image suggestions data pipeline
This MR introduces the image suggestions data pipeline, and closes https://phabricator.wikimedia.org/T296814
The Airflow DAG has the following tasks:
- wait for the latest snapshot date of relevant tables with a Hive sensor
- Wikidata weighted tags for the Commons search index (https://phabricator.wikimedia.org/T302095)
- all-Wikis image suggestions for Cassandra (https://phabricator.wikimedia.org/T299789)
- suggestion flags for Wikis search indices (https://phabricator.wikimedia.org/T299884)
clean up HDFS-
suggestions
Cassandra table (https://phabricator.wikimedia.org/T293808) -
title_cache
Cassandra table (https://phabricator.wikimedia.org/T293808) -
instanceof_cache
Cassandra table (https://phabricator.wikimedia.org/T293808)
Issues to be solved before merge
-
merge !50 (merged) -
broken outputs: containers get killed due to memory errors, see https://phabricator.wikimedia.org/T307362. Fix at !57 (diffs) -
the latest Wikidata snapshot yields empty commonswiki_file.py
output, see https://phabricator.wikimedia.org/T307371. Fix at !55 (merged) -
Hive connection within Airflow fails. See !55 (comment 6700)
Issue details
- The
commonswiki_file.py
Airflow task fails- the container gets killed due to exceeding memory limits
- see
an-airflow1003.eqiad.wmnet:/home/mfossati/commonswiki_file_failure.log
- there are a lot of broken delta parquets (i.e., dir with one
_temporary
file) -
lead_image_data_latest
andwikidata_data_latest
look fine-
_SUCCESS
file + snappy ones - quickly checked with
count()
&show()
-
-
cassandra.py
fails too- memory issues again
- see
an-airflow1003.eqiad.wmnet:/home/mfossati/cassandra_failure.log
- only
analytics_platform_eng.suggestions
is there in Hive, not even sure it was written after the run
- the latest Wikidata snapshot yields empty
commonswiki_file.py
output- the main suspect is weekly Wikidata snapshots VS monthly Wikis ones, e.g., beginning of the month: Wikidata 2022-04-04, but maybe Wikis are still on 2022-03 ?
- no more reason to wait for the latest snapshot with the Hive sensor ?
- or maybe another sensor that waits for actual data to be available ?
- Hive connection fails
- http://localhost:8600/log?dag_id=image-suggestions&task_id=wait_for_hive_partitions&execution_date=2022-04-27T15%3A05%3A10.184768%2B00%3A00
- added
metastore_default
connection in Admin panel on the Airflow Web UI - maybe try with this URI thrift://analytics-hive.eqiad.wmnet:9083 ?
- passed
metastore_conn_id='analytics-hive'
to the Hive sensor constructor - currently hitting !55 (comment 6700)