Automoderator daily snapshot DAG
Bug: T375153
- gets the list of dbs to fetch the snaphots for, from Automoderator's config.
- fetches the snapshots from MariaDB replicas (using task mapping from the fetch list).
- each snapshot is appended to wmf_product.automoderator_monitoring_snapshot_daily.
- the latest snapshot is published as TSV to https://analytics.wikimedia.org/published/datasets/.
- snapshots older than 30 days will be purged from the destination table.
The script requires refinery spark jar that includes mediawiki-jdbc
source, which will be in the next deployment train (probably next Tuesday; 1 Oct 2024), so marking it as draft so that it doesn't get merged before that.
Testing results
I have verified the output datasets
hdfs dfs -ls /tmp/kcvelaga/automoderator/daily_snapshots_archive
sudo -u analytics-privatedata spark3-sql -e "select wiki_db, COUNT(*) from kcvelaga.automoderator_monitoring_snapshot_daily group by wiki_db"