Skip to content

Automoderator daily snapshot DAG

Bug: T375153

  • gets the list of dbs to fetch the snaphots for, from Automoderator's config.
  • fetches the snapshots from MariaDB replicas (using task mapping from the fetch list).
  • each snapshot is appended to wmf_product.automoderator_monitoring_snapshot_daily.
  • the latest snapshot is published as TSV to https://analytics.wikimedia.org/published/datasets/.
  • snapshots older than 30 days will be purged from the destination table.

The script requires refinery spark jar that includes mediawiki-jdbc source, which will be in the next deployment train (probably next Tuesday; 1 Oct 2024), so marking it as draft so that it doesn't get merged before that.

Testing results

Screenshot from 2024-09-27 16-22-38.png

I have verified the output datasets

hdfs dfs -ls /tmp/kcvelaga/automoderator/daily_snapshots_archive

sudo -u analytics-privatedata spark3-sql -e "select wiki_db, COUNT(*) from kcvelaga.automoderator_monitoring_snapshot_daily group by wiki_db"

Edited by KCVelaga

Merge request reports