Add job to publish content dumps as XML
(This MR depends on https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/991795/ being released as part of refinery-job-0.2.29-shaded.jar
. This has now been merged.)
In this MR we implement a first cut of the Airflow DAG that will convert the intermediate table wmf_dumps.wikitext_raw
into actual XML dumps.
The DAG itself is incomplete, as we do not have a proper sensor yet. Additionally, we are only dumping simplewiki
right now.
Still, we'd like to start exercising this code paths on a regular basis, thus we want to get this MR in prod.
Bug: T346278