Migrate Refine from systemd to Airflow
This process will create a staging DAG on Airflow Analytics, which will be deployed progressively:
- Feed the DAG a mocked version of ESC containing a sample of ~170 datasets, targeting a staging database for refined data.
- Run the DAG in parallel with the existing Refine process on systemd and use an ad-hoc script to check for differences.
- Gradually increase the sample size to assess the effect of the load on the current Airflow setup.
- For deployment, we will switch the output of systemd Refine to this new Refine, allowing it to write to the
event
DB while continuing to check for discrepancies. - Finally, remove the diffing process and deactivate the legacy Refine.
DAG Details:
- Loads the configuration from ESC and creates one task group per enabled dataset.
- Updates the table schema to reflect the latest JSON schema version.
- Refines the data and creates a new Hive partition.
This branch is currently running on the test cluster.
Bug: T356762