analytics: webrequest: add webrequest_frontend refine dag.
Post-processing pipeline for webrequest_frontend haproxy data feed.
This DAG is a work in progress to support T354694. The DAG is likely to be updated during the varnishkafka -> benthos migration process, and deprecated once migration process is complete.
The DAG it is scheduled on the analytics instance, to have access to a stable staging environment. It breaks production patterns:
- No alerting is performed.
- Resulting datasets and tables will be kept outside production path
(in a namespace tied to user
gmodena
). - ETL refine queries are sourced over HTTP from Gitlab, instead of a refinery hdfs path.
The implementation logic follows refine_webrequest_hourly_dag_factory.py. Verify webrequest_frontend logs, refine them, and create the webrequest hourly dataset
Currently (2024-05-21) the DAG reads from a rc0 version of the webrequest_frontend stream. This is considered "test" data (Benthos produces to _test kafka topics).
Bug: T314956
Bug: T351117