Skip to content

analytics: webrequest: add webrequest_frontend refine dag.

Gmodena requested to merge refine-webrequest-frontend into main

Post-processing pipeline for webrequest_frontend haproxy data feed.

This DAG is a work in progress to support T354694. The DAG is likely to be updated during the varnishkafka -> benthos migration process, and deprecated once migration process is complete.

The DAG it is scheduled on the analytics instance, to have access to a stable staging environment. It breaks production patterns:

  • No alerting is performed.
  • Resulting datasets and tables will be kept outside production path (in a namespace tied to user gmodena).
  • ETL refine queries are sourced over HTTP from Gitlab, instead of a refinery hdfs path.

The implementation logic follows refine_webrequest_hourly_dag_factory.py. Verify webrequest_frontend logs, refine them, and create the webrequest hourly dataset

Currently (2024-05-21) the DAG reads from a rc0 version of the webrequest_frontend stream. This is considered "test" data (Benthos produces to _test kafka topics).

Bug: T314956

Bug: T351117

Edited by Gmodena

Merge request reports