Gmodena requested to merge update-ima-project-wip into multi-project-dags-repo Sep 29, 2021

This MR is a PoC of porting an existing airflow dag, and its dependency management, to Platform Engineering an-airflow dev instance (that provides a Scheduler and a LocalExecutor). The goals of this PoC are:

Identify a possible DAG repo structure, and define software dependencies boundaries.
Port an existing DAG and data pipeline (image-matching) to our dev environment.
Build atop Gitlab CI capabilities to improve iteration speed.

These changes are about project layout and develpoment workflow. The MR does not aim to:

Replace production deployment practices and tooling.
Refactor the current IMA dag. This will be carried out in a separate Task.

Repo structure

This repo structure matches the layout of AIRFLOW_HOME on an-airflow1003.eqiad.wmnet.

${AIRFLOW_HOME}/dags contains dags for all projects. Each DAG schedules a data pipeline. No business logic is contained in the dag.
${AIRFLOW_HOME}/<project-name> contains the data pipeline' codebase and business logic, configuration file and resources.

In this MR <project-name> is image-matching.

Dependencies and Python runtime

Only the DAG and datapipeline business logic code are kept in this repo. We treat everything else as an external dependency.

ImageMatching algorithm

What follows is a PoC and request for comments re finding common standards for interacting and distributing code developed by partner teams at WMF.

We depend on a jupyter notebook developed by the Research team. This code lives outside of the image-matching dag, and we treat it like an upstream dependency. This dependency is defined in image-matching/requirements.txt, and installed in the project conda environment. Effectively, we treat "upstream" code as just another python dep like pandas, numpy, and sklearn.

A companion PoC MR (https://gitlab.wikimedia.org/gmodena/ImageMatching/-/merge_requests/32) contains the refactored and packaged version of upstream as a wheel.

Conda environment

For Python and Pyspark based projects, a conda virtual env is provided at ${AIRFLOW_HOME}/image-matching/venv.tar.gz. This is the runtime for all python binaries and pyspark job triggered by the DAG>

The conda env is assembled by maketargets (venv, test) in image-matching/Makefile. Targets can be triggered by CI (see .gitlab-ci.yml) or from any host. conda envs are assembled in a x86_64-linux docker container. This allows "cross-compilation" of venv on non-linux hosts (e.g. a developer's macOS laptop). Native conda can be used by setting the SKIP_DOCKER=true flag:

For example make venv SKIP_DOCKER=true will build a venv.tar.gz conda-packed environment using local conda instead of a container. This is useful when we want to build a venv (either for vendoring a runtime or running tests) on a Gitlab runner (which itself is a docker container), or when the build host does not provide Docker.

CI

Every successful build generates a platform-airflow-dags.tar.gz artifact. This archive packages the repo itself, and includes assembled virtual environments.

Integration commands are delegated to top level make targets (e.g. make test). A full CI pipeline can be triggered by invoking: make deploy-local-build. This will trigger testing, assembly of image-matching/venv.tar.gz, assembly of platform-airflow-dags.tar.gz and "deployment" of the build artifact to the worker's AIRFLOW_HOME.

Gitlab CI

CI is triggered at every push. The current CI pipeline is pretty straightforward.

ima-test: is a job that triggers the data pipeline pytest suite (e.g. image-matching/tests/).
publish: the DAG itself is archived as platform-airflow-dags.tar.gz and uploaded to Generic gitlab repo (e.g https://gitlab.wikimedia.org/gmodena/platform-airflow-dags/-/packages/52). The artifact is versioned as -.

Shipping code to the airflow scheduler/executor.

We don't have an automated way to deploy code from Gitlab. Instead the following workflows needs to happen every time we want to ship DAGs to the airflow worker.

Copy DAGm venvs, project files to the worker.
Impersonate the analytics-paltform_eng user on the ariflow worker and move files to AIRFLOW_HOME.

As a shortcut, top level make deploy and deploy-local-build targets are provided. They both automates the steps above. They are supposed to be run from a developer's laptop. Ideally we could replace this ad-hoc logic with scap or some battle test tool/deployment host.

Edited Oct 14, 2021 by Gmodena

Admin message

Admin message

Admin message

run IMA data pipeline on an-airflow.

Repo structure

Dependencies and Python runtime

ImageMatching algorithm

Conda environment

CI

Gitlab CI

Shipping code to the airflow scheduler/executor.

Merge request reports