run IMA data pipeline on an-airflow.
This MR is a PoC of porting an existing airflow dag, and its dependency management, to Platform Engineering an-airflow
dev instance (that provides a Scheduler and a LocalExecutor). The goals of this PoC are:
- Identify a possible DAG repo structure, and define software dependencies boundaries.
- Port an existing DAG and data pipeline (image-matching) to our dev environment.
- Build atop Gitlab CI capabilities to improve iteration speed.
These changes are about project layout and develpoment workflow. The MR does not aim to:
- Replace production deployment practices and tooling.
- Refactor the current IMA dag. This will be carried out in a separate Task.
Repo structure
This repo structure matches the layout of AIRFLOW_HOME on an-airflow1003.eqiad.wmnet.
-
${AIRFLOW_HOME}/dags
contains dags for all projects. Each DAG schedules a data pipeline. No business logic is contained in the dag. -
${AIRFLOW_HOME}/<project-name>
contains the data pipeline' codebase and business logic, configuration file and resources.
In this MR <project-name>
is image-matching
.
Dependencies and Python runtime
Only the DAG and datapipeline business logic code are kept in this repo. We treat everything else as an external dependency.
ImageMatching algorithm
What follows is a PoC and request for comments re finding common standards for interacting and distributing code developed by partner teams at WMF.
We depend on a jupyter notebook developed by the Research team. This code lives outside of the image-matching
dag,
and we treat it like an upstream dependency. This dependency is defined in image-matching/requirements.txt
, and installed
in the project conda environment. Effectively, we treat "upstream" code as just another python dep like pandas
, numpy
, and sklearn
.
A companion PoC MR (https://gitlab.wikimedia.org/gmodena/ImageMatching/-/merge_requests/32) contains the refactored and packaged version of upstream as a wheel.
Conda environment
For Python and Pyspark based projects, a conda virtual env is provided at ${AIRFLOW_HOME}/image-matching/venv.tar.gz
. This is the runtime for all python binaries and pyspark job triggered by the DAG>
The conda env is assembled by make
targets (venv
, test
) in image-matching/Makefile
. Targets can be triggered by CI (see .gitlab-ci.yml
) or from any host.
conda envs are assembled in a x86_64-linux
docker container. This allows "cross-compilation" of venv on non-linux hosts (e.g. a developer's macOS laptop). Native conda can be used by setting the SKIP_DOCKER=true
flag:
For example make venv SKIP_DOCKER=true
will build a venv.tar.gz
conda-packed environment using local conda instead of a container. This is useful when we want to build a venv (either for vendoring a runtime or running tests) on a Gitlab runner (which itself is a docker container), or when the build host does not provide Docker.
CI
Every successful build generates a platform-airflow-dags.tar.gz
artifact. This archive packages the repo itself, and includes assembled virtual environments.
Integration commands are delegated to top level make targets (e.g. make test
). A full CI pipeline can be triggered by invoking:
make deploy-local-build
. This will trigger testing, assembly of image-matching/venv.tar.gz
, assembly of platform-airflow-dags.tar.gz
and "deployment" of the build artifact to the worker's AIRFLOW_HOME.
Gitlab CI
CI is triggered at every push
. The current CI pipeline is pretty straightforward.
-
ima-test
: is a job that triggers the data pipelinepytest
suite (e.g.image-matching/tests/
). -
publish
: the DAG itself is archived asplatform-airflow-dags.tar.gz
and uploaded to Generic gitlab repo (e.g https://gitlab.wikimedia.org/gmodena/platform-airflow-dags/-/packages/52). The artifact is versioned as -.
Shipping code to the airflow scheduler/executor.
We don't have an automated way to deploy code from Gitlab. Instead the following workflows needs to happen every time we want to ship DAGs to the airflow worker.
- Copy DAGm venvs, project files to the worker.
- Impersonate the
analytics-paltform_eng
user on the ariflow worker and move files to AIRFLOW_HOME.
As a shortcut, top level make deploy
and deploy-local-build
targets are provided. They both automates the steps above. They are supposed to be run from a developer's laptop. Ideally we could replace this ad-hoc logic with scap
or some battle test tool/deployment host.