Skip to content

PySpark job to merge events from `page_content_change` stream into an Iceberg table.

Xcollazo requested to merge pyspark into main

In this MR we implement a simple pyspark job to MERGE INTO from page_content_change into an Iceberg table.

We landed on a pyspark job after realizing that tuning this job was going to be significantly more complex than just an Airflow SparkSQLOperator. We figured that a reasonable partitioning strategy for now is on revision.rev_dt. For a while we thought that dt was the right one, but since events like visibility change don't have dt == revision.rev_dt, it makes more sense to go with revision.rev_dt. A consumer of the data will be able to walk the revisions in order. We can of course revise this later as needed.

In this job, we:

  • Run a query to figure out all the revision.rev_dt hours incoming from the source stream.
  • Build predicates ORing together all the touched revision.rev_dts hours.
  • Run a MERGE INTO that takes the ORed list in, so that we avoid a full table scan on the target table, and merges the data into an Iceberg table.

There are many TODOs, but this job is good enough to start running it in Airflow so that we can study its performance. So far test runs finish in the vicinity of ~10 minutes.

Some TODOs:

  • We don't do much flattening of the event data at all. Once we have a downstream consumer, we can start shaping the target table better.
  • We need a backfill mechanism.
  • We need to build a mechanism so that we log errors like if we ever get MATCHes that are have changelog_kind = 'insert, but we need to backfill first.
  • We need to move this project to another namespace that is not my personal one!

Merge request reports