Refactor `wikitext_raw` table to support backfilling
In this MR we change the schema of wikitext_raw_rc0
to wikitext_raw_rc1
like so:
- Change the partitioning strategy from
hours(revision_timestamp)
to(wiki_db, days(revision_timestamp))
.- The rationale for
days(revision_timestamp)
is that this strategy generates much less ORed predicates that we need to push down when doing the MERGE INTO. This will also help to contain the amount of files in HDFS once we callCALL spark_catalog.system.rewrite_data_files()
on it. - The rationale for adding a
wiki_db
partition is to aid the backfilling process. This process touches alldays(revision_timestamp)
partitions and thus we need a separate mechanism that pushes down thewiki_db
in order to make the backfill manageable. This way we can ingest inwiki_db
groupings. - Since partitioning keys are orthogonal in Iceberg, this strategy, so far, gives us a good ingestion compromise.
- The rationale for
- Switch from parquet to avro. After discussions with the team, we figured this is safer given that
content_slots
contain full revisions. - Flatten out the schema of the target table. We now include what we believe to be the neccesary fields to make a dump out of and nothing else.
- We introduce a helper TIMESTAMP row called
row_last_updated
. The idea is that it will serve as a watermark that we will bump every time we touch a particular row.- For streaming ingests, we will update it with
meta.dt
(time the event was received by the system). - For backfills, we will update it with the backfilling table's 'freshness date', which in the case of
wmf.mediawiki_wikitext_history
it happens to besnapshot
(which is the dumps 1.0 release date). - Notice how, in the event of a stream ingest or backfill, if we have more recent data already (ie. higher watermark) then we ignore the update.
- For streaming ingests, we will update it with
Additionally, we add a new MERGE INTO pyspark script that can backfill at a monthly granularity given the scalability issues described at https://phabricator.wikimedia.org/T340861
Bug: T340861 Bug: T336714