Skip to content

Push down earliest rev_dt per wiki on the revision level MERGE INTO

Xcollazo requested to merge better-predidate-for-revision-level-merge-into into main

On !41 (merged) we switched to daily processing, but botched the predicate pushdown, as we were only pushing down the wiki_db, which is too broad and would make us shuffle over 40TB of data (example can be seen here for next 90 days.)!

In this MR we fix this as we have figured the following observation:

Now that the revision level MERGE INTO only processes page_change_kind IN ('create', 'edit'), then the vast majority of events are in the 'future' compared to what we have already processed. Thus, we can push down, per wiki, the earliest event revision_timestamp we see in the new data. That way, we can eliminate a lot of files, given that they are WRITE ORDERED BY wiki_db, page_id, revision_timestamp.

So we pushdown the result of:

SELECT
  wiki_id,
  MIN(to_timestamp(revision.rev_dt)) as earliest_revision_dt
FROM event.mediawiki_page_content_change_v1
WHERE year=2024
  AND month=10
  AND day=4
  AND page_change_kind IN ('create', 'edit')
GROUP BY wiki_id
ORDER BY wiki_id, earliest_revision_dt

Additionally, for processing visibility events, we are now reading too much data with the old predicate pushdowns based on revision_timestamp. Considering we see ~1500 visibility events per day, we can certainly push them down. Thus we now push down the result of:

SELECT
  database,
  page_id,
  sort_array(collect_set(rev_id)) as rev_ids
FROM event.mediawiki_revision_visibility_change
WHERE year=2024
  AND month=10
  AND day=4
GROUP BY database, page_id
ORDER BY database, page_id

Bug: T377999

Edited by Xcollazo

Merge request reports