Push down earliest rev_dt per wiki on the revision level MERGE INTO
On !41 (merged) we switched to daily processing, but botched the predicate pushdown, as we were only pushing down the wiki_db
, which is too broad and would make us shuffle over 40TB of data (example can be seen here for next 90 days.)!
In this MR we fix this as we have figured the following observation:
Now that the revision level MERGE INTO only processes page_change_kind IN ('create', 'edit')
, then the vast majority of events are in the 'future' compared to what we have already processed. Thus, we can push down, per wiki, the earliest event revision_timestamp
we see in the new data. That way, we can eliminate a lot of files, given that they are WRITE ORDERED BY wiki_db, page_id, revision_timestamp
.
So we pushdown the result of:
SELECT
wiki_id,
MIN(to_timestamp(revision.rev_dt)) as earliest_revision_dt
FROM event.mediawiki_page_content_change_v1
WHERE year=2024
AND month=10
AND day=4
AND page_change_kind IN ('create', 'edit')
GROUP BY wiki_id
ORDER BY wiki_id, earliest_revision_dt
Additionally, for processing visibility events, we are now reading too much data with the old predicate pushdowns based on revision_timestamp
. Considering we see ~1500 visibility events per day, we can certainly push them down. Thus we now push down the result of:
SELECT
database,
page_id,
sort_array(collect_set(rev_id)) as rev_ids
FROM event.mediawiki_revision_visibility_change
WHERE year=2024
AND month=10
AND day=4
GROUP BY database, page_id
ORDER BY database, page_id
Bug: T377999