Deduplicate intermediate table (wiki_db, revision_id) tuples on read.
In T346281#9195431, we figured out that the current dumps process can inadvertently publish duplicate (wiki_db, revision_id)
tuples. Presumably this happens since that process cannot lock the database when reading the data. These duplicates are rare: On the 2023-07
XML dump, year 2004 has 4 duplicates, while year 2019 has 48.
These duplicates trip the backfill though, as the MERGE mechanism is expecting 1:1 matches at most. In this MR we solve this issue by deduplicating the intermediate table on read. A first attempt on de-duplicating on write with a ROW_NUMBER window function got us into an expensive SORT node for all data. But since the duplicate data is very small, by doing it on read, we can do the following trick:
- Figure out the
(wiki_db, revision_id)
tuples that are duplicated with a separate query with a GROUP BY and a HAVING clause (Price: shuffle plus a hashed group read. Couple minutes.) - Then
collect()
those tuples, jointly with their TIMESTAMPs. - Now build a query that finds these duplicated tuples and then do a ROW_NUMBER window function on them, keeping only the ones we want to remove. Because we collected the TIMESTAMPs, and our intermediate table is partitioned by that field, this query reads very little data, and thus keeps the ROW_NUMBER and underlying SORT price very contained, a minute or two.
- Now take (3) and Broadcast LEFT ANTI JOIN it to the original data.
I tested this mechanism on 2004 (little data) and 2019 (lots of data) and its working great.
(Kudos to @milimetric for the idea to avoid the write time sort!)
Query plan with these changes. Looks scary, but is all good:
+- == Initial Plan ==
Sort (46)
+- Exchange (45)
+- Project (44)
+- MergeRows (43)
+- Sort (42)
+- SortMergeJoin FullOuter (41)
:- Sort (26)
: +- Exchange (25)
: +- Project (24)
: +- Project (23)
: +- BatchScan (22)
+- Sort (40)
+- Exchange (39)
+- Project (38)
+- BroadcastHashJoin LeftAnti BuildRight (37)
:- Project (27)
: +- BatchScan (7)
+- BroadcastExchange (36)
+- Project (35)
+- Filter (34)
+- Window (33)
+- Sort (32)
+- Exchange (31)
+- Project (30)
+- Filter (29)
+- BatchScan (28)
Bug: T346281