Add visibility on backfills via broadcast join of wmf_raw.mediawiki_revision.
(Depends on !10 (merged))
In this MR we df.cache()
and force broadcast a table derived from wmf_raw.mediawiki_revision
that provides us with visibility (aka suppressed) data.
Plan is as expected:
== Physical Plan ==
ReplaceData (40)
+- AdaptiveSparkPlan (39)
+- == Final Plan ==
Sort (19)
+- ShuffleQueryStage (18), Statistics(sizeInBytes=50.8 GiB, rowCount=7.81E+6)
+- Exchange (17)
+- * Project (16)
+- MergeRows (15)
+- * Sort (14)
+- * Project (13)
+- * Project (12)
+- * BroadcastHashJoin LeftOuter BuildRight (11)
:- * Filter (2)
: +- Scan hive wmf.mediawiki_wikitext_history (1)
+- BroadcastQueryStage (10), Statistics(sizeInBytes=32.5 MiB, rowCount=8.73E+3) <<<<<<<
+- BroadcastExchange (9)
+- * Filter (8)
+- InMemoryTableScan (3)
+- InMemoryRelation (4)
+- * Project (7)
+- * Filter (6)
+- Scan hive wmf_raw.mediawiki_revision (5) <<<<<<<
Some manual test runs on enwiki
and simplewiki
show no measurable difference between this code and the one from !10 (merged).
Bug: T345183