Goodbye 0.* versions, welcome 1.0!
This MR introduces a radical code base uplift. I believe that we can release the first major version after merging.
Main changes
Logic
- include image extraction when parsing wikitext
- separate wikilinks from images
- dump two datasets: section topics & images
- move section alignment image suggestions here
Project
- modernize packaging (replace
setup.*
withpyproject.toml
) - refresh CI (mamba for both linting and testing)
- radically re-format code
- introduce black
- set line length to black's default (88 chars)
- reduce & bump dependencies
Airflow test run
DAG as per repos/data-engineering/airflow-dags!864 (merged).
Data checks
# Section topics
topics_dev = spark.read.parquet('/user/mfossati/T331522/topics/2024-09-23')
topics_prod = spark.read.parquet('/user/analytics-platform-eng/structured-data/section_topics/2024-09-23')
topics_dev.count(), topics_prod.count()
(144651363, 144650351)
# Article images
images_dev = spark.read.parquet('/user/mfossati/T331522/images/2024-09-23')
images_prod = spark.read.parquet('/user/analytics-platform-eng/structured-data/section-alignment-suggestions/article_images/2024-09-23')
images_dev.count(), images_prod.count()
(62022805, 62461775)
# Section alignment image suggestions
suggestions_dev = spark.read.parquet('/user/mfossati/T331522/section_alignment_image_suggestions/2024-09-23')
suggestions_prod = spark.read.parquet('/user/analytics-platform-eng/structured-data/section-alignment-suggestions/suggestions/2024-09-23')
suggestions_dev.count(), suggestions_prod.count()
(2664164, 2659824)
# Random enwiki sample check
prod_df = suggestions_prod.where('target_wiki_db="enwiki"').sample(0.0001).toPandas().sort_values('target_id')
ids = prod_df.target_id.to_list()
dev_df = suggestions_dev.where('target_wiki_db="enwiki"').where(suggestions_dev.target_id.isin(ids)).toPandas().sort_values('target_id')
for row in sample.itertuples():
print(row.target_id, row.recommended_images_prod, row.recommended_images_dev, sep='\n')
print()
107743
[{'nlwiki': ['Blank_map.svg', 'Map_pointer_black.svg', 'Small-city-symbol.svg', 'The_Mariposa_County_Courthouse,_5088_Bullion_Street,_Mariposa,_California.jpg']}, {'ptwiki': ['Blank_map.svg', 'Map_pointer_black.svg', 'Small-city-symbol.svg']}]
[{'nlwiki': ['Blank_map.svg', 'Map_pointer_black.svg', 'Small-city-symbol.svg', 'The_Mariposa_County_Courthouse,_5088_Bullion_Street,_Mariposa,_California.jpg']}]
117032
[{'eswiki': ['Blank_map.svg', 'Map_pointer_black.svg', 'Small-city-symbol.svg']}]
[{'eswiki': ['Blank_map.svg', 'Map_pointer_black.svg', 'Small-city-symbol.svg']}]
117032
[{'eswiki': ['Blank_map.svg', 'Map_pointer_black.svg', 'Small-city-symbol.svg']}]
[{'eswiki': ['Blank_map.svg', 'Map_pointer_black.svg', 'Small-city-symbol.svg']}]
138567
[{'fawiki': ['Blank_map.svg', 'Map_pointer_black.svg', 'Small-city-symbol.svg']}]
[{'fawiki': ['Blank_map.svg', 'Map_pointer_black.svg', 'Small-city-symbol.svg']}]
390631
[{'svwiki': ['Bundesarchiv_Bild_183-1984-0601-008,_Berlin,_Marzahn.jpg', 'Marzahn_Gaerten_der_Welt_08-2015_img08_Chinese_Garden.jpg', 'Gutshaus_Mahlsdorf.JPG']}, {'plwiki': ['Bundesarchiv_Bild_183-1989-0413-031,_Berlin-Marzahn.jpg', 'Ortsteile_im_Bezirk_Marzahn-Hellersdorf.png']},
{'dewiki': ['Bundesarchiv_Bild_183-1989-0413-031,_Berlin-Marzahn.jpg']}]
[{'plwiki': ['Bundesarchiv_Bild_183-1989-0413-031,_Berlin-Marzahn.jpg', 'Ortsteile_im_Bezirk_Marzahn-Hellersdorf.png']}, {'dewiki': ['Bundesarchiv_Bild_183-1989-0413-031,_Berlin-Marzahn.jpg']}, {'svwiki': ['Bundesarchiv_Bild_183-1984-0601-008,_Berlin,_Marzahn.jpg', 'Marzahn_Gaerte
n_der_Welt_08-2015_img08_Chinese_Garden.jpg', 'Gutshaus_Mahlsdorf.JPG']}]
739855
[{'arwiki': ['Red-blue-noise.gif']}]
[{'arwiki': ['Red-blue-noise.gif']}]
1054770
[{'ptwiki': ['Toscana_-_Maremma_Regional_Park_-_aerial_photo_with_Torre_di_Collelungo.jpg', 'Alluvione_Maremma_Toscana_Novembre_2012_Detriti_in_Mare.jpg']}, {'dewiki': ['Map_-_IT_-_Grosseto_-_Grosseto.svg', 'Grosseto,_palazzo_della_provincia_01.JPG', 'Mauer_Grosseto.JPG']}]
[{'itwiki': ['20100410GrStadioVonCurvaNord.JPG']}]
1054770
[{'ptwiki': ['Toscana_-_Maremma_Regional_Park_-_aerial_photo_with_Torre_di_Collelungo.jpg', 'Alluvione_Maremma_Toscana_Novembre_2012_Detriti_in_Mare.jpg']}, {'dewiki': ['Map_-_IT_-_Grosseto_-_Grosseto.svg', 'Grosseto,_palazzo_della_provincia_01.JPG', 'Mauer_Grosseto.JPG']}]
[{'ptwiki': ['Toscana_-_Maremma_Regional_Park_-_aerial_photo_with_Torre_di_Collelungo.jpg', 'Alluvione_Maremma_Toscana_Novembre_2012_Detriti_in_Mare.jpg']}, {'dewiki': ['Map_-_IT_-_Grosseto_-_Grosseto.svg', 'Grosseto,_palazzo_della_provincia_01.JPG', 'Mauer_Grosseto.JPG']}]
1054770
[{'ptwiki': ['Toscana_-_Maremma_Regional_Park_-_aerial_photo_with_Torre_di_Collelungo.jpg', 'Alluvione_Maremma_Toscana_Novembre_2012_Detriti_in_Mare.jpg']}, {'dewiki': ['Map_-_IT_-_Grosseto_-_Grosseto.svg', 'Grosseto,_palazzo_della_provincia_01.JPG', 'Mauer_Grosseto.JPG']}]
[{'itwiki': ['Druso_maggiore,_1-50_dc_ca,_da_roselle_02.JPG', 'Roselle_Archaeological_Park_Central_area.JPG', 'Coat_of_Arms_of_the_House_of_Aldobrandeschi.svg', 'B_Innozenz_II.jpg', 'Pianta_della_cittá_di_Grosseto_-_btv1b53099929c.jpg', 'Vallardi_-_Provincia_di_Grosseto_-_1860_ca.
jpg', 'Grosseto,_Palazzo_delle_Poste,_1931-32,_esterno_03.jpg', 'Grosseto1935.jpg', 'Stemma_di_Grosseto.svg', 'Grosseto-Gonfalone.png', 'Corona_di_Città_Italiana.svg']}, {'ptwiki': ['Roselle_Archaeological_Park_Central_area.JPG', 'Grosseto,_Palazzo_delle_Poste,_1931-32,_esterno_03
.jpg']}, {'jawiki': ['Grosseto_Palazzo_Comunale.JPG']}]
6862266
[{'ukwiki': ['Пам’ятник_ВВВ_Новий_Буг.jpg', 'Новий_Буг_вулиця.JPG']}]
[{'ukwiki': ['Пам’ятник_ВВВ_Новий_Буг.jpg', 'Новий_Буг_вулиця.JPG']}]
8469245
[{'ukwiki': ['Chkalov,_Stalin_and_Belyakov._August_10,_1936.jpg']}, {'ruwiki': ['Огневые_наземные_испытания_истребителя_И-Z_№_39009.jpg', 'Tupolev_ANT-25_at_Central_Air_Force_Museum_pic1.JPG', 'Chkalov,_Stalin_and_Belyakov._August_10,_1936.jpg', 'Сигизмунд_Леваневский_перед_послед
ним_вылетом.jpg', 'Стефановский,_Пётр_Михайлович.jpg', 'Вид_на_памятник_Су_-7_Б,_на_дальнем_плане_въезд_А_Чкаловского_аэродрома.jpg']}]
[{'ruwiki': ['Огневые_наземные_испытания_истребителя_И-Z_№_39009.jpg', 'Tupolev_ANT-25_at_Central_Air_Force_Museum_pic1.JPG', 'Chkalov,_Stalin_and_Belyakov._August_10,_1936.jpg', 'Сигизмунд_Леваневский_перед_последним_вылетом.jpg', 'Стефановский,_Пётр_Михайлович.jpg', 'Вид_на_памя
тник_Су_-7_Б,_на_дальнем_плане_въезд_А_Чкаловского_аэродрома.jpg']}, {'ukwiki': ['Chkalov,_Stalin_and_Belyakov._August_10,_1936.jpg']}]
8672889
[{'nlwiki': ["Kees_Kist_(AZ'67),_Ivan_Nielsen,_Jan_van_Deinsen_en_Michel_van_de_Korput,_Bestanddeelnr_930-5343.jpg"]}]
[{'nlwiki': ["Kees_Kist_(AZ'67),_Ivan_Nielsen,_Jan_van_Deinsen_en_Michel_van_de_Korput,_Bestanddeelnr_930-5343.jpg"]}]
8672889
[{'nlwiki': ["Kees_Kist_(AZ'67),_Ivan_Nielsen,_Jan_van_Deinsen_en_Michel_van_de_Korput,_Bestanddeelnr_930-5343.jpg"]}]
[{'nlwiki': ["Kees_Kist_(AZ'67),_Ivan_Nielsen,_Jan_van_Deinsen_en_Michel_van_de_Korput,_Bestanddeelnr_930-5343.jpg"]}]
17036006
[{'eswiki': ['ChristianNoboa.jpg', 'Rostov-Amkar_(11).jpg', 'Christian_Noboa_Zenit.jpg']}, {'jawiki': ['ChristianNoboa.jpg']}, {'ruwiki': ['ChristianNoboa.jpg', 'Christian_Noboa_2012.jpg', 'Rostov-Rub15_(8).jpg']}]
[{'jawiki': ['ChristianNoboa.jpg']}, {'ruwiki': ['ChristianNoboa.jpg', 'Christian_Noboa_2012.jpg', 'Rostov-Rub15_(8).jpg']}, {'eswiki': ['ChristianNoboa.jpg', 'Rostov-Amkar_(11).jpg', 'Christian_Noboa_Zenit.jpg']}]
18289775
[{'elwiki': ["Seversky_P-35_Converted_in_1934_to_350hp_Wright_R-975E_with_a_faired_landing_gear_as_SEV-3L,_then_SEV-3XAR,_X-2106_to_win_the_Air_Corps'_BT-8_contract_(16334858241).jpg"]}]
[{'elwiki': ["Seversky_P-35_Converted_in_1934_to_350hp_Wright_R-975E_with_a_faired_landing_gear_as_SEV-3L,_then_SEV-3XAR,_X-2106_to_win_the_Air_Corps'_BT-8_contract_(16334858241).jpg"]}]
20647537
[{'ptwiki': ['Emily_Browning_2010_Comic-Con_Cropped.jpg']}]
[{'ptwiki': ['Emily_Browning_2010_Comic-Con_Cropped.jpg']}]
20647537
[{'ptwiki': ['Emily_Browning_2010_Comic-Con_Cropped.jpg']}]
[{'trwiki': ['Emily_Browning_April_2011.jpg']}, {'ptwiki': ['Emily_Browning_2010_Comic-Con_Cropped.jpg']}, {'ruwiki': ['Emily_Browning_2010_Comic-Con_Cropped.jpg', 'Emily_Browning_April_2011.jpg']}, {'frwiki': ['Emily_Browning_2010_Comic-Con_Cropped.jpg', 'Emily_Browning_April_201
1.jpg', 'Photo_of_Emily_Browning_on_August_25_2014.jpg']}]
22421181
[{'hewiki': ['FC_Zenit_Chelyabinsk.gif']}, {'ruwiki': ['Logo_zenit_cheljabinsk.gif']}]
[{'hewiki': ['FC_Zenit_Chelyabinsk.gif']}, {'ruwiki': ['Logo_zenit_cheljabinsk.gif']}]
24029097
[{'ruwiki': ['Dubh-Ghleann,_Glen_Quoich_-_geograph.org.uk_-_87404.jpg']}]
[{'ruwiki': ['Glen_Quoich_-_geograph.org.uk_-_207035.jpg']}]
24029097
[{'ruwiki': ['Dubh-Ghleann,_Glen_Quoich_-_geograph.org.uk_-_87404.jpg']}]
[{'ruwiki': ['Dubh-Ghleann,_Glen_Quoich_-_geograph.org.uk_-_87404.jpg']}]
35479547
[{'ptwiki': ['Mama_MV.jpg']}]
[{'ptwiki': ['Mama_MV.jpg']}]
52836000
[{'ruwiki': ['Mount_Aberdeen_from_Plain_of_6_Glaciers.jpg']}]
[{'ruwiki': ['Mount_Aberdeen_from_Plain_of_6_Glaciers.jpg']}]
54196633
[{'frwiki': ["Localisation_du_pont_d'Ebebda.jpg", 'PontEbebda1.jpg']}]
[{'frwiki': ["Localisation_du_pont_d'Ebebda.jpg", 'PontEbebda1.jpg']}]
66960052
[{'dewiki': ['Mark_Vientos_hits_a_home_run,_July_29,_2023_(1).jpg', 'Mark_Vientos_during_warmups,_March_15,_2024_-_00002.jpg']}]
[{'dewiki': ['Mark_Vientos_hits_a_home_run,_July_29,_2023_(1).jpg', 'Mark_Vientos_during_warmups,_March_15,_2024_-_00002.jpg']}]
Observations
- Section topics seem to be intact
- 438,970 less articles images:
- more file extensions are extracted, see !35 (diffs) VS https://gitlab.wikimedia.org/repos/structured-data/section-image-recs/-/blob/main/imagerec/article_images.py#L80
- filters are applied at extraction time, thus reducing the amount of images
- 4,340 more section alignment image suggestions
- the sample of suggestions has slight variations, but overall it looks good
Bug: T331522
Bug: T333699
Bug: T339120