Port the standard wikitext list/table filter from section topics (!4) · Merge requests · repos / structured-data / Section Alignment Image Suggestions

Marco Fossati requested to merge standard-filter into main Apr 14, 2023

IMPORTANT: don't extract the lead section anymore. If coupled with the filter, causes a steady increase of executors memory, and isn't used downstream anyway
adapt filtering logic from https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/blob/a11b6e70f2b00d039b05715167545d6abc284717/section_topics/pipeline.py#L158
remove split_section function
_process_sections now extracts heading and content, skips null or empty content, and skips content with standard lists or tables
isolate heading normalization logic
update tests

Bug: T330841

Admin message