Alter scripts/*.py to write their outputs to parquets
The output formats of our scripts are inconsistent: Some already store a parquet (check_bad_parsing.py), others (collect_media_prefixes.py, fetch_qids_from_wikidata.py) just write lines, while others (detect_html_tables.py) write jsonl (which is actually ingested as parquet later on) or (gather_section_titles_denylist.py) json
Some end up being bundled as static files within section_topics/data/, so it's rather inconvenient to keep them up-to-date, as it requires new commits & builds.
This updates all scripts to write to a parquet instead.
Note that these new outputs are not yet being used; those changes are coming in a separate merge request, once we've made sure these are run fine and their output has become available for consumption.
Note: this also includes functional changes in 1 script:
detect_html_tables.py no longer includes
normalized_section_title
in its output as it was not
Used. It also no longer ingests the denylist to omit
rows we likely don't care about anyway. This makes things
simpler (less coordination of scripts/outputs) and safer
(no need to remember that the output is only partial; i.e.
doesn't include denylisted entries). The code that
consumes this output also filters out denylisted rows
anyway, so this has no other functional impact.
Bug: T339129