coalesce outputs with default workable values
Add an optional --coalesce
argument to relevant CLIs, with default values based on trade-offs between less output files and longer execution time.
Notes
-
model.py
was already using the default coalesce value of8
- a drastic coalesce to the default value leads to crashes of Spark executors in
embeddings.py
, due to too few nodes handling the whole computation
Report
script | coalesce | files before | files after |
---|---|---|---|
sections.py |
8 | 2049 | 9 |
embeddings.py |
100 | 1025 | 101 |
features.py |
4 ^ | 807k | 1k |
^ We used repartition
, see https://phabricator.wikimedia.org/T350009#9389878.
Bug: T350009