Skip to content

coalesce outputs with default workable values

Marco Fossati requested to merge T350009 into main

Add an optional --coalesce argument to relevant CLIs, with default values based on trade-offs between less output files and longer execution time.

Notes

  • model.py was already using the default coalesce value of 8
  • a drastic coalesce to the default value leads to crashes of Spark executors in embeddings.py, due to too few nodes handling the whole computation

Report

script coalesce files before files after
sections.py 8 2049 9
embeddings.py 100 1025 101
features.py 4 ^ 807k 1k

^ We used repartition, see https://phabricator.wikimedia.org/T350009#9389878.

Bug: T350009

Edited by Marco Fossati

Merge request reports