Add support for Iceberg's rewrite_data_files(). (!849) · Merge requests · repos / data-engineering / Airflow DAGs

Xcollazo requested to merge add-iceberg-maintenance into main Sep 28, 2024

In this MR, we expand the maintenance mechanism introduced on !806 (merged) to also support rewrite_data_files().

Example:

iceberg_wmf_dumps_wikitext_raw:
  datastore: iceberg
  table_name: wmf_dumps.wikitext_raw_rc2
 maintenance:
    schedule: "@daily"
    rewrite_data_files:
      enabled: True
      strategy: "sort"
      sort_order: "wiki_db ASC NULLS FIRST, revision_timestamp ASC NULLS FIRST"  # Defined due to Iceberg 1.2.1 bug
      options:
        "max-concurrent-file-group-rewrites": "40"
        "partial-progress.enabled": "true"
      spark_kwargs:
        driver_memory: "32g"
        driver_cores: "4"
        executor_memory: "20g"
        executor_cores: "2"
        pool: mutex_for_wmf_dumps_wikitext_raw
        priority_weight: 10
      spark_conf:
        "spark.dynamicAllocation.maxExecutors": "64"

Notice how the config above now includes a rewrite_data_files section. Under that, Iceberg's strategy, options and more procedure parameters are verbatim from https://iceberg.apache.org/docs/1.6.1/spark-procedures/#rewrite_data_files. spark_kwargs and spark_conf are provided to be able to define the spark resources, as these will vary on a per table basis. If not defined, the cluster defaults apply.

Via spark_kwargs you can send in any kwargs that are relevant to a SparkSqlOperator. In the example above we send in kwargs pool and priority_weight to leverage Airflow's pools for this particular table.

Bug: T375402 Bug: T373694

Edited Oct 01, 2024 by Xcollazo

Admin message

Admin message

Admin message

Add support for Iceberg's rewrite_data_files().

Merge request reports