Add support for Iceberg's rewrite_data_files().
In this MR, we expand the maintenance mechanism introduced on !806 (merged) to also support rewrite_data_files()
.
Example:
iceberg_wmf_dumps_wikitext_raw:
datastore: iceberg
table_name: wmf_dumps.wikitext_raw_rc2
maintenance:
schedule: "@daily"
rewrite_data_files:
enabled: True
strategy: "sort"
sort_order: "wiki_db ASC NULLS FIRST, revision_timestamp ASC NULLS FIRST" # Defined due to Iceberg 1.2.1 bug
options:
"max-concurrent-file-group-rewrites": "40"
"partial-progress.enabled": "true"
spark_kwargs:
driver_memory: "32g"
driver_cores: "4"
executor_memory: "20g"
executor_cores: "2"
pool: mutex_for_wmf_dumps_wikitext_raw
priority_weight: 10
spark_conf:
"spark.dynamicAllocation.maxExecutors": "64"
Notice how the config above now includes a rewrite_data_files
section. Under that, Iceberg's strategy
, options
and more procedure parameters are verbatim from https://iceberg.apache.org/docs/1.6.1/spark-procedures/#rewrite_data_files. spark_kwargs
and spark_conf
are provided to be able to define the spark resources, as these will vary on a per table basis. If not defined, the cluster defaults apply.
Via spark_kwargs
you can send in any kwargs
that are relevant to a SparkSqlOperator
. In the example above we send in kwargs
pool
and priority_weight
to leverage Airflow's pools for this particular table.