T275685 generate production datasets
Created by: gmodena
This PR adds the capability to automate end to end generation of production datasets.
For more details see the comments in publish.sh
. This script will:
- run the notebook with the algorunner wrapper
- copy model output to HDFS and expose it via an hive external table (available in superset)
- run etl/transform.py to generate production data
- expose production data via an hive external table (available in superset)
- collect production datasets locally
Datasets will be created for the following wikis:
enwiki arwiki kowiki cswiki viwiki frwiki fawiki ptwiki ruwiki trwiki plwiki hewiki svwiki ukwiki huwiki hywiki srwiki euwiki arzwiki cebwiki dewiki bnwiki
Use
publish.sh <snapshot>
Each time publish.sh is invoked, it records the following data under runs/<run_id>:
-
metrics
: a set of timing metrics generated by this script -
Output
: raw model output in tsv format -
imagerec_prod_${snapshot}
: production datasets in tsv format -
regular.spark.properties
: spark properties file for thetransform.py
job
Each run has an associated, unique, <run_id>. This uuid is propagated to the etl transforms,
and will populate the dataset_id
in production datasets. This allows reconciliation of
a given dataset to the process that generated it.
Example
$ ./publish.sh 2021-01-25
[...]
Datasets are available at runs/dc4c9aea-4e85-475f-9626-ad0909b92fb6/imagerec_prod_2021-01-25
Export summary
22 confidence_rating source
684441
240156 high wikidata
293089 low commons
1182152 medium wikipedia
$ ls runs/dc4c9aea-4e85-475f-9626-ad0909b92fb6/imagerec_prod_2021-02-25/
prod-arwiki-2021-02-25-wd_image_candidates.tsv prod-huwiki-2021-02-25-wd_image_candidates.tsv
prod-arzwiki-2021-02-25-wd_image_candidates.tsv prod-hywiki-2021-02-25-wd_image_candidates.tsv
prod-bnwiki-2021-02-25-wd_image_candidates.tsv prod-kowiki-2021-02-25-wd_image_candidates.tsv
prod-cebwiki-2021-02-25-wd_image_candidates.tsv prod-plwiki-2021-02-25-wd_image_candidates.tsv
prod-cswiki-2021-02-25-wd_image_candidates.tsv prod-ptwiki-2021-02-25-wd_image_candidates.tsv
prod-dewiki-2021-02-25-wd_image_candidates.tsv prod-ruwiki-2021-02-25-wd_image_candidates.tsv
prod-enwiki-2021-02-25-wd_image_candidates.tsv prod-srwiki-2021-02-25-wd_image_candidates.tsv
prod-euwiki-2021-02-25-wd_image_candidates.tsv prod-svwiki-2021-02-25-wd_image_candidates.tsv
prod-fawiki-2021-02-25-wd_image_candidates.tsv prod-trwiki-2021-02-25-wd_image_candidates.tsv
prod-frwiki-2021-02-25-wd_image_candidates.tsv prod-ukwiki-2021-02-25-wd_image_candidates.tsv
prod-hewiki-2021-02-25-wd_image_candidates.tsv prod-viwiki-2021-02-25-wd_image_candidates.tsv