T274798 include all unillustrated articles
Created by: gmodena
Raw data can contain records with NULL image_id.
The ImageMatching
dataset should include all unillustrated articles,
with or without candidate matches.
This PR updates the algo code, and the production dataset ETL to account for this new behaviour.
Articles with no matches will be saved with an empty
top_candidates
field in the raw dataset. These records will
be stored with empty (""
) image_id
, source
, confidence_rating
fields
in prod data. An example of prod dataset with empty suggestions
can be found in gmodena.imagerec_prod
.
Example
hive (gmodena)> select count(*) from gmodena.imagerec_prod where image_id is not null;
104342
hive (gmodena)> select count(*) from gmodena.imagerec_prod where image_id is null;
39518
Changelog
-
algorithm.ipynb
has been modified to save all articles that we consider unillustrated. -
etl/transform.py
has been updated to handle raw data records with an emptytop_candidates
field. -
ddl/external_imagerec_prod.hql
has been modified so that empty strings are formatted asNULL
by Hive (and provide sql NULL semantic).