Filter out data that is too heavily linked (!71) · Merge requests · repos / generated-data-platform / datapipelines

Cparle requested to merge T314120 into T296814-image-suggestions Jul 29, 2022

If a commons file has its own wikidata item (via P18), then on wikidata its commons category is often (or always?) used to indicate the category the wikidata item is in rather than the category containing images relevant to the wikidata item

This sense of the use of commons category is not useful for search - if I'm searching for an image for "Christ washing the feet of the apostles" I don't want other images from Category:Images_from_the_Royal_Library_of_Belgium in the results. Similarly for image suggestions other images from that category are not appropriate suggestions for an article about that particular artwork.

It's impossible to tell definitively which meaning of the P373 property is being used. The best guess we can make at the moment is that if an image is in a commons category that is the P373 value for >99 different wikidata items, then the P373 value is probably being used in the sense of "the category the wikidata item is in" rather than "the category containing images related to the wikidata item", and therefore we should ignore it.

Similarly some images are used as the lead image on tens or even hundreds of thousands of pages - for example there is an svg USA map that's used as lead image on >400k wiki articles about geographical districts in the US.

This is not actually useful for users - I probably don't want a USA map as result in an image search for "Sowbelly ridge" - and also clogs up the commons search index doc for the image with thousands of weighted_tags entries containing the wikidata id of each wiki article.

To solve this and prevent the commonswiki_file index from being overwhelmed we'll filter out rows from this dataframe that have >99 wikidata ids for a single image.

In practical terms this means that if an image is a lead image for a set of articles N, and if the set N contains articles with >99 distinct wikidata ids, then the image is likely to be too general to be useful as a search result or an image suggestion, and so we ignore it.

See https://phabricator.wikimedia.org/T314120

Admin message

Admin message

Admin message

Filter out data that is too heavily linked

Merge request reports