


On the scale of several thousand images, it would be easy to cobble together a few scripts to spit out this information, but with half a billion images, there are a lot of hurdles to overcome. For us to reproduce the features that users take for granted in image search, we're going to need a fairly powerful crawling system. (1) and (2), however, are not possible to solve without actually downloading the image and performing some analysis on the contents of the file. We solved (3) by setting up a caching thumbnail proxy between images in the search results and their 3rd party origin, as well as some last-minute liveness checks to make sure that the image hasn't 404'd. What if the other party's server goes down, the images disappear due to link rot, or their TLS certificates expire? Each of these situations results in broken images appearing in the search results or browser alerts about degraded security. Embedding third party content is fraught with problems.We can't run any type of computer vision analysis on any of the images, which could be useful for enriching search metadata through object recognition.For example, some users are only interested in high resolution images and would like to exclude content below a certain size. We don't know the dimensions or compression quality of images, which is useful both for relevance purposes (de-ranking low quality images) and for filtering.Originally, when we discovered an image and inserted it into CC Search, we didn't even bother downloading it we stuck the URL in our database and embedded the image in our search results.
#FLICKR UPLOADR STUCK PROCESSING FULL#
The full source code can be viewed on GitHub. This article will discuss the process of taking a paper design for a large scale crawler, implementing it, and putting it in production, with a few idealized code snippets and diagrams along the way. To further enhance the usefulness of our search tool, we recently started crawling and analyzing images for improved search results. We have indexed over 500 million images, which we believe is roughly 36% of all CC licensed content on the internet by our last count. The goal of CC Search is to index all of the Creative Commons works on the internet, starting with images.
