CLIP is a gigantic leap forward, bringing many of the recent developments from the realm of natural language processing into the mainstream of computer vision: unsupervised learning, transformers, and multimodality to name a few. The burst of innovation it has inspired shows its versatility. And this is likely just the beginning. There has been scuttlebutt recently about the coming age of <a href="https://hai.stanford.edu/news/introducing-center-research-foundation-models-crfm#:~:text=Foundation%20models%20(e.g.%2C%20BERT%2C,lead%20to%20significant%20societal%20consequences.">"foundation models" in artificial intelligence </a>that will underpin the state of the art across many different problems in AI; I think CLIP is going to turn out to be the bedrock model for computer vision.

As an extension of image similarity, we've <a href="https://github.com/roboflow-ai/zero-shot-object-tracking">used CLIP to track objects across frames in a video </a>. It uses an object detection model to find items of interest then crops the image and uses CLIP to determine if two detected objects are the same or difference instance of that object across different frames of a video.

Object Tracking

<a href="https://blog.roboflow.com/neuralhash-collision/">Apple's Neuralhash semantic image similarity algorithm </a>has been in the news a lot recently for how they're applying it to scanning user devices for CSAM. We showed <a href="https://blog.roboflow.com/apples-csam-neuralhash-collision/">how you can use CLIP to find similar images </a>in the exact same way Apple's Neuralhash works. The applications of being able to find similar images go far beyond scanning for illegal content, though. It could be used to search for copyright violations, create a clone of <a href="https://tineye.com/">Tineye </a>, or an advanced photo library de-duplicator.

Image Similarity

Unfortunately, for many hyper-specific use-cases (eg examining the output of microchip lithography) or identifying things invented since CLIP was trained in 2020 (for example, the unique characteristics of CLIP+VQGAN creations), CLIP isn't capable of performing well out of the box for all problems. It should be possible to extend CLIP (essentially using it as a fantastic checkpoint for transfer learning) with additional data.

Fine-Tuning CLIP

We've used CLIP along with GANs to convert text into images; there's no reason we can't go in the other direction and create rich captions for images with creative usage of CLIP (possibly along with a language model like GPT-3).

Captioning

Because CLIP doesn't need to be trained on specific phrases, it's perfectly suited for searching large catalogs of images. It doesn't need images to be tagged and can do natural language search. Yurij Mikhalevich has already created <a href="https://mikhalevi.ch/rclip-an-ai-powered-command-line-photo-search-tool/">an AI-powered command image line search tool called rclip </a>. It wouldn't surprise me if CLIP spawns a Google Image Search competitor in the near future.

Image Search

One extension of image classification is content moderation. If you <a href="https://blog.roboflow.com/openai-clip-prompt-engineering/">ask it in the right way </a>, CLIP can filter out graphic or NSFW images out of the box. We demonstrated <a href="https://blog.roboflow.com/zero-shot-content-moderation-openai-new-clip-model/">content moderation with CLIP </a>in a post <a href="https://blog.roboflow.com/zero-shot-content-moderation-openai-new-clip-model/">here </a>.

Content Moderation

If you can classify images, it should be doable to classify frames of videos. In this way you could automatically split videos into scenes and create search indexes. Imagine <a href="https://blog.roboflow.com/using-computer-vision-to-find-brands-in-youtube-videos/">searching YouTube for your company's logo </a>and magically finding all of the places where someone happened to have used your product.

Video Indexing

One of the neatest aspects of CLIP is how versatile it is. When introduced by OpenAI they noted two use-cases: <a href="https://openai.com/blog/clip/">image classification </a>and <a href="https://openai.com/blog/dall-e/">image generation </a>. But in the 9 months since its release it has been used for a far wider variety of tasks.

Use Cases

This is in contrast to traditional computer vision models which disregard the context of their labels (in other words, a "normal" image classifier works just as well if your labels are "cat" and "dog" or "foo" and "bar"; behind the scenes it just converts them into a numeric identifier with no particular meaning).In real world tasks, the "glyphs" are actually patterns of pixels (features) representing abstractions like colors, shapes, textures, and patterns (and even concepts like people and locations ).

OpenAI originally evaluated CLIP as a zero-shot image classifier. They compared it against traditional supervised machine learning models and it performed nearly on par with them without having to be trained on any specific dataset. One challenge with traditional approaches to image classification is that you need lots of training examples that closely resemble the distribution of the images it will see in the wild. Because of this, CLIP does better on this task the less training data there is available.

Image classification

It's not just factual representations that are encoded in CLIP's memory. It also knows about qualitative concepts as well (as we learned from <a href="https://twitter.com/arankomatsuzaki/status/1399471244760649729?s=20">the Unreal engine trick </a>). We used this to create <a href="https://blog.roboflow.com/how-we-built-paint-wtf-an-ai-that-judges-your-art-100-000-submissions/">a CLIP judged Pictionary-style game </a>, but you could also use it to create a camera app that "scores" users' photos by "searching" for phrases like "award winning photograph" or "professional selfie of a model" to help users decide which images to keep and which ones to trash, for example.

Image Ranking

In a nutshell, CLIP is a multimodal model that combines knowledge of English-language concepts with semantic knowledge of images.It can just as easily distinguish between an image of a "cat" and a "dog" as it can between "<a href="https://paint.wtf/ranking/erFA7/uCzEnPBtKkgoBAhBesQY">an illustration of Deadpool pretending to be a bunny rabbit </a>" and "<a href="https://paint.wtf/ranking/xboNk/0fjxp0kEi0WeDgkP3LZX">an underwater scene in the style of Vincent Van Gogh </a>" (even though it has definitely never seen those things in its training data). This is because of its generalized knowledge of what those English phrases mean and what those pixels represent.

What is CLIP?

<a href="https://openai.com/blog/dall-e/">DALL-E was developed </a>by OpenAI in tandem with CLIP. It's a generative model that can produce images based on a textual description; CLIP was used to evaluate its efficacy. The DALL-E model has still not been released publicly, but CLIP has been behind a burgeoning <a href="https://blog.roboflow.com/ai-generated-art/">AI generated art </a>scene. It is used to "steer" a GAN (generative adversarial network) towards a desired output. The most commonly used model is Taming Transformers' CLIP+VQGAN which <a href="https://blog.roboflow.com/ai-generated-art/">we dove deep on here </a>.