OpenAI's CLIP is the most important advancement in computer vision this year - Deepstash

Bite-sized knowledge

to upgrade

your career

Ideas from books, articles & podcasts.

created 14 ideas


OpenAI's CLIP is the most important advancement in computer vision this year

OpenAI's CLIP is the most important advancement in computer vision this year


166 reads

In a nutshell, CLIP is a multimodal model that combines knowledge of English-language concepts with semantic knowledge of images.

It can just as easily distinguish between an image of a "cat" and a "dog" as it can between "an il...

One of the neatest aspects of CLIP is how versatile it is. When introduced by OpenAI they noted two use-cases: image classification and image generation . But in the 9 months since its release it has been us...

Image classification

OpenAI originally evaluated CLIP as a zero-shot image classifier. They compared it against traditional supervised machine learning models and it performed nearly on par with them without having to be trained on any specific dataset.

One challenge with traditional approaches to image classi...

Image Generation

DALL-E was developed by OpenAI in tandem with CLIP. It's a generative model that can produce images based on a textual description; CLIP was used to evaluate its efficacy.

The DALL-E model has still not been released publicly, but CLIP has been...

Content Moderation

One extension of image classification is content moderation. If you ask it in the right way , CLIP can filter out graphic or NSFW images out of the box. We demonstrated

Image Search

Because CLIP doesn't need to be trained on specific phrases, it's perfectly suited for searching large catalogs of images. It doesn't need images to be tagged and can do natural language search.

Yurij Mikhalevich has already created

Image Similarity

Apple's Neuralhash semantic image similarity algorithm has been in the news a lot recently for how they're applying it to scanning user devices for CSAM. We showed h...

Image Ranking

It's not just factual representations that are encoded in CLIP's memory. It also knows about qualitative concepts as well (as we learned from the Unreal engine trick ).

We used this to create

Object Tracking

As an extension of image similarity, we've used CLIP to track objects across frames in a video . It uses an object detection model to find items of interest then crops the image and uses CLIP to determine if two detected objec...

Unfortunately, for many hyper-specific use-cases (eg examining the output of microchip lithography) or identifying things invented since CLIP was trained in 2020 (for example, the unique characteristics of CLIP+VQGAN creations), CLIP isn't capable of performing well out of the box for all problem...

We've used CLIP along with GANs to convert text into images; there's no reason we can't go in the other direction and create rich captions for images with creative usage of CLIP (possibly along with a language model like GPT-3).

If you can classify images, it should be doable to classify frames of videos. In this way you could automatically split videos into scenes and create search indexes. Imagine searching YouTube for your comp...



It's time to




Jump-start your

reading habits

, gather your



remember what you read

and stay ahead of the crowd!

Takes just 5 minutes a day.


+2M Installs

4.7 App Score