In a nutshell, CLIP is a multimodal model that combines knowledge of English-language concepts with semantic knowledge of images.
It can just as easily distinguish between an image of a "cat" and a "dog" as it can between "an illustration of Deadpool pretending to be a bunny rabbit " and "an underwater scene in the style of Vincent Van Gogh " (even though it has definitely never seen those things in its training data). This is because of its generalized knowledge of what those English phrases mean and what those pixels represent.
8
49 reads
The idea is part of this collection:
Learn more about artificialintelligence with this collection
Find out the challenges it poses
Learn about the potential impact on society
Understanding the concept of Metaverse
Related collections
Read & Learn
20x Faster
without
deepstash
with
deepstash
with
deepstash
Personalized microlearning
β
100+ Learning Journeys
β
Access to 200,000+ ideas
β
Access to the mobile app
β
Unlimited idea saving
β
β
Unlimited history
β
β
Unlimited listening to ideas
β
β
Downloading & offline access
β
β
Supercharge your mind with one idea per day
Enter your email and spend 1 minute every day to learn something new.
I agree to receive email updates