Quick Tour: Zero-shot Image Classification via Prompt with Hugging Face

3 min readMar 14, 2024

There are many ways to do Zero-shot image classification, and I’ll introduce 2 here. But first things first, let’s get this straight — when we talk about classification here, we’re not talking about object detection. In other words, we’re dealing with the classification of a whole image, not specific objects within the image. In addition, this article does not discuss the content of model papers, but rather looks at it from an applied perspective. After all, with so many models, the ultimate goal is to be applied and used.

Abstract

The so-called Zero-shot Image Classification is using a pretrained model to infer the type of image, without the need for any fine-tuning, but we still have to design the input prompting.

As an Applied AI Engineer, the main job is to apply and combine various models to meet application requirements, so the most detailed and important thing is how to make the models or model combinations produce the most convincing results, especially based on “probability” output. CLIP is the ancestor of models based on prompting or phrases, and it can be said that all modern similar (large) models refer to CLIP (of course, there may be exceptions). This article uses 2 examples as supplementary explanations.

Code

https://colab.research.google.com/drive/1J7aH4egYVzg_RLHCShBrvHYVmUd4dNIY?usp=sharing

Model Selection

CLIP of OpenAI: https://huggingface.co/openai/clip-vit-large-patch14
BLIP of Salesforce: https://huggingface.co/Salesforce/blip-itm-base-coco

The case we are analyzing is based on this picture:

https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg

Self-Attention based model

The common feature of these models is that the more specific the prompting provided, the clearer the expected results.

Take CLIP for example:

The output is the probability of all labels through softmax.

We anticipate the probability values to lean towards becoming more evident.

The probability here represents the accuracy of the description of the image. Clearly, the more information our label contains, the more “distinct” the distinction between positive and negative results will be evident, as shown below:

a man with a dog   0.0108  
a woman with a dog 0.0249

a man sitting on the beach with a dog   0.1653 
a woman sitting on the beach with a dog 0.5866

Take BLIP for example:

BLIP provides binary classification results, different from sigmoid, ultimately unlocking 2 neurons through softmax, namely: negative and positive.

The value of Positive will be more likely to behave as expected with more “specific” prompting.

Summary

Based on prompting, image classification or object detection is the way to go, so using “good” prompting is a good way to get “clearer” results from the model.

Practice writing prompts well.

Here are two object detection models that you can prompt with.

YOLO-世界（实时开放词汇对象检测）

发现YOLO-World，这是一个基于YOLOv8 的框架，用于实时检测图像中的开放词汇对象。它增强了用户互动，提高了计算效率，并能适应各种视觉任务。

docs.ultralytics.com

GitHub - IDEA-Research/GroundingDINO: Official implementation of the paper "Grounding DINO…

Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object…

github.com