Quick Tour: Zero-shot Image Classification via Prompt with Hugging Face

3 min readMar 14, 2024


There are many ways to do Zero-shot image classification, and I’ll introduce 2 here. But first things first, let’s get this straight — when we talk about classification here, we’re not talking about object detection. In other words, we’re dealing with the classification of a whole image, not specific objects within the image. In addition, this article does not discuss the content of model papers, but rather looks at it from an applied perspective. After all, with so many models, the ultimate goal is to be applied and used.


The so-called Zero-shot Image Classification is using a pretrained model to infer the type of image, without the need for any fine-tuning, but we still have to design the input prompting.

As an Applied AI Engineer, the main job is to apply and combine various models to meet application requirements, so the most detailed and important thing is how to make the models or model combinations produce the most convincing results, especially based on “probability” output. CLIP is the ancestor of models based on prompting or phrases, and it can be said that all modern similar (large) models refer to CLIP (of course, there may be exceptions). This article uses 2 examples as supplementary explanations.



Model Selection

  1. CLIP of OpenAI: https://huggingface.co/openai/clip-vit-large-patch14
  2. BLIP of Salesforce: https://huggingface.co/Salesforce/blip-itm-base-coco

The case we are analyzing is based on this picture:


Self-Attention based model

The common feature of these models is that the more specific the prompting provided, the clearer the expected results.

Take CLIP for example:

The output is the probability of all labels through softmax.

We anticipate the probability values to lean towards becoming more evident.

The probability here represents the accuracy of the description of the image. Clearly, the more information our label contains, the more “distinct” the distinction between positive and negative results will be evident, as shown below:

a man with a dog   0.0108  
a woman with a dog 0.0249
a man sitting on the beach with a dog   0.1653 
a woman sitting on the beach with a dog 0.5866

Take BLIP for example:

BLIP provides binary classification results, different from sigmoid, ultimately unlocking 2 neurons through softmax, namely: negative and positive.

The value of Positive will be more likely to behave as expected with more “specific” prompting.


Based on prompting, image classification or object detection is the way to go, so using “good” prompting is a good way to get “clearer” results from the model.

Practice writing prompts well.

Here are two object detection models that you can prompt with.