Marqo-FashionSigLIP Model Card
Marqo-FashionSigLIP leverages Generalised Contrastive Learning (GCL) which allows the model to be trained on not just text descriptions but also categories, style, colors, materials, keywords and fine-details to provide highly relevant search results on fashion products. The model was fine-tuned from ViT-B-16-SigLIP (webli).
Github Page: Marqo-FashionCLIP
Blog: Marqo Blog
Usage
The model can be seamlessly used with OpenCLIP by
import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionSigLIP')
tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionSigLIP')
import torch
from PIL import Image
image = preprocess_val(Image.open("docs/fashion-hippo.png")).unsqueeze(0)
text = tokenizer(["a hat", "a t-shirt", "shoes"])
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
Benchmark Results
Average evaluation results on 6 public multimodal fashion datasets (Atlas, DeepFashion (In-shop), DeepFashion (Multimodal), Fashion200k, KAGL, and Polyvore) are reported below:
Text-To-Image (Averaged across 6 datasets)
Model | AvgRecall | Recall@1 | Recall@10 | MRR |
---|---|---|---|---|
Marqo-FashionSigLIP | 0.231 | 0.121 | 0.340 | 0.239 |
FashionCLIP2.0 | 0.163 | 0.077 | 0.249 | 0.165 |
OpenFashionCLIP | 0.132 | 0.060 | 0.204 | 0.135 |
ViT-B-16-laion2b_s34b_b88k | 0.174 | 0.088 | 0.261 | 0.180 |
ViT-B-16-SigLIP-webli | 0.212 | 0.111 | 0.314 | 0.214 |
Category-To-Product (Averaged across 5 datasets)
Model | AvgP | P@1 | P@10 | MRR |
---|---|---|---|---|
Marqo-FashionSigLIP | 0.737 | 0.758 | 0.716 | 0.812 |
FashionCLIP2.0 | 0.684 | 0.681 | 0.686 | 0.741 |
OpenFashionCLIP | 0.646 | 0.653 | 0.639 | 0.720 |
ViT-B-16-laion2b_s34b_b88k | 0.662 | 0.673 | 0.652 | 0.743 |
ViT-B-16-SigLIP-webli | 0.688 | 0.690 | 0.685 | 0.751 |
Sub-Category-To-Product (Averaged across 4 datasets)
Model | AvgP | P@1 | P@10 | MRR |
---|---|---|---|---|
Marqo-FashionSigLIP | 0.725 | 0.767 | 0.683 | 0.811 |
FashionCLIP2.0 | 0.657 | 0.676 | 0.638 | 0.733 |
OpenFashionCLIP | 0.598 | 0.619 | 0.578 | 0.689 |
ViT-B-16-laion2b_s34b_b88k | 0.638 | 0.651 | 0.624 | 0.712 |
ViT-B-16-SigLIP-webli | 0.643 | 0.643 | 0.643 | 0.726 |
- Downloads last month
- 2,983
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.