openbmb/MiniCPM-Embedding

MiniCPM-Embedding

MiniCPM-Embedding 是面壁智能与清华大学自然语言处理实验室（THUNLP）共同开发的中英双语言文本嵌入模型，有如下特点：

出色的中文、英文检索能力。
出色的中英跨语言检索能力。

MiniCPM-Embedding 基于 MiniCPM-2B-sft-bf16 训练，结构上采取双向注意力和 Weighted Mean Pooling [1]。采取多阶段训练方式，共使用包括开源数据、机造数据、闭源数据在内的约 600 万条训练数据。

欢迎关注 RAG 套件系列：

检索模型：MiniCPM-Embedding
重排模型：MiniCPM-Reranker
面向 RAG 场景的 LoRA 插件：MiniCPM3-RAG-LoRA

MiniCPM-Embedding is a bilingual & cross-lingual text embedding model developed by ModelBest Inc. and THUNLP, featuring:

Exceptional Chinese and English retrieval capabilities.
Outstanding cross-lingual retrieval capabilities between Chinese and English.

MiniCPM-Embedding is trained based on MiniCPM-2B-sft-bf16 and incorporates bidirectional attention and Weighted Mean Pooling [1] in its architecture. The model underwent multi-stage training using approximately 6 million training examples, including open-source, synthetic, and proprietary data.

We also invite you to explore the RAG toolkit series:

Retrieval Model: MiniCPM-Embedding
Re-ranking Model: MiniCPM-Reranker
LoRA Plugin for RAG scenarios: MiniCPM3-RAG-LoRA

[1] Muennighoff, N. (2022). Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.

模型信息 Model Information

模型大小：2.4B
嵌入维度：2304
最大输入token数：512
Model Size: 2.4B
Embedding Dimension: 2304
Max Input Tokens: 512

使用方法 Usage

输入格式 Input Format

本模型支持 query 侧指令，格式如下：

MiniCPM-Embedding supports query-side instructions in the following format:

Instruction: {{ instruction }} Query: {{ query }}

例如：

For example:

Instruction: 为这个医学问题检索相关回答。Query: 咽喉癌的成因是什么？

Instruction: Given a claim about climate change, retrieve documents that support or refute the claim. Query: However the warming trend is slower than most climate models have forecast.

也可以不提供指令，即采取如下格式：

MiniCPM-Embedding also works in instruction-free mode in the following format:

Query: {{ query }}

我们在 BEIR 与 C-MTEB/Retrieval 上测试时使用的指令见 instructions.json，其他测试不使用指令。文档侧直接输入文档原文。

When running evaluation on BEIR and C-MTEB/Retrieval, we use instructions in instructions.json. For other evaluations, we do not use instructions. On the document side, we directly use the bare document as the input.

环境要求 Requirements

transformers==4.37.2
flash-attn>2.3.5

示例脚本 Demo

Huggingface Transformers


from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

model_name = "openbmb/MiniCPM-Embedding"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda")
model.eval()

# 事实上我们用的是weighted mean pooling，但为了部署方便，我们将一部分pooling步骤集成在model.forward中
# In fact, we will use weighted mean pooling, but we will integrate some pooling steps into model.forward for deployment convenience
def mean_pooling(hidden,attention_mask):
    s = torch.sum(hidden * attention_mask.unsqueeze(-1).float(), dim=1)
    d = attention_mask.sum(dim=1, keepdim=True).float()
    reps = s / d
    return reps

@torch.no_grad()
def encode(input_texts):
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True).to("cuda")
    
    outputs = model(**batch_dict)
    attention_mask = batch_dict["attention_mask"]
    hidden = outputs.last_hidden_state

    reps = mean_pooling(hidden, attention_mask)   
    embeddings = F.normalize(reps, p=2, dim=1).detach().cpu().numpy()
    return embeddings

queries = ["中国的首都是哪里？"]
passages = ["beijing", "shanghai"]


INSTRUCTION = "Query: "
queries = [INSTRUCTION + query for query in queries]

embeddings_query = encode(queries)
embeddings_doc = encode(passages)

scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist())  # [[0.3535913825035095, 0.18596848845481873]]

Sentence Transformers

import torch
from sentence_transformers import SentenceTransformer

model_name = "openbmb/MiniCPM-Embedding"
model = SentenceTransformer(model_name, trust_remote_code=True, model_kwargs={"attn_implementation":"flash_attention_2", "torch_dtype":torch.float16})
model.max_seq_length = 512
model.tokenizer.padding_side="right"

queries = ["中国的首都是哪里？"]
passages = ["beijing", "shanghai"]


INSTRUCTION = "Query: "

embeddings_query = model.encode(queries, prompt=INSTRUCTION, normalize_embeddings=True)
embeddings_doc = model.encode(passages, normalize_embeddings=True)

scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist())  # [[0.3535913825035095, 0.18596848845481873]]

实验结果 Evaluation Results

中文与英文检索结果 CN/EN Retrieval Results

模型 Model	C-MTEB/Retrieval (NDCG@10)	BEIR (NDCG@10)
bge-large-zh-v1.5	70.46	-
gte-large-zh	72.49	-
Zhihui_LLM_Embedding	76.74
bge-large-en-v1.5	-	54.29
gte-en-large-v1.5	-	57.91
NV-Retriever-v1	-	60.9
bge-en-icl	-	62.16
NV-Embed-v2	-	62.65
me5-large	63.66	51.43
bge-m3(Dense)	65.43	48.82
gte-multilingual-base(Dense)	71.95	51.08
gte-Qwen2-1.5B-instruct	71.86	58.29
gte-Qwen2-7B-instruct	76.03	60.25
bge-multilingual-gemma2	73.73	59.24
MiniCPM-Embedding	76.76	58.56
MiniCPM-Embedding+MiniCPM-Reranker	77.08	61.61

中英跨语言检索结果 CN-EN Cross-lingual Retrieval Results

模型 Model	MKQA En-Zh_CN (Recall@20)	NeuCLIR22 (NDCG@10)	NeuCLIR23 (NDCG@10)
me5-large	44.3	9.01	25.33
bge-m3(Dense)	66.4	30.49	41.09
gte-multilingual-base(Dense)	68.2	39.46	45.86
gte-Qwen2-1.5B-instruct	68.52	49.11	45.05
gte-Qwen2-7B-instruct	68.27	49.14	49.6
MiniCPM-Embedding	72.95	52.65	49.95
MiniCPM-Embedding+MiniCPM-Reranker	74.33	53.21	54.12

许可证 License

本仓库中代码依照 Apache-2.0 协议开源。
MiniCPM-Embedding 模型权重的使用则需要遵循 MiniCPM 模型协议。
MiniCPM-Embedding 模型权重对学术研究完全开放。如需将模型用于商业用途，请填写此问卷。

The code in this repo is released under the Apache-2.0 License.
The usage of MiniCPM-Embedding model weights must strictly follow MiniCPM Model License.md.
The models and weights of MiniCPM-Embedding are completely free for academic research. After filling out a "questionnaire" for registration, MiniCPM-Embedding weights are also available for free commercial use.

openbmb
/

MiniCPM-Embedding