bwang0911 commited on
Commit
f077afa
1 Parent(s): 2987873

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +200 -0
README.md ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ tags:
4
+ - feature-extraction
5
+ - sentence-similarity
6
+ - mteb
7
+ language:
8
+ - multilingual
9
+ - af
10
+ - am
11
+ - ar
12
+ - as
13
+ - az
14
+ - be
15
+ - bg
16
+ - bn
17
+ - br
18
+ - bs
19
+ - ca
20
+ - cs
21
+ - cy
22
+ - da
23
+ - de
24
+ - el
25
+ - en
26
+ - eo
27
+ - es
28
+ - et
29
+ - eu
30
+ - fa
31
+ - fi
32
+ - fr
33
+ - fy
34
+ - ga
35
+ - gd
36
+ - gl
37
+ - gu
38
+ - ha
39
+ - he
40
+ - hi
41
+ - hr
42
+ - hu
43
+ - hy
44
+ - id
45
+ - is
46
+ - it
47
+ - ja
48
+ - jv
49
+ - ka
50
+ - kk
51
+ - km
52
+ - kn
53
+ - ko
54
+ - ku
55
+ - ky
56
+ - la
57
+ - lo
58
+ - lt
59
+ - lv
60
+ - mg
61
+ - mk
62
+ - ml
63
+ - mn
64
+ - mr
65
+ - ms
66
+ - my
67
+ - ne
68
+ - nl
69
+ - 'no'
70
+ - om
71
+ - or
72
+ - pa
73
+ - pl
74
+ - ps
75
+ - pt
76
+ - ro
77
+ - ru
78
+ - sa
79
+ - sd
80
+ - si
81
+ - sk
82
+ - sl
83
+ - so
84
+ - sq
85
+ - sr
86
+ - su
87
+ - sv
88
+ - sw
89
+ - ta
90
+ - te
91
+ - th
92
+ - tl
93
+ - tr
94
+ - ug
95
+ - uk
96
+ - ur
97
+ - uz
98
+ - vi
99
+ - xh
100
+ - yi
101
+ - zh
102
+ inference: false
103
+ library_name: transformers
104
+ ---
105
+
106
+ <br><br>
107
+
108
+ <p align="center">
109
+ <img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
110
+ </p>
111
+
112
+
113
+ <p align="center">
114
+ <b>The embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
115
+ </p>
116
+
117
+ <p align="center">
118
+ <b>Jina Embedding V3: A Multilingual Multi-Task Embedding Model</b>
119
+ </p>
120
+
121
+ ## Quick Start
122
+
123
+ The easiest way to starting using `jina-embeddings-v3` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/).
124
+
125
+
126
+ ## Intended Usage & Model Info
127
+
128
+ `jina-embeddings-v3` is a multilingual **text embedding model** supporting **8192 sequence length**.
129
+ It is based on a XLMRoBERTa architecture (JinaXLMRoBERTa) that supports the Rotary Position Embeddings to allow longer sequence length.
130
+ The backbone `JinaXLMRoBERTa ` is pretrained on variable length textual data on Mask Language Modeling objective for 160k steps on 89 languages.
131
+ The model is further trained on Jina AI's collection of more than 500 millions of multilingual sentence pairs and hard negatives.
132
+ These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
133
+
134
+ `jina-embeddings-v3` has 5 task-specific LoRA adapters tuned on top of our backbone, add `task_type` as additional parameter when using the model:
135
+
136
+ TODO UPDATE THIS
137
+
138
+ 1. **query**: Handles user incoming queries at search time.
139
+ 2. **index**: Manages user documents submitted for indexing.
140
+ 3. **text-matching**: Processes symmetric text similarity tasks, whether short or long, such as STS (Semantic Textual Similarity).
141
+ 4. **classification**: Classifies user inputs into predefined categories.
142
+ 5. **clustering**: Facilitates the clustering of embeddings for further analysis.
143
+
144
+ `jina-embeddings-v3` supports Matryoshka representation learning. We recommend using an embedding size of 128 or higher (1024 provides optimal performance) for storing your embeddings.
145
+
146
+
147
+
148
+ ## Data & Parameters
149
+
150
+ coming soon.
151
+
152
+ ## Usage
153
+
154
+ 1. The easiest way to starting using jina-clip-v1-en is to use Jina AI's [Embeddings API](https://jina.ai/embeddings/).
155
+ 2. Alternatively, you can use Jina CLIP directly via transformers package.
156
+
157
+ ```python
158
+ !pip install transformers einops flash_attn
159
+ from transformers import AutoModel
160
+
161
+ # Initialize the model
162
+ model = AutoModel.from_pretrained('jinaai/jina-embeddings-v3, trust_remote_code=True)
163
+
164
+ # New meaningful sentences
165
+ sentences = [
166
+ "Organic skincare for sensitive skin with aloe vera and chamomile.",
167
+ "New makeup trends focus on bold colors and innovative techniques",
168
+ "Bio-Hautpflege für empfindliche Haut mit Aloe Vera und Kamille",
169
+ "Neue Make-up-Trends setzen auf kräftige Farben und innovative Techniken",
170
+ "Cuidado de la piel orgánico para piel sensible con aloe vera y manzanilla",
171
+ "Las nuevas tendencias de maquillaje se centran en colores vivos y técnicas innovadoras",
172
+ "针对敏感肌专门设计的天然有机护肤产品",
173
+ "新的化妆趋势注重鲜艳的颜色和创新的技巧",
174
+ "敏感肌のために特別に設計された天然有機スキンケア製品",
175
+ "新しいメイクのトレンドは鮮やかな色と革新的な技術に焦点を当てています",
176
+ ]
177
+
178
+ # Encode sentences
179
+ embeddings = model.encode(sentences, truncate_dim=1024, task_type='index') # TODO UPDATE
180
+
181
+ # Compute similarities
182
+ print(embeddings[0] @ embeddings[1].T)
183
+ ```
184
+
185
+
186
+ ## Performance
187
+
188
+ TODO UPDATE THIS
189
+
190
+ ## Contact
191
+
192
+ Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
193
+
194
+ ## Citation
195
+
196
+ If you find `jina-embeddings-v3` useful in your research, please cite the following paper:
197
+
198
+ ```bibtex
199
+
200
+ ```