Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution Paper • 2409.12961 • Published 6 days ago • 22
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines Paper • 2409.12959 • Published 6 days ago • 33
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning Paper • 2409.12568 • Published 6 days ago • 44
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person Paper • 2407.16224 • Published Jul 23 • 23
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models Paper • 2405.15738 • Published May 24 • 43
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models Paper • 2405.01535 • Published May 2 • 114
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report Paper • 2405.00732 • Published Apr 29 • 118
view article Article Introducing Idefics2: A Powerful 8B Vision-Language Model for the community Apr 15 • 160
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs Paper • 2404.05719 • Published Apr 8 • 62
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction Paper • 2404.02905 • Published Apr 3 • 63
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models Paper • 2404.02258 • Published Apr 2 • 103
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? Paper • 2404.03411 • Published Apr 4 • 8
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens Paper • 2404.03413 • Published Apr 4 • 25
Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order Paper • 2404.00399 • Published Mar 30 • 40
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training Paper • 2403.09611 • Published Mar 14 • 123
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring Paper • 2403.09333 • Published Mar 14 • 14
BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences Paper • 2403.09347 • Published Mar 14 • 20
StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control Paper • 2403.09055 • Published Mar 14 • 24
Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset Paper • 2403.09029 • Published Mar 14 • 54
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter Paper • 2402.10896 • Published Feb 16 • 14
Synth^2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings Paper • 2403.07750 • Published Mar 12 • 21
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM Paper • 2403.07816 • Published Mar 12 • 39
MoAI: Mixture of All Intelligence for Large Language and Vision Models Paper • 2403.07508 • Published Mar 12 • 75
VideoMamba: State Space Model for Efficient Video Understanding Paper • 2403.06977 • Published Mar 11 • 27
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models Paper • 2403.06764 • Published Mar 11 • 25
DeepSeek-VL: Towards Real-World Vision-Language Understanding Paper • 2403.05525 • Published Mar 8 • 39
LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error Paper • 2403.04746 • Published Mar 7 • 22
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation Paper • 2403.04692 • Published Mar 7 • 40
Design2Code: How Far Are We From Automating Front-End Engineering? Paper • 2403.03163 • Published Mar 5 • 93
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models Paper • 2403.03003 • Published Mar 5 • 9
Wukong: Towards a Scaling Law for Large-Scale Recommendation Paper • 2403.02545 • Published Mar 4 • 15
Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use Paper • 2403.02626 • Published Mar 5 • 9
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters Paper • 2403.02677 • Published Mar 5 • 16
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis Paper • 2403.03206 • Published Mar 5 • 56
Sora参考论文 Collection OpenAI "Video generation models as world simulators"技术报告后面的参考论文,总共32篇。OpenAI的ImageGPT和Dalle3这两篇缺失,链接已补充到note中。 • 32 items • Updated Feb 18 • 53
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution Paper • 2307.06304 • Published Jul 12, 2023 • 26
TextBind: Multi-turn Interleaved Multimodal Instruction-following Paper • 2309.08637 • Published Sep 14, 2023 • 7
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages Paper • 2309.09400 • Published Sep 17, 2023 • 82
Semantic-SAM: Segment and Recognize Anything at Any Granularity Paper • 2307.04767 • Published Jul 10, 2023 • 20