Idefics2 🐶 Collection Idefics2-8B is a foundation vision-language model. In this collection, you will find the models, datasets and demo related to its creation. • 11 items • Updated May 6 • 88
LLaVa-NeXT Collection LLaVa-NeXT (also known as LLaVa-1.6) improves upon the 1.5 series by incorporating higher image resolutions and more reasoning/OCR datasets. • 8 items • Updated Jul 19 • 25
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models Paper • 2402.03300 • Published Feb 5 • 67
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization Paper • 2402.03161 • Published Feb 5 • 14
Canonical models Collection This collection lists all the historical (pre-"Hub") canonical model checkpoints, i.e. repos that were not under an org or user namespace • 68 items • Updated Feb 13 • 13
PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models Paper • 2401.05252 • Published Jan 10 • 45
What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs Paper • 2401.02411 • Published Jan 4 • 12
Instruct-Imagen: Image Generation with Multi-modal Instruction Paper • 2401.01952 • Published Jan 3 • 30
Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively Paper • 2401.02955 • Published Jan 5 • 19
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation Paper • 2401.04092 • Published Jan 8 • 20
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts Paper • 2401.04081 • Published Jan 8 • 70
Generative Multimodal Models are In-Context Learners Paper • 2312.13286 • Published Dec 20, 2023 • 34
Zero-Shot Metric Depth with a Field-of-View Conditioned Diffusion Model Paper • 2312.13252 • Published Dec 20, 2023 • 27
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions Paper • 2312.08578 • Published Dec 14, 2023 • 16
LLaMA-VID Collection LLaMA-VID checkpoints. Please refer to project page for more detail: https://llama-vid.github.io/ • 11 items • Updated Dec 3, 2023 • 4
HyperDreamer: Hyper-Realistic 3D Content Generation and Editing from a Single Image Paper • 2312.04543 • Published Dec 7, 2023 • 21
AnimateZero: Video Diffusion Models are Zero-Shot Image Animators Paper • 2312.03793 • Published Dec 6, 2023 • 17
Chain of Code: Reasoning with a Language Model-Augmented Code Emulator Paper • 2312.04474 • Published Dec 7, 2023 • 29
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want Paper • 2312.03818 • Published Dec 6, 2023 • 31
Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors Paper • 2312.04963 • Published Dec 7, 2023 • 16
DreaMoving: A Human Dance Video Generation Framework based on Diffusion Models Paper • 2312.05107 • Published Dec 8, 2023 • 38
Photorealistic Video Generation with Diffusion Models Paper • 2312.06662 • Published Dec 11, 2023 • 23
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models Paper • 2312.06109 • Published Dec 11, 2023 • 20
Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior Paper • 2312.06655 • Published Dec 11, 2023 • 23
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Paper • 2311.10122 • Published Nov 16, 2023 • 26
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models Paper • 2311.06607 • Published Nov 11, 2023 • 3
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration Paper • 2311.04257 • Published Nov 7, 2023 • 20
LRM: Large Reconstruction Model for Single Image to 3D Paper • 2311.04400 • Published Nov 8, 2023 • 47
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents Paper • 2311.05437 • Published Nov 9, 2023 • 42
Story-to-Motion: Synthesizing Infinite and Controllable Character Animation from Long Text Paper • 2311.07446 • Published Nov 13, 2023 • 28
Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure Paper • 2311.07590 • Published Nov 9, 2023 • 16
Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster Paper • 2311.08263 • Published Nov 14, 2023 • 15
One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion Paper • 2311.07885 • Published Nov 14, 2023 • 39
Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts Paper • 2309.04354 • Published Sep 8, 2023 • 13
Llama 2: Open Foundation and Fine-Tuned Chat Models Paper • 2307.09288 • Published Jul 18, 2023 • 239
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Paper • 2307.16449 • Published Jul 31, 2023 • 15
3D Gaussian Splatting for Real-Time Radiance Field Rendering Paper • 2308.04079 • Published Aug 8, 2023 • 166
Nougat: Neural Optical Understanding for Academic Documents Paper • 2308.13418 • Published Aug 25, 2023 • 34
Doppelgangers: Learning to Disambiguate Images of Similar Structures Paper • 2309.02420 • Published Sep 5, 2023 • 9