Multimodal AI Newsletter
July 2024 Edition
The Tech Pulse
Gen3 Alpha from Runway: New Video Generation Model
Gen-3 Alpha is the first of an upcoming series of models trained by Runway on a new infrastructure built for large-scale multimodal training. It is a major improvement in fidelity, consistency, and motion over Gen-2, and a step towards building General World Models.
Explore More: BlogText to Sound Effects Model from ElevenLabs
It's simple: just describe the sound you have in mind and Text to Sound Effects Model will generate a few samples to choose from. You can then upscale the one you like the most, or continue generating until you find the sound effect that works for you.
Explore More: WebsiteRelease of Stable Diffusion 3 Medium from StabilityAI
SD3 Medium is a 2 billion parameter AI model that generates high-quality, photorealistic images with improved handling of hands, faces, and text. It excels in understanding complex prompts, runs efficiently on consumer GPUs, and is suitable for fine-tuning on small datasets.
Explore More: BlogRelease of long awaited Florence-2 Model from Microsoft
Florence-2 is a versatile vision foundation model that uses text prompts to perform various vision and vision-language tasks. It employs a sequence-to-sequence architecture and leverages a massive dataset to excel in multi-task learning, demonstrating strong performance in both zero-shot and fine-tuned scenarios.
Explore More: HuggingFace| Paper | Colab NotebookWhat's Hot in Research?
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
4M-21 outperforms multimodal models like 4M and UnifiedIO by training on tens of diverse modalities, expanding their capabilities to handle 3x more tasks without performance loss, and enabling fine-grained, controllable generation.
Explore More: Website | PaperCambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Cambrian-1 is a new family of vision-centric multimodal language models. It focuses on improving visual components through better encoders, a new connector design, high-quality instruction data, refined tuning strategies, and comprehensive benchmarking.
Explore More: Website | PaperGAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
GAMA, a General-purpose Large Audio-Language Model, integrates a language model with advanced audio representations and fine-tuned for audio understanding and complex reasoning. Evaluations show GAMA outperforms existing models in audio tasks by significant margins.
Explore More: Website | PaperMuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
MuirBench is a comprehensive benchmark for evaluating multi-image understanding capabilities of multimodal LLMs. It includes 12 diverse tasks across 10 categories of multi-image relations, with 11,264 images and 2,600 questions.
Explore More: Website | PaperBoost Your Knowledge Arsenal
Step-by-Step Diffusion: An Elementary Tutorial
A 101 tutorial on diffusion models, mostly used in text-image generation and flow matching for machine learning. It presents key concepts and algorithms while minimizing complex math.
Resources: PaperLarge Multimodal Models: Towards Building and Surpassing Multimodal GPT-4
The talk consists of building large multimodal models like GPT-4, including discussions on data, instruction tuning, architecture, parameter-efficient fine-tuning, and evaluation.
Resources: YouTubeUNet Diffusion Model in Pure CUDA
The repository consists of writing code for a UNet Diffusion Model from scratch in CUDA. This might be a fun exercise for those who want to understand how CUDA kernels are written for Diffusion Models.
Resources: GitHubRecap Multimodal Community (June 2024)
Paper Reading Session - June 21st
The session was led by Surya Guthikonda, where he discussed the chapters on Introduction, Visual Understanding, Visual Generation, and Unified Vision Models from the paper "Multimodal Foundation Models: From Specialists to General-Purpose Assistants."
Resources: Slides | PaperPaper Reading Session - June 28th
The session was led by Henry Vo, where he discussed the chapters on Large MultiModal Models: Training with LLMs, Multimodal Agents: Chaining with Tools, and Conclusion and Research Trends from the paper "Multimodal Foundation Models: From Specialists to General-Purpose Assistants."
Resources: Recording | Slides | PaperDisclaimer
Newsletter highlights notable recent multimodal AI developments but is not exhaustive. We acknowledge that we might have missed some exceptional works.