Multimodal AI Newsletter

July 2024 Edition

The Tech Pulse

Gen3 Alpha from Runway: New Video Generation Model

Gen-3 Alpha is the first of an upcoming series of models trained by Runway on a new infrastructure built for large-scale multimodal training. It is a major improvement in fidelity, consistency, and motion over Gen-2, and a step towards building General World Models.

Explore More: Blog

Text to Sound Effects Model from ElevenLabs

It's simple: just describe the sound you have in mind and Text to Sound Effects Model will generate a few samples to choose from. You can then upscale the one you like the most, or continue generating until you find the sound effect that works for you.

Explore More: Website

Release of Stable Diffusion 3 Medium from StabilityAI

SD3 Medium is a 2 billion parameter AI model that generates high-quality, photorealistic images with improved handling of hands, faces, and text. It excels in understanding complex prompts, runs efficiently on consumer GPUs, and is suitable for fine-tuning on small datasets.

Explore More: Blog

Release of long awaited Florence-2 Model from Microsoft

Florence-2 is a versatile vision foundation model that uses text prompts to perform various vision and vision-language tasks. It employs a sequence-to-sequence architecture and leverages a massive dataset to excel in multi-task learning, demonstrating strong performance in both zero-shot and fine-tuned scenarios.

Explore More: HuggingFace| Paper | Colab Notebook

What's Hot in Research?

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

4M-21 outperforms multimodal models like 4M and UnifiedIO by training on tens of diverse modalities, expanding their capabilities to handle 3x more tasks without performance loss, and enabling fine-grained, controllable generation.

Explore More: Website | Paper

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Cambrian-1 is a new family of vision-centric multimodal language models. It focuses on improving visual components through better encoders, a new connector design, high-quality instruction data, refined tuning strategies, and comprehensive benchmarking.

Explore More: Website | Paper

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

GAMA, a General-purpose Large Audio-Language Model, integrates a language model with advanced audio representations and fine-tuned for audio understanding and complex reasoning. Evaluations show GAMA outperforms existing models in audio tasks by significant margins.

Explore More: Website | Paper

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

MuirBench is a comprehensive benchmark for evaluating multi-image understanding capabilities of multimodal LLMs. It includes 12 diverse tasks across 10 categories of multi-image relations, with 11,264 images and 2,600 questions.

Explore More: Website | Paper

Boost Your Knowledge Arsenal

Step-by-Step Diffusion: An Elementary Tutorial

A 101 tutorial on diffusion models, mostly used in text-image generation and flow matching for machine learning. It presents key concepts and algorithms while minimizing complex math.

Resources: Paper

Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4

The talk consists of building large multimodal models like GPT-4, including discussions on data, instruction tuning, architecture, parameter-efficient fine-tuning, and evaluation.

Resources: YouTube

UNet Diffusion Model in Pure CUDA

The repository consists of writing code for a UNet Diffusion Model from scratch in CUDA. This might be a fun exercise for those who want to understand how CUDA kernels are written for Diffusion Models.

Resources: GitHub

Recap Multimodal Community (June 2024)

Paper Reading Session - June 21st

The session was led by Surya Guthikonda, where he discussed the chapters on Introduction, Visual Understanding, Visual Generation, and Unified Vision Models from the paper "Multimodal Foundation Models: From Specialists to General-Purpose Assistants."

Resources: Slides | Paper

Paper Reading Session - June 28th

The session was led by Henry Vo, where he discussed the chapters on Large MultiModal Models: Training with LLMs, Multimodal Agents: Chaining with Tools, and Conclusion and Research Trends from the paper "Multimodal Foundation Models: From Specialists to General-Purpose Assistants."

Resources: Recording | Slides | Paper

Disclaimer

Newsletter highlights notable recent multimodal AI developments but is not exhaustive. We acknowledge that we might have missed some exceptional works.