🐍M4V: Multi-Modal Mamba for Text-to-Video Generation

1Meituan, 2University of Techcnology Sydney, 3Beijing Jiaotong University

: jianchenghuang@smail.nju.edu.cn

Abstract

Text-to-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Diffusion Transformers (DiTs), which incur quadratic complexity in sequence processing, thereby limiting its practical applications. Recent advancements in linear-time sequence modeling, particularly the Mamba architecture, offer a more efficient alternative to Transformers. Nevertheless, its plain design constrains its direct application to multi-modal and spatiotemporal video generation tasks. To address these challenges, we introduce M4V, a Multi-Modal Mamba framework for text-to-video generation. Specifically, we introduce a multi-modal diffusion Mamba (MM-DiM) block that enables seamless multi-modal integration and spatiotemporal modeling within an autoregressive generation pipeline. Additionally, to mitigate error accumulation in long-context autoregressive processes, we introduce a reward learning strategy to enhance per-frame consistency and quality. As a result, the Mamba blocks in M4V reduce the FLOPs by 45\% compared to the attention alternative when generating 121-frame videos at 768×1280 resolution. Extensive experiments on text-to-video benchmarks demonstrate M4V's ability to produce high-quality videos while significantly lowering computational costs. Code and models will be publicly available.

Architecture

Architecture

(a) Overview of the generation architecture. (b) Detailed strcture of our MM-DiM Block. alpha, beta, gamma are introduced by projecting the timestep condition, and we ommit the projection for simplicity. (c) Illustration of MM-Token Re-Composition.

More Results


A sleek and modern editorial fashion photograph, full body, showcases a young Japanese ...


a young Japanese woman standing waiting for a train outside station ...


A young person, wearing a cozy gray hoodie and black-rimmed glasses, sits in a dimly lit room, intensely focused on a video game. The glow from the TV screen illuminates their face, highlighting their ...

A determined individual in a sleek, black athletic outfit jogs along a winding forest trail, surrounded by towering trees and dappled sunlight filtering through the leaves. Their rhythmic strides ...


A futuristic cityscape at dusk, with flying cars zipping between towering skyscrapers adorned with neon lights


a young woman wearing a purple sweater running in a park during autumn

In a charming Parisian café, a panda sits at a quaint wooden table, sipping coffee from a delicate porcelain cup. The panda, wearing a stylish beret and a striped scarf, embodies a whimsical blend of ...


In the bustling heart of Amsterdam, a young man in a black beanie and jacket str ...


In the bustling heart of NYC's Times Square, a life-sized teddy bear, dressed in a tiny leather jacket and sunglasses, sits behind a gleaming drum kit. The bear's furry paws expertly strike the drums ...

In the neon-lit streets of Cyberpunk Beijing, a colossal robot towers over the cityscape, its sleek metallic frame adorned with glowing blue and red lights. The robot's design is a fusion of futuristi ...


In the video, a group of school children is seen walking together, with some of them using their smartphones. Among them, a young boy and girl are walking side by side, both looking at their phones ...


A stylish woman walks down the streets of Tokyo, surrounded by warm neon lights and vibrant city signs. She wears a black leather jacket, a red long skirt, black boots, and carries a black purse. She ...

BibTeX

@article{2025M4V,
 author    = {Jiancheng Huang and Gengwei Zhang etc},
 title     = {M4V: Multi-Modal Mamba for Text-to-Video Generation"
 name="description},
}