TL;DR: We present Self-Flow, a self-supervised flow matching framework. Across (a) image, (b) video, and (c) audio generation, Self-Flow consistently outperforms REPA, without using any external models or supervision by jointly modeling representation and generation.

FID comparison
(a) Text-to-Image
FVD comparison
(b) Text-to-Video
FAD comparison
(c) Text-to-Audio

Abstract

Strong semantic representations improve the convergence and generation quality of diffusion and flow models. Existing approaches largely rely on external models, which require separate training, operate on misaligned objectives, and exhibit unexpected scaling behavior. We argue that this dependence arises from the model's training objective, which poses a denoising task with little incentive to learn semantic representations. We introduce Self-Flow: a self-supervised flow matching paradigm that integrates representation learning within the generative framework. Our key mechanism, Dual-Timestep Scheduling, applies heterogeneous noise levels across tokens, creating an information asymmetry that forces the model to infer missing information from corrupted inputs. This drives learning strong representations alongside generative capabilities without external supervision. Our method generalizes across modalities and enables multi-modal training while following expected scaling laws, achieving superior image, video, and audio generation.

Joint Multi-Modal Training Comparison

We benchmark our method against vanilla flow matching on a scaled multi-modal experiment where a single 4B parameter FLUX.2 backbone is trained to jointly generate images, videos and audio. The results presented were obtained with just 100k steps of high resolution fine-tuning over a low resolution multi-modal model, on data containing just 6M training videos and 200M training images. Our method produces significant improvements in structural coherence (faces, hands), motion quality, and text rendering accuracy.

Image Samples

4B parameter multi-modal model trained on 200M images for 100k high-resolution fine-tuning steps.

Video Samples

4B parameter multi-modal model trained on 6M videos for 100k high-resolution fine-tuning steps.

Joint Video-Audio Samples

4B parameter multi-modal model trained on 2M audio-video pairs for 100k high-resolution fine-tuning steps. Click play to watch baseline and ours sequentially.


Method Overview

Method architecture
Illustration of our method. Given a clean input x0, we draw two timesteps t, s, and a random mask M, then noise each token according to its assigned timestep. The teacher input is noised with τmin=min{t,s}, creating an information asymmetry compared to the student. The student is trained to simultaneously denoise the input and reconstruct the teacher's features given its mixed-noised view.
Scaling vs training steps
Scaling vs FLOPs

Scaling behavior (Text-to-Image). As model size increases (290M → 420M → 625M → 1B parameters), the performance gap between our method and REPA widens. Our method effectively leverages increased compute, while REPA shows diminishing returns.

Multi-Modal Training
(a) Multi-Modal Training
Joint Video-Action Training
(b) Joint Video-Action Training

Multi-modal experiments. (a) We train a single model on three modalities with different weightings to control the trade-off between them. Self-Flow provides consistent improvements (shaded area) across all settings. Axes are inverted so that larger area indicates better performance. (b) Success rates for joint Video-Action prediction. Early on (30k), Self-Flow outperforms flow matching (FM) across all tasks and achieves success in all task categories, whereas FM fails entirely on Open and Place tasks. Later (100k), performance on single-object tasks (Pick Coke Can, Open/Close Drawer) converges, while Self-Flow maintains a significant advantage on complex multi-object and sequential tasks (Move Near, Open and Place).


Image Generation Comparison

All models use a 625M parameter backbone trained on 20M images. We benchmark against vanilla flow matching and leading approaches for improving diffusion representations. For external alignment methods, we compare against REPA with DINOv2 and REPA with SigLIP2. For methods without external models, we compare against SRA. Our method achieves superior visual quality and prompt adherence across diverse prompts.


Video Generation Comparison

All models use a 625M parameter backbone trained on just 6M videos. We compare against vanilla flow matching, REPA with DINOv2 for external alignment, and SRA as the leading method without external models. Interestingly, DINOv2 remains the strongest external encoder for video generation, outperforming video-specific encoders such as V-JEPA 2 and advanced spatial learners such as Depth Anything 3 (see Sec. 4). Our method achieves superior results across all baselines.


Audio Generation Comparison

All models use a 625M parameter backbone trained on the FMA music dataset. We compare against vanilla flow matching, REPA with MERT for external alignment, and SRA as the leading method without external models. Consistent with our findings on video, external alignment with MERT provides no benefit over vanilla flow matching on audio generation (see Sec. 4), demonstrating that external alignment fails to generalize beyond image-centric tasks. Our method achieves superior results without relying on any external representations.


Future Work — World Models

Looking ahead, by bridging representation learning and generative modeling, our approach offers a path toward world models that harness the scalability and perceptual grounding of visual generative models without sacrificing the semantic abstraction required for planning and understanding. We present results obtained by fine-tuning our video-weighted multi-modal runs (with 675M parameters) for action predictions. We employ the SIMPLER simulator to evaluate the action predictions.