Scaling behavior (Text-to-Image). As model size increases (290M → 420M → 625M → 1B parameters), the performance gap between our method and REPA widens. Our method effectively leverages increased compute, while REPA shows diminishing returns.
Multi-modal experiments. (a) We train a single model on three modalities with different weightings to control the trade-off between them. Self-Flow provides consistent improvements (shaded area) across all settings. Axes are inverted so that larger area indicates better performance. (b) Success rates for joint Video-Action prediction. Early on (30k), Self-Flow outperforms flow matching (FM) across all tasks and achieves success in all task categories, whereas FM fails entirely on Open and Place tasks. Later (100k), performance on single-object tasks (Pick Coke Can, Open/Close Drawer) converges, while Self-Flow maintains a significant advantage on complex multi-object and sequential tasks (Move Near, Open and Place).
All models use a 625M parameter backbone trained on 20M images. We benchmark against vanilla flow matching and leading approaches for improving diffusion representations. For external alignment methods, we compare against REPA with DINOv2 and REPA with SigLIP2. For methods without external models, we compare against SRA. Our method achieves superior visual quality and prompt adherence across diverse prompts.
"a parked blue Vespa scooter on a gravel path beside a tranquil body of water, with tall grasses, captured in a realistic photographic style."
"a professionally portrait of a woman with medium-length dark hair with bangs, a large white decorative bow with silver embellishments on the side of her head."
"a close-up, high-resolution photograph of a wheel and brake system, featuring a polished silver alloy wheel with a blue brake."
"a close-up of a monkey with distinctive facial markings, partially closed eyes, and thick fur, set against a blurred green forest background."
"a tractor, photographed in a vibrant red color, parked on a dirt path beside a grassy field under a partly cloudy sky, highlighting its vintage agricultural design."
"a trapper hat, featuring a red and black plaid pattern with a thick fur around the brim and ear flaps, displayed on a mannequin head against a light background."
"an elephant standing on a dirt path in a natural, forested area with its trunk extended downward, captured in high-definition photography."
"a softly lit portrait of a man with short hair and a beard, wearing a dark shirt with a plaid collar, against a light background with blurred circular elements."
"a stylized, cartoon figurine of a Shiba Inu dog, depicted in a friendly pose while holding a bowl of food, standing on a wooden base."
"a hand-painted, wooden fish-shaped decoration with horizontal white stripes, resting on a white shelf near a window."
"a bento box featuring a compartment of white rice decorated with nori into a cartoonish face, by another compartment containing grilled meat and vegetables."
"a donkey in a grassy, fenced enclosure stands close to the metal fence. The scene is captured in natural daylight."
"a close-up, well-lit portrait photograph of a man with light brown hair and a beard, wearing a blue and red plaid shirt, against a plain, light-colored background."
"a close-up, soft-lit photograph of a pug lying on its back on a red textured surface, with its large, open eyes looking directly at the camera."
"a midcentury-style terracotta sculpture of a hippopotamus, intricately detailed, seated with a decorative headdress against a plain white background."
"a close-up of the front of a car, highlighting its glossy black paint, chrome accents,, set against a snowy outdoor background."
"a donkey standing inside a stable, wearing a blue halter and a red quilted protective blanket over its back, with wooden walls visible in the background."
"a young boy tenderly hugging a white stuffed dog toy while standing indoors against a plain wall."
"a detailed, classical-style sculpted bust of a man in a serious expression, wearing a toga, encased in an oval frame and mounted on a textured, earthy-toned wall."
"a person wearing a black face mask with a white pattern, standing against a green and white gradient background, captured in a softly lit, balanced composition."
"a pair of pink sneakers, worn by someone standing on a concrete surface in an urban environment, close-up, low-angle photograph."
"a professionally lit portrait photograph of a woman with brown hair wearing a trench coat over a white shirt captured outdoors in natural light."
"a highly detailed and colorful traditional Japanese Beckoning Cat, depicted in a stylized manner with intricate patterns and vibrant colors."
"a photograph of a small, white church with blue trim and a prominent blue cross on its facade, featuring a tall steeple and situated on an elevated plot."
"a photograph of plush toys—a white teddy bear and a green rabbit—arranged together against a pink, patterned background."
"two white reindeer inside a wire-fenced enclosure, one facing the camera and the other turned to the side; the background includes a wooden structure."
"a photograph taken from a low angle of an, highlighting its sleek, modern design with metallic railings and overhead structural elements under bright lighting."
"a black cat in a blue jacket with red and white accents, wearing a tag, held by a visible arm, looking upwards with its mouth slightly open against a blue door."
"a man posing indoors with a shallow depth of field, wearing a dark blue shirt and sporting a friendly smile."
All models use a 625M parameter backbone trained on just 6M videos. We compare against vanilla flow matching, REPA with DINOv2 for external alignment, and SRA as the leading method without external models. Interestingly, DINOv2 remains the strongest external encoder for video generation, outperforming video-specific encoders such as V-JEPA 2 and advanced spatial learners such as Depth Anything 3 (see Sec. 4). Our method achieves superior results across all baselines.
"a parrot preens its brilliant feathers, blue and gold colors catching light, meticulous grooming."
"bicycle roll onto a wet concrete path, creating a dynamic splash as it passes through a large puddle. The lower half of a person's legs, clad in blue jeans, pedal the green bicycle."
"a first-person camera view looking down at vibrant green and black skis, launches over a snow-covered drop revealing a vast, sunlit mountain range."
"a young woman with brown hair, a white cap, denim short overalls, and white sneakers, dances in a hip-hop style, behind her a painted brick wall with a swirling yellow and green spiral."
"an elderly man with deep wrinkles and a neatly trimmed grey beard speaks thoughtfully to the camera. His weathered face fills the frame as his lips form each word deliberately, occasionally pausing to gather his thoughts, warm golden lamplight casting gentle shadows across his features."
"a green sea turtle glides effortlessly through a colorful coral reef, filmed from a low angle. The serene creature moves with graceful flipper strokes, ancient eyes observing the vibrant marine life, shell patterns catching the filtered sunlight in the shallow tropical water."
"a 3D animated orange tries to avoid being picked, hiding behind leaves, nervous expression."
"a snake slowly uncoils from its resting position, massive length gradually revealed. The powerful constrictor moves with undulating grace, muscular body flowing over itself as it extends."
"a massive rhinoceros, seen from a low perspective, slowly walks across a vast, dry, light-colored plain. The powerful animal moves deliberately at a savanna landscape under a blue sky."
"a cartoon 3D fox trots through fallen leaves, bushy tail swishing behind. The orange character continues on its woodland path."
"a sharp chef's knife slices through a ripe red tomato on a wooden cutting board, the blade moving in smooth, confident strokes. Juice pools around, the knife work demonstrating years of practiced technique."
"a cape bull stands its ground, facing the camera with notorious aggression in its posture. The powerful animal's horns form a deadly boss, one of Africa's most dangerous creatures ready to charge."
"a cheese expert cuts into a perfectly aged wheel of Parmigiano-Reggiano."
"an underwater view captures a large blue shark, starting with a medium shot of its head and front body, as it gracefully turns and begins swimming away through the deep, deep blue ocean water. The camera smoothly tracks the shark as it performs a swift turn."
"a bald man with a trimmed beard, wearing a blue hooded sweatshirt, directs an intense and determined gaze forward, his face illuminated by soft light. His eyes are wide and mouth slightly open, conveying a focused, mid-sentence expression of direct communication."
"zoom in on the legs of a figure skater completing a jump combination, blades touching down. The landing is secure, momentum carrying into the next element."
"a powerful male lion walks confidently across the savanna, captured from a low angle that conveys his regal bearing. The tawny predator moves with controlled power, mane flowing slightly in the breeze, amber eyes scanning the horizon as he patrols his territory under golden evening light."
"a silverback gorilla walks through dense jungle, seen from a low, respectful distance. The massive ape moves with quiet power, silver-gray back distinctive, intelligent eyes scanning the forest."
"a 3D cartoon of an owl turns its head around while perched on a tree branch at dusk, worried expression, large yellow eyes blinking slowly."
All models use a 625M parameter backbone trained on the FMA music dataset. We compare against vanilla flow matching, REPA with MERT for external alignment, and SRA as the leading method without external models. Consistent with our findings on video, external alignment with MERT provides no benefit over vanilla flow matching on audio generation (see Sec. 4), demonstrating that external alignment fails to generalize beyond image-centric tasks. Our method achieves superior results without relying on any external representations.
"The audio features a rhythmic and energetic electronic music track. It is characterized by a consistent, driving beat, featuring a prominent kick drum and crisp hi-hats that establish a strong rhythm. Layered over this percussive foundation are distinct synth elements that contribute to the track's dynamic and somewhat raw electronic feel, creating an engaging and forward-moving atmosphere."
"The audio features energetic electronic music with a driving beat, a prominent bassline, and a high-pitched synth melody, layered with scratching sound effects. A male voice utters the phrase "In the kitchen" once, and then frequently repeats the vocal sound "Ah ah ah" throughout the track, integrating with the rhythmic structure of the music, with the "Ah ah ah" repeating multiple times."
"The audio features upbeat, energetic electronic music characterized by a steady, driving house beat and a repetitive, rhythmic synth melody. The track has a danceable quality, with the synth creating a consistent, engaging groove throughout the clip. The mood is lively and rhythmic, suitable for a club or workout setting."
"The audio features a rhythmic and percussive soundscape, dominated by a distinct drum machine beat. It prominently features a steady, crisp hi-hat pattern, accompanied by synthetic drum hits that resemble a snare or tom, creating a driving, electronic rhythm. The overall mood is energetic and mechanical."
"The audio features the gentle, tinkling melody of a music box, playing a delicate and somewhat wistful tune. The distinct, resonant sounds of the music box mechanism are clearly audible, creating a calm and reflective atmosphere."
"The audio features an energetic rock music track with a driving beat, prominent electric guitar riffs, and a strong bassline. A male vocalist sings in English with a powerful, slightly gritty tone, delivering the lyrics: "out of my mind, on in my head, from your room, and into your bed." The song maintains a high energy level throughout, driven by a steady drum rhythm, creating an intense and passionate mood."
"The audio features an energetic electronic dance music track. It begins with a driving, rhythmic beat and a looping synthesizer melody, accompanied by various percussive electronic sounds. A prominent 'whoosh' sound builds up, increasing anticipation, before dropping into a more intense and full-bodied section at around the 0:16 mark. This drop introduces a stronger bassline and a more elaborate synth melody, maintaining a danceable and upbeat tempo throughout the instrumental piece, which continues with its pulsating rhythm and evolving electronic textures."
"The audio features instrumental electronic music with a distinct, somewhat melancholic and groovy feel. A steady electronic drum beat forms the rhythmic backbone, accompanied by a consistent bassline. The prominent melodic element is carried by a synthesized sound that closely resembles an accordion or bandoneon, playing a repeating, slightly soulful melody. The overall mood is chill and contemplative, with no spoken words or additional sound effects present throughout the clip."
Looking ahead, by bridging representation learning and generative modeling, our approach offers a path toward world models that harness the scalability and perceptual grounding of visual generative models without sacrificing the semantic abstraction required for planning and understanding. We present results obtained by fine-tuning our video-weighted multi-modal runs (with 675M parameters) for action predictions. We employ the SIMPLER simulator to evaluate the action predictions.
Action: "Move Near"
Action: "Close Bottom Drawer"
Action: "Pick Standing Coke Can"
Action: "Place Apple In Closed Top Drawer"