Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

Avinash Paliwal  ·  Adithya Iyer  ·  Shivin Yadav  ·  Muhammad Ali Afridi  ·  Midhun Harikumar

Morphic

TL;DR  Video-to-video models are difficult to train due to the scarcity of paired multi-perspective video data depicting the same action. We solve this by proposing a self-supervised method that trains a Video-Reshooting model entirely from in-the-wild monocular videos, achieving state-of-the-art temporal consistency and camera control.

Abstract

Precise camera control for reshooting dynamic videos is bottlenecked by the severe scarcity of paired multi-view data for non-rigid scenes. We overcome this limitation with a highly scalable self-supervised framework capable of leveraging internet-scale monocular videos. Our core contribution is the generation of pseudo multi-view training triplets, consisting of a source video, a geometric anchor, and a target video. We achieve this by extracting distinct smooth random-walk crop trajectories from a single input video to serve as the source and target views. The anchor is synthetically generated by forward-warping the first frame of the source with a dense tracking field, which effectively simulates the distorted point-cloud inputs expected at inference. Because our independent cropping strategy introduces spatial misalignment and artificial occlusions, the model cannot simply copy information from the current source frame. Instead, it is forced to implicitly learn 4D spatiotemporal structures by actively routing and re-projecting missing high-fidelity textures across distinct times and viewpoints from the source video to reconstruct the target. At inference, our minimally adapted diffusion transformer utilizes a 4D point-cloud derived anchor to achieve state-of-the-art temporal consistency, robust camera control, and high-fidelity novel view synthesis on complex dynamic scenes.

Training Data Construction

From a single monocular video we extract two independent crop trajectories (source & target) and synthetically generate a geometric anchor by forward-warping the first source frame with a dense tracking field.

Self-supervised training data construction — teaser figure

Figure 1. Our self-supervised training data pipeline. Two independent smooth random-walk crop trajectories are extracted from a single input video to form source & target pseudo-views. The anchor is generated via dense-tracking-guided forward-warping, simulating the point-cloud anchors used at inference.

Pipeline in Action

Inference — input → anchor → reshooting result

Input Video
4D Point Cloud Anchor
Reshooting Result

Inference — input → anchor → reshooting result

Input Video
4D Anchor (cam 78)
Reshooting Result

Qualitative Results

Each row shows the original video (left) alongside our reshooting result (right) from a different camera trajectory.

Original

Ours (reshooting)

Original

Ours (reshooting)

Original

Ours (reshooting)

Original

Ours (reshooting)

Original

Ours (reshooting)

Inference Example

Given one input video we reshoot it from two different anchor camera trajectories simultaneously.

Input Video

Anchor Camera

Reshooting Result

Anchor Camera

Reshooting Result

Method Overview

Model architecture overview

Figure 2. Overview of our conditioning architecture. (1) VAE Encoding: The anchor video (Va) and source video (Vs) are independently encoded into latents. (2) Conditioning Setup: The anchor latent is combined with a noise latent; the source latent uses an all-ones mask ensuring all content is leveraged. (3) DiT Processing: The two conditioned streams are temporally concatenated and jointly processed through the DiT blocks via self-attention, enabling fine-grained content routing. (4) Source Token Management: An auxiliary reconstruction loss on output source tokens ensures high-fidelity content retention.

Implicit 4D Learning

Our training setup forces the model to learn 4D spatiotemporal structure from purely 2D data. Because the source and target crops are spatially misaligned and share occlusions, the model must search across both space and time in the source video to fill in missing target details — emerging as learned 4D reconstruction.

Self-Supervised Training

Our entire pipeline relies only on a robust 2D dense tracker (AllTracker). It is domain-agnostic and scales to any monocular video: photorealistic footage, animation, and generative art — without restrictive domain limitations.

Training Triplet Generation

Video triplet generation pipeline

Figure 3. Pseudo multi-view triplet generation. Two independent smooth random-walk crop trajectories are sampled from a single input video. The anchor is synthesized by forward-warping the source's first frame using a combined dense tracking + crop offset flow field.

Comparisons with State-of-the-Art

Our approach consistently outperforms existing methods in dynamic video reshooting, achieving leading performance across video quality, temporal consistency, camera accuracy, and view synchronization metrics.

Method VBench Quality Temporal Camera Accuracy View Synchronization
Aesthetic ↑ Imaging ↑ Flickering ↑ Smoothness ↑ Subject ↑ Background ↑ CLIP-F ↑ RotErr ↓ TransErr ↓ Mat.Pix ↑ FVD-V ↓ CLIP-V ↑
TrajectoryCrafter (49 frames) 52.69 59.67 96.97 99.03 93.78 95.13 98.80 2.26 3.03 1851.80 582.56 92.40
Ours (49 frames) 52.72 57.81 97.43 99.24 95.09 95.62 99.01 2.61 2.73 2737.65 488.22 94.96
ReCamMaster 48.71 52.61 97.57 99.26 88.57 90.65 98.49 11.29 19.59 1314.00 732.52 88.91
Ex4D 49.72 55.76 97.46 99.08 91.51 94.78 98.94 3.94 4.21 2188.98 685.63 89.77
Ours 52.85 58.64 97.37 99.21 93.43 95.24 99.03 2.76 4.23 2720.83 586.24 93.16

Bold + underline = best in group. ↑ higher is better, ↓ lower is better.

Ablation Studies

We ablate core architectural and training choices. Our baseline uses a black anchor background, token concatenation through self-attention, and monocular videos without augmentations.

Qualitative ablation comparisons

Figure 4. Qualitative ablation on a complex scene with moving smoke and colored lighting. The Synthetic Data Only model fails to capture dynamic smoke. The Cross-Attention model loses fine source details (saxophone texture) and struggles with anchor alignment. Our full model faithfully tracks the anchor while preserving high-fidelity textures and complex dynamics.

Method RotErr ↓ TransErr ↓ Mat.Pix ↑ FVD-V ↓ CLIP-V ↑
Baseline (w/ self-attn) 3.274.922636595.3992.94
  — Source Video 4.8211.65226662.0989.97
  + Gaussian Noise in Latent 2.815.052586605.9391.70
  + 3D Noise in Anchor 2.494.952624598.6792.88
  w/ Cross-Attention 3.534.311766626.3791.64
  + Auxiliary Loss 2.764.982627562.9492.93
  + LoRA 2.854.172615578.9192.98
  + Random Query 3.364.522618571.7492.83
  + Fluorescent Background Anchor 3.164.782627571.0692.68
w/ Synthetic Data (Syn) 3.705.041746608.0391.86
w/ Monocular Videos (Ours) 2.764.232720586.2493.16
Ours + Syn 3.363.662577587.9192.64

Bold + underline = best per column. ↑ higher is better, ↓ lower is better.

BibTeX

@article{paliwal2026reshootanything,
  title   = {Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting},
  author  = {Paliwal, Avinash and Iyer, Adithya and Yadav, Shivin and Afridi, Muhammad Ali and Harikumar, Midhun},
  journal = {arXiv preprint arXiv:2604.21776},
  year    = {2026},
  url     = {https://arxiv.org/abs/2604.21776}
}

Acknowledgements

We'd like to thank the members of Morphic for compute resources and motivation, and Dharmesh Kakadia and Isaac Wang for reviewing the paper.