Precise camera control for reshooting dynamic videos is bottlenecked by the severe scarcity of paired multi-view data for non-rigid scenes. We overcome this limitation with a highly scalable self-supervised framework capable of leveraging internet-scale monocular videos. Our core contribution is the generation of pseudo multi-view training triplets, consisting of a source video, a geometric anchor, and a target video. We achieve this by extracting distinct smooth random-walk crop trajectories from a single input video to serve as the source and target views. The anchor is synthetically generated by forward-warping the first frame of the source with a dense tracking field, which effectively simulates the distorted point-cloud inputs expected at inference. Because our independent cropping strategy introduces spatial misalignment and artificial occlusions, the model cannot simply copy information from the current source frame. Instead, it is forced to implicitly learn 4D spatiotemporal structures by actively routing and re-projecting missing high-fidelity textures across distinct times and viewpoints from the source video to reconstruct the target. At inference, our minimally adapted diffusion transformer utilizes a 4D point-cloud derived anchor to achieve state-of-the-art temporal consistency, robust camera control, and high-fidelity novel view synthesis on complex dynamic scenes.
Each row shows the original video (left) alongside our reshooting result (right) from a different camera trajectory.
Original
Ours (reshooting)
Original
Ours (reshooting)
Original
Ours (reshooting)
Original
Ours (reshooting)
Original
Ours (reshooting)
Given one input video we reshoot it from two different anchor camera trajectories simultaneously.
Input Video
Anchor Camera
Reshooting Result
Anchor Camera
Reshooting Result
Figure 2. Overview of our conditioning architecture. (1) VAE Encoding: The anchor video (Va) and source video (Vs) are independently encoded into latents. (2) Conditioning Setup: The anchor latent is combined with a noise latent; the source latent uses an all-ones mask ensuring all content is leveraged. (3) DiT Processing: The two conditioned streams are temporally concatenated and jointly processed through the DiT blocks via self-attention, enabling fine-grained content routing. (4) Source Token Management: An auxiliary reconstruction loss on output source tokens ensures high-fidelity content retention.
Our training setup forces the model to learn 4D spatiotemporal structure from purely 2D data. Because the source and target crops are spatially misaligned and share occlusions, the model must search across both space and time in the source video to fill in missing target details — emerging as learned 4D reconstruction.
Our entire pipeline relies only on a robust 2D dense tracker (AllTracker). It is domain-agnostic and scales to any monocular video: photorealistic footage, animation, and generative art — without restrictive domain limitations.
Figure 3. Pseudo multi-view triplet generation. Two independent smooth random-walk crop trajectories are sampled from a single input video. The anchor is synthesized by forward-warping the source's first frame using a combined dense tracking + crop offset flow field.
Our approach consistently outperforms existing methods in dynamic video reshooting, achieving leading performance across video quality, temporal consistency, camera accuracy, and view synchronization metrics.
| Method | VBench Quality | Temporal | Camera Accuracy | View Synchronization | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Aesthetic ↑ | Imaging ↑ | Flickering ↑ | Smoothness ↑ | Subject ↑ | Background ↑ | CLIP-F ↑ | RotErr ↓ | TransErr ↓ | Mat.Pix ↑ | FVD-V ↓ | CLIP-V ↑ | |
| TrajectoryCrafter (49 frames) | 52.69 | 59.67 | 96.97 | 99.03 | 93.78 | 95.13 | 98.80 | 2.26 | 3.03 | 1851.80 | 582.56 | 92.40 |
| Ours (49 frames) | 52.72 | 57.81 | 97.43 | 99.24 | 95.09 | 95.62 | 99.01 | 2.61 | 2.73 | 2737.65 | 488.22 | 94.96 |
| ReCamMaster | 48.71 | 52.61 | 97.57 | 99.26 | 88.57 | 90.65 | 98.49 | 11.29 | 19.59 | 1314.00 | 732.52 | 88.91 |
| Ex4D | 49.72 | 55.76 | 97.46 | 99.08 | 91.51 | 94.78 | 98.94 | 3.94 | 4.21 | 2188.98 | 685.63 | 89.77 |
| Ours | 52.85 | 58.64 | 97.37 | 99.21 | 93.43 | 95.24 | 99.03 | 2.76 | 4.23 | 2720.83 | 586.24 | 93.16 |
Bold + underline = best in group. ↑ higher is better, ↓ lower is better.
We ablate core architectural and training choices. Our baseline uses a black anchor background, token concatenation through self-attention, and monocular videos without augmentations.
Figure 4. Qualitative ablation on a complex scene with moving smoke and colored lighting. The Synthetic Data Only model fails to capture dynamic smoke. The Cross-Attention model loses fine source details (saxophone texture) and struggles with anchor alignment. Our full model faithfully tracks the anchor while preserving high-fidelity textures and complex dynamics.
| Method | RotErr ↓ | TransErr ↓ | Mat.Pix ↑ | FVD-V ↓ | CLIP-V ↑ |
|---|---|---|---|---|---|
| Baseline (w/ self-attn) | 3.27 | 4.92 | 2636 | 595.39 | 92.94 |
| — Source Video | 4.82 | 11.65 | 226 | 662.09 | 89.97 |
| + Gaussian Noise in Latent | 2.81 | 5.05 | 2586 | 605.93 | 91.70 |
| + 3D Noise in Anchor | 2.49 | 4.95 | 2624 | 598.67 | 92.88 |
| w/ Cross-Attention | 3.53 | 4.31 | 1766 | 626.37 | 91.64 |
| + Auxiliary Loss | 2.76 | 4.98 | 2627 | 562.94 | 92.93 |
| + LoRA | 2.85 | 4.17 | 2615 | 578.91 | 92.98 |
| + Random Query | 3.36 | 4.52 | 2618 | 571.74 | 92.83 |
| + Fluorescent Background Anchor | 3.16 | 4.78 | 2627 | 571.06 | 92.68 |
| w/ Synthetic Data (Syn) | 3.70 | 5.04 | 1746 | 608.03 | 91.86 |
| w/ Monocular Videos (Ours) | 2.76 | 4.23 | 2720 | 586.24 | 93.16 |
| Ours + Syn | 3.36 | 3.66 | 2577 | 587.91 | 92.64 |
Bold + underline = best per column. ↑ higher is better, ↓ lower is better.
@article{paliwal2026reshootanything,
title = {Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting},
author = {Paliwal, Avinash and Iyer, Adithya and Yadav, Shivin and Afridi, Muhammad Ali and Harikumar, Midhun},
journal = {arXiv preprint arXiv:2604.21776},
year = {2026},
url = {https://arxiv.org/abs/2604.21776}
}
We'd like to thank the members of Morphic for compute resources and motivation, and Dharmesh Kakadia and Isaac Wang for reviewing the paper.