Reshoot-Anything
CVPR 2026 Workshop

Reshoot-Anything.

A Self-Supervised Model for In-the-Wild Video Reshooting

Avinash Paliwal* · Adithya Iyer* · Shivin Yadav · Muhammad Ali Afridi · Midhun Harikumar
* Equal contribution
Input video
Reshot · new camera path
Abstract
We present a self-supervised framework for video reshooting that trains entirely from in-the-wild monocular videos. Existing approaches either rely on synthetic data, which breaks on real footage, or stitch together pre-trained 4D reconstructions whose errors propagate. Our key idea is to manufacture multi-view supervision from a single clip: two crop trajectories produce a source and target view of the same dynamic scene, and a forward-warped anchor exposes the disocclusions a new camera path would create. Forced to route textures across space and time, the model learns implicit 4D structure with no 3D supervision, achieving state-of-the-art camera control and temporal consistency on complex dynamic scenes.
Inference

From a single video, end to end.

The pipeline at inference: a monocular input drives a 4D point-cloud anchor rendered along a new camera trajectory, and the model refines that anchor into a clean reshoot while preserving the original motion and fine detail.

Input Video
4D Point-Cloud Anchor
Reshoot Result
Input Video
4D Point-Cloud Anchor
Reshoot Result
Method

We manufacture multi-view supervision from a single video.

Two independent crop trajectories from one monocular clip form a source and a target view. A forward-warped anchor exposes the disocclusions a new camera path would create, and a minimally adapted diffusion transformer routes textures across space and time to reconstruct the target.

Training data construction from a single monocular video

Our training data pipeline. From a single monocular video we sample two independent smooth random-walk crop trajectories: one becomes the source view, the other becomes the target view of the same dynamic scene. Because the trajectories disagree spatially, the source frame at any given time cannot be copied directly to produce the target. The model must instead route textures from other source frames where the missing regions were visible, learning to reconstruct one view of a dynamic scene from a different one.

Pseudo multi-view triplets on real footage

The same pipeline on real footage. Each row shows the source frames, the target frames, and the synthesized anchor for one training triplet. The black regions in the anchor are the disocclusions a new camera path would expose; the model is trained to fill them by routing textures from later source frames where the geometry was still visible. This is the supervisory signal that forces 4D structure to emerge from purely 2D data.

Implicit 4D Learning

Because the source and target crops are spatially misaligned and share occlusions, the model must search across both space and time in the source video to fill in missing target details. 4D structure emerges as a consequence, with no 3D supervision.

Self-Supervised Training

The pipeline relies only on a robust 2D dense tracker (AllTracker). It is domain-agnostic and scales to any monocular video: photorealistic footage, animation, and generative art, without restrictive domain assumptions.

Conditioning architecture overview

Our conditioning architecture. (1) VAE Encoding: the anchor video (Va) and source video (Vs) are independently encoded into latents (za, zs). (2) Conditioning Setup: the anchor latent pairs with a noise latent zn and downsampled mask Ma; the source latent duplicates itself in place of noise and uses an all-ones mask Ms, so any source content is usable. (3) DiT Processing: both conditioned streams are patchified, temporally concatenated, and routed through the pre-trained DiT via self-attention — letting the model do fine-grained content routing without architecture changes. (4) Source Token Management: an auxiliary reconstruction loss on the output source tokens preserves high-fidelity texture through refinement.

Qualitative Results

Across people, vehicles, and lighting.

The same model rendering new camera trajectories on scenes it never trained on. Same motion, same fine detail, a path the camera never actually took.

Input
Ours
Input
Ours
Input
Ours
Input
Ours
Comparisons

Outperforming every baseline.

Across VBench quality, temporal consistency, camera accuracy, and view synchronization, our method beats TrajectoryCrafter, ReCamMaster, and Ex4D on complex dynamic scenes.

Method VBench Quality Temporal Camera Accuracy View Synchronization
Aesth ↑Imag ↑Flick ↑Smooth ↑Subj ↑Bg ↑ CLIP-F ↑ RotErr ↓TransErr ↓ Mat.Pix ↑FVD-V ↓CLIP-V ↑
TrajectoryCrafter (49 frames) 52.6959.6796.9799.0393.7895.13 98.802.263.031851.80582.5692.40
Ours (49 frames) 52.7257.8197.4399.2495.0995.62 99.012.612.732737.65488.2294.96
ReCamMaster 48.7152.6197.5799.2688.5790.65 98.4911.2919.591314.00732.5288.91
Ex4D 49.7255.7697.4699.0891.5194.78 98.943.944.212188.98685.6389.77
Ours 52.8558.6497.3799.2193.4395.24 99.032.764.232720.83586.2493.16

Underline ≡ best in group. ↑ higher is better, ↓ lower is better.

Ablations

What matters for dynamic reshooting.

We ablate core architectural and training choices. Our baseline uses a black anchor background, token concatenation through self-attention, and monocular videos without augmentations.

Qualitative ablation comparisons

Qualitative ablation on a complex scene with moving smoke and colored lighting. The Synthetic Data Only model fails to capture dynamic smoke, snapping to rigid plausible content instead. The Cross-Attention variant loses fine source detail (notice the saxophone texture) and struggles to align with the anchor when geometry deviates. Our full model faithfully tracks the anchor's new camera path while preserving high-fidelity textures and the complex dynamics of the original footage.

BibTeX
@InProceedings{paliwal2026reshoot,
    author    = {Paliwal, Avinash and Iyer, Adithya and Yadav, Shivin and Afridi, Muhammad and Harikumar, Midhun},
    title     = {Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
    month     = {June},
    year      = {2026},
    pages     = {11596-11606}
}
Acknowledgements

We thank Jaynti Kanani and the rest of the Morphic team for compute resources and support, and Dharmesh Kakadia and Isaac Wang for reviewing the paper.

Website source on GitHub