Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

Avinash Paliwal · Adithya Iyer · Shivin Yadav · Muhammad Ali Afridi · Midhun Harikumar

TL;DR Video-to-video models are difficult to train due to the scarcity of paired multi-perspective video data depicting the same action. We solve this by proposing a self-supervised method that trains a Video-Reshooting model entirely from in-the-wild monocular videos, achieving state-of-the-art temporal consistency and camera control.

Abstract

Precise camera control for reshooting dynamic videos is bottlenecked by the severe scarcity of paired multi-view data for non-rigid scenes. We overcome this limitation with a highly scalable self-supervised framework capable of leveraging internet-scale monocular videos. Our core contribution is the generation of pseudo multi-view training triplets, consisting of a source video, a geometric anchor, and a target video. We achieve this by extracting distinct smooth random-walk crop trajectories from a single input video to serve as the source and target views. The anchor is synthetically generated by forward-warping the first frame of the source with a dense tracking field, which effectively simulates the distorted point-cloud inputs expected at inference. Because our independent cropping strategy introduces spatial misalignment and artificial occlusions, the model cannot simply copy information from the current source frame. Instead, it is forced to implicitly learn 4D spatiotemporal structures by actively routing and re-projecting missing high-fidelity textures across distinct times and viewpoints from the source video to reconstruct the target. At inference, our minimally adapted diffusion transformer utilizes a 4D point-cloud derived anchor to achieve state-of-the-art temporal consistency, robust camera control, and high-fidelity novel view synthesis on complex dynamic scenes.

Training Data Construction

From a single monocular video we extract two independent crop trajectories (source & target) and synthetically generate a geometric anchor by forward-warping the first source frame with a dense tracking field.

Figure 1. Our self-supervised training data pipeline. Two independent smooth random-walk crop trajectories are extracted from a single input video to form source & target pseudo-views. The anchor is generated via dense-tracking-guided forward-warping, simulating the point-cloud anchors used at inference.

Pipeline in Action

Inference — input → anchor → reshooting result

Input Video

→

4D Point Cloud Anchor

→

Reshooting Result

Inference — input → anchor → reshooting result

Input Video

→

4D Anchor (cam 78)

→

Reshooting Result

Qualitative Results

Each row shows the original video (left) alongside our reshooting result (right) from a different camera trajectory.

Original

Ours (reshooting)

Original

Ours (reshooting)

Original

Ours (reshooting)

Original

Ours (reshooting)

Original

Ours (reshooting)

Inference Example

Given one input video we reshoot it from two different anchor camera trajectories simultaneously.

Input Video

→

Anchor Camera

Reshooting Result

Anchor Camera

Reshooting Result

Method Overview

Figure 2. Overview of our conditioning architecture. (1) VAE Encoding: The anchor video (V_a) and source video (V_s) are independently encoded into latents. (2) Conditioning Setup: The anchor latent is combined with a noise latent; the source latent uses an all-ones mask ensuring all content is leveraged. (3) DiT Processing: The two conditioned streams are temporally concatenated and jointly processed through the DiT blocks via self-attention, enabling fine-grained content routing. (4) Source Token Management: An auxiliary reconstruction loss on output source tokens ensures high-fidelity content retention.

Implicit 4D Learning

Our training setup forces the model to learn 4D spatiotemporal structure from purely 2D data. Because the source and target crops are spatially misaligned and share occlusions, the model must search across both space and time in the source video to fill in missing target details — emerging as learned 4D reconstruction.

Self-Supervised Training

Our entire pipeline relies only on a robust 2D dense tracker (AllTracker). It is domain-agnostic and scales to any monocular video: photorealistic footage, animation, and generative art — without restrictive domain limitations.

Training Triplet Generation

Figure 3. Pseudo multi-view triplet generation. Two independent smooth random-walk crop trajectories are sampled from a single input video. The anchor is synthesized by forward-warping the source's first frame using a combined dense tracking + crop offset flow field.

Comparisons with State-of-the-Art

Our approach consistently outperforms existing methods in dynamic video reshooting, achieving leading performance across video quality, temporal consistency, camera accuracy, and view synchronization metrics.

Method	VBench Quality						Temporal	Camera Accuracy		View Synchronization
Method	Aesthetic ↑	Imaging ↑	Flickering ↑	Smoothness ↑	Subject ↑	Background ↑	CLIP-F ↑	RotErr ↓	TransErr ↓	Mat.Pix ↑	FVD-V ↓	CLIP-V ↑
TrajectoryCrafter (49 frames)	52.69	59.67	96.97	99.03	93.78	95.13	98.80	2.26	3.03	1851.80	582.56	92.40
Ours (49 frames)	52.72	57.81	97.43	99.24	95.09	95.62	99.01	2.61	2.73	2737.65	488.22	94.96
ReCamMaster	48.71	52.61	97.57	99.26	88.57	90.65	98.49	11.29	19.59	1314.00	732.52	88.91
Ex4D	49.72	55.76	97.46	99.08	91.51	94.78	98.94	3.94	4.21	2188.98	685.63	89.77
Ours	52.85	58.64	97.37	99.21	93.43	95.24	99.03	2.76	4.23	2720.83	586.24	93.16

Bold + underline = best in group. ↑ higher is better, ↓ lower is better.

Ablation Studies

We ablate core architectural and training choices. Our baseline uses a black anchor background, token concatenation through self-attention, and monocular videos without augmentations.

Figure 4. Qualitative ablation on a complex scene with moving smoke and colored lighting. The Synthetic Data Only model fails to capture dynamic smoke. The Cross-Attention model loses fine source details (saxophone texture) and struggles with anchor alignment. Our full model faithfully tracks the anchor while preserving high-fidelity textures and complex dynamics.

Method	RotErr ↓	TransErr ↓	Mat.Pix ↑	FVD-V ↓	CLIP-V ↑
Baseline (w/ self-attn)	3.27	4.92	2636	595.39	92.94
— Source Video	4.82	11.65	226	662.09	89.97
+ Gaussian Noise in Latent	2.81	5.05	2586	605.93	91.70
+ 3D Noise in Anchor	2.49	4.95	2624	598.67	92.88
w/ Cross-Attention	3.53	4.31	1766	626.37	91.64
+ Auxiliary Loss	2.76	4.98	2627	562.94	92.93
+ LoRA	2.85	4.17	2615	578.91	92.98
+ Random Query	3.36	4.52	2618	571.74	92.83
+ Fluorescent Background Anchor	3.16	4.78	2627	571.06	92.68
w/ Synthetic Data (Syn)	3.70	5.04	1746	608.03	91.86
w/ Monocular Videos (Ours)	2.76	4.23	2720	586.24	93.16
Ours + Syn	3.36	3.66	2577	587.91	92.64

Bold + underline = best per column. ↑ higher is better, ↓ lower is better.

BibTeX

@article{paliwal2026reshootanything,
  title   = {Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting},
  author  = {Paliwal, Avinash and Iyer, Adithya and Yadav, Shivin and Afridi, Muhammad Ali and Harikumar, Midhun},
  journal = {arXiv preprint arXiv:2604.21776},
  year    = {2026},
  url     = {https://arxiv.org/abs/2604.21776}
}

Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

Abstract

Training Data Construction

Pipeline in Action

Qualitative Results

Inference Example

Method Overview

Implicit 4D Learning

Self-Supervised Training

Training Triplet Generation

Comparisons with State-of-the-Art

Ablation Studies

BibTeX

Acknowledgements