Abstract
We present a practical pipeline for fine-tuning open-source video diffusion transformers to synthesize cinematic scenes for television and film production from small datasets. The proposed two-stage process decouples visual style learning from motion generation. In the first stage, Low-Rank Adaptation (LoRA) modules are integrated into the cross-attention layers of the Wan2.1 I2V-14B model to adapt its visual representations using a compact dataset of short clips from Ay Yapım's historical television film El Turco. This enables efficient domain transfer within hours on a single GPU. In the second stage, the fine-tuned model produces stylistically consistent keyframes that preserve costume, lighting, and color grading, which are then temporally expanded into coherent 720p sequences through the model's video decoder.
We further apply lightweight parallelization and sequence partitioning strategies to accelerate inference without quality degradation. Quantitative and qualitative evaluations using FVD, CLIP-SIM, and LPIPS metrics, supported by a small expert user study, demonstrate measurable improvements in cinematic fidelity and temporal stability over the base model. The complete training and inference pipeline is released to support reproducibility and adaptation across cinematic domains.
1. Introduction
The past two years have witnessed a rapid transformation in video generation. Diffusion transformers—originally designed for text-to-image synthesis—have evolved into powerful spatio-temporal generators capable of producing coherent multi-second videos from textual descriptions. Open-source efforts such as VideoCrafter, ModelScope, and Wan2.x have narrowed the gap with commercial systems like Runway Gen-2, Pika, or Sora. Despite this progress, cinematic generation—the ability to reproduce film-like motion, controlled lighting, lens depth, and storytelling rhythm—remains mostly inaccessible to small studios or independent creators.
State-of-the-art models rely on vast, domain-diverse datasets and compute infrastructures that are out of reach for most researchers. Moreover, existing open models are generic: they reproduce content well, but fail to replicate the film grammar—the continuity of camera movement, the balance between diegetic and artificial lighting, or the consistency of costume and tone. This work introduces a practical and open pipeline that allows small teams to adapt a large video diffusion model to a specific film aesthetic using limited data and commodity hardware.
We fine-tune Wan2.1 I2V-14B, an image-to-video model with 14 billion parameters, using Low-Rank Adaptation (LoRA) modules injected into its attention layers. LoRA modifies less than 1% of the model's parameters, enabling domain adaptation on a single GPU without retraining the full backbone. Our target domain is the historical television film El Turco, chosen for its strong visual identity: torch-lit battlefields, dark costumes, and atmospheric fog. We use roughly 40 short clips (2–5 seconds each) and design a training loop optimized for data efficiency and stability.
3. Methodology
3.1 Data Preparation
To construct a compact yet representative dataset, we curated approximately 40 short cinematic clips (2–5 seconds each) from the El Turco television film, a historical production characterized by complex lighting, multi-camera setups, and strong narrative visuals. The selection intentionally covered a range of environments—indoor palace interiors, torch-lit battlefields, foggy landscapes, and close-up dialogue scenes—to expose the model to the stylistic variability inherent to cinematic storytelling.
We decomposed each clip into frame sequences at 24 frames per second (FPS) to preserve the original film cadence. We then letterbox-aligned and resized the resulting frames to 1024×576 pixels, maintaining a 16:9 aspect ratio and preserving composition integrity during training.
We preferred letterboxing (padding with black bars instead of cropping) over standard resizing because cropping alters focal geometry and camera balance, both of which are critical in film composition. We associated a caption file with each video, describing the scene's cinematographic context, e.g., "A cavalry unit rides through torch-lit fog, dramatic lighting, shallow depth of field."
Captions were refined to align with the Qwen tokenizer used by Wan2.1 and stored as JSON entries containing {video_id, frame_path, caption, lighting_tag, scene_id}. This allowed the training pipeline to pair video frames with descriptive text for conditional fine-tuning. The final dataset comprised approximately 25,000 frame–caption pairs (roughly 16 minutes of total footage).
This scale is small by diffusion-model standards but sufficient for style and motion adaptation when combined with LoRA parameter efficiency. We sourced all materials from publicly released footage and used exclusively for non-commercial research within the Hagia AI Research Collective.
3.2 Model Architecture and Fine-Tuning Setup
The base model used in this study is Wan2.1 I2V-14B, a 14-billion-parameter image-to-video diffusion transformer designed for high-fidelity temporal synthesis. Its architecture comprises:
- A frozen Vision Transformer encoder for spatial feature extraction
- A temporal transformer decoder for motion generation
- A text-conditioning module (Qwen-based) providing semantic guidance
Unlike full fine-tuning, which updates all parameters, we adopt Low-Rank Adaptation (LoRA) to inject learnable adapters into specific attention projections of both encoder and decoder. We insert LoRA modules in cross-attention layers (q,k,v projections) between blocks 4–8 of the encoder and 9–13 of the decoder—covering both appearance and motion subspaces. Each LoRA layer learns two low-rank matrices A ∈ ℝd×r and B ∈ ℝr×d such that:
and only (A,B) are optimized.
3.3 Training Configuration
| Hyperparameter | Value | Description |
|---|---|---|
| LoRA rank / α | 8 / 16 | Lightweight, stable updates |
| Learning rate | 3×10-5 | Cosine schedule, 5% warm-up |
| Optimizer | AdamW (β₁=0.9, β₂=0.999, wd=0.01) | Stable for large transformers |
| Batch size | 1 video × grad-acc 4 = 2 effective | Memory-balanced |
| Steps | 4000 | Early stopping at LPIPS plateau |
| Precision | bf16 | Throughput / stability trade-off |
| Activation checkpointing | Enabled | Reduces VRAM footprint |
| Framework | PyTorch + DeepSpeed (FSDP) | Distributed efficiency |
1: Dataset 𝒟 = {(v_i, c_i)}; pretrained Wan2.1 I2V; LoRA rank r=8; lr η=3×10^-5
2: Initialize LoRA {A,B} in encoder [4–8] and decoder [9–13] cross-attention
3: for step t=1 to 4000 do
4: Sample (v,c) ∼ 𝒟; encode c (Qwen) → e_c
5: Sample 33-frame window x_0:T from v; add noise x_t = √(1-β_t)x_{t-1} + √β_t ε
6: Predict ε̂ = f_θ(x_t, t, e_c)
7: ℒ_diff = ‖ε - ε̂‖²_2; ℒ_temp = 1/(T-1) ∑‖f_θ(x_{t+1}) - f_θ(x_t)‖²_2
8: ℒ = ℒ_diff + λ ℒ_temp; update only (A,B) with AdamW
9: if validation LPIPS no-improve for 3 epochs then
10: break
11: end if
12: end for
13: Merge LoRA: W′ = W + AB⊤; save checkpoint
The configuration files (dataset_wan_i2v.toml, train_wan_i2v.toml) explicitly define frame buckets (33), aspect-ratio buckets (min_ar=0.5, max_ar=2.0), and DeepSpeed optimization flags. We set environmental variables NCCL_P2P_DISABLE=1 and NCCL_IB_DISABLE=1 to ensure stable intra-node communication. This setup fits within ≈46 GB VRAM per GPU and converges in ∼5 hours.
3.4 Appearance–Motion Decomposition
Cinematic adaptation benefits from decoupling spatial style learning from temporal motion learning. In our pipeline, the encoder's LoRA adapters primarily learn appearance features—costume texture, color grading, lighting intensity—while the decoder's adapters govern motion features, such as camera pans, zooms, and actor movement continuity. We trained the model on 33-frame temporal windows (≈1.4 s @ 24 FPS) to capture micro-motion segments. Short windows limit overfitting and allow the model to learn frame-to-frame smoothness rather than scene-level memorization.
The overall training objective combines standard denoising diffusion loss with temporal consistency terms:
where
This balance enables stylistic adaptation without compromising motion realism.
3.5 Inference Optimization
For inference, we employ the LoRA-enhanced Wan2.1 I2V model to synthesize 720p (1280×720) video sequences conditioned on a still image and a textual prompt:
python generate.py \
--task i2v-14B \
--ckpt_dir ./Wan-Merged \
--image ./keyframes/torch_scene.png \
--prompt "torch-lit battlefield, cinematic lighting, night fog" \
--num_frames 96 --cfg 3.8 --steps 30 \
--resolution 1280x720 --fps 24 \
--outdir ./generated_clips
Multi-GPU Parallelization
We achieve inference efficiency through sequence partitioning and Fully Sharded Data Parallelism (FSDP) [8]. We divide each 96-frame sequence into two temporal shards of 48 frames with a 4-frame overlap. We blend boundary frames using optical-flow-based cross-fading to avoid motion seams:
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 generate.py \
--temporal_shards 2 --shard_overlap 4 \
--fsdp_policy transformer_blocks --mixed_precision bf16
This doubles throughput while preserving visual quality (LPIPS [9] change < 0.002).
3.6 LoRA Merging and Deployment
After training, LoRA adapters are merged into the base model to simplify inference. For each weight tensor W, corresponding adapter matrices (A,B) are located, multiplied, and added as:
Configuration and tokenizer files are copied into a unified directory (Wan-Merged), producing a self-contained deployment model requiring no external adapters:
python merge_lora.py \
--base ./Wan2.1-I2V-14B-720P \
--lora ./out_lora_elturco \
--output ./Wan-Merged
The merged checkpoint remains compatible with the standard generate.py interface, enabling plug-and-play cinematic generation for downstream creative workflows.
4. Results and Analysis
4.1 Training Performance
We trained the LoRA adapters for 4,000 steps using the configuration described in Section 3.3. On Google Colab Pro with a single A100-40GB GPU, training converged in 3 hours and 12 minutes. When deployed on dual A100-80GB GPUs via RunPod with FSDP enabled, training time was reduced to 1 hour and 36 minutes, achieving approximately 2× speedup. Peak memory utilization remained under 46 GB per GPU in the dual-GPU configuration, demonstrating efficient memory scaling through FSDP [9].
The training loss curve exhibited stable convergence without oscillation, reaching a plateau at approximately 3,200 steps. We employed early stopping based on validation LPIPS [10] to prevent overfitting on the limited dataset. The final checkpoint achieved a validation LPIPS score of 0.142, indicating strong perceptual similarity between generated and ground-truth frames.
4.2 Inference Efficiency
Table II reports wall-clock generation times for 96-frame sequences (4 seconds at 24 FPS) at 720p resolution (1280×720). Single-GPU inference on an A100-80GB required 187 seconds per clip. Multi-GPU inference with temporal sharding and FSDP reduced this to 94 seconds, achieving 1.99× speedup while maintaining visual quality (LPIPS difference < 0.002 between single and multi-GPU outputs).
| Configuration | Time (s) | Speedup |
|---|---|---|
| Single A100-80GB | 187 | 1.0× |
| Dual A100-80GB (FSDP) | 94 | 1.99× |
4.3 Qualitative Analysis
Fig. 1 demonstrates the model's ability to maintain cinematic coherence across frames. Fig. 2 presents comprehensive visual results across diverse scene configurations, demonstrating the pipeline's capability to generate temporally coherent sequences while preserving costume detail, atmospheric lighting, and historical authenticity.
The fine-tuned model successfully preserves:
- Costume consistency: Chainmail texture, helmet geometry, and fabric details remain stable across camera motion and frame transitions.
- Lighting continuity: Torch-lit ambiance, atmospheric fog diffusion, and color temperature consistency characteristic of El Turco's cinematography are maintained throughout generated sequences.
- Camera behavior: Smooth pans and depth-of-field effects typical of professional film production, avoiding the erratic motion common in generic video diffusion models.
- Historical authenticity: Period-accurate armor, weaponry, and battlefield composition reflecting the visual standards of historical television production.
Compared to the base Wan 2.1 model without fine-tuning, our approach exhibited significantly improved adherence to the target aesthetic. The base model tended to generate generic medieval scenes with inconsistent lighting and modern costume elements. Our LoRA-enhanced model internalized the specific visual grammar of El Turco, producing outputs that domain experts rated as substantially closer to production footage in lighting, motion, and costume coherence (mean rating improvement: +1.2 on a 5-point scale, p < 0.05).
4.4 Limitations
Despite strong results, we observed occasional artifacts in rapid motion sequences (e.g., galloping cavalry), where temporal consistency degraded slightly. Additionally, the model occasionally struggled with extreme close-ups of faces, likely due to limited facial training data in our curated dataset. These limitations suggest directions for future dataset augmentation and architectural improvements.
5. Conclusion
We presented a practical, reproducible pipeline for adapting large-scale video diffusion transformers to cinematic styles using limited data and accessible hardware. Building on Wan 2.1 I2V-14B, a 14-billion-parameter image-to-video diffusion transformer, we introduce parameter-efficient Low-Rank Adaptation (LoRA) modules to internalize stylistic features from short sequences of the historical television film El Turco. The fine-tuned model reproduces historically authentic battlefield and palace scenes while modifying less than 1% of the base parameters.
Training converges in under two hours on dual A100 GPUs, and multi-GPU inference with Fully Sharded Data Parallelism (FSDP) achieves near-linear speed-up while preserving temporal coherence. Qualitative and ablation studies confirm a balanced trade-off between fidelity and efficiency. The complete open-source pipeline, including preprocessing scripts, training configurations, and inference workflows, bridges state-of-the-art video diffusion research with cinematic production—advancing algorithmic storytelling and creative direction through generative AI.
6. Future Work
Our pipeline demonstrates effective cinematic adaptation from limited data, yet several directions remain open. This study focuses on a single historical production, El Turco, within a narrow aesthetic range. Extending fine-tuning across other genres—such as science fiction or noir—would test the model's capacity to generalize and interpolate visual styles. The current 33-frame training and 96-frame inference windows restrict output to brief sequences; generating full scenes will require more memory-efficient, long-context mechanisms.
Text prompting alone offers limited directorial control. Adding spatial or storyboard guidance could enable finer manipulation of framing, lighting, and motion, aligning generative models more closely with real cinematography. Further work should also examine data scaling—training with fewer or more clips—and assess the limits of data efficiency through few-shot adaptation.
Comparisons with open and commercial baselines, and the development of perceptual cinematic metrics for continuity and rhythm, would better situate this work within the field. Finally, testing the pipeline in actual production workflows will clarify its creative and economic value, while transparent standards for consent and attribution remain essential as generative tools approach professional filmmaking quality.
Acknowledgments
The authors thank the creators and distributors of the El Turco television series for making footage publicly available for research purposes. We acknowledge Google Colab Pro and RunPod for providing affordable GPU compute resources (A100-40GB and dual A100-80GB configurations) that enabled training and inference on a limited budget. We are grateful to the open-source community behind Wan 2.1, Stable Diffusion XL, LoRA, and the DeepSpeed framework for their foundational contributions to video generation and parameter-efficient fine-tuning. Special thanks to the Hagia AI Research Collective for supporting this work and fostering collaborative research in generative AI for cinematic applications.
BibTeX Citation
@misc{akarsu2025cinematiclora,
title = {Fine-Tuning Open Video Generators for Cinematic Scene Synthesis: A Small-Data Pipeline with LoRA and Wan2.1 I2V},
author = {Çatay, Kerem and Bin Vedat, Sedat and Akarsu, Meftun and Yarkan, Enes Kutay and Şentürk, İlke and Sar, Arda and Ekşioğlu, Dafne and Vargı, Meltem},
year = {2025},
publisher = {Zenodo},
doi = {10.5281/zenodo.17370356},
note = {Technical Report}
}
Questions or media requests: info@hagiaproject.com