*GSoC 2026 Proposal: High-Performance LibTorch Backend Modernization* *Candidate:* Raja Rathour *Project Type:* Large (350 Hours) *Mentor:* Guo Yejun *1. Problem Statement: Bridging the Integration Gap* While the LibTorch backend logic was present in the source tree, it was functionally inaccessible to end-users due to a registration mismatch in the AVOption system. Specifically, the dnn_backend unit in vf_dnn_processing.c lacked the necessary constants to map the user input string "torch" to the internal DNN_TH backend ID. This caused the following failure: *The Error:* [Parsed_dnn_processing_0 @ 0x...] Option 'dnn_backend' not found *The Fix:* I have already diagnosed and resolved this by correctly registering the torch constant in the dnn_processing_options array and updating DnnContext offsets. *2. Current Progress & Functional Verification* I have verified the end-to-end inference pipeline using a local build (--enable-libtorch). The following terminal output serves as proof of concept, showing successful 25-frame inference at 14.8x speed using the new LibTorch integration: *# Verified Command Line Proof* ./ffmpeg -f lavfi -i testsrc=duration=1 -vf "dnn_processing=model=model.pt:dnn_backend=torch" -f null - ... Stream mapping: Stream #0:0 -> #0:0 (wrapped_avframe -> wrapped_avframe) frame= 25 fps=0.0 time=00:00:01.00 speed=14.8x This confirms that the "plumbing" between FFmpeg's filtergraph and the LibTorch engine (using at::from_blob for memory wrapping) is fully operational. [image: image.png] *3. 350-Hour Technical Roadmap (12-Week Plan)* The project scope has been expanded to 350 hours to prioritize architectural modernization and high-performance GPU utilization: • *Phase 1: Unified Async Infrastructure (Weeks 1–4):* Migration to the common DNNAsyncExecModule. This ensures non-blocking execution, bringing LibTorch into alignment with the OpenVINO and TensorFlow backends. • *Phase 2: Zero-Copy GPU Pipeline (Weeks 5–9 – HIGH PRIORITY):* Developing direct mapping for AV_PIX_FMT_CUDA frames. By eliminating redundant CPU-to-GPU memory copies, we can significantly reduce PCIe bandwidth bottlenecks for hardware-accelerated filters. • *Phase 3: Batch-Mode Inference & Refinement (Weeks 10–12 – Optional):* Implementing frame-accumulation buffers for (B x C x H x W) processing to maximize hardware throughput via tensor concatenation. *4. Proven Track Record: Merged Contributions* My proposal is built on a foundation of successful upstream contributions. I have already submitted and merged patches that established the initial infrastructure for this backend, including: • Async Infrastructure Refactoring: Initial migration steps for the LibTorch worker lifecycle. • Memory Safety & Buffering: Implementation of persistent input buffers and dynamic shape reallocation logic to prevent runtime overflows. • Worker Management: Refined management of the inference request queue to improve stability. *5. Community Impact & Stability* To ensure long-term maintainability, all performance features will be implemented as opt-in parameters. I am committed to maintaining a synchronous fallback path and contributing to the FATE test suite to verify these new execution paths across different environments.