*GSoC 2026 Proposal: High-Performance LibTorch Backend Modernization*

*Candidate:* Raja Rathour
*Project Type:* Large (350 Hours)
*Mentor:* Guo Yejun
*1. Problem Statement: Bridging the Integration Gap*

While the LibTorch backend logic was present in the source tree, it was
functionally inaccessible to end-users due to a registration mismatch in
the AVOption system. Specifically, the dnn_backend unit in
vf_dnn_processing.c lacked the necessary constants to map the user input
string "torch" to the internal DNN_TH backend ID. This caused the following
failure:

*The Error:*
[Parsed_dnn_processing_0 @ 0x...] Option 'dnn_backend' not found

*The Fix:*
I have already diagnosed and resolved this by correctly registering the
torch constant in the dnn_processing_options array and updating DnnContext
offsets.
*2. Current Progress & Functional Verification*

I have verified the end-to-end inference pipeline using a local build
(--enable-libtorch). The following terminal output serves as proof of
concept, showing successful 25-frame inference at 14.8x speed using the new
LibTorch integration:

*# Verified Command Line Proof*

./ffmpeg -f lavfi -i testsrc=duration=1 -vf
"dnn_processing=model=model.pt:dnn_backend=torch" -f null -
...
Stream mapping: Stream #0:0 -> #0:0 (wrapped_avframe -> wrapped_avframe)
frame= 25 fps=0.0 time=00:00:01.00 speed=14.8x

This confirms that the "plumbing" between FFmpeg's filtergraph and the
LibTorch engine (using at::from_blob for memory wrapping) is fully
operational.
[image: image.png]


*3. 350-Hour Technical Roadmap (12-Week Plan)*

The project scope has been expanded to 350 hours to prioritize
architectural modernization and high-performance GPU utilization:

• *Phase 1: Unified Async Infrastructure (Weeks 1–4):*
Migration to the common DNNAsyncExecModule. This ensures non-blocking
execution, bringing LibTorch into alignment with the OpenVINO and
TensorFlow backends.

• *Phase 2: Zero-Copy GPU Pipeline (Weeks 5–9 – HIGH PRIORITY):*
Developing direct mapping for AV_PIX_FMT_CUDA frames. By eliminating
redundant CPU-to-GPU memory copies, we can significantly reduce PCIe
bandwidth bottlenecks for hardware-accelerated filters.

• *Phase 3: Batch-Mode Inference & Refinement (Weeks 10–12 – Optional):*
Implementing frame-accumulation buffers for (B x C x H x W) processing to
maximize hardware throughput via tensor concatenation.
*4. Proven Track Record: Merged Contributions*

My proposal is built on a foundation of successful upstream contributions.
I have already submitted and merged patches that established the initial
infrastructure for this backend, including:

• Async Infrastructure Refactoring: Initial migration steps for the
LibTorch worker lifecycle.
• Memory Safety & Buffering: Implementation of persistent input buffers and
dynamic shape reallocation logic to prevent runtime overflows.
• Worker Management: Refined management of the inference request queue to
improve stability.
*5. Community Impact & Stability*

To ensure long-term maintainability, all performance features will be
implemented as opt-in parameters. I am committed to maintaining a
synchronous fallback path and contributing to the FATE test suite to verify
these new execution paths across different environments.