From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id A818049147 for ; Sun, 15 Feb 2026 11:32:57 +0000 (UTC) Authentication-Results: ffbox; dkim=fail (body hash mismatch (got b'xJaVN27rVPMK2S5naB444PWZkAZNSutmcXkx+NZkpTk=', expected b'8OjqYfAS2P+H/n/tVGcuaU1sgXtOf3tDqGmdZd5+8pE=')) header.d=ffmpeg.org header.i=@ffmpeg.org header.a=rsa-sha256 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org; i=@ffmpeg.org; q=dns/txt; s=mail; t=1771155151; h=mime-version : to : date : message-id : reply-to : subject : list-id : list-archive : list-archive : list-help : list-owner : list-post : list-subscribe : list-unsubscribe : from : cc : content-type : content-transfer-encoding : from; bh=xJaVN27rVPMK2S5naB444PWZkAZNSutmcXkx+NZkpTk=; b=wQASCD5y1zeU/amI1HrGPU2bnRlrk0c13NfZE1WKvVcpf1r/VF8lDhBhR4q1MiSsr3zcM 5+1+BZzLKnd6BZ7hPHl3hp9Pkrw3on/riR5VZu7jtb8fcII8V3XVsIgOaicxUd/Iry+WxaU BpA3SCxnhE8rpSax1/xigWBt7pF070LroB3pfvcb0FDi5nBNapyiDJ7NF1Z9xBFKYd4Bt+5 jYFhn1YWKL5GRtlFOCSJWMe4c5Ir2R1UvL24Jj0gFaQawTSJjgiGjIt70mAqdSOTDCSZI4y k3Yv7IqHwRFrpxJhz42gOUCyhGpl3eIWYfHwMGIdb+8MDYeckRDIwyjKSupA== Received: from [172.20.0.3] (unknown [172.20.0.3]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id 1202869123C; Sun, 15 Feb 2026 13:32:31 +0200 (EET) ARC-Seal: i=1; cv=none; a=rsa-sha256; d=ffmpeg.org; s=arc; t=1771155141; b=jp9xFmrQj+spVwyEbTDv24puJa9zIQs5Dh5djd5ngDRxfqstmoLaDaajmBOnyw9x6jr5e 5n4CmtTkhu1j7MXgLWexkZI9aK+jqdHmnWurObVnLhTNQWw2ptIlIlXeXMxQRYuj768Y1Fn ktWTvJ4vAdQ00hsZIqglr4dFwObUYpnauZTZuzHH97kFkwsBGgcj2wzEPozeqry/cLRBIB7 Bn+/EuroLjCnv+QJV3QS+yCa70nEaKmANlgdGWiODk9SRbOMUt89sntIGK+++vivyqRSeMV 3Uz5vsAThZ9Envhu9OBYuVPKnThtCzVpaWcQaEZctsrtR/9S7B0usWO9Tqeg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=ffmpeg.org; s=arc; t=1771155141; h=from : sender : reply-to : subject : date : message-id : to : cc : mime-version : content-type : content-transfer-encoding : content-id : content-description : resent-date : resent-from : resent-sender : resent-to : resent-cc : resent-message-id : in-reply-to : references : list-id : list-help : list-unsubscribe : list-subscribe : list-post : list-owner : list-archive; bh=Vq1ho2u8AqBIENFNqNt/au8WwtUuZ+09gFG1Ng/l8z8=; b=nE/fRfP+cWLz/pRYdQ/v9j3/RucUirR4NsJU4b+7I9/AUwf2U1odlOp8mr85t6NNk7eak wVBNItTNLarK1BqGOq/jKlk/Yq4md8ky2ZnlqeQWwtUL3rgEEfbBlxd300pAH+fL6hV7Wig RIBpJ1XVSSfkl8SzgMcRfu3J886Ylvvvb2L1JNtyqypg/VOoCdkNHM/FONjlaxh00rZZlGy ib6h7LPSF326KGmXj24D1AY6tIwYpmVJEuSTQeD9eCuunFZ/ybjhzENKZpgBw70hQCoi8hK QqWXzXFBkAcE2ZJMFDIFf29qMJJLC5oeI+51qf68AQ98W57eZP+Ap6B7Bd3g== ARC-Authentication-Results: i=1; ffmpeg.org; dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org; arc=none; dmarc=pass header.from=ffmpeg.org policy.dmarc=quarantine Authentication-Results: ffmpeg.org; dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org; arc=none (Message is not ARC signed); dmarc=pass (Used From Domain Record) header.from=ffmpeg.org policy.dmarc=quarantine DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org; i=@ffmpeg.org; q=dns/txt; s=mail; t=1771155135; h=content-type : mime-version : content-transfer-encoding : from : to : reply-to : subject : date : from; bh=8OjqYfAS2P+H/n/tVGcuaU1sgXtOf3tDqGmdZd5+8pE=; b=XmoxYZi+Ua24rvVqaPPKoH+PPcOfjRunl8nSFRNvZlIaZUJAAXVzdRIAQgHYXQPjdsCcV YcMeiFjgHqIZOAn5SJHEcoKyajQMRb0zOmflYmNH3Un+znZwt1sVbd6PE3a6GU0hN2vAm0L TmN9YaMaiFwDZSNn83Q7e17iFdBiIY8TsByoViyHMus/suiMs4c4VPd1XQFXRsXxE2zTfGT HYnu2fzJYwueORDJmmP8l3EVvJiVtPH06sFQx38OOMHwnt9wUIegPxjkkA/k3MvSgBxL3Gw J5c7EKZ/t7m19Azu13ths8hZQxYPTqyuWaD3Mo9F0//4vvu6Pm7TVsqine6A== Received: from c8d966988b92 (code.ffmpeg.org [188.245.149.3]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id DD4AE6911E3 for ; Sun, 15 Feb 2026 13:32:14 +0200 (EET) MIME-Version: 1.0 To: ffmpeg-devel@ffmpeg.org Date: Sun, 15 Feb 2026 11:32:14 -0000 Message-ID: <177115513506.25.10469838722626100490@009cbcb3d8cd> Message-ID-Hash: XFIO4X7UKGFBRSJHUMLH52DUHR56LXST X-Message-ID-Hash: XFIO4X7UKGFBRSJHUMLH52DUHR56LXST X-MailFrom: code@ffmpeg.org X-Mailman-Rule-Hits: nonmember-moderation X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; header-match-ffmpeg-devel.ffmpeg.org-0; header-match-ffmpeg-devel.ffmpeg.org-1; header-match-ffmpeg-devel.ffmpeg.org-2; header-match-ffmpeg-devel.ffmpeg.org-3; emergency; member-moderation X-Mailman-Version: 3.3.10 Precedence: list Reply-To: FFmpeg development discussions and patches Subject: [FFmpeg-devel] [PR] qpel_h16_v1 (PR #21761) List-Id: FFmpeg development discussions and patches Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: From: Jun Zhao via ffmpeg-devel Cc: Jun Zhao Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Archived-At: List-Archive: List-Post: PR #21761 opened by Jun Zhao (mypopydev) URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/21761 Patch URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/21761.patch These two commits clean up and optimize the HEVC/VVC qpel horizontal filter NEON assembly. The first removes a duplicate mov mx, x30 instruction in the VVC qpel h16/h32 functions where the link register save was needlessly performed twice. The second rewrites the HEVC qpel H-pass for width >= 16 to use byte-domain widening multiply (calc_qpelb/calc_qpelb2) instead of the previous int16-domain approach, eliminating uxtl expansion and bl/ret call overhead, yielding a ~1.39x speedup on the H-pass itself and ~1.16x geometric mean improvement across all HV-path variants on Apple M4; VVC qpel h16/h32 are separated into self-contained int16-domain functions since VVC filters are incompatible with the hardcoded sign pattern used in the byte-domain macros. >>From 70b6e89c5a604612498e39e700dea0dafdb05bf1 Mon Sep 17 00:00:00 2001 From: Jun Zhao Date: Sun, 15 Feb 2026 19:27:10 +0800 Subject: [PATCH 1/2] lavc/vvc: remove duplicate 'mov mx, x30' in VVC qpel h16/h32 The VVC qpel h16 and h32 functions had a redundant 'mov mx, x30' instruction. The first one was placed before vvc_load_filter had finished using mx (the filter pointer argument), making it a dead store immediately overwritten by the second 'mov mx, x30'. Remove the first instance and reorder so that 'sub src, src, #3' comes before 'mov mx, x30', ensuring the filter pointer in mx is fully consumed by vvc_load_filter before being overwritten with the link register. Signed-off-by: Jun Zhao --- libavcodec/aarch64/h26x/qpel_neon.S | 2 -- 1 file changed, 2 deletions(-) diff --git a/libavcodec/aarch64/h26x/qpel_neon.S b/libavcodec/aarch64/h26x/qpel_neon.S index 7901fedaf3..b7d2e0f34a 100644 --- a/libavcodec/aarch64/h26x/qpel_neon.S +++ b/libavcodec/aarch64/h26x/qpel_neon.S @@ -556,7 +556,6 @@ endfunc function ff_vvc_put_\type\()_h16_8_neon, export=1 vvc_load_filter mx sxtw height, heightw - mov mx, x30 sub src, src, #3 mov mx, x30 .ifc \type, qpel @@ -634,7 +633,6 @@ endfunc function ff_vvc_put_\type\()_h32_8_neon, export=1 vvc_load_filter mx sxtw height, heightw - mov mx, x30 sub src, src, #3 mov mx, x30 .ifc \type, qpel -- 2.52.0 >>From 912c260a0d37fc205d27eeae2e448eb4bd73106f Mon Sep 17 00:00:00 2001 From: Jun Zhao Date: Sun, 15 Feb 2026 13:23:24 +0800 Subject: [PATCH 2/2] lavc/hevc: optimize qpel H-pass for width>=16 with byte-domain widening multiply Rewrite ff_hevc_put_hevc_qpel_h16_8_neon and h32 to use byte-domain widening multiply (umull/umlal/umlsl via calc_qpelb/calc_qpelb2 macros) instead of the previous int16-domain approach (uxtl + mul/mla). The byte-domain approach eliminates the uxtl expansion step and halves the ext stride (1 byte vs 2 bytes per tap), reducing per-row instruction count from ~32 to ~23. The functions are also inlined, removing bl/ret call overhead. This benefits all HV-path callers (hv/uni_hv/bi_hv/uni_w_hv/bi_w_hv) at widths 16/32/48/64. checkasm benchmarks on Apple M4 (5-run average): H-pass standalone (NEON): h16: 34.0 -> 24.4 cycles (1.39x speedup) h32: 132.0 -> 95.0 cycles (1.39x speedup) h64: 521.8 -> 373.9 cycles (1.40x speedup) HV compound paths geometric mean speedup (NEON, width >= 16): qpel_hv: 1.144x (4 functions) qpel_bi_hv: 1.158x (4 functions) qpel_uni_hv: 1.188x (4 functions) qpel_uni_w_hv: 1.158x (3 functions) Overall: 1.162x (15 functions) VVC qpel h16/h32 are separated into self-contained functions retaining the int16-domain approach, as VVC filters have arbitrary coefficients incompatible with the hardcoded sign pattern in calc_qpelb. Signed-off-by: Jun Zhao --- libavcodec/aarch64/h26x/qpel_neon.S | 165 ++++++++++++++++++++++------ 1 file changed, 129 insertions(+), 36 deletions(-) diff --git a/libavcodec/aarch64/h26x/qpel_neon.S b/libavcodec/aarch64/h26x/qpel_neon.S index b7d2e0f34a..423db38491 100644 --- a/libavcodec/aarch64/h26x/qpel_neon.S +++ b/libavcodec/aarch64/h26x/qpel_neon.S @@ -552,20 +552,64 @@ function ff_hevc_put_hevc_\type\()_h12_8_neon, export=1 ret mx endfunc +.ifc \type, qpel +// VVC qpel h16: self-contained int16-domain implementation +function ff_vvc_put_qpel_h16_8_neon, export=1 + vvc_load_filter mx + sxtw height, heightw + sub src, src, #3 + mov mx, x30 + mov dststride, #(VVC_MAX_PB_SIZE << 1) + lsl x13, srcstride, #1 // srcstridel + mov x14, #(VVC_MAX_PB_SIZE << 2) + add x10, dst, dststride // dstb + add x12, src, srcstride // srcb +1: ld1 {v16.8b-v18.8b}, [src], x13 + ld1 {v19.8b-v21.8b}, [x12], x13 + uxtl v16.8h, v16.8b + uxtl v19.8h, v19.8b + bl ff_hevc_put_hevc_h16_8_neon + subs height, height, #2 + st1 {v26.8h, v27.8h}, [dst], x14 + st1 {v28.8h, v29.8h}, [x10], x14 + b.gt 1b // double line + ret mx +endfunc + +// HEVC qpel h16: byte-domain widening multiply +function ff_hevc_put_hevc_qpel_h16_8_neon, export=1 + load_qpel_filterb mx, x15 + sxtw height, heightw + sub src, src, #3 + mov dststride, #(HEVC_MAX_PB_SIZE << 1) +1: + ld1 {v16.16b, v17.16b}, [src], srcstride + ext v18.16b, v16.16b, v17.16b, #1 + ext v19.16b, v16.16b, v17.16b, #2 + ext v20.16b, v16.16b, v17.16b, #3 + ext v21.16b, v16.16b, v17.16b, #4 + ext v22.16b, v16.16b, v17.16b, #5 + ext v23.16b, v16.16b, v17.16b, #6 + ext v24.16b, v16.16b, v17.16b, #7 + calc_qpelb v26, v16, v18, v19, v20, v21, v22, v23, v24 + calc_qpelb2 v27, v16, v18, v19, v20, v21, v22, v23, v24 + stp q26, q27, [dst] + add dst, dst, dststride + subs height, height, #1 + b.gt 1b + ret +endfunc + +.else // qpel_uni, qpel_bi + .ifnc \type, qpel_bi function ff_vvc_put_\type\()_h16_8_neon, export=1 vvc_load_filter mx sxtw height, heightw sub src, src, #3 mov mx, x30 -.ifc \type, qpel - mov dststride, #(VVC_MAX_PB_SIZE << 1) - lsl x13, srcstride, #1 // srcstridel - mov x14, #(VVC_MAX_PB_SIZE << 2) -.else lsl x14, dststride, #1 // dststridel lsl x13, srcstride, #1 // srcstridel -.endif b 0f endfunc .endif // !qpel_bi @@ -581,14 +625,8 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1 .endif sub src, src, #3 mov mx, x30 -.ifc \type, qpel - mov dststride, #(HEVC_MAX_PB_SIZE << 1) - lsl x13, srcstride, #1 // srcstridel - mov x14, #(HEVC_MAX_PB_SIZE << 2) -.else lsl x14, dststride, #1 // dststridel lsl x13, srcstride, #1 // srcstridel -.endif 0: add x10, dst, dststride // dstb add x12, src, srcstride // srcb @@ -601,10 +639,6 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1 bl ff_hevc_put_hevc_h16_8_neon subs height, height, #2 -.ifc \type, qpel - st1 {v26.8h, v27.8h}, [dst], x14 - st1 {v28.8h, v29.8h}, [x10], x14 -.else .ifc \type, qpel_bi ld1 {v16.8h, v17.8h}, [ x4], x16 ld1 {v18.8h, v19.8h}, [x15], x16 @@ -624,27 +658,96 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1 .endif st1 {v26.8b, v27.8b}, [dst], x14 st1 {v28.8b, v29.8b}, [x10], x14 -.endif b.gt 1b // double line ret mx endfunc +.endif // qpel vs qpel_uni/qpel_bi + +.ifc \type, qpel +// VVC qpel h32: self-contained int16-domain implementation +function ff_vvc_put_qpel_h32_8_neon, export=1 + vvc_load_filter mx + sxtw height, heightw + mov mx, x30 + sub src, src, #3 + mov dststride, #(VVC_MAX_PB_SIZE << 1) + lsl x13, srcstride, #1 // srcstridel + mov x14, #(VVC_MAX_PB_SIZE << 2) + sub x14, x14, width, uxtw #1 + sub x13, x13, width, uxtw + sub x13, x13, #8 + add x10, dst, dststride // dstb + add x12, src, srcstride // srcb +0: mov w9, width + ld1 {v16.8b}, [src], #8 + ld1 {v19.8b}, [x12], #8 + uxtl v16.8h, v16.8b + uxtl v19.8h, v19.8b +1: + ld1 {v17.8b-v18.8b}, [src], #16 + ld1 {v20.8b-v21.8b}, [x12], #16 + bl ff_hevc_put_hevc_h16_8_neon + subs w9, w9, #16 + mov v16.16b, v18.16b + mov v19.16b, v21.16b + st1 {v26.8h, v27.8h}, [dst], #32 + st1 {v28.8h, v29.8h}, [x10], #32 + b.gt 1b // double line + subs height, height, #2 + add src, src, x13 + add x12, x12, x13 + add dst, dst, x14 + add x10, x10, x14 + b.gt 0b + ret mx +endfunc + +// HEVC qpel h32: byte-domain widening multiply with width loop +function ff_hevc_put_hevc_qpel_h32_8_neon, export=1 + load_qpel_filterb mx, x15 + sxtw height, heightw + sub src, src, #3 + mov dststride, #(HEVC_MAX_PB_SIZE << 1) + sub x13, dststride, width, uxtw #1 // stride adjustment +0: + mov w9, width + mov x10, src + mov x11, dst +1: + ld1 {v16.16b, v17.16b}, [x10] + add x10, x10, #16 + ext v18.16b, v16.16b, v17.16b, #1 + ext v19.16b, v16.16b, v17.16b, #2 + ext v20.16b, v16.16b, v17.16b, #3 + ext v21.16b, v16.16b, v17.16b, #4 + ext v22.16b, v16.16b, v17.16b, #5 + ext v23.16b, v16.16b, v17.16b, #6 + ext v24.16b, v16.16b, v17.16b, #7 + calc_qpelb v26, v16, v18, v19, v20, v21, v22, v23, v24 + calc_qpelb2 v27, v16, v18, v19, v20, v21, v22, v23, v24 + stp q26, q27, [x11], #32 + subs w9, w9, #16 + b.gt 1b + add src, src, srcstride + add dst, dst, x13 + add dst, dst, width, uxtw #1 + subs height, height, #1 + b.gt 0b + ret +endfunc + +.else // qpel_uni, qpel_bi + .ifnc \type, qpel_bi function ff_vvc_put_\type\()_h32_8_neon, export=1 vvc_load_filter mx sxtw height, heightw sub src, src, #3 mov mx, x30 -.ifc \type, qpel - mov dststride, #(VVC_MAX_PB_SIZE << 1) - lsl x13, srcstride, #1 // srcstridel - mov x14, #(VVC_MAX_PB_SIZE << 2) - sub x14, x14, width, uxtw #1 -.else lsl x14, dststride, #1 // dststridel lsl x13, srcstride, #1 // srcstridel sub x14, x14, width, uxtw -.endif b 1f endfunc .endif // !qpel_bi @@ -662,16 +765,9 @@ function ff_hevc_put_hevc_\type\()_h32_8_neon, export=1 .endif sub src, src, #3 mov mx, x30 -.ifc \type, qpel - mov dststride, #(HEVC_MAX_PB_SIZE << 1) - lsl x13, srcstride, #1 // srcstridel - mov x14, #(HEVC_MAX_PB_SIZE << 2) - sub x14, x14, width, uxtw #1 -.else lsl x14, dststride, #1 // dststridel lsl x13, srcstride, #1 // srcstridel sub x14, x14, width, uxtw -.endif 1: sub x13, x13, width, uxtw sub x13, x13, #8 @@ -691,10 +787,6 @@ function ff_hevc_put_hevc_\type\()_h32_8_neon, export=1 mov v16.16b, v18.16b mov v19.16b, v21.16b -.ifc \type, qpel - st1 {v26.8h, v27.8h}, [dst], #32 - st1 {v28.8h, v29.8h}, [x10], #32 -.else .ifc \type, qpel_bi ld1 {v20.8h, v21.8h}, [ x4], #32 ld1 {v22.8h, v23.8h}, [x15], #32 @@ -714,7 +806,6 @@ function ff_hevc_put_hevc_\type\()_h32_8_neon, export=1 .endif st1 {v26.8b, v27.8b}, [dst], #16 st1 {v28.8b, v29.8b}, [x10], #16 -.endif b.gt 1b // double line subs height, height, #2 add src, src, x13 @@ -729,6 +820,8 @@ function ff_hevc_put_hevc_\type\()_h32_8_neon, export=1 ret mx endfunc +.endif // qpel vs qpel_uni/qpel_bi + .unreq height .unreq heightw .unreq width -- 2.52.0 _______________________________________________ ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org