From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 3FB824C729 for ; Wed, 22 Oct 2025 18:30:23 +0000 (UTC) Authentication-Results: ffbox; dkim=fail (body hash mismatch (got b'Hzh+t+N0jeA5uEaunOOHd27jrYWtl43vXJI1H9I4kSo=', expected b'Emd2fp+aweShW/+HhrRzjaEOHwniM+i018opKNtE0EM=')) header.d=ffmpeg.org header.i=@ffmpeg.org header.a=rsa-sha256 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org; i=@ffmpeg.org; q=dns/txt; s=mail; t=1761157813; h=mime-version : to : date : message-id : reply-to : subject : list-id : list-archive : list-archive : list-help : list-owner : list-post : list-subscribe : list-unsubscribe : from : cc : content-type : content-transfer-encoding : from; bh=Hzh+t+N0jeA5uEaunOOHd27jrYWtl43vXJI1H9I4kSo=; b=YXmSaXvDhz1JAl8dKXvYHnHsvl63SQ/OG39Tvl1lwLVrYsFJasK1jysjNJPX1IVwBmQBs BV4pNJdAXDhVvU56vAXzdtNqGo3GC2vk6Zz2xaMRfzX+eE6zgKf+E6HG/OfPBcJK5nHB2IB Iz5nwg7couPojwSANu67wRls9pZ9umJ/vK4OaYFpsPtw4LUXURdlIaiaH9KXRh5NS6pudtL kjZW534hhMRWPDv65R4u6eps3dm0Z8TwwIa9fPzZvpDxQk+0aPo2mXeC4O29UHzeOwZS7oa juj+JNdyLPO+lCvJp9E786RGOuoJ9IkFKZ8n03iMs6yr+EMGbFVZ7MPxhiEw== Received: from [172.19.0.2] (unknown [172.19.0.2]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id 86CC768F5AE; Wed, 22 Oct 2025 21:30:13 +0300 (EEST) ARC-Seal: i=1; cv=none; a=rsa-sha256; d=ffmpeg.org; s=arc; t=1761157811; b=YRCRncypKMYngfTwfZoYEBIvvg6wMXcdQtuhJXdO2tz22tmsgcqaJbs5m/3O7lAxphrOq j2CMyfALaHSvp1ZsmtiwjzE4dMO6fsJp7WqQBcC+qmjtIeSaR76x+Rv3g/nwAH9PZz8javC 7JbZDbckEfIobFG8ai+ChNvf4KJUyiuo6XP+pXRdd02TghJBwP5QMYGAPv9ProGZeOj5UUy FqewCKtYttV9Jr/UtmTck+vr/Kx+PXPh6FMnGCiI3W4leKPAHizjlcJbxJ5KKg+p07+6Qb4 4ntjZR0EkD/7/+fxs3aGrJqFClPmKC1Wo6u6ZbsPjUJyunBu1a9mfM4R5hqQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=ffmpeg.org; s=arc; t=1761157811; h=from : sender : reply-to : subject : date : message-id : to : cc : mime-version : content-type : content-transfer-encoding : content-id : content-description : resent-date : resent-from : resent-sender : resent-to : resent-cc : resent-message-id : in-reply-to : references : list-id : list-help : list-unsubscribe : list-subscribe : list-post : list-owner : list-archive; bh=Pntkl9XfY/mZaVvxXQNTl7wV+yiW2lqNkzR/bbbhBQo=; b=kWPBLWC9iZjNDT+xfelA2v15aHudm4un+da9oSk4lyRT+i2JgnUJ3Jt/v9DMZkA6aRQBq SUfovj5U8Dp0IHjINyfKIs19FA+ZriF3vRr+6EINxO0BnAr1Io/uaRVIQyyUPcfLCN/xI97 Xlem6eHF/HL874a0eTetIch1mejzJTzuPqBg2QS0g3inIQyHMulbrcgt8MSlpMwA+ZSvwmM tykpZyrlZxg3G+KHdGK4F5LQxViXJtTVYsPmYpWrz6XEB4TG6mlds3tFYMuXCWY//Hy+rj6 +U73ocUZqSio8UYW45zPzrcnDgbrGV6RH/htQ5yRHWRhBu2K/5o3WwibiEVg== ARC-Authentication-Results: i=1; ffmpeg.org; dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org; arc=none; dmarc=pass header.from=ffmpeg.org policy.dmarc=quarantine Authentication-Results: ffmpeg.org; dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org; arc=none (Message is not ARC signed); dmarc=pass (Used From Domain Record) header.from=ffmpeg.org policy.dmarc=quarantine DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org; i=@ffmpeg.org; q=dns/txt; s=mail; t=1761157806; h=content-type : mime-version : content-transfer-encoding : from : to : reply-to : subject : date : from; bh=Emd2fp+aweShW/+HhrRzjaEOHwniM+i018opKNtE0EM=; b=Xc9OvvF8nACD2n5xR841f48BH7ieYY1pwvSvHRyw1pJhrKgFOZ7WwRGCUkF5MvIkzuYTZ gwZ4AoQftZUFvEM5pajOKlOuosvAa/TlRPiWEJaqaoqs0Ve0X1wyyC8EjxgtP6hAhEL8UIS VvsmUnwf0uR/sBKGSIWPmyt7SKTDkl+YGRpURusz8oWuqA2bXLt20ynoXGmGt4h/+QdyY78 PjM56rOmAjLk1+tMyRvdcKkt1zK4p0B2Mm7weRODNa+KXbR413VMoIgkLKSNtrvesYbqtwD BagmQZGOobz7jqQTV1N9vhoE3BgH2QqLjjGStpfkT0FFwEDQkxvGyhJOYmpg== Received: from 547bf0a948a1 (code.ffmpeg.org [188.245.149.3]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 4CADB68E9B3 for ; Wed, 22 Oct 2025 21:30:06 +0300 (EEST) MIME-Version: 1.0 To: ffmpeg-devel@ffmpeg.org Date: Wed, 22 Oct 2025 18:30:05 -0000 Message-ID: <176115780645.25.2812638608647755938@7d278768979e> Message-ID-Hash: CL5WK7SZQ5TWHI3LHQFGIWDJ3YICLCEZ X-Message-ID-Hash: CL5WK7SZQ5TWHI3LHQFGIWDJ3YICLCEZ X-MailFrom: code@ffmpeg.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; header-match-ffmpeg-devel.ffmpeg.org-0; header-match-ffmpeg-devel.ffmpeg.org-1; header-match-ffmpeg-devel.ffmpeg.org-2; header-match-ffmpeg-devel.ffmpeg.org-3; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list Reply-To: FFmpeg development discussions and patches Subject: [FFmpeg-devel] [PATCH] aarch64/vvc: Optimisations of put_luma_h() functions for 10/12-bit (PR #20737) List-Id: FFmpeg development discussions and patches Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: From: "george.zaguri via ffmpeg-devel" Cc: "george.zaguri" Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Archived-At: List-Archive: List-Post: PR #20737 opened by george.zaguri URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20737 Patch URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20737.patch RPi4: put_luma_h_10_4x4_c: 261.8 ( 1.00x) put_luma_h_10_8x8_c: 1051.5 ( 1.00x) put_luma_h_10_8x8_neon: 231.5 ( 4.54x) put_luma_h_10_16x16_c: 4131.0 ( 1.00x) put_luma_h_10_16x16_neon: 848.6 ( 4.87x) put_luma_h_10_32x32_c: 16469.5 ( 1.00x) put_luma_h_10_32x32_neon: 3345.6 ( 4.92x) put_luma_h_10_64x64_c: 66734.0 ( 1.00x) put_luma_h_10_64x64_neon: 14586.9 ( 4.57x) put_luma_h_10_128x128_c: 264228.9 ( 1.00x) put_luma_h_10_128x128_neon: 52199.7 ( 5.06x) put_luma_h_12_4x4_c: 262.1 ( 1.00x) put_luma_h_12_8x8_c: 1051.3 ( 1.00x) put_luma_h_12_8x8_neon: 230.9 ( 4.55x) put_luma_h_12_16x16_c: 4124.4 ( 1.00x) put_luma_h_12_16x16_neon: 848.0 ( 4.86x) put_luma_h_12_32x32_c: 16446.9 ( 1.00x) put_luma_h_12_32x32_neon: 3347.4 ( 4.91x) put_luma_h_12_64x64_c: 66770.1 ( 1.00x) put_luma_h_12_64x64_neon: 14360.2 ( 4.65x) put_luma_h_12_128x128_c: 264419.5 ( 1.00x) put_luma_h_12_128x128_neon: 52200.6 ( 5.07x) M2 Air (with auto-vectorization feature): put_luma_h_10_4x4_c: 0.3 ( 1.00x) put_luma_h_10_8x8_c: 1.0 ( 1.00x) put_luma_h_10_8x8_neon: 0.4 ( 2.58x) put_luma_h_10_16x16_c: 3.0 ( 1.00x) put_luma_h_10_16x16_neon: 1.5 ( 2.01x) put_luma_h_10_32x32_c: 9.7 ( 1.00x) put_luma_h_10_32x32_neon: 6.2 ( 1.57x) put_luma_h_10_64x64_c: 36.6 ( 1.00x) put_luma_h_10_64x64_neon: 23.9 ( 1.53x) put_luma_h_10_128x128_c: 134.2 ( 1.00x) put_luma_h_10_128x128_neon: 95.4 ( 1.41x) put_luma_h_12_4x4_c: 0.3 ( 1.00x) put_luma_h_12_8x8_c: 1.0 ( 1.00x) put_luma_h_12_8x8_neon: 0.4 ( 2.57x) put_luma_h_12_16x16_c: 3.0 ( 1.00x) put_luma_h_12_16x16_neon: 1.5 ( 2.01x) put_luma_h_12_32x32_c: 9.7 ( 1.00x) put_luma_h_12_32x32_neon: 6.0 ( 1.63x) put_luma_h_12_64x64_c: 36.5 ( 1.00x) put_luma_h_12_64x64_neon: 23.9 ( 1.53x) put_luma_h_12_128x128_c: 134.8 ( 1.00x) put_luma_h_12_128x128_neon: 95.2 ( 1.42x) >>From dba0d5709658f01e40496d1f1fc8a1832e21b708 Mon Sep 17 00:00:00 2001 From: Georgii Zagoruiko Date: Wed, 22 Oct 2025 19:22:23 +0100 Subject: [PATCH] aarch64/vvc: Optimisations of put_luma_h() functions for 10/12-bit RPi4: put_luma_h_10_4x4_c: 261.8 ( 1.00x) put_luma_h_10_8x8_c: 1051.5 ( 1.00x) put_luma_h_10_8x8_neon: 231.5 ( 4.54x) put_luma_h_10_16x16_c: 4131.0 ( 1.00x) put_luma_h_10_16x16_neon: 848.6 ( 4.87x) put_luma_h_10_32x32_c: 16469.5 ( 1.00x) put_luma_h_10_32x32_neon: 3345.6 ( 4.92x) put_luma_h_10_64x64_c: 66734.0 ( 1.00x) put_luma_h_10_64x64_neon: 14586.9 ( 4.57x) put_luma_h_10_128x128_c: 264228.9 ( 1.00x) put_luma_h_10_128x128_neon: 52199.7 ( 5.06x) put_luma_h_12_4x4_c: 262.1 ( 1.00x) put_luma_h_12_8x8_c: 1051.3 ( 1.00x) put_luma_h_12_8x8_neon: 230.9 ( 4.55x) put_luma_h_12_16x16_c: 4124.4 ( 1.00x) put_luma_h_12_16x16_neon: 848.0 ( 4.86x) put_luma_h_12_32x32_c: 16446.9 ( 1.00x) put_luma_h_12_32x32_neon: 3347.4 ( 4.91x) put_luma_h_12_64x64_c: 66770.1 ( 1.00x) put_luma_h_12_64x64_neon: 14360.2 ( 4.65x) put_luma_h_12_128x128_c: 264419.5 ( 1.00x) put_luma_h_12_128x128_neon: 52200.6 ( 5.07x) M2 Air (with auto-vectorization feature): put_luma_h_10_4x4_c: 0.3 ( 1.00x) put_luma_h_10_8x8_c: 1.0 ( 1.00x) put_luma_h_10_8x8_neon: 0.4 ( 2.58x) put_luma_h_10_16x16_c: 3.0 ( 1.00x) put_luma_h_10_16x16_neon: 1.5 ( 2.01x) put_luma_h_10_32x32_c: 9.7 ( 1.00x) put_luma_h_10_32x32_neon: 6.2 ( 1.57x) put_luma_h_10_64x64_c: 36.6 ( 1.00x) put_luma_h_10_64x64_neon: 23.9 ( 1.53x) put_luma_h_10_128x128_c: 134.2 ( 1.00x) put_luma_h_10_128x128_neon: 95.4 ( 1.41x) put_luma_h_12_4x4_c: 0.3 ( 1.00x) put_luma_h_12_8x8_c: 1.0 ( 1.00x) put_luma_h_12_8x8_neon: 0.4 ( 2.57x) put_luma_h_12_16x16_c: 3.0 ( 1.00x) put_luma_h_12_16x16_neon: 1.5 ( 2.01x) put_luma_h_12_32x32_c: 9.7 ( 1.00x) put_luma_h_12_32x32_neon: 6.0 ( 1.63x) put_luma_h_12_64x64_c: 36.5 ( 1.00x) put_luma_h_12_64x64_neon: 23.9 ( 1.53x) put_luma_h_12_128x128_c: 134.8 ( 1.00x) put_luma_h_12_128x128_neon: 95.2 ( 1.42x) --- libavcodec/aarch64/vvc/dsp_init.c | 22 ++++++++ libavcodec/aarch64/vvc/inter.S | 90 +++++++++++++++++++++++++++++++ 2 files changed, 112 insertions(+) diff --git a/libavcodec/aarch64/vvc/dsp_init.c b/libavcodec/aarch64/vvc/dsp_init.c index b7dc1d89f8..053d453fa7 100644 --- a/libavcodec/aarch64/vvc/dsp_init.c +++ b/libavcodec/aarch64/vvc/dsp_init.c @@ -30,6 +30,18 @@ #define BDOF_BLOCK_SIZE 16 #define BDOF_MIN_BLOCK_SIZE 4 +void ff_vvc_put_luma_h8_10_neon(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride, + const int height, const int8_t *hf, const int8_t *vf, const int width); +void ff_vvc_put_luma_h16_10_neon(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride, + const int height, const int8_t *hf, const int8_t *vf, const int width); +void ff_vvc_put_luma_h_x16_10_neon(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride, + const int height, const int8_t *hf, const int8_t *vf, const int width); +void ff_vvc_put_luma_h8_12_neon(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride, + const int height, const int8_t *hf, const int8_t *vf, const int width); +void ff_vvc_put_luma_h16_12_neon(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride, + const int height, const int8_t *hf, const int8_t *vf, const int width); +void ff_vvc_put_luma_h_x16_12_neon(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride, + const int height, const int8_t *hf, const int8_t *vf, const int width); void ff_alf_classify_sum_neon(int *sum0, int *sum1, int16_t *grad, uint32_t gshift, uint32_t steps); @@ -245,6 +257,11 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) c->inter.dmvr[0][1] = ff_vvc_dmvr_h_10_neon; c->inter.dmvr[1][1] = ff_vvc_dmvr_hv_10_neon; c->inter.apply_bdof = ff_vvc_apply_bdof_10_neon; + c->inter.put[0][2][0][1] = ff_vvc_put_luma_h8_10_neon; + c->inter.put[0][3][0][1] = ff_vvc_put_luma_h16_10_neon; + c->inter.put[0][4][0][1] = + c->inter.put[0][5][0][1] = + c->inter.put[0][6][0][1] = ff_vvc_put_luma_h_x16_10_neon; c->alf.filter[LUMA] = alf_filter_luma_10_neon; c->alf.filter[CHROMA] = alf_filter_chroma_10_neon; @@ -256,6 +273,11 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) c->inter.dmvr[0][1] = ff_vvc_dmvr_h_12_neon; c->inter.dmvr[1][1] = ff_vvc_dmvr_hv_12_neon; c->inter.apply_bdof = ff_vvc_apply_bdof_12_neon; + c->inter.put[0][2][0][1] = ff_vvc_put_luma_h8_12_neon; + c->inter.put[0][3][0][1] = ff_vvc_put_luma_h16_12_neon; + c->inter.put[0][4][0][1] = + c->inter.put[0][5][0][1] = + c->inter.put[0][6][0][1] = ff_vvc_put_luma_h_x16_12_neon; c->alf.filter[LUMA] = alf_filter_luma_12_neon; c->alf.filter[CHROMA] = alf_filter_chroma_12_neon; diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index a874edf889..be37df8e38 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarch64/vvc/inter.S @@ -1713,3 +1713,93 @@ endfunc #undef GRADIENT_V1_OFFSET #undef VX_OFFSET #undef VY_OFFSET + +#define VVC_MAX_PB_SIZE 128 + +.macro put_luma_h_x8_16bit_ n, is_w_loop, shift + // dst .req x0 + // _src .req x1 + // _src_stride .req x2 + // height .req w3 + // hf .req x4 + // vf .req x5 + // width .req w6 + mov x9, #(VVC_MAX_PB_SIZE * 2) + ld1 {v0.8b}, [x4] + sxtl v0.8h, v0.8b + mov w7, #0 // y loop: height +1: + sub x11, x1, #6 + add w7, w7, #1 + ld1 {v20.8h}, [x11], #16 + mov x10, x0 + .if \is_w_loop == 1 + mov w8, #(8*\n) +2: + .endif + .rept \n + ld1 {v16.8h}, [x11], #16 + ext v1.16b, v20.16b, v16.16b, #2 + ext v2.16b, v20.16b, v16.16b, #4 + ext v3.16b, v20.16b, v16.16b, #6 + ext v4.16b, v20.16b, v16.16b, #8 + ext v5.16b, v20.16b, v16.16b, #10 + ext v6.16b, v20.16b, v16.16b, #12 + ext v17.16b, v20.16b, v16.16b, #14 + smull v21.4s, v20.4h, v0.h[0] + smull2 v22.4s, v20.8h, v0.h[0] + smlal v21.4s, v1.4h, v0.h[1] + smlal2 v22.4s, v1.8h, v0.h[1] + smlal v21.4s, v2.4h, v0.h[2] + smlal2 v22.4s, v2.8h, v0.h[2] + smlal v21.4s, v3.4h, v0.h[3] + smlal2 v22.4s, v3.8h, v0.h[3] + smlal v21.4s, v4.4h, v0.h[4] + smlal2 v22.4s, v4.8h, v0.h[4] + smlal v21.4s, v5.4h, v0.h[5] + smlal2 v22.4s, v5.8h, v0.h[5] + smlal v21.4s, v6.4h, v0.h[6] + smlal2 v22.4s, v6.8h, v0.h[6] + smlal v21.4s, v17.4h, v0.h[7] + smlal2 v22.4s, v17.8h, v0.h[7] + sqshrn v21.4h, v21.4s, #(\shift) + sqshrn v22.4h, v22.4s, #(\shift) + st1 {v21.4h, v22.4h}, [x10], #16 + mov v20.16b, v16.16b + .endr + .if \is_w_loop == 1 + cmp w8, w6 + add w8, w8, #(8*\n) + b.lt 2b + .endif + cmp w7, w3 + add x0, x0, x9 + add x1, x1, x2 + b.lt 1b + ret +.endm + +function ff_vvc_put_luma_h8_10_neon, export=1 + put_luma_h_x8_16bit_ 1, 0, 2 +endfunc + +function ff_vvc_put_luma_h8_12_neon, export=1 + put_luma_h_x8_16bit_ 1, 0, 4 +endfunc + +function ff_vvc_put_luma_h16_10_neon, export=1 + put_luma_h_x8_16bit_ 2, 0, 2 +endfunc + +function ff_vvc_put_luma_h16_12_neon, export=1 + put_luma_h_x8_16bit_ 2, 0, 4 +endfunc + +function ff_vvc_put_luma_h_x16_10_neon, export=1 + put_luma_h_x8_16bit_ 2, 1, 2 +endfunc + +function ff_vvc_put_luma_h_x16_12_neon, export=1 + put_luma_h_x8_16bit_ 2, 1, 4 +endfunc + -- 2.49.1 _______________________________________________ ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org