From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTPS id 9C0CC4B6C0
	for <ffmpegdev@gitmailbox.com>; Sat, 24 Jan 2026 09:39:27 +0000 (UTC)
Authentication-Results: ffbox; dkim=fail (body hash mismatch (got 
   b'QcLIIzRv5UQ6WQgBAxH5tbFvvlYcd09qqqggD9nn5JI=', expected 
   b'u7Y/+3amUbfYZufIFzLyT5OecCEGaoCl2A/uFo8+Ctk=')) header.d=ffmpeg.org 
   header.i=@ffmpeg.org header.a=rsa-sha256
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org;
 i=@ffmpeg.org; q=dns/txt; s=mail; t=1769247550; h=mime-version : to :
 date : message-id : reply-to : subject : list-id : list-archive :
 list-archive : list-help : list-owner : list-post : list-subscribe :
 list-unsubscribe : from : cc : content-type :
 content-transfer-encoding : from;
 bh=QcLIIzRv5UQ6WQgBAxH5tbFvvlYcd09qqqggD9nn5JI=;
 b=Up2qT1/Jjpf7fAf4a0MjIv8jYyfePLRJNlf6klW9sNVfL8uJmEI7dYx6/Jgzk7nGlP4wT
 Djw+UncR3yp58N/ypkNjcuEpUuQhTYoXVj+9myLBBLhlZwvhdLHcPmPNVhxLUGSCJqydsFI
 3Kvje3rXO4/jJSN5yhMHkg1qPqB58xA7QWydmyUioiw+OHw7Pld7Z0OCV+JuZvxgqOFisgh
 SOwh38KpoJwpu0cux/khqjFap4B+L6qw3v94L442Ep6YaviTtVrRFo7aHEeZRPIRd2DbTdX
 Lw7s3S89u9KnZxY9ItOkyOwVF11gAVcBU0PUGSlpC0c3EokcG3ceh4g98Pag==
Received: from [172.20.0.4] (unknown [172.20.0.4])
	by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id DD26F690EDF;
	Sat, 24 Jan 2026 11:39:10 +0200 (EET)
ARC-Seal: i=1; cv=none; a=rsa-sha256; d=ffmpeg.org; s=arc; t=1769247531;
 b=DfleTDBOswxRUy6r3WkHyRY8mTQMUMstxw3WBcstX6veUgoM0sZWrHan/Kz98M7aPyqbT
 7zy6jjyuECETX5y24pot4BV+p7Liek2EGdZyeofYPll1CjjxtPEGyvuIlmmHoG7JabR0yhV
 UbMLc/4t8BNQ1Kfx5g7qThcaBYqZ75D63M0rlgtmoAsyjvLst3v/52cpO/aS8hxfDaejtMj
 1N/3ByqwyNxIyfZQd4l5d7jPxJ/1Rksvxz1rPzbCGeoVUollykRST/LXVUU6o1X8VLKcnUP
 S2ugVVI54Jbf+K1tpNboxqHEdT2bAFbwshzHqxD5j4mbc/+yNyY0wRLSCCxw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=ffmpeg.org; s=arc; t=1769247531; h=from : sender : reply-to :
 subject : date : message-id : to : cc : mime-version : content-type :
 content-transfer-encoding : content-id : content-description :
 resent-date : resent-from : resent-sender : resent-to : resent-cc :
 resent-message-id : in-reply-to : references : list-id : list-help :
 list-unsubscribe : list-subscribe : list-post : list-owner :
 list-archive; bh=d+dAs3M7v6gg484bxN2RfWcU2REIvYMVqt4jcKWTc9s=;
 b=N4WmpMxyk+YJC/79MTPFkKufjZ9RyjryUowNQSOT0UQXKBZOs2JbzuGJnQISrUmxi5Kq+
 NjexCEbqlTEtya680eglmxvsk7ltJT6gFXubRrTC9X8s2H2S9Px/Pr8ySnHK5awN6dHru8r
 OQpFrFyJ2VOQxSdmbPePUqRRpSXYGX28+J1WbF9b5NseT1DDFJoQQbm12psSdXt2M2TzAYs
 lFFLOIUbljY0k8wquvHMiZNMsPLfb3D69JbztpKXwCKbRRGegxkXJ3i4OVKws/vPASJ5swX
 5V5QShJk7nibhjhkCRNMR0FWUkKKb3/EJ0q0QGYnX+gogWxat3tMzGcZkPpw==
ARC-Authentication-Results: i=1; ffmpeg.org;
 dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org;
 arc=none;
 dmarc=pass header.from=ffmpeg.org policy.dmarc=quarantine
Authentication-Results: ffmpeg.org;
 dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org;
 arc=none (Message is not ARC signed);
 dmarc=pass (Used From Domain Record) header.from=ffmpeg.org
 policy.dmarc=quarantine
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org;
 i=@ffmpeg.org; q=dns/txt; s=mail; t=1769247522; h=content-type :
 mime-version : content-transfer-encoding : from : to : reply-to :
 subject : date : from;
 bh=u7Y/+3amUbfYZufIFzLyT5OecCEGaoCl2A/uFo8+Ctk=;
 b=VsBxrnoHRIDEvFpFEF4agZrdPCKiVEDIq6M6Yfg5WyKx3hAp2AvTajCQDfHZD0cYSotlL
 Q0EAqJ6IHO3kMR8/V9xUAqO9t7tk3dV6dpcjOgBMHAMT7ho6HCzRgDmJ1MdomwpYsd+23hd
 Z2PbvXOzSnhbBPD+fZX9oFT5QbvypLZRMcXPuYQARq8YdBXVm/PrS3v8Myh8z+Gm1nvZOKB
 HWfxvlUbSWZ78bQQ970CQJHTeHQvRh2E22o3tuedwCoPb+f9rMt3JOs12E1475foxhmD0+Z
 hQxfPCkioYkJY84684ZkXsQZ1ERbkbTl3vtN9S0P7COFLPGzWZi0k3oJwWNg==
Received: from 69dab402ede7 (code.ffmpeg.org [188.245.149.3])
	by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 93445690E38
	for <ffmpeg-devel@ffmpeg.org>; Sat, 24 Jan 2026 11:38:42 +0200 (EET)
MIME-Version: 1.0
To: ffmpeg-devel@ffmpeg.org
Date: Sat, 24 Jan 2026 09:38:42 -0000
Message-ID: <176924752281.25.8626360802131877203@4457048688e7>
Message-ID-Hash: NRZ3TCAWO3ADFKEENJ5D2ZEJGZFMXBSQ
X-Message-ID-Hash: NRZ3TCAWO3ADFKEENJ5D2ZEJGZFMXBSQ
X-MailFrom: code@ffmpeg.org
X-Mailman-Rule-Hits: nonmember-moderation
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; header-match-ffmpeg-devel.ffmpeg.org-0;
 header-match-ffmpeg-devel.ffmpeg.org-1;
 header-match-ffmpeg-devel.ffmpeg.org-2;
 header-match-ffmpeg-devel.ffmpeg.org-3; emergency; member-moderation
X-Mailman-Version: 3.3.10
Precedence: list
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: [FFmpeg-devel] [PR] aarch64/vvc: Optimisations of put_luma_hv() functions for
 10/12-bit (PR #21569)
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
Archived-At: 
 <https://lists.ffmpeg.org/archives/list/ffmpeg-devel@ffmpeg.org/message/NRZ3TCAWO3ADFKEENJ5D2ZEJGZFMXBSQ/>
Archived-At: 
 <https://lists.ffmpeg.org/lore/ffmpeg-devel/176924752281.25.8626360802131877203@4457048688e7/>
List-Archive: 
 <https://lists.ffmpeg.org/archives/list/ffmpeg-devel@ffmpeg.org/>
List-Archive: <https://lists.ffmpeg.org/lore/ffmpeg-devel/>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Owner: <mailto:ffmpeg-devel-owner@ffmpeg.org>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Subscribe: <mailto:ffmpeg-devel-join@ffmpeg.org>
List-Unsubscribe: <mailto:ffmpeg-devel-leave@ffmpeg.org>
From: "george.zaguri via ffmpeg-devel" <ffmpeg-devel@ffmpeg.org>
Cc: "george.zaguri" <code@ffmpeg.org>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Archived-At: <https://master.gitmailbox.com/ffmpegdev/176924752281.25.8626360802131877203@4457048688e7/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>

PR #21569 opened by george.zaguri
URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/21569
Patch URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/21569.patch

Apple M2:
put_luma_hv_10_4x4_c:                                   36.3 ( 1.00x)
put_luma_hv_10_8x8_c:                                   82.9 ( 1.00x)
put_luma_hv_10_8x8_neon:                                34.9 ( 2.37x)
put_luma_hv_10_16x16_c:                                239.2 ( 1.00x)
put_luma_hv_10_16x16_neon:                             119.0 ( 2.01x)
put_luma_hv_10_32x32_c:                                900.3 ( 1.00x)
put_luma_hv_10_32x32_neon:                             429.3 ( 2.10x)
put_luma_hv_10_64x64_c:                               2984.7 ( 1.00x)
put_luma_hv_10_64x64_neon:                            1736.2 ( 1.72x)
put_luma_hv_10_128x128_c:                            11194.2 ( 1.00x)
put_luma_hv_10_128x128_neon:                          6357.3 ( 1.76x)
put_luma_hv_12_4x4_c:                                   35.9 ( 1.00x)
put_luma_hv_12_8x8_c:                                   82.6 ( 1.00x)
put_luma_hv_12_8x8_neon:                                34.3 ( 2.41x)
put_luma_hv_12_16x16_c:                                240.2 ( 1.00x)
put_luma_hv_12_16x16_neon:                             115.3 ( 2.08x)
put_luma_hv_12_32x32_c:                                787.7 ( 1.00x)
put_luma_hv_12_32x32_neon:                             414.2 ( 1.90x)
put_luma_hv_12_64x64_c:                               3058.4 ( 1.00x)
put_luma_hv_12_64x64_neon:                            1592.3 ( 1.92x)
put_luma_hv_12_128x128_c:                            11350.8 ( 1.00x)
put_luma_hv_12_128x128_neon:                          6378.3 ( 1.78x)

RPi4:
put_luma_hv_10_4x4_c:                                  637.8 ( 1.00x)
put_luma_hv_10_8x8_c:                                 1044.9 ( 1.00x)
put_luma_hv_10_8x8_neon:                               483.7 ( 2.16x)
put_luma_hv_10_16x16_c:                               3098.0 ( 1.00x)
put_luma_hv_10_16x16_neon:                            1603.1 ( 1.93x)
put_luma_hv_10_32x32_c:                              10054.8 ( 1.00x)
put_luma_hv_10_32x32_neon:                            5843.6 ( 1.72x)
put_luma_hv_10_64x64_c:                              40506.2 ( 1.00x)
put_luma_hv_10_64x64_neon:                           24384.0 ( 1.66x)
put_luma_hv_10_128x128_c:                           130604.2 ( 1.00x)
put_luma_hv_10_128x128_neon:                         99746.6 ( 1.31x)
put_luma_hv_12_4x4_c:                                  638.2 ( 1.00x)
put_luma_hv_12_8x8_c:                                 1074.6 ( 1.00x)
put_luma_hv_12_8x8_neon:                               482.6 ( 2.23x)
put_luma_hv_12_16x16_c:                               3094.0 ( 1.00x)
put_luma_hv_12_16x16_neon:                            1602.5 ( 1.93x)
put_luma_hv_12_32x32_c:                              10034.4 ( 1.00x)
put_luma_hv_12_32x32_neon:                            5843.3 ( 1.72x)
put_luma_hv_12_64x64_c:                              40447.5 ( 1.00x)
put_luma_hv_12_64x64_neon:                           24377.2 ( 1.66x)
put_luma_hv_12_128x128_c:                           130610.4 ( 1.00x)
put_luma_hv_12_128x128_neon:                         99765.8 ( 1.31x)


>>From 09f1ad09dfd8251181a4a95023699584513387b5 Mon Sep 17 00:00:00 2001
From: Georgii Zagoruiko <george.zaguri@gmail.com>
Date: Sat, 24 Jan 2026 09:35:30 +0000
Subject: [PATCH] aarch64/vvc: Optimisations of put_luma_hv() functions for
 10/12-bit

Apple M2:
put_luma_hv_10_4x4_c:                                   36.3 ( 1.00x)
put_luma_hv_10_8x8_c:                                   82.9 ( 1.00x)
put_luma_hv_10_8x8_neon:                                34.9 ( 2.37x)
put_luma_hv_10_16x16_c:                                239.2 ( 1.00x)
put_luma_hv_10_16x16_neon:                             119.0 ( 2.01x)
put_luma_hv_10_32x32_c:                                900.3 ( 1.00x)
put_luma_hv_10_32x32_neon:                             429.3 ( 2.10x)
put_luma_hv_10_64x64_c:                               2984.7 ( 1.00x)
put_luma_hv_10_64x64_neon:                            1736.2 ( 1.72x)
put_luma_hv_10_128x128_c:                            11194.2 ( 1.00x)
put_luma_hv_10_128x128_neon:                          6357.3 ( 1.76x)
put_luma_hv_12_4x4_c:                                   35.9 ( 1.00x)
put_luma_hv_12_8x8_c:                                   82.6 ( 1.00x)
put_luma_hv_12_8x8_neon:                                34.3 ( 2.41x)
put_luma_hv_12_16x16_c:                                240.2 ( 1.00x)
put_luma_hv_12_16x16_neon:                             115.3 ( 2.08x)
put_luma_hv_12_32x32_c:                                787.7 ( 1.00x)
put_luma_hv_12_32x32_neon:                             414.2 ( 1.90x)
put_luma_hv_12_64x64_c:                               3058.4 ( 1.00x)
put_luma_hv_12_64x64_neon:                            1592.3 ( 1.92x)
put_luma_hv_12_128x128_c:                            11350.8 ( 1.00x)
put_luma_hv_12_128x128_neon:                          6378.3 ( 1.78x)

RPi4:
put_luma_hv_10_4x4_c:                                  637.8 ( 1.00x)
put_luma_hv_10_8x8_c:                                 1044.9 ( 1.00x)
put_luma_hv_10_8x8_neon:                               483.7 ( 2.16x)
put_luma_hv_10_16x16_c:                               3098.0 ( 1.00x)
put_luma_hv_10_16x16_neon:                            1603.1 ( 1.93x)
put_luma_hv_10_32x32_c:                              10054.8 ( 1.00x)
put_luma_hv_10_32x32_neon:                            5843.6 ( 1.72x)
put_luma_hv_10_64x64_c:                              40506.2 ( 1.00x)
put_luma_hv_10_64x64_neon:                           24384.0 ( 1.66x)
put_luma_hv_10_128x128_c:                           130604.2 ( 1.00x)
put_luma_hv_10_128x128_neon:                         99746.6 ( 1.31x)
put_luma_hv_12_4x4_c:                                  638.2 ( 1.00x)
put_luma_hv_12_8x8_c:                                 1074.6 ( 1.00x)
put_luma_hv_12_8x8_neon:                               482.6 ( 2.23x)
put_luma_hv_12_16x16_c:                               3094.0 ( 1.00x)
put_luma_hv_12_16x16_neon:                            1602.5 ( 1.93x)
put_luma_hv_12_32x32_c:                              10034.4 ( 1.00x)
put_luma_hv_12_32x32_neon:                            5843.3 ( 1.72x)
put_luma_hv_12_64x64_c:                              40447.5 ( 1.00x)
put_luma_hv_12_64x64_neon:                           24377.2 ( 1.66x)
put_luma_hv_12_128x128_c:                           130610.4 ( 1.00x)
put_luma_hv_12_128x128_neon:                         99765.8 ( 1.31x)
---
 libavcodec/aarch64/vvc/dsp_init.c |  25 ++-
 libavcodec/aarch64/vvc/inter.S    | 334 ++++++++++++++++++++++++++++++
 2 files changed, 358 insertions(+), 1 deletion(-)

diff --git a/libavcodec/aarch64/vvc/dsp_init.c b/libavcodec/aarch64/vvc/dsp_init.c
index bc2677945e..2c89079545 100644
--- a/libavcodec/aarch64/vvc/dsp_init.c
+++ b/libavcodec/aarch64/vvc/dsp_init.c
@@ -59,7 +59,18 @@ void ff_vvc_put_luma_v16_12_neon(int16_t *dst, const uint8_t *_src, const ptrdif
                                  const int height, const int8_t *hf, const int8_t *vf, const int width);
 void ff_vvc_put_luma_v_x16_12_neon(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride,
                                    const int height, const int8_t *hf, const int8_t *vf, const int width);
-
+void ff_vvc_put_luma_hv8_10_neon(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride,
+                                 const int height, const int8_t *hf, const int8_t *vf, const int width);
+void ff_vvc_put_luma_hv8_12_neon(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride,
+                                 const int height, const int8_t *hf, const int8_t *vf, const int width);
+void ff_vvc_put_luma_hv16_10_neon(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride,
+                                  const int height, const int8_t *hf, const int8_t *vf, const int width);
+void ff_vvc_put_luma_hv16_12_neon(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride,
+                                  const int height, const int8_t *hf, const int8_t *vf, const int width);
+void ff_vvc_put_luma_hv_x16_10_neon(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride,
+                                    const int height, const int8_t *hf, const int8_t *vf, const int width);
+void ff_vvc_put_luma_hv_x16_12_neon(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride,
+                                    const int height, const int8_t *hf, const int8_t *vf, const int width);
 void ff_alf_classify_sum_neon(int *sum0, int *sum1, int16_t *grad, uint32_t gshift, uint32_t steps);
 
 #define BIT_DEPTH 8
@@ -287,6 +298,12 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd)
         c->inter.put[0][5][1][0] =
         c->inter.put[0][6][1][0] = ff_vvc_put_luma_v_x16_10_neon;
 
+        c->inter.put[0][2][1][1] = ff_vvc_put_luma_hv8_10_neon;
+        c->inter.put[0][3][1][1] = ff_vvc_put_luma_hv16_10_neon;
+        c->inter.put[0][4][1][1] =
+        c->inter.put[0][5][1][1] =
+        c->inter.put[0][6][1][1] = ff_vvc_put_luma_hv_x16_10_neon;
+
         c->alf.filter[LUMA] = alf_filter_luma_10_neon;
         c->alf.filter[CHROMA] = alf_filter_chroma_10_neon;
         c->alf.classify = alf_classify_10_neon;
@@ -303,6 +320,12 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd)
         c->inter.put[0][5][0][1] =
         c->inter.put[0][6][0][1] = ff_vvc_put_luma_h_x16_12_neon;
 
+	c->inter.put[0][2][1][1] = ff_vvc_put_luma_hv8_12_neon;
+	c->inter.put[0][3][1][1] = ff_vvc_put_luma_hv16_12_neon;
+        c->inter.put[0][4][1][1] =
+        c->inter.put[0][5][1][1] =
+        c->inter.put[0][6][1][1] = ff_vvc_put_luma_hv_x16_12_neon;
+
         c->inter.put[0][1][1][0] = ff_vvc_put_luma_v4_12_neon;
         c->inter.put[0][2][1][0] = ff_vvc_put_luma_v8_12_neon;
         c->inter.put[0][3][1][0] = ff_vvc_put_luma_v16_12_neon;
diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S
index 887e456a66..e9dbed188a 100644
--- a/libavcodec/aarch64/vvc/inter.S
+++ b/libavcodec/aarch64/vvc/inter.S
@@ -2224,3 +2224,337 @@ endfunc
 function ff_vvc_put_luma_v_x16_12_neon, export=1
         put_luma_v_x16_xx_neon 4
 endfunc
+
+
+.macro put_luma_hv_x8_vector_filter shift, dst, src0, src1
+        ext             v2.16b, \src0\().16b, \src1\().16b, #2
+        ext             v3.16b, \src0\().16b, \src1\().16b, #4
+        ext             v4.16b, \src0\().16b, \src1\().16b, #6
+        ext             v5.16b, \src0\().16b, \src1\().16b, #8
+        smull           v6.4s, \src0\().4h, v0.h[0]
+        smull2          v7.4s, \src0\().8h, v0.h[0]
+        smlal           v6.4s, v2.4h, v0.h[1]
+        smlal2          v7.4s, v2.8h, v0.h[1]
+        smlal           v6.4s, v3.4h, v0.h[2]
+        smlal2          v7.4s, v3.8h, v0.h[2]
+        smlal           v6.4s, v4.4h, v0.h[3]
+        smlal2          v7.4s, v4.8h, v0.h[3]
+        smlal           v6.4s, v5.4h, v0.h[4]
+        smlal2          v7.4s, v5.8h, v0.h[4]
+        ext             v2.16b, \src0\().16b, \src1\().16b, #10
+        ext             v3.16b, \src0\().16b, \src1\().16b, #12
+        ext             v4.16b, \src0\().16b, \src1\().16b, #14
+        smlal           v6.4s, v2.4h, v0.h[5]
+        smlal2          v7.4s, v2.8h, v0.h[5]
+        smlal           v6.4s, v3.4h, v0.h[6]
+        smlal2          v7.4s, v3.8h, v0.h[6]
+        smlal           v6.4s, v4.4h, v0.h[7]
+        smlal2          v7.4s, v4.8h, v0.h[7]
+        sqshrn          \dst\().4h, v6.4s, #(\shift)
+        sqshrn2         \dst\().8h, v7.4s, #(\shift)
+.endm
+
+.macro put_luma_hv_x8_vertical_filter dst0, dst1, src0, src1, src2, src3, src4, src5, src6, src7
+        smull           \dst0\().4s, \src0\().4h, v1.h[0]
+        smull2          \dst1\().4s, \src0\().8h, v1.h[0]
+        smlal           \dst0\().4s, \src1\().4h, v1.h[1]
+        smlal2          \dst1\().4s, \src1\().8h, v1.h[1]
+        smlal           \dst0\().4s, \src2\().4h, v1.h[2]
+        smlal2          \dst1\().4s, \src2\().8h, v1.h[2]
+        smlal           \dst0\().4s, \src3\().4h, v1.h[3]
+        smlal2          \dst1\().4s, \src3\().8h, v1.h[3]
+        smlal           \dst0\().4s, \src4\().4h, v1.h[4]
+        smlal2          \dst1\().4s, \src4\().8h, v1.h[4]
+        smlal           \dst0\().4s, \src5\().4h, v1.h[5]
+        smlal2          \dst1\().4s, \src5\().8h, v1.h[5]
+        smlal           \dst0\().4s, \src6\().4h, v1.h[6]
+        smlal2          \dst1\().4s, \src6\().8h, v1.h[6]
+        smlal           \dst0\().4s, \src7\().4h, v1.h[7]
+        smlal2          \dst1\().4s, \src7\().8h, v1.h[7]
+        sqshrn          \dst0\().4h, \dst0\().4s, #6
+        sqshrn          \dst1\().4h, \dst1\().4s, #6
+.endm
+
+.macro put_luma_hv8_xx_neon shift
+        mov             x9, #(VVC_MAX_PB_SIZE * 2)
+        sub             x1, x1, #6
+        ld1             {v0.8b}, [x4]
+        sub             x1, x1, x2, lsl #1
+        sxtl            v0.8h, v0.8b
+        ld1             {v1.8b}, [x5]
+        sub             x1, x1, x2
+        sxtl            v1.8h, v1.8b
+        ld1             {v16.8h, v17.8h}, [x1], x2
+        ld1             {v18.8h, v19.8h}, [x1], x2
+        ld1             {v20.8h, v21.8h}, [x1], x2
+        ld1             {v22.8h, v23.8h}, [x1], x2
+        ld1             {v24.8h, v25.8h}, [x1], x2
+        ld1             {v26.8h, v27.8h}, [x1], x2
+        ld1             {v28.8h, v29.8h}, [x1], x2
+        put_luma_hv_x8_vector_filter \shift, v16, v16, v17
+        put_luma_hv_x8_vector_filter \shift, v18, v18, v19
+        put_luma_hv_x8_vector_filter \shift, v20, v20, v21
+        put_luma_hv_x8_vector_filter \shift, v22, v22, v23
+        put_luma_hv_x8_vector_filter \shift, v24, v24, v25
+        put_luma_hv_x8_vector_filter \shift, v26, v26, v27
+        put_luma_hv_x8_vector_filter \shift, v28, v28, v29
+1:
+        ld1             {v30.8h, v31.8h}, [x1], x2
+        put_luma_hv_x8_vector_filter \shift, v30, v30, v31
+        put_luma_hv_x8_vertical_filter v2, v3, v16, v18, v20, v22, v24, v26, v28, v30
+        ld1             {v16.8h, v17.8h}, [x1], x2
+        st1             {v2.4h-v3.4h}, [x0], x9
+        put_luma_hv_x8_vector_filter \shift, v16, v16, v17
+        put_luma_hv_x8_vertical_filter v2, v3, v18, v20, v22, v24, v26, v28, v30, v16
+        ld1             {v18.8h, v19.8h}, [x1], x2
+        st1             {v2.4h-v3.4h}, [x0], x9
+        put_luma_hv_x8_vector_filter \shift, v18, v18, v19
+        put_luma_hv_x8_vertical_filter v2, v3, v20, v22, v24, v26, v28, v30, v16, v18
+        ld1             {v20.8h, v21.8h}, [x1], x2
+        st1             {v2.4h-v3.4h}, [x0], x9
+        put_luma_hv_x8_vector_filter \shift, v20, v20, v21
+        put_luma_hv_x8_vertical_filter v2, v3, v22, v24, v26, v28, v30, v16, v18, v20
+        st1             {v2.4h-v3.4h}, [x0], x9
+
+        mov             v17.16b, v16.16b
+        mov             v16.16b, v24.16b
+        mov             v24.16b, v17.16b
+        mov             v19.16b, v18.16b
+        mov             v18.16b, v26.16b
+        mov             v26.16b, v19.16b
+        mov             v21.16b, v20.16b
+        mov             v20.16b, v28.16b
+        mov             v28.16b, v21.16b
+        subs            w3, w3, #4
+        mov             v22.16b, v30.16b
+        b.gt            1b
+        ret
+.endm
+
+function ff_vvc_put_luma_hv8_10_neon, export=1
+        put_luma_hv8_xx_neon 2
+endfunc
+
+function ff_vvc_put_luma_hv8_12_neon, export=1
+        put_luma_hv8_xx_neon 4
+endfunc
+
+.macro put_luma_hv16_xx_neon shift
+        sub             sp, sp, #128
+        stp             q8,  q9,  [sp, #0]
+        stp             q10, q11, [sp, #32]
+        stp             q12, q13, [sp, #64]
+        stp             q14, q15, [sp, #96]
+
+        mov             x9, #(VVC_MAX_PB_SIZE * 2)
+        sub             x1, x1, #6
+        ld1             {v0.8b}, [x4]
+        sub             x1, x1, x2, lsl #1
+        sxtl            v0.8h, v0.8b
+        ld1             {v1.8b}, [x5]
+        sub             x1, x1, x2
+        sxtl            v1.8h, v1.8b
+        ld1             {v8.8h, v9.8h, v10.8h}, [x1], x2
+        ld1             {v11.8h, v12.8h, v13.8h}, [x1], x2
+        ld1             {v14.8h, v15.8h, v16.8h}, [x1], x2
+        ld1             {v17.8h, v18.8h, v19.8h}, [x1], x2
+        ld1             {v20.8h, v21.8h, v22.8h}, [x1], x2
+        ld1             {v23.8h, v24.8h, v25.8h}, [x1], x2
+        ld1             {v26.8h, v27.8h, v28.8h}, [x1], x2
+        put_luma_hv_x8_vector_filter \shift, v8, v8, v9
+        put_luma_hv_x8_vector_filter \shift, v9, v9, v10
+        put_luma_hv_x8_vector_filter \shift, v11, v11, v12
+        put_luma_hv_x8_vector_filter \shift, v12, v12, v13
+        put_luma_hv_x8_vector_filter \shift, v14, v14, v15
+        put_luma_hv_x8_vector_filter \shift, v15, v15, v16
+        put_luma_hv_x8_vector_filter \shift, v17, v17, v18
+        put_luma_hv_x8_vector_filter \shift, v18, v18, v19
+        put_luma_hv_x8_vector_filter \shift, v20, v20, v21
+        put_luma_hv_x8_vector_filter \shift, v21, v21, v22
+        put_luma_hv_x8_vector_filter \shift, v23, v23, v24
+        put_luma_hv_x8_vector_filter \shift, v24, v24, v25
+        put_luma_hv_x8_vector_filter \shift, v26, v26, v27
+        put_luma_hv_x8_vector_filter \shift, v27, v27, v28
+1:
+        ld1             {v29.8h, v30.8h, v31.8h}, [x1], x2
+        put_luma_hv_x8_vector_filter \shift, v29, v29, v30
+        put_luma_hv_x8_vector_filter \shift, v30, v30, v31
+        put_luma_hv_x8_vertical_filter v2, v3, v8, v11, v14, v17, v20, v23, v26, v29
+        put_luma_hv_x8_vertical_filter v4, v5, v9, v12, v15, v18, v21, v24, v27, v30
+        ld1             {v8.8h, v9.8h, v10.8h}, [x1], x2
+        st1             {v2.4h-v5.4h}, [x0], x9
+        put_luma_hv_x8_vector_filter \shift, v8, v8, v9
+        put_luma_hv_x8_vector_filter \shift, v9, v9, v10
+        put_luma_hv_x8_vertical_filter v2, v3, v11, v14, v17, v20, v23, v26, v29, v8
+        put_luma_hv_x8_vertical_filter v4, v5, v12, v15, v18, v21, v24, v27, v30, v9
+        ld1             {v11.8h, v12.8h, v13.8h}, [x1], x2
+        st1             {v2.4h-v5.4h}, [x0], x9
+        put_luma_hv_x8_vector_filter \shift, v11, v11, v12
+        put_luma_hv_x8_vector_filter \shift, v12, v12, v13
+        put_luma_hv_x8_vertical_filter v2, v3, v14, v17, v20, v23, v26, v29, v8, v11
+        put_luma_hv_x8_vertical_filter v4, v5, v15, v18, v21, v24, v27, v30, v9, v12
+        ld1             {v14.8h, v15.8h, v16.8h}, [x1], x2
+        st1             {v2.4h-v5.4h}, [x0], x9
+        put_luma_hv_x8_vector_filter \shift, v14, v14, v15
+        put_luma_hv_x8_vector_filter \shift, v15, v15, v16
+        put_luma_hv_x8_vertical_filter v2, v3, v17, v20, v23, v26, v29, v8, v11, v14
+        put_luma_hv_x8_vertical_filter v4, v5, v18, v21, v24, v27, v30, v9, v12, v15
+        st1             {v2.4h-v5.4h}, [x0], x9
+
+        mov             v10.16b, v8.16b
+        mov             v8.16b, v20.16b
+        mov             v20.16b, v10.16b
+        mov             v10.16b, v9.16b
+        mov             v9.16b, v21.16b
+        mov             v21.16b, v10.16b
+
+        mov             v13.16b, v11.16b
+        mov             v11.16b, v23.16b
+        mov             v23.16b, v13.16b
+        mov             v13.16b, v12.16b
+        mov             v12.16b, v24.16b
+        mov             v24.16b, v13.16b
+
+        mov             v16.16b, v14.16b
+        mov             v14.16b, v26.16b
+        mov             v26.16b, v16.16b
+        mov             v16.16b, v15.16b
+        mov             v15.16b, v27.16b
+        mov             v27.16b, v16.16b
+
+        subs            w3, w3, #4
+        mov             v17.16b, v29.16b
+        mov             v18.16b, v30.16b
+        b.gt            1b
+
+        ldp             q8,  q9,  [sp, #0]
+        ldp             q10, q11, [sp, #32]
+        ldp             q12, q13, [sp, #64]
+        ldp             q14, q15, [sp, #96]
+        add             sp, sp, #128
+        ret
+.endm
+
+function ff_vvc_put_luma_hv16_10_neon, export=1
+        put_luma_hv16_xx_neon 2
+endfunc
+
+function ff_vvc_put_luma_hv16_12_neon, export=1
+        put_luma_hv16_xx_neon 4
+endfunc
+
+.macro put_luma_hv_x16_xx_neon shift
+        uxtw            x6, w6
+        sub             sp, sp, #128
+        stp             q8,  q9,  [sp, #0]
+        stp             q10, q11, [sp, #32]
+        stp             q12, q13, [sp, #64]
+        stp             q14, q15, [sp, #96]
+        mov             x9, #(VVC_MAX_PB_SIZE * 2)
+        sub             x1, x1, #6
+        ld1             {v0.8b}, [x4]
+        sub             x1, x1, x2, lsl #1
+        sxtl            v0.8h, v0.8b
+        ld1             {v1.8b}, [x5]
+        sub             x1, x1, x2
+        sxtl            v1.8h, v1.8b
+1:
+        mov             w13, w3
+        mov             x11, x1
+        mov             x10, x0
+        ld1             {v8.8h, v9.8h, v10.8h}, [x11], x2
+        ld1             {v11.8h, v12.8h, v13.8h}, [x11], x2
+        ld1             {v14.8h, v15.8h, v16.8h}, [x11], x2
+        ld1             {v17.8h, v18.8h, v19.8h}, [x11], x2
+        ld1             {v20.8h, v21.8h, v22.8h}, [x11], x2
+        ld1             {v23.8h, v24.8h, v25.8h}, [x11], x2
+        ld1             {v26.8h, v27.8h, v28.8h}, [x11], x2
+        put_luma_hv_x8_vector_filter \shift, v8, v8, v9
+        put_luma_hv_x8_vector_filter \shift, v9, v9, v10
+        put_luma_hv_x8_vector_filter \shift, v11, v11, v12
+        put_luma_hv_x8_vector_filter \shift, v12, v12, v13
+        put_luma_hv_x8_vector_filter \shift, v14, v14, v15
+        put_luma_hv_x8_vector_filter \shift, v15, v15, v16
+        put_luma_hv_x8_vector_filter \shift, v17, v17, v18
+        put_luma_hv_x8_vector_filter \shift, v18, v18, v19
+        put_luma_hv_x8_vector_filter \shift, v20, v20, v21
+        put_luma_hv_x8_vector_filter \shift, v21, v21, v22
+        put_luma_hv_x8_vector_filter \shift, v23, v23, v24
+        put_luma_hv_x8_vector_filter \shift, v24, v24, v25
+        put_luma_hv_x8_vector_filter \shift, v26, v26, v27
+        put_luma_hv_x8_vector_filter \shift, v27, v27, v28
+2:
+        ld1             {v29.8h, v30.8h, v31.8h}, [x11], x2
+        put_luma_hv_x8_vector_filter \shift, v29, v29, v30
+        put_luma_hv_x8_vector_filter \shift, v30, v30, v31
+        put_luma_hv_x8_vertical_filter v2, v3, v8, v11, v14, v17, v20, v23, v26, v29
+        put_luma_hv_x8_vertical_filter v4, v5, v9, v12, v15, v18, v21, v24, v27, v30
+        ld1             {v8.8h, v9.8h, v10.8h}, [x11], x2
+        st1             {v2.4h-v5.4h}, [x10], x9
+        
+        put_luma_hv_x8_vector_filter \shift, v8, v8, v9
+        put_luma_hv_x8_vector_filter \shift, v9, v9, v10
+        put_luma_hv_x8_vertical_filter v2, v3, v11, v14, v17, v20, v23, v26, v29, v8
+        put_luma_hv_x8_vertical_filter v4, v5, v12, v15, v18, v21, v24, v27, v30, v9
+        ld1             {v11.8h, v12.8h, v13.8h}, [x11], x2
+        st1             {v2.4h-v5.4h}, [x10], x9
+
+        put_luma_hv_x8_vector_filter \shift, v11, v11, v12
+        put_luma_hv_x8_vector_filter \shift, v12, v12, v13
+        put_luma_hv_x8_vertical_filter v2, v3, v14, v17, v20, v23, v26, v29, v8, v11
+        put_luma_hv_x8_vertical_filter v4, v5, v15, v18, v21, v24, v27, v30, v9, v12
+        ld1             {v14.8h, v15.8h, v16.8h}, [x11], x2
+        st1             {v2.4h-v5.4h}, [x10], x9
+
+        put_luma_hv_x8_vector_filter \shift, v14, v14, v15
+        put_luma_hv_x8_vector_filter \shift, v15, v15, v16
+        put_luma_hv_x8_vertical_filter v2, v3, v17, v20, v23, v26, v29, v8, v11, v14
+        put_luma_hv_x8_vertical_filter v4, v5, v18, v21, v24, v27, v30, v9, v12, v15
+        st1             {v2.4h-v5.4h}, [x10], x9
+
+        mov             v10.16b, v8.16b
+        mov             v8.16b, v20.16b
+        mov             v20.16b, v10.16b
+        mov             v10.16b, v9.16b
+        mov             v9.16b, v21.16b
+        mov             v21.16b, v10.16b
+
+        mov             v13.16b, v11.16b
+        mov             v11.16b, v23.16b
+        mov             v23.16b, v13.16b
+        mov             v13.16b, v12.16b
+        mov             v12.16b, v24.16b
+        mov             v24.16b, v13.16b
+
+        mov             v16.16b, v14.16b
+        mov             v14.16b, v26.16b
+        mov             v26.16b, v16.16b
+        mov             v16.16b, v15.16b
+        mov             v15.16b, v27.16b
+        mov             v27.16b, v16.16b
+
+        subs            w13, w13, #4
+        mov             v17.16b, v29.16b
+        mov             v18.16b, v30.16b
+        b.gt            2b
+
+        add             x0, x0, #32
+        add             x1, x1, #32
+        subs            w6, w6, #16
+        b.gt            1b
+
+        ldp             q8,  q9,  [sp, #0]
+        ldp             q10, q11, [sp, #32]
+        ldp             q12, q13, [sp, #64]
+        ldp             q14, q15, [sp, #96]
+        add             sp, sp, #128
+        ret
+.endm
+
+function ff_vvc_put_luma_hv_x16_10_neon, export=1
+        put_luma_hv_x16_xx_neon 2
+endfunc
+
+function ff_vvc_put_luma_hv_x16_12_neon, export=1
+        put_luma_hv_x16_xx_neon 4
+endfunc
-- 
2.52.0

_______________________________________________
ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org
To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org