From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTP id C07284220B
	for <ffmpegdev@gitmailbox.com>; Mon, 28 Feb 2022 20:41:53 +0000 (UTC)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id DA25068AF85;
	Mon, 28 Feb 2022 22:41:50 +0200 (EET)
Received: from smtp-fw-9103.amazon.com (smtp-fw-9103.amazon.com
 [207.171.188.200])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id BB88768AB96
 for <ffmpeg-devel@ffmpeg.org>; Mon, 28 Feb 2022 22:41:43 +0200 (EET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209;
 t=1646080909; x=1677616909; h=from:to:subject:date:message-id:
 content-transfer-encoding:mime-version;
 bh=wLUzhrVmiSHbofqdJBK78NDzrvcSqUTSeIQjeofiUFE=;
 b=lDnS0kgemMf76HLE1aMLY6/8yYhJT5+OCyERCOviuQQdYmVrMMcvhdHc
 pt9DSqFG7bsu5yZKD/PiQVp5P4ju57Q5WQdl11tgxOoojBo06y2ZyVPZv
 FEDxHgEdKZk7voL8JrUjL+/k9CRzF8TS6gUHawT4GYypGecZhgGDvnx8n A=;
X-IronPort-AV: E=Sophos;i="5.90,144,1643673600"; d="scan'208";a="995595018"
Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO
 email-inbound-relay-pdx-2c-b09ea7fa.us-west-2.amazon.com) ([10.25.36.210])
 by smtp-border-fw-9103.sea19.amazon.com with ESMTP; 28 Feb 2022 20:41:40 +0000
Received: from EX13MTAUWB001.ant.amazon.com
 (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198])
 by email-inbound-relay-pdx-2c-b09ea7fa.us-west-2.amazon.com (Postfix) with
 ESMTPS id 403A741ADD
 for <ffmpeg-devel@ffmpeg.org>; Mon, 28 Feb 2022 20:41:40 +0000 (UTC)
Received: from EX13D07UWB004.ant.amazon.com (10.43.161.196) by
 EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS)
 id 15.0.1497.28; Mon, 28 Feb 2022 20:41:39 +0000
Received: from EX13D07UWB004.ant.amazon.com (10.43.161.196) by
 EX13D07UWB004.ant.amazon.com (10.43.161.196) with Microsoft SMTP Server (TLS)
 id 15.0.1497.28; Mon, 28 Feb 2022 20:41:39 +0000
Received: from EX13D07UWB004.ant.amazon.com ([10.43.161.196]) by
 EX13D07UWB004.ant.amazon.com ([10.43.161.196]) with mapi id 15.00.1497.028;
 Mon, 28 Feb 2022 20:41:39 +0000
From: "Swinney, Jonathan" <jswinney@amazon.com>
To: "ffmpeg-devel@ffmpeg.org" <ffmpeg-devel@ffmpeg.org>
Thread-Topic: [PATCH] swscale/aarch64: add hscale specializations
Thread-Index: AdgqlYlVD0RgImpmSr6Sxw6I7MzmFg==
Date: Mon, 28 Feb 2022 20:41:39 +0000
Message-ID: <097207e51edb42fdbb79fd3f36b42254@EX13D07UWB004.ant.amazon.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-ms-exchange-transport-fromentityheader: Hosted
x-originating-ip: [10.43.161.217]
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH] swscale/aarch64: add hscale specializations
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Archived-At: <https://master.gitmailbox.com/ffmpegdev/097207e51edb42fdbb79fd3f36b42254@EX13D07UWB004.ant.amazon.com/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>

This patch adds specializations for hscale for filterSize =3D=3D 4 and 8 and
converts the existing implementation for the X8 version. For the old code, =
now
used for the X8 version, it improves the efficiency of the final summations=
 by
reducing 11 instructions to 7.

ff_hscale8to15_8_neon is mostly unchanged from the original except for a few
changes.
 - The loads for the filter data were consolidated into a single 64 byte ld1
   instruction.
 - The final summations were improved.
 - The inner loop on filterSize was completely removed

ff_hscale8to15_4_neon is a complete rewrite. Since the main bottleneck here=
 is
loading the data from src, this data is loaded a whole block ahead and stor=
ed
back to the stack to be loaded again with ld4. This arranges the data for m=
ost
efficient use of the vector instructions and removes the need for completion
adds at the end. The number of iterations of the C per iteration of the ass=
embly
is increased from 4 to 8, but because of the prefetching, it can only be us=
ed
when dstW is >=3D 16.

This improves speed by 26% on Graviton 2 (Neoverse N1)
ffmpeg -nostats -f lavfi -i testsrc2=3D4k:d=3D2 -vf bench=3Dstart,scale=3D1=
024x1024,bench=3Dstop -f null -
before: t:0.001796 avg:0.001839 max:0.002756 min:0.001733
after:  t:0.001690 avg:0.001352 max:0.002171 min:0.001292

In direct micro benchmarks I wrote the benefit is more dramatic when filter=
Size =3D=3D 4.

| (seconds)   | c6g     |       |
| ----------- | ------- | ----- |
| filterSize  | 4       | 8     |
| original    | 7.554   | 7.621 |
| optimized   | 3.736   | 7.054 |
| improvement | 102.19% | 8.04% |

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
---
 libswscale/aarch64/hscale.S  | 263 +++++++++++++++++++++++++++++++++--
 libswscale/aarch64/swscale.c |  41 ++++--
 libswscale/utils.c           |   2 +-
 3 files changed, 284 insertions(+), 22 deletions(-)

diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S
index af55ffe2b7..a934653a46 100644
--- a/libswscale/aarch64/hscale.S
+++ b/libswscale/aarch64/hscale.S
@@ -1,5 +1,7 @@
 /*
  * Copyright (c) 2016 Cl=C3=A9ment B=C5"sch <clement stupeflix.com>
+ * Copyright (c) 2019-2021 Sebastian Pop <spop@amazon.com>
+ * Copyright (c) 2022 Jonathan Swinney <jswinney@amazon.com>
  *
  * This file is part of FFmpeg.
  *
@@ -20,7 +22,25 @@
 =

 #include "libavutil/aarch64/asm.S"
 =

-function ff_hscale_8_to_15_neon, export=3D1
+/*
+;-------------------------------------------------------------------------=
----
+; horizontal line scaling
+;
+; void hscale<source_width>to<intermediate_nbits>_<filterSize>_<opt>
+;                               (SwsContext *c, int{16,32}_t *dst,
+;                                int dstW, const uint{8,16}_t *src,
+;                                const int16_t *filter,
+;                                const int32_t *filterPos, int filterSize);
+;
+; Scale one horizontal line. Input is either 8-bit width or 16-bit width
+; ($source_width can be either 8, 9, 10 or 16, difference is whether we ha=
ve to
+; downscale before multiplying). Filter is 14 bits. Output is either 15 bi=
ts
+; (in int16_t) or 19 bits (in int32_t), as given in $intermediate_nbits. E=
ach
+; output pixel is generated from $filterSize input pixels, the position of
+; the first pixel is given in filterPos[nOutputPixel].
+;-------------------------------------------------------------------------=
---- */
+
+function ff_hscale8to15_X8_neon, export=3D1
         sbfiz               x7, x6, #1, #32             // filterSize*2 (*=
2 because int16)
 1:      ldr                 w8, [x5], #4                // filterPos[idx]
         ldr                 w0, [x5], #4                // filterPos[idx +=
 1]
@@ -61,20 +81,239 @@ function ff_hscale_8_to_15_neon, export=3D1
         smlal               v3.4S, v18.4H, v19.4H       // v3 accumulates =
srcp[filterPos[3] + {0..3}] * filter[{0..3}]
         smlal2              v3.4S, v18.8H, v19.8H       // v3 accumulates =
srcp[filterPos[3] + {4..7}] * filter[{4..7}]
         b.gt                2b                          // inner loop if f=
ilterSize not consumed completely
-        addp                v0.4S, v0.4S, v0.4S         // part0 horizonta=
l pair adding
-        addp                v1.4S, v1.4S, v1.4S         // part1 horizonta=
l pair adding
-        addp                v2.4S, v2.4S, v2.4S         // part2 horizonta=
l pair adding
-        addp                v3.4S, v3.4S, v3.4S         // part3 horizonta=
l pair adding
-        addp                v0.4S, v0.4S, v0.4S         // part0 horizonta=
l pair adding
-        addp                v1.4S, v1.4S, v1.4S         // part1 horizonta=
l pair adding
-        addp                v2.4S, v2.4S, v2.4S         // part2 horizonta=
l pair adding
-        addp                v3.4S, v3.4S, v3.4S         // part3 horizonta=
l pair adding
-        zip1                v0.4S, v0.4S, v1.4S         // part01 =3D zip =
values from part0 and part1
-        zip1                v2.4S, v2.4S, v3.4S         // part23 =3D zip =
values from part2 and part3
-        mov                 v0.d[1], v2.d[0]            // part0123 =3D zi=
p values from part01 and part23
+        uzp1                v4.4S, v0.4S, v1.4S         // unzip low parts=
 0 and 1
+        uzp2                v5.4S, v0.4S, v1.4S         // unzip high part=
s 0 and 1
+        uzp1                v6.4S, v2.4S, v3.4S         // unzip low parts=
 2 and 3
+        uzp2                v7.4S, v2.4S, v3.4S         // unzip high part=
s 2 and 3
+        add                 v16.4S, v4.4S, v5.4S        // add half of eac=
h of part 0 and 1
+        add                 v17.4S, v6.4S, v7.4S        // add half of eac=
h of part 2 and 3
+        addp                v0.4S, v16.4S, v17.4S       // pairwise add to=
 complete half adds in earlier steps
         subs                w2, w2, #4                  // dstW -=3D 4
         sqshrn              v0.4H, v0.4S, #7            // shift and clip =
the 2x16-bit final values
         st1                 {v0.4H}, [x1], #8           // write to destin=
ation part0123
         b.gt                1b                          // loop until end =
of line
         ret
 endfunc
+
+
+function ff_hscale8to15_8_neon, export=3D1
+// x0      SwsContext *c (not used)
+// x1      int16_t *dst
+// x2      int dstW
+// x3      const uint8_t *src
+// x4      const int16_t *filter
+// x5      const int32_t *filterPos
+// x6      int filterSize
+// x8-x11  filterPos values
+
+// v0-v3   multiply add accumulators
+// v4-v7   filter data, temp for final horizontal sum
+// v16-v19 src data
+1:
+        ld1                 {v4.8H, v5.8H, v6.8H, v7.8H}, [x4], #64 // loa=
d filter[idx=3D0..3, j=3D0..7]
+        ldp                 w8, w9,  [x5]               // filterPos[idx +=
 0], [idx + 1]
+        ldp                 w10, w11, [x5, 8]           // filterPos[idx +=
 2], [idx + 3]
+        movi                v0.2D, #0                   // val sum part 1 =
(for dst[0])
+        movi                v1.2D, #0                   // val sum part 2 =
(for dst[1])
+        add                 x5, x5, #16                 // increment filte=
rPos
+
+        add                 x8, x3, w8, UXTW            // srcp + filterPo=
s[0]
+        add                 x9,  x3, w9, UXTW           // srcp + filterPo=
s[1]
+        add                 x10, x3, w10, UXTW          // srcp + filterPo=
s[2]
+        add                 x11, x3, w11, UXTW          // srcp + filterPo=
s[3]
+
+        ld1                 {v16.8B}, [x8], #8          // srcp[filterPos[=
0] + {0..7}]
+        ld1                 {v17.8B}, [x9], #8          // srcp[filterPos[=
1] + {0..7}]
+
+        movi                v2.2D, #0                   // val sum part 3 =
(for dst[2])
+        movi                v3.2D, #0                   // val sum part 4 =
(for dst[3])
+
+        uxtl                v16.8H, v16.8B              // unpack part 1 t=
o 16-bit
+        uxtl                v17.8H, v17.8B              // unpack part 2 t=
o 16-bit
+
+        smlal               v0.4S, v16.4H, v4.4H        // v0 accumulates =
srcp[filterPos[0] + {0..3}] * filter[{0..3}]
+        smlal               v1.4S, v17.4H, v5.4H        // v1 accumulates =
srcp[filterPos[1] + {0..3}] * filter[{0..3}]
+
+        ld1                 {v18.8B}, [x10], #8         // srcp[filterPos[=
2] + {0..7}]
+        ld1                 {v19.8B}, [x11], #8         // srcp[filterPos[=
3] + {0..7}]
+
+        smlal2              v0.4S, v16.8H, v4.8H        // v0 accumulates =
srcp[filterPos[0] + {4..7}] * filter[{4..7}]
+        smlal2              v1.4S, v17.8H, v5.8H        // v1 accumulates =
srcp[filterPos[1] + {4..7}] * filter[{4..7}]
+
+        uxtl                v18.8H, v18.8B              // unpack part 3 t=
o 16-bit
+        uxtl                v19.8H, v19.8B              // unpack part 4 t=
o 16-bit
+
+        smlal               v2.4S, v18.4H, v6.4H        // v2 accumulates =
srcp[filterPos[2] + {0..3}] * filter[{0..3}]
+        smlal               v3.4S, v19.4H, v7.4H        // v3 accumulates =
srcp[filterPos[3] + {0..3}] * filter[{0..3}]
+
+        smlal2              v2.4S, v18.8H, v6.8H        // v2 accumulates =
srcp[filterPos[2] + {4..7}] * filter[{4..7}]
+        smlal2              v3.4S, v19.8H, v7.8H        // v3 accumulates =
srcp[filterPos[3] + {4..7}] * filter[{4..7}]
+
+        uzp1                v4.4S, v0.4S, v1.4S         // unzip low parts=
 0 and 1
+        uzp2                v5.4S, v0.4S, v1.4S         // unzip high part=
s 0 and 1
+        uzp1                v6.4S, v2.4S, v3.4S         // unzip low parts=
 2 and 3
+        uzp2                v7.4S, v2.4S, v3.4S         // unzip high part=
s 2 and 3
+
+        add                 v0.4S, v4.4S, v5.4S         // add half of eac=
h of part 0 and 1
+        add                 v1.4S, v6.4S, v7.4S         // add half of eac=
h of part 2 and 3
+
+        addp                v4.4S, v0.4S, v1.4S         // pairwise add to=
 complete half adds in earlier steps
+
+        subs                w2, w2, #4                  // dstW -=3D 4
+        sqshrn              v0.4H, v4.4S, #7            // shift and clip =
the 2x16-bit final values
+        st1                 {v0.4H}, [x1], #8           // write to destin=
ation part0123
+        b.gt                1b                          // loop until end =
of line
+        ret
+endfunc
+
+function ff_hscale8to15_4_neon, export=3D1
+// x0  SwsContext *c (not used)
+// x1  int16_t *dst
+// x2  int dstW
+// x3  const uint8_t *src
+// x4  const int16_t *filter
+// x5  const int32_t *filterPos
+// x6  int filterSize
+// x8-x15 registers for gathering src data
+
+// v0      madd accumulator 4S
+// v1-v4   filter values (16 bit) 8H
+// v5      madd accumulator 4S
+// v16-v19 src values (8 bit) 8B
+
+// This implementation has 4 sections:
+//  1. Prefetch src data
+//  2. Interleaved prefetching src data and madd
+//  3. Complete madd
+//  4. Complete remaining iterations when dstW % 8 !=3D 0
+
+        add                 sp, sp, #-32                // allocate 32 byt=
es on the stack
+        cmp                 w2, #16                     // if dstW <16, sk=
ip to the last block used for wrapping up
+        b.lt                2f
+
+        // load 8 values from filterPos to be used as offsets into src
+        ldp                 w8, w9,  [x5]               // filterPos[idx +=
 0], [idx + 1]
+        ldp                 w10, w11, [x5, 8]           // filterPos[idx +=
 2], [idx + 3]
+        ldp                 w12, w13, [x5, 16]          // filterPos[idx +=
 4], [idx + 5]
+        ldp                 w14, w15, [x5, 24]          // filterPos[idx +=
 6], [idx + 7]
+        add                 x5, x5, #32                 // advance filterP=
os
+
+        // gather random access data from src into contiguous memory
+        ldr                 w8, [x3, w8, UXTW]          // src[filterPos[i=
dx + 0]][0..3]
+        ldr                 w9, [x3, w9, UXTW]          // src[filterPos[i=
dx + 1]][0..3]
+        ldr                 w10, [x3, w10, UXTW]        // src[filterPos[i=
dx + 2]][0..3]
+        ldr                 w11, [x3, w11, UXTW]        // src[filterPos[i=
dx + 3]][0..3]
+        ldr                 w12, [x3, w12, UXTW]        // src[filterPos[i=
dx + 4]][0..3]
+        ldr                 w13, [x3, w13, UXTW]        // src[filterPos[i=
dx + 5]][0..3]
+        ldr                 w14, [x3, w14, UXTW]        // src[filterPos[i=
dx + 6]][0..3]
+        ldr                 w15, [x3, w15, UXTW]        // src[filterPos[i=
dx + 7]][0..3]
+        stp                 w8, w9, [sp]                // *scratch_mem =
=3D { src[filterPos[idx + 0]][0..3], src[filterPos[idx + 1]][0..3] }
+        stp                 w10, w11, [sp, 8]           // *scratch_mem =
=3D { src[filterPos[idx + 2]][0..3], src[filterPos[idx + 3]][0..3] }
+        stp                 w12, w13, [sp, 16]          // *scratch_mem =
=3D { src[filterPos[idx + 4]][0..3], src[filterPos[idx + 5]][0..3] }
+        stp                 w14, w15, [sp, 24]          // *scratch_mem =
=3D { src[filterPos[idx + 6]][0..3], src[filterPos[idx + 7]][0..3] }
+
+1:
+        ld4                 {v16.8B, v17.8B, v18.8B, v19.8B}, [sp] // tran=
spose 8 bytes each from src into 4 registers
+
+        // load 8 values from filterPos to be used as offsets into src
+        ldp                 w8, w9,  [x5]               // filterPos[idx +=
 0][0..3], [idx + 1][0..3], next iteration
+        ldp                 w10, w11, [x5, 8]           // filterPos[idx +=
 2][0..3], [idx + 3][0..3], next iteration
+        ldp                 w12, w13, [x5, 16]          // filterPos[idx +=
 4][0..3], [idx + 5][0..3], next iteration
+        ldp                 w14, w15, [x5, 24]          // filterPos[idx +=
 6][0..3], [idx + 7][0..3], next iteration
+
+        movi                v0.2D, #0                   // Clear madd accu=
mulator for idx 0..3
+        movi                v5.2D, #0                   // Clear madd accu=
mulator for idx 4..7
+
+        ld4                 {v1.8H, v2.8H, v3.8H, v4.8H}, [x4], #64 // loa=
d filter idx + 0..7
+
+        add                 x5, x5, #32                 // advance filterP=
os
+
+        // interleaved SIMD and prefetching intended to keep ld/st and vec=
tor pipelines busy
+        uxtl                v16.8H, v16.8B              // unsigned extend=
 long, covert src data to 16-bit
+        uxtl                v17.8H, v17.8B              // unsigned extend=
 long, covert src data to 16-bit
+        ldr                 w8, [x3, w8, UXTW]          // src[filterPos[i=
dx + 0]], next iteration
+        ldr                 w9, [x3, w9, UXTW]          // src[filterPos[i=
dx + 1]], next iteration
+        uxtl                v18.8H, v18.8B              // unsigned extend=
 long, covert src data to 16-bit
+        uxtl                v19.8H, v19.8B              // unsigned extend=
 long, covert src data to 16-bit
+        ldr                 w10, [x3, w10, UXTW]        // src[filterPos[i=
dx + 2]], next iteration
+        ldr                 w11, [x3, w11, UXTW]        // src[filterPos[i=
dx + 3]], next iteration
+
+        smlal               v0.4S, v1.4H, v16.4H        // multiply accumu=
late inner loop j =3D 0, idx =3D 0..3
+        smlal               v0.4S, v2.4H, v17.4H        // multiply accumu=
late inner loop j =3D 1, idx =3D 0..3
+        ldr                 w12, [x3, w12, UXTW]        // src[filterPos[i=
dx + 4]], next iteration
+        ldr                 w13, [x3, w13, UXTW]        // src[filterPos[i=
dx + 5]], next iteration
+        smlal               v0.4S, v3.4H, v18.4H        // multiply accumu=
late inner loop j =3D 2, idx =3D 0..3
+        smlal               v0.4S, v4.4H, v19.4H        // multiply accumu=
late inner loop j =3D 3, idx =3D 0..3
+        ldr                 w14, [x3, w14, UXTW]        // src[filterPos[i=
dx + 6]], next iteration
+        ldr                 w15, [x3, w15, UXTW]        // src[filterPos[i=
dx + 7]], next iteration
+
+        smlal2              v5.4S, v1.8H, v16.8H        // multiply accumu=
late inner loop j =3D 0, idx =3D 4..7
+        smlal2              v5.4S, v2.8H, v17.8H        // multiply accumu=
late inner loop j =3D 1, idx =3D 4..7
+        stp                 w8, w9, [sp]                // *scratch_mem =
=3D { src[filterPos[idx + 0]][0..3], src[filterPos[idx + 1]][0..3] }
+        stp                 w10, w11, [sp, 8]           // *scratch_mem =
=3D { src[filterPos[idx + 2]][0..3], src[filterPos[idx + 3]][0..3] }
+        smlal2              v5.4S, v3.8H, v18.8H        // multiply accumu=
late inner loop j =3D 2, idx =3D 4..7
+        smlal2              v5.4S, v4.8H, v19.8H        // multiply accumu=
late inner loop j =3D 3, idx =3D 4..7
+        stp                 w12, w13, [sp, 16]          // *scratch_mem =
=3D { src[filterPos[idx + 4]][0..3], src[filterPos[idx + 5]][0..3] }
+        stp                 w14, w15, [sp, 24]          // *scratch_mem =
=3D { src[filterPos[idx + 6]][0..3], src[filterPos[idx + 7]][0..3] }
+
+        sub                 w2, w2, #8                  // dstW -=3D 8
+        sqshrn              v0.4H, v0.4S, #7            // shift and clip =
the 2x16-bit final values
+        sqshrn              v1.4H, v5.4S, #7            // shift and clip =
the 2x16-bit final values
+        st1                 {v0.4H, v1.4H}, [x1], #16   // write to dst[id=
x + 0..7]
+        cmp                 w2, #16                     // continue on mai=
n loop if there are at least 16 iterations left
+        b.ge                1b
+
+        // last full iteration
+        ld4                 {v16.8B, v17.8B, v18.8B, v19.8B}, [sp]
+        ld4                 {v1.8H, v2.8H, v3.8H, v4.8H}, [x4], #64 // loa=
d filter idx + 0..7
+
+        movi                v0.2D, #0                   // Clear madd accu=
mulator for idx 0..3
+        movi                v5.2D, #0                   // Clear madd accu=
mulator for idx 4..7
+
+        uxtl                v16.8H, v16.8B              // unsigned extend=
 long, covert src data to 16-bit
+        uxtl                v17.8H, v17.8B              // unsigned extend=
 long, covert src data to 16-bit
+        uxtl                v18.8H, v18.8B              // unsigned extend=
 long, covert src data to 16-bit
+        uxtl                v19.8H, v19.8B              // unsigned extend=
 long, covert src data to 16-bit
+
+        smlal               v0.4S, v1.4H, v16.4H        // multiply accumu=
late inner loop j =3D 0, idx =3D 0..3
+        smlal               v0.4S, v2.4H, v17.4H        // multiply accumu=
late inner loop j =3D 1, idx =3D 0..3
+        smlal               v0.4S, v3.4H, v18.4H        // multiply accumu=
late inner loop j =3D 2, idx =3D 0..3
+        smlal               v0.4S, v4.4H, v19.4H        // multiply accumu=
late inner loop j =3D 3, idx =3D 0..3
+
+        smlal2              v5.4S, v1.8H, v16.8H        // multiply accumu=
late inner loop j =3D 0, idx =3D 4..7
+        smlal2              v5.4S, v2.8H, v17.8H        // multiply accumu=
late inner loop j =3D 1, idx =3D 4..7
+        smlal2              v5.4S, v3.8H, v18.8H        // multiply accumu=
late inner loop j =3D 2, idx =3D 4..7
+        smlal2              v5.4S, v4.8H, v19.8H        // multiply accumu=
late inner loop j =3D 3, idx =3D 4..7
+
+        subs                w2, w2, #8                  // dstW -=3D 8
+        sqshrn              v0.4H, v0.4S, #7            // shift and clip =
the 2x16-bit final values
+        sqshrn              v1.4H, v5.4S, #7            // shift and clip =
the 2x16-bit final values
+        st1                 {v0.4H, v1.4H}, [x1], #16   // write to dst[id=
x + 0..7]
+
+        cbnz                w2, 2f                      // if >0 iteration=
s remain, jump to the wrap up section
+
+        add                 sp, sp, #32                 // clean up stack
+        ret
+
+        // finish up when dstW % 8 !=3D 0 or dstW < 16
+2:
+        // load src
+        ldr                 w8, [x5], #4                // filterPos[i]
+        ldr                 w9, [x3, w8, UXTW]          // src[filterPos[i=
] + 0..3]
+        ins                 v5.S[0], w9                 // move to simd re=
gister
+        // load filter
+        ld1                 {v6.4H}, [x4], #8           // filter[filterSi=
ze * i + 0..3]
+
+        uxtl                v5.8H, v5.8B                // unsigned exten =
long, convert src data to 16-bit
+        smull               v0.4S, v5.4H, v6.4H         // 4 iterations of=
 src[...] * filter[...]
+        addp                v0.4S, v0.4S, v0.4S         // accumulate the =
smull results
+        addp                v0.4S, v0.4S, v0.4S         // accumulate the =
smull results
+        sqshrn              v0.4H, v0.4S, #7            // shift and clip =
the 2x16-bit final values
+        mov                 w10, v0.S[0]                // move back to ge=
neral register (only one value from simd reg is used)
+        strh                w10, [x1], #2               // dst[i] =3D ...
+        sub                 w2, w2, #1                  // dstW--
+        cbnz                w2, 2b
+
+        add                 sp, sp, #32                 // clean up stack
+        ret
+endfunc
diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c
index 09d0a7130e..2ea4ccb3a6 100644
--- a/libswscale/aarch64/swscale.c
+++ b/libswscale/aarch64/swscale.c
@@ -22,25 +22,48 @@
 #include "libswscale/swscale_internal.h"
 #include "libavutil/aarch64/cpu.h"
 =

-void ff_hscale_8_to_15_neon(SwsContext *c, int16_t *dst, int dstW,
-                            const uint8_t *src, const int16_t *filter,
-                            const int32_t *filterPos, int filterSize);
+#define SCALE_FUNC(filter_n, from_bpc, to_bpc, opt) \
+void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
+                                                SwsContext *c, int16_t *da=
ta, \
+                                                int dstW, const uint8_t *s=
rc, \
+                                                const int16_t *filter, \
+                                                const int32_t *filterPos, =
int filterSize)
+#define SCALE_FUNCS(filter_n, opt) \
+    SCALE_FUNC(filter_n,  8, 15, opt);
+#define ALL_SCALE_FUNCS(opt) \
+    SCALE_FUNCS(4, opt); \
+    SCALE_FUNCS(8, opt); \
+    SCALE_FUNCS(X8, opt)
+
+ALL_SCALE_FUNCS(neon);
 =

 void ff_yuv2planeX_8_neon(const int16_t *filter, int filterSize,
                           const int16_t **src, uint8_t *dest, int dstW,
                           const uint8_t *dither, int offset);
 =

+#define ASSIGN_SCALE_FUNC2(hscalefn, filtersize, opt) do {              \
+    if (c->srcBpc =3D=3D 8 && c->dstBpc <=3D 14) {                        =
    \
+      hscalefn =3D                                                        \
+        ff_hscale8to15_ ## filtersize ## _ ## opt;                      \
+    }                                                                   \
+} while (0)
+
+#define ASSIGN_SCALE_FUNC(hscalefn, filtersize, opt)                    \
+  switch (filtersize) {                                                 \
+  case 4:  ASSIGN_SCALE_FUNC2(hscalefn, 4, opt); break;                 \
+  case 8:  ASSIGN_SCALE_FUNC2(hscalefn, 8, opt); break;                 \
+  default: if (filtersize % 8 =3D=3D 0)                                   =
  \
+               ASSIGN_SCALE_FUNC2(hscalefn, X8, opt);                   \
+           break;                                                       \
+  }
+
 av_cold void ff_sws_init_swscale_aarch64(SwsContext *c)
 {
     int cpu_flags =3D av_get_cpu_flags();
 =

     if (have_neon(cpu_flags)) {
-        if (c->srcBpc =3D=3D 8 && c->dstBpc <=3D 14 &&
-            (c->hLumFilterSize % 8) =3D=3D 0 &&
-            (c->hChrFilterSize % 8) =3D=3D 0)
-        {
-            c->hyScale =3D c->hcScale =3D ff_hscale_8_to_15_neon;
-        }
+        ASSIGN_SCALE_FUNC(c->hyScale, c->hLumFilterSize, neon);
+        ASSIGN_SCALE_FUNC(c->hcScale, c->hChrFilterSize, neon);
         if (c->dstBpc =3D=3D 8) {
             c->yuv2planeX =3D ff_yuv2planeX_8_neon;
         }
diff --git a/libswscale/utils.c b/libswscale/utils.c
index c5ea8853d5..2f2b8e73a9 100644
--- a/libswscale/utils.c
+++ b/libswscale/utils.c
@@ -1825,7 +1825,7 @@ av_cold int sws_init_context(SwsContext *c, SwsFilter=
 *srcFilter,
         {
             const int filterAlign =3D X86_MMX(cpu_flags)     ? 4 :
                                     PPC_ALTIVEC(cpu_flags) ? 8 :
-                                    have_neon(cpu_flags)   ? 8 : 1;
+                                    have_neon(cpu_flags)   ? 4 : 1;
 =

             if ((ret =3D initFilter(&c->hLumFilter, &c->hLumFilterPos,
                            &c->hLumFilterSize, c->lumXInc,
-- =

2.32.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".