[FFmpeg-devel] [PATCH 0/4] Provide neon implementations for hscale functions

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed

* [FFmpeg-devel] [PATCH 0/4] Provide neon implementations for hscale functions
@ 2022-10-17 13:07 Hubert Mazur
  2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 1/4] sw_scale: Add specializations for hscale 8 to 19 Hubert Mazur
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Hubert Mazur @ 2022-10-17 13:07 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop

Provide arm64 neon optimized functions from swscale family.

Hubert Mazur (4):
  sw_scale: Add specializations for hscale 8 to 19
  tests/sw_scale: Add test cases for input sizes 16
  sw_scale: Add specializations for hscale 16 to 15
  sw_scale: Add specializations for hscale 16 to 19

 libswscale/aarch64/hscale.S  | 1101 +++++++++++++++++++++++++++++++++-
 libswscale/aarch64/swscale.c |  145 ++++-
 libswscale/swscale.c         |    3 +-
 tests/checkasm/sw_scale.c    |   35 +-
 4 files changed, 1268 insertions(+), 16 deletions(-)

-- 
2.37.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [FFmpeg-devel] [PATCH 1/4] sw_scale: Add specializations for hscale 8 to 19
  2022-10-17 13:07 [FFmpeg-devel] [PATCH 0/4] Provide neon implementations for hscale functions Hubert Mazur
@ 2022-10-17 13:07 ` Hubert Mazur
  2022-10-24 12:31   ` Martin Storsjö
  2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 2/4] tests/sw_scale: Add test cases for input sizes 16 Hubert Mazur
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Hubert Mazur @ 2022-10-17 13:07 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop

Add arm64 neon implementations for hscale 8 to 19 with filter
sizes 4, 4X and 8. Both implementations are based on very similar ones
dedicated to hscale 8 to 15. The major changes refer to saving
the data - instead of writing the result as int16_t it is done
with int32_t.

These functions are heavily inspired on patches provided by J. Swinney
and M. Storsjö for hscale8to15 which were slightly adapted for
hscale8to19.

The tests and benchmarks run on AWS Graviton 2 instances. The results
from a checkasm tool shown below.

hscale_8_to_19__fs_4_dstW_512_c: 5663.2
hscale_8_to_19__fs_4_dstW_512_neon: 1259.7
hscale_8_to_19__fs_8_dstW_512_c: 9306.0
hscale_8_to_19__fs_8_dstW_512_neon: 2020.2
hscale_8_to_19__fs_12_dstW_512_c: 12932.7
hscale_8_to_19__fs_12_dstW_512_neon: 2462.5
hscale_8_to_19__fs_16_dstW_512_c: 16844.2
hscale_8_to_19__fs_16_dstW_512_neon: 4671.2
hscale_8_to_19__fs_32_dstW_512_c: 32803.7
hscale_8_to_19__fs_32_dstW_512_neon: 5474.2
hscale_8_to_19__fs_40_dstW_512_c: 40948.0
hscale_8_to_19__fs_40_dstW_512_neon: 6669.7

Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 libswscale/aarch64/hscale.S  | 292 ++++++++++++++++++++++++++++++++++-
 libswscale/aarch64/swscale.c |  13 +-
 2 files changed, 300 insertions(+), 5 deletions(-)

diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S
index a16d3dca42..5e8cad9825 100644
--- a/libswscale/aarch64/hscale.S
+++ b/libswscale/aarch64/hscale.S
@@ -218,7 +218,6 @@ function ff_hscale8to15_4_neon, export=1
 //  2. Interleaved prefetching src data and madd
 //  3. Complete madd
 //  4. Complete remaining iterations when dstW % 8 != 0
-
         sub                 sp, sp, #32                 // allocate 32 bytes on the stack
         cmp                 w2, #16                     // if dstW <16, skip to the last block used for wrapping up
         b.lt                2f
@@ -347,3 +346,294 @@ function ff_hscale8to15_4_neon, export=1
         add                 sp, sp, #32                 // clean up stack
         ret
 endfunc
+
+function ff_hscale8to19_4_neon, export=1
+        // x0               SwsContext *c (unused)
+        // x1               int32_t *dst
+        // w2               int dstW
+        // x3               const uint8_t *src // treat it as uint16_t *src
+        // x4               const uint16_t *filter
+        // x5               const int32_t *filterPos
+        // w6               int filterSize
+
+        movi                v18.4s, #1
+        movi                v17.4s, #1
+        shl                 v18.4s, v18.4s, #19
+        sub                 v18.4s, v18.4s, v17.4s      // max allowed value
+
+        cmp                 w2, #16
+        b.lt                2f // move to last block
+
+        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
+        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
+        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
+        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
+        add                 x5, x5, #32
+
+        // load data from
+        ldr                 w8, [x3, w8, UXTW]
+        ldr                 w9, [x3, w9, UXTW]
+        ldr                 w10, [x3, w10, UXTW]
+        ldr                 w11, [x3, w11, UXTW]
+        ldr                 w12, [x3, w12, UXTW]
+        ldr                 w13, [x3, w13, UXTW]
+        ldr                 w14, [x3, w14, UXTW]
+        ldr                 w15, [x3, w15, UXTW]
+
+        sub                 sp, sp, #32
+
+        stp                 w8, w9, [sp]
+        stp                 w10, w11, [sp, #8]
+        stp                 w12, w13, [sp, #16]
+        stp                 w14, w15, [sp, #24]
+
+1:
+        ld4                 {v0.8b, v1.8b, v2.8b, v3.8b}, [sp]
+        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
+        // load filterPositions into registers for next iteration
+
+        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
+        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
+        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
+        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
+        add                 x5, x5, #32
+        uxtl                v0.8h, v0.8b
+        ldr                 w8, [x3, w8, UXTW]
+        smull               v5.4s, v0.4h, v28.4h        // multiply first column of src
+        ldr                 w9, [x3, w9, UXTW]
+        smull2              v6.4s, v0.8h, v28.8h
+        stp                 w8, w9, [sp]
+
+        uxtl                v1.8h, v1.8b
+        ldr                 w10, [x3, w10, UXTW]
+        smlal               v5.4s, v1.4h, v29.4h        // multiply second column of src
+        ldr                 w11, [x3, w11, UXTW]
+        smlal2              v6.4s, v1.8h, v29.8h
+        stp                 w10, w11, [sp, #8]
+
+        uxtl                v2.8h, v2.8b
+        ldr                 w12, [x3, w12, UXTW]
+        smlal               v5.4s, v2.4h, v30.4h        // multiply third column of src
+        ldr                 w13, [x3, w13, UXTW]
+        smlal2              v6.4s, v2.8h, v30.8h
+        stp                 w12, w13, [sp, #16]
+
+        uxtl                v3.8h, v3.8b
+        ldr                 w14, [x3, w14, UXTW]
+        smlal               v5.4s, v3.4h, v31.4h        // multiply fourth column of src
+        ldr                 w15, [x3, w15, UXTW]
+        smlal2              v6.4s, v3.8h, v31.8h
+        stp                 w14, w15, [sp, #24]
+
+        sub                 w2, w2, #8
+        sshr                v5.4s, v5.4s, #3
+        sshr                v6.4s, v6.4s, #3
+        smin                v5.4s, v5.4s, v18.4s
+        smin                v6.4s, v6.4s, v18.4s
+
+        st1                 {v5.4s, v6.4s}, [x1], #32
+        cmp                 w2, #16
+        b.ge                1b
+
+        // here we make last iteration, without updating the registers
+        ld4                 {v0.8b, v1.8b, v2.8b, v3.8b}, [sp]
+        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
+
+        uxtl                v0.8h, v0.8b
+        uxtl                v1.8h, v1.8b
+        smull               v5.4s, v0.4h, v28.4h
+        smull2              v6.4s, v0.8h, v28.8h
+        uxtl                v2.8h, v2.8b
+        smlal               v5.4s, v1.4h, v29.4H
+        smlal2              v6.4s, v1.8h, v29.8H
+        uxtl                v3.8h, v3.8b
+        smlal               v5.4s, v2.4h, v30.4H
+        smlal2              v6.4s, v2.8h, v30.8H
+        smlal               v5.4s, v3.4h, v31.4H
+        smlal2              v6.4s, v3.8h, v31.8h
+
+        sshr                v5.4s, v5.4s, #3
+        sshr                v6.4s, v6.4s, #3
+
+        smin                v5.4s, v5.4s, v18.4s
+        smin                v6.4s, v6.4s, v18.4s
+
+        sub                 w2, w2, #8
+        st1                 {v5.4s, v6.4s}, [x1], #32
+        add                 sp, sp, #32 // restore stack
+        cbnz                w2, 2f
+
+        ret
+
+2:
+        ldr                 w8, [x5], #4 // load filterPos
+        add                 x9, x3, w8, UXTW // src + filterPos
+        ld1                 {v0.s}[0], [x9] // load 4 * uint8_t* into one single
+        ld1                 {v31.4h}, [x4], #8
+        uxtl                v0.8h, v0.8b
+        smull               v5.4s, v0.4h, v31.4H
+        saddlv              d0, v5.4S
+        sqshrn              s0, d0, #3
+        smin                v0.4s, v0.4s, v18.4s
+        st1                 {v0.s}[0], [x1], #4
+        sub                 w2, w2, #1
+        cbnz                w2, 2b // if iterations remain jump to beginning
+
+        ret
+endfunc
+
+function ff_hscale8to19_X8_neon, export=1
+        movi                v20.4s, #1
+        movi                v17.4s, #1
+        shl                 v20.4s, v20.4s, #19
+        sub                 v20.4s, v20.4s, v17.4s
+
+        sbfiz               x7, x6, #1, #32             // filterSize*2 (*2 because int16)
+1:
+        mov                 x16, x4                     // filter0 = filter
+        ldr                 w8, [x5], #4                // filterPos[idx]
+        add                 x12, x16, x7                // filter1 = filter0 + filterSize*2
+        ldr                 w0, [x5], #4                // filterPos[idx + 1]
+        add                 x13, x12, x7                // filter2 = filter1 + filterSize*2
+        ldr                 w11, [x5], #4               // filterPos[idx + 2]
+        add                 x4, x13, x7                 // filter3 = filter2 + filterSize*2
+        ldr                 w9, [x5], #4                // filterPos[idx + 3]
+        movi                v0.2D, #0                   // val sum part 1 (for dst[0])
+        movi                v1.2D, #0                   // val sum part 2 (for dst[1])
+        movi                v2.2D, #0                   // val sum part 3 (for dst[2])
+        movi                v3.2D, #0                   // val sum part 4 (for dst[3])
+        add                 x17, x3, w8, UXTW           // srcp + filterPos[0]
+        add                 x8,  x3, w0, UXTW           // srcp + filterPos[1]
+        add                 x0, x3, w11, UXTW           // srcp + filterPos[2]
+        add                 x11, x3, w9, UXTW           // srcp + filterPos[3]
+        mov                 w15, w6                     // filterSize counter
+2:      ld1                 {v4.8B}, [x17], #8          // srcp[filterPos[0] + {0..7}]
+        ld1                 {v5.8H}, [x16], #16         // load 8x16-bit filter values, part 1
+        uxtl                v4.8H, v4.8B                // unpack part 1 to 16-bit
+        smlal               v0.4S, v4.4H, v5.4H         // v0 accumulates srcp[filterPos[0] + {0..3}] * filter[{0..3}]
+        ld1                 {v6.8B}, [x8], #8           // srcp[filterPos[1] + {0..7}]
+        smlal2              v0.4S, v4.8H, v5.8H         // v0 accumulates srcp[filterPos[0] + {4..7}] * filter[{4..7}]
+        ld1                 {v7.8H}, [x12], #16         // load 8x16-bit at filter+filterSize
+        ld1                 {v16.8B}, [x0], #8          // srcp[filterPos[2] + {0..7}]
+        uxtl                v6.8H, v6.8B                // unpack part 2 to 16-bit
+        ld1                 {v17.8H}, [x13], #16        // load 8x16-bit at filter+2*filterSize
+        uxtl                v16.8H, v16.8B              // unpack part 3 to 16-bit
+        smlal               v1.4S, v6.4H, v7.4H         // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
+        ld1                 {v18.8B}, [x11], #8         // srcp[filterPos[3] + {0..7}]
+        smlal               v2.4S, v16.4H, v17.4H       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
+        ld1                 {v19.8H}, [x4], #16         // load 8x16-bit at filter+3*filterSize
+        smlal2              v2.4S, v16.8H, v17.8H       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
+        uxtl                v18.8H, v18.8B              // unpack part 4 to 16-bit
+        smlal2              v1.4S, v6.8H, v7.8H         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
+        smlal               v3.4S, v18.4H, v19.4H       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
+        subs                w15, w15, #8                // j -= 8: processed 8/filterSize
+        smlal2              v3.4S, v18.8H, v19.8H       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
+        b.gt                2b                          // inner loop if filterSize not consumed completely
+        addp                v0.4S, v0.4S, v1.4S         // part01 horizontal pair adding
+        addp                v2.4S, v2.4S, v3.4S         // part23 horizontal pair adding
+        addp                v0.4S, v0.4S, v2.4S         // part0123 horizontal pair adding
+        subs                w2, w2, #4                  // dstW -= 4
+        sshr                v0.4s, v0.4S, #3            // shift and clip the 2x16-bit final values
+        smin                v0.4s, v0.4s, v20.4s
+        st1                 {v0.4s}, [x1], #16           // write to destination part0123
+        b.gt                1b                          // loop until end of line
+        ret
+endfunc
+
+function ff_hscale8to19_X4_neon, export=1
+        // x0  SwsContext *c (not used)
+        // x1  int16_t *dst
+        // w2  int dstW
+        // x3  const uint8_t *src
+        // x4  const int16_t *filter
+        // x5  const int32_t *filterPos
+        // w6  int filterSize
+
+        movi                v20.4s, #1
+        movi                v17.4s, #1
+        shl                 v20.4s, v20.4s, #19
+        sub                 v20.4s, v20.4s, v17.4s
+
+        lsl                 w7, w6, #1
+1:
+        ldp                 w8, w9, [x5]
+        ldp                 w10, w11, [x5, #8]
+
+        movi                v16.2d, #0                  // initialize accumulator for idx + 0
+        movi                v17.2d, #0                  // initialize accumulator for idx + 1
+        movi                v18.2d, #0                  // initialize accumulator for idx + 2
+        movi                v19.2d, #0                  // initialize accumulator for idx + 3
+
+        mov                 x12, x4                     // filter + 0
+        add                 x13, x4, x7                 // filter + 1
+        add                 x8, x3, w8, UXTW            // srcp + filterPos 0
+        add                 x14, x13, x7                // filter + 2
+        add                 x9, x3, w9, UXTW            // srcp + filterPos 1
+        add                 x15, x14, x7                // filter + 3
+        add                 x10, x3, w10, UXTW          // srcp + filterPos 2
+        mov                 w0, w6                      // save the filterSize to temporary variable
+        add                 x11, x3, w11, UXTW          // srcp + filterPos 3
+        add                 x5, x5, #16                 // advance filter position
+        mov                 x16, xzr                    // clear the register x16 used for offsetting the filter values
+
+2:
+        ldr                 d4, [x8], #8                // load src values for idx 0
+        ldr                 q31, [x12, x16]             // load filter values for idx 0
+        uxtl                v4.8h, v4.8b                // extend type to match the filter' size
+        ldr                 d5, [x9], #8                // load src values for idx 1
+        smlal               v16.4s, v4.4h, v31.4h       // multiplication of lower half for idx 0
+        uxtl                v5.8h, v5.8b                // extend type to match the filter' size
+        ldr                 q30, [x13, x16]             // load filter values for idx 1
+        smlal2              v16.4s, v4.8h, v31.8h       // multiplication of upper half for idx 0
+        ldr                 d6, [x10], #8               // load src values for idx 2
+        ldr                 q29, [x14, x16]             // load filter values for idx 2
+        smlal               v17.4s, v5.4h, v30.4H       // multiplication of lower half for idx 1
+        ldr                 d7, [x11], #8               // load src values for idx 3
+        smlal2              v17.4s, v5.8h, v30.8H       // multiplication of upper half for idx 1
+        uxtl                v6.8h, v6.8B                // extend tpye to matchi the filter's size
+        ldr                 q28, [x15, x16]             // load filter values for idx 3
+        smlal               v18.4s, v6.4h, v29.4h       // multiplication of lower half for idx 2
+        uxtl                v7.8h, v7.8B
+        smlal2              v18.4s, v6.8h, v29.8H       // multiplication of upper half for idx 2
+        sub                 w0, w0, #8
+        smlal               v19.4s, v7.4h, v28.4H       // multiplication of lower half for idx 3
+        cmp                 w0, #8
+        smlal2              v19.4s, v7.8h, v28.8h       // multiplication of upper half for idx 3
+        add                 x16, x16, #16                // advance filter values indexing
+
+        b.ge                2b
+
+
+        // 4 iterations left
+
+        sub                 x17, x7, #8                 // step back to wrap up the filter pos for last 4 elements
+
+        ldr                 s4, [x8]                    // load src values for idx 0
+        ldr                 d31, [x12, x17]             // load filter values for idx 0
+        uxtl                v4.8h, v4.8b                // extend type to match the filter' size
+        ldr                 s5, [x9]                    // load src values for idx 1
+        smlal               v16.4s, v4.4h, v31.4h
+        ldr                 d30, [x13, x17]             // load filter values for idx 1
+        uxtl                v5.8h, v5.8b                // extend type to match the filter' size
+        ldr                 s6, [x10]                   // load src values for idx 2
+        smlal               v17.4s, v5.4h, v30.4h
+        uxtl                v6.8h, v6.8B                // extend type to match the filter's size
+        ldr                 d29, [x14, x17]             // load filter values for idx 2
+        ldr                 s7, [x11]                   // load src values for idx 3
+        addp                v16.4s, v16.4s, v17.4s
+        uxtl                v7.8h, v7.8B
+        ldr                 d28, [x15, x17]             // load filter values for idx 3
+        smlal               v18.4s, v6.4h, v29.4h
+        smlal               v19.4s, v7.4h, v28.4h
+        subs                w2, w2, #4
+        addp                v18.4s, v18.4s, v19.4s
+        addp                v16.4s, v16.4s, v18.4s
+        sshr                v16.4s, v16.4s, #3
+        smin                v16.4s, v16.4s, v20.4s
+
+        st1                 {v16.4s}, [x1], #16
+        add                 x4, x4, x7, lsl #2
+        b.gt                1b
+        ret
+
+endfunc
\ No newline at end of file
diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c
index d1312c6658..479fe129d0 100644
--- a/libswscale/aarch64/swscale.c
+++ b/libswscale/aarch64/swscale.c
@@ -29,7 +29,8 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
                                                 const int16_t *filter, \
                                                 const int32_t *filterPos, int filterSize)
 #define SCALE_FUNCS(filter_n, opt) \
-    SCALE_FUNC(filter_n,  8, 15, opt);
+    SCALE_FUNC(filter_n,  8, 15, opt); \
+    SCALE_FUNC(filter_n, 8, 19, opt);
 #define ALL_SCALE_FUNCS(opt) \
     SCALE_FUNCS(4, opt); \
     SCALE_FUNCS(X8, opt); \
@@ -48,9 +49,13 @@ void ff_yuv2plane1_8_neon(
         int offset);
 
 #define ASSIGN_SCALE_FUNC2(hscalefn, filtersize, opt) do {              \
-    if (c->srcBpc == 8 && c->dstBpc <= 14) {                            \
-      hscalefn =                                                        \
-        ff_hscale8to15_ ## filtersize ## _ ## opt;                      \
+    if (c->srcBpc == 8) {                                               \
+        if(c->dstBpc <= 14) {                                           \
+            hscalefn =                                                  \
+                ff_hscale8to15_ ## filtersize ## _ ## opt;              \
+        } else                                                          \
+            hscalefn =                                                  \
+                ff_hscale8to19_ ## filtersize ## _ ## opt;              \
     }                                                                   \
 } while (0)
 
-- 
2.37.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [FFmpeg-devel] [PATCH 2/4] tests/sw_scale: Add test cases for input sizes 16
  2022-10-17 13:07 [FFmpeg-devel] [PATCH 0/4] Provide neon implementations for hscale functions Hubert Mazur
  2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 1/4] sw_scale: Add specializations for hscale 8 to 19 Hubert Mazur
@ 2022-10-17 13:07 ` Hubert Mazur
  2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 3/4] sw_scale: Add specializations for hscale 16 to 15 Hubert Mazur
  2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 4/4] sw_scale: Add specializations for hscale 16 to 19 Hubert Mazur
  3 siblings, 0 replies; 8+ messages in thread
From: Hubert Mazur @ 2022-10-17 13:07 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop

Previously test cases handled only input sizes equal to 8.
Add support for input size 16 which is used by scaling
routines hscale16To15 and hscale16To19. Pass SwsContext
pointer to each function as some of them make use of it.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 tests/checkasm/sw_scale.c | 35 ++++++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/tests/checkasm/sw_scale.c b/tests/checkasm/sw_scale.c
index 3b8dd310ec..2e4b698f88 100644
--- a/tests/checkasm/sw_scale.c
+++ b/tests/checkasm/sw_scale.c
@@ -262,23 +262,31 @@ static void check_hscale(void)
 #define FILTER_SIZES 6
     static const int filter_sizes[FILTER_SIZES] = { 4, 8, 12, 16, 32, 40 };
 
-#define HSCALE_PAIRS 2
+#define HSCALE_PAIRS 4
     static const int hscale_pairs[HSCALE_PAIRS][2] = {
         { 8, 14 },
         { 8, 18 },
+        { 16, 14 },
+        { 16, 18 }
     };
 
+#define DST_WIDTH(x) ( (x) == (14) ? sizeof(int16_t) : sizeof(int32_t))
 #define LARGEST_INPUT_SIZE 512
 #define INPUT_SIZES 6
     static const int input_sizes[INPUT_SIZES] = {8, 24, 128, 144, 256, 512};
 
     int i, j, fsi, hpi, width, dstWi;
     struct SwsContext *ctx;
+    void *(*_dst)[2];
+    void *_src;
 
     // padded
     LOCAL_ALIGNED_32(uint8_t, src, [FFALIGN(SRC_PIXELS + MAX_FILTER_WIDTH - 1, 4)]);
-    LOCAL_ALIGNED_32(uint32_t, dst0, [SRC_PIXELS]);
-    LOCAL_ALIGNED_32(uint32_t, dst1, [SRC_PIXELS]);
+    LOCAL_ALIGNED_32(uint16_t, src1, [FFALIGN(SRC_PIXELS + MAX_FILTER_WIDTH - 1, 4)]);
+    LOCAL_ALIGNED_32(int16_t, dst_ref_16, [SRC_PIXELS]);
+    LOCAL_ALIGNED_32(int16_t, dst_new_16, [SRC_PIXELS]);
+    LOCAL_ALIGNED_32(int32_t, dst_ref_32, [SRC_PIXELS]);
+    LOCAL_ALIGNED_32(int32_t, dst_new_32, [SRC_PIXELS]);
 
     // padded
     LOCAL_ALIGNED_32(int16_t, filter, [SRC_PIXELS * MAX_FILTER_WIDTH + MAX_FILTER_WIDTH]);
@@ -286,6 +294,9 @@ static void check_hscale(void)
     LOCAL_ALIGNED_32(int16_t, filterAvx2, [SRC_PIXELS * MAX_FILTER_WIDTH + MAX_FILTER_WIDTH]);
     LOCAL_ALIGNED_32(int32_t, filterPosAvx, [SRC_PIXELS]);
 
+    void *_dst_16[2] = {dst_ref_16, dst_new_16};
+    void *_dst_32[2] = {dst_ref_32, dst_new_32};
+
     // The dst parameter here is either int16_t or int32_t but we use void* to
     // just cover both cases.
     declare_func_emms(AV_CPU_FLAG_MMX, void, void *c, void *dst, int dstW,
@@ -297,6 +308,7 @@ static void check_hscale(void)
         fail();
 
     randomize_buffers(src, SRC_PIXELS + MAX_FILTER_WIDTH - 1);
+    randomize_buffers(src1, SRC_PIXELS + MAX_FILTER_WIDTH - 1);
 
     for (hpi = 0; hpi < HSCALE_PAIRS; hpi++) {
         for (fsi = 0; fsi < FILTER_SIZES; fsi++) {
@@ -306,6 +318,8 @@ static void check_hscale(void)
                 ctx->srcBpc = hscale_pairs[hpi][0];
                 ctx->dstBpc = hscale_pairs[hpi][1];
                 ctx->hLumFilterSize = ctx->hChrFilterSize = width;
+                _src = ctx->srcBpc == 8 ? (void *)src : (void *)src1;
+                _dst = ctx->dstBpc == 14 ? (void*)_dst_16 : (void*)_dst_32;
 
                 for (i = 0; i < SRC_PIXELS; i++) {
                     filterPos[i] = i;
@@ -343,14 +357,15 @@ static void check_hscale(void)
                 ff_shuffle_filter_coefficients(ctx, filterPosAvx, width, filterAvx2, ctx->dstW);
 
                 if (check_func(ctx->hcScale, "hscale_%d_to_%d__fs_%d_dstW_%d", ctx->srcBpc, ctx->dstBpc + 1, width, ctx->dstW)) {
-                    memset(dst0, 0, SRC_PIXELS * sizeof(dst0[0]));
-                    memset(dst1, 0, SRC_PIXELS * sizeof(dst1[0]));
+                    memset((*_dst)[0], 0, SRC_PIXELS * DST_WIDTH(ctx->dstBpc));
+                    memset((*_dst)[1], 0, SRC_PIXELS * DST_WIDTH(ctx->dstBpc));
+
+                    call_ref(ctx, (*_dst)[0], ctx->dstW, src, filter, filterPos, width);
+                    call_new(ctx, (*_dst)[1], ctx->dstW, src, filterAvx2, filterPosAvx, width);
 
-                    call_ref(NULL, dst0, ctx->dstW, src, filter, filterPos, width);
-                    call_new(NULL, dst1, ctx->dstW, src, filterAvx2, filterPosAvx, width);
-                    if (memcmp(dst0, dst1, ctx->dstW * sizeof(dst0[0])))
+                    if (memcmp((*_dst)[0], (*_dst)[1], ctx->dstW * DST_WIDTH(ctx->dstBpc)))
                         fail();
-                    bench_new(NULL, dst0, ctx->dstW, src, filter, filterPosAvx, width);
+                    bench_new(ctx, (*_dst)[1], ctx->dstW, _src, filter, filterPosAvx, width);
                 }
             }
         }
@@ -358,6 +373,8 @@ static void check_hscale(void)
     sws_freeContext(ctx);
 }
 
+#undef DST_WIDTH
+
 void checkasm_check_sw_scale(void)
 {
     check_hscale();
-- 
2.37.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [FFmpeg-devel] [PATCH 3/4] sw_scale: Add specializations for hscale 16 to 15
  2022-10-17 13:07 [FFmpeg-devel] [PATCH 0/4] Provide neon implementations for hscale functions Hubert Mazur
  2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 1/4] sw_scale: Add specializations for hscale 8 to 19 Hubert Mazur
  2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 2/4] tests/sw_scale: Add test cases for input sizes 16 Hubert Mazur
@ 2022-10-17 13:07 ` Hubert Mazur
  2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 4/4] sw_scale: Add specializations for hscale 16 to 19 Hubert Mazur
  3 siblings, 0 replies; 8+ messages in thread
From: Hubert Mazur @ 2022-10-17 13:07 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop

Add arm64 neon implementations for hscale 16 to 15 with filter
sizes 4, 8 and X4.

The tests and benchmarks run on AWS Graviton 2 instances.
The results from a checkasm tool are shown below.

hscale_16_to_15__fs_4_dstW_512_c: 6703.5
hscale_16_to_15__fs_4_dstW_512_neon: 2298.0
hscale_16_to_15__fs_8_dstW_512_c: 10983.0
hscale_16_to_15__fs_8_dstW_512_neon: 3216.5
hscale_16_to_15__fs_12_dstW_512_c: 15526.0
hscale_16_to_15__fs_12_dstW_512_neon: 3993.0
hscale_16_to_15__fs_16_dstW_512_c: 20183.5
hscale_16_to_15__fs_16_dstW_512_neon: 5369.7
hscale_16_to_15__fs_32_dstW_512_c: 39315.2
hscale_16_to_15__fs_32_dstW_512_neon: 9511.2
hscale_16_to_15__fs_40_dstW_512_c: 48995.7
hscale_16_to_15__fs_40_dstW_512_neon: 11570.0

Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 libswscale/aarch64/hscale.S  | 409 ++++++++++++++++++++++++++++++++++-
 libswscale/aarch64/swscale.c |  66 +++++-
 libswscale/swscale.c         |   3 +-
 3 files changed, 474 insertions(+), 4 deletions(-)

diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S
index 5e8cad9825..7d7e1c1f2e 100644
--- a/libswscale/aarch64/hscale.S
+++ b/libswscale/aarch64/hscale.S
@@ -635,5 +635,412 @@ function ff_hscale8to19_X4_neon, export=1
         add                 x4, x4, x7, lsl #2
         b.gt                1b
         ret
+endfunc
+
+function ff_hscale16to15_4_neon_asm, export=1
+        // w0               int shift
+        // x1               int32_t *dst
+        // w2               int dstW
+        // x3               const uint8_t *src // treat it as uint16_t *src
+        // x4               const uint16_t *filter
+        // x5               const int32_t *filterPos
+        // w6               int filterSize
+
+        movi                v18.4s, #1
+        movi                v17.4s, #1
+        shl                 v18.4s, v18.4s, #15
+        sub                 v18.4s, v18.4s, v17.4s      // max allowed value
+        dup                 v17.4s, w0                  // read shift
+        neg                 v17.4s, v17.4s              // negate it, so it can be used in sshl (effectively shift right)
+
+        cmp                 w2, #16
+        b.lt                2f // move to last block
+
+        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
+        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
+        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
+        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
+        add                 x5, x5, #32
+
+        // shift all filterPos left by one, as uint16_t will be read
+        lsl                 x8, x8, #1
+        lsl                 x9, x9, #1
+        lsl                 x10, x10, #1
+        lsl                 x11, x11, #1
+        lsl                 x12, x12, #1
+        lsl                 x13, x13, #1
+        lsl                 x14, x14, #1
+        lsl                 x15, x15, #1
+
+        // load src with given offset
+        ldr                 x8,  [x3, w8,  UXTW]
+        ldr                 x9,  [x3, w9,  UXTW]
+        ldr                 x10, [x3, w10, UXTW]
+        ldr                 x11, [x3, w11, UXTW]
+        ldr                 x12, [x3, w12, UXTW]
+        ldr                 x13, [x3, w13, UXTW]
+        ldr                 x14, [x3, w14, UXTW]
+        ldr                 x15, [x3, w15, UXTW]
+
+        sub                 sp, sp, #64
+        // push src on stack so it can be loaded into vectors later
+        stp                 x8, x9, [sp]
+        stp                 x10, x11, [sp, #16]
+        stp                 x12, x13, [sp, #32]
+        stp                 x14, x15, [sp, #48]
+
+1:
+        ld4                 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
+        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
+
+        // Each of blocks does the following:
+        // Extend src and filter to 32 bits with uxtl and sxtl
+        // multiply or multiply and accumulate results
+        // Extending to 32 bits is necessary, as unit16_t values can't
+        // be represented as int16_t without type promotion.
+        uxtl                v26.4s, v0.4h
+        sxtl                v27.4s, v28.4H
+        uxtl2               v0.4s, v0.8h
+        mul                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v28.8H
+        uxtl                v26.4s, v1.4h
+        mul                 v6.4s, v0.4s, v28.4s
+
+        sxtl                v27.4s, v29.4H
+        uxtl2               v0.4s, v1.8h
+        mla                 v5.4s, v27.4s, v26.4s
+        sxtl2               v28.4s, v29.8H
+        uxtl                v26.4s, v2.4h
+        mla                 v6.4s, v28.4s, v0.4s
+
+        sxtl                v27.4s, v30.4H
+        uxtl2               v0.4s, v2.8h
+        mla                 v5.4s, v27.4s, v26.4s
+        sxtl2               v28.4s, v30.8H
+        uxtl                v26.4s, v3.4h
+        mla                 v6.4s, v28.4s, v0.4s
+
+        sxtl                v27.4s, v31.4H
+        uxtl2               v0.4s, v3.8h
+        mla                 v5.4s, v27.4s, v26.4s
+        sxtl2               v28.4s, v31.8H
+        sub                 w2, w2, #8
+        mla                 v6.4s, v28.4s, v0.4s
+
+        sshl                v5.4s, v5.4s, v17.4s
+        sshl                v6.4s, v6.4s, v17.4s
+        smin                v5.4s, v5.4s, v18.4s
+        smin                v6.4s, v6.4s, v18.4s
+        xtn                 v5.4h, v5.4s
+        xtn2                v5.8h, v6.4s
+
+        st1                 {v5.8h}, [x1], #16
+        cmp                 w2, #16
+
+        // load filterPositions into registers for next iteration
+        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
+        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
+        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
+        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
+        add                 x5, x5, #32
+
+        lsl                 x8, x8, #1
+        lsl                 x9, x9, #1
+        lsl                 x10, x10, #1
+        lsl                 x11, x11, #1
+        lsl                 x12, x12, #1
+        lsl                 x13, x13, #1
+        lsl                 x14, x14, #1
+        lsl                 x15, x15, #1
+
+        ldr                 x8,  [x3, w8,  UXTW]
+        ldr                 x9,  [x3, w9,  UXTW]
+        ldr                 x10, [x3, w10, UXTW]
+        ldr                 x11, [x3, w11, UXTW]
+        ldr                 x12, [x3, w12, UXTW]
+        ldr                 x13, [x3, w13, UXTW]
+        ldr                 x14, [x3, w14, UXTW]
+        ldr                 x15, [x3, w15, UXTW]
+
+        stp                 x8, x9, [sp]
+        stp                 x10, x11, [sp, #16]
+        stp                 x12, x13, [sp, #32]
+        stp                 x14, x15, [sp, #48]
 
-endfunc
\ No newline at end of file
+        b.ge                1b
+
+        // here we make last iteration, without updating the registers
+        ld4                 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
+        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64
+
+        uxtl                v26.4s, v0.4h
+        sxtl                v27.4s, v28.4H
+        uxtl2               v0.4s, v0.8h
+        mul                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v28.8H
+        uxtl                v26.4s, v1.4h
+        mul                 v6.4s, v0.4s, v28.4s
+
+        sxtl                v27.4s, v29.4H
+        uxtl2               v0.4s, v1.8h
+        mla                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v29.8H
+        uxtl                v26.4s, v2.4h
+        mla                 v6.4s, v0.4s, v28.4s
+
+        sxtl                v27.4s, v30.4H
+        uxtl2               v0.4s, v2.8h
+        mla                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v30.8H
+        uxtl                v26.4s, v3.4h
+        mla                 v6.4s, v0.4s, v28.4s
+
+        sxtl                v27.4s, v31.4H
+        uxtl2               v0.4s, v3.8h
+        mla                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v31.8H
+        subs                w2, w2, #8
+        mla                 v6.4s, v0.4s, v28.4s
+
+        sshl                v5.4s, v5.4s, v17.4s
+        sshl                v6.4s, v6.4s, v17.4s
+        smin                v5.4s, v5.4s, v18.4s
+        smin                v6.4s, v6.4s, v18.4s
+        xtn                 v5.4h, v5.4S
+        xtn2                v5.8h, v6.4s
+
+        st1                 {v5.8h}, [x1], #16
+        add                 sp, sp, #64                 // restore stack
+        cbnz                w2, 2f
+
+        ret
+
+2:
+        ldr                 w8, [x5], #4                // load filterPos
+        lsl                 w8, w8, #1
+        add                 x9, x3, w8, UXTW            // src + filterPos
+        ld1                 {v0.4h}, [x9]               // load 4 * uint16_t
+        ld1                 {v31.4h}, [x4], #8
+
+        uxtl                v0.4s, v0.4h
+        sxtl                v31.4s, v31.4h
+        mul                 v5.4s, v0.4s, v31.4s
+        addv                s0, v5.4S
+        sshl                v0.4s, v0.4s, v17.4s
+        smin                v0.4s, v0.4s, v18.4s
+        st1                 {v0.h}[0], [x1], #2
+        sub                 w2, w2, #1
+        cbnz                w2, 2b                      // if iterations remain jump to beginning
+
+        ret
+endfunc
+
+function ff_hscale16to15_X8_neon_asm, export=1
+        // w0               int shift
+        // x1               int32_t *dst
+        // w2               int dstW
+        // x3               const uint8_t *src // treat it as uint16_t *src
+        // x4               const uint16_t *filter
+        // x5               const int32_t *filterPos
+        // w6               int filterSize
+
+        movi                v20.4s, #1
+        movi                v21.4s, #1
+        shl                 v20.4s, v20.4s, #15
+        sub                 v20.4s, v20.4s, v21.4s
+        dup                 v21.4s, w0
+        neg                 v21.4s, v21.4s
+
+        sbfiz               x7, x6, #1, #32             // filterSize*2 (*2 because int16)
+1:      ldr                 w8, [x5], #4                // filterPos[idx]
+        lsl                 w8, w8, #1
+        ldr                 w10, [x5], #4               // filterPos[idx + 1]
+        lsl                 w10, w10, #1
+        ldr                 w11, [x5], #4               // filterPos[idx + 2]
+        lsl                 w11, w11, #1
+        ldr                 w9, [x5], #4                // filterPos[idx + 3]
+        lsl                 w9, w9, #1
+        mov                 x16, x4                     // filter0 = filter
+        add                 x12, x16, x7                // filter1 = filter0 + filterSize*2
+        add                 x13, x12, x7                // filter2 = filter1 + filterSize*2
+        add                 x4, x13, x7                 // filter3 = filter2 + filterSize*2
+        movi                v0.2D, #0                   // val sum part 1 (for dst[0])
+        movi                v1.2D, #0                   // val sum part 2 (for dst[1])
+        movi                v2.2D, #0                   // val sum part 3 (for dst[2])
+        movi                v3.2D, #0                   // val sum part 4 (for dst[3])
+        add                 x17, x3, w8, UXTW           // srcp + filterPos[0]
+        add                 x8,  x3, w10, UXTW          // srcp + filterPos[1]
+        add                 x10, x3, w11, UXTW          // srcp + filterPos[2]
+        add                 x11, x3, w9, UXTW           // srcp + filterPos[3]
+        mov                 w15, w6                     // filterSize counter
+2:      ld1                 {v4.8H}, [x17], #16         // srcp[filterPos[0] + {0..7}]
+        ld1                 {v5.8H}, [x16], #16         // load 8x16-bit filter values, part 1
+        ld1                 {v6.8H}, [x8], #16          // srcp[filterPos[1] + {0..7}]
+        ld1                 {v7.8H}, [x12], #16         // load 8x16-bit at filter+filterSize
+        uxtl                v24.4s, v4.4H               // extend srcp lower half to 32 bits to preserve sign
+        sxtl                v25.4s, v5.4H               // extend filter lower half to 32 bits to match srcp size
+        uxtl2               v4.4s, v4.8h                // extend srcp upper half to 32 bits
+        mla                 v0.4s, v24.4s, v25.4s       // multiply accumulate lower half of v4 * v5
+        sxtl2               v5.4s, v5.8h                // extend filter upper half to 32 bits
+        uxtl                v26.4s, v6.4h               // extend srcp lower half to 32 bits
+        mla                 v0.4S, v4.4s, v5.4s         // multiply accumulate upper half of v4 * v5
+        sxtl                v27.4s, v7.4H               // exted filter lower half
+        uxtl2               v6.4s, v6.8H                // extend srcp upper half
+        sxtl2               v7.4s, v7.8h                // extend filter upper half
+        ld1                 {v16.8H}, [x10], #16        // srcp[filterPos[2] + {0..7}]
+        mla                 v1.4S, v26.4s, v27.4s       // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
+        ld1                 {v17.8H}, [x13], #16        // load 8x16-bit at filter+2*filterSize
+        uxtl                v22.4s, v16.4H              // extend srcp lower half
+        sxtl                v23.4s, v17.4H              // extend filter lower half
+        uxtl2               v16.4s, v16.8H              // extend srcp upper half
+        sxtl2               v17.4s, v17.8h              // extend filter upper half
+        mla                 v2.4S, v22.4s, v23.4s       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
+        mla                 v2.4S, v16.4s, v17.4s       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
+        ld1                 {v18.8H}, [x11], #16        // srcp[filterPos[3] + {0..7}]
+        mla                 v1.4S, v6.4s, v7.4s         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
+        ld1                 {v19.8H}, [x4], #16         // load 8x16-bit at filter+3*filterSize
+        subs                w15, w15, #8                // j -= 8: processed 8/filterSize
+        uxtl                v28.4s, v18.4H              // extend srcp lower half
+        sxtl                v29.4s, v19.4H              // extend filter lower half
+        uxtl2               v18.4s, v18.8H              // extend srcp upper half
+        sxtl2               v19.4s, v19.8h              // extend filter upper half
+        mla                 v3.4S, v28.4s, v29.4s       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
+        mla                 v3.4S, v18.4s, v19.4s       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
+        b.gt                2b                          // inner loop if filterSize not consumed completely
+        addp                v0.4S, v0.4S, v1.4S         // part01 horizontal pair adding
+        addp                v2.4S, v2.4S, v3.4S         // part23 horizontal pair adding
+        addp                v0.4S, v0.4S, v2.4S         // part0123 horizontal pair adding
+        subs                w2, w2, #4                  // dstW -= 4
+        sshl                v0.4s, v0.4s, v21.4s        // shift right (effectively rigth, as shift is negative); overflow expected
+        smin                v0.4s, v0.4s, v20.4s        // apply min (do not use sqshl)
+        xtn                 v0.4h, v0.4s                // narrow down to 16 bits
+
+        st1                 {v0.4H}, [x1], #8           // write to destination part0123
+        b.gt                1b                          // loop until end of line
+        ret
+endfunc
+
+function ff_hscale16to15_X4_neon_asm, export=1
+        // w0  int shift
+        // x1  int16_t *dst
+        // w2  int dstW
+        // x3  const uint8_t *src
+        // x4  const int16_t *filter
+        // x5  const int32_t *filterPos
+        // w6  int filterSize
+
+        stp                 d8, d9, [sp, #-0x20]!
+        stp                 d10, d11, [sp, #0x10]
+
+        movi                v18.4s, #1
+        movi                v17.4s, #1
+        shl                 v18.4s, v18.4s, #15
+        sub                 v21.4s, v18.4s, v17.4s      // max allowed value
+        dup                 v17.4s, w0                  // read shift
+        neg                 v20.4s, v17.4s              // negate it, so it can be used in sshl (effectively shift right)
+
+        lsl                 w7, w6, #1
+1:
+        ldp                 w8, w9, [x5]
+        ldp                 w10, w11, [x5, #8]
+
+        movi                v16.2d, #0                  // initialize accumulator for idx + 0
+        movi                v17.2d, #0                  // initialize accumulator for idx + 1
+        movi                v18.2d, #0                  // initialize accumulator for idx + 2
+        movi                v19.2d, #0                  // initialize accumulator for idx + 3
+
+        mov                 x12, x4                     // filter + 0
+        add                 x13, x4, x7                 // filter + 1
+        add                 x8, x3, x8, lsl #1          // srcp + filterPos 0
+        add                 x14, x13, x7                // filter + 2
+        add                 x9, x3, x9, lsl #1          // srcp + filterPos 1
+        add                 x15, x14, x7                // filter + 3
+        add                 x10, x3, x10, lsl #1        // srcp + filterPos 2
+        mov                 w0, w6                      // save the filterSize to temporary variable
+        add                 x11, x3, x11, lsl #1        // srcp + filterPos 3
+        add                 x5, x5, #16                 // advance filter position
+        mov                 x16, xzr                    // clear the register x16 used for offsetting the filter values
+
+2:
+        ldr                 q4, [x8], #16               // load src values for idx 0
+        ldr                 q5, [x9], #16               // load src values for idx 1
+        uxtl                v26.4s, v4.4h
+        uxtl2               v4.4s, v4.8h
+        ldr                 q31, [x12, x16]             // load filter values for idx 0
+        ldr                 q6, [x10], #16              // load src values for idx 2
+        sxtl                v22.4s, v31.4h
+        sxtl2               v31.4s, v31.8h
+        mla                 v16.4s, v26.4s, v22.4s      // multiplication of lower half for idx 0
+        uxtl                v25.4s, v5.4h
+        uxtl2               v5.4s, v5.8h
+        ldr                 q30, [x13, x16]             // load filter values for idx 1
+        ldr                 q7, [x11], #16              // load src values for idx 3
+        mla                 v16.4s, v4.4s, v31.4s       // multiplication of upper half for idx 0
+        uxtl                v24.4s, v6.4h
+        sxtl                v8.4s, v30.4h
+        sxtl2               v30.4s, v30.8h
+        mla                 v17.4s, v25.4s, v8.4s       // multiplication of lower half for idx 1
+        ldr                 q29, [x14, x16]             // load filter values for idx 2
+        uxtl2               v6.4s, v6.8h
+        sxtl                v9.4s, v29.4h
+        sxtl2               v29.4s, v29.8h
+        mla                 v17.4s, v5.4s, v30.4s       // multiplication of upper half for idx 1
+        mla                 v18.4s, v24.4s, v9.4s       // multiplication of lower half for idx 2
+        ldr                 q28, [x15, x16]             // load filter values for idx 3
+        uxtl                v23.4s, v7.4h
+        sxtl                v10.4s, v28.4h
+        mla                 v18.4s, v6.4s, v29.4s       // multiplication of upper half for idx 2
+        uxtl2               v7.4s, v7.8h
+        sxtl2               v28.4s, v28.8h
+        mla                 v19.4s, v23.4s, v10.4s      // multiplication of lower half for idx 3
+        sub                 w0, w0, #8
+        cmp                 w0, #8
+        mla                 v19.4s, v7.4s, v28.4s       // multiplication of upper half for idx 3
+
+        add                 x16, x16, #16               // advance filter values indexing
+
+        b.ge                2b
+
+        // 4 iterations left
+
+        sub                 x17, x7, #8                 // step back to wrap up the filter pos for last 4 elements
+
+        ldr                 d4, [x8]                    // load src values for idx 0
+        ldr                 d31, [x12, x17]             // load filter values for idx 0
+        uxtl                v4.4s, v4.4h
+        sxtl                v31.4s, v31.4h
+        ldr                 d5, [x9]                    // load src values for idx 1
+        mla                 v16.4s, v4.4s, v31.4s       // multiplication of upper half for idx 0
+        ldr                 d30, [x13, x17]             // load filter values for idx 1
+        uxtl                v5.4s, v5.4h
+        sxtl                v30.4s, v30.4h
+        ldr                 d6, [x10]                   // load src values for idx 2
+        mla                 v17.4s, v5.4s, v30.4s       // multiplication of upper half for idx 1
+        ldr                 d29, [x14, x17]             // load filter values for idx 2
+        uxtl                v6.4s, v6.4h
+        sxtl                v29.4s, v29.4h
+        ldr                 d7, [x11]                   // load src values for idx 3
+        ldr                 d28, [x15, x17]             // load filter values for idx 3
+        mla                 v18.4s, v6.4s, v29.4s       // multiplication of upper half for idx 2
+        uxtl                v7.4s, v7.4h
+        sxtl                v28.4s, v28.4h
+        addp                v16.4s, v16.4s, v17.4s
+        mla                 v19.4s, v7.4s, v28.4s       // multiplication of upper half for idx 3
+        subs                w2, w2, #4
+        addp                v18.4s, v18.4s, v19.4s
+        addp                v16.4s, v16.4s, v18.4s
+        sshl                v16.4s, v16.4s, v20.4s
+        smin                v16.4s, v16.4s, v21.4s
+        xtn                 v16.4h, v16.4s
+
+        st1                 {v16.4h}, [x1], #8
+        add                 x4, x4, x7, lsl #2
+        b.gt                1b
+
+        ldp                 d8, d9, [sp]
+        ldp                 d10, d11, [sp, #0x10]
+
+        add                 sp, sp, #0x20
+
+        ret
+endfunc
diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c
index 479fe129d0..993cdd67dd 100644
--- a/libswscale/aarch64/swscale.c
+++ b/libswscale/aarch64/swscale.c
@@ -22,6 +22,18 @@
 #include "libswscale/swscale_internal.h"
 #include "libavutil/aarch64/cpu.h"
 
+void ff_hscale16to15_4_neon_asm(int shift, int16_t *_dst, int dstW,
+                      const uint8_t *_src, const int16_t *filter,
+                      const int32_t *filterPos, int filterSize);
+
+void ff_hscale16to15_X8_neon_asm(int shift, int16_t *_dst, int dstW,
+                      const uint8_t *_src, const int16_t *filter,
+                      const int32_t *filterPos, int filterSize);
+
+void ff_hscale16to15_X4_neon_asm(int shift, int16_t *_dst, int dstW,
+                      const uint8_t *_src, const int16_t *filter,
+                      const int32_t *filterPos, int filterSize);
+
 #define SCALE_FUNC(filter_n, from_bpc, to_bpc, opt) \
 void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
                                                 SwsContext *c, int16_t *data, \
@@ -30,7 +42,8 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
                                                 const int32_t *filterPos, int filterSize)
 #define SCALE_FUNCS(filter_n, opt) \
     SCALE_FUNC(filter_n,  8, 15, opt); \
-    SCALE_FUNC(filter_n, 8, 19, opt);
+    SCALE_FUNC(filter_n, 8, 19, opt); \
+    SCALE_FUNC(filter_n, 16, 15, opt);
 #define ALL_SCALE_FUNCS(opt) \
     SCALE_FUNCS(4, opt); \
     SCALE_FUNCS(X8, opt); \
@@ -56,6 +69,10 @@ void ff_yuv2plane1_8_neon(
         } else                                                          \
             hscalefn =                                                  \
                 ff_hscale8to19_ ## filtersize ## _ ## opt;              \
+    } else {                                                            \
+        if (c->dstBpc <= 14)                                            \
+            hscalefn =                                                  \
+                ff_hscale16to15_ ## filtersize ## _ ## opt;             \
     }                                                                   \
 } while (0)
 
@@ -87,3 +104,50 @@ av_cold void ff_sws_init_swscale_aarch64(SwsContext *c)
         }
     }
 }
+
+void ff_hscale16to15_4_neon(SwsContext *c, int16_t *_dst, int dstW,
+                      const uint8_t *_src, const int16_t *filter,
+                      const int32_t *filterPos, int filterSize)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat);
+    int sh              = desc->comp[0].depth - 1;
+
+    if (sh<15) {
+        sh = isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8 ? 13 : (desc->comp[0].depth - 1);
+    } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */
+        sh = 16 - 1;
+    }
+    ff_hscale16to15_4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+
+}
+
+void ff_hscale16to15_X8_neon(SwsContext *c, int16_t *_dst, int dstW,
+                      const uint8_t *_src, const int16_t *filter,
+                      const int32_t *filterPos, int filterSize)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat);
+    int sh              = desc->comp[0].depth - 1;
+
+    if (sh<15) {
+        sh = isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8 ? 13 : (desc->comp[0].depth - 1);
+    } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */
+        sh = 16 - 1;
+    }
+    ff_hscale16to15_X8_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+
+}
+
+void ff_hscale16to15_X4_neon(SwsContext *c, int16_t *_dst, int dstW,
+                      const uint8_t *_src, const int16_t *filter,
+                      const int32_t *filterPos, int filterSize)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat);
+    int sh              = desc->comp[0].depth - 1;
+
+    if (sh<15) {
+        sh = isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8 ? 13 : (desc->comp[0].depth - 1);
+    } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */
+        sh = 16 - 1;
+    }
+    ff_hscale16to15_X4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+}
\ No newline at end of file
diff --git a/libswscale/swscale.c b/libswscale/swscale.c
index 367d045a02..5afd5eba83 100644
--- a/libswscale/swscale.c
+++ b/libswscale/swscale.c
@@ -109,11 +109,10 @@ static void hScale16To15_c(SwsContext *c, int16_t *dst, int dstW,
         int j;
         int srcPos = filterPos[i];
         int val    = 0;
-
         for (j = 0; j < filterSize; j++) {
             val += src[srcPos + j] * filter[filterSize * i + j];
         }
-        // filter=14 bit, input=16 bit, output=30 bit, >> 15 makes 15 bit
+        //filter=14 bit, input=16 bit, output=30 bit, >> 15 makes 15 bit
         dst[i] = FFMIN(val >> sh, (1 << 15) - 1);
     }
 }
-- 
2.37.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [FFmpeg-devel] [PATCH 4/4] sw_scale: Add specializations for hscale 16 to 19
  2022-10-17 13:07 [FFmpeg-devel] [PATCH 0/4] Provide neon implementations for hscale functions Hubert Mazur
                   ` (2 preceding siblings ...)
  2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 3/4] sw_scale: Add specializations for hscale 16 to 15 Hubert Mazur
@ 2022-10-17 13:07 ` Hubert Mazur
  2022-10-24 13:19   ` Martin Storsjö
  3 siblings, 1 reply; 8+ messages in thread
From: Hubert Mazur @ 2022-10-17 13:07 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop

Provide arm64 neon optimized implementations for hscale16To19 with
filter sizes 4, 8 and X4.

The tests and benchmarks run on AWS Graviton 2 instances.
The results from a checkasm tool are shown below.

hscale_16_to_19__fs_4_dstW_512_c: 6216.0
hscale_16_to_19__fs_4_dstW_512_neon: 2257.0
hscale_16_to_19__fs_8_dstW_512_c: 10417.7
hscale_16_to_19__fs_8_dstW_512_neon: 3112.5
hscale_16_to_19__fs_12_dstW_512_c: 14890.5
hscale_16_to_19__fs_12_dstW_512_neon: 3899.0
hscale_16_to_19__fs_16_dstW_512_c: 19006.5
hscale_16_to_19__fs_16_dstW_512_neon: 5341.2
hscale_16_to_19__fs_32_dstW_512_c: 36629.5
hscale_16_to_19__fs_32_dstW_512_neon: 9502.7
hscale_16_to_19__fs_40_dstW_512_c: 45477.5
hscale_16_to_19__fs_40_dstW_512_neon: 11552.0

Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 libswscale/aarch64/hscale.S  | 402 +++++++++++++++++++++++++++++++++++
 libswscale/aarch64/swscale.c |  70 +++++-
 2 files changed, 471 insertions(+), 1 deletion(-)

diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S
index 7d7e1c1f2e..dfc635d1b9 100644
--- a/libswscale/aarch64/hscale.S
+++ b/libswscale/aarch64/hscale.S
@@ -1044,3 +1044,405 @@ function ff_hscale16to15_X4_neon_asm, export=1
 
         ret
 endfunc
+
+function ff_hscale16to19_4_neon_asm, export=1
+        // w0               int shift
+        // x1               int32_t *dst
+        // w2               int dstW
+        // x3               const uint8_t *src // treat it as uint16_t *src
+        // x4               const uint16_t *filter
+        // x5               const int32_t *filterPos
+        // w6               int filterSize
+
+        movi                v18.4s, #1
+        movi                v17.4s, #1
+        shl                 v18.4s, v18.4s, #19
+        sub                 v18.4s, v18.4s, v17.4s      // max allowed value
+        dup                 v17.4s, w0                  // read shift
+        neg                 v17.4s, v17.4s              // negate it, so it can be used in sshl (effectively shift right)
+
+        cmp                 w2, #16
+        b.lt                2f // move to last block
+
+        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
+        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
+        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
+        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
+        add                 x5, x5, #32
+
+        // shift all filterPos left by one, as uint16_t will be read
+        lsl                 x8, x8, #1
+        lsl                 x9, x9, #1
+        lsl                 x10, x10, #1
+        lsl                 x11, x11, #1
+        lsl                 x12, x12, #1
+        lsl                 x13, x13, #1
+        lsl                 x14, x14, #1
+        lsl                 x15, x15, #1
+
+        // load src with given offset
+        ldr                 x8,  [x3, w8,  UXTW]
+        ldr                 x9,  [x3, w9,  UXTW]
+        ldr                 x10, [x3, w10, UXTW]
+        ldr                 x11, [x3, w11, UXTW]
+        ldr                 x12, [x3, w12, UXTW]
+        ldr                 x13, [x3, w13, UXTW]
+        ldr                 x14, [x3, w14, UXTW]
+        ldr                 x15, [x3, w15, UXTW]
+
+        sub                 sp, sp, #64
+        // push src on stack so it can be loaded into vectors later
+        stp                 x8, x9, [sp]
+        stp                 x10, x11, [sp, #16]
+        stp                 x12, x13, [sp, #32]
+        stp                 x14, x15, [sp, #48]
+
+1:
+        ld4                 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
+        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
+
+        // Each of blocks does the following:
+        // Extend src and filter to 32 bits with uxtl and sxtl
+        // multiply or multiply and accumulate results
+        // Extending to 32 bits is necessary, as unit16_t values can't
+        // be represented as int16_t without type promotion.
+        uxtl                v26.4s, v0.4h
+        sxtl                v27.4s, v28.4H
+        uxtl2               v0.4s, v0.8h
+        mul                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v28.8H
+        uxtl                v26.4s, v1.4h
+        mul                 v6.4s, v0.4s, v28.4s
+
+        sxtl                v27.4s, v29.4H
+        uxtl2               v0.4s, v1.8h
+        mla                 v5.4s, v27.4s, v26.4s
+        sxtl2               v28.4s, v29.8H
+        uxtl                v26.4s, v2.4h
+        mla                 v6.4s, v28.4s, v0.4s
+
+        sxtl                v27.4s, v30.4H
+        uxtl2               v0.4s, v2.8h
+        mla                 v5.4s, v27.4s, v26.4s
+        sxtl2               v28.4s, v30.8H
+        uxtl                v26.4s, v3.4h
+        mla                 v6.4s, v28.4s, v0.4s
+
+        sxtl                v27.4s, v31.4H
+        uxtl2               v0.4s, v3.8h
+        mla                 v5.4s, v27.4s, v26.4s
+        sxtl2               v28.4s, v31.8H
+        sub                 w2, w2, #8
+        mla                 v6.4s, v28.4s, v0.4s
+
+        sshl                v5.4s, v5.4s, v17.4s
+        sshl                v6.4s, v6.4s, v17.4s
+        smin                v5.4s, v5.4s, v18.4s
+        smin                v6.4s, v6.4s, v18.4s
+
+        st1                 {v5.4s, v6.4s}, [x1], #32
+        cmp                 w2, #16
+
+        // load filterPositions into registers for next iteration
+        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
+        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
+        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
+        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
+        add                 x5, x5, #32
+
+        lsl                 x8, x8, #1
+        lsl                 x9, x9, #1
+        lsl                 x10, x10, #1
+        lsl                 x11, x11, #1
+        lsl                 x12, x12, #1
+        lsl                 x13, x13, #1
+        lsl                 x14, x14, #1
+        lsl                 x15, x15, #1
+
+        ldr                 x8,  [x3, w8,  UXTW]
+        ldr                 x9,  [x3, w9,  UXTW]
+        ldr                 x10, [x3, w10, UXTW]
+        ldr                 x11, [x3, w11, UXTW]
+        ldr                 x12, [x3, w12, UXTW]
+        ldr                 x13, [x3, w13, UXTW]
+        ldr                 x14, [x3, w14, UXTW]
+        ldr                 x15, [x3, w15, UXTW]
+
+        stp                 x8, x9, [sp]
+        stp                 x10, x11, [sp, #16]
+        stp                 x12, x13, [sp, #32]
+        stp                 x14, x15, [sp, #48]
+
+        b.ge                1b
+
+        // here we make last iteration, without updating the registers
+        ld4                 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
+        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64
+
+        uxtl                v26.4s, v0.4h
+        sxtl                v27.4s, v28.4H
+        uxtl2               v0.4s, v0.8h
+        mul                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v28.8H
+        uxtl                v26.4s, v1.4h
+        mul                 v6.4s, v0.4s, v28.4s
+
+        sxtl                v27.4s, v29.4H
+        uxtl2               v0.4s, v1.8h
+        mla                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v29.8H
+        uxtl                v26.4s, v2.4h
+        mla                 v6.4s, v0.4s, v28.4s
+
+        sxtl                v27.4s, v30.4H
+        uxtl2               v0.4s, v2.8h
+        mla                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v30.8H
+        uxtl                v26.4s, v3.4h
+        mla                 v6.4s, v0.4s, v28.4s
+
+        sxtl                v27.4s, v31.4H
+        uxtl2               v0.4s, v3.8h
+        mla                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v31.8H
+        subs                w2, w2, #8
+        mla                 v6.4s, v0.4s, v28.4s
+
+        sshl                v5.4s, v5.4s, v17.4s
+        sshl                v6.4s, v6.4s, v17.4s
+
+        smin                v5.4s, v5.4s, v18.4s
+        smin                v6.4s, v6.4s, v18.4s
+
+        st1                 {v5.4s, v6.4s}, [x1], #32
+        add                 sp, sp, #64                 // restore stack
+        cbnz                w2, 2f
+
+        ret
+
+2:
+        ldr                 w8, [x5], #4                // load filterPos
+        lsl                 w8, w8, #1
+        add                 x9, x3, w8, UXTW            // src + filterPos
+        ld1                 {v0.4h}, [x9]               // load 4 * uint16_t
+        ld1                 {v31.4h}, [x4], #8
+
+        uxtl                v0.4s, v0.4h
+        sxtl                v31.4s, v31.4h
+        subs                w2, w2, #1
+        mul                 v5.4s, v0.4s, v31.4s
+        addv                s0, v5.4S
+        sshl                v0.4s, v0.4s, v17.4s
+        smin                v0.4s, v0.4s, v18.4s
+        st1                 {v0.s}[0], [x1], #4
+        cbnz                w2, 2b                      // if iterations remain jump to beginning
+
+        ret
+endfunc
+
+function ff_hscale16to19_X8_neon_asm, export=1
+        // w0               int shift
+        // x1               int32_t *dst
+        // w2               int dstW
+        // x3               const uint8_t *src // treat it as uint16_t *src
+        // x4               const uint16_t *filter
+        // x5               const int32_t *filterPos
+        // w6               int filterSize
+
+        movi                v20.4s, #1
+        movi                v21.4s, #1
+        shl                 v20.4s, v20.4s, #19
+        sub                 v20.4s, v20.4s, v21.4s
+        dup                 v21.4s, w0
+        neg                 v21.4s, v21.4s
+
+        sbfiz               x7, x6, #1, #32             // filterSize*2 (*2 because int16)
+1:      ldr                 w8, [x5], #4                // filterPos[idx]
+        ldr                 w10, [x5], #4               // filterPos[idx + 1]
+        lsl                 w8, w8, #1
+        ldr                 w11, [x5], #4               // filterPos[idx + 2]
+        ldr                 w9, [x5], #4                // filterPos[idx + 3]
+        mov                 x16, x4                     // filter0 = filter
+        lsl                 w11, w11, #1
+        add                 x12, x16, x7                // filter1 = filter0 + filterSize*2
+        lsl                 w9, w9, #1
+        add                 x13, x12, x7                // filter2 = filter1 + filterSize*2
+        lsl                 w10, w10, #1
+        add                 x4, x13, x7                 // filter3 = filter2 + filterSize*2
+        movi                v0.2D, #0                   // val sum part 1 (for dst[0])
+        movi                v1.2D, #0                   // val sum part 2 (for dst[1])
+        movi                v2.2D, #0                   // val sum part 3 (for dst[2])
+        movi                v3.2D, #0                   // val sum part 4 (for dst[3])
+        add                 x17, x3, w8, UXTW           // srcp + filterPos[0]
+        add                 x8,  x3, w10, UXTW          // srcp + filterPos[1]
+        add                 x10, x3, w11, UXTW          // srcp + filterPos[2]
+        add                 x11, x3, w9, UXTW           // srcp + filterPos[3]
+        mov                 w15, w6                     // filterSize counter
+2:      ld1                 {v4.8H}, [x17], #16         // srcp[filterPos[0] + {0..7}]
+        ld1                 {v5.8H}, [x16], #16         // load 8x16-bit filter values, part 1
+        ld1                 {v6.8H}, [x8], #16          // srcp[filterPos[1] + {0..7}]
+        ld1                 {v7.8H}, [x12], #16         // load 8x16-bit at filter+filterSize
+        uxtl                v24.4s, v4.4H               // extend srcp lower half to 32 bits to preserve sign
+        sxtl                v25.4s, v5.4H               // extend filter lower half to 32 bits to match srcp size
+        uxtl2               v4.4s, v4.8h                // extend srcp upper half to 32 bits
+        mla                 v0.4s, v24.4s, v25.4s       // multiply accumulate lower half of v4 * v5
+        sxtl2               v5.4s, v5.8h                // extend filter upper half to 32 bits
+        uxtl                v26.4s, v6.4h               // extend srcp lower half to 32 bits
+        mla                 v0.4S, v4.4s, v5.4s         // multiply accumulate upper half of v4 * v5
+        sxtl                v27.4s, v7.4H               // exted filter lower half
+        uxtl2               v6.4s, v6.8H                // extend srcp upper half
+        sxtl2               v7.4s, v7.8h                // extend filter upper half
+        ld1                 {v16.8H}, [x10], #16        // srcp[filterPos[2] + {0..7}]
+        mla                 v1.4S, v26.4s, v27.4s       // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
+        ld1                 {v17.8H}, [x13], #16        // load 8x16-bit at filter+2*filterSize
+        uxtl                v22.4s, v16.4H              // extend srcp lower half
+        sxtl                v23.4s, v17.4H              // extend filter lower half
+        uxtl2               v16.4s, v16.8H              // extend srcp upper half
+        sxtl2               v17.4s, v17.8h              // extend filter upper half
+        mla                 v2.4S, v22.4s, v23.4s       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
+        mla                 v2.4S, v16.4s, v17.4s       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
+        ld1                 {v18.8H}, [x11], #16        // srcp[filterPos[3] + {0..7}]
+        mla                 v1.4S, v6.4s, v7.4s         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
+        ld1                 {v19.8H}, [x4], #16         // load 8x16-bit at filter+3*filterSize
+        subs                w15, w15, #8                // j -= 8: processed 8/filterSize
+        uxtl                v28.4s, v18.4H              // extend srcp lower half
+        sxtl                v29.4s, v19.4H              // extend filter lower half
+        uxtl2               v18.4s, v18.8H              // extend srcp upper half
+        sxtl2               v19.4s, v19.8h              // extend filter upper half
+        mla                 v3.4S, v28.4s, v29.4s       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
+        mla                 v3.4S, v18.4s, v19.4s       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
+        b.gt                2b                          // inner loop if filterSize not consumed completely
+        addp                v0.4S, v0.4S, v1.4S         // part01 horizontal pair adding
+        addp                v2.4S, v2.4S, v3.4S         // part23 horizontal pair adding
+        addp                v0.4S, v0.4S, v2.4S         // part0123 horizontal pair adding
+        subs                w2, w2, #4                  // dstW -= 4
+        sshl                v0.4s, v0.4s, v21.4s        // shift right (effectively rigth, as shift is negative); overflow expected
+        smin                v0.4s, v0.4s, v20.4s        // apply min (do not use sqshl)
+        st1                 {v0.4s}, [x1], #16          // write to destination part0123
+        b.gt                1b                          // loop until end of line
+        ret
+endfunc
+
+function ff_hscale16to19_X4_neon_asm, export=1
+        // w0  int shift
+        // x1  int16_t *dst
+        // w2  int dstW
+        // x3  const uint8_t *src
+        // x4  const int16_t *filter
+        // x5  const int32_t *filterPos
+        // w6  int filterSize
+
+        stp                 d8, d9, [sp, #-0x20]!
+        stp                 d10, d11, [sp, #0x10]
+
+        movi                v18.4s, #1
+        movi                v17.4s, #1
+        shl                 v18.4s, v18.4s, #19
+        sub                 v21.4s, v18.4s, v17.4s      // max allowed value
+        dup                 v17.4s, w0                  // read shift
+        neg                 v20.4s, v17.4s              // negate it, so it can be used in sshl (effectively shift right)
+
+        lsl                 w7, w6, #1
+1:
+        ldp                 w8, w9, [x5]
+        ldp                 w10, w11, [x5, #8]
+
+        movi                v16.2d, #0                  // initialize accumulator for idx + 0
+        movi                v17.2d, #0                  // initialize accumulator for idx + 1
+        movi                v18.2d, #0                  // initialize accumulator for idx + 2
+        movi                v19.2d, #0                  // initialize accumulator for idx + 3
+
+        mov                 x12, x4                     // filter + 0
+        add                 x13, x4, x7                 // filter + 1
+        add                 x8, x3, x8, lsl #1          // srcp + filterPos 0
+        add                 x14, x13, x7                // filter + 2
+        add                 x9, x3, x9, lsl #1          // srcp + filterPos 1
+        add                 x15, x14, x7                // filter + 3
+        add                 x10, x3, x10, lsl #1        // srcp + filterPos 2
+        mov                 w0, w6                      // save the filterSize to temporary variable
+        add                 x11, x3, x11, lsl #1        // srcp + filterPos 3
+        add                 x5, x5, #16                 // advance filter position
+        mov                 x16, xzr                    // clear the register x16 used for offsetting the filter values
+
+2:
+        ldr                 q4, [x8], #16               // load src values for idx 0
+        ldr                 q5, [x9], #16               // load src values for idx 1
+        uxtl                v26.4s, v4.4h
+        uxtl2               v4.4s, v4.8h
+        ldr                 q31, [x12, x16]             // load filter values for idx 0
+        ldr                 q6, [x10], #16              // load src values for idx 2
+        sxtl                v22.4s, v31.4h
+        sxtl2               v31.4s, v31.8h
+        mla                 v16.4s, v26.4s, v22.4s      // multiplication of lower half for idx 0
+        uxtl                v25.4s, v5.4h
+        uxtl2               v5.4s, v5.8h
+        ldr                 q30, [x13, x16]             // load filter values for idx 1
+        ldr                 q7, [x11], #16              // load src values for idx 3
+        mla                 v16.4s, v4.4s, v31.4s       // multiplication of upper half for idx 0
+        uxtl                v24.4s, v6.4h
+        sxtl                v8.4s, v30.4h
+        sxtl2               v30.4s, v30.8h
+        mla                 v17.4s, v25.4s, v8.4s       // multiplication of lower half for idx 1
+        ldr                 q29, [x14, x16]             // load filter values for idx 2
+        uxtl2               v6.4s, v6.8h
+        sxtl                v9.4s, v29.4h
+        sxtl2               v29.4s, v29.8h
+        mla                 v17.4s, v5.4s, v30.4s       // multiplication of upper half for idx 1
+        ldr                 q28, [x15, x16]             // load filter values for idx 3
+        mla                 v18.4s, v24.4s, v9.4s       // multiplication of lower half for idx 2
+        uxtl                v23.4s, v7.4h
+        sxtl                v10.4s, v28.4h
+        mla                 v18.4s, v6.4s, v29.4s       // multiplication of upper half for idx 2
+        uxtl2               v7.4s, v7.8h
+        sxtl2               v28.4s, v28.8h
+        mla                 v19.4s, v23.4s, v10.4s      // multiplication of lower half for idx 3
+        sub                 w0, w0, #8
+        cmp                 w0, #8
+        mla                 v19.4s, v7.4s, v28.4s       // multiplication of upper half for idx 3
+
+        add                 x16, x16, #16               // advance filter values indexing
+
+        b.ge                2b
+
+        // 4 iterations left
+
+        sub                 x17, x7, #8                 // step back to wrap up the filter pos for last 4 elements
+
+        ldr                 d4, [x8]                    // load src values for idx 0
+        ldr                 d31, [x12, x17]             // load filter values for idx 0
+        uxtl                v4.4s, v4.4h
+        sxtl                v31.4s, v31.4h
+        ldr                 d5, [x9]                    // load src values for idx 1
+        mla                 v16.4s, v4.4s, v31.4s       // multiplication of upper half for idx 0
+        ldr                 d30, [x13, x17]             // load filter values for idx 1
+        uxtl                v5.4s, v5.4h
+        sxtl                v30.4s, v30.4h
+        ldr                 d6, [x10]                   // load src values for idx 2
+        mla                 v17.4s, v5.4s, v30.4s       // multiplication of upper half for idx 1
+        ldr                 d29, [x14, x17]             // load filter values for idx 2
+        uxtl                v6.4s, v6.4h
+        sxtl                v29.4s, v29.4h
+        ldr                 d7, [x11]                   // load src values for idx 3
+        ldr                 d28, [x15, x17]             // load filter values for idx 3
+        mla                 v18.4s, v6.4s, v29.4s       // multiplication of upper half for idx 2
+        uxtl                v7.4s, v7.4h
+        sxtl                v28.4s, v28.4h
+        addp                v16.4s, v16.4s, v17.4s
+        mla                 v19.4s, v7.4s, v28.4s       // multiplication of upper half for idx 3
+        subs                w2, w2, #4
+        addp                v18.4s, v18.4s, v19.4s
+        addp                v16.4s, v16.4s, v18.4s
+        sshl                v16.4s, v16.4s, v20.4s
+        smin                v16.4s, v16.4s, v21.4s
+
+        st1                 {v16.4s}, [x1], #16
+        add                 x4, x4, x7, lsl #2
+        b.gt                1b
+
+        ldp                 d8, d9, [sp]
+        ldp                 d10, d11, [sp, #0x10]
+
+        add                 sp, sp, #0x20
+
+        ret
+endfunc
\ No newline at end of file
diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c
index 993cdd67dd..ef6029e068 100644
--- a/libswscale/aarch64/swscale.c
+++ b/libswscale/aarch64/swscale.c
@@ -34,6 +34,16 @@ void ff_hscale16to15_X4_neon_asm(int shift, int16_t *_dst, int dstW,
                       const uint8_t *_src, const int16_t *filter,
                       const int32_t *filterPos, int filterSize);
 
+void ff_hscale16to19_4_neon_asm(int shift, int16_t *_dst, int dstW,
+                      const uint8_t *_src, const int16_t *filter,
+                      const int32_t *filterPos, int filterSize);
+void ff_hscale16to19_X8_neon_asm(int shift, int16_t *_dst, int dstW,
+                      const uint8_t *_src, const int16_t *filter,
+                      const int32_t *filterPos, int filterSize);
+void ff_hscale16to19_X4_neon_asm(int shift, int16_t *_dst, int dstW,
+                      const uint8_t *_src, const int16_t *filter,
+                      const int32_t *filterPos, int filterSize);
+
 #define SCALE_FUNC(filter_n, from_bpc, to_bpc, opt) \
 void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
                                                 SwsContext *c, int16_t *data, \
@@ -43,7 +53,8 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
 #define SCALE_FUNCS(filter_n, opt) \
     SCALE_FUNC(filter_n,  8, 15, opt); \
     SCALE_FUNC(filter_n, 8, 19, opt); \
-    SCALE_FUNC(filter_n, 16, 15, opt);
+    SCALE_FUNC(filter_n, 16, 15, opt); \
+    SCALE_FUNC(filter_n, 16, 19, opt);
 #define ALL_SCALE_FUNCS(opt) \
     SCALE_FUNCS(4, opt); \
     SCALE_FUNCS(X8, opt); \
@@ -73,6 +84,9 @@ void ff_yuv2plane1_8_neon(
         if (c->dstBpc <= 14)                                            \
             hscalefn =                                                  \
                 ff_hscale16to15_ ## filtersize ## _ ## opt;             \
+        else                                                            \
+            hscalefn =                                                  \
+                ff_hscale16to19_ ## filtersize ## _ ## opt;             \
     }                                                                   \
 } while (0)
 
@@ -150,4 +164,58 @@ void ff_hscale16to15_X4_neon(SwsContext *c, int16_t *_dst, int dstW,
         sh = 16 - 1;
     }
     ff_hscale16to15_X4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+}
+
+void ff_hscale16to19_4_neon(SwsContext *c, int16_t *_dst, int dstW,
+                           const uint8_t *_src, const int16_t *filter,
+                           const int32_t *filterPos, int filterSize)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat);
+    int bits            = desc->comp[0].depth - 1;
+    int sh              = bits - 4;
+
+    if ((isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8) && desc->comp[0].depth<16) {
+        sh = 9;
+    } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */
+        sh = 16 - 1 - 4;
+    }
+
+    ff_hscale16to19_4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+
+}
+
+void ff_hscale16to19_X8_neon(SwsContext *c, int16_t *_dst, int dstW,
+                           const uint8_t *_src, const int16_t *filter,
+                           const int32_t *filterPos, int filterSize)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat);
+    int bits            = desc->comp[0].depth - 1;
+    int sh              = bits - 4;
+
+    if ((isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8) && desc->comp[0].depth<16) {
+        sh = 9;
+    } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */
+        sh = 16 - 1 - 4;
+    }
+
+    ff_hscale16to19_X8_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+
+}
+
+void ff_hscale16to19_X4_neon(SwsContext *c, int16_t *_dst, int dstW,
+                           const uint8_t *_src, const int16_t *filter,
+                           const int32_t *filterPos, int filterSize)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat);
+    int bits            = desc->comp[0].depth - 1;
+    int sh              = bits - 4;
+
+    if ((isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8) && desc->comp[0].depth<16) {
+        sh = 9;
+    } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */
+        sh = 16 - 1 - 4;
+    }
+
+    ff_hscale16to19_X4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+
 }
\ No newline at end of file
-- 
2.37.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [FFmpeg-devel] [PATCH 1/4] sw_scale: Add specializations for hscale 8 to 19
  2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 1/4] sw_scale: Add specializations for hscale 8 to 19 Hubert Mazur
@ 2022-10-24 12:31   ` Martin Storsjö
  0 siblings, 0 replies; 8+ messages in thread
From: Martin Storsjö @ 2022-10-24 12:31 UTC (permalink / raw)
  To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop

On Mon, 17 Oct 2022, Hubert Mazur wrote:

> Add arm64 neon implementations for hscale 8 to 19 with filter
> sizes 4, 4X and 8. Both implementations are based on very similar ones
> dedicated to hscale 8 to 15. The major changes refer to saving
> the data - instead of writing the result as int16_t it is done
> with int32_t.
>
> These functions are heavily inspired on patches provided by J. Swinney
> and M. Storsjö for hscale8to15 which were slightly adapted for
> hscale8to19.
>
> The tests and benchmarks run on AWS Graviton 2 instances. The results
> from a checkasm tool shown below.
>
> hscale_8_to_19__fs_4_dstW_512_c: 5663.2
> hscale_8_to_19__fs_4_dstW_512_neon: 1259.7
> hscale_8_to_19__fs_8_dstW_512_c: 9306.0
> hscale_8_to_19__fs_8_dstW_512_neon: 2020.2
> hscale_8_to_19__fs_12_dstW_512_c: 12932.7
> hscale_8_to_19__fs_12_dstW_512_neon: 2462.5
> hscale_8_to_19__fs_16_dstW_512_c: 16844.2
> hscale_8_to_19__fs_16_dstW_512_neon: 4671.2
> hscale_8_to_19__fs_32_dstW_512_c: 32803.7
> hscale_8_to_19__fs_32_dstW_512_neon: 5474.2
> hscale_8_to_19__fs_40_dstW_512_c: 40948.0
> hscale_8_to_19__fs_40_dstW_512_neon: 6669.7
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libswscale/aarch64/hscale.S  | 292 ++++++++++++++++++++++++++++++++++-
> libswscale/aarch64/swscale.c |  13 +-
> 2 files changed, 300 insertions(+), 5 deletions(-)
>
> diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S
> index a16d3dca42..5e8cad9825 100644
> --- a/libswscale/aarch64/hscale.S
> +++ b/libswscale/aarch64/hscale.S
> @@ -218,7 +218,6 @@ function ff_hscale8to15_4_neon, export=1
> //  2. Interleaved prefetching src data and madd
> //  3. Complete madd
> //  4. Complete remaining iterations when dstW % 8 != 0
> -

Nit: stray whitespace changes

>         sub                 sp, sp, #32                 // allocate 32 bytes on the stack
>         cmp                 w2, #16                     // if dstW <16, skip to the last block used for wrapping up
>         b.lt                2f
> @@ -347,3 +346,294 @@ function ff_hscale8to15_4_neon, export=1
>         add                 sp, sp, #32                 // clean up stack
>         ret
> endfunc
> +
> +function ff_hscale8to19_4_neon, export=1
> +        // x0               SwsContext *c (unused)
> +        // x1               int32_t *dst
> +        // w2               int dstW
> +        // x3               const uint8_t *src // treat it as uint16_t *src
> +        // x4               const uint16_t *filter
> +        // x5               const int32_t *filterPos
> +        // w6               int filterSize
> +
> +        movi                v18.4s, #1
> +        movi                v17.4s, #1
> +        shl                 v18.4s, v18.4s, #19
> +        sub                 v18.4s, v18.4s, v17.4s      // max allowed value
> +
> +        cmp                 w2, #16
> +        b.lt                2f // move to last block
> +
> +        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
> +        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
> +        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
> +        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
> +        add                 x5, x5, #32
> +
> +        // load data from
> +        ldr                 w8, [x3, w8, UXTW]
> +        ldr                 w9, [x3, w9, UXTW]
> +        ldr                 w10, [x3, w10, UXTW]
> +        ldr                 w11, [x3, w11, UXTW]
> +        ldr                 w12, [x3, w12, UXTW]
> +        ldr                 w13, [x3, w13, UXTW]
> +        ldr                 w14, [x3, w14, UXTW]
> +        ldr                 w15, [x3, w15, UXTW]
> +
> +        sub                 sp, sp, #32
> +
> +        stp                 w8, w9, [sp]
> +        stp                 w10, w11, [sp, #8]
> +        stp                 w12, w13, [sp, #16]
> +        stp                 w14, w15, [sp, #24]
> +
> +1:
> +        ld4                 {v0.8b, v1.8b, v2.8b, v3.8b}, [sp]
> +        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
> +        // load filterPositions into registers for next iteration
> +
> +        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
> +        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
> +        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
> +        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
> +        add                 x5, x5, #32
> +        uxtl                v0.8h, v0.8b
> +        ldr                 w8, [x3, w8, UXTW]
> +        smull               v5.4s, v0.4h, v28.4h        // multiply first column of src
> +        ldr                 w9, [x3, w9, UXTW]
> +        smull2              v6.4s, v0.8h, v28.8h
> +        stp                 w8, w9, [sp]
> +
> +        uxtl                v1.8h, v1.8b
> +        ldr                 w10, [x3, w10, UXTW]
> +        smlal               v5.4s, v1.4h, v29.4h        // multiply second column of src
> +        ldr                 w11, [x3, w11, UXTW]
> +        smlal2              v6.4s, v1.8h, v29.8h
> +        stp                 w10, w11, [sp, #8]
> +
> +        uxtl                v2.8h, v2.8b
> +        ldr                 w12, [x3, w12, UXTW]
> +        smlal               v5.4s, v2.4h, v30.4h        // multiply third column of src
> +        ldr                 w13, [x3, w13, UXTW]
> +        smlal2              v6.4s, v2.8h, v30.8h
> +        stp                 w12, w13, [sp, #16]
> +
> +        uxtl                v3.8h, v3.8b
> +        ldr                 w14, [x3, w14, UXTW]
> +        smlal               v5.4s, v3.4h, v31.4h        // multiply fourth column of src
> +        ldr                 w15, [x3, w15, UXTW]
> +        smlal2              v6.4s, v3.8h, v31.8h
> +        stp                 w14, w15, [sp, #24]
> +
> +        sub                 w2, w2, #8
> +        sshr                v5.4s, v5.4s, #3
> +        sshr                v6.4s, v6.4s, #3
> +        smin                v5.4s, v5.4s, v18.4s
> +        smin                v6.4s, v6.4s, v18.4s
> +
> +        st1                 {v5.4s, v6.4s}, [x1], #32
> +        cmp                 w2, #16
> +        b.ge                1b
> +
> +        // here we make last iteration, without updating the registers
> +        ld4                 {v0.8b, v1.8b, v2.8b, v3.8b}, [sp]
> +        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
> +
> +        uxtl                v0.8h, v0.8b
> +        uxtl                v1.8h, v1.8b
> +        smull               v5.4s, v0.4h, v28.4h
> +        smull2              v6.4s, v0.8h, v28.8h
> +        uxtl                v2.8h, v2.8b
> +        smlal               v5.4s, v1.4h, v29.4H
> +        smlal2              v6.4s, v1.8h, v29.8H
> +        uxtl                v3.8h, v3.8b
> +        smlal               v5.4s, v2.4h, v30.4H
> +        smlal2              v6.4s, v2.8h, v30.8H
> +        smlal               v5.4s, v3.4h, v31.4H
> +        smlal2              v6.4s, v3.8h, v31.8h
> +
> +        sshr                v5.4s, v5.4s, #3
> +        sshr                v6.4s, v6.4s, #3
> +
> +        smin                v5.4s, v5.4s, v18.4s
> +        smin                v6.4s, v6.4s, v18.4s
> +
> +        sub                 w2, w2, #8
> +        st1                 {v5.4s, v6.4s}, [x1], #32
> +        add                 sp, sp, #32 // restore stack
> +        cbnz                w2, 2f
> +
> +        ret
> +
> +2:
> +        ldr                 w8, [x5], #4 // load filterPos
> +        add                 x9, x3, w8, UXTW // src + filterPos
> +        ld1                 {v0.s}[0], [x9] // load 4 * uint8_t* into one single
> +        ld1                 {v31.4h}, [x4], #8
> +        uxtl                v0.8h, v0.8b
> +        smull               v5.4s, v0.4h, v31.4H
> +        saddlv              d0, v5.4S
> +        sqshrn              s0, d0, #3
> +        smin                v0.4s, v0.4s, v18.4s
> +        st1                 {v0.s}[0], [x1], #4
> +        sub                 w2, w2, #1
> +        cbnz                w2, 2b // if iterations remain jump to beginning
> +
> +        ret
> +endfunc
> +
> +function ff_hscale8to19_X8_neon, export=1
> +        movi                v20.4s, #1
> +        movi                v17.4s, #1
> +        shl                 v20.4s, v20.4s, #19
> +        sub                 v20.4s, v20.4s, v17.4s
> +
> +        sbfiz               x7, x6, #1, #32             // filterSize*2 (*2 because int16)
> +1:
> +        mov                 x16, x4                     // filter0 = filter
> +        ldr                 w8, [x5], #4                // filterPos[idx]
> +        add                 x12, x16, x7                // filter1 = filter0 + filterSize*2
> +        ldr                 w0, [x5], #4                // filterPos[idx + 1]
> +        add                 x13, x12, x7                // filter2 = filter1 + filterSize*2
> +        ldr                 w11, [x5], #4               // filterPos[idx + 2]
> +        add                 x4, x13, x7                 // filter3 = filter2 + filterSize*2
> +        ldr                 w9, [x5], #4                // filterPos[idx + 3]
> +        movi                v0.2D, #0                   // val sum part 1 (for dst[0])
> +        movi                v1.2D, #0                   // val sum part 2 (for dst[1])
> +        movi                v2.2D, #0                   // val sum part 3 (for dst[2])
> +        movi                v3.2D, #0                   // val sum part 4 (for dst[3])
> +        add                 x17, x3, w8, UXTW           // srcp + filterPos[0]
> +        add                 x8,  x3, w0, UXTW           // srcp + filterPos[1]
> +        add                 x0, x3, w11, UXTW           // srcp + filterPos[2]
> +        add                 x11, x3, w9, UXTW           // srcp + filterPos[3]
> +        mov                 w15, w6                     // filterSize counter
> +2:      ld1                 {v4.8B}, [x17], #8          // srcp[filterPos[0] + {0..7}]
> +        ld1                 {v5.8H}, [x16], #16         // load 8x16-bit filter values, part 1
> +        uxtl                v4.8H, v4.8B                // unpack part 1 to 16-bit
> +        smlal               v0.4S, v4.4H, v5.4H         // v0 accumulates srcp[filterPos[0] + {0..3}] * filter[{0..3}]
> +        ld1                 {v6.8B}, [x8], #8           // srcp[filterPos[1] + {0..7}]
> +        smlal2              v0.4S, v4.8H, v5.8H         // v0 accumulates srcp[filterPos[0] + {4..7}] * filter[{4..7}]
> +        ld1                 {v7.8H}, [x12], #16         // load 8x16-bit at filter+filterSize
> +        ld1                 {v16.8B}, [x0], #8          // srcp[filterPos[2] + {0..7}]
> +        uxtl                v6.8H, v6.8B                // unpack part 2 to 16-bit
> +        ld1                 {v17.8H}, [x13], #16        // load 8x16-bit at filter+2*filterSize
> +        uxtl                v16.8H, v16.8B              // unpack part 3 to 16-bit
> +        smlal               v1.4S, v6.4H, v7.4H         // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
> +        ld1                 {v18.8B}, [x11], #8         // srcp[filterPos[3] + {0..7}]
> +        smlal               v2.4S, v16.4H, v17.4H       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
> +        ld1                 {v19.8H}, [x4], #16         // load 8x16-bit at filter+3*filterSize
> +        smlal2              v2.4S, v16.8H, v17.8H       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
> +        uxtl                v18.8H, v18.8B              // unpack part 4 to 16-bit
> +        smlal2              v1.4S, v6.8H, v7.8H         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
> +        smlal               v3.4S, v18.4H, v19.4H       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
> +        subs                w15, w15, #8                // j -= 8: processed 8/filterSize
> +        smlal2              v3.4S, v18.8H, v19.8H       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
> +        b.gt                2b                          // inner loop if filterSize not consumed completely
> +        addp                v0.4S, v0.4S, v1.4S         // part01 horizontal pair adding
> +        addp                v2.4S, v2.4S, v3.4S         // part23 horizontal pair adding
> +        addp                v0.4S, v0.4S, v2.4S         // part0123 horizontal pair adding
> +        subs                w2, w2, #4                  // dstW -= 4
> +        sshr                v0.4s, v0.4S, #3            // shift and clip the 2x16-bit final values
> +        smin                v0.4s, v0.4s, v20.4s
> +        st1                 {v0.4s}, [x1], #16           // write to destination part0123
> +        b.gt                1b                          // loop until end of line
> +        ret
> +endfunc
> +
> +function ff_hscale8to19_X4_neon, export=1
> +        // x0  SwsContext *c (not used)
> +        // x1  int16_t *dst
> +        // w2  int dstW
> +        // x3  const uint8_t *src
> +        // x4  const int16_t *filter
> +        // x5  const int32_t *filterPos
> +        // w6  int filterSize
> +
> +        movi                v20.4s, #1
> +        movi                v17.4s, #1
> +        shl                 v20.4s, v20.4s, #19
> +        sub                 v20.4s, v20.4s, v17.4s
> +
> +        lsl                 w7, w6, #1
> +1:
> +        ldp                 w8, w9, [x5]
> +        ldp                 w10, w11, [x5, #8]
> +
> +        movi                v16.2d, #0                  // initialize accumulator for idx + 0
> +        movi                v17.2d, #0                  // initialize accumulator for idx + 1
> +        movi                v18.2d, #0                  // initialize accumulator for idx + 2
> +        movi                v19.2d, #0                  // initialize accumulator for idx + 3
> +
> +        mov                 x12, x4                     // filter + 0
> +        add                 x13, x4, x7                 // filter + 1
> +        add                 x8, x3, w8, UXTW            // srcp + filterPos 0
> +        add                 x14, x13, x7                // filter + 2
> +        add                 x9, x3, w9, UXTW            // srcp + filterPos 1
> +        add                 x15, x14, x7                // filter + 3
> +        add                 x10, x3, w10, UXTW          // srcp + filterPos 2
> +        mov                 w0, w6                      // save the filterSize to temporary variable
> +        add                 x11, x3, w11, UXTW          // srcp + filterPos 3
> +        add                 x5, x5, #16                 // advance filter position
> +        mov                 x16, xzr                    // clear the register x16 used for offsetting the filter values
> +
> +2:
> +        ldr                 d4, [x8], #8                // load src values for idx 0
> +        ldr                 q31, [x12, x16]             // load filter values for idx 0
> +        uxtl                v4.8h, v4.8b                // extend type to match the filter' size
> +        ldr                 d5, [x9], #8                // load src values for idx 1
> +        smlal               v16.4s, v4.4h, v31.4h       // multiplication of lower half for idx 0
> +        uxtl                v5.8h, v5.8b                // extend type to match the filter' size
> +        ldr                 q30, [x13, x16]             // load filter values for idx 1
> +        smlal2              v16.4s, v4.8h, v31.8h       // multiplication of upper half for idx 0
> +        ldr                 d6, [x10], #8               // load src values for idx 2
> +        ldr                 q29, [x14, x16]             // load filter values for idx 2
> +        smlal               v17.4s, v5.4h, v30.4H       // multiplication of lower half for idx 1
> +        ldr                 d7, [x11], #8               // load src values for idx 3
> +        smlal2              v17.4s, v5.8h, v30.8H       // multiplication of upper half for idx 1
> +        uxtl                v6.8h, v6.8B                // extend tpye to matchi the filter's size
> +        ldr                 q28, [x15, x16]             // load filter values for idx 3
> +        smlal               v18.4s, v6.4h, v29.4h       // multiplication of lower half for idx 2
> +        uxtl                v7.8h, v7.8B
> +        smlal2              v18.4s, v6.8h, v29.8H       // multiplication of upper half for idx 2
> +        sub                 w0, w0, #8
> +        smlal               v19.4s, v7.4h, v28.4H       // multiplication of lower half for idx 3
> +        cmp                 w0, #8
> +        smlal2              v19.4s, v7.8h, v28.8h       // multiplication of upper half for idx 3
> +        add                 x16, x16, #16                // advance filter values indexing
> +
> +        b.ge                2b
> +
> +
> +        // 4 iterations left
> +
> +        sub                 x17, x7, #8                 // step back to wrap up the filter pos for last 4 elements
> +
> +        ldr                 s4, [x8]                    // load src values for idx 0
> +        ldr                 d31, [x12, x17]             // load filter values for idx 0
> +        uxtl                v4.8h, v4.8b                // extend type to match the filter' size
> +        ldr                 s5, [x9]                    // load src values for idx 1
> +        smlal               v16.4s, v4.4h, v31.4h
> +        ldr                 d30, [x13, x17]             // load filter values for idx 1
> +        uxtl                v5.8h, v5.8b                // extend type to match the filter' size
> +        ldr                 s6, [x10]                   // load src values for idx 2
> +        smlal               v17.4s, v5.4h, v30.4h
> +        uxtl                v6.8h, v6.8B                // extend type to match the filter's size
> +        ldr                 d29, [x14, x17]             // load filter values for idx 2
> +        ldr                 s7, [x11]                   // load src values for idx 3
> +        addp                v16.4s, v16.4s, v17.4s
> +        uxtl                v7.8h, v7.8B
> +        ldr                 d28, [x15, x17]             // load filter values for idx 3
> +        smlal               v18.4s, v6.4h, v29.4h
> +        smlal               v19.4s, v7.4h, v28.4h
> +        subs                w2, w2, #4
> +        addp                v18.4s, v18.4s, v19.4s
> +        addp                v16.4s, v16.4s, v18.4s
> +        sshr                v16.4s, v16.4s, #3
> +        smin                v16.4s, v16.4s, v20.4s
> +
> +        st1                 {v16.4s}, [x1], #16
> +        add                 x4, x4, x7, lsl #2
> +        b.gt                1b
> +        ret
> +
> +endfunc
> \ No newline at end of file

Nit: The file could use a trailing newline

> diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c
> index d1312c6658..479fe129d0 100644
> --- a/libswscale/aarch64/swscale.c
> +++ b/libswscale/aarch64/swscale.c
> @@ -29,7 +29,8 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
>                                                 const int16_t *filter, \
>                                                 const int32_t *filterPos, int filterSize)
> #define SCALE_FUNCS(filter_n, opt) \
> -    SCALE_FUNC(filter_n,  8, 15, opt);
> +    SCALE_FUNC(filter_n,  8, 15, opt); \
> +    SCALE_FUNC(filter_n, 8, 19, opt);

Nit: There's no need to preserve the odd spacing of the existing line 
here.

Other than that, this patch (and the others) mostly seem fine. I've got a 
version of the patches with these nits fixed locally (fixing it was a bit 
annoying wrt rebasing the later patches though).

// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [FFmpeg-devel] [PATCH 4/4] sw_scale: Add specializations for hscale 16 to 19
  2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 4/4] sw_scale: Add specializations for hscale 16 to 19 Hubert Mazur
@ 2022-10-24 13:19   ` Martin Storsjö
  2022-10-25  7:14     ` Hubert Mazur
  0 siblings, 1 reply; 8+ messages in thread
From: Martin Storsjö @ 2022-10-24 13:19 UTC (permalink / raw)
  To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop

On Mon, 17 Oct 2022, Hubert Mazur wrote:

> Provide arm64 neon optimized implementations for hscale16To19 with
> filter sizes 4, 8 and X4.
>
> The tests and benchmarks run on AWS Graviton 2 instances.
> The results from a checkasm tool are shown below.
>
> hscale_16_to_19__fs_4_dstW_512_c: 6216.0
> hscale_16_to_19__fs_4_dstW_512_neon: 2257.0
> hscale_16_to_19__fs_8_dstW_512_c: 10417.7
> hscale_16_to_19__fs_8_dstW_512_neon: 3112.5
> hscale_16_to_19__fs_12_dstW_512_c: 14890.5
> hscale_16_to_19__fs_12_dstW_512_neon: 3899.0
> hscale_16_to_19__fs_16_dstW_512_c: 19006.5
> hscale_16_to_19__fs_16_dstW_512_neon: 5341.2
> hscale_16_to_19__fs_32_dstW_512_c: 36629.5
> hscale_16_to_19__fs_32_dstW_512_neon: 9502.7
> hscale_16_to_19__fs_40_dstW_512_c: 45477.5
> hscale_16_to_19__fs_40_dstW_512_neon: 11552.0
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libswscale/aarch64/hscale.S  | 402 +++++++++++++++++++++++++++++++++++
> libswscale/aarch64/swscale.c |  70 +++++-
> 2 files changed, 471 insertions(+), 1 deletion(-)

> +void ff_hscale16to19_4_neon_asm(int shift, int16_t *_dst, int dstW,
> +                      const uint8_t *_src, const int16_t *filter,
> +                      const int32_t *filterPos, int filterSize);
> +void ff_hscale16to19_X8_neon_asm(int shift, int16_t *_dst, int dstW,
> +                      const uint8_t *_src, const int16_t *filter,
> +                      const int32_t *filterPos, int filterSize);
> +void ff_hscale16to19_X4_neon_asm(int shift, int16_t *_dst, int dstW,
> +                      const uint8_t *_src, const int16_t *filter,
> +                      const int32_t *filterPos, int filterSize);
> +
> #define SCALE_FUNC(filter_n, from_bpc, to_bpc, opt) \
> void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
>                                                 SwsContext *c, int16_t *data, \
> @@ -43,7 +53,8 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
> #define SCALE_FUNCS(filter_n, opt) \
>     SCALE_FUNC(filter_n,  8, 15, opt); \
>     SCALE_FUNC(filter_n, 8, 19, opt); \
> -    SCALE_FUNC(filter_n, 16, 15, opt);
> +    SCALE_FUNC(filter_n, 16, 15, opt); \
> +    SCALE_FUNC(filter_n, 16, 19, opt);

So this declares the functions we're implementing as C wrappers below, and 
the manual declarations further up declare the actual asm functions?

I guess that works, although it makes unnecessary extern functions. In 
such cases, we usually have the C functions be static functions, placed 
above the code that uses them. But it's not a big deal.

Other than that, this patchset mostly seems fine.

However, I tested the patches on x86, and the new checkasm tests do fail 
on x86 (both i386 and x86_64) - so that needs to be fixed anyway. So since 
we'll need to do a new round anyway, please do try to fix up the minor 
cosmetics I mentioned.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [FFmpeg-devel] [PATCH 4/4] sw_scale: Add specializations for hscale 16 to 19
  2022-10-24 13:19   ` Martin Storsjö
@ 2022-10-25  7:14     ` Hubert Mazur
  0 siblings, 0 replies; 8+ messages in thread
From: Hubert Mazur @ 2022-10-25  7:14 UTC (permalink / raw)
  To: Martin Storsjö; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop

Thanks for the review.
I will fix the failing checkasm first and then take care of the minor
issues. I will try to to resend fixed versions this week.

Regards,
Hubert

On Mon, Oct 24, 2022 at 3:19 PM Martin Storsjö <martin@martin.st> wrote:

> On Mon, 17 Oct 2022, Hubert Mazur wrote:
>
> > Provide arm64 neon optimized implementations for hscale16To19 with
> > filter sizes 4, 8 and X4.
> >
> > The tests and benchmarks run on AWS Graviton 2 instances.
> > The results from a checkasm tool are shown below.
> >
> > hscale_16_to_19__fs_4_dstW_512_c: 6216.0
> > hscale_16_to_19__fs_4_dstW_512_neon: 2257.0
> > hscale_16_to_19__fs_8_dstW_512_c: 10417.7
> > hscale_16_to_19__fs_8_dstW_512_neon: 3112.5
> > hscale_16_to_19__fs_12_dstW_512_c: 14890.5
> > hscale_16_to_19__fs_12_dstW_512_neon: 3899.0
> > hscale_16_to_19__fs_16_dstW_512_c: 19006.5
> > hscale_16_to_19__fs_16_dstW_512_neon: 5341.2
> > hscale_16_to_19__fs_32_dstW_512_c: 36629.5
> > hscale_16_to_19__fs_32_dstW_512_neon: 9502.7
> > hscale_16_to_19__fs_40_dstW_512_c: 45477.5
> > hscale_16_to_19__fs_40_dstW_512_neon: 11552.0
> >
> > Signed-off-by: Hubert Mazur <hum@semihalf.com>
> > ---
> > libswscale/aarch64/hscale.S  | 402 +++++++++++++++++++++++++++++++++++
> > libswscale/aarch64/swscale.c |  70 +++++-
> > 2 files changed, 471 insertions(+), 1 deletion(-)
>
> > +void ff_hscale16to19_4_neon_asm(int shift, int16_t *_dst, int dstW,
> > +                      const uint8_t *_src, const int16_t *filter,
> > +                      const int32_t *filterPos, int filterSize);
> > +void ff_hscale16to19_X8_neon_asm(int shift, int16_t *_dst, int dstW,
> > +                      const uint8_t *_src, const int16_t *filter,
> > +                      const int32_t *filterPos, int filterSize);
> > +void ff_hscale16to19_X4_neon_asm(int shift, int16_t *_dst, int dstW,
> > +                      const uint8_t *_src, const int16_t *filter,
> > +                      const int32_t *filterPos, int filterSize);
> > +
> > #define SCALE_FUNC(filter_n, from_bpc, to_bpc, opt) \
> > void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt(
> \
> >                                                 SwsContext *c, int16_t
> *data, \
> > @@ -43,7 +53,8 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ##
> filter_n ## _ ## opt( \
> > #define SCALE_FUNCS(filter_n, opt) \
> >     SCALE_FUNC(filter_n,  8, 15, opt); \
> >     SCALE_FUNC(filter_n, 8, 19, opt); \
> > -    SCALE_FUNC(filter_n, 16, 15, opt);
> > +    SCALE_FUNC(filter_n, 16, 15, opt); \
> > +    SCALE_FUNC(filter_n, 16, 19, opt);
>
> So this declares the functions we're implementing as C wrappers below, and
> the manual declarations further up declare the actual asm functions?
>
> I guess that works, although it makes unnecessary extern functions. In
> such cases, we usually have the C functions be static functions, placed
> above the code that uses them. But it's not a big deal.
>
> Other than that, this patchset mostly seems fine.
>
> However, I tested the patches on x86, and the new checkasm tests do fail
> on x86 (both i386 and x86_64) - so that needs to be fixed anyway. So since
> we'll need to do a new round anyway, please do try to fix up the minor
> cosmetics I mentioned.
>
> // Martin
>
>
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-10-25  7:15 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-17 13:07 [FFmpeg-devel] [PATCH 0/4] Provide neon implementations for hscale functions Hubert Mazur
2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 1/4] sw_scale: Add specializations for hscale 8 to 19 Hubert Mazur
2022-10-24 12:31   ` Martin Storsjö
2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 2/4] tests/sw_scale: Add test cases for input sizes 16 Hubert Mazur
2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 3/4] sw_scale: Add specializations for hscale 16 to 15 Hubert Mazur
2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 4/4] sw_scale: Add specializations for hscale 16 to 19 Hubert Mazur
2022-10-24 13:19   ` Martin Storsjö
2022-10-25  7:14     ` Hubert Mazur

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git