Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed
* [FFmpeg-devel] [PATCH 0/5] me_cmp: Provide arm64 neon implementations
@ 2022-08-22 15:26 Hubert Mazur
  2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for vsad16 Hubert Mazur
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Hubert Mazur @ 2022-08-22 15:26 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop

Provide optimized arm64 neon implementations for functions from
motion estimation family.

Hubert Mazur (5):
  lavc/aarch64: Add neon implementation for vsad16
  lavc/aarch64: Add neon implementation of vsse16
  lavc/aarch64: Add neon implementation for vsad_intra16
  lavc/aarch64: Add neon implementation for vsse_intra16
  lavc/aarch64: Provide neon implementation of nsse16

 libavcodec/aarch64/me_cmp_init_aarch64.c |  30 ++
 libavcodec/aarch64/me_cmp_neon.S         | 431 +++++++++++++++++++++++
 2 files changed, 461 insertions(+)

-- 
2.34.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for vsad16
  2022-08-22 15:26 [FFmpeg-devel] [PATCH 0/5] me_cmp: Provide arm64 neon implementations Hubert Mazur
@ 2022-08-22 15:26 ` Hubert Mazur
  2022-09-02 21:49   ` Martin Storsjö
  2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation of vsse16 Hubert Mazur
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Hubert Mazur @ 2022-08-22 15:26 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop

Provide optimized implementation of vsad16 function for arm64.

Performance comparison tests are shown below.
- vsad_0_c: 285.0
- vsad_0_neon: 42.5

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 libavcodec/aarch64/me_cmp_init_aarch64.c |  5 ++
 libavcodec/aarch64/me_cmp_neon.S         | 75 ++++++++++++++++++++++++
 2 files changed, 80 insertions(+)

diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index fb7c3f5059..ddc5d05611 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -41,6 +41,9 @@ int sse8_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
 int sse4_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
               ptrdiff_t stride, int h);
 
+int vsad16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
+                ptrdiff_t stride, int h);
+
 av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
 {
     int cpu_flags = av_get_cpu_flags();
@@ -57,5 +60,7 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
         c->sse[0] = sse16_neon;
         c->sse[1] = sse8_neon;
         c->sse[2] = sse4_neon;
+
+        c->vsad[0] = vsad16_neon;
     }
 }
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 4198985c6c..d4c0099854 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -584,3 +584,78 @@ function sse4_neon, export=1
 
         ret
 endfunc
+
+function vsad16_neon, export=1
+        // x0           unused
+        // x1           uint8_t *pix1
+        // x2           uint8_t *pix2
+        // x3           ptrdiff_t stride
+        // w4           int h
+
+        sub             w4, w4, #1                      // we need to make h-1 iterations
+        movi            v16.8h, #0
+
+        cmp             w4, #3                          // check if we can make 3 iterations at once
+        add             x5, x1, x3                      // pix1 + stride
+        add             x6, x2, x3                      // pix2 + stride
+        b.le            2f
+
+1:
+        // abs(pix1[0] - pix2[0] - pix1[0 + stride] + pix2[0 + stride])
+        // abs(x) = (x < 0 ? (-x) : (x))
+        ld1             {v0.16b}, [x1], x3              // Load pix1[0], first iteration
+        ld1             {v1.16b}, [x2], x3              // Load pix2[0], first iteration
+        ld1             {v2.16b}, [x5], x3              // Load pix1[0 + stride], first iteration
+        usubl           v31.8h, v0.8b, v1.8b            // Signed difference pix1[0] - pix2[0], first iteration
+        ld1             {v3.16b}, [x6], x3              // Load pix2[0 + stride], first iteration
+        usubl2          v30.8h, v0.16b, v1.16b          // Signed difference pix1[0] - pix2[0], first iteration
+        usubl           v29.8h, v2.8b, v3.8b            // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration
+        ld1             {v4.16b}, [x1], x3              // Load pix1[0], second iteration
+        usubl2          v28.8h, v2.16b, v3.16b          // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration
+        ld1             {v5.16b}, [x2], x3              // Load pix2[0], second iteration
+        saba            v16.8h, v31.8h, v29.8h          // Signed absolute difference and accumulate the result. first iteration
+        ld1             {v6.16b}, [x5], x3              // Load pix1[0 + stride], second iteration
+        saba            v16.8h, v30.8h, v28.8h          // Signed absolute difference and accumulate the result. first iteration
+        usubl           v27.8h, v4.8b, v5.8b            // Signed difference pix1[0] - pix2[0], second iteration
+        ld1             {v7.16b}, [x6], x3              // Load pix2[0 + stride], second iteration
+        usubl2          v26.8h, v4.16b, v5.16b          // Signed difference pix1[0] - pix2[0], second iteration
+        usubl           v25.8h, v6.8b, v7.8b            // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration
+        ld1             {v17.16b}, [x1], x3             // Load pix1[0], third iteration
+        usubl2          v24.8h, v6.16b, v7.16b          // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration
+        ld1             {v18.16b}, [x2], x3             // Load pix2[0], second iteration
+        saba            v16.8h, v27.8h, v25.8h          // Signed absolute difference and accumulate the result. second iteration
+        ld1             {v19.16b}, [x5], x3             // Load pix1[0 + stride], third iteration
+        saba            v16.8h, v26.8h, v24.8h          // Signed absolute difference and accumulate the result. second iteration
+        usubl           v23.8h, v17.8b, v18.8b          // Signed difference pix1[0] - pix2[0], third iteration
+        ld1             {v20.16b}, [x6], x3             // Load pix2[0 + stride], third iteration
+        usubl2          v22.8h, v17.16b, v18.16b        // Signed difference pix1[0] - pix2[0], third iteration
+        usubl           v21.8h, v19.8b, v20.8b          // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration
+        sub             w4, w4, #3                      // h -= 3
+        saba            v16.8h, v23.8h, v21.8h          // Signed absolute difference and accumulate the result. third iteration
+        usubl2          v31.8h, v19.16b, v20.16b        // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration
+        cmp             w4, #3
+        saba            v16.8h, v22.8h, v31.8h          // Signed absolute difference and accumulate the result. third iteration
+
+        b.ge            1b
+        cbz             w4, 3f
+2:
+
+        ld1             {v0.16b}, [x1], x3
+        ld1             {v1.16b}, [x2], x3
+        ld1             {v2.16b}, [x5], x3
+        usubl           v30.8h, v0.8b, v1.8b
+        ld1             {v3.16b}, [x6], x3
+        usubl2          v29.8h, v0.16b, v1.16b
+        usubl           v28.8h, v2.8b, v3.8b
+        usubl2          v27.8h, v2.16b, v3.16b
+        saba            v16.8h, v30.8h, v28.8h
+        subs            w4, w4, #1
+        saba            v16.8h, v29.8h, v27.8h
+
+        b.ne            2b
+3:
+        uaddlv          s17, v16.8h
+        fmov            w0, s17
+
+        ret
+endfunc
-- 
2.34.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation of vsse16
  2022-08-22 15:26 [FFmpeg-devel] [PATCH 0/5] me_cmp: Provide arm64 neon implementations Hubert Mazur
  2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for vsad16 Hubert Mazur
@ 2022-08-22 15:26 ` Hubert Mazur
  2022-09-04 20:53   ` Martin Storsjö
  2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for vsad_intra16 Hubert Mazur
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Hubert Mazur @ 2022-08-22 15:26 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop

Provide optimized implementation of vsse16 for arm64.

Performance comparison tests are shown below.
- vsse_0_c: 254.4
- vsse_0_neon: 64.7

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 libavcodec/aarch64/me_cmp_init_aarch64.c |  4 +
 libavcodec/aarch64/me_cmp_neon.S         | 97 ++++++++++++++++++++++++
 2 files changed, 101 insertions(+)

diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index ddc5d05611..7b81e48d16 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -43,6 +43,8 @@ int sse4_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
 
 int vsad16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
                 ptrdiff_t stride, int h);
+int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
+                ptrdiff_t stride, int h);
 
 av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
 {
@@ -62,5 +64,7 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
         c->sse[2] = sse4_neon;
 
         c->vsad[0] = vsad16_neon;
+
+        c->vsse[0] = vsse16_neon;
     }
 }
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index d4c0099854..279bae7cb5 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -659,3 +659,100 @@ function vsad16_neon, export=1
 
         ret
 endfunc
+
+function vsse16_neon, export=1
+        // x0           unused
+        // x1           uint8_t *pix1
+        // x2           uint8_t *pix2
+        // x3           ptrdiff_t stride
+        // w4           int h
+
+        movi            v30.4s, #0
+        movi            v29.4s, #0
+
+        add             x5, x1, x3                      // pix1 + stride
+        add             x6, x2, x3                      // pix2 + stride
+        sub             w4, w4, #1                      // we need to make h-1 iterations
+        cmp             w4, #3                          // check if we can make 4 iterations at once
+        b.le            2f
+
+// make 4 iterations at once
+1:
+        // x = abs(pix1[0] - pix2[0] - pix1[0 + stride] + pix2[0 + stride]) =
+        // res = (x) * (x)
+        ld1             {v0.16b}, [x1], x3              // Load pix1[0], first iteration
+        ld1             {v1.16b}, [x2], x3              // Load pix2[0], first iteration
+        ld1             {v2.16b}, [x5], x3              // Load pix1[0 + stride], first iteration
+        usubl           v28.8h, v0.8b, v1.8b            // Signed difference of pix1[0] - pix2[0], first iteration
+        ld1             {v3.16b}, [x6], x3              // Load pix2[0 + stride], first iteration
+        usubl2          v27.8h, v0.16b, v1.16b          // Signed difference of pix1[0] - pix2[0], first iteration
+        usubl           v26.8h, v3.8b, v2.8b            // Signed difference of pix1[0 + stride] - pix2[0 + stride], first iteration
+        usubl2          v25.8h, v3.16b, v2.16b          // Signed difference of pix1[0 + stride] - pix2[0 + stride], first iteration
+        ld1             {v4.16b}, [x1], x3              // Load pix1[0], second iteration
+        sqadd           v28.8h, v28.8h, v26.8h          // Add first iteration
+        ld1             {v6.16b}, [x5], x3              // Load pix1[0 + stride], second iteration
+        sqadd           v27.8h, v27.8h, v25.8h          // Add first iteration
+        ld1             {v5.16b}, [x2], x3              // Load pix2[0], second iteration
+        smlal           v30.4s, v28.4h, v28.4h          // Multiply-accumulate first iteration
+        ld1             {v7.16b}, [x6], x3              // Load pix2[0 + stride], second iteration
+        usubl           v26.8h, v4.8b, v5.8b            // Signed difference of pix1[0] - pix2[0], second iteration
+        smlal2          v29.4s, v28.8h, v28.8h          // Multiply-accumulate first iteration
+        usubl2          v25.8h, v4.16b, v5.16b          // Signed difference of pix1[0] - pix2[0], second iteration
+        usubl           v24.8h, v7.8b, v6.8b            // Signed difference of pix1[0 + stride] - pix2[0 + stride], first iteration
+        smlal           v30.4s, v27.4h, v27.4h          // Multiply-accumulate first iteration
+        usubl2          v23.8h, v7.16b, v6.16b          // Signed difference of pix1[0 + stride] - pix2[0 + stride], first iteration
+        sqadd           v24.8h, v26.8h, v24.8h          // Add second iteration
+        smlal2          v29.4s, v27.8h, v27.8h          // Multiply-accumulate first iteration
+        sqadd           v23.8h, v25.8h, v23.8h          // Add second iteration
+        ld1             {v18.16b}, [x1], x3             // Load pix1[0], third iteration
+        smlal           v30.4s, v24.4h, v24.4h          // Multiply-accumulate second iteration
+        ld1             {v31.16b}, [x2], x3             // Load pix2[0], third iteration
+        ld1             {v17.16b}, [x5], x3             // Load pix1[0 + stride], third iteration
+        smlal2          v29.4s, v24.8h, v24.8h          // Multiply-accumulate second iteration
+        ld1             {v16.16b}, [x6], x3             // Load pix2[0 + stride], third iteration
+        usubl           v22.8h, v18.8b, v31.8b          // Signed difference of pix1[0] - pix2[0], third iteration
+        smlal           v30.4s, v23.4h, v23.4h          // Multiply-accumulate second iteration
+        usubl2          v21.8h, v18.16b, v31.16b        // Signed difference of pix1[0] - pix2[0], third iteration
+        usubl           v20.8h, v16.8b, v17.8b          // Signed difference of pix1[0 + stride] - pix2[0 + stride], first iteration
+        smlal2          v29.4s, v23.8h, v23.8h          // Multiply-accumulate second iteration
+        sqadd           v20.8h, v22.8h, v20.8h          // Add third iteration
+        usubl2          v19.8h, v16.16b, v17.16b        // Signed difference of pix1[0 + stride] - pix2[0 + stride], first iteration
+        smlal           v30.4s, v20.4h, v20.4h          // Multiply-accumulate third iteration
+        sqadd           v19.8h, v21.8h, v19.8h          // Add third iteration
+        smlal2          v29.4s, v20.8h, v20.8h          // Multiply-accumulate third iteration
+        sub             w4, w4, #3
+        smlal           v30.4s, v19.4h, v19.4h          // Multiply-accumulate third iteration
+        cmp             w4, #3
+        smlal2          v29.4s, v19.8h, v19.8h          // Multiply-accumulate third iteration
+
+        b.ge            1b
+
+        cbz             w4, 3f
+
+// iterate by once
+2:
+        ld1             {v0.16b}, [x1], x3
+        ld1             {v1.16b}, [x2], x3
+        ld1             {v2.16b}, [x5], x3
+        usubl           v28.8h, v0.8b, v1.8b
+        ld1             {v3.16b}, [x6], x3
+        usubl2          v27.8h, v0.16b, v1.16b
+        usubl           v26.8h, v3.8b, v2.8b
+        usubl2          v25.8h, v3.16b, v2.16b
+        sqadd           v28.8h, v28.8h, v26.8h
+        sqadd           v27.8h, v27.8h, v25.8h
+        smlal           v30.4s, v28.4h, v28.4h
+        smlal2          v29.4s, v28.8h, v28.8h
+        subs            w4, w4, #1
+        smlal           v30.4s, v27.4h, v27.4h
+        smlal2          v29.4s, v27.8h, v27.8h
+
+        b.ne            2b
+
+3:
+        add             v30.4s, v30.4s, v29.4s
+        saddlv          d17, v30.4s
+        fmov            w0, s17
+
+        ret
+endfunc
-- 
2.34.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for vsad_intra16
  2022-08-22 15:26 [FFmpeg-devel] [PATCH 0/5] me_cmp: Provide arm64 neon implementations Hubert Mazur
  2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for vsad16 Hubert Mazur
  2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation of vsse16 Hubert Mazur
@ 2022-08-22 15:26 ` Hubert Mazur
  2022-09-04 20:58   ` Martin Storsjö
  2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16 Hubert Mazur
  2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16 Hubert Mazur
  4 siblings, 1 reply; 15+ messages in thread
From: Hubert Mazur @ 2022-08-22 15:26 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop

Provide optimized implementation for vsad_intra16 function for arm64.

Performance comparison tests are shown below.
- vsad_4_c: 177.2
- vsad_4_neon: 24.5

Benchmarks and tests are run with checkasm tool on AWS Gravtion 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 libavcodec/aarch64/me_cmp_init_aarch64.c |  3 ++
 libavcodec/aarch64/me_cmp_neon.S         | 58 ++++++++++++++++++++++++
 2 files changed, 61 insertions(+)

diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index 7b81e48d16..af83f7ed1e 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -43,6 +43,8 @@ int sse4_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
 
 int vsad16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
                 ptrdiff_t stride, int h);
+int vsad_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
+                      ptrdiff_t stride, int h) ;
 int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
                 ptrdiff_t stride, int h);
 
@@ -64,6 +66,7 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
         c->sse[2] = sse4_neon;
 
         c->vsad[0] = vsad16_neon;
+        c->vsad[4] = vsad_intra16_neon;
 
         c->vsse[0] = vsse16_neon;
     }
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 279bae7cb5..126e267fdc 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -756,3 +756,61 @@ function vsse16_neon, export=1
 
         ret
 endfunc
+
+function vsad_intra16_neon, export=1
+        // x0           unused
+        // x1           uint8_t *pix1
+        // x2           uint8_t *dummy
+        // x3           ptrdiff_t stride
+        // w4           int h
+
+        sub             w4, w4, #1 // we need to make h-1 iterations
+        add             x5, x1, x3 // pix1 + stride
+        cmp             w4, #4
+        movi            v16.8h, #0
+
+        b.lt            2f
+
+// make 4 iterations at once
+1:
+        // v = abs( pix1[0] - pix1[0 + stride] )
+        // score = sum(v)
+        // abs(x) = ( (x > 0) ? (x) : (-x) )
+
+        ld1             {v0.16b}, [x1], x3
+        ld1             {v1.16b}, [x5], x3
+        ld1             {v2.16b}, [x1], x3
+        uabal           v16.8h, v0.8b, v1.8b
+        ld1             {v3.16b}, [x5], x3
+        uabal2          v16.8h, v0.16b, v1.16b
+        ld1             {v4.16b}, [x1], x3
+        uabal           v16.8h, v2.8b, v3.8b
+        ld1             {v5.16b}, [x5], x3
+        uabal2          v16.8h, v2.16b, v3.16b
+        ld1             {v6.16b}, [x1], x3
+        uabal           v16.8h, v4.8b, v5.8b
+        ld1             {v7.16b}, [x5], x3
+        uabal2          v16.8h, v4.16b, v5.16b
+        sub             w4, w4, #4
+        uabal           v16.8h, v6.8b, v7.8b
+        cmp             w4, #4
+        uabal2          v16.8h, v6.16b, v7.16b
+
+        b.ge            1b
+        cbz            w4, 3f
+
+// iterate by one
+2:
+        ld1             {v0.16b}, [x1], x3
+        ld1             {v1.16b}, [x5], x3
+        subs            w4, w4, #1
+        uabal           v16.8h, v0.8b, v1.8b
+        uabal2          v16.8h, v0.16b, v1.16b
+        cbnz            w4, 2b
+
+3:
+        uaddlv          s17, v16.8h
+        fmov            w0, s17
+
+        ret
+endfunc
-- 
2.34.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16
  2022-08-22 15:26 [FFmpeg-devel] [PATCH 0/5] me_cmp: Provide arm64 neon implementations Hubert Mazur
                   ` (2 preceding siblings ...)
  2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for vsad_intra16 Hubert Mazur
@ 2022-08-22 15:26 ` Hubert Mazur
  2022-09-04 20:59   ` Martin Storsjö
  2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16 Hubert Mazur
  4 siblings, 1 reply; 15+ messages in thread
From: Hubert Mazur @ 2022-08-22 15:26 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop

Provide optimized implementation for vsse_intra16 for arm64.

Performance tests are shown below.
- vsse_4_c: 153.7
- vsse_4_neon: 34.2

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 libavcodec/aarch64/me_cmp_init_aarch64.c |  3 +
 libavcodec/aarch64/me_cmp_neon.S         | 75 ++++++++++++++++++++++++
 2 files changed, 78 insertions(+)

diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index af83f7ed1e..8c295d5457 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -47,6 +47,8 @@ int vsad_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
                       ptrdiff_t stride, int h) ;
 int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
                 ptrdiff_t stride, int h);
+int vsse_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
+                      ptrdiff_t stride, int h);
 
 av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
 {
@@ -69,5 +71,6 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
         c->vsad[4] = vsad_intra16_neon;
 
         c->vsse[0] = vsse16_neon;
+        c->vsse[4] = vsse_intra16_neon;
     }
 }
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 126e267fdc..46d4dade5d 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -814,3 +814,78 @@ function vsad_intra16_neon, export=1
 
         ret
 endfunc
+
+function vsse_intra16_neon, export=1
+        // x0           unused
+        // x1           uint8_t *pix1
+        // x2           uint8_t *dummy
+        // x3           ptrdiff_t stride
+        // w4           int h
+
+        add             x5, x1, x3 // pix1 + stride
+        movi            v16.4s, #0
+        movi            v17.4s, #0
+
+        sub             w4, w4, #1 // we need to make h-1 iterations
+        cmp             w4, #4
+        b.lt            2f
+
+// make 4 iterations at once
+1:
+        // v = abs( pix1[0] - pix1[0 + stride] )
+        // score = sum( v * v )
+        // abs(x) = ( (x > 0) ? (x) : (-x) )
+
+        ld1             {v0.16b}, [x1], x3
+        ld1             {v1.16b}, [x5], x3
+        ld1             {v2.16b}, [x1], x3
+        uabd            v30.16b, v0.16b, v1.16b
+        ld1             {v3.16b}, [x5], x3
+        ld1             {v4.16b}, [x1], x3
+        umull           v29.8h, v30.8b, v30.8b
+        umull2          v28.8h, v30.16b, v30.16b
+        uabd            v27.16b, v2.16b, v3.16b
+        uadalp          v16.4s, v29.8h
+        ld1             {v5.16b}, [x5], x3
+        umull           v26.8h, v27.8b, v27.8b
+        uadalp          v17.4s, v28.8h
+        ld1             {v6.16b}, [x1], x3
+        umull2          v27.8h, v27.16b, v27.16b
+        uabd            v25.16b, v4.16b, v5.16b
+        uadalp          v16.4s, v26.8h
+        umull           v24.8h, v25.8b, v25.8b
+        ld1             {v7.16b}, [x5], x3
+        uadalp          v17.4s, v27.8h
+        umull2          v25.8h, v25.16b, v25.16b
+        uabd            v23.16b, v6.16b, v7.16b
+        uadalp          v16.4s, v24.8h
+        umull           v22.8h, v23.8b, v23.8b
+        uadalp          v17.4s, v25.8h
+        umull2          v23.8h, v23.16b, v23.16b
+        sub             w4, w4, #4
+        uadalp          v16.4s, v22.8h
+        cmp             w4, #4
+        uadalp          v17.4s, v23.8h
+
+        b.ge            1b
+        cbz             w4, 3f
+
+// iterate by one
+2:
+        ld1             {v0.16b}, [x1], x3
+        ld1             {v1.16b}, [x5], x3
+        subs            w4, w4, #1
+        uabd            v30.16b, v0.16b, v1.16b
+        umull           v29.8h, v30.8b, v30.8b
+        umull2          v30.8h, v30.16b, v30.16b
+        uadalp          v16.4s, v29.8h
+        uadalp          v17.4s, v30.8h
+        cbnz            w4, 2b
+
+3:
+        add             v16.4s, v16.4s, v17.4S
+        uaddlv          d17, v16.4s
+        fmov            w0, s17
+
+        ret
+endfunc
-- 
2.34.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16
  2022-08-22 15:26 [FFmpeg-devel] [PATCH 0/5] me_cmp: Provide arm64 neon implementations Hubert Mazur
                   ` (3 preceding siblings ...)
  2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16 Hubert Mazur
@ 2022-08-22 15:26 ` Hubert Mazur
  2022-09-02 21:29   ` Martin Storsjö
  2022-09-04 21:23   ` Martin Storsjö
  4 siblings, 2 replies; 15+ messages in thread
From: Hubert Mazur @ 2022-08-22 15:26 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop

Add vectorized implementation of nsse16 function.

Performance comparison tests are shown below.
- nsse_0_c: 707.0
- nsse_0_neon: 120.0

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 libavcodec/aarch64/me_cmp_init_aarch64.c |  15 +++
 libavcodec/aarch64/me_cmp_neon.S         | 126 +++++++++++++++++++++++
 2 files changed, 141 insertions(+)

diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index 8c295d5457..146ef04345 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -49,6 +49,10 @@ int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
                 ptrdiff_t stride, int h);
 int vsse_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
                       ptrdiff_t stride, int h);
+int nsse16_neon(int multiplier, const uint8_t *s, const uint8_t *s2,
+                ptrdiff_t stride, int h);
+int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
+                        ptrdiff_t stride, int h);
 
 av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
 {
@@ -72,5 +76,16 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
 
         c->vsse[0] = vsse16_neon;
         c->vsse[4] = vsse_intra16_neon;
+
+        c->nsse[0] = nsse16_neon_wrapper;
     }
 }
+
+int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
+                        ptrdiff_t stride, int h)
+{
+        if (c)
+            return nsse16_neon(c->avctx->nsse_weight, s1, s2, stride, h);
+        else
+            return nsse16_neon(8, s1, s2, stride, h);
+}
\ No newline at end of file
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 46d4dade5d..9fe96e111c 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -889,3 +889,129 @@ function vsse_intra16_neon, export=1
 
         ret
 endfunc
+
+function nsse16_neon, export=1
+        // x0           multiplier
+        // x1           uint8_t *pix1
+        // x2           uint8_t *pix2
+        // x3           ptrdiff_t stride
+        // w4           int h
+
+        str             x0, [sp, #-0x40]!
+        stp             x1, x2, [sp, #0x10]
+        stp             x3, x4, [sp, #0x20]
+        str             lr, [sp, #0x30]
+        bl              sse16_neon
+        ldr             lr, [sp, #0x30]
+        mov             w9, w0                                  // here we store score1
+        ldr             x5, [sp]
+        ldp             x1, x2, [sp, #0x10]
+        ldp             x3, x4, [sp, #0x20]
+        add             sp, sp, #0x40
+
+        movi            v16.8h, #0
+        movi            v17.8h, #0
+        movi            v18.8h, #0
+        movi            v19.8h, #0
+
+        mov             x10, x1                                 // x1
+        mov             x14, x2                                 // x2
+        add             x11, x1, x3                             // x1 + stride
+        add             x15, x2, x3                             // x2 + stride
+        add             x12, x1, #1                             // x1 + 1
+        add             x16, x2, #1                             // x2 + 1
+        add             x13, x11, #1                            // x1 + stride + 1
+        add             x17, x15, #1                            // x2 + stride + 1
+
+        subs            w4, w4, #1                              // we need to make h-1 iterations
+        cmp             w4, #2
+        b.lt            2f
+
+// make 2 iterations at once
+1:
+        ld1             {v0.16b}, [x10], x3
+        ld1             {v1.16b}, [x11], x3
+        ld1             {v2.16b}, [x12], x3
+        usubl           v31.8h, v0.8b, v1.8b
+        ld1             {v3.16b}, [x13], x3
+        usubl2          v30.8h, v0.16b, v1.16b
+        usubl           v29.8h, v2.8b, v3.8b
+        usubl2          v28.8h, v2.16b, v3.16b
+        ld1             {v4.16b}, [x14], x3
+        saba            v16.8h, v31.8h, v29.8h
+        ld1             {v5.16b}, [x15], x3
+        ld1             {v6.16b}, [x16], x3
+        saba            v17.8h, v30.8h, v28.8h
+        ld1             {v7.16b}, [x17], x3
+        usubl           v27.8h, v4.8b, v5.8b
+        usubl2          v26.8h, v4.16b, v5.16b
+        usubl           v25.8h, v6.8b, v7.8b
+        ld1             {v31.16b}, [x10], x3
+        saba            v18.8h, v27.8h, v25.8h
+        usubl2          v24.8h, v6.16b, v7.16b
+        ld1             {v30.16b}, [x11], x3
+        ld1             {v29.16b}, [x12], x3
+        saba            v19.8h, v26.8h, v24.8h
+        usubl           v23.8h, v31.8b, v30.8b
+        ld1             {v28.16b}, [x13], x3
+        usubl2          v22.8h, v31.16b, v30.16b
+        usubl           v21.8h, v29.8b, v28.8b
+        ld1             {v27.16b}, [x14], x3
+        usubl2          v20.8h, v29.16b, v28.16b
+        saba            v16.8h, v23.8h, v21.8h
+        ld1             {v26.16b}, [x15], x3
+        ld1             {v25.16b}, [x16], x3
+        saba            v17.8h, v22.8h, v20.8h
+        ld1             {v24.16b}, [x17], x3
+        usubl           v31.8h, v27.8b, v26.8b
+        usubl           v29.8h, v25.8b, v24.8b
+        usubl2          v30.8h, v27.16b, v26.16b
+        saba            v18.8h, v31.8h, v29.8h
+        usubl2          v28.8h, v25.16b, v24.16b
+        sub             w4, w4, #2
+        cmp             w4, #2
+        saba            v19.8h, v30.8h, v28.8h
+
+        b.ge            1b
+        cbz             w4, 3f
+
+// iterate by one
+2:
+        ld1             {v0.16b}, [x10], x3
+        ld1             {v1.16b}, [x11], x3
+        ld1             {v2.16b}, [x12], x3
+        usubl           v31.8h, v0.8b, v1.8b
+        ld1             {v3.16b}, [x13], x3
+        usubl2          v30.8h, v0.16b, v1.16b
+        usubl           v29.8h, v2.8b, v3.8b
+        usubl2          v28.8h, v2.16b, v3.16b
+        saba            v16.8h, v31.8h, v29.8h
+        ld1             {v4.16b}, [x14], x3
+        ld1             {v5.16b}, [x15], x3
+        saba            v17.8h, v30.8h, v28.8h
+        ld1             {v6.16b}, [x16], x3
+        usubl           v27.8h, v4.8b, v5.8b
+        ld1             {v7.16b}, [x17], x3
+        usubl2          v26.8h, v4.16b, v5.16b
+        usubl           v25.8h, v6.8b, v7.8b
+        usubl2          v24.8h, v6.16b, v7.16b
+        saba            v18.8h, v27.8h, v25.8h
+        subs            w4, w4, #1
+        saba            v19.8h, v26.8h, v24.8h
+
+        cbnz            w4, 2b
+
+3:
+        sqsub           v16.8h, v16.8h, v18.8h
+        sqsub           v17.8h, v17.8h, v19.8h
+        ins             v17.h[7], wzr
+        sqadd           v16.8h, v16.8h, v17.8h
+        saddlv          s16, v16.8h
+        sqabs           s16, s16
+        fmov            w0, s16
+
+        mul             w0, w0, w5
+        add             w0, w0, w9
+
+        ret
+endfunc
-- 
2.34.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16
  2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16 Hubert Mazur
@ 2022-09-02 21:29   ` Martin Storsjö
  2022-09-04 21:23   ` Martin Storsjö
  1 sibling, 0 replies; 15+ messages in thread
From: Martin Storsjö @ 2022-09-02 21:29 UTC (permalink / raw)
  To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop

On Mon, 22 Aug 2022, Hubert Mazur wrote:

> Add vectorized implementation of nsse16 function.
>
> Performance comparison tests are shown below.
> - nsse_0_c: 707.0
> - nsse_0_neon: 120.0
>
> Benchmarks and tests run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c |  15 +++
> libavcodec/aarch64/me_cmp_neon.S         | 126 +++++++++++++++++++++++
> 2 files changed, 141 insertions(+)
>
> diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
> index 46d4dade5d..9fe96e111c 100644
> --- a/libavcodec/aarch64/me_cmp_neon.S
> +++ b/libavcodec/aarch64/me_cmp_neon.S
> @@ -889,3 +889,129 @@ function vsse_intra16_neon, export=1
>
>         ret
> endfunc
> +
> +function nsse16_neon, export=1
> +        // x0           multiplier
> +        // x1           uint8_t *pix1
> +        // x2           uint8_t *pix2
> +        // x3           ptrdiff_t stride
> +        // w4           int h
> +
> +        str             x0, [sp, #-0x40]!
> +        stp             x1, x2, [sp, #0x10]
> +        stp             x3, x4, [sp, #0x20]
> +        str             lr, [sp, #0x30]
> +        bl              sse16_neon
> +        ldr             lr, [sp, #0x30]

This breaks building in two configurations; old binutils doesn't recognize 
the register name lr, you need to spell out x30.

Building on macOS breaks since there's no symbol named sse16_neon; this is 
an exported function, so it has got the symbol prefix _. So you need to do 
"bl X(sse16_neon)" here.

Didn't look at the code from a performance perspective yet.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for vsad16
  2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for vsad16 Hubert Mazur
@ 2022-09-02 21:49   ` Martin Storsjö
  0 siblings, 0 replies; 15+ messages in thread
From: Martin Storsjö @ 2022-09-02 21:49 UTC (permalink / raw)
  To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop

[-- Attachment #1: Type: text/plain, Size: 9033 bytes --]

On Mon, 22 Aug 2022, Hubert Mazur wrote:

> Provide optimized implementation of vsad16 function for arm64.
>
> Performance comparison tests are shown below.
> - vsad_0_c: 285.0
> - vsad_0_neon: 42.5
>
> Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c |  5 ++
> libavcodec/aarch64/me_cmp_neon.S         | 75 ++++++++++++++++++++++++
> 2 files changed, 80 insertions(+)
>
> diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
> index 4198985c6c..d4c0099854 100644
> --- a/libavcodec/aarch64/me_cmp_neon.S
> +++ b/libavcodec/aarch64/me_cmp_neon.S
> @@ -584,3 +584,78 @@ function sse4_neon, export=1
>
>         ret
> endfunc
> +
> +function vsad16_neon, export=1
> +        // x0           unused
> +        // x1           uint8_t *pix1
> +        // x2           uint8_t *pix2
> +        // x3           ptrdiff_t stride
> +        // w4           int h
> +
> +        sub             w4, w4, #1                      // we need to make h-1 iterations
> +        movi            v16.8h, #0
> +
> +        cmp             w4, #3                          // check if we can make 3 iterations at once
> +        add             x5, x1, x3                      // pix1 + stride
> +        add             x6, x2, x3                      // pix2 + stride
> +        b.le            2f
> +
> +1:
> +        // abs(pix1[0] - pix2[0] - pix1[0 + stride] + pix2[0 + stride])
> +        // abs(x) = (x < 0 ? (-x) : (x))

This comment (the second line) about abs() doesn't really add anything of 
value here, does it?

> +        ld1             {v0.16b}, [x1], x3              // Load pix1[0], first iteration
> +        ld1             {v1.16b}, [x2], x3              // Load pix2[0], first iteration
> +        ld1             {v2.16b}, [x5], x3              // Load pix1[0 + stride], first iteration
> +        usubl           v31.8h, v0.8b, v1.8b            // Signed difference pix1[0] - pix2[0], first iteration
> +        ld1             {v3.16b}, [x6], x3              // Load pix2[0 + stride], first iteration
> +        usubl2          v30.8h, v0.16b, v1.16b          // Signed difference pix1[0] - pix2[0], first iteration
> +        usubl           v29.8h, v2.8b, v3.8b            // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration
> +        ld1             {v4.16b}, [x1], x3              // Load pix1[0], second iteration
> +        usubl2          v28.8h, v2.16b, v3.16b          // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration
> +        ld1             {v5.16b}, [x2], x3              // Load pix2[0], second iteration
> +        saba            v16.8h, v31.8h, v29.8h          // Signed absolute difference and accumulate the result. first iteration
> +        ld1             {v6.16b}, [x5], x3              // Load pix1[0 + stride], second iteration
> +        saba            v16.8h, v30.8h, v28.8h          // Signed absolute difference and accumulate the result. first iteration
> +        usubl           v27.8h, v4.8b, v5.8b            // Signed difference pix1[0] - pix2[0], second iteration
> +        ld1             {v7.16b}, [x6], x3              // Load pix2[0 + stride], second iteration
> +        usubl2          v26.8h, v4.16b, v5.16b          // Signed difference pix1[0] - pix2[0], second iteration
> +        usubl           v25.8h, v6.8b, v7.8b            // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration
> +        ld1             {v17.16b}, [x1], x3             // Load pix1[0], third iteration
> +        usubl2          v24.8h, v6.16b, v7.16b          // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration
> +        ld1             {v18.16b}, [x2], x3             // Load pix2[0], second iteration
> +        saba            v16.8h, v27.8h, v25.8h          // Signed absolute difference and accumulate the result. second iteration
> +        ld1             {v19.16b}, [x5], x3             // Load pix1[0 + stride], third iteration
> +        saba            v16.8h, v26.8h, v24.8h          // Signed absolute difference and accumulate the result. second iteration
> +        usubl           v23.8h, v17.8b, v18.8b          // Signed difference pix1[0] - pix2[0], third iteration
> +        ld1             {v20.16b}, [x6], x3             // Load pix2[0 + stride], third iteration
> +        usubl2          v22.8h, v17.16b, v18.16b        // Signed difference pix1[0] - pix2[0], third iteration
> +        usubl           v21.8h, v19.8b, v20.8b          // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration
> +        sub             w4, w4, #3                      // h -= 3
> +        saba            v16.8h, v23.8h, v21.8h          // Signed absolute difference and accumulate the result. third iteration
> +        usubl2          v31.8h, v19.16b, v20.16b        // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration
> +        cmp             w4, #3
> +        saba            v16.8h, v22.8h, v31.8h          // Signed absolute difference and accumulate the result. third iteration

This unrolled implementation isn't really interleaved much at all. In such 
a case there's not really much benefit from an unrolled implementation, 
other than saving a couple decrements/branch instructions.

I tested playing with your code here. Originally, checkasm gave me these 
benchmark numbers:

          Cortex A53    A72    A73
vsad_0_neon:  162.7   74.5   67.7

When I switched to only running the non-unrolled version below, I got 
these numbers:

vsad_0_neon:  165.2   78.5   76.0

I.e. hardly any difference at all. In such a case, we're just wasting a 
lot of code size (and code maintainability!) on big unrolled code without 
much benefit at all. But we can do better.

> +
> +        b.ge            1b
> +        cbz             w4, 3f
> +2:
> +
> +        ld1             {v0.16b}, [x1], x3
> +        ld1             {v1.16b}, [x2], x3
> +        ld1             {v2.16b}, [x5], x3
> +        usubl           v30.8h, v0.8b, v1.8b
> +        ld1             {v3.16b}, [x6], x3
> +        usubl2          v29.8h, v0.16b, v1.16b
> +        usubl           v28.8h, v2.8b, v3.8b
> +        usubl2          v27.8h, v2.16b, v3.16b
> +        saba            v16.8h, v30.8h, v28.8h
> +        subs            w4, w4, #1
> +        saba            v16.8h, v29.8h, v27.8h

Now let's look at this simple non-unrolled implementation first. We're 
doing a lot of redundant work here (and in the other implementation).

When we've loaded v2/v3 in the first round in this loop, and calculated 
v28/v27 from them, and we then proceed to the next iteration, v0/v1 will 
be identical to v2/v3, and v30/v29 will be identical to v28/v27 from the 
previous iteration.

So if we just load one row from pix1/pix2 each and calculate their 
difference in the start of the function, and then just load one row from 
each, subtract them and accumulate the difference to the previous one, 
we're done.

Originally with your non-unrolled implementation, I got these benchmark 
numbers:

vsad_0_neon:  165.2   78.5   76.0

After this simplification, I got these numbers:

vsad_0_neon:  131.2   68.5   70.0

In this case, my code looked like this:

         ld1             {v0.16b}, [x1], x3
         ld1             {v1.16b}, [x2], x3
         usubl           v31.8h, v0.8b,  v1.8b
         usubl2          v30.8h, v0.16b, v1.16b

2:
         ld1             {v0.16b}, [x1], x3
         ld1             {v1.16b}, [x2], x3
         subs            w4, w4, #1
         usubl           v29.8h,  v0.8b,   v1.8b
         usubl2          v28.8h,  v0.16b,  v1.16b
         saba            v16.8h,  v31.8h,  v29.8h
         mov             v31.16b, v29.16b
         saba            v16.8h,  v30.8h,  v28.8h
         mov             v30.16b, v28.16b
         b.ne            2b

Isn't that much simpler? Now after that, I tried doing the same 
modification to the unrolled version. FWIW, I did it like this: For an 
algorithm this simple, where there's more than enough registers, I wrote 
the unrolled implementation like this:


     ld1 // first iter
     ld1 // first iter
     ld1 // second iter
     ld1 // second iter
     ld1 // third iter
     ld1 // third iter
     usubl  // first
     usubl2 // first
     usubl ..
     usubl2 ..
     ...
     saba // first
     saba // first
     ...

After that, I reordered them so that we start with the first usubl's after 
a couple of instructions after the corresponding loads, then moved the 
following saba's closer.

With that, I'm getting these checkasm numbers:

          Cortex A53    A72    A73
vsad_0_neon:  108.7   61.2   58.0

As context, this patch originally gave these numbers:

vsad_0_neon:  162.7   74.5   67.7

I.e. a 1.15x speedup on A73, 1.21x speedup on A72 and an 1.5x speedup on 
A53.

You can see my version in the attached patch; apply it on top of yours.

// Martin

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-diff; name=0001-Improve-vsad16_neon.patch, Size: 8106 bytes --]

From 76eb1f213a72cdfd04a62c773442336cd56e0858 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Martin=20Storsj=C3=B6?= <martin@martin.st>
Date: Sat, 3 Sep 2022 00:45:55 +0300
Subject: [PATCH] Improve vsad16_neon

---
 libavcodec/aarch64/me_cmp_neon.S | 71 ++++++++++++++------------------
 1 file changed, 31 insertions(+), 40 deletions(-)

diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index ecc9c793d6..7ab8744e0d 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -592,65 +592,56 @@ function vsad16_neon, export=1
         // x3           ptrdiff_t stride
         // w4           int h
 
+        ld1             {v0.16b}, [x1], x3              // Load pix1[0], first iteration
+        ld1             {v1.16b}, [x2], x3              // Load pix2[0], first iteration
+
         sub             w4, w4, #1                      // we need to make h-1 iterations
         movi            v16.8h, #0
 
         cmp             w4, #3                          // check if we can make 3 iterations at once
-        add             x5, x1, x3                      // pix1 + stride
-        add             x6, x2, x3                      // pix2 + stride
-        b.le            2f
+        usubl           v31.8h, v0.8b, v1.8b            // Signed difference pix1[0] - pix2[0], first iteration
+        usubl2          v30.8h, v0.16b, v1.16b          // Signed difference pix1[0] - pix2[0], first iteration
+
+        b.lt            2f
 
 1:
         // abs(pix1[0] - pix2[0] - pix1[0 + stride] + pix2[0 + stride])
         // abs(x) = (x < 0 ? (-x) : (x))
-        ld1             {v0.16b}, [x1], x3              // Load pix1[0], first iteration
-        ld1             {v1.16b}, [x2], x3              // Load pix2[0], first iteration
-        ld1             {v2.16b}, [x5], x3              // Load pix1[0 + stride], first iteration
-        usubl           v31.8h, v0.8b, v1.8b            // Signed difference pix1[0] - pix2[0], first iteration
-        ld1             {v3.16b}, [x6], x3              // Load pix2[0 + stride], first iteration
-        usubl2          v30.8h, v0.16b, v1.16b          // Signed difference pix1[0] - pix2[0], first iteration
-        usubl           v29.8h, v2.8b, v3.8b            // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration
-        ld1             {v4.16b}, [x1], x3              // Load pix1[0], second iteration
-        usubl2          v28.8h, v2.16b, v3.16b          // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration
-        ld1             {v5.16b}, [x2], x3              // Load pix2[0], second iteration
+        ld1             {v0.16b}, [x1], x3              // Load pix1[0 + stride], first iteration
+        ld1             {v1.16b}, [x2], x3              // Load pix2[0 + stride], first iteration
+        ld1             {v2.16b}, [x1], x3              // Load pix1[0 + stride], second iteration
+        ld1             {v3.16b}, [x2], x3              // Load pix2[0 + stride], second iteration
+        usubl           v29.8h, v0.8b,  v1.8b           // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration
+        usubl2          v28.8h, v0.16b, v1.16b          // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration
+        ld1             {v4.16b}, [x1], x3              // Load pix1[0 + stride], third iteration
+        ld1             {v5.16b}, [x2], x3              // Load pix2[0 + stride], third iteration
+        usubl           v27.8h, v2.8b,  v3.8b           // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration
         saba            v16.8h, v31.8h, v29.8h          // Signed absolute difference and accumulate the result. first iteration
-        ld1             {v6.16b}, [x5], x3              // Load pix1[0 + stride], second iteration
+        usubl2          v26.8h, v2.16b, v3.16b          // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration
         saba            v16.8h, v30.8h, v28.8h          // Signed absolute difference and accumulate the result. first iteration
-        usubl           v27.8h, v4.8b, v5.8b            // Signed difference pix1[0] - pix2[0], second iteration
-        ld1             {v7.16b}, [x6], x3              // Load pix2[0 + stride], second iteration
-        usubl2          v26.8h, v4.16b, v5.16b          // Signed difference pix1[0] - pix2[0], second iteration
-        usubl           v25.8h, v6.8b, v7.8b            // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration
-        ld1             {v17.16b}, [x1], x3             // Load pix1[0], third iteration
-        usubl2          v24.8h, v6.16b, v7.16b          // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration
-        ld1             {v18.16b}, [x2], x3             // Load pix2[0], second iteration
-        saba            v16.8h, v27.8h, v25.8h          // Signed absolute difference and accumulate the result. second iteration
-        ld1             {v19.16b}, [x5], x3             // Load pix1[0 + stride], third iteration
-        saba            v16.8h, v26.8h, v24.8h          // Signed absolute difference and accumulate the result. second iteration
-        usubl           v23.8h, v17.8b, v18.8b          // Signed difference pix1[0] - pix2[0], third iteration
-        ld1             {v20.16b}, [x6], x3             // Load pix2[0 + stride], third iteration
-        usubl2          v22.8h, v17.16b, v18.16b        // Signed difference pix1[0] - pix2[0], third iteration
-        usubl           v21.8h, v19.8b, v20.8b          // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration
+        usubl           v25.8h, v4.8b,  v5.8b           // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration
+        usubl2          v24.8h, v4.16b, v5.16b          // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration
+        saba            v16.8h, v29.8h, v27.8h          // Signed absolute difference and accumulate the result. second iteration
+        mov             v31.16b, v25.16b
+        saba            v16.8h, v28.8h, v26.8h          // Signed absolute difference and accumulate the result. second iteration
         sub             w4, w4, #3                      // h -= 3
-        saba            v16.8h, v23.8h, v21.8h          // Signed absolute difference and accumulate the result. third iteration
-        usubl2          v31.8h, v19.16b, v20.16b        // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration
+        mov             v30.16b, v24.16b
+        saba            v16.8h, v27.8h, v25.8h          // Signed absolute difference and accumulate the result. third iteration
         cmp             w4, #3
-        saba            v16.8h, v22.8h, v31.8h          // Signed absolute difference and accumulate the result. third iteration
+        saba            v16.8h, v26.8h, v24.8h          // Signed absolute difference and accumulate the result. third iteration
 
         b.ge            1b
         cbz             w4, 3f
 2:
-
         ld1             {v0.16b}, [x1], x3
         ld1             {v1.16b}, [x2], x3
-        ld1             {v2.16b}, [x5], x3
-        usubl           v30.8h, v0.8b, v1.8b
-        ld1             {v3.16b}, [x6], x3
-        usubl2          v29.8h, v0.16b, v1.16b
-        usubl           v28.8h, v2.8b, v3.8b
-        usubl2          v27.8h, v2.16b, v3.16b
-        saba            v16.8h, v30.8h, v28.8h
         subs            w4, w4, #1
-        saba            v16.8h, v29.8h, v27.8h
+        usubl           v29.8h,  v0.8b,   v1.8b
+        usubl2          v28.8h,  v0.16b,  v1.16b
+        saba            v16.8h,  v31.8h,  v29.8h
+        mov             v31.16b, v29.16b
+        saba            v16.8h,  v30.8h,  v28.8h
+        mov             v30.16b, v28.16b
 
         b.ne            2b
 3:
-- 
2.25.1


[-- Attachment #3: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation of vsse16
  2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation of vsse16 Hubert Mazur
@ 2022-09-04 20:53   ` Martin Storsjö
  0 siblings, 0 replies; 15+ messages in thread
From: Martin Storsjö @ 2022-09-04 20:53 UTC (permalink / raw)
  To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop

On Mon, 22 Aug 2022, Hubert Mazur wrote:

> Provide optimized implementation of vsse16 for arm64.
>
> Performance comparison tests are shown below.
> - vsse_0_c: 254.4
> - vsse_0_neon: 64.7
>
> Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c |  4 +
> libavcodec/aarch64/me_cmp_neon.S         | 97 ++++++++++++++++++++++++
> 2 files changed, 101 insertions(+)
>
> diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
> index ddc5d05611..7b81e48d16 100644
> --- a/libavcodec/aarch64/me_cmp_init_aarch64.c
> +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
> @@ -43,6 +43,8 @@ int sse4_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
> 
> int vsad16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
>                 ptrdiff_t stride, int h);
> +int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
> +                ptrdiff_t stride, int h);
> 
> av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
> {
> @@ -62,5 +64,7 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
>         c->sse[2] = sse4_neon;
>
>         c->vsad[0] = vsad16_neon;
> +
> +        c->vsse[0] = vsse16_neon;
>     }
> }
> diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
> index d4c0099854..279bae7cb5 100644
> --- a/libavcodec/aarch64/me_cmp_neon.S
> +++ b/libavcodec/aarch64/me_cmp_neon.S
> @@ -659,3 +659,100 @@ function vsad16_neon, export=1
>
>         ret
> endfunc
> +
> +function vsse16_neon, export=1
> +        // x0           unused
> +        // x1           uint8_t *pix1
> +        // x2           uint8_t *pix2
> +        // x3           ptrdiff_t stride
> +        // w4           int h
> +
> +        movi            v30.4s, #0
> +        movi            v29.4s, #0
> +
> +        add             x5, x1, x3                      // pix1 + stride
> +        add             x6, x2, x3                      // pix2 + stride
> +        sub             w4, w4, #1                      // we need to make h-1 iterations
> +        cmp             w4, #3                          // check if we can make 4 iterations at once
> +        b.le            2f
> +
> +// make 4 iterations at once

The comments seem to talk about 4 iterations at once while the code 
actually only does 3.

> +1:
> +        // x = abs(pix1[0] - pix2[0] - pix1[0 + stride] + pix2[0 + stride]) =

The comment seems a bit un-updated here, since there's no abs() involved 
here

> +        // res = (x) * (x)
> +        ld1             {v0.16b}, [x1], x3              // Load pix1[0], first iteration
> +        ld1             {v1.16b}, [x2], x3              // Load pix2[0], first iteration
> +        ld1             {v2.16b}, [x5], x3              // Load pix1[0 + stride], first iteration
> +        usubl           v28.8h, v0.8b, v1.8b            // Signed difference of pix1[0] - pix2[0], first iteration
> +        ld1             {v3.16b}, [x6], x3              // Load pix2[0 + stride], first iteration
> +        usubl2          v27.8h, v0.16b, v1.16b          // Signed difference of pix1[0] - pix2[0], first iteration
> +        usubl           v26.8h, v3.8b, v2.8b            // Signed difference of pix1[0 + stride] - pix2[0 + stride], first iteration
> +        usubl2          v25.8h, v3.16b, v2.16b          // Signed difference of pix1[0 + stride] - pix2[0 + stride], first iteration

Same thing about reusing data from the previous row, as for the previous 
patch.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for vsad_intra16
  2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for vsad_intra16 Hubert Mazur
@ 2022-09-04 20:58   ` Martin Storsjö
  0 siblings, 0 replies; 15+ messages in thread
From: Martin Storsjö @ 2022-09-04 20:58 UTC (permalink / raw)
  To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop

On Mon, 22 Aug 2022, Hubert Mazur wrote:

> Provide optimized implementation for vsad_intra16 function for arm64.
>
> Performance comparison tests are shown below.
> - vsad_4_c: 177.2
> - vsad_4_neon: 24.5
>
> Benchmarks and tests are run with checkasm tool on AWS Gravtion 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c |  3 ++
> libavcodec/aarch64/me_cmp_neon.S         | 58 ++++++++++++++++++++++++
> 2 files changed, 61 insertions(+)

Same thing as for the others; keep the data for the previous row in 
registers instead of loading it twice.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16
  2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16 Hubert Mazur
@ 2022-09-04 20:59   ` Martin Storsjö
  0 siblings, 0 replies; 15+ messages in thread
From: Martin Storsjö @ 2022-09-04 20:59 UTC (permalink / raw)
  To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop

On Mon, 22 Aug 2022, Hubert Mazur wrote:

> Provide optimized implementation for vsse_intra16 for arm64.
>
> Performance tests are shown below.
> - vsse_4_c: 153.7
> - vsse_4_neon: 34.2
>
> Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c |  3 +
> libavcodec/aarch64/me_cmp_neon.S         | 75 ++++++++++++++++++++++++
> 2 files changed, 78 insertions(+)

The same comment as for the others.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16
  2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16 Hubert Mazur
  2022-09-02 21:29   ` Martin Storsjö
@ 2022-09-04 21:23   ` Martin Storsjö
  1 sibling, 0 replies; 15+ messages in thread
From: Martin Storsjö @ 2022-09-04 21:23 UTC (permalink / raw)
  To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop

On Mon, 22 Aug 2022, Hubert Mazur wrote:

> Add vectorized implementation of nsse16 function.
>
> Performance comparison tests are shown below.
> - nsse_0_c: 707.0
> - nsse_0_neon: 120.0
>
> Benchmarks and tests run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c |  15 +++
> libavcodec/aarch64/me_cmp_neon.S         | 126 +++++++++++++++++++++++
> 2 files changed, 141 insertions(+)
>
> diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
> index 8c295d5457..146ef04345 100644
> --- a/libavcodec/aarch64/me_cmp_init_aarch64.c
> +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
> @@ -49,6 +49,10 @@ int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
>                 ptrdiff_t stride, int h);
> int vsse_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
>                       ptrdiff_t stride, int h);
> +int nsse16_neon(int multiplier, const uint8_t *s, const uint8_t *s2,
> +                ptrdiff_t stride, int h);
> +int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
> +                        ptrdiff_t stride, int h);
> 
> av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
> {
> @@ -72,5 +76,16 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
>
>         c->vsse[0] = vsse16_neon;
>         c->vsse[4] = vsse_intra16_neon;
> +
> +        c->nsse[0] = nsse16_neon_wrapper;
>     }
> }
> +
> +int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
> +                        ptrdiff_t stride, int h)
> +{
> +        if (c)
> +            return nsse16_neon(c->avctx->nsse_weight, s1, s2, stride, h);
> +        else
> +            return nsse16_neon(8, s1, s2, stride, h);
> +}
> \ No newline at end of file

The indentation is off for this file, and it's missing the final newline.

> diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
> index 46d4dade5d..9fe96e111c 100644
> --- a/libavcodec/aarch64/me_cmp_neon.S
> +++ b/libavcodec/aarch64/me_cmp_neon.S
> @@ -889,3 +889,129 @@ function vsse_intra16_neon, export=1
>
>         ret
> endfunc
> +
> +function nsse16_neon, export=1
> +        // x0           multiplier
> +        // x1           uint8_t *pix1
> +        // x2           uint8_t *pix2
> +        // x3           ptrdiff_t stride
> +        // w4           int h
> +
> +        str             x0, [sp, #-0x40]!
> +        stp             x1, x2, [sp, #0x10]
> +        stp             x3, x4, [sp, #0x20]
> +        str             lr, [sp, #0x30]
> +        bl              sse16_neon
> +        ldr             lr, [sp, #0x30]
> +        mov             w9, w0                                  // here we store score1
> +        ldr             x5, [sp]
> +        ldp             x1, x2, [sp, #0x10]
> +        ldp             x3, x4, [sp, #0x20]
> +        add             sp, sp, #0x40
> +
> +        movi            v16.8h, #0
> +        movi            v17.8h, #0
> +        movi            v18.8h, #0
> +        movi            v19.8h, #0
> +
> +        mov             x10, x1                                 // x1
> +        mov             x14, x2                                 // x2

I don't see why you need to make a copy of x1/x2 here, as you don't use 
x1/x2 after this at all.

> +        add             x11, x1, x3                             // x1 + stride
> +        add             x15, x2, x3                             // x2 + stride
> +        add             x12, x1, #1                             // x1 + 1
> +        add             x16, x2, #1                             // x2 + 1

FWIW, instead of making two loads, for [x1] and [x1+1], as we don't need 
the final value at [x1+16], I would normally just do one load of [x1] and 
then make a shifted version with the 'ext' instruction; ext is generally 
cheaper than doing redundant loads. On the other hand, by doing two loads, 
you don't have a serial dependency on the first load.


> +// iterate by one
> +2:
> +        ld1             {v0.16b}, [x10], x3
> +        ld1             {v1.16b}, [x11], x3
> +        ld1             {v2.16b}, [x12], x3
> +        usubl           v31.8h, v0.8b, v1.8b
> +        ld1             {v3.16b}, [x13], x3
> +        usubl2          v30.8h, v0.16b, v1.16b
> +        usubl           v29.8h, v2.8b, v3.8b
> +        usubl2          v28.8h, v2.16b, v3.16b
> +        saba            v16.8h, v31.8h, v29.8h
> +        ld1             {v4.16b}, [x14], x3
> +        ld1             {v5.16b}, [x15], x3
> +        saba            v17.8h, v30.8h, v28.8h
> +        ld1             {v6.16b}, [x16], x3
> +        usubl           v27.8h, v4.8b, v5.8b
> +        ld1             {v7.16b}, [x17], x3

So, looking at the main implementation structure here, by looking at the 
non-unrolled version: You're doing 8 loads per iteration here - and I 
would say you can do this with 2 loads per iteration.

By reusing the loaded data from the previous iteration instead of 
duplicated loading, you can get this down from 8 to 4 loads. And by 
shifting with 'ext' instead of a separate load, you can get it down to 2 
loads. (Then again, with 4 loads instead of 2, you can have the 
overlapping loads running in parallel, instead of having to wait for the 
first load to complete if using ext. I'd suggest trying both and seeing 
which one worke better - although the tradeoff might be different between 
different cores. But storing data from the previous line instead of such 
duplicated loading is certainly better in any case.)

> +        usubl2          v26.8h, v4.16b, v5.16b
> +        usubl           v25.8h, v6.8b, v7.8b
> +        usubl2          v24.8h, v6.16b, v7.16b
> +        saba            v18.8h, v27.8h, v25.8h
> +        subs            w4, w4, #1
> +        saba            v19.8h, v26.8h, v24.8h
> +
> +        cbnz            w4, 2b
> +
> +3:
> +        sqsub           v16.8h, v16.8h, v18.8h
> +        sqsub           v17.8h, v17.8h, v19.8h
> +        ins             v17.h[7], wzr

This is very good, that you figured out how to handle the odd element here 
outside of the loops!

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16
  2022-09-08  9:25 [FFmpeg-devel] [PATCH 0/5] Provide optimized neon implementation Hubert Mazur
@ 2022-09-08  9:25 ` Hubert Mazur
  0 siblings, 0 replies; 15+ messages in thread
From: Hubert Mazur @ 2022-09-08  9:25 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop

Add vectorized implementation of nsse16 function.

Performance comparison tests are shown below.
- nsse_0_c: 682.2
- nsse_0_neon: 116.5

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Co-authored-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 libavcodec/aarch64/me_cmp_init_aarch64.c |  15 +++
 libavcodec/aarch64/me_cmp_neon.S         | 122 +++++++++++++++++++++++
 2 files changed, 137 insertions(+)

diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index 8c295d5457..ade3e9a4c1 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -49,6 +49,10 @@ int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
                 ptrdiff_t stride, int h);
 int vsse_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
                       ptrdiff_t stride, int h);
+int nsse16_neon(int multiplier, const uint8_t *s, const uint8_t *s2,
+                ptrdiff_t stride, int h);
+int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
+                        ptrdiff_t stride, int h);
 
 av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
 {
@@ -72,5 +76,16 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
 
         c->vsse[0] = vsse16_neon;
         c->vsse[4] = vsse_intra16_neon;
+
+        c->nsse[0] = nsse16_neon_wrapper;
     }
 }
+
+int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
+                        ptrdiff_t stride, int h)
+{
+    if (c)
+        return nsse16_neon(c->avctx->nsse_weight, s1, s2, stride, h);
+    else
+        return nsse16_neon(8, s1, s2, stride, h);
+}
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index cf2b8da425..f8998749a5 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -847,3 +847,125 @@ function vsse_intra16_neon, export=1
 
         ret
 endfunc
+
+function nsse16_neon, export=1
+        // x0           multiplier
+        // x1           uint8_t *pix1
+        // x2           uint8_t *pix2
+        // x3           ptrdiff_t stride
+        // w4           int h
+
+        str             x0, [sp, #-0x40]!
+        stp             x1, x2, [sp, #0x10]
+        stp             x3, x4, [sp, #0x20]
+        str             x30, [sp, #0x30]
+        bl              X(sse16_neon)
+        ldr             x30, [sp, #0x30]
+        mov             w9, w0                                  // here we store score1
+        ldr             x5, [sp]
+        ldp             x1, x2, [sp, #0x10]
+        ldp             x3, x4, [sp, #0x20]
+        add             sp, sp, #0x40
+
+        movi            v16.8h, #0
+        movi            v17.8h, #0
+        movi            v18.8h, #0
+        movi            v19.8h, #0
+
+        ld1             {v0.16b}, [x1], x3
+        subs            w4, w4, #1                              // we need to make h-1 iterations
+        ld1             {v2.16b}, [x2], x3
+        ext             v1.16b, v0.16b, v0.16b, #1              // x1 + 1
+        cmp             w4, #2
+        ext             v3.16b, v2.16b, v2.16b, #1              // x2 + 1
+
+        b.lt            2f
+
+// make 2 iterations at once
+1:
+        ld1             {v4.16b}, [x1], x3
+        ld1             {v6.16b}, [x2], x3
+        ld1             {v20.16b}, [x1], x3
+        ext             v5.16b, v4.16b, v4.16b, #1              // x1 + stride + 1
+        usubl           v31.8h, v0.8b, v4.8b
+        usubl2          v30.8h, v0.16b, v4.16b
+        ld1             {v22.16b}, [x2], x3
+        usubl           v29.8h, v1.8b, v5.8b
+        usubl2          v28.8h, v1.16b, v5.16b
+        ext             v7.16b, v6.16b, v6.16b, #1              // x2 + stride + 1
+        saba            v16.8h, v31.8h, v29.8h
+        ext             v21.16b, v20.16b, v20.16b, #1
+        saba            v17.8h, v30.8h, v28.8h
+        usubl           v27.8h, v2.8b, v6.8b
+        usubl2          v26.8h, v2.16b, v6.16b
+        ext             v23.16b, v22.16b, v22.16b, #1
+        usubl           v25.8h, v3.8b, v7.8b
+        usubl2          v24.8h, v3.16b, v7.16b
+        saba            v18.8h, v27.8h, v25.8h
+        saba            v19.8h, v26.8h, v24.8h
+
+        usubl           v31.8h, v4.8b, v20.8b
+        usubl2          v30.8h, v4.16b, v20.16b
+        usubl           v29.8h, v5.8b, v21.8b
+        usubl2          v28.8h, v5.16b, v21.16b
+        saba            v16.8h, v31.8h, v29.8h
+        saba            v17.8h, v30.8h, v28.8h
+        usubl           v27.8h, v6.8b, v22.8b
+        usubl2          v26.8h, v6.16b, v22.16b
+        usubl           v25.8h, v7.8b, v23.8b
+        usubl2          v24.8h, v7.16b, v23.16b
+        saba            v18.8h, v27.8h, v25.8h
+        saba            v19.8h, v26.8h, v24.8h
+        sub             w4, w4, #2
+
+        mov             v0.16b, v20.16b
+        mov             v1.16b, v21.16b
+        cmp             w4, #2
+        mov             v2.16b, v22.16b
+        mov             v3.16b, v23.16b
+
+        b.ge            1b
+        cbz             w4, 3f
+
+// iterate by one
+2:
+        ld1             {v4.16b}, [x1], x3
+        subs            w4, w4, #1
+        ld1             {v6.16b}, [x2], x3
+        ext             v5.16b, v4.16b, v4.16b, #1              // x1 + stride + 1
+        usubl           v31.8h, v0.8b, v4.8b
+        ext             v7.16b, v6.16b, v6.16b, #1              // x2 + stride + 1
+
+        usubl2          v30.8h, v0.16b, v4.16b
+        usubl           v29.8h, v1.8b, v5.8b
+        usubl2          v28.8h, v1.16b, v5.16b
+        saba            v16.8h, v31.8h, v29.8h
+        saba            v17.8h, v30.8h, v28.8h
+        usubl           v27.8h, v2.8b, v6.8b
+        usubl2          v26.8h, v2.16b, v6.16b
+        usubl           v25.8h, v3.8b, v7.8b
+        usubl2          v24.8h, v3.16b, v7.16b
+        saba            v18.8h, v27.8h, v25.8h
+        saba            v19.8h, v26.8h, v24.8h
+
+        mov             v0.16b, v4.16b
+        mov             v1.16b, v5.16b
+        mov             v2.16b, v6.16b
+        mov             v3.16b, v7.16b
+
+        cbnz            w4, 2b
+
+3:
+        sqsub           v17.8h, v17.8h, v19.8h
+        sqsub           v16.8h, v16.8h, v18.8h
+        ins             v17.h[7], wzr
+        sqadd           v16.8h, v16.8h, v17.8h
+        saddlv          s16, v16.8h
+        sqabs           s16, s16
+        fmov            w0, s16
+
+        mul             w0, w0, w5
+        add             w0, w0, w9
+
+        ret
+endfunc
-- 
2.34.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16
  2022-09-06 10:27 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16 Hubert Mazur
@ 2022-09-07  9:06   ` Martin Storsjö
  0 siblings, 0 replies; 15+ messages in thread
From: Martin Storsjö @ 2022-09-07  9:06 UTC (permalink / raw)
  To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop

[-- Attachment #1: Type: text/plain, Size: 4851 bytes --]

On Tue, 6 Sep 2022, Hubert Mazur wrote:

> Add vectorized implementation of nsse16 function.
>
> Performance comparison tests are shown below.
> - nsse_0_c: 707.0
> - nsse_0_neon: 120.0
>
> Benchmarks and tests run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c |  15 +++
> libavcodec/aarch64/me_cmp_neon.S         | 124 +++++++++++++++++++++++
> 2 files changed, 139 insertions(+)
>
> diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
> index 8c295d5457..ea7f295373 100644
> --- a/libavcodec/aarch64/me_cmp_init_aarch64.c
> +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
> @@ -49,6 +49,10 @@ int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
>                 ptrdiff_t stride, int h);
> int vsse_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
>                       ptrdiff_t stride, int h);
> +int nsse16_neon(int multiplier, const uint8_t *s, const uint8_t *s2,
> +                ptrdiff_t stride, int h);
> +int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
> +                        ptrdiff_t stride, int h);
> 
> av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
> {
> @@ -72,5 +76,16 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
>
>         c->vsse[0] = vsse16_neon;
>         c->vsse[4] = vsse_intra16_neon;
> +
> +        c->nsse[0] = nsse16_neon_wrapper;
>     }
> }
> +
> +int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
> +                        ptrdiff_t stride, int h)
> +{
> +        if (c)
> +            return nsse16_neon(c->avctx->nsse_weight, s1, s2, stride, h);
> +        else
> +            return nsse16_neon(8, s1, s2, stride, h);
> +}

The body of this function is still indented 4 spaces too much.

> diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
> index cf2b8da425..bd21122a21 100644
> --- a/libavcodec/aarch64/me_cmp_neon.S
> +++ b/libavcodec/aarch64/me_cmp_neon.S
> @@ -847,3 +847,127 @@ function vsse_intra16_neon, export=1
>
>         ret
> endfunc
> +
> +function nsse16_neon, export=1
> +        // x0           multiplier
> +        // x1           uint8_t *pix1
> +        // x2           uint8_t *pix2
> +        // x3           ptrdiff_t stride
> +        // w4           int h
> +
> +        str             x0, [sp, #-0x40]!
> +        stp             x1, x2, [sp, #0x10]
> +        stp             x3, x4, [sp, #0x20]
> +        str             x30, [sp, #0x30]
> +        bl              X(sse16_neon)
> +        ldr             x30, [sp, #0x30]
> +        mov             w9, w0                                  // here we store score1
> +        ldr             x5, [sp]
> +        ldp             x1, x2, [sp, #0x10]
> +        ldp             x3, x4, [sp, #0x20]
> +        add             sp, sp, #0x40
> +
> +        movi            v16.8h, #0
> +        movi            v17.8h, #0
> +        movi            v18.8h, #0
> +        movi            v19.8h, #0
> +
> +        ld1             {v0.16b}, [x1], x3
> +        subs            w4, w4, #1                              // we need to make h-1 iterations
> +        ext             v1.16b, v0.16b, v0.16b, #1              // x1 + 1
> +        ld1             {v2.16b}, [x2], x3
> +        cmp             w4, #2
> +        ext             v3.16b, v2.16b, v2.16b, #1              // x2 + 1

The scheduling here (and in both loops below) is a bit non-ideal; we use 
v0 very early after loading it, while we could start the load of v2 before 
that.

> +
> +        b.lt            2f
> +
> +// make 2 iterations at once
> +1:
> +        ld1             {v4.16b}, [x1], x3
> +        ld1             {v20.16b}, [x1], x3

The load of v20 here would stall a bit, since the first load updates x1. 
Instead we can launch the load of v6 inbetween these two

> +        ext             v5.16b, v4.16b, v4.16b, #1              // x1 + stride + 1
> +        ext             v21.16b, v20.16b, v20.16b, #1

We can move the use of v20 a bit later here, inbetween the two other 
exts.

These changes (plus a similar minor one to the non-unrolled version at the 
bottom) produces the following benchmark change for me:

Before:   Cortex A53     A72     A73
nsse_0_neon:   401.0   198.0   194.5
After:
nsse_0_neon:   377.0   198.7   196.5

(The differences on A72 and A73 are within the measurement noise, those 
numbers vary more than that from one run to another.)

You can squash in the attached patch for simplicity. Also, remember to fix 
the indentation of the wrapper function.

With those changes, plus updated benchmark numbers in the commit messages, 
I think this patchset should be good to go.

// Martin

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-diff; name=0001-squash-nsse16-Tune-scheduling.patch, Size: 2496 bytes --]

From 081aff967d4fdc3d475c777033223625db3bb532 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Martin=20Storsj=C3=B6?= <martin@martin.st>
Date: Wed, 7 Sep 2022 11:50:29 +0300
Subject: [PATCH] squash: nsse16: Tune scheduling

Before:   Cortex A53     A72     A73
nsse_0_neon:   401.0   198.0   194.5
After:
nsse_0_neon:   377.0   198.7   196.5

(The differences on A72 and A73 are within the measurement noise,
those numbers vary more than that from one run to another.)
---
 libavcodec/aarch64/me_cmp_neon.S | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index bd21122a21..2a2af7a788 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -874,8 +874,8 @@ function nsse16_neon, export=1
 
         ld1             {v0.16b}, [x1], x3
         subs            w4, w4, #1                              // we need to make h-1 iterations
-        ext             v1.16b, v0.16b, v0.16b, #1              // x1 + 1
         ld1             {v2.16b}, [x2], x3
+        ext             v1.16b, v0.16b, v0.16b, #1              // x1 + 1
         cmp             w4, #2
         ext             v3.16b, v2.16b, v2.16b, #1              // x2 + 1
 
@@ -884,12 +884,12 @@ function nsse16_neon, export=1
 // make 2 iterations at once
 1:
         ld1             {v4.16b}, [x1], x3
+        ld1             {v6.16b}, [x2], x3
         ld1             {v20.16b}, [x1], x3
         ext             v5.16b, v4.16b, v4.16b, #1              // x1 + stride + 1
-        ext             v21.16b, v20.16b, v20.16b, #1
-        ld1             {v6.16b}, [x2], x3
         ld1             {v22.16b}, [x2], x3
         ext             v7.16b, v6.16b, v6.16b, #1              // x2 + stride + 1
+        ext             v21.16b, v20.16b, v20.16b, #1
         ext             v23.16b, v22.16b, v22.16b, #1
 
         usubl           v31.8h, v0.8b, v4.8b
@@ -933,8 +933,8 @@ function nsse16_neon, export=1
 2:
         ld1             {v4.16b}, [x1], x3
         subs            w4, w4, #1
-        ext             v5.16b, v4.16b, v4.16b, #1              // x1 + stride + 1
         ld1             {v6.16b}, [x2], x3
+        ext             v5.16b, v4.16b, v4.16b, #1              // x1 + stride + 1
         usubl           v31.8h, v0.8b, v4.8b
         ext             v7.16b, v6.16b, v6.16b, #1              // x2 + stride + 1
 
-- 
2.25.1


[-- Attachment #3: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16
  2022-09-06 10:27 [FFmpeg-devel] [PATCH 0/5] Provide optimized neon implementation Hubert Mazur
@ 2022-09-06 10:27 ` Hubert Mazur
  2022-09-07  9:06   ` Martin Storsjö
  0 siblings, 1 reply; 15+ messages in thread
From: Hubert Mazur @ 2022-09-06 10:27 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop

Add vectorized implementation of nsse16 function.

Performance comparison tests are shown below.
- nsse_0_c: 707.0
- nsse_0_neon: 120.0

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 libavcodec/aarch64/me_cmp_init_aarch64.c |  15 +++
 libavcodec/aarch64/me_cmp_neon.S         | 124 +++++++++++++++++++++++
 2 files changed, 139 insertions(+)

diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index 8c295d5457..ea7f295373 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -49,6 +49,10 @@ int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
                 ptrdiff_t stride, int h);
 int vsse_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
                       ptrdiff_t stride, int h);
+int nsse16_neon(int multiplier, const uint8_t *s, const uint8_t *s2,
+                ptrdiff_t stride, int h);
+int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
+                        ptrdiff_t stride, int h);
 
 av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
 {
@@ -72,5 +76,16 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
 
         c->vsse[0] = vsse16_neon;
         c->vsse[4] = vsse_intra16_neon;
+
+        c->nsse[0] = nsse16_neon_wrapper;
     }
 }
+
+int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
+                        ptrdiff_t stride, int h)
+{
+        if (c)
+            return nsse16_neon(c->avctx->nsse_weight, s1, s2, stride, h);
+        else
+            return nsse16_neon(8, s1, s2, stride, h);
+}
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index cf2b8da425..bd21122a21 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -847,3 +847,127 @@ function vsse_intra16_neon, export=1
 
         ret
 endfunc
+
+function nsse16_neon, export=1
+        // x0           multiplier
+        // x1           uint8_t *pix1
+        // x2           uint8_t *pix2
+        // x3           ptrdiff_t stride
+        // w4           int h
+
+        str             x0, [sp, #-0x40]!
+        stp             x1, x2, [sp, #0x10]
+        stp             x3, x4, [sp, #0x20]
+        str             x30, [sp, #0x30]
+        bl              X(sse16_neon)
+        ldr             x30, [sp, #0x30]
+        mov             w9, w0                                  // here we store score1
+        ldr             x5, [sp]
+        ldp             x1, x2, [sp, #0x10]
+        ldp             x3, x4, [sp, #0x20]
+        add             sp, sp, #0x40
+
+        movi            v16.8h, #0
+        movi            v17.8h, #0
+        movi            v18.8h, #0
+        movi            v19.8h, #0
+
+        ld1             {v0.16b}, [x1], x3
+        subs            w4, w4, #1                              // we need to make h-1 iterations
+        ext             v1.16b, v0.16b, v0.16b, #1              // x1 + 1
+        ld1             {v2.16b}, [x2], x3
+        cmp             w4, #2
+        ext             v3.16b, v2.16b, v2.16b, #1              // x2 + 1
+
+        b.lt            2f
+
+// make 2 iterations at once
+1:
+        ld1             {v4.16b}, [x1], x3
+        ld1             {v20.16b}, [x1], x3
+        ext             v5.16b, v4.16b, v4.16b, #1              // x1 + stride + 1
+        ext             v21.16b, v20.16b, v20.16b, #1
+        ld1             {v6.16b}, [x2], x3
+        ld1             {v22.16b}, [x2], x3
+        ext             v7.16b, v6.16b, v6.16b, #1              // x2 + stride + 1
+        ext             v23.16b, v22.16b, v22.16b, #1
+
+        usubl           v31.8h, v0.8b, v4.8b
+        usubl2          v30.8h, v0.16b, v4.16b
+        usubl           v29.8h, v1.8b, v5.8b
+        usubl2          v28.8h, v1.16b, v5.16b
+        saba            v16.8h, v31.8h, v29.8h
+        saba            v17.8h, v30.8h, v28.8h
+        usubl           v27.8h, v2.8b, v6.8b
+        usubl2          v26.8h, v2.16b, v6.16b
+        usubl           v25.8h, v3.8b, v7.8b
+        usubl2          v24.8h, v3.16b, v7.16b
+        saba            v18.8h, v27.8h, v25.8h
+        saba            v19.8h, v26.8h, v24.8h
+
+        usubl           v31.8h, v4.8b, v20.8b
+        usubl2          v30.8h, v4.16b, v20.16b
+        usubl           v29.8h, v5.8b, v21.8b
+        usubl2          v28.8h, v5.16b, v21.16b
+        saba            v16.8h, v31.8h, v29.8h
+        saba            v17.8h, v30.8h, v28.8h
+        usubl           v27.8h, v6.8b, v22.8b
+        usubl2          v26.8h, v6.16b, v22.16b
+        usubl           v25.8h, v7.8b, v23.8b
+        usubl2          v24.8h, v7.16b, v23.16b
+        saba            v18.8h, v27.8h, v25.8h
+        saba            v19.8h, v26.8h, v24.8h
+
+        mov             v0.16b, v20.16b
+        mov             v1.16b, v21.16b
+        mov             v2.16b, v22.16b
+        mov             v3.16b, v23.16b
+
+        sub             w4, w4, #2
+        cmp             w4, #2
+
+        b.ge            1b
+        cbz             w4, 3f
+
+// iterate by one
+2:
+        ld1             {v4.16b}, [x1], x3
+        subs            w4, w4, #1
+        ext             v5.16b, v4.16b, v4.16b, #1              // x1 + stride + 1
+        ld1             {v6.16b}, [x2], x3
+        usubl           v31.8h, v0.8b, v4.8b
+        ext             v7.16b, v6.16b, v6.16b, #1              // x2 + stride + 1
+
+        usubl2          v30.8h, v0.16b, v4.16b
+        usubl           v29.8h, v1.8b, v5.8b
+        usubl2          v28.8h, v1.16b, v5.16b
+        saba            v16.8h, v31.8h, v29.8h
+        saba            v17.8h, v30.8h, v28.8h
+        usubl           v27.8h, v2.8b, v6.8b
+        usubl2          v26.8h, v2.16b, v6.16b
+        usubl           v25.8h, v3.8b, v7.8b
+        usubl2          v24.8h, v3.16b, v7.16b
+        saba            v18.8h, v27.8h, v25.8h
+        saba            v19.8h, v26.8h, v24.8h
+
+        mov             v0.16b, v4.16b
+        mov             v1.16b, v5.16b
+        mov             v2.16b, v6.16b
+        mov             v3.16b, v7.16b
+
+        cbnz            w4, 2b
+
+3:
+        sqsub           v16.8h, v16.8h, v18.8h
+        sqsub           v17.8h, v17.8h, v19.8h
+        ins             v17.h[7], wzr
+        sqadd           v16.8h, v16.8h, v17.8h
+        saddlv          s16, v16.8h
+        sqabs           s16, s16
+        fmov            w0, s16
+
+        mul             w0, w0, w5
+        add             w0, w0, w9
+
+        ret
+endfunc
-- 
2.34.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-09-08  9:26 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-22 15:26 [FFmpeg-devel] [PATCH 0/5] me_cmp: Provide arm64 neon implementations Hubert Mazur
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for vsad16 Hubert Mazur
2022-09-02 21:49   ` Martin Storsjö
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation of vsse16 Hubert Mazur
2022-09-04 20:53   ` Martin Storsjö
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for vsad_intra16 Hubert Mazur
2022-09-04 20:58   ` Martin Storsjö
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16 Hubert Mazur
2022-09-04 20:59   ` Martin Storsjö
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16 Hubert Mazur
2022-09-02 21:29   ` Martin Storsjö
2022-09-04 21:23   ` Martin Storsjö
2022-09-06 10:27 [FFmpeg-devel] [PATCH 0/5] Provide optimized neon implementation Hubert Mazur
2022-09-06 10:27 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16 Hubert Mazur
2022-09-07  9:06   ` Martin Storsjö
2022-09-08  9:25 [FFmpeg-devel] [PATCH 0/5] Provide optimized neon implementation Hubert Mazur
2022-09-08  9:25 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16 Hubert Mazur

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git