[FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed

* [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions
@ 2022-08-16 12:20 Hubert Mazur
  2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for sse16 Hubert Mazur
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Hubert Mazur @ 2022-08-16 12:20 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop

Add arm64 neon implementation for functions from motion estimation
family. All of them were tested and benchmarked using checkasm tool.
The rare code paths, e.g. when filter_size % 4 != 0 were also tested.
Instructions were manualy deinterleaved to reach best performance.

Hubert Mazur (5):
  lavc/aarch64: Add neon implementation for sse16
  lavc/aarch64: Add neon implementation for sse4
  lavc/aarch64: Add neon implementation for pix_abs16_y2
  lavc/aarch64: Add neon implementation for sse8
  lavc/aarch64: Add neon implementation for pix_abs8

 libavcodec/aarch64/me_cmp_init_aarch64.c |  18 ++
 libavcodec/aarch64/me_cmp_neon.S         | 324 +++++++++++++++++++++++
 2 files changed, 342 insertions(+)

-- 
2.34.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for sse16
  2022-08-16 12:20 [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Hubert Mazur
@ 2022-08-16 12:20 ` Hubert Mazur
  2022-08-18  9:09   ` Martin Storsjö
  2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation for sse4 Hubert Mazur
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Hubert Mazur @ 2022-08-16 12:20 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop

Provide neon implementation for sse16 function.

Performance comparison tests are shown below.
- sse_0_c: 268.2
- sse_0_neon: 43.5

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 libavcodec/aarch64/me_cmp_init_aarch64.c |  4 ++
 libavcodec/aarch64/me_cmp_neon.S         | 76 ++++++++++++++++++++++++
 2 files changed, 80 insertions(+)

diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index 79c739914f..7780009d41 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -30,6 +30,9 @@ int ff_pix_abs16_xy2_neon(MpegEncContext *s, const uint8_t *blk1, const uint8_t
 int ff_pix_abs16_x2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
                       ptrdiff_t stride, int h);
 
+int sse16_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
+                      ptrdiff_t stride, int h);
+
 av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
 {
     int cpu_flags = av_get_cpu_flags();
@@ -40,5 +43,6 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
         c->pix_abs[0][3] = ff_pix_abs16_xy2_neon;
 
         c->sad[0] = ff_pix_abs16_neon;
+        c->sse[0] = sse16_neon;
     }
 }
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index cda7ce0408..825ce45d13 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -270,3 +270,79 @@ function ff_pix_abs16_x2_neon, export=1
 
         ret
 endfunc
+
+function sse16_neon, export=1
+        // x0 - unused
+        // x1 - pix1
+        // x2 - pix2
+        // x3 - stride
+        // w4 - h
+
+        cmp             w4, #4
+        movi            d18, #0
+        movi            v17.4s, #0
+        b.lt            2f
+
+// Make 4 iterations at once
+1:
+
+        // res = abs(pix1[0] - pix2[0])
+        // res * res
+
+        ld1             {v0.16b}, [x1], x3              // Load pix1 vector for first iteration
+        ld1             {v1.16b}, [x2], x3              // Load pix2 vector for first iteration
+        ld1             {v2.16b}, [x1], x3              // Load pix1 vector for second iteration
+        uabd            v30.16b, v0.16b, v1.16b         // Absolute difference, first iteration
+        ld1             {v3.16b}, [x2], x3              // Load pix2 vector for second iteration
+        umull           v29.8h, v30.8b, v30.8b          // Multiply lower half of vectors, first iteration
+        umull2          v28.8h, v30.16b, v30.16b        // Multiply upper half of vectors, first iteration
+        uabd            v27.16b, v2.16b, v3.16b         // Absolute difference, second iteration
+        uadalp          v17.4s, v29.8h                  // Pairwise add, first iteration
+        ld1             {v4.16b}, [x1], x3              // Load pix1 for third iteration
+        umull           v26.8h, v27.8b, v27.8b          // Mulitply lower half, second iteration
+        umull2          v25.8h, v27.16b, v27.16b        // Multiply upper half, second iteration
+        ld1             {v5.16b}, [x2], x3              // Load pix2 for third iteration
+        uadalp          v17.4s, v26.8h                  // Pairwise add and accumulate, second iteration
+        uabd            v24.16b, v4.16b, v5.16b         // Absolute difference, third iteration
+        ld1             {v6.16b}, [x1], x3              // Load pix1 for fourth iteration
+        uadalp          v17.4s, v25.8h                  // Pairwise add andd accumulate, second iteration
+        umull           v23.8h, v24.8b, v24.8b          // Multiply lower half, third iteration
+        umull2          v22.8h, v24.16b, v24.16b        // Multiply upper half, third iteration
+        uadalp          v17.4s, v23.8h                  // Pairwise add and accumulate, third iteration
+        ld1             {v7.16b}, [x2], x3              // Load pix2 for fouth iteration
+        uadalp          v17.4s, v22.8h                  // Pairwise add and accumulate, third iteration
+        uabd            v21.16b, v6.16b, v7.16b         // Absolute difference, fourth iteration
+        uadalp          v17.4s, v28.8h                  // Pairwise add and accumulate, first iteration
+        umull           v20.8h, v21.8b, v21.8b          // Multiply lower half, fourth iteration
+        sub             w4, w4, #4                      // h -= 4
+        umull2          v19.8h, v21.16b, v21.16b        // Multiply upper half, fourth iteration
+        uadalp          v17.4s, v20.8h                  // Pairwise add and accumulate, fourth iteration
+        cmp             w4, #4
+        uadalp          v17.4s, v19.8h                  // Pairwise add and accumulate, fourth iteration
+        b.ge            1b
+
+        cbz             w4, 3f
+
+// iterate by one
+2:
+
+        ld1             {v0.16b}, [x1], x3              // Load pix1
+        ld1             {v1.16b}, [x2], x3              // Load pix2
+
+        uabd            v30.16b, v0.16b, v1.16b
+        umull           v29.8h, v30.8b, v30.8b
+        umull2          v28.8h, v30.16b, v30.16b
+        uadalp          v17.4s, v29.8h
+        subs            w4, w4, #1
+        uadalp          v17.4s, v28.8h
+
+        b.ne            2b
+
+3:
+        uaddlv          d16, v17.4s                     // add up accumulator vector
+        add             d18, d18, d16
+
+        fmov            w0, s18
+
+        ret
+endfunc
-- 
2.34.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation for sse4
  2022-08-16 12:20 [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Hubert Mazur
  2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for sse16 Hubert Mazur
@ 2022-08-16 12:20 ` Hubert Mazur
  2022-08-18  9:10   ` Martin Storsjö
  2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for pix_abs16_y2 Hubert Mazur
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Hubert Mazur @ 2022-08-16 12:20 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop

Provide neon implementation for sse4 function.

Performance comparison tests are shown below.
- sse_2_c: 80.7
- sse_2_neon: 31.0

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 libavcodec/aarch64/me_cmp_init_aarch64.c |  3 ++
 libavcodec/aarch64/me_cmp_neon.S         | 58 ++++++++++++++++++++++++
 2 files changed, 61 insertions(+)

diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index 7780009d41..955592625a 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -32,6 +32,8 @@ int ff_pix_abs16_x2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *
 
 int sse16_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
                       ptrdiff_t stride, int h);
+int sse4_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
+                      ptrdiff_t stride, int h);
 
 av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
 {
@@ -44,5 +46,6 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
 
         c->sad[0] = ff_pix_abs16_neon;
         c->sse[0] = sse16_neon;
+        c->sse[2] = sse4_neon;
     }
 }
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 825ce45d13..367924b3c2 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -346,3 +346,61 @@ function sse16_neon, export=1
 
         ret
 endfunc
+
+function sse4_neon, export=1
+        // x0 - unused
+        // x1 - pix1
+        // x2 - pix2
+        // x3 - stride
+        // w4 - h
+
+        movi            d18, #0
+        movi            v16.4s, #0                      // clear the result accumulator
+        cmp             w4, #4
+        b.le            2f
+
+// make 4 iterations at once
+1:
+
+        // res = abs(pix1[0] - pix2[0])
+        // res * res
+
+        ld1             {v0.s}[0], [x1], x3             // Load pix1, first iteration
+        ld1             {v1.s}[0], [x2], x3             // Load pix2, first iteration
+        ld1             {v2.s}[0], [x1], x3             // Load pix1, second iteration
+        ld1             {v3.s}[0], [x2], x3             // Load pix2, second iteration
+        uabdl           v30.8h, v0.8b, v1.8b            // Absolute difference, first iteration
+        ld1             {v4.s}[0], [x1], x3             // Load pix1, third iteration
+        ld1             {v5.s}[0], [x2], x3             // Load pix2, third iteration
+        uabdl           v29.8h, v2.8b, v3.8b            // Absolute difference, second iteration
+        umlal           v16.4s, v30.4h, v30.4h          // Multiply vectors, first iteration
+        ld1             {v6.s}[0], [x1], x3             // Load pix1, fourth iteration
+        ld1             {v7.s}[0], [x2], x3             // Load pix2, fourth iteration
+        uabdl           v28.8h, v4.8b, v5.8b            // Absolute difference, third iteration
+        umlal           v16.4s, v29.4h, v29.4h          // Multiply and accumulate, second iteration
+        sub             w4, w4, #4
+        uabdl           v27.8h, v6.8b, v7.8b            // Absolue difference, fourth iteration
+        umlal           v16.4s, v28.4h, v28.4h          // Multiply and accumulate, third iteration
+        cmp             w4, #4
+        umlal           v16.4s, v27.4h, v27.4h          // Multiply and accumulate, fourth iteration
+        b.ge            1b
+
+        cbz             w4, 3f
+
+// iterate by one
+2:
+        ld1             {v0.s}[0], [x1], x3               // Load pix1
+        ld1             {v1.s}[0], [x2], x3               // Load pix2
+        uabdl           v30.8h, v0.8b, v1.8b
+        subs            w4, w4, #1
+        umlal           v16.4s, v30.4h, v30.4h
+
+        b.ne            2b
+
+3:
+        uaddlv          d17, v16.4s                     // Add vector
+        add             d18, d18, d17
+        fmov            w0, s18
+
+        ret
+endfunc
-- 
2.34.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for pix_abs16_y2
  2022-08-16 12:20 [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Hubert Mazur
  2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for sse16 Hubert Mazur
  2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation for sse4 Hubert Mazur
@ 2022-08-16 12:20 ` Hubert Mazur
  2022-08-18  9:16   ` Martin Storsjö
  2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for sse8 Hubert Mazur
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Hubert Mazur @ 2022-08-16 12:20 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop

Provide optimized implementation of pix_abs16_y2 function for arm64.

Performance comparison tests are shown below.
pix_abs_0_2_c: 317.2
pix_abs_0_2_neon: 37.5

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 libavcodec/aarch64/me_cmp_init_aarch64.c |  3 +
 libavcodec/aarch64/me_cmp_neon.S         | 75 ++++++++++++++++++++++++
 2 files changed, 78 insertions(+)

diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index 955592625a..1c36d3d7cb 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -29,6 +29,8 @@ int ff_pix_abs16_xy2_neon(MpegEncContext *s, const uint8_t *blk1, const uint8_t
                       ptrdiff_t stride, int h);
 int ff_pix_abs16_x2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
                       ptrdiff_t stride, int h);
+int ff_pix_abs16_y2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
+                      ptrdiff_t stride, int h);
 
 int sse16_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
                       ptrdiff_t stride, int h);
@@ -42,6 +44,7 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
     if (have_neon(cpu_flags)) {
         c->pix_abs[0][0] = ff_pix_abs16_neon;
         c->pix_abs[0][1] = ff_pix_abs16_x2_neon;
+        c->pix_abs[0][2] = ff_pix_abs16_y2_neon;
         c->pix_abs[0][3] = ff_pix_abs16_xy2_neon;
 
         c->sad[0] = ff_pix_abs16_neon;
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 367924b3c2..0ec9c0465b 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -404,3 +404,78 @@ function sse4_neon, export=1
 
         ret
 endfunc
+
+function ff_pix_abs16_y2_neon, export=1
+        // x0           unused
+        // x1           uint8_t *pix1
+        // x2           uint8_t *pix2
+        // x3           ptrdiff_t stride
+        // x4           int h
+
+        // initialize buffers
+        movi            v29.8h, #0                      // clear the accumulator
+        movi            v28.8h, #0                      // clear the accumulator
+        movi            d18, #0
+        add             x5, x2, x3                      // pix2 + stride
+        cmp             w4, #4
+        b.lt            2f
+
+// make 4 iterations at once
+1:
+
+        // abs(pix1[0], avg2(pix2[0], pix2[0 + stride]))
+        // avg2(a, b) = (((a) + (b) + 1) >> 1)
+        // abs(x) = (x < 0 ? (-x) : (x))
+
+        ld1             {v1.16b}, [x2], x3              // Load pix2 for first iteration
+        ld1             {v2.16b}, [x5], x3              // Load pix3 for first iteration
+        urhadd          v30.16b, v1.16b, v2.16b         // Rounding halving add, first iteration
+        ld1             {v0.16b}, [x1], x3              // Load pix1 for first iteration
+        uabal           v29.8h, v0.8b, v30.8b           // Absolute difference of lower half, first iteration
+        ld1             {v4.16b}, [x2], x3              // Load pix2 for second iteration
+        uabal2          v28.8h, v0.16b, v30.16b         // Absolute difference of upper half, first iteration
+        ld1             {v5.16b}, [x5], x3              // Load pix3 for second iteartion
+        urhadd          v27.16b, v4.16b, v5.16b         // Rounding halving add, second iteration
+        ld1             {v3.16b}, [x1], x3              // Load pix1 for second iteration
+        uabal           v29.8h, v3.8b, v27.8b           // Absolute difference of lower half for second iteration
+        ld1             {v7.16b}, [x2], x3              // Load pix2 for third iteration
+        ld1             {v20.16b}, [x5], x3             // Load pix3 for third iteration
+        uabal2          v28.8h, v3.16b, v27.16b         // Absolute difference of upper half for second iteration
+        ld1             {v6.16b}, [x1], x3              // Load pix1 for third iteration
+        urhadd          v26.16b, v7.16b, v20.16b        // Rounding halving add, third iteration
+        uabal           v29.8h, v6.8b, v26.8b           // Absolute difference of lower half for third iteration
+        ld1             {v22.16b}, [x2], x3             // Load pix2 for fourth iteration
+        uabal2          v28.8h, v6.16b, v26.16b         // Absolute difference of upper half for third iteration
+        ld1             {v23.16b}, [x5], x3             // Load pix3 for fourth iteration
+        sub             w4, w4, #4                      // h-= 4
+        urhadd          v25.16b, v22.16b, v23.16b       // Rounding halving add
+        ld1             {v21.16b}, [x1], x3             // Load pix1 for fourth iteration
+        cmp             w4, #4
+        uabal           v29.8h, v21.8b, v25.8b          // Absolute difference of lower half for fourth iteration
+        uabal2          v28.8h, v21.16b, v25.16b        // Absolute difference of upper half for fourth iteration
+
+        b.ge            1b
+        cbz             w4, 3f
+
+// iterate by one
+2:
+
+        ld1             {v1.16b}, [x2], x3              // Load pix2
+        ld1             {v2.16b}, [x5], x3              // Load pix3
+        subs            w4, w4, #1
+        urhadd          v30.16b, v1.16b, v2.16b         // Rounding halving add
+        ld1             {v0.16b}, [x1], x3              // Load pix1
+        uabal           v29.8h, v30.8b, v0.8b
+        uabal2          v28.8h, v30.16b, v0.16b
+
+        b.ne            2b
+
+3:
+        add             v29.8h, v29.8h, v28.8h          // Add vectors together
+        uaddlv          s16, v29.8h                     // Add up vector values
+        add             d18, d18, d16
+
+        fmov            w0, s18
+
+        ret
+endfunc
-- 
2.34.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for sse8
  2022-08-16 12:20 [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Hubert Mazur
                   ` (2 preceding siblings ...)
  2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for pix_abs16_y2 Hubert Mazur
@ 2022-08-16 12:20 ` Hubert Mazur
  2022-08-18  9:18   ` Martin Storsjö
  2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Add neon implementation for pix_abs8 Hubert Mazur
  2022-08-18  9:07 ` [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Martin Storsjö
  5 siblings, 1 reply; 13+ messages in thread
From: Hubert Mazur @ 2022-08-16 12:20 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop

Provide optimized implementation of sse8 function for arm64.

Performance comparison tests are shown below.
- sse_1_c: 130.7
- sse_1_neon: 29.7

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 libavcodec/aarch64/me_cmp_init_aarch64.c |  4 ++
 libavcodec/aarch64/me_cmp_neon.S         | 66 ++++++++++++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index 1c36d3d7cb..2f51f0497e 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -34,9 +34,12 @@ int ff_pix_abs16_y2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *
 
 int sse16_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
                       ptrdiff_t stride, int h);
+int sse8_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
+                      ptrdiff_t stride, int h);
 int sse4_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
                       ptrdiff_t stride, int h);
 
+
 av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
 {
     int cpu_flags = av_get_cpu_flags();
@@ -49,6 +52,7 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
 
         c->sad[0] = ff_pix_abs16_neon;
         c->sse[0] = sse16_neon;
+        c->sse[1] = sse8_neon;
         c->sse[2] = sse4_neon;
     }
 }
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 0ec9c0465b..3f4266d4d5 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -347,6 +347,72 @@ function sse16_neon, export=1
         ret
 endfunc
 
+function sse8_neon, export=1
+        // x0 - unused
+        // x1 - pix1
+        // x2 - pix2
+        // x3 - stride
+        // w4 - h
+
+        movi            d18, #0
+        movi            v21.4s, #0
+        movi            v20.4s, #0
+        cmp             w4, #4
+        b.le            2f
+
+// make 4 iterations at once
+1:
+
+        // res = abs(pix1[0] - pix2[0])
+        // res * res
+
+        ld1             {v0.8b}, [x1], x3               // Load pix1 for first iteration
+        ld1             {v1.8b}, [x2], x3               // Load pix2 for second iteration
+        ld1             {v2.8b}, [x1], x3               // Load pix1 for second iteration
+        ld1             {v3.8b}, [x2], x3               // Load pix2 for second iteration
+        uabdl           v30.8h, v0.8b, v1.8b            // Absolute difference, first iteration
+        ld1             {v4.8b}, [x1], x3               // Load pix1 for third iteration
+        ld1             {v5.8b}, [x2], x3               // Load pix2 for third iteration
+        uabdl           v29.8h, v2.8b, v3.8b            // Absolute difference, second iteration
+        umlal           v21.4s, v30.4h, v30.4h          // Multiply lower half, first iteration
+        ld1             {v6.8b}, [x1], x3               // Load pix1 for fourth iteration
+        ld1             {v7.8b}, [x2], x3               // Load pix2 for fourth iteration
+        uabdl           v28.8h, v4.8b, v5.8b            // Absolute difference, third iteration
+        umlal           v21.4s, v29.4h, v29.4h          // Multiply lower half, second iteration
+        umlal2          v20.4s, v30.8h, v30.8h          // Multiply upper half, second iteration
+        uabdl           v27.8h, v6.8b, v7.8b            // Absolute difference, fourth iteration
+        umlal           v21.4s, v28.4h, v28.4h          // Multiply lower half, third iteration
+        umlal2          v20.4s, v29.8h, v29.8h          // Multiply upper half, second iteration
+        sub             w4, w4, #4                      // h -= 4
+        umlal2          v20.4s, v28.8h, v28.8h          // Multiply upper half, third iteration
+        umlal           v21.4s, v27.4h, v27.4h          // Multiply lower half, fourth iteration
+        cmp             w4, #4
+        umlal2          v20.4s, v27.8h, v27.8h          // Multiply upper half, fourth iteration
+        b.ge            1b
+
+        cbz             w4, 3f
+
+// iterate by one
+2:
+        ld1             {v0.8b}, [x1], x3               // Load pix1
+        ld1             {v1.8b}, [x2], x3               // Load pix2
+        subs            w4, w4, #1
+        uabdl           v30.8h, v0.8b, v1.8b
+        umlal           v21.4s, v30.4h, v30.4h
+        umlal2          v20.4s, v30.8h, v30.8h
+
+        b.ne            2b
+
+3:
+        add             v21.4s, v21.4s, v20.4s          // Add accumulator vectors together
+        uaddlv          d17, v21.4s                     // Add up vector
+        add             d18, d18, d17
+
+        fmov            w0, s18
+        ret
+
+endfunc
+
 function sse4_neon, export=1
         // x0 - unused
         // x1 - pix1
-- 
2.34.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Add neon implementation for pix_abs8
  2022-08-16 12:20 [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Hubert Mazur
                   ` (3 preceding siblings ...)
  2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for sse8 Hubert Mazur
@ 2022-08-16 12:20 ` Hubert Mazur
  2022-08-18  9:22   ` Martin Storsjö
  2022-08-18  9:07 ` [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Martin Storsjö
  5 siblings, 1 reply; 13+ messages in thread
From: Hubert Mazur @ 2022-08-16 12:20 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop

Provide optimized implementation of pix_abs8 function for arm64.

Performance comparison tests are shown below.
- pix_abs_1_0_c: 101.2
- pix_abs_1_0_neon: 22.5
- sad_1_c: 101.2
- sad_1_neon: 22.5

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
---
 libavcodec/aarch64/me_cmp_init_aarch64.c |  4 ++
 libavcodec/aarch64/me_cmp_neon.S         | 49 ++++++++++++++++++++++++
 2 files changed, 53 insertions(+)

diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index 2f51f0497e..e7dbd4cbc5 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -31,6 +31,8 @@ int ff_pix_abs16_x2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *
                       ptrdiff_t stride, int h);
 int ff_pix_abs16_y2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
                       ptrdiff_t stride, int h);
+int ff_pix_abs8_neon(MpegEncContext *s, const uint8_t *blk1, const uint8_t *blk2,
+                      ptrdiff_t stride, int h);
 
 int sse16_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
                       ptrdiff_t stride, int h);
@@ -49,8 +51,10 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
         c->pix_abs[0][1] = ff_pix_abs16_x2_neon;
         c->pix_abs[0][2] = ff_pix_abs16_y2_neon;
         c->pix_abs[0][3] = ff_pix_abs16_xy2_neon;
+        c->pix_abs[1][0] = ff_pix_abs8_neon;
 
         c->sad[0] = ff_pix_abs16_neon;
+        c->sad[1] = ff_pix_abs8_neon;
         c->sse[0] = sse16_neon;
         c->sse[1] = sse8_neon;
         c->sse[2] = sse4_neon;
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 3f4266d4d5..8c396cad21 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -72,6 +72,55 @@ function ff_pix_abs16_neon, export=1
         ret
 endfunc
 
+function ff_pix_abs8_neon, export=1
+        // x0           unused
+        // x1           uint8_t *pix1
+        // x2           uint8_t *pix2
+        // x3           ptrdiff_t stride
+        // x4           int h
+
+        movi            d18, #0
+        movi            v30.8h, #0
+        cmp             w4, #4
+        b.lt            2f
+
+// make 4 iterations at once
+1:
+        ld1             {v0.8b}, [x1], x3               // Load pix1 for first iteration
+        ld1             {v1.8b}, [x2], x3               // Load pix2 for first iteration
+        ld1             {v2.8b}, [x1], x3               // Load pix1 for second iteration
+        uabal           v30.8h, v0.8b, v1.8b            // Absolute difference, first iteration
+        ld1             {v3.8b}, [x2], x3               // Load pix2 for second iteration
+        ld1             {v4.8b}, [x1], x3               // Load pix1 for third iteration
+        uabal           v30.8h, v2.8b, v3.8b            // Absolute difference, second iteration
+        ld1             {v5.8b}, [x2], x3               // Load pix2 for third iteration
+        sub             w4, w4, #4                      // h -= 4
+        uabal           v30.8h, v4.8b, v5.8b            // Absolute difference, third iteration
+        ld1             {v6.8b}, [x1], x3               // Load pix1 for foruth iteration
+        ld1             {v7.8b}, [x2], x3               // Load pix2 for fourth iteration
+        cmp             w4, #4
+        uabal           v30.8h, v6.8b, v7.8b            // Absolute difference, foruth iteration
+        b.ge            1b
+
+        cbz             w4, 3f
+
+// iterate by one
+2:
+        ld1             {v0.8b}, [x1], x3               // Load pix1
+        ld1             {v1.8b}, [x2], x3               // Load pix2
+
+        subs            w4, w4, #1
+        uabal           v30.8h, v0.8b, v1.8b
+        b.ne            2b
+
+3:
+        uaddlv          s20, v30.8h                     // Add up vector
+        add             d18, d18, d20
+        fmov            w0, s18
+
+        ret
+endfunc
+
 function ff_pix_abs16_xy2_neon, export=1
         // x0           unused
         // x1           uint8_t *pix1
-- 
2.34.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions
  2022-08-16 12:20 [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Hubert Mazur
                   ` (4 preceding siblings ...)
  2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Add neon implementation for pix_abs8 Hubert Mazur
@ 2022-08-18  9:07 ` Martin Storsjö
  2022-08-18  9:24   ` Hubert Mazur
  5 siblings, 1 reply; 13+ messages in thread
From: Martin Storsjö @ 2022-08-18  9:07 UTC (permalink / raw)
  To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop

On Tue, 16 Aug 2022, Hubert Mazur wrote:

> Add arm64 neon implementation for functions from motion estimation
> family. All of them were tested and benchmarked using checkasm tool.
> The rare code paths, e.g. when filter_size % 4 != 0 were also tested.

> Instructions were manualy deinterleaved to reach best performance.

You probably mean "interleaved", as deinterleaved would be how it was 
initially, which is detrimental for performance.

Overall I think this patchset is close enough now. There were a bunch of 
minor details left on the patches, but I'll fix that up locally and push 
them, instead of doing yet another round of these. I'll comment and point 
out the details I changed - please pay attention to them for future 
patches though!

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for sse16
  2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for sse16 Hubert Mazur
@ 2022-08-18  9:09   ` Martin Storsjö
  0 siblings, 0 replies; 13+ messages in thread
From: Martin Storsjö @ 2022-08-18  9:09 UTC (permalink / raw)
  To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop

On Tue, 16 Aug 2022, Hubert Mazur wrote:

> Provide neon implementation for sse16 function.
>
> Performance comparison tests are shown below.
> - sse_0_c: 268.2
> - sse_0_neon: 43.5
>
> Benchmarks and tests run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c |  4 ++
> libavcodec/aarch64/me_cmp_neon.S         | 76 ++++++++++++++++++++++++
> 2 files changed, 80 insertions(+)
>
> diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
> index 79c739914f..7780009d41 100644
> --- a/libavcodec/aarch64/me_cmp_init_aarch64.c
> +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
> @@ -30,6 +30,9 @@ int ff_pix_abs16_xy2_neon(MpegEncContext *s, const uint8_t *blk1, const uint8_t
> int ff_pix_abs16_x2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
>                       ptrdiff_t stride, int h);
> 
> +int sse16_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
> +                      ptrdiff_t stride, int h);
> +

The second line of the function delcaration is incorrectly indented (it 
should be aligned with the opening parenthesis). I fixed this for the 
preexisting cases and the new patches, that I pushed.

> diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
> index cda7ce0408..825ce45d13 100644
> --- a/libavcodec/aarch64/me_cmp_neon.S
> +++ b/libavcodec/aarch64/me_cmp_neon.S
> @@ -270,3 +270,79 @@ function ff_pix_abs16_x2_neon, export=1
>
>         ret
> endfunc
> +
> +function sse16_neon, export=1
> +        // x0 - unused
> +        // x1 - pix1
> +        // x2 - pix2
> +        // x3 - stride
> +        // w4 - h
> +
> +        cmp             w4, #4
> +        movi            d18, #0

The d18 register was essentially unused

> +3:
> +        uaddlv          d16, v17.4s                     // add up accumulator vector
> +        add             d18, d18, d16
> +
> +        fmov            w0, s18

Here, the d18 register could be left out entirely.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation for sse4
  2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation for sse4 Hubert Mazur
@ 2022-08-18  9:10   ` Martin Storsjö
  0 siblings, 0 replies; 13+ messages in thread
From: Martin Storsjö @ 2022-08-18  9:10 UTC (permalink / raw)
  To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop

On Tue, 16 Aug 2022, Hubert Mazur wrote:

> Provide neon implementation for sse4 function.
>
> Performance comparison tests are shown below.
> - sse_2_c: 80.7
> - sse_2_neon: 31.0
>
> Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c |  3 ++
> libavcodec/aarch64/me_cmp_neon.S         | 58 ++++++++++++++++++++++++
> 2 files changed, 61 insertions(+)

This patch had the same issue about unused d18 register and unnecessary 
add instruction, and the misaligned function declaration.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for pix_abs16_y2
  2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for pix_abs16_y2 Hubert Mazur
@ 2022-08-18  9:16   ` Martin Storsjö
  0 siblings, 0 replies; 13+ messages in thread
From: Martin Storsjö @ 2022-08-18  9:16 UTC (permalink / raw)
  To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop

On Tue, 16 Aug 2022, Hubert Mazur wrote:

> Provide optimized implementation of pix_abs16_y2 function for arm64.
>
> Performance comparison tests are shown below.
> pix_abs_0_2_c: 317.2
> pix_abs_0_2_neon: 37.5
>
> Benchmarks and tests run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c |  3 +
> libavcodec/aarch64/me_cmp_neon.S         | 75 ++++++++++++++++++++++++
> 2 files changed, 78 insertions(+)
>
> diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
> index 955592625a..1c36d3d7cb 100644
> --- a/libavcodec/aarch64/me_cmp_init_aarch64.c
> +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
> @@ -29,6 +29,8 @@ int ff_pix_abs16_xy2_neon(MpegEncContext *s, const uint8_t *blk1, const uint8_t
>                       ptrdiff_t stride, int h);
> int ff_pix_abs16_x2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
>                       ptrdiff_t stride, int h);
> +int ff_pix_abs16_y2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
> +                      ptrdiff_t stride, int h);

Misaligned function declaration.

> diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
> index 367924b3c2..0ec9c0465b 100644
> --- a/libavcodec/aarch64/me_cmp_neon.S
> +++ b/libavcodec/aarch64/me_cmp_neon.S
> @@ -404,3 +404,78 @@ function sse4_neon, export=1
>
>         ret
> endfunc
> +
> +function ff_pix_abs16_y2_neon, export=1

Why place this new function at the bottom of the file, instead of 
logically following the other preexisting pix_abs16 function? In the 
version I pushed, I moved it further up

> +        // x0           unused
> +        // x1           uint8_t *pix1
> +        // x2           uint8_t *pix2
> +        // x3           ptrdiff_t stride
> +        // x4           int h

This should be w4. You had fixed this in a couple patches, but missed this 
one.

> +
> +        // initialize buffers
> +        movi            v29.8h, #0                      // clear the accumulator
> +        movi            v28.8h, #0                      // clear the accumulator
> +        movi            d18, #0

Unused d18 here too


> +        add             x5, x2, x3                      // pix2 + stride
> +        cmp             w4, #4
> +        b.lt            2f
> +
> +// make 4 iterations at once
> +1:
> +
> +        // abs(pix1[0], avg2(pix2[0], pix2[0 + stride]))
> +        // avg2(a, b) = (((a) + (b) + 1) >> 1)
> +        // abs(x) = (x < 0 ? (-x) : (x))
> +
> +        ld1             {v1.16b}, [x2], x3              // Load pix2 for first iteration
> +        ld1             {v2.16b}, [x5], x3              // Load pix3 for first iteration
> +        urhadd          v30.16b, v1.16b, v2.16b         // Rounding halving add, first iteration
> +        ld1             {v0.16b}, [x1], x3              // Load pix1 for first iteration
> +        uabal           v29.8h, v0.8b, v30.8b           // Absolute difference of lower half, first iteration

This whole first sequence is almost entirely blocking, waiting for the 
result of the previous operation - did you miss to interleave this with 
the rest of the operations?

Normally I wouldn't bother with minor interleaving details, but here the 
impact was rather big. I manually reinterleaved the whole function, and 
got this speedup:

Before:       Cortex A53    A72     A73
pix_abs_0_2_neon:  153.0   63.7    52.7
After:
pix_abs_0_2_neon:  141.0   61.7    51.7

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for sse8
  2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for sse8 Hubert Mazur
@ 2022-08-18  9:18   ` Martin Storsjö
  0 siblings, 0 replies; 13+ messages in thread
From: Martin Storsjö @ 2022-08-18  9:18 UTC (permalink / raw)
  To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop

On Tue, 16 Aug 2022, Hubert Mazur wrote:

> Provide optimized implementation of sse8 function for arm64.
>
> Performance comparison tests are shown below.
> - sse_1_c: 130.7
> - sse_1_neon: 29.7
>
> Benchmarks and tests run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c |  4 ++
> libavcodec/aarch64/me_cmp_neon.S         | 66 ++++++++++++++++++++++++
> 2 files changed, 70 insertions(+)
>
> diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
> index 1c36d3d7cb..2f51f0497e 100644
> --- a/libavcodec/aarch64/me_cmp_init_aarch64.c
> +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
> @@ -34,9 +34,12 @@ int ff_pix_abs16_y2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *
> 
> int sse16_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
>                       ptrdiff_t stride, int h);
> +int sse8_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
> +                      ptrdiff_t stride, int h);
> int sse4_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
>                       ptrdiff_t stride, int h);

Same as the others about function declaration indentation

> diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
> index 0ec9c0465b..3f4266d4d5 100644
> --- a/libavcodec/aarch64/me_cmp_neon.S
> +++ b/libavcodec/aarch64/me_cmp_neon.S
> @@ -347,6 +347,72 @@ function sse16_neon, export=1
>         ret
> endfunc
> 
> +function sse8_neon, export=1
> +        // x0 - unused
> +        // x1 - pix1
> +        // x2 - pix2
> +        // x3 - stride
> +        // w4 - h
> +
> +        movi            d18, #0

Same as the others about d18

> +        movi            v21.4s, #0
> +        movi            v20.4s, #0
> +        cmp             w4, #4
> +        b.le            2f
> +
> +// make 4 iterations at once
> +1:
> +
> +        // res = abs(pix1[0] - pix2[0])
> +        // res * res
> +
> +        ld1             {v0.8b}, [x1], x3               // Load pix1 for first iteration
> +        ld1             {v1.8b}, [x2], x3               // Load pix2 for second iteration
> +        ld1             {v2.8b}, [x1], x3               // Load pix1 for second iteration
> +        ld1             {v3.8b}, [x2], x3               // Load pix2 for second iteration
> +        uabdl           v30.8h, v0.8b, v1.8b            // Absolute difference, first iteration
> +        ld1             {v4.8b}, [x1], x3               // Load pix1 for third iteration
> +        ld1             {v5.8b}, [x2], x3               // Load pix2 for third iteration
> +        uabdl           v29.8h, v2.8b, v3.8b            // Absolute difference, second iteration
> +        umlal           v21.4s, v30.4h, v30.4h          // Multiply lower half, first iteration
> +        ld1             {v6.8b}, [x1], x3               // Load pix1 for fourth iteration
> +        ld1             {v7.8b}, [x2], x3               // Load pix2 for fourth iteration
> +        uabdl           v28.8h, v4.8b, v5.8b            // Absolute difference, third iteration
> +        umlal           v21.4s, v29.4h, v29.4h          // Multiply lower half, second iteration
> +        umlal2          v20.4s, v30.8h, v30.8h          // Multiply upper half, second iteration

The comment was wrong here, this is about the first iteration, not the 
second one.

> +        uabdl           v27.8h, v6.8b, v7.8b            // Absolute difference, fourth iteration
> +        umlal           v21.4s, v28.4h, v28.4h          // Multiply lower half, third iteration
> +        umlal2          v20.4s, v29.8h, v29.8h          // Multiply upper half, second iteration
> +        sub             w4, w4, #4                      // h -= 4
> +        umlal2          v20.4s, v28.8h, v28.8h          // Multiply upper half, third iteration
> +        umlal           v21.4s, v27.4h, v27.4h          // Multiply lower half, fourth iteration
> +        cmp             w4, #4
> +        umlal2          v20.4s, v27.8h, v27.8h          // Multiply upper half, fourth iteration
> +        b.ge            1b
> +
> +        cbz             w4, 3f
> +
> +// iterate by one
> +2:
> +        ld1             {v0.8b}, [x1], x3               // Load pix1
> +        ld1             {v1.8b}, [x2], x3               // Load pix2
> +        subs            w4, w4, #1
> +        uabdl           v30.8h, v0.8b, v1.8b
> +        umlal           v21.4s, v30.4h, v30.4h
> +        umlal2          v20.4s, v30.8h, v30.8h
> +
> +        b.ne            2b
> +
> +3:
> +        add             v21.4s, v21.4s, v20.4s          // Add accumulator vectors together
> +        uaddlv          d17, v21.4s                     // Add up vector
> +        add             d18, d18, d17
> +

Unnecesssary d18.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Add neon implementation for pix_abs8
  2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Add neon implementation for pix_abs8 Hubert Mazur
@ 2022-08-18  9:22   ` Martin Storsjö
  0 siblings, 0 replies; 13+ messages in thread
From: Martin Storsjö @ 2022-08-18  9:22 UTC (permalink / raw)
  To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop

On Tue, 16 Aug 2022, Hubert Mazur wrote:

> Provide optimized implementation of pix_abs8 function for arm64.
>
> Performance comparison tests are shown below.
> - pix_abs_1_0_c: 101.2
> - pix_abs_1_0_neon: 22.5
> - sad_1_c: 101.2
> - sad_1_neon: 22.5
>
> Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c |  4 ++
> libavcodec/aarch64/me_cmp_neon.S         | 49 ++++++++++++++++++++++++
> 2 files changed, 53 insertions(+)
>
> diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
> index 2f51f0497e..e7dbd4cbc5 100644
> --- a/libavcodec/aarch64/me_cmp_init_aarch64.c
> +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
> @@ -31,6 +31,8 @@ int ff_pix_abs16_x2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *
>                       ptrdiff_t stride, int h);
> int ff_pix_abs16_y2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
>                       ptrdiff_t stride, int h);
> +int ff_pix_abs8_neon(MpegEncContext *s, const uint8_t *blk1, const uint8_t *blk2,
> +                      ptrdiff_t stride, int h);

Alignment

> diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
> index 3f4266d4d5..8c396cad21 100644
> --- a/libavcodec/aarch64/me_cmp_neon.S
> +++ b/libavcodec/aarch64/me_cmp_neon.S
> @@ -72,6 +72,55 @@ function ff_pix_abs16_neon, export=1
>         ret
> endfunc
> 
> +function ff_pix_abs8_neon, export=1
> +        // x0           unused
> +        // x1           uint8_t *pix1
> +        // x2           uint8_t *pix2
> +        // x3           ptrdiff_t stride
> +        // x4           int h

w4, not x4

> +
> +        movi            d18, #0

Unused d18

> +        movi            v30.8h, #0
> +        cmp             w4, #4
> +        b.lt            2f
> +
> +// make 4 iterations at once
> +1:
> +        ld1             {v0.8b}, [x1], x3               // Load pix1 for first iteration
> +        ld1             {v1.8b}, [x2], x3               // Load pix2 for first iteration
> +        ld1             {v2.8b}, [x1], x3               // Load pix1 for second iteration
> +        uabal           v30.8h, v0.8b, v1.8b            // Absolute difference, first iteration
> +        ld1             {v3.8b}, [x2], x3               // Load pix2 for second iteration
> +        ld1             {v4.8b}, [x1], x3               // Load pix1 for third iteration
> +        uabal           v30.8h, v2.8b, v3.8b            // Absolute difference, second iteration
> +        ld1             {v5.8b}, [x2], x3               // Load pix2 for third iteration
> +        sub             w4, w4, #4                      // h -= 4
> +        uabal           v30.8h, v4.8b, v5.8b            // Absolute difference, third iteration
> +        ld1             {v6.8b}, [x1], x3               // Load pix1 for foruth iteration
> +        ld1             {v7.8b}, [x2], x3               // Load pix2 for fourth iteration
> +        cmp             w4, #4
> +        uabal           v30.8h, v6.8b, v7.8b            // Absolute difference, foruth iteration

The interleaving here looks mostly quite good, but the last uabal comes 
almost directly after the two loads; I moved the second-last uabal from 
before the two ld1s to between ld1 and cmp, and got a rather notable 
speedup.

Before:       Cortex A53    A72    A73
pix_abs_1_0_neon:   65.7   33.7   21.5
After:
pix_abs_1_0_neon:   57.7   33.5   21.5

So this is a 13% speedup on Cortex A53, just by moving one single 
instruction. This is why paying attention to scheduling matters, sometimes 
a lot.

> +        uaddlv          s20, v30.8h                     // Add up vector
> +        add             d18, d18, d20
> +        fmov            w0, s18

And finally, by removing the unnecessary add of d18 here, I got this 
further reduced to the following runtimes:

               Cortex A53    A72    A73
pix_abs_1_0_neon:   54.7   30.7   20.2

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions
  2022-08-18  9:07 ` [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Martin Storsjö
@ 2022-08-18  9:24   ` Hubert Mazur
  0 siblings, 0 replies; 13+ messages in thread
From: Hubert Mazur @ 2022-08-18  9:24 UTC (permalink / raw)
  To: Martin Storsjö
  Cc: Grzegorz Bernacki, upstream, Swinney, Jonathan, ffmpeg-devel,
	Marcin Wojtas, Pop, Sebastian

Thanks for the review and pointing out the issues. I will check out the
other patches for such things and fix them if needed.

Regards

On Thu, Aug 18, 2022 at 11:08 AM Martin Storsjö <martin@martin.st> wrote:

> On Tue, 16 Aug 2022, Hubert Mazur wrote:
>
> > Add arm64 neon implementation for functions from motion estimation
> > family. All of them were tested and benchmarked using checkasm tool.
> > The rare code paths, e.g. when filter_size % 4 != 0 were also tested.
>
>
> > Instructions were manualy deinterleaved to reach best performance.
>
> You probably mean "interleaved", as deinterleaved would be how it was
> initially, which is detrimental for performance.
>
> Overall I think this patchset is close enough now. There were a bunch of
> minor details left on the patches, but I'll fix that up locally and push
> them, instead of doing yet another round of these. I'll comment and point
> out the details I changed - please pay attention to them for future
> patches though!
>
> // Martin
>
>
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-08-18  9:25 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-16 12:20 [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Hubert Mazur
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for sse16 Hubert Mazur
2022-08-18  9:09   ` Martin Storsjö
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation for sse4 Hubert Mazur
2022-08-18  9:10   ` Martin Storsjö
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for pix_abs16_y2 Hubert Mazur
2022-08-18  9:16   ` Martin Storsjö
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for sse8 Hubert Mazur
2022-08-18  9:18   ` Martin Storsjö
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Add neon implementation for pix_abs8 Hubert Mazur
2022-08-18  9:22   ` Martin Storsjö
2022-08-18  9:07 ` [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Martin Storsjö
2022-08-18  9:24   ` Hubert Mazur

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git