* [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions
@ 2022-08-16 12:20 Hubert Mazur
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for sse16 Hubert Mazur
` (5 more replies)
0 siblings, 6 replies; 14+ messages in thread
From: Hubert Mazur @ 2022-08-16 12:20 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Add arm64 neon implementation for functions from motion estimation
family. All of them were tested and benchmarked using checkasm tool.
The rare code paths, e.g. when filter_size % 4 != 0 were also tested.
Instructions were manualy deinterleaved to reach best performance.
Hubert Mazur (5):
lavc/aarch64: Add neon implementation for sse16
lavc/aarch64: Add neon implementation for sse4
lavc/aarch64: Add neon implementation for pix_abs16_y2
lavc/aarch64: Add neon implementation for sse8
lavc/aarch64: Add neon implementation for pix_abs8
libavcodec/aarch64/me_cmp_init_aarch64.c | 18 ++
libavcodec/aarch64/me_cmp_neon.S | 324 +++++++++++++++++++++++
2 files changed, 342 insertions(+)
--
2.34.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for sse16
2022-08-16 12:20 [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Hubert Mazur
@ 2022-08-16 12:20 ` Hubert Mazur
2022-08-18 9:09 ` Martin Storsjö
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation for sse4 Hubert Mazur
` (4 subsequent siblings)
5 siblings, 1 reply; 14+ messages in thread
From: Hubert Mazur @ 2022-08-16 12:20 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Provide neon implementation for sse16 function.
Performance comparison tests are shown below.
- sse_0_c: 268.2
- sse_0_neon: 43.5
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
libavcodec/aarch64/me_cmp_init_aarch64.c | 4 ++
libavcodec/aarch64/me_cmp_neon.S | 76 ++++++++++++++++++++++++
2 files changed, 80 insertions(+)
diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index 79c739914f..7780009d41 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -30,6 +30,9 @@ int ff_pix_abs16_xy2_neon(MpegEncContext *s, const uint8_t *blk1, const uint8_t
int ff_pix_abs16_x2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
ptrdiff_t stride, int h);
+int sse16_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
+ ptrdiff_t stride, int h);
+
av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
{
int cpu_flags = av_get_cpu_flags();
@@ -40,5 +43,6 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
c->pix_abs[0][3] = ff_pix_abs16_xy2_neon;
c->sad[0] = ff_pix_abs16_neon;
+ c->sse[0] = sse16_neon;
}
}
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index cda7ce0408..825ce45d13 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -270,3 +270,79 @@ function ff_pix_abs16_x2_neon, export=1
ret
endfunc
+
+function sse16_neon, export=1
+ // x0 - unused
+ // x1 - pix1
+ // x2 - pix2
+ // x3 - stride
+ // w4 - h
+
+ cmp w4, #4
+ movi d18, #0
+ movi v17.4s, #0
+ b.lt 2f
+
+// Make 4 iterations at once
+1:
+
+ // res = abs(pix1[0] - pix2[0])
+ // res * res
+
+ ld1 {v0.16b}, [x1], x3 // Load pix1 vector for first iteration
+ ld1 {v1.16b}, [x2], x3 // Load pix2 vector for first iteration
+ ld1 {v2.16b}, [x1], x3 // Load pix1 vector for second iteration
+ uabd v30.16b, v0.16b, v1.16b // Absolute difference, first iteration
+ ld1 {v3.16b}, [x2], x3 // Load pix2 vector for second iteration
+ umull v29.8h, v30.8b, v30.8b // Multiply lower half of vectors, first iteration
+ umull2 v28.8h, v30.16b, v30.16b // Multiply upper half of vectors, first iteration
+ uabd v27.16b, v2.16b, v3.16b // Absolute difference, second iteration
+ uadalp v17.4s, v29.8h // Pairwise add, first iteration
+ ld1 {v4.16b}, [x1], x3 // Load pix1 for third iteration
+ umull v26.8h, v27.8b, v27.8b // Mulitply lower half, second iteration
+ umull2 v25.8h, v27.16b, v27.16b // Multiply upper half, second iteration
+ ld1 {v5.16b}, [x2], x3 // Load pix2 for third iteration
+ uadalp v17.4s, v26.8h // Pairwise add and accumulate, second iteration
+ uabd v24.16b, v4.16b, v5.16b // Absolute difference, third iteration
+ ld1 {v6.16b}, [x1], x3 // Load pix1 for fourth iteration
+ uadalp v17.4s, v25.8h // Pairwise add andd accumulate, second iteration
+ umull v23.8h, v24.8b, v24.8b // Multiply lower half, third iteration
+ umull2 v22.8h, v24.16b, v24.16b // Multiply upper half, third iteration
+ uadalp v17.4s, v23.8h // Pairwise add and accumulate, third iteration
+ ld1 {v7.16b}, [x2], x3 // Load pix2 for fouth iteration
+ uadalp v17.4s, v22.8h // Pairwise add and accumulate, third iteration
+ uabd v21.16b, v6.16b, v7.16b // Absolute difference, fourth iteration
+ uadalp v17.4s, v28.8h // Pairwise add and accumulate, first iteration
+ umull v20.8h, v21.8b, v21.8b // Multiply lower half, fourth iteration
+ sub w4, w4, #4 // h -= 4
+ umull2 v19.8h, v21.16b, v21.16b // Multiply upper half, fourth iteration
+ uadalp v17.4s, v20.8h // Pairwise add and accumulate, fourth iteration
+ cmp w4, #4
+ uadalp v17.4s, v19.8h // Pairwise add and accumulate, fourth iteration
+ b.ge 1b
+
+ cbz w4, 3f
+
+// iterate by one
+2:
+
+ ld1 {v0.16b}, [x1], x3 // Load pix1
+ ld1 {v1.16b}, [x2], x3 // Load pix2
+
+ uabd v30.16b, v0.16b, v1.16b
+ umull v29.8h, v30.8b, v30.8b
+ umull2 v28.8h, v30.16b, v30.16b
+ uadalp v17.4s, v29.8h
+ subs w4, w4, #1
+ uadalp v17.4s, v28.8h
+
+ b.ne 2b
+
+3:
+ uaddlv d16, v17.4s // add up accumulator vector
+ add d18, d18, d16
+
+ fmov w0, s18
+
+ ret
+endfunc
--
2.34.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation for sse4
2022-08-16 12:20 [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Hubert Mazur
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for sse16 Hubert Mazur
@ 2022-08-16 12:20 ` Hubert Mazur
2022-08-18 9:10 ` Martin Storsjö
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for pix_abs16_y2 Hubert Mazur
` (3 subsequent siblings)
5 siblings, 1 reply; 14+ messages in thread
From: Hubert Mazur @ 2022-08-16 12:20 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Provide neon implementation for sse4 function.
Performance comparison tests are shown below.
- sse_2_c: 80.7
- sse_2_neon: 31.0
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
libavcodec/aarch64/me_cmp_init_aarch64.c | 3 ++
libavcodec/aarch64/me_cmp_neon.S | 58 ++++++++++++++++++++++++
2 files changed, 61 insertions(+)
diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index 7780009d41..955592625a 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -32,6 +32,8 @@ int ff_pix_abs16_x2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *
int sse16_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
ptrdiff_t stride, int h);
+int sse4_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
+ ptrdiff_t stride, int h);
av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
{
@@ -44,5 +46,6 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
c->sad[0] = ff_pix_abs16_neon;
c->sse[0] = sse16_neon;
+ c->sse[2] = sse4_neon;
}
}
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 825ce45d13..367924b3c2 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -346,3 +346,61 @@ function sse16_neon, export=1
ret
endfunc
+
+function sse4_neon, export=1
+ // x0 - unused
+ // x1 - pix1
+ // x2 - pix2
+ // x3 - stride
+ // w4 - h
+
+ movi d18, #0
+ movi v16.4s, #0 // clear the result accumulator
+ cmp w4, #4
+ b.le 2f
+
+// make 4 iterations at once
+1:
+
+ // res = abs(pix1[0] - pix2[0])
+ // res * res
+
+ ld1 {v0.s}[0], [x1], x3 // Load pix1, first iteration
+ ld1 {v1.s}[0], [x2], x3 // Load pix2, first iteration
+ ld1 {v2.s}[0], [x1], x3 // Load pix1, second iteration
+ ld1 {v3.s}[0], [x2], x3 // Load pix2, second iteration
+ uabdl v30.8h, v0.8b, v1.8b // Absolute difference, first iteration
+ ld1 {v4.s}[0], [x1], x3 // Load pix1, third iteration
+ ld1 {v5.s}[0], [x2], x3 // Load pix2, third iteration
+ uabdl v29.8h, v2.8b, v3.8b // Absolute difference, second iteration
+ umlal v16.4s, v30.4h, v30.4h // Multiply vectors, first iteration
+ ld1 {v6.s}[0], [x1], x3 // Load pix1, fourth iteration
+ ld1 {v7.s}[0], [x2], x3 // Load pix2, fourth iteration
+ uabdl v28.8h, v4.8b, v5.8b // Absolute difference, third iteration
+ umlal v16.4s, v29.4h, v29.4h // Multiply and accumulate, second iteration
+ sub w4, w4, #4
+ uabdl v27.8h, v6.8b, v7.8b // Absolue difference, fourth iteration
+ umlal v16.4s, v28.4h, v28.4h // Multiply and accumulate, third iteration
+ cmp w4, #4
+ umlal v16.4s, v27.4h, v27.4h // Multiply and accumulate, fourth iteration
+ b.ge 1b
+
+ cbz w4, 3f
+
+// iterate by one
+2:
+ ld1 {v0.s}[0], [x1], x3 // Load pix1
+ ld1 {v1.s}[0], [x2], x3 // Load pix2
+ uabdl v30.8h, v0.8b, v1.8b
+ subs w4, w4, #1
+ umlal v16.4s, v30.4h, v30.4h
+
+ b.ne 2b
+
+3:
+ uaddlv d17, v16.4s // Add vector
+ add d18, d18, d17
+ fmov w0, s18
+
+ ret
+endfunc
--
2.34.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for pix_abs16_y2
2022-08-16 12:20 [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Hubert Mazur
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for sse16 Hubert Mazur
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation for sse4 Hubert Mazur
@ 2022-08-16 12:20 ` Hubert Mazur
2022-08-18 9:16 ` Martin Storsjö
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for sse8 Hubert Mazur
` (2 subsequent siblings)
5 siblings, 1 reply; 14+ messages in thread
From: Hubert Mazur @ 2022-08-16 12:20 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Provide optimized implementation of pix_abs16_y2 function for arm64.
Performance comparison tests are shown below.
pix_abs_0_2_c: 317.2
pix_abs_0_2_neon: 37.5
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
libavcodec/aarch64/me_cmp_init_aarch64.c | 3 +
libavcodec/aarch64/me_cmp_neon.S | 75 ++++++++++++++++++++++++
2 files changed, 78 insertions(+)
diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index 955592625a..1c36d3d7cb 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -29,6 +29,8 @@ int ff_pix_abs16_xy2_neon(MpegEncContext *s, const uint8_t *blk1, const uint8_t
ptrdiff_t stride, int h);
int ff_pix_abs16_x2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
ptrdiff_t stride, int h);
+int ff_pix_abs16_y2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
+ ptrdiff_t stride, int h);
int sse16_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
ptrdiff_t stride, int h);
@@ -42,6 +44,7 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
if (have_neon(cpu_flags)) {
c->pix_abs[0][0] = ff_pix_abs16_neon;
c->pix_abs[0][1] = ff_pix_abs16_x2_neon;
+ c->pix_abs[0][2] = ff_pix_abs16_y2_neon;
c->pix_abs[0][3] = ff_pix_abs16_xy2_neon;
c->sad[0] = ff_pix_abs16_neon;
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 367924b3c2..0ec9c0465b 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -404,3 +404,78 @@ function sse4_neon, export=1
ret
endfunc
+
+function ff_pix_abs16_y2_neon, export=1
+ // x0 unused
+ // x1 uint8_t *pix1
+ // x2 uint8_t *pix2
+ // x3 ptrdiff_t stride
+ // x4 int h
+
+ // initialize buffers
+ movi v29.8h, #0 // clear the accumulator
+ movi v28.8h, #0 // clear the accumulator
+ movi d18, #0
+ add x5, x2, x3 // pix2 + stride
+ cmp w4, #4
+ b.lt 2f
+
+// make 4 iterations at once
+1:
+
+ // abs(pix1[0], avg2(pix2[0], pix2[0 + stride]))
+ // avg2(a, b) = (((a) + (b) + 1) >> 1)
+ // abs(x) = (x < 0 ? (-x) : (x))
+
+ ld1 {v1.16b}, [x2], x3 // Load pix2 for first iteration
+ ld1 {v2.16b}, [x5], x3 // Load pix3 for first iteration
+ urhadd v30.16b, v1.16b, v2.16b // Rounding halving add, first iteration
+ ld1 {v0.16b}, [x1], x3 // Load pix1 for first iteration
+ uabal v29.8h, v0.8b, v30.8b // Absolute difference of lower half, first iteration
+ ld1 {v4.16b}, [x2], x3 // Load pix2 for second iteration
+ uabal2 v28.8h, v0.16b, v30.16b // Absolute difference of upper half, first iteration
+ ld1 {v5.16b}, [x5], x3 // Load pix3 for second iteartion
+ urhadd v27.16b, v4.16b, v5.16b // Rounding halving add, second iteration
+ ld1 {v3.16b}, [x1], x3 // Load pix1 for second iteration
+ uabal v29.8h, v3.8b, v27.8b // Absolute difference of lower half for second iteration
+ ld1 {v7.16b}, [x2], x3 // Load pix2 for third iteration
+ ld1 {v20.16b}, [x5], x3 // Load pix3 for third iteration
+ uabal2 v28.8h, v3.16b, v27.16b // Absolute difference of upper half for second iteration
+ ld1 {v6.16b}, [x1], x3 // Load pix1 for third iteration
+ urhadd v26.16b, v7.16b, v20.16b // Rounding halving add, third iteration
+ uabal v29.8h, v6.8b, v26.8b // Absolute difference of lower half for third iteration
+ ld1 {v22.16b}, [x2], x3 // Load pix2 for fourth iteration
+ uabal2 v28.8h, v6.16b, v26.16b // Absolute difference of upper half for third iteration
+ ld1 {v23.16b}, [x5], x3 // Load pix3 for fourth iteration
+ sub w4, w4, #4 // h-= 4
+ urhadd v25.16b, v22.16b, v23.16b // Rounding halving add
+ ld1 {v21.16b}, [x1], x3 // Load pix1 for fourth iteration
+ cmp w4, #4
+ uabal v29.8h, v21.8b, v25.8b // Absolute difference of lower half for fourth iteration
+ uabal2 v28.8h, v21.16b, v25.16b // Absolute difference of upper half for fourth iteration
+
+ b.ge 1b
+ cbz w4, 3f
+
+// iterate by one
+2:
+
+ ld1 {v1.16b}, [x2], x3 // Load pix2
+ ld1 {v2.16b}, [x5], x3 // Load pix3
+ subs w4, w4, #1
+ urhadd v30.16b, v1.16b, v2.16b // Rounding halving add
+ ld1 {v0.16b}, [x1], x3 // Load pix1
+ uabal v29.8h, v30.8b, v0.8b
+ uabal2 v28.8h, v30.16b, v0.16b
+
+ b.ne 2b
+
+3:
+ add v29.8h, v29.8h, v28.8h // Add vectors together
+ uaddlv s16, v29.8h // Add up vector values
+ add d18, d18, d16
+
+ fmov w0, s18
+
+ ret
+endfunc
--
2.34.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for sse8
2022-08-16 12:20 [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Hubert Mazur
` (2 preceding siblings ...)
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for pix_abs16_y2 Hubert Mazur
@ 2022-08-16 12:20 ` Hubert Mazur
2022-08-18 9:18 ` Martin Storsjö
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Add neon implementation for pix_abs8 Hubert Mazur
2022-08-18 9:07 ` [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Martin Storsjö
5 siblings, 1 reply; 14+ messages in thread
From: Hubert Mazur @ 2022-08-16 12:20 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Provide optimized implementation of sse8 function for arm64.
Performance comparison tests are shown below.
- sse_1_c: 130.7
- sse_1_neon: 29.7
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
libavcodec/aarch64/me_cmp_init_aarch64.c | 4 ++
libavcodec/aarch64/me_cmp_neon.S | 66 ++++++++++++++++++++++++
2 files changed, 70 insertions(+)
diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index 1c36d3d7cb..2f51f0497e 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -34,9 +34,12 @@ int ff_pix_abs16_y2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *
int sse16_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
ptrdiff_t stride, int h);
+int sse8_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
+ ptrdiff_t stride, int h);
int sse4_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
ptrdiff_t stride, int h);
+
av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
{
int cpu_flags = av_get_cpu_flags();
@@ -49,6 +52,7 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
c->sad[0] = ff_pix_abs16_neon;
c->sse[0] = sse16_neon;
+ c->sse[1] = sse8_neon;
c->sse[2] = sse4_neon;
}
}
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 0ec9c0465b..3f4266d4d5 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -347,6 +347,72 @@ function sse16_neon, export=1
ret
endfunc
+function sse8_neon, export=1
+ // x0 - unused
+ // x1 - pix1
+ // x2 - pix2
+ // x3 - stride
+ // w4 - h
+
+ movi d18, #0
+ movi v21.4s, #0
+ movi v20.4s, #0
+ cmp w4, #4
+ b.le 2f
+
+// make 4 iterations at once
+1:
+
+ // res = abs(pix1[0] - pix2[0])
+ // res * res
+
+ ld1 {v0.8b}, [x1], x3 // Load pix1 for first iteration
+ ld1 {v1.8b}, [x2], x3 // Load pix2 for second iteration
+ ld1 {v2.8b}, [x1], x3 // Load pix1 for second iteration
+ ld1 {v3.8b}, [x2], x3 // Load pix2 for second iteration
+ uabdl v30.8h, v0.8b, v1.8b // Absolute difference, first iteration
+ ld1 {v4.8b}, [x1], x3 // Load pix1 for third iteration
+ ld1 {v5.8b}, [x2], x3 // Load pix2 for third iteration
+ uabdl v29.8h, v2.8b, v3.8b // Absolute difference, second iteration
+ umlal v21.4s, v30.4h, v30.4h // Multiply lower half, first iteration
+ ld1 {v6.8b}, [x1], x3 // Load pix1 for fourth iteration
+ ld1 {v7.8b}, [x2], x3 // Load pix2 for fourth iteration
+ uabdl v28.8h, v4.8b, v5.8b // Absolute difference, third iteration
+ umlal v21.4s, v29.4h, v29.4h // Multiply lower half, second iteration
+ umlal2 v20.4s, v30.8h, v30.8h // Multiply upper half, second iteration
+ uabdl v27.8h, v6.8b, v7.8b // Absolute difference, fourth iteration
+ umlal v21.4s, v28.4h, v28.4h // Multiply lower half, third iteration
+ umlal2 v20.4s, v29.8h, v29.8h // Multiply upper half, second iteration
+ sub w4, w4, #4 // h -= 4
+ umlal2 v20.4s, v28.8h, v28.8h // Multiply upper half, third iteration
+ umlal v21.4s, v27.4h, v27.4h // Multiply lower half, fourth iteration
+ cmp w4, #4
+ umlal2 v20.4s, v27.8h, v27.8h // Multiply upper half, fourth iteration
+ b.ge 1b
+
+ cbz w4, 3f
+
+// iterate by one
+2:
+ ld1 {v0.8b}, [x1], x3 // Load pix1
+ ld1 {v1.8b}, [x2], x3 // Load pix2
+ subs w4, w4, #1
+ uabdl v30.8h, v0.8b, v1.8b
+ umlal v21.4s, v30.4h, v30.4h
+ umlal2 v20.4s, v30.8h, v30.8h
+
+ b.ne 2b
+
+3:
+ add v21.4s, v21.4s, v20.4s // Add accumulator vectors together
+ uaddlv d17, v21.4s // Add up vector
+ add d18, d18, d17
+
+ fmov w0, s18
+ ret
+
+endfunc
+
function sse4_neon, export=1
// x0 - unused
// x1 - pix1
--
2.34.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Add neon implementation for pix_abs8
2022-08-16 12:20 [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Hubert Mazur
` (3 preceding siblings ...)
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for sse8 Hubert Mazur
@ 2022-08-16 12:20 ` Hubert Mazur
2022-08-18 9:22 ` Martin Storsjö
2022-08-18 9:07 ` [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Martin Storsjö
5 siblings, 1 reply; 14+ messages in thread
From: Hubert Mazur @ 2022-08-16 12:20 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Provide optimized implementation of pix_abs8 function for arm64.
Performance comparison tests are shown below.
- pix_abs_1_0_c: 101.2
- pix_abs_1_0_neon: 22.5
- sad_1_c: 101.2
- sad_1_neon: 22.5
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
---
libavcodec/aarch64/me_cmp_init_aarch64.c | 4 ++
libavcodec/aarch64/me_cmp_neon.S | 49 ++++++++++++++++++++++++
2 files changed, 53 insertions(+)
diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index 2f51f0497e..e7dbd4cbc5 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -31,6 +31,8 @@ int ff_pix_abs16_x2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *
ptrdiff_t stride, int h);
int ff_pix_abs16_y2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
ptrdiff_t stride, int h);
+int ff_pix_abs8_neon(MpegEncContext *s, const uint8_t *blk1, const uint8_t *blk2,
+ ptrdiff_t stride, int h);
int sse16_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
ptrdiff_t stride, int h);
@@ -49,8 +51,10 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
c->pix_abs[0][1] = ff_pix_abs16_x2_neon;
c->pix_abs[0][2] = ff_pix_abs16_y2_neon;
c->pix_abs[0][3] = ff_pix_abs16_xy2_neon;
+ c->pix_abs[1][0] = ff_pix_abs8_neon;
c->sad[0] = ff_pix_abs16_neon;
+ c->sad[1] = ff_pix_abs8_neon;
c->sse[0] = sse16_neon;
c->sse[1] = sse8_neon;
c->sse[2] = sse4_neon;
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 3f4266d4d5..8c396cad21 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -72,6 +72,55 @@ function ff_pix_abs16_neon, export=1
ret
endfunc
+function ff_pix_abs8_neon, export=1
+ // x0 unused
+ // x1 uint8_t *pix1
+ // x2 uint8_t *pix2
+ // x3 ptrdiff_t stride
+ // x4 int h
+
+ movi d18, #0
+ movi v30.8h, #0
+ cmp w4, #4
+ b.lt 2f
+
+// make 4 iterations at once
+1:
+ ld1 {v0.8b}, [x1], x3 // Load pix1 for first iteration
+ ld1 {v1.8b}, [x2], x3 // Load pix2 for first iteration
+ ld1 {v2.8b}, [x1], x3 // Load pix1 for second iteration
+ uabal v30.8h, v0.8b, v1.8b // Absolute difference, first iteration
+ ld1 {v3.8b}, [x2], x3 // Load pix2 for second iteration
+ ld1 {v4.8b}, [x1], x3 // Load pix1 for third iteration
+ uabal v30.8h, v2.8b, v3.8b // Absolute difference, second iteration
+ ld1 {v5.8b}, [x2], x3 // Load pix2 for third iteration
+ sub w4, w4, #4 // h -= 4
+ uabal v30.8h, v4.8b, v5.8b // Absolute difference, third iteration
+ ld1 {v6.8b}, [x1], x3 // Load pix1 for foruth iteration
+ ld1 {v7.8b}, [x2], x3 // Load pix2 for fourth iteration
+ cmp w4, #4
+ uabal v30.8h, v6.8b, v7.8b // Absolute difference, foruth iteration
+ b.ge 1b
+
+ cbz w4, 3f
+
+// iterate by one
+2:
+ ld1 {v0.8b}, [x1], x3 // Load pix1
+ ld1 {v1.8b}, [x2], x3 // Load pix2
+
+ subs w4, w4, #1
+ uabal v30.8h, v0.8b, v1.8b
+ b.ne 2b
+
+3:
+ uaddlv s20, v30.8h // Add up vector
+ add d18, d18, d20
+ fmov w0, s18
+
+ ret
+endfunc
+
function ff_pix_abs16_xy2_neon, export=1
// x0 unused
// x1 uint8_t *pix1
--
2.34.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions
2022-08-16 12:20 [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Hubert Mazur
` (4 preceding siblings ...)
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Add neon implementation for pix_abs8 Hubert Mazur
@ 2022-08-18 9:07 ` Martin Storsjö
2022-08-18 9:24 ` Hubert Mazur
5 siblings, 1 reply; 14+ messages in thread
From: Martin Storsjö @ 2022-08-18 9:07 UTC (permalink / raw)
To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop
On Tue, 16 Aug 2022, Hubert Mazur wrote:
> Add arm64 neon implementation for functions from motion estimation
> family. All of them were tested and benchmarked using checkasm tool.
> The rare code paths, e.g. when filter_size % 4 != 0 were also tested.
> Instructions were manualy deinterleaved to reach best performance.
You probably mean "interleaved", as deinterleaved would be how it was
initially, which is detrimental for performance.
Overall I think this patchset is close enough now. There were a bunch of
minor details left on the patches, but I'll fix that up locally and push
them, instead of doing yet another round of these. I'll comment and point
out the details I changed - please pay attention to them for future
patches though!
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for sse16
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for sse16 Hubert Mazur
@ 2022-08-18 9:09 ` Martin Storsjö
0 siblings, 0 replies; 14+ messages in thread
From: Martin Storsjö @ 2022-08-18 9:09 UTC (permalink / raw)
To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop
On Tue, 16 Aug 2022, Hubert Mazur wrote:
> Provide neon implementation for sse16 function.
>
> Performance comparison tests are shown below.
> - sse_0_c: 268.2
> - sse_0_neon: 43.5
>
> Benchmarks and tests run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c | 4 ++
> libavcodec/aarch64/me_cmp_neon.S | 76 ++++++++++++++++++++++++
> 2 files changed, 80 insertions(+)
>
> diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
> index 79c739914f..7780009d41 100644
> --- a/libavcodec/aarch64/me_cmp_init_aarch64.c
> +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
> @@ -30,6 +30,9 @@ int ff_pix_abs16_xy2_neon(MpegEncContext *s, const uint8_t *blk1, const uint8_t
> int ff_pix_abs16_x2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
> ptrdiff_t stride, int h);
>
> +int sse16_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
> + ptrdiff_t stride, int h);
> +
The second line of the function delcaration is incorrectly indented (it
should be aligned with the opening parenthesis). I fixed this for the
preexisting cases and the new patches, that I pushed.
> diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
> index cda7ce0408..825ce45d13 100644
> --- a/libavcodec/aarch64/me_cmp_neon.S
> +++ b/libavcodec/aarch64/me_cmp_neon.S
> @@ -270,3 +270,79 @@ function ff_pix_abs16_x2_neon, export=1
>
> ret
> endfunc
> +
> +function sse16_neon, export=1
> + // x0 - unused
> + // x1 - pix1
> + // x2 - pix2
> + // x3 - stride
> + // w4 - h
> +
> + cmp w4, #4
> + movi d18, #0
The d18 register was essentially unused
> +3:
> + uaddlv d16, v17.4s // add up accumulator vector
> + add d18, d18, d16
> +
> + fmov w0, s18
Here, the d18 register could be left out entirely.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation for sse4
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation for sse4 Hubert Mazur
@ 2022-08-18 9:10 ` Martin Storsjö
0 siblings, 0 replies; 14+ messages in thread
From: Martin Storsjö @ 2022-08-18 9:10 UTC (permalink / raw)
To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop
On Tue, 16 Aug 2022, Hubert Mazur wrote:
> Provide neon implementation for sse4 function.
>
> Performance comparison tests are shown below.
> - sse_2_c: 80.7
> - sse_2_neon: 31.0
>
> Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c | 3 ++
> libavcodec/aarch64/me_cmp_neon.S | 58 ++++++++++++++++++++++++
> 2 files changed, 61 insertions(+)
This patch had the same issue about unused d18 register and unnecessary
add instruction, and the misaligned function declaration.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for pix_abs16_y2
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for pix_abs16_y2 Hubert Mazur
@ 2022-08-18 9:16 ` Martin Storsjö
0 siblings, 0 replies; 14+ messages in thread
From: Martin Storsjö @ 2022-08-18 9:16 UTC (permalink / raw)
To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop
On Tue, 16 Aug 2022, Hubert Mazur wrote:
> Provide optimized implementation of pix_abs16_y2 function for arm64.
>
> Performance comparison tests are shown below.
> pix_abs_0_2_c: 317.2
> pix_abs_0_2_neon: 37.5
>
> Benchmarks and tests run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c | 3 +
> libavcodec/aarch64/me_cmp_neon.S | 75 ++++++++++++++++++++++++
> 2 files changed, 78 insertions(+)
>
> diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
> index 955592625a..1c36d3d7cb 100644
> --- a/libavcodec/aarch64/me_cmp_init_aarch64.c
> +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
> @@ -29,6 +29,8 @@ int ff_pix_abs16_xy2_neon(MpegEncContext *s, const uint8_t *blk1, const uint8_t
> ptrdiff_t stride, int h);
> int ff_pix_abs16_x2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
> ptrdiff_t stride, int h);
> +int ff_pix_abs16_y2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
> + ptrdiff_t stride, int h);
Misaligned function declaration.
> diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
> index 367924b3c2..0ec9c0465b 100644
> --- a/libavcodec/aarch64/me_cmp_neon.S
> +++ b/libavcodec/aarch64/me_cmp_neon.S
> @@ -404,3 +404,78 @@ function sse4_neon, export=1
>
> ret
> endfunc
> +
> +function ff_pix_abs16_y2_neon, export=1
Why place this new function at the bottom of the file, instead of
logically following the other preexisting pix_abs16 function? In the
version I pushed, I moved it further up
> + // x0 unused
> + // x1 uint8_t *pix1
> + // x2 uint8_t *pix2
> + // x3 ptrdiff_t stride
> + // x4 int h
This should be w4. You had fixed this in a couple patches, but missed this
one.
> +
> + // initialize buffers
> + movi v29.8h, #0 // clear the accumulator
> + movi v28.8h, #0 // clear the accumulator
> + movi d18, #0
Unused d18 here too
> + add x5, x2, x3 // pix2 + stride
> + cmp w4, #4
> + b.lt 2f
> +
> +// make 4 iterations at once
> +1:
> +
> + // abs(pix1[0], avg2(pix2[0], pix2[0 + stride]))
> + // avg2(a, b) = (((a) + (b) + 1) >> 1)
> + // abs(x) = (x < 0 ? (-x) : (x))
> +
> + ld1 {v1.16b}, [x2], x3 // Load pix2 for first iteration
> + ld1 {v2.16b}, [x5], x3 // Load pix3 for first iteration
> + urhadd v30.16b, v1.16b, v2.16b // Rounding halving add, first iteration
> + ld1 {v0.16b}, [x1], x3 // Load pix1 for first iteration
> + uabal v29.8h, v0.8b, v30.8b // Absolute difference of lower half, first iteration
This whole first sequence is almost entirely blocking, waiting for the
result of the previous operation - did you miss to interleave this with
the rest of the operations?
Normally I wouldn't bother with minor interleaving details, but here the
impact was rather big. I manually reinterleaved the whole function, and
got this speedup:
Before: Cortex A53 A72 A73
pix_abs_0_2_neon: 153.0 63.7 52.7
After:
pix_abs_0_2_neon: 141.0 61.7 51.7
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for sse8
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for sse8 Hubert Mazur
@ 2022-08-18 9:18 ` Martin Storsjö
0 siblings, 0 replies; 14+ messages in thread
From: Martin Storsjö @ 2022-08-18 9:18 UTC (permalink / raw)
To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop
On Tue, 16 Aug 2022, Hubert Mazur wrote:
> Provide optimized implementation of sse8 function for arm64.
>
> Performance comparison tests are shown below.
> - sse_1_c: 130.7
> - sse_1_neon: 29.7
>
> Benchmarks and tests run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c | 4 ++
> libavcodec/aarch64/me_cmp_neon.S | 66 ++++++++++++++++++++++++
> 2 files changed, 70 insertions(+)
>
> diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
> index 1c36d3d7cb..2f51f0497e 100644
> --- a/libavcodec/aarch64/me_cmp_init_aarch64.c
> +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
> @@ -34,9 +34,12 @@ int ff_pix_abs16_y2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *
>
> int sse16_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
> ptrdiff_t stride, int h);
> +int sse8_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
> + ptrdiff_t stride, int h);
> int sse4_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
> ptrdiff_t stride, int h);
Same as the others about function declaration indentation
> diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
> index 0ec9c0465b..3f4266d4d5 100644
> --- a/libavcodec/aarch64/me_cmp_neon.S
> +++ b/libavcodec/aarch64/me_cmp_neon.S
> @@ -347,6 +347,72 @@ function sse16_neon, export=1
> ret
> endfunc
>
> +function sse8_neon, export=1
> + // x0 - unused
> + // x1 - pix1
> + // x2 - pix2
> + // x3 - stride
> + // w4 - h
> +
> + movi d18, #0
Same as the others about d18
> + movi v21.4s, #0
> + movi v20.4s, #0
> + cmp w4, #4
> + b.le 2f
> +
> +// make 4 iterations at once
> +1:
> +
> + // res = abs(pix1[0] - pix2[0])
> + // res * res
> +
> + ld1 {v0.8b}, [x1], x3 // Load pix1 for first iteration
> + ld1 {v1.8b}, [x2], x3 // Load pix2 for second iteration
> + ld1 {v2.8b}, [x1], x3 // Load pix1 for second iteration
> + ld1 {v3.8b}, [x2], x3 // Load pix2 for second iteration
> + uabdl v30.8h, v0.8b, v1.8b // Absolute difference, first iteration
> + ld1 {v4.8b}, [x1], x3 // Load pix1 for third iteration
> + ld1 {v5.8b}, [x2], x3 // Load pix2 for third iteration
> + uabdl v29.8h, v2.8b, v3.8b // Absolute difference, second iteration
> + umlal v21.4s, v30.4h, v30.4h // Multiply lower half, first iteration
> + ld1 {v6.8b}, [x1], x3 // Load pix1 for fourth iteration
> + ld1 {v7.8b}, [x2], x3 // Load pix2 for fourth iteration
> + uabdl v28.8h, v4.8b, v5.8b // Absolute difference, third iteration
> + umlal v21.4s, v29.4h, v29.4h // Multiply lower half, second iteration
> + umlal2 v20.4s, v30.8h, v30.8h // Multiply upper half, second iteration
The comment was wrong here, this is about the first iteration, not the
second one.
> + uabdl v27.8h, v6.8b, v7.8b // Absolute difference, fourth iteration
> + umlal v21.4s, v28.4h, v28.4h // Multiply lower half, third iteration
> + umlal2 v20.4s, v29.8h, v29.8h // Multiply upper half, second iteration
> + sub w4, w4, #4 // h -= 4
> + umlal2 v20.4s, v28.8h, v28.8h // Multiply upper half, third iteration
> + umlal v21.4s, v27.4h, v27.4h // Multiply lower half, fourth iteration
> + cmp w4, #4
> + umlal2 v20.4s, v27.8h, v27.8h // Multiply upper half, fourth iteration
> + b.ge 1b
> +
> + cbz w4, 3f
> +
> +// iterate by one
> +2:
> + ld1 {v0.8b}, [x1], x3 // Load pix1
> + ld1 {v1.8b}, [x2], x3 // Load pix2
> + subs w4, w4, #1
> + uabdl v30.8h, v0.8b, v1.8b
> + umlal v21.4s, v30.4h, v30.4h
> + umlal2 v20.4s, v30.8h, v30.8h
> +
> + b.ne 2b
> +
> +3:
> + add v21.4s, v21.4s, v20.4s // Add accumulator vectors together
> + uaddlv d17, v21.4s // Add up vector
> + add d18, d18, d17
> +
Unnecesssary d18.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Add neon implementation for pix_abs8
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Add neon implementation for pix_abs8 Hubert Mazur
@ 2022-08-18 9:22 ` Martin Storsjö
0 siblings, 0 replies; 14+ messages in thread
From: Martin Storsjö @ 2022-08-18 9:22 UTC (permalink / raw)
To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop
On Tue, 16 Aug 2022, Hubert Mazur wrote:
> Provide optimized implementation of pix_abs8 function for arm64.
>
> Performance comparison tests are shown below.
> - pix_abs_1_0_c: 101.2
> - pix_abs_1_0_neon: 22.5
> - sad_1_c: 101.2
> - sad_1_neon: 22.5
>
> Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c | 4 ++
> libavcodec/aarch64/me_cmp_neon.S | 49 ++++++++++++++++++++++++
> 2 files changed, 53 insertions(+)
>
> diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
> index 2f51f0497e..e7dbd4cbc5 100644
> --- a/libavcodec/aarch64/me_cmp_init_aarch64.c
> +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
> @@ -31,6 +31,8 @@ int ff_pix_abs16_x2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *
> ptrdiff_t stride, int h);
> int ff_pix_abs16_y2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
> ptrdiff_t stride, int h);
> +int ff_pix_abs8_neon(MpegEncContext *s, const uint8_t *blk1, const uint8_t *blk2,
> + ptrdiff_t stride, int h);
Alignment
> diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
> index 3f4266d4d5..8c396cad21 100644
> --- a/libavcodec/aarch64/me_cmp_neon.S
> +++ b/libavcodec/aarch64/me_cmp_neon.S
> @@ -72,6 +72,55 @@ function ff_pix_abs16_neon, export=1
> ret
> endfunc
>
> +function ff_pix_abs8_neon, export=1
> + // x0 unused
> + // x1 uint8_t *pix1
> + // x2 uint8_t *pix2
> + // x3 ptrdiff_t stride
> + // x4 int h
w4, not x4
> +
> + movi d18, #0
Unused d18
> + movi v30.8h, #0
> + cmp w4, #4
> + b.lt 2f
> +
> +// make 4 iterations at once
> +1:
> + ld1 {v0.8b}, [x1], x3 // Load pix1 for first iteration
> + ld1 {v1.8b}, [x2], x3 // Load pix2 for first iteration
> + ld1 {v2.8b}, [x1], x3 // Load pix1 for second iteration
> + uabal v30.8h, v0.8b, v1.8b // Absolute difference, first iteration
> + ld1 {v3.8b}, [x2], x3 // Load pix2 for second iteration
> + ld1 {v4.8b}, [x1], x3 // Load pix1 for third iteration
> + uabal v30.8h, v2.8b, v3.8b // Absolute difference, second iteration
> + ld1 {v5.8b}, [x2], x3 // Load pix2 for third iteration
> + sub w4, w4, #4 // h -= 4
> + uabal v30.8h, v4.8b, v5.8b // Absolute difference, third iteration
> + ld1 {v6.8b}, [x1], x3 // Load pix1 for foruth iteration
> + ld1 {v7.8b}, [x2], x3 // Load pix2 for fourth iteration
> + cmp w4, #4
> + uabal v30.8h, v6.8b, v7.8b // Absolute difference, foruth iteration
The interleaving here looks mostly quite good, but the last uabal comes
almost directly after the two loads; I moved the second-last uabal from
before the two ld1s to between ld1 and cmp, and got a rather notable
speedup.
Before: Cortex A53 A72 A73
pix_abs_1_0_neon: 65.7 33.7 21.5
After:
pix_abs_1_0_neon: 57.7 33.5 21.5
So this is a 13% speedup on Cortex A53, just by moving one single
instruction. This is why paying attention to scheduling matters, sometimes
a lot.
> + uaddlv s20, v30.8h // Add up vector
> + add d18, d18, d20
> + fmov w0, s18
And finally, by removing the unnecessary add of d18 here, I got this
further reduced to the following runtimes:
Cortex A53 A72 A73
pix_abs_1_0_neon: 54.7 30.7 20.2
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions
2022-08-18 9:07 ` [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Martin Storsjö
@ 2022-08-18 9:24 ` Hubert Mazur
0 siblings, 0 replies; 14+ messages in thread
From: Hubert Mazur @ 2022-08-18 9:24 UTC (permalink / raw)
To: Martin Storsjö
Cc: Grzegorz Bernacki, upstream, Swinney, Jonathan, ffmpeg-devel,
Marcin Wojtas, Pop, Sebastian
Thanks for the review and pointing out the issues. I will check out the
other patches for such things and fix them if needed.
Regards
On Thu, Aug 18, 2022 at 11:08 AM Martin Storsjö <martin@martin.st> wrote:
> On Tue, 16 Aug 2022, Hubert Mazur wrote:
>
> > Add arm64 neon implementation for functions from motion estimation
> > family. All of them were tested and benchmarked using checkasm tool.
> > The rare code paths, e.g. when filter_size % 4 != 0 were also tested.
>
>
> > Instructions were manualy deinterleaved to reach best performance.
>
> You probably mean "interleaved", as deinterleaved would be how it was
> initially, which is detrimental for performance.
>
> Overall I think this patchset is close enough now. There were a bunch of
> minor details left on the patches, but I'll fix that up locally and push
> them, instead of doing yet another round of these. I'll comment and point
> out the details I changed - please pay attention to them for future
> patches though!
>
> // Martin
>
>
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Add neon implementation for pix_abs8
2022-07-15 8:02 [FFmpeg-devel] [PATCH 0/5] Add " Hubert Mazur
@ 2022-07-15 8:02 ` Hubert Mazur
0 siblings, 0 replies; 14+ messages in thread
From: Hubert Mazur @ 2022-07-15 8:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Provide optimized implementation of pix_abs8 function for arm64.
Performance comparison tests are shown below.
- pix_abs_1_0_c: 105.2
- pix_abs_1_0_neon: 21.4
- sad_1_c: 107.2
- sad_1_neon: 20.9
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
---
libavcodec/aarch64/me_cmp_init_aarch64.c | 4 ++
libavcodec/aarch64/me_cmp_neon.S | 53 ++++++++++++++++++++++++
2 files changed, 57 insertions(+)
diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index 89c817990c..7d7dc38754 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -31,6 +31,8 @@ int ff_pix_abs16_x2_neon(MpegEncContext *v, uint8_t *pix1, uint8_t *pix2,
ptrdiff_t stride, int h);
int ff_pix_abs16_y2_neon(MpegEncContext *v, uint8_t *pix1, uint8_t *pix2,
ptrdiff_t stride, int h);
+int ff_pix_abs8_neon(MpegEncContext *s, uint8_t *blk1, uint8_t *blk2,
+ ptrdiff_t stride, int h);
int sse16_neon(MpegEncContext *v, uint8_t *pix1, uint8_t *pix2,
ptrdiff_t stride, int h);
@@ -48,8 +50,10 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
c->pix_abs[0][1] = ff_pix_abs16_x2_neon;
c->pix_abs[0][2] = ff_pix_abs16_y2_neon;
c->pix_abs[0][3] = ff_pix_abs16_xy2_neon;
+ c->pix_abs[1][0] = ff_pix_abs8_neon;
c->sad[0] = ff_pix_abs16_neon;
+ c->sad[1] = ff_pix_abs8_neon;
c->sse[0] = sse16_neon;
c->sse[1] = sse8_neon;
c->sse[2] = sse4_neon;
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index c78e26df4b..383459d209 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -72,6 +72,59 @@ function ff_pix_abs16_neon, export=1
ret
endfunc
+function ff_pix_abs8_neon, export=1
+ // x0 unused
+ // x1 uint8_t *pix1
+ // x2 uint8_t *pix2
+ // x3 ptrdiff_t stride
+ // x4 int h
+
+ movi d18, #0
+ cmp w4, #4
+ b.lt 2f
+
+// make 4 iterations at once
+1:
+ ld1 {v0.8b}, [x1], x3
+ ld1 {v1.8b}, [x2], x3
+ uabdl v30.8h, v0.8b, v1.8b
+ ld1 {v2.8b}, [x1], x3
+ ld1 {v3.8b}, [x2], x3
+ uabal v30.8h, v2.8b, v3.8b
+ ld1 {v4.8b}, [x1], x3
+ ld1 {v5.8b}, [x2], x3
+ uabal v30.8h, v4.8b, v5.8b
+ ld1 {v6.8b}, [x1], x3
+ ld1 {v7.8b}, [x2], x3
+ uabal v30.8h, v6.8b, v7.8b
+
+ sub w4, w4, #4
+ uaddlv s20, v30.8h
+ cmp w4, #4
+ add d18, d18, d20
+ b.ge 1b
+ cbnz w4, 2f
+ fmov w0, s18
+
+ ret
+
+// iterate by one
+2:
+ ld1 {v0.8b}, [x1], x3
+ ld1 {v1.8b}, [x2], x3
+
+ uabdl v16.8h, v0.8b, v1.8b
+
+ uaddlv s17, v16.8h
+ add d18, d18, d17
+ subs w4, w4, #1
+ b.ne 2b
+ fmov w0, s18
+
+ ret
+
+endfunc
+
function ff_pix_abs16_xy2_neon, export=1
// x0 unused
// x1 uint8_t *pix1
--
2.34.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2022-08-18 9:25 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-16 12:20 [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Hubert Mazur
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for sse16 Hubert Mazur
2022-08-18 9:09 ` Martin Storsjö
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation for sse4 Hubert Mazur
2022-08-18 9:10 ` Martin Storsjö
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for pix_abs16_y2 Hubert Mazur
2022-08-18 9:16 ` Martin Storsjö
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for sse8 Hubert Mazur
2022-08-18 9:18 ` Martin Storsjö
2022-08-16 12:20 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Add neon implementation for pix_abs8 Hubert Mazur
2022-08-18 9:22 ` Martin Storsjö
2022-08-18 9:07 ` [FFmpeg-devel] [PATCH 0/5] Provide neon implementation for me_cmp functions Martin Storsjö
2022-08-18 9:24 ` Hubert Mazur
-- strict thread matches above, loose matches on Subject: below --
2022-07-15 8:02 [FFmpeg-devel] [PATCH 0/5] Add " Hubert Mazur
2022-07-15 8:02 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Add neon implementation for pix_abs8 Hubert Mazur
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
This inbox may be cloned and mirrored by anyone:
git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
ffmpegdev@gitmailbox.com
public-inbox-index ffmpegdev
Example config snippet for mirrors.
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git