* [FFmpeg-devel] [PATCH 0/5] me_cmp: Provide arm64 neon implementations
@ 2022-08-22 15:26 Hubert Mazur
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for vsad16 Hubert Mazur
` (4 more replies)
0 siblings, 5 replies; 14+ messages in thread
From: Hubert Mazur @ 2022-08-22 15:26 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Provide optimized arm64 neon implementations for functions from
motion estimation family.
Hubert Mazur (5):
lavc/aarch64: Add neon implementation for vsad16
lavc/aarch64: Add neon implementation of vsse16
lavc/aarch64: Add neon implementation for vsad_intra16
lavc/aarch64: Add neon implementation for vsse_intra16
lavc/aarch64: Provide neon implementation of nsse16
libavcodec/aarch64/me_cmp_init_aarch64.c | 30 ++
libavcodec/aarch64/me_cmp_neon.S | 431 +++++++++++++++++++++++
2 files changed, 461 insertions(+)
--
2.34.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for vsad16
2022-08-22 15:26 [FFmpeg-devel] [PATCH 0/5] me_cmp: Provide arm64 neon implementations Hubert Mazur
@ 2022-08-22 15:26 ` Hubert Mazur
2022-09-02 21:49 ` Martin Storsjö
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation of vsse16 Hubert Mazur
` (3 subsequent siblings)
4 siblings, 1 reply; 14+ messages in thread
From: Hubert Mazur @ 2022-08-22 15:26 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Provide optimized implementation of vsad16 function for arm64.
Performance comparison tests are shown below.
- vsad_0_c: 285.0
- vsad_0_neon: 42.5
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
libavcodec/aarch64/me_cmp_init_aarch64.c | 5 ++
libavcodec/aarch64/me_cmp_neon.S | 75 ++++++++++++++++++++++++
2 files changed, 80 insertions(+)
diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index fb7c3f5059..ddc5d05611 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -41,6 +41,9 @@ int sse8_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
int sse4_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
ptrdiff_t stride, int h);
+int vsad16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
+ ptrdiff_t stride, int h);
+
av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
{
int cpu_flags = av_get_cpu_flags();
@@ -57,5 +60,7 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
c->sse[0] = sse16_neon;
c->sse[1] = sse8_neon;
c->sse[2] = sse4_neon;
+
+ c->vsad[0] = vsad16_neon;
}
}
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 4198985c6c..d4c0099854 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -584,3 +584,78 @@ function sse4_neon, export=1
ret
endfunc
+
+function vsad16_neon, export=1
+ // x0 unused
+ // x1 uint8_t *pix1
+ // x2 uint8_t *pix2
+ // x3 ptrdiff_t stride
+ // w4 int h
+
+ sub w4, w4, #1 // we need to make h-1 iterations
+ movi v16.8h, #0
+
+ cmp w4, #3 // check if we can make 3 iterations at once
+ add x5, x1, x3 // pix1 + stride
+ add x6, x2, x3 // pix2 + stride
+ b.le 2f
+
+1:
+ // abs(pix1[0] - pix2[0] - pix1[0 + stride] + pix2[0 + stride])
+ // abs(x) = (x < 0 ? (-x) : (x))
+ ld1 {v0.16b}, [x1], x3 // Load pix1[0], first iteration
+ ld1 {v1.16b}, [x2], x3 // Load pix2[0], first iteration
+ ld1 {v2.16b}, [x5], x3 // Load pix1[0 + stride], first iteration
+ usubl v31.8h, v0.8b, v1.8b // Signed difference pix1[0] - pix2[0], first iteration
+ ld1 {v3.16b}, [x6], x3 // Load pix2[0 + stride], first iteration
+ usubl2 v30.8h, v0.16b, v1.16b // Signed difference pix1[0] - pix2[0], first iteration
+ usubl v29.8h, v2.8b, v3.8b // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration
+ ld1 {v4.16b}, [x1], x3 // Load pix1[0], second iteration
+ usubl2 v28.8h, v2.16b, v3.16b // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration
+ ld1 {v5.16b}, [x2], x3 // Load pix2[0], second iteration
+ saba v16.8h, v31.8h, v29.8h // Signed absolute difference and accumulate the result. first iteration
+ ld1 {v6.16b}, [x5], x3 // Load pix1[0 + stride], second iteration
+ saba v16.8h, v30.8h, v28.8h // Signed absolute difference and accumulate the result. first iteration
+ usubl v27.8h, v4.8b, v5.8b // Signed difference pix1[0] - pix2[0], second iteration
+ ld1 {v7.16b}, [x6], x3 // Load pix2[0 + stride], second iteration
+ usubl2 v26.8h, v4.16b, v5.16b // Signed difference pix1[0] - pix2[0], second iteration
+ usubl v25.8h, v6.8b, v7.8b // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration
+ ld1 {v17.16b}, [x1], x3 // Load pix1[0], third iteration
+ usubl2 v24.8h, v6.16b, v7.16b // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration
+ ld1 {v18.16b}, [x2], x3 // Load pix2[0], second iteration
+ saba v16.8h, v27.8h, v25.8h // Signed absolute difference and accumulate the result. second iteration
+ ld1 {v19.16b}, [x5], x3 // Load pix1[0 + stride], third iteration
+ saba v16.8h, v26.8h, v24.8h // Signed absolute difference and accumulate the result. second iteration
+ usubl v23.8h, v17.8b, v18.8b // Signed difference pix1[0] - pix2[0], third iteration
+ ld1 {v20.16b}, [x6], x3 // Load pix2[0 + stride], third iteration
+ usubl2 v22.8h, v17.16b, v18.16b // Signed difference pix1[0] - pix2[0], third iteration
+ usubl v21.8h, v19.8b, v20.8b // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration
+ sub w4, w4, #3 // h -= 3
+ saba v16.8h, v23.8h, v21.8h // Signed absolute difference and accumulate the result. third iteration
+ usubl2 v31.8h, v19.16b, v20.16b // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration
+ cmp w4, #3
+ saba v16.8h, v22.8h, v31.8h // Signed absolute difference and accumulate the result. third iteration
+
+ b.ge 1b
+ cbz w4, 3f
+2:
+
+ ld1 {v0.16b}, [x1], x3
+ ld1 {v1.16b}, [x2], x3
+ ld1 {v2.16b}, [x5], x3
+ usubl v30.8h, v0.8b, v1.8b
+ ld1 {v3.16b}, [x6], x3
+ usubl2 v29.8h, v0.16b, v1.16b
+ usubl v28.8h, v2.8b, v3.8b
+ usubl2 v27.8h, v2.16b, v3.16b
+ saba v16.8h, v30.8h, v28.8h
+ subs w4, w4, #1
+ saba v16.8h, v29.8h, v27.8h
+
+ b.ne 2b
+3:
+ uaddlv s17, v16.8h
+ fmov w0, s17
+
+ ret
+endfunc
--
2.34.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation of vsse16
2022-08-22 15:26 [FFmpeg-devel] [PATCH 0/5] me_cmp: Provide arm64 neon implementations Hubert Mazur
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for vsad16 Hubert Mazur
@ 2022-08-22 15:26 ` Hubert Mazur
2022-09-04 20:53 ` Martin Storsjö
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for vsad_intra16 Hubert Mazur
` (2 subsequent siblings)
4 siblings, 1 reply; 14+ messages in thread
From: Hubert Mazur @ 2022-08-22 15:26 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Provide optimized implementation of vsse16 for arm64.
Performance comparison tests are shown below.
- vsse_0_c: 254.4
- vsse_0_neon: 64.7
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
libavcodec/aarch64/me_cmp_init_aarch64.c | 4 +
libavcodec/aarch64/me_cmp_neon.S | 97 ++++++++++++++++++++++++
2 files changed, 101 insertions(+)
diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index ddc5d05611..7b81e48d16 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -43,6 +43,8 @@ int sse4_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
int vsad16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
ptrdiff_t stride, int h);
+int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
+ ptrdiff_t stride, int h);
av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
{
@@ -62,5 +64,7 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
c->sse[2] = sse4_neon;
c->vsad[0] = vsad16_neon;
+
+ c->vsse[0] = vsse16_neon;
}
}
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index d4c0099854..279bae7cb5 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -659,3 +659,100 @@ function vsad16_neon, export=1
ret
endfunc
+
+function vsse16_neon, export=1
+ // x0 unused
+ // x1 uint8_t *pix1
+ // x2 uint8_t *pix2
+ // x3 ptrdiff_t stride
+ // w4 int h
+
+ movi v30.4s, #0
+ movi v29.4s, #0
+
+ add x5, x1, x3 // pix1 + stride
+ add x6, x2, x3 // pix2 + stride
+ sub w4, w4, #1 // we need to make h-1 iterations
+ cmp w4, #3 // check if we can make 4 iterations at once
+ b.le 2f
+
+// make 4 iterations at once
+1:
+ // x = abs(pix1[0] - pix2[0] - pix1[0 + stride] + pix2[0 + stride]) =
+ // res = (x) * (x)
+ ld1 {v0.16b}, [x1], x3 // Load pix1[0], first iteration
+ ld1 {v1.16b}, [x2], x3 // Load pix2[0], first iteration
+ ld1 {v2.16b}, [x5], x3 // Load pix1[0 + stride], first iteration
+ usubl v28.8h, v0.8b, v1.8b // Signed difference of pix1[0] - pix2[0], first iteration
+ ld1 {v3.16b}, [x6], x3 // Load pix2[0 + stride], first iteration
+ usubl2 v27.8h, v0.16b, v1.16b // Signed difference of pix1[0] - pix2[0], first iteration
+ usubl v26.8h, v3.8b, v2.8b // Signed difference of pix1[0 + stride] - pix2[0 + stride], first iteration
+ usubl2 v25.8h, v3.16b, v2.16b // Signed difference of pix1[0 + stride] - pix2[0 + stride], first iteration
+ ld1 {v4.16b}, [x1], x3 // Load pix1[0], second iteration
+ sqadd v28.8h, v28.8h, v26.8h // Add first iteration
+ ld1 {v6.16b}, [x5], x3 // Load pix1[0 + stride], second iteration
+ sqadd v27.8h, v27.8h, v25.8h // Add first iteration
+ ld1 {v5.16b}, [x2], x3 // Load pix2[0], second iteration
+ smlal v30.4s, v28.4h, v28.4h // Multiply-accumulate first iteration
+ ld1 {v7.16b}, [x6], x3 // Load pix2[0 + stride], second iteration
+ usubl v26.8h, v4.8b, v5.8b // Signed difference of pix1[0] - pix2[0], second iteration
+ smlal2 v29.4s, v28.8h, v28.8h // Multiply-accumulate first iteration
+ usubl2 v25.8h, v4.16b, v5.16b // Signed difference of pix1[0] - pix2[0], second iteration
+ usubl v24.8h, v7.8b, v6.8b // Signed difference of pix1[0 + stride] - pix2[0 + stride], first iteration
+ smlal v30.4s, v27.4h, v27.4h // Multiply-accumulate first iteration
+ usubl2 v23.8h, v7.16b, v6.16b // Signed difference of pix1[0 + stride] - pix2[0 + stride], first iteration
+ sqadd v24.8h, v26.8h, v24.8h // Add second iteration
+ smlal2 v29.4s, v27.8h, v27.8h // Multiply-accumulate first iteration
+ sqadd v23.8h, v25.8h, v23.8h // Add second iteration
+ ld1 {v18.16b}, [x1], x3 // Load pix1[0], third iteration
+ smlal v30.4s, v24.4h, v24.4h // Multiply-accumulate second iteration
+ ld1 {v31.16b}, [x2], x3 // Load pix2[0], third iteration
+ ld1 {v17.16b}, [x5], x3 // Load pix1[0 + stride], third iteration
+ smlal2 v29.4s, v24.8h, v24.8h // Multiply-accumulate second iteration
+ ld1 {v16.16b}, [x6], x3 // Load pix2[0 + stride], third iteration
+ usubl v22.8h, v18.8b, v31.8b // Signed difference of pix1[0] - pix2[0], third iteration
+ smlal v30.4s, v23.4h, v23.4h // Multiply-accumulate second iteration
+ usubl2 v21.8h, v18.16b, v31.16b // Signed difference of pix1[0] - pix2[0], third iteration
+ usubl v20.8h, v16.8b, v17.8b // Signed difference of pix1[0 + stride] - pix2[0 + stride], first iteration
+ smlal2 v29.4s, v23.8h, v23.8h // Multiply-accumulate second iteration
+ sqadd v20.8h, v22.8h, v20.8h // Add third iteration
+ usubl2 v19.8h, v16.16b, v17.16b // Signed difference of pix1[0 + stride] - pix2[0 + stride], first iteration
+ smlal v30.4s, v20.4h, v20.4h // Multiply-accumulate third iteration
+ sqadd v19.8h, v21.8h, v19.8h // Add third iteration
+ smlal2 v29.4s, v20.8h, v20.8h // Multiply-accumulate third iteration
+ sub w4, w4, #3
+ smlal v30.4s, v19.4h, v19.4h // Multiply-accumulate third iteration
+ cmp w4, #3
+ smlal2 v29.4s, v19.8h, v19.8h // Multiply-accumulate third iteration
+
+ b.ge 1b
+
+ cbz w4, 3f
+
+// iterate by once
+2:
+ ld1 {v0.16b}, [x1], x3
+ ld1 {v1.16b}, [x2], x3
+ ld1 {v2.16b}, [x5], x3
+ usubl v28.8h, v0.8b, v1.8b
+ ld1 {v3.16b}, [x6], x3
+ usubl2 v27.8h, v0.16b, v1.16b
+ usubl v26.8h, v3.8b, v2.8b
+ usubl2 v25.8h, v3.16b, v2.16b
+ sqadd v28.8h, v28.8h, v26.8h
+ sqadd v27.8h, v27.8h, v25.8h
+ smlal v30.4s, v28.4h, v28.4h
+ smlal2 v29.4s, v28.8h, v28.8h
+ subs w4, w4, #1
+ smlal v30.4s, v27.4h, v27.4h
+ smlal2 v29.4s, v27.8h, v27.8h
+
+ b.ne 2b
+
+3:
+ add v30.4s, v30.4s, v29.4s
+ saddlv d17, v30.4s
+ fmov w0, s17
+
+ ret
+endfunc
--
2.34.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for vsad_intra16
2022-08-22 15:26 [FFmpeg-devel] [PATCH 0/5] me_cmp: Provide arm64 neon implementations Hubert Mazur
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for vsad16 Hubert Mazur
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation of vsse16 Hubert Mazur
@ 2022-08-22 15:26 ` Hubert Mazur
2022-09-04 20:58 ` Martin Storsjö
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16 Hubert Mazur
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16 Hubert Mazur
4 siblings, 1 reply; 14+ messages in thread
From: Hubert Mazur @ 2022-08-22 15:26 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Provide optimized implementation for vsad_intra16 function for arm64.
Performance comparison tests are shown below.
- vsad_4_c: 177.2
- vsad_4_neon: 24.5
Benchmarks and tests are run with checkasm tool on AWS Gravtion 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
libavcodec/aarch64/me_cmp_init_aarch64.c | 3 ++
libavcodec/aarch64/me_cmp_neon.S | 58 ++++++++++++++++++++++++
2 files changed, 61 insertions(+)
diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index 7b81e48d16..af83f7ed1e 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -43,6 +43,8 @@ int sse4_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
int vsad16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
ptrdiff_t stride, int h);
+int vsad_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
+ ptrdiff_t stride, int h) ;
int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
ptrdiff_t stride, int h);
@@ -64,6 +66,7 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
c->sse[2] = sse4_neon;
c->vsad[0] = vsad16_neon;
+ c->vsad[4] = vsad_intra16_neon;
c->vsse[0] = vsse16_neon;
}
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 279bae7cb5..126e267fdc 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -756,3 +756,61 @@ function vsse16_neon, export=1
ret
endfunc
+
+function vsad_intra16_neon, export=1
+ // x0 unused
+ // x1 uint8_t *pix1
+ // x2 uint8_t *dummy
+ // x3 ptrdiff_t stride
+ // w4 int h
+
+ sub w4, w4, #1 // we need to make h-1 iterations
+ add x5, x1, x3 // pix1 + stride
+ cmp w4, #4
+ movi v16.8h, #0
+
+ b.lt 2f
+
+// make 4 iterations at once
+1:
+ // v = abs( pix1[0] - pix1[0 + stride] )
+ // score = sum(v)
+ // abs(x) = ( (x > 0) ? (x) : (-x) )
+
+ ld1 {v0.16b}, [x1], x3
+ ld1 {v1.16b}, [x5], x3
+ ld1 {v2.16b}, [x1], x3
+ uabal v16.8h, v0.8b, v1.8b
+ ld1 {v3.16b}, [x5], x3
+ uabal2 v16.8h, v0.16b, v1.16b
+ ld1 {v4.16b}, [x1], x3
+ uabal v16.8h, v2.8b, v3.8b
+ ld1 {v5.16b}, [x5], x3
+ uabal2 v16.8h, v2.16b, v3.16b
+ ld1 {v6.16b}, [x1], x3
+ uabal v16.8h, v4.8b, v5.8b
+ ld1 {v7.16b}, [x5], x3
+ uabal2 v16.8h, v4.16b, v5.16b
+ sub w4, w4, #4
+ uabal v16.8h, v6.8b, v7.8b
+ cmp w4, #4
+ uabal2 v16.8h, v6.16b, v7.16b
+
+ b.ge 1b
+ cbz w4, 3f
+
+// iterate by one
+2:
+ ld1 {v0.16b}, [x1], x3
+ ld1 {v1.16b}, [x5], x3
+ subs w4, w4, #1
+ uabal v16.8h, v0.8b, v1.8b
+ uabal2 v16.8h, v0.16b, v1.16b
+ cbnz w4, 2b
+
+3:
+ uaddlv s17, v16.8h
+ fmov w0, s17
+
+ ret
+endfunc
--
2.34.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16
2022-08-22 15:26 [FFmpeg-devel] [PATCH 0/5] me_cmp: Provide arm64 neon implementations Hubert Mazur
` (2 preceding siblings ...)
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for vsad_intra16 Hubert Mazur
@ 2022-08-22 15:26 ` Hubert Mazur
2022-09-04 20:59 ` Martin Storsjö
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16 Hubert Mazur
4 siblings, 1 reply; 14+ messages in thread
From: Hubert Mazur @ 2022-08-22 15:26 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Provide optimized implementation for vsse_intra16 for arm64.
Performance tests are shown below.
- vsse_4_c: 153.7
- vsse_4_neon: 34.2
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
libavcodec/aarch64/me_cmp_init_aarch64.c | 3 +
libavcodec/aarch64/me_cmp_neon.S | 75 ++++++++++++++++++++++++
2 files changed, 78 insertions(+)
diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index af83f7ed1e..8c295d5457 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -47,6 +47,8 @@ int vsad_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
ptrdiff_t stride, int h) ;
int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
ptrdiff_t stride, int h);
+int vsse_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
+ ptrdiff_t stride, int h);
av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
{
@@ -69,5 +71,6 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
c->vsad[4] = vsad_intra16_neon;
c->vsse[0] = vsse16_neon;
+ c->vsse[4] = vsse_intra16_neon;
}
}
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 126e267fdc..46d4dade5d 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -814,3 +814,78 @@ function vsad_intra16_neon, export=1
ret
endfunc
+
+function vsse_intra16_neon, export=1
+ // x0 unused
+ // x1 uint8_t *pix1
+ // x2 uint8_t *dummy
+ // x3 ptrdiff_t stride
+ // w4 int h
+
+ add x5, x1, x3 // pix1 + stride
+ movi v16.4s, #0
+ movi v17.4s, #0
+
+ sub w4, w4, #1 // we need to make h-1 iterations
+ cmp w4, #4
+ b.lt 2f
+
+// make 4 iterations at once
+1:
+ // v = abs( pix1[0] - pix1[0 + stride] )
+ // score = sum( v * v )
+ // abs(x) = ( (x > 0) ? (x) : (-x) )
+
+ ld1 {v0.16b}, [x1], x3
+ ld1 {v1.16b}, [x5], x3
+ ld1 {v2.16b}, [x1], x3
+ uabd v30.16b, v0.16b, v1.16b
+ ld1 {v3.16b}, [x5], x3
+ ld1 {v4.16b}, [x1], x3
+ umull v29.8h, v30.8b, v30.8b
+ umull2 v28.8h, v30.16b, v30.16b
+ uabd v27.16b, v2.16b, v3.16b
+ uadalp v16.4s, v29.8h
+ ld1 {v5.16b}, [x5], x3
+ umull v26.8h, v27.8b, v27.8b
+ uadalp v17.4s, v28.8h
+ ld1 {v6.16b}, [x1], x3
+ umull2 v27.8h, v27.16b, v27.16b
+ uabd v25.16b, v4.16b, v5.16b
+ uadalp v16.4s, v26.8h
+ umull v24.8h, v25.8b, v25.8b
+ ld1 {v7.16b}, [x5], x3
+ uadalp v17.4s, v27.8h
+ umull2 v25.8h, v25.16b, v25.16b
+ uabd v23.16b, v6.16b, v7.16b
+ uadalp v16.4s, v24.8h
+ umull v22.8h, v23.8b, v23.8b
+ uadalp v17.4s, v25.8h
+ umull2 v23.8h, v23.16b, v23.16b
+ sub w4, w4, #4
+ uadalp v16.4s, v22.8h
+ cmp w4, #4
+ uadalp v17.4s, v23.8h
+
+ b.ge 1b
+ cbz w4, 3f
+
+// iterate by one
+2:
+ ld1 {v0.16b}, [x1], x3
+ ld1 {v1.16b}, [x5], x3
+ subs w4, w4, #1
+ uabd v30.16b, v0.16b, v1.16b
+ umull v29.8h, v30.8b, v30.8b
+ umull2 v30.8h, v30.16b, v30.16b
+ uadalp v16.4s, v29.8h
+ uadalp v17.4s, v30.8h
+ cbnz w4, 2b
+
+3:
+ add v16.4s, v16.4s, v17.4S
+ uaddlv d17, v16.4s
+ fmov w0, s17
+
+ ret
+endfunc
--
2.34.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16
2022-08-22 15:26 [FFmpeg-devel] [PATCH 0/5] me_cmp: Provide arm64 neon implementations Hubert Mazur
` (3 preceding siblings ...)
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16 Hubert Mazur
@ 2022-08-22 15:26 ` Hubert Mazur
2022-09-02 21:29 ` Martin Storsjö
2022-09-04 21:23 ` Martin Storsjö
4 siblings, 2 replies; 14+ messages in thread
From: Hubert Mazur @ 2022-08-22 15:26 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Add vectorized implementation of nsse16 function.
Performance comparison tests are shown below.
- nsse_0_c: 707.0
- nsse_0_neon: 120.0
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
libavcodec/aarch64/me_cmp_init_aarch64.c | 15 +++
libavcodec/aarch64/me_cmp_neon.S | 126 +++++++++++++++++++++++
2 files changed, 141 insertions(+)
diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index 8c295d5457..146ef04345 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -49,6 +49,10 @@ int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
ptrdiff_t stride, int h);
int vsse_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
ptrdiff_t stride, int h);
+int nsse16_neon(int multiplier, const uint8_t *s, const uint8_t *s2,
+ ptrdiff_t stride, int h);
+int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
+ ptrdiff_t stride, int h);
av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
{
@@ -72,5 +76,16 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
c->vsse[0] = vsse16_neon;
c->vsse[4] = vsse_intra16_neon;
+
+ c->nsse[0] = nsse16_neon_wrapper;
}
}
+
+int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
+ ptrdiff_t stride, int h)
+{
+ if (c)
+ return nsse16_neon(c->avctx->nsse_weight, s1, s2, stride, h);
+ else
+ return nsse16_neon(8, s1, s2, stride, h);
+}
\ No newline at end of file
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 46d4dade5d..9fe96e111c 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -889,3 +889,129 @@ function vsse_intra16_neon, export=1
ret
endfunc
+
+function nsse16_neon, export=1
+ // x0 multiplier
+ // x1 uint8_t *pix1
+ // x2 uint8_t *pix2
+ // x3 ptrdiff_t stride
+ // w4 int h
+
+ str x0, [sp, #-0x40]!
+ stp x1, x2, [sp, #0x10]
+ stp x3, x4, [sp, #0x20]
+ str lr, [sp, #0x30]
+ bl sse16_neon
+ ldr lr, [sp, #0x30]
+ mov w9, w0 // here we store score1
+ ldr x5, [sp]
+ ldp x1, x2, [sp, #0x10]
+ ldp x3, x4, [sp, #0x20]
+ add sp, sp, #0x40
+
+ movi v16.8h, #0
+ movi v17.8h, #0
+ movi v18.8h, #0
+ movi v19.8h, #0
+
+ mov x10, x1 // x1
+ mov x14, x2 // x2
+ add x11, x1, x3 // x1 + stride
+ add x15, x2, x3 // x2 + stride
+ add x12, x1, #1 // x1 + 1
+ add x16, x2, #1 // x2 + 1
+ add x13, x11, #1 // x1 + stride + 1
+ add x17, x15, #1 // x2 + stride + 1
+
+ subs w4, w4, #1 // we need to make h-1 iterations
+ cmp w4, #2
+ b.lt 2f
+
+// make 2 iterations at once
+1:
+ ld1 {v0.16b}, [x10], x3
+ ld1 {v1.16b}, [x11], x3
+ ld1 {v2.16b}, [x12], x3
+ usubl v31.8h, v0.8b, v1.8b
+ ld1 {v3.16b}, [x13], x3
+ usubl2 v30.8h, v0.16b, v1.16b
+ usubl v29.8h, v2.8b, v3.8b
+ usubl2 v28.8h, v2.16b, v3.16b
+ ld1 {v4.16b}, [x14], x3
+ saba v16.8h, v31.8h, v29.8h
+ ld1 {v5.16b}, [x15], x3
+ ld1 {v6.16b}, [x16], x3
+ saba v17.8h, v30.8h, v28.8h
+ ld1 {v7.16b}, [x17], x3
+ usubl v27.8h, v4.8b, v5.8b
+ usubl2 v26.8h, v4.16b, v5.16b
+ usubl v25.8h, v6.8b, v7.8b
+ ld1 {v31.16b}, [x10], x3
+ saba v18.8h, v27.8h, v25.8h
+ usubl2 v24.8h, v6.16b, v7.16b
+ ld1 {v30.16b}, [x11], x3
+ ld1 {v29.16b}, [x12], x3
+ saba v19.8h, v26.8h, v24.8h
+ usubl v23.8h, v31.8b, v30.8b
+ ld1 {v28.16b}, [x13], x3
+ usubl2 v22.8h, v31.16b, v30.16b
+ usubl v21.8h, v29.8b, v28.8b
+ ld1 {v27.16b}, [x14], x3
+ usubl2 v20.8h, v29.16b, v28.16b
+ saba v16.8h, v23.8h, v21.8h
+ ld1 {v26.16b}, [x15], x3
+ ld1 {v25.16b}, [x16], x3
+ saba v17.8h, v22.8h, v20.8h
+ ld1 {v24.16b}, [x17], x3
+ usubl v31.8h, v27.8b, v26.8b
+ usubl v29.8h, v25.8b, v24.8b
+ usubl2 v30.8h, v27.16b, v26.16b
+ saba v18.8h, v31.8h, v29.8h
+ usubl2 v28.8h, v25.16b, v24.16b
+ sub w4, w4, #2
+ cmp w4, #2
+ saba v19.8h, v30.8h, v28.8h
+
+ b.ge 1b
+ cbz w4, 3f
+
+// iterate by one
+2:
+ ld1 {v0.16b}, [x10], x3
+ ld1 {v1.16b}, [x11], x3
+ ld1 {v2.16b}, [x12], x3
+ usubl v31.8h, v0.8b, v1.8b
+ ld1 {v3.16b}, [x13], x3
+ usubl2 v30.8h, v0.16b, v1.16b
+ usubl v29.8h, v2.8b, v3.8b
+ usubl2 v28.8h, v2.16b, v3.16b
+ saba v16.8h, v31.8h, v29.8h
+ ld1 {v4.16b}, [x14], x3
+ ld1 {v5.16b}, [x15], x3
+ saba v17.8h, v30.8h, v28.8h
+ ld1 {v6.16b}, [x16], x3
+ usubl v27.8h, v4.8b, v5.8b
+ ld1 {v7.16b}, [x17], x3
+ usubl2 v26.8h, v4.16b, v5.16b
+ usubl v25.8h, v6.8b, v7.8b
+ usubl2 v24.8h, v6.16b, v7.16b
+ saba v18.8h, v27.8h, v25.8h
+ subs w4, w4, #1
+ saba v19.8h, v26.8h, v24.8h
+
+ cbnz w4, 2b
+
+3:
+ sqsub v16.8h, v16.8h, v18.8h
+ sqsub v17.8h, v17.8h, v19.8h
+ ins v17.h[7], wzr
+ sqadd v16.8h, v16.8h, v17.8h
+ saddlv s16, v16.8h
+ sqabs s16, s16
+ fmov w0, s16
+
+ mul w0, w0, w5
+ add w0, w0, w9
+
+ ret
+endfunc
--
2.34.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16 Hubert Mazur
@ 2022-09-02 21:29 ` Martin Storsjö
2022-09-04 21:23 ` Martin Storsjö
1 sibling, 0 replies; 14+ messages in thread
From: Martin Storsjö @ 2022-09-02 21:29 UTC (permalink / raw)
To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop
On Mon, 22 Aug 2022, Hubert Mazur wrote:
> Add vectorized implementation of nsse16 function.
>
> Performance comparison tests are shown below.
> - nsse_0_c: 707.0
> - nsse_0_neon: 120.0
>
> Benchmarks and tests run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c | 15 +++
> libavcodec/aarch64/me_cmp_neon.S | 126 +++++++++++++++++++++++
> 2 files changed, 141 insertions(+)
>
> diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
> index 46d4dade5d..9fe96e111c 100644
> --- a/libavcodec/aarch64/me_cmp_neon.S
> +++ b/libavcodec/aarch64/me_cmp_neon.S
> @@ -889,3 +889,129 @@ function vsse_intra16_neon, export=1
>
> ret
> endfunc
> +
> +function nsse16_neon, export=1
> + // x0 multiplier
> + // x1 uint8_t *pix1
> + // x2 uint8_t *pix2
> + // x3 ptrdiff_t stride
> + // w4 int h
> +
> + str x0, [sp, #-0x40]!
> + stp x1, x2, [sp, #0x10]
> + stp x3, x4, [sp, #0x20]
> + str lr, [sp, #0x30]
> + bl sse16_neon
> + ldr lr, [sp, #0x30]
This breaks building in two configurations; old binutils doesn't recognize
the register name lr, you need to spell out x30.
Building on macOS breaks since there's no symbol named sse16_neon; this is
an exported function, so it has got the symbol prefix _. So you need to do
"bl X(sse16_neon)" here.
Didn't look at the code from a performance perspective yet.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for vsad16
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for vsad16 Hubert Mazur
@ 2022-09-02 21:49 ` Martin Storsjö
0 siblings, 0 replies; 14+ messages in thread
From: Martin Storsjö @ 2022-09-02 21:49 UTC (permalink / raw)
To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop
[-- Attachment #1: Type: text/plain, Size: 9033 bytes --]
On Mon, 22 Aug 2022, Hubert Mazur wrote:
> Provide optimized implementation of vsad16 function for arm64.
>
> Performance comparison tests are shown below.
> - vsad_0_c: 285.0
> - vsad_0_neon: 42.5
>
> Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c | 5 ++
> libavcodec/aarch64/me_cmp_neon.S | 75 ++++++++++++++++++++++++
> 2 files changed, 80 insertions(+)
>
> diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
> index 4198985c6c..d4c0099854 100644
> --- a/libavcodec/aarch64/me_cmp_neon.S
> +++ b/libavcodec/aarch64/me_cmp_neon.S
> @@ -584,3 +584,78 @@ function sse4_neon, export=1
>
> ret
> endfunc
> +
> +function vsad16_neon, export=1
> + // x0 unused
> + // x1 uint8_t *pix1
> + // x2 uint8_t *pix2
> + // x3 ptrdiff_t stride
> + // w4 int h
> +
> + sub w4, w4, #1 // we need to make h-1 iterations
> + movi v16.8h, #0
> +
> + cmp w4, #3 // check if we can make 3 iterations at once
> + add x5, x1, x3 // pix1 + stride
> + add x6, x2, x3 // pix2 + stride
> + b.le 2f
> +
> +1:
> + // abs(pix1[0] - pix2[0] - pix1[0 + stride] + pix2[0 + stride])
> + // abs(x) = (x < 0 ? (-x) : (x))
This comment (the second line) about abs() doesn't really add anything of
value here, does it?
> + ld1 {v0.16b}, [x1], x3 // Load pix1[0], first iteration
> + ld1 {v1.16b}, [x2], x3 // Load pix2[0], first iteration
> + ld1 {v2.16b}, [x5], x3 // Load pix1[0 + stride], first iteration
> + usubl v31.8h, v0.8b, v1.8b // Signed difference pix1[0] - pix2[0], first iteration
> + ld1 {v3.16b}, [x6], x3 // Load pix2[0 + stride], first iteration
> + usubl2 v30.8h, v0.16b, v1.16b // Signed difference pix1[0] - pix2[0], first iteration
> + usubl v29.8h, v2.8b, v3.8b // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration
> + ld1 {v4.16b}, [x1], x3 // Load pix1[0], second iteration
> + usubl2 v28.8h, v2.16b, v3.16b // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration
> + ld1 {v5.16b}, [x2], x3 // Load pix2[0], second iteration
> + saba v16.8h, v31.8h, v29.8h // Signed absolute difference and accumulate the result. first iteration
> + ld1 {v6.16b}, [x5], x3 // Load pix1[0 + stride], second iteration
> + saba v16.8h, v30.8h, v28.8h // Signed absolute difference and accumulate the result. first iteration
> + usubl v27.8h, v4.8b, v5.8b // Signed difference pix1[0] - pix2[0], second iteration
> + ld1 {v7.16b}, [x6], x3 // Load pix2[0 + stride], second iteration
> + usubl2 v26.8h, v4.16b, v5.16b // Signed difference pix1[0] - pix2[0], second iteration
> + usubl v25.8h, v6.8b, v7.8b // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration
> + ld1 {v17.16b}, [x1], x3 // Load pix1[0], third iteration
> + usubl2 v24.8h, v6.16b, v7.16b // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration
> + ld1 {v18.16b}, [x2], x3 // Load pix2[0], second iteration
> + saba v16.8h, v27.8h, v25.8h // Signed absolute difference and accumulate the result. second iteration
> + ld1 {v19.16b}, [x5], x3 // Load pix1[0 + stride], third iteration
> + saba v16.8h, v26.8h, v24.8h // Signed absolute difference and accumulate the result. second iteration
> + usubl v23.8h, v17.8b, v18.8b // Signed difference pix1[0] - pix2[0], third iteration
> + ld1 {v20.16b}, [x6], x3 // Load pix2[0 + stride], third iteration
> + usubl2 v22.8h, v17.16b, v18.16b // Signed difference pix1[0] - pix2[0], third iteration
> + usubl v21.8h, v19.8b, v20.8b // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration
> + sub w4, w4, #3 // h -= 3
> + saba v16.8h, v23.8h, v21.8h // Signed absolute difference and accumulate the result. third iteration
> + usubl2 v31.8h, v19.16b, v20.16b // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration
> + cmp w4, #3
> + saba v16.8h, v22.8h, v31.8h // Signed absolute difference and accumulate the result. third iteration
This unrolled implementation isn't really interleaved much at all. In such
a case there's not really much benefit from an unrolled implementation,
other than saving a couple decrements/branch instructions.
I tested playing with your code here. Originally, checkasm gave me these
benchmark numbers:
Cortex A53 A72 A73
vsad_0_neon: 162.7 74.5 67.7
When I switched to only running the non-unrolled version below, I got
these numbers:
vsad_0_neon: 165.2 78.5 76.0
I.e. hardly any difference at all. In such a case, we're just wasting a
lot of code size (and code maintainability!) on big unrolled code without
much benefit at all. But we can do better.
> +
> + b.ge 1b
> + cbz w4, 3f
> +2:
> +
> + ld1 {v0.16b}, [x1], x3
> + ld1 {v1.16b}, [x2], x3
> + ld1 {v2.16b}, [x5], x3
> + usubl v30.8h, v0.8b, v1.8b
> + ld1 {v3.16b}, [x6], x3
> + usubl2 v29.8h, v0.16b, v1.16b
> + usubl v28.8h, v2.8b, v3.8b
> + usubl2 v27.8h, v2.16b, v3.16b
> + saba v16.8h, v30.8h, v28.8h
> + subs w4, w4, #1
> + saba v16.8h, v29.8h, v27.8h
Now let's look at this simple non-unrolled implementation first. We're
doing a lot of redundant work here (and in the other implementation).
When we've loaded v2/v3 in the first round in this loop, and calculated
v28/v27 from them, and we then proceed to the next iteration, v0/v1 will
be identical to v2/v3, and v30/v29 will be identical to v28/v27 from the
previous iteration.
So if we just load one row from pix1/pix2 each and calculate their
difference in the start of the function, and then just load one row from
each, subtract them and accumulate the difference to the previous one,
we're done.
Originally with your non-unrolled implementation, I got these benchmark
numbers:
vsad_0_neon: 165.2 78.5 76.0
After this simplification, I got these numbers:
vsad_0_neon: 131.2 68.5 70.0
In this case, my code looked like this:
ld1 {v0.16b}, [x1], x3
ld1 {v1.16b}, [x2], x3
usubl v31.8h, v0.8b, v1.8b
usubl2 v30.8h, v0.16b, v1.16b
2:
ld1 {v0.16b}, [x1], x3
ld1 {v1.16b}, [x2], x3
subs w4, w4, #1
usubl v29.8h, v0.8b, v1.8b
usubl2 v28.8h, v0.16b, v1.16b
saba v16.8h, v31.8h, v29.8h
mov v31.16b, v29.16b
saba v16.8h, v30.8h, v28.8h
mov v30.16b, v28.16b
b.ne 2b
Isn't that much simpler? Now after that, I tried doing the same
modification to the unrolled version. FWIW, I did it like this: For an
algorithm this simple, where there's more than enough registers, I wrote
the unrolled implementation like this:
ld1 // first iter
ld1 // first iter
ld1 // second iter
ld1 // second iter
ld1 // third iter
ld1 // third iter
usubl // first
usubl2 // first
usubl ..
usubl2 ..
...
saba // first
saba // first
...
After that, I reordered them so that we start with the first usubl's after
a couple of instructions after the corresponding loads, then moved the
following saba's closer.
With that, I'm getting these checkasm numbers:
Cortex A53 A72 A73
vsad_0_neon: 108.7 61.2 58.0
As context, this patch originally gave these numbers:
vsad_0_neon: 162.7 74.5 67.7
I.e. a 1.15x speedup on A73, 1.21x speedup on A72 and an 1.5x speedup on
A53.
You can see my version in the attached patch; apply it on top of yours.
// Martin
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-diff; name=0001-Improve-vsad16_neon.patch, Size: 8106 bytes --]
From 76eb1f213a72cdfd04a62c773442336cd56e0858 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Martin=20Storsj=C3=B6?= <martin@martin.st>
Date: Sat, 3 Sep 2022 00:45:55 +0300
Subject: [PATCH] Improve vsad16_neon
---
libavcodec/aarch64/me_cmp_neon.S | 71 ++++++++++++++------------------
1 file changed, 31 insertions(+), 40 deletions(-)
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index ecc9c793d6..7ab8744e0d 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -592,65 +592,56 @@ function vsad16_neon, export=1
// x3 ptrdiff_t stride
// w4 int h
+ ld1 {v0.16b}, [x1], x3 // Load pix1[0], first iteration
+ ld1 {v1.16b}, [x2], x3 // Load pix2[0], first iteration
+
sub w4, w4, #1 // we need to make h-1 iterations
movi v16.8h, #0
cmp w4, #3 // check if we can make 3 iterations at once
- add x5, x1, x3 // pix1 + stride
- add x6, x2, x3 // pix2 + stride
- b.le 2f
+ usubl v31.8h, v0.8b, v1.8b // Signed difference pix1[0] - pix2[0], first iteration
+ usubl2 v30.8h, v0.16b, v1.16b // Signed difference pix1[0] - pix2[0], first iteration
+
+ b.lt 2f
1:
// abs(pix1[0] - pix2[0] - pix1[0 + stride] + pix2[0 + stride])
// abs(x) = (x < 0 ? (-x) : (x))
- ld1 {v0.16b}, [x1], x3 // Load pix1[0], first iteration
- ld1 {v1.16b}, [x2], x3 // Load pix2[0], first iteration
- ld1 {v2.16b}, [x5], x3 // Load pix1[0 + stride], first iteration
- usubl v31.8h, v0.8b, v1.8b // Signed difference pix1[0] - pix2[0], first iteration
- ld1 {v3.16b}, [x6], x3 // Load pix2[0 + stride], first iteration
- usubl2 v30.8h, v0.16b, v1.16b // Signed difference pix1[0] - pix2[0], first iteration
- usubl v29.8h, v2.8b, v3.8b // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration
- ld1 {v4.16b}, [x1], x3 // Load pix1[0], second iteration
- usubl2 v28.8h, v2.16b, v3.16b // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration
- ld1 {v5.16b}, [x2], x3 // Load pix2[0], second iteration
+ ld1 {v0.16b}, [x1], x3 // Load pix1[0 + stride], first iteration
+ ld1 {v1.16b}, [x2], x3 // Load pix2[0 + stride], first iteration
+ ld1 {v2.16b}, [x1], x3 // Load pix1[0 + stride], second iteration
+ ld1 {v3.16b}, [x2], x3 // Load pix2[0 + stride], second iteration
+ usubl v29.8h, v0.8b, v1.8b // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration
+ usubl2 v28.8h, v0.16b, v1.16b // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration
+ ld1 {v4.16b}, [x1], x3 // Load pix1[0 + stride], third iteration
+ ld1 {v5.16b}, [x2], x3 // Load pix2[0 + stride], third iteration
+ usubl v27.8h, v2.8b, v3.8b // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration
saba v16.8h, v31.8h, v29.8h // Signed absolute difference and accumulate the result. first iteration
- ld1 {v6.16b}, [x5], x3 // Load pix1[0 + stride], second iteration
+ usubl2 v26.8h, v2.16b, v3.16b // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration
saba v16.8h, v30.8h, v28.8h // Signed absolute difference and accumulate the result. first iteration
- usubl v27.8h, v4.8b, v5.8b // Signed difference pix1[0] - pix2[0], second iteration
- ld1 {v7.16b}, [x6], x3 // Load pix2[0 + stride], second iteration
- usubl2 v26.8h, v4.16b, v5.16b // Signed difference pix1[0] - pix2[0], second iteration
- usubl v25.8h, v6.8b, v7.8b // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration
- ld1 {v17.16b}, [x1], x3 // Load pix1[0], third iteration
- usubl2 v24.8h, v6.16b, v7.16b // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration
- ld1 {v18.16b}, [x2], x3 // Load pix2[0], second iteration
- saba v16.8h, v27.8h, v25.8h // Signed absolute difference and accumulate the result. second iteration
- ld1 {v19.16b}, [x5], x3 // Load pix1[0 + stride], third iteration
- saba v16.8h, v26.8h, v24.8h // Signed absolute difference and accumulate the result. second iteration
- usubl v23.8h, v17.8b, v18.8b // Signed difference pix1[0] - pix2[0], third iteration
- ld1 {v20.16b}, [x6], x3 // Load pix2[0 + stride], third iteration
- usubl2 v22.8h, v17.16b, v18.16b // Signed difference pix1[0] - pix2[0], third iteration
- usubl v21.8h, v19.8b, v20.8b // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration
+ usubl v25.8h, v4.8b, v5.8b // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration
+ usubl2 v24.8h, v4.16b, v5.16b // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration
+ saba v16.8h, v29.8h, v27.8h // Signed absolute difference and accumulate the result. second iteration
+ mov v31.16b, v25.16b
+ saba v16.8h, v28.8h, v26.8h // Signed absolute difference and accumulate the result. second iteration
sub w4, w4, #3 // h -= 3
- saba v16.8h, v23.8h, v21.8h // Signed absolute difference and accumulate the result. third iteration
- usubl2 v31.8h, v19.16b, v20.16b // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration
+ mov v30.16b, v24.16b
+ saba v16.8h, v27.8h, v25.8h // Signed absolute difference and accumulate the result. third iteration
cmp w4, #3
- saba v16.8h, v22.8h, v31.8h // Signed absolute difference and accumulate the result. third iteration
+ saba v16.8h, v26.8h, v24.8h // Signed absolute difference and accumulate the result. third iteration
b.ge 1b
cbz w4, 3f
2:
-
ld1 {v0.16b}, [x1], x3
ld1 {v1.16b}, [x2], x3
- ld1 {v2.16b}, [x5], x3
- usubl v30.8h, v0.8b, v1.8b
- ld1 {v3.16b}, [x6], x3
- usubl2 v29.8h, v0.16b, v1.16b
- usubl v28.8h, v2.8b, v3.8b
- usubl2 v27.8h, v2.16b, v3.16b
- saba v16.8h, v30.8h, v28.8h
subs w4, w4, #1
- saba v16.8h, v29.8h, v27.8h
+ usubl v29.8h, v0.8b, v1.8b
+ usubl2 v28.8h, v0.16b, v1.16b
+ saba v16.8h, v31.8h, v29.8h
+ mov v31.16b, v29.16b
+ saba v16.8h, v30.8h, v28.8h
+ mov v30.16b, v28.16b
b.ne 2b
3:
--
2.25.1
[-- Attachment #3: Type: text/plain, Size: 251 bytes --]
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation of vsse16
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation of vsse16 Hubert Mazur
@ 2022-09-04 20:53 ` Martin Storsjö
0 siblings, 0 replies; 14+ messages in thread
From: Martin Storsjö @ 2022-09-04 20:53 UTC (permalink / raw)
To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop
On Mon, 22 Aug 2022, Hubert Mazur wrote:
> Provide optimized implementation of vsse16 for arm64.
>
> Performance comparison tests are shown below.
> - vsse_0_c: 254.4
> - vsse_0_neon: 64.7
>
> Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c | 4 +
> libavcodec/aarch64/me_cmp_neon.S | 97 ++++++++++++++++++++++++
> 2 files changed, 101 insertions(+)
>
> diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
> index ddc5d05611..7b81e48d16 100644
> --- a/libavcodec/aarch64/me_cmp_init_aarch64.c
> +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
> @@ -43,6 +43,8 @@ int sse4_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
>
> int vsad16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
> ptrdiff_t stride, int h);
> +int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
> + ptrdiff_t stride, int h);
>
> av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
> {
> @@ -62,5 +64,7 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
> c->sse[2] = sse4_neon;
>
> c->vsad[0] = vsad16_neon;
> +
> + c->vsse[0] = vsse16_neon;
> }
> }
> diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
> index d4c0099854..279bae7cb5 100644
> --- a/libavcodec/aarch64/me_cmp_neon.S
> +++ b/libavcodec/aarch64/me_cmp_neon.S
> @@ -659,3 +659,100 @@ function vsad16_neon, export=1
>
> ret
> endfunc
> +
> +function vsse16_neon, export=1
> + // x0 unused
> + // x1 uint8_t *pix1
> + // x2 uint8_t *pix2
> + // x3 ptrdiff_t stride
> + // w4 int h
> +
> + movi v30.4s, #0
> + movi v29.4s, #0
> +
> + add x5, x1, x3 // pix1 + stride
> + add x6, x2, x3 // pix2 + stride
> + sub w4, w4, #1 // we need to make h-1 iterations
> + cmp w4, #3 // check if we can make 4 iterations at once
> + b.le 2f
> +
> +// make 4 iterations at once
The comments seem to talk about 4 iterations at once while the code
actually only does 3.
> +1:
> + // x = abs(pix1[0] - pix2[0] - pix1[0 + stride] + pix2[0 + stride]) =
The comment seems a bit un-updated here, since there's no abs() involved
here
> + // res = (x) * (x)
> + ld1 {v0.16b}, [x1], x3 // Load pix1[0], first iteration
> + ld1 {v1.16b}, [x2], x3 // Load pix2[0], first iteration
> + ld1 {v2.16b}, [x5], x3 // Load pix1[0 + stride], first iteration
> + usubl v28.8h, v0.8b, v1.8b // Signed difference of pix1[0] - pix2[0], first iteration
> + ld1 {v3.16b}, [x6], x3 // Load pix2[0 + stride], first iteration
> + usubl2 v27.8h, v0.16b, v1.16b // Signed difference of pix1[0] - pix2[0], first iteration
> + usubl v26.8h, v3.8b, v2.8b // Signed difference of pix1[0 + stride] - pix2[0 + stride], first iteration
> + usubl2 v25.8h, v3.16b, v2.16b // Signed difference of pix1[0 + stride] - pix2[0 + stride], first iteration
Same thing about reusing data from the previous row, as for the previous
patch.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for vsad_intra16
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for vsad_intra16 Hubert Mazur
@ 2022-09-04 20:58 ` Martin Storsjö
0 siblings, 0 replies; 14+ messages in thread
From: Martin Storsjö @ 2022-09-04 20:58 UTC (permalink / raw)
To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop
On Mon, 22 Aug 2022, Hubert Mazur wrote:
> Provide optimized implementation for vsad_intra16 function for arm64.
>
> Performance comparison tests are shown below.
> - vsad_4_c: 177.2
> - vsad_4_neon: 24.5
>
> Benchmarks and tests are run with checkasm tool on AWS Gravtion 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c | 3 ++
> libavcodec/aarch64/me_cmp_neon.S | 58 ++++++++++++++++++++++++
> 2 files changed, 61 insertions(+)
Same thing as for the others; keep the data for the previous row in
registers instead of loading it twice.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16 Hubert Mazur
@ 2022-09-04 20:59 ` Martin Storsjö
0 siblings, 0 replies; 14+ messages in thread
From: Martin Storsjö @ 2022-09-04 20:59 UTC (permalink / raw)
To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop
On Mon, 22 Aug 2022, Hubert Mazur wrote:
> Provide optimized implementation for vsse_intra16 for arm64.
>
> Performance tests are shown below.
> - vsse_4_c: 153.7
> - vsse_4_neon: 34.2
>
> Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c | 3 +
> libavcodec/aarch64/me_cmp_neon.S | 75 ++++++++++++++++++++++++
> 2 files changed, 78 insertions(+)
The same comment as for the others.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16 Hubert Mazur
2022-09-02 21:29 ` Martin Storsjö
@ 2022-09-04 21:23 ` Martin Storsjö
1 sibling, 0 replies; 14+ messages in thread
From: Martin Storsjö @ 2022-09-04 21:23 UTC (permalink / raw)
To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop
On Mon, 22 Aug 2022, Hubert Mazur wrote:
> Add vectorized implementation of nsse16 function.
>
> Performance comparison tests are shown below.
> - nsse_0_c: 707.0
> - nsse_0_neon: 120.0
>
> Benchmarks and tests run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c | 15 +++
> libavcodec/aarch64/me_cmp_neon.S | 126 +++++++++++++++++++++++
> 2 files changed, 141 insertions(+)
>
> diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
> index 8c295d5457..146ef04345 100644
> --- a/libavcodec/aarch64/me_cmp_init_aarch64.c
> +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
> @@ -49,6 +49,10 @@ int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
> ptrdiff_t stride, int h);
> int vsse_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
> ptrdiff_t stride, int h);
> +int nsse16_neon(int multiplier, const uint8_t *s, const uint8_t *s2,
> + ptrdiff_t stride, int h);
> +int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
> + ptrdiff_t stride, int h);
>
> av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
> {
> @@ -72,5 +76,16 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
>
> c->vsse[0] = vsse16_neon;
> c->vsse[4] = vsse_intra16_neon;
> +
> + c->nsse[0] = nsse16_neon_wrapper;
> }
> }
> +
> +int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
> + ptrdiff_t stride, int h)
> +{
> + if (c)
> + return nsse16_neon(c->avctx->nsse_weight, s1, s2, stride, h);
> + else
> + return nsse16_neon(8, s1, s2, stride, h);
> +}
> \ No newline at end of file
The indentation is off for this file, and it's missing the final newline.
> diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
> index 46d4dade5d..9fe96e111c 100644
> --- a/libavcodec/aarch64/me_cmp_neon.S
> +++ b/libavcodec/aarch64/me_cmp_neon.S
> @@ -889,3 +889,129 @@ function vsse_intra16_neon, export=1
>
> ret
> endfunc
> +
> +function nsse16_neon, export=1
> + // x0 multiplier
> + // x1 uint8_t *pix1
> + // x2 uint8_t *pix2
> + // x3 ptrdiff_t stride
> + // w4 int h
> +
> + str x0, [sp, #-0x40]!
> + stp x1, x2, [sp, #0x10]
> + stp x3, x4, [sp, #0x20]
> + str lr, [sp, #0x30]
> + bl sse16_neon
> + ldr lr, [sp, #0x30]
> + mov w9, w0 // here we store score1
> + ldr x5, [sp]
> + ldp x1, x2, [sp, #0x10]
> + ldp x3, x4, [sp, #0x20]
> + add sp, sp, #0x40
> +
> + movi v16.8h, #0
> + movi v17.8h, #0
> + movi v18.8h, #0
> + movi v19.8h, #0
> +
> + mov x10, x1 // x1
> + mov x14, x2 // x2
I don't see why you need to make a copy of x1/x2 here, as you don't use
x1/x2 after this at all.
> + add x11, x1, x3 // x1 + stride
> + add x15, x2, x3 // x2 + stride
> + add x12, x1, #1 // x1 + 1
> + add x16, x2, #1 // x2 + 1
FWIW, instead of making two loads, for [x1] and [x1+1], as we don't need
the final value at [x1+16], I would normally just do one load of [x1] and
then make a shifted version with the 'ext' instruction; ext is generally
cheaper than doing redundant loads. On the other hand, by doing two loads,
you don't have a serial dependency on the first load.
> +// iterate by one
> +2:
> + ld1 {v0.16b}, [x10], x3
> + ld1 {v1.16b}, [x11], x3
> + ld1 {v2.16b}, [x12], x3
> + usubl v31.8h, v0.8b, v1.8b
> + ld1 {v3.16b}, [x13], x3
> + usubl2 v30.8h, v0.16b, v1.16b
> + usubl v29.8h, v2.8b, v3.8b
> + usubl2 v28.8h, v2.16b, v3.16b
> + saba v16.8h, v31.8h, v29.8h
> + ld1 {v4.16b}, [x14], x3
> + ld1 {v5.16b}, [x15], x3
> + saba v17.8h, v30.8h, v28.8h
> + ld1 {v6.16b}, [x16], x3
> + usubl v27.8h, v4.8b, v5.8b
> + ld1 {v7.16b}, [x17], x3
So, looking at the main implementation structure here, by looking at the
non-unrolled version: You're doing 8 loads per iteration here - and I
would say you can do this with 2 loads per iteration.
By reusing the loaded data from the previous iteration instead of
duplicated loading, you can get this down from 8 to 4 loads. And by
shifting with 'ext' instead of a separate load, you can get it down to 2
loads. (Then again, with 4 loads instead of 2, you can have the
overlapping loads running in parallel, instead of having to wait for the
first load to complete if using ext. I'd suggest trying both and seeing
which one worke better - although the tradeoff might be different between
different cores. But storing data from the previous line instead of such
duplicated loading is certainly better in any case.)
> + usubl2 v26.8h, v4.16b, v5.16b
> + usubl v25.8h, v6.8b, v7.8b
> + usubl2 v24.8h, v6.16b, v7.16b
> + saba v18.8h, v27.8h, v25.8h
> + subs w4, w4, #1
> + saba v19.8h, v26.8h, v24.8h
> +
> + cbnz w4, 2b
> +
> +3:
> + sqsub v16.8h, v16.8h, v18.8h
> + sqsub v17.8h, v17.8h, v19.8h
> + ins v17.h[7], wzr
This is very good, that you figured out how to handle the odd element here
outside of the loops!
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16
2022-09-08 9:25 [FFmpeg-devel] [PATCH 0/5] Provide optimized neon implementation Hubert Mazur
@ 2022-09-08 9:25 ` Hubert Mazur
0 siblings, 0 replies; 14+ messages in thread
From: Hubert Mazur @ 2022-09-08 9:25 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Provide optimized implementation for vsse_intra16 for arm64.
Performance tests are shown below.
- vsse_4_c: 155.2
- vsse_4_neon: 36.2
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
libavcodec/aarch64/me_cmp_init_aarch64.c | 3 ++
libavcodec/aarch64/me_cmp_neon.S | 63 ++++++++++++++++++++++++
2 files changed, 66 insertions(+)
diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index af83f7ed1e..8c295d5457 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -47,6 +47,8 @@ int vsad_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
ptrdiff_t stride, int h) ;
int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
ptrdiff_t stride, int h);
+int vsse_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
+ ptrdiff_t stride, int h);
av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
{
@@ -69,5 +71,6 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
c->vsad[4] = vsad_intra16_neon;
c->vsse[0] = vsse16_neon;
+ c->vsse[4] = vsse_intra16_neon;
}
}
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index ce198ea227..cf2b8da425 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -784,3 +784,66 @@ function vsad_intra16_neon, export=1
ret
endfunc
+
+function vsse_intra16_neon, export=1
+ // x0 unused
+ // x1 uint8_t *pix1
+ // x2 uint8_t *dummy
+ // x3 ptrdiff_t stride
+ // w4 int h
+
+ ld1 {v0.16b}, [x1], x3
+ movi v16.4s, #0
+ movi v17.4s, #0
+
+ sub w4, w4, #1 // we need to make h-1 iterations
+ cmp w4, #3
+ b.lt 2f
+
+1:
+ // v = abs( pix1[0] - pix1[0 + stride] )
+ // score = sum( v * v )
+ ld1 {v1.16b}, [x1], x3
+ ld1 {v2.16b}, [x1], x3
+ uabd v30.16b, v0.16b, v1.16b
+ ld1 {v3.16b}, [x1], x3
+ umull v29.8h, v30.8b, v30.8b
+ umull2 v28.8h, v30.16b, v30.16b
+ uabd v27.16b, v1.16b, v2.16b
+ uadalp v16.4s, v29.8h
+ umull v26.8h, v27.8b, v27.8b
+ umull2 v27.8h, v27.16b, v27.16b
+ uadalp v17.4s, v28.8h
+ uabd v25.16b, v2.16b, v3.16b
+ uadalp v16.4s, v26.8h
+ umull v24.8h, v25.8b, v25.8b
+ umull2 v25.8h, v25.16b, v25.16b
+ uadalp v17.4s, v27.8h
+ sub w4, w4, #3
+ uadalp v16.4s, v24.8h
+ cmp w4, #3
+ uadalp v17.4s, v25.8h
+ mov v0.16b, v3.16b
+
+ b.ge 1b
+ cbz w4, 3f
+
+// iterate by one
+2:
+ ld1 {v1.16b}, [x1], x3
+ subs w4, w4, #1
+ uabd v30.16b, v0.16b, v1.16b
+ mov v0.16b, v1.16b
+ umull v29.8h, v30.8b, v30.8b
+ umull2 v30.8h, v30.16b, v30.16b
+ uadalp v16.4s, v29.8h
+ uadalp v17.4s, v30.8h
+ cbnz w4, 2b
+
+3:
+ add v16.4s, v16.4s, v17.4S
+ uaddlv d17, v16.4s
+ fmov w0, s17
+
+ ret
+endfunc
--
2.34.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
* [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16
2022-09-06 10:27 [FFmpeg-devel] [PATCH 0/5] Provide optimized neon implementation Hubert Mazur
@ 2022-09-06 10:27 ` Hubert Mazur
0 siblings, 0 replies; 14+ messages in thread
From: Hubert Mazur @ 2022-09-06 10:27 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Provide optimized implementation for vsse_intra16 for arm64.
Performance tests are shown below.
- vsse_4_c: 153.7
- vsse_4_neon: 34.2
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
libavcodec/aarch64/me_cmp_init_aarch64.c | 3 ++
libavcodec/aarch64/me_cmp_neon.S | 63 ++++++++++++++++++++++++
2 files changed, 66 insertions(+)
diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index af83f7ed1e..8c295d5457 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -47,6 +47,8 @@ int vsad_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
ptrdiff_t stride, int h) ;
int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2,
ptrdiff_t stride, int h);
+int vsse_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy,
+ ptrdiff_t stride, int h);
av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
{
@@ -69,5 +71,6 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
c->vsad[4] = vsad_intra16_neon;
c->vsse[0] = vsse16_neon;
+ c->vsse[4] = vsse_intra16_neon;
}
}
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index ce198ea227..cf2b8da425 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -784,3 +784,66 @@ function vsad_intra16_neon, export=1
ret
endfunc
+
+function vsse_intra16_neon, export=1
+ // x0 unused
+ // x1 uint8_t *pix1
+ // x2 uint8_t *dummy
+ // x3 ptrdiff_t stride
+ // w4 int h
+
+ ld1 {v0.16b}, [x1], x3
+ movi v16.4s, #0
+ movi v17.4s, #0
+
+ sub w4, w4, #1 // we need to make h-1 iterations
+ cmp w4, #3
+ b.lt 2f
+
+1:
+ // v = abs( pix1[0] - pix1[0 + stride] )
+ // score = sum( v * v )
+ ld1 {v1.16b}, [x1], x3
+ ld1 {v2.16b}, [x1], x3
+ uabd v30.16b, v0.16b, v1.16b
+ ld1 {v3.16b}, [x1], x3
+ umull v29.8h, v30.8b, v30.8b
+ umull2 v28.8h, v30.16b, v30.16b
+ uabd v27.16b, v1.16b, v2.16b
+ uadalp v16.4s, v29.8h
+ umull v26.8h, v27.8b, v27.8b
+ umull2 v27.8h, v27.16b, v27.16b
+ uadalp v17.4s, v28.8h
+ uabd v25.16b, v2.16b, v3.16b
+ uadalp v16.4s, v26.8h
+ umull v24.8h, v25.8b, v25.8b
+ umull2 v25.8h, v25.16b, v25.16b
+ uadalp v17.4s, v27.8h
+ sub w4, w4, #3
+ uadalp v16.4s, v24.8h
+ cmp w4, #3
+ uadalp v17.4s, v25.8h
+ mov v0.16b, v3.16b
+
+ b.ge 1b
+ cbz w4, 3f
+
+// iterate by one
+2:
+ ld1 {v1.16b}, [x1], x3
+ subs w4, w4, #1
+ uabd v30.16b, v0.16b, v1.16b
+ mov v0.16b, v1.16b
+ umull v29.8h, v30.8b, v30.8b
+ umull2 v30.8h, v30.16b, v30.16b
+ uadalp v16.4s, v29.8h
+ uadalp v17.4s, v30.8h
+ cbnz w4, 2b
+
+3:
+ add v16.4s, v16.4s, v17.4S
+ uaddlv d17, v16.4s
+ fmov w0, s17
+
+ ret
+endfunc
--
2.34.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2022-09-08 9:26 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-22 15:26 [FFmpeg-devel] [PATCH 0/5] me_cmp: Provide arm64 neon implementations Hubert Mazur
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for vsad16 Hubert Mazur
2022-09-02 21:49 ` Martin Storsjö
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation of vsse16 Hubert Mazur
2022-09-04 20:53 ` Martin Storsjö
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for vsad_intra16 Hubert Mazur
2022-09-04 20:58 ` Martin Storsjö
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16 Hubert Mazur
2022-09-04 20:59 ` Martin Storsjö
2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16 Hubert Mazur
2022-09-02 21:29 ` Martin Storsjö
2022-09-04 21:23 ` Martin Storsjö
2022-09-06 10:27 [FFmpeg-devel] [PATCH 0/5] Provide optimized neon implementation Hubert Mazur
2022-09-06 10:27 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16 Hubert Mazur
2022-09-08 9:25 [FFmpeg-devel] [PATCH 0/5] Provide optimized neon implementation Hubert Mazur
2022-09-08 9:25 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16 Hubert Mazur
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
This inbox may be cloned and mirrored by anyone:
git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
ffmpegdev@gitmailbox.com
public-inbox-index ffmpegdev
Example config snippet for mirrors.
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git