* [FFmpeg-devel] [PATCH 0/4] Provide neon implementations for hscale functions
@ 2022-10-17 13:07 Hubert Mazur
2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 1/4] sw_scale: Add specializations for hscale 8 to 19 Hubert Mazur
` (3 more replies)
0 siblings, 4 replies; 8+ messages in thread
From: Hubert Mazur @ 2022-10-17 13:07 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Provide arm64 neon optimized functions from swscale family.
Hubert Mazur (4):
sw_scale: Add specializations for hscale 8 to 19
tests/sw_scale: Add test cases for input sizes 16
sw_scale: Add specializations for hscale 16 to 15
sw_scale: Add specializations for hscale 16 to 19
libswscale/aarch64/hscale.S | 1101 +++++++++++++++++++++++++++++++++-
libswscale/aarch64/swscale.c | 145 ++++-
libswscale/swscale.c | 3 +-
tests/checkasm/sw_scale.c | 35 +-
4 files changed, 1268 insertions(+), 16 deletions(-)
--
2.37.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 8+ messages in thread
* [FFmpeg-devel] [PATCH 1/4] sw_scale: Add specializations for hscale 8 to 19
2022-10-17 13:07 [FFmpeg-devel] [PATCH 0/4] Provide neon implementations for hscale functions Hubert Mazur
@ 2022-10-17 13:07 ` Hubert Mazur
2022-10-24 12:31 ` Martin Storsjö
2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 2/4] tests/sw_scale: Add test cases for input sizes 16 Hubert Mazur
` (2 subsequent siblings)
3 siblings, 1 reply; 8+ messages in thread
From: Hubert Mazur @ 2022-10-17 13:07 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Add arm64 neon implementations for hscale 8 to 19 with filter
sizes 4, 4X and 8. Both implementations are based on very similar ones
dedicated to hscale 8 to 15. The major changes refer to saving
the data - instead of writing the result as int16_t it is done
with int32_t.
These functions are heavily inspired on patches provided by J. Swinney
and M. Storsjö for hscale8to15 which were slightly adapted for
hscale8to19.
The tests and benchmarks run on AWS Graviton 2 instances. The results
from a checkasm tool shown below.
hscale_8_to_19__fs_4_dstW_512_c: 5663.2
hscale_8_to_19__fs_4_dstW_512_neon: 1259.7
hscale_8_to_19__fs_8_dstW_512_c: 9306.0
hscale_8_to_19__fs_8_dstW_512_neon: 2020.2
hscale_8_to_19__fs_12_dstW_512_c: 12932.7
hscale_8_to_19__fs_12_dstW_512_neon: 2462.5
hscale_8_to_19__fs_16_dstW_512_c: 16844.2
hscale_8_to_19__fs_16_dstW_512_neon: 4671.2
hscale_8_to_19__fs_32_dstW_512_c: 32803.7
hscale_8_to_19__fs_32_dstW_512_neon: 5474.2
hscale_8_to_19__fs_40_dstW_512_c: 40948.0
hscale_8_to_19__fs_40_dstW_512_neon: 6669.7
Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
libswscale/aarch64/hscale.S | 292 ++++++++++++++++++++++++++++++++++-
libswscale/aarch64/swscale.c | 13 +-
2 files changed, 300 insertions(+), 5 deletions(-)
diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S
index a16d3dca42..5e8cad9825 100644
--- a/libswscale/aarch64/hscale.S
+++ b/libswscale/aarch64/hscale.S
@@ -218,7 +218,6 @@ function ff_hscale8to15_4_neon, export=1
// 2. Interleaved prefetching src data and madd
// 3. Complete madd
// 4. Complete remaining iterations when dstW % 8 != 0
-
sub sp, sp, #32 // allocate 32 bytes on the stack
cmp w2, #16 // if dstW <16, skip to the last block used for wrapping up
b.lt 2f
@@ -347,3 +346,294 @@ function ff_hscale8to15_4_neon, export=1
add sp, sp, #32 // clean up stack
ret
endfunc
+
+function ff_hscale8to19_4_neon, export=1
+ // x0 SwsContext *c (unused)
+ // x1 int32_t *dst
+ // w2 int dstW
+ // x3 const uint8_t *src // treat it as uint16_t *src
+ // x4 const uint16_t *filter
+ // x5 const int32_t *filterPos
+ // w6 int filterSize
+
+ movi v18.4s, #1
+ movi v17.4s, #1
+ shl v18.4s, v18.4s, #19
+ sub v18.4s, v18.4s, v17.4s // max allowed value
+
+ cmp w2, #16
+ b.lt 2f // move to last block
+
+ ldp w8, w9, [x5] // filterPos[0], filterPos[1]
+ ldp w10, w11, [x5, #8] // filterPos[2], filterPos[3]
+ ldp w12, w13, [x5, #16] // filterPos[4], filterPos[5]
+ ldp w14, w15, [x5, #24] // filterPos[6], filterPos[7]
+ add x5, x5, #32
+
+ // load data from
+ ldr w8, [x3, w8, UXTW]
+ ldr w9, [x3, w9, UXTW]
+ ldr w10, [x3, w10, UXTW]
+ ldr w11, [x3, w11, UXTW]
+ ldr w12, [x3, w12, UXTW]
+ ldr w13, [x3, w13, UXTW]
+ ldr w14, [x3, w14, UXTW]
+ ldr w15, [x3, w15, UXTW]
+
+ sub sp, sp, #32
+
+ stp w8, w9, [sp]
+ stp w10, w11, [sp, #8]
+ stp w12, w13, [sp, #16]
+ stp w14, w15, [sp, #24]
+
+1:
+ ld4 {v0.8b, v1.8b, v2.8b, v3.8b}, [sp]
+ ld4 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
+ // load filterPositions into registers for next iteration
+
+ ldp w8, w9, [x5] // filterPos[0], filterPos[1]
+ ldp w10, w11, [x5, #8] // filterPos[2], filterPos[3]
+ ldp w12, w13, [x5, #16] // filterPos[4], filterPos[5]
+ ldp w14, w15, [x5, #24] // filterPos[6], filterPos[7]
+ add x5, x5, #32
+ uxtl v0.8h, v0.8b
+ ldr w8, [x3, w8, UXTW]
+ smull v5.4s, v0.4h, v28.4h // multiply first column of src
+ ldr w9, [x3, w9, UXTW]
+ smull2 v6.4s, v0.8h, v28.8h
+ stp w8, w9, [sp]
+
+ uxtl v1.8h, v1.8b
+ ldr w10, [x3, w10, UXTW]
+ smlal v5.4s, v1.4h, v29.4h // multiply second column of src
+ ldr w11, [x3, w11, UXTW]
+ smlal2 v6.4s, v1.8h, v29.8h
+ stp w10, w11, [sp, #8]
+
+ uxtl v2.8h, v2.8b
+ ldr w12, [x3, w12, UXTW]
+ smlal v5.4s, v2.4h, v30.4h // multiply third column of src
+ ldr w13, [x3, w13, UXTW]
+ smlal2 v6.4s, v2.8h, v30.8h
+ stp w12, w13, [sp, #16]
+
+ uxtl v3.8h, v3.8b
+ ldr w14, [x3, w14, UXTW]
+ smlal v5.4s, v3.4h, v31.4h // multiply fourth column of src
+ ldr w15, [x3, w15, UXTW]
+ smlal2 v6.4s, v3.8h, v31.8h
+ stp w14, w15, [sp, #24]
+
+ sub w2, w2, #8
+ sshr v5.4s, v5.4s, #3
+ sshr v6.4s, v6.4s, #3
+ smin v5.4s, v5.4s, v18.4s
+ smin v6.4s, v6.4s, v18.4s
+
+ st1 {v5.4s, v6.4s}, [x1], #32
+ cmp w2, #16
+ b.ge 1b
+
+ // here we make last iteration, without updating the registers
+ ld4 {v0.8b, v1.8b, v2.8b, v3.8b}, [sp]
+ ld4 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
+
+ uxtl v0.8h, v0.8b
+ uxtl v1.8h, v1.8b
+ smull v5.4s, v0.4h, v28.4h
+ smull2 v6.4s, v0.8h, v28.8h
+ uxtl v2.8h, v2.8b
+ smlal v5.4s, v1.4h, v29.4H
+ smlal2 v6.4s, v1.8h, v29.8H
+ uxtl v3.8h, v3.8b
+ smlal v5.4s, v2.4h, v30.4H
+ smlal2 v6.4s, v2.8h, v30.8H
+ smlal v5.4s, v3.4h, v31.4H
+ smlal2 v6.4s, v3.8h, v31.8h
+
+ sshr v5.4s, v5.4s, #3
+ sshr v6.4s, v6.4s, #3
+
+ smin v5.4s, v5.4s, v18.4s
+ smin v6.4s, v6.4s, v18.4s
+
+ sub w2, w2, #8
+ st1 {v5.4s, v6.4s}, [x1], #32
+ add sp, sp, #32 // restore stack
+ cbnz w2, 2f
+
+ ret
+
+2:
+ ldr w8, [x5], #4 // load filterPos
+ add x9, x3, w8, UXTW // src + filterPos
+ ld1 {v0.s}[0], [x9] // load 4 * uint8_t* into one single
+ ld1 {v31.4h}, [x4], #8
+ uxtl v0.8h, v0.8b
+ smull v5.4s, v0.4h, v31.4H
+ saddlv d0, v5.4S
+ sqshrn s0, d0, #3
+ smin v0.4s, v0.4s, v18.4s
+ st1 {v0.s}[0], [x1], #4
+ sub w2, w2, #1
+ cbnz w2, 2b // if iterations remain jump to beginning
+
+ ret
+endfunc
+
+function ff_hscale8to19_X8_neon, export=1
+ movi v20.4s, #1
+ movi v17.4s, #1
+ shl v20.4s, v20.4s, #19
+ sub v20.4s, v20.4s, v17.4s
+
+ sbfiz x7, x6, #1, #32 // filterSize*2 (*2 because int16)
+1:
+ mov x16, x4 // filter0 = filter
+ ldr w8, [x5], #4 // filterPos[idx]
+ add x12, x16, x7 // filter1 = filter0 + filterSize*2
+ ldr w0, [x5], #4 // filterPos[idx + 1]
+ add x13, x12, x7 // filter2 = filter1 + filterSize*2
+ ldr w11, [x5], #4 // filterPos[idx + 2]
+ add x4, x13, x7 // filter3 = filter2 + filterSize*2
+ ldr w9, [x5], #4 // filterPos[idx + 3]
+ movi v0.2D, #0 // val sum part 1 (for dst[0])
+ movi v1.2D, #0 // val sum part 2 (for dst[1])
+ movi v2.2D, #0 // val sum part 3 (for dst[2])
+ movi v3.2D, #0 // val sum part 4 (for dst[3])
+ add x17, x3, w8, UXTW // srcp + filterPos[0]
+ add x8, x3, w0, UXTW // srcp + filterPos[1]
+ add x0, x3, w11, UXTW // srcp + filterPos[2]
+ add x11, x3, w9, UXTW // srcp + filterPos[3]
+ mov w15, w6 // filterSize counter
+2: ld1 {v4.8B}, [x17], #8 // srcp[filterPos[0] + {0..7}]
+ ld1 {v5.8H}, [x16], #16 // load 8x16-bit filter values, part 1
+ uxtl v4.8H, v4.8B // unpack part 1 to 16-bit
+ smlal v0.4S, v4.4H, v5.4H // v0 accumulates srcp[filterPos[0] + {0..3}] * filter[{0..3}]
+ ld1 {v6.8B}, [x8], #8 // srcp[filterPos[1] + {0..7}]
+ smlal2 v0.4S, v4.8H, v5.8H // v0 accumulates srcp[filterPos[0] + {4..7}] * filter[{4..7}]
+ ld1 {v7.8H}, [x12], #16 // load 8x16-bit at filter+filterSize
+ ld1 {v16.8B}, [x0], #8 // srcp[filterPos[2] + {0..7}]
+ uxtl v6.8H, v6.8B // unpack part 2 to 16-bit
+ ld1 {v17.8H}, [x13], #16 // load 8x16-bit at filter+2*filterSize
+ uxtl v16.8H, v16.8B // unpack part 3 to 16-bit
+ smlal v1.4S, v6.4H, v7.4H // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
+ ld1 {v18.8B}, [x11], #8 // srcp[filterPos[3] + {0..7}]
+ smlal v2.4S, v16.4H, v17.4H // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
+ ld1 {v19.8H}, [x4], #16 // load 8x16-bit at filter+3*filterSize
+ smlal2 v2.4S, v16.8H, v17.8H // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
+ uxtl v18.8H, v18.8B // unpack part 4 to 16-bit
+ smlal2 v1.4S, v6.8H, v7.8H // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
+ smlal v3.4S, v18.4H, v19.4H // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
+ subs w15, w15, #8 // j -= 8: processed 8/filterSize
+ smlal2 v3.4S, v18.8H, v19.8H // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
+ b.gt 2b // inner loop if filterSize not consumed completely
+ addp v0.4S, v0.4S, v1.4S // part01 horizontal pair adding
+ addp v2.4S, v2.4S, v3.4S // part23 horizontal pair adding
+ addp v0.4S, v0.4S, v2.4S // part0123 horizontal pair adding
+ subs w2, w2, #4 // dstW -= 4
+ sshr v0.4s, v0.4S, #3 // shift and clip the 2x16-bit final values
+ smin v0.4s, v0.4s, v20.4s
+ st1 {v0.4s}, [x1], #16 // write to destination part0123
+ b.gt 1b // loop until end of line
+ ret
+endfunc
+
+function ff_hscale8to19_X4_neon, export=1
+ // x0 SwsContext *c (not used)
+ // x1 int16_t *dst
+ // w2 int dstW
+ // x3 const uint8_t *src
+ // x4 const int16_t *filter
+ // x5 const int32_t *filterPos
+ // w6 int filterSize
+
+ movi v20.4s, #1
+ movi v17.4s, #1
+ shl v20.4s, v20.4s, #19
+ sub v20.4s, v20.4s, v17.4s
+
+ lsl w7, w6, #1
+1:
+ ldp w8, w9, [x5]
+ ldp w10, w11, [x5, #8]
+
+ movi v16.2d, #0 // initialize accumulator for idx + 0
+ movi v17.2d, #0 // initialize accumulator for idx + 1
+ movi v18.2d, #0 // initialize accumulator for idx + 2
+ movi v19.2d, #0 // initialize accumulator for idx + 3
+
+ mov x12, x4 // filter + 0
+ add x13, x4, x7 // filter + 1
+ add x8, x3, w8, UXTW // srcp + filterPos 0
+ add x14, x13, x7 // filter + 2
+ add x9, x3, w9, UXTW // srcp + filterPos 1
+ add x15, x14, x7 // filter + 3
+ add x10, x3, w10, UXTW // srcp + filterPos 2
+ mov w0, w6 // save the filterSize to temporary variable
+ add x11, x3, w11, UXTW // srcp + filterPos 3
+ add x5, x5, #16 // advance filter position
+ mov x16, xzr // clear the register x16 used for offsetting the filter values
+
+2:
+ ldr d4, [x8], #8 // load src values for idx 0
+ ldr q31, [x12, x16] // load filter values for idx 0
+ uxtl v4.8h, v4.8b // extend type to match the filter' size
+ ldr d5, [x9], #8 // load src values for idx 1
+ smlal v16.4s, v4.4h, v31.4h // multiplication of lower half for idx 0
+ uxtl v5.8h, v5.8b // extend type to match the filter' size
+ ldr q30, [x13, x16] // load filter values for idx 1
+ smlal2 v16.4s, v4.8h, v31.8h // multiplication of upper half for idx 0
+ ldr d6, [x10], #8 // load src values for idx 2
+ ldr q29, [x14, x16] // load filter values for idx 2
+ smlal v17.4s, v5.4h, v30.4H // multiplication of lower half for idx 1
+ ldr d7, [x11], #8 // load src values for idx 3
+ smlal2 v17.4s, v5.8h, v30.8H // multiplication of upper half for idx 1
+ uxtl v6.8h, v6.8B // extend tpye to matchi the filter's size
+ ldr q28, [x15, x16] // load filter values for idx 3
+ smlal v18.4s, v6.4h, v29.4h // multiplication of lower half for idx 2
+ uxtl v7.8h, v7.8B
+ smlal2 v18.4s, v6.8h, v29.8H // multiplication of upper half for idx 2
+ sub w0, w0, #8
+ smlal v19.4s, v7.4h, v28.4H // multiplication of lower half for idx 3
+ cmp w0, #8
+ smlal2 v19.4s, v7.8h, v28.8h // multiplication of upper half for idx 3
+ add x16, x16, #16 // advance filter values indexing
+
+ b.ge 2b
+
+
+ // 4 iterations left
+
+ sub x17, x7, #8 // step back to wrap up the filter pos for last 4 elements
+
+ ldr s4, [x8] // load src values for idx 0
+ ldr d31, [x12, x17] // load filter values for idx 0
+ uxtl v4.8h, v4.8b // extend type to match the filter' size
+ ldr s5, [x9] // load src values for idx 1
+ smlal v16.4s, v4.4h, v31.4h
+ ldr d30, [x13, x17] // load filter values for idx 1
+ uxtl v5.8h, v5.8b // extend type to match the filter' size
+ ldr s6, [x10] // load src values for idx 2
+ smlal v17.4s, v5.4h, v30.4h
+ uxtl v6.8h, v6.8B // extend type to match the filter's size
+ ldr d29, [x14, x17] // load filter values for idx 2
+ ldr s7, [x11] // load src values for idx 3
+ addp v16.4s, v16.4s, v17.4s
+ uxtl v7.8h, v7.8B
+ ldr d28, [x15, x17] // load filter values for idx 3
+ smlal v18.4s, v6.4h, v29.4h
+ smlal v19.4s, v7.4h, v28.4h
+ subs w2, w2, #4
+ addp v18.4s, v18.4s, v19.4s
+ addp v16.4s, v16.4s, v18.4s
+ sshr v16.4s, v16.4s, #3
+ smin v16.4s, v16.4s, v20.4s
+
+ st1 {v16.4s}, [x1], #16
+ add x4, x4, x7, lsl #2
+ b.gt 1b
+ ret
+
+endfunc
\ No newline at end of file
diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c
index d1312c6658..479fe129d0 100644
--- a/libswscale/aarch64/swscale.c
+++ b/libswscale/aarch64/swscale.c
@@ -29,7 +29,8 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
const int16_t *filter, \
const int32_t *filterPos, int filterSize)
#define SCALE_FUNCS(filter_n, opt) \
- SCALE_FUNC(filter_n, 8, 15, opt);
+ SCALE_FUNC(filter_n, 8, 15, opt); \
+ SCALE_FUNC(filter_n, 8, 19, opt);
#define ALL_SCALE_FUNCS(opt) \
SCALE_FUNCS(4, opt); \
SCALE_FUNCS(X8, opt); \
@@ -48,9 +49,13 @@ void ff_yuv2plane1_8_neon(
int offset);
#define ASSIGN_SCALE_FUNC2(hscalefn, filtersize, opt) do { \
- if (c->srcBpc == 8 && c->dstBpc <= 14) { \
- hscalefn = \
- ff_hscale8to15_ ## filtersize ## _ ## opt; \
+ if (c->srcBpc == 8) { \
+ if(c->dstBpc <= 14) { \
+ hscalefn = \
+ ff_hscale8to15_ ## filtersize ## _ ## opt; \
+ } else \
+ hscalefn = \
+ ff_hscale8to19_ ## filtersize ## _ ## opt; \
} \
} while (0)
--
2.37.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 8+ messages in thread
* [FFmpeg-devel] [PATCH 2/4] tests/sw_scale: Add test cases for input sizes 16
2022-10-17 13:07 [FFmpeg-devel] [PATCH 0/4] Provide neon implementations for hscale functions Hubert Mazur
2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 1/4] sw_scale: Add specializations for hscale 8 to 19 Hubert Mazur
@ 2022-10-17 13:07 ` Hubert Mazur
2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 3/4] sw_scale: Add specializations for hscale 16 to 15 Hubert Mazur
2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 4/4] sw_scale: Add specializations for hscale 16 to 19 Hubert Mazur
3 siblings, 0 replies; 8+ messages in thread
From: Hubert Mazur @ 2022-10-17 13:07 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Previously test cases handled only input sizes equal to 8.
Add support for input size 16 which is used by scaling
routines hscale16To15 and hscale16To19. Pass SwsContext
pointer to each function as some of them make use of it.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
tests/checkasm/sw_scale.c | 35 ++++++++++++++++++++++++++---------
1 file changed, 26 insertions(+), 9 deletions(-)
diff --git a/tests/checkasm/sw_scale.c b/tests/checkasm/sw_scale.c
index 3b8dd310ec..2e4b698f88 100644
--- a/tests/checkasm/sw_scale.c
+++ b/tests/checkasm/sw_scale.c
@@ -262,23 +262,31 @@ static void check_hscale(void)
#define FILTER_SIZES 6
static const int filter_sizes[FILTER_SIZES] = { 4, 8, 12, 16, 32, 40 };
-#define HSCALE_PAIRS 2
+#define HSCALE_PAIRS 4
static const int hscale_pairs[HSCALE_PAIRS][2] = {
{ 8, 14 },
{ 8, 18 },
+ { 16, 14 },
+ { 16, 18 }
};
+#define DST_WIDTH(x) ( (x) == (14) ? sizeof(int16_t) : sizeof(int32_t))
#define LARGEST_INPUT_SIZE 512
#define INPUT_SIZES 6
static const int input_sizes[INPUT_SIZES] = {8, 24, 128, 144, 256, 512};
int i, j, fsi, hpi, width, dstWi;
struct SwsContext *ctx;
+ void *(*_dst)[2];
+ void *_src;
// padded
LOCAL_ALIGNED_32(uint8_t, src, [FFALIGN(SRC_PIXELS + MAX_FILTER_WIDTH - 1, 4)]);
- LOCAL_ALIGNED_32(uint32_t, dst0, [SRC_PIXELS]);
- LOCAL_ALIGNED_32(uint32_t, dst1, [SRC_PIXELS]);
+ LOCAL_ALIGNED_32(uint16_t, src1, [FFALIGN(SRC_PIXELS + MAX_FILTER_WIDTH - 1, 4)]);
+ LOCAL_ALIGNED_32(int16_t, dst_ref_16, [SRC_PIXELS]);
+ LOCAL_ALIGNED_32(int16_t, dst_new_16, [SRC_PIXELS]);
+ LOCAL_ALIGNED_32(int32_t, dst_ref_32, [SRC_PIXELS]);
+ LOCAL_ALIGNED_32(int32_t, dst_new_32, [SRC_PIXELS]);
// padded
LOCAL_ALIGNED_32(int16_t, filter, [SRC_PIXELS * MAX_FILTER_WIDTH + MAX_FILTER_WIDTH]);
@@ -286,6 +294,9 @@ static void check_hscale(void)
LOCAL_ALIGNED_32(int16_t, filterAvx2, [SRC_PIXELS * MAX_FILTER_WIDTH + MAX_FILTER_WIDTH]);
LOCAL_ALIGNED_32(int32_t, filterPosAvx, [SRC_PIXELS]);
+ void *_dst_16[2] = {dst_ref_16, dst_new_16};
+ void *_dst_32[2] = {dst_ref_32, dst_new_32};
+
// The dst parameter here is either int16_t or int32_t but we use void* to
// just cover both cases.
declare_func_emms(AV_CPU_FLAG_MMX, void, void *c, void *dst, int dstW,
@@ -297,6 +308,7 @@ static void check_hscale(void)
fail();
randomize_buffers(src, SRC_PIXELS + MAX_FILTER_WIDTH - 1);
+ randomize_buffers(src1, SRC_PIXELS + MAX_FILTER_WIDTH - 1);
for (hpi = 0; hpi < HSCALE_PAIRS; hpi++) {
for (fsi = 0; fsi < FILTER_SIZES; fsi++) {
@@ -306,6 +318,8 @@ static void check_hscale(void)
ctx->srcBpc = hscale_pairs[hpi][0];
ctx->dstBpc = hscale_pairs[hpi][1];
ctx->hLumFilterSize = ctx->hChrFilterSize = width;
+ _src = ctx->srcBpc == 8 ? (void *)src : (void *)src1;
+ _dst = ctx->dstBpc == 14 ? (void*)_dst_16 : (void*)_dst_32;
for (i = 0; i < SRC_PIXELS; i++) {
filterPos[i] = i;
@@ -343,14 +357,15 @@ static void check_hscale(void)
ff_shuffle_filter_coefficients(ctx, filterPosAvx, width, filterAvx2, ctx->dstW);
if (check_func(ctx->hcScale, "hscale_%d_to_%d__fs_%d_dstW_%d", ctx->srcBpc, ctx->dstBpc + 1, width, ctx->dstW)) {
- memset(dst0, 0, SRC_PIXELS * sizeof(dst0[0]));
- memset(dst1, 0, SRC_PIXELS * sizeof(dst1[0]));
+ memset((*_dst)[0], 0, SRC_PIXELS * DST_WIDTH(ctx->dstBpc));
+ memset((*_dst)[1], 0, SRC_PIXELS * DST_WIDTH(ctx->dstBpc));
+
+ call_ref(ctx, (*_dst)[0], ctx->dstW, src, filter, filterPos, width);
+ call_new(ctx, (*_dst)[1], ctx->dstW, src, filterAvx2, filterPosAvx, width);
- call_ref(NULL, dst0, ctx->dstW, src, filter, filterPos, width);
- call_new(NULL, dst1, ctx->dstW, src, filterAvx2, filterPosAvx, width);
- if (memcmp(dst0, dst1, ctx->dstW * sizeof(dst0[0])))
+ if (memcmp((*_dst)[0], (*_dst)[1], ctx->dstW * DST_WIDTH(ctx->dstBpc)))
fail();
- bench_new(NULL, dst0, ctx->dstW, src, filter, filterPosAvx, width);
+ bench_new(ctx, (*_dst)[1], ctx->dstW, _src, filter, filterPosAvx, width);
}
}
}
@@ -358,6 +373,8 @@ static void check_hscale(void)
sws_freeContext(ctx);
}
+#undef DST_WIDTH
+
void checkasm_check_sw_scale(void)
{
check_hscale();
--
2.37.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 8+ messages in thread
* [FFmpeg-devel] [PATCH 3/4] sw_scale: Add specializations for hscale 16 to 15
2022-10-17 13:07 [FFmpeg-devel] [PATCH 0/4] Provide neon implementations for hscale functions Hubert Mazur
2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 1/4] sw_scale: Add specializations for hscale 8 to 19 Hubert Mazur
2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 2/4] tests/sw_scale: Add test cases for input sizes 16 Hubert Mazur
@ 2022-10-17 13:07 ` Hubert Mazur
2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 4/4] sw_scale: Add specializations for hscale 16 to 19 Hubert Mazur
3 siblings, 0 replies; 8+ messages in thread
From: Hubert Mazur @ 2022-10-17 13:07 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Add arm64 neon implementations for hscale 16 to 15 with filter
sizes 4, 8 and X4.
The tests and benchmarks run on AWS Graviton 2 instances.
The results from a checkasm tool are shown below.
hscale_16_to_15__fs_4_dstW_512_c: 6703.5
hscale_16_to_15__fs_4_dstW_512_neon: 2298.0
hscale_16_to_15__fs_8_dstW_512_c: 10983.0
hscale_16_to_15__fs_8_dstW_512_neon: 3216.5
hscale_16_to_15__fs_12_dstW_512_c: 15526.0
hscale_16_to_15__fs_12_dstW_512_neon: 3993.0
hscale_16_to_15__fs_16_dstW_512_c: 20183.5
hscale_16_to_15__fs_16_dstW_512_neon: 5369.7
hscale_16_to_15__fs_32_dstW_512_c: 39315.2
hscale_16_to_15__fs_32_dstW_512_neon: 9511.2
hscale_16_to_15__fs_40_dstW_512_c: 48995.7
hscale_16_to_15__fs_40_dstW_512_neon: 11570.0
Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
libswscale/aarch64/hscale.S | 409 ++++++++++++++++++++++++++++++++++-
libswscale/aarch64/swscale.c | 66 +++++-
libswscale/swscale.c | 3 +-
3 files changed, 474 insertions(+), 4 deletions(-)
diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S
index 5e8cad9825..7d7e1c1f2e 100644
--- a/libswscale/aarch64/hscale.S
+++ b/libswscale/aarch64/hscale.S
@@ -635,5 +635,412 @@ function ff_hscale8to19_X4_neon, export=1
add x4, x4, x7, lsl #2
b.gt 1b
ret
+endfunc
+
+function ff_hscale16to15_4_neon_asm, export=1
+ // w0 int shift
+ // x1 int32_t *dst
+ // w2 int dstW
+ // x3 const uint8_t *src // treat it as uint16_t *src
+ // x4 const uint16_t *filter
+ // x5 const int32_t *filterPos
+ // w6 int filterSize
+
+ movi v18.4s, #1
+ movi v17.4s, #1
+ shl v18.4s, v18.4s, #15
+ sub v18.4s, v18.4s, v17.4s // max allowed value
+ dup v17.4s, w0 // read shift
+ neg v17.4s, v17.4s // negate it, so it can be used in sshl (effectively shift right)
+
+ cmp w2, #16
+ b.lt 2f // move to last block
+
+ ldp w8, w9, [x5] // filterPos[0], filterPos[1]
+ ldp w10, w11, [x5, #8] // filterPos[2], filterPos[3]
+ ldp w12, w13, [x5, #16] // filterPos[4], filterPos[5]
+ ldp w14, w15, [x5, #24] // filterPos[6], filterPos[7]
+ add x5, x5, #32
+
+ // shift all filterPos left by one, as uint16_t will be read
+ lsl x8, x8, #1
+ lsl x9, x9, #1
+ lsl x10, x10, #1
+ lsl x11, x11, #1
+ lsl x12, x12, #1
+ lsl x13, x13, #1
+ lsl x14, x14, #1
+ lsl x15, x15, #1
+
+ // load src with given offset
+ ldr x8, [x3, w8, UXTW]
+ ldr x9, [x3, w9, UXTW]
+ ldr x10, [x3, w10, UXTW]
+ ldr x11, [x3, w11, UXTW]
+ ldr x12, [x3, w12, UXTW]
+ ldr x13, [x3, w13, UXTW]
+ ldr x14, [x3, w14, UXTW]
+ ldr x15, [x3, w15, UXTW]
+
+ sub sp, sp, #64
+ // push src on stack so it can be loaded into vectors later
+ stp x8, x9, [sp]
+ stp x10, x11, [sp, #16]
+ stp x12, x13, [sp, #32]
+ stp x14, x15, [sp, #48]
+
+1:
+ ld4 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
+ ld4 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
+
+ // Each of blocks does the following:
+ // Extend src and filter to 32 bits with uxtl and sxtl
+ // multiply or multiply and accumulate results
+ // Extending to 32 bits is necessary, as unit16_t values can't
+ // be represented as int16_t without type promotion.
+ uxtl v26.4s, v0.4h
+ sxtl v27.4s, v28.4H
+ uxtl2 v0.4s, v0.8h
+ mul v5.4s, v26.4s, v27.4s
+ sxtl2 v28.4s, v28.8H
+ uxtl v26.4s, v1.4h
+ mul v6.4s, v0.4s, v28.4s
+
+ sxtl v27.4s, v29.4H
+ uxtl2 v0.4s, v1.8h
+ mla v5.4s, v27.4s, v26.4s
+ sxtl2 v28.4s, v29.8H
+ uxtl v26.4s, v2.4h
+ mla v6.4s, v28.4s, v0.4s
+
+ sxtl v27.4s, v30.4H
+ uxtl2 v0.4s, v2.8h
+ mla v5.4s, v27.4s, v26.4s
+ sxtl2 v28.4s, v30.8H
+ uxtl v26.4s, v3.4h
+ mla v6.4s, v28.4s, v0.4s
+
+ sxtl v27.4s, v31.4H
+ uxtl2 v0.4s, v3.8h
+ mla v5.4s, v27.4s, v26.4s
+ sxtl2 v28.4s, v31.8H
+ sub w2, w2, #8
+ mla v6.4s, v28.4s, v0.4s
+
+ sshl v5.4s, v5.4s, v17.4s
+ sshl v6.4s, v6.4s, v17.4s
+ smin v5.4s, v5.4s, v18.4s
+ smin v6.4s, v6.4s, v18.4s
+ xtn v5.4h, v5.4s
+ xtn2 v5.8h, v6.4s
+
+ st1 {v5.8h}, [x1], #16
+ cmp w2, #16
+
+ // load filterPositions into registers for next iteration
+ ldp w8, w9, [x5] // filterPos[0], filterPos[1]
+ ldp w10, w11, [x5, #8] // filterPos[2], filterPos[3]
+ ldp w12, w13, [x5, #16] // filterPos[4], filterPos[5]
+ ldp w14, w15, [x5, #24] // filterPos[6], filterPos[7]
+ add x5, x5, #32
+
+ lsl x8, x8, #1
+ lsl x9, x9, #1
+ lsl x10, x10, #1
+ lsl x11, x11, #1
+ lsl x12, x12, #1
+ lsl x13, x13, #1
+ lsl x14, x14, #1
+ lsl x15, x15, #1
+
+ ldr x8, [x3, w8, UXTW]
+ ldr x9, [x3, w9, UXTW]
+ ldr x10, [x3, w10, UXTW]
+ ldr x11, [x3, w11, UXTW]
+ ldr x12, [x3, w12, UXTW]
+ ldr x13, [x3, w13, UXTW]
+ ldr x14, [x3, w14, UXTW]
+ ldr x15, [x3, w15, UXTW]
+
+ stp x8, x9, [sp]
+ stp x10, x11, [sp, #16]
+ stp x12, x13, [sp, #32]
+ stp x14, x15, [sp, #48]
-endfunc
\ No newline at end of file
+ b.ge 1b
+
+ // here we make last iteration, without updating the registers
+ ld4 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
+ ld4 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64
+
+ uxtl v26.4s, v0.4h
+ sxtl v27.4s, v28.4H
+ uxtl2 v0.4s, v0.8h
+ mul v5.4s, v26.4s, v27.4s
+ sxtl2 v28.4s, v28.8H
+ uxtl v26.4s, v1.4h
+ mul v6.4s, v0.4s, v28.4s
+
+ sxtl v27.4s, v29.4H
+ uxtl2 v0.4s, v1.8h
+ mla v5.4s, v26.4s, v27.4s
+ sxtl2 v28.4s, v29.8H
+ uxtl v26.4s, v2.4h
+ mla v6.4s, v0.4s, v28.4s
+
+ sxtl v27.4s, v30.4H
+ uxtl2 v0.4s, v2.8h
+ mla v5.4s, v26.4s, v27.4s
+ sxtl2 v28.4s, v30.8H
+ uxtl v26.4s, v3.4h
+ mla v6.4s, v0.4s, v28.4s
+
+ sxtl v27.4s, v31.4H
+ uxtl2 v0.4s, v3.8h
+ mla v5.4s, v26.4s, v27.4s
+ sxtl2 v28.4s, v31.8H
+ subs w2, w2, #8
+ mla v6.4s, v0.4s, v28.4s
+
+ sshl v5.4s, v5.4s, v17.4s
+ sshl v6.4s, v6.4s, v17.4s
+ smin v5.4s, v5.4s, v18.4s
+ smin v6.4s, v6.4s, v18.4s
+ xtn v5.4h, v5.4S
+ xtn2 v5.8h, v6.4s
+
+ st1 {v5.8h}, [x1], #16
+ add sp, sp, #64 // restore stack
+ cbnz w2, 2f
+
+ ret
+
+2:
+ ldr w8, [x5], #4 // load filterPos
+ lsl w8, w8, #1
+ add x9, x3, w8, UXTW // src + filterPos
+ ld1 {v0.4h}, [x9] // load 4 * uint16_t
+ ld1 {v31.4h}, [x4], #8
+
+ uxtl v0.4s, v0.4h
+ sxtl v31.4s, v31.4h
+ mul v5.4s, v0.4s, v31.4s
+ addv s0, v5.4S
+ sshl v0.4s, v0.4s, v17.4s
+ smin v0.4s, v0.4s, v18.4s
+ st1 {v0.h}[0], [x1], #2
+ sub w2, w2, #1
+ cbnz w2, 2b // if iterations remain jump to beginning
+
+ ret
+endfunc
+
+function ff_hscale16to15_X8_neon_asm, export=1
+ // w0 int shift
+ // x1 int32_t *dst
+ // w2 int dstW
+ // x3 const uint8_t *src // treat it as uint16_t *src
+ // x4 const uint16_t *filter
+ // x5 const int32_t *filterPos
+ // w6 int filterSize
+
+ movi v20.4s, #1
+ movi v21.4s, #1
+ shl v20.4s, v20.4s, #15
+ sub v20.4s, v20.4s, v21.4s
+ dup v21.4s, w0
+ neg v21.4s, v21.4s
+
+ sbfiz x7, x6, #1, #32 // filterSize*2 (*2 because int16)
+1: ldr w8, [x5], #4 // filterPos[idx]
+ lsl w8, w8, #1
+ ldr w10, [x5], #4 // filterPos[idx + 1]
+ lsl w10, w10, #1
+ ldr w11, [x5], #4 // filterPos[idx + 2]
+ lsl w11, w11, #1
+ ldr w9, [x5], #4 // filterPos[idx + 3]
+ lsl w9, w9, #1
+ mov x16, x4 // filter0 = filter
+ add x12, x16, x7 // filter1 = filter0 + filterSize*2
+ add x13, x12, x7 // filter2 = filter1 + filterSize*2
+ add x4, x13, x7 // filter3 = filter2 + filterSize*2
+ movi v0.2D, #0 // val sum part 1 (for dst[0])
+ movi v1.2D, #0 // val sum part 2 (for dst[1])
+ movi v2.2D, #0 // val sum part 3 (for dst[2])
+ movi v3.2D, #0 // val sum part 4 (for dst[3])
+ add x17, x3, w8, UXTW // srcp + filterPos[0]
+ add x8, x3, w10, UXTW // srcp + filterPos[1]
+ add x10, x3, w11, UXTW // srcp + filterPos[2]
+ add x11, x3, w9, UXTW // srcp + filterPos[3]
+ mov w15, w6 // filterSize counter
+2: ld1 {v4.8H}, [x17], #16 // srcp[filterPos[0] + {0..7}]
+ ld1 {v5.8H}, [x16], #16 // load 8x16-bit filter values, part 1
+ ld1 {v6.8H}, [x8], #16 // srcp[filterPos[1] + {0..7}]
+ ld1 {v7.8H}, [x12], #16 // load 8x16-bit at filter+filterSize
+ uxtl v24.4s, v4.4H // extend srcp lower half to 32 bits to preserve sign
+ sxtl v25.4s, v5.4H // extend filter lower half to 32 bits to match srcp size
+ uxtl2 v4.4s, v4.8h // extend srcp upper half to 32 bits
+ mla v0.4s, v24.4s, v25.4s // multiply accumulate lower half of v4 * v5
+ sxtl2 v5.4s, v5.8h // extend filter upper half to 32 bits
+ uxtl v26.4s, v6.4h // extend srcp lower half to 32 bits
+ mla v0.4S, v4.4s, v5.4s // multiply accumulate upper half of v4 * v5
+ sxtl v27.4s, v7.4H // exted filter lower half
+ uxtl2 v6.4s, v6.8H // extend srcp upper half
+ sxtl2 v7.4s, v7.8h // extend filter upper half
+ ld1 {v16.8H}, [x10], #16 // srcp[filterPos[2] + {0..7}]
+ mla v1.4S, v26.4s, v27.4s // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
+ ld1 {v17.8H}, [x13], #16 // load 8x16-bit at filter+2*filterSize
+ uxtl v22.4s, v16.4H // extend srcp lower half
+ sxtl v23.4s, v17.4H // extend filter lower half
+ uxtl2 v16.4s, v16.8H // extend srcp upper half
+ sxtl2 v17.4s, v17.8h // extend filter upper half
+ mla v2.4S, v22.4s, v23.4s // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
+ mla v2.4S, v16.4s, v17.4s // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
+ ld1 {v18.8H}, [x11], #16 // srcp[filterPos[3] + {0..7}]
+ mla v1.4S, v6.4s, v7.4s // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
+ ld1 {v19.8H}, [x4], #16 // load 8x16-bit at filter+3*filterSize
+ subs w15, w15, #8 // j -= 8: processed 8/filterSize
+ uxtl v28.4s, v18.4H // extend srcp lower half
+ sxtl v29.4s, v19.4H // extend filter lower half
+ uxtl2 v18.4s, v18.8H // extend srcp upper half
+ sxtl2 v19.4s, v19.8h // extend filter upper half
+ mla v3.4S, v28.4s, v29.4s // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
+ mla v3.4S, v18.4s, v19.4s // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
+ b.gt 2b // inner loop if filterSize not consumed completely
+ addp v0.4S, v0.4S, v1.4S // part01 horizontal pair adding
+ addp v2.4S, v2.4S, v3.4S // part23 horizontal pair adding
+ addp v0.4S, v0.4S, v2.4S // part0123 horizontal pair adding
+ subs w2, w2, #4 // dstW -= 4
+ sshl v0.4s, v0.4s, v21.4s // shift right (effectively rigth, as shift is negative); overflow expected
+ smin v0.4s, v0.4s, v20.4s // apply min (do not use sqshl)
+ xtn v0.4h, v0.4s // narrow down to 16 bits
+
+ st1 {v0.4H}, [x1], #8 // write to destination part0123
+ b.gt 1b // loop until end of line
+ ret
+endfunc
+
+function ff_hscale16to15_X4_neon_asm, export=1
+ // w0 int shift
+ // x1 int16_t *dst
+ // w2 int dstW
+ // x3 const uint8_t *src
+ // x4 const int16_t *filter
+ // x5 const int32_t *filterPos
+ // w6 int filterSize
+
+ stp d8, d9, [sp, #-0x20]!
+ stp d10, d11, [sp, #0x10]
+
+ movi v18.4s, #1
+ movi v17.4s, #1
+ shl v18.4s, v18.4s, #15
+ sub v21.4s, v18.4s, v17.4s // max allowed value
+ dup v17.4s, w0 // read shift
+ neg v20.4s, v17.4s // negate it, so it can be used in sshl (effectively shift right)
+
+ lsl w7, w6, #1
+1:
+ ldp w8, w9, [x5]
+ ldp w10, w11, [x5, #8]
+
+ movi v16.2d, #0 // initialize accumulator for idx + 0
+ movi v17.2d, #0 // initialize accumulator for idx + 1
+ movi v18.2d, #0 // initialize accumulator for idx + 2
+ movi v19.2d, #0 // initialize accumulator for idx + 3
+
+ mov x12, x4 // filter + 0
+ add x13, x4, x7 // filter + 1
+ add x8, x3, x8, lsl #1 // srcp + filterPos 0
+ add x14, x13, x7 // filter + 2
+ add x9, x3, x9, lsl #1 // srcp + filterPos 1
+ add x15, x14, x7 // filter + 3
+ add x10, x3, x10, lsl #1 // srcp + filterPos 2
+ mov w0, w6 // save the filterSize to temporary variable
+ add x11, x3, x11, lsl #1 // srcp + filterPos 3
+ add x5, x5, #16 // advance filter position
+ mov x16, xzr // clear the register x16 used for offsetting the filter values
+
+2:
+ ldr q4, [x8], #16 // load src values for idx 0
+ ldr q5, [x9], #16 // load src values for idx 1
+ uxtl v26.4s, v4.4h
+ uxtl2 v4.4s, v4.8h
+ ldr q31, [x12, x16] // load filter values for idx 0
+ ldr q6, [x10], #16 // load src values for idx 2
+ sxtl v22.4s, v31.4h
+ sxtl2 v31.4s, v31.8h
+ mla v16.4s, v26.4s, v22.4s // multiplication of lower half for idx 0
+ uxtl v25.4s, v5.4h
+ uxtl2 v5.4s, v5.8h
+ ldr q30, [x13, x16] // load filter values for idx 1
+ ldr q7, [x11], #16 // load src values for idx 3
+ mla v16.4s, v4.4s, v31.4s // multiplication of upper half for idx 0
+ uxtl v24.4s, v6.4h
+ sxtl v8.4s, v30.4h
+ sxtl2 v30.4s, v30.8h
+ mla v17.4s, v25.4s, v8.4s // multiplication of lower half for idx 1
+ ldr q29, [x14, x16] // load filter values for idx 2
+ uxtl2 v6.4s, v6.8h
+ sxtl v9.4s, v29.4h
+ sxtl2 v29.4s, v29.8h
+ mla v17.4s, v5.4s, v30.4s // multiplication of upper half for idx 1
+ mla v18.4s, v24.4s, v9.4s // multiplication of lower half for idx 2
+ ldr q28, [x15, x16] // load filter values for idx 3
+ uxtl v23.4s, v7.4h
+ sxtl v10.4s, v28.4h
+ mla v18.4s, v6.4s, v29.4s // multiplication of upper half for idx 2
+ uxtl2 v7.4s, v7.8h
+ sxtl2 v28.4s, v28.8h
+ mla v19.4s, v23.4s, v10.4s // multiplication of lower half for idx 3
+ sub w0, w0, #8
+ cmp w0, #8
+ mla v19.4s, v7.4s, v28.4s // multiplication of upper half for idx 3
+
+ add x16, x16, #16 // advance filter values indexing
+
+ b.ge 2b
+
+ // 4 iterations left
+
+ sub x17, x7, #8 // step back to wrap up the filter pos for last 4 elements
+
+ ldr d4, [x8] // load src values for idx 0
+ ldr d31, [x12, x17] // load filter values for idx 0
+ uxtl v4.4s, v4.4h
+ sxtl v31.4s, v31.4h
+ ldr d5, [x9] // load src values for idx 1
+ mla v16.4s, v4.4s, v31.4s // multiplication of upper half for idx 0
+ ldr d30, [x13, x17] // load filter values for idx 1
+ uxtl v5.4s, v5.4h
+ sxtl v30.4s, v30.4h
+ ldr d6, [x10] // load src values for idx 2
+ mla v17.4s, v5.4s, v30.4s // multiplication of upper half for idx 1
+ ldr d29, [x14, x17] // load filter values for idx 2
+ uxtl v6.4s, v6.4h
+ sxtl v29.4s, v29.4h
+ ldr d7, [x11] // load src values for idx 3
+ ldr d28, [x15, x17] // load filter values for idx 3
+ mla v18.4s, v6.4s, v29.4s // multiplication of upper half for idx 2
+ uxtl v7.4s, v7.4h
+ sxtl v28.4s, v28.4h
+ addp v16.4s, v16.4s, v17.4s
+ mla v19.4s, v7.4s, v28.4s // multiplication of upper half for idx 3
+ subs w2, w2, #4
+ addp v18.4s, v18.4s, v19.4s
+ addp v16.4s, v16.4s, v18.4s
+ sshl v16.4s, v16.4s, v20.4s
+ smin v16.4s, v16.4s, v21.4s
+ xtn v16.4h, v16.4s
+
+ st1 {v16.4h}, [x1], #8
+ add x4, x4, x7, lsl #2
+ b.gt 1b
+
+ ldp d8, d9, [sp]
+ ldp d10, d11, [sp, #0x10]
+
+ add sp, sp, #0x20
+
+ ret
+endfunc
diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c
index 479fe129d0..993cdd67dd 100644
--- a/libswscale/aarch64/swscale.c
+++ b/libswscale/aarch64/swscale.c
@@ -22,6 +22,18 @@
#include "libswscale/swscale_internal.h"
#include "libavutil/aarch64/cpu.h"
+void ff_hscale16to15_4_neon_asm(int shift, int16_t *_dst, int dstW,
+ const uint8_t *_src, const int16_t *filter,
+ const int32_t *filterPos, int filterSize);
+
+void ff_hscale16to15_X8_neon_asm(int shift, int16_t *_dst, int dstW,
+ const uint8_t *_src, const int16_t *filter,
+ const int32_t *filterPos, int filterSize);
+
+void ff_hscale16to15_X4_neon_asm(int shift, int16_t *_dst, int dstW,
+ const uint8_t *_src, const int16_t *filter,
+ const int32_t *filterPos, int filterSize);
+
#define SCALE_FUNC(filter_n, from_bpc, to_bpc, opt) \
void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
SwsContext *c, int16_t *data, \
@@ -30,7 +42,8 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
const int32_t *filterPos, int filterSize)
#define SCALE_FUNCS(filter_n, opt) \
SCALE_FUNC(filter_n, 8, 15, opt); \
- SCALE_FUNC(filter_n, 8, 19, opt);
+ SCALE_FUNC(filter_n, 8, 19, opt); \
+ SCALE_FUNC(filter_n, 16, 15, opt);
#define ALL_SCALE_FUNCS(opt) \
SCALE_FUNCS(4, opt); \
SCALE_FUNCS(X8, opt); \
@@ -56,6 +69,10 @@ void ff_yuv2plane1_8_neon(
} else \
hscalefn = \
ff_hscale8to19_ ## filtersize ## _ ## opt; \
+ } else { \
+ if (c->dstBpc <= 14) \
+ hscalefn = \
+ ff_hscale16to15_ ## filtersize ## _ ## opt; \
} \
} while (0)
@@ -87,3 +104,50 @@ av_cold void ff_sws_init_swscale_aarch64(SwsContext *c)
}
}
}
+
+void ff_hscale16to15_4_neon(SwsContext *c, int16_t *_dst, int dstW,
+ const uint8_t *_src, const int16_t *filter,
+ const int32_t *filterPos, int filterSize)
+{
+ const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat);
+ int sh = desc->comp[0].depth - 1;
+
+ if (sh<15) {
+ sh = isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8 ? 13 : (desc->comp[0].depth - 1);
+ } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */
+ sh = 16 - 1;
+ }
+ ff_hscale16to15_4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+
+}
+
+void ff_hscale16to15_X8_neon(SwsContext *c, int16_t *_dst, int dstW,
+ const uint8_t *_src, const int16_t *filter,
+ const int32_t *filterPos, int filterSize)
+{
+ const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat);
+ int sh = desc->comp[0].depth - 1;
+
+ if (sh<15) {
+ sh = isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8 ? 13 : (desc->comp[0].depth - 1);
+ } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */
+ sh = 16 - 1;
+ }
+ ff_hscale16to15_X8_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+
+}
+
+void ff_hscale16to15_X4_neon(SwsContext *c, int16_t *_dst, int dstW,
+ const uint8_t *_src, const int16_t *filter,
+ const int32_t *filterPos, int filterSize)
+{
+ const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat);
+ int sh = desc->comp[0].depth - 1;
+
+ if (sh<15) {
+ sh = isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8 ? 13 : (desc->comp[0].depth - 1);
+ } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */
+ sh = 16 - 1;
+ }
+ ff_hscale16to15_X4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+}
\ No newline at end of file
diff --git a/libswscale/swscale.c b/libswscale/swscale.c
index 367d045a02..5afd5eba83 100644
--- a/libswscale/swscale.c
+++ b/libswscale/swscale.c
@@ -109,11 +109,10 @@ static void hScale16To15_c(SwsContext *c, int16_t *dst, int dstW,
int j;
int srcPos = filterPos[i];
int val = 0;
-
for (j = 0; j < filterSize; j++) {
val += src[srcPos + j] * filter[filterSize * i + j];
}
- // filter=14 bit, input=16 bit, output=30 bit, >> 15 makes 15 bit
+ //filter=14 bit, input=16 bit, output=30 bit, >> 15 makes 15 bit
dst[i] = FFMIN(val >> sh, (1 << 15) - 1);
}
}
--
2.37.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 8+ messages in thread
* [FFmpeg-devel] [PATCH 4/4] sw_scale: Add specializations for hscale 16 to 19
2022-10-17 13:07 [FFmpeg-devel] [PATCH 0/4] Provide neon implementations for hscale functions Hubert Mazur
` (2 preceding siblings ...)
2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 3/4] sw_scale: Add specializations for hscale 16 to 15 Hubert Mazur
@ 2022-10-17 13:07 ` Hubert Mazur
2022-10-24 13:19 ` Martin Storsjö
3 siblings, 1 reply; 8+ messages in thread
From: Hubert Mazur @ 2022-10-17 13:07 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gjb, upstream, jswinney, Hubert Mazur, martin, mw, spop
Provide arm64 neon optimized implementations for hscale16To19 with
filter sizes 4, 8 and X4.
The tests and benchmarks run on AWS Graviton 2 instances.
The results from a checkasm tool are shown below.
hscale_16_to_19__fs_4_dstW_512_c: 6216.0
hscale_16_to_19__fs_4_dstW_512_neon: 2257.0
hscale_16_to_19__fs_8_dstW_512_c: 10417.7
hscale_16_to_19__fs_8_dstW_512_neon: 3112.5
hscale_16_to_19__fs_12_dstW_512_c: 14890.5
hscale_16_to_19__fs_12_dstW_512_neon: 3899.0
hscale_16_to_19__fs_16_dstW_512_c: 19006.5
hscale_16_to_19__fs_16_dstW_512_neon: 5341.2
hscale_16_to_19__fs_32_dstW_512_c: 36629.5
hscale_16_to_19__fs_32_dstW_512_neon: 9502.7
hscale_16_to_19__fs_40_dstW_512_c: 45477.5
hscale_16_to_19__fs_40_dstW_512_neon: 11552.0
Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
libswscale/aarch64/hscale.S | 402 +++++++++++++++++++++++++++++++++++
libswscale/aarch64/swscale.c | 70 +++++-
2 files changed, 471 insertions(+), 1 deletion(-)
diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S
index 7d7e1c1f2e..dfc635d1b9 100644
--- a/libswscale/aarch64/hscale.S
+++ b/libswscale/aarch64/hscale.S
@@ -1044,3 +1044,405 @@ function ff_hscale16to15_X4_neon_asm, export=1
ret
endfunc
+
+function ff_hscale16to19_4_neon_asm, export=1
+ // w0 int shift
+ // x1 int32_t *dst
+ // w2 int dstW
+ // x3 const uint8_t *src // treat it as uint16_t *src
+ // x4 const uint16_t *filter
+ // x5 const int32_t *filterPos
+ // w6 int filterSize
+
+ movi v18.4s, #1
+ movi v17.4s, #1
+ shl v18.4s, v18.4s, #19
+ sub v18.4s, v18.4s, v17.4s // max allowed value
+ dup v17.4s, w0 // read shift
+ neg v17.4s, v17.4s // negate it, so it can be used in sshl (effectively shift right)
+
+ cmp w2, #16
+ b.lt 2f // move to last block
+
+ ldp w8, w9, [x5] // filterPos[0], filterPos[1]
+ ldp w10, w11, [x5, #8] // filterPos[2], filterPos[3]
+ ldp w12, w13, [x5, #16] // filterPos[4], filterPos[5]
+ ldp w14, w15, [x5, #24] // filterPos[6], filterPos[7]
+ add x5, x5, #32
+
+ // shift all filterPos left by one, as uint16_t will be read
+ lsl x8, x8, #1
+ lsl x9, x9, #1
+ lsl x10, x10, #1
+ lsl x11, x11, #1
+ lsl x12, x12, #1
+ lsl x13, x13, #1
+ lsl x14, x14, #1
+ lsl x15, x15, #1
+
+ // load src with given offset
+ ldr x8, [x3, w8, UXTW]
+ ldr x9, [x3, w9, UXTW]
+ ldr x10, [x3, w10, UXTW]
+ ldr x11, [x3, w11, UXTW]
+ ldr x12, [x3, w12, UXTW]
+ ldr x13, [x3, w13, UXTW]
+ ldr x14, [x3, w14, UXTW]
+ ldr x15, [x3, w15, UXTW]
+
+ sub sp, sp, #64
+ // push src on stack so it can be loaded into vectors later
+ stp x8, x9, [sp]
+ stp x10, x11, [sp, #16]
+ stp x12, x13, [sp, #32]
+ stp x14, x15, [sp, #48]
+
+1:
+ ld4 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
+ ld4 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
+
+ // Each of blocks does the following:
+ // Extend src and filter to 32 bits with uxtl and sxtl
+ // multiply or multiply and accumulate results
+ // Extending to 32 bits is necessary, as unit16_t values can't
+ // be represented as int16_t without type promotion.
+ uxtl v26.4s, v0.4h
+ sxtl v27.4s, v28.4H
+ uxtl2 v0.4s, v0.8h
+ mul v5.4s, v26.4s, v27.4s
+ sxtl2 v28.4s, v28.8H
+ uxtl v26.4s, v1.4h
+ mul v6.4s, v0.4s, v28.4s
+
+ sxtl v27.4s, v29.4H
+ uxtl2 v0.4s, v1.8h
+ mla v5.4s, v27.4s, v26.4s
+ sxtl2 v28.4s, v29.8H
+ uxtl v26.4s, v2.4h
+ mla v6.4s, v28.4s, v0.4s
+
+ sxtl v27.4s, v30.4H
+ uxtl2 v0.4s, v2.8h
+ mla v5.4s, v27.4s, v26.4s
+ sxtl2 v28.4s, v30.8H
+ uxtl v26.4s, v3.4h
+ mla v6.4s, v28.4s, v0.4s
+
+ sxtl v27.4s, v31.4H
+ uxtl2 v0.4s, v3.8h
+ mla v5.4s, v27.4s, v26.4s
+ sxtl2 v28.4s, v31.8H
+ sub w2, w2, #8
+ mla v6.4s, v28.4s, v0.4s
+
+ sshl v5.4s, v5.4s, v17.4s
+ sshl v6.4s, v6.4s, v17.4s
+ smin v5.4s, v5.4s, v18.4s
+ smin v6.4s, v6.4s, v18.4s
+
+ st1 {v5.4s, v6.4s}, [x1], #32
+ cmp w2, #16
+
+ // load filterPositions into registers for next iteration
+ ldp w8, w9, [x5] // filterPos[0], filterPos[1]
+ ldp w10, w11, [x5, #8] // filterPos[2], filterPos[3]
+ ldp w12, w13, [x5, #16] // filterPos[4], filterPos[5]
+ ldp w14, w15, [x5, #24] // filterPos[6], filterPos[7]
+ add x5, x5, #32
+
+ lsl x8, x8, #1
+ lsl x9, x9, #1
+ lsl x10, x10, #1
+ lsl x11, x11, #1
+ lsl x12, x12, #1
+ lsl x13, x13, #1
+ lsl x14, x14, #1
+ lsl x15, x15, #1
+
+ ldr x8, [x3, w8, UXTW]
+ ldr x9, [x3, w9, UXTW]
+ ldr x10, [x3, w10, UXTW]
+ ldr x11, [x3, w11, UXTW]
+ ldr x12, [x3, w12, UXTW]
+ ldr x13, [x3, w13, UXTW]
+ ldr x14, [x3, w14, UXTW]
+ ldr x15, [x3, w15, UXTW]
+
+ stp x8, x9, [sp]
+ stp x10, x11, [sp, #16]
+ stp x12, x13, [sp, #32]
+ stp x14, x15, [sp, #48]
+
+ b.ge 1b
+
+ // here we make last iteration, without updating the registers
+ ld4 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
+ ld4 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64
+
+ uxtl v26.4s, v0.4h
+ sxtl v27.4s, v28.4H
+ uxtl2 v0.4s, v0.8h
+ mul v5.4s, v26.4s, v27.4s
+ sxtl2 v28.4s, v28.8H
+ uxtl v26.4s, v1.4h
+ mul v6.4s, v0.4s, v28.4s
+
+ sxtl v27.4s, v29.4H
+ uxtl2 v0.4s, v1.8h
+ mla v5.4s, v26.4s, v27.4s
+ sxtl2 v28.4s, v29.8H
+ uxtl v26.4s, v2.4h
+ mla v6.4s, v0.4s, v28.4s
+
+ sxtl v27.4s, v30.4H
+ uxtl2 v0.4s, v2.8h
+ mla v5.4s, v26.4s, v27.4s
+ sxtl2 v28.4s, v30.8H
+ uxtl v26.4s, v3.4h
+ mla v6.4s, v0.4s, v28.4s
+
+ sxtl v27.4s, v31.4H
+ uxtl2 v0.4s, v3.8h
+ mla v5.4s, v26.4s, v27.4s
+ sxtl2 v28.4s, v31.8H
+ subs w2, w2, #8
+ mla v6.4s, v0.4s, v28.4s
+
+ sshl v5.4s, v5.4s, v17.4s
+ sshl v6.4s, v6.4s, v17.4s
+
+ smin v5.4s, v5.4s, v18.4s
+ smin v6.4s, v6.4s, v18.4s
+
+ st1 {v5.4s, v6.4s}, [x1], #32
+ add sp, sp, #64 // restore stack
+ cbnz w2, 2f
+
+ ret
+
+2:
+ ldr w8, [x5], #4 // load filterPos
+ lsl w8, w8, #1
+ add x9, x3, w8, UXTW // src + filterPos
+ ld1 {v0.4h}, [x9] // load 4 * uint16_t
+ ld1 {v31.4h}, [x4], #8
+
+ uxtl v0.4s, v0.4h
+ sxtl v31.4s, v31.4h
+ subs w2, w2, #1
+ mul v5.4s, v0.4s, v31.4s
+ addv s0, v5.4S
+ sshl v0.4s, v0.4s, v17.4s
+ smin v0.4s, v0.4s, v18.4s
+ st1 {v0.s}[0], [x1], #4
+ cbnz w2, 2b // if iterations remain jump to beginning
+
+ ret
+endfunc
+
+function ff_hscale16to19_X8_neon_asm, export=1
+ // w0 int shift
+ // x1 int32_t *dst
+ // w2 int dstW
+ // x3 const uint8_t *src // treat it as uint16_t *src
+ // x4 const uint16_t *filter
+ // x5 const int32_t *filterPos
+ // w6 int filterSize
+
+ movi v20.4s, #1
+ movi v21.4s, #1
+ shl v20.4s, v20.4s, #19
+ sub v20.4s, v20.4s, v21.4s
+ dup v21.4s, w0
+ neg v21.4s, v21.4s
+
+ sbfiz x7, x6, #1, #32 // filterSize*2 (*2 because int16)
+1: ldr w8, [x5], #4 // filterPos[idx]
+ ldr w10, [x5], #4 // filterPos[idx + 1]
+ lsl w8, w8, #1
+ ldr w11, [x5], #4 // filterPos[idx + 2]
+ ldr w9, [x5], #4 // filterPos[idx + 3]
+ mov x16, x4 // filter0 = filter
+ lsl w11, w11, #1
+ add x12, x16, x7 // filter1 = filter0 + filterSize*2
+ lsl w9, w9, #1
+ add x13, x12, x7 // filter2 = filter1 + filterSize*2
+ lsl w10, w10, #1
+ add x4, x13, x7 // filter3 = filter2 + filterSize*2
+ movi v0.2D, #0 // val sum part 1 (for dst[0])
+ movi v1.2D, #0 // val sum part 2 (for dst[1])
+ movi v2.2D, #0 // val sum part 3 (for dst[2])
+ movi v3.2D, #0 // val sum part 4 (for dst[3])
+ add x17, x3, w8, UXTW // srcp + filterPos[0]
+ add x8, x3, w10, UXTW // srcp + filterPos[1]
+ add x10, x3, w11, UXTW // srcp + filterPos[2]
+ add x11, x3, w9, UXTW // srcp + filterPos[3]
+ mov w15, w6 // filterSize counter
+2: ld1 {v4.8H}, [x17], #16 // srcp[filterPos[0] + {0..7}]
+ ld1 {v5.8H}, [x16], #16 // load 8x16-bit filter values, part 1
+ ld1 {v6.8H}, [x8], #16 // srcp[filterPos[1] + {0..7}]
+ ld1 {v7.8H}, [x12], #16 // load 8x16-bit at filter+filterSize
+ uxtl v24.4s, v4.4H // extend srcp lower half to 32 bits to preserve sign
+ sxtl v25.4s, v5.4H // extend filter lower half to 32 bits to match srcp size
+ uxtl2 v4.4s, v4.8h // extend srcp upper half to 32 bits
+ mla v0.4s, v24.4s, v25.4s // multiply accumulate lower half of v4 * v5
+ sxtl2 v5.4s, v5.8h // extend filter upper half to 32 bits
+ uxtl v26.4s, v6.4h // extend srcp lower half to 32 bits
+ mla v0.4S, v4.4s, v5.4s // multiply accumulate upper half of v4 * v5
+ sxtl v27.4s, v7.4H // exted filter lower half
+ uxtl2 v6.4s, v6.8H // extend srcp upper half
+ sxtl2 v7.4s, v7.8h // extend filter upper half
+ ld1 {v16.8H}, [x10], #16 // srcp[filterPos[2] + {0..7}]
+ mla v1.4S, v26.4s, v27.4s // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
+ ld1 {v17.8H}, [x13], #16 // load 8x16-bit at filter+2*filterSize
+ uxtl v22.4s, v16.4H // extend srcp lower half
+ sxtl v23.4s, v17.4H // extend filter lower half
+ uxtl2 v16.4s, v16.8H // extend srcp upper half
+ sxtl2 v17.4s, v17.8h // extend filter upper half
+ mla v2.4S, v22.4s, v23.4s // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
+ mla v2.4S, v16.4s, v17.4s // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
+ ld1 {v18.8H}, [x11], #16 // srcp[filterPos[3] + {0..7}]
+ mla v1.4S, v6.4s, v7.4s // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
+ ld1 {v19.8H}, [x4], #16 // load 8x16-bit at filter+3*filterSize
+ subs w15, w15, #8 // j -= 8: processed 8/filterSize
+ uxtl v28.4s, v18.4H // extend srcp lower half
+ sxtl v29.4s, v19.4H // extend filter lower half
+ uxtl2 v18.4s, v18.8H // extend srcp upper half
+ sxtl2 v19.4s, v19.8h // extend filter upper half
+ mla v3.4S, v28.4s, v29.4s // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
+ mla v3.4S, v18.4s, v19.4s // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
+ b.gt 2b // inner loop if filterSize not consumed completely
+ addp v0.4S, v0.4S, v1.4S // part01 horizontal pair adding
+ addp v2.4S, v2.4S, v3.4S // part23 horizontal pair adding
+ addp v0.4S, v0.4S, v2.4S // part0123 horizontal pair adding
+ subs w2, w2, #4 // dstW -= 4
+ sshl v0.4s, v0.4s, v21.4s // shift right (effectively rigth, as shift is negative); overflow expected
+ smin v0.4s, v0.4s, v20.4s // apply min (do not use sqshl)
+ st1 {v0.4s}, [x1], #16 // write to destination part0123
+ b.gt 1b // loop until end of line
+ ret
+endfunc
+
+function ff_hscale16to19_X4_neon_asm, export=1
+ // w0 int shift
+ // x1 int16_t *dst
+ // w2 int dstW
+ // x3 const uint8_t *src
+ // x4 const int16_t *filter
+ // x5 const int32_t *filterPos
+ // w6 int filterSize
+
+ stp d8, d9, [sp, #-0x20]!
+ stp d10, d11, [sp, #0x10]
+
+ movi v18.4s, #1
+ movi v17.4s, #1
+ shl v18.4s, v18.4s, #19
+ sub v21.4s, v18.4s, v17.4s // max allowed value
+ dup v17.4s, w0 // read shift
+ neg v20.4s, v17.4s // negate it, so it can be used in sshl (effectively shift right)
+
+ lsl w7, w6, #1
+1:
+ ldp w8, w9, [x5]
+ ldp w10, w11, [x5, #8]
+
+ movi v16.2d, #0 // initialize accumulator for idx + 0
+ movi v17.2d, #0 // initialize accumulator for idx + 1
+ movi v18.2d, #0 // initialize accumulator for idx + 2
+ movi v19.2d, #0 // initialize accumulator for idx + 3
+
+ mov x12, x4 // filter + 0
+ add x13, x4, x7 // filter + 1
+ add x8, x3, x8, lsl #1 // srcp + filterPos 0
+ add x14, x13, x7 // filter + 2
+ add x9, x3, x9, lsl #1 // srcp + filterPos 1
+ add x15, x14, x7 // filter + 3
+ add x10, x3, x10, lsl #1 // srcp + filterPos 2
+ mov w0, w6 // save the filterSize to temporary variable
+ add x11, x3, x11, lsl #1 // srcp + filterPos 3
+ add x5, x5, #16 // advance filter position
+ mov x16, xzr // clear the register x16 used for offsetting the filter values
+
+2:
+ ldr q4, [x8], #16 // load src values for idx 0
+ ldr q5, [x9], #16 // load src values for idx 1
+ uxtl v26.4s, v4.4h
+ uxtl2 v4.4s, v4.8h
+ ldr q31, [x12, x16] // load filter values for idx 0
+ ldr q6, [x10], #16 // load src values for idx 2
+ sxtl v22.4s, v31.4h
+ sxtl2 v31.4s, v31.8h
+ mla v16.4s, v26.4s, v22.4s // multiplication of lower half for idx 0
+ uxtl v25.4s, v5.4h
+ uxtl2 v5.4s, v5.8h
+ ldr q30, [x13, x16] // load filter values for idx 1
+ ldr q7, [x11], #16 // load src values for idx 3
+ mla v16.4s, v4.4s, v31.4s // multiplication of upper half for idx 0
+ uxtl v24.4s, v6.4h
+ sxtl v8.4s, v30.4h
+ sxtl2 v30.4s, v30.8h
+ mla v17.4s, v25.4s, v8.4s // multiplication of lower half for idx 1
+ ldr q29, [x14, x16] // load filter values for idx 2
+ uxtl2 v6.4s, v6.8h
+ sxtl v9.4s, v29.4h
+ sxtl2 v29.4s, v29.8h
+ mla v17.4s, v5.4s, v30.4s // multiplication of upper half for idx 1
+ ldr q28, [x15, x16] // load filter values for idx 3
+ mla v18.4s, v24.4s, v9.4s // multiplication of lower half for idx 2
+ uxtl v23.4s, v7.4h
+ sxtl v10.4s, v28.4h
+ mla v18.4s, v6.4s, v29.4s // multiplication of upper half for idx 2
+ uxtl2 v7.4s, v7.8h
+ sxtl2 v28.4s, v28.8h
+ mla v19.4s, v23.4s, v10.4s // multiplication of lower half for idx 3
+ sub w0, w0, #8
+ cmp w0, #8
+ mla v19.4s, v7.4s, v28.4s // multiplication of upper half for idx 3
+
+ add x16, x16, #16 // advance filter values indexing
+
+ b.ge 2b
+
+ // 4 iterations left
+
+ sub x17, x7, #8 // step back to wrap up the filter pos for last 4 elements
+
+ ldr d4, [x8] // load src values for idx 0
+ ldr d31, [x12, x17] // load filter values for idx 0
+ uxtl v4.4s, v4.4h
+ sxtl v31.4s, v31.4h
+ ldr d5, [x9] // load src values for idx 1
+ mla v16.4s, v4.4s, v31.4s // multiplication of upper half for idx 0
+ ldr d30, [x13, x17] // load filter values for idx 1
+ uxtl v5.4s, v5.4h
+ sxtl v30.4s, v30.4h
+ ldr d6, [x10] // load src values for idx 2
+ mla v17.4s, v5.4s, v30.4s // multiplication of upper half for idx 1
+ ldr d29, [x14, x17] // load filter values for idx 2
+ uxtl v6.4s, v6.4h
+ sxtl v29.4s, v29.4h
+ ldr d7, [x11] // load src values for idx 3
+ ldr d28, [x15, x17] // load filter values for idx 3
+ mla v18.4s, v6.4s, v29.4s // multiplication of upper half for idx 2
+ uxtl v7.4s, v7.4h
+ sxtl v28.4s, v28.4h
+ addp v16.4s, v16.4s, v17.4s
+ mla v19.4s, v7.4s, v28.4s // multiplication of upper half for idx 3
+ subs w2, w2, #4
+ addp v18.4s, v18.4s, v19.4s
+ addp v16.4s, v16.4s, v18.4s
+ sshl v16.4s, v16.4s, v20.4s
+ smin v16.4s, v16.4s, v21.4s
+
+ st1 {v16.4s}, [x1], #16
+ add x4, x4, x7, lsl #2
+ b.gt 1b
+
+ ldp d8, d9, [sp]
+ ldp d10, d11, [sp, #0x10]
+
+ add sp, sp, #0x20
+
+ ret
+endfunc
\ No newline at end of file
diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c
index 993cdd67dd..ef6029e068 100644
--- a/libswscale/aarch64/swscale.c
+++ b/libswscale/aarch64/swscale.c
@@ -34,6 +34,16 @@ void ff_hscale16to15_X4_neon_asm(int shift, int16_t *_dst, int dstW,
const uint8_t *_src, const int16_t *filter,
const int32_t *filterPos, int filterSize);
+void ff_hscale16to19_4_neon_asm(int shift, int16_t *_dst, int dstW,
+ const uint8_t *_src, const int16_t *filter,
+ const int32_t *filterPos, int filterSize);
+void ff_hscale16to19_X8_neon_asm(int shift, int16_t *_dst, int dstW,
+ const uint8_t *_src, const int16_t *filter,
+ const int32_t *filterPos, int filterSize);
+void ff_hscale16to19_X4_neon_asm(int shift, int16_t *_dst, int dstW,
+ const uint8_t *_src, const int16_t *filter,
+ const int32_t *filterPos, int filterSize);
+
#define SCALE_FUNC(filter_n, from_bpc, to_bpc, opt) \
void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
SwsContext *c, int16_t *data, \
@@ -43,7 +53,8 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
#define SCALE_FUNCS(filter_n, opt) \
SCALE_FUNC(filter_n, 8, 15, opt); \
SCALE_FUNC(filter_n, 8, 19, opt); \
- SCALE_FUNC(filter_n, 16, 15, opt);
+ SCALE_FUNC(filter_n, 16, 15, opt); \
+ SCALE_FUNC(filter_n, 16, 19, opt);
#define ALL_SCALE_FUNCS(opt) \
SCALE_FUNCS(4, opt); \
SCALE_FUNCS(X8, opt); \
@@ -73,6 +84,9 @@ void ff_yuv2plane1_8_neon(
if (c->dstBpc <= 14) \
hscalefn = \
ff_hscale16to15_ ## filtersize ## _ ## opt; \
+ else \
+ hscalefn = \
+ ff_hscale16to19_ ## filtersize ## _ ## opt; \
} \
} while (0)
@@ -150,4 +164,58 @@ void ff_hscale16to15_X4_neon(SwsContext *c, int16_t *_dst, int dstW,
sh = 16 - 1;
}
ff_hscale16to15_X4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+}
+
+void ff_hscale16to19_4_neon(SwsContext *c, int16_t *_dst, int dstW,
+ const uint8_t *_src, const int16_t *filter,
+ const int32_t *filterPos, int filterSize)
+{
+ const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat);
+ int bits = desc->comp[0].depth - 1;
+ int sh = bits - 4;
+
+ if ((isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8) && desc->comp[0].depth<16) {
+ sh = 9;
+ } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */
+ sh = 16 - 1 - 4;
+ }
+
+ ff_hscale16to19_4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+
+}
+
+void ff_hscale16to19_X8_neon(SwsContext *c, int16_t *_dst, int dstW,
+ const uint8_t *_src, const int16_t *filter,
+ const int32_t *filterPos, int filterSize)
+{
+ const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat);
+ int bits = desc->comp[0].depth - 1;
+ int sh = bits - 4;
+
+ if ((isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8) && desc->comp[0].depth<16) {
+ sh = 9;
+ } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */
+ sh = 16 - 1 - 4;
+ }
+
+ ff_hscale16to19_X8_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+
+}
+
+void ff_hscale16to19_X4_neon(SwsContext *c, int16_t *_dst, int dstW,
+ const uint8_t *_src, const int16_t *filter,
+ const int32_t *filterPos, int filterSize)
+{
+ const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat);
+ int bits = desc->comp[0].depth - 1;
+ int sh = bits - 4;
+
+ if ((isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8) && desc->comp[0].depth<16) {
+ sh = 9;
+ } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */
+ sh = 16 - 1 - 4;
+ }
+
+ ff_hscale16to19_X4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+
}
\ No newline at end of file
--
2.37.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [FFmpeg-devel] [PATCH 1/4] sw_scale: Add specializations for hscale 8 to 19
2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 1/4] sw_scale: Add specializations for hscale 8 to 19 Hubert Mazur
@ 2022-10-24 12:31 ` Martin Storsjö
0 siblings, 0 replies; 8+ messages in thread
From: Martin Storsjö @ 2022-10-24 12:31 UTC (permalink / raw)
To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop
On Mon, 17 Oct 2022, Hubert Mazur wrote:
> Add arm64 neon implementations for hscale 8 to 19 with filter
> sizes 4, 4X and 8. Both implementations are based on very similar ones
> dedicated to hscale 8 to 15. The major changes refer to saving
> the data - instead of writing the result as int16_t it is done
> with int32_t.
>
> These functions are heavily inspired on patches provided by J. Swinney
> and M. Storsjö for hscale8to15 which were slightly adapted for
> hscale8to19.
>
> The tests and benchmarks run on AWS Graviton 2 instances. The results
> from a checkasm tool shown below.
>
> hscale_8_to_19__fs_4_dstW_512_c: 5663.2
> hscale_8_to_19__fs_4_dstW_512_neon: 1259.7
> hscale_8_to_19__fs_8_dstW_512_c: 9306.0
> hscale_8_to_19__fs_8_dstW_512_neon: 2020.2
> hscale_8_to_19__fs_12_dstW_512_c: 12932.7
> hscale_8_to_19__fs_12_dstW_512_neon: 2462.5
> hscale_8_to_19__fs_16_dstW_512_c: 16844.2
> hscale_8_to_19__fs_16_dstW_512_neon: 4671.2
> hscale_8_to_19__fs_32_dstW_512_c: 32803.7
> hscale_8_to_19__fs_32_dstW_512_neon: 5474.2
> hscale_8_to_19__fs_40_dstW_512_c: 40948.0
> hscale_8_to_19__fs_40_dstW_512_neon: 6669.7
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libswscale/aarch64/hscale.S | 292 ++++++++++++++++++++++++++++++++++-
> libswscale/aarch64/swscale.c | 13 +-
> 2 files changed, 300 insertions(+), 5 deletions(-)
>
> diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S
> index a16d3dca42..5e8cad9825 100644
> --- a/libswscale/aarch64/hscale.S
> +++ b/libswscale/aarch64/hscale.S
> @@ -218,7 +218,6 @@ function ff_hscale8to15_4_neon, export=1
> // 2. Interleaved prefetching src data and madd
> // 3. Complete madd
> // 4. Complete remaining iterations when dstW % 8 != 0
> -
Nit: stray whitespace changes
> sub sp, sp, #32 // allocate 32 bytes on the stack
> cmp w2, #16 // if dstW <16, skip to the last block used for wrapping up
> b.lt 2f
> @@ -347,3 +346,294 @@ function ff_hscale8to15_4_neon, export=1
> add sp, sp, #32 // clean up stack
> ret
> endfunc
> +
> +function ff_hscale8to19_4_neon, export=1
> + // x0 SwsContext *c (unused)
> + // x1 int32_t *dst
> + // w2 int dstW
> + // x3 const uint8_t *src // treat it as uint16_t *src
> + // x4 const uint16_t *filter
> + // x5 const int32_t *filterPos
> + // w6 int filterSize
> +
> + movi v18.4s, #1
> + movi v17.4s, #1
> + shl v18.4s, v18.4s, #19
> + sub v18.4s, v18.4s, v17.4s // max allowed value
> +
> + cmp w2, #16
> + b.lt 2f // move to last block
> +
> + ldp w8, w9, [x5] // filterPos[0], filterPos[1]
> + ldp w10, w11, [x5, #8] // filterPos[2], filterPos[3]
> + ldp w12, w13, [x5, #16] // filterPos[4], filterPos[5]
> + ldp w14, w15, [x5, #24] // filterPos[6], filterPos[7]
> + add x5, x5, #32
> +
> + // load data from
> + ldr w8, [x3, w8, UXTW]
> + ldr w9, [x3, w9, UXTW]
> + ldr w10, [x3, w10, UXTW]
> + ldr w11, [x3, w11, UXTW]
> + ldr w12, [x3, w12, UXTW]
> + ldr w13, [x3, w13, UXTW]
> + ldr w14, [x3, w14, UXTW]
> + ldr w15, [x3, w15, UXTW]
> +
> + sub sp, sp, #32
> +
> + stp w8, w9, [sp]
> + stp w10, w11, [sp, #8]
> + stp w12, w13, [sp, #16]
> + stp w14, w15, [sp, #24]
> +
> +1:
> + ld4 {v0.8b, v1.8b, v2.8b, v3.8b}, [sp]
> + ld4 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
> + // load filterPositions into registers for next iteration
> +
> + ldp w8, w9, [x5] // filterPos[0], filterPos[1]
> + ldp w10, w11, [x5, #8] // filterPos[2], filterPos[3]
> + ldp w12, w13, [x5, #16] // filterPos[4], filterPos[5]
> + ldp w14, w15, [x5, #24] // filterPos[6], filterPos[7]
> + add x5, x5, #32
> + uxtl v0.8h, v0.8b
> + ldr w8, [x3, w8, UXTW]
> + smull v5.4s, v0.4h, v28.4h // multiply first column of src
> + ldr w9, [x3, w9, UXTW]
> + smull2 v6.4s, v0.8h, v28.8h
> + stp w8, w9, [sp]
> +
> + uxtl v1.8h, v1.8b
> + ldr w10, [x3, w10, UXTW]
> + smlal v5.4s, v1.4h, v29.4h // multiply second column of src
> + ldr w11, [x3, w11, UXTW]
> + smlal2 v6.4s, v1.8h, v29.8h
> + stp w10, w11, [sp, #8]
> +
> + uxtl v2.8h, v2.8b
> + ldr w12, [x3, w12, UXTW]
> + smlal v5.4s, v2.4h, v30.4h // multiply third column of src
> + ldr w13, [x3, w13, UXTW]
> + smlal2 v6.4s, v2.8h, v30.8h
> + stp w12, w13, [sp, #16]
> +
> + uxtl v3.8h, v3.8b
> + ldr w14, [x3, w14, UXTW]
> + smlal v5.4s, v3.4h, v31.4h // multiply fourth column of src
> + ldr w15, [x3, w15, UXTW]
> + smlal2 v6.4s, v3.8h, v31.8h
> + stp w14, w15, [sp, #24]
> +
> + sub w2, w2, #8
> + sshr v5.4s, v5.4s, #3
> + sshr v6.4s, v6.4s, #3
> + smin v5.4s, v5.4s, v18.4s
> + smin v6.4s, v6.4s, v18.4s
> +
> + st1 {v5.4s, v6.4s}, [x1], #32
> + cmp w2, #16
> + b.ge 1b
> +
> + // here we make last iteration, without updating the registers
> + ld4 {v0.8b, v1.8b, v2.8b, v3.8b}, [sp]
> + ld4 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
> +
> + uxtl v0.8h, v0.8b
> + uxtl v1.8h, v1.8b
> + smull v5.4s, v0.4h, v28.4h
> + smull2 v6.4s, v0.8h, v28.8h
> + uxtl v2.8h, v2.8b
> + smlal v5.4s, v1.4h, v29.4H
> + smlal2 v6.4s, v1.8h, v29.8H
> + uxtl v3.8h, v3.8b
> + smlal v5.4s, v2.4h, v30.4H
> + smlal2 v6.4s, v2.8h, v30.8H
> + smlal v5.4s, v3.4h, v31.4H
> + smlal2 v6.4s, v3.8h, v31.8h
> +
> + sshr v5.4s, v5.4s, #3
> + sshr v6.4s, v6.4s, #3
> +
> + smin v5.4s, v5.4s, v18.4s
> + smin v6.4s, v6.4s, v18.4s
> +
> + sub w2, w2, #8
> + st1 {v5.4s, v6.4s}, [x1], #32
> + add sp, sp, #32 // restore stack
> + cbnz w2, 2f
> +
> + ret
> +
> +2:
> + ldr w8, [x5], #4 // load filterPos
> + add x9, x3, w8, UXTW // src + filterPos
> + ld1 {v0.s}[0], [x9] // load 4 * uint8_t* into one single
> + ld1 {v31.4h}, [x4], #8
> + uxtl v0.8h, v0.8b
> + smull v5.4s, v0.4h, v31.4H
> + saddlv d0, v5.4S
> + sqshrn s0, d0, #3
> + smin v0.4s, v0.4s, v18.4s
> + st1 {v0.s}[0], [x1], #4
> + sub w2, w2, #1
> + cbnz w2, 2b // if iterations remain jump to beginning
> +
> + ret
> +endfunc
> +
> +function ff_hscale8to19_X8_neon, export=1
> + movi v20.4s, #1
> + movi v17.4s, #1
> + shl v20.4s, v20.4s, #19
> + sub v20.4s, v20.4s, v17.4s
> +
> + sbfiz x7, x6, #1, #32 // filterSize*2 (*2 because int16)
> +1:
> + mov x16, x4 // filter0 = filter
> + ldr w8, [x5], #4 // filterPos[idx]
> + add x12, x16, x7 // filter1 = filter0 + filterSize*2
> + ldr w0, [x5], #4 // filterPos[idx + 1]
> + add x13, x12, x7 // filter2 = filter1 + filterSize*2
> + ldr w11, [x5], #4 // filterPos[idx + 2]
> + add x4, x13, x7 // filter3 = filter2 + filterSize*2
> + ldr w9, [x5], #4 // filterPos[idx + 3]
> + movi v0.2D, #0 // val sum part 1 (for dst[0])
> + movi v1.2D, #0 // val sum part 2 (for dst[1])
> + movi v2.2D, #0 // val sum part 3 (for dst[2])
> + movi v3.2D, #0 // val sum part 4 (for dst[3])
> + add x17, x3, w8, UXTW // srcp + filterPos[0]
> + add x8, x3, w0, UXTW // srcp + filterPos[1]
> + add x0, x3, w11, UXTW // srcp + filterPos[2]
> + add x11, x3, w9, UXTW // srcp + filterPos[3]
> + mov w15, w6 // filterSize counter
> +2: ld1 {v4.8B}, [x17], #8 // srcp[filterPos[0] + {0..7}]
> + ld1 {v5.8H}, [x16], #16 // load 8x16-bit filter values, part 1
> + uxtl v4.8H, v4.8B // unpack part 1 to 16-bit
> + smlal v0.4S, v4.4H, v5.4H // v0 accumulates srcp[filterPos[0] + {0..3}] * filter[{0..3}]
> + ld1 {v6.8B}, [x8], #8 // srcp[filterPos[1] + {0..7}]
> + smlal2 v0.4S, v4.8H, v5.8H // v0 accumulates srcp[filterPos[0] + {4..7}] * filter[{4..7}]
> + ld1 {v7.8H}, [x12], #16 // load 8x16-bit at filter+filterSize
> + ld1 {v16.8B}, [x0], #8 // srcp[filterPos[2] + {0..7}]
> + uxtl v6.8H, v6.8B // unpack part 2 to 16-bit
> + ld1 {v17.8H}, [x13], #16 // load 8x16-bit at filter+2*filterSize
> + uxtl v16.8H, v16.8B // unpack part 3 to 16-bit
> + smlal v1.4S, v6.4H, v7.4H // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
> + ld1 {v18.8B}, [x11], #8 // srcp[filterPos[3] + {0..7}]
> + smlal v2.4S, v16.4H, v17.4H // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
> + ld1 {v19.8H}, [x4], #16 // load 8x16-bit at filter+3*filterSize
> + smlal2 v2.4S, v16.8H, v17.8H // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
> + uxtl v18.8H, v18.8B // unpack part 4 to 16-bit
> + smlal2 v1.4S, v6.8H, v7.8H // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
> + smlal v3.4S, v18.4H, v19.4H // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
> + subs w15, w15, #8 // j -= 8: processed 8/filterSize
> + smlal2 v3.4S, v18.8H, v19.8H // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
> + b.gt 2b // inner loop if filterSize not consumed completely
> + addp v0.4S, v0.4S, v1.4S // part01 horizontal pair adding
> + addp v2.4S, v2.4S, v3.4S // part23 horizontal pair adding
> + addp v0.4S, v0.4S, v2.4S // part0123 horizontal pair adding
> + subs w2, w2, #4 // dstW -= 4
> + sshr v0.4s, v0.4S, #3 // shift and clip the 2x16-bit final values
> + smin v0.4s, v0.4s, v20.4s
> + st1 {v0.4s}, [x1], #16 // write to destination part0123
> + b.gt 1b // loop until end of line
> + ret
> +endfunc
> +
> +function ff_hscale8to19_X4_neon, export=1
> + // x0 SwsContext *c (not used)
> + // x1 int16_t *dst
> + // w2 int dstW
> + // x3 const uint8_t *src
> + // x4 const int16_t *filter
> + // x5 const int32_t *filterPos
> + // w6 int filterSize
> +
> + movi v20.4s, #1
> + movi v17.4s, #1
> + shl v20.4s, v20.4s, #19
> + sub v20.4s, v20.4s, v17.4s
> +
> + lsl w7, w6, #1
> +1:
> + ldp w8, w9, [x5]
> + ldp w10, w11, [x5, #8]
> +
> + movi v16.2d, #0 // initialize accumulator for idx + 0
> + movi v17.2d, #0 // initialize accumulator for idx + 1
> + movi v18.2d, #0 // initialize accumulator for idx + 2
> + movi v19.2d, #0 // initialize accumulator for idx + 3
> +
> + mov x12, x4 // filter + 0
> + add x13, x4, x7 // filter + 1
> + add x8, x3, w8, UXTW // srcp + filterPos 0
> + add x14, x13, x7 // filter + 2
> + add x9, x3, w9, UXTW // srcp + filterPos 1
> + add x15, x14, x7 // filter + 3
> + add x10, x3, w10, UXTW // srcp + filterPos 2
> + mov w0, w6 // save the filterSize to temporary variable
> + add x11, x3, w11, UXTW // srcp + filterPos 3
> + add x5, x5, #16 // advance filter position
> + mov x16, xzr // clear the register x16 used for offsetting the filter values
> +
> +2:
> + ldr d4, [x8], #8 // load src values for idx 0
> + ldr q31, [x12, x16] // load filter values for idx 0
> + uxtl v4.8h, v4.8b // extend type to match the filter' size
> + ldr d5, [x9], #8 // load src values for idx 1
> + smlal v16.4s, v4.4h, v31.4h // multiplication of lower half for idx 0
> + uxtl v5.8h, v5.8b // extend type to match the filter' size
> + ldr q30, [x13, x16] // load filter values for idx 1
> + smlal2 v16.4s, v4.8h, v31.8h // multiplication of upper half for idx 0
> + ldr d6, [x10], #8 // load src values for idx 2
> + ldr q29, [x14, x16] // load filter values for idx 2
> + smlal v17.4s, v5.4h, v30.4H // multiplication of lower half for idx 1
> + ldr d7, [x11], #8 // load src values for idx 3
> + smlal2 v17.4s, v5.8h, v30.8H // multiplication of upper half for idx 1
> + uxtl v6.8h, v6.8B // extend tpye to matchi the filter's size
> + ldr q28, [x15, x16] // load filter values for idx 3
> + smlal v18.4s, v6.4h, v29.4h // multiplication of lower half for idx 2
> + uxtl v7.8h, v7.8B
> + smlal2 v18.4s, v6.8h, v29.8H // multiplication of upper half for idx 2
> + sub w0, w0, #8
> + smlal v19.4s, v7.4h, v28.4H // multiplication of lower half for idx 3
> + cmp w0, #8
> + smlal2 v19.4s, v7.8h, v28.8h // multiplication of upper half for idx 3
> + add x16, x16, #16 // advance filter values indexing
> +
> + b.ge 2b
> +
> +
> + // 4 iterations left
> +
> + sub x17, x7, #8 // step back to wrap up the filter pos for last 4 elements
> +
> + ldr s4, [x8] // load src values for idx 0
> + ldr d31, [x12, x17] // load filter values for idx 0
> + uxtl v4.8h, v4.8b // extend type to match the filter' size
> + ldr s5, [x9] // load src values for idx 1
> + smlal v16.4s, v4.4h, v31.4h
> + ldr d30, [x13, x17] // load filter values for idx 1
> + uxtl v5.8h, v5.8b // extend type to match the filter' size
> + ldr s6, [x10] // load src values for idx 2
> + smlal v17.4s, v5.4h, v30.4h
> + uxtl v6.8h, v6.8B // extend type to match the filter's size
> + ldr d29, [x14, x17] // load filter values for idx 2
> + ldr s7, [x11] // load src values for idx 3
> + addp v16.4s, v16.4s, v17.4s
> + uxtl v7.8h, v7.8B
> + ldr d28, [x15, x17] // load filter values for idx 3
> + smlal v18.4s, v6.4h, v29.4h
> + smlal v19.4s, v7.4h, v28.4h
> + subs w2, w2, #4
> + addp v18.4s, v18.4s, v19.4s
> + addp v16.4s, v16.4s, v18.4s
> + sshr v16.4s, v16.4s, #3
> + smin v16.4s, v16.4s, v20.4s
> +
> + st1 {v16.4s}, [x1], #16
> + add x4, x4, x7, lsl #2
> + b.gt 1b
> + ret
> +
> +endfunc
> \ No newline at end of file
Nit: The file could use a trailing newline
> diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c
> index d1312c6658..479fe129d0 100644
> --- a/libswscale/aarch64/swscale.c
> +++ b/libswscale/aarch64/swscale.c
> @@ -29,7 +29,8 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
> const int16_t *filter, \
> const int32_t *filterPos, int filterSize)
> #define SCALE_FUNCS(filter_n, opt) \
> - SCALE_FUNC(filter_n, 8, 15, opt);
> + SCALE_FUNC(filter_n, 8, 15, opt); \
> + SCALE_FUNC(filter_n, 8, 19, opt);
Nit: There's no need to preserve the odd spacing of the existing line
here.
Other than that, this patch (and the others) mostly seem fine. I've got a
version of the patches with these nits fixed locally (fixing it was a bit
annoying wrt rebasing the later patches though).
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [FFmpeg-devel] [PATCH 4/4] sw_scale: Add specializations for hscale 16 to 19
2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 4/4] sw_scale: Add specializations for hscale 16 to 19 Hubert Mazur
@ 2022-10-24 13:19 ` Martin Storsjö
2022-10-25 7:14 ` Hubert Mazur
0 siblings, 1 reply; 8+ messages in thread
From: Martin Storsjö @ 2022-10-24 13:19 UTC (permalink / raw)
To: Hubert Mazur; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop
On Mon, 17 Oct 2022, Hubert Mazur wrote:
> Provide arm64 neon optimized implementations for hscale16To19 with
> filter sizes 4, 8 and X4.
>
> The tests and benchmarks run on AWS Graviton 2 instances.
> The results from a checkasm tool are shown below.
>
> hscale_16_to_19__fs_4_dstW_512_c: 6216.0
> hscale_16_to_19__fs_4_dstW_512_neon: 2257.0
> hscale_16_to_19__fs_8_dstW_512_c: 10417.7
> hscale_16_to_19__fs_8_dstW_512_neon: 3112.5
> hscale_16_to_19__fs_12_dstW_512_c: 14890.5
> hscale_16_to_19__fs_12_dstW_512_neon: 3899.0
> hscale_16_to_19__fs_16_dstW_512_c: 19006.5
> hscale_16_to_19__fs_16_dstW_512_neon: 5341.2
> hscale_16_to_19__fs_32_dstW_512_c: 36629.5
> hscale_16_to_19__fs_32_dstW_512_neon: 9502.7
> hscale_16_to_19__fs_40_dstW_512_c: 45477.5
> hscale_16_to_19__fs_40_dstW_512_neon: 11552.0
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libswscale/aarch64/hscale.S | 402 +++++++++++++++++++++++++++++++++++
> libswscale/aarch64/swscale.c | 70 +++++-
> 2 files changed, 471 insertions(+), 1 deletion(-)
> +void ff_hscale16to19_4_neon_asm(int shift, int16_t *_dst, int dstW,
> + const uint8_t *_src, const int16_t *filter,
> + const int32_t *filterPos, int filterSize);
> +void ff_hscale16to19_X8_neon_asm(int shift, int16_t *_dst, int dstW,
> + const uint8_t *_src, const int16_t *filter,
> + const int32_t *filterPos, int filterSize);
> +void ff_hscale16to19_X4_neon_asm(int shift, int16_t *_dst, int dstW,
> + const uint8_t *_src, const int16_t *filter,
> + const int32_t *filterPos, int filterSize);
> +
> #define SCALE_FUNC(filter_n, from_bpc, to_bpc, opt) \
> void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
> SwsContext *c, int16_t *data, \
> @@ -43,7 +53,8 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
> #define SCALE_FUNCS(filter_n, opt) \
> SCALE_FUNC(filter_n, 8, 15, opt); \
> SCALE_FUNC(filter_n, 8, 19, opt); \
> - SCALE_FUNC(filter_n, 16, 15, opt);
> + SCALE_FUNC(filter_n, 16, 15, opt); \
> + SCALE_FUNC(filter_n, 16, 19, opt);
So this declares the functions we're implementing as C wrappers below, and
the manual declarations further up declare the actual asm functions?
I guess that works, although it makes unnecessary extern functions. In
such cases, we usually have the C functions be static functions, placed
above the code that uses them. But it's not a big deal.
Other than that, this patchset mostly seems fine.
However, I tested the patches on x86, and the new checkasm tests do fail
on x86 (both i386 and x86_64) - so that needs to be fixed anyway. So since
we'll need to do a new round anyway, please do try to fix up the minor
cosmetics I mentioned.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [FFmpeg-devel] [PATCH 4/4] sw_scale: Add specializations for hscale 16 to 19
2022-10-24 13:19 ` Martin Storsjö
@ 2022-10-25 7:14 ` Hubert Mazur
0 siblings, 0 replies; 8+ messages in thread
From: Hubert Mazur @ 2022-10-25 7:14 UTC (permalink / raw)
To: Martin Storsjö; +Cc: gjb, upstream, jswinney, ffmpeg-devel, mw, spop
Thanks for the review.
I will fix the failing checkasm first and then take care of the minor
issues. I will try to to resend fixed versions this week.
Regards,
Hubert
On Mon, Oct 24, 2022 at 3:19 PM Martin Storsjö <martin@martin.st> wrote:
> On Mon, 17 Oct 2022, Hubert Mazur wrote:
>
> > Provide arm64 neon optimized implementations for hscale16To19 with
> > filter sizes 4, 8 and X4.
> >
> > The tests and benchmarks run on AWS Graviton 2 instances.
> > The results from a checkasm tool are shown below.
> >
> > hscale_16_to_19__fs_4_dstW_512_c: 6216.0
> > hscale_16_to_19__fs_4_dstW_512_neon: 2257.0
> > hscale_16_to_19__fs_8_dstW_512_c: 10417.7
> > hscale_16_to_19__fs_8_dstW_512_neon: 3112.5
> > hscale_16_to_19__fs_12_dstW_512_c: 14890.5
> > hscale_16_to_19__fs_12_dstW_512_neon: 3899.0
> > hscale_16_to_19__fs_16_dstW_512_c: 19006.5
> > hscale_16_to_19__fs_16_dstW_512_neon: 5341.2
> > hscale_16_to_19__fs_32_dstW_512_c: 36629.5
> > hscale_16_to_19__fs_32_dstW_512_neon: 9502.7
> > hscale_16_to_19__fs_40_dstW_512_c: 45477.5
> > hscale_16_to_19__fs_40_dstW_512_neon: 11552.0
> >
> > Signed-off-by: Hubert Mazur <hum@semihalf.com>
> > ---
> > libswscale/aarch64/hscale.S | 402 +++++++++++++++++++++++++++++++++++
> > libswscale/aarch64/swscale.c | 70 +++++-
> > 2 files changed, 471 insertions(+), 1 deletion(-)
>
> > +void ff_hscale16to19_4_neon_asm(int shift, int16_t *_dst, int dstW,
> > + const uint8_t *_src, const int16_t *filter,
> > + const int32_t *filterPos, int filterSize);
> > +void ff_hscale16to19_X8_neon_asm(int shift, int16_t *_dst, int dstW,
> > + const uint8_t *_src, const int16_t *filter,
> > + const int32_t *filterPos, int filterSize);
> > +void ff_hscale16to19_X4_neon_asm(int shift, int16_t *_dst, int dstW,
> > + const uint8_t *_src, const int16_t *filter,
> > + const int32_t *filterPos, int filterSize);
> > +
> > #define SCALE_FUNC(filter_n, from_bpc, to_bpc, opt) \
> > void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt(
> \
> > SwsContext *c, int16_t
> *data, \
> > @@ -43,7 +53,8 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ##
> filter_n ## _ ## opt( \
> > #define SCALE_FUNCS(filter_n, opt) \
> > SCALE_FUNC(filter_n, 8, 15, opt); \
> > SCALE_FUNC(filter_n, 8, 19, opt); \
> > - SCALE_FUNC(filter_n, 16, 15, opt);
> > + SCALE_FUNC(filter_n, 16, 15, opt); \
> > + SCALE_FUNC(filter_n, 16, 19, opt);
>
> So this declares the functions we're implementing as C wrappers below, and
> the manual declarations further up declare the actual asm functions?
>
> I guess that works, although it makes unnecessary extern functions. In
> such cases, we usually have the C functions be static functions, placed
> above the code that uses them. But it's not a big deal.
>
> Other than that, this patchset mostly seems fine.
>
> However, I tested the patches on x86, and the new checkasm tests do fail
> on x86 (both i386 and x86_64) - so that needs to be fixed anyway. So since
> we'll need to do a new round anyway, please do try to fix up the minor
> cosmetics I mentioned.
>
> // Martin
>
>
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2022-10-25 7:15 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-17 13:07 [FFmpeg-devel] [PATCH 0/4] Provide neon implementations for hscale functions Hubert Mazur
2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 1/4] sw_scale: Add specializations for hscale 8 to 19 Hubert Mazur
2022-10-24 12:31 ` Martin Storsjö
2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 2/4] tests/sw_scale: Add test cases for input sizes 16 Hubert Mazur
2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 3/4] sw_scale: Add specializations for hscale 16 to 15 Hubert Mazur
2022-10-17 13:07 ` [FFmpeg-devel] [PATCH 4/4] sw_scale: Add specializations for hscale 16 to 19 Hubert Mazur
2022-10-24 13:19 ` Martin Storsjö
2022-10-25 7:14 ` Hubert Mazur
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
This inbox may be cloned and mirrored by anyone:
git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
ffmpegdev@gitmailbox.com
public-inbox-index ffmpegdev
Example config snippet for mirrors.
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git