* [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions
@ 2023-06-29 17:57 John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 01/15] avfilter/vf_bwdif: Add outline for aarch " John Cox
` (15 more replies)
0 siblings, 16 replies; 32+ messages in thread
From: John Cox @ 2023-06-29 17:57 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: thomas.mundt, John Cox
Also adds a filter_line3 method which on aarch64 neon yields approx 30%
speedup over 2xfilter_line and a memcpy
John Cox (15):
avfilter/vf_bwdif: Add outline for aarch neon functions
avfilter/vf_bwdif: Add common macros and consts for aarch64 neon
avfilter/vf_bwdif: Export C filter_intra
avfilter/vf_bwdif: Add neon for filter_intra
tests/checkasm: Add test for vf_bwdif filter_intra
avfilter/vf_bwdif: Add clip and spatial macros for aarch64 neon
avfilter/vf_bwdif: Export C filter_edge
avfilter/vf_bwdif: Add neon for filter_edge
tests/checkasm: Add test for vf_bwdif filter_edge
avfilter/vf_bwdif: Export C filter_line
avfilter/vf_bwdif: Add neon for filter_line
avfilter/vf_bwdif: Add a filter_line3 method for optimisation
avfilter/vf_bwdif: Add neon for filter_line3
tests/checkasm: Add test for vf_bwdif filter_line3
avfilter/vf_bwdif: Block filter slices into a multiple of 4 lines
libavfilter/aarch64/Makefile | 2 +
libavfilter/aarch64/vf_bwdif_init_aarch64.c | 125 ++++
libavfilter/aarch64/vf_bwdif_neon.S | 780 ++++++++++++++++++++
libavfilter/bwdif.h | 20 +
libavfilter/vf_bwdif.c | 70 +-
tests/checkasm/vf_bwdif.c | 172 +++++
6 files changed, 1154 insertions(+), 15 deletions(-)
create mode 100644 libavfilter/aarch64/vf_bwdif_init_aarch64.c
create mode 100644 libavfilter/aarch64/vf_bwdif_neon.S
--
2.39.2
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* [FFmpeg-devel] [PATCH 01/15] avfilter/vf_bwdif: Add outline for aarch neon functions
2023-06-29 17:57 [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions John Cox
@ 2023-06-29 17:57 ` John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 02/15] avfilter/vf_bwdif: Add common macros and consts for aarch64 neon John Cox
` (14 subsequent siblings)
15 siblings, 0 replies; 32+ messages in thread
From: John Cox @ 2023-06-29 17:57 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: thomas.mundt, John Cox
Outline but no actual functions.
Signed-off-by: John Cox <jc@kynesim.co.uk>
---
libavfilter/aarch64/Makefile | 2 ++
libavfilter/aarch64/vf_bwdif_init_aarch64.c | 39 +++++++++++++++++++++
libavfilter/aarch64/vf_bwdif_neon.S | 25 +++++++++++++
libavfilter/bwdif.h | 1 +
libavfilter/vf_bwdif.c | 2 ++
5 files changed, 69 insertions(+)
create mode 100644 libavfilter/aarch64/vf_bwdif_init_aarch64.c
create mode 100644 libavfilter/aarch64/vf_bwdif_neon.S
diff --git a/libavfilter/aarch64/Makefile b/libavfilter/aarch64/Makefile
index b58daa3a3f..b68209bc94 100644
--- a/libavfilter/aarch64/Makefile
+++ b/libavfilter/aarch64/Makefile
@@ -1,3 +1,5 @@
+OBJS-$(CONFIG_BWDIF_FILTER) += aarch64/vf_bwdif_init_aarch64.o
OBJS-$(CONFIG_NLMEANS_FILTER) += aarch64/vf_nlmeans_init.o
+NEON-OBJS-$(CONFIG_BWDIF_FILTER) += aarch64/vf_bwdif_neon.o
NEON-OBJS-$(CONFIG_NLMEANS_FILTER) += aarch64/vf_nlmeans_neon.o
diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
new file mode 100644
index 0000000000..86d53b2ca1
--- /dev/null
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -0,0 +1,39 @@
+/*
+ * bwdif aarch64 NEON optimisations
+ *
+ * Copyright (c) 2023 John Cox <jc@kynesim.co.uk>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/common.h"
+#include "libavfilter/bwdif.h"
+#include "libavutil/aarch64/cpu.h"
+
+void
+ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
+{
+ const int cpu_flags = av_get_cpu_flags();
+
+ if (bit_depth != 8)
+ return;
+
+ if (!have_neon(cpu_flags))
+ return;
+
+}
+
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S
new file mode 100644
index 0000000000..639ab22998
--- /dev/null
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -0,0 +1,25 @@
+/*
+ * bwdif aarch64 NEON optimisations
+ *
+ * Copyright (c) 2023 John Cox <jc@kynesim.co.uk>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+
+#include "libavutil/aarch64/asm.S"
+
diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h
index 5749345f78..6a0f70487a 100644
--- a/libavfilter/bwdif.h
+++ b/libavfilter/bwdif.h
@@ -39,5 +39,6 @@ typedef struct BWDIFContext {
void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int bit_depth);
void ff_bwdif_init_x86(BWDIFContext *bwdif, int bit_depth);
+void ff_bwdif_init_aarch64(BWDIFContext *bwdif, int bit_depth);
#endif /* AVFILTER_BWDIF_H */
diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index e278cf1217..39a51429ac 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -369,6 +369,8 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int bit_depth)
#if ARCH_X86
ff_bwdif_init_x86(s, bit_depth);
+#elif ARCH_AARCH64
+ ff_bwdif_init_aarch64(s, bit_depth);
#endif
}
--
2.39.2
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* [FFmpeg-devel] [PATCH 02/15] avfilter/vf_bwdif: Add common macros and consts for aarch64 neon
2023-06-29 17:57 [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 01/15] avfilter/vf_bwdif: Add outline for aarch " John Cox
@ 2023-06-29 17:57 ` John Cox
2023-07-01 21:35 ` Martin Storsjö
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 03/15] avfilter/vf_bwdif: Export C filter_intra John Cox
` (13 subsequent siblings)
15 siblings, 1 reply; 32+ messages in thread
From: John Cox @ 2023-06-29 17:57 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: thomas.mundt, John Cox
Add macros for dual scalar half->single multiply and accumulate
Add macro for shift, saturate and shorten single to byte
Add filter constants
Signed-off-by: John Cox <jc@kynesim.co.uk>
---
libavfilter/aarch64/vf_bwdif_neon.S | 46 +++++++++++++++++++++++++++++
1 file changed, 46 insertions(+)
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S
index 639ab22998..a8f0ed525a 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -23,3 +23,49 @@
#include "libavutil/aarch64/asm.S"
+.macro SQSHRUNN b, s0, s1, s2, s3, n
+ sqshrun \s0\().4h, \s0\().4s, #\n - 8
+ sqshrun2 \s0\().8h, \s1\().4s, #\n - 8
+ sqshrun \s1\().4h, \s2\().4s, #\n - 8
+ sqshrun2 \s1\().8h, \s3\().4s, #\n - 8
+ uzp2 \b\().16b, \s0\().16b, \s1\().16b
+.endm
+
+.macro SMULL4K a0, a1, a2, a3, s0, s1, k
+ smull \a0\().4s, \s0\().4h, \k
+ smull2 \a1\().4s, \s0\().8h, \k
+ smull \a2\().4s, \s1\().4h, \k
+ smull2 \a3\().4s, \s1\().8h, \k
+.endm
+
+.macro UMULL4K a0, a1, a2, a3, s0, s1, k
+ umull \a0\().4s, \s0\().4h, \k
+ umull2 \a1\().4s, \s0\().8h, \k
+ umull \a2\().4s, \s1\().4h, \k
+ umull2 \a3\().4s, \s1\().8h, \k
+.endm
+
+.macro UMLAL4K a0, a1, a2, a3, s0, s1, k
+ umlal \a0\().4s, \s0\().4h, \k
+ umlal2 \a1\().4s, \s0\().8h, \k
+ umlal \a2\().4s, \s1\().4h, \k
+ umlal2 \a3\().4s, \s1\().8h, \k
+.endm
+
+.macro UMLSL4K a0, a1, a2, a3, s0, s1, k
+ umlsl \a0\().4s, \s0\().4h, \k
+ umlsl2 \a1\().4s, \s0\().8h, \k
+ umlsl \a2\().4s, \s1\().4h, \k
+ umlsl2 \a3\().4s, \s1\().8h, \k
+.endm
+
+// static const uint16_t coef_lf[2] = { 4309, 213 };
+// static const uint16_t coef_hf[3] = { 5570, 3801, 1016 };
+// static const uint16_t coef_sp[2] = { 5077, 981 };
+
+ .align 16
+coeffs:
+ .hword 4309 * 4, 213 * 4 // lf[0]*4 = v0.h[0]
+ .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5]
+ .hword 5077, 981 // sp[0] = v0.h[6]
+
--
2.39.2
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* [FFmpeg-devel] [PATCH 03/15] avfilter/vf_bwdif: Export C filter_intra
2023-06-29 17:57 [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 01/15] avfilter/vf_bwdif: Add outline for aarch " John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 02/15] avfilter/vf_bwdif: Add common macros and consts for aarch64 neon John Cox
@ 2023-06-29 17:57 ` John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 04/15] avfilter/vf_bwdif: Add neon for filter_intra John Cox
` (12 subsequent siblings)
15 siblings, 0 replies; 32+ messages in thread
From: John Cox @ 2023-06-29 17:57 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: thomas.mundt, John Cox
Needed for tail fixup of neon code
Signed-off-by: John Cox <jc@kynesim.co.uk>
---
libavfilter/bwdif.h | 3 +++
libavfilter/vf_bwdif.c | 6 +++---
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h
index 6a0f70487a..ae6f6ce223 100644
--- a/libavfilter/bwdif.h
+++ b/libavfilter/bwdif.h
@@ -41,4 +41,7 @@ void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int bit_depth);
void ff_bwdif_init_x86(BWDIFContext *bwdif, int bit_depth);
void ff_bwdif_init_aarch64(BWDIFContext *bwdif, int bit_depth);
+void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int mrefs,
+ int prefs3, int mrefs3, int parity, int clip_max);
+
#endif /* AVFILTER_BWDIF_H */
diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index 39a51429ac..035fc58670 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -122,8 +122,8 @@ typedef struct ThreadData {
next2++; \
}
-static void filter_intra(void *dst1, void *cur1, int w, int prefs, int mrefs,
- int prefs3, int mrefs3, int parity, int clip_max)
+void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int mrefs,
+ int prefs3, int mrefs3, int parity, int clip_max)
{
uint8_t *dst = dst1;
uint8_t *cur = cur1;
@@ -362,7 +362,7 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int bit_depth)
s->filter_line = filter_line_c_16bit;
s->filter_edge = filter_edge_16bit;
} else {
- s->filter_intra = filter_intra;
+ s->filter_intra = ff_bwdif_filter_intra_c;
s->filter_line = filter_line_c;
s->filter_edge = filter_edge;
}
--
2.39.2
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* [FFmpeg-devel] [PATCH 04/15] avfilter/vf_bwdif: Add neon for filter_intra
2023-06-29 17:57 [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions John Cox
` (2 preceding siblings ...)
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 03/15] avfilter/vf_bwdif: Export C filter_intra John Cox
@ 2023-06-29 17:57 ` John Cox
2023-07-01 21:37 ` Martin Storsjö
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 05/15] tests/checkasm: Add test for vf_bwdif filter_intra John Cox
` (11 subsequent siblings)
15 siblings, 1 reply; 32+ messages in thread
From: John Cox @ 2023-06-29 17:57 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: thomas.mundt, John Cox
Signed-off-by: John Cox <jc@kynesim.co.uk>
---
libavfilter/aarch64/vf_bwdif_init_aarch64.c | 17 +++++++
libavfilter/aarch64/vf_bwdif_neon.S | 53 +++++++++++++++++++++
2 files changed, 70 insertions(+)
diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
index 86d53b2ca1..3ffaa07ab3 100644
--- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -24,6 +24,22 @@
#include "libavfilter/bwdif.h"
#include "libavutil/aarch64/cpu.h"
+void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs,
+ int prefs3, int mrefs3, int parity, int clip_max);
+
+
+static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int mrefs,
+ int prefs3, int mrefs3, int parity, int clip_max)
+{
+ const int w0 = clip_max != 255 ? 0 : w & ~15;
+
+ ff_bwdif_filter_intra_neon(dst1, cur1, w0, prefs, mrefs, prefs3, mrefs3, parity, clip_max);
+
+ if (w0 < w)
+ ff_bwdif_filter_intra_c((char *)dst1 + w0, (char *)cur1 + w0,
+ w - w0, prefs, mrefs, prefs3, mrefs3, parity, clip_max);
+}
+
void
ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
{
@@ -35,5 +51,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
if (!have_neon(cpu_flags))
return;
+ s->filter_intra = filter_intra_helper;
}
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S
index a8f0ed525a..b863b3447d 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -69,3 +69,56 @@ coeffs:
.hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5]
.hword 5077, 981 // sp[0] = v0.h[6]
+// ============================================================================
+//
+// void ff_bwdif_filter_intra_neon(
+// void *dst1, // x0
+// void *cur1, // x1
+// int w, // w2
+// int prefs, // w3
+// int mrefs, // w4
+// int prefs3, // w5
+// int mrefs3, // w6
+// int parity, // w7 unused
+// int clip_max) // [sp, #0] unused
+
+function ff_bwdif_filter_intra_neon, export=1
+ cmp w2, #0
+ ble 99f
+
+ ldr q0, coeffs
+
+// for (x = 0; x < w; x++) {
+10:
+
+// interpol = (coef_sp[0] * (cur[mrefs] + cur[prefs]) - coef_sp[1] * (cur[mrefs3] + cur[prefs3])) >> 13;
+ ldr q31, [x1, w4, SXTW]
+ ldr q30, [x1, w3, SXTW]
+ ldr q29, [x1, w6, SXTW]
+ ldr q28, [x1, w5, SXTW]
+
+ uaddl v20.8h, v31.8b, v30.8b
+ uaddl2 v21.8h, v31.16b, v30.16b
+
+ UMULL4K v2, v3, v4, v5, v20, v21, v0.h[6]
+
+ uaddl v20.8h, v29.8b, v28.8b
+ uaddl2 v21.8h, v29.16b, v28.16b
+
+ UMLSL4K v2, v3, v4, v5, v20, v21, v0.h[7]
+
+// dst[0] = av_clip(interpol, 0, clip_max);
+ SQSHRUNN v2, v2, v3, v4, v5, 13
+ str q2, [x0], #16
+
+// dst++;
+// cur++;
+// }
+
+ subs w2, w2, #16
+ add x1, x1, #16
+ bgt 10b
+
+99:
+ ret
+endfunc
--
2.39.2
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* [FFmpeg-devel] [PATCH 05/15] tests/checkasm: Add test for vf_bwdif filter_intra
2023-06-29 17:57 [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions John Cox
` (3 preceding siblings ...)
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 04/15] avfilter/vf_bwdif: Add neon for filter_intra John Cox
@ 2023-06-29 17:57 ` John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 06/15] avfilter/vf_bwdif: Add clip and spatial macros for aarch64 neon John Cox
` (10 subsequent siblings)
15 siblings, 0 replies; 32+ messages in thread
From: John Cox @ 2023-06-29 17:57 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: thomas.mundt, John Cox
Signed-off-by: John Cox <jc@kynesim.co.uk>
---
tests/checkasm/vf_bwdif.c | 37 +++++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)
diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c
index 46224bb575..034bbabb4c 100644
--- a/tests/checkasm/vf_bwdif.c
+++ b/tests/checkasm/vf_bwdif.c
@@ -20,6 +20,7 @@
#include "checkasm.h"
#include "libavcodec/internal.h"
#include "libavfilter/bwdif.h"
+#include "libavutil/mem_internal.h"
#define WIDTH 256
@@ -81,4 +82,40 @@ void checkasm_check_vf_bwdif(void)
BODY(uint16_t, 10);
report("bwdif10");
}
+
+ if (check_func(ctx_8.filter_intra, "bwdif8.intra")) {
+ LOCAL_ALIGNED_16(uint8_t, cur0, [11*WIDTH]);
+ LOCAL_ALIGNED_16(uint8_t, cur1, [11*WIDTH]);
+ LOCAL_ALIGNED_16(uint8_t, dst0, [WIDTH*3]);
+ LOCAL_ALIGNED_16(uint8_t, dst1, [WIDTH*3]);
+ const int stride = WIDTH;
+ const int mask = (1<<8)-1;
+
+ declare_func(void, void *dst1, void *cur1, int w, int prefs, int mrefs,
+ int prefs3, int mrefs3, int parity, int clip_max);
+
+ randomize_buffers( cur0, cur1, mask, 11*WIDTH);
+ memset(dst0, 0xba, WIDTH * 3);
+ memset(dst1, 0xba, WIDTH * 3);
+
+ call_ref(dst0 + stride,
+ cur0 + stride * 4, WIDTH,
+ stride, -stride, stride * 3, -stride * 3,
+ 0, mask);
+ call_new(dst1 + stride,
+ cur0 + stride * 4, WIDTH,
+ stride, -stride, stride * 3, -stride * 3,
+ 0, mask);
+
+ if (memcmp(dst0, dst1, WIDTH*3)
+ || memcmp( cur0, cur1, WIDTH*11))
+ fail();
+
+ bench_new(dst1 + stride,
+ cur0 + stride * 4, WIDTH,
+ stride, -stride, stride * 3, -stride * 3,
+ 0, mask);
+
+ report("bwdif8.intra");
+ }
}
--
2.39.2
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* [FFmpeg-devel] [PATCH 06/15] avfilter/vf_bwdif: Add clip and spatial macros for aarch64 neon
2023-06-29 17:57 [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions John Cox
` (4 preceding siblings ...)
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 05/15] tests/checkasm: Add test for vf_bwdif filter_intra John Cox
@ 2023-06-29 17:57 ` John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 07/15] avfilter/vf_bwdif: Export C filter_edge John Cox
` (9 subsequent siblings)
15 siblings, 0 replies; 32+ messages in thread
From: John Cox @ 2023-06-29 17:57 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: thomas.mundt, John Cox
Signed-off-by: John Cox <jc@kynesim.co.uk>
---
libavfilter/aarch64/vf_bwdif_neon.S | 59 +++++++++++++++++++++++++++++
1 file changed, 59 insertions(+)
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S
index b863b3447d..6c5d1598f4 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -59,6 +59,65 @@
umlsl2 \a3\().4s, \s1\().8h, \k
.endm
+// int b = m2s1 - m1;
+// int f = p2s1 - p1;
+// int dc = c0s1 - m1;
+// int de = c0s1 - p1;
+// int sp_max = FFMIN(p1 - c0s1, m1 - c0s1);
+// sp_max = FFMIN(sp_max, FFMAX(-b,-f));
+// int sp_min = FFMIN(c0s1 - p1, c0s1 - m1);
+// sp_min = FFMIN(sp_min, FFMAX(b,f));
+// diff = diff == 0 ? 0 : FFMAX3(diff, sp_min, sp_max);
+.macro SPAT_CHECK diff, m2s1, m1, c0s1, p1, p2s1, t0, t1, t2, t3
+ uqsub \t0\().16b, \p1\().16b, \c0s1\().16b
+ uqsub \t2\().16b, \m1\().16b, \c0s1\().16b
+ umin \t2\().16b, \t0\().16b, \t2\().16b
+
+ uqsub \t1\().16b, \m1\().16b, \m2s1\().16b
+ uqsub \t3\().16b, \p1\().16b, \p2s1\().16b
+ umax \t3\().16b, \t3\().16b, \t1\().16b
+ umin \t3\().16b, \t3\().16b, \t2\().16b
+
+ uqsub \t0\().16b, \c0s1\().16b, \p1\().16b
+ uqsub \t2\().16b, \c0s1\().16b, \m1\().16b
+ umin \t2\().16b, \t0\().16b, \t2\().16b
+
+ uqsub \t1\().16b, \m2s1\().16b, \m1\().16b
+ uqsub \t0\().16b, \p2s1\().16b, \p1\().16b
+ umax \t0\().16b, \t0\().16b, \t1\().16b
+ umin \t2\().16b, \t2\().16b, \t0\().16b
+
+ cmeq \t1\().16b, \diff\().16b, #0
+ umax \diff\().16b, \diff\().16b, \t3\().16b
+ umax \diff\().16b, \diff\().16b, \t2\().16b
+ bic \diff\().16b, \diff\().16b, \t1\().16b
+.endm
+
+// i0 = s0;
+// if (i0 > d0 + diff0)
+// i0 = d0 + diff0;
+// else if (i0 < d0 - diff0)
+// i0 = d0 - diff0;
+//
+// i0 = s0 is safe
+.macro DIFF_CLIP i0, s0, d0, diff, t0, t1
+ uqadd \t0\().16b, \d0\().16b, \diff\().16b
+ uqsub \t1\().16b, \d0\().16b, \diff\().16b
+ umin \i0\().16b, \s0\().16b, \t0\().16b
+ umax \i0\().16b, \i0\().16b, \t1\().16b
+.endm
+
+// i0 = FFABS(m1 - p1) > td0 ? i1 : i2;
+// DIFF_CLIP
+//
+// i0 = i1 is safe
+.macro INTERPOL i0, i1, i2, m1, d0, p1, td0, diff, t0, t1, t2
+ uabd \t0\().16b, \m1\().16b, \p1\().16b
+ cmhi \t0\().16b, \t0\().16b, \td0\().16b
+ bsl \t0\().16b, \i1\().16b, \i2\().16b
+ DIFF_CLIP \i0, \t0, \d0, \diff, \t1, \t2
+.endm
+
// static const uint16_t coef_lf[2] = { 4309, 213 };
// static const uint16_t coef_hf[3] = { 5570, 3801, 1016 };
// static const uint16_t coef_sp[2] = { 5077, 981 };
--
2.39.2
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* [FFmpeg-devel] [PATCH 07/15] avfilter/vf_bwdif: Export C filter_edge
2023-06-29 17:57 [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions John Cox
` (5 preceding siblings ...)
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 06/15] avfilter/vf_bwdif: Add clip and spatial macros for aarch64 neon John Cox
@ 2023-06-29 17:57 ` John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 08/15] avfilter/vf_bwdif: Add neon for filter_edge John Cox
` (8 subsequent siblings)
15 siblings, 0 replies; 32+ messages in thread
From: John Cox @ 2023-06-29 17:57 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: thomas.mundt, John Cox
Needed for tail fixup of neon code
Signed-off-by: John Cox <jc@kynesim.co.uk>
---
libavfilter/bwdif.h | 4 ++++
libavfilter/vf_bwdif.c | 8 ++++----
2 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h
index ae6f6ce223..ae1616d366 100644
--- a/libavfilter/bwdif.h
+++ b/libavfilter/bwdif.h
@@ -41,6 +41,10 @@ void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int bit_depth);
void ff_bwdif_init_x86(BWDIFContext *bwdif, int bit_depth);
void ff_bwdif_init_aarch64(BWDIFContext *bwdif, int bit_depth);
+void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1,
+ int w, int prefs, int mrefs, int prefs2, int mrefs2,
+ int parity, int clip_max, int spat);
+
void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int mrefs,
int prefs3, int mrefs3, int parity, int clip_max);
diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index 035fc58670..bec83111b4 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -150,9 +150,9 @@ static void filter_line_c(void *dst1, void *prev1, void *cur1, void *next1,
FILTER2()
}
-static void filter_edge(void *dst1, void *prev1, void *cur1, void *next1,
- int w, int prefs, int mrefs, int prefs2, int mrefs2,
- int parity, int clip_max, int spat)
+void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1,
+ int w, int prefs, int mrefs, int prefs2, int mrefs2,
+ int parity, int clip_max, int spat)
{
uint8_t *dst = dst1;
uint8_t *prev = prev1;
@@ -364,7 +364,7 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int bit_depth)
} else {
s->filter_intra = ff_bwdif_filter_intra_c;
s->filter_line = filter_line_c;
- s->filter_edge = filter_edge;
+ s->filter_edge = ff_bwdif_filter_edge_c;
}
#if ARCH_X86
--
2.39.2
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* [FFmpeg-devel] [PATCH 08/15] avfilter/vf_bwdif: Add neon for filter_edge
2023-06-29 17:57 [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions John Cox
` (6 preceding siblings ...)
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 07/15] avfilter/vf_bwdif: Export C filter_edge John Cox
@ 2023-06-29 17:57 ` John Cox
2023-07-01 21:40 ` Martin Storsjö
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 09/15] tests/checkasm: Add test for vf_bwdif filter_edge John Cox
` (7 subsequent siblings)
15 siblings, 1 reply; 32+ messages in thread
From: John Cox @ 2023-06-29 17:57 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: thomas.mundt, John Cox
Signed-off-by: John Cox <jc@kynesim.co.uk>
---
libavfilter/aarch64/vf_bwdif_init_aarch64.c | 20 ++++
libavfilter/aarch64/vf_bwdif_neon.S | 104 ++++++++++++++++++++
2 files changed, 124 insertions(+)
diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
index 3ffaa07ab3..e75cf2f204 100644
--- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -24,10 +24,29 @@
#include "libavfilter/bwdif.h"
#include "libavutil/aarch64/cpu.h"
+void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void *next1,
+ int w, int prefs, int mrefs, int prefs2, int mrefs2,
+ int parity, int clip_max, int spat);
+
void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs,
int prefs3, int mrefs3, int parity, int clip_max);
+static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void *next1,
+ int w, int prefs, int mrefs, int prefs2, int mrefs2,
+ int parity, int clip_max, int spat)
+{
+ const int w0 = clip_max != 255 ? 0 : w & ~15;
+
+ ff_bwdif_filter_edge_neon(dst1, prev1, cur1, next1, w0, prefs, mrefs, prefs2, mrefs2,
+ parity, clip_max, spat);
+
+ if (w0 < w)
+ ff_bwdif_filter_edge_c((char *)dst1 + w0, (char *)prev1 + w0, (char *)cur1 + w0, (char *)next1 + w0,
+ w - w0, prefs, mrefs, prefs2, mrefs2,
+ parity, clip_max, spat);
+}
+
static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int mrefs,
int prefs3, int mrefs3, int parity, int clip_max)
{
@@ -52,5 +71,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
return;
s->filter_intra = filter_intra_helper;
+ s->filter_edge = filter_edge_helper;
}
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S
index 6c5d1598f4..a33b235882 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -128,6 +128,110 @@ coeffs:
.hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5]
.hword 5077, 981 // sp[0] = v0.h[6]
+// ============================================================================
+//
+// void ff_bwdif_filter_edge_neon(
+// void *dst1, // x0
+// void *prev1, // x1
+// void *cur1, // x2
+// void *next1, // x3
+// int w, // w4
+// int prefs, // w5
+// int mrefs, // w6
+// int prefs2, // w7
+// int mrefs2, // [sp, #0]
+// int parity, // [sp, #8]
+// int clip_max, // [sp, #16] unused
+// int spat); // [sp, #24]
+
+function ff_bwdif_filter_edge_neon, export=1
+ // Sanity check w
+ cmp w4, #0
+ ble 99f
+
+// #define prev2 cur
+// const uint8_t * restrict next2 = parity ? prev : next;
+
+ ldr w8, [sp, #0] // mrefs2
+
+ ldr w17, [sp, #8] // parity
+ ldr w16, [sp, #24] // spat
+ cmp w17, #0
+ csel x17, x1, x3, ne
+
+// for (x = 0; x < w; x++) {
+
+10:
+// int m1 = cur[mrefs];
+// int d = (prev2[0] + next2[0]) >> 1;
+// int p1 = cur[prefs];
+// int temporal_diff0 = FFABS(prev2[0] - next2[0]);
+// int temporal_diff1 =(FFABS(prev[mrefs] - m1) + FFABS(prev[prefs] - p1)) >> 1;
+// int temporal_diff2 =(FFABS(next[mrefs] - m1) + FFABS(next[prefs] - p1)) >> 1;
+// int diff = FFMAX3(temporal_diff0 >> 1, temporal_diff1, temporal_diff2);
+ ldr q31, [x2]
+ ldr q21, [x17]
+ uhadd v16.16b, v31.16b, v21.16b // d0 = v16
+ uabd v17.16b, v31.16b, v21.16b // td0 = v17
+ ldr q24, [x2, w6, SXTW] // m1 = v24
+ ldr q22, [x2, w5, SXTW] // p1 = v22
+
+ ldr q0, [x1, w6, SXTW] // prev[mrefs]
+ ldr q2, [x1, w5, SXTW] // prev[prefs]
+ ldr q1, [x3, w6, SXTW] // next[mrefs]
+ ldr q3, [x3, w5, SXTW] // next[prefs]
+
+ ushr v29.16b, v17.16b, #1
+
+ uabd v31.16b, v0.16b, v24.16b
+ uabd v30.16b, v2.16b, v22.16b
+ uhadd v0.16b, v31.16b, v30.16b // td1 = q0
+
+ uabd v31.16b, v1.16b, v24.16b
+ uabd v30.16b, v3.16b, v22.16b
+ uhadd v1.16b, v31.16b, v30.16b // td2 = q1
+
+ umax v0.16b, v0.16b, v29.16b
+ umax v0.16b, v0.16b, v1.16b // diff = v0
+
+// if (spat) {
+// SPAT_CHECK()
+// }
+// i0 = (m1 + p1) >> 1;
+ cbz w16, 1f
+
+ ldr q31, [x2, w8, SXTW]
+ ldr q18, [x17, w8, SXTW]
+ ldr q30, [x2, w7, SXTW]
+ ldr q19, [x17, w7, SXTW]
+ uhadd v18.16b, v18.16b, v31.16b
+ uhadd v19.16b, v19.16b, v30.16b
+
+ SPAT_CHECK v0, v18, v24, v16, v22, v19, v31, v30, v29, v28
+
+1:
+ uhadd v2.16b, v22.16b, v24.16b
+
+ // i0 = v2, s0 = v2, d0 = v16, diff = v0, t0 = v31, t1 = v30
+ DIFF_CLIP v2, v2, v16, v0, v31, v30
+
+// dst[0] = av_clip(interpol, 0, clip_max);
+ str q2, [x0], #16
+
+// dst++;
+// cur++;
+// }
+ subs w4, w4, #16
+ add x1, x1, #16
+ add x2, x2, #16
+ add x3, x3, #16
+ add x17, x17, #16
+ bgt 10b
+
+99:
+ ret
+endfunc
+
// ============================================================================
//
// void ff_bwdif_filter_intra_neon(
--
2.39.2
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* [FFmpeg-devel] [PATCH 09/15] tests/checkasm: Add test for vf_bwdif filter_edge
2023-06-29 17:57 [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions John Cox
` (7 preceding siblings ...)
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 08/15] avfilter/vf_bwdif: Add neon for filter_edge John Cox
@ 2023-06-29 17:57 ` John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 10/15] avfilter/vf_bwdif: Export C filter_line John Cox
` (6 subsequent siblings)
15 siblings, 0 replies; 32+ messages in thread
From: John Cox @ 2023-06-29 17:57 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: thomas.mundt, John Cox
Signed-off-by: John Cox <jc@kynesim.co.uk>
---
tests/checkasm/vf_bwdif.c | 54 +++++++++++++++++++++++++++++++++++++++
1 file changed, 54 insertions(+)
diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c
index 034bbabb4c..5fdba09fdc 100644
--- a/tests/checkasm/vf_bwdif.c
+++ b/tests/checkasm/vf_bwdif.c
@@ -83,6 +83,60 @@ void checkasm_check_vf_bwdif(void)
report("bwdif10");
}
+ {
+ LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]);
+ LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]);
+ LOCAL_ALIGNED_16(uint8_t, next0, [11*WIDTH]);
+ LOCAL_ALIGNED_16(uint8_t, next1, [11*WIDTH]);
+ LOCAL_ALIGNED_16(uint8_t, cur0, [11*WIDTH]);
+ LOCAL_ALIGNED_16(uint8_t, cur1, [11*WIDTH]);
+ LOCAL_ALIGNED_16(uint8_t, dst0, [WIDTH*3]);
+ LOCAL_ALIGNED_16(uint8_t, dst1, [WIDTH*3]);
+ const int stride = WIDTH;
+ const int mask = (1<<8)-1;
+ int spat;
+ int parity;
+
+ for (spat = 0; spat != 2; ++spat) {
+ for (parity = 0; parity != 2; ++parity) {
+ if (check_func(ctx_8.filter_edge, "bwdif8.edge.s%d.p%d", spat, parity)) {
+
+ declare_func(void, void *dst1, void *prev1, void *cur1, void *next1,
+ int w, int prefs, int mrefs, int prefs2, int mrefs2,
+ int parity, int clip_max, int spat);
+
+ randomize_buffers(prev0, prev1, mask, 11*WIDTH);
+ randomize_buffers(next0, next1, mask, 11*WIDTH);
+ randomize_buffers( cur0, cur1, mask, 11*WIDTH);
+ memset(dst0, 0xba, WIDTH * 3);
+ memset(dst1, 0xba, WIDTH * 3);
+
+ call_ref(dst0 + stride,
+ prev0 + stride * 4, cur0 + stride * 4, next0 + stride * 4, WIDTH,
+ stride, -stride, stride * 2, -stride * 2,
+ parity, mask, spat);
+ call_new(dst1 + stride,
+ prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, WIDTH,
+ stride, -stride, stride * 2, -stride * 2,
+ parity, mask, spat);
+
+ if (memcmp(dst0, dst1, WIDTH*3)
+ || memcmp(prev0, prev1, WIDTH*11)
+ || memcmp(next0, next1, WIDTH*11)
+ || memcmp( cur0, cur1, WIDTH*11))
+ fail();
+
+ bench_new(dst1 + stride,
+ prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, WIDTH,
+ stride, -stride, stride * 2, -stride * 2,
+ parity, mask, spat);
+ }
+ }
+ }
+
+ report("bwdif8.edge");
+ }
+
if (check_func(ctx_8.filter_intra, "bwdif8.intra")) {
LOCAL_ALIGNED_16(uint8_t, cur0, [11*WIDTH]);
LOCAL_ALIGNED_16(uint8_t, cur1, [11*WIDTH]);
--
2.39.2
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* [FFmpeg-devel] [PATCH 10/15] avfilter/vf_bwdif: Export C filter_line
2023-06-29 17:57 [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions John Cox
` (8 preceding siblings ...)
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 09/15] tests/checkasm: Add test for vf_bwdif filter_edge John Cox
@ 2023-06-29 17:57 ` John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 11/15] avfilter/vf_bwdif: Add neon for filter_line John Cox
` (5 subsequent siblings)
15 siblings, 0 replies; 32+ messages in thread
From: John Cox @ 2023-06-29 17:57 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: thomas.mundt, John Cox
Needed for tail fixup of neon code
Signed-off-by: John Cox <jc@kynesim.co.uk>
---
libavfilter/bwdif.h | 5 +++++
libavfilter/vf_bwdif.c | 10 +++++-----
2 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h
index ae1616d366..cce99953f3 100644
--- a/libavfilter/bwdif.h
+++ b/libavfilter/bwdif.h
@@ -48,4 +48,9 @@ void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1,
void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int mrefs,
int prefs3, int mrefs3, int parity, int clip_max);
+void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1,
+ int w, int prefs, int mrefs, int prefs2, int mrefs2,
+ int prefs3, int mrefs3, int prefs4, int mrefs4,
+ int parity, int clip_max);
+
#endif /* AVFILTER_BWDIF_H */
diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index bec83111b4..26349da1fd 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -132,10 +132,10 @@ void ff_bwdif_filter_intra_c(void *dst1, void *cur1, int w, int prefs, int mrefs
FILTER_INTRA()
}
-static void filter_line_c(void *dst1, void *prev1, void *cur1, void *next1,
- int w, int prefs, int mrefs, int prefs2, int mrefs2,
- int prefs3, int mrefs3, int prefs4, int mrefs4,
- int parity, int clip_max)
+void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1,
+ int w, int prefs, int mrefs, int prefs2, int mrefs2,
+ int prefs3, int mrefs3, int prefs4, int mrefs4,
+ int parity, int clip_max)
{
uint8_t *dst = dst1;
uint8_t *prev = prev1;
@@ -363,7 +363,7 @@ av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int bit_depth)
s->filter_edge = filter_edge_16bit;
} else {
s->filter_intra = ff_bwdif_filter_intra_c;
- s->filter_line = filter_line_c;
+ s->filter_line = ff_bwdif_filter_line_c;
s->filter_edge = ff_bwdif_filter_edge_c;
}
--
2.39.2
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* [FFmpeg-devel] [PATCH 11/15] avfilter/vf_bwdif: Add neon for filter_line
2023-06-29 17:57 [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions John Cox
` (9 preceding siblings ...)
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 10/15] avfilter/vf_bwdif: Export C filter_line John Cox
@ 2023-06-29 17:57 ` John Cox
2023-07-01 21:44 ` Martin Storsjö
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 12/15] avfilter/vf_bwdif: Add a filter_line3 method for optimisation John Cox
` (4 subsequent siblings)
15 siblings, 1 reply; 32+ messages in thread
From: John Cox @ 2023-06-29 17:57 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: thomas.mundt, John Cox
Signed-off-by: John Cox <jc@kynesim.co.uk>
---
libavfilter/aarch64/vf_bwdif_init_aarch64.c | 21 ++
libavfilter/aarch64/vf_bwdif_neon.S | 215 ++++++++++++++++++++
2 files changed, 236 insertions(+)
diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
index e75cf2f204..21e67884ab 100644
--- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -31,6 +31,26 @@ void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void *next1,
void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs,
int prefs3, int mrefs3, int parity, int clip_max);
+void ff_bwdif_filter_line_neon(void *dst1, void *prev1, void *cur1, void *next1,
+ int w, int prefs, int mrefs, int prefs2, int mrefs2,
+ int prefs3, int mrefs3, int prefs4, int mrefs4,
+ int parity, int clip_max);
+
+
+static void filter_line_helper(void *dst1, void *prev1, void *cur1, void *next1,
+ int w, int prefs, int mrefs, int prefs2, int mrefs2,
+ int prefs3, int mrefs3, int prefs4, int mrefs4,
+ int parity, int clip_max)
+{
+ const int w0 = clip_max != 255 ? 0 : w & ~15;
+
+ ff_bwdif_filter_line_neon(dst1, prev1, cur1, next1,
+ w0, prefs, mrefs, prefs2, mrefs2, prefs3, mrefs3, prefs4, mrefs4, parity, clip_max);
+
+ if (w0 < w)
+ ff_bwdif_filter_line_c((char *)dst1 + w0, (char *)prev1 + w0, (char *)cur1 + w0, (char *)next1 + w0,
+ w - w0, prefs, mrefs, prefs2, mrefs2, prefs3, mrefs3, prefs4, mrefs4, parity, clip_max);
+}
static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void *next1,
int w, int prefs, int mrefs, int prefs2, int mrefs2,
@@ -71,6 +91,7 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
return;
s->filter_intra = filter_intra_helper;
+ s->filter_line = filter_line_helper;
s->filter_edge = filter_edge_helper;
}
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S
index a33b235882..675e97d966 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -128,6 +128,221 @@ coeffs:
.hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5]
.hword 5077, 981 // sp[0] = v0.h[6]
+// ===========================================================================
+//
+// void filter_line(
+// void *dst1, // x0
+// void *prev1, // x1
+// void *cur1, // x2
+// void *next1, // x3
+// int w, // w4
+// int prefs, // w5
+// int mrefs, // w6
+// int prefs2, // w7
+// int mrefs2, // [sp, #0]
+// int prefs3, // [sp, #8]
+// int mrefs3, // [sp, #16]
+// int prefs4, // [sp, #24]
+// int mrefs4, // [sp, #32]
+// int parity, // [sp, #40]
+// int clip_max) // [sp, #48]
+
+function ff_bwdif_filter_line_neon, export=1
+ // Sanity check w
+ cmp w4, #0
+ ble 99f
+
+ // Rearrange regs to be the same as line3 for ease of debug!
+ mov w10, w4 // w10 = loop count
+ mov w9, w6 // w9 = mref
+ mov w12, w7 // w12 = pref2
+ mov w11, w5 // w11 = pref
+ ldr w8, [sp, #0] // w8 = mref2
+ ldr w7, [sp, #16] // w7 = mref3
+ ldr w6, [sp, #32] // w6 = mref4
+ ldr w13, [sp, #8] // w13 = pref3
+ ldr w14, [sp, #24] // w14 = pref4
+
+ mov x4, x3
+ mov x3, x2
+ mov x2, x1
+
+// #define prev2 cur
+// const uint8_t * restrict next2 = parity ? prev : next;
+ ldr w17, [sp, #40] // parity
+ cmp w17, #0
+ csel x17, x2, x4, ne
+
+ // We want all the V registers - save all the ones we must
+ stp d14, d15, [sp, #-64]!
+ stp d8, d9, [sp, #48]
+ stp d10, d11, [sp, #32]
+ stp d12, d13, [sp, #16]
+
+ ldr q0, coeffs
+
+// for (x = 0; x < w; x++) {
+// int diff0, diff2;
+// int d0, d2;
+// int temporal_diff0, temporal_diff2;
+//
+// int i1, i2;
+// int j1, j2;
+// int p6, p5, p4, p3, p2, p1, c0, m1, m2, m3, m4;
+
+10:
+// c0 = prev2[0] + next2[0]; // c0 = v20, v21
+// d0 = c0 >> 1; // d0 = v10
+// temporal_diff0 = FFABS(prev2[0] - next2[0]); // td0 = v11
+ ldr q31, [x3]
+ ldr q21, [x17]
+ uhadd v10.16b, v31.16b, v21.16b
+ uabd v11.16b, v31.16b, v21.16b
+ uaddl v20.8h, v21.8b, v31.8b
+ uaddl2 v21.8h, v21.16b, v31.16b
+
+ ldr q31, [x3, w6, SXTW]
+ ldr q23, [x17, w6, SXTW]
+
+// i1 = coef_hf[0] * c0; // i1 = v2-v5
+ UMULL4K v2, v3, v4, v5, v20, v21, v0.h[2]
+
+ ldr q30, [x3, w14, SXTW]
+ ldr q25, [x17, w14, SXTW]
+
+// m4 = prev2[mrefs4] + next2[mrefs4]; // m4 = v22,v23
+ uaddl v22.8h, v23.8b, v31.8b
+ uaddl2 v23.8h, v23.16b, v31.16b
+
+// p4 = prev2[prefs4] + next2[prefs4]; // p4 = v24,v25, (p4 >> 1) = v12
+ uhadd v12.16b, v25.16b, v30.16b
+ uaddl v24.8h, v25.8b, v30.8b
+ uaddl2 v25.8h, v25.16b, v30.16b
+
+// m3 = cur[mrefs3]; // m3 = v20
+ ldr q20, [x3, w7, SXTW]
+
+// p3 = cur[prefs3]; // p3 = v21
+ ldr q21, [x3, w13, SXTW]
+
+// i1 += coef_hf[2] * (m4 + p4); // (-m4:v22,v23) (-p4:v24,v25)
+ add v22.8h, v22.8h, v24.8h
+ add v23.8h, v23.8h, v25.8h
+ UMLAL4K v2, v3, v4, v5, v22, v23, v0.h[4]
+
+ ldr q29, [x3, w8, SXTW]
+ ldr q23, [x17, w8, SXTW]
+
+// i1 -= coef_lf[1] * 4 * (m3 + p3); // -
+ uaddl v30.8h, v20.8b, v21.8b
+ uaddl2 v31.8h, v20.16b, v21.16b
+
+ UMLSL4K v2, v3, v4, v5, v30, v31, v0.h[1]
+
+// m2 = prev2[mrefs2] + next2[mrefs2]; // m2 = v22,v23, (m2 >> 1) = v13
+ uhadd v13.16b, v23.16b, v29.16b
+ uaddl v22.8h, v23.8b, v29.8b
+ uaddl2 v23.8h, v23.16b, v29.16b
+
+ ldr q31, [x3, w12, SXTW]
+ ldr q27, [x17, w12, SXTW]
+
+// j1 += coef_hf[2] * (m2 + p6); // (-p6:v24,v25)
+ add v24.8h, v24.8h, v22.8h
+ add v25.8h, v25.8h, v23.8h
+ UMLAL4K v6, v7, v8, v9, v24, v25, v0.h[4]
+
+// m1 = cur[mrefs]; // m1 = v24
+ ldr q24, [x3, w9, SXTW]
+
+// p2 = prev2[prefs2] + next2[prefs2]; // p2 = v26, v27
+// temporal_diff2 = FFABS(prev2[prefs2] - next2[prefs2]); // td2 = v14
+// d2 = p2 >> 1; // d2 = v15
+ uabd v14.16b, v31.16b, v27.16b
+ uhadd v15.16b, v31.16b, v27.16b
+ uaddl v26.8h, v27.8b, v31.8b
+ uaddl2 v27.8h, v27.16b, v31.16b
+
+// i1 -= coef_hf[1] * (m2 + p2); // (-m2:v22,v23*) (-p2:v26*,v27*)
+ add v22.8h, v22.8h, v26.8h
+ add v23.8h, v23.8h, v27.8h
+ UMLSL4K v2, v3, v4, v5, v22, v23, v0.h[3]
+
+// p1 = cur[prefs]; // p1 = v22
+ ldr q22, [x3, w11, SXTW]
+
+// i2 = (coef_sp[0] * (m1 + p1) - coef_sp[1] * (m3 + p3)) >> 13; // (-m3:v20*) i2=v17
+ uaddl v18.8h, v22.8b, v24.8b
+ uaddl2 v19.8h, v22.16b, v24.16b
+ UMULL4K v28, v29, v30, v31, v18, v19, v0.h[6]
+
+ uaddl v18.8h, v20.8b, v21.8b
+ uaddl2 v19.8h, v20.16b, v21.16b
+ UMLSL4K v28, v29, v30, v31, v18, v19, v0.h[7]
+
+ SQSHRUNN v17, v28, v29, v30, v31, 13
+
+// i1 += coef_lf[0] * 4 * (m1 + p1); // p1 = v22, m1 = v24
+ uaddl v26.8h, v24.8b, v22.8b
+ uaddl2 v27.8h, v24.16b, v22.16b
+ UMLAL4K v2, v3, v4, v5, v26, v27, v0.h[0]
+
+ ldr q31, [x2, w9, SXTW]
+ ldr q29, [x4, w9, SXTW]
+
+ ldr q30, [x2, w11, SXTW]
+ ldr q28, [x4, w11, SXTW]
+
+// i1 >>= 15; // i1 = v2, -v3, -v4*, -v5*
+ SQSHRUNN v2, v2, v3, v4, v5, 15
+
+// {
+// int t1 =(FFABS(prev[mrefs] - m1) + FFABS(prev[prefs] - p1)) >> 1;
+// int t2 =(FFABS(next[mrefs] - m1) + FFABS(next[prefs] - p1)) >> 1;
+ uabd v30.16b, v22.16b, v30.16b
+ uabd v31.16b, v24.16b, v31.16b
+ uabd v28.16b, v22.16b, v28.16b
+ uabd v29.16b, v24.16b, v29.16b
+ uhadd v31.16b, v31.16b, v30.16b
+ uhadd v29.16b, v29.16b, v28.16b
+
+// diff0 = FFMAX3(temporal_diff0 >> 1, t1, t2); // diff0=v18
+ ushr v18.16b, v11.16b, #1
+ umax v18.16b, v18.16b, v31.16b
+ umax v18.16b, v18.16b, v29.16b
+
+ // diff0 = v18, (m2 >> 1) = v13, m1 = v24, d0 = v10, p1 = v22, d2 = v15
+ SPAT_CHECK v18, v13, v24, v10, v22, v15, v31, v30, v29, v28
+
+ // i1 = v2, i2 = v17, m1 = v24, d0 = v10, p1 = v22, td2 = v11, diff2 = v18
+ INTERPOL v2, v2, v17, v24, v10, v22, v11, v18, v31, v30, v29
+
+// dst[0] = av_clip_uint8(interpol);
+ str q2, [x0], #16
+// }
+//
+// dst++;
+// cur++;
+// prev++;
+// prev2++;
+// next++;
+// }
+
+ subs w10, w10, #16
+ add x2, x2, #16
+ add x3, x3, #16
+ add x4, x4, #16
+ add x17, x17, #16
+ bgt 10b
+
+ ldp d12, d13, [sp, #16]
+ ldp d10, d11, [sp, #32]
+ ldp d8, d9, [sp, #48]
+ ldp d14, d15, [sp], #64
+99:
+ ret
+endfunc
+
// ============================================================================
//
// void ff_bwdif_filter_edge_neon(
--
2.39.2
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* [FFmpeg-devel] [PATCH 12/15] avfilter/vf_bwdif: Add a filter_line3 method for optimisation
2023-06-29 17:57 [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions John Cox
` (10 preceding siblings ...)
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 11/15] avfilter/vf_bwdif: Add neon for filter_line John Cox
@ 2023-06-29 17:57 ` John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 13/15] avfilter/vf_bwdif: Add neon for filter_line3 John Cox
` (3 subsequent siblings)
15 siblings, 0 replies; 32+ messages in thread
From: John Cox @ 2023-06-29 17:57 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: thomas.mundt, John Cox
Add an optional filter_line3 to the available optimisations.
filter_line3 is equivalent to filter_line, memcpy, filter_line
filter_line shares quite a number of loads and some calculations in
common with its next iteration and testing shows that using aarch64
neon filter_line3s performance is 30% better than two filter_lines
and a memcpy.
Signed-off-by: John Cox <jc@kynesim.co.uk>
---
libavfilter/bwdif.h | 7 +++++++
libavfilter/vf_bwdif.c | 31 +++++++++++++++++++++++++++++++
2 files changed, 38 insertions(+)
diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h
index cce99953f3..496cec72ef 100644
--- a/libavfilter/bwdif.h
+++ b/libavfilter/bwdif.h
@@ -35,6 +35,9 @@ typedef struct BWDIFContext {
void (*filter_edge)(void *dst, void *prev, void *cur, void *next,
int w, int prefs, int mrefs, int prefs2, int mrefs2,
int parity, int clip_max, int spat);
+ void (*filter_line3)(void *dst, int dstride,
+ const void *prev, const void *cur, const void *next, int prefs,
+ int w, int parity, int clip_max);
} BWDIFContext;
void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int bit_depth);
@@ -53,4 +56,8 @@ void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1,
int prefs3, int mrefs3, int prefs4, int mrefs4,
int parity, int clip_max);
+void ff_bwdif_filter_line3_c(void * dst1, int d_stride,
+ const void * prev1, const void * cur1, const void * next1, int s_stride,
+ int w, int parity, int clip_max);
+
#endif /* AVFILTER_BWDIF_H */
diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index 26349da1fd..52bc676cf8 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -150,6 +150,31 @@ void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1,
FILTER2()
}
+#define NEXT_LINE()\
+ dst += d_stride; \
+ prev += prefs; \
+ cur += prefs; \
+ next += prefs;
+
+void ff_bwdif_filter_line3_c(void * dst1, int d_stride,
+ const void * prev1, const void * cur1, const void * next1, int s_stride,
+ int w, int parity, int clip_max)
+{
+ const int prefs = s_stride;
+ uint8_t * dst = dst1;
+ const uint8_t * prev = prev1;
+ const uint8_t * cur = cur1;
+ const uint8_t * next = next1;
+
+ ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur, (void*)next, w,
+ prefs, -prefs, prefs * 2, - prefs * 2, prefs * 3, -prefs * 3, prefs * 4, -prefs * 4, parity, clip_max);
+ NEXT_LINE();
+ memcpy(dst, cur, w);
+ NEXT_LINE();
+ ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur, (void*)next, w,
+ prefs, -prefs, prefs * 2, - prefs * 2, prefs * 3, -prefs * 3, prefs * 4, -prefs * 4, parity, clip_max);
+}
+
void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1,
int w, int prefs, int mrefs, int prefs2, int mrefs2,
int parity, int clip_max, int spat)
@@ -244,6 +269,11 @@ static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs)
refs << 1, -(refs << 1),
td->parity ^ td->tff, clip_max,
(y < 2) || ((y + 3) > td->h) ? 0 : 1);
+ } else if (s->filter_line3 && y + 2 < slice_end && y + 6 < td->h) {
+ s->filter_line3(dst, td->frame->linesize[td->plane],
+ prev, cur, next, linesize, td->w,
+ td->parity ^ td->tff, clip_max);
+ y += 2;
} else {
s->filter_line(dst, prev, cur, next, td->w,
refs, -refs, refs << 1, -(refs << 1),
@@ -357,6 +387,7 @@ static int config_props(AVFilterLink *link)
av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int bit_depth)
{
+ s->filter_line3 = 0;
if (bit_depth > 8) {
s->filter_intra = filter_intra_16bit;
s->filter_line = filter_line_c_16bit;
--
2.39.2
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* [FFmpeg-devel] [PATCH 13/15] avfilter/vf_bwdif: Add neon for filter_line3
2023-06-29 17:57 [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions John Cox
` (11 preceding siblings ...)
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 12/15] avfilter/vf_bwdif: Add a filter_line3 method for optimisation John Cox
@ 2023-06-29 17:57 ` John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 14/15] tests/checkasm: Add test for vf_bwdif filter_line3 John Cox
` (2 subsequent siblings)
15 siblings, 0 replies; 32+ messages in thread
From: John Cox @ 2023-06-29 17:57 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: thomas.mundt, John Cox
Signed-off-by: John Cox <jc@kynesim.co.uk>
---
libavfilter/aarch64/vf_bwdif_init_aarch64.c | 28 ++
libavfilter/aarch64/vf_bwdif_neon.S | 278 ++++++++++++++++++++
2 files changed, 306 insertions(+)
diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
index 21e67884ab..f52bc4b9b4 100644
--- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
+++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
@@ -36,6 +36,33 @@ void ff_bwdif_filter_line_neon(void *dst1, void *prev1, void *cur1, void *next1,
int prefs3, int mrefs3, int prefs4, int mrefs4,
int parity, int clip_max);
+void ff_bwdif_filter_line3_neon(void * dst1, int d_stride,
+ const void * prev1, const void * cur1, const void * next1, int s_stride,
+ int w, int parity, int clip_max);
+
+
+static void filter_line3_helper(void * dst1, int d_stride,
+ const void * prev1, const void * cur1, const void * next1, int s_stride,
+ int w, int parity, int clip_max)
+{
+ // Asm works on 16 byte chunks
+ // If w is a multiple of 16 then all is good - if not then if width rounded
+ // up to nearest 16 will fit in both src & dst strides then allow the asm
+ // to write over the padding bytes as that is almost certainly faster than
+ // having to invoke the C version to clean up the tail.
+ const int w1 = FFALIGN(w, 16);
+ const int w0 = clip_max != 255 ? 0 :
+ d_stride <= w1 && s_stride <= w1 ? w : w & ~15;
+
+ ff_bwdif_filter_line3_neon(dst1, d_stride,
+ prev1, cur1, next1, s_stride,
+ w0, parity, clip_max);
+
+ if (w0 < w)
+ ff_bwdif_filter_line3_c((char *)dst1 + w0, d_stride,
+ (const char *)prev1 + w0, (const char *)cur1 + w0, (const char *)next1 + w0, s_stride,
+ w - w0, parity, clip_max);
+}
static void filter_line_helper(void *dst1, void *prev1, void *cur1, void *next1,
int w, int prefs, int mrefs, int prefs2, int mrefs2,
@@ -93,5 +120,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
s->filter_intra = filter_intra_helper;
s->filter_line = filter_line_helper;
s->filter_edge = filter_edge_helper;
+ s->filter_line3 = filter_line3_helper;
}
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S
index 675e97d966..bcffbe5793 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -128,6 +128,284 @@ coeffs:
.hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5]
.hword 5077, 981 // sp[0] = v0.h[6]
+// ===========================================================================
+//
+// void ff_bwdif_filter_line3_neon(
+// void * dst1, // x0
+// int d_stride, // w1
+// const void * prev1, // x2
+// const void * cur1, // x3
+// const void * next1, // x4
+// int s_stride, // w5
+// int w, // w6
+// int parity, // w7
+// int clip_max); // [sp, #0] (Ignored)
+
+function ff_bwdif_filter_line3_neon, export=1
+ // Sanity check w
+ cmp w6, #0
+ ble 99f
+
+// #define prev2 cur
+// const uint8_t * restrict next2 = parity ? prev : next;
+ cmp w7, #0
+ csel x17, x2, x4, ne
+
+ // We want all the V registers - save all the ones we must
+ stp d14, d15, [sp, #-64]!
+ stp d8, d9, [sp, #48]
+ stp d10, d11, [sp, #32]
+ stp d12, d13, [sp, #16]
+
+ ldr q0, coeffs
+
+ // Some rearrangement of initial values for nice layout of refs in regs
+ mov w10, w6 // w10 = loop count
+ neg w9, w5 // w9 = mref
+ lsl w8, w9, #1 // w8 = mref2
+ add w7, w9, w9, LSL #1 // w7 = mref3
+ lsl w6, w9, #2 // w6 = mref4
+ mov w11, w5 // w11 = pref
+ lsl w12, w5, #1 // w12 = pref2
+ add w13, w5, w5, LSL #1 // w13 = pref3
+ lsl w14, w5, #2 // w14 = pref4
+ add w15, w5, w5, LSL #2 // w15 = pref5
+ add w16, w14, w12 // w16 = pref6
+
+ lsl w5, w1, #1 // w5 = d_stride * 2
+
+// for (x = 0; x < w; x++) {
+// int diff0, diff2;
+// int d0, d2;
+// int temporal_diff0, temporal_diff2;
+//
+// int i1, i2;
+// int j1, j2;
+// int p6, p5, p4, p3, p2, p1, c0, m1, m2, m3, m4;
+
+10:
+// c0 = prev2[0] + next2[0]; // c0 = v20, v21
+// d0 = c0 >> 1; // d0 = v10
+// temporal_diff0 = FFABS(prev2[0] - next2[0]); // td0 = v11
+ ldr q31, [x3]
+ ldr q21, [x17]
+ uhadd v10.16b, v31.16b, v21.16b
+ uabd v11.16b, v31.16b, v21.16b
+ uaddl v20.8h, v21.8b, v31.8b
+ uaddl2 v21.8h, v21.16b, v31.16b
+
+ ldr q31, [x3, w6, SXTW]
+ ldr q23, [x17, w6, SXTW]
+
+// i1 = coef_hf[0] * c0; // i1 = v2-v5
+ UMULL4K v2, v3, v4, v5, v20, v21, v0.h[2]
+
+ ldr q30, [x3, w14, SXTW]
+ ldr q25, [x17, w14, SXTW]
+
+// m4 = prev2[mrefs4] + next2[mrefs4]; // m4 = v22,v23
+ uaddl v22.8h, v23.8b, v31.8b
+ uaddl2 v23.8h, v23.16b, v31.16b
+
+// p4 = prev2[prefs4] + next2[prefs4]; // p4 = v24,v25, (p4 >> 1) = v12
+ uhadd v12.16b, v25.16b, v30.16b
+ uaddl v24.8h, v25.8b, v30.8b
+ uaddl2 v25.8h, v25.16b, v30.16b
+
+// j1 = -coef_hf[1] * (c0 + p4); // j1 = v6-v9 (-c0:v20,v21)
+ add v20.8h, v20.8h, v24.8h
+ add v21.8h, v21.8h, v25.8h
+ SMULL4K v6, v7, v8, v9, v20, v21, v0.h[5]
+
+// m3 = cur[mrefs3]; // m3 = v20
+ ldr q20, [x3, w7, SXTW]
+
+// p3 = cur[prefs3]; // p3 = v21
+ ldr q21, [x3, w13, SXTW]
+
+// i1 += coef_hf[2] * (m4 + p4); // (-m4:v22,v23) (-p4:v24,v25)
+ add v22.8h, v22.8h, v24.8h
+ add v23.8h, v23.8h, v25.8h
+ UMLAL4K v2, v3, v4, v5, v22, v23, v0.h[4]
+
+ ldr q29, [x3, w8, SXTW]
+ ldr q23, [x17, w8, SXTW]
+
+// i1 -= coef_lf[1] * 4 * (m3 + p3); // -
+ uaddl v30.8h, v20.8b, v21.8b
+ uaddl2 v31.8h, v20.16b, v21.16b
+
+ ldr q28, [x3, w16, SXTW]
+ ldr q25, [x17, w16, SXTW]
+
+ UMLSL4K v2, v3, v4, v5, v30, v31, v0.h[1]
+
+// m2 = prev2[mrefs2] + next2[mrefs2]; // m2 = v22,v23, (m2 >> 1) = v13
+ uhadd v13.16b, v23.16b, v29.16b
+ uaddl v22.8h, v23.8b, v29.8b
+ uaddl2 v23.8h, v23.16b, v29.16b
+
+ ldr q31, [x3, w12, SXTW]
+ ldr q27, [x17, w12, SXTW]
+
+// p6 = prev2[prefs6] + next2[prefs6]; // p6 = v24,v25
+ uaddl v24.8h, v25.8b, v28.8b
+ uaddl2 v25.8h, v25.16b, v28.16b
+
+// j1 += coef_hf[2] * (m2 + p6); // (-p6:v24,v25)
+ add v24.8h, v24.8h, v22.8h
+ add v25.8h, v25.8h, v23.8h
+ UMLAL4K v6, v7, v8, v9, v24, v25, v0.h[4]
+
+// m1 = cur[mrefs]; // m1 = v24
+ ldr q24, [x3, w9, SXTW]
+
+// p5 = cur[prefs5]; // p5 = v25
+ ldr q25, [x3, w15, SXTW]
+
+// p2 = prev2[prefs2] + next2[prefs2]; // p2 = v26, v27
+// temporal_diff2 = FFABS(prev2[prefs2] - next2[prefs2]); // td2 = v14
+// d2 = p2 >> 1; // d2 = v15
+ uabd v14.16b, v31.16b, v27.16b
+ uhadd v15.16b, v31.16b, v27.16b
+ uaddl v26.8h, v27.8b, v31.8b
+ uaddl2 v27.8h, v27.16b, v31.16b
+
+// j1 += coef_hf[0] * p2; // -
+ UMLAL4K v6, v7, v8, v9, v26, v27, v0.h[2]
+
+// i1 -= coef_hf[1] * (m2 + p2); // (-m2:v22,v23*) (-p2:v26*,v27*)
+ add v22.8h, v22.8h, v26.8h
+ add v23.8h, v23.8h, v27.8h
+ UMLSL4K v2, v3, v4, v5, v22, v23, v0.h[3]
+
+// p1 = cur[prefs]; // p1 = v22
+ ldr q22, [x3, w11, SXTW]
+
+// j1 -= coef_lf[1] * 4 * (m1 + p5); // -
+ uaddl v26.8h, v24.8b, v25.8b
+ uaddl2 v27.8h, v24.16b, v25.16b
+ UMLSL4K v6, v7, v8, v9, v26, v27, v0.h[1]
+
+// j2 = (coef_sp[0] * (p1 + p3) - coef_sp[1] * (m1 + p5)) >> 13; // (-p5:v25*) j2=v16
+ uaddl v18.8h, v22.8b, v21.8b
+ uaddl2 v19.8h, v22.16b, v21.16b
+ UMULL4K v28, v29, v30, v31, v18, v19, v0.h[6]
+
+ uaddl v18.8h, v24.8b, v25.8b
+ uaddl2 v19.8h, v24.16b, v25.16b
+ UMLSL4K v28, v29, v30, v31, v18, v19, v0.h[7]
+
+ SQSHRUNN v16, v28, v29, v30, v31, 13
+
+// i2 = (coef_sp[0] * (m1 + p1) - coef_sp[1] * (m3 + p3)) >> 13; // (-m3:v20*) i2=v17
+ uaddl v18.8h, v22.8b, v24.8b
+ uaddl2 v19.8h, v22.16b, v24.16b
+ UMULL4K v28, v29, v30, v31, v18, v19, v0.h[6]
+
+ uaddl v18.8h, v20.8b, v21.8b
+ uaddl2 v19.8h, v20.16b, v21.16b
+ UMLSL4K v28, v29, v30, v31, v18, v19, v0.h[7]
+
+ SQSHRUNN v17, v28, v29, v30, v31, 13
+
+// i1 += coef_lf[0] * 4 * (m1 + p1); // p1 = v22, m1 = v24
+ uaddl v26.8h, v24.8b, v22.8b
+ uaddl2 v27.8h, v24.16b, v22.16b
+ UMLAL4K v2, v3, v4, v5, v26, v27, v0.h[0]
+
+ ldr q31, [x2, w9, SXTW]
+ ldr q29, [x4, w9, SXTW]
+
+// j1 += coef_lf[0] * 4 * (p1 + p3); // p1 = v22, p3 = v21
+ uaddl v26.8h, v21.8b, v22.8b
+ uaddl2 v27.8h, v21.16b, v22.16b
+ UMLAL4K v6, v7, v8, v9, v26, v27, v0.h[0]
+
+ ldr q30, [x2, w11, SXTW]
+ ldr q28, [x4, w11, SXTW]
+
+// i1 >>= 15; // i1 = v2, -v3, -v4*, -v5*
+ SQSHRUNN v2, v2, v3, v4, v5, 15
+
+// j1 >>= 15; // j1 = v3, -v6*, -v7*, -v8*, -v9*
+ SQSHRUNN v3, v6, v7, v8, v9, 15
+
+// {
+// int t1 =(FFABS(prev[mrefs] - m1) + FFABS(prev[prefs] - p1)) >> 1;
+// int t2 =(FFABS(next[mrefs] - m1) + FFABS(next[prefs] - p1)) >> 1;
+ uabd v30.16b, v22.16b, v30.16b
+ uabd v31.16b, v24.16b, v31.16b
+ uabd v28.16b, v22.16b, v28.16b
+ uabd v29.16b, v24.16b, v29.16b
+ uhadd v31.16b, v31.16b, v30.16b
+ uhadd v29.16b, v29.16b, v28.16b
+
+ ldr q27, [x2, w13, SXTW]
+ ldr q26, [x4, w13, SXTW]
+
+// diff0 = FFMAX3(temporal_diff0 >> 1, t1, t2); // diff0=v18
+ ushr v18.16b, v11.16b, #1
+ umax v18.16b, v18.16b, v31.16b
+ umax v18.16b, v18.16b, v29.16b
+// } // v28, v30 preserved for next block
+// { // tdiff2 = v14
+// int t1 =(FFABS(prev[prefs] - p1) + FFABS(prev[prefs3] - p3)) >> 1;
+// int t2 =(FFABS(next[prefs] - p1) + FFABS(next[prefs3] - p3)) >> 1;
+ uabd v31.16b, v21.16b, v27.16b
+ uabd v29.16b, v21.16b, v26.16b
+ uhadd v31.16b, v31.16b, v30.16b
+ uhadd v29.16b, v29.16b, v28.16b
+
+// diff2 = FFMAX3(temporal_diff2 >> 1, t1, t2); // diff2=v19
+ ushr v19.16b, v14.16b, #1
+ umax v19.16b, v19.16b, v31.16b
+ umax v19.16b, v19.16b, v29.16b
+// }
+
+ // diff0 = v18, (m2 >> 1) = v13, m1 = v24, d0 = v10, p1 = v22, d2 = v15
+ SPAT_CHECK v18, v13, v24, v10, v22, v15, v31, v30, v29, v28
+
+ // diff2 = v19, d0 = v10, p1 = v22, d2 = v15, p3 = v21, (p4 >> 1) = v12
+ SPAT_CHECK v19, v10, v22, v15, v21, v12, v31, v30, v29, v28
+
+ // j1 = v3, j2 = v16, p1 = v22, d2 = v15, p3 = v21, td2 = v14, diff2 = v19
+ INTERPOL v3, v3, v16, v22, v15, v21, v14, v19, v31, v30, v29
+
+// dst[d_stride * 2] = av_clip_uint8(interpol);
+ str q3, [x0, w5, SXTW]
+
+// dst[d_stride] = p1;
+ str q22, [x0, w1, SXTW]
+
+ // i1 = v2, i2 = v17, m1 = v24, d0 = v10, p1 = v22, td2 = v11, diff2 = v18
+ INTERPOL v2, v2, v17, v24, v10, v22, v11, v18, v31, v30, v29
+
+// dst[0] = av_clip_uint8(interpol);
+ str q2, [x0], #16
+// }
+//
+// dst++;
+// cur++;
+// prev++;
+// prev2++;
+// next++;
+// }
+ subs w10, w10, #16
+ add x2, x2, #16
+ add x3, x3, #16
+ add x4, x4, #16
+ add x17, x17, #16
+ bgt 10b
+
+ ldp d12, d13, [sp, #16]
+ ldp d10, d11, [sp, #32]
+ ldp d8, d9, [sp, #48]
+ ldp d14, d15, [sp], #64
+99:
+ ret
+endfunc
+
// ===========================================================================
//
// void filter_line(
--
2.39.2
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* [FFmpeg-devel] [PATCH 14/15] tests/checkasm: Add test for vf_bwdif filter_line3
2023-06-29 17:57 [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions John Cox
` (12 preceding siblings ...)
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 13/15] avfilter/vf_bwdif: Add neon for filter_line3 John Cox
@ 2023-06-29 17:57 ` John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 15/15] avfilter/vf_bwdif: Block filter slices into a multiple of 4 lines John Cox
2023-07-01 21:33 ` [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions Martin Storsjö
15 siblings, 0 replies; 32+ messages in thread
From: John Cox @ 2023-06-29 17:57 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: thomas.mundt, John Cox
Signed-off-by: John Cox <jc@kynesim.co.uk>
---
tests/checkasm/vf_bwdif.c | 81 +++++++++++++++++++++++++++++++++++++++
1 file changed, 81 insertions(+)
diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c
index 5fdba09fdc..3399cacdf7 100644
--- a/tests/checkasm/vf_bwdif.c
+++ b/tests/checkasm/vf_bwdif.c
@@ -28,6 +28,10 @@
for (size_t i = 0; i < count; i++) \
buf0[i] = buf1[i] = rnd() & mask
+#define randomize_overflow_check(buf0, buf1, mask, count) \
+ for (size_t i = 0; i < count; i++) \
+ buf0[i] = buf1[i] = (rnd() & 1) != 0 ? mask : 0;
+
#define BODY(type, depth) \
do { \
type prev0[9*WIDTH], prev1[9*WIDTH]; \
@@ -83,6 +87,83 @@ void checkasm_check_vf_bwdif(void)
report("bwdif10");
}
+ if (!ctx_8.filter_line3)
+ ctx_8.filter_line3 = ff_bwdif_filter_line3_c;
+
+ {
+ LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]);
+ LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]);
+ LOCAL_ALIGNED_16(uint8_t, next0, [11*WIDTH]);
+ LOCAL_ALIGNED_16(uint8_t, next1, [11*WIDTH]);
+ LOCAL_ALIGNED_16(uint8_t, cur0, [11*WIDTH]);
+ LOCAL_ALIGNED_16(uint8_t, cur1, [11*WIDTH]);
+ LOCAL_ALIGNED_16(uint8_t, dst0, [WIDTH*3]);
+ LOCAL_ALIGNED_16(uint8_t, dst1, [WIDTH*3]);
+ const int stride = WIDTH;
+ const int mask = (1<<8)-1;
+ int parity;
+
+ for (parity = 0; parity != 2; ++parity) {
+ if (check_func(ctx_8.filter_line3, "bwdif8.line3.rnd.p%d", parity)) {
+
+ declare_func(void, void * dst1, int d_stride,
+ const void * prev1, const void * cur1, const void * next1, int prefs,
+ int w, int parity, int clip_max);
+
+ randomize_buffers(prev0, prev1, mask, 11*WIDTH);
+ randomize_buffers(next0, next1, mask, 11*WIDTH);
+ randomize_buffers( cur0, cur1, mask, 11*WIDTH);
+
+ call_ref(dst0, stride,
+ prev0 + stride * 4, cur0 + stride * 4, next0 + stride * 4, stride,
+ WIDTH, parity, mask);
+ call_new(dst1, stride,
+ prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, stride,
+ WIDTH, parity, mask);
+
+ if (memcmp(dst0, dst1, WIDTH*3)
+ || memcmp(prev0, prev1, WIDTH*11)
+ || memcmp(next0, next1, WIDTH*11)
+ || memcmp( cur0, cur1, WIDTH*11))
+ fail();
+
+ bench_new(dst1, stride,
+ prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, stride,
+ WIDTH, parity, mask);
+ }
+ }
+
+ // Use just 0s and ~0s to try to provoke bad cropping or overflow
+ // Parity makes no difference to this test so just test 0
+ if (check_func(ctx_8.filter_line3, "bwdif8.line3.overflow")) {
+
+ declare_func(void, void * dst1, int d_stride,
+ const void * prev1, const void * cur1, const void * next1, int prefs,
+ int w, int parity, int clip_max);
+
+ randomize_overflow_check(prev0, prev1, mask, 11*WIDTH);
+ randomize_overflow_check(next0, next1, mask, 11*WIDTH);
+ randomize_overflow_check( cur0, cur1, mask, 11*WIDTH);
+
+ call_ref(dst0, stride,
+ prev0 + stride * 4, cur0 + stride * 4, next0 + stride * 4, stride,
+ WIDTH, 0, mask);
+ call_new(dst1, stride,
+ prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, stride,
+ WIDTH, 0, mask);
+
+ if (memcmp(dst0, dst1, WIDTH*3)
+ || memcmp(prev0, prev1, WIDTH*11)
+ || memcmp(next0, next1, WIDTH*11)
+ || memcmp( cur0, cur1, WIDTH*11))
+ fail();
+
+ // No point to benching
+ }
+
+ report("bwdif8.line3");
+ }
+
{
LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]);
LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]);
--
2.39.2
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* [FFmpeg-devel] [PATCH 15/15] avfilter/vf_bwdif: Block filter slices into a multiple of 4 lines
2023-06-29 17:57 [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions John Cox
` (13 preceding siblings ...)
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 14/15] tests/checkasm: Add test for vf_bwdif filter_line3 John Cox
@ 2023-06-29 17:57 ` John Cox
2023-07-01 21:33 ` [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions Martin Storsjö
15 siblings, 0 replies; 32+ messages in thread
From: John Cox @ 2023-06-29 17:57 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: thomas.mundt, John Cox
Round job start lines down to a multiple of 4. This means that if
filter_line3 exists then filter_line will not sometimes be called
once at the end of a slice depending on thread count. The final slice
may do up to 3 extra lines but filter_edge is faster than filter_line
so it is unlikely to create any noticable thread load variation.
Signed-off-by: John Cox <jc@kynesim.co.uk>
---
libavfilter/vf_bwdif.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index 52bc676cf8..6701208efe 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -237,6 +237,13 @@ static void filter_edge_16bit(void *dst1, void *prev1, void *cur1, void *next1,
FILTER2()
}
+// Round job start line down to multiple of 4 so that if filter_line3 exists
+// and the frame is a multiple of 4 high then filter_line will never be called
+static inline int job_start(const int jobnr, const int nb_jobs, const int h)
+{
+ return jobnr >= nb_jobs ? h : ((h * jobnr) / nb_jobs) & ~3;
+}
+
static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs)
{
BWDIFContext *s = ctx->priv;
@@ -246,8 +253,8 @@ static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs)
int clip_max = (1 << (yadif->csp->comp[td->plane].depth)) - 1;
int df = (yadif->csp->comp[td->plane].depth + 7) / 8;
int refs = linesize / df;
- int slice_start = (td->h * jobnr ) / nb_jobs;
- int slice_end = (td->h * (jobnr+1)) / nb_jobs;
+ int slice_start = job_start(jobnr, nb_jobs, td->h);
+ int slice_end = job_start(jobnr + 1, nb_jobs, td->h);
int y;
for (y = slice_start; y < slice_end; y++) {
@@ -310,7 +317,7 @@ static void filter(AVFilterContext *ctx, AVFrame *dstpic,
td.plane = i;
ff_filter_execute(ctx, filter_slice, &td, NULL,
- FFMIN(h, ff_filter_get_nb_threads(ctx)));
+ FFMIN((h+3)/4, ff_filter_get_nb_threads(ctx)));
}
if (yadif->current_field == YADIF_FIELD_END) {
yadif->current_field = YADIF_FIELD_NORMAL;
--
2.39.2
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions
2023-06-29 17:57 [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions John Cox
` (14 preceding siblings ...)
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 15/15] avfilter/vf_bwdif: Block filter slices into a multiple of 4 lines John Cox
@ 2023-07-01 21:33 ` Martin Storsjö
2023-07-02 10:18 ` John Cox
15 siblings, 1 reply; 32+ messages in thread
From: Martin Storsjö @ 2023-07-01 21:33 UTC (permalink / raw)
To: FFmpeg development discussions and patches; +Cc: thomas.mundt, John Cox
On Thu, 29 Jun 2023, John Cox wrote:
> Also adds a filter_line3 method which on aarch64 neon yields approx 30%
> speedup over 2xfilter_line and a memcpy
>
> John Cox (15):
> avfilter/vf_bwdif: Add outline for aarch neon functions
> avfilter/vf_bwdif: Add common macros and consts for aarch64 neon
> avfilter/vf_bwdif: Export C filter_intra
> avfilter/vf_bwdif: Add neon for filter_intra
> tests/checkasm: Add test for vf_bwdif filter_intra
> avfilter/vf_bwdif: Add clip and spatial macros for aarch64 neon
> avfilter/vf_bwdif: Export C filter_edge
> avfilter/vf_bwdif: Add neon for filter_edge
> tests/checkasm: Add test for vf_bwdif filter_edge
> avfilter/vf_bwdif: Export C filter_line
> avfilter/vf_bwdif: Add neon for filter_line
> avfilter/vf_bwdif: Add a filter_line3 method for optimisation
> avfilter/vf_bwdif: Add neon for filter_line3
> tests/checkasm: Add test for vf_bwdif filter_line3
> avfilter/vf_bwdif: Block filter slices into a multiple of 4 lines
It's nice to have this split up in small easily checkable patches, but
this is perhaps a bit more finegrained than what's usual. But I guess
that's ok...
I'll comment on the patches that need commenting on.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [FFmpeg-devel] [PATCH 02/15] avfilter/vf_bwdif: Add common macros and consts for aarch64 neon
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 02/15] avfilter/vf_bwdif: Add common macros and consts for aarch64 neon John Cox
@ 2023-07-01 21:35 ` Martin Storsjö
2023-07-02 10:27 ` John Cox
0 siblings, 1 reply; 32+ messages in thread
From: Martin Storsjö @ 2023-07-01 21:35 UTC (permalink / raw)
To: FFmpeg development discussions and patches; +Cc: thomas.mundt, John Cox
On Thu, 29 Jun 2023, John Cox wrote:
> Add macros for dual scalar half->single multiply and accumulate
> Add macro for shift, saturate and shorten single to byte
> Add filter constants
>
> Signed-off-by: John Cox <jc@kynesim.co.uk>
> ---
> libavfilter/aarch64/vf_bwdif_neon.S | 46 +++++++++++++++++++++++++++++
> 1 file changed, 46 insertions(+)
>
> diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S
> index 639ab22998..a8f0ed525a 100644
> --- a/libavfilter/aarch64/vf_bwdif_neon.S
> +++ b/libavfilter/aarch64/vf_bwdif_neon.S
> @@ -23,3 +23,49 @@
>
> #include "libavutil/aarch64/asm.S"
>
> +.macro SQSHRUNN b, s0, s1, s2, s3, n
> + sqshrun \s0\().4h, \s0\().4s, #\n - 8
> + sqshrun2 \s0\().8h, \s1\().4s, #\n - 8
> + sqshrun \s1\().4h, \s2\().4s, #\n - 8
> + sqshrun2 \s1\().8h, \s3\().4s, #\n - 8
> + uzp2 \b\().16b, \s0\().16b, \s1\().16b
> +.endm
> +
> +.macro SMULL4K a0, a1, a2, a3, s0, s1, k
> + smull \a0\().4s, \s0\().4h, \k
> + smull2 \a1\().4s, \s0\().8h, \k
> + smull \a2\().4s, \s1\().4h, \k
> + smull2 \a3\().4s, \s1\().8h, \k
> +.endm
> +
> +.macro UMULL4K a0, a1, a2, a3, s0, s1, k
> + umull \a0\().4s, \s0\().4h, \k
> + umull2 \a1\().4s, \s0\().8h, \k
> + umull \a2\().4s, \s1\().4h, \k
> + umull2 \a3\().4s, \s1\().8h, \k
> +.endm
> +
> +.macro UMLAL4K a0, a1, a2, a3, s0, s1, k
> + umlal \a0\().4s, \s0\().4h, \k
> + umlal2 \a1\().4s, \s0\().8h, \k
> + umlal \a2\().4s, \s1\().4h, \k
> + umlal2 \a3\().4s, \s1\().8h, \k
> +.endm
> +
> +.macro UMLSL4K a0, a1, a2, a3, s0, s1, k
> + umlsl \a0\().4s, \s0\().4h, \k
> + umlsl2 \a1\().4s, \s0\().8h, \k
> + umlsl \a2\().4s, \s1\().4h, \k
> + umlsl2 \a3\().4s, \s1\().8h, \k
> +.endm
> +
> +// static const uint16_t coef_lf[2] = { 4309, 213 };
> +// static const uint16_t coef_hf[3] = { 5570, 3801, 1016 };
> +// static const uint16_t coef_sp[2] = { 5077, 981 };
> +
> + .align 16
Note that .align for arm is power of two; this triggers a 2^16 byte
alignment here, which certainly isn't intended.
But in general, the usual way of defining constants is with a
const/endconst block, which places them in the right rdata section instead
of in the text section. But that then requires you to use a movrel macro
for accessing the data, instead of a plain ldr instruction.
> +coeffs:
> + .hword 4309 * 4, 213 * 4 // lf[0]*4 = v0.h[0]
> + .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5]
> + .hword 5077, 981 // sp[0] = v0.h[6]
> +
> --
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [FFmpeg-devel] [PATCH 04/15] avfilter/vf_bwdif: Add neon for filter_intra
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 04/15] avfilter/vf_bwdif: Add neon for filter_intra John Cox
@ 2023-07-01 21:37 ` Martin Storsjö
2023-07-02 10:43 ` John Cox
0 siblings, 1 reply; 32+ messages in thread
From: Martin Storsjö @ 2023-07-01 21:37 UTC (permalink / raw)
To: FFmpeg development discussions and patches; +Cc: thomas.mundt, John Cox
On Thu, 29 Jun 2023, John Cox wrote:
> Signed-off-by: John Cox <jc@kynesim.co.uk>
> ---
> libavfilter/aarch64/vf_bwdif_init_aarch64.c | 17 +++++++
> libavfilter/aarch64/vf_bwdif_neon.S | 53 +++++++++++++++++++++
> 2 files changed, 70 insertions(+)
>
> diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
> index 86d53b2ca1..3ffaa07ab3 100644
> --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
> +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
> @@ -24,6 +24,22 @@
> #include "libavfilter/bwdif.h"
> #include "libavutil/aarch64/cpu.h"
>
> +void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs,
> + int prefs3, int mrefs3, int parity, int clip_max);
> +
> +
> +static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int mrefs,
> + int prefs3, int mrefs3, int parity, int clip_max)
> +{
> + const int w0 = clip_max != 255 ? 0 : w & ~15;
> +
> + ff_bwdif_filter_intra_neon(dst1, cur1, w0, prefs, mrefs, prefs3, mrefs3, parity, clip_max);
> +
> + if (w0 < w)
> + ff_bwdif_filter_intra_c((char *)dst1 + w0, (char *)cur1 + w0,
> + w - w0, prefs, mrefs, prefs3, mrefs3, parity, clip_max);
> +}
> +
> void
> ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
> {
> @@ -35,5 +51,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
> if (!have_neon(cpu_flags))
> return;
>
> + s->filter_intra = filter_intra_helper;
> }
>
> diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S
> index a8f0ed525a..b863b3447d 100644
> --- a/libavfilter/aarch64/vf_bwdif_neon.S
> +++ b/libavfilter/aarch64/vf_bwdif_neon.S
> @@ -69,3 +69,56 @@ coeffs:
> .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5]
> .hword 5077, 981 // sp[0] = v0.h[6]
>
> +// ============================================================================
> +//
> +// void ff_bwdif_filter_intra_neon(
> +// void *dst1, // x0
> +// void *cur1, // x1
> +// int w, // w2
> +// int prefs, // w3
> +// int mrefs, // w4
> +// int prefs3, // w5
> +// int mrefs3, // w6
> +// int parity, // w7 unused
> +// int clip_max) // [sp, #0] unused
This bit is great to have
> +
> +function ff_bwdif_filter_intra_neon, export=1
> + cmp w2, #0
> + ble 99f
> +
> + ldr q0, coeffs
> +
> +// for (x = 0; x < w; x++) {
> +10:
> +
> +// interpol = (coef_sp[0] * (cur[mrefs] + cur[prefs]) - coef_sp[1] * (cur[mrefs3] + cur[prefs3])) >> 13;
I guess the style with intermixed C code is a bit uncommon in our
assembly, but as long as it doesn't affect the overall code style I guess
we can keep it.
> + ldr q31, [x1, w4, SXTW]
> + ldr q30, [x1, w3, SXTW]
> + ldr q29, [x1, w6, SXTW]
> + ldr q28, [x1, w5, SXTW]
Don't use shouty uppercase SXTW here
> +
> + uaddl v20.8h, v31.8b, v30.8b
> + uaddl2 v21.8h, v31.16b, v30.16b
> +
> + UMULL4K v2, v3, v4, v5, v20, v21, v0.h[6]
> +
> + uaddl v20.8h, v29.8b, v28.8b
> + uaddl2 v21.8h, v29.16b, v28.16b
> +
> + UMLSL4K v2, v3, v4, v5, v20, v21, v0.h[7]
> +
> +// dst[0] = av_clip(interpol, 0, clip_max);
> + SQSHRUNN v2, v2, v3, v4, v5, 13
> + str q2, [x0], #16
> +
> +// dst++;
> +// cur++;
> +// }
> +
> + subs w2, w2, #16
> + add x1, x1, #16
For in-order cores, it might be good to update these variables sometime
sooner, e.g. before the str instruction. But such scheduling breaks your
mapping between neat C code and assembly.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [FFmpeg-devel] [PATCH 08/15] avfilter/vf_bwdif: Add neon for filter_edge
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 08/15] avfilter/vf_bwdif: Add neon for filter_edge John Cox
@ 2023-07-01 21:40 ` Martin Storsjö
2023-07-02 10:50 ` John Cox
0 siblings, 1 reply; 32+ messages in thread
From: Martin Storsjö @ 2023-07-01 21:40 UTC (permalink / raw)
To: FFmpeg development discussions and patches; +Cc: thomas.mundt, John Cox
On Thu, 29 Jun 2023, John Cox wrote:
> Signed-off-by: John Cox <jc@kynesim.co.uk>
> ---
> libavfilter/aarch64/vf_bwdif_init_aarch64.c | 20 ++++
> libavfilter/aarch64/vf_bwdif_neon.S | 104 ++++++++++++++++++++
> 2 files changed, 124 insertions(+)
>
> diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
> index 3ffaa07ab3..e75cf2f204 100644
> --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
> +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
> @@ -24,10 +24,29 @@
> #include "libavfilter/bwdif.h"
> #include "libavutil/aarch64/cpu.h"
>
> +void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void *next1,
> + int w, int prefs, int mrefs, int prefs2, int mrefs2,
> + int parity, int clip_max, int spat);
> +
> void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs,
> int prefs3, int mrefs3, int parity, int clip_max);
>
>
> +static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void *next1,
> + int w, int prefs, int mrefs, int prefs2, int mrefs2,
> + int parity, int clip_max, int spat)
> +{
> + const int w0 = clip_max != 255 ? 0 : w & ~15;
> +
> + ff_bwdif_filter_edge_neon(dst1, prev1, cur1, next1, w0, prefs, mrefs, prefs2, mrefs2,
> + parity, clip_max, spat);
> +
> + if (w0 < w)
> + ff_bwdif_filter_edge_c((char *)dst1 + w0, (char *)prev1 + w0, (char *)cur1 + w0, (char *)next1 + w0,
> + w - w0, prefs, mrefs, prefs2, mrefs2,
> + parity, clip_max, spat);
> +}
> +
> static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int mrefs,
> int prefs3, int mrefs3, int parity, int clip_max)
> {
> @@ -52,5 +71,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
> return;
>
> s->filter_intra = filter_intra_helper;
> + s->filter_edge = filter_edge_helper;
> }
>
> diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S
> index 6c5d1598f4..a33b235882 100644
> --- a/libavfilter/aarch64/vf_bwdif_neon.S
> +++ b/libavfilter/aarch64/vf_bwdif_neon.S
> @@ -128,6 +128,110 @@ coeffs:
> .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5]
> .hword 5077, 981 // sp[0] = v0.h[6]
>
> +// ============================================================================
> +//
> +// void ff_bwdif_filter_edge_neon(
> +// void *dst1, // x0
> +// void *prev1, // x1
> +// void *cur1, // x2
> +// void *next1, // x3
> +// int w, // w4
> +// int prefs, // w5
> +// int mrefs, // w6
> +// int prefs2, // w7
> +// int mrefs2, // [sp, #0]
> +// int parity, // [sp, #8]
> +// int clip_max, // [sp, #16] unused
> +// int spat); // [sp, #24]
This doesn't hold for macOS targets (and the checkasm tests fail on that
platform).
On macOS, arguments that aren't passed in registers but on the stack, are
tightly packed. So since parity is 32 bit and mrefs2 also was 32 bit,
parity is available at [sp, #4].
Therefore, it's usually simplest for portability reasons, to pass any
arguments after the first 8, as intptr_t or ptrdiff_t, as that makes them
consistent across platforms.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [FFmpeg-devel] [PATCH 11/15] avfilter/vf_bwdif: Add neon for filter_line
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 11/15] avfilter/vf_bwdif: Add neon for filter_line John Cox
@ 2023-07-01 21:44 ` Martin Storsjö
2023-07-02 10:57 ` John Cox
0 siblings, 1 reply; 32+ messages in thread
From: Martin Storsjö @ 2023-07-01 21:44 UTC (permalink / raw)
To: FFmpeg development discussions and patches; +Cc: thomas.mundt, John Cox
On Thu, 29 Jun 2023, John Cox wrote:
> Signed-off-by: John Cox <jc@kynesim.co.uk>
> ---
> libavfilter/aarch64/vf_bwdif_init_aarch64.c | 21 ++
> libavfilter/aarch64/vf_bwdif_neon.S | 215 ++++++++++++++++++++
> 2 files changed, 236 insertions(+)
>
> diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
> index e75cf2f204..21e67884ab 100644
> --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
> +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
> @@ -31,6 +31,26 @@ void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void *next1,
> void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs,
> int prefs3, int mrefs3, int parity, int clip_max);
>
> +void ff_bwdif_filter_line_neon(void *dst1, void *prev1, void *cur1, void *next1,
> + int w, int prefs, int mrefs, int prefs2, int mrefs2,
> + int prefs3, int mrefs3, int prefs4, int mrefs4,
> + int parity, int clip_max);
> +
> +
> +static void filter_line_helper(void *dst1, void *prev1, void *cur1, void *next1,
> + int w, int prefs, int mrefs, int prefs2, int mrefs2,
> + int prefs3, int mrefs3, int prefs4, int mrefs4,
> + int parity, int clip_max)
> +{
> + const int w0 = clip_max != 255 ? 0 : w & ~15;
> +
> + ff_bwdif_filter_line_neon(dst1, prev1, cur1, next1,
> + w0, prefs, mrefs, prefs2, mrefs2, prefs3, mrefs3, prefs4, mrefs4, parity, clip_max);
> +
> + if (w0 < w)
> + ff_bwdif_filter_line_c((char *)dst1 + w0, (char *)prev1 + w0, (char *)cur1 + w0, (char *)next1 + w0,
> + w - w0, prefs, mrefs, prefs2, mrefs2, prefs3, mrefs3, prefs4, mrefs4, parity, clip_max);
> +}
>
> static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void *next1,
> int w, int prefs, int mrefs, int prefs2, int mrefs2,
> @@ -71,6 +91,7 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
> return;
>
> s->filter_intra = filter_intra_helper;
> + s->filter_line = filter_line_helper;
> s->filter_edge = filter_edge_helper;
> }
>
> diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S
> index a33b235882..675e97d966 100644
> --- a/libavfilter/aarch64/vf_bwdif_neon.S
> +++ b/libavfilter/aarch64/vf_bwdif_neon.S
> @@ -128,6 +128,221 @@ coeffs:
> .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5]
> .hword 5077, 981 // sp[0] = v0.h[6]
>
> +// ===========================================================================
> +//
> +// void filter_line(
> +// void *dst1, // x0
> +// void *prev1, // x1
> +// void *cur1, // x2
> +// void *next1, // x3
> +// int w, // w4
> +// int prefs, // w5
> +// int mrefs, // w6
> +// int prefs2, // w7
> +// int mrefs2, // [sp, #0]
> +// int prefs3, // [sp, #8]
> +// int mrefs3, // [sp, #16]
> +// int prefs4, // [sp, #24]
> +// int mrefs4, // [sp, #32]
> +// int parity, // [sp, #40]
> +// int clip_max) // [sp, #48]
> +
> +function ff_bwdif_filter_line_neon, export=1
> + // Sanity check w
> + cmp w4, #0
> + ble 99f
> +
> + // Rearrange regs to be the same as line3 for ease of debug!
> + mov w10, w4 // w10 = loop count
> + mov w9, w6 // w9 = mref
> + mov w12, w7 // w12 = pref2
> + mov w11, w5 // w11 = pref
> + ldr w8, [sp, #0] // w8 = mref2
> + ldr w7, [sp, #16] // w7 = mref3
> + ldr w6, [sp, #32] // w6 = mref4
> + ldr w13, [sp, #8] // w13 = pref3
> + ldr w14, [sp, #24] // w14 = pref4
Btw, remember that you can load two arguments from the stack at once with
ldp, e.g. "ldp x8, x13, [sp, #0]". If they're made intptr_t/ptrdiff_t, you
won't have an issue with garbage in the upper 32 bits either.
> +
> + mov x4, x3
> + mov x3, x2
> + mov x2, x1
> +
> +// #define prev2 cur
> +// const uint8_t * restrict next2 = parity ? prev : next;
> + ldr w17, [sp, #40] // parity
> + cmp w17, #0
> + csel x17, x2, x4, ne
> +
> + // We want all the V registers - save all the ones we must
> + stp d14, d15, [sp, #-64]!
> + stp d8, d9, [sp, #48]
> + stp d10, d11, [sp, #32]
> + stp d12, d13, [sp, #16]
The order looks a bit weird here even if they end up sequential on the
stack. If you'd fill it from the bottom up, e.g.
stp d8, d9, [sp, #-64]!
stp d10, d11, [sp, #16]
stp d12, d13, [sp, #32]
stp d14, d15, [sp, #48]
they're sequential both in code and on the stack.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions
2023-07-01 21:33 ` [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions Martin Storsjö
@ 2023-07-02 10:18 ` John Cox
0 siblings, 0 replies; 32+ messages in thread
From: John Cox @ 2023-07-02 10:18 UTC (permalink / raw)
To: Martin Storsjö
Cc: thomas.mundt, FFmpeg development discussions and patches
Hi
>On Thu, 29 Jun 2023, John Cox wrote:
>
>> Also adds a filter_line3 method which on aarch64 neon yields approx 30%
>> speedup over 2xfilter_line and a memcpy
>>
>> John Cox (15):
>> avfilter/vf_bwdif: Add outline for aarch neon functions
>> avfilter/vf_bwdif: Add common macros and consts for aarch64 neon
>> avfilter/vf_bwdif: Export C filter_intra
>> avfilter/vf_bwdif: Add neon for filter_intra
>> tests/checkasm: Add test for vf_bwdif filter_intra
>> avfilter/vf_bwdif: Add clip and spatial macros for aarch64 neon
>> avfilter/vf_bwdif: Export C filter_edge
>> avfilter/vf_bwdif: Add neon for filter_edge
>> tests/checkasm: Add test for vf_bwdif filter_edge
>> avfilter/vf_bwdif: Export C filter_line
>> avfilter/vf_bwdif: Add neon for filter_line
>> avfilter/vf_bwdif: Add a filter_line3 method for optimisation
>> avfilter/vf_bwdif: Add neon for filter_line3
>> tests/checkasm: Add test for vf_bwdif filter_line3
>> avfilter/vf_bwdif: Block filter slices into a multiple of 4 lines
>
>It's nice to have this split up in small easily checkable patches, but
>this is perhaps a bit more finegrained than what's usual. But I guess
>that's ok...
I normally find that people ask me to split patches so I though I'd cut
stuff down to the minimum plausible unit.
I'm more than happy to coalesce stuff if wanted.
JC
>I'll comment on the patches that need commenting on.
>
>// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [FFmpeg-devel] [PATCH 02/15] avfilter/vf_bwdif: Add common macros and consts for aarch64 neon
2023-07-01 21:35 ` Martin Storsjö
@ 2023-07-02 10:27 ` John Cox
2023-07-02 20:07 ` Martin Storsjö
0 siblings, 1 reply; 32+ messages in thread
From: John Cox @ 2023-07-02 10:27 UTC (permalink / raw)
To: Martin Storsjö
Cc: thomas.mundt, FFmpeg development discussions and patches
On Sun, 2 Jul 2023 00:35:14 +0300 (EEST), you wrote:
>On Thu, 29 Jun 2023, John Cox wrote:
>
>> Add macros for dual scalar half->single multiply and accumulate
>> Add macro for shift, saturate and shorten single to byte
>> Add filter constants
>>
>> Signed-off-by: John Cox <jc@kynesim.co.uk>
>> ---
>> libavfilter/aarch64/vf_bwdif_neon.S | 46 +++++++++++++++++++++++++++++
>> 1 file changed, 46 insertions(+)
>>
>> diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S
>> index 639ab22998..a8f0ed525a 100644
>> --- a/libavfilter/aarch64/vf_bwdif_neon.S
>> +++ b/libavfilter/aarch64/vf_bwdif_neon.S
>> @@ -23,3 +23,49 @@
>>
>> #include "libavutil/aarch64/asm.S"
>>
>> +.macro SQSHRUNN b, s0, s1, s2, s3, n
>> + sqshrun \s0\().4h, \s0\().4s, #\n - 8
>> + sqshrun2 \s0\().8h, \s1\().4s, #\n - 8
>> + sqshrun \s1\().4h, \s2\().4s, #\n - 8
>> + sqshrun2 \s1\().8h, \s3\().4s, #\n - 8
>> + uzp2 \b\().16b, \s0\().16b, \s1\().16b
>> +.endm
>> +
>> +.macro SMULL4K a0, a1, a2, a3, s0, s1, k
>> + smull \a0\().4s, \s0\().4h, \k
>> + smull2 \a1\().4s, \s0\().8h, \k
>> + smull \a2\().4s, \s1\().4h, \k
>> + smull2 \a3\().4s, \s1\().8h, \k
>> +.endm
>> +
>> +.macro UMULL4K a0, a1, a2, a3, s0, s1, k
>> + umull \a0\().4s, \s0\().4h, \k
>> + umull2 \a1\().4s, \s0\().8h, \k
>> + umull \a2\().4s, \s1\().4h, \k
>> + umull2 \a3\().4s, \s1\().8h, \k
>> +.endm
>> +
>> +.macro UMLAL4K a0, a1, a2, a3, s0, s1, k
>> + umlal \a0\().4s, \s0\().4h, \k
>> + umlal2 \a1\().4s, \s0\().8h, \k
>> + umlal \a2\().4s, \s1\().4h, \k
>> + umlal2 \a3\().4s, \s1\().8h, \k
>> +.endm
>> +
>> +.macro UMLSL4K a0, a1, a2, a3, s0, s1, k
>> + umlsl \a0\().4s, \s0\().4h, \k
>> + umlsl2 \a1\().4s, \s0\().8h, \k
>> + umlsl \a2\().4s, \s1\().4h, \k
>> + umlsl2 \a3\().4s, \s1\().8h, \k
>> +.endm
>> +
>> +// static const uint16_t coef_lf[2] = { 4309, 213 };
>> +// static const uint16_t coef_hf[3] = { 5570, 3801, 1016 };
>> +// static const uint16_t coef_sp[2] = { 5077, 981 };
>> +
>> + .align 16
>
>Note that .align for arm is power of two; this triggers a 2^16 byte
>alignment here, which certainly isn't intended.
Yikes! I'll swap that for a .balign now I've looked that up
>But in general, the usual way of defining constants is with a
>const/endconst block, which places them in the right rdata section instead
>of in the text section. But that then requires you to use a movrel macro
>for accessing the data, instead of a plain ldr instruction.
Yeah - arm has a history of mixing text & const - I went with the
simpler code. Is this a deal breaker or can I leave it as is?
JC
>> +coeffs:
>> + .hword 4309 * 4, 213 * 4 // lf[0]*4 = v0.h[0]
>> + .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5]
>> + .hword 5077, 981 // sp[0] = v0.h[6]
>> +
>> --
>
>
>// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [FFmpeg-devel] [PATCH 04/15] avfilter/vf_bwdif: Add neon for filter_intra
2023-07-01 21:37 ` Martin Storsjö
@ 2023-07-02 10:43 ` John Cox
2023-07-02 20:18 ` Martin Storsjö
0 siblings, 1 reply; 32+ messages in thread
From: John Cox @ 2023-07-02 10:43 UTC (permalink / raw)
To: FFmpeg development discussions and patches
Cc: thomas.mundt, FFmpeg development discussions and patches
On Sun, 2 Jul 2023 00:37:35 +0300 (EEST), you wrote:
>On Thu, 29 Jun 2023, John Cox wrote:
>
>> Signed-off-by: John Cox <jc@kynesim.co.uk>
>> ---
>> libavfilter/aarch64/vf_bwdif_init_aarch64.c | 17 +++++++
>> libavfilter/aarch64/vf_bwdif_neon.S | 53 +++++++++++++++++++++
>> 2 files changed, 70 insertions(+)
>>
>> diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>> index 86d53b2ca1..3ffaa07ab3 100644
>> --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>> +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>> @@ -24,6 +24,22 @@
>> #include "libavfilter/bwdif.h"
>> #include "libavutil/aarch64/cpu.h"
>>
>> +void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs,
>> + int prefs3, int mrefs3, int parity, int clip_max);
>> +
>> +
>> +static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int mrefs,
>> + int prefs3, int mrefs3, int parity, int clip_max)
>> +{
>> + const int w0 = clip_max != 255 ? 0 : w & ~15;
>> +
>> + ff_bwdif_filter_intra_neon(dst1, cur1, w0, prefs, mrefs, prefs3, mrefs3, parity, clip_max);
>> +
>> + if (w0 < w)
>> + ff_bwdif_filter_intra_c((char *)dst1 + w0, (char *)cur1 + w0,
>> + w - w0, prefs, mrefs, prefs3, mrefs3, parity, clip_max);
>> +}
>> +
>> void
>> ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
>> {
>> @@ -35,5 +51,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
>> if (!have_neon(cpu_flags))
>> return;
>>
>> + s->filter_intra = filter_intra_helper;
>> }
>>
>> diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S
>> index a8f0ed525a..b863b3447d 100644
>> --- a/libavfilter/aarch64/vf_bwdif_neon.S
>> +++ b/libavfilter/aarch64/vf_bwdif_neon.S
>> @@ -69,3 +69,56 @@ coeffs:
>> .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5]
>> .hword 5077, 981 // sp[0] = v0.h[6]
>>
>> +// ============================================================================
>> +//
>> +// void ff_bwdif_filter_intra_neon(
>> +// void *dst1, // x0
>> +// void *cur1, // x1
>> +// int w, // w2
>> +// int prefs, // w3
>> +// int mrefs, // w4
>> +// int prefs3, // w5
>> +// int mrefs3, // w6
>> +// int parity, // w7 unused
>> +// int clip_max) // [sp, #0] unused
>
>This bit is great to have
>
>> +
>> +function ff_bwdif_filter_intra_neon, export=1
>> + cmp w2, #0
>> + ble 99f
>> +
>> + ldr q0, coeffs
>> +
>> +// for (x = 0; x < w; x++) {
>> +10:
>> +
>> +// interpol = (coef_sp[0] * (cur[mrefs] + cur[prefs]) - coef_sp[1] * (cur[mrefs3] + cur[prefs3])) >> 13;
>
>I guess the style with intermixed C code is a bit uncommon in our
>assembly, but as long as it doesn't affect the overall code style I guess
>we can keep it.
I needed it to track where I was whilst writing the code & if I ever
need to change it I'll be lost without it - so I, at least, rate it as
valuable and in line3 where I am very tight on registers it was
invaluable for keeping track of what referred to what.
>> + ldr q31, [x1, w4, SXTW]
>> + ldr q30, [x1, w3, SXTW]
>> + ldr q29, [x1, w6, SXTW]
>> + ldr q28, [x1, w5, SXTW]
>
>Don't use shouty uppercase SXTW here
Will change.
>> +
>> + uaddl v20.8h, v31.8b, v30.8b
>> + uaddl2 v21.8h, v31.16b, v30.16b
>> +
>> + UMULL4K v2, v3, v4, v5, v20, v21, v0.h[6]
>> +
>> + uaddl v20.8h, v29.8b, v28.8b
>> + uaddl2 v21.8h, v29.16b, v28.16b
>> +
>> + UMLSL4K v2, v3, v4, v5, v20, v21, v0.h[7]
>> +
>> +// dst[0] = av_clip(interpol, 0, clip_max);
>> + SQSHRUNN v2, v2, v3, v4, v5, 13
>> + str q2, [x0], #16
>> +
>> +// dst++;
>> +// cur++;
>> +// }
>> +
>> + subs w2, w2, #16
>> + add x1, x1, #16
>
>For in-order cores, it might be good to update these variables sometime
>sooner, e.g. before the str instruction. But such scheduling breaks your
>mapping between neat C code and assembly.
I take your point but there is at least 1 instruction between set and
use which is normally enough.
I did my benching on a Pi4 and found, to my surprise, that in most cases
reordering instructions to interleavse operations made life worse and
seeing as Pi4 is my target platform I stopped trying to do that and
stuck with code that was easier to read! (Also filter_intra seems to be
much more memory b/w limited than code limited on a Pi4.)
JC
>// Martin
>
>_______________________________________________
>ffmpeg-devel mailing list
>ffmpeg-devel@ffmpeg.org
>https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
>To unsubscribe, visit link above, or email
>ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [FFmpeg-devel] [PATCH 08/15] avfilter/vf_bwdif: Add neon for filter_edge
2023-07-01 21:40 ` Martin Storsjö
@ 2023-07-02 10:50 ` John Cox
2023-07-02 20:36 ` Martin Storsjö
0 siblings, 1 reply; 32+ messages in thread
From: John Cox @ 2023-07-02 10:50 UTC (permalink / raw)
To: Martin Storsjö
Cc: thomas.mundt, FFmpeg development discussions and patches
On Sun, 2 Jul 2023 00:40:09 +0300 (EEST), you wrote:
>On Thu, 29 Jun 2023, John Cox wrote:
>
>> Signed-off-by: John Cox <jc@kynesim.co.uk>
>> ---
>> libavfilter/aarch64/vf_bwdif_init_aarch64.c | 20 ++++
>> libavfilter/aarch64/vf_bwdif_neon.S | 104 ++++++++++++++++++++
>> 2 files changed, 124 insertions(+)
>>
>> diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>> index 3ffaa07ab3..e75cf2f204 100644
>> --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>> +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>> @@ -24,10 +24,29 @@
>> #include "libavfilter/bwdif.h"
>> #include "libavutil/aarch64/cpu.h"
>>
>> +void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void *next1,
>> + int w, int prefs, int mrefs, int prefs2, int mrefs2,
>> + int parity, int clip_max, int spat);
>> +
>> void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs,
>> int prefs3, int mrefs3, int parity, int clip_max);
>>
>>
>> +static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void *next1,
>> + int w, int prefs, int mrefs, int prefs2, int mrefs2,
>> + int parity, int clip_max, int spat)
>> +{
>> + const int w0 = clip_max != 255 ? 0 : w & ~15;
>> +
>> + ff_bwdif_filter_edge_neon(dst1, prev1, cur1, next1, w0, prefs, mrefs, prefs2, mrefs2,
>> + parity, clip_max, spat);
>> +
>> + if (w0 < w)
>> + ff_bwdif_filter_edge_c((char *)dst1 + w0, (char *)prev1 + w0, (char *)cur1 + w0, (char *)next1 + w0,
>> + w - w0, prefs, mrefs, prefs2, mrefs2,
>> + parity, clip_max, spat);
>> +}
>> +
>> static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int mrefs,
>> int prefs3, int mrefs3, int parity, int clip_max)
>> {
>> @@ -52,5 +71,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
>> return;
>>
>> s->filter_intra = filter_intra_helper;
>> + s->filter_edge = filter_edge_helper;
>> }
>>
>> diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S
>> index 6c5d1598f4..a33b235882 100644
>> --- a/libavfilter/aarch64/vf_bwdif_neon.S
>> +++ b/libavfilter/aarch64/vf_bwdif_neon.S
>> @@ -128,6 +128,110 @@ coeffs:
>> .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5]
>> .hword 5077, 981 // sp[0] = v0.h[6]
>>
>> +// ============================================================================
>> +//
>> +// void ff_bwdif_filter_edge_neon(
>> +// void *dst1, // x0
>> +// void *prev1, // x1
>> +// void *cur1, // x2
>> +// void *next1, // x3
>> +// int w, // w4
>> +// int prefs, // w5
>> +// int mrefs, // w6
>> +// int prefs2, // w7
>> +// int mrefs2, // [sp, #0]
>> +// int parity, // [sp, #8]
>> +// int clip_max, // [sp, #16] unused
>> +// int spat); // [sp, #24]
>
>This doesn't hold for macOS targets (and the checkasm tests fail on that
>platform).
>
>On macOS, arguments that aren't passed in registers but on the stack, are
>tightly packed. So since parity is 32 bit and mrefs2 also was 32 bit,
>parity is available at [sp, #4].
>
>Therefore, it's usually simplest for portability reasons, to pass any
>arguments after the first 8, as intptr_t or ptrdiff_t, as that makes them
>consistent across platforms.
Not my interface - this is already existing code. What do you suggest I
do?
I'm happy either to change the interface or fix my stack offsets if
there is any clue that lets me detect this ABI. As personal preference
I'd choose the latter.
I don't have easy access to a mac. Is there any easy way of getting this
tested before resubmission?
Thanks
JC
>// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [FFmpeg-devel] [PATCH 11/15] avfilter/vf_bwdif: Add neon for filter_line
2023-07-01 21:44 ` Martin Storsjö
@ 2023-07-02 10:57 ` John Cox
2023-07-02 20:40 ` Martin Storsjö
0 siblings, 1 reply; 32+ messages in thread
From: John Cox @ 2023-07-02 10:57 UTC (permalink / raw)
To: Martin Storsjö
Cc: thomas.mundt, FFmpeg development discussions and patches
On Sun, 2 Jul 2023 00:44:10 +0300 (EEST), you wrote:
>On Thu, 29 Jun 2023, John Cox wrote:
>
>> Signed-off-by: John Cox <jc@kynesim.co.uk>
>> ---
>> libavfilter/aarch64/vf_bwdif_init_aarch64.c | 21 ++
>> libavfilter/aarch64/vf_bwdif_neon.S | 215 ++++++++++++++++++++
>> 2 files changed, 236 insertions(+)
>>
>> diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>> index e75cf2f204..21e67884ab 100644
>> --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>> +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>> @@ -31,6 +31,26 @@ void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void *next1,
>> void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs,
>> int prefs3, int mrefs3, int parity, int clip_max);
>>
>> +void ff_bwdif_filter_line_neon(void *dst1, void *prev1, void *cur1, void *next1,
>> + int w, int prefs, int mrefs, int prefs2, int mrefs2,
>> + int prefs3, int mrefs3, int prefs4, int mrefs4,
>> + int parity, int clip_max);
>> +
>> +
>> +static void filter_line_helper(void *dst1, void *prev1, void *cur1, void *next1,
>> + int w, int prefs, int mrefs, int prefs2, int mrefs2,
>> + int prefs3, int mrefs3, int prefs4, int mrefs4,
>> + int parity, int clip_max)
>> +{
>> + const int w0 = clip_max != 255 ? 0 : w & ~15;
>> +
>> + ff_bwdif_filter_line_neon(dst1, prev1, cur1, next1,
>> + w0, prefs, mrefs, prefs2, mrefs2, prefs3, mrefs3, prefs4, mrefs4, parity, clip_max);
>> +
>> + if (w0 < w)
>> + ff_bwdif_filter_line_c((char *)dst1 + w0, (char *)prev1 + w0, (char *)cur1 + w0, (char *)next1 + w0,
>> + w - w0, prefs, mrefs, prefs2, mrefs2, prefs3, mrefs3, prefs4, mrefs4, parity, clip_max);
>> +}
>>
>> static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void *next1,
>> int w, int prefs, int mrefs, int prefs2, int mrefs2,
>> @@ -71,6 +91,7 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
>> return;
>>
>> s->filter_intra = filter_intra_helper;
>> + s->filter_line = filter_line_helper;
>> s->filter_edge = filter_edge_helper;
>> }
>>
>> diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S
>> index a33b235882..675e97d966 100644
>> --- a/libavfilter/aarch64/vf_bwdif_neon.S
>> +++ b/libavfilter/aarch64/vf_bwdif_neon.S
>> @@ -128,6 +128,221 @@ coeffs:
>> .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5]
>> .hword 5077, 981 // sp[0] = v0.h[6]
>>
>> +// ===========================================================================
>> +//
>> +// void filter_line(
>> +// void *dst1, // x0
>> +// void *prev1, // x1
>> +// void *cur1, // x2
>> +// void *next1, // x3
>> +// int w, // w4
>> +// int prefs, // w5
>> +// int mrefs, // w6
>> +// int prefs2, // w7
>> +// int mrefs2, // [sp, #0]
>> +// int prefs3, // [sp, #8]
>> +// int mrefs3, // [sp, #16]
>> +// int prefs4, // [sp, #24]
>> +// int mrefs4, // [sp, #32]
>> +// int parity, // [sp, #40]
>> +// int clip_max) // [sp, #48]
>> +
>> +function ff_bwdif_filter_line_neon, export=1
>> + // Sanity check w
>> + cmp w4, #0
>> + ble 99f
>> +
>> + // Rearrange regs to be the same as line3 for ease of debug!
>> + mov w10, w4 // w10 = loop count
>> + mov w9, w6 // w9 = mref
>> + mov w12, w7 // w12 = pref2
>> + mov w11, w5 // w11 = pref
>> + ldr w8, [sp, #0] // w8 = mref2
>> + ldr w7, [sp, #16] // w7 = mref3
>> + ldr w6, [sp, #32] // w6 = mref4
>> + ldr w13, [sp, #8] // w13 = pref3
>> + ldr w14, [sp, #24] // w14 = pref4
>
>Btw, remember that you can load two arguments from the stack at once with
>ldp, e.g. "ldp x8, x13, [sp, #0]". If they're made intptr_t/ptrdiff_t, you
>won't have an issue with garbage in the upper 32 bits either.
Fair point - I was indeed worrying about garbage in the upper half (and
this is not performance or size critical code).
>> +
>> + mov x4, x3
>> + mov x3, x2
>> + mov x2, x1
>> +
>> +// #define prev2 cur
>> +// const uint8_t * restrict next2 = parity ? prev : next;
>> + ldr w17, [sp, #40] // parity
>> + cmp w17, #0
>> + csel x17, x2, x4, ne
>> +
>> + // We want all the V registers - save all the ones we must
>> + stp d14, d15, [sp, #-64]!
>> + stp d8, d9, [sp, #48]
>> + stp d10, d11, [sp, #32]
>> + stp d12, d13, [sp, #16]
>
>The order looks a bit weird here even if they end up sequential on the
>stack. If you'd fill it from the bottom up, e.g.
>
>stp d8, d9, [sp, #-64]!
>stp d10, d11, [sp, #16]
>stp d12, d13, [sp, #32]
>stp d14, d15, [sp, #48]
>
>they're sequential both in code and on the stack.
Sure I can tweak that.
JC
>// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [FFmpeg-devel] [PATCH 02/15] avfilter/vf_bwdif: Add common macros and consts for aarch64 neon
2023-07-02 10:27 ` John Cox
@ 2023-07-02 20:07 ` Martin Storsjö
2023-07-02 21:02 ` Martin Storsjö
0 siblings, 1 reply; 32+ messages in thread
From: Martin Storsjö @ 2023-07-02 20:07 UTC (permalink / raw)
To: John Cox; +Cc: thomas.mundt, FFmpeg development discussions and patches
On Sun, 2 Jul 2023, John Cox wrote:
> On Sun, 2 Jul 2023 00:35:14 +0300 (EEST), you wrote:
>
>> On Thu, 29 Jun 2023, John Cox wrote:
>>
>>> Add macros for dual scalar half->single multiply and accumulate
>>> Add macro for shift, saturate and shorten single to byte
>>> Add filter constants
>>>
>>> Signed-off-by: John Cox <jc@kynesim.co.uk>
>>> ---
>>> libavfilter/aarch64/vf_bwdif_neon.S | 46 +++++++++++++++++++++++++++++
>>> 1 file changed, 46 insertions(+)
>>>
>>> diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S
>>> index 639ab22998..a8f0ed525a 100644
>>> --- a/libavfilter/aarch64/vf_bwdif_neon.S
>>> +++ b/libavfilter/aarch64/vf_bwdif_neon.S
>>> @@ -23,3 +23,49 @@
>>>
>>> #include "libavutil/aarch64/asm.S"
>>>
>>> +.macro SQSHRUNN b, s0, s1, s2, s3, n
>>> + sqshrun \s0\().4h, \s0\().4s, #\n - 8
>>> + sqshrun2 \s0\().8h, \s1\().4s, #\n - 8
>>> + sqshrun \s1\().4h, \s2\().4s, #\n - 8
>>> + sqshrun2 \s1\().8h, \s3\().4s, #\n - 8
>>> + uzp2 \b\().16b, \s0\().16b, \s1\().16b
>>> +.endm
>>> +
>>> +.macro SMULL4K a0, a1, a2, a3, s0, s1, k
>>> + smull \a0\().4s, \s0\().4h, \k
>>> + smull2 \a1\().4s, \s0\().8h, \k
>>> + smull \a2\().4s, \s1\().4h, \k
>>> + smull2 \a3\().4s, \s1\().8h, \k
>>> +.endm
>>> +
>>> +.macro UMULL4K a0, a1, a2, a3, s0, s1, k
>>> + umull \a0\().4s, \s0\().4h, \k
>>> + umull2 \a1\().4s, \s0\().8h, \k
>>> + umull \a2\().4s, \s1\().4h, \k
>>> + umull2 \a3\().4s, \s1\().8h, \k
>>> +.endm
>>> +
>>> +.macro UMLAL4K a0, a1, a2, a3, s0, s1, k
>>> + umlal \a0\().4s, \s0\().4h, \k
>>> + umlal2 \a1\().4s, \s0\().8h, \k
>>> + umlal \a2\().4s, \s1\().4h, \k
>>> + umlal2 \a3\().4s, \s1\().8h, \k
>>> +.endm
>>> +
>>> +.macro UMLSL4K a0, a1, a2, a3, s0, s1, k
>>> + umlsl \a0\().4s, \s0\().4h, \k
>>> + umlsl2 \a1\().4s, \s0\().8h, \k
>>> + umlsl \a2\().4s, \s1\().4h, \k
>>> + umlsl2 \a3\().4s, \s1\().8h, \k
>>> +.endm
>>> +
>>> +// static const uint16_t coef_lf[2] = { 4309, 213 };
>>> +// static const uint16_t coef_hf[3] = { 5570, 3801, 1016 };
>>> +// static const uint16_t coef_sp[2] = { 5077, 981 };
>>> +
>>> + .align 16
>>
>> Note that .align for arm is power of two; this triggers a 2^16 byte
>> alignment here, which certainly isn't intended.
>
> Yikes! I'll swap that for a .balign now I've looked that up
>
>> But in general, the usual way of defining constants is with a
>> const/endconst block, which places them in the right rdata section instead
>> of in the text section. But that then requires you to use a movrel macro
>> for accessing the data, instead of a plain ldr instruction.
>
> Yeah - arm has a history of mixing text & const - I went with the
> simpler code. Is this a deal breaker or can I leave it as is?
I wouldn't treat it as a deal breaker as long as it works across all
platforms - even if consistency with the code style elsewhere is
preferred, but IIRC there may be issues with MS armasm (after passed
through gas-preprocessor). IIRC there might be issues with starting out
with straight up content without the full setup made by the function/const
macros. OTOH I might have fixed that in gas-preprocessor too...
Last time around, the patchset failed building in that configuration due
ot the out of range alignment, I'll see how it fares now.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [FFmpeg-devel] [PATCH 04/15] avfilter/vf_bwdif: Add neon for filter_intra
2023-07-02 10:43 ` John Cox
@ 2023-07-02 20:18 ` Martin Storsjö
0 siblings, 0 replies; 32+ messages in thread
From: Martin Storsjö @ 2023-07-02 20:18 UTC (permalink / raw)
To: FFmpeg development discussions and patches; +Cc: thomas.mundt
On Sun, 2 Jul 2023, John Cox wrote:
> On Sun, 2 Jul 2023 00:37:35 +0300 (EEST), you wrote:
>
>>> +
>>> + uaddl v20.8h, v31.8b, v30.8b
>>> + uaddl2 v21.8h, v31.16b, v30.16b
>>> +
>>> + UMULL4K v2, v3, v4, v5, v20, v21, v0.h[6]
>>> +
>>> + uaddl v20.8h, v29.8b, v28.8b
>>> + uaddl2 v21.8h, v29.16b, v28.16b
>>> +
>>> + UMLSL4K v2, v3, v4, v5, v20, v21, v0.h[7]
>>> +
>>> +// dst[0] = av_clip(interpol, 0, clip_max);
>>> + SQSHRUNN v2, v2, v3, v4, v5, 13
>>> + str q2, [x0], #16
>>> +
>>> +// dst++;
>>> +// cur++;
>>> +// }
>>> +
>>> + subs w2, w2, #16
>>> + add x1, x1, #16
>>
>> For in-order cores, it might be good to update these variables sometime
>> sooner, e.g. before the str instruction. But such scheduling breaks your
>> mapping between neat C code and assembly.
>
> I take your point but there is at least 1 instruction between set and
> use which is normally enough.
True, in most cases, it's enough, but sometimes you can save more if
there's a stall bubble elsewhere too.
> I did my benching on a Pi4 and found, to my surprise, that in most cases
> reordering instructions to interleavse operations made life worse and
> seeing as Pi4 is my target platform I stopped trying to do that and
> stuck with code that was easier to read! (Also filter_intra seems to be
> much more memory b/w limited than code limited on a Pi4.)
A Raspberry Pi 4 is Cortex A72, and that one has got out of order
execution, so you generally won't be able to notice most of the effects of
instruction scheduling. On actual in-order cores, like Cortex A53 (found
in e..g the Pi3 and lots of other places), scheduling can have a fairly
dramatic effect though.
In any case, here it's not a big concern, and one instruction inbetween
usually is good enough indeed.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [FFmpeg-devel] [PATCH 08/15] avfilter/vf_bwdif: Add neon for filter_edge
2023-07-02 10:50 ` John Cox
@ 2023-07-02 20:36 ` Martin Storsjö
0 siblings, 0 replies; 32+ messages in thread
From: Martin Storsjö @ 2023-07-02 20:36 UTC (permalink / raw)
To: John Cox; +Cc: thomas.mundt, FFmpeg development discussions and patches
On Sun, 2 Jul 2023, John Cox wrote:
> On Sun, 2 Jul 2023 00:40:09 +0300 (EEST), you wrote:
>
>> On Thu, 29 Jun 2023, John Cox wrote:
>>
>>> Signed-off-by: John Cox <jc@kynesim.co.uk>
>>> ---
>>> libavfilter/aarch64/vf_bwdif_init_aarch64.c | 20 ++++
>>> libavfilter/aarch64/vf_bwdif_neon.S | 104 ++++++++++++++++++++
>>> 2 files changed, 124 insertions(+)
>>>
>>> diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>>> index 3ffaa07ab3..e75cf2f204 100644
>>> --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>>> +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>>> @@ -24,10 +24,29 @@
>>> #include "libavfilter/bwdif.h"
>>> #include "libavutil/aarch64/cpu.h"
>>>
>>> +void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void *next1,
>>> + int w, int prefs, int mrefs, int prefs2, int mrefs2,
>>> + int parity, int clip_max, int spat);
>>> +
>>> void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs,
>>> int prefs3, int mrefs3, int parity, int clip_max);
>>>
>>>
>>> +static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void *next1,
>>> + int w, int prefs, int mrefs, int prefs2, int mrefs2,
>>> + int parity, int clip_max, int spat)
>>> +{
>>> + const int w0 = clip_max != 255 ? 0 : w & ~15;
>>> +
>>> + ff_bwdif_filter_edge_neon(dst1, prev1, cur1, next1, w0, prefs, mrefs, prefs2, mrefs2,
>>> + parity, clip_max, spat);
>>> +
>>> + if (w0 < w)
>>> + ff_bwdif_filter_edge_c((char *)dst1 + w0, (char *)prev1 + w0, (char *)cur1 + w0, (char *)next1 + w0,
>>> + w - w0, prefs, mrefs, prefs2, mrefs2,
>>> + parity, clip_max, spat);
>>> +}
>>> +
>>> static void filter_intra_helper(void *dst1, void *cur1, int w, int prefs, int mrefs,
>>> int prefs3, int mrefs3, int parity, int clip_max)
>>> {
>>> @@ -52,5 +71,6 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
>>> return;
>>>
>>> s->filter_intra = filter_intra_helper;
>>> + s->filter_edge = filter_edge_helper;
>>> }
>>>
>>> diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S
>>> index 6c5d1598f4..a33b235882 100644
>>> --- a/libavfilter/aarch64/vf_bwdif_neon.S
>>> +++ b/libavfilter/aarch64/vf_bwdif_neon.S
>>> @@ -128,6 +128,110 @@ coeffs:
>>> .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5]
>>> .hword 5077, 981 // sp[0] = v0.h[6]
>>>
>>> +// ============================================================================
>>> +//
>>> +// void ff_bwdif_filter_edge_neon(
>>> +// void *dst1, // x0
>>> +// void *prev1, // x1
>>> +// void *cur1, // x2
>>> +// void *next1, // x3
>>> +// int w, // w4
>>> +// int prefs, // w5
>>> +// int mrefs, // w6
>>> +// int prefs2, // w7
>>> +// int mrefs2, // [sp, #0]
>>> +// int parity, // [sp, #8]
>>> +// int clip_max, // [sp, #16] unused
>>> +// int spat); // [sp, #24]
>>
>> This doesn't hold for macOS targets (and the checkasm tests fail on that
>> platform).
>>
>> On macOS, arguments that aren't passed in registers but on the stack, are
>> tightly packed. So since parity is 32 bit and mrefs2 also was 32 bit,
>> parity is available at [sp, #4].
>>
>> Therefore, it's usually simplest for portability reasons, to pass any
>> arguments after the first 8, as intptr_t or ptrdiff_t, as that makes them
>> consistent across platforms.
>
> Not my interface - this is already existing code. What do you suggest I
> do?
>
> I'm happy either to change the interface or fix my stack offsets if
> there is any clue that lets me detect this ABI. As personal preference
> I'd choose the latter.
You can litter the code with "#ifdef __APPLE__", but in general I'd prefer
to not go down that route as it makes the code messier to maintain, and
explicitly requires you to test the code on both system combinations
whenever touching it, compared to just having to test it once if there's a
shared codepath.
We often do change function interfaces to make writing the assembly
simpler; we often change parameters like "int stride" into "ptrdiff_t
stride", to avoid needing explicit sign extension instructions on some
architectures. Changing interfaces for this reason is less common though
(we often don't have many functions that take that many parameters), but
it shouldn't require a lot of changes for other architectures with
existing assembly.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [FFmpeg-devel] [PATCH 11/15] avfilter/vf_bwdif: Add neon for filter_line
2023-07-02 10:57 ` John Cox
@ 2023-07-02 20:40 ` Martin Storsjö
0 siblings, 0 replies; 32+ messages in thread
From: Martin Storsjö @ 2023-07-02 20:40 UTC (permalink / raw)
To: John Cox; +Cc: thomas.mundt, FFmpeg development discussions and patches
On Sun, 2 Jul 2023, John Cox wrote:
> On Sun, 2 Jul 2023 00:44:10 +0300 (EEST), you wrote:
>
>> On Thu, 29 Jun 2023, John Cox wrote:
>>
>>> Signed-off-by: John Cox <jc@kynesim.co.uk>
>>> ---
>>> libavfilter/aarch64/vf_bwdif_init_aarch64.c | 21 ++
>>> libavfilter/aarch64/vf_bwdif_neon.S | 215 ++++++++++++++++++++
>>> 2 files changed, 236 insertions(+)
>>>
>>> diff --git a/libavfilter/aarch64/vf_bwdif_init_aarch64.c b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>>> index e75cf2f204..21e67884ab 100644
>>> --- a/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>>> +++ b/libavfilter/aarch64/vf_bwdif_init_aarch64.c
>>> @@ -31,6 +31,26 @@ void ff_bwdif_filter_edge_neon(void *dst1, void *prev1, void *cur1, void *next1,
>>> void ff_bwdif_filter_intra_neon(void *dst1, void *cur1, int w, int prefs, int mrefs,
>>> int prefs3, int mrefs3, int parity, int clip_max);
>>>
>>> +void ff_bwdif_filter_line_neon(void *dst1, void *prev1, void *cur1, void *next1,
>>> + int w, int prefs, int mrefs, int prefs2, int mrefs2,
>>> + int prefs3, int mrefs3, int prefs4, int mrefs4,
>>> + int parity, int clip_max);
>>> +
>>> +
>>> +static void filter_line_helper(void *dst1, void *prev1, void *cur1, void *next1,
>>> + int w, int prefs, int mrefs, int prefs2, int mrefs2,
>>> + int prefs3, int mrefs3, int prefs4, int mrefs4,
>>> + int parity, int clip_max)
>>> +{
>>> + const int w0 = clip_max != 255 ? 0 : w & ~15;
>>> +
>>> + ff_bwdif_filter_line_neon(dst1, prev1, cur1, next1,
>>> + w0, prefs, mrefs, prefs2, mrefs2, prefs3, mrefs3, prefs4, mrefs4, parity, clip_max);
>>> +
>>> + if (w0 < w)
>>> + ff_bwdif_filter_line_c((char *)dst1 + w0, (char *)prev1 + w0, (char *)cur1 + w0, (char *)next1 + w0,
>>> + w - w0, prefs, mrefs, prefs2, mrefs2, prefs3, mrefs3, prefs4, mrefs4, parity, clip_max);
>>> +}
>>>
>>> static void filter_edge_helper(void *dst1, void *prev1, void *cur1, void *next1,
>>> int w, int prefs, int mrefs, int prefs2, int mrefs2,
>>> @@ -71,6 +91,7 @@ ff_bwdif_init_aarch64(BWDIFContext *s, int bit_depth)
>>> return;
>>>
>>> s->filter_intra = filter_intra_helper;
>>> + s->filter_line = filter_line_helper;
>>> s->filter_edge = filter_edge_helper;
>>> }
>>>
>>> diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S
>>> index a33b235882..675e97d966 100644
>>> --- a/libavfilter/aarch64/vf_bwdif_neon.S
>>> +++ b/libavfilter/aarch64/vf_bwdif_neon.S
>>> @@ -128,6 +128,221 @@ coeffs:
>>> .hword 5570, 3801, 1016, -3801 // hf[0] = v0.h[2], -hf[1] = v0.h[5]
>>> .hword 5077, 981 // sp[0] = v0.h[6]
>>>
>>> +// ===========================================================================
>>> +//
>>> +// void filter_line(
>>> +// void *dst1, // x0
>>> +// void *prev1, // x1
>>> +// void *cur1, // x2
>>> +// void *next1, // x3
>>> +// int w, // w4
>>> +// int prefs, // w5
>>> +// int mrefs, // w6
>>> +// int prefs2, // w7
>>> +// int mrefs2, // [sp, #0]
>>> +// int prefs3, // [sp, #8]
>>> +// int mrefs3, // [sp, #16]
>>> +// int prefs4, // [sp, #24]
>>> +// int mrefs4, // [sp, #32]
>>> +// int parity, // [sp, #40]
>>> +// int clip_max) // [sp, #48]
>>> +
>>> +function ff_bwdif_filter_line_neon, export=1
>>> + // Sanity check w
>>> + cmp w4, #0
>>> + ble 99f
>>> +
>>> + // Rearrange regs to be the same as line3 for ease of debug!
>>> + mov w10, w4 // w10 = loop count
>>> + mov w9, w6 // w9 = mref
>>> + mov w12, w7 // w12 = pref2
>>> + mov w11, w5 // w11 = pref
>>> + ldr w8, [sp, #0] // w8 = mref2
>>> + ldr w7, [sp, #16] // w7 = mref3
>>> + ldr w6, [sp, #32] // w6 = mref4
>>> + ldr w13, [sp, #8] // w13 = pref3
>>> + ldr w14, [sp, #24] // w14 = pref4
>>
>> Btw, remember that you can load two arguments from the stack at once with
>> ldp, e.g. "ldp x8, x13, [sp, #0]". If they're made intptr_t/ptrdiff_t, you
>> won't have an issue with garbage in the upper 32 bits either.
>
> Fair point - I was indeed worrying about garbage in the upper half (and
> this is not performance or size critical code).
Well as long as you actually do refer to the register in the form of w8
instead of x8, it shouldn't matter. Checkasm does try to make sure that
you actually should get garbage in such areas, so if it passes checkasm,
it should be fine.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [FFmpeg-devel] [PATCH 02/15] avfilter/vf_bwdif: Add common macros and consts for aarch64 neon
2023-07-02 20:07 ` Martin Storsjö
@ 2023-07-02 21:02 ` Martin Storsjö
2023-07-03 8:31 ` John Cox
0 siblings, 1 reply; 32+ messages in thread
From: Martin Storsjö @ 2023-07-02 21:02 UTC (permalink / raw)
To: John Cox; +Cc: thomas.mundt, FFmpeg development discussions and patches
On Sun, 2 Jul 2023, Martin Storsjö wrote:
> On Sun, 2 Jul 2023, John Cox wrote:
>
>> On Sun, 2 Jul 2023 00:35:14 +0300 (EEST), you wrote:
>>
>>> On Thu, 29 Jun 2023, John Cox wrote:
>>>
>>>> Add macros for dual scalar half->single multiply and accumulate
>>>> Add macro for shift, saturate and shorten single to byte
>>>> Add filter constants
>>>>
>>>> Signed-off-by: John Cox <jc@kynesim.co.uk>
>>>> ---
>>>> libavfilter/aarch64/vf_bwdif_neon.S | 46 +++++++++++++++++++++++++++++
>>>> 1 file changed, 46 insertions(+)
>>>>
>>>> +
>>>> + .align 16
>>>
>>> Note that .align for arm is power of two; this triggers a 2^16 byte
>>> alignment here, which certainly isn't intended.
>>
>> Yikes! I'll swap that for a .balign now I've looked that up
>>
>>> But in general, the usual way of defining constants is with a
>>> const/endconst block, which places them in the right rdata section instead
>>> of in the text section. But that then requires you to use a movrel macro
>>> for accessing the data, instead of a plain ldr instruction.
>>
>> Yeah - arm has a history of mixing text & const - I went with the
>> simpler code. Is this a deal breaker or can I leave it as is?
>
> I wouldn't treat it as a deal breaker as long as it works across all
> platforms - even if consistency with the code style elsewhere is preferred,
> but IIRC there may be issues with MS armasm (after passed through
> gas-preprocessor). IIRC there might be issues with starting out with straight
> up content without the full setup made by the function/const macros. OTOH I
> might have fixed that in gas-preprocessor too...
>
> Last time around, the patchset failed building in that configuration due ot
> the out of range alignment, I'll see how it fares now.
I'm sorry, but I'd just recommend you to go with the const macros.
Your current patch fails because gas-preprocessor,
https://github.com/ffmpeg/gas-preprocessor, doesn't support the .balign
directive, it only recognizes .align and .p2align. (Extending it to
support it would be trivial though.)
If I change your code to ".align 4", I get the following warning:
\home\martin\code\ffmpeg-msvc-arm64\libavfilter\aarch64\vf_bwdif_neon.o.asm(1011)
: warning A4228: Alignment value exceeds AREA alignment; alignment not
guaranteed
ALIGN 16
Since we haven't started any section, apparently armasm defaults to a
section with 4 byte alignment.
But anyway, regardless of the alignment, it later fails with this error:
\home\martin\code\ffmpeg-msvc-arm64\libavfilter\aarch64\vf_bwdif_neon.o.asm(1051)
: error A2504: operand 2: Expected address
ldr q0, coeffs
So I would request you to just go with the macros we use elsewhere. The
gas-preprocessor/armasm setup doesn't support/expect any random assembly,
but the disciplined subset we normally write. In most cases, we
essentially never write bare directives in the code, but only use the
macros from asm.S, which are set up to handle portability across all
supported platforms and their toolchains.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [FFmpeg-devel] [PATCH 02/15] avfilter/vf_bwdif: Add common macros and consts for aarch64 neon
2023-07-02 21:02 ` Martin Storsjö
@ 2023-07-03 8:31 ` John Cox
0 siblings, 0 replies; 32+ messages in thread
From: John Cox @ 2023-07-03 8:31 UTC (permalink / raw)
To: Martin Storsjö
Cc: thomas.mundt, FFmpeg development discussions and patches
On Mon, 3 Jul 2023 00:02:27 +0300 (EEST), you wrote:
>On Sun, 2 Jul 2023, Martin Storsjö wrote:
>
>> On Sun, 2 Jul 2023, John Cox wrote:
>>
>>> On Sun, 2 Jul 2023 00:35:14 +0300 (EEST), you wrote:
>>>
>>>> On Thu, 29 Jun 2023, John Cox wrote:
>>>>
>>>>> Add macros for dual scalar half->single multiply and accumulate
>>>>> Add macro for shift, saturate and shorten single to byte
>>>>> Add filter constants
>>>>>
>>>>> Signed-off-by: John Cox <jc@kynesim.co.uk>
>>>>> ---
>>>>> libavfilter/aarch64/vf_bwdif_neon.S | 46 +++++++++++++++++++++++++++++
>>>>> 1 file changed, 46 insertions(+)
>>>>>
>>>>> +
>>>>> + .align 16
>>>>
>>>> Note that .align for arm is power of two; this triggers a 2^16 byte
>>>> alignment here, which certainly isn't intended.
>>>
>>> Yikes! I'll swap that for a .balign now I've looked that up
>>>
>>>> But in general, the usual way of defining constants is with a
>>>> const/endconst block, which places them in the right rdata section instead
>>>> of in the text section. But that then requires you to use a movrel macro
>>>> for accessing the data, instead of a plain ldr instruction.
>>>
>>> Yeah - arm has a history of mixing text & const - I went with the
>>> simpler code. Is this a deal breaker or can I leave it as is?
>>
>> I wouldn't treat it as a deal breaker as long as it works across all
>> platforms - even if consistency with the code style elsewhere is preferred,
>> but IIRC there may be issues with MS armasm (after passed through
>> gas-preprocessor). IIRC there might be issues with starting out with straight
>> up content without the full setup made by the function/const macros. OTOH I
>> might have fixed that in gas-preprocessor too...
>>
>> Last time around, the patchset failed building in that configuration due ot
>> the out of range alignment, I'll see how it fares now.
>
>I'm sorry, but I'd just recommend you to go with the const macros.
>
>Your current patch fails because gas-preprocessor,
>https://github.com/ffmpeg/gas-preprocessor, doesn't support the .balign
>directive, it only recognizes .align and .p2align. (Extending it to
>support it would be trivial though.)
>
>If I change your code to ".align 4", I get the following warning:
>
>\home\martin\code\ffmpeg-msvc-arm64\libavfilter\aarch64\vf_bwdif_neon.o.asm(1011)
>: warning A4228: Alignment value exceeds AREA alignment; alignment not
>guaranteed
> ALIGN 16
>
>Since we haven't started any section, apparently armasm defaults to a
>section with 4 byte alignment.
>
>But anyway, regardless of the alignment, it later fails with this error:
>
>\home\martin\code\ffmpeg-msvc-arm64\libavfilter\aarch64\vf_bwdif_neon.o.asm(1051)
>: error A2504: operand 2: Expected address
> ldr q0, coeffs
>
>
>So I would request you to just go with the macros we use elsewhere. The
>gas-preprocessor/armasm setup doesn't support/expect any random assembly,
>but the disciplined subset we normally write. In most cases, we
>essentially never write bare directives in the code, but only use the
>macros from asm.S, which are set up to handle portability across all
>supported platforms and their toolchains.
OK - will do.
JC
>// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2023-07-03 8:31 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-29 17:57 [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 01/15] avfilter/vf_bwdif: Add outline for aarch " John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 02/15] avfilter/vf_bwdif: Add common macros and consts for aarch64 neon John Cox
2023-07-01 21:35 ` Martin Storsjö
2023-07-02 10:27 ` John Cox
2023-07-02 20:07 ` Martin Storsjö
2023-07-02 21:02 ` Martin Storsjö
2023-07-03 8:31 ` John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 03/15] avfilter/vf_bwdif: Export C filter_intra John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 04/15] avfilter/vf_bwdif: Add neon for filter_intra John Cox
2023-07-01 21:37 ` Martin Storsjö
2023-07-02 10:43 ` John Cox
2023-07-02 20:18 ` Martin Storsjö
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 05/15] tests/checkasm: Add test for vf_bwdif filter_intra John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 06/15] avfilter/vf_bwdif: Add clip and spatial macros for aarch64 neon John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 07/15] avfilter/vf_bwdif: Export C filter_edge John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 08/15] avfilter/vf_bwdif: Add neon for filter_edge John Cox
2023-07-01 21:40 ` Martin Storsjö
2023-07-02 10:50 ` John Cox
2023-07-02 20:36 ` Martin Storsjö
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 09/15] tests/checkasm: Add test for vf_bwdif filter_edge John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 10/15] avfilter/vf_bwdif: Export C filter_line John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 11/15] avfilter/vf_bwdif: Add neon for filter_line John Cox
2023-07-01 21:44 ` Martin Storsjö
2023-07-02 10:57 ` John Cox
2023-07-02 20:40 ` Martin Storsjö
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 12/15] avfilter/vf_bwdif: Add a filter_line3 method for optimisation John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 13/15] avfilter/vf_bwdif: Add neon for filter_line3 John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 14/15] tests/checkasm: Add test for vf_bwdif filter_line3 John Cox
2023-06-29 17:57 ` [FFmpeg-devel] [PATCH 15/15] avfilter/vf_bwdif: Block filter slices into a multiple of 4 lines John Cox
2023-07-01 21:33 ` [FFmpeg-devel] [PATCH 00/15] avfilter/vf_bwdif: Add aarch64 neon functions Martin Storsjö
2023-07-02 10:18 ` John Cox
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
This inbox may be cloned and mirrored by anyone:
git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
ffmpegdev@gitmailbox.com
public-inbox-index ffmpegdev
Example config snippet for mirrors.
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git