From: Ben Avison <bavison@riscosopen.org> To: ffmpeg-devel@ffmpeg.org Cc: Ben Avison <bavison@riscosopen.org> Subject: [FFmpeg-devel] [PATCH 08/10] avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast paths Date: Fri, 25 Mar 2022 18:52:55 +0000 Message-ID: <20220325185257.513933-9-bavison@riscosopen.org> (raw) In-Reply-To: <20220325185257.513933-1-bavison@riscosopen.org> checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. idctdsp.add_pixels_clamped_c: 323.0 idctdsp.add_pixels_clamped_neon: 41.5 idctdsp.put_pixels_clamped_c: 243.0 idctdsp.put_pixels_clamped_neon: 30.0 idctdsp.put_signed_pixels_clamped_c: 225.7 idctdsp.put_signed_pixels_clamped_neon: 37.7 Signed-off-by: Ben Avison <bavison@riscosopen.org> --- libavcodec/aarch64/Makefile | 3 +- libavcodec/aarch64/idctdsp_init_aarch64.c | 26 +++-- libavcodec/aarch64/idctdsp_neon.S | 130 ++++++++++++++++++++++ 3 files changed, 150 insertions(+), 9 deletions(-) create mode 100644 libavcodec/aarch64/idctdsp_neon.S diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile index 5b25e4dfb9..c8935f205e 100644 --- a/libavcodec/aarch64/Makefile +++ b/libavcodec/aarch64/Makefile @@ -44,7 +44,8 @@ NEON-OBJS-$(CONFIG_H264PRED) += aarch64/h264pred_neon.o NEON-OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_neon.o \ aarch64/hpeldsp_neon.o NEON-OBJS-$(CONFIG_HPELDSP) += aarch64/hpeldsp_neon.o -NEON-OBJS-$(CONFIG_IDCTDSP) += aarch64/simple_idct_neon.o +NEON-OBJS-$(CONFIG_IDCTDSP) += aarch64/idctdsp_neon.o \ + aarch64/simple_idct_neon.o NEON-OBJS-$(CONFIG_MDCT) += aarch64/mdct_neon.o NEON-OBJS-$(CONFIG_MPEGAUDIODSP) += aarch64/mpegaudiodsp_neon.o NEON-OBJS-$(CONFIG_PIXBLOCKDSP) += aarch64/pixblockdsp_neon.o diff --git a/libavcodec/aarch64/idctdsp_init_aarch64.c b/libavcodec/aarch64/idctdsp_init_aarch64.c index 742a3372e3..eec21aa5a2 100644 --- a/libavcodec/aarch64/idctdsp_init_aarch64.c +++ b/libavcodec/aarch64/idctdsp_init_aarch64.c @@ -27,19 +27,29 @@ #include "libavcodec/idctdsp.h" #include "idct.h" +void ff_put_pixels_clamped_neon(const int16_t *, uint8_t *, ptrdiff_t); +void ff_put_signed_pixels_clamped_neon(const int16_t *, uint8_t *, ptrdiff_t); +void ff_add_pixels_clamped_neon(const int16_t *, uint8_t *, ptrdiff_t); + av_cold void ff_idctdsp_init_aarch64(IDCTDSPContext *c, AVCodecContext *avctx, unsigned high_bit_depth) { int cpu_flags = av_get_cpu_flags(); - if (have_neon(cpu_flags) && !avctx->lowres && !high_bit_depth) { - if (avctx->idct_algo == FF_IDCT_AUTO || - avctx->idct_algo == FF_IDCT_SIMPLEAUTO || - avctx->idct_algo == FF_IDCT_SIMPLENEON) { - c->idct_put = ff_simple_idct_put_neon; - c->idct_add = ff_simple_idct_add_neon; - c->idct = ff_simple_idct_neon; - c->perm_type = FF_IDCT_PERM_PARTTRANS; + if (have_neon(cpu_flags)) { + if (!avctx->lowres && !high_bit_depth) { + if (avctx->idct_algo == FF_IDCT_AUTO || + avctx->idct_algo == FF_IDCT_SIMPLEAUTO || + avctx->idct_algo == FF_IDCT_SIMPLENEON) { + c->idct_put = ff_simple_idct_put_neon; + c->idct_add = ff_simple_idct_add_neon; + c->idct = ff_simple_idct_neon; + c->perm_type = FF_IDCT_PERM_PARTTRANS; + } } + + c->add_pixels_clamped = ff_add_pixels_clamped_neon; + c->put_pixels_clamped = ff_put_pixels_clamped_neon; + c->put_signed_pixels_clamped = ff_put_signed_pixels_clamped_neon; } } diff --git a/libavcodec/aarch64/idctdsp_neon.S b/libavcodec/aarch64/idctdsp_neon.S new file mode 100644 index 0000000000..7f47611206 --- /dev/null +++ b/libavcodec/aarch64/idctdsp_neon.S @@ -0,0 +1,130 @@ +/* + * IDCT AArch64 NEON optimisations + * + * Copyright (c) 2022 Ben Avison <bavison@riscosopen.org> + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/aarch64/asm.S" + +// Clamp 16-bit signed block coefficients to unsigned 8-bit +// On entry: +// x0 -> array of 64x 16-bit coefficients +// x1 -> 8-bit results +// x2 = row stride for results, bytes +function ff_put_pixels_clamped_neon, export=1 + ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x0], #64 + ld1 {v4.16b, v5.16b, v6.16b, v7.16b}, [x0] + sqxtun v0.8b, v0.8h + sqxtun v1.8b, v1.8h + sqxtun v2.8b, v2.8h + sqxtun v3.8b, v3.8h + sqxtun v4.8b, v4.8h + st1 {v0.8b}, [x1], x2 + sqxtun v0.8b, v5.8h + st1 {v1.8b}, [x1], x2 + sqxtun v1.8b, v6.8h + st1 {v2.8b}, [x1], x2 + sqxtun v2.8b, v7.8h + st1 {v3.8b}, [x1], x2 + st1 {v4.8b}, [x1], x2 + st1 {v0.8b}, [x1], x2 + st1 {v1.8b}, [x1], x2 + st1 {v2.8b}, [x1] + ret +endfunc + +// Clamp 16-bit signed block coefficients to signed 8-bit (biased by 128) +// On entry: +// x0 -> array of 64x 16-bit coefficients +// x1 -> 8-bit results +// x2 = row stride for results, bytes +function ff_put_signed_pixels_clamped_neon, export=1 + ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x0], #64 + movi v4.8b, #128 + ld1 {v16.16b, v17.16b, v18.16b, v19.16b}, [x0] + sqxtn v0.8b, v0.8h + sqxtn v1.8b, v1.8h + sqxtn v2.8b, v2.8h + sqxtn v3.8b, v3.8h + sqxtn v5.8b, v16.8h + add v0.8b, v0.8b, v4.8b + sqxtn v6.8b, v17.8h + add v1.8b, v1.8b, v4.8b + sqxtn v7.8b, v18.8h + add v2.8b, v2.8b, v4.8b + sqxtn v16.8b, v19.8h + add v3.8b, v3.8b, v4.8b + st1 {v0.8b}, [x1], x2 + add v0.8b, v5.8b, v4.8b + st1 {v1.8b}, [x1], x2 + add v1.8b, v6.8b, v4.8b + st1 {v2.8b}, [x1], x2 + add v2.8b, v7.8b, v4.8b + st1 {v3.8b}, [x1], x2 + add v3.8b, v16.8b, v4.8b + st1 {v0.8b}, [x1], x2 + st1 {v1.8b}, [x1], x2 + st1 {v2.8b}, [x1], x2 + st1 {v3.8b}, [x1] + ret +endfunc + +// Add 16-bit signed block coefficients to unsigned 8-bit +// On entry: +// x0 -> array of 64x 16-bit coefficients +// x1 -> 8-bit input and results +// x2 = row stride for 8-bit input and results, bytes +function ff_add_pixels_clamped_neon, export=1 + ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x0], #64 + mov x3, x1 + ld1 {v4.8b}, [x1], x2 + ld1 {v5.8b}, [x1], x2 + ld1 {v6.8b}, [x1], x2 + ld1 {v7.8b}, [x1], x2 + ld1 {v16.16b, v17.16b, v18.16b, v19.16b}, [x0] + uaddw v0.8h, v0.8h, v4.8b + uaddw v1.8h, v1.8h, v5.8b + uaddw v2.8h, v2.8h, v6.8b + ld1 {v4.8b}, [x1], x2 + uaddw v3.8h, v3.8h, v7.8b + ld1 {v5.8b}, [x1], x2 + sqxtun v0.8b, v0.8h + ld1 {v6.8b}, [x1], x2 + sqxtun v1.8b, v1.8h + ld1 {v7.8b}, [x1] + sqxtun v2.8b, v2.8h + sqxtun v3.8b, v3.8h + uaddw v4.8h, v16.8h, v4.8b + st1 {v0.8b}, [x3], x2 + uaddw v0.8h, v17.8h, v5.8b + st1 {v1.8b}, [x3], x2 + uaddw v1.8h, v18.8h, v6.8b + st1 {v2.8b}, [x3], x2 + uaddw v2.8h, v19.8h, v7.8b + sqxtun v4.8b, v4.8h + sqxtun v0.8b, v0.8h + st1 {v3.8b}, [x3], x2 + sqxtun v1.8b, v1.8h + sqxtun v2.8b, v2.8h + st1 {v4.8b}, [x3], x2 + st1 {v0.8b}, [x3], x2 + st1 {v1.8b}, [x3], x2 + st1 {v2.8b}, [x3] + ret +endfunc -- 2.25.1 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
next prev parent reply other threads:[~2022-03-25 18:54 UTC|newest] Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top 2022-03-17 18:58 [FFmpeg-devel] [PATCH 0/6] avcodec/vc1: Arm optimisations Ben Avison 2022-03-17 18:58 ` [FFmpeg-devel] [PATCH 1/6] avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths Ben Avison 2022-03-17 18:58 ` [FFmpeg-devel] [PATCH 2/6] avcodec/vc1: Arm 32-bit " Ben Avison 2022-03-17 18:58 ` [FFmpeg-devel] [PATCH 3/6] avcodec/vc1: Arm 64-bit NEON inverse transform " Ben Avison 2022-03-17 18:58 ` [FFmpeg-devel] [PATCH 4/6] avcodec/idctdsp: Arm 64-bit NEON block add and clamp " Ben Avison 2022-03-17 18:58 ` [FFmpeg-devel] [PATCH 5/6] avcodec/blockdsp: Arm 64-bit NEON block clear " Ben Avison 2022-03-17 18:58 ` [FFmpeg-devel] [PATCH 6/6] avcodec/vc1: Introduce fast path for unescaping bitstream buffer Ben Avison 2022-03-18 19:10 ` Andreas Rheinhardt 2022-03-21 15:51 ` Ben Avison 2022-03-21 20:44 ` Martin Storsjö 2022-03-19 23:06 ` [FFmpeg-devel] [PATCH 0/6] avcodec/vc1: Arm optimisations Martin Storsjö 2022-03-19 23:07 ` Martin Storsjö 2022-03-21 17:37 ` Ben Avison 2022-03-21 22:29 ` Martin Storsjö 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH v2 00/10] " Ben Avison 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH 01/10] checkasm: Add vc1dsp in-loop deblocking filter tests Ben Avison 2022-03-25 22:53 ` Martin Storsjö 2022-03-28 18:28 ` Ben Avison 2022-03-29 11:47 ` Martin Storsjö 2022-03-29 12:24 ` Martin Storsjö 2022-03-29 12:43 ` Martin Storsjö 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH 02/10] checkasm: Add vc1dsp inverse transform tests Ben Avison 2022-03-29 12:41 ` Martin Storsjö 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH 03/10] checkasm: Add idctdsp add/put-pixels-clamped tests Ben Avison 2022-03-29 13:13 ` Martin Storsjö 2022-03-29 19:56 ` Martin Storsjö 2022-03-29 20:22 ` Ben Avison 2022-03-29 20:30 ` Martin Storsjö 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH 04/10] avcodec/vc1: Introduce fast path for unescaping bitstream buffer Ben Avison 2022-03-29 20:37 ` Martin Storsjö 2022-03-31 13:58 ` Ben Avison 2022-03-31 14:07 ` Martin Storsjö 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH 05/10] avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths Ben Avison 2022-03-30 12:35 ` Martin Storsjö 2022-03-31 15:15 ` Ben Avison 2022-03-31 21:21 ` Martin Storsjö 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH 06/10] avcodec/vc1: Arm 32-bit " Ben Avison 2022-03-25 19:27 ` Lynne 2022-03-25 19:49 ` Martin Storsjö 2022-03-25 19:55 ` Lynne 2022-03-30 12:37 ` Martin Storsjö 2022-03-30 13:03 ` Martin Storsjö 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH 07/10] avcodec/vc1: Arm 64-bit NEON inverse transform " Ben Avison 2022-03-30 13:49 ` Martin Storsjö 2022-03-30 14:01 ` Martin Storsjö 2022-03-31 15:37 ` Ben Avison 2022-03-31 21:32 ` Martin Storsjö 2022-03-25 18:52 ` Ben Avison [this message] 2022-03-30 14:14 ` [FFmpeg-devel] [PATCH 08/10] avcodec/idctdsp: Arm 64-bit NEON block add and clamp " Martin Storsjö 2022-03-31 16:47 ` Ben Avison 2022-03-31 21:42 ` Martin Storsjö 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH 09/10] avcodec/vc1: Arm 64-bit NEON unescape fast path Ben Avison 2022-03-30 14:35 ` Martin Storsjö 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH 10/10] avcodec/vc1: Arm 32-bit " Ben Avison 2022-03-30 14:35 ` Martin Storsjö
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20220325185257.513933-9-bavison@riscosopen.org \ --to=bavison@riscosopen.org \ --cc=ffmpeg-devel@ffmpeg.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel This inbox may be cloned and mirrored by anyone: git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \ ffmpegdev@gitmailbox.com public-inbox-index ffmpegdev Example config snippet for mirrors. AGPL code for this site: git clone https://public-inbox.org/public-inbox.git