From: "Martin Storsjö" <martin@martin.st> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> Subject: Re: [FFmpeg-devel] [PATCH 1/2] lavc/aarch64: new 8-bit hevc 16x16 idct Date: Tue, 9 Aug 2022 15:15:05 +0300 (EEST) Message-ID: <44688f68-969a-8570-de5c-7e1a4fe4fe0@martin.st> (raw) In-Reply-To: <20220623122311.20097-1-jdek@itanimul.li> On Thu, 23 Jun 2022, J. Dekker wrote: > old: > hevc_idct_16x16_8_c: 5366.2 > hevc_idct_16x16_8_neon: 1493.2 > > new: > hevc_idct_16x16_8_c: 5363.2 > hevc_idct_16x16_8_neon: 943.5 > > Co-developed-by: Rafal Dabrowa <fatwildcat@gmail.com> > Signed-off-by: J. Dekker <jdek@itanimul.li> > --- > libavcodec/aarch64/hevcdsp_idct_neon.S | 666 ++++++++++++++++++++++ > libavcodec/aarch64/hevcdsp_init_aarch64.c | 3 +- > 2 files changed, 668 insertions(+), 1 deletion(-) Throughout the new code, you have e.g. "add x5, x5, x4, lsl 2", where the "lsl 2" breaks assembling with MS armasm64 - it's missing the '#' on the constant 2. Also, for loads/stores, it seems to be missing the same '#' for postincrement, e.g. "ld1 {v16.8h, v17.8h, v18.8h, v19.8h}, [x4], 64". Also "mov x4, 64". Apparently armasm64 doesn't have a problem with that, but it would still be good to have it consistent with the rest. > This idct is significantly faster than the one we currently have, I > suspect its for a couple reasons: 1) it's only written for 8bit I don't see how that would change anything? Isn't the only thing that differs between 8 and 10/12 bit in the existing implementation about how much to scale down at the end? All other intermediate values are the same size? > 2) it's unrolled signficantly more. It comes at a hefty cost of roughly > 2.25x the object size. If by that, you mean that the existing code works on 4 elements at a time (i.e. mostly operating on .4h vectors), while this one operates on .8h vectors, then yes, that's most probably the biggest source of the speedup (even if a lot of the intermediate stuff happens in .4s vectors). The existing code was ported from the 32 bit arm version (which probably had to stick to 4 elements at a time due to register availability there), while it probably could have been made double width when it was ported to 64 bit. > I'm wondering if this idct is salvagable, or the one we have should just > be improved instead. Well, my honest opinion is: - I don't quite understand the current code (I've worked on the vp8/vp9/av1 IDCTs a fair amount, but the HEVC one seems to be different enough that I don't recognize all the concepts here. - The current implementation would need to be reformatted if kept - The current implementation does have some rather clear high level structure though, e.g. when looking at the idct_16x16 macro. - The new implementation seems to be just one huuuuge function. If you know it by heart, it's probably good, but it's really hard to get an overview of if you're not familiar with the HEVC IDCTs. As for steps forward: - Is it possible to widen the existing implementation to operate on 8 elements instead of 4? I think that would bring it up to par with this one. - Can you get some high level structure to the new implementation so that it becomes understandable? Either lots of more comments explaining what's happening and why, or splitting it up in smaller macros. Some more comments on the code itself below: > +// void ff_hevc_idct_16x16_8_neon(int16_t *coeffs, int col_limit) > +function ff_hevc_idct_16x16_8_neon_new, export=1 > + sub sp, sp, 64 > + st1 {v8.16b, v9.16b, v10.16b, v11.16b}, [sp] > + sub sp, sp, 32 > + st1 {v14.16b, v15.16b}, [sp] > + mov x3, 0 > + mov x2, x0 > +1: mov x4, x2 > + mov x5, 32 > + ld1 {v16.8h}, [x4], x5 > + ld1 {v17.8h}, [x4], x5 > + ld1 {v18.8h}, [x4], x5 > + ld1 {v19.8h}, [x4], x5 > + ld1 {v20.8h}, [x4], x5 > + ld1 {v21.8h}, [x4], x5 > + ld1 {v22.8h}, [x4], x5 > + ld1 {v23.8h}, [x4], x5 > + ld1 {v24.8h}, [x4], x5 > + ld1 {v25.8h}, [x4], x5 > + ld1 {v26.8h}, [x4], x5 > + ld1 {v27.8h}, [x4], x5 > + ld1 {v28.8h}, [x4], x5 > + ld1 {v29.8h}, [x4], x5 > + ld1 {v30.8h}, [x4], x5 > + ld1 {v31.8h}, [x4], x5 > + cmp x1, 12 > + b.hs 5f > + // limit2 below 16 > + bic x4, x1, 1 > + adr x5, .LimitMask > + cbnz x3, 3f > + // columns 0 .. 7 - cleanup of indexes 5 .. 7 > + ld1 {v0.8h}, [x5] > + adr x5, 2f > + add x5, x5, x4, lsl 2 > + add x5, x5, x4, lsl 1 > + br x5 > +2: and v17.16b, v17.16b, v0.16b // col_limit 0..1 -> limit2 == 4..5 > + and v19.16b, v19.16b, v0.16b > + b 5f I don't really know what these jump tables do and how it corresponds to things in the existing implementation - but I guess that can be one part of what makes things faster too. The existing implementation does an 16x16 transform by first doing 4x transforms for an 4x16 piece of data, transpose that, then do another 4x 4x16 for the second pass. How does the new implementation do it? If I understand correctly, the old implementation didn't take col_limit into account at all. Can that be one part of what makes things faster - or is that only something that makes a difference in real use but not in checkasm benchmarks? // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
prev parent reply other threads:[~2022-08-09 12:15 UTC|newest] Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top 2022-06-23 12:23 J. Dekker 2022-06-23 12:23 ` [FFmpeg-devel] [PATCH 2/2] lavc/aarch64: add 8-bit hevc 32x32 idct J. Dekker 2022-08-09 12:15 ` Martin Storsjö [this message]
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=44688f68-969a-8570-de5c-7e1a4fe4fe0@martin.st \ --to=martin@martin.st \ --cc=ffmpeg-devel@ffmpeg.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel This inbox may be cloned and mirrored by anyone: git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \ ffmpegdev@gitmailbox.com public-inbox-index ffmpegdev Example config snippet for mirrors. AGPL code for this site: git clone https://public-inbox.org/public-inbox.git