From: "Martin Storsjö" <martin@martin.st> To: Ben Avison <bavison@riscosopen.org> Cc: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> Subject: Re: [FFmpeg-devel] [PATCH 05/10] avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths Date: Fri, 1 Apr 2022 00:21:37 +0300 (EEST) Message-ID: <d45d531f-cddf-f9ba-677a-c5e091b7438f@martin.st> (raw) In-Reply-To: <452decbf-939d-238b-4307-f88a1e3b14a2@riscosopen.org> On Thu, 31 Mar 2022, Ben Avison wrote: > On 30/03/2022 13:35, Martin Storsjö wrote: >> Overall, the code looks sensible to me. >> >> Would it make sense to share the core of the filter between the >> horizontal/vertical cases with e.g. a macro? (I didn't check in detail if >> there's much differences in the core of the filter. At most some >> differences in condition registers for partial writeout in the horizontal >> forms?) > > Well, looking at the comments at the right-hand side of the source, which > give the logical meaning of the results of each instruction, I admit there's > a resemblance in the middle of the 8-pixel-pair function. Actually, I didn't try to follow/compare it to that level, I just assumed them to be similar. > However, the physical register assignments are quite different, and > attempting to reassign the registers in one to match the other isn't a > trivial task. It's hard enough when you start register assignment from > the top of a function and work your way down, as I have done here. > > In the 16-pixel-pair case, the fact that the input values arrive in a > different order as the result of them, in one case, being loaded in > regularly-increasing address order, and in the other, falling out of a matrix > transposition, has resulted in even the logical order of instructions being > quite different in the two cases. > > In the 4-pixel-pair case, the values are packed differently into registers in > the two cases, because in the v case, we're loading 4 pixels between > row-strides, which means it's easy to place each row in its own vector, > whereas in the h case we load 4 rows of 8 pixels each and transpose, which > leaves the values in 4 vectors rather than 8. Some of the filtering steps can > be performed with the data packed in this way (calculating a1 and a2) while > waiting for it to be restructured in order to calculate the other metrics, > but it's not worth packing the data together in this way in the v case given > that it starts off already separated. So the two implementations end up quite > different in the operations they perform, not just the scheduling of > instructions and in register assignment terms. > > Some background: as you may have guessed, I didn't start out writing these > functions as they currently appear. Prototype versions didn't care much for > scheduling or keeping to a small number of registers. They were primarily for > checking the correctness of the mathematics, and they'd use all available > vectors, sometimes shuffling values between registers or to the stack to make > room. Once I'd verified correctness, I then reworked them to keep to a > minimal number of registers and to minimise stalls as far as possible. > > I'm targeting the Cortex-A72, since that's what the Raspberry Pi 4 uses and > it's on the cusp of having enough power to decode VC-1 BluRay streams, so I > deliberately didn't take too much consideration of the requirements of > earlier cores. Yes, it's an out-of-order core, but I reckoned there are > probably limits to how wisely it can select instructions to execute (there > have got to be limits to instruction queue lengths, for example). So based on > the pipeline structure documented in Arm's Cortex-A72 software opimization > guide, I arranged the instructions to best keep all pipelines busy as much as > possible, then assigned registers to keep the instructions in this order. > > For the most part, I was able to keep the number of vectors used low enough > that no callee-saving was required - or failing that, at least avoiding > having to spill values to the stack mid-function. But it came pretty close at > times - witness for example the peculiar order in which vectors had to be > loaded in the AArch32 version of ff_vc1_h_loop_filter16_neon. There's reason > behind that! > > In short, I'd really rather not tamper with these larger assembly functions > any more unless I really have to. Ok, fair enough. FWIW, my point of view was from implementing the loop filters for VP9 and AV1, where I did the core filter as one shared implementation for both variants, and where the frontend functions just load (and transpose) data into the registers used as input for the common core filter, and vice versa. But I presume that a custom implementation for each of them can be more optimal, at the cost of more code to maintain (but if there are no bugs, it usually doesn't need maintainance either). Thus - fair enough, this code probably is ok then. // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
next prev parent reply other threads:[~2022-03-31 21:21 UTC|newest] Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top 2022-03-17 18:58 [FFmpeg-devel] [PATCH 0/6] avcodec/vc1: Arm optimisations Ben Avison 2022-03-17 18:58 ` [FFmpeg-devel] [PATCH 1/6] avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths Ben Avison 2022-03-17 18:58 ` [FFmpeg-devel] [PATCH 2/6] avcodec/vc1: Arm 32-bit " Ben Avison 2022-03-17 18:58 ` [FFmpeg-devel] [PATCH 3/6] avcodec/vc1: Arm 64-bit NEON inverse transform " Ben Avison 2022-03-17 18:58 ` [FFmpeg-devel] [PATCH 4/6] avcodec/idctdsp: Arm 64-bit NEON block add and clamp " Ben Avison 2022-03-17 18:58 ` [FFmpeg-devel] [PATCH 5/6] avcodec/blockdsp: Arm 64-bit NEON block clear " Ben Avison 2022-03-17 18:58 ` [FFmpeg-devel] [PATCH 6/6] avcodec/vc1: Introduce fast path for unescaping bitstream buffer Ben Avison 2022-03-18 19:10 ` Andreas Rheinhardt 2022-03-21 15:51 ` Ben Avison 2022-03-21 20:44 ` Martin Storsjö 2022-03-19 23:06 ` [FFmpeg-devel] [PATCH 0/6] avcodec/vc1: Arm optimisations Martin Storsjö 2022-03-19 23:07 ` Martin Storsjö 2022-03-21 17:37 ` Ben Avison 2022-03-21 22:29 ` Martin Storsjö 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH v2 00/10] " Ben Avison 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH 01/10] checkasm: Add vc1dsp in-loop deblocking filter tests Ben Avison 2022-03-25 22:53 ` Martin Storsjö 2022-03-28 18:28 ` Ben Avison 2022-03-29 11:47 ` Martin Storsjö 2022-03-29 12:24 ` Martin Storsjö 2022-03-29 12:43 ` Martin Storsjö 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH 02/10] checkasm: Add vc1dsp inverse transform tests Ben Avison 2022-03-29 12:41 ` Martin Storsjö 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH 03/10] checkasm: Add idctdsp add/put-pixels-clamped tests Ben Avison 2022-03-29 13:13 ` Martin Storsjö 2022-03-29 19:56 ` Martin Storsjö 2022-03-29 20:22 ` Ben Avison 2022-03-29 20:30 ` Martin Storsjö 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH 04/10] avcodec/vc1: Introduce fast path for unescaping bitstream buffer Ben Avison 2022-03-29 20:37 ` Martin Storsjö 2022-03-31 13:58 ` Ben Avison 2022-03-31 14:07 ` Martin Storsjö 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH 05/10] avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths Ben Avison 2022-03-30 12:35 ` Martin Storsjö 2022-03-31 15:15 ` Ben Avison 2022-03-31 21:21 ` Martin Storsjö [this message] 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH 06/10] avcodec/vc1: Arm 32-bit " Ben Avison 2022-03-25 19:27 ` Lynne 2022-03-25 19:49 ` Martin Storsjö 2022-03-25 19:55 ` Lynne 2022-03-30 12:37 ` Martin Storsjö 2022-03-30 13:03 ` Martin Storsjö 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH 07/10] avcodec/vc1: Arm 64-bit NEON inverse transform " Ben Avison 2022-03-30 13:49 ` Martin Storsjö 2022-03-30 14:01 ` Martin Storsjö 2022-03-31 15:37 ` Ben Avison 2022-03-31 21:32 ` Martin Storsjö 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH 08/10] avcodec/idctdsp: Arm 64-bit NEON block add and clamp " Ben Avison 2022-03-30 14:14 ` Martin Storsjö 2022-03-31 16:47 ` Ben Avison 2022-03-31 21:42 ` Martin Storsjö 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH 09/10] avcodec/vc1: Arm 64-bit NEON unescape fast path Ben Avison 2022-03-30 14:35 ` Martin Storsjö 2022-03-25 18:52 ` [FFmpeg-devel] [PATCH 10/10] avcodec/vc1: Arm 32-bit " Ben Avison 2022-03-30 14:35 ` Martin Storsjö
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=d45d531f-cddf-f9ba-677a-c5e091b7438f@martin.st \ --to=martin@martin.st \ --cc=bavison@riscosopen.org \ --cc=ffmpeg-devel@ffmpeg.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel This inbox may be cloned and mirrored by anyone: git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \ ffmpegdev@gitmailbox.com public-inbox-index ffmpegdev Example config snippet for mirrors. AGPL code for this site: git clone https://public-inbox.org/public-inbox.git