From: Lynne via ffmpeg-devel <ffmpeg-devel@ffmpeg.org>
To: ffmpeg-devel@ffmpeg.org
Cc: Lynne <dev@lynne.ee>
Subject: Re: [FFmpeg-devel] [PATCH] lpc: rewrite lpc_compute_autocorr in external asm
Date: Sun, 26 May 2024 02:02:17 +0200
Message-ID: <13d7b7ee-9b36-4b8a-8c44-f46a5352174a@lynne.ee> (raw)
In-Reply-To: <d971bea5-3424-429a-b944-53d659a92d26@gmail.com>
[-- Attachment #1.1.1.1: Type: text/plain, Size: 6137 bytes --]
On 26/05/2024 00:45, James Almer wrote:
> On 5/25/2024 7:31 PM, James Almer wrote:
>> On 5/25/2024 5:57 PM, Lynne via ffmpeg-devel wrote:
>>> The inline asm function had issues running under checkasm.
>>> So I came to finish what I started, and wrote the last part
>>> of LPC computation in assembly.
>>>
>>> autocorr_10_c: 135525.8
>>> autocorr_10_sse2: 50729.8
>>> autocorr_10_fma3: 19007.8
>>> autocorr_30_c: 390100.8
>>> autocorr_30_sse2: 142478.8
>>> autocorr_30_fma3: 50559.8
>>> autocorr_32_c: 407058.3
>>> autocorr_32_sse2: 151633.3
>>> autocorr_32_fma3: 50517.3
>>> ---
>>> libavcodec/x86/lpc.asm | 91 +++++++++++++++++++++++++++++++++++++++
>>> libavcodec/x86/lpc_init.c | 87 ++++---------------------------------
>>> 2 files changed, 100 insertions(+), 78 deletions(-)
>>>
>>> diff --git a/libavcodec/x86/lpc.asm b/libavcodec/x86/lpc.asm
>>> index a585c17ef5..790841b7f4 100644
>>> --- a/libavcodec/x86/lpc.asm
>>> +++ b/libavcodec/x86/lpc.asm
>>> @@ -32,6 +32,8 @@ dec_tab_sse2: times 2 dq -2.0
>>> dec_tab_scalar: times 2 dq -1.0
>>> seq_tab_sse2: dq 1.0, 0.0
>>> +autoc_init_tab: times 4 dq 1.0
>>> +
>>> SECTION .text
>>> %macro APPLY_WELCH_FN 0
>>> @@ -261,3 +263,92 @@ APPLY_WELCH_FN
>>> INIT_YMM avx2
>>> APPLY_WELCH_FN
>>> %endif
>>> +
>>> +%macro COMPUTE_AUTOCORR_FN 0
>>> +cglobal lpc_compute_autocorr, 4, 7, 8, data, len, lag, autoc, lag_p,
>>> data_l, len_p
>>
>> Already mentioned, but it should be 3 not 8.
>>
>>> +
>>> + shl lagd, 3
>>> + shl lenq, 3
>>> + xor lag_pq, lag_pq
>>> +
>>> +.lag_l:
>>> + movaps m8, [autoc_init_tab]
>>
>> m2
>>
>>> +
>>> + mov len_pq, lag_pq
>>> +
>>> + lea data_lq, [lag_pq + mmsize - 8]
>>> + neg data_lq ; -j - mmsize
>>> + add data_lq, dataq ; data[-j - mmsize]
>>> +.len_l:
>>> + ; We waste the upper value here on SSE2,
>>> + ; but we use it on AVX.
>>> + movupd xm0, [dataq + len_pq] ; data[i]
>>
>> movsd
>>
>>> + movupd m1, [data_lq + len_pq] ; data[i - j]
>>> +
>>> +%if cpuflag(avx)
>>
>> %if mmsize == 32 here and everywhere else.
>>
>>> + vbroadcastsd m0, xm0
>>
>> This is AVX2. AVX only has memory input argument. So use that and save
>> the movsd from above for the FMA3 version.
>>
>>> + vperm2f128 m1, m1, m1, 0x01
>>
>> Aren't you loading 16 extra bytes for no reason if you're just going
>> to use the upper 16 bytes from the load above?
>
> Nevermind, this is swapping lanes.
>
> That aside, these versions are barely better and sometimes worse in all
> my tests on win64 with GCC with certain seeds.
> For example, seed 4022958484 gives me:
>
> autocorr_10_c: 21345.6
> autocorr_10_sse2: 16434.6
> autocorr_10_fma3: 24154.6
> autocorr_30_c: 59239.1
> autocorr_30_sse2: 46114.6
> autocorr_30_fma3: 64147.1
> autocorr_32_c: 63022.1
> autocorr_32_sse2: 50040.1
> autocorr_32_fma3: 66594.1
>
> But seed 2236774811 gives me:
>
> autocorr_10_c: 37135.3
> autocorr_10_sse2: 26492.3
> autocorr_10_fma3: 32943.3
> autocorr_30_c: 102266.8
> autocorr_30_sse2: 72933.3
> autocorr_30_fma3: 85808.3
> autocorr_32_c: 106537.8
> autocorr_32_sse2: 77623.3
> autocorr_32_fma3: 85844.3
>
> But if i force len to always be 4999 instead of its value varying
> depending on seed, i consistently get things like:
>
> autocorr_10_c: 40447.3
> autocorr_10_sse2: 39526.8
> autocorr_10_fma3: 42955.3
> autocorr_30_c: 111362.3
> autocorr_30_sse2: 111408.3
> autocorr_30_fma3: 116781.8
> autocorr_32_c: 122388.3
> autocorr_32_sse2: 119125.3
> autocorr_32_fma3: 114239.3
>
> It would help if someone else could confirm this, but overall i don't
> see any worthwhile gain here. The old inline version, for those seeds
> where it worked, was somewhat faster.
The metrics given are on Zen 3, with Clang with compiler optimizations
disabled.
We do not rely on compiler optimizations, and have plenty of assembly
which turns out to be slower than modern compilers autovectorizing (even
though we disable tree vectorization on GCC, that does not apply to
simple loops like this one). On the other hand, we also support ancient
compilers, and compilers which have no understanding of vectorization at
all.
To illustrate how different results can look on different arches and
compilers, and even platforms (you mentioned you tested only on win64):
Zen 3, gcc-9, O2:
autocorr_10_c: 48796.8
autocorr_10_sse2: 39571.8
autocorr_10_fma3: 30272.8
autocorr_30_c: 138499.3
autocorr_30_sse2: 114091.3
autocorr_30_fma3: 82114.3
autocorr_32_c: 146466.8
autocorr_32_sse2: 118400.8
autocorr_32_fma3: 80473.8
Zen 3, gcc-14, O2:
autocorr_10_c: 44981.3
autocorr_10_sse2: 36481.3
autocorr_10_fma3: 18418.8
autocorr_30_c: 129462.8
autocorr_30_sse2: 104175.3
autocorr_30_fma3: 48670.3
autocorr_32_c: 135625.3
autocorr_32_sse2: 109079.8
autocorr_32_fma3: 48670.3
Zen 3, clang-18, O2:
autocorr_10_c: 51872.6
autocorr_10_sse2: 48311.1
autocorr_10_fma3: 30070.1
autocorr_30_c: 145899.6
autocorr_30_sse2: 135793.1
autocorr_30_fma3: 79922.6
autocorr_32_c: 160443.1
autocorr_32_sse2: 147591.1
autocorr_32_fma3: 80075.6
Skylake, gcc-14, O2:
autocorr_10_c: 149251.0
autocorr_10_sse2: 133769.5
autocorr_10_fma3: 72886.0
autocorr_30_c: 396145.0
autocorr_30_sse2: 376618.5
autocorr_30_fma3: 194116.5
autocorr_32_c: 413219.0
autocorr_32_sse2: 400867.5
autocorr_32_fma3: 194117.5
Skylake, clang-18, O2:
autocorr_10_c: 153825.3
autocorr_10_sse2: 133774.3
autocorr_10_fma3: 72883.8
autocorr_30_c: 398339.8
autocorr_30_sse2: 376603.8
autocorr_30_fma3: 194098.8
autocorr_32_c: 432183.3
autocorr_32_sse2: 422583.8
autocorr_32_fma3: 194094.3
<Insert your favorite decade old compiler here>
But again, this is irrelevant, as we do not rely on compilers for
optimizations. We help them as much as we can, and when they work, its
nice, but that is in no way reliable, especially to turn down a patch
like this.
[-- Attachment #1.1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 637 bytes --]
[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
[-- Attachment #2: Type: text/plain, Size: 251 bytes --]
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
next prev parent reply other threads:[~2024-05-26 0:02 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-25 20:57 Lynne via ffmpeg-devel
2024-05-25 22:12 ` Michael Niedermayer
2024-05-25 22:31 ` James Almer
2024-05-25 22:45 ` James Almer
2024-05-26 0:02 ` Lynne via ffmpeg-devel [this message]
2024-05-26 0:09 ` James Almer
2024-05-25 23:24 ` Lynne via ffmpeg-devel
2024-05-25 23:41 ` James Almer
2024-05-26 5:45 ` Rémi Denis-Courmont
2024-05-26 0:39 ` James Almer
2024-05-26 1:42 ` [FFmpeg-devel] [PATCH v2] " Lynne via ffmpeg-devel
2024-05-26 1:51 ` James Almer
2024-05-26 2:16 ` James Almer
2024-05-26 19:43 ` Michael Niedermayer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=13d7b7ee-9b36-4b8a-8c44-f46a5352174a@lynne.ee \
--to=ffmpeg-devel@ffmpeg.org \
--cc=dev@lynne.ee \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
This inbox may be cloned and mirrored by anyone:
git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
ffmpegdev@gitmailbox.com
public-inbox-index ffmpegdev
Example config snippet for mirrors.
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git