From: James Almer <jamrial@gmail.com> To: ffmpeg-devel@ffmpeg.org Subject: Re: [FFmpeg-devel] [PATCH] lpc: rewrite lpc_compute_autocorr in external asm Date: Sat, 25 May 2024 21:09:03 -0300 Message-ID: <7ff3c1ea-b2cc-491a-be39-b8d64c23f6a0@gmail.com> (raw) In-Reply-To: <13d7b7ee-9b36-4b8a-8c44-f46a5352174a@lynne.ee> On 5/25/2024 9:02 PM, Lynne via ffmpeg-devel wrote: > On 26/05/2024 00:45, James Almer wrote: >> On 5/25/2024 7:31 PM, James Almer wrote: >>> On 5/25/2024 5:57 PM, Lynne via ffmpeg-devel wrote: >>>> The inline asm function had issues running under checkasm. >>>> So I came to finish what I started, and wrote the last part >>>> of LPC computation in assembly. >>>> >>>> autocorr_10_c: 135525.8 >>>> autocorr_10_sse2: 50729.8 >>>> autocorr_10_fma3: 19007.8 >>>> autocorr_30_c: 390100.8 >>>> autocorr_30_sse2: 142478.8 >>>> autocorr_30_fma3: 50559.8 >>>> autocorr_32_c: 407058.3 >>>> autocorr_32_sse2: 151633.3 >>>> autocorr_32_fma3: 50517.3 >>>> --- >>>> libavcodec/x86/lpc.asm | 91 >>>> +++++++++++++++++++++++++++++++++++++++ >>>> libavcodec/x86/lpc_init.c | 87 ++++--------------------------------- >>>> 2 files changed, 100 insertions(+), 78 deletions(-) >>>> >>>> diff --git a/libavcodec/x86/lpc.asm b/libavcodec/x86/lpc.asm >>>> index a585c17ef5..790841b7f4 100644 >>>> --- a/libavcodec/x86/lpc.asm >>>> +++ b/libavcodec/x86/lpc.asm >>>> @@ -32,6 +32,8 @@ dec_tab_sse2: times 2 dq -2.0 >>>> dec_tab_scalar: times 2 dq -1.0 >>>> seq_tab_sse2: dq 1.0, 0.0 >>>> +autoc_init_tab: times 4 dq 1.0 >>>> + >>>> SECTION .text >>>> %macro APPLY_WELCH_FN 0 >>>> @@ -261,3 +263,92 @@ APPLY_WELCH_FN >>>> INIT_YMM avx2 >>>> APPLY_WELCH_FN >>>> %endif >>>> + >>>> +%macro COMPUTE_AUTOCORR_FN 0 >>>> +cglobal lpc_compute_autocorr, 4, 7, 8, data, len, lag, autoc, >>>> lag_p, data_l, len_p >>> >>> Already mentioned, but it should be 3 not 8. >>> >>>> + >>>> + shl lagd, 3 >>>> + shl lenq, 3 >>>> + xor lag_pq, lag_pq >>>> + >>>> +.lag_l: >>>> + movaps m8, [autoc_init_tab] >>> >>> m2 >>> >>>> + >>>> + mov len_pq, lag_pq >>>> + >>>> + lea data_lq, [lag_pq + mmsize - 8] >>>> + neg data_lq ; -j - mmsize >>>> + add data_lq, dataq ; data[-j - mmsize] >>>> +.len_l: >>>> + ; We waste the upper value here on SSE2, >>>> + ; but we use it on AVX. >>>> + movupd xm0, [dataq + len_pq] ; data[i] >>> >>> movsd >>> >>>> + movupd m1, [data_lq + len_pq] ; data[i - j] >>>> + >>>> +%if cpuflag(avx) >>> >>> %if mmsize == 32 here and everywhere else. >>> >>>> + vbroadcastsd m0, xm0 >>> >>> This is AVX2. AVX only has memory input argument. So use that and >>> save the movsd from above for the FMA3 version. >>> >>>> + vperm2f128 m1, m1, m1, 0x01 >>> >>> Aren't you loading 16 extra bytes for no reason if you're just going >>> to use the upper 16 bytes from the load above? >> >> Nevermind, this is swapping lanes. >> >> That aside, these versions are barely better and sometimes worse in >> all my tests on win64 with GCC with certain seeds. >> For example, seed 4022958484 gives me: >> >> autocorr_10_c: 21345.6 >> autocorr_10_sse2: 16434.6 >> autocorr_10_fma3: 24154.6 >> autocorr_30_c: 59239.1 >> autocorr_30_sse2: 46114.6 >> autocorr_30_fma3: 64147.1 >> autocorr_32_c: 63022.1 >> autocorr_32_sse2: 50040.1 >> autocorr_32_fma3: 66594.1 >> >> But seed 2236774811 gives me: >> >> autocorr_10_c: 37135.3 >> autocorr_10_sse2: 26492.3 >> autocorr_10_fma3: 32943.3 >> autocorr_30_c: 102266.8 >> autocorr_30_sse2: 72933.3 >> autocorr_30_fma3: 85808.3 >> autocorr_32_c: 106537.8 >> autocorr_32_sse2: 77623.3 >> autocorr_32_fma3: 85844.3 >> >> But if i force len to always be 4999 instead of its value varying >> depending on seed, i consistently get things like: >> >> autocorr_10_c: 40447.3 >> autocorr_10_sse2: 39526.8 >> autocorr_10_fma3: 42955.3 >> autocorr_30_c: 111362.3 >> autocorr_30_sse2: 111408.3 >> autocorr_30_fma3: 116781.8 >> autocorr_32_c: 122388.3 >> autocorr_32_sse2: 119125.3 >> autocorr_32_fma3: 114239.3 >> >> It would help if someone else could confirm this, but overall i don't >> see any worthwhile gain here. The old inline version, for those seeds >> where it worked, was somewhat faster. > > The metrics given are on Zen 3, with Clang with compiler optimizations > disabled. > We do not rely on compiler optimizations, and have plenty of assembly > which turns out to be slower than modern compilers autovectorizing (even > though we disable tree vectorization on GCC, that does not apply to > simple loops like this one). On the other hand, we also support ancient > compilers, and compilers which have no understanding of vectorization at > all. Tree vectorization is disabled everywhere, including my target (GCC 14, mingw-w64, Alder Lake i7). > To illustrate how different results can look on different arches and > compilers, and even platforms (you mentioned you tested only on win64): > > Zen 3, gcc-9, O2: > autocorr_10_c: 48796.8 > autocorr_10_sse2: 39571.8 > autocorr_10_fma3: 30272.8 > autocorr_30_c: 138499.3 > autocorr_30_sse2: 114091.3 > autocorr_30_fma3: 82114.3 > autocorr_32_c: 146466.8 > autocorr_32_sse2: 118400.8 > autocorr_32_fma3: 80473.8 > > Zen 3, gcc-14, O2: > autocorr_10_c: 44981.3 > autocorr_10_sse2: 36481.3 > autocorr_10_fma3: 18418.8 > autocorr_30_c: 129462.8 > autocorr_30_sse2: 104175.3 > autocorr_30_fma3: 48670.3 > autocorr_32_c: 135625.3 > autocorr_32_sse2: 109079.8 > autocorr_32_fma3: 48670.3 > > Zen 3, clang-18, O2: > autocorr_10_c: 51872.6 > autocorr_10_sse2: 48311.1 > autocorr_10_fma3: 30070.1 > autocorr_30_c: 145899.6 > autocorr_30_sse2: 135793.1 > autocorr_30_fma3: 79922.6 > autocorr_32_c: 160443.1 > autocorr_32_sse2: 147591.1 > autocorr_32_fma3: 80075.6 > > Skylake, gcc-14, O2: > autocorr_10_c: 149251.0 > autocorr_10_sse2: 133769.5 > autocorr_10_fma3: 72886.0 > autocorr_30_c: 396145.0 > autocorr_30_sse2: 376618.5 > autocorr_30_fma3: 194116.5 > autocorr_32_c: 413219.0 > autocorr_32_sse2: 400867.5 > autocorr_32_fma3: 194117.5 > > Skylake, clang-18, O2: > autocorr_10_c: 153825.3 > autocorr_10_sse2: 133774.3 > autocorr_10_fma3: 72883.8 > autocorr_30_c: 398339.8 > autocorr_30_sse2: 376603.8 > autocorr_30_fma3: 194098.8 > autocorr_32_c: 432183.3 > autocorr_32_sse2: 422583.8 > autocorr_32_fma3: 194094.3 I see no such boost at all. You're getting twice the performance on fma3 than sse2 whereas i get fma3 worse than sse2 in many cases. There is something fishy going on, hence me asking others to check to see if they can reproduce it. > > <Insert your favorite decade old compiler here> > > But again, this is irrelevant, as we do not rely on compilers for > optimizations. We help them as much as we can, and when they work, its > nice, but that is in no way reliable, especially to turn down a patch > like this. > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
next prev parent reply other threads:[~2024-05-26 0:09 UTC|newest] Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top 2024-05-25 20:57 Lynne via ffmpeg-devel 2024-05-25 22:12 ` Michael Niedermayer 2024-05-25 22:31 ` James Almer 2024-05-25 22:45 ` James Almer 2024-05-26 0:02 ` Lynne via ffmpeg-devel 2024-05-26 0:09 ` James Almer [this message] 2024-05-25 23:24 ` Lynne via ffmpeg-devel 2024-05-25 23:41 ` James Almer 2024-05-26 5:45 ` Rémi Denis-Courmont 2024-05-26 0:39 ` James Almer 2024-05-26 1:42 ` [FFmpeg-devel] [PATCH v2] " Lynne via ffmpeg-devel 2024-05-26 1:51 ` James Almer 2024-05-26 2:16 ` James Almer 2024-05-26 19:43 ` Michael Niedermayer
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=7ff3c1ea-b2cc-491a-be39-b8d64c23f6a0@gmail.com \ --to=jamrial@gmail.com \ --cc=ffmpeg-devel@ffmpeg.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel This inbox may be cloned and mirrored by anyone: git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \ ffmpegdev@gitmailbox.com public-inbox-index ffmpegdev Example config snippet for mirrors. AGPL code for this site: git clone https://public-inbox.org/public-inbox.git