Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed
From: James Almer <jamrial@gmail.com>
To: ffmpeg-devel@ffmpeg.org
Subject: Re: [FFmpeg-devel] [PATCH] lpc: rewrite lpc_compute_autocorr in external asm
Date: Sat, 25 May 2024 21:09:03 -0300
Message-ID: <7ff3c1ea-b2cc-491a-be39-b8d64c23f6a0@gmail.com> (raw)
In-Reply-To: <13d7b7ee-9b36-4b8a-8c44-f46a5352174a@lynne.ee>

On 5/25/2024 9:02 PM, Lynne via ffmpeg-devel wrote:
> On 26/05/2024 00:45, James Almer wrote:
>> On 5/25/2024 7:31 PM, James Almer wrote:
>>> On 5/25/2024 5:57 PM, Lynne via ffmpeg-devel wrote:
>>>> The inline asm function had issues running under checkasm.
>>>> So I came to finish what I started, and wrote the last part
>>>> of LPC computation in assembly.
>>>>
>>>> autocorr_10_c: 135525.8
>>>> autocorr_10_sse2: 50729.8
>>>> autocorr_10_fma3: 19007.8
>>>> autocorr_30_c: 390100.8
>>>> autocorr_30_sse2: 142478.8
>>>> autocorr_30_fma3: 50559.8
>>>> autocorr_32_c: 407058.3
>>>> autocorr_32_sse2: 151633.3
>>>> autocorr_32_fma3: 50517.3
>>>> ---
>>>>   libavcodec/x86/lpc.asm    | 91 
>>>> +++++++++++++++++++++++++++++++++++++++
>>>>   libavcodec/x86/lpc_init.c | 87 ++++---------------------------------
>>>>   2 files changed, 100 insertions(+), 78 deletions(-)
>>>>
>>>> diff --git a/libavcodec/x86/lpc.asm b/libavcodec/x86/lpc.asm
>>>> index a585c17ef5..790841b7f4 100644
>>>> --- a/libavcodec/x86/lpc.asm
>>>> +++ b/libavcodec/x86/lpc.asm
>>>> @@ -32,6 +32,8 @@ dec_tab_sse2: times 2 dq -2.0
>>>>   dec_tab_scalar: times 2 dq -1.0
>>>>   seq_tab_sse2: dq 1.0, 0.0
>>>> +autoc_init_tab: times 4 dq 1.0
>>>> +
>>>>   SECTION .text
>>>>   %macro APPLY_WELCH_FN 0
>>>> @@ -261,3 +263,92 @@ APPLY_WELCH_FN
>>>>   INIT_YMM avx2
>>>>   APPLY_WELCH_FN
>>>>   %endif
>>>> +
>>>> +%macro COMPUTE_AUTOCORR_FN 0
>>>> +cglobal lpc_compute_autocorr, 4, 7, 8, data, len, lag, autoc, 
>>>> lag_p, data_l, len_p
>>>
>>> Already mentioned, but it should be 3 not 8.
>>>
>>>> +
>>>> +    shl lagd, 3
>>>> +    shl lenq, 3
>>>> +    xor lag_pq, lag_pq
>>>> +
>>>> +.lag_l:
>>>> +    movaps m8, [autoc_init_tab]
>>>
>>> m2
>>>
>>>> +
>>>> +    mov len_pq, lag_pq
>>>> +
>>>> +    lea data_lq, [lag_pq + mmsize - 8]
>>>> +    neg data_lq                     ; -j - mmsize
>>>> +    add data_lq, dataq              ; data[-j - mmsize]
>>>> +.len_l:
>>>> +    ; We waste the upper value here on SSE2,
>>>> +    ; but we use it on AVX.
>>>> +    movupd xm0, [dataq + len_pq]    ; data[i]
>>>
>>> movsd
>>>
>>>> +    movupd m1, [data_lq + len_pq]   ; data[i - j]
>>>> +
>>>> +%if cpuflag(avx)
>>>
>>> %if mmsize == 32 here and everywhere else.
>>>
>>>> +    vbroadcastsd m0, xm0
>>>
>>> This is AVX2. AVX only has memory input argument. So use that and 
>>> save the movsd from above for the FMA3 version.
>>>
>>>> +    vperm2f128 m1, m1, m1, 0x01
>>>
>>> Aren't you loading 16 extra bytes for no reason if you're just going 
>>> to use the upper 16 bytes from the load above?
>>
>> Nevermind, this is swapping lanes.
>>
>> That aside, these versions are barely better and sometimes worse in 
>> all my tests on win64 with GCC with certain seeds.
>> For example, seed 4022958484 gives me:
>>
>> autocorr_10_c: 21345.6
>> autocorr_10_sse2: 16434.6
>> autocorr_10_fma3: 24154.6
>> autocorr_30_c: 59239.1
>> autocorr_30_sse2: 46114.6
>> autocorr_30_fma3: 64147.1
>> autocorr_32_c: 63022.1
>> autocorr_32_sse2: 50040.1
>> autocorr_32_fma3: 66594.1
>>
>> But seed 2236774811 gives me:
>>
>> autocorr_10_c: 37135.3
>> autocorr_10_sse2: 26492.3
>> autocorr_10_fma3: 32943.3
>> autocorr_30_c: 102266.8
>> autocorr_30_sse2: 72933.3
>> autocorr_30_fma3: 85808.3
>> autocorr_32_c: 106537.8
>> autocorr_32_sse2: 77623.3
>> autocorr_32_fma3: 85844.3
>>
>> But if i force len to always be 4999 instead of its value varying 
>> depending on seed, i consistently get things like:
>>
>> autocorr_10_c: 40447.3
>> autocorr_10_sse2: 39526.8
>> autocorr_10_fma3: 42955.3
>> autocorr_30_c: 111362.3
>> autocorr_30_sse2: 111408.3
>> autocorr_30_fma3: 116781.8
>> autocorr_32_c: 122388.3
>> autocorr_32_sse2: 119125.3
>> autocorr_32_fma3: 114239.3
>>
>> It would help if someone else could confirm this, but overall i don't 
>> see any worthwhile gain here. The old inline version, for those seeds 
>> where it worked, was somewhat faster.
> 
> The metrics given are on Zen 3, with Clang with compiler optimizations 
> disabled.
> We do not rely on compiler optimizations, and have plenty of assembly 
> which turns out to be slower than modern compilers autovectorizing (even 
> though we disable tree vectorization on GCC, that does not apply to 
> simple loops like this one). On the other hand, we also support ancient 
> compilers, and compilers which have no understanding of vectorization at 
> all.

Tree vectorization is disabled everywhere, including my target (GCC 14, 
mingw-w64, Alder Lake i7).

> To illustrate how different results can look on different arches and 
> compilers, and even platforms (you mentioned you tested only on win64):
> 
> Zen 3, gcc-9, O2:
> autocorr_10_c: 48796.8
> autocorr_10_sse2: 39571.8
> autocorr_10_fma3: 30272.8
> autocorr_30_c: 138499.3
> autocorr_30_sse2: 114091.3
> autocorr_30_fma3: 82114.3
> autocorr_32_c: 146466.8
> autocorr_32_sse2: 118400.8
> autocorr_32_fma3: 80473.8
> 
> Zen 3, gcc-14, O2:
> autocorr_10_c: 44981.3
> autocorr_10_sse2: 36481.3
> autocorr_10_fma3: 18418.8
> autocorr_30_c: 129462.8
> autocorr_30_sse2: 104175.3
> autocorr_30_fma3: 48670.3
> autocorr_32_c: 135625.3
> autocorr_32_sse2: 109079.8
> autocorr_32_fma3: 48670.3
> 
> Zen 3, clang-18, O2:
> autocorr_10_c: 51872.6
> autocorr_10_sse2: 48311.1
> autocorr_10_fma3: 30070.1
> autocorr_30_c: 145899.6
> autocorr_30_sse2: 135793.1
> autocorr_30_fma3: 79922.6
> autocorr_32_c: 160443.1
> autocorr_32_sse2: 147591.1
> autocorr_32_fma3: 80075.6
> 
> Skylake, gcc-14, O2:
> autocorr_10_c: 149251.0
> autocorr_10_sse2: 133769.5
> autocorr_10_fma3: 72886.0
> autocorr_30_c: 396145.0
> autocorr_30_sse2: 376618.5
> autocorr_30_fma3: 194116.5
> autocorr_32_c: 413219.0
> autocorr_32_sse2: 400867.5
> autocorr_32_fma3: 194117.5
> 
> Skylake, clang-18, O2:
> autocorr_10_c: 153825.3
> autocorr_10_sse2: 133774.3
> autocorr_10_fma3: 72883.8
> autocorr_30_c: 398339.8
> autocorr_30_sse2: 376603.8
> autocorr_30_fma3: 194098.8
> autocorr_32_c: 432183.3
> autocorr_32_sse2: 422583.8
> autocorr_32_fma3: 194094.3

I see no such boost at all. You're getting twice the performance on fma3 
than sse2 whereas i get fma3 worse than sse2 in many cases. There is 
something fishy going on, hence me asking others to check to see if they 
can reproduce it.

> 
> <Insert your favorite decade old compiler here>
> 
> But again, this is irrelevant, as we do not rely on compilers for 
> optimizations. We help them as much as we can, and when they work, its 
> nice, but that is in no way reliable, especially to turn down a patch 
> like this.
> 
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

  reply	other threads:[~2024-05-26  0:09 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-25 20:57 Lynne via ffmpeg-devel
2024-05-25 22:12 ` Michael Niedermayer
2024-05-25 22:31 ` James Almer
2024-05-25 22:45   ` James Almer
2024-05-26  0:02     ` Lynne via ffmpeg-devel
2024-05-26  0:09       ` James Almer [this message]
2024-05-25 23:24   ` Lynne via ffmpeg-devel
2024-05-25 23:41     ` James Almer
2024-05-26  5:45   ` Rémi Denis-Courmont
2024-05-26  0:39 ` James Almer
2024-05-26  1:42 ` [FFmpeg-devel] [PATCH v2] " Lynne via ffmpeg-devel
2024-05-26  1:51   ` James Almer
2024-05-26  2:16     ` James Almer
2024-05-26 19:43   ` Michael Niedermayer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7ff3c1ea-b2cc-491a-be39-b8d64c23f6a0@gmail.com \
    --to=jamrial@gmail.com \
    --cc=ffmpeg-devel@ffmpeg.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git