From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTP id 9EC014AFEF for ; Sat, 25 May 2024 22:31:25 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A05B868D4DB; Sun, 26 May 2024 01:31:22 +0300 (EEST) Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 8B7FC68D42E for ; Sun, 26 May 2024 01:31:15 +0300 (EEST) Received: by mail-pl1-f173.google.com with SMTP id d9443c01a7336-1f45d6500b4so9887285ad.1 for ; Sat, 25 May 2024 15:31:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1716676272; x=1717281072; darn=ffmpeg.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id:from :to:cc:subject:date:message-id:reply-to; bh=CJnNskIGDq+VGtNyzFwZz9cFcjM5ftUYdhPAM7il0ts=; b=IxQJg40XuuRuPLwWN0iXI8mF06tXyrUZYVFex2bFBSlfb+YD1KOa9x8Kv7rpF0djUn Rytu+aaqkM8EvbqkLBUI2SeC1o69uwB6hWQiYXUVTwSJ1b4DS0wtioBEKx7ozgSldk3n Sqc5tvnwPGmr3r6WZHY6iQgLht61eubAcdOfbaFBgOSmG+JmHDJe+ZkdpRPs4DPd0EY0 L6fpWRAI6JoymO7OBUGextPxbEKQRQcucjMR5L7JZm91zWZgBRMpGjFpqtoRbdnW8k+F kjj8FLqupgtTn56rsLKKRCeN8B9HnKvdKE/1I5t4EAJ5xw082++J+NIiffGCR3sDZyCx w2gw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1716676272; x=1717281072; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=CJnNskIGDq+VGtNyzFwZz9cFcjM5ftUYdhPAM7il0ts=; b=PlE/Eq6qnXRI5kz7aWDPkV028RT6qf1UOgDxnWPiMfHw+/NA7e0MytmSArARKhvPCh 0AXQk6sDFohQ0uCjjWcNhB4swI4pN2KTPZWEO4yG9B9aYfNQKyk4o6X1uzdk8URm5b6p fTYL/iUjhKwNE8TFsFS08aW0d151L+79eIj5eX4bnwgr2T/MdKmTlhBz+qwA8O+Ciz9v Lww2r86pkJVPL9cTJ8ylEjjYcYrRnhXsdujH0Vekr8mCAm09VjPBp+T4tlLQtHT6nVA3 DMWSq9kKdGBQG66Dn0PIpTm/9KPUsOfIwNWUJTL0TxRLw83Xw+Z/uLleQadct+5DU5YE 4/TQ== X-Gm-Message-State: AOJu0YyR+R8Mu3sdyFolHkHqdHkgJ/Eg5h7vIMpu4/Pb/hpIx6V2wPbT XMWuXzcI4c6LoBsCCvv9PJWJhsWnZi4FNbAuvDSxjIiavcC8dhZLDZ4+og== X-Google-Smtp-Source: AGHT+IFb9t2mkJ+ERb0QXk4GnTZMAjdZhmdzl8tlRF9fS0b0dfH++kfIlnkI2HL9WcT116XHktGL+Q== X-Received: by 2002:a17:902:e750:b0:1f4:7d47:b889 with SMTP id d9443c01a7336-1f47d47bd5emr16200495ad.30.1716676272292; Sat, 25 May 2024 15:31:12 -0700 (PDT) Received: from [192.168.0.10] ([190.194.167.233]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-1f482668a05sm2809545ad.181.2024.05.25.15.31.10 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sat, 25 May 2024 15:31:11 -0700 (PDT) Message-ID: <72761d42-0f8f-4abb-8476-6832d14b0774@gmail.com> Date: Sat, 25 May 2024 19:31:18 -0300 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird To: ffmpeg-devel@ffmpeg.org References: <20240525205731.2578146-1-dev@lynne.ee> Content-Language: en-US From: James Almer In-Reply-To: <20240525205731.2578146-1-dev@lynne.ee> Subject: Re: [FFmpeg-devel] [PATCH] lpc: rewrite lpc_compute_autocorr in external asm X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: On 5/25/2024 5:57 PM, Lynne via ffmpeg-devel wrote: > The inline asm function had issues running under checkasm. > So I came to finish what I started, and wrote the last part > of LPC computation in assembly. > > autocorr_10_c: 135525.8 > autocorr_10_sse2: 50729.8 > autocorr_10_fma3: 19007.8 > autocorr_30_c: 390100.8 > autocorr_30_sse2: 142478.8 > autocorr_30_fma3: 50559.8 > autocorr_32_c: 407058.3 > autocorr_32_sse2: 151633.3 > autocorr_32_fma3: 50517.3 > --- > libavcodec/x86/lpc.asm | 91 +++++++++++++++++++++++++++++++++++++++ > libavcodec/x86/lpc_init.c | 87 ++++--------------------------------- > 2 files changed, 100 insertions(+), 78 deletions(-) > > diff --git a/libavcodec/x86/lpc.asm b/libavcodec/x86/lpc.asm > index a585c17ef5..790841b7f4 100644 > --- a/libavcodec/x86/lpc.asm > +++ b/libavcodec/x86/lpc.asm > @@ -32,6 +32,8 @@ dec_tab_sse2: times 2 dq -2.0 > dec_tab_scalar: times 2 dq -1.0 > seq_tab_sse2: dq 1.0, 0.0 > > +autoc_init_tab: times 4 dq 1.0 > + > SECTION .text > > %macro APPLY_WELCH_FN 0 > @@ -261,3 +263,92 @@ APPLY_WELCH_FN > INIT_YMM avx2 > APPLY_WELCH_FN > %endif > + > +%macro COMPUTE_AUTOCORR_FN 0 > +cglobal lpc_compute_autocorr, 4, 7, 8, data, len, lag, autoc, lag_p, data_l, len_p Already mentioned, but it should be 3 not 8. > + > + shl lagd, 3 > + shl lenq, 3 > + xor lag_pq, lag_pq > + > +.lag_l: > + movaps m8, [autoc_init_tab] m2 > + > + mov len_pq, lag_pq > + > + lea data_lq, [lag_pq + mmsize - 8] > + neg data_lq ; -j - mmsize > + add data_lq, dataq ; data[-j - mmsize] > +.len_l: > + ; We waste the upper value here on SSE2, > + ; but we use it on AVX. > + movupd xm0, [dataq + len_pq] ; data[i] movsd > + movupd m1, [data_lq + len_pq] ; data[i - j] > + > +%if cpuflag(avx) %if mmsize == 32 here and everywhere else. > + vbroadcastsd m0, xm0 This is AVX2. AVX only has memory input argument. So use that and save the movsd from above for the FMA3 version. > + vperm2f128 m1, m1, m1, 0x01 Aren't you loading 16 extra bytes for no reason if you're just going to use the upper 16 bytes from the load above? > +%endif > + > + shufpd m0, m0, m0, 1100b The last argument has two bits, not four. What you're doing here is a splat/broadcast, so you don't need it for FMA3. > + shufpd m1, m1, m1, 0101b The upper two bits of imm8 are ignored. > + > +%if cpuflag(fma3) > + fmaddpd m8, m0, m1, m8 ; sum += data[i]*data[i-j] > +%else > + mulpd m0, m1 > + addpd m8, m0 ; sum += data[i]*data[i-j] > +%endif > + > + add len_pq, 8 > + cmp len_pq, lenq > + jl .len_l > + > + movups [autocq + lag_pq], m8 ; autoc[j] = sum > + add lag_pq, mmsize > + cmp lag_pq, lagq > + jl .lag_l > + > + ; The tail computation is guaranteed never to happen > + ; as long as we're doing multiples of 4, rather than 2. > + ; It is trivial to convert this to avx if ever needed. > +%if !cpuflag(avx) This doesn't seem to be tested as is. Maybe the checkasm should try other lag values? > + jg .end > + ; If lag_p == lag fallthrough > + > +.tail: > + movaps xm2, [autoc_init_tab] > + > + mov len_pq, lag_pq > + sub len_pq, mmsize > + > + lea data_lq, [lag_pq] > + neg data_lq ; -j > + add data_lq, dataq ; data[-j] > + > +.tail_l: > + movupd xm0, [dataq + len_pq] > + movupd xm1, [data_lq + len_pq] > + > + mulpd xm0, xm1 > + addpd xm2, xm0 ; sum += data[i]*data[i-j] > + > + add len_pq, mmsize > + cmp len_pq, lenq > + jl .tail_l > + > + shufpd xm1, xm2, xm2, 01b > + addpd xm2, xm1 > + > + movhpd [autocq + lag_pq], xm2 > +%endif > + > +.end: > + RET > + > +%endmacro > + > +INIT_XMM sse2 > +COMPUTE_AUTOCORR_FN > +INIT_YMM fma3 > +COMPUTE_AUTOCORR_FN > diff --git a/libavcodec/x86/lpc_init.c b/libavcodec/x86/lpc_init.c > index f2fca53799..96469fae40 100644 > --- a/libavcodec/x86/lpc_init.c > +++ b/libavcodec/x86/lpc_init.c > @@ -28,89 +28,20 @@ void ff_lpc_apply_welch_window_sse2(const int32_t *data, ptrdiff_t len, > double *w_data); > void ff_lpc_apply_welch_window_avx2(const int32_t *data, ptrdiff_t len, > double *w_data); > - > -DECLARE_ASM_CONST(16, double, pd_1)[2] = { 1.0, 1.0 }; > - > -#if HAVE_SSE2_INLINE > - > -static void lpc_compute_autocorr_sse2(const double *data, ptrdiff_t len, int lag, > - double *autoc) > -{ > - int j; > - > - if((x86_reg)data & 15) > - data++; > - > - for(j=0; j - x86_reg i = -len*sizeof(double); > - if(j == lag-2) { > - __asm__ volatile( > - "movsd "MANGLE(pd_1)", %%xmm0 \n\t" > - "movsd "MANGLE(pd_1)", %%xmm1 \n\t" > - "movsd "MANGLE(pd_1)", %%xmm2 \n\t" > - "1: \n\t" > - "movapd (%2,%0), %%xmm3 \n\t" > - "movupd -8(%3,%0), %%xmm4 \n\t" > - "movapd (%3,%0), %%xmm5 \n\t" > - "mulpd %%xmm3, %%xmm4 \n\t" > - "mulpd %%xmm3, %%xmm5 \n\t" > - "mulpd -16(%3,%0), %%xmm3 \n\t" > - "addpd %%xmm4, %%xmm1 \n\t" > - "addpd %%xmm5, %%xmm0 \n\t" > - "addpd %%xmm3, %%xmm2 \n\t" > - "add $16, %0 \n\t" > - "jl 1b \n\t" > - "movhlps %%xmm0, %%xmm3 \n\t" > - "movhlps %%xmm1, %%xmm4 \n\t" > - "movhlps %%xmm2, %%xmm5 \n\t" > - "addsd %%xmm3, %%xmm0 \n\t" > - "addsd %%xmm4, %%xmm1 \n\t" > - "addsd %%xmm5, %%xmm2 \n\t" > - "movsd %%xmm0, (%1) \n\t" > - "movsd %%xmm1, 8(%1) \n\t" > - "movsd %%xmm2, 16(%1) \n\t" > - :"+&r"(i) > - :"r"(autoc+j), "r"(data+len), "r"(data+len-j) > - NAMED_CONSTRAINTS_ARRAY_ADD(pd_1) > - :"memory" > - ); > - } else { > - __asm__ volatile( > - "movsd "MANGLE(pd_1)", %%xmm0 \n\t" > - "movsd "MANGLE(pd_1)", %%xmm1 \n\t" > - "1: \n\t" > - "movapd (%3,%0), %%xmm3 \n\t" > - "movupd -8(%4,%0), %%xmm4 \n\t" > - "mulpd %%xmm3, %%xmm4 \n\t" > - "mulpd (%4,%0), %%xmm3 \n\t" > - "addpd %%xmm4, %%xmm1 \n\t" > - "addpd %%xmm3, %%xmm0 \n\t" > - "add $16, %0 \n\t" > - "jl 1b \n\t" > - "movhlps %%xmm0, %%xmm3 \n\t" > - "movhlps %%xmm1, %%xmm4 \n\t" > - "addsd %%xmm3, %%xmm0 \n\t" > - "addsd %%xmm4, %%xmm1 \n\t" > - "movsd %%xmm0, %1 \n\t" > - "movsd %%xmm1, %2 \n\t" > - :"+&r"(i), "=m"(autoc[j]), "=m"(autoc[j+1]) > - :"r"(data+len), "r"(data+len-j) > - NAMED_CONSTRAINTS_ARRAY_ADD(pd_1) > - ); > - } > - } > -} > - > -#endif /* HAVE_SSE2_INLINE */ > +void ff_lpc_compute_autocorr_sse2(const double *data, ptrdiff_t len, int lag, > + double *autoc); > +void ff_lpc_compute_autocorr_fma3(const double *data, ptrdiff_t len, int lag, > + double *autoc); > > av_cold void ff_lpc_init_x86(LPCContext *c) > { > int cpu_flags = av_get_cpu_flags(); > > -#if HAVE_SSE2_INLINE > - if (INLINE_SSE2_SLOW(cpu_flags)) > - c->lpc_compute_autocorr = lpc_compute_autocorr_sse2; > -#endif > + if (EXTERNAL_SSE2(cpu_flags)) > + c->lpc_compute_autocorr = ff_lpc_compute_autocorr_sse2; > + > + if (EXTERNAL_FMA3(cpu_flags)) > + c->lpc_compute_autocorr = ff_lpc_compute_autocorr_fma3; > > if (EXTERNAL_SSE2(cpu_flags)) > c->lpc_apply_welch_window = ff_lpc_apply_welch_window_sse2; _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".