From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTP id D234D42B5B
	for <ffmpegdev@gitmailbox.com>; Tue, 26 Apr 2022 07:45:46 +0000 (UTC)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 55B7168B272;
	Tue, 26 Apr 2022 10:45:44 +0300 (EEST)
Received: from mail-wr1-f43.google.com (mail-wr1-f43.google.com
 [209.85.221.43])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 9B16568A56F
 for <ffmpeg-devel@ffmpeg.org>; Tue, 26 Apr 2022 10:45:37 +0300 (EEST)
Received: by mail-wr1-f43.google.com with SMTP id v12so17357179wrv.10
 for <ffmpeg-devel@ffmpeg.org>; Tue, 26 Apr 2022 00:45:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
 bh=jhiLQ1qkrUpbb/+Id7y4QPtHk1ImKvfrC8lQAFRbOpo=;
 b=cJ/pbFaj2MLxF9/JL6m/+aGBhTAnLOFUeWjKWulWLOOsdmE6MP2/r9ZPa9jdxuebh6
 Q0OymN4ZIaPtJ7PJRAvF8zdB1Mp7eCPTLRGFvSyDklcewebCFwFEQpZY+/D3qP6/ytJP
 U/TVX4+yOlv6TKu2mUJkV+TJP5CBr1X4sonrDUA22ImPZLFY/ZDUdzYFfj2h2COFvAwn
 22x5l9c2PWkFuUAeEmvtjZIb14bKJHtafpXN9uxHJtIwoXgejZvWwUOFUkyqta9Chbb6
 9+c1EpGOWPetGSHs71b8EK9dZ0HkDJPJl29IRnrhRIUK17lAOlV0zxyRuCcutYyg9KdN
 2ZGw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to;
 bh=jhiLQ1qkrUpbb/+Id7y4QPtHk1ImKvfrC8lQAFRbOpo=;
 b=e/KP2dMyTyDBwZSBxWLlZiVJl9a5BwVNyosJxPlXkoNAQ9J19Cn4SZtJKGf7fTaXeA
 5SMpka1k8pld2QLVLOgJw+PJdSzImt1Q5tWGl6jga3ggiIxtwfTpM0EIRdI+6jjvwi5Z
 Oqjt2gQHia/IlOs8iZXLHaWTj3IsSS1SsnaI5BMtOO3ShOh02aDSQJbXKzy6I3rjfMv1
 MP/XMWGlBj2luUU5QgDwiYGOnrkQO11mbseUJayVOWGz+w4gRfbb5gYAXTLOsiE/Y/t9
 hlYCG0JFX3ziZql39mIoigizrbQykSdjkes2rbyTs5ylkv6hi/R7xAt7yFBluU4JD30g
 nuLQ==
X-Gm-Message-State: AOAM530HJKxAOXjJsvybf7zru09x6nlfgEflHtSYsvn+vBowlXBMDYUf
 3eRMLvUgKuibWl39Znl1eQY681GWZYS15EJi2zcyHpb6KJU=
X-Google-Smtp-Source: ABdhPJzB00yP0OawwkIPaekOPqoQb+rBM1pr4CxGWvXejMG+YUg735QDFaipoRe4faEAWEwjz53iLTgxZaS3ZJ8KjZk=
X-Received: by 2002:a5d:498c:0:b0:20a:db70:a246 with SMTP id
 r12-20020a5d498c000000b0020adb70a246mr6887605wrq.392.1650959136210; Tue, 26
 Apr 2022 00:45:36 -0700 (PDT)
MIME-Version: 1.0
References: <20220217100404.1112755-1-alankelly@google.com>
 <20220422174240.GV2829255@pb2>
In-Reply-To: <20220422174240.GV2829255@pb2>
From: Alan Kelly <alankelly-at-google.com@ffmpeg.org>
Date: Tue, 26 Apr 2022 09:45:24 +0200
Message-ID: <CABaBdZq6f1g+==2qrHwubTSU2JibbOVkDjyLdOmxkLp-Kb_CNQ@mail.gmail.com>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
Subject: Re: [FFmpeg-devel] [PATCH v2 3/5] libswscale: Avx2 hscale can
 process inputs of any size.
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Archived-At: <https://master.gitmailbox.com/ffmpegdev/CABaBdZq6f1g+==2qrHwubTSU2JibbOVkDjyLdOmxkLp-Kb_CNQ@mail.gmail.com/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>

Hi Paul, Michael,

Thanks for your responses and review.

Paul: I have tested this on many input sizes (primes etc) and it works. It
has the same restriction as the original SSSE3 version, it processes pixels
in blocks of 4, however, sufficient memory is allocated so that reads and
writes are safe. All loads and stores are unaligned and I have tested it
with unaligned pointers to verify that it works.

Michael: The performance impact of this change is that it enables
processing of all input sizes by avx2 hscale instead of sizes which are
divisible by 16. For example an input of 513 with a filter size of 4 is
processed 35% faster (506 vs 771). Thanks for the tip on countq, one add
has been removed from each loop.

Alan


On Fri, Apr 22, 2022 at 7:43 PM Michael Niedermayer <michael@niedermayer.cc>
wrote:

> On Thu, Feb 17, 2022 at 11:04:04AM +0100, Alan Kelly wrote:
> > The main loop processes blocks of 16 pixels. The tail processes blocks
> > of size 4.
> > ---
> >  libswscale/x86/scale_avx2.asm | 48 +++++++++++++++++++++++++++++++++--
> >  1 file changed, 46 insertions(+), 2 deletions(-)
> >
> > diff --git a/libswscale/x86/scale_avx2.asm
> b/libswscale/x86/scale_avx2.asm
> > index 20acdbd633..dc42abb100 100644
> > --- a/libswscale/x86/scale_avx2.asm
> > +++ b/libswscale/x86/scale_avx2.asm
> > @@ -53,6 +53,9 @@ cglobal hscale8to15_%1, 7, 9, 16, pos0, dst, w,
> srcmem, filter, fltpos, fltsize,
> >      mova m14, [four]
> >      shr fltsized, 2
> >  %endif
> > +    cmp wq, 16
> > +    jl .tail_loop
> > +    mov countq, 0x10
> >  .loop:
> >      movu m1, [fltposq]
> >      movu m2, [fltposq+32]
> > @@ -97,11 +100,52 @@ cglobal hscale8to15_%1, 7, 9, 16, pos0, dst, w,
> srcmem, filter, fltpos, fltsize,
> >      vpsrad  m6, 7
> >      vpackssdw m5, m5, m6
> >      vpermd m5, m15, m5
> > -    vmovdqu [dstq + countq * 2], m5
> > +    vmovdqu [dstq], m5
> > +    add dstq, 0x20
> >      add fltposq, 0x40
> >      add countq, 0x10
> >      cmp countq, wq
> > -    jl .loop
> > +    jle .loop
> > +
> > +    sub countq, 0x10
> > +    cmp countq, wq
> > +    jge .end
> > +
> > +.tail_loop:
> > +    movu xm1, [fltposq]
> > +%ifidn %1, X4
> > +    pxor xm9, xm9
> > +    pxor xm10, xm10
> > +    xor innerq, innerq
> > +.tail_innerloop:
> > +%endif
> > +    vpcmpeqd  xm13, xm13
> > +    vpgatherdd xm3,[srcmemq + xm1], xm13
> > +    vpunpcklbw xm5, xm3, xm0
> > +    vpunpckhbw xm6, xm3, xm0
> > +    vpmaddwd xm5, xm5, [filterq]
> > +    vpmaddwd xm6, xm6, [filterq + 16]
> > +    add filterq, 0x20
> > +%ifidn %1, X4
> > +    paddd xm9, xm5
> > +    paddd xm10, xm6
> > +    paddd xm1, xm14
> > +    add innerq, 1
> > +    cmp innerq, fltsizeq
> > +    jl .tail_innerloop
> > +    vphaddd xm5, xm9, xm10
> > +%else
> > +    vphaddd xm5, xm5, xm6
> > +%endif
> > +    vpsrad  xm5, 7
> > +    vpackssdw xm5, xm5, xm5
> > +    vmovq [dstq], xm5
> > +    add dstq, 0x8
> > +    add fltposq, 0x10
> > +    add countq, 0x4
> > +    cmp countq, wq
> > +    jl .tail_loop
> > +.end:
> >  REP_RET
> >  %endmacro
>
> countq is only used as counter after this
> If you count against 0 this reduces the instructions in the loop from
> add/cmp to just add.
> similarly the previously used [dstq + countq * 2] avoids a add
> can you comment on the performance impact of these changes ?
> On previous generations of CPUs this would have been generally slower
> I havnt really optimized ASM for current CPUs so these comments might
> not apply today but noone else seems reviewing this
>
> thx
>
> [...]
> --
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> The worst form of inequality is to try to make unequal things equal.
> -- Aristotle
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".