[FFmpeg-devel] [PATCH v2 3/5] libswscale: Avx2 hscale can process inputs of any size.

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed

* [FFmpeg-devel] [PATCH v2 3/5] libswscale: Avx2 hscale can process inputs of any size.
@ 2022-02-17 10:04 Alan Kelly
  2022-02-17 16:21 ` Michael Niedermayer
  2022-04-22 17:42 ` Michael Niedermayer
  0 siblings, 2 replies; 9+ messages in thread
From: Alan Kelly @ 2022-02-17 10:04 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Alan Kelly

The main loop processes blocks of 16 pixels. The tail processes blocks
of size 4.
---
 libswscale/x86/scale_avx2.asm | 48 +++++++++++++++++++++++++++++++++--
 1 file changed, 46 insertions(+), 2 deletions(-)

diff --git a/libswscale/x86/scale_avx2.asm b/libswscale/x86/scale_avx2.asm
index 20acdbd633..dc42abb100 100644
--- a/libswscale/x86/scale_avx2.asm
+++ b/libswscale/x86/scale_avx2.asm
@@ -53,6 +53,9 @@ cglobal hscale8to15_%1, 7, 9, 16, pos0, dst, w, srcmem, filter, fltpos, fltsize,
     mova m14, [four]
     shr fltsized, 2
 %endif
+    cmp wq, 16
+    jl .tail_loop
+    mov countq, 0x10
 .loop:
     movu m1, [fltposq]
     movu m2, [fltposq+32]
@@ -97,11 +100,52 @@ cglobal hscale8to15_%1, 7, 9, 16, pos0, dst, w, srcmem, filter, fltpos, fltsize,
     vpsrad  m6, 7
     vpackssdw m5, m5, m6
     vpermd m5, m15, m5
-    vmovdqu [dstq + countq * 2], m5
+    vmovdqu [dstq], m5
+    add dstq, 0x20
     add fltposq, 0x40
     add countq, 0x10
     cmp countq, wq
-    jl .loop
+    jle .loop
+
+    sub countq, 0x10
+    cmp countq, wq
+    jge .end
+
+.tail_loop:
+    movu xm1, [fltposq]
+%ifidn %1, X4
+    pxor xm9, xm9
+    pxor xm10, xm10
+    xor innerq, innerq
+.tail_innerloop:
+%endif
+    vpcmpeqd  xm13, xm13
+    vpgatherdd xm3,[srcmemq + xm1], xm13
+    vpunpcklbw xm5, xm3, xm0
+    vpunpckhbw xm6, xm3, xm0
+    vpmaddwd xm5, xm5, [filterq]
+    vpmaddwd xm6, xm6, [filterq + 16]
+    add filterq, 0x20
+%ifidn %1, X4
+    paddd xm9, xm5
+    paddd xm10, xm6
+    paddd xm1, xm14
+    add innerq, 1
+    cmp innerq, fltsizeq
+    jl .tail_innerloop
+    vphaddd xm5, xm9, xm10
+%else
+    vphaddd xm5, xm5, xm6
+%endif
+    vpsrad  xm5, 7
+    vpackssdw xm5, xm5, xm5
+    vmovq [dstq], xm5
+    add dstq, 0x8
+    add fltposq, 0x10
+    add countq, 0x4
+    cmp countq, wq
+    jl .tail_loop
+.end:
 REP_RET
 %endmacro
 
-- 
2.35.1.265.g69c8d7142f-goog

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2 3/5] libswscale: Avx2 hscale can process inputs of any size.
  2022-02-17 10:04 [FFmpeg-devel] [PATCH v2 3/5] libswscale: Avx2 hscale can process inputs of any size Alan Kelly
@ 2022-02-17 16:21 ` Michael Niedermayer
  2022-03-07 15:27   ` Alan Kelly
  2022-04-22 17:42 ` Michael Niedermayer
  1 sibling, 1 reply; 9+ messages in thread
From: Michael Niedermayer @ 2022-02-17 16:21 UTC (permalink / raw)
  To: FFmpeg development discussions and patches


[-- Attachment #1.1: Type: text/plain, Size: 794 bytes --]

On Thu, Feb 17, 2022 at 11:04:04AM +0100, Alan Kelly wrote:
> The main loop processes blocks of 16 pixels. The tail processes blocks
> of size 4.
> ---
>  libswscale/x86/scale_avx2.asm | 48 +++++++++++++++++++++++++++++++++--
>  1 file changed, 46 insertions(+), 2 deletions(-)

ill wait a few days on this, there are people here who know avx2 better than i do
its a while since i wrote x86 SIMD.
but if noone else reviews this then ill do 

thx


[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

"You are 36 times more likely to die in a bathtub than at the hands of a
terrorist. Also, you are 2.5 times more likely to become a president and
2 times more likely to become an astronaut, than to die in a terrorist
attack." -- Thoughty2


[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

[-- Attachment #2: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2 3/5] libswscale: Avx2 hscale can process inputs of any size.
  2022-02-17 16:21 ` Michael Niedermayer
@ 2022-03-07 15:27   ` Alan Kelly
  2022-04-22  8:02     ` Alan Kelly
  0 siblings, 1 reply; 9+ messages in thread
From: Alan Kelly @ 2022-03-07 15:27 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

Hi Michael,

Thanks for reviewing the first two parts of this patchset.

Is there anybody interested in reviewing this part?

Thanks,

Alan

On Thu, Feb 17, 2022 at 5:21 PM Michael Niedermayer <michael@niedermayer.cc>
wrote:

> On Thu, Feb 17, 2022 at 11:04:04AM +0100, Alan Kelly wrote:
> > The main loop processes blocks of 16 pixels. The tail processes blocks
> > of size 4.
> > ---
> >  libswscale/x86/scale_avx2.asm | 48 +++++++++++++++++++++++++++++++++--
> >  1 file changed, 46 insertions(+), 2 deletions(-)
>
> ill wait a few days on this, there are people here who know avx2 better
> than i do
> its a while since i wrote x86 SIMD.
> but if noone else reviews this then ill do
>
> thx
>
>
> [...]
> --
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> "You are 36 times more likely to die in a bathtub than at the hands of a
> terrorist. Also, you are 2.5 times more likely to become a president and
> 2 times more likely to become an astronaut, than to die in a terrorist
> attack." -- Thoughty2
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2 3/5] libswscale: Avx2 hscale can process inputs of any size.
  2022-03-07 15:27   ` Alan Kelly
@ 2022-04-22  8:02     ` Alan Kelly
  2022-04-22 14:53       ` Paul B Mahol
  0 siblings, 1 reply; 9+ messages in thread
From: Alan Kelly @ 2022-04-22  8:02 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

Hi,

Is anyone interested in this patch? This makes AVX2 hscale work on all
input sizes.

Thanks,

Alan

On Mon, Mar 7, 2022 at 4:27 PM Alan Kelly <alankelly@google.com> wrote:

> Hi Michael,
>
> Thanks for reviewing the first two parts of this patchset.
>
> Is there anybody interested in reviewing this part?
>
> Thanks,
>
> Alan
>
> On Thu, Feb 17, 2022 at 5:21 PM Michael Niedermayer <
> michael@niedermayer.cc> wrote:
>
>> On Thu, Feb 17, 2022 at 11:04:04AM +0100, Alan Kelly wrote:
>> > The main loop processes blocks of 16 pixels. The tail processes blocks
>> > of size 4.
>> > ---
>> >  libswscale/x86/scale_avx2.asm | 48 +++++++++++++++++++++++++++++++++--
>> >  1 file changed, 46 insertions(+), 2 deletions(-)
>>
>> ill wait a few days on this, there are people here who know avx2 better
>> than i do
>> its a while since i wrote x86 SIMD.
>> but if noone else reviews this then ill do
>>
>> thx
>>
>>
>> [...]
>> --
>> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>>
>> "You are 36 times more likely to die in a bathtub than at the hands of a
>> terrorist. Also, you are 2.5 times more likely to become a president and
>> 2 times more likely to become an astronaut, than to die in a terrorist
>> attack." -- Thoughty2
>>
>> _______________________________________________
>> ffmpeg-devel mailing list
>> ffmpeg-devel@ffmpeg.org
>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>
>> To unsubscribe, visit link above, or email
>> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>>
>
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2 3/5] libswscale: Avx2 hscale can process inputs of any size.
  2022-04-22  8:02     ` Alan Kelly
@ 2022-04-22 14:53       ` Paul B Mahol
  0 siblings, 0 replies; 9+ messages in thread
From: Paul B Mahol @ 2022-04-22 14:53 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

On Fri, Apr 22, 2022 at 10:03 AM Alan Kelly <
alankelly-at-google.com@ffmpeg.org> wrote:

> Hi,
>
> Is anyone interested in this patch? This makes AVX2 hscale work on all
> input sizes.
>
>
I can apply this if you confirm it works with any image size and on
unaligned data.


> Thanks,
>
> Alan
>
> On Mon, Mar 7, 2022 at 4:27 PM Alan Kelly <alankelly@google.com> wrote:
>
> > Hi Michael,
> >
> > Thanks for reviewing the first two parts of this patchset.
> >
> > Is there anybody interested in reviewing this part?
> >
> > Thanks,
> >
> > Alan
> >
> > On Thu, Feb 17, 2022 at 5:21 PM Michael Niedermayer <
> > michael@niedermayer.cc> wrote:
> >
> >> On Thu, Feb 17, 2022 at 11:04:04AM +0100, Alan Kelly wrote:
> >> > The main loop processes blocks of 16 pixels. The tail processes blocks
> >> > of size 4.
> >> > ---
> >> >  libswscale/x86/scale_avx2.asm | 48
> +++++++++++++++++++++++++++++++++--
> >> >  1 file changed, 46 insertions(+), 2 deletions(-)
> >>
> >> ill wait a few days on this, there are people here who know avx2 better
> >> than i do
> >> its a while since i wrote x86 SIMD.
> >> but if noone else reviews this then ill do
> >>
> >> thx
> >>
> >>
> >> [...]
> >> --
> >> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
> >>
> >> "You are 36 times more likely to die in a bathtub than at the hands of a
> >> terrorist. Also, you are 2.5 times more likely to become a president and
> >> 2 times more likely to become an astronaut, than to die in a terrorist
> >> attack." -- Thoughty2
> >>
> >> _______________________________________________
> >> ffmpeg-devel mailing list
> >> ffmpeg-devel@ffmpeg.org
> >> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> >>
> >> To unsubscribe, visit link above, or email
> >> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
> >>
> >
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2 3/5] libswscale: Avx2 hscale can process inputs of any size.
  2022-02-17 10:04 [FFmpeg-devel] [PATCH v2 3/5] libswscale: Avx2 hscale can process inputs of any size Alan Kelly
  2022-02-17 16:21 ` Michael Niedermayer
@ 2022-04-22 17:42 ` Michael Niedermayer
  2022-04-26  7:45   ` Alan Kelly
  1 sibling, 1 reply; 9+ messages in thread
From: Michael Niedermayer @ 2022-04-22 17:42 UTC (permalink / raw)
  To: FFmpeg development discussions and patches


[-- Attachment #1.1: Type: text/plain, Size: 2770 bytes --]

On Thu, Feb 17, 2022 at 11:04:04AM +0100, Alan Kelly wrote:
> The main loop processes blocks of 16 pixels. The tail processes blocks
> of size 4.
> ---
>  libswscale/x86/scale_avx2.asm | 48 +++++++++++++++++++++++++++++++++--
>  1 file changed, 46 insertions(+), 2 deletions(-)
> 
> diff --git a/libswscale/x86/scale_avx2.asm b/libswscale/x86/scale_avx2.asm
> index 20acdbd633..dc42abb100 100644
> --- a/libswscale/x86/scale_avx2.asm
> +++ b/libswscale/x86/scale_avx2.asm
> @@ -53,6 +53,9 @@ cglobal hscale8to15_%1, 7, 9, 16, pos0, dst, w, srcmem, filter, fltpos, fltsize,
>      mova m14, [four]
>      shr fltsized, 2
>  %endif
> +    cmp wq, 16
> +    jl .tail_loop
> +    mov countq, 0x10
>  .loop:
>      movu m1, [fltposq]
>      movu m2, [fltposq+32]
> @@ -97,11 +100,52 @@ cglobal hscale8to15_%1, 7, 9, 16, pos0, dst, w, srcmem, filter, fltpos, fltsize,
>      vpsrad  m6, 7
>      vpackssdw m5, m5, m6
>      vpermd m5, m15, m5
> -    vmovdqu [dstq + countq * 2], m5
> +    vmovdqu [dstq], m5
> +    add dstq, 0x20
>      add fltposq, 0x40
>      add countq, 0x10
>      cmp countq, wq
> -    jl .loop
> +    jle .loop
> +
> +    sub countq, 0x10
> +    cmp countq, wq
> +    jge .end
> +
> +.tail_loop:
> +    movu xm1, [fltposq]
> +%ifidn %1, X4
> +    pxor xm9, xm9
> +    pxor xm10, xm10
> +    xor innerq, innerq
> +.tail_innerloop:
> +%endif
> +    vpcmpeqd  xm13, xm13
> +    vpgatherdd xm3,[srcmemq + xm1], xm13
> +    vpunpcklbw xm5, xm3, xm0
> +    vpunpckhbw xm6, xm3, xm0
> +    vpmaddwd xm5, xm5, [filterq]
> +    vpmaddwd xm6, xm6, [filterq + 16]
> +    add filterq, 0x20
> +%ifidn %1, X4
> +    paddd xm9, xm5
> +    paddd xm10, xm6
> +    paddd xm1, xm14
> +    add innerq, 1
> +    cmp innerq, fltsizeq
> +    jl .tail_innerloop
> +    vphaddd xm5, xm9, xm10
> +%else
> +    vphaddd xm5, xm5, xm6
> +%endif
> +    vpsrad  xm5, 7
> +    vpackssdw xm5, xm5, xm5
> +    vmovq [dstq], xm5
> +    add dstq, 0x8
> +    add fltposq, 0x10
> +    add countq, 0x4
> +    cmp countq, wq
> +    jl .tail_loop
> +.end:
>  REP_RET
>  %endmacro

countq is only used as counter after this
If you count against 0 this reduces the instructions in the loop from
add/cmp to just add.
similarly the previously used [dstq + countq * 2] avoids a add
can you comment on the performance impact of these changes ?
On previous generations of CPUs this would have been generally slower
I havnt really optimized ASM for current CPUs so these comments might
not apply today but noone else seems reviewing this

thx

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

The worst form of inequality is to try to make unequal things equal.
-- Aristotle

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

[-- Attachment #2: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2 3/5] libswscale: Avx2 hscale can process inputs of any size.
  2022-04-22 17:42 ` Michael Niedermayer
@ 2022-04-26  7:45   ` Alan Kelly
  2022-04-26  8:00     ` Alan Kelly
  0 siblings, 1 reply; 9+ messages in thread
From: Alan Kelly @ 2022-04-26  7:45 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

Hi Paul, Michael,

Thanks for your responses and review.

Paul: I have tested this on many input sizes (primes etc) and it works. It
has the same restriction as the original SSSE3 version, it processes pixels
in blocks of 4, however, sufficient memory is allocated so that reads and
writes are safe. All loads and stores are unaligned and I have tested it
with unaligned pointers to verify that it works.

Michael: The performance impact of this change is that it enables
processing of all input sizes by avx2 hscale instead of sizes which are
divisible by 16. For example an input of 513 with a filter size of 4 is
processed 35% faster (506 vs 771). Thanks for the tip on countq, one add
has been removed from each loop.

Alan


On Fri, Apr 22, 2022 at 7:43 PM Michael Niedermayer <michael@niedermayer.cc>
wrote:

> On Thu, Feb 17, 2022 at 11:04:04AM +0100, Alan Kelly wrote:
> > The main loop processes blocks of 16 pixels. The tail processes blocks
> > of size 4.
> > ---
> >  libswscale/x86/scale_avx2.asm | 48 +++++++++++++++++++++++++++++++++--
> >  1 file changed, 46 insertions(+), 2 deletions(-)
> >
> > diff --git a/libswscale/x86/scale_avx2.asm
> b/libswscale/x86/scale_avx2.asm
> > index 20acdbd633..dc42abb100 100644
> > --- a/libswscale/x86/scale_avx2.asm
> > +++ b/libswscale/x86/scale_avx2.asm
> > @@ -53,6 +53,9 @@ cglobal hscale8to15_%1, 7, 9, 16, pos0, dst, w,
> srcmem, filter, fltpos, fltsize,
> >      mova m14, [four]
> >      shr fltsized, 2
> >  %endif
> > +    cmp wq, 16
> > +    jl .tail_loop
> > +    mov countq, 0x10
> >  .loop:
> >      movu m1, [fltposq]
> >      movu m2, [fltposq+32]
> > @@ -97,11 +100,52 @@ cglobal hscale8to15_%1, 7, 9, 16, pos0, dst, w,
> srcmem, filter, fltpos, fltsize,
> >      vpsrad  m6, 7
> >      vpackssdw m5, m5, m6
> >      vpermd m5, m15, m5
> > -    vmovdqu [dstq + countq * 2], m5
> > +    vmovdqu [dstq], m5
> > +    add dstq, 0x20
> >      add fltposq, 0x40
> >      add countq, 0x10
> >      cmp countq, wq
> > -    jl .loop
> > +    jle .loop
> > +
> > +    sub countq, 0x10
> > +    cmp countq, wq
> > +    jge .end
> > +
> > +.tail_loop:
> > +    movu xm1, [fltposq]
> > +%ifidn %1, X4
> > +    pxor xm9, xm9
> > +    pxor xm10, xm10
> > +    xor innerq, innerq
> > +.tail_innerloop:
> > +%endif
> > +    vpcmpeqd  xm13, xm13
> > +    vpgatherdd xm3,[srcmemq + xm1], xm13
> > +    vpunpcklbw xm5, xm3, xm0
> > +    vpunpckhbw xm6, xm3, xm0
> > +    vpmaddwd xm5, xm5, [filterq]
> > +    vpmaddwd xm6, xm6, [filterq + 16]
> > +    add filterq, 0x20
> > +%ifidn %1, X4
> > +    paddd xm9, xm5
> > +    paddd xm10, xm6
> > +    paddd xm1, xm14
> > +    add innerq, 1
> > +    cmp innerq, fltsizeq
> > +    jl .tail_innerloop
> > +    vphaddd xm5, xm9, xm10
> > +%else
> > +    vphaddd xm5, xm5, xm6
> > +%endif
> > +    vpsrad  xm5, 7
> > +    vpackssdw xm5, xm5, xm5
> > +    vmovq [dstq], xm5
> > +    add dstq, 0x8
> > +    add fltposq, 0x10
> > +    add countq, 0x4
> > +    cmp countq, wq
> > +    jl .tail_loop
> > +.end:
> >  REP_RET
> >  %endmacro
>
> countq is only used as counter after this
> If you count against 0 this reduces the instructions in the loop from
> add/cmp to just add.
> similarly the previously used [dstq + countq * 2] avoids a add
> can you comment on the performance impact of these changes ?
> On previous generations of CPUs this would have been generally slower
> I havnt really optimized ASM for current CPUs so these comments might
> not apply today but noone else seems reviewing this
>
> thx
>
> [...]
> --
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> The worst form of inequality is to try to make unequal things equal.
> -- Aristotle
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [FFmpeg-devel] [PATCH v2 3/5] libswscale: Avx2 hscale can process inputs of any size.
  2022-04-26  7:45   ` Alan Kelly
@ 2022-04-26  8:00     ` Alan Kelly
  2022-07-13  8:04       ` Alan Kelly
  0 siblings, 1 reply; 9+ messages in thread
From: Alan Kelly @ 2022-04-26  8:00 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Alan Kelly

The main loop processes blocks of 16 pixels. The tail processes blocks
of size 4.
---
 libswscale/x86/scale_avx2.asm | 44 ++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/libswscale/x86/scale_avx2.asm b/libswscale/x86/scale_avx2.asm
index 20acdbd633..7657b2825f 100644
--- a/libswscale/x86/scale_avx2.asm
+++ b/libswscale/x86/scale_avx2.asm
@@ -53,6 +53,9 @@ cglobal hscale8to15_%1, 7, 9, 16, pos0, dst, w, srcmem, filter, fltpos, fltsize,
     mova m14, [four]
     shr fltsized, 2
 %endif
+    cmp wq, 16
+    jl .tail_loop
+    sub wq, 0x10
 .loop:
     movu m1, [fltposq]
     movu m2, [fltposq+32]
@@ -101,7 +104,46 @@ cglobal hscale8to15_%1, 7, 9, 16, pos0, dst, w, srcmem, filter, fltpos, fltsize,
     add fltposq, 0x40
     add countq, 0x10
     cmp countq, wq
-    jl .loop
+    jle .loop
+
+    add wq, 0x10
+    cmp countq, wq
+    jge .end
+
+.tail_loop:
+    movu xm1, [fltposq]
+%ifidn %1, X4
+    pxor xm9, xm9
+    pxor xm10, xm10
+    xor innerq, innerq
+.tail_innerloop:
+%endif
+    vpcmpeqd  xm13, xm13
+    vpgatherdd xm3,[srcmemq + xm1], xm13
+    vpunpcklbw xm5, xm3, xm0
+    vpunpckhbw xm6, xm3, xm0
+    vpmaddwd xm5, xm5, [filterq]
+    vpmaddwd xm6, xm6, [filterq + 16]
+    add filterq, 0x20
+%ifidn %1, X4
+    paddd xm9, xm5
+    paddd xm10, xm6
+    paddd xm1, xm14
+    add innerq, 1
+    cmp innerq, fltsizeq
+    jl .tail_innerloop
+    vphaddd xm5, xm9, xm10
+%else
+    vphaddd xm5, xm5, xm6
+%endif
+    vpsrad  xm5, 7
+    vpackssdw xm5, xm5, xm5
+    vmovq [dstq + countq * 2], xm5
+    add fltposq, 0x10
+    add countq, 0x4
+    cmp countq, wq
+    jl .tail_loop
+.end:
 REP_RET
 %endmacro
 
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2 3/5] libswscale: Avx2 hscale can process inputs of any size.
  2022-04-26  8:00     ` Alan Kelly
@ 2022-07-13  8:04       ` Alan Kelly
  0 siblings, 0 replies; 9+ messages in thread
From: Alan Kelly @ 2022-07-13  8:04 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

Hi,

Are there any further comments on this patch or can it be committed?

Thanks,

Alan

On Tue, Apr 26, 2022 at 10:00 AM Alan Kelly <alankelly@google.com> wrote:

> The main loop processes blocks of 16 pixels. The tail processes blocks
> of size 4.
> ---
>  libswscale/x86/scale_avx2.asm | 44 ++++++++++++++++++++++++++++++++++-
>  1 file changed, 43 insertions(+), 1 deletion(-)
>
> diff --git a/libswscale/x86/scale_avx2.asm b/libswscale/x86/scale_avx2.asm
> index 20acdbd633..7657b2825f 100644
> --- a/libswscale/x86/scale_avx2.asm
> +++ b/libswscale/x86/scale_avx2.asm
> @@ -53,6 +53,9 @@ cglobal hscale8to15_%1, 7, 9, 16, pos0, dst, w, srcmem,
> filter, fltpos, fltsize,
>      mova m14, [four]
>      shr fltsized, 2
>  %endif
> +    cmp wq, 16
> +    jl .tail_loop
> +    sub wq, 0x10
>  .loop:
>      movu m1, [fltposq]
>      movu m2, [fltposq+32]
> @@ -101,7 +104,46 @@ cglobal hscale8to15_%1, 7, 9, 16, pos0, dst, w,
> srcmem, filter, fltpos, fltsize,
>      add fltposq, 0x40
>      add countq, 0x10
>      cmp countq, wq
> -    jl .loop
> +    jle .loop
> +
> +    add wq, 0x10
> +    cmp countq, wq
> +    jge .end
> +
> +.tail_loop:
> +    movu xm1, [fltposq]
> +%ifidn %1, X4
> +    pxor xm9, xm9
> +    pxor xm10, xm10
> +    xor innerq, innerq
> +.tail_innerloop:
> +%endif
> +    vpcmpeqd  xm13, xm13
> +    vpgatherdd xm3,[srcmemq + xm1], xm13
> +    vpunpcklbw xm5, xm3, xm0
> +    vpunpckhbw xm6, xm3, xm0
> +    vpmaddwd xm5, xm5, [filterq]
> +    vpmaddwd xm6, xm6, [filterq + 16]
> +    add filterq, 0x20
> +%ifidn %1, X4
> +    paddd xm9, xm5
> +    paddd xm10, xm6
> +    paddd xm1, xm14
> +    add innerq, 1
> +    cmp innerq, fltsizeq
> +    jl .tail_innerloop
> +    vphaddd xm5, xm9, xm10
> +%else
> +    vphaddd xm5, xm5, xm6
> +%endif
> +    vpsrad  xm5, 7
> +    vpackssdw xm5, xm5, xm5
> +    vmovq [dstq + countq * 2], xm5
> +    add fltposq, 0x10
> +    add countq, 0x4
> +    cmp countq, wq
> +    jl .tail_loop
> +.end:
>  REP_RET
>  %endmacro
>
> --
> 2.36.0.rc2.479.g8af0fa9b8e-goog
>
>
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-07-13  8:04 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-17 10:04 [FFmpeg-devel] [PATCH v2 3/5] libswscale: Avx2 hscale can process inputs of any size Alan Kelly
2022-02-17 16:21 ` Michael Niedermayer
2022-03-07 15:27   ` Alan Kelly
2022-04-22  8:02     ` Alan Kelly
2022-04-22 14:53       ` Paul B Mahol
2022-04-22 17:42 ` Michael Niedermayer
2022-04-26  7:45   ` Alan Kelly
2022-04-26  8:00     ` Alan Kelly
2022-07-13  8:04       ` Alan Kelly

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git