Alright, I learned a bit more, so should we not consider the internal
implementation?
I've added this version that reduces one vset in this reply.

Rémi Denis-Courmont <remi@remlab.net> 于2024年1月7日周日 16:03写道：

> Le sunnuntaina 7. tammikuuta 2024, 3.33.39 EET flow gg a écrit :
> > I tested it, and indeed using vwsub is faster. Updated it in the reply.
> >
> > ---
> >
> > I have a question: if I tweak the load order a bit, using one less vset,
> it
> > leads to being slower (the patch I submitted is 13.2, if I make the
> > following change, the time would be 15.2).
> > But I thought it would be faster.
>
> I would guess that v0 is needed before v8 in the internal implementation
> of
> vwsub. This kind of makes sense as the element still need to be
> sign-extended.
> Thus vwsub ends up stalling the pipeline in wait for vle8 to complete.
> That's
> just a guess though, as I don't have internal cycle timing documentation.
>
> > - vsetvli      t0, a2, e8, m2, tu, ma
> > - vle8.v       v0, (a0)
> > - sub          a2, a2, t0
> > - vsetvli      zero, t0, e16, m4, tu, ma
> > - vle16.v      v8, (a1)
> > - vsetvli      zero, t0, e8, m2, tu, ma
> > - vwsub.wv     v16, v8, v0
> >
> > + vsetvli      t0, a2, e16, m4, tu, ma
> > + vle16.v      v8, (a1)
> > + sub          a2, a2, t0
> > + vsetvli      zero, t0, e8, m2, tu, ma
> > + vle8.v       v0, (a0)
> > + vwsub.wv     v16, v8, v0
>
> --
> 雷米‧德尼-库尔蒙
> http://www.remlab.net/
>
>
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>