Alright, I learned a bit more, so should we not consider the internal implementation? I've added this version that reduces one vset in this reply. Rémi Denis-Courmont 于2024年1月7日周日 16:03写道: > Le sunnuntaina 7. tammikuuta 2024, 3.33.39 EET flow gg a écrit : > > I tested it, and indeed using vwsub is faster. Updated it in the reply. > > > > --- > > > > I have a question: if I tweak the load order a bit, using one less vset, > it > > leads to being slower (the patch I submitted is 13.2, if I make the > > following change, the time would be 15.2). > > But I thought it would be faster. > > I would guess that v0 is needed before v8 in the internal implementation > of > vwsub. This kind of makes sense as the element still need to be > sign-extended. > Thus vwsub ends up stalling the pipeline in wait for vle8 to complete. > That's > just a guess though, as I don't have internal cycle timing documentation. > > > - vsetvli t0, a2, e8, m2, tu, ma > > - vle8.v v0, (a0) > > - sub a2, a2, t0 > > - vsetvli zero, t0, e16, m4, tu, ma > > - vle16.v v8, (a1) > > - vsetvli zero, t0, e8, m2, tu, ma > > - vwsub.wv v16, v8, v0 > > > > + vsetvli t0, a2, e16, m4, tu, ma > > + vle16.v v8, (a1) > > + sub a2, a2, t0 > > + vsetvli zero, t0, e8, m2, tu, ma > > + vle8.v v0, (a0) > > + vwsub.wv v16, v8, v0 > > -- > 雷米‧德尼-库尔蒙 > http://www.remlab.net/ > > > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". >