>>> please pad mnemonics to at least 8 columns for consistency okay, changed >>> It seems that you could just as well use vlseg2 without register stride, no? yes, vlseg will better, changed >>> Note that you could do the double versions with very little extra efforts. okay >>> But really, DO NOT use a fixed vector length here. At best, you're wasting half >>> the vector width. Your input has a variable size, use it. okay, changed >>> I'm a bit surprised that the performance improves this much, considering that >>> the C910 is notoriously bad at both segmented strided loads. It might be that >>> the C versions is just very bad due to lack of aliasing optimisations. thanks, You reminded me. Sorry I had forgotten that there was a problem.. A few days ago, I wanted to try running some existing benchmarks, ``` tests/checkasm/checkasm --bench --test=aacpsdsp tests/checkasm/checkasm --bench --test=alacdsp tests/checkasm/checkasm --bench --test=audiodsp tests/checkasm/checkasm --bench --test=g722dsp tests/checkasm/checkasm --bench --test=vorbisdsp tests/checkasm/checkasm --bench --test=float_dsp tests/checkasm/checkasm --bench --test=fixed_dsp tests/checkasm/checkasm --bench --test=af_afir ``` but they all returned 0.0. For example, ``` butterflies_float_c: 0.0 butterflies_float_rvv_f32: 0.0 scalarproduct_float_c: 0.0 scalarproduct_float_rvv_f32: 0.0 vector_dmac_scalar_c: 0.0 vector_dmac_scalar_rvv_f64: 0.0 ... ``` I tried changing the -O3 in configure to -O2 or -O1, but still got 0.0. Only by changing to -O0 did I receive non-zero results. So, the benchmark I conducted was based on this, and I obtained the initial results… fcmul_add_c: 19.7 fcmul_add_rvv_f32: 6.7 Rémi Denis-Courmont 于2023年9月27日周三 02:44写道: > Le tiistaina 26. syyskuuta 2023, 21.40.12 EEST Paul B Mahol a écrit : > > On Tue, Sep 26, 2023 at 8:35 PM Rémi Denis-Courmont > wrote: > > > Le tiistaina 26. syyskuuta 2023, 12.24.58 EEST flow gg a écrit : > > > > benchmark: > > > > fcmul_add_c: 19.7 > > > > fcmul_add_rvv_f32: 6.7 > > > > > > Nit: please pad mnemonics to at least 8 columns for consistency. > > > > > > I'm a bit surprised that the performance improves this much, > considering > > > that > > > the C910 is notoriously bad at both segmented strided loads. It might > be > > > that > > > the C versions is just very bad due to lack of aliasing optimisations. > Oh > > > well. > > > > What you mean exactly that C version is missing? > > The C version does not have any restrict qualifier. This potentially > prevents > the C compiler from unrolling. Adding the keyword can improve performance > gains of 20-30% on RISC-V scalar floating point. > > That said, sometimes you can't validly use restrict, and you simply can't > tell > the C compiler how to optimise properly. In those cases, even scalar > floating > point optimisations improve performance. > > -- > Rémi Denis-Courmont > http://www.remlab.net/ > > > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". >