> On Apr 29, 2025, at 15:58, Martin Storsjö <martin@martin.st> wrote: > > On Tue, 29 Apr 2025, Zhao Zhili wrote: > >>> On Apr 25, 2025, at 16:25, Martin Storsjö <martin@martin.st> wrote: >>> On Tue, 15 Apr 2025, Zhao Zhili wrote: >>>> + tbx v3.8b, {v16.16b-v17.16b}, v3.8b >>> Is there any specific reason for preferring tbx over tbl here? (I know the existing code used tbx.) Without having studied cycle tables, I would expect tbl to maybe be slightly simpler, but perhaps there's no difference (or tbx is faster)? >> >> tbl can be faster. The result is quite impressive. Changed to tbl before push. >> >> Before tbx tbl >> hevc_sao_band_8_8_c: 252.3 ( 1.00x) 252.3 ( 1.00x) 252.3 ( 1.00x) >> hevc_sao_band_8_8_neon: 95.8 ( 2.63x) 61.0 ( 4.14x) 61.0 ( 4.57x) >> hevc_sao_band_16_8_c: 875.2 ( 1.00x) 864.9 ( 1.00x) 864.9 ( 1.00x) >> hevc_sao_band_16_8_neon: 317.5 ( 2.76x) 150.0 ( 5.76x) 150.0 ( 6.26x) >> hevc_sao_band_32_8_c: 3853.5 ( 1.00x) 3871.6 ( 1.00x) 3871.6 ( 1.00x) >> hevc_sao_band_32_8_neon: 1222.3 ( 3.15x) 550.6 ( 7.03x) 550.6 ( 7.39) >> hevc_sao_band_48_8_c: 8203.6 ( 1.00x) 8182.6 ( 1.00x) 8182.6 ( 1.00x) >> hevc_sao_band_48_8_neon: 2685.7 ( 3.05x) 1185.8 ( 6.90x) 1185.8 ( 7.36x) >> hevc_sao_band_64_8_c: 14023.0 ( 1.00x) 14038.9 ( 1.00x) 14038.9 ( 1.00x) >> hevc_sao_band_64_8_neon: 4783.2 ( 2.93x) 2078.4 ( 6.75x) 2078.4 ( 7.15x) > > The cycle numbers in the tbl and tbx columns seem to be identical here, while the relative speedup numbers differ - was this some sort of copypaste mistake in preparing the table? (The difference in speedup numbers does seem impressive.) They are the same on A75, but not on A76/A77/X3. tbl: 2 cycle for 1 or 2 table register tbx: 2 cycle for 1 table register, 4 for 2 table register. The code use 2 table register.