* [FFmpeg-devel] [RFC/PATCH] bitpacked_dec: Optimization for bitpacked_dec decoder performance @ 2023-05-05 21:54 Devin Heitmueller 2023-05-06 11:32 ` Lance Wang 0 siblings, 1 reply; 13+ messages in thread From: Devin Heitmueller @ 2023-05-05 21:54 UTC (permalink / raw) To: ffmpeg-devel; +Cc: Devin Heitmueller Rework the code a bit to speed up the 10-bit bitpacked decoding routine. This is probably about as fast as I can get it without switching to assembly language. Demonstratable with: ./ffmpeg -f lavfi -i "smptehdbars=size=3840x2160" -c bitpacked -f image2 -frames:v 1 source.yuv ./ffmpeg -f bitpacked -pix_fmt yuv422p10le -s 3840x2160 -c:v bitpacked -i source.yuv -pix_fmt yuv422p10le out.yuv On my development system, it went from 80ms for a 2160p frame down to 20ms (i.e. a 4X speedup). Good enough for now, I hope... Signed-off-by: Devin Heitmueller <dheitmueller@ltnglobal.com> --- libavcodec/bitpacked_dec.c | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/libavcodec/bitpacked_dec.c b/libavcodec/bitpacked_dec.c index a1ffef1..96aba27 100644 --- a/libavcodec/bitpacked_dec.c +++ b/libavcodec/bitpacked_dec.c @@ -28,7 +28,6 @@ #include "avcodec.h" #include "codec_internal.h" -#include "get_bits.h" #include "libavutil/imgutils.h" #include "thread.h" @@ -65,7 +64,7 @@ static int bitpacked_decode_yuv422p10(AVCodecContext *avctx, AVFrame *frame, { uint64_t frame_size = (uint64_t)avctx->width * (uint64_t)avctx->height * 20; uint64_t packet_size = (uint64_t)avpkt->size * 8; - GetBitContext bc; + uint8_t *src; uint16_t *y, *u, *v; int ret, i, j; @@ -79,20 +78,18 @@ static int bitpacked_decode_yuv422p10(AVCodecContext *avctx, AVFrame *frame, if (avctx->width % 2) return AVERROR_PATCHWELCOME; - ret = init_get_bits(&bc, avpkt->data, avctx->width * avctx->height * 20); - if (ret) - return ret; - + src = avpkt->data; for (i = 0; i < avctx->height; i++) { y = (uint16_t*)(frame->data[0] + i * frame->linesize[0]); u = (uint16_t*)(frame->data[1] + i * frame->linesize[1]); v = (uint16_t*)(frame->data[2] + i * frame->linesize[2]); for (j = 0; j < avctx->width; j += 2) { - *u++ = get_bits(&bc, 10); - *y++ = get_bits(&bc, 10); - *v++ = get_bits(&bc, 10); - *y++ = get_bits(&bc, 10); + *u++ = (src[0] << 2) | (src[1] >> 6); + *y++ = ((src[1] << 4) | (src[2] >> 4)) & 0x3ff; + *v++ = ((src[2] << 6) | (src[3] >> 2)) & 0x3ff; + *y++ = ((src[3] << 8) | (src[4])) & 0x3ff; + src += 5; } } -- 1.8.3.1 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [FFmpeg-devel] [RFC/PATCH] bitpacked_dec: Optimization for bitpacked_dec decoder performance 2023-05-05 21:54 [FFmpeg-devel] [RFC/PATCH] bitpacked_dec: Optimization for bitpacked_dec decoder performance Devin Heitmueller @ 2023-05-06 11:32 ` Lance Wang 2023-05-06 11:49 ` Devin Heitmueller 2023-05-06 11:52 ` Paul B Mahol 0 siblings, 2 replies; 13+ messages in thread From: Lance Wang @ 2023-05-06 11:32 UTC (permalink / raw) To: FFmpeg development discussions and patches On Sat, May 6, 2023 at 4:58 AM Devin Heitmueller < devin.heitmueller@ltnglobal.com> wrote: > Rework the code a bit to speed up the 10-bit bitpacked decoding > routine. This is probably about as fast as I can get it without > switching to assembly language. > > Demonstratable with: > > ./ffmpeg -f lavfi -i "smptehdbars=size=3840x2160" -c bitpacked -f image2 > -frames:v 1 source.yuv > ./ffmpeg -f bitpacked -pix_fmt yuv422p10le -s 3840x2160 -c:v bitpacked -i > source.yuv -pix_fmt yuv422p10le out.yuv > > On my development system, it went from 80ms for a 2160p frame > down to 20ms (i.e. a 4X speedup). Good enough for now, I hope... > > FYI, on my development system, I run two time for the original and modified version and no obvious difference: ./ffmpeg -f lavfi -i "smptehdbars=size=3840x2160" -c bitpacked -frames:v 25 source.yuv time ./ffmpeg -f bitpacked -pix_fmt yuv422p10le -s 3840x2160 -c:v bitpacked -i source.yuv -pix_fmt yuv422p10le out.yuv frame= 25 fps=0.0 q=-0.0 Lsize= 810000kB time=00:00:00.96 bitrate=6912000.0kbits/s speed=1.13x real 0m0.961s user 0m1.086s sys 0m1.360s frame= 25 fps=0.0 q=-0.0 Lsize= 810000kB time=00:00:00.96 bitrate=6912000.0kbits/s speed=1.16x real 0m0.936s user 0m1.358s sys 0m1.350s after apply the patch: frame= 25 fps=0.0 q=-0.0 Lsize= 810000kB time=00:00:00.96 bitrate=6912000.0kbits/s speed=1.14x real 0m0.953s user 0m0.906s sys 0m1.438s frame= 25 fps=0.0 q=-0.0 Lsize= 810000kB time=00:00:00.96 bitrate=6912000.0kbits/s speed=1.17x real 0m0.922s user 0m0.926s sys 0m1.066s > Signed-off-by: Devin Heitmueller <dheitmueller@ltnglobal.com> > --- > libavcodec/bitpacked_dec.c | 17 +++++++---------- > 1 file changed, 7 insertions(+), 10 deletions(-) > > diff --git a/libavcodec/bitpacked_dec.c b/libavcodec/bitpacked_dec.c > index a1ffef1..96aba27 100644 > --- a/libavcodec/bitpacked_dec.c > +++ b/libavcodec/bitpacked_dec.c > @@ -28,7 +28,6 @@ > > #include "avcodec.h" > #include "codec_internal.h" > -#include "get_bits.h" > #include "libavutil/imgutils.h" > #include "thread.h" > > @@ -65,7 +64,7 @@ static int bitpacked_decode_yuv422p10(AVCodecContext > *avctx, AVFrame *frame, > { > uint64_t frame_size = (uint64_t)avctx->width * > (uint64_t)avctx->height * 20; > uint64_t packet_size = (uint64_t)avpkt->size * 8; > - GetBitContext bc; > + uint8_t *src; > uint16_t *y, *u, *v; > int ret, i, j; > > @@ -79,20 +78,18 @@ static int bitpacked_decode_yuv422p10(AVCodecContext > *avctx, AVFrame *frame, > if (avctx->width % 2) > return AVERROR_PATCHWELCOME; > > - ret = init_get_bits(&bc, avpkt->data, avctx->width * avctx->height * > 20); > - if (ret) > - return ret; > - > + src = avpkt->data; > for (i = 0; i < avctx->height; i++) { > y = (uint16_t*)(frame->data[0] + i * frame->linesize[0]); > u = (uint16_t*)(frame->data[1] + i * frame->linesize[1]); > v = (uint16_t*)(frame->data[2] + i * frame->linesize[2]); > > for (j = 0; j < avctx->width; j += 2) { > - *u++ = get_bits(&bc, 10); > - *y++ = get_bits(&bc, 10); > - *v++ = get_bits(&bc, 10); > - *y++ = get_bits(&bc, 10); > + *u++ = (src[0] << 2) | (src[1] >> 6); > + *y++ = ((src[1] << 4) | (src[2] >> 4)) & 0x3ff; > + *v++ = ((src[2] << 6) | (src[3] >> 2)) & 0x3ff; > + *y++ = ((src[3] << 8) | (src[4])) & 0x3ff; > + src += 5; > } > } > > -- > 1.8.3.1 > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". > _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [FFmpeg-devel] [RFC/PATCH] bitpacked_dec: Optimization for bitpacked_dec decoder performance 2023-05-06 11:32 ` Lance Wang @ 2023-05-06 11:49 ` Devin Heitmueller 2023-05-06 11:52 ` Paul B Mahol 1 sibling, 0 replies; 13+ messages in thread From: Devin Heitmueller @ 2023-05-06 11:49 UTC (permalink / raw) To: FFmpeg development discussions and patches Hi Lance, On Sat, May 6, 2023 at 7:32 AM Lance Wang <lance.lmwang@gmail.com> wrote: > FYI, on my development system, I run two time for the original and modified > version and no obvious difference: Simply running "time" against the binary isn't an accurate way to measure a 60ms difference for a single frame being processed. For any such execution of ffmpeg the bulk of the time is spent loading the application and in this case loading the 20MB file from disk into memory. In my case I added instrumentation to the decoder to measure how much time it took to perform the actual decode operation. I discarded the patch already since it was like six lines of code and I did the work several weeks ago, but if there is really a dispute about the performance benefit I can obviously recreate it (as can anyone who wants to benchmark it themselves). You would just have to insert a couple of calls to gettimeofday() in libavcodec/bitpacked_dec.c before and after the decoding operation. This is one of those cases where you won't notice the performance difference doing any operation once, but it becomes important when you're processing a live RTP stream of 2160p59 video at 10.4 Gbps. Devin -- Devin Heitmueller, Senior Software Engineer LTN Global Communications o: +1 (301) 363-1001 w: https://ltnglobal.com e: devin.heitmueller@ltnglobal.com _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [FFmpeg-devel] [RFC/PATCH] bitpacked_dec: Optimization for bitpacked_dec decoder performance 2023-05-06 11:32 ` Lance Wang 2023-05-06 11:49 ` Devin Heitmueller @ 2023-05-06 11:52 ` Paul B Mahol 2023-05-06 12:13 ` Devin Heitmueller 1 sibling, 1 reply; 13+ messages in thread From: Paul B Mahol @ 2023-05-06 11:52 UTC (permalink / raw) To: FFmpeg development discussions and patches On Sat, May 6, 2023 at 1:32 PM Lance Wang <lance.lmwang@gmail.com> wrote: > On Sat, May 6, 2023 at 4:58 AM Devin Heitmueller < > devin.heitmueller@ltnglobal.com> wrote: > > > Rework the code a bit to speed up the 10-bit bitpacked decoding > > routine. This is probably about as fast as I can get it without > > switching to assembly language. > > > > Demonstratable with: > > > > ./ffmpeg -f lavfi -i "smptehdbars=size=3840x2160" -c bitpacked -f image2 > > -frames:v 1 source.yuv > > ./ffmpeg -f bitpacked -pix_fmt yuv422p10le -s 3840x2160 -c:v bitpacked -i > > source.yuv -pix_fmt yuv422p10le out.yuv > > > > On my development system, it went from 80ms for a 2160p frame > > down to 20ms (i.e. a 4X speedup). Good enough for now, I hope... > > > > > FYI, on my development system, I run two time for the original and modified > version and no obvious difference: > ./ffmpeg -f lavfi -i "smptehdbars=size=3840x2160" -c bitpacked -frames:v 25 > source.yuv > time ./ffmpeg -f bitpacked -pix_fmt yuv422p10le -s 3840x2160 -c:v bitpacked > -i source.yuv -pix_fmt yuv422p10le out.yuv > frame= 25 fps=0.0 q=-0.0 Lsize= 810000kB time=00:00:00.96 > bitrate=6912000.0kbits/s speed=1.13x > > real 0m0.961s > user 0m1.086s > sys 0m1.360s > > frame= 25 fps=0.0 q=-0.0 Lsize= 810000kB time=00:00:00.96 > bitrate=6912000.0kbits/s speed=1.16x > > real 0m0.936s > user 0m1.358s > sys 0m1.350s > > after apply the patch: > frame= 25 fps=0.0 q=-0.0 Lsize= 810000kB time=00:00:00.96 > bitrate=6912000.0kbits/s speed=1.14x > > real 0m0.953s > user 0m0.906s > sys 0m1.438s > > frame= 25 fps=0.0 q=-0.0 Lsize= 810000kB time=00:00:00.96 > bitrate=6912000.0kbits/s speed=1.17x > > real 0m0.922s > user 0m0.926s > sys 0m1.066s > Only 25 frames? This is flawed. > > > > > Signed-off-by: Devin Heitmueller <dheitmueller@ltnglobal.com> > > --- > > libavcodec/bitpacked_dec.c | 17 +++++++---------- > > 1 file changed, 7 insertions(+), 10 deletions(-) > > > > diff --git a/libavcodec/bitpacked_dec.c b/libavcodec/bitpacked_dec.c > > index a1ffef1..96aba27 100644 > > --- a/libavcodec/bitpacked_dec.c > > +++ b/libavcodec/bitpacked_dec.c > > @@ -28,7 +28,6 @@ > > > > #include "avcodec.h" > > #include "codec_internal.h" > > -#include "get_bits.h" > > #include "libavutil/imgutils.h" > > #include "thread.h" > > > > @@ -65,7 +64,7 @@ static int bitpacked_decode_yuv422p10(AVCodecContext > > *avctx, AVFrame *frame, > > { > > uint64_t frame_size = (uint64_t)avctx->width * > > (uint64_t)avctx->height * 20; > > uint64_t packet_size = (uint64_t)avpkt->size * 8; > > - GetBitContext bc; > > + uint8_t *src; > > uint16_t *y, *u, *v; > > int ret, i, j; > > > > @@ -79,20 +78,18 @@ static int bitpacked_decode_yuv422p10(AVCodecContext > > *avctx, AVFrame *frame, > > if (avctx->width % 2) > > return AVERROR_PATCHWELCOME; > > > > - ret = init_get_bits(&bc, avpkt->data, avctx->width * avctx->height * > > 20); > > - if (ret) > > - return ret; > > - > > + src = avpkt->data; > > for (i = 0; i < avctx->height; i++) { > > y = (uint16_t*)(frame->data[0] + i * frame->linesize[0]); > > u = (uint16_t*)(frame->data[1] + i * frame->linesize[1]); > > v = (uint16_t*)(frame->data[2] + i * frame->linesize[2]); > > > > for (j = 0; j < avctx->width; j += 2) { > > - *u++ = get_bits(&bc, 10); > > - *y++ = get_bits(&bc, 10); > > - *v++ = get_bits(&bc, 10); > > - *y++ = get_bits(&bc, 10); > > + *u++ = (src[0] << 2) | (src[1] >> 6); > > + *y++ = ((src[1] << 4) | (src[2] >> 4)) & 0x3ff; > > + *v++ = ((src[2] << 6) | (src[3] >> 2)) & 0x3ff; > > + *y++ = ((src[3] << 8) | (src[4])) & 0x3ff; > > + src += 5; > > } > > } > > > > -- > > 1.8.3.1 > > > > _______________________________________________ > > ffmpeg-devel mailing list > > ffmpeg-devel@ffmpeg.org > > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > > > To unsubscribe, visit link above, or email > > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". > > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". > _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [FFmpeg-devel] [RFC/PATCH] bitpacked_dec: Optimization for bitpacked_dec decoder performance 2023-05-06 11:52 ` Paul B Mahol @ 2023-05-06 12:13 ` Devin Heitmueller 2023-05-06 12:16 ` James Almer 0 siblings, 1 reply; 13+ messages in thread From: Devin Heitmueller @ 2023-05-06 12:13 UTC (permalink / raw) To: FFmpeg development discussions and patches [-- Attachment #1: Type: text/plain, Size: 5976 bytes --] I added some instrumentation via the attached patch. You can see the benefits here: Before=1683378057.243350 After 1683378057.264239 Before=1683378083.335424 After 1683378083.356440 Before=1683378089.675400 After 1683378089.696512 Before=1683378151.792324 After 1683378151.813579 21 ms per run After patch: Before=1683378222.167796 After 1683378222.175760 Before=1683378233.131416 After 1683378233.139326 Before=1683378243.591895 After 1683378243.599840 8 ms per run Note: this is a different platform than I did the original development on, and apparently the improvement on this particular box is only 2.5x rather than 4x. Devin On Sat, May 6, 2023 at 7:53 AM Paul B Mahol <onemda@gmail.com> wrote: > > On Sat, May 6, 2023 at 1:32 PM Lance Wang <lance.lmwang@gmail.com> wrote: > > > On Sat, May 6, 2023 at 4:58 AM Devin Heitmueller < > > devin.heitmueller@ltnglobal.com> wrote: > > > > > Rework the code a bit to speed up the 10-bit bitpacked decoding > > > routine. This is probably about as fast as I can get it without > > > switching to assembly language. > > > > > > Demonstratable with: > > > > > > ./ffmpeg -f lavfi -i "smptehdbars=size=3840x2160" -c bitpacked -f image2 > > > -frames:v 1 source.yuv > > > ./ffmpeg -f bitpacked -pix_fmt yuv422p10le -s 3840x2160 -c:v bitpacked -i > > > source.yuv -pix_fmt yuv422p10le out.yuv > > > > > > On my development system, it went from 80ms for a 2160p frame > > > down to 20ms (i.e. a 4X speedup). Good enough for now, I hope... > > > > > > > > FYI, on my development system, I run two time for the original and modified > > version and no obvious difference: > > ./ffmpeg -f lavfi -i "smptehdbars=size=3840x2160" -c bitpacked -frames:v 25 > > source.yuv > > time ./ffmpeg -f bitpacked -pix_fmt yuv422p10le -s 3840x2160 -c:v bitpacked > > -i source.yuv -pix_fmt yuv422p10le out.yuv > > frame= 25 fps=0.0 q=-0.0 Lsize= 810000kB time=00:00:00.96 > > bitrate=6912000.0kbits/s speed=1.13x > > > > real 0m0.961s > > user 0m1.086s > > sys 0m1.360s > > > > frame= 25 fps=0.0 q=-0.0 Lsize= 810000kB time=00:00:00.96 > > bitrate=6912000.0kbits/s speed=1.16x > > > > real 0m0.936s > > user 0m1.358s > > sys 0m1.350s > > > > after apply the patch: > > frame= 25 fps=0.0 q=-0.0 Lsize= 810000kB time=00:00:00.96 > > bitrate=6912000.0kbits/s speed=1.14x > > > > real 0m0.953s > > user 0m0.906s > > sys 0m1.438s > > > > frame= 25 fps=0.0 q=-0.0 Lsize= 810000kB time=00:00:00.96 > > bitrate=6912000.0kbits/s speed=1.17x > > > > real 0m0.922s > > user 0m0.926s > > sys 0m1.066s > > > > Only 25 frames? > This is flawed. > > > > > > > > > > > Signed-off-by: Devin Heitmueller <dheitmueller@ltnglobal.com> > > > --- > > > libavcodec/bitpacked_dec.c | 17 +++++++---------- > > > 1 file changed, 7 insertions(+), 10 deletions(-) > > > > > > diff --git a/libavcodec/bitpacked_dec.c b/libavcodec/bitpacked_dec.c > > > index a1ffef1..96aba27 100644 > > > --- a/libavcodec/bitpacked_dec.c > > > +++ b/libavcodec/bitpacked_dec.c > > > @@ -28,7 +28,6 @@ > > > > > > #include "avcodec.h" > > > #include "codec_internal.h" > > > -#include "get_bits.h" > > > #include "libavutil/imgutils.h" > > > #include "thread.h" > > > > > > @@ -65,7 +64,7 @@ static int bitpacked_decode_yuv422p10(AVCodecContext > > > *avctx, AVFrame *frame, > > > { > > > uint64_t frame_size = (uint64_t)avctx->width * > > > (uint64_t)avctx->height * 20; > > > uint64_t packet_size = (uint64_t)avpkt->size * 8; > > > - GetBitContext bc; > > > + uint8_t *src; > > > uint16_t *y, *u, *v; > > > int ret, i, j; > > > > > > @@ -79,20 +78,18 @@ static int bitpacked_decode_yuv422p10(AVCodecContext > > > *avctx, AVFrame *frame, > > > if (avctx->width % 2) > > > return AVERROR_PATCHWELCOME; > > > > > > - ret = init_get_bits(&bc, avpkt->data, avctx->width * avctx->height * > > > 20); > > > - if (ret) > > > - return ret; > > > - > > > + src = avpkt->data; > > > for (i = 0; i < avctx->height; i++) { > > > y = (uint16_t*)(frame->data[0] + i * frame->linesize[0]); > > > u = (uint16_t*)(frame->data[1] + i * frame->linesize[1]); > > > v = (uint16_t*)(frame->data[2] + i * frame->linesize[2]); > > > > > > for (j = 0; j < avctx->width; j += 2) { > > > - *u++ = get_bits(&bc, 10); > > > - *y++ = get_bits(&bc, 10); > > > - *v++ = get_bits(&bc, 10); > > > - *y++ = get_bits(&bc, 10); > > > + *u++ = (src[0] << 2) | (src[1] >> 6); > > > + *y++ = ((src[1] << 4) | (src[2] >> 4)) & 0x3ff; > > > + *v++ = ((src[2] << 6) | (src[3] >> 2)) & 0x3ff; > > > + *y++ = ((src[3] << 8) | (src[4])) & 0x3ff; > > > + src += 5; > > > } > > > } > > > > > > -- > > > 1.8.3.1 > > > > > > _______________________________________________ > > > ffmpeg-devel mailing list > > > ffmpeg-devel@ffmpeg.org > > > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > > > > > To unsubscribe, visit link above, or email > > > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". > > > > > _______________________________________________ > > ffmpeg-devel mailing list > > ffmpeg-devel@ffmpeg.org > > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > > > To unsubscribe, visit link above, or email > > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". > > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". -- Devin Heitmueller, Senior Software Engineer LTN Global Communications o: +1 (301) 363-1001 w: https://ltnglobal.com e: devin.heitmueller@ltnglobal.com [-- Attachment #2: timing.patch --] [-- Type: application/octet-stream, Size: 1066 bytes --] [-- Attachment #3: Type: text/plain, Size: 251 bytes --] _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [FFmpeg-devel] [RFC/PATCH] bitpacked_dec: Optimization for bitpacked_dec decoder performance 2023-05-06 12:13 ` Devin Heitmueller @ 2023-05-06 12:16 ` James Almer 2023-05-06 12:40 ` Devin Heitmueller 0 siblings, 1 reply; 13+ messages in thread From: James Almer @ 2023-05-06 12:16 UTC (permalink / raw) To: ffmpeg-devel On 5/6/2023 9:13 AM, Devin Heitmueller wrote: > I added some instrumentation via the attached patch. You can see the > benefits here: > > Before=1683378057.243350 After 1683378057.264239 > Before=1683378083.335424 After 1683378083.356440 > Before=1683378089.675400 After 1683378089.696512 > Before=1683378151.792324 After 1683378151.813579 > 21 ms per run > > After patch: > Before=1683378222.167796 After 1683378222.175760 > Before=1683378233.131416 After 1683378233.139326 > Before=1683378243.591895 After 1683378243.599840 > 8 ms per run > > Note: this is a different platform than I did the original development > on, and apparently the improvement on this particular box is only 2.5x > rather than 4x. > > Devin Can you bench with the START_TIMER and STOP_TIMER macros in timer.h? Also, define CACHED_BITSTREAM_READER in bitpacked_dec.c before including git_bits.h and test the actual implementation again, to see if it makes any difference. > > > On Sat, May 6, 2023 at 7:53 AM Paul B Mahol <onemda@gmail.com> wrote: >> >> On Sat, May 6, 2023 at 1:32 PM Lance Wang <lance.lmwang@gmail.com> wrote: >> >>> On Sat, May 6, 2023 at 4:58 AM Devin Heitmueller < >>> devin.heitmueller@ltnglobal.com> wrote: >>> >>>> Rework the code a bit to speed up the 10-bit bitpacked decoding >>>> routine. This is probably about as fast as I can get it without >>>> switching to assembly language. >>>> >>>> Demonstratable with: >>>> >>>> ./ffmpeg -f lavfi -i "smptehdbars=size=3840x2160" -c bitpacked -f image2 >>>> -frames:v 1 source.yuv >>>> ./ffmpeg -f bitpacked -pix_fmt yuv422p10le -s 3840x2160 -c:v bitpacked -i >>>> source.yuv -pix_fmt yuv422p10le out.yuv >>>> >>>> On my development system, it went from 80ms for a 2160p frame >>>> down to 20ms (i.e. a 4X speedup). Good enough for now, I hope... >>>> >>>> >>> FYI, on my development system, I run two time for the original and modified >>> version and no obvious difference: >>> ./ffmpeg -f lavfi -i "smptehdbars=size=3840x2160" -c bitpacked -frames:v 25 >>> source.yuv >>> time ./ffmpeg -f bitpacked -pix_fmt yuv422p10le -s 3840x2160 -c:v bitpacked >>> -i source.yuv -pix_fmt yuv422p10le out.yuv >>> frame= 25 fps=0.0 q=-0.0 Lsize= 810000kB time=00:00:00.96 >>> bitrate=6912000.0kbits/s speed=1.13x >>> >>> real 0m0.961s >>> user 0m1.086s >>> sys 0m1.360s >>> >>> frame= 25 fps=0.0 q=-0.0 Lsize= 810000kB time=00:00:00.96 >>> bitrate=6912000.0kbits/s speed=1.16x >>> >>> real 0m0.936s >>> user 0m1.358s >>> sys 0m1.350s >>> >>> after apply the patch: >>> frame= 25 fps=0.0 q=-0.0 Lsize= 810000kB time=00:00:00.96 >>> bitrate=6912000.0kbits/s speed=1.14x >>> >>> real 0m0.953s >>> user 0m0.906s >>> sys 0m1.438s >>> >>> frame= 25 fps=0.0 q=-0.0 Lsize= 810000kB time=00:00:00.96 >>> bitrate=6912000.0kbits/s speed=1.17x >>> >>> real 0m0.922s >>> user 0m0.926s >>> sys 0m1.066s >>> >> >> Only 25 frames? >> This is flawed. >> >> >>> >>> >>> >>>> Signed-off-by: Devin Heitmueller <dheitmueller@ltnglobal.com> >>>> --- >>>> libavcodec/bitpacked_dec.c | 17 +++++++---------- >>>> 1 file changed, 7 insertions(+), 10 deletions(-) >>>> >>>> diff --git a/libavcodec/bitpacked_dec.c b/libavcodec/bitpacked_dec.c >>>> index a1ffef1..96aba27 100644 >>>> --- a/libavcodec/bitpacked_dec.c >>>> +++ b/libavcodec/bitpacked_dec.c >>>> @@ -28,7 +28,6 @@ >>>> >>>> #include "avcodec.h" >>>> #include "codec_internal.h" >>>> -#include "get_bits.h" >>>> #include "libavutil/imgutils.h" >>>> #include "thread.h" >>>> >>>> @@ -65,7 +64,7 @@ static int bitpacked_decode_yuv422p10(AVCodecContext >>>> *avctx, AVFrame *frame, >>>> { >>>> uint64_t frame_size = (uint64_t)avctx->width * >>>> (uint64_t)avctx->height * 20; >>>> uint64_t packet_size = (uint64_t)avpkt->size * 8; >>>> - GetBitContext bc; >>>> + uint8_t *src; >>>> uint16_t *y, *u, *v; >>>> int ret, i, j; >>>> >>>> @@ -79,20 +78,18 @@ static int bitpacked_decode_yuv422p10(AVCodecContext >>>> *avctx, AVFrame *frame, >>>> if (avctx->width % 2) >>>> return AVERROR_PATCHWELCOME; >>>> >>>> - ret = init_get_bits(&bc, avpkt->data, avctx->width * avctx->height * >>>> 20); >>>> - if (ret) >>>> - return ret; >>>> - >>>> + src = avpkt->data; >>>> for (i = 0; i < avctx->height; i++) { >>>> y = (uint16_t*)(frame->data[0] + i * frame->linesize[0]); >>>> u = (uint16_t*)(frame->data[1] + i * frame->linesize[1]); >>>> v = (uint16_t*)(frame->data[2] + i * frame->linesize[2]); >>>> >>>> for (j = 0; j < avctx->width; j += 2) { >>>> - *u++ = get_bits(&bc, 10); >>>> - *y++ = get_bits(&bc, 10); >>>> - *v++ = get_bits(&bc, 10); >>>> - *y++ = get_bits(&bc, 10); >>>> + *u++ = (src[0] << 2) | (src[1] >> 6); >>>> + *y++ = ((src[1] << 4) | (src[2] >> 4)) & 0x3ff; >>>> + *v++ = ((src[2] << 6) | (src[3] >> 2)) & 0x3ff; >>>> + *y++ = ((src[3] << 8) | (src[4])) & 0x3ff; >>>> + src += 5; >>>> } >>>> } >>>> >>>> -- >>>> 1.8.3.1 >>>> >>>> _______________________________________________ >>>> ffmpeg-devel mailing list >>>> ffmpeg-devel@ffmpeg.org >>>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel >>>> >>>> To unsubscribe, visit link above, or email >>>> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". >>>> >>> _______________________________________________ >>> ffmpeg-devel mailing list >>> ffmpeg-devel@ffmpeg.org >>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel >>> >>> To unsubscribe, visit link above, or email >>> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". >>> >> _______________________________________________ >> ffmpeg-devel mailing list >> ffmpeg-devel@ffmpeg.org >> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel >> >> To unsubscribe, visit link above, or email >> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". > > > > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [FFmpeg-devel] [RFC/PATCH] bitpacked_dec: Optimization for bitpacked_dec decoder performance 2023-05-06 12:16 ` James Almer @ 2023-05-06 12:40 ` Devin Heitmueller 2023-05-10 11:16 ` Lance Wang 0 siblings, 1 reply; 13+ messages in thread From: Devin Heitmueller @ 2023-05-06 12:40 UTC (permalink / raw) To: FFmpeg development discussions and patches [-- Attachment #1: Type: text/plain, Size: 1188 bytes --] On Sat, May 6, 2023 at 8:16 AM James Almer <jamrial@gmail.com> wrote: > Can you bench with the START_TIMER and STOP_TIMER macros in timer.h? > Also, define CACHED_BITSTREAM_READER in bitpacked_dec.c before including > git_bits.h and test the actual implementation again, to see if it makes > any difference. Original code: 671661910 decicycles in bitpacked_dec, 1 runs, 0 skips 669736380 decicycles in bitpacked_dec, 1 runs, 0 skips 669370700 decicycles in bitpacked_dec, 1 runs, 0 skips Original code with CACHED_BITSTREAM_READER defined 352599030 decicycles in bitpacked_dec, 1 runs, 0 skips 336163810 decicycles in bitpacked_dec, 1 runs, 0 skips 344628350 decicycles in bitpacked_dec, 1 runs, 0 skips My proposed versioned: 257353330 decicycles in bitpacked_dec, 1 runs, 0 skips 271527000 decicycles in bitpacked_dec, 1 runs, 0 skips 252701500 decicycles in bitpacked_dec, 1 runs, 0 skips Devin -- Devin Heitmueller, Senior Software Engineer LTN Global Communications o: +1 (301) 363-1001 w: https://ltnglobal.com e: devin.heitmueller@ltnglobal.com [-- Attachment #2: timing3.patch --] [-- Type: application/octet-stream, Size: 1116 bytes --] [-- Attachment #3: Type: text/plain, Size: 251 bytes --] _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [FFmpeg-devel] [RFC/PATCH] bitpacked_dec: Optimization for bitpacked_dec decoder performance 2023-05-06 12:40 ` Devin Heitmueller @ 2023-05-10 11:16 ` Lance Wang 2023-05-11 22:20 ` Marton Balint 0 siblings, 1 reply; 13+ messages in thread From: Lance Wang @ 2023-05-10 11:16 UTC (permalink / raw) To: FFmpeg development discussions and patches On Sat, May 6, 2023 at 8:41 PM Devin Heitmueller < devin.heitmueller@ltnglobal.com> wrote: > On Sat, May 6, 2023 at 8:16 AM James Almer <jamrial@gmail.com> wrote: > > Can you bench with the START_TIMER and STOP_TIMER macros in timer.h? > > Also, define CACHED_BITSTREAM_READER in bitpacked_dec.c before including > > git_bits.h and test the actual implementation again, to see if it makes > > any difference. > > Original code: > 671661910 decicycles in bitpacked_dec, 1 runs, 0 skips > 669736380 decicycles in bitpacked_dec, 1 runs, 0 skips > 669370700 decicycles in bitpacked_dec, 1 runs, 0 skips > > Original code with CACHED_BITSTREAM_READER defined > 352599030 decicycles in bitpacked_dec, 1 runs, 0 skips > 336163810 decicycles in bitpacked_dec, 1 runs, 0 skips > 344628350 decicycles in bitpacked_dec, 1 runs, 0 skips > > My proposed versioned: > 257353330 decicycles in bitpacked_dec, 1 runs, 0 skips > 271527000 decicycles in bitpacked_dec, 1 runs, 0 skips > 252701500 decicycles in bitpacked_dec, 1 runs, 0 skips > > Yes, it's show better performance, so LGTM if nobody have plan to optimize the bitstream function. Devin > > -- > Devin Heitmueller, Senior Software Engineer > LTN Global Communications > o: +1 (301) 363-1001 > w: https://ltnglobal.com e: devin.heitmueller@ltnglobal.com > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". > _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [FFmpeg-devel] [RFC/PATCH] bitpacked_dec: Optimization for bitpacked_dec decoder performance 2023-05-10 11:16 ` Lance Wang @ 2023-05-11 22:20 ` Marton Balint 2023-05-12 15:26 ` Devin Heitmueller 2023-06-12 16:05 ` [FFmpeg-devel] [RFC/PATCH] bitpacked_dec: Optimization for bitpacked_dec decoder performance Paul B Mahol 0 siblings, 2 replies; 13+ messages in thread From: Marton Balint @ 2023-05-11 22:20 UTC (permalink / raw) To: FFmpeg development discussions and patches On Wed, 10 May 2023, Lance Wang wrote: > On Sat, May 6, 2023 at 8:41 PM Devin Heitmueller < > devin.heitmueller@ltnglobal.com> wrote: > >> On Sat, May 6, 2023 at 8:16 AM James Almer <jamrial@gmail.com> wrote: >> > Can you bench with the START_TIMER and STOP_TIMER macros in timer.h? >> > Also, define CACHED_BITSTREAM_READER in bitpacked_dec.c before including >> > git_bits.h and test the actual implementation again, to see if it makes >> > any difference. >> >> Original code: >> 671661910 decicycles in bitpacked_dec, 1 runs, 0 skips >> 669736380 decicycles in bitpacked_dec, 1 runs, 0 skips >> 669370700 decicycles in bitpacked_dec, 1 runs, 0 skips >> >> Original code with CACHED_BITSTREAM_READER defined >> 352599030 decicycles in bitpacked_dec, 1 runs, 0 skips >> 336163810 decicycles in bitpacked_dec, 1 runs, 0 skips >> 344628350 decicycles in bitpacked_dec, 1 runs, 0 skips >> >> My proposed versioned: >> 257353330 decicycles in bitpacked_dec, 1 runs, 0 skips >> 271527000 decicycles in bitpacked_dec, 1 runs, 0 skips >> 252701500 decicycles in bitpacked_dec, 1 runs, 0 skips >> >> > Yes, it's show better performance, so LGTM if nobody have plan to optimize > the bitstream > function. Actually the cached bitstream reader was faster here than the manual approach: ./ffmpeg -stream_loop 128 -threads 1 -f bitpacked -pix_fmt yuv422p10le -s 3840x2160 -c:v bitpacked -i source.yuv -pix_fmt yuv422p10le -f null none -loglevel error Old code: 821050920 decicycles in bitpacked, 1 runs, 0 skips 815402160 decicycles in bitpacked, 2 runs, 0 skips 814108410 decicycles in bitpacked, 4 runs, 0 skips 814213800 decicycles in bitpacked, 8 runs, 0 skips 815048325 decicycles in bitpacked, 16 runs, 0 skips 812866713 decicycles in bitpacked, 32 runs, 0 skips 809186523 decicycles in bitpacked, 64 runs, 0 skips 808317601 decicycles in bitpacked, 128 runs, 0 skips With the patch: 379879920 decicycles in bitpacked, 1 runs, 0 skips 387491580 decicycles in bitpacked, 2 runs, 0 skips 397720260 decicycles in bitpacked, 4 runs, 0 skips 389581560 decicycles in bitpacked, 8 runs, 0 skips 381820635 decicycles in bitpacked, 16 runs, 0 skips 379791675 decicycles in bitpacked, 32 runs, 0 skips 379246303 decicycles in bitpacked, 64 runs, 0 skips 379221671 decicycles in bitpacked, 128 runs, 0 skips Old code and #defined CACHED_BITSTREAM_READER 1 345122280 decicycles in bitpacked, 1 runs, 0 skips 343663020 decicycles in bitpacked, 2 runs, 0 skips 343372680 decicycles in bitpacked, 4 runs, 0 skips 342554535 decicycles in bitpacked, 8 runs, 0 skips 340816522 decicycles in bitpacked, 16 runs, 0 skips 340225672 decicycles in bitpacked, 32 runs, 0 skips 340283520 decicycles in bitpacked, 64 runs, 0 skips 339643105 decicycles in bitpacked, 128 runs, 0 skips Regards, Marton _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [FFmpeg-devel] [RFC/PATCH] bitpacked_dec: Optimization for bitpacked_dec decoder performance 2023-05-11 22:20 ` Marton Balint @ 2023-05-12 15:26 ` Devin Heitmueller 2023-12-13 19:58 ` [FFmpeg-devel] [PATCH] avcodec/bitpacked_dec: optimize bitpacked_decode_yuv422p10 Marton Balint 2023-06-12 16:05 ` [FFmpeg-devel] [RFC/PATCH] bitpacked_dec: Optimization for bitpacked_dec decoder performance Paul B Mahol 1 sibling, 1 reply; 13+ messages in thread From: Devin Heitmueller @ 2023-05-12 15:26 UTC (permalink / raw) To: FFmpeg development discussions and patches On Thu, May 11, 2023 at 6:20 PM Marton Balint <cus@passwd.hu> wrote: > Actually the cached bitstream reader was faster here than the manual > approach: > > ./ffmpeg -stream_loop 128 -threads 1 -f bitpacked -pix_fmt yuv422p10le -s 3840x2160 -c:v bitpacked -i source.yuv -pix_fmt yuv422p10le -f null none -loglevel error > > Old code: > > 821050920 decicycles in bitpacked, 1 runs, 0 skips > 815402160 decicycles in bitpacked, 2 runs, 0 skips > 814108410 decicycles in bitpacked, 4 runs, 0 skips > 814213800 decicycles in bitpacked, 8 runs, 0 skips > 815048325 decicycles in bitpacked, 16 runs, 0 skips > 812866713 decicycles in bitpacked, 32 runs, 0 skips > 809186523 decicycles in bitpacked, 64 runs, 0 skips > 808317601 decicycles in bitpacked, 128 runs, 0 skips > > With the patch: > > 379879920 decicycles in bitpacked, 1 runs, 0 skips > 387491580 decicycles in bitpacked, 2 runs, 0 skips > 397720260 decicycles in bitpacked, 4 runs, 0 skips > 389581560 decicycles in bitpacked, 8 runs, 0 skips > 381820635 decicycles in bitpacked, 16 runs, 0 skips > 379791675 decicycles in bitpacked, 32 runs, 0 skips > 379246303 decicycles in bitpacked, 64 runs, 0 skips > 379221671 decicycles in bitpacked, 128 runs, 0 skips > > Old code and #defined CACHED_BITSTREAM_READER 1 > > 345122280 decicycles in bitpacked, 1 runs, 0 skips > 343663020 decicycles in bitpacked, 2 runs, 0 skips > 343372680 decicycles in bitpacked, 4 runs, 0 skips > 342554535 decicycles in bitpacked, 8 runs, 0 skips > 340816522 decicycles in bitpacked, 16 runs, 0 skips > 340225672 decicycles in bitpacked, 32 runs, 0 skips > 340283520 decicycles in bitpacked, 64 runs, 0 skips > 339643105 decicycles in bitpacked, 128 runs, 0 skips I don't have a good explanation for this. I could speculate that some of it comes down to the processor architecture, how much onboard cache it has, gcc version (and what sort of optimization/vectorization it does, if any), etc. In my case I was testing on Haswell and Skylake (both with 12MB cache) with gcc 4.8. I would welcome feedback from others. Looking at the code to libavcodec/git_bits.h, it might also be worth looking at setting #define LONG_BITSTREAM_READER, as that might speed things up as well for such large files. Devin -- Devin Heitmueller, Senior Software Engineer LTN Global Communications o: +1 (301) 363-1001 w: https://ltnglobal.com e: devin.heitmueller@ltnglobal.com _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 13+ messages in thread
* [FFmpeg-devel] [PATCH] avcodec/bitpacked_dec: optimize bitpacked_decode_yuv422p10 2023-05-12 15:26 ` Devin Heitmueller @ 2023-12-13 19:58 ` Marton Balint 2023-12-28 20:42 ` Marton Balint 0 siblings, 1 reply; 13+ messages in thread From: Marton Balint @ 2023-12-13 19:58 UTC (permalink / raw) To: ffmpeg-devel; +Cc: Marton Balint, Devin Heitmueller, Devin Heitmueller From: Devin Heitmueller <devin.heitmueller@ltnglobal.com> Rework the code a bit to speed up the 10-bit bitpacked decoding routine. This is probably about as fast as I can get it without switching to assembly language. Demonstratable with: ./ffmpeg -f lavfi -i "smptehdbars=size=3840x2160" -c bitpacked -f image2 -frames:v 1 source.yuv ./ffmpeg -f bitpacked -pix_fmt yuv422p10le -s 3840x2160 -c:v bitpacked -i source.yuv -pix_fmt yuv422p10le out.yuv On my development system, it went from 80ms for a 2160p frame down to 20ms (i.e. a 4X speedup). Good enough for now, I hope... Comments from Marton: Originally on my system better performance could be achieved by simply switching to the cached bitstream reader, but for Devin it was slower than his direct byte operations. I changed the order of writing output from u/y/v/y to u/v/y/y, and that made the code faster than the cached bitstream reader on my system as well. TIMER measurement of the decode loop on Ryzen 5 3600 with command line: ./ffmpeg -stream_loop 256 -threads 1 -f bitpacked -pix_fmt yuv422p10le -s 3840x2160 -c:v bitpacked -i source.yuv -pix_fmt yuv422p10le -f null none -loglevel error Before: 823204127 decicycles in YUV, 256 runs, 0 skips After: 315070524 decicycles in YUV, 256 runs, 0 skips Signed-off-by: Devin Heitmueller <dheitmueller@ltnglobal.com> Signed-off-by: Marton Balint <cus@passwd.hu> --- libavcodec/bitpacked_dec.c | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/libavcodec/bitpacked_dec.c b/libavcodec/bitpacked_dec.c index c88f861993..54c008bd86 100644 --- a/libavcodec/bitpacked_dec.c +++ b/libavcodec/bitpacked_dec.c @@ -28,7 +28,6 @@ #include "avcodec.h" #include "codec_internal.h" -#include "get_bits.h" #include "libavutil/imgutils.h" #include "thread.h" @@ -65,7 +64,7 @@ static int bitpacked_decode_yuv422p10(AVCodecContext *avctx, AVFrame *frame, { uint64_t frame_size = (uint64_t)avctx->width * (uint64_t)avctx->height * 20; uint64_t packet_size = (uint64_t)avpkt->size * 8; - GetBitContext bc; + uint8_t *src; uint16_t *y, *u, *v; int ret, i, j; @@ -79,20 +78,18 @@ static int bitpacked_decode_yuv422p10(AVCodecContext *avctx, AVFrame *frame, if (avctx->width % 2) return AVERROR_PATCHWELCOME; - ret = init_get_bits(&bc, avpkt->data, avctx->width * avctx->height * 20); - if (ret) - return ret; - + src = avpkt->data; for (i = 0; i < avctx->height; i++) { y = (uint16_t*)(frame->data[0] + i * frame->linesize[0]); u = (uint16_t*)(frame->data[1] + i * frame->linesize[1]); v = (uint16_t*)(frame->data[2] + i * frame->linesize[2]); for (j = 0; j < avctx->width; j += 2) { - *u++ = get_bits(&bc, 10); - *y++ = get_bits(&bc, 10); - *v++ = get_bits(&bc, 10); - *y++ = get_bits(&bc, 10); + *u++ = (src[0] << 2) | (src[1] >> 6); + *v++ = ((src[2] << 6) | (src[3] >> 2)) & 0x3ff; + *y++ = ((src[1] << 4) | (src[2] >> 4)) & 0x3ff; + *y++ = ((src[3] << 8) | (src[4])) & 0x3ff; + src += 5; } } -- 2.35.3 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [FFmpeg-devel] [PATCH] avcodec/bitpacked_dec: optimize bitpacked_decode_yuv422p10 2023-12-13 19:58 ` [FFmpeg-devel] [PATCH] avcodec/bitpacked_dec: optimize bitpacked_decode_yuv422p10 Marton Balint @ 2023-12-28 20:42 ` Marton Balint 0 siblings, 0 replies; 13+ messages in thread From: Marton Balint @ 2023-12-28 20:42 UTC (permalink / raw) To: FFmpeg development discussions and patches On Wed, 13 Dec 2023, Marton Balint wrote: > From: Devin Heitmueller <devin.heitmueller@ltnglobal.com> > > Rework the code a bit to speed up the 10-bit bitpacked decoding > routine. This is probably about as fast as I can get it without > switching to assembly language. > > Demonstratable with: > > ./ffmpeg -f lavfi -i "smptehdbars=size=3840x2160" -c bitpacked -f image2 -frames:v 1 source.yuv > ./ffmpeg -f bitpacked -pix_fmt yuv422p10le -s 3840x2160 -c:v bitpacked -i source.yuv -pix_fmt yuv422p10le out.yuv > > On my development system, it went from 80ms for a 2160p frame > down to 20ms (i.e. a 4X speedup). Good enough for now, I hope... > > Comments from Marton: > > Originally on my system better performance could be achieved by simply > switching to the cached bitstream reader, but for Devin it was slower than > his direct byte operations. > > I changed the order of writing output from u/y/v/y to u/v/y/y, and that made > the code faster than the cached bitstream reader on my system as well. > > TIMER measurement of the decode loop on Ryzen 5 3600 with command line: > > ./ffmpeg -stream_loop 256 -threads 1 -f bitpacked -pix_fmt yuv422p10le -s 3840x2160 -c:v bitpacked -i source.yuv -pix_fmt yuv422p10le -f null none -loglevel error > > Before: 823204127 decicycles in YUV, 256 runs, 0 skips > After: 315070524 decicycles in YUV, 256 runs, 0 skips > > Signed-off-by: Devin Heitmueller <dheitmueller@ltnglobal.com> > Signed-off-by: Marton Balint <cus@passwd.hu> > --- > libavcodec/bitpacked_dec.c | 17 +++++++---------- > 1 file changed, 7 insertions(+), 10 deletions(-) Will apply. Regards, Marton > > diff --git a/libavcodec/bitpacked_dec.c b/libavcodec/bitpacked_dec.c > index c88f861993..54c008bd86 100644 > --- a/libavcodec/bitpacked_dec.c > +++ b/libavcodec/bitpacked_dec.c > @@ -28,7 +28,6 @@ > > #include "avcodec.h" > #include "codec_internal.h" > -#include "get_bits.h" > #include "libavutil/imgutils.h" > #include "thread.h" > > @@ -65,7 +64,7 @@ static int bitpacked_decode_yuv422p10(AVCodecContext *avctx, AVFrame *frame, > { > uint64_t frame_size = (uint64_t)avctx->width * (uint64_t)avctx->height * 20; > uint64_t packet_size = (uint64_t)avpkt->size * 8; > - GetBitContext bc; > + uint8_t *src; > uint16_t *y, *u, *v; > int ret, i, j; > > @@ -79,20 +78,18 @@ static int bitpacked_decode_yuv422p10(AVCodecContext *avctx, AVFrame *frame, > if (avctx->width % 2) > return AVERROR_PATCHWELCOME; > > - ret = init_get_bits(&bc, avpkt->data, avctx->width * avctx->height * 20); > - if (ret) > - return ret; > - > + src = avpkt->data; > for (i = 0; i < avctx->height; i++) { > y = (uint16_t*)(frame->data[0] + i * frame->linesize[0]); > u = (uint16_t*)(frame->data[1] + i * frame->linesize[1]); > v = (uint16_t*)(frame->data[2] + i * frame->linesize[2]); > > for (j = 0; j < avctx->width; j += 2) { > - *u++ = get_bits(&bc, 10); > - *y++ = get_bits(&bc, 10); > - *v++ = get_bits(&bc, 10); > - *y++ = get_bits(&bc, 10); > + *u++ = (src[0] << 2) | (src[1] >> 6); > + *v++ = ((src[2] << 6) | (src[3] >> 2)) & 0x3ff; > + *y++ = ((src[1] << 4) | (src[2] >> 4)) & 0x3ff; > + *y++ = ((src[3] << 8) | (src[4])) & 0x3ff; > + src += 5; > } > } > > -- > 2.35.3 > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". > _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [FFmpeg-devel] [RFC/PATCH] bitpacked_dec: Optimization for bitpacked_dec decoder performance 2023-05-11 22:20 ` Marton Balint 2023-05-12 15:26 ` Devin Heitmueller @ 2023-06-12 16:05 ` Paul B Mahol 1 sibling, 0 replies; 13+ messages in thread From: Paul B Mahol @ 2023-06-12 16:05 UTC (permalink / raw) To: FFmpeg development discussions and patches On Fri, May 12, 2023 at 12:20 AM Marton Balint <cus@passwd.hu> wrote: > > > On Wed, 10 May 2023, Lance Wang wrote: > > > On Sat, May 6, 2023 at 8:41 PM Devin Heitmueller < > > devin.heitmueller@ltnglobal.com> wrote: > > > >> On Sat, May 6, 2023 at 8:16 AM James Almer <jamrial@gmail.com> wrote: > >> > Can you bench with the START_TIMER and STOP_TIMER macros in timer.h? > >> > Also, define CACHED_BITSTREAM_READER in bitpacked_dec.c before > including > >> > git_bits.h and test the actual implementation again, to see if it > makes > >> > any difference. > >> > >> Original code: > >> 671661910 decicycles in bitpacked_dec, 1 runs, 0 skips > >> 669736380 decicycles in bitpacked_dec, 1 runs, 0 skips > >> 669370700 decicycles in bitpacked_dec, 1 runs, 0 skips > >> > >> Original code with CACHED_BITSTREAM_READER defined > >> 352599030 decicycles in bitpacked_dec, 1 runs, 0 skips > >> 336163810 decicycles in bitpacked_dec, 1 runs, 0 skips > >> 344628350 decicycles in bitpacked_dec, 1 runs, 0 skips > >> > >> My proposed versioned: > >> 257353330 decicycles in bitpacked_dec, 1 runs, 0 skips > >> 271527000 decicycles in bitpacked_dec, 1 runs, 0 skips > >> 252701500 decicycles in bitpacked_dec, 1 runs, 0 skips > >> > >> > > Yes, it's show better performance, so LGTM if nobody have plan to > optimize > > the bitstream > > function. > > Actually the cached bitstream reader was faster here than the manual > approach: > > ./ffmpeg -stream_loop 128 -threads 1 -f bitpacked -pix_fmt yuv422p10le -s > 3840x2160 -c:v bitpacked -i source.yuv -pix_fmt yuv422p10le -f null none > -loglevel error > > Old code: > > 821050920 decicycles in bitpacked, 1 runs, 0 skips > 815402160 decicycles in bitpacked, 2 runs, 0 skips > 814108410 decicycles in bitpacked, 4 runs, 0 skips > 814213800 decicycles in bitpacked, 8 runs, 0 skips > 815048325 decicycles in bitpacked, 16 runs, 0 skips > 812866713 decicycles in bitpacked, 32 runs, 0 skips > 809186523 decicycles in bitpacked, 64 runs, 0 skips > 808317601 decicycles in bitpacked, 128 runs, 0 skips > > With the patch: > > 379879920 decicycles in bitpacked, 1 runs, 0 skips > 387491580 decicycles in bitpacked, 2 runs, 0 skips > 397720260 decicycles in bitpacked, 4 runs, 0 skips > 389581560 decicycles in bitpacked, 8 runs, 0 skips > 381820635 decicycles in bitpacked, 16 runs, 0 skips > 379791675 decicycles in bitpacked, 32 runs, 0 skips > 379246303 decicycles in bitpacked, 64 runs, 0 skips > 379221671 decicycles in bitpacked, 128 runs, 0 skips > > Old code and #defined CACHED_BITSTREAM_READER 1 > > 345122280 decicycles in bitpacked, 1 runs, 0 skips > 343663020 decicycles in bitpacked, 2 runs, 0 skips > 343372680 decicycles in bitpacked, 4 runs, 0 skips > 342554535 decicycles in bitpacked, 8 runs, 0 skips > 340816522 decicycles in bitpacked, 16 runs, 0 skips > 340225672 decicycles in bitpacked, 32 runs, 0 skips > 340283520 decicycles in bitpacked, 64 runs, 0 skips > 339643105 decicycles in bitpacked, 128 runs, 0 skips > Could someone send patch for this ? > > Regards, > Marton > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". > _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2023-12-28 20:42 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-05-05 21:54 [FFmpeg-devel] [RFC/PATCH] bitpacked_dec: Optimization for bitpacked_dec decoder performance Devin Heitmueller 2023-05-06 11:32 ` Lance Wang 2023-05-06 11:49 ` Devin Heitmueller 2023-05-06 11:52 ` Paul B Mahol 2023-05-06 12:13 ` Devin Heitmueller 2023-05-06 12:16 ` James Almer 2023-05-06 12:40 ` Devin Heitmueller 2023-05-10 11:16 ` Lance Wang 2023-05-11 22:20 ` Marton Balint 2023-05-12 15:26 ` Devin Heitmueller 2023-12-13 19:58 ` [FFmpeg-devel] [PATCH] avcodec/bitpacked_dec: optimize bitpacked_decode_yuv422p10 Marton Balint 2023-12-28 20:42 ` Marton Balint 2023-06-12 16:05 ` [FFmpeg-devel] [RFC/PATCH] bitpacked_dec: Optimization for bitpacked_dec decoder performance Paul B Mahol
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel This inbox may be cloned and mirrored by anyone: git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \ ffmpegdev@gitmailbox.com public-inbox-index ffmpegdev Example config snippet for mirrors. AGPL code for this site: git clone https://public-inbox.org/public-inbox.git