Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed
* [FFmpeg-devel] [PATCH v2 0/6] APV support
@ 2025-04-21 15:24 Mark Thompson
  2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 1/6] lavc: APV codec ID and descriptor Mark Thompson
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Mark Thompson @ 2025-04-21 15:24 UTC (permalink / raw)
  To: ffmpeg-devel

v2 incorporating review comments and minor additional optimisations.

Thanks,

- Mark

Mark Thompson (6):
  lavc: APV codec ID and descriptor
  lavc/cbs: APV support
  lavf: APV demuxer
  lavc: APV decoder
  lavc/apv: AVX2 transquant for x86-64
  lavc: APV metadata bitstream filter

 configure                            |   2 +
 libavcodec/Makefile                  |   2 +
 libavcodec/allcodecs.c               |   1 +
 libavcodec/apv.h                     |  86 ++++
 libavcodec/apv_decode.c              | 331 +++++++++++++++
 libavcodec/apv_decode.h              |  80 ++++
 libavcodec/apv_dsp.c                 | 140 +++++++
 libavcodec/apv_dsp.h                 |  39 ++
 libavcodec/apv_entropy.c             | 200 +++++++++
 libavcodec/bitstream_filters.c       |   1 +
 libavcodec/bsf/Makefile              |   1 +
 libavcodec/bsf/apv_metadata.c        | 134 ++++++
 libavcodec/cbs.c                     |   6 +
 libavcodec/cbs_apv.c                 | 395 ++++++++++++++++++
 libavcodec/cbs_apv.h                 | 207 ++++++++++
 libavcodec/cbs_apv_syntax_template.c | 595 +++++++++++++++++++++++++++
 libavcodec/cbs_internal.h            |   4 +
 libavcodec/codec_desc.c              |   7 +
 libavcodec/codec_id.h                |   1 +
 libavcodec/x86/Makefile              |   2 +
 libavcodec/x86/apv_dsp.asm           | 279 +++++++++++++
 libavcodec/x86/apv_dsp_init.c        |  40 ++
 libavformat/Makefile                 |   1 +
 libavformat/allformats.c             |   1 +
 libavformat/apvdec.c                 | 248 +++++++++++
 libavformat/cbs.h                    |   1 +
 tests/checkasm/Makefile              |   1 +
 tests/checkasm/apv_dsp.c             | 109 +++++
 tests/checkasm/checkasm.c            |   3 +
 tests/checkasm/checkasm.h            |   1 +
 tests/fate/checkasm.mak              |   1 +
 31 files changed, 2919 insertions(+)
 create mode 100644 libavcodec/apv.h
 create mode 100644 libavcodec/apv_decode.c
 create mode 100644 libavcodec/apv_decode.h
 create mode 100644 libavcodec/apv_dsp.c
 create mode 100644 libavcodec/apv_dsp.h
 create mode 100644 libavcodec/apv_entropy.c
 create mode 100644 libavcodec/bsf/apv_metadata.c
 create mode 100644 libavcodec/cbs_apv.c
 create mode 100644 libavcodec/cbs_apv.h
 create mode 100644 libavcodec/cbs_apv_syntax_template.c
 create mode 100644 libavcodec/x86/apv_dsp.asm
 create mode 100644 libavcodec/x86/apv_dsp_init.c
 create mode 100644 libavformat/apvdec.c
 create mode 100644 tests/checkasm/apv_dsp.c

-- 
2.47.2

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [FFmpeg-devel] [PATCH v2 1/6] lavc: APV codec ID and descriptor
  2025-04-21 15:24 [FFmpeg-devel] [PATCH v2 0/6] APV support Mark Thompson
@ 2025-04-21 15:24 ` Mark Thompson
  2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 2/6] lavc/cbs: APV support Mark Thompson
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Mark Thompson @ 2025-04-21 15:24 UTC (permalink / raw)
  To: ffmpeg-devel

---
 libavcodec/codec_desc.c | 7 +++++++
 libavcodec/codec_id.h   | 1 +
 2 files changed, 8 insertions(+)

diff --git a/libavcodec/codec_desc.c b/libavcodec/codec_desc.c
index 9fb190e35a..88fed478a3 100644
--- a/libavcodec/codec_desc.c
+++ b/libavcodec/codec_desc.c
@@ -1985,6 +1985,13 @@ static const AVCodecDescriptor codec_descriptors[] = {
         .props     = AV_CODEC_PROP_LOSSY | AV_CODEC_PROP_LOSSLESS,
         .mime_types= MT("image/jxl"),
     },
+    {
+        .id        = AV_CODEC_ID_APV,
+        .type      = AVMEDIA_TYPE_VIDEO,
+        .name      = "apv",
+        .long_name = NULL_IF_CONFIG_SMALL("Advanced Professional Video"),
+        .props     = AV_CODEC_PROP_INTRA_ONLY | AV_CODEC_PROP_LOSSY,
+    },
 
     /* various PCM "codecs" */
     {
diff --git a/libavcodec/codec_id.h b/libavcodec/codec_id.h
index 2f6efe8261..be0a65bcb9 100644
--- a/libavcodec/codec_id.h
+++ b/libavcodec/codec_id.h
@@ -329,6 +329,7 @@ enum AVCodecID {
     AV_CODEC_ID_DNXUC,
     AV_CODEC_ID_RV60,
     AV_CODEC_ID_JPEGXL_ANIM,
+    AV_CODEC_ID_APV,
 
     /* various PCM "codecs" */
     AV_CODEC_ID_FIRST_AUDIO = 0x10000,     ///< A dummy id pointing at the start of audio codecs
-- 
2.47.2

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [FFmpeg-devel] [PATCH v2 2/6] lavc/cbs: APV support
  2025-04-21 15:24 [FFmpeg-devel] [PATCH v2 0/6] APV support Mark Thompson
  2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 1/6] lavc: APV codec ID and descriptor Mark Thompson
@ 2025-04-21 15:24 ` Mark Thompson
  2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 3/6] lavf: APV demuxer Mark Thompson
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Mark Thompson @ 2025-04-21 15:24 UTC (permalink / raw)
  To: ffmpeg-devel

---
 configure                            |   1 +
 libavcodec/Makefile                  |   1 +
 libavcodec/apv.h                     |  86 ++++
 libavcodec/cbs.c                     |   6 +
 libavcodec/cbs_apv.c                 | 395 ++++++++++++++++++
 libavcodec/cbs_apv.h                 | 207 ++++++++++
 libavcodec/cbs_apv_syntax_template.c | 595 +++++++++++++++++++++++++++
 libavcodec/cbs_internal.h            |   4 +
 libavformat/cbs.h                    |   1 +
 9 files changed, 1296 insertions(+)
 create mode 100644 libavcodec/apv.h
 create mode 100644 libavcodec/cbs_apv.c
 create mode 100644 libavcodec/cbs_apv.h
 create mode 100644 libavcodec/cbs_apv_syntax_template.c

diff --git a/configure b/configure
index c94b8eac43..ca404d2797 100755
--- a/configure
+++ b/configure
@@ -2562,6 +2562,7 @@ CONFIG_EXTRA="
     bswapdsp
     cabac
     cbs
+    cbs_apv
     cbs_av1
     cbs_h264
     cbs_h265
diff --git a/libavcodec/Makefile b/libavcodec/Makefile
index 7bd1dbec9a..a5f5c4e904 100644
--- a/libavcodec/Makefile
+++ b/libavcodec/Makefile
@@ -83,6 +83,7 @@ OBJS-$(CONFIG_BLOCKDSP)                += blockdsp.o
 OBJS-$(CONFIG_BSWAPDSP)                += bswapdsp.o
 OBJS-$(CONFIG_CABAC)                   += cabac.o
 OBJS-$(CONFIG_CBS)                     += cbs.o cbs_bsf.o
+OBJS-$(CONFIG_CBS_APV)                 += cbs_apv.o
 OBJS-$(CONFIG_CBS_AV1)                 += cbs_av1.o
 OBJS-$(CONFIG_CBS_H264)                += cbs_h2645.o cbs_sei.o h2645_parse.o
 OBJS-$(CONFIG_CBS_H265)                += cbs_h2645.o cbs_sei.o h2645_parse.o
diff --git a/libavcodec/apv.h b/libavcodec/apv.h
new file mode 100644
index 0000000000..27e089ea22
--- /dev/null
+++ b/libavcodec/apv.h
@@ -0,0 +1,86 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef AVCODEC_APV_H
+#define AVCODEC_APV_H
+
+// PBU types (section 5.3.3).
+enum {
+    APV_PBU_PRIMARY_FRAME           = 1,
+    APV_PBU_NON_PRIMARY_FRAME       = 2,
+    APV_PBU_PREVIEW_FRAME           = 25,
+    APV_PBU_DEPTH_FRAME             = 26,
+    APV_PBU_ALPHA_FRAME             = 27,
+    APV_PBU_ACCESS_UNIT_INFORMATION = 65,
+    APV_PBU_METADATA                = 66,
+    APV_PBU_FILLER                  = 67,
+};
+
+// Format parameters (section 4.2).
+enum {
+    APV_MAX_NUM_COMP = 4,
+    APV_MB_WIDTH     = 16,
+    APV_MB_HEIGHT    = 16,
+    APV_TR_SIZE      = 8,
+};
+
+// Chroma formats (section 4.2).
+enum {
+    APV_CHROMA_FORMAT_400  = 0,
+    APV_CHROMA_FORMAT_422  = 2,
+    APV_CHROMA_FORMAT_444  = 3,
+    APV_CHROMA_FORMAT_4444 = 4,
+};
+
+// Coefficient limits (section 5.3.15).
+enum {
+    APV_BLK_COEFFS      = (APV_TR_SIZE * APV_TR_SIZE),
+    APV_MIN_TRANS_COEFF = -32768,
+    APV_MAX_TRANS_COEFF =  32767,
+};
+
+// Profiles (section 10.1.3).
+enum {
+    APV_PROFILE_422_10  = 33,
+    APV_PROFILE_422_12  = 44,
+    APV_PROFILE_444_10  = 55,
+    APV_PROFILE_444_12  = 66,
+    APV_PROFILE_4444_10 = 77,
+    APV_PROFILE_4444_12 = 88,
+    APV_PROFILE_400_10  = 99,
+};
+
+// General level limits for tiles (section 10.1.4.1).
+enum {
+    APV_MIN_TILE_WIDTH_IN_MBS  = 16,
+    APV_MIN_TILE_HEIGHT_IN_MBS = 8,
+    APV_MAX_TILE_COLS          = 20,
+    APV_MAX_TILE_ROWS          = 20,
+    APV_MAX_TILE_COUNT         = APV_MAX_TILE_COLS * APV_MAX_TILE_ROWS,
+};
+
+// Metadata types (section 10.3.1).
+enum {
+    APV_METADATA_ITU_T_T35    = 4,
+    APV_METADATA_MDCV         = 5,
+    APV_METADATA_CLL          = 6,
+    APV_METADATA_FILLER       = 10,
+    APV_METADATA_USER_DEFINED = 170,
+};
+
+#endif /* AVCODEC_APV_H */
diff --git a/libavcodec/cbs.c b/libavcodec/cbs.c
index ba1034a72e..9b485420d5 100644
--- a/libavcodec/cbs.c
+++ b/libavcodec/cbs.c
@@ -31,6 +31,9 @@
 
 
 static const CodedBitstreamType *const cbs_type_table[] = {
+#if CBS_APV
+    &CBS_FUNC(type_apv),
+#endif
 #if CBS_AV1
     &CBS_FUNC(type_av1),
 #endif
@@ -58,6 +61,9 @@ static const CodedBitstreamType *const cbs_type_table[] = {
 };
 
 const enum AVCodecID CBS_FUNC(all_codec_ids)[] = {
+#if CBS_APV
+    AV_CODEC_ID_APV,
+#endif
 #if CBS_AV1
     AV_CODEC_ID_AV1,
 #endif
diff --git a/libavcodec/cbs_apv.c b/libavcodec/cbs_apv.c
new file mode 100644
index 0000000000..c37124168a
--- /dev/null
+++ b/libavcodec/cbs_apv.c
@@ -0,0 +1,395 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/mem.h"
+#include "cbs.h"
+#include "cbs_internal.h"
+#include "cbs_apv.h"
+
+
+static int cbs_apv_get_num_comp(const APVRawFrameHeader *fh)
+{
+    switch (fh->frame_info.chroma_format_idc) {
+    case APV_CHROMA_FORMAT_400:
+        return 1;
+    case APV_CHROMA_FORMAT_422:
+    case APV_CHROMA_FORMAT_444:
+        return 3;
+    case APV_CHROMA_FORMAT_4444:
+        return 4;
+    default:
+        av_assert0(0 && "Invalid chroma_format_idc");
+    }
+}
+
+static void cbs_apv_derive_tile_info(APVDerivedTileInfo *ti,
+                                     const APVRawFrameHeader *fh)
+{
+    int frame_width_in_mbs   = (fh->frame_info.frame_width  + 15) / 16;
+    int frame_height_in_mbs  = (fh->frame_info.frame_height + 15) / 16;
+    int start_mb, i;
+
+    start_mb = 0;
+    for (i = 0; start_mb < frame_width_in_mbs; i++) {
+        ti->col_starts[i] = start_mb * APV_MB_WIDTH;
+        start_mb += fh->tile_info.tile_width_in_mbs;
+    }
+    av_assert0(i <= APV_MAX_TILE_COLS);
+    ti->col_starts[i] = frame_width_in_mbs * APV_MB_WIDTH;
+    ti->tile_cols = i;
+
+    start_mb = 0;
+    for (i = 0; start_mb < frame_height_in_mbs; i++) {
+        av_assert0(i < APV_MAX_TILE_ROWS);
+        ti->row_starts[i] = start_mb * APV_MB_HEIGHT;
+        start_mb += fh->tile_info.tile_height_in_mbs;
+    }
+    av_assert0(i <= APV_MAX_TILE_ROWS);
+    ti->row_starts[i] = frame_height_in_mbs * APV_MB_HEIGHT;
+    ti->tile_rows = i;
+
+    ti->num_tiles = ti->tile_cols * ti->tile_rows;
+}
+
+
+#define HEADER(name) do { \
+        ff_cbs_trace_header(ctx, name); \
+    } while (0)
+
+#define CHECK(call) do { \
+        err = (call); \
+        if (err < 0) \
+            return err; \
+    } while (0)
+
+#define SUBSCRIPTS(subs, ...) (subs > 0 ? ((int[subs + 1]){ subs, __VA_ARGS__ }) : NULL)
+
+
+#define u(width, name, range_min, range_max) \
+    xu(width, name, current->name, range_min, range_max, 0, )
+#define ub(width, name) \
+    xu(width, name, current->name, 0, MAX_UINT_BITS(width), 0, )
+#define us(width, name, range_min, range_max, subs, ...) \
+    xu(width, name, current->name, range_min, range_max,  subs, __VA_ARGS__)
+#define ubs(width, name, subs, ...) \
+    xu(width, name, current->name, 0, MAX_UINT_BITS(width),  subs, __VA_ARGS__)
+
+#define fixed(width, name, value) do { \
+        av_unused uint32_t fixed_value = value; \
+        xu(width, name, fixed_value, value, value, 0, ); \
+    } while (0)
+
+
+#define READ
+#define READWRITE read
+#define RWContext GetBitContext
+#define FUNC(name) cbs_apv_read_ ## name
+
+#define xu(width, name, var, range_min, range_max, subs, ...) do { \
+        uint32_t value; \
+        CHECK(ff_cbs_read_unsigned(ctx, rw, width, #name, \
+                                   SUBSCRIPTS(subs, __VA_ARGS__), \
+                                   &value, range_min, range_max)); \
+        var = value; \
+    } while (0)
+
+#define infer(name, value) do { \
+        current->name = value; \
+    } while (0)
+
+#define byte_alignment(rw) (get_bits_count(rw) % 8)
+
+#include "cbs_apv_syntax_template.c"
+
+#undef READ
+#undef READWRITE
+#undef RWContext
+#undef FUNC
+#undef xu
+#undef infer
+#undef byte_alignment
+
+#define WRITE
+#define READWRITE write
+#define RWContext PutBitContext
+#define FUNC(name) cbs_apv_write_ ## name
+
+#define xu(width, name, var, range_min, range_max, subs, ...) do { \
+        uint32_t value = var; \
+        CHECK(ff_cbs_write_unsigned(ctx, rw, width, #name, \
+                                    SUBSCRIPTS(subs, __VA_ARGS__), \
+                                    value, range_min, range_max)); \
+    } while (0)
+
+#define infer(name, value) do { \
+        if (current->name != (value)) { \
+            av_log(ctx->log_ctx, AV_LOG_ERROR, \
+                   "%s does not match inferred value: " \
+                   "%"PRId64", but should be %"PRId64".\n", \
+                   #name, (int64_t)current->name, (int64_t)(value)); \
+            return AVERROR_INVALIDDATA; \
+        } \
+    } while (0)
+
+#define byte_alignment(rw) (put_bits_count(rw) % 8)
+
+#include "cbs_apv_syntax_template.c"
+
+#undef WRITE
+#undef READWRITE
+#undef RWContext
+#undef FUNC
+#undef xu
+#undef infer
+#undef byte_alignment
+
+
+static int cbs_apv_split_fragment(CodedBitstreamContext *ctx,
+                                  CodedBitstreamFragment *frag,
+                                  int header)
+{
+    uint8_t *data = frag->data;
+    size_t   size = frag->data_size;
+    int err, trace;
+
+    // Don't include parsing here in trace output.
+    trace = ctx->trace_enable;
+    ctx->trace_enable = 0;
+
+    while (size > 0) {
+        GetBitContext   gbc;
+        uint32_t        pbu_size;
+        APVRawPBUHeader pbu_header;
+
+        if (size < 8) {
+            av_log(ctx->log_ctx, AV_LOG_ERROR, "Invalid PBU: "
+                   "fragment too short (%"SIZE_SPECIFIER" bytes).\n",
+                   size);
+            err = AVERROR_INVALIDDATA;
+            goto fail;
+        }
+
+        pbu_size = AV_RB32(data);
+        if (pbu_size < 4 ) {
+            av_log(ctx->log_ctx, AV_LOG_ERROR, "Invalid PBU: "
+                   "pbu_size too small (%"PRIu32" bytes).\n",
+                   pbu_size);
+            err = AVERROR_INVALIDDATA;
+            goto fail;
+        }
+
+        data += 4;
+        size -= 4;
+
+        if (pbu_size > size) {
+            av_log(ctx->log_ctx, AV_LOG_ERROR, "Invalid PBU: "
+                   "pbu_size too large (%"PRIu32" bytes).\n",
+                   pbu_size);
+            err = AVERROR_INVALIDDATA;
+            goto fail;
+        }
+
+        init_get_bits(&gbc, data, 8 * pbu_size);
+
+        err = cbs_apv_read_pbu_header(ctx, &gbc, &pbu_header);
+        if (err < 0)
+            return err;
+
+        // Could select/skip frames based on type/group_id here.
+
+        err = ff_cbs_append_unit_data(frag, pbu_header.pbu_type,
+                                      data, pbu_size, frag->data_ref);
+        if (err < 0)
+            return err;
+
+        data += pbu_size;
+        size -= pbu_size;
+    }
+
+    err = 0;
+fail:
+    ctx->trace_enable = trace;
+    return err;
+}
+
+static int cbs_apv_read_unit(CodedBitstreamContext *ctx,
+                             CodedBitstreamUnit *unit)
+{
+    GetBitContext gbc;
+    int err;
+
+    err = init_get_bits(&gbc, unit->data, 8 * unit->data_size);
+    if (err < 0)
+        return err;
+
+    err = ff_cbs_alloc_unit_content(ctx, unit);
+    if (err < 0)
+        return err;
+
+    switch (unit->type) {
+    case APV_PBU_PRIMARY_FRAME:
+        {
+            APVRawFrame *frame = unit->content;
+
+            err = cbs_apv_read_frame(ctx, &gbc, frame);
+            if (err < 0)
+                return err;
+
+            // Each tile inside the frame has pointers into the unit
+            // data buffer; make a single reference here for all of
+            // them together.
+            frame->tile_data_ref = av_buffer_ref(unit->data_ref);
+            if (!frame->tile_data_ref)
+                return AVERROR(ENOMEM);
+        }
+        break;
+    case APV_PBU_ACCESS_UNIT_INFORMATION:
+        {
+            err = cbs_apv_read_au_info(ctx, &gbc, unit->content);
+            if (err < 0)
+                return err;
+        }
+        break;
+    case APV_PBU_METADATA:
+        {
+            err = cbs_apv_read_metadata(ctx, &gbc, unit->content);
+            if (err < 0)
+                return err;
+        }
+        break;
+    case APV_PBU_FILLER:
+        {
+            err = cbs_apv_read_filler(ctx, &gbc, unit->content);
+            if (err < 0)
+                return err;
+        }
+        break;
+    default:
+        return AVERROR(ENOSYS);
+    }
+
+    return 0;
+}
+
+static int cbs_apv_write_unit(CodedBitstreamContext *ctx,
+                              CodedBitstreamUnit *unit,
+                              PutBitContext *pbc)
+{
+    int err;
+
+    switch (unit->type) {
+    case APV_PBU_PRIMARY_FRAME:
+        {
+            APVRawFrame *frame = unit->content;
+
+            err = cbs_apv_write_frame(ctx, pbc, frame);
+            if (err < 0)
+                return err;
+        }
+        break;
+    case APV_PBU_ACCESS_UNIT_INFORMATION:
+        {
+            err = cbs_apv_write_au_info(ctx, pbc, unit->content);
+            if (err < 0)
+                return err;
+        }
+        break;
+    case APV_PBU_METADATA:
+        {
+            err = cbs_apv_write_metadata(ctx, pbc, unit->content);
+            if (err < 0)
+                return err;
+        }
+        break;
+    case APV_PBU_FILLER:
+        {
+            err = cbs_apv_write_filler(ctx, pbc, unit->content);
+            if (err < 0)
+                return err;
+        }
+        break;
+    default:
+        return AVERROR(ENOSYS);
+    }
+
+    return 0;
+}
+
+static int cbs_apv_assemble_fragment(CodedBitstreamContext *ctx,
+                                     CodedBitstreamFragment *frag)
+{
+    size_t size = 0, pos;
+
+    for (int i = 0; i < frag->nb_units; i++)
+        size += frag->units[i].data_size + 4;
+
+    frag->data_ref = av_buffer_alloc(size + AV_INPUT_BUFFER_PADDING_SIZE);
+    if (!frag->data_ref)
+        return AVERROR(ENOMEM);
+    frag->data = frag->data_ref->data;
+    memset(frag->data + size, 0, AV_INPUT_BUFFER_PADDING_SIZE);
+
+    pos = 0;
+    for (int i = 0; i < frag->nb_units; i++) {
+        AV_WB32(frag->data + pos, frag->units[i].data_size);
+        pos += 4;
+
+        memcpy(frag->data + pos, frag->units[i].data,
+               frag->units[i].data_size);
+        pos += frag->units[i].data_size;
+    }
+    av_assert0(pos == size);
+    frag->data_size = size;
+
+    return 0;
+}
+
+
+static const CodedBitstreamUnitTypeDescriptor cbs_apv_unit_types[] = {
+    {
+        .nb_unit_types   = CBS_UNIT_TYPE_RANGE,
+        .unit_type.range = {
+            .start       = APV_PBU_PRIMARY_FRAME,
+            .end         = APV_PBU_ALPHA_FRAME,
+        },
+        .content_type    = CBS_CONTENT_TYPE_INTERNAL_REFS,
+        .content_size    = sizeof(APVRawFrame),
+        .type.ref        = {
+            .nb_offsets  = 1,
+            .offsets     = { offsetof(APVRawFrame, tile_data_ref) -
+                             sizeof(void*) },
+        },
+    },
+
+    CBS_UNIT_TYPE_POD(APV_PBU_METADATA, APVRawMetadata),
+
+    CBS_UNIT_TYPE_END_OF_LIST
+};
+
+const CodedBitstreamType ff_cbs_type_apv = {
+    .codec_id          = AV_CODEC_ID_APV,
+
+    .priv_data_size    = sizeof(CodedBitstreamAPVContext),
+
+    .unit_types        = cbs_apv_unit_types,
+
+    .split_fragment    = &cbs_apv_split_fragment,
+    .read_unit         = &cbs_apv_read_unit,
+    .write_unit        = &cbs_apv_write_unit,
+    .assemble_fragment = &cbs_apv_assemble_fragment,
+};
diff --git a/libavcodec/cbs_apv.h b/libavcodec/cbs_apv.h
new file mode 100644
index 0000000000..cbaeb45acb
--- /dev/null
+++ b/libavcodec/cbs_apv.h
@@ -0,0 +1,207 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef AVCODEC_CBS_APV_H
+#define AVCODEC_CBS_APV_H
+
+#include <stddef.h>
+#include <stdint.h>
+
+#include "libavutil/buffer.h"
+#include "apv.h"
+
+// Arbitrary limits to avoid large structures.
+#define CBS_APV_MAX_AU_FRAMES         8
+#define CBS_APV_MAX_METADATA_PAYLOADS 8
+
+
+typedef struct APVRawPBUHeader {
+    uint8_t  pbu_type;
+    uint16_t group_id;
+    uint8_t  reserved_zero_8bits;
+} APVRawPBUHeader;
+
+typedef struct APVRawFiller {
+    size_t   filler_size;
+} APVRawFiller;
+
+typedef struct APVRawFrameInfo {
+    uint8_t  profile_idc;
+    uint8_t  level_idc;
+    uint8_t  band_idc;
+    uint8_t  reserved_zero_5bits;
+    uint32_t frame_width;
+    uint32_t frame_height;
+    uint8_t  chroma_format_idc;
+    uint8_t  bit_depth_minus8;
+    uint8_t  capture_time_distance;
+    uint8_t  reserved_zero_8bits;
+} APVRawFrameInfo;
+
+typedef struct APVRawQuantizationMatrix {
+    uint8_t  q_matrix[APV_MAX_NUM_COMP][APV_TR_SIZE][APV_TR_SIZE];
+} APVRawQuantizationMatrix;
+
+typedef struct APVRawTileInfo {
+    uint32_t tile_width_in_mbs;
+    uint32_t tile_height_in_mbs;
+    uint8_t  tile_size_present_in_fh_flag;
+    uint32_t tile_size_in_fh[APV_MAX_TILE_COUNT];
+} APVRawTileInfo;
+
+typedef struct APVRawFrameHeader {
+    APVRawFrameInfo frame_info;
+    uint8_t  reserved_zero_8bits;
+
+    uint8_t  color_description_present_flag;
+    uint8_t  color_primaries;
+    uint8_t  transfer_characteristics;
+    uint8_t  matrix_coefficients;
+    uint8_t  full_range_flag;
+
+    uint8_t  use_q_matrix;
+    APVRawQuantizationMatrix quantization_matrix;
+
+    APVRawTileInfo tile_info;
+
+    uint8_t  reserved_zero_8bits_2;
+} APVRawFrameHeader;
+
+typedef struct APVRawTileHeader {
+    uint16_t tile_header_size;
+    uint16_t tile_index;
+    uint32_t tile_data_size[APV_MAX_NUM_COMP];
+    uint8_t  tile_qp       [APV_MAX_NUM_COMP];
+    uint8_t  reserved_zero_8bits;
+} APVRawTileHeader;
+
+typedef struct APVRawTile {
+    APVRawTileHeader tile_header;
+
+    uint8_t         *tile_data[APV_MAX_NUM_COMP];
+    uint8_t         *tile_dummy_byte;
+    uint32_t         tile_dummy_byte_size;
+} APVRawTile;
+
+typedef struct APVRawFrame {
+    APVRawPBUHeader   pbu_header;
+    APVRawFrameHeader frame_header;
+    uint32_t          tile_size[APV_MAX_TILE_COUNT];
+    APVRawTile        tile     [APV_MAX_TILE_COUNT];
+    APVRawFiller      filler;
+
+    AVBufferRef      *tile_data_ref;
+} APVRawFrame;
+
+typedef struct APVRawAUInfo {
+    uint16_t num_frames;
+
+    uint8_t  pbu_type           [CBS_APV_MAX_AU_FRAMES];
+    uint8_t  group_id           [CBS_APV_MAX_AU_FRAMES];
+    uint8_t  reserved_zero_8bits[CBS_APV_MAX_AU_FRAMES];
+    APVRawFrameInfo frame_info  [CBS_APV_MAX_AU_FRAMES];
+
+    uint8_t  reserved_zero_8bits_2;
+
+    APVRawFiller filler;
+} APVRawAUInfo;
+
+typedef struct APVRawMetadataITUTT35 {
+    uint8_t  itu_t_t35_country_code;
+    uint8_t  itu_t_t35_country_code_extension;
+
+    uint8_t     *data;
+    AVBufferRef *data_ref;
+    size_t       data_size;
+} APVRawMetadataITUTT35;
+
+typedef struct APVRawMetadataMDCV {
+    uint16_t primary_chromaticity_x[3];
+    uint16_t primary_chromaticity_y[3];
+    uint16_t white_point_chromaticity_x;
+    uint16_t white_point_chromaticity_y;
+    uint32_t max_mastering_luminance;
+    uint32_t min_mastering_luminance;
+} APVRawMetadataMDCV;
+
+typedef struct APVRawMetadataCLL {
+    uint16_t max_cll;
+    uint16_t max_fall;
+} APVRawMetadataCLL;
+
+typedef struct APVRawMetadataFiller {
+    uint32_t payload_size;
+} APVRawMetadataFiller;
+
+typedef struct APVRawMetadataUserDefined {
+    uint8_t  uuid[16];
+
+    uint8_t     *data;
+    AVBufferRef *data_ref;
+    size_t       data_size;
+} APVRawMetadataUserDefined;
+
+typedef struct APVRawMetadataUndefined {
+    uint8_t     *data;
+    AVBufferRef *data_ref;
+    size_t       data_size;
+} APVRawMetadataUndefined;
+
+typedef struct APVRawMetadataPayload {
+    uint32_t payload_type;
+    uint32_t payload_size;
+    union {
+        APVRawMetadataITUTT35     itu_t_t35;
+        APVRawMetadataMDCV        mdcv;
+        APVRawMetadataCLL         cll;
+        APVRawMetadataFiller      filler;
+        APVRawMetadataUserDefined user_defined;
+        APVRawMetadataUndefined   undefined;
+    };
+} APVRawMetadataPayload;
+
+typedef struct APVRawMetadata {
+    APVRawPBUHeader pbu_header;
+
+    uint32_t metadata_size;
+    uint32_t metadata_count;
+
+    APVRawMetadataPayload payloads[CBS_APV_MAX_METADATA_PAYLOADS];
+
+    APVRawFiller filler;
+} APVRawMetadata;
+
+
+typedef struct APVDerivedTileInfo {
+    uint8_t  tile_cols;
+    uint8_t  tile_rows;
+    uint16_t num_tiles;
+    // The spec uses an extra element on the end of these arrays
+    // not corresponding to any tile.
+    uint16_t col_starts[APV_MAX_TILE_COLS + 1];
+    uint16_t row_starts[APV_MAX_TILE_ROWS + 1];
+} APVDerivedTileInfo;
+
+typedef struct CodedBitstreamAPVContext {
+    int bit_depth;
+    int num_comp;
+
+    APVDerivedTileInfo tile_info;
+} CodedBitstreamAPVContext;
+
+#endif /* AVCODEC_CBS_APV_H */
diff --git a/libavcodec/cbs_apv_syntax_template.c b/libavcodec/cbs_apv_syntax_template.c
new file mode 100644
index 0000000000..2ebe6050cf
--- /dev/null
+++ b/libavcodec/cbs_apv_syntax_template.c
@@ -0,0 +1,595 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+static int FUNC(pbu_header)(CodedBitstreamContext *ctx, RWContext *rw,
+                            APVRawPBUHeader *current)
+{
+    int err;
+
+    ub(8,  pbu_type);
+    ub(16, group_id);
+    u(8, reserved_zero_8bits, 0, 0);
+
+    return 0;
+}
+
+static int FUNC(byte_alignment)(CodedBitstreamContext *ctx, RWContext *rw)
+{
+    int err;
+
+    while (byte_alignment(rw) != 0)
+        fixed(1, alignment_bit_equal_to_zero, 0);
+
+    return 0;
+}
+
+static int FUNC(filler)(CodedBitstreamContext *ctx, RWContext *rw,
+                        APVRawFiller *current)
+{
+    int err;
+
+#ifdef READ
+    current->filler_size = 0;
+    while (show_bits(rw, 8) == 0xff) {
+        fixed(8, ff_byte, 0xff);
+        ++current->filler_size;
+    }
+#else
+    {
+        uint32_t i;
+        for (i = 0; i < current->filler_size; i++)
+            fixed(8, ff_byte, 0xff);
+    }
+#endif
+
+    return 0;
+}
+
+static int FUNC(frame_info)(CodedBitstreamContext *ctx, RWContext *rw,
+                            APVRawFrameInfo *current)
+{
+    int err;
+
+    ub(8,  profile_idc);
+    ub(8,  level_idc);
+    ub(3,  band_idc);
+
+    u(5, reserved_zero_5bits, 0, 0);
+
+    ub(24, frame_width);
+    ub(24, frame_height);
+
+    u(4, chroma_format_idc, 0, 4);
+    if (current->chroma_format_idc == 1) {
+        av_log(ctx->log_ctx, AV_LOG_ERROR,
+               "chroma_format_idc 1 for 4:2:0 is not allowed in APV.\n");
+        return AVERROR_INVALIDDATA;
+    }
+
+    u(4, bit_depth_minus8, 2, 8);
+
+    ub(8, capture_time_distance);
+
+    u(8, reserved_zero_8bits, 0, 0);
+
+    return 0;
+}
+
+static int FUNC(quantization_matrix)(CodedBitstreamContext *ctx,
+                                     RWContext *rw,
+                                     APVRawQuantizationMatrix *current)
+{
+    const CodedBitstreamAPVContext *priv = ctx->priv_data;
+    int err;
+
+    for (int c = 0; c < priv->num_comp; c++) {
+        for (int y = 0; y < 8; y++) {
+            for (int x = 0; x < 8 ; x++) {
+                us(8, q_matrix[c][x][y], 1, 255, 3, c, x, y);
+            }
+        }
+    }
+
+    return 0;
+}
+
+static int FUNC(tile_info)(CodedBitstreamContext *ctx, RWContext *rw,
+                           APVRawTileInfo *current,
+                           const APVRawFrameHeader *fh)
+{
+    CodedBitstreamAPVContext *priv = ctx->priv_data;
+    int err;
+
+    u(20, tile_width_in_mbs,
+      APV_MIN_TILE_WIDTH_IN_MBS,  MAX_UINT_BITS(20));
+    u(20, tile_height_in_mbs,
+      APV_MIN_TILE_HEIGHT_IN_MBS, MAX_UINT_BITS(20));
+
+    ub(1, tile_size_present_in_fh_flag);
+
+    cbs_apv_derive_tile_info(&priv->tile_info, fh);
+
+    if (current->tile_size_present_in_fh_flag) {
+        for (int t = 0; t < priv->tile_info.num_tiles; t++) {
+            us(32, tile_size_in_fh[t], 10, MAX_UINT_BITS(32), 1, t);
+        }
+    }
+
+    return 0;
+}
+
+static int FUNC(frame_header)(CodedBitstreamContext *ctx, RWContext *rw,
+                              APVRawFrameHeader *current)
+{
+    CodedBitstreamAPVContext *priv = ctx->priv_data;
+    int err;
+
+    CHECK(FUNC(frame_info)(ctx, rw, &current->frame_info));
+
+    u(8, reserved_zero_8bits, 0, 0);
+
+    ub(1, color_description_present_flag);
+    if (current->color_description_present_flag) {
+        ub(8, color_primaries);
+        ub(8, transfer_characteristics);
+        ub(8, matrix_coefficients);
+        ub(1, full_range_flag);
+    } else {
+        infer(color_primaries,          2);
+        infer(transfer_characteristics, 2);
+        infer(matrix_coefficients,      2);
+        infer(full_range_flag,          0);
+    }
+
+    priv->bit_depth = current->frame_info.bit_depth_minus8 + 8;
+    priv->num_comp  = cbs_apv_get_num_comp(current);
+
+    ub(1, use_q_matrix);
+    if (current->use_q_matrix) {
+        CHECK(FUNC(quantization_matrix)(ctx, rw,
+                                        &current->quantization_matrix));
+    } else {
+        for (int c = 0; c < priv->num_comp; c++) {
+            for (int y = 0; y < 8; y++) {
+                for (int x = 0; x < 8 ; x++) {
+                    infer(quantization_matrix.q_matrix[c][y][x], 16);
+                }
+            }
+        }
+    }
+
+    CHECK(FUNC(tile_info)(ctx, rw, &current->tile_info, current));
+
+    u(8, reserved_zero_8bits_2, 0, 0);
+
+    CHECK(FUNC(byte_alignment)(ctx, rw));
+
+    return 0;
+}
+
+static int FUNC(tile_header)(CodedBitstreamContext *ctx, RWContext *rw,
+                             APVRawTileHeader *current, int tile_idx)
+{
+    const CodedBitstreamAPVContext *priv = ctx->priv_data;
+    uint16_t expected_tile_header_size;
+    uint8_t max_qp;
+    int err;
+
+    expected_tile_header_size = 4 + priv->num_comp * (4 + 1) + 1;
+
+    u(16, tile_header_size,
+      expected_tile_header_size, expected_tile_header_size);
+
+    u(16, tile_index, tile_idx, tile_idx);
+
+    for (int c = 0; c < priv->num_comp; c++) {
+        us(32, tile_data_size[c], 1, MAX_UINT_BITS(32), 1, c);
+    }
+
+    max_qp = 3 + priv->bit_depth * 6;
+    for (int c = 0; c < priv->num_comp; c++) {
+        us(8, tile_qp[c], 0, max_qp, 1, c);
+    }
+
+    u(8, reserved_zero_8bits, 0, 0);
+
+    return 0;
+}
+
+static int FUNC(tile)(CodedBitstreamContext *ctx, RWContext *rw,
+                      APVRawTile *current, int tile_idx)
+{
+    const CodedBitstreamAPVContext *priv = ctx->priv_data;
+    int err;
+
+    CHECK(FUNC(tile_header)(ctx, rw, &current->tile_header, tile_idx));
+
+    for (int c = 0; c < priv->num_comp; c++) {
+        uint32_t comp_size = current->tile_header.tile_data_size[c];
+#ifdef READ
+        int pos = get_bits_count(rw);
+        av_assert0(pos % 8 == 0);
+        current->tile_data[c] = (uint8_t*)align_get_bits(rw);
+        skip_bits_long(rw, 8 * comp_size);
+#else
+        if (put_bytes_left(rw, 0) < comp_size)
+            return AVERROR(ENOSPC);
+        ff_copy_bits(rw, current->tile_data[c], comp_size * 8);
+#endif
+    }
+
+    return 0;
+}
+
+static int FUNC(frame)(CodedBitstreamContext *ctx, RWContext *rw,
+                       APVRawFrame *current)
+{
+    const CodedBitstreamAPVContext *priv = ctx->priv_data;
+    int err;
+
+    HEADER("Frame");
+
+    CHECK(FUNC(pbu_header)(ctx, rw, &current->pbu_header));
+
+    CHECK(FUNC(frame_header)(ctx, rw, &current->frame_header));
+
+    for (int t = 0; t < priv->tile_info.num_tiles; t++) {
+        us(32, tile_size[t], 10, MAX_UINT_BITS(32), 1, t);
+
+        CHECK(FUNC(tile)(ctx, rw, &current->tile[t], t));
+    }
+
+    CHECK(FUNC(filler)(ctx, rw, &current->filler));
+
+    return 0;
+}
+
+static int FUNC(au_info)(CodedBitstreamContext *ctx, RWContext *rw,
+                         APVRawAUInfo *current)
+{
+    int err;
+
+    HEADER("Access Unit Information");
+
+    u(16, num_frames, 1, CBS_APV_MAX_AU_FRAMES);
+
+    for (int i = 0; i < current->num_frames; i++) {
+        ubs(8, pbu_type[i], 1, i);
+        ubs(8, group_id[i], 1, i);
+
+        us(8, reserved_zero_8bits[i], 0, 0, 1, i);
+
+        CHECK(FUNC(frame_info)(ctx, rw, &current->frame_info[i]));
+    }
+
+    u(8, reserved_zero_8bits_2, 0, 0);
+
+    return 0;
+}
+
+static int FUNC(metadata_itu_t_t35)(CodedBitstreamContext *ctx,
+                                    RWContext *rw,
+                                    APVRawMetadataITUTT35 *current,
+                                    size_t payload_size)
+{
+    int err;
+    size_t read_size = payload_size - 1;
+
+    HEADER("ITU-T T.35 Metadata");
+
+    ub(8, itu_t_t35_country_code);
+
+    if (current->itu_t_t35_country_code == 0xff) {
+        ub(8, itu_t_t35_country_code_extension);
+        --read_size;
+    }
+
+#ifdef READ
+    current->data_size = read_size;
+    current->data_ref  = av_buffer_alloc(current->data_size);
+    if (!current->data_ref)
+        return AVERROR(ENOMEM);
+    current->data = current->data_ref->data;
+#else
+    if (current->data_size != read_size) {
+        av_log(ctx->log_ctx, AV_LOG_ERROR, "Write size mismatch: "
+               "payload %zu but expecting %zu\n",
+               current->data_size, read_size);
+        return AVERROR(EINVAL);
+    }
+#endif
+
+    for (size_t i = 0; i < current->data_size; i++) {
+        xu(8, itu_t_t35_payload[i],
+           current->data[i], 0x00, 0xff, 1, i);
+    }
+
+    return 0;
+}
+
+static int FUNC(metadata_mdcv)(CodedBitstreamContext *ctx,
+                               RWContext *rw,
+                               APVRawMetadataMDCV *current)
+{
+    int err, i;
+
+    HEADER("MDCV Metadata");
+
+    for (i = 0; i < 3; i++) {
+        ubs(16, primary_chromaticity_x[i], 1, i);
+        ubs(16, primary_chromaticity_y[i], 1, i);
+    }
+
+    ub(16, white_point_chromaticity_x);
+    ub(16, white_point_chromaticity_y);
+
+    ub(32, max_mastering_luminance);
+    ub(32, min_mastering_luminance);
+
+    return 0;
+}
+
+static int FUNC(metadata_cll)(CodedBitstreamContext *ctx,
+                              RWContext *rw,
+                              APVRawMetadataCLL *current)
+{
+    int err;
+
+    HEADER("CLL Metadata");
+
+    ub(16, max_cll);
+    ub(16, max_fall);
+
+    return 0;
+}
+
+static int FUNC(metadata_filler)(CodedBitstreamContext *ctx,
+                                 RWContext *rw,
+                                 APVRawMetadataFiller *current,
+                                 size_t payload_size)
+{
+    int err;
+
+    HEADER("Filler Metadata");
+
+    for (size_t i = 0; i < payload_size; i++)
+        fixed(8, ff_byte, 0xff);
+
+    return 0;
+}
+
+static int FUNC(metadata_user_defined)(CodedBitstreamContext *ctx,
+                                       RWContext *rw,
+                                       APVRawMetadataUserDefined *current,
+                                       size_t payload_size)
+{
+    int err;
+
+    HEADER("User-Defined Metadata");
+
+    for (int i = 0; i < 16; i++)
+        ubs(8, uuid[i], 1, i);
+
+#ifdef READ
+    current->data_size = payload_size - 16;
+    current->data_ref  = av_buffer_alloc(current->data_size);
+    if (!current->data_ref)
+        return AVERROR(ENOMEM);
+    current->data = current->data_ref->data;
+#else
+    if (current->data_size != payload_size - 16) {
+        av_log(ctx->log_ctx, AV_LOG_ERROR, "Write size mismatch: "
+               "payload %zu but expecting %zu\n",
+               current->data_size, payload_size - 16);
+        return AVERROR(EINVAL);
+    }
+#endif
+
+    for (size_t i = 0; i < current->data_size; i++) {
+        xu(8, user_defined_data_payload[i],
+           current->data[i], 0x00, 0xff, 1, i);
+    }
+
+    return 0;
+}
+
+static int FUNC(metadata_undefined)(CodedBitstreamContext *ctx,
+                                    RWContext *rw,
+                                    APVRawMetadataUndefined *current,
+                                    size_t payload_size)
+{
+    int err;
+
+    HEADER("Undefined Metadata");
+
+#ifdef READ
+    current->data_size = payload_size;
+    current->data_ref  = av_buffer_alloc(current->data_size);
+    if (!current->data_ref)
+        return AVERROR(ENOMEM);
+    current->data = current->data_ref->data;
+#else
+    if (current->data_size != payload_size) {
+        av_log(ctx->log_ctx, AV_LOG_ERROR, "Write size mismatch: "
+               "payload %zu but expecting %zu\n",
+               current->data_size, payload_size - 16);
+        return AVERROR(EINVAL);
+    }
+#endif
+
+    for (size_t i = 0; i < current->data_size; i++) {
+        xu(8, undefined_metadata_payload_byte[i],
+           current->data[i], 0x00, 0xff, 1, i);
+    }
+
+    return 0;
+}
+
+static int FUNC(metadata_payload)(CodedBitstreamContext *ctx,
+                                  RWContext *rw,
+                                  APVRawMetadataPayload *current)
+{
+    int err;
+
+    switch (current->payload_type) {
+    case APV_METADATA_ITU_T_T35:
+        CHECK(FUNC(metadata_itu_t_t35)(ctx, rw,
+                                       &current->itu_t_t35,
+                                       current->payload_size));
+        break;
+    case APV_METADATA_MDCV:
+        CHECK(FUNC(metadata_mdcv)(ctx, rw, &current->mdcv));
+        break;
+    case APV_METADATA_CLL:
+        CHECK(FUNC(metadata_cll)(ctx, rw, &current->cll));
+        break;
+    case APV_METADATA_FILLER:
+        CHECK(FUNC(metadata_filler)(ctx, rw,
+                                    &current->filler,
+                                    current->payload_size));
+        break;
+    case APV_METADATA_USER_DEFINED:
+        CHECK(FUNC(metadata_user_defined)(ctx, rw,
+                                          &current->user_defined,
+                                          current->payload_size));
+        break;
+    default:
+        CHECK(FUNC(metadata_undefined)(ctx, rw,
+                                       &current->undefined,
+                                       current->payload_size));
+    }
+
+    return 0;
+}
+
+static int FUNC(metadata)(CodedBitstreamContext *ctx, RWContext *rw,
+                          APVRawMetadata *current)
+{
+    int err;
+
+#ifdef WRITE
+    PutBitContext metadata_start_state;
+    uint32_t metadata_start_position;
+    int trace;
+#endif
+
+    HEADER("Metadata");
+
+    CHECK(FUNC(pbu_header)(ctx, rw, &current->pbu_header));
+
+#ifdef READ
+    ub(32, metadata_size);
+
+    for (int p = 0; p < CBS_APV_MAX_METADATA_PAYLOADS; p++) {
+        APVRawMetadataPayload *pl = &current->payloads[p];
+        uint32_t metadata_bytes_left = current->metadata_size;
+        uint32_t tmp;
+
+        pl->payload_type = 0;
+        while (show_bits(rw, 8) == 0xff) {
+            fixed(8, ff_byte, 0xff);
+            pl->payload_type += 255;
+            --metadata_bytes_left;
+        }
+        xu(8, metadata_payload_type, tmp, 0, 254, 0);
+        pl->payload_type += tmp;
+        --metadata_bytes_left;
+
+        pl->payload_size = 0;
+        while (show_bits(rw, 8) == 0xff) {
+            fixed(8, ff_byte, 0xff);
+            pl->payload_size += 255;
+            --metadata_bytes_left;
+        }
+        xu(8, metadata_payload_size, tmp, 0, 254, 0);
+        pl->payload_size += tmp;
+        --metadata_bytes_left;
+
+        if (pl->payload_size > metadata_bytes_left) {
+            av_log(ctx->log_ctx, AV_LOG_ERROR, "Invalid metadata: "
+                   "payload_size larger than remaining metadata size "
+                   "(%"PRIu32" bytes).\n", pl->payload_size);
+            return AVERROR_INVALIDDATA;
+        }
+
+        CHECK(FUNC(metadata_payload)(ctx, rw, pl));
+
+        metadata_bytes_left -= pl->payload_size;
+
+        if (metadata_bytes_left == 0)
+            break;
+    }
+#else
+    // Two passes: the first write finds the size (with tracing
+    // disabled), the second write does the real write.
+
+    metadata_start_state = *rw;
+    metadata_start_position = put_bits_count(rw);
+
+    trace = ctx->trace_enable;
+    ctx->trace_enable = 0;
+
+    for (int pass = 1; pass <= 2; pass++) {
+        *rw = metadata_start_state;
+
+        ub(32, metadata_size);
+
+        for (int p = 0; p < current->metadata_count; p++) {
+            APVRawMetadataPayload *pl = &current->payloads[p];
+            uint32_t payload_start_position;
+            uint32_t tmp;
+
+            payload_start_position = put_bits_count(rw);
+
+            tmp = pl->payload_type;
+            while (tmp >= 255) {
+                fixed(8, ff_byte, 0xff);
+                tmp -= 255;
+            }
+            xu(8, metadata_payload_type, tmp, 0, 254, 0);
+
+            tmp = pl->payload_size;
+            while (tmp >= 255) {
+                fixed(8, ff_byte, 0xff);
+                tmp -= 255;
+            }
+            xu(8, metadata_payload_size, tmp, 0, 254, 0);
+
+            err = FUNC(metadata_payload)(ctx, rw, pl);
+            ctx->trace_enable = trace;
+            if (err < 0)
+                return err;
+
+            if (pass == 1) {
+                pl->payload_size = (put_bits_count(rw) -
+                                    payload_start_position) / 8;
+            }
+        }
+
+        if (pass == 1) {
+            current->metadata_size = (put_bits_count(rw) -
+                                      metadata_start_position) / 8;
+            ctx->trace_enable = trace;
+        }
+    }
+#endif
+
+    CHECK(FUNC(filler)(ctx, rw, &current->filler));
+
+    return 0;
+}
diff --git a/libavcodec/cbs_internal.h b/libavcodec/cbs_internal.h
index 1ed1f04c15..c3265924ba 100644
--- a/libavcodec/cbs_internal.h
+++ b/libavcodec/cbs_internal.h
@@ -42,6 +42,9 @@
 #define CBS_TRACE 1
 #endif
 
+#ifndef CBS_APV
+#define CBS_APV CONFIG_CBS_APV
+#endif
 #ifndef CBS_AV1
 #define CBS_AV1 CONFIG_CBS_AV1
 #endif
@@ -383,6 +386,7 @@ int CBS_FUNC(write_signed)(CodedBitstreamContext *ctx, PutBitContext *pbc,
 #define CBS_UNIT_TYPE_END_OF_LIST { .nb_unit_types = 0 }
 
 
+extern const CodedBitstreamType CBS_FUNC(type_apv);
 extern const CodedBitstreamType CBS_FUNC(type_av1);
 extern const CodedBitstreamType CBS_FUNC(type_h264);
 extern const CodedBitstreamType CBS_FUNC(type_h265);
diff --git a/libavformat/cbs.h b/libavformat/cbs.h
index e4dc231001..0fab3a7457 100644
--- a/libavformat/cbs.h
+++ b/libavformat/cbs.h
@@ -22,6 +22,7 @@
 #define CBS_PREFIX lavf_cbs
 #define CBS_WRITE 0
 #define CBS_TRACE 0
+#define CBS_APV 0
 #define CBS_H264 0
 #define CBS_H265 0
 #define CBS_H266 0
-- 
2.47.2

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [FFmpeg-devel] [PATCH v2 3/6] lavf: APV demuxer
  2025-04-21 15:24 [FFmpeg-devel] [PATCH v2 0/6] APV support Mark Thompson
  2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 1/6] lavc: APV codec ID and descriptor Mark Thompson
  2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 2/6] lavc/cbs: APV support Mark Thompson
@ 2025-04-21 15:24 ` Mark Thompson
  2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 4/6] lavc: APV decoder Mark Thompson
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Mark Thompson @ 2025-04-21 15:24 UTC (permalink / raw)
  To: ffmpeg-devel

Demuxes raw streams as defined in draft spec section 10.2.
---
 libavformat/Makefile     |   1 +
 libavformat/allformats.c |   1 +
 libavformat/apvdec.c     | 248 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 250 insertions(+)
 create mode 100644 libavformat/apvdec.c

diff --git a/libavformat/Makefile b/libavformat/Makefile
index a94ac66e7e..ef96c2762e 100644
--- a/libavformat/Makefile
+++ b/libavformat/Makefile
@@ -119,6 +119,7 @@ OBJS-$(CONFIG_APTX_DEMUXER)              += aptxdec.o
 OBJS-$(CONFIG_APTX_MUXER)                += rawenc.o
 OBJS-$(CONFIG_APTX_HD_DEMUXER)           += aptxdec.o
 OBJS-$(CONFIG_APTX_HD_MUXER)             += rawenc.o
+OBJS-$(CONFIG_APV_DEMUXER)               += apvdec.o
 OBJS-$(CONFIG_AQTITLE_DEMUXER)           += aqtitledec.o subtitles.o
 OBJS-$(CONFIG_ARGO_ASF_DEMUXER)          += argo_asf.o
 OBJS-$(CONFIG_ARGO_ASF_MUXER)            += argo_asf.o
diff --git a/libavformat/allformats.c b/libavformat/allformats.c
index 445f13f42a..90a4fe64ec 100644
--- a/libavformat/allformats.c
+++ b/libavformat/allformats.c
@@ -72,6 +72,7 @@ extern const FFInputFormat  ff_aptx_demuxer;
 extern const FFOutputFormat ff_aptx_muxer;
 extern const FFInputFormat  ff_aptx_hd_demuxer;
 extern const FFOutputFormat ff_aptx_hd_muxer;
+extern const FFInputFormat  ff_apv_demuxer;
 extern const FFInputFormat  ff_aqtitle_demuxer;
 extern const FFInputFormat  ff_argo_asf_demuxer;
 extern const FFOutputFormat ff_argo_asf_muxer;
diff --git a/libavformat/apvdec.c b/libavformat/apvdec.c
new file mode 100644
index 0000000000..f2e6982e48
--- /dev/null
+++ b/libavformat/apvdec.c
@@ -0,0 +1,248 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavcodec/apv.h"
+#include "libavcodec/get_bits.h"
+#include "avformat.h"
+#include "avio_internal.h"
+#include "demux.h"
+#include "internal.h"
+
+
+#define APV_TAG MKBETAG('a', 'P', 'v', '1')
+
+typedef struct APVHeaderInfo {
+    uint8_t  pbu_type;
+    uint16_t group_id;
+
+    uint8_t  profile_idc;
+    uint8_t  level_idc;
+    uint8_t  band_idc;
+
+    int      frame_width;
+    int      frame_height;
+
+    uint8_t  chroma_format_idc;
+    uint8_t  bit_depth_minus8;
+
+    enum AVPixelFormat pixel_format;
+} APVHeaderInfo;
+
+static const enum AVPixelFormat apv_format_table[5][5] = {
+    { AV_PIX_FMT_GRAY8,    AV_PIX_FMT_GRAY10,     AV_PIX_FMT_GRAY12,     AV_PIX_FMT_GRAY14, AV_PIX_FMT_GRAY16 },
+    { 0 }, // 4:2:0 is not valid.
+    { AV_PIX_FMT_YUV422P,  AV_PIX_FMT_YUV422P10,  AV_PIX_FMT_YUV422P12,  AV_PIX_FMT_GRAY14, AV_PIX_FMT_YUV422P16 },
+    { AV_PIX_FMT_YUV444P,  AV_PIX_FMT_YUV444P10,  AV_PIX_FMT_YUV444P12,  AV_PIX_FMT_GRAY14, AV_PIX_FMT_YUV444P16 },
+    { AV_PIX_FMT_YUVA444P, AV_PIX_FMT_YUVA444P10, AV_PIX_FMT_YUVA444P12, AV_PIX_FMT_GRAY14, AV_PIX_FMT_YUVA444P16 },
+};
+
+static int apv_extract_header_info(APVHeaderInfo *info,
+                                   GetBitContext *gbc)
+{
+    int zero, bit_depth_index;
+
+    info->pbu_type = get_bits(gbc, 8);
+    info->group_id = get_bits(gbc, 16);
+
+    zero = get_bits(gbc, 8);
+    if (zero != 0)
+        return AVERROR_INVALIDDATA;
+
+    if (info->pbu_type != APV_PBU_PRIMARY_FRAME)
+        return AVERROR_INVALIDDATA;
+
+    info->profile_idc = get_bits(gbc, 8);
+    info->level_idc   = get_bits(gbc, 8);
+    info->band_idc    = get_bits(gbc, 3);
+
+    zero = get_bits(gbc, 5);
+    if (zero != 0)
+        return AVERROR_INVALIDDATA;
+
+    info->frame_width  = get_bits(gbc, 24);
+    info->frame_height = get_bits(gbc, 24);
+    if (info->frame_width  < 1 || info->frame_width  > 65536 ||
+        info->frame_height < 1 || info->frame_height > 65536)
+        return AVERROR_INVALIDDATA;
+
+    info->chroma_format_idc = get_bits(gbc, 4);
+    info->bit_depth_minus8  = get_bits(gbc, 4);
+
+    if (info->bit_depth_minus8 > 8) {
+        return AVERROR_INVALIDDATA;
+    }
+    if (info->bit_depth_minus8 % 2) {
+        // Odd bit depths are technically valid but not useful here.
+        return AVERROR_INVALIDDATA;
+    }
+    bit_depth_index = info->bit_depth_minus8 / 2;
+
+    switch (info->chroma_format_idc) {
+    case APV_CHROMA_FORMAT_400:
+    case APV_CHROMA_FORMAT_422:
+    case APV_CHROMA_FORMAT_444:
+    case APV_CHROMA_FORMAT_4444:
+        info->pixel_format = apv_format_table[info->chroma_format_idc][bit_depth_index];
+        break;
+    default:
+        return AVERROR_INVALIDDATA;
+    }
+
+    // Ignore capture_time_distance.
+    skip_bits(gbc, 8);
+
+    zero = get_bits(gbc, 8);
+    if (zero != 0)
+        return AVERROR_INVALIDDATA;
+
+    return 1;
+}
+
+static int apv_probe(const AVProbeData *p)
+{
+    GetBitContext gbc;
+    APVHeaderInfo header;
+    uint32_t au_size, tag, pbu_size;
+    int score = AVPROBE_SCORE_EXTENSION + 1;
+    int err;
+
+    init_get_bits8(&gbc, p->buf, p->buf_size);
+
+    au_size = get_bits_long(&gbc, 32);
+    if (au_size < 16) {
+        // Too small.
+        return 0;
+    }
+    // The spec doesn't have this tag, but the reference software and
+    // all current files do.  Treat it as optional and skip if present,
+    // but if it is there then this is definitely an APV file.
+    tag = get_bits_long(&gbc, 32);
+    if (tag == APV_TAG) {
+        pbu_size = get_bits_long(&gbc, 32);
+        score = AVPROBE_SCORE_MAX;
+    } else {
+        pbu_size = tag;
+    }
+    if (pbu_size < 16) {
+        // Too small.
+        return 0;
+    }
+
+    err = apv_extract_header_info(&header, &gbc);
+    if (err < 0) {
+        // Header does not look like APV.
+        return 0;
+    }
+    return score;
+}
+
+static int apv_read_header(AVFormatContext *s)
+{
+    AVStream *st;
+    GetBitContext gbc;
+    APVHeaderInfo header;
+    uint8_t buffer[32];
+    uint32_t au_size, tag, pbu_size;
+    int err, size;
+
+    err = ffio_ensure_seekback(s->pb, sizeof(buffer));
+    if (err < 0)
+        return err;
+    size = avio_read(s->pb, buffer, sizeof(buffer));
+    if (size < 0)
+        return size;
+
+    init_get_bits8(&gbc, buffer, sizeof(buffer));
+
+    au_size = get_bits_long(&gbc, 32);
+    if (au_size < 16) {
+        // Too small.
+        return AVERROR_INVALIDDATA;
+    }
+    tag = get_bits_long(&gbc, 32);
+    if (tag == APV_TAG) {
+        pbu_size = get_bits_long(&gbc, 32);
+    } else {
+        pbu_size = tag;
+    }
+    if (pbu_size < 16) {
+        // Too small.
+        return AVERROR_INVALIDDATA;
+    }
+
+    err = apv_extract_header_info(&header, &gbc);
+    if (err < 0)
+        return err;
+
+    st = avformat_new_stream(s, NULL);
+    if (!st)
+        return AVERROR(ENOMEM);
+
+    st->codecpar->codec_type = AVMEDIA_TYPE_VIDEO;
+    st->codecpar->codec_id   = AV_CODEC_ID_APV;
+    st->codecpar->format     = header.pixel_format;
+    st->codecpar->profile    = header.profile_idc;
+    st->codecpar->level      = header.level_idc;
+    st->codecpar->width      = header.frame_width;
+    st->codecpar->height     = header.frame_height;
+
+    st->avg_frame_rate = (AVRational){ 30, 1 };
+    avpriv_set_pts_info(st, 64, 1, 30);
+
+    avio_seek(s->pb, -size, SEEK_CUR);
+
+    return 0;
+}
+
+static int apv_read_packet(AVFormatContext *s, AVPacket *pkt)
+{
+    uint32_t au_size, tag;
+    int ret;
+
+    au_size = avio_rb32(s->pb);
+    if (au_size == 0 && avio_feof(s->pb))
+        return AVERROR_EOF;
+
+    tag = avio_rb32(s->pb);
+    if (tag == APV_TAG) {
+        au_size -= 4;
+    } else {
+        avio_seek(s->pb, -4, SEEK_CUR);
+    }
+
+    if (au_size < 16 || au_size > 1 << 24) {
+        av_log(s, AV_LOG_ERROR, "APV AU is bad\n");
+        return AVERROR_INVALIDDATA;
+    }
+
+    ret = av_get_packet(s->pb, pkt, au_size);
+    pkt->flags        = AV_PKT_FLAG_KEY;
+
+    return ret;
+}
+
+const FFInputFormat ff_apv_demuxer = {
+    .p.name         = "apv",
+    .p.long_name    = NULL_IF_CONFIG_SMALL("APV raw bitstream"),
+    .p.extensions   = "apv",
+    .p.flags        = AVFMT_GENERIC_INDEX | AVFMT_NOTIMESTAMPS,
+    .flags_internal = FF_INFMT_FLAG_INIT_CLEANUP,
+    .read_probe     = apv_probe,
+    .read_header    = apv_read_header,
+    .read_packet    = apv_read_packet,
+};
-- 
2.47.2

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [FFmpeg-devel] [PATCH v2 4/6] lavc: APV decoder
  2025-04-21 15:24 [FFmpeg-devel] [PATCH v2 0/6] APV support Mark Thompson
                   ` (2 preceding siblings ...)
  2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 3/6] lavf: APV demuxer Mark Thompson
@ 2025-04-21 15:24 ` Mark Thompson
  2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 5/6] lavc/apv: AVX2 transquant for x86-64 Mark Thompson
  2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 6/6] lavc: APV metadata bitstream filter Mark Thompson
  5 siblings, 0 replies; 12+ messages in thread
From: Mark Thompson @ 2025-04-21 15:24 UTC (permalink / raw)
  To: ffmpeg-devel

---
 configure                |   1 +
 libavcodec/Makefile      |   1 +
 libavcodec/allcodecs.c   |   1 +
 libavcodec/apv_decode.c  | 331 +++++++++++++++++++++++++++++++++++++++
 libavcodec/apv_decode.h  |  80 ++++++++++
 libavcodec/apv_dsp.c     | 136 ++++++++++++++++
 libavcodec/apv_dsp.h     |  37 +++++
 libavcodec/apv_entropy.c | 200 +++++++++++++++++++++++
 8 files changed, 787 insertions(+)
 create mode 100644 libavcodec/apv_decode.c
 create mode 100644 libavcodec/apv_decode.h
 create mode 100644 libavcodec/apv_dsp.c
 create mode 100644 libavcodec/apv_dsp.h
 create mode 100644 libavcodec/apv_entropy.c

diff --git a/configure b/configure
index ca404d2797..ee270b770c 100755
--- a/configure
+++ b/configure
@@ -2935,6 +2935,7 @@ apng_decoder_select="inflate_wrapper"
 apng_encoder_select="deflate_wrapper llvidencdsp"
 aptx_encoder_select="audio_frame_queue"
 aptx_hd_encoder_select="audio_frame_queue"
+apv_decoder_select="cbs_apv"
 asv1_decoder_select="blockdsp bswapdsp idctdsp"
 asv1_encoder_select="aandcttables bswapdsp fdctdsp pixblockdsp"
 asv2_decoder_select="blockdsp bswapdsp idctdsp"
diff --git a/libavcodec/Makefile b/libavcodec/Makefile
index a5f5c4e904..e674671460 100644
--- a/libavcodec/Makefile
+++ b/libavcodec/Makefile
@@ -244,6 +244,7 @@ OBJS-$(CONFIG_APTX_HD_DECODER)         += aptxdec.o aptx.o
 OBJS-$(CONFIG_APTX_HD_ENCODER)         += aptxenc.o aptx.o
 OBJS-$(CONFIG_APNG_DECODER)            += png.o pngdec.o pngdsp.o
 OBJS-$(CONFIG_APNG_ENCODER)            += png.o pngenc.o
+OBJS-$(CONFIG_APV_DECODER)             += apv_decode.o apv_entropy.o apv_dsp.o
 OBJS-$(CONFIG_ARBC_DECODER)            += arbc.o
 OBJS-$(CONFIG_ARGO_DECODER)            += argo.o
 OBJS-$(CONFIG_SSA_DECODER)             += assdec.o ass.o
diff --git a/libavcodec/allcodecs.c b/libavcodec/allcodecs.c
index f10519617e..09f06c71d6 100644
--- a/libavcodec/allcodecs.c
+++ b/libavcodec/allcodecs.c
@@ -47,6 +47,7 @@ extern const FFCodec ff_anm_decoder;
 extern const FFCodec ff_ansi_decoder;
 extern const FFCodec ff_apng_encoder;
 extern const FFCodec ff_apng_decoder;
+extern const FFCodec ff_apv_decoder;
 extern const FFCodec ff_arbc_decoder;
 extern const FFCodec ff_argo_decoder;
 extern const FFCodec ff_asv1_encoder;
diff --git a/libavcodec/apv_decode.c b/libavcodec/apv_decode.c
new file mode 100644
index 0000000000..db1ae9756d
--- /dev/null
+++ b/libavcodec/apv_decode.c
@@ -0,0 +1,331 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/mem_internal.h"
+#include "libavutil/pixdesc.h"
+
+#include "apv.h"
+#include "apv_decode.h"
+#include "apv_dsp.h"
+#include "avcodec.h"
+#include "cbs.h"
+#include "cbs_apv.h"
+#include "codec_internal.h"
+#include "decode.h"
+#include "thread.h"
+
+
+typedef struct APVDecodeContext {
+    CodedBitstreamContext *cbc;
+    APVDSPContext dsp;
+
+    CodedBitstreamFragment au;
+    APVDerivedTileInfo tile_info;
+
+    APVVLCLUT decode_lut;
+
+    AVFrame *output_frame;
+} APVDecodeContext;
+
+static const enum AVPixelFormat apv_format_table[5][5] = {
+    { AV_PIX_FMT_GRAY8,    AV_PIX_FMT_GRAY10,     AV_PIX_FMT_GRAY12,     AV_PIX_FMT_GRAY14, AV_PIX_FMT_GRAY16 },
+    { 0 }, // 4:2:0 is not valid.
+    { AV_PIX_FMT_YUV422P,  AV_PIX_FMT_YUV422P10,  AV_PIX_FMT_YUV422P12,  AV_PIX_FMT_GRAY14, AV_PIX_FMT_YUV422P16 },
+    { AV_PIX_FMT_YUV444P,  AV_PIX_FMT_YUV444P10,  AV_PIX_FMT_YUV444P12,  AV_PIX_FMT_GRAY14, AV_PIX_FMT_YUV444P16 },
+    { AV_PIX_FMT_YUVA444P, AV_PIX_FMT_YUVA444P10, AV_PIX_FMT_YUVA444P12, AV_PIX_FMT_GRAY14, AV_PIX_FMT_YUVA444P16 },
+};
+
+static int apv_decode_check_format(AVCodecContext *avctx,
+                                   const APVRawFrameHeader *header)
+{
+    int err, bit_depth;
+
+    avctx->profile = header->frame_info.profile_idc;
+    avctx->level   = header->frame_info.level_idc;
+
+    bit_depth = header->frame_info.bit_depth_minus8 + 8;
+    if (bit_depth < 8 || bit_depth > 16 || bit_depth % 2) {
+        avpriv_request_sample(avctx, "Bit depth %d", bit_depth);
+        return AVERROR_PATCHWELCOME;
+    }
+    avctx->pix_fmt =
+        apv_format_table[header->frame_info.chroma_format_idc][bit_depth - 4 >> 2];
+
+    err = ff_set_dimensions(avctx,
+                            FFALIGN(header->frame_info.frame_width,  16),
+                            FFALIGN(header->frame_info.frame_height, 16));
+    if (err < 0) {
+        // Unsupported frame size.
+        return err;
+    }
+    avctx->width  = header->frame_info.frame_width;
+    avctx->height = header->frame_info.frame_height;
+
+    avctx->sample_aspect_ratio = (AVRational){ 1, 1 };
+
+    avctx->color_primaries = header->color_primaries;
+    avctx->color_trc       = header->transfer_characteristics;
+    avctx->colorspace      = header->matrix_coefficients;
+    avctx->color_range     = header->full_range_flag ? AVCOL_RANGE_JPEG
+                                                     : AVCOL_RANGE_MPEG;
+    avctx->chroma_sample_location = AVCHROMA_LOC_TOPLEFT;
+
+    avctx->refs = 0;
+    avctx->has_b_frames = 0;
+
+    return 0;
+}
+
+static av_cold int apv_decode_init(AVCodecContext *avctx)
+{
+    APVDecodeContext *apv = avctx->priv_data;
+    int err;
+
+    err = ff_cbs_init(&apv->cbc, AV_CODEC_ID_APV, avctx);
+    if (err < 0)
+        return err;
+
+    ff_apv_entropy_build_decode_lut(&apv->decode_lut);
+
+    ff_apv_dsp_init(&apv->dsp);
+
+    if (avctx->extradata) {
+        av_log(avctx, AV_LOG_WARNING,
+               "APV does not support extradata.\n");
+    }
+
+    return 0;
+}
+
+static av_cold int apv_decode_close(AVCodecContext *avctx)
+{
+    APVDecodeContext *apv = avctx->priv_data;
+
+    ff_cbs_fragment_free(&apv->au);
+    ff_cbs_close(&apv->cbc);
+
+    return 0;
+}
+
+static int apv_decode_block(AVCodecContext *avctx,
+                            void *output,
+                            ptrdiff_t pitch,
+                            GetBitContext *gbc,
+                            APVEntropyState *entropy_state,
+                            int bit_depth,
+                            int qp_shift,
+                            const uint16_t *qmatrix)
+{
+    APVDecodeContext *apv = avctx->priv_data;
+    int err;
+
+    LOCAL_ALIGNED_32(int16_t, coeff, [64]);
+
+    err = ff_apv_entropy_decode_block(coeff, gbc, entropy_state);
+    if (err < 0)
+        return 0;
+
+    apv->dsp.decode_transquant(output, pitch,
+                               coeff, qmatrix,
+                               bit_depth, qp_shift);
+
+    return 0;
+}
+
+static int apv_decode_tile_component(AVCodecContext *avctx, void *data,
+                                     int job, int thread)
+{
+    APVRawFrame                      *input = data;
+    APVDecodeContext                   *apv = avctx->priv_data;
+    const CodedBitstreamAPVContext *apv_cbc = apv->cbc->priv_data;
+    const APVDerivedTileInfo     *tile_info = &apv_cbc->tile_info;
+
+    int tile_index = job / apv_cbc->num_comp;
+    int comp_index = job % apv_cbc->num_comp;
+
+    const AVPixFmtDescriptor *pix_fmt_desc =
+        av_pix_fmt_desc_get(avctx->pix_fmt);
+
+    int sub_w = comp_index == 0 ? 1 : pix_fmt_desc->log2_chroma_w + 1;
+    int sub_h = comp_index == 0 ? 1 : pix_fmt_desc->log2_chroma_h + 1;
+
+    APVRawTile *tile = &input->tile[tile_index];
+
+    int tile_y = tile_index / tile_info->tile_cols;
+    int tile_x = tile_index % tile_info->tile_cols;
+
+    int tile_start_x = tile_info->col_starts[tile_x];
+    int tile_start_y = tile_info->row_starts[tile_y];
+
+    int tile_width  = tile_info->col_starts[tile_x + 1] - tile_start_x;
+    int tile_height = tile_info->row_starts[tile_y + 1] - tile_start_y;
+
+    int tile_mb_width  = tile_width  / APV_MB_WIDTH;
+    int tile_mb_height = tile_height / APV_MB_HEIGHT;
+
+    int blk_mb_width  = 2 / sub_w;
+    int blk_mb_height = 2 / sub_h;
+
+    int bit_depth;
+    int qp_shift;
+    LOCAL_ALIGNED_32(uint16_t, qmatrix_scaled, [64]);
+
+    GetBitContext gbc;
+
+    APVEntropyState entropy_state = {
+        .log_ctx           = avctx,
+        .decode_lut        = &apv->decode_lut,
+        .prev_dc           = 0,
+        .prev_dc_diff      = 20,
+        .prev_1st_ac_level = 0,
+    };
+
+    init_get_bits8(&gbc, tile->tile_data[comp_index],
+                   tile->tile_header.tile_data_size[comp_index]);
+
+    // Combine the bitstream quantisation matrix with the qp scaling
+    // in advance.  (Including qp_shift as well would overflow 16 bits.)
+    // Fix the row ordering at the same time.
+    {
+        static const uint8_t apv_level_scale[6] = { 40, 45, 51, 57, 64, 71 };
+        int qp = tile->tile_header.tile_qp[comp_index];
+        int level_scale = apv_level_scale[qp % 6];
+
+        bit_depth = apv_cbc->bit_depth;
+        qp_shift  = qp / 6;
+
+        for (int y = 0; y < 8; y++) {
+            for (int x = 0; x < 8; x++)
+                qmatrix_scaled[y * 8 + x] = level_scale *
+                    input->frame_header.quantization_matrix.q_matrix[comp_index][x][y];
+        }
+    }
+
+    for (int mb_y = 0; mb_y < tile_mb_height; mb_y++) {
+        for (int mb_x = 0; mb_x < tile_mb_width; mb_x++) {
+            for (int blk_y = 0; blk_y < blk_mb_height; blk_y++) {
+                for (int blk_x = 0; blk_x < blk_mb_width; blk_x++) {
+                    int frame_y = (tile_start_y +
+                                   APV_MB_HEIGHT * mb_y +
+                                   APV_TR_SIZE * blk_y) / sub_h;
+                    int frame_x = (tile_start_x +
+                                   APV_MB_WIDTH * mb_x +
+                                   APV_TR_SIZE * blk_x) / sub_w;
+
+                    ptrdiff_t frame_pitch = apv->output_frame->linesize[comp_index];
+                    uint8_t  *block_start = apv->output_frame->data[comp_index] +
+                                            frame_y * frame_pitch + 2 * frame_x;
+
+                    apv_decode_block(avctx,
+                                     block_start, frame_pitch,
+                                     &gbc, &entropy_state,
+                                     bit_depth,
+                                     qp_shift,
+                                     qmatrix_scaled);
+                }
+            }
+        }
+    }
+
+    av_log(avctx, AV_LOG_DEBUG,
+           "Decoded tile %d component %d: %dx%d MBs starting at (%d,%d)\n",
+           tile_index, comp_index, tile_mb_width, tile_mb_height,
+           tile_start_x, tile_start_y);
+
+    return 0;
+}
+
+static int apv_decode(AVCodecContext *avctx, AVFrame *output,
+                      APVRawFrame *input)
+{
+    APVDecodeContext                   *apv = avctx->priv_data;
+    const CodedBitstreamAPVContext *apv_cbc = apv->cbc->priv_data;
+    const APVDerivedTileInfo     *tile_info = &apv_cbc->tile_info;
+    int err, job_count;
+
+    err = apv_decode_check_format(avctx, &input->frame_header);
+    if (err < 0) {
+        av_log(avctx, AV_LOG_ERROR, "Unsupported format parameters.\n");
+        return err;
+    }
+
+    err = ff_thread_get_buffer(avctx, output, 0);
+    if (err) {
+        av_log(avctx, AV_LOG_ERROR, "No output frame supplied.\n");
+        return err;
+    }
+
+    apv->output_frame = output;
+
+    // Each component within a tile is independent of every other,
+    // so we can decode all in parallel.
+    job_count = tile_info->num_tiles * apv_cbc->num_comp;
+
+    avctx->execute2(avctx, apv_decode_tile_component,
+                    input, NULL, job_count);
+
+    return 0;
+}
+
+static int apv_decode_frame(AVCodecContext *avctx, AVFrame *frame,
+                            int *got_frame, AVPacket *packet)
+{
+    APVDecodeContext      *apv = avctx->priv_data;
+    CodedBitstreamFragment *au = &apv->au;
+    int err;
+
+    err = ff_cbs_read_packet(apv->cbc, au, packet);
+    if (err < 0) {
+        av_log(avctx, AV_LOG_ERROR, "Failed to read packet.\n");
+        return err;
+    }
+
+    for (int i = 0; i < au->nb_units; i++) {
+        CodedBitstreamUnit *pbu = &au->units[i];
+
+        switch (pbu->type) {
+        case APV_PBU_PRIMARY_FRAME:
+            err = apv_decode(avctx, frame, pbu->content);
+            if (err < 0)
+                return err;
+            *got_frame = 1;
+            break;
+        default:
+            av_log(avctx, AV_LOG_WARNING,
+                   "Ignoring unsupported PBU type %d.\n", pbu->type);
+        }
+    }
+
+    ff_cbs_fragment_reset(au);
+
+    return packet->size;
+}
+
+const FFCodec ff_apv_decoder = {
+    .p.name                = "apv",
+    CODEC_LONG_NAME("Advanced Professional Video"),
+    .p.type                = AVMEDIA_TYPE_VIDEO,
+    .p.id                  = AV_CODEC_ID_APV,
+    .priv_data_size        = sizeof(APVDecodeContext),
+    .init                  = apv_decode_init,
+    .close                 = apv_decode_close,
+    FF_CODEC_DECODE_CB(apv_decode_frame),
+    .p.capabilities        = AV_CODEC_CAP_DR1 |
+                             AV_CODEC_CAP_SLICE_THREADS |
+                             AV_CODEC_CAP_FRAME_THREADS,
+};
diff --git a/libavcodec/apv_decode.h b/libavcodec/apv_decode.h
new file mode 100644
index 0000000000..34c6176ea0
--- /dev/null
+++ b/libavcodec/apv_decode.h
@@ -0,0 +1,80 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef AVCODEC_APV_DECODE_H
+#define AVCODEC_APV_DECODE_H
+
+#include <stdint.h>
+
+#include "apv.h"
+#include "avcodec.h"
+#include "get_bits.h"
+
+
+// Number of bits in the entropy look-up tables.
+// It may be desirable to tune this per-architecture, as a larger LUT
+// trades greater memory use for fewer instructions.
+// (N bits -> 24*2^N bytes of tables; 9 -> 12KB of tables.)
+#define APV_VLC_LUT_BITS 9
+#define APV_VLC_LUT_SIZE (1 << APV_VLC_LUT_BITS)
+
+typedef struct APVVLCLUTEntry {
+    uint16_t result;  // Return value if not reading more.
+    uint8_t  consume; // Number of bits to consume.
+    uint8_t  more;    // Whether to read additional bits.
+} APVVLCLUTEntry;
+
+typedef struct APVVLCLUT {
+    APVVLCLUTEntry lut[6][APV_VLC_LUT_SIZE];
+} APVVLCLUT;
+
+typedef struct APVEntropyState {
+    void *log_ctx;
+
+    const APVVLCLUT *decode_lut;
+
+    int16_t prev_dc;
+    int16_t prev_dc_diff;
+    int16_t prev_1st_ac_level;
+} APVEntropyState;
+
+
+/**
+ * Build the decoder VLC look-up table.
+ */
+void ff_apv_entropy_build_decode_lut(APVVLCLUT *decode_lut);
+
+/**
+ * Entropy decode a single 8x8 block to coefficients.
+ *
+ * Outputs in block order (dezigzag already applied).
+ */
+int ff_apv_entropy_decode_block(int16_t *coeff,
+                                GetBitContext *gbc,
+                                APVEntropyState *state);
+
+/**
+ * Read a single APV VLC code.
+ *
+ * This entrypoint is exposed for testing.
+ */
+unsigned int ff_apv_read_vlc(GetBitContext *gbc, int k_param,
+                             const APVVLCLUT *lut);
+
+
+#endif /* AVCODEC_APV_DECODE_H */
diff --git a/libavcodec/apv_dsp.c b/libavcodec/apv_dsp.c
new file mode 100644
index 0000000000..fe11cd6b94
--- /dev/null
+++ b/libavcodec/apv_dsp.c
@@ -0,0 +1,136 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <stdint.h>
+
+#include "config.h"
+#include "libavutil/attributes.h"
+#include "libavutil/common.h"
+
+#include "apv.h"
+#include "apv_dsp.h"
+
+
+static const int8_t apv_trans_matrix[8][8] = {
+    {  64,  64,  64,  64,  64,  64,  64,  64 },
+    {  89,  75,  50,  18, -18, -50, -75, -89 },
+    {  84,  35, -35, -84, -84, -35,  35,  84 },
+    {  75, -18, -89, -50,  50,  89,  18, -75 },
+    {  64, -64, -64,  64,  64, -64, -64,  64 },
+    {  50, -89,  18,  75, -75, -18,  89, -50 },
+    {  35, -84,  84, -35, -35,  84, -84,  35 },
+    {  18, -50,  75, -89,  89, -75,  50, -18 },
+};
+
+static void apv_decode_transquant_c(void *output,
+                                    ptrdiff_t pitch,
+                                    const int16_t *input_flat,
+                                    const int16_t *qmatrix_flat,
+                                    int bit_depth,
+                                    int qp_shift)
+{
+    const int16_t (*input)[8]   = (const int16_t(*)[8])input_flat;
+    const int16_t (*qmatrix)[8] = (const int16_t(*)[8])qmatrix_flat;
+
+    int16_t scaled_coeff[8][8];
+    int32_t recon_sample[8][8];
+
+    // Dequant.
+    {
+        // Note that level_scale was already combined into qmatrix
+        // before we got here.
+        int bd_shift = bit_depth + 3 - 5;
+
+        for (int y = 0; y < 8; y++) {
+            for (int x = 0; x < 8; x++) {
+                int coeff = (((input[y][x] * qmatrix[y][x]) << qp_shift) +
+                             (1 << (bd_shift - 1))) >> bd_shift;
+
+                scaled_coeff[y][x] =
+                    av_clip(coeff, APV_MIN_TRANS_COEFF,
+                                   APV_MAX_TRANS_COEFF);
+            }
+        }
+    }
+
+    // Transform.
+    {
+        int32_t tmp[8][8];
+
+        // Vertical transform of columns.
+        for (int x = 0; x < 8; x++) {
+            for (int i = 0; i < 8; i++) {
+                int sum = 0;
+                for (int j = 0; j < 8; j++)
+                    sum += apv_trans_matrix[j][i] * scaled_coeff[j][x];
+                tmp[i][x] = sum;
+            }
+        }
+
+        // Renormalise.
+        for (int x = 0; x < 8; x++) {
+            for (int y = 0; y < 8; y++)
+                tmp[y][x] = (tmp[y][x] + 64) >> 7;
+        }
+
+        // Horizontal transform of rows.
+        for (int y = 0; y < 8; y++) {
+            for (int i = 0; i < 8; i++) {
+                int sum = 0;
+                for (int j = 0; j < 8; j++)
+                    sum += apv_trans_matrix[j][i] * tmp[y][j];
+                recon_sample[y][i] = sum;
+            }
+        }
+    }
+
+    // Output.
+    if (bit_depth == 8) {
+        uint8_t *ptr = output;
+        int bd_shift = 20 - bit_depth;
+
+        for (int y = 0; y < 8; y++) {
+            for (int x = 0; x < 8; x++) {
+                int sample = ((recon_sample[y][x] +
+                               (1 << (bd_shift - 1))) >> bd_shift) +
+                    (1 << (bit_depth - 1));
+                ptr[x] = av_clip_uintp2(sample, bit_depth);
+            }
+            ptr += pitch;
+        }
+    } else {
+        uint16_t *ptr = output;
+        int bd_shift = 20 - bit_depth;
+        pitch /= 2; // Pitch was in bytes, 2 bytes per sample.
+
+        for (int y = 0; y < 8; y++) {
+            for (int x = 0; x < 8; x++) {
+                int sample = ((recon_sample[y][x] +
+                               (1 << (bd_shift - 1))) >> bd_shift) +
+                    (1 << (bit_depth - 1));
+                ptr[x] = av_clip_uintp2(sample, bit_depth);
+            }
+            ptr += pitch;
+        }
+    }
+}
+
+av_cold void ff_apv_dsp_init(APVDSPContext *dsp)
+{
+    dsp->decode_transquant = apv_decode_transquant_c;
+}
diff --git a/libavcodec/apv_dsp.h b/libavcodec/apv_dsp.h
new file mode 100644
index 0000000000..31645b8581
--- /dev/null
+++ b/libavcodec/apv_dsp.h
@@ -0,0 +1,37 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef AVCODEC_APV_DSP_H
+#define AVCODEC_APV_DSP_H
+
+#include <stddef.h>
+#include <stdint.h>
+
+
+typedef struct APVDSPContext {
+    void (*decode_transquant)(void *output,
+                              ptrdiff_t pitch,
+                              const int16_t *input,
+                              const int16_t *qmatrix,
+                              int bit_depth,
+                              int qp_shift);
+} APVDSPContext;
+
+void ff_apv_dsp_init(APVDSPContext *dsp);
+
+#endif /* AVCODEC_APV_DSP_H */
diff --git a/libavcodec/apv_entropy.c b/libavcodec/apv_entropy.c
new file mode 100644
index 0000000000..00e0b4fbdf
--- /dev/null
+++ b/libavcodec/apv_entropy.c
@@ -0,0 +1,200 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "apv.h"
+#include "apv_decode.h"
+
+
+void ff_apv_entropy_build_decode_lut(APVVLCLUT *decode_lut)
+{
+    const int code_len = APV_VLC_LUT_BITS;
+    const int lut_size = APV_VLC_LUT_SIZE;
+
+    for (int k = 0; k <= 5; k++) {
+        for (unsigned int code = 0; code < lut_size; code++) {
+            APVVLCLUTEntry *ent = &decode_lut->lut[k][code];
+            unsigned int first_bit      = code & (1 << code_len - 1);
+            unsigned int remaining_bits = code ^ first_bit;
+
+            if (first_bit) {
+                ent->consume = 1 + k;
+                ent->result  = remaining_bits >> (code_len - k - 1);
+                ent->more    = 0;
+            } else {
+                unsigned int second_bit = code & (1 << code_len - 2);
+                remaining_bits ^= second_bit;
+
+                if (second_bit) {
+                    unsigned int bits_left = code_len - 2;
+                    unsigned int first_set = bits_left - av_log2(remaining_bits);
+                    unsigned int last_bits = first_set - 1 + k;
+
+                    if (first_set + last_bits <= bits_left) {
+                        // Whole code fits here.
+                        ent->consume = 2 + first_set + last_bits;
+                        ent->result  = ((2 << k) +
+                                        (((1 << first_set - 1) - 1) << k) +
+                                        ((code >> bits_left - first_set - last_bits) & (1 << last_bits) - 1));
+                        ent->more    = 0;
+                    } else {
+                        // Need to read more, collapse to default.
+                        ent->consume = 2;
+                        ent->more    = 1;
+                    }
+                } else {
+                    ent->consume = 2 + k;
+                    ent->result  = (1 << k) + (remaining_bits >> (code_len - k - 2));
+                    ent->more    = 0;
+                }
+            }
+        }
+    }
+}
+
+av_always_inline
+static unsigned int apv_read_vlc(GetBitContext *gbc, int k_param,
+                                 const APVVLCLUT *lut)
+{
+    unsigned int next_bits;
+    const APVVLCLUTEntry *ent;
+
+    next_bits = show_bits(gbc, APV_VLC_LUT_BITS);
+    ent = &lut->lut[k_param][next_bits];
+
+    if (ent->more) {
+        unsigned int leading_zeroes;
+
+        skip_bits(gbc, ent->consume);
+
+        next_bits = show_bits(gbc, 16);
+        leading_zeroes = 15 - av_log2(next_bits);
+
+        skip_bits(gbc, leading_zeroes + 1);
+
+        return (2 << k_param) +
+            ((1 << leading_zeroes) - 1) * (1 << k_param) +
+            get_bits(gbc, leading_zeroes + k_param);
+    } else {
+        skip_bits(gbc, ent->consume);
+        return ent->result;
+    }
+}
+
+unsigned int ff_apv_read_vlc(GetBitContext *gbc, int k_param,
+                             const APVVLCLUT *lut)
+{
+    return apv_read_vlc(gbc, k_param, lut);
+}
+
+int ff_apv_entropy_decode_block(int16_t *coeff,
+                                GetBitContext *gbc,
+                                APVEntropyState *state)
+{
+    const APVVLCLUT *lut = state->decode_lut;
+    int k_param;
+
+    // DC coefficient.
+    {
+        int abs_dc_coeff_diff;
+        int sign_dc_coeff_diff;
+        int dc_coeff;
+
+        k_param = av_clip(state->prev_dc_diff >> 1, 0, 5);
+        abs_dc_coeff_diff = apv_read_vlc(gbc, k_param, lut);
+
+        if (abs_dc_coeff_diff > 0)
+            sign_dc_coeff_diff = get_bits1(gbc);
+        else
+            sign_dc_coeff_diff = 0;
+
+        if (sign_dc_coeff_diff)
+            dc_coeff = state->prev_dc - abs_dc_coeff_diff;
+        else
+            dc_coeff = state->prev_dc + abs_dc_coeff_diff;
+
+        if (dc_coeff < APV_MIN_TRANS_COEFF ||
+            dc_coeff > APV_MAX_TRANS_COEFF) {
+            av_log(state->log_ctx, AV_LOG_ERROR,
+                   "Out-of-range DC coefficient value: %d "
+                   "(from prev_dc %d abs_dc_coeff_diff %d sign_dc_coeff_diff %d)\n",
+                   dc_coeff, state->prev_dc, abs_dc_coeff_diff, sign_dc_coeff_diff);
+            return AVERROR_INVALIDDATA;
+        }
+
+        coeff[0] = dc_coeff;
+
+        state->prev_dc      = dc_coeff;
+        state->prev_dc_diff = abs_dc_coeff_diff;
+    }
+
+    // AC coefficients.
+    {
+        int scan_pos   = 1;
+        int first_ac   = 1;
+        int prev_level = state->prev_1st_ac_level;
+        int prev_run   = 0;
+
+        do {
+            int coeff_zero_run;
+
+            k_param = av_clip(prev_run >> 2, 0, 2);
+            coeff_zero_run = apv_read_vlc(gbc, k_param, lut);
+
+            if (coeff_zero_run > APV_BLK_COEFFS - scan_pos) {
+                av_log(state->log_ctx, AV_LOG_ERROR,
+                       "Out-of-range zero-run value: %d (at scan pos %d)\n",
+                       coeff_zero_run, scan_pos);
+                return AVERROR_INVALIDDATA;
+            }
+
+            for (int i = 0; i < coeff_zero_run; i++) {
+                coeff[ff_zigzag_direct[scan_pos]] = 0;
+                ++scan_pos;
+            }
+            prev_run = coeff_zero_run;
+
+            if (scan_pos < APV_BLK_COEFFS) {
+                int abs_ac_coeff_minus1;
+                int sign_ac_coeff;
+                int level;
+
+                k_param = av_clip(prev_level >> 2, 0, 4);
+                abs_ac_coeff_minus1 = apv_read_vlc(gbc, k_param, lut);
+                sign_ac_coeff = get_bits(gbc, 1);
+
+                if (sign_ac_coeff)
+                    level = -abs_ac_coeff_minus1 - 1;
+                else
+                    level = abs_ac_coeff_minus1 + 1;
+
+                coeff[ff_zigzag_direct[scan_pos]] = level;
+
+                prev_level = abs_ac_coeff_minus1 + 1;
+                if (first_ac) {
+                    state->prev_1st_ac_level = prev_level;
+                    first_ac = 0;
+                }
+
+                ++scan_pos;
+            }
+
+        } while (scan_pos < APV_BLK_COEFFS);
+    }
+
+    return 0;
+}
-- 
2.47.2

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [FFmpeg-devel] [PATCH v2 5/6] lavc/apv: AVX2 transquant for x86-64
  2025-04-21 15:24 [FFmpeg-devel] [PATCH v2 0/6] APV support Mark Thompson
                   ` (3 preceding siblings ...)
  2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 4/6] lavc: APV decoder Mark Thompson
@ 2025-04-21 15:24 ` Mark Thompson
  2025-04-21 16:53   ` James Almer
  2025-04-23 19:52   ` Michael Niedermayer
  2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 6/6] lavc: APV metadata bitstream filter Mark Thompson
  5 siblings, 2 replies; 12+ messages in thread
From: Mark Thompson @ 2025-04-21 15:24 UTC (permalink / raw)
  To: ffmpeg-devel

Typical checkasm result on Alder Lake:

decode_transquant_8_c:                                 461.1 ( 1.00x)
decode_transquant_8_avx2:                               97.5 ( 4.73x)
decode_transquant_10_c:                                483.9 ( 1.00x)
decode_transquant_10_avx2:                              91.7 ( 5.28x)
---
 libavcodec/apv_dsp.c          |   4 +
 libavcodec/apv_dsp.h          |   2 +
 libavcodec/x86/Makefile       |   2 +
 libavcodec/x86/apv_dsp.asm    | 279 ++++++++++++++++++++++++++++++++++
 libavcodec/x86/apv_dsp_init.c |  40 +++++
 tests/checkasm/Makefile       |   1 +
 tests/checkasm/apv_dsp.c      | 109 +++++++++++++
 tests/checkasm/checkasm.c     |   3 +
 tests/checkasm/checkasm.h     |   1 +
 tests/fate/checkasm.mak       |   1 +
 10 files changed, 442 insertions(+)
 create mode 100644 libavcodec/x86/apv_dsp.asm
 create mode 100644 libavcodec/x86/apv_dsp_init.c
 create mode 100644 tests/checkasm/apv_dsp.c

diff --git a/libavcodec/apv_dsp.c b/libavcodec/apv_dsp.c
index fe11cd6b94..fd814ef900 100644
--- a/libavcodec/apv_dsp.c
+++ b/libavcodec/apv_dsp.c
@@ -133,4 +133,8 @@ static void apv_decode_transquant_c(void *output,
 av_cold void ff_apv_dsp_init(APVDSPContext *dsp)
 {
     dsp->decode_transquant = apv_decode_transquant_c;
+
+#if ARCH_X86_64
+    ff_apv_dsp_init_x86_64(dsp);
+#endif
 }
diff --git a/libavcodec/apv_dsp.h b/libavcodec/apv_dsp.h
index 31645b8581..c63d6a88ee 100644
--- a/libavcodec/apv_dsp.h
+++ b/libavcodec/apv_dsp.h
@@ -34,4 +34,6 @@ typedef struct APVDSPContext {
 
 void ff_apv_dsp_init(APVDSPContext *dsp);
 
+void ff_apv_dsp_init_x86_64(APVDSPContext *dsp);
+
 #endif /* AVCODEC_APV_DSP_H */
diff --git a/libavcodec/x86/Makefile b/libavcodec/x86/Makefile
index 5d53515381..821c410a0f 100644
--- a/libavcodec/x86/Makefile
+++ b/libavcodec/x86/Makefile
@@ -44,6 +44,7 @@ OBJS-$(CONFIG_ADPCM_G722_DECODER)      += x86/g722dsp_init.o
 OBJS-$(CONFIG_ADPCM_G722_ENCODER)      += x86/g722dsp_init.o
 OBJS-$(CONFIG_ALAC_DECODER)            += x86/alacdsp_init.o
 OBJS-$(CONFIG_APNG_DECODER)            += x86/pngdsp_init.o
+OBJS-$(CONFIG_APV_DECODER)             += x86/apv_dsp_init.o
 OBJS-$(CONFIG_CAVS_DECODER)            += x86/cavsdsp.o
 OBJS-$(CONFIG_CFHD_DECODER)            += x86/cfhddsp_init.o
 OBJS-$(CONFIG_CFHD_ENCODER)            += x86/cfhdencdsp_init.o
@@ -149,6 +150,7 @@ X86ASM-OBJS-$(CONFIG_ADPCM_G722_DECODER) += x86/g722dsp.o
 X86ASM-OBJS-$(CONFIG_ADPCM_G722_ENCODER) += x86/g722dsp.o
 X86ASM-OBJS-$(CONFIG_ALAC_DECODER)     += x86/alacdsp.o
 X86ASM-OBJS-$(CONFIG_APNG_DECODER)     += x86/pngdsp.o
+X86ASM-OBJS-$(CONFIG_APV_DECODER)      += x86/apv_dsp.o
 X86ASM-OBJS-$(CONFIG_CAVS_DECODER)     += x86/cavsidct.o
 X86ASM-OBJS-$(CONFIG_CFHD_ENCODER)     += x86/cfhdencdsp.o
 X86ASM-OBJS-$(CONFIG_CFHD_DECODER)     += x86/cfhddsp.o
diff --git a/libavcodec/x86/apv_dsp.asm b/libavcodec/x86/apv_dsp.asm
new file mode 100644
index 0000000000..6b045e989a
--- /dev/null
+++ b/libavcodec/x86/apv_dsp.asm
@@ -0,0 +1,279 @@
+;************************************************************************
+;* This file is part of FFmpeg.
+;*
+;* FFmpeg is free software; you can redistribute it and/or
+;* modify it under the terms of the GNU Lesser General Public
+;* License as published by the Free Software Foundation; either
+;* version 2.1 of the License, or (at your option) any later version.
+;*
+;* FFmpeg is distributed in the hope that it will be useful,
+;* but WITHOUT ANY WARRANTY; without even the implied warranty of
+;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+;* Lesser General Public License for more details.
+;*
+;* You should have received a copy of the GNU Lesser General Public
+;* License along with FFmpeg; if not, write to the Free Software
+;* 51, Inc., Foundation Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+;******************************************************************************
+
+%include "libavutil/x86/x86util.asm"
+
+SECTION_RODATA 32
+
+; Full matrix for row transform.
+const tmatrix_row
+    dw  64,  89,  84,  75,  64,  50,  35,  18
+    dw  64, -18, -84,  50,  64, -75, -35,  89
+    dw  64,  75,  35, -18, -64, -89, -84, -50
+    dw  64, -50, -35,  89, -64, -18,  84, -75
+    dw  64,  50, -35, -89, -64,  18,  84,  75
+    dw  64, -75,  35,  18, -64,  89, -84,  50
+    dw  64,  18, -84, -50,  64,  75, -35, -89
+    dw  64, -89,  84, -75,  64, -50,  35, -18
+
+; Constant pairs for broadcast in column transform.
+const tmatrix_col_even
+    dw  64,  64,  64, -64
+    dw  84,  35,  35, -84
+const tmatrix_col_odd
+    dw  89,  75,  50,  18
+    dw  75, -18, -89, -50
+    dw  50, -89,  18,  75
+    dw  18, -50,  75, -89
+
+; Memory targets for vpbroadcastd (register version requires AVX512).
+cextern pd_1
+const sixtyfour
+    dd  64
+
+SECTION .text
+
+; void ff_apv_decode_transquant_avx2(void *output,
+;                                    ptrdiff_t pitch,
+;                                    const int16_t *input,
+;                                    const int16_t *qmatrix,
+;                                    int bit_depth,
+;                                    int qp_shift);
+
+INIT_YMM avx2
+
+cglobal apv_decode_transquant, 6, 7, 16, output, pitch, input, qmatrix, bit_depth, qp_shift, tmp
+
+    ; Load input and dequantise
+
+    vpbroadcastd  m10, [pd_1]
+    lea       tmpq, [bit_depthq - 2]
+    movd      xm8, qp_shiftd
+    movd      xm9, tmpd
+    vpslld    m10, m10, xm9
+    vpsrld    m10, m10, 1
+
+    ; m8  = scalar qp_shift
+    ; m9  = scalar bd_shift
+    ; m10 = vector 1 << (bd_shift - 1)
+    ; m11 = qmatrix load
+
+%macro LOAD_AND_DEQUANT 2 ; (xmm input, constant offset)
+    vpmovsxwd m%1, [inputq   + %2]
+    vpmovsxwd m11, [qmatrixq + %2]
+    vpmaddwd  m%1, m%1, m11
+    vpslld    m%1, m%1, xm8
+    vpaddd    m%1, m%1, m10
+    vpsrad    m%1, m%1, xm9
+    vpackssdw m%1, m%1, m%1
+%endmacro
+
+    LOAD_AND_DEQUANT 0, 0x00
+    LOAD_AND_DEQUANT 1, 0x10
+    LOAD_AND_DEQUANT 2, 0x20
+    LOAD_AND_DEQUANT 3, 0x30
+    LOAD_AND_DEQUANT 4, 0x40
+    LOAD_AND_DEQUANT 5, 0x50
+    LOAD_AND_DEQUANT 6, 0x60
+    LOAD_AND_DEQUANT 7, 0x70
+
+    ; mN = row N words 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7
+
+    ; Transform columns
+    ; This applies a 1-D DCT butterfly
+
+    vpunpcklwd  m12, m0,  m4
+    vpunpcklwd  m13, m2,  m6
+    vpunpcklwd  m14, m1,  m3
+    vpunpcklwd  m15, m5,  m7
+
+    ; m12 = rows 0 and 4 interleaved
+    ; m13 = rows 2 and 6 interleaved
+    ; m14 = rows 1 and 3 interleaved
+    ; m15 = rows 5 and 7 interleaved
+
+    vpbroadcastd   m0, [tmatrix_col_even + 0x00]
+    vpbroadcastd   m1, [tmatrix_col_even + 0x04]
+    vpbroadcastd   m2, [tmatrix_col_even + 0x08]
+    vpbroadcastd   m3, [tmatrix_col_even + 0x0c]
+
+    vpmaddwd  m4,  m12, m0
+    vpmaddwd  m5,  m12, m1
+    vpmaddwd  m6,  m13, m2
+    vpmaddwd  m7,  m13, m3
+    vpaddd    m8,  m4,  m6
+    vpaddd    m9,  m5,  m7
+    vpsubd    m10, m5,  m7
+    vpsubd    m11, m4,  m6
+
+    vpbroadcastd   m0, [tmatrix_col_odd + 0x00]
+    vpbroadcastd   m1, [tmatrix_col_odd + 0x04]
+    vpbroadcastd   m2, [tmatrix_col_odd + 0x08]
+    vpbroadcastd   m3, [tmatrix_col_odd + 0x0c]
+
+    vpmaddwd  m4,  m14, m0
+    vpmaddwd  m5,  m15, m1
+    vpmaddwd  m6,  m14, m2
+    vpmaddwd  m7,  m15, m3
+    vpaddd    m12, m4,  m5
+    vpaddd    m13, m6,  m7
+
+    vpbroadcastd   m0, [tmatrix_col_odd + 0x10]
+    vpbroadcastd   m1, [tmatrix_col_odd + 0x14]
+    vpbroadcastd   m2, [tmatrix_col_odd + 0x18]
+    vpbroadcastd   m3, [tmatrix_col_odd + 0x1c]
+
+    vpmaddwd  m4,  m14, m0
+    vpmaddwd  m5,  m15, m1
+    vpmaddwd  m6,  m14, m2
+    vpmaddwd  m7,  m15, m3
+    vpaddd    m14, m4,  m5
+    vpaddd    m15, m6,  m7
+
+    vpaddd    m0,  m8,  m12
+    vpaddd    m1,  m9,  m13
+    vpaddd    m2,  m10, m14
+    vpaddd    m3,  m11, m15
+    vpsubd    m4,  m11, m15
+    vpsubd    m5,  m10, m14
+    vpsubd    m6,  m9,  m13
+    vpsubd    m7,  m8,  m12
+
+    ; Mid-transform normalisation
+    ; Note that outputs here are fitted to 16 bits
+
+    vpbroadcastd  m8, [sixtyfour]
+
+%macro NORMALISE 1
+    vpaddd    m%1, m%1, m8
+    vpsrad    m%1, m%1, 7
+    vpackssdw m%1, m%1, m%1
+    vpermq    m%1, m%1, q3120
+%endmacro
+
+    NORMALISE 0
+    NORMALISE 1
+    NORMALISE 2
+    NORMALISE 3
+    NORMALISE 4
+    NORMALISE 5
+    NORMALISE 6
+    NORMALISE 7
+
+    ; mN = row N words 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
+
+    ; Transform rows
+    ; This multiplies the rows directly by the transform matrix,
+    ; avoiding the need to transpose anything
+
+    mova      m12, [tmatrix_row + 0x00]
+    mova      m13, [tmatrix_row + 0x20]
+    mova      m14, [tmatrix_row + 0x40]
+    mova      m15, [tmatrix_row + 0x60]
+
+%macro TRANS_ROW_STEP 1
+    vpmaddwd  m8,  m%1, m12
+    vpmaddwd  m9,  m%1, m13
+    vpmaddwd  m10, m%1, m14
+    vpmaddwd  m11, m%1, m15
+    vphaddd   m8,  m8,  m9
+    vphaddd   m10, m10, m11
+    vphaddd   m%1, m8,  m10
+%endmacro
+
+    TRANS_ROW_STEP 0
+    TRANS_ROW_STEP 1
+    TRANS_ROW_STEP 2
+    TRANS_ROW_STEP 3
+    TRANS_ROW_STEP 4
+    TRANS_ROW_STEP 5
+    TRANS_ROW_STEP 6
+    TRANS_ROW_STEP 7
+
+    ; Renormalise, clip and store output
+
+    vpbroadcastd  m14, [pd_1]
+    mov       tmpd, 20
+    sub       tmpd, bit_depthd
+    movd      xm9, tmpd
+    dec       tmpd
+    movd      xm13, tmpd
+    movd      xm15, bit_depthd
+    vpslld    m8,  m14, xm13
+    vpslld    m12, m14, xm15
+    vpsrld    m10, m12, 1
+    vpsubd    m12, m12, m14
+    vpxor     m11, m11, m11
+
+    ; m8  = vector 1 << (bd_shift - 1)
+    ; m9  = scalar bd_shift
+    ; m10 = vector 1 << (bit_depth - 1)
+    ; m11 = zero
+    ; m12 = vector (1 << bit_depth) - 1
+
+    cmp       bit_depthd, 8
+    jne       store_10
+
+%macro NORMALISE_AND_STORE_8 1
+    vpaddd    m%1, m%1, m8
+    vpsrad    m%1, m%1, xm9
+    vpaddd    m%1, m%1, m10
+    vextracti128  xm13, m%1, 0
+    vextracti128  xm14, m%1, 1
+    vpackusdw xm%1, xm13, xm14
+    vpackuswb xm%1, xm%1, xm%1
+    movq      [outputq], xm%1
+    add       outputq, pitchq
+%endmacro
+
+    NORMALISE_AND_STORE_8 0
+    NORMALISE_AND_STORE_8 1
+    NORMALISE_AND_STORE_8 2
+    NORMALISE_AND_STORE_8 3
+    NORMALISE_AND_STORE_8 4
+    NORMALISE_AND_STORE_8 5
+    NORMALISE_AND_STORE_8 6
+    NORMALISE_AND_STORE_8 7
+
+    RET
+
+store_10:
+
+%macro NORMALISE_AND_STORE_10 1
+    vpaddd    m%1, m%1, m8
+    vpsrad    m%1, m%1, xm9
+    vpaddd    m%1, m%1, m10
+    vpmaxsd   m%1, m%1, m11
+    vpminsd   m%1, m%1, m12
+    vextracti128  xm13, m%1, 0
+    vextracti128  xm14, m%1, 1
+    vpackusdw xm%1, xm13, xm14
+    mova      [outputq], xm%1
+    add       outputq, pitchq
+%endmacro
+
+    NORMALISE_AND_STORE_10 0
+    NORMALISE_AND_STORE_10 1
+    NORMALISE_AND_STORE_10 2
+    NORMALISE_AND_STORE_10 3
+    NORMALISE_AND_STORE_10 4
+    NORMALISE_AND_STORE_10 5
+    NORMALISE_AND_STORE_10 6
+    NORMALISE_AND_STORE_10 7
+
+    RET
diff --git a/libavcodec/x86/apv_dsp_init.c b/libavcodec/x86/apv_dsp_init.c
new file mode 100644
index 0000000000..bc017ce37a
--- /dev/null
+++ b/libavcodec/x86/apv_dsp_init.c
@@ -0,0 +1,40 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "config.h"
+#include "libavutil/attributes.h"
+#include "libavutil/cpu.h"
+#include "libavutil/x86/asm.h"
+#include "libavutil/x86/cpu.h"
+#include "libavcodec/apv_dsp.h"
+
+void ff_apv_decode_transquant_avx2(void *output,
+                                   ptrdiff_t pitch,
+                                   const int16_t *input,
+                                   const int16_t *qmatrix,
+                                   int bit_depth,
+                                   int qp_shift);
+
+av_cold void ff_apv_dsp_init_x86_64(APVDSPContext *dsp)
+{
+    int cpu_flags = av_get_cpu_flags();
+
+    if (EXTERNAL_AVX2_FAST(cpu_flags)) {
+        dsp->decode_transquant = ff_apv_decode_transquant_avx2;
+    }
+}
diff --git a/tests/checkasm/Makefile b/tests/checkasm/Makefile
index d5c50e5599..193c1e4633 100644
--- a/tests/checkasm/Makefile
+++ b/tests/checkasm/Makefile
@@ -28,6 +28,7 @@ AVCODECOBJS-$(CONFIG_AAC_DECODER)       += aacpsdsp.o \
                                            sbrdsp.o
 AVCODECOBJS-$(CONFIG_AAC_ENCODER)       += aacencdsp.o
 AVCODECOBJS-$(CONFIG_ALAC_DECODER)      += alacdsp.o
+AVCODECOBJS-$(CONFIG_APV_DECODER)       += apv_dsp.o
 AVCODECOBJS-$(CONFIG_DCA_DECODER)       += synth_filter.o
 AVCODECOBJS-$(CONFIG_DIRAC_DECODER)     += diracdsp.o
 AVCODECOBJS-$(CONFIG_EXR_DECODER)       += exrdsp.o
diff --git a/tests/checkasm/apv_dsp.c b/tests/checkasm/apv_dsp.c
new file mode 100644
index 0000000000..b3adb8ca06
--- /dev/null
+++ b/tests/checkasm/apv_dsp.c
@@ -0,0 +1,109 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <stdint.h>
+
+#include "checkasm.h"
+
+#include "libavutil/attributes.h"
+#include "libavutil/mem_internal.h"
+#include "libavcodec/apv_dsp.h"
+
+
+static void check_decode_transquant_8(void)
+{
+    LOCAL_ALIGNED_16(int16_t, input,      [64]);
+    LOCAL_ALIGNED_16(int16_t, qmatrix,    [64]);
+    LOCAL_ALIGNED_16(uint8_t, new_output, [64]);
+    LOCAL_ALIGNED_16(uint8_t, ref_output, [64]);
+
+    declare_func(void,
+                 uint8_t *output,
+                 ptrdiff_t pitch,
+                 const int16_t *input,
+                 const int16_t *qmatrix,
+                 int bit_depth,
+                 int qp_shift);
+
+    for (int i = 0; i < 64; i++) {
+        // Any signed 12-bit integer.
+        input[i] = rnd() % 2048 - 1024;
+
+        // qmatrix input is premultiplied by level_scale, so
+        // range is 1 to 255 * 71.  Interesting values are all
+        // at the low end of that, though.
+        qmatrix[i] = rnd() % 16 + 16;
+    }
+
+    call_ref(ref_output, 8, input, qmatrix, 8, 4);
+    call_new(new_output, 8, input, qmatrix, 8, 4);
+
+    if (memcmp(new_output, ref_output, 64 * sizeof(*ref_output)))
+        fail();
+
+    bench_new(new_output, 8, input, qmatrix, 8, 4);
+}
+
+static void check_decode_transquant_10(void)
+{
+    LOCAL_ALIGNED_16( int16_t, input,      [64]);
+    LOCAL_ALIGNED_16( int16_t, qmatrix,    [64]);
+    LOCAL_ALIGNED_16(uint16_t, new_output, [64]);
+    LOCAL_ALIGNED_16(uint16_t, ref_output, [64]);
+
+    declare_func(void,
+                 uint16_t *output,
+                 ptrdiff_t pitch,
+                 const int16_t *input,
+                 const int16_t *qmatrix,
+                 int bit_depth,
+                 int qp_shift);
+
+    for (int i = 0; i < 64; i++) {
+        // Any signed 14-bit integer.
+        input[i] = rnd() % 16384 - 8192;
+
+        // qmatrix input is premultiplied by level_scale, so
+        // range is 1 to 255 * 71.  Interesting values are all
+        // at the low end of that, though.
+        qmatrix[i] = 16; //rnd() % 16 + 16;
+    }
+
+    call_ref(ref_output, 16, input, qmatrix, 10, 4);
+    call_new(new_output, 16, input, qmatrix, 10, 4);
+
+    if (memcmp(new_output, ref_output, 64 * sizeof(*ref_output)))
+        fail();
+
+    bench_new(new_output, 16, input, qmatrix, 10, 4);
+}
+
+void checkasm_check_apv_dsp(void)
+{
+    APVDSPContext dsp;
+
+    ff_apv_dsp_init(&dsp);
+
+    if (check_func(dsp.decode_transquant, "decode_transquant_8"))
+        check_decode_transquant_8();
+
+    if (check_func(dsp.decode_transquant, "decode_transquant_10"))
+        check_decode_transquant_10();
+
+    report("apv_dsp");
+}
diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index 412b8b2cd1..3bb82ed0e5 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -129,6 +129,9 @@ static const struct {
     #if CONFIG_ALAC_DECODER
         { "alacdsp", checkasm_check_alacdsp },
     #endif
+    #if CONFIG_APV_DECODER
+        { "apv_dsp", checkasm_check_apv_dsp },
+    #endif
     #if CONFIG_AUDIODSP
         { "audiodsp", checkasm_check_audiodsp },
     #endif
diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
index ad239fb2a4..a6b5965e02 100644
--- a/tests/checkasm/checkasm.h
+++ b/tests/checkasm/checkasm.h
@@ -83,6 +83,7 @@ void checkasm_check_ac3dsp(void);
 void checkasm_check_aes(void);
 void checkasm_check_afir(void);
 void checkasm_check_alacdsp(void);
+void checkasm_check_apv_dsp(void);
 void checkasm_check_audiodsp(void);
 void checkasm_check_av_tx(void);
 void checkasm_check_blend(void);
diff --git a/tests/fate/checkasm.mak b/tests/fate/checkasm.mak
index 6d42df148e..720c5fd77e 100644
--- a/tests/fate/checkasm.mak
+++ b/tests/fate/checkasm.mak
@@ -4,6 +4,7 @@ FATE_CHECKASM = fate-checkasm-aacencdsp                                 \
                 fate-checkasm-aes                                       \
                 fate-checkasm-af_afir                                   \
                 fate-checkasm-alacdsp                                   \
+                fate-checkasm-apv_dsp                                   \
                 fate-checkasm-audiodsp                                  \
                 fate-checkasm-av_tx                                     \
                 fate-checkasm-blockdsp                                  \
-- 
2.47.2

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [FFmpeg-devel] [PATCH v2 6/6] lavc: APV metadata bitstream filter
  2025-04-21 15:24 [FFmpeg-devel] [PATCH v2 0/6] APV support Mark Thompson
                   ` (4 preceding siblings ...)
  2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 5/6] lavc/apv: AVX2 transquant for x86-64 Mark Thompson
@ 2025-04-21 15:24 ` Mark Thompson
  5 siblings, 0 replies; 12+ messages in thread
From: Mark Thompson @ 2025-04-21 15:24 UTC (permalink / raw)
  To: ffmpeg-devel

---
 libavcodec/bitstream_filters.c |   1 +
 libavcodec/bsf/Makefile        |   1 +
 libavcodec/bsf/apv_metadata.c  | 134 +++++++++++++++++++++++++++++++++
 3 files changed, 136 insertions(+)
 create mode 100644 libavcodec/bsf/apv_metadata.c

diff --git a/libavcodec/bitstream_filters.c b/libavcodec/bitstream_filters.c
index f923411bee..da9d0d2513 100644
--- a/libavcodec/bitstream_filters.c
+++ b/libavcodec/bitstream_filters.c
@@ -25,6 +25,7 @@
 #include "bsf_internal.h"
 
 extern const FFBitStreamFilter ff_aac_adtstoasc_bsf;
+extern const FFBitStreamFilter ff_apv_metadata_bsf;
 extern const FFBitStreamFilter ff_av1_frame_merge_bsf;
 extern const FFBitStreamFilter ff_av1_frame_split_bsf;
 extern const FFBitStreamFilter ff_av1_metadata_bsf;
diff --git a/libavcodec/bsf/Makefile b/libavcodec/bsf/Makefile
index 40b7fc6e9b..39ea091b50 100644
--- a/libavcodec/bsf/Makefile
+++ b/libavcodec/bsf/Makefile
@@ -2,6 +2,7 @@ clean::
 	$(RM) $(CLEANSUFFIXES:%=libavcodec/bsf/%)
 
 OBJS-$(CONFIG_AAC_ADTSTOASC_BSF)          += bsf/aac_adtstoasc.o
+OBJS-$(CONFIG_APV_METADATA_BSF)           += bsf/apv_metadata.o
 OBJS-$(CONFIG_AV1_FRAME_MERGE_BSF)        += bsf/av1_frame_merge.o
 OBJS-$(CONFIG_AV1_FRAME_SPLIT_BSF)        += bsf/av1_frame_split.o
 OBJS-$(CONFIG_AV1_METADATA_BSF)           += bsf/av1_metadata.o
diff --git a/libavcodec/bsf/apv_metadata.c b/libavcodec/bsf/apv_metadata.c
new file mode 100644
index 0000000000..a1cdcf86c8
--- /dev/null
+++ b/libavcodec/bsf/apv_metadata.c
@@ -0,0 +1,134 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/common.h"
+#include "libavutil/opt.h"
+
+#include "bsf.h"
+#include "bsf_internal.h"
+#include "cbs.h"
+#include "cbs_bsf.h"
+#include "cbs_apv.h"
+
+typedef struct APVMetadataContext {
+    CBSBSFContext common;
+
+    int color_primaries;
+    int transfer_characteristics;
+    int matrix_coefficients;
+    int full_range_flag;
+} APVMetadataContext;
+
+
+static int apv_metadata_update_frame_header(AVBSFContext *bsf,
+                                            APVRawFrameHeader *hdr)
+{
+    APVMetadataContext *ctx = bsf->priv_data;
+
+    if (ctx->color_primaries >= 0          ||
+        ctx->transfer_characteristics >= 0 ||
+        ctx->matrix_coefficients >= 0      ||
+        ctx->full_range_flag >= 0) {
+        hdr->color_description_present_flag = 1;
+
+        if (ctx->color_primaries >= 0)
+            hdr->color_primaries = ctx->color_primaries;
+        if (ctx->transfer_characteristics >= 0)
+            hdr->transfer_characteristics = ctx->transfer_characteristics;
+        if (ctx->matrix_coefficients >= 0)
+            hdr->matrix_coefficients = ctx->matrix_coefficients;
+        if (ctx->full_range_flag >= 0)
+            hdr->full_range_flag = ctx->full_range_flag;
+    }
+
+    return 0;
+}
+
+static int apv_metadata_update_fragment(AVBSFContext *bsf, AVPacket *pkt,
+                                        CodedBitstreamFragment *frag)
+{
+    int err, i;
+
+    for (i = 0; i < frag->nb_units; i++) {
+        if (frag->units[i].type == APV_PBU_PRIMARY_FRAME) {
+            APVRawFrame *pbu = frag->units[i].content;
+            err = apv_metadata_update_frame_header(bsf, &pbu->frame_header);
+            if (err < 0)
+                return err;
+        }
+    }
+
+    return 0;
+}
+
+static const CBSBSFType apv_metadata_type = {
+    .codec_id        = AV_CODEC_ID_APV,
+    .fragment_name   = "access unit",
+    .unit_name       = "PBU",
+    .update_fragment = &apv_metadata_update_fragment,
+};
+
+static int apv_metadata_init(AVBSFContext *bsf)
+{
+    return ff_cbs_bsf_generic_init(bsf, &apv_metadata_type);
+}
+
+#define OFFSET(x) offsetof(APVMetadataContext, x)
+#define FLAGS (AV_OPT_FLAG_VIDEO_PARAM|AV_OPT_FLAG_BSF_PARAM)
+static const AVOption apv_metadata_options[] = {
+    { "color_primaries", "Set color primaries (section 5.3.5)",
+        OFFSET(color_primaries), AV_OPT_TYPE_INT,
+        { .i64 = -1 }, -1, 255, FLAGS },
+    { "transfer_characteristics", "Set transfer characteristics (section 5.3.5)",
+        OFFSET(transfer_characteristics), AV_OPT_TYPE_INT,
+        { .i64 = -1 }, -1, 255, FLAGS },
+    { "matrix_coefficients", "Set matrix coefficients (section 5.3.5)",
+        OFFSET(matrix_coefficients), AV_OPT_TYPE_INT,
+        { .i64 = -1 }, -1, 255, FLAGS },
+
+    { "full_range_flag", "Set full range flag flag (section 5.3.5)",
+        OFFSET(full_range_flag), AV_OPT_TYPE_INT,
+        { .i64 = -1 }, -1, 1, FLAGS, .unit = "cr" },
+    { "tv", "TV (limited) range", 0, AV_OPT_TYPE_CONST,
+        { .i64 = 0 }, .flags = FLAGS, .unit = "cr" },
+    { "pc", "PC (full) range",    0, AV_OPT_TYPE_CONST,
+        { .i64 = 1 }, .flags = FLAGS, .unit = "cr" },
+
+    { NULL }
+};
+
+static const AVClass apv_metadata_class = {
+    .class_name = "apv_metadata_bsf",
+    .item_name  = av_default_item_name,
+    .option     = apv_metadata_options,
+    .version    = LIBAVUTIL_VERSION_INT,
+};
+
+static const enum AVCodecID apv_metadata_codec_ids[] = {
+    AV_CODEC_ID_APV, AV_CODEC_ID_NONE,
+};
+
+const FFBitStreamFilter ff_apv_metadata_bsf = {
+    .p.name         = "apv_metadata",
+    .p.codec_ids    = apv_metadata_codec_ids,
+    .p.priv_class   = &apv_metadata_class,
+    .priv_data_size = sizeof(APVMetadataContext),
+    .init           = &apv_metadata_init,
+    .close          = &ff_cbs_bsf_generic_close,
+    .filter         = &ff_cbs_bsf_generic_filter,
+};
-- 
2.47.2

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2 5/6] lavc/apv: AVX2 transquant for x86-64
  2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 5/6] lavc/apv: AVX2 transquant for x86-64 Mark Thompson
@ 2025-04-21 16:53   ` James Almer
  2025-04-21 19:50     ` Mark Thompson
  2025-04-23 19:52   ` Michael Niedermayer
  1 sibling, 1 reply; 12+ messages in thread
From: James Almer @ 2025-04-21 16:53 UTC (permalink / raw)
  To: ffmpeg-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 21236 bytes --]

On 4/21/2025 12:24 PM, Mark Thompson wrote:
> Typical checkasm result on Alder Lake:
> 
> decode_transquant_8_c:                                 461.1 ( 1.00x)
> decode_transquant_8_avx2:                               97.5 ( 4.73x)
> decode_transquant_10_c:                                483.9 ( 1.00x)
> decode_transquant_10_avx2:                              91.7 ( 5.28x)
> ---
>   libavcodec/apv_dsp.c          |   4 +
>   libavcodec/apv_dsp.h          |   2 +
>   libavcodec/x86/Makefile       |   2 +
>   libavcodec/x86/apv_dsp.asm    | 279 ++++++++++++++++++++++++++++++++++
>   libavcodec/x86/apv_dsp_init.c |  40 +++++
>   tests/checkasm/Makefile       |   1 +
>   tests/checkasm/apv_dsp.c      | 109 +++++++++++++
>   tests/checkasm/checkasm.c     |   3 +
>   tests/checkasm/checkasm.h     |   1 +
>   tests/fate/checkasm.mak       |   1 +
>   10 files changed, 442 insertions(+)
>   create mode 100644 libavcodec/x86/apv_dsp.asm
>   create mode 100644 libavcodec/x86/apv_dsp_init.c
>   create mode 100644 tests/checkasm/apv_dsp.c
> 
> diff --git a/libavcodec/apv_dsp.c b/libavcodec/apv_dsp.c
> index fe11cd6b94..fd814ef900 100644
> --- a/libavcodec/apv_dsp.c
> +++ b/libavcodec/apv_dsp.c
> @@ -133,4 +133,8 @@ static void apv_decode_transquant_c(void *output,
>   av_cold void ff_apv_dsp_init(APVDSPContext *dsp)
>   {
>       dsp->decode_transquant = apv_decode_transquant_c;
> +
> +#if ARCH_X86_64
> +    ff_apv_dsp_init_x86_64(dsp);
> +#endif
>   }
> diff --git a/libavcodec/apv_dsp.h b/libavcodec/apv_dsp.h
> index 31645b8581..c63d6a88ee 100644
> --- a/libavcodec/apv_dsp.h
> +++ b/libavcodec/apv_dsp.h
> @@ -34,4 +34,6 @@ typedef struct APVDSPContext {
>   
>   void ff_apv_dsp_init(APVDSPContext *dsp);
>   
> +void ff_apv_dsp_init_x86_64(APVDSPContext *dsp);
> +
>   #endif /* AVCODEC_APV_DSP_H */
> diff --git a/libavcodec/x86/Makefile b/libavcodec/x86/Makefile
> index 5d53515381..821c410a0f 100644
> --- a/libavcodec/x86/Makefile
> +++ b/libavcodec/x86/Makefile
> @@ -44,6 +44,7 @@ OBJS-$(CONFIG_ADPCM_G722_DECODER)      += x86/g722dsp_init.o
>   OBJS-$(CONFIG_ADPCM_G722_ENCODER)      += x86/g722dsp_init.o
>   OBJS-$(CONFIG_ALAC_DECODER)            += x86/alacdsp_init.o
>   OBJS-$(CONFIG_APNG_DECODER)            += x86/pngdsp_init.o
> +OBJS-$(CONFIG_APV_DECODER)             += x86/apv_dsp_init.o
>   OBJS-$(CONFIG_CAVS_DECODER)            += x86/cavsdsp.o
>   OBJS-$(CONFIG_CFHD_DECODER)            += x86/cfhddsp_init.o
>   OBJS-$(CONFIG_CFHD_ENCODER)            += x86/cfhdencdsp_init.o
> @@ -149,6 +150,7 @@ X86ASM-OBJS-$(CONFIG_ADPCM_G722_DECODER) += x86/g722dsp.o
>   X86ASM-OBJS-$(CONFIG_ADPCM_G722_ENCODER) += x86/g722dsp.o
>   X86ASM-OBJS-$(CONFIG_ALAC_DECODER)     += x86/alacdsp.o
>   X86ASM-OBJS-$(CONFIG_APNG_DECODER)     += x86/pngdsp.o
> +X86ASM-OBJS-$(CONFIG_APV_DECODER)      += x86/apv_dsp.o
>   X86ASM-OBJS-$(CONFIG_CAVS_DECODER)     += x86/cavsidct.o
>   X86ASM-OBJS-$(CONFIG_CFHD_ENCODER)     += x86/cfhdencdsp.o
>   X86ASM-OBJS-$(CONFIG_CFHD_DECODER)     += x86/cfhddsp.o
> diff --git a/libavcodec/x86/apv_dsp.asm b/libavcodec/x86/apv_dsp.asm
> new file mode 100644
> index 0000000000..6b045e989a
> --- /dev/null
> +++ b/libavcodec/x86/apv_dsp.asm
> @@ -0,0 +1,279 @@
> +;************************************************************************
> +;* This file is part of FFmpeg.
> +;*
> +;* FFmpeg is free software; you can redistribute it and/or
> +;* modify it under the terms of the GNU Lesser General Public
> +;* License as published by the Free Software Foundation; either
> +;* version 2.1 of the License, or (at your option) any later version.
> +;*
> +;* FFmpeg is distributed in the hope that it will be useful,
> +;* but WITHOUT ANY WARRANTY; without even the implied warranty of
> +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +;* Lesser General Public License for more details.
> +;*
> +;* You should have received a copy of the GNU Lesser General Public
> +;* License along with FFmpeg; if not, write to the Free Software
> +;* 51, Inc., Foundation Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> +;******************************************************************************
> +
> +%include "libavutil/x86/x86util.asm"
> +
> +SECTION_RODATA 32
> +
> +; Full matrix for row transform.
> +const tmatrix_row
> +    dw  64,  89,  84,  75,  64,  50,  35,  18
> +    dw  64, -18, -84,  50,  64, -75, -35,  89
> +    dw  64,  75,  35, -18, -64, -89, -84, -50
> +    dw  64, -50, -35,  89, -64, -18,  84, -75
> +    dw  64,  50, -35, -89, -64,  18,  84,  75
> +    dw  64, -75,  35,  18, -64,  89, -84,  50
> +    dw  64,  18, -84, -50,  64,  75, -35, -89
> +    dw  64, -89,  84, -75,  64, -50,  35, -18
> +
> +; Constant pairs for broadcast in column transform.
> +const tmatrix_col_even
> +    dw  64,  64,  64, -64
> +    dw  84,  35,  35, -84
> +const tmatrix_col_odd
> +    dw  89,  75,  50,  18
> +    dw  75, -18, -89, -50
> +    dw  50, -89,  18,  75
> +    dw  18, -50,  75, -89
> +
> +; Memory targets for vpbroadcastd (register version requires AVX512).
> +cextern pd_1
> +const sixtyfour
> +    dd  64
> +
> +SECTION .text
> +
> +; void ff_apv_decode_transquant_avx2(void *output,
> +;                                    ptrdiff_t pitch,
> +;                                    const int16_t *input,
> +;                                    const int16_t *qmatrix,
> +;                                    int bit_depth,
> +;                                    int qp_shift);
> +
> +INIT_YMM avx2
> +
> +cglobal apv_decode_transquant, 6, 7, 16, output, pitch, input, qmatrix, bit_depth, qp_shift, tmp
> +
> +    ; Load input and dequantise
> +
> +    vpbroadcastd  m10, [pd_1]
> +    lea       tmpq, [bit_depthq - 2]

lea       tmpd, [bit_depthd - 2]

The upper 32 bits of the register may have garbage.

> +    movd      xm8, qp_shiftd

If you declare the function as 5, 7, 16, then qp_shift will not be 
loaded into a gpr on ABIs where it's on stack (Win64, and x86_32 if it 
was supported), and then you can do

     movd      xm8, qp_shiftm

Which will load it directly to the simd register from memory, saving one 
instruction in the prologue.

> +    movd      xm9, tmpd
> +    vpslld    m10, m10, xm9
> +    vpsrld    m10, m10, 1
> +
> +    ; m8  = scalar qp_shift
> +    ; m9  = scalar bd_shift
> +    ; m10 = vector 1 << (bd_shift - 1)
> +    ; m11 = qmatrix load
> +
> +%macro LOAD_AND_DEQUANT 2 ; (xmm input, constant offset)
> +    vpmovsxwd m%1, [inputq   + %2]
> +    vpmovsxwd m11, [qmatrixq + %2]
> +    vpmaddwd  m%1, m%1, m11
> +    vpslld    m%1, m%1, xm8
> +    vpaddd    m%1, m%1, m10
> +    vpsrad    m%1, m%1, xm9
> +    vpackssdw m%1, m%1, m%1
> +%endmacro
> +
> +    LOAD_AND_DEQUANT 0, 0x00
> +    LOAD_AND_DEQUANT 1, 0x10
> +    LOAD_AND_DEQUANT 2, 0x20
> +    LOAD_AND_DEQUANT 3, 0x30
> +    LOAD_AND_DEQUANT 4, 0x40
> +    LOAD_AND_DEQUANT 5, 0x50
> +    LOAD_AND_DEQUANT 6, 0x60
> +    LOAD_AND_DEQUANT 7, 0x70
> +
> +    ; mN = row N words 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7
> +
> +    ; Transform columns
> +    ; This applies a 1-D DCT butterfly
> +
> +    vpunpcklwd  m12, m0,  m4
> +    vpunpcklwd  m13, m2,  m6
> +    vpunpcklwd  m14, m1,  m3
> +    vpunpcklwd  m15, m5,  m7
> +
> +    ; m12 = rows 0 and 4 interleaved
> +    ; m13 = rows 2 and 6 interleaved
> +    ; m14 = rows 1 and 3 interleaved
> +    ; m15 = rows 5 and 7 interleaved
> +
> +    vpbroadcastd   m0, [tmatrix_col_even + 0x00]
> +    vpbroadcastd   m1, [tmatrix_col_even + 0x04]
> +    vpbroadcastd   m2, [tmatrix_col_even + 0x08]
> +    vpbroadcastd   m3, [tmatrix_col_even + 0x0c]

Maybe do

lea tmpq, [tmatrix_col_even]
vpbroadcastd   m0, [tmpq + 0x00]
vpbroadcastd   m1, [tmpq + 0x04]
...

To emit smaller instructions. Same for tmatrix_col_odd and tmatrix_row 
below.

> +
> +    vpmaddwd  m4,  m12, m0
> +    vpmaddwd  m5,  m12, m1
> +    vpmaddwd  m6,  m13, m2
> +    vpmaddwd  m7,  m13, m3
> +    vpaddd    m8,  m4,  m6
> +    vpaddd    m9,  m5,  m7
> +    vpsubd    m10, m5,  m7
> +    vpsubd    m11, m4,  m6
> +
> +    vpbroadcastd   m0, [tmatrix_col_odd + 0x00]
> +    vpbroadcastd   m1, [tmatrix_col_odd + 0x04]
> +    vpbroadcastd   m2, [tmatrix_col_odd + 0x08]
> +    vpbroadcastd   m3, [tmatrix_col_odd + 0x0c]
> +
> +    vpmaddwd  m4,  m14, m0
> +    vpmaddwd  m5,  m15, m1
> +    vpmaddwd  m6,  m14, m2
> +    vpmaddwd  m7,  m15, m3
> +    vpaddd    m12, m4,  m5
> +    vpaddd    m13, m6,  m7
> +
> +    vpbroadcastd   m0, [tmatrix_col_odd + 0x10]
> +    vpbroadcastd   m1, [tmatrix_col_odd + 0x14]
> +    vpbroadcastd   m2, [tmatrix_col_odd + 0x18]
> +    vpbroadcastd   m3, [tmatrix_col_odd + 0x1c]
> +
> +    vpmaddwd  m4,  m14, m0
> +    vpmaddwd  m5,  m15, m1
> +    vpmaddwd  m6,  m14, m2
> +    vpmaddwd  m7,  m15, m3
> +    vpaddd    m14, m4,  m5
> +    vpaddd    m15, m6,  m7
> +
> +    vpaddd    m0,  m8,  m12
> +    vpaddd    m1,  m9,  m13
> +    vpaddd    m2,  m10, m14
> +    vpaddd    m3,  m11, m15
> +    vpsubd    m4,  m11, m15
> +    vpsubd    m5,  m10, m14
> +    vpsubd    m6,  m9,  m13
> +    vpsubd    m7,  m8,  m12
> +
> +    ; Mid-transform normalisation
> +    ; Note that outputs here are fitted to 16 bits
> +
> +    vpbroadcastd  m8, [sixtyfour]
> +
> +%macro NORMALISE 1
> +    vpaddd    m%1, m%1, m8
> +    vpsrad    m%1, m%1, 7
> +    vpackssdw m%1, m%1, m%1
> +    vpermq    m%1, m%1, q3120
> +%endmacro
> +
> +    NORMALISE 0
> +    NORMALISE 1
> +    NORMALISE 2
> +    NORMALISE 3
> +    NORMALISE 4
> +    NORMALISE 5
> +    NORMALISE 6
> +    NORMALISE 7
> +
> +    ; mN = row N words 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
> +
> +    ; Transform rows
> +    ; This multiplies the rows directly by the transform matrix,
> +    ; avoiding the need to transpose anything
> +
> +    mova      m12, [tmatrix_row + 0x00]
> +    mova      m13, [tmatrix_row + 0x20]
> +    mova      m14, [tmatrix_row + 0x40]
> +    mova      m15, [tmatrix_row + 0x60]
> +
> +%macro TRANS_ROW_STEP 1
> +    vpmaddwd  m8,  m%1, m12
> +    vpmaddwd  m9,  m%1, m13
> +    vpmaddwd  m10, m%1, m14
> +    vpmaddwd  m11, m%1, m15
> +    vphaddd   m8,  m8,  m9
> +    vphaddd   m10, m10, m11
> +    vphaddd   m%1, m8,  m10
> +%endmacro
> +
> +    TRANS_ROW_STEP 0
> +    TRANS_ROW_STEP 1
> +    TRANS_ROW_STEP 2
> +    TRANS_ROW_STEP 3
> +    TRANS_ROW_STEP 4
> +    TRANS_ROW_STEP 5
> +    TRANS_ROW_STEP 6
> +    TRANS_ROW_STEP 7
> +
> +    ; Renormalise, clip and store output
> +
> +    vpbroadcastd  m14, [pd_1]
> +    mov       tmpd, 20
> +    sub       tmpd, bit_depthd
> +    movd      xm9, tmpd
> +    dec       tmpd
> +    movd      xm13, tmpd
> +    movd      xm15, bit_depthd
> +    vpslld    m8,  m14, xm13
> +    vpslld    m12, m14, xm15
> +    vpsrld    m10, m12, 1
> +    vpsubd    m12, m12, m14
> +    vpxor     m11, m11, m11
> +
> +    ; m8  = vector 1 << (bd_shift - 1)
> +    ; m9  = scalar bd_shift
> +    ; m10 = vector 1 << (bit_depth - 1)
> +    ; m11 = zero
> +    ; m12 = vector (1 << bit_depth) - 1
> +
> +    cmp       bit_depthd, 8
> +    jne       store_10
> +
> +%macro NORMALISE_AND_STORE_8 1
> +    vpaddd    m%1, m%1, m8
> +    vpsrad    m%1, m%1, xm9
> +    vpaddd    m%1, m%1, m10
> +    vextracti128  xm13, m%1, 0
> +    vextracti128  xm14, m%1, 1
> +    vpackusdw xm%1, xm13, xm14
> +    vpackuswb xm%1, xm%1, xm%1

     vpaddd    m%1, m%1, m10
     vextracti128  xm14, m%1, 1
     vpackusdw xm%1, xm%1, xm14
     vpackuswb xm%1, xm%1, xm%1

vextracti128 with 0 as third argument is the same as a mova for the 
lower 128 bits, so it's not needed.

> +    movq      [outputq], xm%1
> +    add       outputq, pitchq
> +%endmacro
> +
> +    NORMALISE_AND_STORE_8 0
> +    NORMALISE_AND_STORE_8 1
> +    NORMALISE_AND_STORE_8 2
> +    NORMALISE_AND_STORE_8 3
> +    NORMALISE_AND_STORE_8 4
> +    NORMALISE_AND_STORE_8 5
> +    NORMALISE_AND_STORE_8 6
> +    NORMALISE_AND_STORE_8 7
> +
> +    RET
> +
> +store_10:
> +
> +%macro NORMALISE_AND_STORE_10 1
> +    vpaddd    m%1, m%1, m8
> +    vpsrad    m%1, m%1, xm9
> +    vpaddd    m%1, m%1, m10
> +    vpmaxsd   m%1, m%1, m11
> +    vpminsd   m%1, m%1, m12
> +    vextracti128  xm13, m%1, 0
> +    vextracti128  xm14, m%1, 1
> +    vpackusdw xm%1, xm13, xm14

Same.

> +    mova      [outputq], xm%1
> +    add       outputq, pitchq
> +%endmacro
> +
> +    NORMALISE_AND_STORE_10 0
> +    NORMALISE_AND_STORE_10 1
> +    NORMALISE_AND_STORE_10 2
> +    NORMALISE_AND_STORE_10 3
> +    NORMALISE_AND_STORE_10 4
> +    NORMALISE_AND_STORE_10 5
> +    NORMALISE_AND_STORE_10 6
> +    NORMALISE_AND_STORE_10 7
> +
> +    RET
> diff --git a/libavcodec/x86/apv_dsp_init.c b/libavcodec/x86/apv_dsp_init.c
> new file mode 100644
> index 0000000000..bc017ce37a
> --- /dev/null
> +++ b/libavcodec/x86/apv_dsp_init.c
> @@ -0,0 +1,40 @@
> +/*
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include "config.h"
> +#include "libavutil/attributes.h"
> +#include "libavutil/cpu.h"
> +#include "libavutil/x86/asm.h"
> +#include "libavutil/x86/cpu.h"
> +#include "libavcodec/apv_dsp.h"
> +
> +void ff_apv_decode_transquant_avx2(void *output,
> +                                   ptrdiff_t pitch,
> +                                   const int16_t *input,
> +                                   const int16_t *qmatrix,
> +                                   int bit_depth,
> +                                   int qp_shift);
> +
> +av_cold void ff_apv_dsp_init_x86_64(APVDSPContext *dsp)
> +{
> +    int cpu_flags = av_get_cpu_flags();
> +
> +    if (EXTERNAL_AVX2_FAST(cpu_flags)) {
> +        dsp->decode_transquant = ff_apv_decode_transquant_avx2;
> +    }
> +}
> diff --git a/tests/checkasm/Makefile b/tests/checkasm/Makefile
> index d5c50e5599..193c1e4633 100644
> --- a/tests/checkasm/Makefile
> +++ b/tests/checkasm/Makefile
> @@ -28,6 +28,7 @@ AVCODECOBJS-$(CONFIG_AAC_DECODER)       += aacpsdsp.o \
>                                              sbrdsp.o
>   AVCODECOBJS-$(CONFIG_AAC_ENCODER)       += aacencdsp.o
>   AVCODECOBJS-$(CONFIG_ALAC_DECODER)      += alacdsp.o
> +AVCODECOBJS-$(CONFIG_APV_DECODER)       += apv_dsp.o
>   AVCODECOBJS-$(CONFIG_DCA_DECODER)       += synth_filter.o
>   AVCODECOBJS-$(CONFIG_DIRAC_DECODER)     += diracdsp.o
>   AVCODECOBJS-$(CONFIG_EXR_DECODER)       += exrdsp.o
> diff --git a/tests/checkasm/apv_dsp.c b/tests/checkasm/apv_dsp.c
> new file mode 100644
> index 0000000000..b3adb8ca06
> --- /dev/null
> +++ b/tests/checkasm/apv_dsp.c
> @@ -0,0 +1,109 @@
> +/*
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include <stdint.h>
> +
> +#include "checkasm.h"
> +
> +#include "libavutil/attributes.h"
> +#include "libavutil/mem_internal.h"
> +#include "libavcodec/apv_dsp.h"
> +
> +
> +static void check_decode_transquant_8(void)
> +{
> +    LOCAL_ALIGNED_16(int16_t, input,      [64]);
> +    LOCAL_ALIGNED_16(int16_t, qmatrix,    [64]);
> +    LOCAL_ALIGNED_16(uint8_t, new_output, [64]);
> +    LOCAL_ALIGNED_16(uint8_t, ref_output, [64]);
> +
> +    declare_func(void,
> +                 uint8_t *output,
> +                 ptrdiff_t pitch,
> +                 const int16_t *input,
> +                 const int16_t *qmatrix,
> +                 int bit_depth,
> +                 int qp_shift);
> +
> +    for (int i = 0; i < 64; i++) {
> +        // Any signed 12-bit integer.
> +        input[i] = rnd() % 2048 - 1024;
> +
> +        // qmatrix input is premultiplied by level_scale, so
> +        // range is 1 to 255 * 71.  Interesting values are all
> +        // at the low end of that, though.
> +        qmatrix[i] = rnd() % 16 + 16;
> +    }
> +
> +    call_ref(ref_output, 8, input, qmatrix, 8, 4);
> +    call_new(new_output, 8, input, qmatrix, 8, 4);
> +
> +    if (memcmp(new_output, ref_output, 64 * sizeof(*ref_output)))
> +        fail();
> +
> +    bench_new(new_output, 8, input, qmatrix, 8, 4);
> +}
> +
> +static void check_decode_transquant_10(void)
> +{
> +    LOCAL_ALIGNED_16( int16_t, input,      [64]);
> +    LOCAL_ALIGNED_16( int16_t, qmatrix,    [64]);
> +    LOCAL_ALIGNED_16(uint16_t, new_output, [64]);
> +    LOCAL_ALIGNED_16(uint16_t, ref_output, [64]);
> +
> +    declare_func(void,
> +                 uint16_t *output,
> +                 ptrdiff_t pitch,
> +                 const int16_t *input,
> +                 const int16_t *qmatrix,
> +                 int bit_depth,
> +                 int qp_shift);
> +
> +    for (int i = 0; i < 64; i++) {
> +        // Any signed 14-bit integer.
> +        input[i] = rnd() % 16384 - 8192;
> +
> +        // qmatrix input is premultiplied by level_scale, so
> +        // range is 1 to 255 * 71.  Interesting values are all
> +        // at the low end of that, though.
> +        qmatrix[i] = 16; //rnd() % 16 + 16;
> +    }
> +
> +    call_ref(ref_output, 16, input, qmatrix, 10, 4);
> +    call_new(new_output, 16, input, qmatrix, 10, 4);
> +
> +    if (memcmp(new_output, ref_output, 64 * sizeof(*ref_output)))
> +        fail();
> +
> +    bench_new(new_output, 16, input, qmatrix, 10, 4);
> +}
> +
> +void checkasm_check_apv_dsp(void)
> +{
> +    APVDSPContext dsp;
> +
> +    ff_apv_dsp_init(&dsp);
> +
> +    if (check_func(dsp.decode_transquant, "decode_transquant_8"))
> +        check_decode_transquant_8();
> +
> +    if (check_func(dsp.decode_transquant, "decode_transquant_10"))
> +        check_decode_transquant_10();
> +
> +    report("apv_dsp");
> +}
> diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
> index 412b8b2cd1..3bb82ed0e5 100644
> --- a/tests/checkasm/checkasm.c
> +++ b/tests/checkasm/checkasm.c
> @@ -129,6 +129,9 @@ static const struct {
>       #if CONFIG_ALAC_DECODER
>           { "alacdsp", checkasm_check_alacdsp },
>       #endif
> +    #if CONFIG_APV_DECODER
> +        { "apv_dsp", checkasm_check_apv_dsp },
> +    #endif
>       #if CONFIG_AUDIODSP
>           { "audiodsp", checkasm_check_audiodsp },
>       #endif
> diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
> index ad239fb2a4..a6b5965e02 100644
> --- a/tests/checkasm/checkasm.h
> +++ b/tests/checkasm/checkasm.h
> @@ -83,6 +83,7 @@ void checkasm_check_ac3dsp(void);
>   void checkasm_check_aes(void);
>   void checkasm_check_afir(void);
>   void checkasm_check_alacdsp(void);
> +void checkasm_check_apv_dsp(void);
>   void checkasm_check_audiodsp(void);
>   void checkasm_check_av_tx(void);
>   void checkasm_check_blend(void);
> diff --git a/tests/fate/checkasm.mak b/tests/fate/checkasm.mak
> index 6d42df148e..720c5fd77e 100644
> --- a/tests/fate/checkasm.mak
> +++ b/tests/fate/checkasm.mak
> @@ -4,6 +4,7 @@ FATE_CHECKASM = fate-checkasm-aacencdsp                                 \
>                   fate-checkasm-aes                                       \
>                   fate-checkasm-af_afir                                   \
>                   fate-checkasm-alacdsp                                   \
> +                fate-checkasm-apv_dsp                                   \
>                   fate-checkasm-audiodsp                                  \
>                   fate-checkasm-av_tx                                     \
>                   fate-checkasm-blockdsp                                  \


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

[-- Attachment #2: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2 5/6] lavc/apv: AVX2 transquant for x86-64
  2025-04-21 16:53   ` James Almer
@ 2025-04-21 19:50     ` Mark Thompson
  2025-04-22 20:00       ` James Almer
  0 siblings, 1 reply; 12+ messages in thread
From: Mark Thompson @ 2025-04-21 19:50 UTC (permalink / raw)
  To: ffmpeg-devel

On 21/04/2025 17:53, James Almer wrote:
> On 4/21/2025 12:24 PM, Mark Thompson wrote:
>> Typical checkasm result on Alder Lake:
>>
>> decode_transquant_8_c:                                 461.1 ( 1.00x)
>> decode_transquant_8_avx2:                               97.5 ( 4.73x)
>> decode_transquant_10_c:                                483.9 ( 1.00x)
>> decode_transquant_10_avx2:                              91.7 ( 5.28x)
>> ---
>>   libavcodec/apv_dsp.c          |   4 +
>>   libavcodec/apv_dsp.h          |   2 +
>>   libavcodec/x86/Makefile       |   2 +
>>   libavcodec/x86/apv_dsp.asm    | 279 ++++++++++++++++++++++++++++++++++
>>   libavcodec/x86/apv_dsp_init.c |  40 +++++
>>   tests/checkasm/Makefile       |   1 +
>>   tests/checkasm/apv_dsp.c      | 109 +++++++++++++
>>   tests/checkasm/checkasm.c     |   3 +
>>   tests/checkasm/checkasm.h     |   1 +
>>   tests/fate/checkasm.mak       |   1 +
>>   10 files changed, 442 insertions(+)
>>   create mode 100644 libavcodec/x86/apv_dsp.asm
>>   create mode 100644 libavcodec/x86/apv_dsp_init.c
>>   create mode 100644 tests/checkasm/apv_dsp.c
>>
>> ...
>> diff --git a/libavcodec/x86/apv_dsp.asm b/libavcodec/x86/apv_dsp.asm
>> new file mode 100644
>> index 0000000000..6b045e989a
>> --- /dev/null
>> +++ b/libavcodec/x86/apv_dsp.asm
>> @@ -0,0 +1,279 @@
>> +;************************************************************************
>> +;* This file is part of FFmpeg.
>> +;*
>> +;* FFmpeg is free software; you can redistribute it and/or
>> +;* modify it under the terms of the GNU Lesser General Public
>> +;* License as published by the Free Software Foundation; either
>> +;* version 2.1 of the License, or (at your option) any later version.
>> +;*
>> +;* FFmpeg is distributed in the hope that it will be useful,
>> +;* but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +;* Lesser General Public License for more details.
>> +;*
>> +;* You should have received a copy of the GNU Lesser General Public
>> +;* License along with FFmpeg; if not, write to the Free Software
>> +;* 51, Inc., Foundation Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
>> +;******************************************************************************
>> +
>> +%include "libavutil/x86/x86util.asm"
>> +
>> +SECTION_RODATA 32
>> +
>> +; Full matrix for row transform.
>> +const tmatrix_row
>> +    dw  64,  89,  84,  75,  64,  50,  35,  18
>> +    dw  64, -18, -84,  50,  64, -75, -35,  89
>> +    dw  64,  75,  35, -18, -64, -89, -84, -50
>> +    dw  64, -50, -35,  89, -64, -18,  84, -75
>> +    dw  64,  50, -35, -89, -64,  18,  84,  75
>> +    dw  64, -75,  35,  18, -64,  89, -84,  50
>> +    dw  64,  18, -84, -50,  64,  75, -35, -89
>> +    dw  64, -89,  84, -75,  64, -50,  35, -18
>> +
>> +; Constant pairs for broadcast in column transform.
>> +const tmatrix_col_even
>> +    dw  64,  64,  64, -64
>> +    dw  84,  35,  35, -84
>> +const tmatrix_col_odd
>> +    dw  89,  75,  50,  18
>> +    dw  75, -18, -89, -50
>> +    dw  50, -89,  18,  75
>> +    dw  18, -50,  75, -89
>> +
>> +; Memory targets for vpbroadcastd (register version requires AVX512).
>> +cextern pd_1
>> +const sixtyfour
>> +    dd  64
>> +
>> +SECTION .text
>> +
>> +; void ff_apv_decode_transquant_avx2(void *output,
>> +;                                    ptrdiff_t pitch,
>> +;                                    const int16_t *input,
>> +;                                    const int16_t *qmatrix,
>> +;                                    int bit_depth,
>> +;                                    int qp_shift);
>> +
>> +INIT_YMM avx2
>> +
>> +cglobal apv_decode_transquant, 6, 7, 16, output, pitch, input, qmatrix, bit_depth, qp_shift, tmp
>> +
>> +    ; Load input and dequantise
>> +
>> +    vpbroadcastd  m10, [pd_1]
>> +    lea       tmpq, [bit_depthq - 2]
> 
> lea       tmpd, [bit_depthd - 2]
> 
> The upper 32 bits of the register may have garbage.

Ah, I was assuming that lea had to be pointer-sized, but apparently it doesn't.  Changed.

>> +    movd      xm8, qp_shiftd
> 
> If you declare the function as 5, 7, 16, then qp_shift will not be loaded into a gpr on ABIs where it's on stack (Win64, and x86_32 if it was supported), and then you can do
> 
>     movd      xm8, qp_shiftm
> 
> Which will load it directly to the simd register from memory, saving one instruction in the prologue.

This seems like highly dubious magic since it is lying about the number of arguments.

I've changed it, but I want to check a Windows machine as well.

>> +    movd      xm9, tmpd
>> +    vpslld    m10, m10, xm9
>> +    vpsrld    m10, m10, 1
>> +
>> +    ; m8  = scalar qp_shift
>> +    ; m9  = scalar bd_shift
>> +    ; m10 = vector 1 << (bd_shift - 1)
>> +    ; m11 = qmatrix load
>> +
>> +%macro LOAD_AND_DEQUANT 2 ; (xmm input, constant offset)
>> +    vpmovsxwd m%1, [inputq   + %2]
>> +    vpmovsxwd m11, [qmatrixq + %2]
>> +    vpmaddwd  m%1, m%1, m11
>> +    vpslld    m%1, m%1, xm8
>> +    vpaddd    m%1, m%1, m10
>> +    vpsrad    m%1, m%1, xm9
>> +    vpackssdw m%1, m%1, m%1
>> +%endmacro
>> +
>> +    LOAD_AND_DEQUANT 0, 0x00
>> +    LOAD_AND_DEQUANT 1, 0x10
>> +    LOAD_AND_DEQUANT 2, 0x20
>> +    LOAD_AND_DEQUANT 3, 0x30
>> +    LOAD_AND_DEQUANT 4, 0x40
>> +    LOAD_AND_DEQUANT 5, 0x50
>> +    LOAD_AND_DEQUANT 6, 0x60
>> +    LOAD_AND_DEQUANT 7, 0x70
>> +
>> +    ; mN = row N words 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7
>> +
>> +    ; Transform columns
>> +    ; This applies a 1-D DCT butterfly
>> +
>> +    vpunpcklwd  m12, m0,  m4
>> +    vpunpcklwd  m13, m2,  m6
>> +    vpunpcklwd  m14, m1,  m3
>> +    vpunpcklwd  m15, m5,  m7
>> +
>> +    ; m12 = rows 0 and 4 interleaved
>> +    ; m13 = rows 2 and 6 interleaved
>> +    ; m14 = rows 1 and 3 interleaved
>> +    ; m15 = rows 5 and 7 interleaved
>> +
>> +    vpbroadcastd   m0, [tmatrix_col_even + 0x00]
>> +    vpbroadcastd   m1, [tmatrix_col_even + 0x04]
>> +    vpbroadcastd   m2, [tmatrix_col_even + 0x08]
>> +    vpbroadcastd   m3, [tmatrix_col_even + 0x0c]
> 
> Maybe do
> 
> lea tmpq, [tmatrix_col_even]
> vpbroadcastd   m0, [tmpq + 0x00]
> vpbroadcastd   m1, [tmpq + 0x04]
> ...
> 
> To emit smaller instructions. Same for tmatrix_col_odd and tmatrix_row below.

 150:   48 8d 05 00 00 00 00    lea    0x0(%rip),%rax        # 157 <ff_apv_decode_transquant_avx2+0x157>
 157:   c4 e2 7d 58 00          vpbroadcastd (%rax),%ymm0
 15c:   c4 e2 7d 58 48 04       vpbroadcastd 0x4(%rax),%ymm1
 162:   c4 e2 7d 58 50 08       vpbroadcastd 0x8(%rax),%ymm2
 168:   c4 e2 7d 58 58 0c       vpbroadcastd 0xc(%rax),%ymm3

 18e:   c4 e2 7d 58 05 00 00    vpbroadcastd 0x0(%rip),%ymm0        # 197 <ff_apv_decode_transquant_avx2+0x197>
 195:   00 00
 197:   c4 e2 7d 58 0d 00 00    vpbroadcastd 0x0(%rip),%ymm1        # 1a0 <ff_apv_decode_transquant_avx2+0x1a0>
 19e:   00 00
 1a0:   c4 e2 7d 58 15 00 00    vpbroadcastd 0x0(%rip),%ymm2        # 1a9 <ff_apv_decode_transquant_avx2+0x1a9>
 1a7:   00 00
 1a9:   c4 e2 7d 58 1d 00 00    vpbroadcastd 0x0(%rip),%ymm3        # 1b2 <ff_apv_decode_transquant_avx2+0x1b2>
 1b0:   00 00

Saves 6 bytes, but there is now a dependency which wasn't there before.  Is it really better?

>> +
>> +    vpmaddwd  m4,  m12, m0
>> +    vpmaddwd  m5,  m12, m1
>> +    vpmaddwd  m6,  m13, m2
>> +    vpmaddwd  m7,  m13, m3
>> +    vpaddd    m8,  m4,  m6
>> +    vpaddd    m9,  m5,  m7
>> +    vpsubd    m10, m5,  m7
>> +    vpsubd    m11, m4,  m6
>> +
>> +    vpbroadcastd   m0, [tmatrix_col_odd + 0x00]
>> +    vpbroadcastd   m1, [tmatrix_col_odd + 0x04]
>> +    vpbroadcastd   m2, [tmatrix_col_odd + 0x08]
>> +    vpbroadcastd   m3, [tmatrix_col_odd + 0x0c]
>> +
>> +    vpmaddwd  m4,  m14, m0
>> +    vpmaddwd  m5,  m15, m1
>> +    vpmaddwd  m6,  m14, m2
>> +    vpmaddwd  m7,  m15, m3
>> +    vpaddd    m12, m4,  m5
>> +    vpaddd    m13, m6,  m7
>> +
>> +    vpbroadcastd   m0, [tmatrix_col_odd + 0x10]
>> +    vpbroadcastd   m1, [tmatrix_col_odd + 0x14]
>> +    vpbroadcastd   m2, [tmatrix_col_odd + 0x18]
>> +    vpbroadcastd   m3, [tmatrix_col_odd + 0x1c]
>> +
>> +    vpmaddwd  m4,  m14, m0
>> +    vpmaddwd  m5,  m15, m1
>> +    vpmaddwd  m6,  m14, m2
>> +    vpmaddwd  m7,  m15, m3
>> +    vpaddd    m14, m4,  m5
>> +    vpaddd    m15, m6,  m7
>> +
>> +    vpaddd    m0,  m8,  m12
>> +    vpaddd    m1,  m9,  m13
>> +    vpaddd    m2,  m10, m14
>> +    vpaddd    m3,  m11, m15
>> +    vpsubd    m4,  m11, m15
>> +    vpsubd    m5,  m10, m14
>> +    vpsubd    m6,  m9,  m13
>> +    vpsubd    m7,  m8,  m12
>> +
>> +    ; Mid-transform normalisation
>> +    ; Note that outputs here are fitted to 16 bits
>> +
>> +    vpbroadcastd  m8, [sixtyfour]
>> +
>> +%macro NORMALISE 1
>> +    vpaddd    m%1, m%1, m8
>> +    vpsrad    m%1, m%1, 7
>> +    vpackssdw m%1, m%1, m%1
>> +    vpermq    m%1, m%1, q3120
>> +%endmacro
>> +
>> +    NORMALISE 0
>> +    NORMALISE 1
>> +    NORMALISE 2
>> +    NORMALISE 3
>> +    NORMALISE 4
>> +    NORMALISE 5
>> +    NORMALISE 6
>> +    NORMALISE 7
>> +
>> +    ; mN = row N words 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
>> +
>> +    ; Transform rows
>> +    ; This multiplies the rows directly by the transform matrix,
>> +    ; avoiding the need to transpose anything
>> +
>> +    mova      m12, [tmatrix_row + 0x00]
>> +    mova      m13, [tmatrix_row + 0x20]
>> +    mova      m14, [tmatrix_row + 0x40]
>> +    mova      m15, [tmatrix_row + 0x60]
>> +
>> +%macro TRANS_ROW_STEP 1
>> +    vpmaddwd  m8,  m%1, m12
>> +    vpmaddwd  m9,  m%1, m13
>> +    vpmaddwd  m10, m%1, m14
>> +    vpmaddwd  m11, m%1, m15
>> +    vphaddd   m8,  m8,  m9
>> +    vphaddd   m10, m10, m11
>> +    vphaddd   m%1, m8,  m10
>> +%endmacro
>> +
>> +    TRANS_ROW_STEP 0
>> +    TRANS_ROW_STEP 1
>> +    TRANS_ROW_STEP 2
>> +    TRANS_ROW_STEP 3
>> +    TRANS_ROW_STEP 4
>> +    TRANS_ROW_STEP 5
>> +    TRANS_ROW_STEP 6
>> +    TRANS_ROW_STEP 7
>> +
>> +    ; Renormalise, clip and store output
>> +
>> +    vpbroadcastd  m14, [pd_1]
>> +    mov       tmpd, 20
>> +    sub       tmpd, bit_depthd
>> +    movd      xm9, tmpd
>> +    dec       tmpd
>> +    movd      xm13, tmpd
>> +    movd      xm15, bit_depthd
>> +    vpslld    m8,  m14, xm13
>> +    vpslld    m12, m14, xm15
>> +    vpsrld    m10, m12, 1
>> +    vpsubd    m12, m12, m14
>> +    vpxor     m11, m11, m11
>> +
>> +    ; m8  = vector 1 << (bd_shift - 1)
>> +    ; m9  = scalar bd_shift
>> +    ; m10 = vector 1 << (bit_depth - 1)
>> +    ; m11 = zero
>> +    ; m12 = vector (1 << bit_depth) - 1
>> +
>> +    cmp       bit_depthd, 8
>> +    jne       store_10
>> +
>> +%macro NORMALISE_AND_STORE_8 1
>> +    vpaddd    m%1, m%1, m8
>> +    vpsrad    m%1, m%1, xm9
>> +    vpaddd    m%1, m%1, m10
>> +    vextracti128  xm13, m%1, 0
>> +    vextracti128  xm14, m%1, 1
>> +    vpackusdw xm%1, xm13, xm14
>> +    vpackuswb xm%1, xm%1, xm%1
> 
>     vpaddd    m%1, m%1, m10
>     vextracti128  xm14, m%1, 1
>     vpackusdw xm%1, xm%1, xm14
>     vpackuswb xm%1, xm%1, xm%1
> 
> vextracti128 with 0 as third argument is the same as a mova for the lower 128 bits, so it's not needed.

Thinking about this a bit more makes me want to combine rows to not waste elements.  It's not obvious that this is better, but how about:

%macro NORMALISE_AND_STORE_8 4
    vpaddd    m%1, m%1, m8
    vpaddd    m%2, m%2, m8
    vpaddd    m%3, m%3, m8
    vpaddd    m%4, m%4, m8
    vpsrad    m%1, m%1, xm9
    vpsrad    m%2, m%2, xm9
    vpsrad    m%3, m%3, xm9
    vpsrad    m%4, m%4, xm9
    vpaddd    m%1, m%1, m10
    vpaddd    m%2, m%2, m10
    vpaddd    m%3, m%3, m10
    vpaddd    m%4, m%4, m10
    ; m%1 = 32x4   A0-3 A4-7
    ; m%2 = 32x4   B0-3 B4-7
    ; m%3 = 32x8   C0-3 C4-7
    ; m%4 = 32x8   D0-3 D4-7
    vpackusdw m%1, m%1, m%2
    vpackusdw m%3, m%3, m%4
    ; m%1 = 16x16  A0-3 B0-3 A4-7 B4-7
    ; m%2 = 16x16  C0-3 D0-3 C4-7 D4-7
    vpermq    m%1, m%1, q3120
    vpermq    m%2, m%3, q3120
    ; m%1 = 16x16  A0-3 A4-7 B0-3 B4-7
    ; m%2 = 16x16  C0-3 C4-7 D0-3 D4-7
    vpackuswb m%1, m%1, m%2
    ; m%1 = 32x8   A0-3 A4-7 C0-3 C4-7 B0-3 B4-7 D0-3 D4-7
    vextracti128  xm%2, m%1, 1
    vpsrldq   xm%3, xm%1, 8
    vpsrldq   xm%4, xm%2, 8
    vmovq     [outputq],          xm%1
    vmovq     [outputq + pitchq], xm%2
    lea       outputq, [outputq + 2*pitchq]
    vmovq     [outputq],          xm%3
    vmovq     [outputq + pitchq], xm%4
    lea       outputq, [outputq + 2*pitchq]
%endmacro

    NORMALISE_AND_STORE_8 0, 1, 2, 3
    NORMALISE_AND_STORE_8 4, 5, 6, 7

>> +    movq      [outputq], xm%1
>> +    add       outputq, pitchq
>> +%endmacro
>> +
>> +    NORMALISE_AND_STORE_8 0
>> +    NORMALISE_AND_STORE_8 1
>> +    NORMALISE_AND_STORE_8 2
>> +    NORMALISE_AND_STORE_8 3
>> +    NORMALISE_AND_STORE_8 4
>> +    NORMALISE_AND_STORE_8 5
>> +    NORMALISE_AND_STORE_8 6
>> +    NORMALISE_AND_STORE_8 7
>> +
>> +    RET
>> +
>> +store_10:
>> +
>> +%macro NORMALISE_AND_STORE_10 1
>> +    vpaddd    m%1, m%1, m8
>> +    vpsrad    m%1, m%1, xm9
>> +    vpaddd    m%1, m%1, m10
>> +    vpmaxsd   m%1, m%1, m11
>> +    vpminsd   m%1, m%1, m12
>> +    vextracti128  xm13, m%1, 0
>> +    vextracti128  xm14, m%1, 1
>> +    vpackusdw xm%1, xm13, xm14
> 
> Same.

A similar method for pairs applies here as well.

>> +    mova      [outputq], xm%1
>> +    add       outputq, pitchq
>> +%endmacro
>> +
>> +    NORMALISE_AND_STORE_10 0
>> +    NORMALISE_AND_STORE_10 1
>> +    NORMALISE_AND_STORE_10 2
>> +    NORMALISE_AND_STORE_10 3
>> +    NORMALISE_AND_STORE_10 4
>> +    NORMALISE_AND_STORE_10 5
>> +    NORMALISE_AND_STORE_10 6
>> +    NORMALISE_AND_STORE_10 7
>> +
>> +    RET
>> ...

Thanks,

- Mark

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2 5/6] lavc/apv: AVX2 transquant for x86-64
  2025-04-21 19:50     ` Mark Thompson
@ 2025-04-22 20:00       ` James Almer
  0 siblings, 0 replies; 12+ messages in thread
From: James Almer @ 2025-04-22 20:00 UTC (permalink / raw)
  To: ffmpeg-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 17403 bytes --]

On 4/21/2025 4:50 PM, Mark Thompson wrote:
> On 21/04/2025 17:53, James Almer wrote:
>> On 4/21/2025 12:24 PM, Mark Thompson wrote:
>>> Typical checkasm result on Alder Lake:
>>>
>>> decode_transquant_8_c:                                 461.1 ( 1.00x)
>>> decode_transquant_8_avx2:                               97.5 ( 4.73x)
>>> decode_transquant_10_c:                                483.9 ( 1.00x)
>>> decode_transquant_10_avx2:                              91.7 ( 5.28x)
>>> ---
>>>    libavcodec/apv_dsp.c          |   4 +
>>>    libavcodec/apv_dsp.h          |   2 +
>>>    libavcodec/x86/Makefile       |   2 +
>>>    libavcodec/x86/apv_dsp.asm    | 279 ++++++++++++++++++++++++++++++++++
>>>    libavcodec/x86/apv_dsp_init.c |  40 +++++
>>>    tests/checkasm/Makefile       |   1 +
>>>    tests/checkasm/apv_dsp.c      | 109 +++++++++++++
>>>    tests/checkasm/checkasm.c     |   3 +
>>>    tests/checkasm/checkasm.h     |   1 +
>>>    tests/fate/checkasm.mak       |   1 +
>>>    10 files changed, 442 insertions(+)
>>>    create mode 100644 libavcodec/x86/apv_dsp.asm
>>>    create mode 100644 libavcodec/x86/apv_dsp_init.c
>>>    create mode 100644 tests/checkasm/apv_dsp.c
>>>
>>> ...
>>> diff --git a/libavcodec/x86/apv_dsp.asm b/libavcodec/x86/apv_dsp.asm
>>> new file mode 100644
>>> index 0000000000..6b045e989a
>>> --- /dev/null
>>> +++ b/libavcodec/x86/apv_dsp.asm
>>> @@ -0,0 +1,279 @@
>>> +;************************************************************************
>>> +;* This file is part of FFmpeg.
>>> +;*
>>> +;* FFmpeg is free software; you can redistribute it and/or
>>> +;* modify it under the terms of the GNU Lesser General Public
>>> +;* License as published by the Free Software Foundation; either
>>> +;* version 2.1 of the License, or (at your option) any later version.
>>> +;*
>>> +;* FFmpeg is distributed in the hope that it will be useful,
>>> +;* but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>> +;* Lesser General Public License for more details.
>>> +;*
>>> +;* You should have received a copy of the GNU Lesser General Public
>>> +;* License along with FFmpeg; if not, write to the Free Software
>>> +;* 51, Inc., Foundation Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
>>> +;******************************************************************************
>>> +
>>> +%include "libavutil/x86/x86util.asm"
>>> +
>>> +SECTION_RODATA 32
>>> +
>>> +; Full matrix for row transform.
>>> +const tmatrix_row
>>> +    dw  64,  89,  84,  75,  64,  50,  35,  18
>>> +    dw  64, -18, -84,  50,  64, -75, -35,  89
>>> +    dw  64,  75,  35, -18, -64, -89, -84, -50
>>> +    dw  64, -50, -35,  89, -64, -18,  84, -75
>>> +    dw  64,  50, -35, -89, -64,  18,  84,  75
>>> +    dw  64, -75,  35,  18, -64,  89, -84,  50
>>> +    dw  64,  18, -84, -50,  64,  75, -35, -89
>>> +    dw  64, -89,  84, -75,  64, -50,  35, -18
>>> +
>>> +; Constant pairs for broadcast in column transform.
>>> +const tmatrix_col_even
>>> +    dw  64,  64,  64, -64
>>> +    dw  84,  35,  35, -84
>>> +const tmatrix_col_odd
>>> +    dw  89,  75,  50,  18
>>> +    dw  75, -18, -89, -50
>>> +    dw  50, -89,  18,  75
>>> +    dw  18, -50,  75, -89
>>> +
>>> +; Memory targets for vpbroadcastd (register version requires AVX512).
>>> +cextern pd_1
>>> +const sixtyfour
>>> +    dd  64
>>> +
>>> +SECTION .text
>>> +
>>> +; void ff_apv_decode_transquant_avx2(void *output,
>>> +;                                    ptrdiff_t pitch,
>>> +;                                    const int16_t *input,
>>> +;                                    const int16_t *qmatrix,
>>> +;                                    int bit_depth,
>>> +;                                    int qp_shift);
>>> +
>>> +INIT_YMM avx2
>>> +
>>> +cglobal apv_decode_transquant, 6, 7, 16, output, pitch, input, qmatrix, bit_depth, qp_shift, tmp
>>> +
>>> +    ; Load input and dequantise
>>> +
>>> +    vpbroadcastd  m10, [pd_1]
>>> +    lea       tmpq, [bit_depthq - 2]
>>
>> lea       tmpd, [bit_depthd - 2]
>>
>> The upper 32 bits of the register may have garbage.
> 
> Ah, I was assuming that lea had to be pointer-sized, but apparently it doesn't.  Changed.
> 
>>> +    movd      xm8, qp_shiftd
>>
>> If you declare the function as 5, 7, 16, then qp_shift will not be loaded into a gpr on ABIs where it's on stack (Win64, and x86_32 if it was supported), and then you can do
>>
>>      movd      xm8, qp_shiftm
>>
>> Which will load it directly to the simd register from memory, saving one instruction in the prologue.
> 
> This seems like highly dubious magic since it is lying about the number of arguments.

You're not lying. That value is to tell x86inc to load x arguments onto 
gprs in the prologue if they are in stack. If they are not (As is the 
case with the first six arguments on Unix64, first four on Win64), 
qp_shiftm will be equivalent of qp_shiftd, and if they are, it will be 
the stack memory address.

So with my suggestion, on Win64 you get

movd xmm8, [rsp]; qp_shiftm points to memory

instead of

mov r11, [rsp] ; prologue loads argument into what will be qp_shiftd
movd xmm8, r11d ; qp_shiftd is an alias of r11d

It's ricing, yes, but it's free.

> 
> I've changed it, but I want to check a Windows machine as well.
> 
>>> +    movd      xm9, tmpd
>>> +    vpslld    m10, m10, xm9
>>> +    vpsrld    m10, m10, 1
>>> +
>>> +    ; m8  = scalar qp_shift
>>> +    ; m9  = scalar bd_shift
>>> +    ; m10 = vector 1 << (bd_shift - 1)
>>> +    ; m11 = qmatrix load
>>> +
>>> +%macro LOAD_AND_DEQUANT 2 ; (xmm input, constant offset)
>>> +    vpmovsxwd m%1, [inputq   + %2]
>>> +    vpmovsxwd m11, [qmatrixq + %2]
>>> +    vpmaddwd  m%1, m%1, m11
>>> +    vpslld    m%1, m%1, xm8
>>> +    vpaddd    m%1, m%1, m10
>>> +    vpsrad    m%1, m%1, xm9
>>> +    vpackssdw m%1, m%1, m%1
>>> +%endmacro
>>> +
>>> +    LOAD_AND_DEQUANT 0, 0x00
>>> +    LOAD_AND_DEQUANT 1, 0x10
>>> +    LOAD_AND_DEQUANT 2, 0x20
>>> +    LOAD_AND_DEQUANT 3, 0x30
>>> +    LOAD_AND_DEQUANT 4, 0x40
>>> +    LOAD_AND_DEQUANT 5, 0x50
>>> +    LOAD_AND_DEQUANT 6, 0x60
>>> +    LOAD_AND_DEQUANT 7, 0x70
>>> +
>>> +    ; mN = row N words 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7
>>> +
>>> +    ; Transform columns
>>> +    ; This applies a 1-D DCT butterfly
>>> +
>>> +    vpunpcklwd  m12, m0,  m4
>>> +    vpunpcklwd  m13, m2,  m6
>>> +    vpunpcklwd  m14, m1,  m3
>>> +    vpunpcklwd  m15, m5,  m7
>>> +
>>> +    ; m12 = rows 0 and 4 interleaved
>>> +    ; m13 = rows 2 and 6 interleaved
>>> +    ; m14 = rows 1 and 3 interleaved
>>> +    ; m15 = rows 5 and 7 interleaved
>>> +
>>> +    vpbroadcastd   m0, [tmatrix_col_even + 0x00]
>>> +    vpbroadcastd   m1, [tmatrix_col_even + 0x04]
>>> +    vpbroadcastd   m2, [tmatrix_col_even + 0x08]
>>> +    vpbroadcastd   m3, [tmatrix_col_even + 0x0c]
>>
>> Maybe do
>>
>> lea tmpq, [tmatrix_col_even]
>> vpbroadcastd   m0, [tmpq + 0x00]
>> vpbroadcastd   m1, [tmpq + 0x04]
>> ...
>>
>> To emit smaller instructions. Same for tmatrix_col_odd and tmatrix_row below.
> 
>   150:   48 8d 05 00 00 00 00    lea    0x0(%rip),%rax        # 157 <ff_apv_decode_transquant_avx2+0x157>
>   157:   c4 e2 7d 58 00          vpbroadcastd (%rax),%ymm0
>   15c:   c4 e2 7d 58 48 04       vpbroadcastd 0x4(%rax),%ymm1
>   162:   c4 e2 7d 58 50 08       vpbroadcastd 0x8(%rax),%ymm2
>   168:   c4 e2 7d 58 58 0c       vpbroadcastd 0xc(%rax),%ymm3
> 
>   18e:   c4 e2 7d 58 05 00 00    vpbroadcastd 0x0(%rip),%ymm0        # 197 <ff_apv_decode_transquant_avx2+0x197>
>   195:   00 00
>   197:   c4 e2 7d 58 0d 00 00    vpbroadcastd 0x0(%rip),%ymm1        # 1a0 <ff_apv_decode_transquant_avx2+0x1a0>
>   19e:   00 00
>   1a0:   c4 e2 7d 58 15 00 00    vpbroadcastd 0x0(%rip),%ymm2        # 1a9 <ff_apv_decode_transquant_avx2+0x1a9>
>   1a7:   00 00
>   1a9:   c4 e2 7d 58 1d 00 00    vpbroadcastd 0x0(%rip),%ymm3        # 1b2 <ff_apv_decode_transquant_avx2+0x1b2>
>   1b0:   00 00
> 
> Saves 6 bytes, but there is now a dependency which wasn't there before.  Is it really better?

You could do the lea several instructions earlier, so the dependency 
wouldn't matter, but unless you can measure a difference in speed, then 
maybe don't bother.

> 
>>> +
>>> +    vpmaddwd  m4,  m12, m0
>>> +    vpmaddwd  m5,  m12, m1
>>> +    vpmaddwd  m6,  m13, m2
>>> +    vpmaddwd  m7,  m13, m3
>>> +    vpaddd    m8,  m4,  m6
>>> +    vpaddd    m9,  m5,  m7
>>> +    vpsubd    m10, m5,  m7
>>> +    vpsubd    m11, m4,  m6
>>> +
>>> +    vpbroadcastd   m0, [tmatrix_col_odd + 0x00]
>>> +    vpbroadcastd   m1, [tmatrix_col_odd + 0x04]
>>> +    vpbroadcastd   m2, [tmatrix_col_odd + 0x08]
>>> +    vpbroadcastd   m3, [tmatrix_col_odd + 0x0c]
>>> +
>>> +    vpmaddwd  m4,  m14, m0
>>> +    vpmaddwd  m5,  m15, m1
>>> +    vpmaddwd  m6,  m14, m2
>>> +    vpmaddwd  m7,  m15, m3
>>> +    vpaddd    m12, m4,  m5
>>> +    vpaddd    m13, m6,  m7
>>> +
>>> +    vpbroadcastd   m0, [tmatrix_col_odd + 0x10]
>>> +    vpbroadcastd   m1, [tmatrix_col_odd + 0x14]
>>> +    vpbroadcastd   m2, [tmatrix_col_odd + 0x18]
>>> +    vpbroadcastd   m3, [tmatrix_col_odd + 0x1c]
>>> +
>>> +    vpmaddwd  m4,  m14, m0
>>> +    vpmaddwd  m5,  m15, m1
>>> +    vpmaddwd  m6,  m14, m2
>>> +    vpmaddwd  m7,  m15, m3
>>> +    vpaddd    m14, m4,  m5
>>> +    vpaddd    m15, m6,  m7
>>> +
>>> +    vpaddd    m0,  m8,  m12
>>> +    vpaddd    m1,  m9,  m13
>>> +    vpaddd    m2,  m10, m14
>>> +    vpaddd    m3,  m11, m15
>>> +    vpsubd    m4,  m11, m15
>>> +    vpsubd    m5,  m10, m14
>>> +    vpsubd    m6,  m9,  m13
>>> +    vpsubd    m7,  m8,  m12
>>> +
>>> +    ; Mid-transform normalisation
>>> +    ; Note that outputs here are fitted to 16 bits
>>> +
>>> +    vpbroadcastd  m8, [sixtyfour]
>>> +
>>> +%macro NORMALISE 1
>>> +    vpaddd    m%1, m%1, m8
>>> +    vpsrad    m%1, m%1, 7
>>> +    vpackssdw m%1, m%1, m%1
>>> +    vpermq    m%1, m%1, q3120
>>> +%endmacro
>>> +
>>> +    NORMALISE 0
>>> +    NORMALISE 1
>>> +    NORMALISE 2
>>> +    NORMALISE 3
>>> +    NORMALISE 4
>>> +    NORMALISE 5
>>> +    NORMALISE 6
>>> +    NORMALISE 7
>>> +
>>> +    ; mN = row N words 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
>>> +
>>> +    ; Transform rows
>>> +    ; This multiplies the rows directly by the transform matrix,
>>> +    ; avoiding the need to transpose anything
>>> +
>>> +    mova      m12, [tmatrix_row + 0x00]
>>> +    mova      m13, [tmatrix_row + 0x20]
>>> +    mova      m14, [tmatrix_row + 0x40]
>>> +    mova      m15, [tmatrix_row + 0x60]
>>> +
>>> +%macro TRANS_ROW_STEP 1
>>> +    vpmaddwd  m8,  m%1, m12
>>> +    vpmaddwd  m9,  m%1, m13
>>> +    vpmaddwd  m10, m%1, m14
>>> +    vpmaddwd  m11, m%1, m15
>>> +    vphaddd   m8,  m8,  m9
>>> +    vphaddd   m10, m10, m11
>>> +    vphaddd   m%1, m8,  m10
>>> +%endmacro
>>> +
>>> +    TRANS_ROW_STEP 0
>>> +    TRANS_ROW_STEP 1
>>> +    TRANS_ROW_STEP 2
>>> +    TRANS_ROW_STEP 3
>>> +    TRANS_ROW_STEP 4
>>> +    TRANS_ROW_STEP 5
>>> +    TRANS_ROW_STEP 6
>>> +    TRANS_ROW_STEP 7
>>> +
>>> +    ; Renormalise, clip and store output
>>> +
>>> +    vpbroadcastd  m14, [pd_1]
>>> +    mov       tmpd, 20
>>> +    sub       tmpd, bit_depthd
>>> +    movd      xm9, tmpd
>>> +    dec       tmpd
>>> +    movd      xm13, tmpd
>>> +    movd      xm15, bit_depthd
>>> +    vpslld    m8,  m14, xm13
>>> +    vpslld    m12, m14, xm15
>>> +    vpsrld    m10, m12, 1
>>> +    vpsubd    m12, m12, m14
>>> +    vpxor     m11, m11, m11
>>> +
>>> +    ; m8  = vector 1 << (bd_shift - 1)
>>> +    ; m9  = scalar bd_shift
>>> +    ; m10 = vector 1 << (bit_depth - 1)
>>> +    ; m11 = zero
>>> +    ; m12 = vector (1 << bit_depth) - 1
>>> +
>>> +    cmp       bit_depthd, 8
>>> +    jne       store_10
>>> +
>>> +%macro NORMALISE_AND_STORE_8 1
>>> +    vpaddd    m%1, m%1, m8
>>> +    vpsrad    m%1, m%1, xm9
>>> +    vpaddd    m%1, m%1, m10
>>> +    vextracti128  xm13, m%1, 0
>>> +    vextracti128  xm14, m%1, 1
>>> +    vpackusdw xm%1, xm13, xm14
>>> +    vpackuswb xm%1, xm%1, xm%1
>>
>>      vpaddd    m%1, m%1, m10
>>      vextracti128  xm14, m%1, 1
>>      vpackusdw xm%1, xm%1, xm14
>>      vpackuswb xm%1, xm%1, xm%1
>>
>> vextracti128 with 0 as third argument is the same as a mova for the lower 128 bits, so it's not needed.
> 
> Thinking about this a bit more makes me want to combine rows to not waste elements.  It's not obvious that this is better, but how about:

It may be better just for having six crosslane instructions instead of 
eight.

> 
> %macro NORMALISE_AND_STORE_8 4
>      vpaddd    m%1, m%1, m8
>      vpaddd    m%2, m%2, m8
>      vpaddd    m%3, m%3, m8
>      vpaddd    m%4, m%4, m8
>      vpsrad    m%1, m%1, xm9
>      vpsrad    m%2, m%2, xm9
>      vpsrad    m%3, m%3, xm9
>      vpsrad    m%4, m%4, xm9
>      vpaddd    m%1, m%1, m10
>      vpaddd    m%2, m%2, m10
>      vpaddd    m%3, m%3, m10
>      vpaddd    m%4, m%4, m10
>      ; m%1 = 32x4   A0-3 A4-7
>      ; m%2 = 32x4   B0-3 B4-7
>      ; m%3 = 32x8   C0-3 C4-7
>      ; m%4 = 32x8   D0-3 D4-7
>      vpackusdw m%1, m%1, m%2
>      vpackusdw m%3, m%3, m%4
>      ; m%1 = 16x16  A0-3 B0-3 A4-7 B4-7
>      ; m%2 = 16x16  C0-3 D0-3 C4-7 D4-7
>      vpermq    m%1, m%1, q3120
>      vpermq    m%2, m%3, q3120
>      ; m%1 = 16x16  A0-3 A4-7 B0-3 B4-7
>      ; m%2 = 16x16  C0-3 C4-7 D0-3 D4-7
>      vpackuswb m%1, m%1, m%2
>      ; m%1 = 32x8   A0-3 A4-7 C0-3 C4-7 B0-3 B4-7 D0-3 D4-7
>      vextracti128  xm%2, m%1, 1
>      vpsrldq   xm%3, xm%1, 8
>      vpsrldq   xm%4, xm%2, 8
>      vmovq     [outputq],          xm%1
>      vmovq     [outputq + pitchq], xm%2
>      lea       outputq, [outputq + 2*pitchq]

Maybe instead load pitch*3 onto tmpq outside of the macro

lea       tmpq, [pitchq+pitchq*2]

Then you can do:

vmovq     [outputq],          xm%1
vmovq     [outputq+pitchq],   xm%2
vmovq     [outputq+pitchq*2], xm%3
vmovq     [outputq+tmpq],     xm%4
lea       outputq, [outputq+pitchq*4]

Inside it.

>      vmovq     [outputq],          xm%3
>      vmovq     [outputq + pitchq], xm%4
>      lea       outputq, [outputq + 2*pitchq]
> %endmacro
> 
>      NORMALISE_AND_STORE_8 0, 1, 2, 3
>      NORMALISE_AND_STORE_8 4, 5, 6, 7
> 
>>> +    movq      [outputq], xm%1
>>> +    add       outputq, pitchq
>>> +%endmacro
>>> +
>>> +    NORMALISE_AND_STORE_8 0
>>> +    NORMALISE_AND_STORE_8 1
>>> +    NORMALISE_AND_STORE_8 2
>>> +    NORMALISE_AND_STORE_8 3
>>> +    NORMALISE_AND_STORE_8 4
>>> +    NORMALISE_AND_STORE_8 5
>>> +    NORMALISE_AND_STORE_8 6
>>> +    NORMALISE_AND_STORE_8 7
>>> +
>>> +    RET
>>> +
>>> +store_10:
>>> +
>>> +%macro NORMALISE_AND_STORE_10 1
>>> +    vpaddd    m%1, m%1, m8
>>> +    vpsrad    m%1, m%1, xm9
>>> +    vpaddd    m%1, m%1, m10
>>> +    vpmaxsd   m%1, m%1, m11
>>> +    vpminsd   m%1, m%1, m12
>>> +    vextracti128  xm13, m%1, 0
>>> +    vextracti128  xm14, m%1, 1
>>> +    vpackusdw xm%1, xm13, xm14
>>
>> Same.
> 
> A similar method for pairs applies here as well.
> 
>>> +    mova      [outputq], xm%1
>>> +    add       outputq, pitchq
>>> +%endmacro
>>> +
>>> +    NORMALISE_AND_STORE_10 0
>>> +    NORMALISE_AND_STORE_10 1
>>> +    NORMALISE_AND_STORE_10 2
>>> +    NORMALISE_AND_STORE_10 3
>>> +    NORMALISE_AND_STORE_10 4
>>> +    NORMALISE_AND_STORE_10 5
>>> +    NORMALISE_AND_STORE_10 6
>>> +    NORMALISE_AND_STORE_10 7
>>> +
>>> +    RET
>>> ...
> 
> Thanks,
> 
> - Mark
> 
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

[-- Attachment #2: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2 5/6] lavc/apv: AVX2 transquant for x86-64
  2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 5/6] lavc/apv: AVX2 transquant for x86-64 Mark Thompson
  2025-04-21 16:53   ` James Almer
@ 2025-04-23 19:52   ` Michael Niedermayer
  2025-04-23 20:47     ` Mark Thompson
  1 sibling, 1 reply; 12+ messages in thread
From: Michael Niedermayer @ 2025-04-23 19:52 UTC (permalink / raw)
  To: FFmpeg development discussions and patches


[-- Attachment #1.1: Type: text/plain, Size: 2866 bytes --]

Hi

On Mon, Apr 21, 2025 at 04:24:36PM +0100, Mark Thompson wrote:
> Typical checkasm result on Alder Lake:
> 
> decode_transquant_8_c:                                 461.1 ( 1.00x)
> decode_transquant_8_avx2:                               97.5 ( 4.73x)
> decode_transquant_10_c:                                483.9 ( 1.00x)
> decode_transquant_10_avx2:                              91.7 ( 5.28x)
> ---
>  libavcodec/apv_dsp.c          |   4 +
>  libavcodec/apv_dsp.h          |   2 +
>  libavcodec/x86/Makefile       |   2 +
>  libavcodec/x86/apv_dsp.asm    | 279 ++++++++++++++++++++++++++++++++++
>  libavcodec/x86/apv_dsp_init.c |  40 +++++
>  tests/checkasm/Makefile       |   1 +
>  tests/checkasm/apv_dsp.c      | 109 +++++++++++++
>  tests/checkasm/checkasm.c     |   3 +
>  tests/checkasm/checkasm.h     |   1 +
>  tests/fate/checkasm.mak       |   1 +
>  10 files changed, 442 insertions(+)
>  create mode 100644 libavcodec/x86/apv_dsp.asm
>  create mode 100644 libavcodec/x86/apv_dsp_init.c
>  create mode 100644 tests/checkasm/apv_dsp.c

breaks build on x86-32
make
X86ASM	libavcodec/x86/apv_dsp.o
src/libavcodec/x86/apv_dsp.asm:64: error: symbol `m10' undefined
src/libavcodec/x86/apv_dsp.asm:66: error: symbol `xmmm8' undefined
src//libavutil/x86/x86inc.asm:1637: ... from macro `movd' defined here
src//libavutil/x86/x86inc.asm:1501: ... from macro `RUN_AVX_INSTR' defined here
src/libavcodec/x86/apv_dsp.asm:67: error: symbol `xmmm9' undefined
src//libavutil/x86/x86inc.asm:1637: ... from macro `movd' defined here
src//libavutil/x86/x86inc.asm:1501: ... from macro `RUN_AVX_INSTR' defined here
src/libavcodec/x86/apv_dsp.asm:68: error: symbol `m10' undefined
src/libavcodec/x86/apv_dsp.asm:69: error: symbol `m10' undefined
src/libavcodec/x86/apv_dsp.asm:86: error: symbol `m11' undefined
src/libavcodec/x86/apv_dsp.asm:78: ... from macro `LOAD_AND_DEQUANT' defined here
src/libavcodec/x86/apv_dsp.asm:86: error: symbol `m11' undefined
src/libavcodec/x86/apv_dsp.asm:79: ... from macro `LOAD_AND_DEQUANT' defined here
src/libavcodec/x86/apv_dsp.asm:86: error: symbol `xmmm8' undefined
src/libavcodec/x86/apv_dsp.asm:80: ... from macro `LOAD_AND_DEQUANT' defined here
src/libavcodec/x86/apv_dsp.asm:86: error: symbol `m10' undefined
src/libavcodec/x86/apv_dsp.asm:81: ... from macro `LOAD_AND_DEQUANT' defined here
src/libavcodec/x86/apv_dsp.asm:86: error: symbol `xmmm9' undefined
src/libavcodec/x86/apv_dsp.asm:82: ... from macro `LOAD_AND_DEQUANT' defined here
src/libavcodec/x86/apv_dsp.asm:87: error: symbol `m11' undefined
src/libavcodec/x86/apv_dsp.asm:78: ... from macro `LOAD_AND_DEQUANT' defined here
...

thx

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Everything should be made as simple as possible, but not simpler.
-- Albert Einstein

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

[-- Attachment #2: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2 5/6] lavc/apv: AVX2 transquant for x86-64
  2025-04-23 19:52   ` Michael Niedermayer
@ 2025-04-23 20:47     ` Mark Thompson
  0 siblings, 0 replies; 12+ messages in thread
From: Mark Thompson @ 2025-04-23 20:47 UTC (permalink / raw)
  To: ffmpeg-devel

On 23/04/2025 20:52, Michael Niedermayer wrote:
> Hi
> 
> On Mon, Apr 21, 2025 at 04:24:36PM +0100, Mark Thompson wrote:
>> Typical checkasm result on Alder Lake:
>>
>> decode_transquant_8_c:                                 461.1 ( 1.00x)
>> decode_transquant_8_avx2:                               97.5 ( 4.73x)
>> decode_transquant_10_c:                                483.9 ( 1.00x)
>> decode_transquant_10_avx2:                              91.7 ( 5.28x)
>> ---
>>  libavcodec/apv_dsp.c          |   4 +
>>  libavcodec/apv_dsp.h          |   2 +
>>  libavcodec/x86/Makefile       |   2 +
>>  libavcodec/x86/apv_dsp.asm    | 279 ++++++++++++++++++++++++++++++++++
>>  libavcodec/x86/apv_dsp_init.c |  40 +++++
>>  tests/checkasm/Makefile       |   1 +
>>  tests/checkasm/apv_dsp.c      | 109 +++++++++++++
>>  tests/checkasm/checkasm.c     |   3 +
>>  tests/checkasm/checkasm.h     |   1 +
>>  tests/fate/checkasm.mak       |   1 +
>>  10 files changed, 442 insertions(+)
>>  create mode 100644 libavcodec/x86/apv_dsp.asm
>>  create mode 100644 libavcodec/x86/apv_dsp_init.c
>>  create mode 100644 tests/checkasm/apv_dsp.c
> 
> breaks build on x86-32
> make
> X86ASM	libavcodec/x86/apv_dsp.o
> src/libavcodec/x86/apv_dsp.asm:64: error: symbol `m10' undefined
> src/libavcodec/x86/apv_dsp.asm:66: error: symbol `xmmm8' undefined
> src//libavutil/x86/x86inc.asm:1637: ... from macro `movd' defined here
> src//libavutil/x86/x86inc.asm:1501: ... from macro `RUN_AVX_INSTR' defined here
> src/libavcodec/x86/apv_dsp.asm:67: error: symbol `xmmm9' undefined
> src//libavutil/x86/x86inc.asm:1637: ... from macro `movd' defined here
> src//libavutil/x86/x86inc.asm:1501: ... from macro `RUN_AVX_INSTR' defined here
> src/libavcodec/x86/apv_dsp.asm:68: error: symbol `m10' undefined
> src/libavcodec/x86/apv_dsp.asm:69: error: symbol `m10' undefined
> src/libavcodec/x86/apv_dsp.asm:86: error: symbol `m11' undefined
> src/libavcodec/x86/apv_dsp.asm:78: ... from macro `LOAD_AND_DEQUANT' defined here
> src/libavcodec/x86/apv_dsp.asm:86: error: symbol `m11' undefined
> src/libavcodec/x86/apv_dsp.asm:79: ... from macro `LOAD_AND_DEQUANT' defined here
> src/libavcodec/x86/apv_dsp.asm:86: error: symbol `xmmm8' undefined
> src/libavcodec/x86/apv_dsp.asm:80: ... from macro `LOAD_AND_DEQUANT' defined here
> src/libavcodec/x86/apv_dsp.asm:86: error: symbol `m10' undefined
> src/libavcodec/x86/apv_dsp.asm:81: ... from macro `LOAD_AND_DEQUANT' defined here
> src/libavcodec/x86/apv_dsp.asm:86: error: symbol `xmmm9' undefined
> src/libavcodec/x86/apv_dsp.asm:82: ... from macro `LOAD_AND_DEQUANT' defined here
> src/libavcodec/x86/apv_dsp.asm:87: error: symbol `m11' undefined
> src/libavcodec/x86/apv_dsp.asm:78: ... from macro `LOAD_AND_DEQUANT' defined here
> ...

This was intended to be x86-64 only (due to register pressure) and wasn't guarded properly.  Fixed in the latest version.

Thank you for testing!

- Mark

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-04-23 20:47 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-21 15:24 [FFmpeg-devel] [PATCH v2 0/6] APV support Mark Thompson
2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 1/6] lavc: APV codec ID and descriptor Mark Thompson
2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 2/6] lavc/cbs: APV support Mark Thompson
2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 3/6] lavf: APV demuxer Mark Thompson
2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 4/6] lavc: APV decoder Mark Thompson
2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 5/6] lavc/apv: AVX2 transquant for x86-64 Mark Thompson
2025-04-21 16:53   ` James Almer
2025-04-21 19:50     ` Mark Thompson
2025-04-22 20:00       ` James Almer
2025-04-23 19:52   ` Michael Niedermayer
2025-04-23 20:47     ` Mark Thompson
2025-04-21 15:24 ` [FFmpeg-devel] [PATCH v2 6/6] lavc: APV metadata bitstream filter Mark Thompson

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git