From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTP id C0714402F0
	for <ffmpegdev@gitmailbox.com>; Fri, 21 Jan 2022 08:33:17 +0000 (UTC)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 75ACC68B249;
	Fri, 21 Jan 2022 10:33:14 +0200 (EET)
Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id D718668B17D
 for <ffmpeg-devel@ffmpeg.org>; Fri, 21 Jan 2022 10:33:07 +0200 (EET)
Received: from w3.tutanota.de (unknown [192.168.1.164])
 by w4.tutanota.de (Postfix) with ESMTP id 68EC01060247
 for <ffmpeg-devel@ffmpeg.org>; Fri, 21 Jan 2022 08:33:07 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1642753985; 
 s=s1; d=lynne.ee;
 h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:Sender;
 bh=2eTabR9KCeeFM/4BGrE5S85eMGs5ZoCoXo0a4CjqZMA=;
 b=Z+bgge/ptXWPuzys0bsQTcpdTu+I5BzHRgHG2/xtkylbW/p9qBBoyOD8gBaHaCyp
 4Lzlomq2EkD6lRMNqy6aPZTsC/S/tvQyTrDGw7/8godUJub7t7lt8PULol/BdxWlHNN
 nvL5iT6qr1kJX0Syr+8/r/EhbDs6MlT3x9KVXTO6YompUYY8qfen1vZbeR9yl8SUL/k
 hnBiPDjNEdQoaWpeRuCsOjCxKEqjrCSyxF4k/Q3AraZWxP0PbAI/vrj9px0z4+70ple
 vdu15uc0JS8hOyozLe9L957y6owJFJOi0xIofL2I4czsibO6Q6pMmZCRrw49X0oVGr9
 RpK15w7QrA==
Date: Fri, 21 Jan 2022 09:33:05 +0100 (CET)
From: Lynne <dev@lynne.ee>
To: Ffmpeg Devel <ffmpeg-devel@ffmpeg.org>
Message-ID: <Mtvl7g3--3-2@lynne.ee>
MIME-Version: 1.0
Content-Type: multipart/mixed; 
 boundary="----=_Part_134465_1290416493.1642753986056"
Subject: [FFmpeg-devel] [PATCH 1/2] lavu/tx: rewrite internal code as a
 tree-based codelet constructor
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Archived-At: <https://master.gitmailbox.com/ffmpegdev/Mtvl7g3--3-2@lynne.ee/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>

------=_Part_134465_1290416493.1642753986056
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

This commit rewrites the internal transform code into a constructor
that stitches transforms (codelets).
This allows for transforms to reuse arbitrary parts of other
transforms, and allows transforms to be stacked onto one
another (such as a full iMDCT using a half-iMDCT which in turn
uses an FFT). It also permits for each step to be individually
replaced by assembly or a custom implementation (such as an ASIC).

Patch attached.


------=_Part_134465_1290416493.1642753986056
Content-Type: text/x-patch; charset=UTF-8; 
	name=0001-lavu-tx-rewrite-internal-code-as-a-tree-based-codele.patch
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; 
	filename=0001-lavu-tx-rewrite-internal-code-as-a-tree-based-codele.patch

>From a5e4bb3e1bde245264c6a4cde5c8db3162ddfa5f Mon Sep 17 00:00:00 2001
From: Lynne <dev@lynne.ee>
Date: Thu, 20 Jan 2022 07:14:46 +0100
Subject: [PATCH 1/2] lavu/tx: rewrite internal code as a tree-based codelet
 constructor

This commit rewrites the internal transform code into a constructor
that stitches transforms (codelets).
This allows for transforms to reuse arbitrary parts of other
transforms, and allows transforms to be stacked onto one
another (such as a full iMDCT using a half-iMDCT which in turn
uses an FFT). It also permits for each step to be individually
replaced by assembly or a custom implementation (such as an ASIC).
---
 libavutil/Makefile            |    4 +-
 libavutil/tx.c                |  483 +++++++++---
 libavutil/tx.h                |    3 +
 libavutil/tx_priv.h           |  180 +++--
 libavutil/tx_template.c       | 1356 ++++++++++++++++++++-------------
 libavutil/x86/tx_float.asm    |  111 +--
 libavutil/x86/tx_float_init.c |  170 +++--
 7 files changed, 1526 insertions(+), 781 deletions(-)

diff --git a/libavutil/Makefile b/libavutil/Makefile
index d17876df1a..22a7b15f61 100644
--- a/libavutil/Makefile
+++ b/libavutil/Makefile
@@ -170,8 +170,8 @@ OBJS =3D adler32.o                                     =
                   \
        tea.o                                                            \
        tx.o                                                             \
        tx_float.o                                                       \
-       tx_double.o                                                      \
-       tx_int32.o                                                       \
+#       tx_double.o                                                      \
+#       tx_int32.o                                                       \
        video_enc_params.o                                               \
        film_grain_params.o                                              \
=20
diff --git a/libavutil/tx.c b/libavutil/tx.c
index fa81ada2f1..28fe6c55b9 100644
--- a/libavutil/tx.c
+++ b/libavutil/tx.c
@@ -17,8 +17,9 @@
  */
=20
 #include "tx_priv.h"
+#include "qsort.h"
=20
-int ff_tx_type_is_mdct(enum AVTXType type)
+static av_always_inline int type_is_mdct(enum AVTXType type)
 {
     switch (type) {
     case AV_TX_FLOAT_MDCT:
@@ -42,22 +43,26 @@ static av_always_inline int mulinv(int n, int m)
 }
=20
 /* Guaranteed to work for any n, m where gcd(n, m) =3D=3D 1 */
-int ff_tx_gen_compound_mapping(AVTXContext *s)
+int ff_tx_gen_compound_mapping(AVTXContext *s, int n, int m)
 {
     int *in_map, *out_map;
-    const int n     =3D s->n;
-    const int m     =3D s->m;
-    const int inv   =3D s->inv;
-    const int len   =3D n*m;
-    const int m_inv =3D mulinv(m, n);
-    const int n_inv =3D mulinv(n, m);
-    const int mdct  =3D ff_tx_type_is_mdct(s->type);
-
-    if (!(s->pfatab =3D av_malloc(2*len*sizeof(*s->pfatab))))
+    const int inv =3D s->inv;
+    const int len =3D n*m;    /* Will not be equal to s->len for MDCTs */
+    const int mdct =3D type_is_mdct(s->type);
+    int m_inv, n_inv;
+
+    /* Make sure the numbers are coprime */
+    if (av_gcd(n, m) !=3D 1)
+        return AVERROR(EINVAL);
+
+    m_inv =3D mulinv(m, n);
+    n_inv =3D mulinv(n, m);
+
+    if (!(s->map =3D av_malloc(2*len*sizeof(*s->map))))
         return AVERROR(ENOMEM);
=20
-    in_map  =3D s->pfatab;
-    out_map =3D s->pfatab + n*m;
+    in_map  =3D s->map;
+    out_map =3D s->map + len;
=20
     /* Ruritanian map for input, CRT map for output, can be swapped */
     for (int j =3D 0; j < m; j++) {
@@ -92,48 +97,50 @@ int ff_tx_gen_compound_mapping(AVTXContext *s)
     return 0;
 }
=20
-static inline int split_radix_permutation(int i, int m, int inverse)
+static inline int split_radix_permutation(int i, int len, int inv)
 {
-    m >>=3D 1;
-    if (m <=3D 1)
+    len >>=3D 1;
+    if (len <=3D 1)
         return i & 1;
-    if (!(i & m))
-        return split_radix_permutation(i, m, inverse) * 2;
-    m >>=3D 1;
-    return split_radix_permutation(i, m, inverse) * 4 + 1 - 2*(!(i & m) ^ =
inverse);
+    if (!(i & len))
+        return split_radix_permutation(i, len, inv) * 2;
+    len >>=3D 1;
+    return split_radix_permutation(i, len, inv) * 4 + 1 - 2*(!(i & len) ^ =
inv);
 }
=20
 int ff_tx_gen_ptwo_revtab(AVTXContext *s, int invert_lookup)
 {
-    const int m =3D s->m, inv =3D s->inv;
+    int len =3D s->len;
=20
-    if (!(s->revtab =3D av_malloc(s->m*sizeof(*s->revtab))))
-        return AVERROR(ENOMEM);
-    if (!(s->revtab_c =3D av_malloc(m*sizeof(*s->revtab_c))))
+    if (!(s->map =3D av_malloc(len*sizeof(*s->map))))
         return AVERROR(ENOMEM);
=20
-    /* Default */
-    for (int i =3D 0; i < m; i++) {
-        int k =3D -split_radix_permutation(i, m, inv) & (m - 1);
-        if (invert_lookup)
-            s->revtab[i] =3D s->revtab_c[i] =3D k;
-        else
-            s->revtab[i] =3D s->revtab_c[k] =3D i;
+    if (invert_lookup) {
+        for (int i =3D 0; i < s->len; i++)
+            s->map[i] =3D -split_radix_permutation(i, len, s->inv) & (len =
- 1);
+    } else {
+        for (int i =3D 0; i < s->len; i++)
+            s->map[-split_radix_permutation(i, len, s->inv) & (len - 1)] =
=3D i;
     }
=20
     return 0;
 }
=20
-int ff_tx_gen_ptwo_inplace_revtab_idx(AVTXContext *s, int *revtab)
+int ff_tx_gen_ptwo_inplace_revtab_idx(AVTXContext *s)
 {
-    int nb_inplace_idx =3D 0;
+    int *chain_map, chain_map_idx =3D 0, len =3D s->len;
=20
-    if (!(s->inplace_idx =3D av_malloc(s->m*sizeof(*s->inplace_idx))))
+    if (!(s->map =3D av_malloc(2*len*sizeof(*s->map))))
         return AVERROR(ENOMEM);
=20
+    chain_map =3D &s->map[s->len];
+
+    for (int i =3D 0; i < len; i++)
+        s->map[-split_radix_permutation(i, len, s->inv) & (len - 1)] =3D i=
;
+
     /* The first coefficient is always already in-place */
-    for (int src =3D 1; src < s->m; src++) {
-        int dst =3D revtab[src];
+    for (int src =3D 1; src < s->len; src++) {
+        int dst =3D s->map[src];
         int found =3D 0;
=20
         if (dst <=3D src)
@@ -143,48 +150,54 @@ int ff_tx_gen_ptwo_inplace_revtab_idx(AVTXContext *s,=
 int *revtab)
          * and if so, skips it, since to fully permute a loop we must only
          * enter it once. */
         do {
-            for (int j =3D 0; j < nb_inplace_idx; j++) {
-                if (dst =3D=3D s->inplace_idx[j]) {
+            for (int j =3D 0; j < chain_map_idx; j++) {
+                if (dst =3D=3D chain_map[j]) {
                     found =3D 1;
                     break;
                 }
             }
-            dst =3D revtab[dst];
+            dst =3D s->map[dst];
         } while (dst !=3D src && !found);
=20
         if (!found)
-            s->inplace_idx[nb_inplace_idx++] =3D src;
+            chain_map[chain_map_idx++] =3D src;
     }
=20
-    s->inplace_idx[nb_inplace_idx++] =3D 0;
+    chain_map[chain_map_idx++] =3D 0;
=20
     return 0;
 }
=20
 static void parity_revtab_generator(int *revtab, int n, int inv, int offse=
t,
                                     int is_dual, int dual_high, int len,
-                                    int basis, int dual_stride)
+                                    int basis, int dual_stride, int inv_lo=
okup)
 {
     len >>=3D 1;
=20
     if (len <=3D basis) {
-        int k1, k2, *even, *odd, stride;
+        int k1, k2, stride, even_idx, odd_idx;
=20
         is_dual =3D is_dual && dual_stride;
         dual_high =3D is_dual & dual_high;
         stride =3D is_dual ? FFMIN(dual_stride, len) : 0;
=20
-        even =3D &revtab[offset + dual_high*(stride - 2*len)];
-        odd  =3D &even[len + (is_dual && !dual_high)*len + dual_high*len];
+        even_idx =3D offset + dual_high*(stride - 2*len);
+        odd_idx  =3D even_idx + len + (is_dual && !dual_high)*len + dual_h=
igh*len;
+
=20
         for (int i =3D 0; i < len; i++) {
             k1 =3D -split_radix_permutation(offset + i*2 + 0, n, inv) & (n=
 - 1);
             k2 =3D -split_radix_permutation(offset + i*2 + 1, n, inv) & (n=
 - 1);
-            *even++ =3D k1;
-            *odd++  =3D k2;
+            if (inv_lookup) {
+                revtab[even_idx++] =3D k1;
+                revtab[odd_idx++]  =3D k2;
+            } else {
+                revtab[k1] =3D even_idx++;
+                revtab[k2] =3D odd_idx++;
+            }
             if (stride && !((i + 1) % stride)) {
-                even +=3D stride;
-                odd  +=3D stride;
+                even_idx +=3D stride;
+                odd_idx  +=3D stride;
             }
         }
=20
@@ -192,22 +205,52 @@ static void parity_revtab_generator(int *revtab, int =
n, int inv, int offset,
     }
=20
     parity_revtab_generator(revtab, n, inv, offset,
-                            0, 0, len >> 0, basis, dual_stride);
+                            0, 0, len >> 0, basis, dual_stride, inv_lookup=
);
     parity_revtab_generator(revtab, n, inv, offset + (len >> 0),
-                            1, 0, len >> 1, basis, dual_stride);
+                            1, 0, len >> 1, basis, dual_stride, inv_lookup=
);
     parity_revtab_generator(revtab, n, inv, offset + (len >> 0) + (len >> =
1),
-                            1, 1, len >> 1, basis, dual_stride);
+                            1, 1, len >> 1, basis, dual_stride, inv_lookup=
);
 }
=20
-void ff_tx_gen_split_radix_parity_revtab(int *revtab, int len, int inv,
-                                         int basis, int dual_stride)
+int ff_tx_gen_split_radix_parity_revtab(AVTXContext *s, int invert_lookup,
+                                        int basis, int dual_stride)
 {
+    int len =3D s->len;
+    int inv =3D s->inv;
+
+    if (!(s->map =3D av_mallocz(len*sizeof(*s->map))))
+        return AVERROR(ENOMEM);
+
     basis >>=3D 1;
     if (len < basis)
-        return;
+        return AVERROR(EINVAL);
+
     av_assert0(!dual_stride || !(dual_stride & (dual_stride - 1)));
     av_assert0(dual_stride <=3D basis);
-    parity_revtab_generator(revtab, len, inv, 0, 0, 0, len, basis, dual_st=
ride);
+    parity_revtab_generator(s->map, len, inv, 0, 0, 0, len,
+                            basis, dual_stride, invert_lookup);
+
+    return 0;
+}
+
+static void reset_ctx(AVTXContext *s)
+{
+    if (!s)
+        return;
+
+    if (s->sub)
+        for (int i =3D 0; i < s->nb_sub; i++)
+            reset_ctx(&s->sub[i]);
+
+    if (s->cd_self->uninit)
+        s->cd_self->uninit(s);
+
+    av_freep(&s->sub);
+    av_freep(&s->map);
+    av_freep(&s->exp);
+    av_freep(&s->tmp);
+
+    memset(s, 0, sizeof(*s));
 }
=20
 av_cold void av_tx_uninit(AVTXContext **ctx)
@@ -215,53 +258,303 @@ av_cold void av_tx_uninit(AVTXContext **ctx)
     if (!(*ctx))
         return;
=20
-    av_free((*ctx)->pfatab);
-    av_free((*ctx)->exptab);
-    av_free((*ctx)->revtab);
-    av_free((*ctx)->revtab_c);
-    av_free((*ctx)->inplace_idx);
-    av_free((*ctx)->tmp);
-
+    reset_ctx(*ctx);
     av_freep(ctx);
 }
=20
+/* Null transform when the length is 1 */
+static void ff_tx_null(AVTXContext *s, void *_out, void *_in, ptrdiff_t st=
ride)
+{
+    memcpy(_out, _in, stride);
+}
+
+static const FFTXCodelet ff_tx_null_def =3D {
+    .name       =3D "null",
+    .function   =3D ff_tx_null,
+    .type       =3D TX_TYPE_ANY,
+    .flags      =3D AV_TX_UNALIGNED | FF_TX_ALIGNED |
+                  FF_TX_OUT_OF_PLACE | AV_TX_INPLACE,
+    .factors[0] =3D TX_FACTOR_ANY,
+    .min_len    =3D 1,
+    .max_len    =3D 1,
+    .init       =3D NULL,
+    .cpu_flags  =3D FF_TX_CPU_FLAGS_ALL,
+    .prio       =3D FF_TX_PRIO_MAX,
+};
+
+static const FFTXCodelet * const ff_tx_null_list[] =3D {
+    &ff_tx_null_def,
+};
+
+typedef struct TXCodeletMatch {
+    const FFTXCodelet *cd;
+    int prio;
+} TXCodeletMatch;
+
+static int cmp_matches(TXCodeletMatch *a, TXCodeletMatch *b)
+{
+    int diff =3D FFDIFFSIGN(b->prio, a->prio);
+    if (!diff)
+        return FFDIFFSIGN(b->cd->factors[0], a->cd->factors[0]);
+    return diff;
+}
+
+static void print_flags(uint64_t flags)
+{
+    av_log(NULL, AV_LOG_WARNING, "Flags: %s%s%s%s%s%s\n",
+           flags & AV_TX_INPLACE ? "inplace+" : "",
+           flags & FF_TX_OUT_OF_PLACE ? "out_of_place+" : "",
+           flags & FF_TX_ALIGNED ? "aligned+" : "",
+           flags & AV_TX_UNALIGNED ? "unaligned+" : "",
+           flags & FF_TX_PRESHUFFLE ? "preshuffle+" : "",
+           flags & AV_TX_FULL_IMDCT ? "full_imdct+" : "");
+}
+
+/* We want all factors to completely cover the length */
+static inline int check_cd_factors(const FFTXCodelet *cd, int len)
+{
+    int all_flag =3D 0;
+
+    for (int i =3D 0; i < TX_MAX_SUB; i++) {
+        int factor =3D cd->factors[i];
+
+        /* Conditions satisfied */
+        if (len =3D=3D 1)
+            return 1;
+
+        /* No more factors */
+        if (!factor) {
+            break;
+        } else if (factor =3D=3D TX_FACTOR_ANY) {
+            all_flag =3D 1;
+            continue;
+        }
+
+        if (factor =3D=3D 2) { /* Fast path */
+            int bits_2 =3D ff_ctz(len);
+            if (!bits_2)
+                return 0; /* Factor not supported */
+
+            len >>=3D bits_2;
+        } else {
+            int res =3D len % factor;
+            if (res)
+                return 0; /* Factor not supported */
+
+            while (!res) {
+                len /=3D factor;
+                res =3D len % factor;
+            }
+        }
+    }
+
+    return all_flag || (len =3D=3D 1);
+}
+
+av_cold int ff_tx_init_subtx(AVTXContext *s, enum AVTXType type,
+                             uint64_t flags, FFTXCodeletOptions *opts,
+                             int len, int inv, const void *scale)
+{
+    int ret =3D 0;
+    AVTXContext *sub =3D NULL;
+    TXCodeletMatch *cd_tmp, *cd_matches =3D NULL;
+    unsigned int cd_matches_size =3D 0;
+    int nb_cd_matches =3D 0;
+
+    /* Array of all compiled codelet lists. Order is irrelevant. */
+    const FFTXCodelet * const * const codelet_list[] =3D {
+        ff_tx_codelet_list_float_c,
+        ff_tx_codelet_list_float_x86,
+        ff_tx_null_list,
+    };
+    int codelet_list_num =3D FF_ARRAY_ELEMS(codelet_list);
+
+    /* We still accept functions marked with SLOW, even if the CPU is
+     * marked with the same flag, but we give them lower priority. */
+    const int cpu_flags =3D av_get_cpu_flags();
+    const int slow_mask =3D AV_CPU_FLAG_SSE2SLOW | AV_CPU_FLAG_SSE3SLOW  |
+                          AV_CPU_FLAG_ATOM     | AV_CPU_FLAG_SSSE3SLOW |
+                          AV_CPU_FLAG_AVXSLOW  | AV_CPU_FLAG_SLOW_GATHER;
+
+    /* Flags the transform wants */
+    uint64_t req_flags =3D flags;
+    int penalize_unaligned =3D 0;
+
+    /* Unaligned codelets are compatible with the aligned flag, with a sli=
ght
+     * penalty */
+    if (req_flags & FF_TX_ALIGNED) {
+        req_flags |=3D AV_TX_UNALIGNED;
+        penalize_unaligned =3D 1;
+    }
+
+    /* If either flag is set, both are okay, so don't check for an exact m=
atch */
+    if ((req_flags & AV_TX_INPLACE) && (req_flags & FF_TX_OUT_OF_PLACE))
+        req_flags &=3D ~(AV_TX_INPLACE | FF_TX_OUT_OF_PLACE);
+    if ((req_flags & FF_TX_ALIGNED) && (req_flags & AV_TX_UNALIGNED))
+        req_flags &=3D ~(FF_TX_ALIGNED | AV_TX_UNALIGNED);
+
+    /* Flags the codelet may require to be present */
+    uint64_t inv_req_mask =3D AV_TX_FULL_IMDCT | FF_TX_PRESHUFFLE;
+
+//    print_flags(req_flags);
+
+    /* Loop through all codelets in all codelet lists to find matches
+     * to the requirements */
+    while (codelet_list_num--) {
+        const FFTXCodelet * const * list =3D codelet_list[codelet_list_num=
];
+        const FFTXCodelet *cd =3D NULL;
+
+        while ((cd =3D *list++)) {
+            /* Check if the type matches */
+            if (cd->type !=3D TX_TYPE_ANY && type !=3D cd->type)
+                continue;
+
+            /* Check direction for non-orthogonal codelets */
+            if (((cd->flags & FF_TX_FORWARD_ONLY) && inv) ||
+                ((cd->flags & (FF_TX_INVERSE_ONLY | AV_TX_FULL_IMDCT)) && =
!inv))
+                continue;
+
+            /* Check if the requested flags match from both sides */
+            if (((req_flags    & cd->flags) !=3D (req_flags)) ||
+                ((inv_req_mask & cd->flags) !=3D (req_flags & inv_req_mask=
)))
+                continue;
+
+            /* Check if length is supported */
+            if ((len < cd->min_len) || (cd->max_len !=3D -1 && (len > cd->=
max_len)))
+                continue;
+
+            /* Check if the CPU supports the required ISA */
+            if (!(!cd->cpu_flags || (cpu_flags & (cd->cpu_flags & ~slow_ma=
sk))))
+                continue;
+
+            /* Check for factors */
+            if (!check_cd_factors(cd, len))
+                continue;
+
+            /* Realloc array and append */
+            cd_tmp =3D av_fast_realloc(cd_matches, &cd_matches_size,
+                                     sizeof(*cd_tmp) * (nb_cd_matches + 1)=
);
+            if (!cd_tmp) {
+                av_free(cd_matches);
+                return AVERROR(ENOMEM);
+            }
+
+            cd_matches                     =3D cd_tmp;
+            cd_matches[nb_cd_matches].cd   =3D cd;
+            cd_matches[nb_cd_matches].prio =3D cd->prio;
+
+            /* If the CPU has a SLOW flag, and the instruction is also fla=
gged
+             * as being slow for such, reduce its priority */
+            if ((cpu_flags & cd->cpu_flags) & slow_mask)
+                cd_matches[nb_cd_matches].prio -=3D 64;
+
+            /* Penalize unaligned functions if needed */
+            if ((cd->flags & AV_TX_UNALIGNED) && penalize_unaligned)
+                cd_matches[nb_cd_matches].prio -=3D 64;
+
+            /* Codelets for specific lengths are generally faster. */
+            if ((len =3D=3D cd->min_len) && (len =3D=3D cd->max_len))
+                cd_matches[nb_cd_matches].prio +=3D 64;
+
+            nb_cd_matches++;
+        }
+    }
+
+    /* No matches found */
+    if (!nb_cd_matches)
+        return AVERROR(ENOSYS);
+
+    /* Sort the list */
+    AV_QSORT(cd_matches, nb_cd_matches, TXCodeletMatch, cmp_matches);
+
+    if (!s->sub)
+        s->sub =3D sub =3D av_mallocz(TX_MAX_SUB*sizeof(*sub));
+
+    /* Attempt to initialize each */
+    for (int i =3D 0; i < nb_cd_matches; i++) {
+        const FFTXCodelet *cd =3D cd_matches[i].cd;
+        AVTXContext *sctx =3D &s->sub[s->nb_sub];
+
+        sctx->len        =3D len;
+        sctx->inv        =3D inv;
+        sctx->type       =3D type;
+        sctx->flags      =3D flags;
+        sctx->cd_self    =3D cd;
+
+        s->fn[s->nb_sub] =3D cd->function;
+        s->cd[s->nb_sub] =3D cd;
+
+        ret =3D 0;
+        if (cd->init)
+            ret =3D cd->init(sctx, cd, flags, opts, len, inv, scale);
+
+        if (ret >=3D 0) {
+            s->nb_sub++;
+            goto end;
+        }
+
+        s->fn[s->nb_sub] =3D NULL;
+        s->cd[s->nb_sub] =3D NULL;
+
+        reset_ctx(sctx);
+        if (ret =3D=3D AVERROR(ENOMEM))
+            break;
+    }
+
+    if (sub)
+        av_freep(&s->sub);
+
+    if (ret >=3D 0)
+        ret =3D AVERROR(ENOSYS);
+
+end:
+    av_free(cd_matches);
+    return ret;
+}
+
+static void print_tx_structure(AVTXContext *s, int depth)
+{
+    const FFTXCodelet *cd =3D s->cd_self;
+
+    for (int i =3D 0; i <=3D depth; i++)
+        av_log(NULL, AV_LOG_WARNING, "    ");
+    av_log(NULL, AV_LOG_WARNING, "=E2=86=B3 %s - %s, %ipt, %p\n", cd->name=
,
+           cd->type =3D=3D TX_TYPE_ANY       ? "all"         :
+           cd->type =3D=3D AV_TX_FLOAT_FFT   ? "fft_float"   :
+           cd->type =3D=3D AV_TX_FLOAT_MDCT  ? "mdct_float"  :
+           cd->type =3D=3D AV_TX_DOUBLE_FFT  ? "fft_double"  :
+           cd->type =3D=3D AV_TX_DOUBLE_MDCT ? "mdct_double" :
+           cd->type =3D=3D AV_TX_INT32_FFT   ? "fft_int32"   :
+           cd->type =3D=3D AV_TX_INT32_MDCT  ? "mdct_int32"  : "unknown",
+           s->len, cd->function);
+
+    for (int i =3D 0; i < s->nb_sub; i++)
+        print_tx_structure(&s->sub[i], depth + 1);
+}
+
 av_cold int av_tx_init(AVTXContext **ctx, av_tx_fn *tx, enum AVTXType type=
,
                        int inv, int len, const void *scale, uint64_t flags=
)
 {
-    int err;
-    AVTXContext *s =3D av_mallocz(sizeof(*s));
-    if (!s)
-        return AVERROR(ENOMEM);
+    int ret;
+    AVTXContext tmp =3D { 0 };
=20
-    switch (type) {
-    case AV_TX_FLOAT_FFT:
-    case AV_TX_FLOAT_MDCT:
-        if ((err =3D ff_tx_init_mdct_fft_float(s, tx, type, inv, len, scal=
e, flags)))
-            goto fail;
-        if (ARCH_X86)
-            ff_tx_init_float_x86(s, tx);
-        break;
-    case AV_TX_DOUBLE_FFT:
-    case AV_TX_DOUBLE_MDCT:
-        if ((err =3D ff_tx_init_mdct_fft_double(s, tx, type, inv, len, sca=
le, flags)))
-            goto fail;
-        break;
-    case AV_TX_INT32_FFT:
-    case AV_TX_INT32_MDCT:
-        if ((err =3D ff_tx_init_mdct_fft_int32(s, tx, type, inv, len, scal=
e, flags)))
-            goto fail;
-        break;
-    default:
-        err =3D AVERROR(EINVAL);
-        goto fail;
-    }
+    if (!len || type >=3D AV_TX_NB)
+        return AVERROR(EINVAL);
=20
-    *ctx =3D s;
+    if (!(flags & AV_TX_UNALIGNED))
+        flags |=3D FF_TX_ALIGNED;
+    if (!(flags & AV_TX_INPLACE))
+        flags |=3D FF_TX_OUT_OF_PLACE;
=20
-    return 0;
+    ret =3D ff_tx_init_subtx(&tmp, type, flags, NULL, len, inv, scale);
+    if (ret < 0)
+        return ret;
+
+    *ctx =3D &tmp.sub[0];
+    *tx  =3D tmp.fn[0];
+
+    av_log(NULL, AV_LOG_WARNING, "Transform tree:\n");
+    print_tx_structure(*ctx, 0);
=20
-fail:
-    av_tx_uninit(&s);
-    *tx =3D NULL;
-    return err;
+    return ret;
 }
diff --git a/libavutil/tx.h b/libavutil/tx.h
index 55173810ee..4bc1478644 100644
--- a/libavutil/tx.h
+++ b/libavutil/tx.h
@@ -82,6 +82,9 @@ enum AVTXType {
      * Stride must be a non-zero multiple of sizeof(int32_t).
      */
     AV_TX_INT32_MDCT =3D 5,
+
+    /* Not part of the API, do not use */
+    AV_TX_NB,
 };
=20
 /**
diff --git a/libavutil/tx_priv.h b/libavutil/tx_priv.h
index 63dc6bbe6d..a709e6973f 100644
--- a/libavutil/tx_priv.h
+++ b/libavutil/tx_priv.h
@@ -25,17 +25,26 @@
 #include "attributes.h"
=20
 #ifdef TX_FLOAT
-#define TX_NAME(x) x ## _float
+#define TX_TAB(x) x ## _float
+#define TX_NAME(x) x ## _float_c
+#define TX_NAME_STR(x) x "_float_c"
+#define TX_TYPE(x) AV_TX_FLOAT_ ## x
 #define SCALE_TYPE float
 typedef float FFTSample;
 typedef AVComplexFloat FFTComplex;
 #elif defined(TX_DOUBLE)
-#define TX_NAME(x) x ## _double
+#define TX_TAB(x) x ## _double
+#define TX_NAME(x) x ## _double_c
+#define TX_NAME_STR(x) x "_double_c"
+#define TX_TYPE(x) AV_TX_DOUBLE_ ## x
 #define SCALE_TYPE double
 typedef double FFTSample;
 typedef AVComplexDouble FFTComplex;
 #elif defined(TX_INT32)
-#define TX_NAME(x) x ## _int32
+#define TX_TAB(x) x ## _int32
+#define TX_NAME(x) x ## _int32_c
+#define TX_NAME_STR(x) x "_int32_c"
+#define TX_TYPE(x) AV_TX_INT32_ ## x
 #define SCALE_TYPE float
 typedef int32_t FFTSample;
 typedef AVComplexInt32 FFTComplex;
@@ -103,53 +112,130 @@ typedef void FFTComplex;
 #define CMUL3(c, a, b)                                                    =
     \
     CMUL((c).re, (c).im, (a).re, (a).im, (b).re, (b).im)
=20
-#define COSTABLE(size)                                                    =
     \
-    DECLARE_ALIGNED(32, FFTSample, TX_NAME(ff_cos_##size))[size/4 + 1]
+/* Codelet flags, used to pick codelets. Must be a superset of enum AVTXFl=
ags,
+ * but if it runs out of bits, it can be made separate. */
+typedef enum FFTXCodeletFlags {
+    FF_TX_OUT_OF_PLACE  =3D (1ULL << 63), /* Can be OR'd with AV_TX_INPLAC=
E             */
+    FF_TX_ALIGNED       =3D (1ULL << 62), /* Cannot be OR'd with AV_TX_UNA=
LIGNED        */
+    FF_TX_PRESHUFFLE    =3D (1ULL << 61), /* Codelet expects permuted coef=
fs            */
+    FF_TX_INVERSE_ONLY  =3D (1ULL << 60), /* For non-orthogonal inverse-on=
ly transforms */
+    FF_TX_FORWARD_ONLY  =3D (1ULL << 59), /* For non-orthogonal forward-on=
ly transforms */
+} FFTXCodeletFlags;
+
+typedef enum FFTXCodeletPriority {
+    FF_TX_PRIO_BASE =3D 0,              /* Baseline priority */
+
+    /* For SIMD, set prio to the register size in bits. */
+
+    FF_TX_PRIO_MIN          =3D -131072, /* For naive implementations */
+    FF_TX_PRIO_MAX          =3D  32768,  /* For custom implementations/ASI=
Cs */
+} FFTXCodeletPriority;
+
+/* Codelet options */
+typedef struct FFTXCodeletOptions {
+    int invert_lookup;     /* If codelet is flagged as FF_TX_CODELET_PRESH=
UFFLE,
+                              invert the lookup direction for the map gene=
rated */
+} FFTXCodeletOptions;
+
+/* Maximum amount of subtransform functions, subtransforms and factors. Ar=
bitrary. */
+#define TX_MAX_SUB 4
+
+typedef struct FFTXCodelet {
+    const char       *name;                    /* Codelet name, for debugg=
ing */
+    av_tx_fn          function;                /* Codelet function, !=3D N=
ULL */
+    enum AVTXType     type;                    /* Type of codelet transfor=
m */
+#define TX_TYPE_ANY INT32_MAX   /* Special type to allow all types */
+
+    uint64_t          flags;                   /* A combination of AVTXFla=
gs
+                                                * and FFTXCodeletFlags fla=
gs
+                                                * to describe the codelet.=
 */
+
+    int               factors[TX_MAX_SUB];     /* Length factors */
+#define TX_FACTOR_ANY -1        /* When used alone, signals that the codel=
et
+                                 * supports all factors. Otherwise, if oth=
er
+                                 * factors are present, it signals that wh=
atever
+                                 * remains will be supported, as long as t=
he
+                                 * other factors are a component of the le=
ngth */
+
+    int               min_len;                 /* Minimum length of transf=
orm, must be >=3D 1 */
+    int               max_len;                 /* Maximum length of transf=
orm */
+#define TX_LEN_UNLIMITED -1     /* Special length value to permit arbitrar=
ily large transforms */
+
+    int (*init)(AVTXContext *s,                /* Callback for current con=
text initialization. */
+                const struct FFTXCodelet *cd,  /* May be NULL */
+                uint64_t flags,
+                FFTXCodeletOptions *opts,
+                int len, int inv,
+                const void *scale);
+
+    int (*uninit)(AVTXContext *s);             /* Callback for uninitializ=
ation. Can be NULL. */
+
+    int cpu_flags;                             /* CPU flags. If any negati=
ve flags like
+                                                * SLOW are present, will a=
void picking.
+                                                * 0x0 to signal it's a C c=
odelet */
+#define FF_TX_CPU_FLAGS_ALL 0x0 /* Special CPU flag for C */
+
+    int prio;                                  /* < 0 =3D least, 0 =3D no =
pref, > 0 =3D prefer */
+} FFTXCodelet;
=20
-/* Used by asm, reorder with care */
 struct AVTXContext {
-    int n;              /* Non-power-of-two part */
-    int m;              /* Power-of-two part */
-    int inv;            /* Is inverse */
-    int type;           /* Type */
-    uint64_t flags;     /* Flags */
-    double scale;       /* Scale */
-
-    FFTComplex *exptab; /* MDCT exptab */
-    FFTComplex    *tmp; /* Temporary buffer needed for all compound transf=
orms */
-    int        *pfatab; /* Input/Output mapping for compound transforms */
-    int        *revtab; /* Input mapping for power of two transforms */
-    int   *inplace_idx; /* Required indices to revtab for in-place transfo=
rms */
-
-    int      *revtab_c; /* Revtab for only the C transforms, needed becaus=
e
-                         * checkasm makes us reuse the same context. */
-
-    av_tx_fn    top_tx; /* Used for computing transforms derived from othe=
r
-                         * transforms, like full-length iMDCTs and RDFTs.
-                         * NOTE: Do NOT use this to mix assembly with C co=
de. */
+    /* Fields the root transform and subtransforms use or may use.
+     * NOTE: This section is used by assembly, do not reorder or change */
+    int                len;             /* Length of the transform */
+    int                inv;             /* If transform is inverse */
+    int               *map;             /* Lookup table(s) */
+    FFTComplex        *exp;             /* Any non-pre-baked multiplicatio=
n factors needed */
+    FFTComplex        *tmp;             /* Temporary buffer, if needed */
+
+    AVTXContext       *sub;             /* Subtransform context(s), if nee=
ded */
+    av_tx_fn           fn[TX_MAX_SUB];  /* Function(s) for the subtransfor=
ms */
+    int                nb_sub;          /* Number of subtransforms.
+                                         * The reason all of these are set=
 here
+                                         * rather than in each separate co=
ntext
+                                         * is to eliminate extra pointer
+                                         * dereferences. */
+
+    /* Fields mainly useul/applicable for the root transform or initializa=
tion.
+     * Fields below are not used by assembly code. */
+    const FFTXCodelet *cd[TX_MAX_SUB];  /* Subtransform codelets */
+    const FFTXCodelet *cd_self;         /* Codelet for the current context=
 */
+    enum AVTXType      type;            /* Type of transform */
+    uint64_t           flags;           /* A combination of AVTXFlags
+                                           and FFTXCodeletFlags flags
+                                           used when creating */
+    float              scale_f;
+    double             scale_d;
+    void              *opaque;          /* Free to use by implementations =
*/
 };
=20
-/* Checks if type is an MDCT */
-int ff_tx_type_is_mdct(enum AVTXType type);
+/* Create a subtransform in the current context with the given parameters.
+ * The flags parameter from FFTXCodelet.init() should be preserved as much
+ * as that's possible.
+ * MUST be called during the sub() callback of each codelet. */
+int ff_tx_init_subtx(AVTXContext *s, enum AVTXType type,
+                     uint64_t flags, FFTXCodeletOptions *opts,
+                     int len, int inv, const void *scale);
=20
 /*
  * Generates the PFA permutation table into AVTXContext->pfatab. The end t=
able
  * is appended to the start table.
  */
-int ff_tx_gen_compound_mapping(AVTXContext *s);
+int ff_tx_gen_compound_mapping(AVTXContext *s, int n, int m);
=20
 /*
  * Generates a standard-ish (slightly modified) Split-Radix revtab into
- * AVTXContext->revtab
+ * AVTXContext->map. Invert lookup changes how the mapping needs to be app=
lied.
+ * If it's set to 0, it has to be applied like out[map[i]] =3D in[i], othe=
rwise
+ * if it's set to 1, has to be applied as out[i] =3D in[map[i]]
  */
 int ff_tx_gen_ptwo_revtab(AVTXContext *s, int invert_lookup);
=20
 /*
  * Generates an index into AVTXContext->inplace_idx that if followed in th=
e
- * specific order,  allows the revtab to be done in-place. AVTXContext->re=
vtab
+ * specific order, allows the revtab to be done in-place. AVTXContext->map
  * must already exist.
  */
-int ff_tx_gen_ptwo_inplace_revtab_idx(AVTXContext *s, int *revtab);
+int ff_tx_gen_ptwo_inplace_revtab_idx(AVTXContext *s);
=20
 /*
  * This generates a parity-based revtab of length len and direction inv.
@@ -179,25 +265,17 @@ int ff_tx_gen_ptwo_inplace_revtab_idx(AVTXContext *s,=
 int *revtab);
  *
  * If length is smaller than basis/2 this function will not do anything.
  */
-void ff_tx_gen_split_radix_parity_revtab(int *revtab, int len, int inv,
-                                         int basis, int dual_stride);
-
-/* Templated init functions */
-int ff_tx_init_mdct_fft_float(AVTXContext *s, av_tx_fn *tx,
-                              enum AVTXType type, int inv, int len,
-                              const void *scale, uint64_t flags);
-int ff_tx_init_mdct_fft_double(AVTXContext *s, av_tx_fn *tx,
-                               enum AVTXType type, int inv, int len,
-                               const void *scale, uint64_t flags);
-int ff_tx_init_mdct_fft_int32(AVTXContext *s, av_tx_fn *tx,
-                              enum AVTXType type, int inv, int len,
-                              const void *scale, uint64_t flags);
-
-typedef struct CosTabsInitOnce {
-    void (*func)(void);
-    AVOnce control;
-} CosTabsInitOnce;
-
-void ff_tx_init_float_x86(AVTXContext *s, av_tx_fn *tx);
+int ff_tx_gen_split_radix_parity_revtab(AVTXContext *s, int invert_lookup,
+                                        int basis, int dual_stride);
+
+void ff_tx_init_tabs_float (int len);
+extern const FFTXCodelet * const ff_tx_codelet_list_float_c       [];
+extern const FFTXCodelet * const ff_tx_codelet_list_float_x86     [];
+
+void ff_tx_init_tabs_double(int len);
+extern const FFTXCodelet * const ff_tx_codelet_list_double_c      [];
+
+void ff_tx_init_tabs_int32 (int len);
+extern const FFTXCodelet * const ff_tx_codelet_list_int32_c       [];
=20
 #endif /* AVUTIL_TX_PRIV_H */
diff --git a/libavutil/tx_template.c b/libavutil/tx_template.c
index cad66a8bc0..bfd27799be 100644
--- a/libavutil/tx_template.c
+++ b/libavutil/tx_template.c
@@ -24,134 +24,160 @@
  * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-130=
1 USA
  */
=20
-/* All costabs for a type are defined here */
-COSTABLE(16);
-COSTABLE(32);
-COSTABLE(64);
-COSTABLE(128);
-COSTABLE(256);
-COSTABLE(512);
-COSTABLE(1024);
-COSTABLE(2048);
-COSTABLE(4096);
-COSTABLE(8192);
-COSTABLE(16384);
-COSTABLE(32768);
-COSTABLE(65536);
-COSTABLE(131072);
-DECLARE_ALIGNED(32, FFTComplex, TX_NAME(ff_cos_53))[4];
-DECLARE_ALIGNED(32, FFTComplex, TX_NAME(ff_cos_7))[3];
-DECLARE_ALIGNED(32, FFTComplex, TX_NAME(ff_cos_9))[4];
-
-static FFTSample * const cos_tabs[18] =3D {
-    NULL,
-    NULL,
-    NULL,
-    NULL,
-    TX_NAME(ff_cos_16),
-    TX_NAME(ff_cos_32),
-    TX_NAME(ff_cos_64),
-    TX_NAME(ff_cos_128),
-    TX_NAME(ff_cos_256),
-    TX_NAME(ff_cos_512),
-    TX_NAME(ff_cos_1024),
-    TX_NAME(ff_cos_2048),
-    TX_NAME(ff_cos_4096),
-    TX_NAME(ff_cos_8192),
-    TX_NAME(ff_cos_16384),
-    TX_NAME(ff_cos_32768),
-    TX_NAME(ff_cos_65536),
-    TX_NAME(ff_cos_131072),
-};
-
-static av_always_inline void init_cos_tabs_idx(int index)
-{
-    int m =3D 1 << index;
-    double freq =3D 2*M_PI/m;
-    FFTSample *tab =3D cos_tabs[index];
-
-    for (int i =3D 0; i < m/4; i++)
-        *tab++ =3D RESCALE(cos(i*freq));
-
-    *tab =3D 0;
+#define TABLE_DEF(name, size) \
+    DECLARE_ALIGNED(32, FFTSample, TX_TAB(ff_tx_tab_ ##name))[size]
+
+#define SR_TABLE(len) \
+    TABLE_DEF(len, len/4 + 1)
+
+/* Power of two tables */
+SR_TABLE(8);
+SR_TABLE(16);
+SR_TABLE(32);
+SR_TABLE(64);
+SR_TABLE(128);
+SR_TABLE(256);
+SR_TABLE(512);
+SR_TABLE(1024);
+SR_TABLE(2048);
+SR_TABLE(4096);
+SR_TABLE(8192);
+SR_TABLE(16384);
+SR_TABLE(32768);
+SR_TABLE(65536);
+SR_TABLE(131072);
+
+/* Other factors' tables */
+TABLE_DEF(53, 8);
+TABLE_DEF( 7, 6);
+TABLE_DEF( 9, 8);
+
+typedef struct FFSRTabsInitOnce {
+    void (*func)(void);
+    AVOnce control;
+    int factors[4]; /* Must be sorted high -> low */
+} FFSRTabsInitOnce;
+
+#define INIT_FF_SR_TAB(len)                                        \
+static av_cold void TX_TAB(ff_tx_init_tab_ ##len)(void)            \
+{                                                                  \
+    double freq =3D 2*M_PI/len;                                      \
+    FFTSample *tab =3D TX_TAB(ff_tx_tab_ ##len);                     \
+                                                                   \
+    for (int i =3D 0; i < len/4; i++)                                \
+        *tab++ =3D RESCALE(cos(i*freq));                             \
+                                                                   \
+    *tab =3D 0;                                                      \
 }
=20
-#define INIT_FF_COS_TABS_FUNC(index, size)                                =
     \
-static av_cold void init_cos_tabs_ ## size (void)                         =
     \
-{                                                                         =
     \
-    init_cos_tabs_idx(index);                                             =
     \
-}
+INIT_FF_SR_TAB(8)
+INIT_FF_SR_TAB(16)
+INIT_FF_SR_TAB(32)
+INIT_FF_SR_TAB(64)
+INIT_FF_SR_TAB(128)
+INIT_FF_SR_TAB(256)
+INIT_FF_SR_TAB(512)
+INIT_FF_SR_TAB(1024)
+INIT_FF_SR_TAB(2048)
+INIT_FF_SR_TAB(4096)
+INIT_FF_SR_TAB(8192)
+INIT_FF_SR_TAB(16384)
+INIT_FF_SR_TAB(32768)
+INIT_FF_SR_TAB(65536)
+INIT_FF_SR_TAB(131072)
+
+FFSRTabsInitOnce sr_tabs_init_once[] =3D {
+    { TX_TAB(ff_tx_init_tab_8),      AV_ONCE_INIT },
+    { TX_TAB(ff_tx_init_tab_16),     AV_ONCE_INIT },
+    { TX_TAB(ff_tx_init_tab_32),     AV_ONCE_INIT },
+    { TX_TAB(ff_tx_init_tab_64),     AV_ONCE_INIT },
+    { TX_TAB(ff_tx_init_tab_128),    AV_ONCE_INIT },
+    { TX_TAB(ff_tx_init_tab_256),    AV_ONCE_INIT },
+    { TX_TAB(ff_tx_init_tab_512),    AV_ONCE_INIT },
+    { TX_TAB(ff_tx_init_tab_1024),   AV_ONCE_INIT },
+    { TX_TAB(ff_tx_init_tab_2048),   AV_ONCE_INIT },
+    { TX_TAB(ff_tx_init_tab_4096),   AV_ONCE_INIT },
+    { TX_TAB(ff_tx_init_tab_8192),   AV_ONCE_INIT },
+    { TX_TAB(ff_tx_init_tab_16384),  AV_ONCE_INIT },
+    { TX_TAB(ff_tx_init_tab_32768),  AV_ONCE_INIT },
+    { TX_TAB(ff_tx_init_tab_65536),  AV_ONCE_INIT },
+    { TX_TAB(ff_tx_init_tab_131072), AV_ONCE_INIT },
+};
=20
-INIT_FF_COS_TABS_FUNC(4, 16)
-INIT_FF_COS_TABS_FUNC(5, 32)
-INIT_FF_COS_TABS_FUNC(6, 64)
-INIT_FF_COS_TABS_FUNC(7, 128)
-INIT_FF_COS_TABS_FUNC(8, 256)
-INIT_FF_COS_TABS_FUNC(9, 512)
-INIT_FF_COS_TABS_FUNC(10, 1024)
-INIT_FF_COS_TABS_FUNC(11, 2048)
-INIT_FF_COS_TABS_FUNC(12, 4096)
-INIT_FF_COS_TABS_FUNC(13, 8192)
-INIT_FF_COS_TABS_FUNC(14, 16384)
-INIT_FF_COS_TABS_FUNC(15, 32768)
-INIT_FF_COS_TABS_FUNC(16, 65536)
-INIT_FF_COS_TABS_FUNC(17, 131072)
-
-static av_cold void ff_init_53_tabs(void)
+static av_cold void TX_TAB(ff_tx_init_tab_53)(void)
 {
-    TX_NAME(ff_cos_53)[0] =3D (FFTComplex){ RESCALE(cos(2 * M_PI / 12)), R=
ESCALE(cos(2 * M_PI / 12)) };
-    TX_NAME(ff_cos_53)[1] =3D (FFTComplex){ RESCALE(cos(2 * M_PI /  6)), R=
ESCALE(cos(2 * M_PI /  6)) };
-    TX_NAME(ff_cos_53)[2] =3D (FFTComplex){ RESCALE(cos(2 * M_PI /  5)), R=
ESCALE(sin(2 * M_PI /  5)) };
-    TX_NAME(ff_cos_53)[3] =3D (FFTComplex){ RESCALE(cos(2 * M_PI / 10)), R=
ESCALE(sin(2 * M_PI / 10)) };
+    TX_TAB(ff_tx_tab_53)[0] =3D RESCALE(cos(2 * M_PI / 12));
+    TX_TAB(ff_tx_tab_53)[1] =3D RESCALE(cos(2 * M_PI / 12));
+    TX_TAB(ff_tx_tab_53)[2] =3D RESCALE(cos(2 * M_PI /  6));
+    TX_TAB(ff_tx_tab_53)[3] =3D RESCALE(cos(2 * M_PI /  6));
+    TX_TAB(ff_tx_tab_53)[4] =3D RESCALE(cos(2 * M_PI /  5));
+    TX_TAB(ff_tx_tab_53)[5] =3D RESCALE(sin(2 * M_PI /  5));
+    TX_TAB(ff_tx_tab_53)[6] =3D RESCALE(cos(2 * M_PI / 10));
+    TX_TAB(ff_tx_tab_53)[7] =3D RESCALE(sin(2 * M_PI / 10));
 }
=20
-static av_cold void ff_init_7_tabs(void)
+static av_cold void TX_TAB(ff_tx_init_tab_7)(void)
 {
-    TX_NAME(ff_cos_7)[0] =3D (FFTComplex){ RESCALE(cos(2 * M_PI /  7)), RE=
SCALE(sin(2 * M_PI /  7)) };
-    TX_NAME(ff_cos_7)[1] =3D (FFTComplex){ RESCALE(sin(2 * M_PI / 28)), RE=
SCALE(cos(2 * M_PI / 28)) };
-    TX_NAME(ff_cos_7)[2] =3D (FFTComplex){ RESCALE(cos(2 * M_PI / 14)), RE=
SCALE(sin(2 * M_PI / 14)) };
+    TX_TAB(ff_tx_tab_7)[0] =3D RESCALE(cos(2 * M_PI /  7));
+    TX_TAB(ff_tx_tab_7)[1] =3D RESCALE(sin(2 * M_PI /  7));
+    TX_TAB(ff_tx_tab_7)[2] =3D RESCALE(sin(2 * M_PI / 28));
+    TX_TAB(ff_tx_tab_7)[3] =3D RESCALE(cos(2 * M_PI / 28));
+    TX_TAB(ff_tx_tab_7)[4] =3D RESCALE(cos(2 * M_PI / 14));
+    TX_TAB(ff_tx_tab_7)[5] =3D RESCALE(sin(2 * M_PI / 14));
 }
=20
-static av_cold void ff_init_9_tabs(void)
+static av_cold void TX_TAB(ff_tx_init_tab_9)(void)
 {
-    TX_NAME(ff_cos_9)[0] =3D (FFTComplex){ RESCALE(cos(2 * M_PI /  3)), RE=
SCALE( sin(2 * M_PI /  3)) };
-    TX_NAME(ff_cos_9)[1] =3D (FFTComplex){ RESCALE(cos(2 * M_PI /  9)), RE=
SCALE( sin(2 * M_PI /  9)) };
-    TX_NAME(ff_cos_9)[2] =3D (FFTComplex){ RESCALE(cos(2 * M_PI / 36)), RE=
SCALE( sin(2 * M_PI / 36)) };
-    TX_NAME(ff_cos_9)[3] =3D (FFTComplex){ TX_NAME(ff_cos_9)[1].re + TX_NA=
ME(ff_cos_9)[2].im,
-                                         TX_NAME(ff_cos_9)[1].im - TX_NAME=
(ff_cos_9)[2].re };
+    TX_TAB(ff_tx_tab_9)[0] =3D RESCALE(cos(2 * M_PI /  3));
+    TX_TAB(ff_tx_tab_9)[1] =3D RESCALE(sin(2 * M_PI /  3));
+    TX_TAB(ff_tx_tab_9)[2] =3D RESCALE(cos(2 * M_PI /  9));
+    TX_TAB(ff_tx_tab_9)[3] =3D RESCALE(sin(2 * M_PI /  9));
+    TX_TAB(ff_tx_tab_9)[4] =3D RESCALE(cos(2 * M_PI / 36));
+    TX_TAB(ff_tx_tab_9)[5] =3D RESCALE(sin(2 * M_PI / 36));
+    TX_TAB(ff_tx_tab_9)[6] =3D TX_TAB(ff_tx_tab_9)[2] + TX_TAB(ff_tx_tab_9=
)[5];
+    TX_TAB(ff_tx_tab_9)[7] =3D TX_TAB(ff_tx_tab_9)[3] - TX_TAB(ff_tx_tab_9=
)[4];
 }
=20
-static CosTabsInitOnce cos_tabs_init_once[] =3D {
-    { ff_init_53_tabs, AV_ONCE_INIT },
-    { ff_init_7_tabs, AV_ONCE_INIT },
-    { ff_init_9_tabs, AV_ONCE_INIT },
-    { NULL },
-    { init_cos_tabs_16, AV_ONCE_INIT },
-    { init_cos_tabs_32, AV_ONCE_INIT },
-    { init_cos_tabs_64, AV_ONCE_INIT },
-    { init_cos_tabs_128, AV_ONCE_INIT },
-    { init_cos_tabs_256, AV_ONCE_INIT },
-    { init_cos_tabs_512, AV_ONCE_INIT },
-    { init_cos_tabs_1024, AV_ONCE_INIT },
-    { init_cos_tabs_2048, AV_ONCE_INIT },
-    { init_cos_tabs_4096, AV_ONCE_INIT },
-    { init_cos_tabs_8192, AV_ONCE_INIT },
-    { init_cos_tabs_16384, AV_ONCE_INIT },
-    { init_cos_tabs_32768, AV_ONCE_INIT },
-    { init_cos_tabs_65536, AV_ONCE_INIT },
-    { init_cos_tabs_131072, AV_ONCE_INIT },
+FFSRTabsInitOnce nptwo_tabs_init_once[] =3D {
+    { TX_TAB(ff_tx_init_tab_53),      AV_ONCE_INIT, { 15, 5, 3 } },
+    { TX_TAB(ff_tx_init_tab_9),       AV_ONCE_INIT, {  9 }       },
+    { TX_TAB(ff_tx_init_tab_7),       AV_ONCE_INIT, {  7 }       },
 };
=20
-static av_cold void init_cos_tabs(int index)
+av_cold void TX_TAB(ff_tx_init_tabs)(int len)
 {
-    ff_thread_once(&cos_tabs_init_once[index].control,
-                    cos_tabs_init_once[index].func);
+    int factor_2 =3D ff_ctz(len);
+    if (factor_2) {
+        int idx =3D factor_2 - 3;
+        for (int i =3D 0; i <=3D idx; i++)
+            ff_thread_once(&sr_tabs_init_once[i].control,
+                            sr_tabs_init_once[i].func);
+        len >>=3D factor_2;
+    }
+
+    for (int i =3D 0; i < FF_ARRAY_ELEMS(nptwo_tabs_init_once); i++) {
+        int f, f_idx =3D 0;
+
+        if (len <=3D 1)
+            return;
+
+        while ((f =3D nptwo_tabs_init_once[i].factors[f_idx++])) {
+            if (f % len)
+                continue;
+
+            ff_thread_once(&nptwo_tabs_init_once[i].control,
+                            nptwo_tabs_init_once[i].func);
+            len /=3D f;
+            break;
+        }
+    }
 }
=20
 static av_always_inline void fft3(FFTComplex *out, FFTComplex *in,
                                   ptrdiff_t stride)
 {
     FFTComplex tmp[2];
+    const FFTSample *tab =3D TX_TAB(ff_tx_tab_53);
 #ifdef TX_INT32
     int64_t mtmp[4];
 #endif
@@ -163,19 +189,19 @@ static av_always_inline void fft3(FFTComplex *out, FF=
TComplex *in,
     out[0*stride].im =3D in[0].im + tmp[1].im;
=20
 #ifdef TX_INT32
-    mtmp[0] =3D (int64_t)TX_NAME(ff_cos_53)[0].re * tmp[0].re;
-    mtmp[1] =3D (int64_t)TX_NAME(ff_cos_53)[0].im * tmp[0].im;
-    mtmp[2] =3D (int64_t)TX_NAME(ff_cos_53)[1].re * tmp[1].re;
-    mtmp[3] =3D (int64_t)TX_NAME(ff_cos_53)[1].re * tmp[1].im;
+    mtmp[0] =3D (int64_t)tab[0] * tmp[0].re;
+    mtmp[1] =3D (int64_t)tab[1] * tmp[0].im;
+    mtmp[2] =3D (int64_t)tab[2] * tmp[1].re;
+    mtmp[3] =3D (int64_t)tab[2] * tmp[1].im;
     out[1*stride].re =3D in[0].re - (mtmp[2] + mtmp[0] + 0x40000000 >> 31)=
;
     out[1*stride].im =3D in[0].im - (mtmp[3] - mtmp[1] + 0x40000000 >> 31)=
;
     out[2*stride].re =3D in[0].re - (mtmp[2] - mtmp[0] + 0x40000000 >> 31)=
;
     out[2*stride].im =3D in[0].im - (mtmp[3] + mtmp[1] + 0x40000000 >> 31)=
;
 #else
-    tmp[0].re =3D TX_NAME(ff_cos_53)[0].re * tmp[0].re;
-    tmp[0].im =3D TX_NAME(ff_cos_53)[0].im * tmp[0].im;
-    tmp[1].re =3D TX_NAME(ff_cos_53)[1].re * tmp[1].re;
-    tmp[1].im =3D TX_NAME(ff_cos_53)[1].re * tmp[1].im;
+    tmp[0].re =3D tab[0] * tmp[0].re;
+    tmp[0].im =3D tab[1] * tmp[0].im;
+    tmp[1].re =3D tab[2] * tmp[1].re;
+    tmp[1].im =3D tab[2] * tmp[1].im;
     out[1*stride].re =3D in[0].re - tmp[1].re + tmp[0].re;
     out[1*stride].im =3D in[0].im - tmp[1].im - tmp[0].im;
     out[2*stride].re =3D in[0].re - tmp[1].re - tmp[0].re;
@@ -183,38 +209,39 @@ static av_always_inline void fft3(FFTComplex *out, FF=
TComplex *in,
 #endif
 }
=20
-#define DECL_FFT5(NAME, D0, D1, D2, D3, D4)                               =
                        \
-static av_always_inline void NAME(FFTComplex *out, FFTComplex *in,        =
                        \
-                                  ptrdiff_t stride)                       =
                        \
-{                                                                         =
                        \
-    FFTComplex z0[4], t[6];                                               =
                        \
-                                                                          =
                        \
-    BF(t[1].im, t[0].re, in[1].re, in[4].re);                             =
                        \
-    BF(t[1].re, t[0].im, in[1].im, in[4].im);                             =
                        \
-    BF(t[3].im, t[2].re, in[2].re, in[3].re);                             =
                        \
-    BF(t[3].re, t[2].im, in[2].im, in[3].im);                             =
                        \
-                                                                          =
                        \
-    out[D0*stride].re =3D in[0].re + t[0].re + t[2].re;                   =
                          \
-    out[D0*stride].im =3D in[0].im + t[0].im + t[2].im;                   =
                          \
-                                                                          =
                        \
-    SMUL(t[4].re, t[0].re, TX_NAME(ff_cos_53)[2].re, TX_NAME(ff_cos_53)[3]=
.re, t[2].re, t[0].re); \
-    SMUL(t[4].im, t[0].im, TX_NAME(ff_cos_53)[2].re, TX_NAME(ff_cos_53)[3]=
.re, t[2].im, t[0].im); \
-    CMUL(t[5].re, t[1].re, TX_NAME(ff_cos_53)[2].im, TX_NAME(ff_cos_53)[3]=
.im, t[3].re, t[1].re); \
-    CMUL(t[5].im, t[1].im, TX_NAME(ff_cos_53)[2].im, TX_NAME(ff_cos_53)[3]=
.im, t[3].im, t[1].im); \
-                                                                          =
                        \
-    BF(z0[0].re, z0[3].re, t[0].re, t[1].re);                             =
                        \
-    BF(z0[0].im, z0[3].im, t[0].im, t[1].im);                             =
                        \
-    BF(z0[2].re, z0[1].re, t[4].re, t[5].re);                             =
                        \
-    BF(z0[2].im, z0[1].im, t[4].im, t[5].im);                             =
                        \
-                                                                          =
                        \
-    out[D1*stride].re =3D in[0].re + z0[3].re;                            =
                          \
-    out[D1*stride].im =3D in[0].im + z0[0].im;                            =
                          \
-    out[D2*stride].re =3D in[0].re + z0[2].re;                            =
                          \
-    out[D2*stride].im =3D in[0].im + z0[1].im;                            =
                          \
-    out[D3*stride].re =3D in[0].re + z0[1].re;                            =
                          \
-    out[D3*stride].im =3D in[0].im + z0[2].im;                            =
                          \
-    out[D4*stride].re =3D in[0].re + z0[0].re;                            =
                          \
-    out[D4*stride].im =3D in[0].im + z0[3].im;                            =
                          \
+#define DECL_FFT5(NAME, D0, D1, D2, D3, D4)                         \
+static av_always_inline void NAME(FFTComplex *out, FFTComplex *in,  \
+                                  ptrdiff_t stride)                 \
+{                                                                   \
+    FFTComplex z0[4], t[6];                                         \
+    const FFTSample *tab =3D TX_TAB(ff_tx_tab_53);                    \
+                                                                    \
+    BF(t[1].im, t[0].re, in[1].re, in[4].re);                       \
+    BF(t[1].re, t[0].im, in[1].im, in[4].im);                       \
+    BF(t[3].im, t[2].re, in[2].re, in[3].re);                       \
+    BF(t[3].re, t[2].im, in[2].im, in[3].im);                       \
+                                                                    \
+    out[D0*stride].re =3D in[0].re + t[0].re + t[2].re;               \
+    out[D0*stride].im =3D in[0].im + t[0].im + t[2].im;               \
+                                                                    \
+    SMUL(t[4].re, t[0].re, tab[4], tab[6], t[2].re, t[0].re);       \
+    SMUL(t[4].im, t[0].im, tab[4], tab[6], t[2].im, t[0].im);       \
+    CMUL(t[5].re, t[1].re, tab[5], tab[7], t[3].re, t[1].re);       \
+    CMUL(t[5].im, t[1].im, tab[5], tab[7], t[3].im, t[1].im);       \
+                                                                    \
+    BF(z0[0].re, z0[3].re, t[0].re, t[1].re);                       \
+    BF(z0[0].im, z0[3].im, t[0].im, t[1].im);                       \
+    BF(z0[2].re, z0[1].re, t[4].re, t[5].re);                       \
+    BF(z0[2].im, z0[1].im, t[4].im, t[5].im);                       \
+                                                                    \
+    out[D1*stride].re =3D in[0].re + z0[3].re;                        \
+    out[D1*stride].im =3D in[0].im + z0[0].im;                        \
+    out[D2*stride].re =3D in[0].re + z0[2].re;                        \
+    out[D2*stride].im =3D in[0].im + z0[1].im;                        \
+    out[D3*stride].re =3D in[0].re + z0[1].re;                        \
+    out[D3*stride].im =3D in[0].im + z0[2].im;                        \
+    out[D4*stride].re =3D in[0].re + z0[0].re;                        \
+    out[D4*stride].im =3D in[0].im + z0[3].im;                        \
 }
=20
 DECL_FFT5(fft5,     0,  1,  2,  3,  4)
@@ -226,7 +253,7 @@ static av_always_inline void fft7(FFTComplex *out, FFTC=
omplex *in,
                                   ptrdiff_t stride)
 {
     FFTComplex t[6], z[3];
-    const FFTComplex *tab =3D TX_NAME(ff_cos_7);
+    const FFTComplex *tab =3D (const FFTComplex *)TX_TAB(ff_tx_tab_7);
 #ifdef TX_INT32
     int64_t mtmp[12];
 #endif
@@ -312,7 +339,7 @@ static av_always_inline void fft7(FFTComplex *out, FFTC=
omplex *in,
 static av_always_inline void fft9(FFTComplex *out, FFTComplex *in,
                                   ptrdiff_t stride)
 {
-    const FFTComplex *tab =3D TX_NAME(ff_cos_9);
+    const FFTComplex *tab =3D (const FFTComplex *)TX_TAB(ff_tx_tab_9);
     FFTComplex t[16], w[4], x[5], y[5], z[2];
 #ifdef TX_INT32
     int64_t mtmp[12];
@@ -468,15 +495,16 @@ static av_always_inline void fft15(FFTComplex *out, F=
FTComplex *in,
     } while (0)
=20
 /* z[0...8n-1], w[1...2n-1] */
-static void split_radix_combine(FFTComplex *z, const FFTSample *cos, int n=
)
+static inline void TX_NAME(ff_tx_fft_sr_combine)(FFTComplex *z,
+                                                 const FFTSample *cos, int=
 len)
 {
-    int o1 =3D 2*n;
-    int o2 =3D 4*n;
-    int o3 =3D 6*n;
+    int o1 =3D 2*len;
+    int o2 =3D 4*len;
+    int o3 =3D 6*len;
     const FFTSample *wim =3D cos + o1 - 7;
     FFTSample t1, t2, t3, t4, t5, t6, r0, i0, r1, i1;
=20
-    for (int i =3D 0; i < n; i +=3D 4) {
+    for (int i =3D 0; i < len; i +=3D 4) {
         TRANSFORM(z[0], z[o1 + 0], z[o2 + 0], z[o3 + 0], cos[0], wim[7]);
         TRANSFORM(z[2], z[o1 + 2], z[o2 + 2], z[o3 + 2], cos[2], wim[5]);
         TRANSFORM(z[4], z[o1 + 4], z[o2 + 4], z[o3 + 4], cos[4], wim[3]);
@@ -493,25 +521,62 @@ static void split_radix_combine(FFTComplex *z, const =
FFTSample *cos, int n)
     }
 }
=20
-#define DECL_FFT(n, n2, n4)                            \
-static void fft##n(FFTComplex *z)                      \
-{                                                      \
-    fft##n2(z);                                        \
-    fft##n4(z + n4*2);                                 \
-    fft##n4(z + n4*3);                                 \
-    split_radix_combine(z, TX_NAME(ff_cos_##n), n4/2); \
+static av_cold int TX_NAME(ff_tx_fft_sr_codelet_init)(AVTXContext *s,
+                                                      const FFTXCodelet *c=
d,
+                                                      uint64_t flags,
+                                                      FFTXCodeletOptions *=
opts,
+                                                      int len, int inv,
+                                                      const void *scale)
+{
+    TX_TAB(ff_tx_init_tabs)(len);
+    return ff_tx_gen_ptwo_revtab(s, opts ? opts->invert_lookup : 1);
 }
=20
-static void fft2(FFTComplex *z)
+#define DECL_SR_CODELET_DEF(n)                         \
+const FFTXCodelet TX_NAME(ff_tx_fft##n##_ns_def) =3D {   \
+    .name       =3D TX_NAME_STR("fft" #n "_ns"),         \
+    .function   =3D TX_NAME(ff_tx_fft##n##_ns),          \
+    .type       =3D TX_TYPE(FFT),                        \
+    .flags      =3D AV_TX_INPLACE | AV_TX_UNALIGNED |    \
+                  FF_TX_PRESHUFFLE,                    \
+    .factors[0] =3D 2,                                   \
+    .min_len    =3D n,                                   \
+    .max_len    =3D n,                                   \
+    .init       =3D TX_NAME(ff_tx_fft_sr_codelet_init),  \
+    .cpu_flags  =3D FF_TX_CPU_FLAGS_ALL,                 \
+    .prio       =3D FF_TX_PRIO_BASE,                     \
+};
+
+#define DECL_SR_CODELET(n, n2, n4)                                   \
+static void TX_NAME(ff_tx_fft##n##_ns)(AVTXContext *s, void *dst,    \
+                                        void *src, ptrdiff_t stride) \
+{                                                                    \
+    FFTComplex *z =3D dst;                                             \
+    const FFTSample *cos =3D TX_TAB(ff_tx_tab_##n);                    \
+                                                                     \
+    TX_NAME(ff_tx_fft##n2##_ns)(s, z,        z,        stride);      \
+    TX_NAME(ff_tx_fft##n4##_ns)(s, z + n4*2, z + n4*2, stride);      \
+    TX_NAME(ff_tx_fft##n4##_ns)(s, z + n4*3, z + n4*3, stride);      \
+    TX_NAME(ff_tx_fft_sr_combine)(z, cos, n4 >> 1);                  \
+}                                                                    \
+                                                                     \
+DECL_SR_CODELET_DEF(n)
+
+static void TX_NAME(ff_tx_fft2_ns)(AVTXContext *s, void *dst,
+                                   void *src, ptrdiff_t stride)
 {
+    FFTComplex *z =3D dst;
     FFTComplex tmp;
+
     BF(tmp.re, z[0].re, z[0].re, z[1].re);
     BF(tmp.im, z[0].im, z[0].im, z[1].im);
     z[1] =3D tmp;
 }
=20
-static void fft4(FFTComplex *z)
+static void TX_NAME(ff_tx_fft4_ns)(AVTXContext *s, void *dst,
+                                   void *src, ptrdiff_t stride)
 {
+    FFTComplex *z =3D dst;
     FFTSample t1, t2, t3, t4, t5, t6, t7, t8;
=20
     BF(t3, t1, z[0].re, z[1].re);
@@ -524,11 +589,14 @@ static void fft4(FFTComplex *z)
     BF(z[2].im, z[0].im, t2, t5);
 }
=20
-static void fft8(FFTComplex *z)
+static void TX_NAME(ff_tx_fft8_ns)(AVTXContext *s, void *dst,
+                                   void *src, ptrdiff_t stride)
 {
+    FFTComplex *z =3D dst;
     FFTSample t1, t2, t3, t4, t5, t6, r0, i0, r1, i1;
+    const FFTSample cos =3D TX_TAB(ff_tx_tab_8)[1];
=20
-    fft4(z);
+    TX_NAME(ff_tx_fft4_ns)(s, z, z, stride);
=20
     BF(t1, z[5].re, z[4].re, -z[5].re);
     BF(t2, z[5].im, z[4].im, -z[5].im);
@@ -536,19 +604,23 @@ static void fft8(FFTComplex *z)
     BF(t6, z[7].im, z[6].im, -z[7].im);
=20
     BUTTERFLIES(z[0], z[2], z[4], z[6]);
-    TRANSFORM(z[1], z[3], z[5], z[7], RESCALE(M_SQRT1_2), RESCALE(M_SQRT1_=
2));
+    TRANSFORM(z[1], z[3], z[5], z[7], cos, cos);
 }
=20
-static void fft16(FFTComplex *z)
+static void TX_NAME(ff_tx_fft16_ns)(AVTXContext *s, void *dst,
+                                    void *src, ptrdiff_t stride)
 {
+    FFTComplex *z =3D dst;
+    const FFTSample *cos =3D TX_TAB(ff_tx_tab_16);
+
     FFTSample t1, t2, t3, t4, t5, t6, r0, i0, r1, i1;
-    FFTSample cos_16_1 =3D TX_NAME(ff_cos_16)[1];
-    FFTSample cos_16_2 =3D TX_NAME(ff_cos_16)[2];
-    FFTSample cos_16_3 =3D TX_NAME(ff_cos_16)[3];
+    FFTSample cos_16_1 =3D cos[1];
+    FFTSample cos_16_2 =3D cos[2];
+    FFTSample cos_16_3 =3D cos[3];
=20
-    fft8(z +  0);
-    fft4(z +  8);
-    fft4(z + 12);
+    TX_NAME(ff_tx_fft8_ns)(s, z +  0, z +  0, stride);
+    TX_NAME(ff_tx_fft4_ns)(s, z +  8, z +  8, stride);
+    TX_NAME(ff_tx_fft4_ns)(s, z + 12, z + 12, stride);
=20
     t1 =3D z[ 8].re;
     t2 =3D z[ 8].im;
@@ -561,90 +633,125 @@ static void fft16(FFTComplex *z)
     TRANSFORM(z[ 3], z[ 7], z[11], z[15], cos_16_3, cos_16_1);
 }
=20
-DECL_FFT(32,16,8)
-DECL_FFT(64,32,16)
-DECL_FFT(128,64,32)
-DECL_FFT(256,128,64)
-DECL_FFT(512,256,128)
-DECL_FFT(1024,512,256)
-DECL_FFT(2048,1024,512)
-DECL_FFT(4096,2048,1024)
-DECL_FFT(8192,4096,2048)
-DECL_FFT(16384,8192,4096)
-DECL_FFT(32768,16384,8192)
-DECL_FFT(65536,32768,16384)
-DECL_FFT(131072,65536,32768)
-
-static void (* const fft_dispatch[])(FFTComplex*) =3D {
-    NULL, fft2, fft4, fft8, fft16, fft32, fft64, fft128, fft256, fft512,
-    fft1024, fft2048, fft4096, fft8192, fft16384, fft32768, fft65536, fft1=
31072
-};
+DECL_SR_CODELET_DEF(2)
+DECL_SR_CODELET_DEF(4)
+DECL_SR_CODELET_DEF(8)
+DECL_SR_CODELET_DEF(16)
+DECL_SR_CODELET(32,16,8)
+DECL_SR_CODELET(64,32,16)
+DECL_SR_CODELET(128,64,32)
+DECL_SR_CODELET(256,128,64)
+DECL_SR_CODELET(512,256,128)
+DECL_SR_CODELET(1024,512,256)
+DECL_SR_CODELET(2048,1024,512)
+DECL_SR_CODELET(4096,2048,1024)
+DECL_SR_CODELET(8192,4096,2048)
+DECL_SR_CODELET(16384,8192,4096)
+DECL_SR_CODELET(32768,16384,8192)
+DECL_SR_CODELET(65536,32768,16384)
+DECL_SR_CODELET(131072,65536,32768)
+
+static void TX_NAME(ff_tx_fft_sr)(AVTXContext *s, void *_dst,
+                                  void *_src, ptrdiff_t stride)
+{
+    FFTComplex *src =3D _src;
+    FFTComplex *dst =3D _dst;
+    int *map =3D s->sub[0].map;
+    int len =3D s->len;
=20
-#define DECL_COMP_FFT(N)                                                  =
     \
-static void compound_fft_##N##xM(AVTXContext *s, void *_out,              =
     \
-                                 void *_in, ptrdiff_t stride)             =
     \
-{                                                                         =
     \
-    const int m =3D s->m, *in_map =3D s->pfatab, *out_map =3D in_map + N*m=
;          \
-    FFTComplex *in =3D _in;                                               =
       \
-    FFTComplex *out =3D _out;                                             =
       \
-    FFTComplex fft##N##in[N];                                             =
     \
-    void (*fftp)(FFTComplex *z) =3D fft_dispatch[av_log2(m)];             =
       \
-                                                                          =
     \
-    for (int i =3D 0; i < m; i++) {                                       =
       \
-        for (int j =3D 0; j < N; j++)                                     =
       \
-            fft##N##in[j] =3D in[in_map[i*N + j]];                        =
       \
-        fft##N(s->tmp + s->revtab_c[i], fft##N##in, m);                   =
     \
-    }                                                                     =
     \
-                                                                          =
     \
-    for (int i =3D 0; i < N; i++)                                         =
       \
-        fftp(s->tmp + m*i);                                               =
     \
-                                                                          =
     \
-    for (int i =3D 0; i < N*m; i++)                                       =
       \
-        out[i] =3D s->tmp[out_map[i]];                                    =
       \
-}
+    /* Compilers can't vectorize this anyway without assuming AVX2, which =
they
+     * generally don't, at least without -march=3Dnative -mtune=3Dnative *=
/
+    for (int i =3D 0; i < len; i++)
+        dst[i] =3D src[map[i]];
=20
-DECL_COMP_FFT(3)
-DECL_COMP_FFT(5)
-DECL_COMP_FFT(7)
-DECL_COMP_FFT(9)
-DECL_COMP_FFT(15)
+    s->fn[0](&s->sub[0], dst, dst, stride);
+}
=20
-static void split_radix_fft(AVTXContext *s, void *_out, void *_in,
-                            ptrdiff_t stride)
+static void TX_NAME(ff_tx_fft_sr_inplace)(AVTXContext *s, void *_dst,
+                                          void *_src, ptrdiff_t stride)
 {
-    FFTComplex *in =3D _in;
-    FFTComplex *out =3D _out;
-    int m =3D s->m, mb =3D av_log2(m);
+    FFTComplex *dst =3D _dst;
+    FFTComplex tmp;
+    const int *map =3D s->sub->map;
+    const int *inplace_idx =3D s->map;
+    int src_idx, dst_idx;
+
+    src_idx =3D *inplace_idx++;
+    do {
+        tmp =3D dst[src_idx];
+        dst_idx =3D map[src_idx];
+        do {
+            FFSWAP(FFTComplex, tmp, dst[dst_idx]);
+            dst_idx =3D map[dst_idx];
+        } while (dst_idx !=3D src_idx); /* Can be > as well, but is less p=
redictable */
+        dst[dst_idx] =3D tmp;
+    } while ((src_idx =3D *inplace_idx++));
=20
-    if (s->flags & AV_TX_INPLACE) {
-        FFTComplex tmp;
-        int src, dst, *inplace_idx =3D s->inplace_idx;
+    s->fn[0](&s->sub[0], dst, dst, stride);
+}
=20
-        src =3D *inplace_idx++;
+static av_cold int TX_NAME(ff_tx_fft_sr_init)(AVTXContext *s,
+                                              const FFTXCodelet *cd,
+                                              uint64_t flags,
+                                              FFTXCodeletOptions *opts,
+                                              int len, int inv,
+                                              const void *scale)
+{
+    int ret;
+    FFTXCodeletOptions sub_opts =3D { 0 };
=20
-        do {
-            tmp =3D out[src];
-            dst =3D s->revtab_c[src];
-            do {
-                FFSWAP(FFTComplex, tmp, out[dst]);
-                dst =3D s->revtab_c[dst];
-            } while (dst !=3D src); /* Can be > as well, but is less predi=
ctable */
-            out[dst] =3D tmp;
-        } while ((src =3D *inplace_idx++));
+    if (flags & AV_TX_INPLACE) {
+        if ((ret =3D ff_tx_gen_ptwo_inplace_revtab_idx(s)))
+            return ret;
+        sub_opts.invert_lookup =3D 0;
     } else {
-        for (int i =3D 0; i < m; i++)
-            out[i] =3D in[s->revtab_c[i]];
+        /* For a straightforward lookup, it's faster to do it inverted
+         * (gather, rather than scatter). */
+        sub_opts.invert_lookup =3D 1;
     }
=20
-    fft_dispatch[mb](out);
+    flags &=3D ~FF_TX_OUT_OF_PLACE; /* We want the subtransform to be */
+    flags |=3D  AV_TX_INPLACE;      /* in-place */
+    flags |=3D  FF_TX_PRESHUFFLE;   /* This function handles the permute s=
tep */
+
+    if ((ret =3D ff_tx_init_subtx(s, TX_TYPE(FFT), flags, &sub_opts, len, =
inv, scale)))
+        return ret;
+
+    return 0;
 }
=20
-static void naive_fft(AVTXContext *s, void *_out, void *_in,
-                      ptrdiff_t stride)
+const FFTXCodelet TX_NAME(ff_tx_fft_sr_def) =3D {
+    .name       =3D TX_NAME_STR("fft_sr"),
+    .function   =3D TX_NAME(ff_tx_fft_sr),
+    .type       =3D TX_TYPE(FFT),
+    .flags      =3D AV_TX_UNALIGNED | FF_TX_OUT_OF_PLACE,
+    .factors[0] =3D 2,
+    .min_len    =3D 2,
+    .max_len    =3D TX_LEN_UNLIMITED,
+    .init       =3D TX_NAME(ff_tx_fft_sr_init),
+    .cpu_flags  =3D FF_TX_CPU_FLAGS_ALL,
+    .prio       =3D FF_TX_PRIO_BASE,
+};
+
+const FFTXCodelet TX_NAME(ff_tx_fft_sr_inplace_def) =3D {
+    .name       =3D TX_NAME_STR("fft_sr_inplace"),
+    .function   =3D TX_NAME(ff_tx_fft_sr_inplace),
+    .type       =3D TX_TYPE(FFT),
+    .flags      =3D AV_TX_UNALIGNED | AV_TX_INPLACE,
+    .factors[0] =3D 2,
+    .min_len    =3D 2,
+    .max_len    =3D TX_LEN_UNLIMITED,
+    .init       =3D TX_NAME(ff_tx_fft_sr_init),
+    .cpu_flags  =3D FF_TX_CPU_FLAGS_ALL,
+    .prio       =3D FF_TX_PRIO_BASE,
+};
+
+static void TX_NAME(ff_tx_fft_naive)(AVTXContext *s, void *_dst, void *_sr=
c,
+                                     ptrdiff_t stride)
 {
-    FFTComplex *in =3D _in;
-    FFTComplex *out =3D _out;
-    const int n =3D s->n;
+    FFTComplex *src =3D _src;
+    FFTComplex *dst =3D _dst;
+    const int n =3D s->len;
     double phase =3D s->inv ? 2.0*M_PI/n : -2.0*M_PI/n;
=20
     for(int i =3D 0; i < n; i++) {
@@ -656,164 +763,218 @@ static void naive_fft(AVTXContext *s, void *_out, v=
oid *_in,
                 RESCALE(sin(factor)),
             };
             FFTComplex res;
-            CMUL3(res, in[j], mult);
+            CMUL3(res, src[j], mult);
             tmp.re +=3D res.re;
             tmp.im +=3D res.im;
         }
-        out[i] =3D tmp;
+        dst[i] =3D tmp;
     }
 }
=20
-#define DECL_COMP_IMDCT(N)                                                =
     \
-static void compound_imdct_##N##xM(AVTXContext *s, void *_dst, void *_src,=
     \
-                                   ptrdiff_t stride)                      =
     \
+const FFTXCodelet TX_NAME(ff_tx_fft_naive_def) =3D {
+    .name       =3D TX_NAME_STR("fft_naive"),
+    .function   =3D TX_NAME(ff_tx_fft_naive),
+    .type       =3D TX_TYPE(FFT),
+    .flags      =3D AV_TX_UNALIGNED | FF_TX_OUT_OF_PLACE,
+    .factors[0] =3D TX_FACTOR_ANY,
+    .min_len    =3D 2,
+    .max_len    =3D TX_LEN_UNLIMITED,
+    .init       =3D NULL,
+    .cpu_flags  =3D FF_TX_CPU_FLAGS_ALL,
+    .prio       =3D FF_TX_PRIO_MIN,
+};
+
+static av_cold int TX_NAME(ff_tx_fft_pfa_init)(AVTXContext *s,
+                                               const FFTXCodelet *cd,
+                                               uint64_t flags,
+                                               FFTXCodeletOptions *opts,
+                                               int len, int inv,
+                                               const void *scale)
+{
+    int ret;
+    int sub_len =3D len / cd->factors[0];
+    FFTXCodeletOptions sub_opts =3D { .invert_lookup =3D 0 };
+
+    flags &=3D ~FF_TX_OUT_OF_PLACE; /* We want the subtransform to be */
+    flags |=3D  AV_TX_INPLACE;      /* in-place */
+    flags |=3D  FF_TX_PRESHUFFLE;   /* This function handles the permute s=
tep */
+
+    if ((ret =3D ff_tx_init_subtx(s, TX_TYPE(FFT), flags, &sub_opts,
+                                sub_len, inv, scale)))
+        return ret;
+
+    if ((ret =3D ff_tx_gen_compound_mapping(s, cd->factors[0], sub_len)))
+        return ret;
+
+    if (!(s->tmp =3D av_malloc(len*sizeof(*s->tmp))))
+        return AVERROR(ENOMEM);
+
+    TX_TAB(ff_tx_init_tabs)(len / sub_len);
+
+    return 0;
+}
+
+#define DECL_COMP_FFT(N)                                                  =
     \
+static void TX_NAME(ff_tx_fft_pfa_##N##xM)(AVTXContext *s, void *_out,    =
     \
+                                           void *_in, ptrdiff_t stride)   =
     \
 {                                                                         =
     \
+    const int m =3D s->sub->len;                                          =
       \
+    const int *in_map =3D s->map, *out_map =3D in_map + s->len;           =
         \
+    const int *sub_map =3D s->sub->map;                                   =
       \
+    FFTComplex *in =3D _in;                                               =
       \
+    FFTComplex *out =3D _out;                                             =
       \
     FFTComplex fft##N##in[N];                                             =
     \
-    FFTComplex *z =3D _dst, *exp =3D s->exptab;                           =
         \
-    const int m =3D s->m, len8 =3D N*m >> 1;                              =
         \
-    const int *in_map =3D s->pfatab, *out_map =3D in_map + N*m;           =
         \
-    const FFTSample *src =3D _src, *in1, *in2;                            =
       \
-    void (*fftp)(FFTComplex *) =3D fft_dispatch[av_log2(m)];              =
       \
-                                                                          =
     \
-    stride /=3D sizeof(*src); /* To convert it from bytes */              =
       \
-    in1 =3D src;                                                          =
       \
-    in2 =3D src + ((N*m*2) - 1) * stride;                                 =
       \
                                                                           =
     \
     for (int i =3D 0; i < m; i++) {                                       =
       \
-        for (int j =3D 0; j < N; j++) {                                   =
       \
-            const int k =3D in_map[i*N + j];                              =
       \
-            FFTComplex tmp =3D { in2[-k*stride], in1[k*stride] };         =
       \
-            CMUL3(fft##N##in[j], tmp, exp[k >> 1]);                       =
     \
-        }                                                                 =
     \
-        fft##N(s->tmp + s->revtab_c[i], fft##N##in, m);                   =
     \
+        for (int j =3D 0; j < N; j++)                                     =
       \
+            fft##N##in[j] =3D in[in_map[i*N + j]];                        =
       \
+        fft##N(s->tmp + sub_map[i], fft##N##in, m);                       =
     \
     }                                                                     =
     \
                                                                           =
     \
     for (int i =3D 0; i < N; i++)                                         =
       \
-        fftp(s->tmp + m*i);                                               =
     \
+        s->fn[0](&s->sub[0], s->tmp + m*i, s->tmp + m*i, sizeof(FFTComplex=
));  \
                                                                           =
     \
-    for (int i =3D 0; i < len8; i++) {                                    =
       \
-        const int i0 =3D len8 + i, i1 =3D len8 - i - 1;                   =
         \
-        const int s0 =3D out_map[i0], s1 =3D out_map[i1];                 =
         \
-        FFTComplex src1 =3D { s->tmp[s1].im, s->tmp[s1].re };             =
       \
-        FFTComplex src0 =3D { s->tmp[s0].im, s->tmp[s0].re };             =
       \
+    for (int i =3D 0; i < N*m; i++)                                       =
       \
+        out[i] =3D s->tmp[out_map[i]];                                    =
       \
+}                                                                         =
     \
                                                                           =
     \
-        CMUL(z[i1].re, z[i0].im, src1.re, src1.im, exp[i1].im, exp[i1].re)=
;    \
-        CMUL(z[i0].re, z[i1].im, src0.re, src0.im, exp[i0].im, exp[i0].re)=
;    \
-    }                                                                     =
     \
-}
+const FFTXCodelet TX_NAME(ff_tx_fft_pfa_##N##xM_def) =3D {                =
       \
+    .name       =3D TX_NAME_STR("fft_pfa_" #N "xM"),                      =
       \
+    .function   =3D TX_NAME(ff_tx_fft_pfa_##N##xM),                       =
       \
+    .type       =3D TX_TYPE(FFT),                                         =
       \
+    .flags      =3D AV_TX_UNALIGNED | FF_TX_OUT_OF_PLACE,                 =
       \
+    .factors    =3D { N, TX_FACTOR_ANY },                                 =
       \
+    .min_len    =3D N*2,                                                  =
       \
+    .max_len    =3D TX_LEN_UNLIMITED,                                     =
       \
+    .init       =3D TX_NAME(ff_tx_fft_pfa_init),                          =
       \
+    .cpu_flags  =3D FF_TX_CPU_FLAGS_ALL,                                  =
       \
+    .prio       =3D FF_TX_PRIO_BASE,                                      =
       \
+};
=20
-DECL_COMP_IMDCT(3)
-DECL_COMP_IMDCT(5)
-DECL_COMP_IMDCT(7)
-DECL_COMP_IMDCT(9)
-DECL_COMP_IMDCT(15)
+DECL_COMP_FFT(3)
+DECL_COMP_FFT(5)
+DECL_COMP_FFT(7)
+DECL_COMP_FFT(9)
+DECL_COMP_FFT(15)
=20
-#define DECL_COMP_MDCT(N)                                                 =
     \
-static void compound_mdct_##N##xM(AVTXContext *s, void *_dst, void *_src, =
     \
-                                  ptrdiff_t stride)                       =
     \
-{                                                                         =
     \
-    FFTSample *src =3D _src, *dst =3D _dst;                               =
         \
-    FFTComplex *exp =3D s->exptab, tmp, fft##N##in[N];                    =
       \
-    const int m =3D s->m, len4 =3D N*m, len3 =3D len4 * 3, len8 =3D len4 >=
> 1;         \
-    const int *in_map =3D s->pfatab, *out_map =3D in_map + N*m;           =
         \
-    void (*fftp)(FFTComplex *) =3D fft_dispatch[av_log2(m)];              =
       \
-                                                                          =
     \
-    stride /=3D sizeof(*dst);                                             =
       \
-                                                                          =
     \
-    for (int i =3D 0; i < m; i++) { /* Folding and pre-reindexing */      =
       \
-        for (int j =3D 0; j < N; j++) {                                   =
       \
-            const int k =3D in_map[i*N + j];                              =
       \
-            if (k < len4) {                                               =
     \
-                tmp.re =3D FOLD(-src[ len4 + k],  src[1*len4 - 1 - k]);   =
       \
-                tmp.im =3D FOLD(-src[ len3 + k], -src[1*len3 - 1 - k]);   =
       \
-            } else {                                                      =
     \
-                tmp.re =3D FOLD(-src[ len4 + k], -src[5*len4 - 1 - k]);   =
       \
-                tmp.im =3D FOLD( src[-len4 + k], -src[1*len3 - 1 - k]);   =
       \
-            }                                                             =
     \
-            CMUL(fft##N##in[j].im, fft##N##in[j].re, tmp.re, tmp.im,      =
     \
-                 exp[k >> 1].re, exp[k >> 1].im);                         =
     \
-        }                                                                 =
     \
-        fft##N(s->tmp + s->revtab_c[i], fft##N##in, m);                   =
     \
-    }                                                                     =
     \
-                                                                          =
     \
-    for (int i =3D 0; i < N; i++)                                         =
       \
-        fftp(s->tmp + m*i);                                               =
     \
-                                                                          =
     \
-    for (int i =3D 0; i < len8; i++) {                                    =
       \
-        const int i0 =3D len8 + i, i1 =3D len8 - i - 1;                   =
         \
-        const int s0 =3D out_map[i0], s1 =3D out_map[i1];                 =
         \
-        FFTComplex src1 =3D { s->tmp[s1].re, s->tmp[s1].im };             =
       \
-        FFTComplex src0 =3D { s->tmp[s0].re, s->tmp[s0].im };             =
       \
-                                                                          =
     \
-        CMUL(dst[2*i1*stride + stride], dst[2*i0*stride], src0.re, src0.im=
,    \
-             exp[i0].im, exp[i0].re);                                     =
     \
-        CMUL(dst[2*i0*stride + stride], dst[2*i1*stride], src1.re, src1.im=
,    \
-             exp[i1].im, exp[i1].re);                                     =
     \
-    }                                                                     =
     \
-}
+static void TX_NAME(ff_tx_mdct_naive_fwd)(AVTXContext *s, void *_dst,
+                                          void *_src, ptrdiff_t stride)
+{
+    FFTSample *src =3D _src;
+    FFTSample *dst =3D _dst;
+    double scale =3D s->scale_d;
+    int len =3D s->len;
+    const double phase =3D M_PI/(4.0*len);
=20
-DECL_COMP_MDCT(3)
-DECL_COMP_MDCT(5)
-DECL_COMP_MDCT(7)
-DECL_COMP_MDCT(9)
-DECL_COMP_MDCT(15)
+    stride /=3D sizeof(*dst);
+
+    for (int i =3D 0; i < len; i++) {
+        double sum =3D 0.0;
+        for (int j =3D 0; j < len*2; j++) {
+            int a =3D (2*j + 1 + len) * (2*i + 1);
+            sum +=3D UNSCALE(src[j]) * cos(a * phase);
+        }
+        dst[i*stride] =3D RESCALE(sum*scale);
+    }
+}
=20
-static void monolithic_imdct(AVTXContext *s, void *_dst, void *_src,
-                             ptrdiff_t stride)
+static void TX_NAME(ff_tx_mdct_naive_inv)(AVTXContext *s, void *_dst,
+                                          void *_src, ptrdiff_t stride)
 {
-    FFTComplex *z =3D _dst, *exp =3D s->exptab;
-    const int m =3D s->m, len8 =3D m >> 1;
-    const FFTSample *src =3D _src, *in1, *in2;
-    void (*fftp)(FFTComplex *) =3D fft_dispatch[av_log2(m)];
+    FFTSample *src =3D _src;
+    FFTSample *dst =3D _dst;
+    double scale =3D s->scale_d;
+    int len =3D s->len >> 1;
+    int len2 =3D len*2;
+    const double phase =3D M_PI/(4.0*len2);
=20
     stride /=3D sizeof(*src);
-    in1 =3D src;
-    in2 =3D src + ((m*2) - 1) * stride;
=20
-    for (int i =3D 0; i < m; i++) {
-        FFTComplex tmp =3D { in2[-2*i*stride], in1[2*i*stride] };
-        CMUL3(z[s->revtab_c[i]], tmp, exp[i]);
+    for (int i =3D 0; i < len; i++) {
+        double sum_d =3D 0.0;
+        double sum_u =3D 0.0;
+        double i_d =3D phase * (4*len  - 2*i - 1);
+        double i_u =3D phase * (3*len2 + 2*i + 1);
+        for (int j =3D 0; j < len2; j++) {
+            double a =3D (2 * j + 1);
+            double a_d =3D cos(a * i_d);
+            double a_u =3D cos(a * i_u);
+            double val =3D UNSCALE(src[j*stride]);
+            sum_d +=3D a_d * val;
+            sum_u +=3D a_u * val;
+        }
+        dst[i +   0] =3D RESCALE( sum_d*scale);
+        dst[i + len] =3D RESCALE(-sum_u*scale);
     }
+}
=20
-    fftp(z);
+static av_cold int TX_NAME(ff_tx_mdct_naive_init)(AVTXContext *s,
+                                                  const FFTXCodelet *cd,
+                                                  uint64_t flags,
+                                                  FFTXCodeletOptions *opts=
,
+                                                  int len, int inv,
+                                                  const void *scale)
+{
+    s->scale_d =3D *((SCALE_TYPE *)scale);
+    s->scale_f =3D s->scale_d;
+    return 0;
+}
=20
-    for (int i =3D 0; i < len8; i++) {
-        const int i0 =3D len8 + i, i1 =3D len8 - i - 1;
-        FFTComplex src1 =3D { z[i1].im, z[i1].re };
-        FFTComplex src0 =3D { z[i0].im, z[i0].re };
+const FFTXCodelet TX_NAME(ff_tx_mdct_naive_fwd_def) =3D {
+    .name       =3D TX_NAME_STR("mdct_naive_fwd"),
+    .function   =3D TX_NAME(ff_tx_mdct_naive_fwd),
+    .type       =3D TX_TYPE(MDCT),
+    .flags      =3D AV_TX_UNALIGNED | FF_TX_OUT_OF_PLACE | FF_TX_FORWARD_O=
NLY,
+    .factors    =3D { 2, TX_FACTOR_ANY }, /* MDCTs need even number of coe=
fficients/samples */
+    .min_len    =3D 2,
+    .max_len    =3D TX_LEN_UNLIMITED,
+    .init       =3D TX_NAME(ff_tx_mdct_naive_init),
+    .cpu_flags  =3D FF_TX_CPU_FLAGS_ALL,
+    .prio       =3D FF_TX_PRIO_MIN,
+};
=20
-        CMUL(z[i1].re, z[i0].im, src1.re, src1.im, exp[i1].im, exp[i1].re)=
;
-        CMUL(z[i0].re, z[i1].im, src0.re, src0.im, exp[i0].im, exp[i0].re)=
;
-    }
-}
+const FFTXCodelet TX_NAME(ff_tx_mdct_naive_inv_def) =3D {
+    .name       =3D TX_NAME_STR("mdct_naive_inv"),
+    .function   =3D TX_NAME(ff_tx_mdct_naive_inv),
+    .type       =3D TX_TYPE(MDCT),
+    .flags      =3D AV_TX_UNALIGNED | FF_TX_OUT_OF_PLACE | FF_TX_INVERSE_O=
NLY,
+    .factors    =3D { 2, TX_FACTOR_ANY },
+    .min_len    =3D 2,
+    .max_len    =3D TX_LEN_UNLIMITED,
+    .init       =3D TX_NAME(ff_tx_mdct_naive_init),
+    .cpu_flags  =3D FF_TX_CPU_FLAGS_ALL,
+    .prio       =3D FF_TX_PRIO_MIN,
+};
=20
-static void monolithic_mdct(AVTXContext *s, void *_dst, void *_src,
-                            ptrdiff_t stride)
+static void TX_NAME(ff_tx_mdct_sr_fwd)(AVTXContext *s, void *_dst, void *_=
src,
+                                       ptrdiff_t stride)
 {
     FFTSample *src =3D _src, *dst =3D _dst;
-    FFTComplex *exp =3D s->exptab, tmp, *z =3D _dst;
-    const int m =3D s->m, len4 =3D m, len3 =3D len4 * 3, len8 =3D len4 >> =
1;
-    void (*fftp)(FFTComplex *) =3D fft_dispatch[av_log2(m)];
+    FFTComplex *exp =3D s->exp, tmp, *z =3D _dst;
+    const int len2 =3D s->len >> 1;
+    const int len4 =3D s->len >> 2;
+    const int len3 =3D len2 * 3;
+    const int *sub_map =3D s->sub->map;
=20
     stride /=3D sizeof(*dst);
=20
-    for (int i =3D 0; i < m; i++) { /* Folding and pre-reindexing */
+    for (int i =3D 0; i < len2; i++) { /* Folding and pre-reindexing */
         const int k =3D 2*i;
-        if (k < len4) {
-            tmp.re =3D FOLD(-src[ len4 + k],  src[1*len4 - 1 - k]);
+        const int idx =3D sub_map[i];
+        if (k < len2) {
+            tmp.re =3D FOLD(-src[ len2 + k],  src[1*len2 - 1 - k]);
             tmp.im =3D FOLD(-src[ len3 + k], -src[1*len3 - 1 - k]);
         } else {
-            tmp.re =3D FOLD(-src[ len4 + k], -src[5*len4 - 1 - k]);
-            tmp.im =3D FOLD( src[-len4 + k], -src[1*len3 - 1 - k]);
+            tmp.re =3D FOLD(-src[ len2 + k], -src[5*len2 - 1 - k]);
+            tmp.im =3D FOLD( src[-len2 + k], -src[1*len3 - 1 - k]);
         }
-        CMUL(z[s->revtab_c[i]].im, z[s->revtab_c[i]].re, tmp.re, tmp.im,
-             exp[i].re, exp[i].im);
+        CMUL(z[idx].im, z[idx].re, tmp.re, tmp.im, exp[i].re, exp[i].im);
     }
=20
-    fftp(z);
+    s->fn[0](&s->sub[0], z, z, sizeof(FFTComplex));
=20
-    for (int i =3D 0; i < len8; i++) {
-        const int i0 =3D len8 + i, i1 =3D len8 - i - 1;
+    for (int i =3D 0; i < len4; i++) {
+        const int i0 =3D len4 + i, i1 =3D len4 - i - 1;
         FFTComplex src1 =3D { z[i1].re, z[i1].im };
         FFTComplex src0 =3D { z[i0].re, z[i0].im };
=20
@@ -824,66 +985,117 @@ static void monolithic_mdct(AVTXContext *s, void *_d=
st, void *_src,
     }
 }
=20
-static void naive_imdct(AVTXContext *s, void *_dst, void *_src,
-                        ptrdiff_t stride)
+static void TX_NAME(ff_tx_mdct_sr_inv)(AVTXContext *s, void *_dst, void *_=
src,
+                                       ptrdiff_t stride)
 {
-    int len =3D s->n;
-    int len2 =3D len*2;
-    FFTSample *src =3D _src;
-    FFTSample *dst =3D _dst;
-    double scale =3D s->scale;
-    const double phase =3D M_PI/(4.0*len2);
+    FFTComplex *z =3D _dst, *exp =3D s->exp;
+    const FFTSample *src =3D _src, *in1, *in2;
+    const int len2 =3D s->len >> 1;
+    const int len4 =3D s->len >> 2;
+    const int *sub_map =3D s->sub->map;
=20
     stride /=3D sizeof(*src);
+    in1 =3D src;
+    in2 =3D src + ((len2*2) - 1) * stride;
=20
-    for (int i =3D 0; i < len; i++) {
-        double sum_d =3D 0.0;
-        double sum_u =3D 0.0;
-        double i_d =3D phase * (4*len  - 2*i - 1);
-        double i_u =3D phase * (3*len2 + 2*i + 1);
-        for (int j =3D 0; j < len2; j++) {
-            double a =3D (2 * j + 1);
-            double a_d =3D cos(a * i_d);
-            double a_u =3D cos(a * i_u);
-            double val =3D UNSCALE(src[j*stride]);
-            sum_d +=3D a_d * val;
-            sum_u +=3D a_u * val;
-        }
-        dst[i +   0] =3D RESCALE( sum_d*scale);
-        dst[i + len] =3D RESCALE(-sum_u*scale);
+    for (int i =3D 0; i < len2; i++) {
+        FFTComplex tmp =3D { in2[-2*i*stride], in1[2*i*stride] };
+        CMUL3(z[sub_map[i]], tmp, exp[i]);
+    }
+
+    s->fn[0](&s->sub[0], z, z, sizeof(FFTComplex));
+
+    for (int i =3D 0; i < len4; i++) {
+        const int i0 =3D len4 + i, i1 =3D len4 - i - 1;
+        FFTComplex src1 =3D { z[i1].im, z[i1].re };
+        FFTComplex src0 =3D { z[i0].im, z[i0].re };
+
+        CMUL(z[i1].re, z[i0].im, src1.re, src1.im, exp[i1].im, exp[i1].re)=
;
+        CMUL(z[i0].re, z[i1].im, src0.re, src0.im, exp[i0].im, exp[i0].re)=
;
     }
 }
=20
-static void naive_mdct(AVTXContext *s, void *_dst, void *_src,
-                       ptrdiff_t stride)
+static int TX_NAME(ff_tx_mdct_gen_exp)(AVTXContext *s)
 {
-    int len =3D s->n*2;
-    FFTSample *src =3D _src;
-    FFTSample *dst =3D _dst;
-    double scale =3D s->scale;
-    const double phase =3D M_PI/(4.0*len);
+    int len4 =3D s->len >> 1;
+    double scale =3D s->scale_d;
+    const double theta =3D (scale < 0 ? len4 : 0) + 1.0/8.0;
=20
-    stride /=3D sizeof(*dst);
+    if (!(s->exp =3D av_malloc_array(len4, sizeof(*s->exp))))
+        return AVERROR(ENOMEM);
=20
-    for (int i =3D 0; i < len; i++) {
-        double sum =3D 0.0;
-        for (int j =3D 0; j < len*2; j++) {
-            int a =3D (2*j + 1 + len) * (2*i + 1);
-            sum +=3D UNSCALE(src[j]) * cos(a * phase);
-        }
-        dst[i*stride] =3D RESCALE(sum*scale);
+    scale =3D sqrt(fabs(scale));
+    for (int i =3D 0; i < len4; i++) {
+        const double alpha =3D M_PI_2 * (i + theta) / len4;
+        s->exp[i].re =3D RESCALE(cos(alpha) * scale);
+        s->exp[i].im =3D RESCALE(sin(alpha) * scale);
     }
+
+    return 0;
 }
=20
-static void full_imdct_wrapper_fn(AVTXContext *s, void *_dst, void *_src,
-                                  ptrdiff_t stride)
+static av_cold int TX_NAME(ff_tx_mdct_sr_init)(AVTXContext *s,
+                                               const FFTXCodelet *cd,
+                                               uint64_t flags,
+                                               FFTXCodeletOptions *opts,
+                                               int len, int inv,
+                                               const void *scale)
+{
+    int ret;
+    FFTXCodeletOptions sub_opts =3D { .invert_lookup =3D 0 };
+
+    s->scale_d =3D *((SCALE_TYPE *)scale);
+    s->scale_f =3D s->scale_d;
+
+    flags &=3D ~FF_TX_OUT_OF_PLACE; /* We want the subtransform to be */
+    flags |=3D  AV_TX_INPLACE;      /* in-place */
+    flags |=3D  FF_TX_PRESHUFFLE;   /* This function handles the permute s=
tep */
+
+    if ((ret =3D ff_tx_init_subtx(s, TX_TYPE(FFT), flags, &sub_opts, len >=
> 1,
+                                inv, scale)))
+        return ret;
+
+    if ((ret =3D TX_NAME(ff_tx_mdct_gen_exp)(s)))
+        return ret;
+
+    return 0;
+}
+
+const FFTXCodelet TX_NAME(ff_tx_mdct_sr_fwd_def) =3D {
+    .name       =3D TX_NAME_STR("mdct_sr_fwd"),
+    .function   =3D TX_NAME(ff_tx_mdct_sr_fwd),
+    .type       =3D TX_TYPE(MDCT),
+    .flags      =3D AV_TX_UNALIGNED | FF_TX_OUT_OF_PLACE | FF_TX_FORWARD_O=
NLY,
+    .factors[0] =3D 2,
+    .min_len    =3D 2,
+    .max_len    =3D TX_LEN_UNLIMITED,
+    .init       =3D TX_NAME(ff_tx_mdct_sr_init),
+    .cpu_flags  =3D FF_TX_CPU_FLAGS_ALL,
+    .prio       =3D FF_TX_PRIO_BASE,
+};
+
+const FFTXCodelet TX_NAME(ff_tx_mdct_sr_inv_def) =3D {
+    .name       =3D TX_NAME_STR("mdct_sr_inv"),
+    .function   =3D TX_NAME(ff_tx_mdct_sr_inv),
+    .type       =3D TX_TYPE(MDCT),
+    .flags      =3D AV_TX_UNALIGNED | FF_TX_OUT_OF_PLACE | FF_TX_INVERSE_O=
NLY,
+    .factors[0] =3D 2,
+    .min_len    =3D 2,
+    .max_len    =3D TX_LEN_UNLIMITED,
+    .init       =3D TX_NAME(ff_tx_mdct_sr_init),
+    .cpu_flags  =3D FF_TX_CPU_FLAGS_ALL,
+    .prio       =3D FF_TX_PRIO_BASE,
+};
+
+static void TX_NAME(ff_tx_mdct_inv_full)(AVTXContext *s, void *_dst,
+                                         void *_src, ptrdiff_t stride)
 {
-    int len =3D s->m*s->n*4;
+    int len  =3D s->len << 1;
     int len2 =3D len >> 1;
     int len4 =3D len >> 2;
     FFTSample *dst =3D _dst;
=20
-    s->top_tx(s, dst + len4, _src, stride);
+    s->fn[0](&s->sub[0], dst + len4, _src, stride);
=20
     stride /=3D sizeof(*dst);
=20
@@ -893,132 +1105,246 @@ static void full_imdct_wrapper_fn(AVTXContext *s, =
void *_dst, void *_src,
     }
 }
=20
-static int gen_mdct_exptab(AVTXContext *s, int len4, double scale)
+static av_cold int TX_NAME(ff_tx_mdct_inv_full_init)(AVTXContext *s,
+                                                     const FFTXCodelet *cd=
,
+                                                     uint64_t flags,
+                                                     FFTXCodeletOptions *o=
pts,
+                                                     int len, int inv,
+                                                     const void *scale)
 {
-    const double theta =3D (scale < 0 ? len4 : 0) + 1.0/8.0;
+    int ret;
=20
-    if (!(s->exptab =3D av_malloc_array(len4, sizeof(*s->exptab))))
-        return AVERROR(ENOMEM);
+    s->scale_d =3D *((SCALE_TYPE *)scale);
+    s->scale_f =3D s->scale_d;
=20
-    scale =3D sqrt(fabs(scale));
-    for (int i =3D 0; i < len4; i++) {
-        const double alpha =3D M_PI_2 * (i + theta) / len4;
-        s->exptab[i].re =3D RESCALE(cos(alpha) * scale);
-        s->exptab[i].im =3D RESCALE(sin(alpha) * scale);
-    }
+    flags &=3D ~AV_TX_FULL_IMDCT;
+
+    if ((ret =3D ff_tx_init_subtx(s, TX_TYPE(MDCT), flags, NULL, len, 1, s=
cale)))
+        return ret;
=20
     return 0;
 }
=20
-int TX_NAME(ff_tx_init_mdct_fft)(AVTXContext *s, av_tx_fn *tx,
-                                 enum AVTXType type, int inv, int len,
-                                 const void *scale, uint64_t flags)
+const FFTXCodelet TX_NAME(ff_tx_mdct_inv_full_def) =3D {
+    .name       =3D TX_NAME_STR("mdct_inv_full"),
+    .function   =3D TX_NAME(ff_tx_mdct_inv_full),
+    .type       =3D TX_TYPE(MDCT),
+    .flags      =3D AV_TX_UNALIGNED | FF_TX_OUT_OF_PLACE | AV_TX_FULL_IMDC=
T,
+    .factors    =3D { 2, TX_FACTOR_ANY },
+    .min_len    =3D 2,
+    .max_len    =3D TX_LEN_UNLIMITED,
+    .init       =3D TX_NAME(ff_tx_mdct_inv_full_init),
+    .cpu_flags  =3D FF_TX_CPU_FLAGS_ALL,
+    .prio       =3D FF_TX_PRIO_BASE,
+};
+
+static av_cold int TX_NAME(ff_tx_mdct_pfa_init)(AVTXContext *s,
+                                                const FFTXCodelet *cd,
+                                                uint64_t flags,
+                                                FFTXCodeletOptions *opts,
+                                                int len, int inv,
+                                                const void *scale)
 {
-    const int is_mdct =3D ff_tx_type_is_mdct(type);
-    int err, l, n =3D 1, m =3D 1, max_ptwo =3D 1 << (FF_ARRAY_ELEMS(fft_di=
spatch) - 1);
+    int ret, sub_len;
+    FFTXCodeletOptions sub_opts =3D { .invert_lookup =3D 0 };
=20
-    if (is_mdct)
-        len >>=3D 1;
+    len >>=3D 1;
+    sub_len =3D len / cd->factors[0];
=20
-    l =3D len;
+    s->scale_d =3D *((SCALE_TYPE *)scale);
+    s->scale_f =3D s->scale_d;
=20
-#define CHECK_FACTOR(DST, FACTOR, SRC)                                    =
     \
-    if (DST =3D=3D 1 && !(SRC % FACTOR)) {                                =
         \
-        DST =3D FACTOR;                                                   =
       \
-        SRC /=3D FACTOR;                                                  =
       \
-    }
-    CHECK_FACTOR(n, 15, len)
-    CHECK_FACTOR(n,  9, len)
-    CHECK_FACTOR(n,  7, len)
-    CHECK_FACTOR(n,  5, len)
-    CHECK_FACTOR(n,  3, len)
-#undef CHECK_FACTOR
-
-    /* len must be a power of two now */
-    if (!(len & (len - 1)) && len >=3D 2 && len <=3D max_ptwo) {
-        m =3D len;
-        len =3D 1;
-    }
+    flags &=3D ~FF_TX_OUT_OF_PLACE; /* We want the subtransform to be */
+    flags |=3D  AV_TX_INPLACE;      /* in-place */
+    flags |=3D  FF_TX_PRESHUFFLE;   /* This function handles the permute s=
tep */
=20
-    s->n =3D n;
-    s->m =3D m;
-    s->inv =3D inv;
-    s->type =3D type;
-    s->flags =3D flags;
-
-    /* If we weren't able to split the length into factors we can handle,
-     * resort to using the naive and slow FT. This also filters out
-     * direct 3, 5 and 15 transforms as they're too niche. */
-    if (len > 1 || m =3D=3D 1) {
-        if (is_mdct && (l & 1)) /* Odd (i)MDCTs are not supported yet */
-            return AVERROR(ENOSYS);
-        if (flags & AV_TX_INPLACE) /* Neither are in-place naive transform=
s */
-            return AVERROR(ENOSYS);
-        s->n =3D l;
-        s->m =3D 1;
-        *tx =3D naive_fft;
-        if (is_mdct) {
-            s->scale =3D *((SCALE_TYPE *)scale);
-            *tx =3D inv ? naive_imdct : naive_mdct;
-            if (inv && (flags & AV_TX_FULL_IMDCT)) {
-                s->top_tx =3D *tx;
-                *tx =3D full_imdct_wrapper_fn;
-            }
-        }
-        return 0;
-    }
+    if ((ret =3D ff_tx_init_subtx(s, TX_TYPE(FFT), flags, &sub_opts,
+                                sub_len, inv, scale)))
+        return ret;
=20
-    if (n > 1 && m > 1) { /* 2D transform case */
-        if ((err =3D ff_tx_gen_compound_mapping(s)))
-            return err;
-        if (!(s->tmp =3D av_malloc(n*m*sizeof(*s->tmp))))
-            return AVERROR(ENOMEM);
-        if (!(m & (m - 1))) {
-            *tx =3D n =3D=3D 3 ? compound_fft_3xM :
-                  n =3D=3D 5 ? compound_fft_5xM :
-                  n =3D=3D 7 ? compound_fft_7xM :
-                  n =3D=3D 9 ? compound_fft_9xM :
-                           compound_fft_15xM;
-            if (is_mdct)
-                *tx =3D n =3D=3D 3 ? inv ? compound_imdct_3xM  : compound_=
mdct_3xM :
-                      n =3D=3D 5 ? inv ? compound_imdct_5xM  : compound_md=
ct_5xM :
-                      n =3D=3D 7 ? inv ? compound_imdct_7xM  : compound_md=
ct_7xM :
-                      n =3D=3D 9 ? inv ? compound_imdct_9xM  : compound_md=
ct_9xM :
-                               inv ? compound_imdct_15xM : compound_mdct_1=
5xM;
-        }
-    } else { /* Direct transform case */
-        *tx =3D split_radix_fft;
-        if (is_mdct)
-            *tx =3D inv ? monolithic_imdct : monolithic_mdct;
-    }
+    if ((ret =3D ff_tx_gen_compound_mapping(s, cd->factors[0], sub_len)))
+        return ret;
=20
-    if (n =3D=3D 3 || n =3D=3D 5 || n =3D=3D 15)
-        init_cos_tabs(0);
-    else if (n =3D=3D 7)
-        init_cos_tabs(1);
-    else if (n =3D=3D 9)
-        init_cos_tabs(2);
-
-    if (m !=3D 1 && !(m & (m - 1))) {
-        if ((err =3D ff_tx_gen_ptwo_revtab(s, n =3D=3D 1 && !is_mdct && !(=
flags & AV_TX_INPLACE))))
-            return err;
-        if (flags & AV_TX_INPLACE) {
-            if (is_mdct) /* In-place MDCTs are not supported yet */
-                return AVERROR(ENOSYS);
-            if ((err =3D ff_tx_gen_ptwo_inplace_revtab_idx(s, s->revtab_c)=
))
-                return err;
-        }
-        for (int i =3D 4; i <=3D av_log2(m); i++)
-            init_cos_tabs(i);
-    }
+    if ((ret =3D TX_NAME(ff_tx_mdct_gen_exp)(s)))
+        return ret;
=20
-    if (is_mdct) {
-        if (inv && (flags & AV_TX_FULL_IMDCT)) {
-            s->top_tx =3D *tx;
-            *tx =3D full_imdct_wrapper_fn;
-        }
-        return gen_mdct_exptab(s, n*m, *((SCALE_TYPE *)scale));
-    }
+    if (!(s->tmp =3D av_malloc(len*sizeof(*s->tmp))))
+        return AVERROR(ENOMEM);
+
+    TX_TAB(ff_tx_init_tabs)(len / sub_len);
=20
     return 0;
 }
+
+#define DECL_COMP_IMDCT(N)                                                =
     \
+static void TX_NAME(ff_tx_mdct_pfa_##N##xM_inv)(AVTXContext *s, void *_dst=
,    \
+                                                void *_src, ptrdiff_t stri=
de)  \
+{                                                                         =
     \
+    FFTComplex fft##N##in[N];                                             =
     \
+    FFTComplex *z =3D _dst, *exp =3D s->exp;                              =
         \
+    const FFTSample *src =3D _src, *in1, *in2;                            =
       \
+    const int len4 =3D s->len >> 2;                                       =
       \
+    const int m =3D s->sub->len;                                          =
       \
+    const int *in_map =3D s->map, *out_map =3D in_map + N*m;              =
         \
+    const int *sub_map =3D s->sub->map;                                   =
       \
+                                                                          =
     \
+    stride /=3D sizeof(*src); /* To convert it from bytes */              =
       \
+    in1 =3D src;                                                          =
       \
+    in2 =3D src + ((N*m*2) - 1) * stride;                                 =
       \
+                                                                          =
     \
+    for (int i =3D 0; i < m; i++) {                                       =
       \
+        for (int j =3D 0; j < N; j++) {                                   =
       \
+            const int k =3D in_map[i*N + j];                              =
       \
+            FFTComplex tmp =3D { in2[-k*stride], in1[k*stride] };         =
       \
+            CMUL3(fft##N##in[j], tmp, exp[k >> 1]);                       =
     \
+        }                                                                 =
     \
+        fft##N(s->tmp + sub_map[i], fft##N##in, m);                       =
     \
+    }                                                                     =
     \
+                                                                          =
     \
+    for (int i =3D 0; i < N; i++)                                         =
       \
+        s->fn[0](&s->sub[0], s->tmp + m*i, s->tmp + m*i, sizeof(FFTComplex=
));  \
+                                                                          =
     \
+    for (int i =3D 0; i < len4; i++) {                                    =
       \
+        const int i0 =3D len4 + i, i1 =3D len4 - i - 1;                   =
         \
+        const int s0 =3D out_map[i0], s1 =3D out_map[i1];                 =
         \
+        FFTComplex src1 =3D { s->tmp[s1].im, s->tmp[s1].re };             =
       \
+        FFTComplex src0 =3D { s->tmp[s0].im, s->tmp[s0].re };             =
       \
+                                                                          =
     \
+        CMUL(z[i1].re, z[i0].im, src1.re, src1.im, exp[i1].im, exp[i1].re)=
;    \
+        CMUL(z[i0].re, z[i1].im, src0.re, src0.im, exp[i0].im, exp[i0].re)=
;    \
+    }                                                                     =
     \
+}                                                                         =
     \
+                                                                          =
     \
+const FFTXCodelet TX_NAME(ff_tx_mdct_pfa_##N##xM_inv_def) =3D {           =
       \
+    .name       =3D TX_NAME_STR("mdct_pfa_" #N "xM_inv"),                 =
       \
+    .function   =3D TX_NAME(ff_tx_mdct_pfa_##N##xM_inv),                  =
       \
+    .type       =3D TX_TYPE(MDCT),                                        =
       \
+    .flags      =3D AV_TX_UNALIGNED | FF_TX_OUT_OF_PLACE | FF_TX_INVERSE_O=
NLY,   \
+    .factors    =3D { N, TX_FACTOR_ANY },                                 =
       \
+    .min_len    =3D N*2,                                                  =
       \
+    .max_len    =3D TX_LEN_UNLIMITED,                                     =
       \
+    .init       =3D TX_NAME(ff_tx_mdct_pfa_init),                         =
       \
+    .cpu_flags  =3D FF_TX_CPU_FLAGS_ALL,                                  =
       \
+    .prio       =3D FF_TX_PRIO_BASE,                                      =
       \
+};
+
+DECL_COMP_IMDCT(3)
+DECL_COMP_IMDCT(5)
+DECL_COMP_IMDCT(7)
+DECL_COMP_IMDCT(9)
+DECL_COMP_IMDCT(15)
+
+#define DECL_COMP_MDCT(N)                                                 =
     \
+static void TX_NAME(ff_tx_mdct_pfa_##N##xM_fwd)(AVTXContext *s, void *_dst=
,    \
+                                                void *_src, ptrdiff_t stri=
de)  \
+{                                                                         =
     \
+    FFTComplex fft##N##in[N];                                             =
     \
+    FFTSample *src =3D _src, *dst =3D _dst;                               =
         \
+    FFTComplex *exp =3D s->exp, tmp;                                      =
       \
+    const int m =3D s->sub->len;                                          =
       \
+    const int len4 =3D N*m;                                               =
       \
+    const int len3 =3D len4 * 3;                                          =
       \
+    const int len8 =3D s->len >> 2;                                       =
       \
+    const int *in_map =3D s->map, *out_map =3D in_map + N*m;              =
         \
+    const int *sub_map =3D s->sub->map;                                   =
       \
+                                                                          =
     \
+    stride /=3D sizeof(*dst);                                             =
       \
+                                                                          =
     \
+    for (int i =3D 0; i < m; i++) { /* Folding and pre-reindexing */      =
       \
+        for (int j =3D 0; j < N; j++) {                                   =
       \
+            const int k =3D in_map[i*N + j];                              =
       \
+            if (k < len4) {                                               =
     \
+                tmp.re =3D FOLD(-src[ len4 + k],  src[1*len4 - 1 - k]);   =
       \
+                tmp.im =3D FOLD(-src[ len3 + k], -src[1*len3 - 1 - k]);   =
       \
+            } else {                                                      =
     \
+                tmp.re =3D FOLD(-src[ len4 + k], -src[5*len4 - 1 - k]);   =
       \
+                tmp.im =3D FOLD( src[-len4 + k], -src[1*len3 - 1 - k]);   =
       \
+            }                                                             =
     \
+            CMUL(fft##N##in[j].im, fft##N##in[j].re, tmp.re, tmp.im,      =
     \
+                 exp[k >> 1].re, exp[k >> 1].im);                         =
     \
+        }                                                                 =
     \
+        fft##N(s->tmp + sub_map[i], fft##N##in, m);                       =
     \
+    }                                                                     =
     \
+                                                                          =
     \
+    for (int i =3D 0; i < N; i++)                                         =
       \
+        s->fn[0](&s->sub[0], s->tmp + m*i, s->tmp + m*i, sizeof(FFTComplex=
));  \
+                                                                          =
     \
+    for (int i =3D 0; i < len8; i++) {                                    =
       \
+        const int i0 =3D len8 + i, i1 =3D len8 - i - 1;                   =
         \
+        const int s0 =3D out_map[i0], s1 =3D out_map[i1];                 =
         \
+        FFTComplex src1 =3D { s->tmp[s1].re, s->tmp[s1].im };             =
       \
+        FFTComplex src0 =3D { s->tmp[s0].re, s->tmp[s0].im };             =
       \
+                                                                          =
     \
+        CMUL(dst[2*i1*stride + stride], dst[2*i0*stride], src0.re, src0.im=
,    \
+             exp[i0].im, exp[i0].re);                                     =
     \
+        CMUL(dst[2*i0*stride + stride], dst[2*i1*stride], src1.re, src1.im=
,    \
+             exp[i1].im, exp[i1].re);                                     =
     \
+    }                                                                     =
     \
+}                                                                         =
     \
+                                                                          =
     \
+const FFTXCodelet TX_NAME(ff_tx_mdct_pfa_##N##xM_fwd_def) =3D {           =
       \
+    .name       =3D TX_NAME_STR("mdct_pfa_" #N "xM_fwd"),                 =
       \
+    .function   =3D TX_NAME(ff_tx_mdct_pfa_##N##xM_fwd),                  =
       \
+    .type       =3D TX_TYPE(MDCT),                                        =
       \
+    .flags      =3D AV_TX_UNALIGNED | FF_TX_OUT_OF_PLACE | FF_TX_FORWARD_O=
NLY,   \
+    .factors    =3D { N, TX_FACTOR_ANY },                                 =
       \
+    .min_len    =3D N*2,                                                  =
       \
+    .max_len    =3D TX_LEN_UNLIMITED,                                     =
       \
+    .init       =3D TX_NAME(ff_tx_mdct_pfa_init),                         =
       \
+    .cpu_flags  =3D FF_TX_CPU_FLAGS_ALL,                                  =
       \
+    .prio       =3D FF_TX_PRIO_BASE,                                      =
       \
+};
+
+DECL_COMP_MDCT(3)
+DECL_COMP_MDCT(5)
+DECL_COMP_MDCT(7)
+DECL_COMP_MDCT(9)
+DECL_COMP_MDCT(15)
+
+const FFTXCodelet * const TX_NAME(ff_tx_codelet_list)[] =3D {
+    /* Split-Radix codelets */
+    &TX_NAME(ff_tx_fft2_ns_def),
+    &TX_NAME(ff_tx_fft4_ns_def),
+    &TX_NAME(ff_tx_fft8_ns_def),
+    &TX_NAME(ff_tx_fft16_ns_def),
+    &TX_NAME(ff_tx_fft32_ns_def),
+    &TX_NAME(ff_tx_fft64_ns_def),
+    &TX_NAME(ff_tx_fft128_ns_def),
+    &TX_NAME(ff_tx_fft256_ns_def),
+    &TX_NAME(ff_tx_fft512_ns_def),
+    &TX_NAME(ff_tx_fft1024_ns_def),
+    &TX_NAME(ff_tx_fft2048_ns_def),
+    &TX_NAME(ff_tx_fft4096_ns_def),
+    &TX_NAME(ff_tx_fft8192_ns_def),
+    &TX_NAME(ff_tx_fft16384_ns_def),
+    &TX_NAME(ff_tx_fft32768_ns_def),
+    &TX_NAME(ff_tx_fft65536_ns_def),
+    &TX_NAME(ff_tx_fft131072_ns_def),
+
+    /* Standalone transforms */
+    &TX_NAME(ff_tx_fft_sr_def),
+    &TX_NAME(ff_tx_fft_sr_inplace_def),
+    &TX_NAME(ff_tx_fft_pfa_3xM_def),
+    &TX_NAME(ff_tx_fft_pfa_5xM_def),
+    &TX_NAME(ff_tx_fft_pfa_7xM_def),
+    &TX_NAME(ff_tx_fft_pfa_9xM_def),
+    &TX_NAME(ff_tx_fft_pfa_15xM_def),
+    &TX_NAME(ff_tx_fft_naive_def),
+    &TX_NAME(ff_tx_mdct_sr_fwd_def),
+    &TX_NAME(ff_tx_mdct_sr_inv_def),
+    &TX_NAME(ff_tx_mdct_pfa_3xM_fwd_def),
+    &TX_NAME(ff_tx_mdct_pfa_5xM_fwd_def),
+    &TX_NAME(ff_tx_mdct_pfa_7xM_fwd_def),
+    &TX_NAME(ff_tx_mdct_pfa_9xM_fwd_def),
+    &TX_NAME(ff_tx_mdct_pfa_15xM_fwd_def),
+    &TX_NAME(ff_tx_mdct_pfa_3xM_inv_def),
+    &TX_NAME(ff_tx_mdct_pfa_5xM_inv_def),
+    &TX_NAME(ff_tx_mdct_pfa_7xM_inv_def),
+    &TX_NAME(ff_tx_mdct_pfa_9xM_inv_def),
+    &TX_NAME(ff_tx_mdct_pfa_15xM_inv_def),
+    &TX_NAME(ff_tx_mdct_naive_fwd_def),
+    &TX_NAME(ff_tx_mdct_naive_inv_def),
+    &TX_NAME(ff_tx_mdct_inv_full_def),
+
+    NULL,
+};
diff --git a/libavutil/x86/tx_float.asm b/libavutil/x86/tx_float.asm
index 4d2283fae1..963e6cad66 100644
--- a/libavutil/x86/tx_float.asm
+++ b/libavutil/x86/tx_float.asm
@@ -31,6 +31,8 @@
=20
 %include "x86util.asm"
=20
+%define private_prefix ff_tx
+
 %if ARCH_X86_64
 %define ptr resq
 %else
@@ -39,25 +41,22 @@
=20
 %assign i 16
 %rep 14
-cextern cos_ %+ i %+ _float ; ff_cos_i_float...
+cextern tab_ %+ i %+ _float ; ff_tab_i_float...
 %assign i (i << 1)
 %endrep
=20
 struc AVTXContext
-    .n:           resd 1 ; Non-power-of-two part
-    .m:           resd 1 ; Power-of-two part
-    .inv:         resd 1 ; Is inverse
-    .type:        resd 1 ; Type
-    .flags:       resq 1 ; Flags
-    .scale:       resq 1 ; Scale
-
-    .exptab:       ptr 1 ; MDCT exptab
-    .tmp:          ptr 1 ; Temporary buffer needed for all compound transf=
orms
-    .pfatab:       ptr 1 ; Input/Output mapping for compound transforms
-    .revtab:       ptr 1 ; Input mapping for power of two transforms
-    .inplace_idx:  ptr 1 ; Required indices to revtab for in-place transfo=
rms
-
-    .top_tx        ptr 1 ;  Used for transforms derived from other transfo=
rms
+    .len:          resd 1 ; Length
+    .inv           resd 1 ; Inverse flag
+    .map:           ptr 1 ; Lookup table(s)
+    .exp:           ptr 1 ; Exponentiation factors
+    .tmp:           ptr 1 ; Temporary data
+
+    .sub:           ptr 1 ; Subcontexts
+    .fn:            ptr 4 ; Subcontext functions
+    .nb_sub:       resd 1 ; Subcontext count
+
+    ; Everything else is inaccessible
 endstruc
=20
 SECTION_RODATA 32
@@ -485,8 +484,8 @@ SECTION .text
     movaps [outq + 10*mmsize], tx1_o0
     movaps [outq + 14*mmsize], tx2_o0
=20
-    movaps tw_e,           [cos_64_float + mmsize]
-    vperm2f128 tw_o, tw_o, [cos_64_float + 64 - 4*7 - mmsize], 0x23
+    movaps tw_e,           [tab_64_float + mmsize]
+    vperm2f128 tw_o, tw_o, [tab_64_float + 64 - 4*7 - mmsize], 0x23
=20
     movaps m0, [outq +  1*mmsize]
     movaps m1, [outq +  3*mmsize]
@@ -708,14 +707,21 @@ cglobal fft4_ %+ %1 %+ _float, 4, 4, 3, ctx, out, in,=
 stride
 FFT4 fwd, 0
 FFT4 inv, 1
=20
+%macro FFT8_FN 2
 INIT_XMM sse3
-cglobal fft8_float, 4, 4, 6, ctx, out, in, tmp
-    mov ctxq, [ctxq + AVTXContext.revtab]
-
+cglobal fft8_ %+ %1, 4, 4, 6, ctx, out, in, tmp
+%if %2
+    mov ctxq, [ctxq + AVTXContext.map]
     LOAD64_LUT m0, inq, ctxq, (mmsize/2)*0, tmpq
     LOAD64_LUT m1, inq, ctxq, (mmsize/2)*1, tmpq
     LOAD64_LUT m2, inq, ctxq, (mmsize/2)*2, tmpq
     LOAD64_LUT m3, inq, ctxq, (mmsize/2)*3, tmpq
+%else
+    movaps m0, [inq + 0*mmsize]
+    movaps m1, [inq + 1*mmsize]
+    movaps m2, [inq + 2*mmsize]
+    movaps m3, [inq + 3*mmsize]
+%endif
=20
     FFT8 m0, m1, m2, m3, m4, m5
=20
@@ -730,13 +736,22 @@ cglobal fft8_float, 4, 4, 6, ctx, out, in, tmp
     movups [outq + 3*mmsize], m1
=20
     RET
+%endmacro
=20
-INIT_YMM avx
-cglobal fft8_float, 4, 4, 4, ctx, out, in, tmp
-    mov ctxq, [ctxq + AVTXContext.revtab]
+FFT8_FN float,    1
+FFT8_FN ns_float, 0
=20
+%macro FFT16_FN 2
+INIT_YMM avx
+cglobal fft8_ %+ %1, 4, 4, 4, ctx, out, in, tmp
+%if %2
+    mov ctxq, [ctxq + AVTXContext.map]
     LOAD64_LUT m0, inq, ctxq, (mmsize/2)*0, tmpq, m2
     LOAD64_LUT m1, inq, ctxq, (mmsize/2)*1, tmpq, m3
+%else
+    movaps m0, [inq + 0*mmsize]
+    movaps m1, [inq + 1*mmsize]
+%endif
=20
     FFT8_AVX m0, m1, m2, m3
=20
@@ -750,11 +765,15 @@ cglobal fft8_float, 4, 4, 4, ctx, out, in, tmp
     vextractf128 [outq + 16*3], m0, 1
=20
     RET
+%endmacro
+
+FFT16_FN float,    1
+FFT16_FN ns_float, 0
=20
 %macro FFT16_FN 1
 INIT_YMM %1
 cglobal fft16_float, 4, 4, 8, ctx, out, in, tmp
-    mov ctxq, [ctxq + AVTXContext.revtab]
+    mov ctxq, [ctxq + AVTXContext.map]
=20
     LOAD64_LUT m0, inq, ctxq, (mmsize/2)*0, tmpq, m4
     LOAD64_LUT m1, inq, ctxq, (mmsize/2)*1, tmpq, m5
@@ -786,7 +805,7 @@ FFT16_FN fma3
 %macro FFT32_FN 1
 INIT_YMM %1
 cglobal fft32_float, 4, 4, 16, ctx, out, in, tmp
-    mov ctxq, [ctxq + AVTXContext.revtab]
+    mov ctxq, [ctxq + AVTXContext.map]
=20
     LOAD64_LUT m4, inq, ctxq, (mmsize/2)*4, tmpq,  m8,  m9
     LOAD64_LUT m5, inq, ctxq, (mmsize/2)*5, tmpq, m10, m11
@@ -800,8 +819,8 @@ cglobal fft32_float, 4, 4, 16, ctx, out, in, tmp
     LOAD64_LUT m2, inq, ctxq, (mmsize/2)*2, tmpq, m12, m13
     LOAD64_LUT m3, inq, ctxq, (mmsize/2)*3, tmpq, m14, m15
=20
-    movaps m8,         [cos_32_float]
-    vperm2f128 m9, m9, [cos_32_float + 4*8 - 4*7], 0x23
+    movaps m8,         [tab_32_float]
+    vperm2f128 m9, m9, [tab_32_float + 4*8 - 4*7], 0x23
=20
     FFT16 m0, m1, m2, m3, m10, m11, m12, m13
=20
@@ -858,8 +877,8 @@ ALIGN 16
     POP lenq
     sub outq, (%1*4) + (%1*2) + (%1/2)
=20
-    lea rtabq, [cos_ %+ %1 %+ _float]
-    lea itabq, [cos_ %+ %1 %+ _float + %1 - 4*7]
+    lea rtabq, [tab_ %+ %1 %+ _float]
+    lea itabq, [tab_ %+ %1 %+ _float + %1 - 4*7]
=20
 %if %0 > 1
     cmp tgtq, %1
@@ -883,9 +902,9 @@ ALIGN 16
=20
 %macro FFT_SPLIT_RADIX_FN 1
 INIT_YMM %1
-cglobal split_radix_fft_float, 4, 8, 16, 272, lut, out, in, len, tmp, itab=
, rtab, tgt
-    movsxd lenq, dword [lutq + AVTXContext.m]
-    mov lutq, [lutq + AVTXContext.revtab]
+cglobal fft_sr_float, 4, 8, 16, 272, lut, out, in, len, tmp, itab, rtab, t=
gt
+    movsxd lenq, dword [lutq + AVTXContext.len]
+    mov lutq, [lutq + AVTXContext.map]
     mov tgtq, lenq
=20
 ; Bottom-most/32-point transform =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D
@@ -903,8 +922,8 @@ ALIGN 16
     LOAD64_LUT m2, inq, lutq, (mmsize/2)*2, tmpq, m12, m13
     LOAD64_LUT m3, inq, lutq, (mmsize/2)*3, tmpq, m14, m15
=20
-    movaps m8,         [cos_32_float]
-    vperm2f128 m9, m9, [cos_32_float + 32 - 4*7], 0x23
+    movaps m8,         [tab_32_float]
+    vperm2f128 m9, m9, [tab_32_float + 32 - 4*7], 0x23
=20
     FFT16 m0, m1, m2, m3, m10, m11, m12, m13
=20
@@ -961,8 +980,8 @@ ALIGN 16
=20
     FFT16 tx2_e0, tx2_e1, tx2_o0, tx2_o1, tmp1, tmp2, tw_e, tw_o
=20
-    movaps tw_e,           [cos_64_float]
-    vperm2f128 tw_o, tw_o, [cos_64_float + 64 - 4*7], 0x23
+    movaps tw_e,           [tab_64_float]
+    vperm2f128 tw_o, tw_o, [tab_64_float + 64 - 4*7], 0x23
=20
     add lutq, (mmsize/2)*8
     cmp tgtq, 64
@@ -989,8 +1008,8 @@ ALIGN 16
     POP lenq
     sub outq, 24*mmsize
=20
-    lea rtabq, [cos_128_float]
-    lea itabq, [cos_128_float + 128 - 4*7]
+    lea rtabq, [tab_128_float]
+    lea itabq, [tab_128_float + 128 - 4*7]
=20
     cmp tgtq, 128
     je .deinterleave
@@ -1016,8 +1035,8 @@ ALIGN 16
     POP lenq
     sub outq, 48*mmsize
=20
-    lea rtabq, [cos_256_float]
-    lea itabq, [cos_256_float + 256 - 4*7]
+    lea rtabq, [tab_256_float]
+    lea itabq, [tab_256_float + 256 - 4*7]
=20
     cmp tgtq, 256
     je .deinterleave
@@ -1044,8 +1063,8 @@ ALIGN 16
     POP lenq
     sub outq, 96*mmsize
=20
-    lea rtabq, [cos_512_float]
-    lea itabq, [cos_512_float + 512 - 4*7]
+    lea rtabq, [tab_512_float]
+    lea itabq, [tab_512_float + 512 - 4*7]
=20
     cmp tgtq, 512
     je .deinterleave
@@ -1079,8 +1098,8 @@ ALIGN 16
     POP lenq
     sub outq, 192*mmsize
=20
-    lea rtabq, [cos_1024_float]
-    lea itabq, [cos_1024_float + 1024 - 4*7]
+    lea rtabq, [tab_1024_float]
+    lea itabq, [tab_1024_float + 1024 - 4*7]
=20
     cmp tgtq, 1024
     je .deinterleave
@@ -1160,8 +1179,8 @@ FFT_SPLIT_RADIX_DEF 131072
     vextractf128 [outq + 13*mmsize +  0], tw_e,   1
     vextractf128 [outq + 13*mmsize + 16], tx2_e0, 1
=20
-    movaps tw_e,           [cos_64_float + mmsize]
-    vperm2f128 tw_o, tw_o, [cos_64_float + 64 - 4*7 - mmsize], 0x23
+    movaps tw_e,           [tab_64_float + mmsize]
+    vperm2f128 tw_o, tw_o, [tab_64_float + 64 - 4*7 - mmsize], 0x23
=20
     movaps m0, [outq +  1*mmsize]
     movaps m1, [outq +  3*mmsize]
diff --git a/libavutil/x86/tx_float_init.c b/libavutil/x86/tx_float_init.c
index 8b77a5f29f..9e9de35530 100644
--- a/libavutil/x86/tx_float_init.c
+++ b/libavutil/x86/tx_float_init.c
@@ -21,86 +21,112 @@
 #include "libavutil/attributes.h"
 #include "libavutil/x86/cpu.h"
=20
-void ff_fft2_float_sse3     (AVTXContext *s, void *out, void *in, ptrdiff_=
t stride);
-void ff_fft4_inv_float_sse2 (AVTXContext *s, void *out, void *in, ptrdiff_=
t stride);
-void ff_fft4_fwd_float_sse2 (AVTXContext *s, void *out, void *in, ptrdiff_=
t stride);
-void ff_fft8_float_sse3     (AVTXContext *s, void *out, void *in, ptrdiff_=
t stride);
-void ff_fft8_float_avx      (AVTXContext *s, void *out, void *in, ptrdiff_=
t stride);
-void ff_fft16_float_avx     (AVTXContext *s, void *out, void *in, ptrdiff_=
t stride);
-void ff_fft16_float_fma3    (AVTXContext *s, void *out, void *in, ptrdiff_=
t stride);
-void ff_fft32_float_avx     (AVTXContext *s, void *out, void *in, ptrdiff_=
t stride);
-void ff_fft32_float_fma3    (AVTXContext *s, void *out, void *in, ptrdiff_=
t stride);
+#include "config.h"
=20
-void ff_split_radix_fft_float_avx (AVTXContext *s, void *out, void *in, pt=
rdiff_t stride);
-void ff_split_radix_fft_float_avx2(AVTXContext *s, void *out, void *in, pt=
rdiff_t stride);
+void ff_tx_fft2_float_sse3     (AVTXContext *s, void *out, void *in, ptrdi=
ff_t stride);
+void ff_tx_fft4_inv_float_sse2 (AVTXContext *s, void *out, void *in, ptrdi=
ff_t stride);
+void ff_tx_fft4_fwd_float_sse2 (AVTXContext *s, void *out, void *in, ptrdi=
ff_t stride);
+void ff_tx_fft8_float_sse3     (AVTXContext *s, void *out, void *in, ptrdi=
ff_t stride);
+void ff_tx_fft8_ns_float_sse3  (AVTXContext *s, void *out, void *in, ptrdi=
ff_t stride);
+void ff_tx_fft8_float_avx      (AVTXContext *s, void *out, void *in, ptrdi=
ff_t stride);
+void ff_tx_fft8_ns_float_avx   (AVTXContext *s, void *out, void *in, ptrdi=
ff_t stride);
+void ff_tx_fft16_float_avx     (AVTXContext *s, void *out, void *in, ptrdi=
ff_t stride);
+void ff_tx_fft16_float_fma3    (AVTXContext *s, void *out, void *in, ptrdi=
ff_t stride);
+void ff_tx_fft32_float_avx     (AVTXContext *s, void *out, void *in, ptrdi=
ff_t stride);
+void ff_tx_fft32_float_fma3    (AVTXContext *s, void *out, void *in, ptrdi=
ff_t stride);
=20
-av_cold void ff_tx_init_float_x86(AVTXContext *s, av_tx_fn *tx)
-{
-    int cpu_flags =3D av_get_cpu_flags();
-    int gen_revtab =3D 0, basis, revtab_interleave;
+void ff_tx_fft_sr_float_avx    (AVTXContext *s, void *out, void *in, ptrdi=
ff_t stride);
+void ff_tx_fft_sr_float_avx2   (AVTXContext *s, void *out, void *in, ptrdi=
ff_t stride);
=20
-    if (s->flags & AV_TX_UNALIGNED)
-        return;
-
-    if (ff_tx_type_is_mdct(s->type))
-        return;
+#define DECL_INIT_FN(basis, interleave)                                   =
     \
+static av_cold av_unused int                                              =
     \
+    ff_tx_fft_sr_codelet_init_b ##basis## _i ##interleave## _x86          =
     \
+    (AVTXContext *s,                                                      =
     \
+     const FFTXCodelet *cd,                                               =
     \
+     uint64_t flags,                                                      =
     \
+     FFTXCodeletOptions *opts,                                            =
     \
+     int len, int inv,                                                    =
     \
+     const void *scale)                                                   =
     \
+{                                                                         =
     \
+    const int inv_lookup =3D opts ? opts->invert_lookup : 1;              =
       \
+    ff_tx_init_tabs_float(len);                                           =
     \
+    return ff_tx_gen_split_radix_parity_revtab(s, inv_lookup,             =
     \
+                                               basis, interleave);        =
     \
+}
=20
-#define TXFN(fn, gentab, sr_basis, interleave) \
-    do {                                       \
-        *tx =3D fn;                              \
-        gen_revtab =3D gentab;                   \
-        basis =3D sr_basis;                      \
-        revtab_interleave =3D interleave;        \
-    } while (0)
+#define ff_tx_fft_sr_codelet_init_b0_i0_x86 NULL
+DECL_INIT_FN(8, 0)
+DECL_INIT_FN(8, 2)
=20
-    if (s->n =3D=3D 1) {
-        if (EXTERNAL_SSE2(cpu_flags)) {
-            if (s->m =3D=3D 4 && s->inv)
-                TXFN(ff_fft4_inv_float_sse2, 0, 0, 0);
-            else if (s->m =3D=3D 4)
-                TXFN(ff_fft4_fwd_float_sse2, 0, 0, 0);
-        }
+#define DECL_SR_CD_DEF(fn_name, len, init_fn, fn_prio, cpu, fn_flags) \
+const FFTXCodelet ff_tx_ ##fn_name## _def =3D {                         \
+    .name       =3D #fn_name,                                           \
+    .function   =3D ff_tx_ ##fn_name,                                   \
+    .type       =3D TX_TYPE(FFT),                                       \
+    .flags      =3D FF_TX_OUT_OF_PLACE | FF_TX_ALIGNED | fn_flags,      \
+    .factors[0] =3D 2,                                                  \
+    .min_len    =3D len,                                                \
+    .max_len    =3D len,                                                \
+    .init       =3D ff_tx_fft_sr_codelet_init_ ##init_fn## _x86,        \
+    .cpu_flags  =3D AV_CPU_FLAG_ ##cpu,                                 \
+    .prio       =3D fn_prio,                                            \
+};
=20
-        if (EXTERNAL_SSE3(cpu_flags)) {
-            if (s->m =3D=3D 2)
-                TXFN(ff_fft2_float_sse3, 0, 0, 0);
-            else if (s->m =3D=3D 8)
-                TXFN(ff_fft8_float_sse3, 1, 8, 0);
-        }
+DECL_SR_CD_DEF(fft2_float_sse3,      2, b0_i0, 128, SSE3, AV_TX_INPLACE)
+DECL_SR_CD_DEF(fft4_fwd_float_sse2,  4, b0_i0, 128, SSE2, AV_TX_INPLACE | =
FF_TX_FORWARD_ONLY)
+DECL_SR_CD_DEF(fft4_inv_float_sse2,  4, b0_i0, 128, SSE2, AV_TX_INPLACE | =
FF_TX_INVERSE_ONLY)
+DECL_SR_CD_DEF(fft8_float_sse3,      8, b8_i0, 128, SSE3, AV_TX_INPLACE)
+DECL_SR_CD_DEF(fft8_ns_float_sse3,   8, b8_i0, 192, SSE3, AV_TX_INPLACE | =
FF_TX_PRESHUFFLE)
+DECL_SR_CD_DEF(fft8_float_avx,       8, b8_i0, 256, AVX,  AV_TX_INPLACE)
+DECL_SR_CD_DEF(fft8_ns_float_avx,    8, b8_i0, 320, AVX,  AV_TX_INPLACE | =
FF_TX_PRESHUFFLE)
+DECL_SR_CD_DEF(fft16_float_avx,     16, b8_i2, 256, AVX,  AV_TX_INPLACE)
+DECL_SR_CD_DEF(fft16_float_fma3,    16, b8_i2, 288, FMA3, AV_TX_INPLACE)
+DECL_SR_CD_DEF(fft32_float_avx,     32, b8_i2, 256, AVX,  AV_TX_INPLACE)
+DECL_SR_CD_DEF(fft32_float_fma3,    32, b8_i2, 288, FMA3, AV_TX_INPLACE)
=20
-        if (EXTERNAL_AVX_FAST(cpu_flags)) {
-            if (s->m =3D=3D 8)
-                TXFN(ff_fft8_float_avx, 1, 8, 0);
-            else if (s->m =3D=3D 16)
-                TXFN(ff_fft16_float_avx, 1, 8, 2);
-#if ARCH_X86_64
-            else if (s->m =3D=3D 32)
-                TXFN(ff_fft32_float_avx, 1, 8, 2);
-            else if (s->m >=3D 64 && s->m <=3D 131072 && !(s->flags & AV_T=
X_INPLACE))
-                TXFN(ff_split_radix_fft_float_avx, 1, 8, 2);
-#endif
-        }
+const FFTXCodelet ff_tx_fft_sr_float_avx_def =3D {
+    .name       =3D "fft_sr_float_avx",
+    .function   =3D ff_tx_fft_sr_float_avx,
+    .type       =3D TX_TYPE(FFT),
+    .flags      =3D FF_TX_ALIGNED | FF_TX_OUT_OF_PLACE,
+    .factors[0] =3D 2,
+    .min_len    =3D 64,
+    .max_len    =3D 131072,
+    .init       =3D ff_tx_fft_sr_codelet_init_b8_i2_x86,
+    .cpu_flags  =3D AV_CPU_FLAG_AVX,
+    .prio       =3D 256,
+};
=20
-        if (EXTERNAL_FMA3_FAST(cpu_flags)) {
-            if (s->m =3D=3D 16)
-                TXFN(ff_fft16_float_fma3, 1, 8, 2);
-#if ARCH_X86_64
-            else if (s->m =3D=3D 32)
-                TXFN(ff_fft32_float_fma3, 1, 8, 2);
-#endif
-        }
+const FFTXCodelet ff_tx_fft_sr_float_avx2_def =3D {
+    .name       =3D "fft_sr_float_avx2",
+    .function   =3D ff_tx_fft_sr_float_avx2,
+    .type       =3D TX_TYPE(FFT),
+    .flags      =3D FF_TX_ALIGNED | FF_TX_OUT_OF_PLACE,
+    .factors[0] =3D 2,
+    .min_len    =3D 64,
+    .max_len    =3D 131072,
+    .init       =3D ff_tx_fft_sr_codelet_init_b8_i2_x86,
+    .cpu_flags  =3D AV_CPU_FLAG_AVX2,
+    .prio       =3D 288,
+};
=20
-#if ARCH_X86_64
-        if (EXTERNAL_AVX2_FAST(cpu_flags)) {
-            if (s->m >=3D 64 && s->m <=3D 131072 && !(s->flags & AV_TX_INP=
LACE))
-                TXFN(ff_split_radix_fft_float_avx2, 1, 8, 2);
-        }
-#endif
-    }
+const FFTXCodelet * const ff_tx_codelet_list_float_x86[] =3D {
+    /* Split-Radix codelets */
+    &ff_tx_fft2_float_sse3_def,
+    &ff_tx_fft4_fwd_float_sse2_def,
+    &ff_tx_fft4_inv_float_sse2_def,
+    &ff_tx_fft8_float_sse3_def,
+    &ff_tx_fft8_ns_float_sse3_def,
+    &ff_tx_fft8_float_avx_def,
+    &ff_tx_fft8_ns_float_avx_def,
+    &ff_tx_fft16_float_avx_def,
+    &ff_tx_fft16_float_fma3_def,
+    &ff_tx_fft32_float_avx_def,
+    &ff_tx_fft32_float_fma3_def,
=20
-    if (gen_revtab)
-        ff_tx_gen_split_radix_parity_revtab(s->revtab, s->m, s->inv, basis=
,
-                                            revtab_interleave);
+    /* Standalone transforms */
+    &ff_tx_fft_sr_float_avx_def,
+    &ff_tx_fft_sr_float_avx2_def,
=20
-#undef TXFN
-}
+    NULL,
+};
--=20
2.34.1


------=_Part_134465_1290416493.1642753986056
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

------=_Part_134465_1290416493.1642753986056--