From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 240CE4C9D3 for ; Mon, 27 Oct 2025 18:48:34 +0000 (UTC) Authentication-Results: ffbox; dkim=fail (body hash mismatch (got b'NGaUXQbrrN+0xtGo8wYLBFU8cqep9N0dc5nhc8ydPFQ=', expected b'BKAfY9FSKSozhD1JSKI4ooGLG6N6GUxvlX5FYvLUD0E=')) header.d=ffmpeg.org header.i=@ffmpeg.org header.a=rsa-sha256 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org; i=@ffmpeg.org; q=dns/txt; s=mail; t=1761590892; h=mime-version : to : date : message-id : reply-to : subject : list-id : list-archive : list-archive : list-help : list-owner : list-post : list-subscribe : list-unsubscribe : from : cc : content-type : content-transfer-encoding : from; bh=NGaUXQbrrN+0xtGo8wYLBFU8cqep9N0dc5nhc8ydPFQ=; b=qpiuZz4nXDpWl+27pov634zs+xBF6dZJzs0iYYuZU/qJSSWxQSx0hc1uJW2wfaJVC4sjA OHsk+32xA6ZXkoUqaFMtNMoUa/6rvpfNtipHWarV6xConq0qDC9rT6xUvqy/maeykPCzxDb XFhUV6uttUZp2dPlbJd06QioYSbtmOz3xjccufTORa06QRkqDf7lNG5RZ2mSF//9lP6Po6p rTSY8XcVrUyXW83tSPhDFiZ1RK4sa6SRs2bOKU0rSISUl8xlxaWl5oL7T5KEih+H48x+I3+ nTC7wWne2okeZQH3GfBOSvBSIk6vDaLVEe34zcbwJCGjYYPb1ddN50RSufNA== Received: from [172.19.0.2] (unknown [172.19.0.2]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id B924E68F77D; Mon, 27 Oct 2025 20:48:12 +0200 (EET) ARC-Seal: i=1; cv=none; a=rsa-sha256; d=ffmpeg.org; s=arc; t=1761590875; b=Xmhfem/GwBRvfsKSD+RMIM/Za+RkV2qAaVHCHK08ZRh3bSwu/5KILm3B6AGjusIRnxYrI VXRICXfzmnuqTFeet3Knay4cUll8eWRYicbW80tuhzzYGvRqr+d07Ay+22ZkHwv1LXUyNvQ j3kNRCHnDh6s0SlSfPDWIsR7CIdgpchqlAQ6KTwA4w8+WQ9CITAjSOcCza/IimeAPowStjA gJHoQAo3r7ESZLOm1aN3BJwm2lLRbDnbLezoRH52D1vjV11crPbXtUVSRYrqTrThktWoOLE ODtW+6QY+zv3NndfgiwNy2hou1nRlKK0rqYGZzr1hNIQHEp51/dDMLGnf1Ig== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=ffmpeg.org; s=arc; t=1761590875; h=from : sender : reply-to : subject : date : message-id : to : cc : mime-version : content-type : content-transfer-encoding : content-id : content-description : resent-date : resent-from : resent-sender : resent-to : resent-cc : resent-message-id : in-reply-to : references : list-id : list-help : list-unsubscribe : list-subscribe : list-post : list-owner : list-archive; bh=IA0n/ILQf4WDltCcnXI85Oo2S91XPq29I4eeJkmOsFk=; b=U9Bx7RHvfbPA6pcyibDUTLVl/id6SKC5nHmS3vb7al/agB+do9SDHKZvadh9dNwohMYWx 0zanxcW6MyHkmHJ/l6KHIpSx9RsC1dpacblieCm7o5rF2pbrZkFDTz2Eu76jDkwZ84RGy0Q dMrNHRx5trZWeRT5PT1/GC8U1V9p9gusAxzBLDow5VdpJVkF5lh2Lg2i4C90Ezl+3qo1MtU QnZ72nH7ACDjXCw3vk++sl9XgeGH1zmyKUMl/wqZhlqOGzKOKCNNrTi/MslXU9PqSv+3q/b SzhnD31PA3SaWvitslH/wI1MZGWYN3MT1KVEnB/dMJuJ4nRYbcfbGibtZX1w== ARC-Authentication-Results: i=1; ffmpeg.org; dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org; arc=none; dmarc=pass header.from=ffmpeg.org policy.dmarc=quarantine Authentication-Results: ffmpeg.org; dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org; arc=none (Message is not ARC signed); dmarc=pass (Used From Domain Record) header.from=ffmpeg.org policy.dmarc=quarantine DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org; i=@ffmpeg.org; q=dns/txt; s=mail; t=1761590868; h=content-type : mime-version : content-transfer-encoding : from : to : reply-to : subject : date : from; bh=BKAfY9FSKSozhD1JSKI4ooGLG6N6GUxvlX5FYvLUD0E=; b=XHJ1b0BrgerEwFSWsK+KwFCOFvq8FtTwcbExhx+NemjR0axOwf8mk4p7QtLOzKBLmMz3O 9H4tM41ttiKHbevnccCplA70UNu/LERhM/KMNNPeXV0UQzJFXyraXvkA/dcj0S3KlcmONX7 4hanYGk4ZaqM+UPyVjop3SHcP/C9T1AsaZHgJkpfp0YdyB7wIs5Asv0ZnOUQQGE9HJX5utJ Yz5Fp77Ea9uqCMgbPVkW3/pf7K5Qq1rp4ubEVG8DIGDK0dbs04v5EMFwVxg/Evfjdav5sgI 794RD1kAVtFz2SzjV8awhiKIhFkKBVJm3cXo0yYX5ola4UyOcc63WXpCfo6Q== Received: from 02c22a36bd31 (code.ffmpeg.org [188.245.149.3]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 8864F68F719 for ; Mon, 27 Oct 2025 20:47:48 +0200 (EET) MIME-Version: 1.0 To: ffmpeg-devel@ffmpeg.org Date: Mon, 27 Oct 2025 18:47:48 -0000 Message-ID: <176159086870.81.11355698174203966738@7d278768979e> Message-ID-Hash: WHQJR7T6O4BCSUFWIXRVFNKZ3W5D23UX X-Message-ID-Hash: WHQJR7T6O4BCSUFWIXRVFNKZ3W5D23UX X-MailFrom: code@ffmpeg.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; header-match-ffmpeg-devel.ffmpeg.org-0; header-match-ffmpeg-devel.ffmpeg.org-1; header-match-ffmpeg-devel.ffmpeg.org-2; header-match-ffmpeg-devel.ffmpeg.org-3; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list Reply-To: FFmpeg development discussions and patches Subject: [FFmpeg-devel] [PATCH] avfilter/boxblur: add AVX2 assembly (PR #20770) List-Id: FFmpeg development discussions and patches Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: From: MakarDev via ffmpeg-devel Cc: MakarDev Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Archived-At: List-Archive: List-Post: PR #20770 opened by MakarDev URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20770 Patch URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20770.patch AVX2 assembly implementation of the boxblur filter. As the boxblur filter has a dependency chain over sum, it can't be fully vectorized, but speedup was achieved through vectorizing all the other operations in the filter. Also, assembly is written only for the "steady-state" middle part of the image, to which boxblur is applied. Benchmarking results tests/checkasm/checkasm --test=vf_boxblur --bench AVX2: - vf_boxblur.boxblur_blur8 [OK] - vf_boxblur.boxblur_blur16 [OK] checkasm: all 2 tests passed boxblur_blur8_c: 1396.9 ( 1.00x) boxblur_blur8_avx2: 541.1 ( 2.58x) boxblur_blur16_c: 1256.0 ( 1.00x) boxblur_blur16_avx2: 504.2 ( 2.49x) >>From 26e836c1ebf2bfcd3c02f9e7d7a46dd135ee6174 Mon Sep 17 00:00:00 2001 From: MakarDev Date: Thu, 16 Oct 2025 22:44:31 -0700 Subject: [PATCH] avfilter/boxblur: add AVX2 assembly --- libavfilter/Makefile | 2 +- libavfilter/boxblur.h | 9 ++ libavfilter/boxblur_dsp.c | 37 ++++++ libavfilter/vf_boxblur.c | 93 ++++++++++--- libavfilter/vf_boxblur_dsp.h | 46 +++++++ libavfilter/x86/Makefile | 2 + libavfilter/x86/vf_boxblur.asm | 213 ++++++++++++++++++++++++++++++ libavfilter/x86/vf_boxblur_init.c | 50 +++++++ tests/checkasm/Makefile | 1 + tests/checkasm/checkasm.c | 3 + tests/checkasm/checkasm.h | 1 + tests/checkasm/vf_boxblur.c | 148 +++++++++++++++++++++ tests/fate/checkasm.mak | 1 + 13 files changed, 585 insertions(+), 21 deletions(-) create mode 100644 libavfilter/boxblur_dsp.c create mode 100644 libavfilter/vf_boxblur_dsp.h create mode 100644 libavfilter/x86/vf_boxblur.asm create mode 100644 libavfilter/x86/vf_boxblur_init.c create mode 100644 tests/checkasm/vf_boxblur.c diff --git a/libavfilter/Makefile b/libavfilter/Makefile index 69d74183b2..00f956dc19 100644 --- a/libavfilter/Makefile +++ b/libavfilter/Makefile @@ -217,7 +217,7 @@ OBJS-$(CONFIG_BLEND_VULKAN_FILTER) += vf_blend_vulkan.o framesync.o vu OBJS-$(CONFIG_BLOCKDETECT_FILTER) += vf_blockdetect.o OBJS-$(CONFIG_BLURDETECT_FILTER) += vf_blurdetect.o edge_common.o OBJS-$(CONFIG_BM3D_FILTER) += vf_bm3d.o framesync.o -OBJS-$(CONFIG_BOXBLUR_FILTER) += vf_boxblur.o boxblur.o +OBJS-$(CONFIG_BOXBLUR_FILTER) += vf_boxblur.o boxblur.o boxblur_dsp.o OBJS-$(CONFIG_BOXBLUR_OPENCL_FILTER) += vf_avgblur_opencl.o opencl.o \ opencl/avgblur.o boxblur.o OBJS-$(CONFIG_BWDIF_FILTER) += vf_bwdif.o bwdifdsp.o yadif_common.o diff --git a/libavfilter/boxblur.h b/libavfilter/boxblur.h index 214d4e0c93..16ca377600 100644 --- a/libavfilter/boxblur.h +++ b/libavfilter/boxblur.h @@ -44,4 +44,13 @@ int ff_boxblur_eval_filter_params(AVFilterLink *inlink, FilterParam *chroma_param, FilterParam *alpha_param); +/* Forward declaration */ +typedef struct FFBoxblurDSPContext FFBoxblurDSPContext; + +/* Blur functions - used for testing and internally */ +void ff_boxblur_blur8(uint8_t *dst, int dst_step, const uint8_t *src, + int src_step, int len, int radius, FFBoxblurDSPContext *dsp); +void ff_boxblur_blur16(uint16_t *dst, int dst_step, const uint16_t *src, + int src_step, int len, int radius, FFBoxblurDSPContext *dsp); + #endif // AVFILTER_BOXBLUR_H diff --git a/libavfilter/boxblur_dsp.c b/libavfilter/boxblur_dsp.c new file mode 100644 index 0000000000..9633cd1062 --- /dev/null +++ b/libavfilter/boxblur_dsp.c @@ -0,0 +1,37 @@ +/* + * Copyright (c) 2025 Makar Kuznietsov + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "config.h" +#include "libavutil/attributes.h" +#include "vf_boxblur_dsp.h" + +#if ARCH_X86_64 +void ff_boxblur_dsp_init_x86(FFBoxblurDSPContext *dsp, int depth); +#endif + +av_cold void ff_boxblur_dsp_init(FFBoxblurDSPContext *dsp, int depth) +{ + dsp->middle = depth > 8 ? (void *)boxblur_middle16_c : (void *)boxblur_middle8_c; + +#if ARCH_X86_64 + ff_boxblur_dsp_init_x86(dsp, depth); +#endif +} + diff --git a/libavfilter/vf_boxblur.c b/libavfilter/vf_boxblur.c index 3cb42471a7..07bf979453 100644 --- a/libavfilter/vf_boxblur.c +++ b/libavfilter/vf_boxblur.c @@ -33,7 +33,7 @@ #include "formats.h" #include "video.h" #include "boxblur.h" - +#include "vf_boxblur_dsp.h" typedef struct BoxBlurContext { const AVClass *class; @@ -45,6 +45,7 @@ typedef struct BoxBlurContext { int radius[4]; int power[4]; uint8_t *temp[2]; ///< temporary buffer used in blur_power() + FFBoxblurDSPContext dsp; } BoxBlurContext; static av_cold void uninit(AVFilterContext *ctx) @@ -108,9 +109,39 @@ static int config_input(AVFilterLink *inlink) s->power[U] = s->power[V] = s->chroma_param.power; s->power[A] = s->alpha_param.power; + ff_boxblur_dsp_init(&s->dsp, desc->comp[0].depth); + return 0; } +/* C reference implementation of middle loop for 8-bit */ +void boxblur_middle8_c(uint8_t *dst, const uint8_t *src, + int x_start, int x_end, int radius, + int inv, int *sum_ptr) +{ + int x; + int sum = *sum_ptr; + for (x = x_start; x < x_end; x++) { + sum += (src[radius+x] - src[x-radius-1])*inv; + dst[x] = sum >>16; + } + *sum_ptr = sum; +} + +/* C reference implementation of middle loop for 16-bit */ +void boxblur_middle16_c(uint16_t *dst, const uint16_t *src, + int x_start, int x_end, int radius, + int inv, int *sum_ptr) +{ + int x; + int sum = *sum_ptr; + for (x = x_start; x < x_end; x++) { + sum += (src[radius+x] - src[x-radius-1])*inv; + dst[x] = sum >>16; + } + *sum_ptr = sum; +} + /* Naive boxblur would sum source pixels from x-radius .. x+radius * for destination pixel x. That would be O(radius*width). * If you now look at what source pixels represent 2 consecutive @@ -125,9 +156,10 @@ static int config_input(AVFilterLink *inlink) * and subtracting 1 input pixel. * The following code adopts this faster variant. */ -#define BLUR(type, depth) \ -static inline void blur ## depth(type *dst, int dst_step, const type *src, \ - int src_step, int len, int radius) \ +#define BLUR(type, depth) \ +void ff_boxblur_blur ## depth(type *dst, int dst_step, const type *src, \ + int src_step, int len, int radius, \ + FFBoxblurDSPContext *dsp) \ { \ const int length = radius*2 + 1; \ const int inv = ((1<<16) + length/2)/length; \ @@ -143,9 +175,27 @@ static inline void blur ## depth(type *dst, int dst_step, const type *src, \ dst[x*dst_step] = sum>>16; \ } \ \ - for (; x < len-radius; x++) { \ - sum += (src[(radius+x)*src_step] - src[(x-radius-1)*src_step])*inv; \ - dst[x*dst_step] = sum >>16; \ + /* Middle loop: use optimized function if strides are 1 */ \ + { \ + int middle_start = radius + 1; \ + int middle_end = len - radius; \ + if (middle_end > middle_start && dst_step == 1 && src_step == 1) { \ + int middle_end_mod16 = middle_end - ((middle_end-middle_start)%16); \ + if (dsp && dsp->middle && middle_end_mod16 > middle_start) { \ + dsp->middle(dst, src, middle_start, middle_end_mod16, \ + radius, inv, &sum); \ + x = middle_end_mod16; \ + } \ + for (; x < middle_end; x++) { \ + sum += (src[(radius+x)*src_step] - src[(x-radius-1)*src_step])*inv; \ + dst[x*dst_step] = sum >>16; \ + } \ + } else { \ + for (x = middle_start; x < middle_end; x++) { \ + sum += (src[(radius+x)*src_step] - src[(x-radius-1)*src_step])*inv; \ + dst[x*dst_step] = sum >>16; \ + } \ + } \ } \ \ for (; x < len; x++) { \ @@ -160,26 +210,27 @@ BLUR(uint16_t, 16) #undef BLUR static inline void blur(uint8_t *dst, int dst_step, const uint8_t *src, int src_step, - int len, int radius, int pixsize) + int len, int radius, int pixsize, FFBoxblurDSPContext *dsp) { - if (pixsize == 1) blur8 (dst, dst_step , src, src_step , len, radius); - else blur16((uint16_t*)dst, dst_step>>1, (const uint16_t*)src, src_step>>1, len, radius); + if (pixsize == 1) ff_boxblur_blur8 (dst, dst_step , src, src_step , len, radius, dsp); + else ff_boxblur_blur16((uint16_t*)dst, dst_step>>1, (const uint16_t*)src, src_step>>1, len, radius, dsp); } static inline void blur_power(uint8_t *dst, int dst_step, const uint8_t *src, int src_step, - int len, int radius, int power, uint8_t *temp[2], int pixsize) + int len, int radius, int power, uint8_t *temp[2], int pixsize, + FFBoxblurDSPContext *dsp) { uint8_t *a = temp[0], *b = temp[1]; if (radius && power) { - blur(a, pixsize, src, src_step, len, radius, pixsize); + blur(a, pixsize, src, src_step, len, radius, pixsize, dsp); for (; power > 2; power--) { uint8_t *c; - blur(b, pixsize, a, pixsize, len, radius, pixsize); + blur(b, pixsize, a, pixsize, len, radius, pixsize, dsp); c = a; a = b; b = c; } if (power > 1) { - blur(dst, dst_step, a, pixsize, len, radius, pixsize); + blur(dst, dst_step, a, pixsize, len, radius, pixsize, dsp); } else { int i; if (pixsize == 1) { @@ -201,7 +252,8 @@ static inline void blur_power(uint8_t *dst, int dst_step, const uint8_t *src, in } static void hblur(uint8_t *dst, int dst_linesize, const uint8_t *src, int src_linesize, - int w, int h, int radius, int power, uint8_t *temp[2], int pixsize) + int w, int h, int radius, int power, uint8_t *temp[2], int pixsize, + FFBoxblurDSPContext *dsp) { int y; @@ -210,11 +262,12 @@ static void hblur(uint8_t *dst, int dst_linesize, const uint8_t *src, int src_li for (y = 0; y < h; y++) blur_power(dst + y*dst_linesize, pixsize, src + y*src_linesize, pixsize, - w, radius, power, temp, pixsize); + w, radius, power, temp, pixsize, dsp); } static void vblur(uint8_t *dst, int dst_linesize, const uint8_t *src, int src_linesize, - int w, int h, int radius, int power, uint8_t *temp[2], int pixsize) + int w, int h, int radius, int power, uint8_t *temp[2], int pixsize, + FFBoxblurDSPContext *dsp) { int x; @@ -223,7 +276,7 @@ static void vblur(uint8_t *dst, int dst_linesize, const uint8_t *src, int src_li for (x = 0; x < w; x++) blur_power(dst + x*pixsize, dst_linesize, src + x*pixsize, src_linesize, - h, radius, power, temp, pixsize); + h, radius, power, temp, pixsize, dsp); } static int filter_frame(AVFilterLink *inlink, AVFrame *in) @@ -251,13 +304,13 @@ static int filter_frame(AVFilterLink *inlink, AVFrame *in) hblur(out->data[plane], out->linesize[plane], in ->data[plane], in ->linesize[plane], w[plane], h[plane], s->radius[plane], s->power[plane], - s->temp, pixsize); + s->temp, pixsize, &s->dsp); for (plane = 0; plane < 4 && in->data[plane] && in->linesize[plane]; plane++) vblur(out->data[plane], out->linesize[plane], out->data[plane], out->linesize[plane], w[plane], h[plane], s->radius[plane], s->power[plane], - s->temp, pixsize); + s->temp, pixsize, &s->dsp); av_frame_free(&in); diff --git a/libavfilter/vf_boxblur_dsp.h b/libavfilter/vf_boxblur_dsp.h new file mode 100644 index 0000000000..c2603df55f --- /dev/null +++ b/libavfilter/vf_boxblur_dsp.h @@ -0,0 +1,46 @@ +/* + * Copyright (c) 2025 Makar Kuznietsov + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef AVFILTER_BOXBLUR_DSP_H +#define AVFILTER_BOXBLUR_DSP_H + +#include +#include + +typedef struct FFBoxblurDSPContext { + /* Optimized middle-loop function for steady-state blur */ + void (*middle)(void *dst, const void *src, + int x_start, int x_end, int radius, + int inv, int *sum_ptr); +} FFBoxblurDSPContext; + +/* C reference implementations */ +void boxblur_middle8_c(uint8_t *dst, const uint8_t *src, + int x_start, int x_end, int radius, + int inv, int *sum_ptr); + +void boxblur_middle16_c(uint16_t *dst, const uint16_t *src, + int x_start, int x_end, int radius, + int inv, int *sum_ptr); + +void ff_boxblur_dsp_init(FFBoxblurDSPContext *dsp, int depth); + +#endif /* AVFILTER_BOXBLUR_DSP_H */ + diff --git a/libavfilter/x86/Makefile b/libavfilter/x86/Makefile index b485c10fbe..f8840c5a73 100644 --- a/libavfilter/x86/Makefile +++ b/libavfilter/x86/Makefile @@ -5,6 +5,7 @@ OBJS-$(CONFIG_ANLMDN_FILTER) += x86/af_anlmdn_init.o OBJS-$(CONFIG_ATADENOISE_FILTER) += x86/vf_atadenoise_init.o OBJS-$(CONFIG_BLACKDETECT_FILTER) += x86/vf_blackdetect_init.o OBJS-$(CONFIG_BLEND_FILTER) += x86/vf_blend_init.o +OBJS-$(CONFIG_BOXBLUR_FILTER) += x86/vf_boxblur_init.o OBJS-$(CONFIG_BWDIF_FILTER) += x86/vf_bwdif_init.o OBJS-$(CONFIG_COLORDETECT_FILTER) += x86/vf_colordetect_init.o OBJS-$(CONFIG_COLORSPACE_FILTER) += x86/colorspacedsp_init.o @@ -53,6 +54,7 @@ X86ASM-OBJS-$(CONFIG_ANLMDN_FILTER) += x86/af_anlmdn.o X86ASM-OBJS-$(CONFIG_ATADENOISE_FILTER) += x86/vf_atadenoise.o X86ASM-OBJS-$(CONFIG_BLACKDETECT_FILTER) += x86/vf_blackdetect.o X86ASM-OBJS-$(CONFIG_BLEND_FILTER) += x86/vf_blend.o +X86ASM-OBJS-$(CONFIG_BOXBLUR_FILTER) += x86/vf_boxblur.o X86ASM-OBJS-$(CONFIG_BWDIF_FILTER) += x86/vf_bwdif.o X86ASM-OBJS-$(CONFIG_COLORDETECT_FILTER) += x86/vf_colordetect.o X86ASM-OBJS-$(CONFIG_COLORSPACE_FILTER) += x86/colorspacedsp.o diff --git a/libavfilter/x86/vf_boxblur.asm b/libavfilter/x86/vf_boxblur.asm new file mode 100644 index 0000000000..069ed092c9 --- /dev/null +++ b/libavfilter/x86/vf_boxblur.asm @@ -0,0 +1,213 @@ +;***************************************************************************** +;* x86-optimized functions for boxblur filter +;* +;* Copyright (c) 2025 Makar Kuznietsov +;* +;* This file is part of FFmpeg. +;* +;* FFmpeg is free software; you can redistribute it and/or +;* modify it under the terms of the GNU Lesser General Public +;* License as published by the Free Software Foundation; either +;* version 2.1 of the License, or (at your option) any later version. +;* +;* FFmpeg is distributed in the hope that it will be useful, +;* but WITHOUT ANY WARRANTY; without even the implied warranty of +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +;* Lesser General Public License for more details. +;* +;* You should have received a copy of the GNU Lesser General Public +;* License along with FFmpeg; if not, write to the Free Software +;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA +;****************************************************************************** + +%include "libavutil/x86/x86util.asm" + +SECTION .text + +%if ARCH_X86_64 + +; void ff_boxblur_middle_avx2(uint8_t *dst, const uint8_t *src, +; int x_start, int x_end, int radius, +; int inv, int *sum_ptr) +INIT_YMM avx2 +cglobal boxblur_middle, 7, 10, 6, dst, src, x_start, x_end, radius, inv, sum_ptr, x, tmp, sum + mov sumd, [sum_ptrq] + movd xm3, invd + vpbroadcastd m3, xm3 + mov xd, x_startd + +.vloop: + ; Load incoming pixels: src[x + radius] + lea tmpq, [xq + radiusq] + movu xm0, [srcq + tmpq] + + ; Load outgoing pixels: src[x - radius - 1] + lea tmpq, [xq - 1] + sub tmpq, radiusq + movu xm1, [srcq + tmpq] + + ; Zero-extend u8 -> u16 + pmovzxbw m0, xm0 + pmovzxbw m1, xm1 + + ; Compute signed difference + psubw m2, m0, m1 + pmovsxwd m4, xm2 + + ; Extract high 8 words and sign-extend + vextracti128 xm0, m2, 1 + pmovsxwd m5, xm0 + + ; Multiply by inv + pmulld m4, m4, m3 + pmulld m5, m5, m3 + + ; Compute prefix sum for m4 (lower 8 pixels) + mova m0, m4 + pslldq m1, m0, 4 + paddd m0, m0, m1 + pslldq m1, m0, 8 + paddd m0, m0, m1 + + ; Propagate carry across 128-bit lanes + vextracti128 xm1, m0, 0 + vpshufd xm1, xm1, 0xFF + vpxor m2, m2, m2 + vinserti128 m2, m2, xm1, 1 + vpaddd m0, m0, m2 + + ; Add accumulator + movd xm2, sumd + vpbroadcastd m2, xm2 + paddd m0, m0, m2 + mova m4, m0 + + ; Update accumulator for next iteration + vextracti128 xm1, m0, 1 + pshufd xm1, xm1, 0xFF + movd sumd, xm1 + + ; Compute prefix sum for m5 (upper 8 pixels) + mova m0, m5 + pslldq m1, m0, 4 + paddd m0, m0, m1 + pslldq m1, m0, 8 + paddd m0, m0, m1 + + ; Propagate carry across 128-bit lanes + vextracti128 xm1, m0, 0 + pshufd xm1, xm1, 0xFF + pxor m2, m2, m2 + vinserti128 m2, m2, xm1, 1 + paddd m0, m0, m2 + + ; Add accumulator + movd xm2, sumd + vpbroadcastd m2, xm2 + paddd m0, m0, m2 + mova m5, m0 + + ; Update accumulator for next iteration + vextracti128 xm1, m0, 1 + pshufd xm1, xm1, 0xFF + movd sumd, xm1 + + ; Shift and pack results + psrad m4, m4, 16 + psrad m5, m5, 16 + + ; Pack lower 8 pixels + vextracti128 xm0, m4, 0 + vextracti128 xm1, m4, 1 + packusdw xm0, xm0, xm1 + packuswb xm0, xm0, xm0 + movq [dstq + xq + 0], xm0 + + ; Pack upper 8 pixels + vextracti128 xm0, m5, 0 + vextracti128 xm1, m5, 1 + packusdw xm0, xm0, xm1 + packuswb xm0, xm0, xm0 + movq [dstq + xq + 8], xm0 + + add xd, 16 + cmp xd, x_endd + jl .vloop + + mov [sum_ptrq], sumd + RET + +; void ff_boxblur_middle16_avx2(uint16_t *dst, const uint16_t *src, +; int x_start, int x_end, int radius, +; int inv, int *sum_ptr) +INIT_YMM avx2 +cglobal boxblur_middle16, 7, 10, 5, dst, src, x_start, x_end, radius, inv, sum_ptr, x, tmp, sum + mov sumd, [sum_ptrq] + movd xm3, invd + vpbroadcastd m3, xm3 + mov xd, x_startd + +.vloop: + ; Load incoming pixels: src[x + radius] (accounting for 2-byte stride) + lea tmpq, [xq + radiusq] + movu xm0, [srcq + tmpq*2] + + ; Load outgoing pixels: src[x - radius - 1] + lea tmpq, [xq - 1] + sub tmpq, radiusq + movu xm1, [srcq + tmpq*2] + + ; Zero-extend u16 -> u32 + pmovzxwd m0, xm0 + pmovzxwd m1, xm1 + + ; Compute signed difference + psubd m2, m0, m1 + + ; Multiply by inv + pmulld m4, m2, m3 + + ; Compute prefix sum + mova m0, m4 + pslldq m1, m0, 4 + paddd m0, m0, m1 + pslldq m1, m0, 8 + paddd m0, m0, m1 + + ; Propagate carry across 128-bit lanes + vextracti128 xm1, m0, 0 + pshufd xm1, xm1, 0xFF + pxor m2, m2, m2 + vinserti128 m2, m2, xm1, 1 + paddd m0, m0, m2 + + ; Add accumulator + movd xm2, sumd + vpbroadcastd m2, xm2 + paddd m0, m0, m2 + mova m4, m0 + + ; Update accumulator for next iteration + vextracti128 xm1, m0, 1 + pshufd xm1, xm1, 0xFF + movd sumd, xm1 + + ; Shift and pack results + psrld m4, m4, 16 + vextracti128 xm0, m4, 0 + vextracti128 xm1, m4, 1 + packusdw xm0, xm0, xm1 + movu [dstq + xq*2], xm0 + + add xd, 8 + cmp xd, x_endd + jl .vloop + + mov [sum_ptrq], sumd + RET + +%endif + +%if HAVE_AVX2_EXTERNAL +INIT_YMM avx2 +%endif diff --git a/libavfilter/x86/vf_boxblur_init.c b/libavfilter/x86/vf_boxblur_init.c new file mode 100644 index 0000000000..e11536d10c --- /dev/null +++ b/libavfilter/x86/vf_boxblur_init.c @@ -0,0 +1,50 @@ +/* + * Copyright (c) 2025 Makar Kuznietsov + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/attributes.h" +#include "libavutil/cpu.h" +#include "libavutil/x86/cpu.h" + +#include "libavfilter/vf_boxblur_dsp.h" + +/* Forward declaration */ +void ff_boxblur_dsp_init_x86(FFBoxblurDSPContext *dsp, int depth); + +/* AVX2 optimized middle-loop functions */ +#if ARCH_X86_64 && HAVE_AVX2_EXTERNAL +void ff_boxblur_middle_avx2(uint8_t *dst, const uint8_t *src, + int x_start, int x_end, int radius, + int inv, int *sum_ptr); + +void ff_boxblur_middle16_avx2(uint16_t *dst, const uint16_t *src, + int x_start, int x_end, int radius, + int inv, int *sum_ptr); +#endif + +av_cold void ff_boxblur_dsp_init_x86(FFBoxblurDSPContext *dsp, int depth) +{ +#if ARCH_X86_64 && HAVE_AVX2_EXTERNAL + int cpu_flags = av_get_cpu_flags(); + + if (EXTERNAL_AVX2_FAST(cpu_flags)) { + dsp->middle = depth > 8 ? (void *)ff_boxblur_middle16_avx2 : (void *)ff_boxblur_middle_avx2; + } +#endif +} diff --git a/tests/checkasm/Makefile b/tests/checkasm/Makefile index e47070d90f..8d3196bbdf 100644 --- a/tests/checkasm/Makefile +++ b/tests/checkasm/Makefile @@ -60,6 +60,7 @@ AVFILTEROBJS-$(CONFIG_SCENE_SAD) += scene_sad.o AVFILTEROBJS-$(CONFIG_AFIR_FILTER) += af_afir.o AVFILTEROBJS-$(CONFIG_BLACKDETECT_FILTER) += vf_blackdetect.o AVFILTEROBJS-$(CONFIG_BLEND_FILTER) += vf_blend.o +AVFILTEROBJS-$(CONFIG_BOXBLUR_FILTER) += vf_boxblur.o AVFILTEROBJS-$(CONFIG_BWDIF_FILTER) += vf_bwdif.o AVFILTEROBJS-$(CONFIG_COLORDETECT_FILTER)+= vf_colordetect.o AVFILTEROBJS-$(CONFIG_COLORSPACE_FILTER) += vf_colorspace.o diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c index 4469e043f5..23800b9978 100644 --- a/tests/checkasm/checkasm.c +++ b/tests/checkasm/checkasm.c @@ -285,6 +285,9 @@ static const struct { #if CONFIG_BLEND_FILTER { "vf_blend", checkasm_check_blend }, #endif + #if CONFIG_BOXBLUR_FILTER + { "vf_boxblur", checkasm_check_boxblur }, + #endif #if CONFIG_BWDIF_FILTER { "vf_bwdif", checkasm_check_vf_bwdif }, #endif diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h index e1ccd4011b..bfca26cb82 100644 --- a/tests/checkasm/checkasm.h +++ b/tests/checkasm/checkasm.h @@ -89,6 +89,7 @@ void checkasm_check_av_tx(void); void checkasm_check_blackdetect(void); void checkasm_check_blend(void); void checkasm_check_blockdsp(void); +void checkasm_check_boxblur(void); void checkasm_check_bswapdsp(void); void checkasm_check_cavsdsp(void); void checkasm_check_colordetect(void); diff --git a/tests/checkasm/vf_boxblur.c b/tests/checkasm/vf_boxblur.c new file mode 100644 index 0000000000..c67abc5ece --- /dev/null +++ b/tests/checkasm/vf_boxblur.c @@ -0,0 +1,148 @@ +/* + * Copyright (c) 2025 Makar Kuznietsov + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License along + * with FFmpeg; if not, write to the Free Software Foundation, Inc., + * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. + */ + +#include +#include "checkasm.h" +#include "libavutil/mem_internal.h" +#include "libavutil/cpu.h" + +#include "libavfilter/boxblur.h" +#include "libavfilter/vf_boxblur_dsp.h" + +static int current_depth = 8; + +static void blur8_c(uint8_t *dst, int dst_step, const uint8_t *src, + int src_step, int len, int radius) +{ + FFBoxblurDSPContext dsp; + int saved_flags = av_get_cpu_flags(); + av_force_cpu_flags(saved_flags & ~AV_CPU_FLAG_AVX2); + ff_boxblur_dsp_init(&dsp, current_depth); + av_force_cpu_flags(saved_flags); + ff_boxblur_blur8(dst, dst_step, src, src_step, len, radius, &dsp); +} + +static void blur8_simd(uint8_t *dst, int dst_step, const uint8_t *src, + int src_step, int len, int radius) +{ + FFBoxblurDSPContext dsp; + ff_boxblur_dsp_init(&dsp, current_depth); + ff_boxblur_blur8(dst, dst_step, src, src_step, len, radius, &dsp); +} + +static void check_blur8(int depth) +{ + LOCAL_ALIGNED_32(uint8_t, src, [2048]); + LOCAL_ALIGNED_32(uint8_t, dst0, [2048]); + LOCAL_ALIGNED_32(uint8_t, dst1, [2048]); + + declare_func(void, uint8_t *, int, const uint8_t *, int, int, int); + + current_depth = depth; + + /* Register exactly one version per CPU run so checkasm records C and AVX2 */ + void (*fn)(uint8_t *, int, const uint8_t *, int, int, int) = + (av_get_cpu_flags() & AV_CPU_FLAG_AVX2) ? blur8_simd : blur8_c; + + if (check_func(fn, "boxblur_blur8")) { + for (int iter = 0; iter < 16; iter++) { + const int len = 64 + (rnd() % 256); + const int radius = FFMIN((len - 1) / 2, 1 + (rnd() % 15)); + for (int i = 0; i < len; i++) + src[i] = rnd(); + + call_ref(dst0, 1, src, 1, len, radius); + call_new(dst1, 1, src, 1, len, radius); + if (memcmp(dst0, dst1, len)) + fail(); + } + + /* Benchmark with typical size */ + const int bench_len = 256; + const int bench_radius = 8; + for (int i = 0; i < bench_len; i++) + src[i] = rnd(); + bench_new(dst1, 1, src, 1, bench_len, bench_radius); + } +} + +static void blur16_c(uint16_t *dst, int dst_step, const uint16_t *src, + int src_step, int len, int radius) +{ + FFBoxblurDSPContext dsp; + int saved_flags = av_get_cpu_flags(); + av_force_cpu_flags(saved_flags & ~AV_CPU_FLAG_AVX2); + ff_boxblur_dsp_init(&dsp, current_depth); + av_force_cpu_flags(saved_flags); + ff_boxblur_blur16(dst, dst_step, src, src_step, len, radius, &dsp); +} + +static void blur16_simd(uint16_t *dst, int dst_step, const uint16_t *src, + int src_step, int len, int radius) +{ + FFBoxblurDSPContext dsp; + ff_boxblur_dsp_init(&dsp, current_depth); + ff_boxblur_blur16(dst, dst_step, src, src_step, len, radius, &dsp); +} + +static void check_blur16(int depth) +{ + LOCAL_ALIGNED_32(uint16_t, src, [2048]); + LOCAL_ALIGNED_32(uint16_t, dst0, [2048]); + LOCAL_ALIGNED_32(uint16_t, dst1, [2048]); + + declare_func(void, uint16_t *, int, const uint16_t *, int, int, int); + + current_depth = depth; + + /* Register exactly one version per CPU run so checkasm records C and AVX2 */ + void (*fn)(uint16_t *, int, const uint16_t *, int, int, int) = + (av_get_cpu_flags() & AV_CPU_FLAG_AVX2) ? blur16_simd : blur16_c; + + if (check_func(fn, "boxblur_blur16")) { + for (int iter = 0; iter < 16; iter++) { + const int len = 64 + (rnd() % 256); + const int radius = FFMIN((len - 1) / 2, 1 + (rnd() % 15)); + for (int i = 0; i < len; i++) + src[i] = rnd(); + + call_ref(dst0, 1, src, 1, len, radius); + call_new(dst1, 1, src, 1, len, radius); + if (memcmp(dst0, dst1, len * sizeof(uint16_t))) + fail(); + } + + /* Benchmark with typical size */ + const int bench_len = 256; + const int bench_radius = 8; + for (int i = 0; i < bench_len; i++) + src[i] = rnd(); + bench_new(dst1, 1, src, 1, bench_len, bench_radius); + } +} + +void checkasm_check_boxblur(void) +{ + check_blur8(8); + report("boxblur_blur8"); + + check_blur16(16); + report("boxblur_blur16"); +} diff --git a/tests/fate/checkasm.mak b/tests/fate/checkasm.mak index ca1cd0dea3..3fcef57496 100644 --- a/tests/fate/checkasm.mak +++ b/tests/fate/checkasm.mak @@ -63,6 +63,7 @@ FATE_CHECKASM = fate-checkasm-aacencdsp \ fate-checkasm-vc1dsp \ fate-checkasm-vf_blackdetect \ fate-checkasm-vf_blend \ + fate-checkasm-vf_boxblur \ fate-checkasm-vf_bwdif \ fate-checkasm-vf_colordetect \ fate-checkasm-vf_colorspace \ -- 2.49.1 _______________________________________________ ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org