2018-04-05 01:37:25 +02:00
|
|
|
OBJS-$(CONFIG_SCENE_SAD) += x86/scene_sad_init.o
|
|
|
|
|
2017-01-26 18:03:08 +02:00
|
|
|
OBJS-$(CONFIG_AFIR_FILTER) += x86/af_afir_init.o
|
2019-01-09 14:33:02 +02:00
|
|
|
OBJS-$(CONFIG_ANLMDN_FILTER) += x86/af_anlmdn_init.o
|
2019-10-14 18:15:14 +02:00
|
|
|
OBJS-$(CONFIG_ATADENOISE_FILTER) += x86/vf_atadenoise_init.o
|
2015-10-02 17:22:42 +02:00
|
|
|
OBJS-$(CONFIG_BLEND_FILTER) += x86/vf_blend_init.o
|
2016-03-13 11:06:21 +02:00
|
|
|
OBJS-$(CONFIG_BWDIF_FILTER) += x86/vf_bwdif_init.o
|
2016-04-06 20:09:08 +02:00
|
|
|
OBJS-$(CONFIG_COLORSPACE_FILTER) += x86/colorspacedsp_init.o
|
2019-06-27 04:07:21 +02:00
|
|
|
OBJS-$(CONFIG_CONVOLUTION_FILTER) += x86/vf_convolution_init.o
|
2019-09-18 09:05:34 +02:00
|
|
|
OBJS-$(CONFIG_EQ_FILTER) += x86/vf_eq_init.o
|
2014-12-26 20:37:54 +02:00
|
|
|
OBJS-$(CONFIG_FSPP_FILTER) += x86/vf_fspp_init.o
|
2019-05-15 11:54:10 +02:00
|
|
|
OBJS-$(CONFIG_GBLUR_FILTER) += x86/vf_gblur_init.o
|
2013-10-22 03:37:46 +03:00
|
|
|
OBJS-$(CONFIG_GRADFUN_FILTER) += x86/vf_gradfun_init.o
|
avfilter/vf_framerate: add SIMD functions for frame blending
Blend function speedups on x86_64 Core i5 4460:
ffmpeg -f lavfi -i allyuv -vf framerate=60:threads=1 -f null none
C: 447548411 decicycles in Blend, 2048 runs, 0 skips
SSSE3: 130020087 decicycles in Blend, 2048 runs, 0 skips
AVX2: 128508221 decicycles in Blend, 2048 runs, 0 skips
ffmpeg -f lavfi -i allyuv -vf format=yuv420p12,framerate=60:threads=1 -f null none
C: 228932745 decicycles in Blend, 2048 runs, 0 skips
SSE4: 123357781 decicycles in Blend, 2048 runs, 0 skips
AVX2: 121215353 decicycles in Blend, 2048 runs, 0 skips
Signed-off-by: Marton Balint <cus@passwd.hu>
2018-01-08 02:05:45 +02:00
|
|
|
OBJS-$(CONFIG_FRAMERATE_FILTER) += x86/vf_framerate_init.o
|
2017-12-01 21:56:45 +02:00
|
|
|
OBJS-$(CONFIG_HFLIP_FILTER) += x86/vf_hflip_init.o
|
2013-01-22 03:39:37 +03:00
|
|
|
OBJS-$(CONFIG_HQDN3D_FILTER) += x86/vf_hqdn3d_init.o
|
2014-09-03 12:02:32 +03:00
|
|
|
OBJS-$(CONFIG_IDET_FILTER) += x86/vf_idet_init.o
|
2018-04-17 12:48:28 +02:00
|
|
|
OBJS-$(CONFIG_INTERLACE_FILTER) += x86/vf_tinterlace_init.o
|
2017-07-03 17:42:03 +02:00
|
|
|
OBJS-$(CONFIG_LIMITER_FILTER) += x86/vf_limiter_init.o
|
avfilter/vf_lut3d: add x86-optimized tetrahedral interpolation
I spotted an interesting pattern that I didn't see before that leads to the implementation being faster.
The bit shifting table I was using before is no longer needed, and was able to remove quite a few lines.
I also add use of FMA on the AVX2 version.
f32 1920x1080 1 thread with prelut
c impl
1434012700 UNITS in lut3d->interp, 1 runs, 0 skips
1434035335 UNITS in lut3d->interp, 2 runs, 0 skips
1423615347 UNITS in lut3d->interp, 4 runs, 0 skips
1426268863 UNITS in lut3d->interp, 8 runs, 0 skips
sse2
905484420 UNITS in lut3d->interp, 1 runs, 0 skips
905659010 UNITS in lut3d->interp, 2 runs, 0 skips
915167140 UNITS in lut3d->interp, 4 runs, 0 skips
915834222 UNITS in lut3d->interp, 8 runs, 0 skips
avx
574794860 UNITS in lut3d->interp, 1 runs, 0 skips
581035090 UNITS in lut3d->interp, 2 runs, 0 skips
584116720 UNITS in lut3d->interp, 4 runs, 0 skips
581460290 UNITS in lut3d->interp, 8 runs, 0 skips
avx2
301698880 UNITS in lut3d->interp, 1 runs, 0 skips
301982880 UNITS in lut3d->interp, 2 runs, 0 skips
306962430 UNITS in lut3d->interp, 4 runs, 0 skips
305472025 UNITS in lut3d->interp, 8 runs, 0 skips
gbrap16 1920x1080 1 thread with prelut
c impl
1480894840 UNITS in lut3d->interp, 1 runs, 0 skips
1502922990 UNITS in lut3d->interp, 2 runs, 0 skips
1496114307 UNITS in lut3d->interp, 4 runs, 0 skips
1492554551 UNITS in lut3d->interp, 8 runs, 0 skips
sse2
980777180 UNITS in lut3d->interp, 1 runs, 0 skips
986121520 UNITS in lut3d->interp, 2 runs, 0 skips
986489840 UNITS in lut3d->interp, 4 runs, 0 skips
998832248 UNITS in lut3d->interp, 8 runs, 0 skips
avx
622212360 UNITS in lut3d->interp, 1 runs, 0 skips
622981160 UNITS in lut3d->interp, 2 runs, 0 skips
645396315 UNITS in lut3d->interp, 4 runs, 0 skips
641057075 UNITS in lut3d->interp, 8 runs, 0 skips
avx2
321336400 UNITS in lut3d->interp, 1 runs, 0 skips
321268920 UNITS in lut3d->interp, 2 runs, 0 skips
323459895 UNITS in lut3d->interp, 4 runs, 0 skips
324949967 UNITS in lut3d->interp, 8 runs, 0 skips
2021-10-06 05:58:30 +02:00
|
|
|
OBJS-$(CONFIG_LUT3D_FILTER) += x86/vf_lut3d_init.o
|
2019-10-22 18:57:14 +02:00
|
|
|
OBJS-$(CONFIG_MASKEDCLAMP_FILTER) += x86/vf_maskedclamp_init.o
|
2015-09-30 23:00:14 +02:00
|
|
|
OBJS-$(CONFIG_MASKEDMERGE_FILTER) += x86/vf_maskedmerge_init.o
|
2021-10-24 17:13:34 +02:00
|
|
|
OBJS-$(CONFIG_NLMEANS_FILTER) += x86/vf_nlmeans_init.o
|
2014-10-17 04:24:42 +03:00
|
|
|
OBJS-$(CONFIG_NOISE_FILTER) += x86/vf_noise.o
|
2018-04-30 12:01:07 +02:00
|
|
|
OBJS-$(CONFIG_OVERLAY_FILTER) += x86/vf_overlay_init.o
|
2015-01-09 21:51:13 +02:00
|
|
|
OBJS-$(CONFIG_PP7_FILTER) += x86/vf_pp7_init.o
|
2015-07-12 12:44:39 +02:00
|
|
|
OBJS-$(CONFIG_PSNR_FILTER) += x86/vf_psnr_init.o
|
2013-07-08 15:42:53 +03:00
|
|
|
OBJS-$(CONFIG_PULLUP_FILTER) += x86/vf_pullup_init.o
|
avfilter/vf_removegrain: add x86 and x86_64 SSE2 functions
Speed of all modes increased by a factor between 7.4 and 19.8 largely depending
on whether bytes are unpacked into words. Modes 2, 3, and 4 have been sped-up
by a factor of 43 (thanks quick sort!)
All modes are available on x86_64 but only modes 1, 10, 11, 12, 13, 14, 19, 20,
21, and 22 are available on x86 due to the number of SIMD registers used.
With a contribution from James Almer <jamrial@gmail.com>
2015-07-15 01:48:47 +02:00
|
|
|
OBJS-$(CONFIG_REMOVEGRAIN_FILTER) += x86/vf_removegrain_init.o
|
2016-06-04 09:33:05 +02:00
|
|
|
OBJS-$(CONFIG_SHOWCQT_FILTER) += x86/avf_showcqt_init.o
|
2013-05-11 13:03:38 +03:00
|
|
|
OBJS-$(CONFIG_SPP_FILTER) += x86/vf_spp.o
|
2015-07-13 01:33:06 +02:00
|
|
|
OBJS-$(CONFIG_SSIM_FILTER) += x86/vf_ssim_init.o
|
2015-10-04 11:34:03 +02:00
|
|
|
OBJS-$(CONFIG_STEREO3D_FILTER) += x86/vf_stereo3d_init.o
|
2015-10-02 17:22:42 +02:00
|
|
|
OBJS-$(CONFIG_TBLEND_FILTER) += x86/vf_blend_init.o
|
2017-11-12 20:11:51 +02:00
|
|
|
OBJS-$(CONFIG_THRESHOLD_FILTER) += x86/vf_threshold_init.o
|
2014-11-15 04:49:37 +02:00
|
|
|
OBJS-$(CONFIG_TINTERLACE_FILTER) += x86/vf_tinterlace_init.o
|
2019-10-21 16:43:26 +02:00
|
|
|
OBJS-$(CONFIG_TRANSPOSE_FILTER) += x86/vf_transpose_init.o
|
2012-09-23 21:49:26 +03:00
|
|
|
OBJS-$(CONFIG_VOLUME_FILTER) += x86/af_volume_init.o
|
2019-09-03 18:54:44 +02:00
|
|
|
OBJS-$(CONFIG_V360_FILTER) += x86/vf_v360_init.o
|
2015-10-07 21:03:16 +02:00
|
|
|
OBJS-$(CONFIG_W3FDIF_FILTER) += x86/vf_w3fdif_init.o
|
2014-01-04 15:49:38 +03:00
|
|
|
OBJS-$(CONFIG_YADIF_FILTER) += x86/vf_yadif_init.o
|
2012-08-29 20:37:14 +03:00
|
|
|
|
2018-04-05 01:37:25 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_SCENE_SAD) += x86/scene_sad.o
|
|
|
|
|
2016-10-08 16:18:33 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_AFIR_FILTER) += x86/af_afir.o
|
2019-01-09 14:33:02 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_ANLMDN_FILTER) += x86/af_anlmdn.o
|
2019-10-14 18:15:14 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_ATADENOISE_FILTER) += x86/vf_atadenoise.o
|
2016-10-08 16:18:33 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_BLEND_FILTER) += x86/vf_blend.o
|
|
|
|
X86ASM-OBJS-$(CONFIG_BWDIF_FILTER) += x86/vf_bwdif.o
|
|
|
|
X86ASM-OBJS-$(CONFIG_COLORSPACE_FILTER) += x86/colorspacedsp.o
|
2019-06-27 04:07:21 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_CONVOLUTION_FILTER) += x86/vf_convolution.o
|
2019-09-26 17:12:33 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_EQ_FILTER) += x86/vf_eq.o
|
avfilter/vf_framerate: add SIMD functions for frame blending
Blend function speedups on x86_64 Core i5 4460:
ffmpeg -f lavfi -i allyuv -vf framerate=60:threads=1 -f null none
C: 447548411 decicycles in Blend, 2048 runs, 0 skips
SSSE3: 130020087 decicycles in Blend, 2048 runs, 0 skips
AVX2: 128508221 decicycles in Blend, 2048 runs, 0 skips
ffmpeg -f lavfi -i allyuv -vf format=yuv420p12,framerate=60:threads=1 -f null none
C: 228932745 decicycles in Blend, 2048 runs, 0 skips
SSE4: 123357781 decicycles in Blend, 2048 runs, 0 skips
AVX2: 121215353 decicycles in Blend, 2048 runs, 0 skips
Signed-off-by: Marton Balint <cus@passwd.hu>
2018-01-08 02:05:45 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_FRAMERATE_FILTER) += x86/vf_framerate.o
|
2016-10-08 16:18:33 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_FSPP_FILTER) += x86/vf_fspp.o
|
2019-05-15 11:54:10 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_GBLUR_FILTER) += x86/vf_gblur.o
|
2016-10-08 16:18:33 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_GRADFUN_FILTER) += x86/vf_gradfun.o
|
2017-12-01 21:56:45 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_HFLIP_FILTER) += x86/vf_hflip.o
|
2016-10-08 16:18:33 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_HQDN3D_FILTER) += x86/vf_hqdn3d.o
|
|
|
|
X86ASM-OBJS-$(CONFIG_IDET_FILTER) += x86/vf_idet.o
|
|
|
|
X86ASM-OBJS-$(CONFIG_INTERLACE_FILTER) += x86/vf_interlace.o
|
2017-07-03 17:42:03 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_LIMITER_FILTER) += x86/vf_limiter.o
|
avfilter/vf_lut3d: add x86-optimized tetrahedral interpolation
I spotted an interesting pattern that I didn't see before that leads to the implementation being faster.
The bit shifting table I was using before is no longer needed, and was able to remove quite a few lines.
I also add use of FMA on the AVX2 version.
f32 1920x1080 1 thread with prelut
c impl
1434012700 UNITS in lut3d->interp, 1 runs, 0 skips
1434035335 UNITS in lut3d->interp, 2 runs, 0 skips
1423615347 UNITS in lut3d->interp, 4 runs, 0 skips
1426268863 UNITS in lut3d->interp, 8 runs, 0 skips
sse2
905484420 UNITS in lut3d->interp, 1 runs, 0 skips
905659010 UNITS in lut3d->interp, 2 runs, 0 skips
915167140 UNITS in lut3d->interp, 4 runs, 0 skips
915834222 UNITS in lut3d->interp, 8 runs, 0 skips
avx
574794860 UNITS in lut3d->interp, 1 runs, 0 skips
581035090 UNITS in lut3d->interp, 2 runs, 0 skips
584116720 UNITS in lut3d->interp, 4 runs, 0 skips
581460290 UNITS in lut3d->interp, 8 runs, 0 skips
avx2
301698880 UNITS in lut3d->interp, 1 runs, 0 skips
301982880 UNITS in lut3d->interp, 2 runs, 0 skips
306962430 UNITS in lut3d->interp, 4 runs, 0 skips
305472025 UNITS in lut3d->interp, 8 runs, 0 skips
gbrap16 1920x1080 1 thread with prelut
c impl
1480894840 UNITS in lut3d->interp, 1 runs, 0 skips
1502922990 UNITS in lut3d->interp, 2 runs, 0 skips
1496114307 UNITS in lut3d->interp, 4 runs, 0 skips
1492554551 UNITS in lut3d->interp, 8 runs, 0 skips
sse2
980777180 UNITS in lut3d->interp, 1 runs, 0 skips
986121520 UNITS in lut3d->interp, 2 runs, 0 skips
986489840 UNITS in lut3d->interp, 4 runs, 0 skips
998832248 UNITS in lut3d->interp, 8 runs, 0 skips
avx
622212360 UNITS in lut3d->interp, 1 runs, 0 skips
622981160 UNITS in lut3d->interp, 2 runs, 0 skips
645396315 UNITS in lut3d->interp, 4 runs, 0 skips
641057075 UNITS in lut3d->interp, 8 runs, 0 skips
avx2
321336400 UNITS in lut3d->interp, 1 runs, 0 skips
321268920 UNITS in lut3d->interp, 2 runs, 0 skips
323459895 UNITS in lut3d->interp, 4 runs, 0 skips
324949967 UNITS in lut3d->interp, 8 runs, 0 skips
2021-10-06 05:58:30 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_LUT3D_FILTER) += x86/vf_lut3d.o
|
2019-10-22 18:57:14 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_MASKEDCLAMP_FILTER) += x86/vf_maskedclamp.o
|
2016-10-08 16:18:33 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_MASKEDMERGE_FILTER) += x86/vf_maskedmerge.o
|
2021-10-24 17:13:34 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_NLMEANS_FILTER) += x86/vf_nlmeans.o
|
2018-04-30 12:01:07 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_OVERLAY_FILTER) += x86/vf_overlay.o
|
2016-10-08 16:18:33 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_PP7_FILTER) += x86/vf_pp7.o
|
|
|
|
X86ASM-OBJS-$(CONFIG_PSNR_FILTER) += x86/vf_psnr.o
|
|
|
|
X86ASM-OBJS-$(CONFIG_PULLUP_FILTER) += x86/vf_pullup.o
|
avfilter/vf_removegrain: add x86 and x86_64 SSE2 functions
Speed of all modes increased by a factor between 7.4 and 19.8 largely depending
on whether bytes are unpacked into words. Modes 2, 3, and 4 have been sped-up
by a factor of 43 (thanks quick sort!)
All modes are available on x86_64 but only modes 1, 10, 11, 12, 13, 14, 19, 20,
21, and 22 are available on x86 due to the number of SIMD registers used.
With a contribution from James Almer <jamrial@gmail.com>
2015-07-15 01:48:47 +02:00
|
|
|
ifdef CONFIG_GPL
|
2016-10-08 16:18:33 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_REMOVEGRAIN_FILTER) += x86/vf_removegrain.o
|
avfilter/vf_removegrain: add x86 and x86_64 SSE2 functions
Speed of all modes increased by a factor between 7.4 and 19.8 largely depending
on whether bytes are unpacked into words. Modes 2, 3, and 4 have been sped-up
by a factor of 43 (thanks quick sort!)
All modes are available on x86_64 but only modes 1, 10, 11, 12, 13, 14, 19, 20,
21, and 22 are available on x86 due to the number of SIMD registers used.
With a contribution from James Almer <jamrial@gmail.com>
2015-07-15 01:48:47 +02:00
|
|
|
endif
|
2016-10-08 16:18:33 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_SHOWCQT_FILTER) += x86/avf_showcqt.o
|
|
|
|
X86ASM-OBJS-$(CONFIG_SSIM_FILTER) += x86/vf_ssim.o
|
|
|
|
X86ASM-OBJS-$(CONFIG_STEREO3D_FILTER) += x86/vf_stereo3d.o
|
|
|
|
X86ASM-OBJS-$(CONFIG_TBLEND_FILTER) += x86/vf_blend.o
|
2017-11-12 20:11:51 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_THRESHOLD_FILTER) += x86/vf_threshold.o
|
2016-10-08 16:18:33 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_TINTERLACE_FILTER) += x86/vf_interlace.o
|
2019-10-21 16:43:26 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_TRANSPOSE_FILTER) += x86/vf_transpose.o
|
2016-10-08 16:18:33 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_VOLUME_FILTER) += x86/af_volume.o
|
2019-09-03 18:54:44 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_V360_FILTER) += x86/vf_v360.o
|
2016-10-08 16:18:33 +02:00
|
|
|
X86ASM-OBJS-$(CONFIG_W3FDIF_FILTER) += x86/vf_w3fdif.o
|
|
|
|
X86ASM-OBJS-$(CONFIG_YADIF_FILTER) += x86/vf_yadif.o x86/yadif-16.o x86/yadif-10.o
|