FFmpeg/libavcodec/aarch64/Makefile

# subsystems
OBJS-$(CONFIG_FFT)                      += aarch64/fft_init_aarch64.o
OBJS-$(CONFIG_FMTCONVERT)               += aarch64/fmtconvert_init.o
OBJS-$(CONFIG_H264CHROMA)               += aarch64/h264chroma_init_aarch64.o
OBJS-$(CONFIG_H264DSP)                  += aarch64/h264dsp_init_aarch64.o
OBJS-$(CONFIG_H264PRED)                 += aarch64/h264pred_init.o
OBJS-$(CONFIG_H264QPEL)                 += aarch64/h264qpel_init_aarch64.o
OBJS-$(CONFIG_HPELDSP)                  += aarch64/hpeldsp_init_aarch64.o
OBJS-$(CONFIG_MPEGAUDIODSP)             += aarch64/mpegaudiodsp_init.o
OBJS-$(CONFIG_NEON_CLOBBER_TEST)        += aarch64/neontest.o
OBJS-$(CONFIG_VIDEODSP)                 += aarch64/videodsp_init.o

# decoders/encoders
OBJS-$(CONFIG_DCA_DECODER)              += aarch64/synth_filter_init.o
OBJS-$(CONFIG_RV40_DECODER)             += aarch64/rv40dsp_init_aarch64.o
OBJS-$(CONFIG_VC1DSP)                   += aarch64/vc1dsp_init_aarch64.o
OBJS-$(CONFIG_VORBIS_DECODER)           += aarch64/vorbisdsp_init.o
OBJS-$(CONFIG_VP9_DECODER)              += aarch64/vp9dsp_init_10bpp_aarch64.o \
                                           aarch64/vp9dsp_init_12bpp_aarch64.o \
                                           aarch64/vp9dsp_init_aarch64.o

# ARMv8 optimizations

# subsystems
ARMV8-OBJS-$(CONFIG_VIDEODSP)           += aarch64/videodsp.o

# NEON optimizations

# subsystems
NEON-OBJS-$(CONFIG_FFT)                 += aarch64/fft_neon.o
NEON-OBJS-$(CONFIG_FMTCONVERT)          += aarch64/fmtconvert_neon.o
NEON-OBJS-$(CONFIG_H264CHROMA)          += aarch64/h264cmc_neon.o
NEON-OBJS-$(CONFIG_H264DSP)             += aarch64/h264dsp_neon.o              \
                                           aarch64/h264idct_neon.o
NEON-OBJS-$(CONFIG_H264PRED)            += aarch64/h264pred_neon.o
NEON-OBJS-$(CONFIG_H264QPEL)            += aarch64/h264qpel_neon.o             \
                                           aarch64/hpeldsp_neon.o
NEON-OBJS-$(CONFIG_HPELDSP)             += aarch64/hpeldsp_neon.o
NEON-OBJS-$(CONFIG_IDCTDSP)             += aarch64/idctdsp_init_aarch64.o      \
                                           aarch64/simple_idct_neon.o
NEON-OBJS-$(CONFIG_MDCT)                += aarch64/mdct_neon.o
NEON-OBJS-$(CONFIG_MPEGAUDIODSP)        += aarch64/mpegaudiodsp_neon.o

# decoders/encoders
NEON-OBJS-$(CONFIG_DCA_DECODER)         += aarch64/synth_filter_neon.o
NEON-OBJS-$(CONFIG_VORBIS_DECODER)      += aarch64/vorbisdsp_neon.o
NEON-OBJS-$(CONFIG_VP9_DECODER)         += aarch64/vp9itxfm_16bpp_neon.o       \
                                           aarch64/vp9itxfm_neon.o             \
                                           aarch64/vp9lpf_16bpp_neon.o         \
                                           aarch64/vp9lpf_neon.o               \
                                           aarch64/vp9mc_16bpp_neon.o          \
                                           aarch64/vp9mc_neon.o
build: miscellaneous cosmetics Restore alphabetical order in lists, break overly long lines, do some prettyprinting, add some explanatory section comments, group parts together that belong together logically. 2016-02-16 18:58:50 +02:00			`# subsystems`
aarch64: NEON float FFT Approximately as fast as the ARM NEON version on Apple's A7. 2014-03-26 17:20:42 +03:00			`OBJS-$(CONFIG_FFT) += aarch64/fft_init_aarch64.o`
arm64: int32_to_float_fmul neon asm 3% faster dts decoding on a cortex-a57. cortex-a57 cortex-a53 int32_to_float_fmul_array8_c: 1270.9 4475.6 int32_to_float_fmul_array8_neon: 328.6 569.2 int32_to_float_fmul_scalar_c: 928.5 4119.6 int32_to_float_fmul_scalar_neon: 309.1 524.1 2015-12-03 12:04:29 +02:00			`OBJS-$(CONFIG_FMTCONVERT) += aarch64/fmtconvert_init.o`
aarch64: h264 chroma motion compensation NEON optimizations Since RV40 and VC-1 use almost the same algorithm so optimizations for those two decoders are easy to do and included. 2013-12-10 22:16:08 +03:00			`OBJS-$(CONFIG_H264CHROMA) += aarch64/h264chroma_init_aarch64.o`
aarch64: h264 idct NEON assembler optimizations Ported from ARMv7 NEON. 2013-12-13 02:33:48 +03:00			`OBJS-$(CONFIG_H264DSP) += aarch64/h264dsp_init_aarch64.o`
h264: aarch64: intra prediction optimisations 2015-07-12 18:30:09 +02:00			`OBJS-$(CONFIG_H264PRED) += aarch64/h264pred_init.o`
aarch64: h264 qpel NEON optimizations Ported from ARMv7 NEON. 2013-12-18 17:56:50 +03:00			`OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_init_aarch64.o`
aarch64: hpeldsp NEON optimizations Ported from ARMv7 NEON. 2013-12-20 22:03:58 +03:00			`OBJS-$(CONFIG_HPELDSP) += aarch64/hpeldsp_init_aarch64.o`
aarch64: NEON fixed/floating point MPADSP apply_window 30%/25% (fixed/float) faster mp3 decoding on Apple's A7. The floating point decoder is approximately 7% faster. 2014-04-19 19:17:23 +03:00			`OBJS-$(CONFIG_MPEGAUDIODSP) += aarch64/mpegaudiodsp_init.o`
aarch64: port neon clobber test from arm 2014-01-11 19:21:19 +03:00			`OBJS-$(CONFIG_NEON_CLOBBER_TEST) += aarch64/neontest.o`
aarch64: implement videodsp.prefetch 8% faster h264 decoding on Apple A7. 2014-04-05 12:47:18 +03:00			`OBJS-$(CONFIG_VIDEODSP) += aarch64/videodsp_init.o`
build: Group general components separate from de/encoders in arch Makefiles This is in line with how the top-level libavcodec Makefile is structured. 2013-12-20 17:28:18 +03:00
build: miscellaneous cosmetics Restore alphabetical order in lists, break overly long lines, do some prettyprinting, add some explanatory section comments, group parts together that belong together logically. 2016-02-16 18:58:50 +02:00			`# decoders/encoders`
aarch64/synth_filter: fix compilation Signed-off-by: James Almer <jamrial@gmail.com> 2016-05-11 04:33:12 +02:00			`OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_init.o`
aarch64: h264 chroma motion compensation NEON optimizations Since RV40 and VC-1 use almost the same algorithm so optimizations for those two decoders are easy to do and included. 2013-12-10 22:16:08 +03:00			`OBJS-$(CONFIG_RV40_DECODER) += aarch64/rv40dsp_init_aarch64.o`
avcodec: fix vc1dsp dependencies 2016-09-25 12:56:55 +02:00			`OBJS-$(CONFIG_VC1DSP) += aarch64/vc1dsp_init_aarch64.o`
aarch64: NEON vorbis_inverse_coupling From the ARMv7 NEON version. 16 times faster as the C version, overall more than 12% faster vorbis decoding on Apple's A7. 2014-04-20 18:57:36 +03:00			`OBJS-$(CONFIG_VORBIS_DECODER) += aarch64/vorbisdsp_init.o`
aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC This work is sponsored by, and copyright, Google. This has mostly got the same differences to the 8 bit version as in the arm version. For the horizontal filters, we do 16 pixels in parallel as well. For the 8 pixel wide vertical filters, we can accumulate 4 rows before storing, just as in the 8 bit version. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_avg4_10bpp_neon: 35.7 30.7 vp9_avg8_10bpp_neon: 93.5 84.7 vp9_avg16_10bpp_neon: 324.4 296.6 vp9_avg32_10bpp_neon: 1236.5 1148.2 vp9_avg64_10bpp_neon: 4639.6 4571.1 vp9_avg_8tap_smooth_4h_10bpp_neon: 130.0 128.0 vp9_avg_8tap_smooth_4hv_10bpp_neon: 440.0 440.5 vp9_avg_8tap_smooth_4v_10bpp_neon: 114.0 105.5 vp9_avg_8tap_smooth_8h_10bpp_neon: 327.0 314.0 vp9_avg_8tap_smooth_8hv_10bpp_neon: 918.7 865.4 vp9_avg_8tap_smooth_8v_10bpp_neon: 330.0 300.2 vp9_avg_8tap_smooth_16h_10bpp_neon: 1187.5 1155.5 vp9_avg_8tap_smooth_16hv_10bpp_neon: 2663.1 2591.0 vp9_avg_8tap_smooth_16v_10bpp_neon: 1107.4 1078.3 vp9_avg_8tap_smooth_64h_10bpp_neon: 17754.6 17454.7 vp9_avg_8tap_smooth_64hv_10bpp_neon: 33285.2 33001.5 vp9_avg_8tap_smooth_64v_10bpp_neon: 16066.9 16048.6 vp9_put4_10bpp_neon: 25.5 21.7 vp9_put8_10bpp_neon: 56.0 52.0 vp9_put16_10bpp_neon/armv8: 183.0 163.1 vp9_put32_10bpp_neon/armv8: 678.6 563.1 vp9_put64_10bpp_neon/armv8: 2679.9 2195.8 vp9_put_8tap_smooth_4h_10bpp_neon: 120.0 118.0 vp9_put_8tap_smooth_4hv_10bpp_neon: 435.2 435.0 vp9_put_8tap_smooth_4v_10bpp_neon: 107.0 98.2 vp9_put_8tap_smooth_8h_10bpp_neon: 303.0 290.0 vp9_put_8tap_smooth_8hv_10bpp_neon: 893.7 828.7 vp9_put_8tap_smooth_8v_10bpp_neon: 305.5 263.5 vp9_put_8tap_smooth_16h_10bpp_neon: 1089.1 1059.2 vp9_put_8tap_smooth_16hv_10bpp_neon: 2578.8 2452.4 vp9_put_8tap_smooth_16v_10bpp_neon: 1009.5 933.5 vp9_put_8tap_smooth_64h_10bpp_neon: 16223.4 15918.6 vp9_put_8tap_smooth_64hv_10bpp_neon: 32153.0 31016.2 vp9_put_8tap_smooth_64v_10bpp_neon: 14516.5 13748.1 These are generally about as fast as the corresponding ARM routines on the same CPU (at least on the A53), in most cases marginally faster. The speedup vs C code is around 4-9x. Signed-off-by: Martin Storsjö <martin@martin.st> 2016-12-14 23:48:35 +02:00			`OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9dsp_init_10bpp_aarch64.o \`
			`aarch64/vp9dsp_init_12bpp_aarch64.o \`
			`aarch64/vp9dsp_init_aarch64.o`
aarch64: h264 chroma motion compensation NEON optimizations Since RV40 and VC-1 use almost the same algorithm so optimizations for those two decoders are easy to do and included. 2013-12-10 22:16:08 +03:00
build: miscellaneous cosmetics Restore alphabetical order in lists, break overly long lines, do some prettyprinting, add some explanatory section comments, group parts together that belong together logically. 2016-02-16 18:58:50 +02:00			`# ARMv8 optimizations`

			`# subsystems`
aarch64: implement videodsp.prefetch 8% faster h264 decoding on Apple A7. 2014-04-05 12:47:18 +03:00			`ARMV8-OBJS-$(CONFIG_VIDEODSP) += aarch64/videodsp.o`

build: miscellaneous cosmetics Restore alphabetical order in lists, break overly long lines, do some prettyprinting, add some explanatory section comments, group parts together that belong together logically. 2016-02-16 18:58:50 +02:00			`# NEON optimizations`

			`# subsystems`
aarch64: NEON float FFT Approximately as fast as the ARM NEON version on Apple's A7. 2014-03-26 17:20:42 +03:00			`NEON-OBJS-$(CONFIG_FFT) += aarch64/fft_neon.o`
arm64: int32_to_float_fmul neon asm 3% faster dts decoding on a cortex-a57. cortex-a57 cortex-a53 int32_to_float_fmul_array8_c: 1270.9 4475.6 int32_to_float_fmul_array8_neon: 328.6 569.2 int32_to_float_fmul_scalar_c: 928.5 4119.6 int32_to_float_fmul_scalar_neon: 309.1 524.1 2015-12-03 12:04:29 +02:00			`NEON-OBJS-$(CONFIG_FMTCONVERT) += aarch64/fmtconvert_neon.o`
aarch64: h264 chroma motion compensation NEON optimizations Since RV40 and VC-1 use almost the same algorithm so optimizations for those two decoders are easy to do and included. 2013-12-10 22:16:08 +03:00			`NEON-OBJS-$(CONFIG_H264CHROMA) += aarch64/h264cmc_neon.o`
aarch64: h264 loop filter NEON optimizations Ported from ARMv7 NEON. 2013-12-20 23:02:43 +03:00			`NEON-OBJS-$(CONFIG_H264DSP) += aarch64/h264dsp_neon.o \`
			`aarch64/h264idct_neon.o`
h264: aarch64: intra prediction optimisations 2015-07-12 18:30:09 +02:00			`NEON-OBJS-$(CONFIG_H264PRED) += aarch64/h264pred_neon.o`
aarch64: hpeldsp NEON optimizations Ported from ARMv7 NEON. 2013-12-20 22:03:58 +03:00			`NEON-OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_neon.o \`
			`aarch64/hpeldsp_neon.o`
			`NEON-OBJS-$(CONFIG_HPELDSP) += aarch64/hpeldsp_neon.o`
lavc/aarch64: add ff_simple_idct{,_add,_put}_neon functions 2017-01-27 13:55:48 +02:00			`NEON-OBJS-$(CONFIG_IDCTDSP) += aarch64/idctdsp_init_aarch64.o \`
			`aarch64/simple_idct_neon.o`
aarch64: NEON float (i)MDCT Approximately as fast as the ARM NEON version on Apple's A7. 2014-04-15 19:35:57 +03:00			`NEON-OBJS-$(CONFIG_MDCT) += aarch64/mdct_neon.o`
build: miscellaneous cosmetics Restore alphabetical order in lists, break overly long lines, do some prettyprinting, add some explanatory section comments, group parts together that belong together logically. 2016-02-16 18:58:50 +02:00			`NEON-OBJS-$(CONFIG_MPEGAUDIODSP) += aarch64/mpegaudiodsp_neon.o`
aarch64: NEON vorbis_inverse_coupling From the ARMv7 NEON version. 16 times faster as the C version, overall more than 12% faster vorbis decoding on Apple's A7. 2014-04-20 18:57:36 +03:00
build: miscellaneous cosmetics Restore alphabetical order in lists, break overly long lines, do some prettyprinting, add some explanatory section comments, group parts together that belong together logically. 2016-02-16 18:58:50 +02:00			`# decoders/encoders`
Merge commit '01621202aad7e27b2a05c71d9ad7a19dfcbe17ec' * commit '01621202aad7e27b2a05c71d9ad7a19dfcbe17ec': build: miscellaneous cosmetics Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com> 2016-05-09 16:52:05 +02:00			`NEON-OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_neon.o`
aarch64: NEON vorbis_inverse_coupling From the ARMv7 NEON version. 16 times faster as the C version, overall more than 12% faster vorbis decoding on Apple's A7. 2014-04-20 18:57:36 +03:00			`NEON-OBJS-$(CONFIG_VORBIS_DECODER) += aarch64/vorbisdsp_neon.o`
aarch64: Add NEON optimizations for 10 and 12 bit vp9 itxfm This work is sponsored by, and copyright, Google. Compared to the arm version, on aarch64 we can keep the full 8x8 transform in registers, and for 16x16 and 32x32, we can process it in slices of 4 pixels instead of 2. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_inv_adst_adst_4x4_sub4_add_10_neon: 111.0 109.7 vp9_inv_adst_adst_8x8_sub8_add_10_neon: 914.0 733.5 vp9_inv_adst_adst_16x16_sub16_add_10_neon: 5184.0 3745.7 vp9_inv_dct_dct_4x4_sub1_add_10_neon: 65.0 65.7 vp9_inv_dct_dct_4x4_sub4_add_10_neon: 100.0 96.7 vp9_inv_dct_dct_8x8_sub1_add_10_neon: 111.0 119.7 vp9_inv_dct_dct_8x8_sub8_add_10_neon: 618.0 494.7 vp9_inv_dct_dct_16x16_sub1_add_10_neon: 295.1 284.6 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 2303.2 1883.9 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 2984.8 2189.3 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 3890.0 2799.4 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1044.4 1012.7 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 13333.7 9695.1 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 18531.3 12459.8 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 24470.7 16160.2 vp9_inv_wht_wht_4x4_sub4_add_10_neon: 83.0 79.7 The larger transforms are significantly faster than the corresponding ARM versions. The speedup vs C code is smaller than in 32 bit mode, probably because the 64 bit intermediates in the C code can be expressed more efficiently in aarch64. Signed-off-by: Martin Storsjö <martin@martin.st> 2017-01-03 14:35:54 +02:00			`NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_16bpp_neon.o \`
			`aarch64/vp9itxfm_neon.o \`
aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter This work is sponsored by, and copyright, Google. This is similar to the arm version, but due to the larger registers on aarch64, we can do 8 pixels at a time for all filter sizes. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_loop_filter_h_4_8_10bpp_neon: 213.2 172.6 vp9_loop_filter_h_8_8_10bpp_neon: 281.2 244.2 vp9_loop_filter_h_16_8_10bpp_neon: 657.0 444.5 vp9_loop_filter_h_16_16_10bpp_neon: 1280.4 877.7 vp9_loop_filter_mix2_h_44_16_10bpp_neon: 397.7 358.0 vp9_loop_filter_mix2_h_48_16_10bpp_neon: 465.7 429.0 vp9_loop_filter_mix2_h_84_16_10bpp_neon: 465.7 428.0 vp9_loop_filter_mix2_h_88_16_10bpp_neon: 533.7 499.0 vp9_loop_filter_mix2_v_44_16_10bpp_neon: 271.5 244.0 vp9_loop_filter_mix2_v_48_16_10bpp_neon: 330.0 305.0 vp9_loop_filter_mix2_v_84_16_10bpp_neon: 329.0 306.0 vp9_loop_filter_mix2_v_88_16_10bpp_neon: 386.0 365.0 vp9_loop_filter_v_4_8_10bpp_neon: 150.0 115.2 vp9_loop_filter_v_8_8_10bpp_neon: 209.0 175.5 vp9_loop_filter_v_16_8_10bpp_neon: 492.7 345.2 vp9_loop_filter_v_16_16_10bpp_neon: 951.0 682.7 This is significantly faster than the ARM version in almost all cases except for the mix2 functions. Based on START_TIMER/STOP_TIMER wrapping around a few individual functions, the speedup vs C code is around 2-3x. Signed-off-by: Martin Storsjö <martin@martin.st> 2017-01-05 12:52:06 +02:00			`aarch64/vp9lpf_16bpp_neon.o \`
aarch64: vp9: Implement NEON loop filters This work is sponsored by, and copyright, Google. These are ported from the ARM version; thanks to the larger amount of registers available, we can do the loop filters with 16 pixels at a time. The implementation is fully templated, with a single macro which can generate versions for both 8 and 16 pixels wide, for both 4, 8 and 16 pixels loop filters (and the 4/8 mixed versions as well). For the 8 pixel wide versions, it is pretty close in speed (the v_4_8 and v_8_8 filters are the best examples of this; the h_4_8 and h_8_8 filters seem to get some gain in the load/transpose/store part). For the 16 pixels wide ones, we get a speedup of around 1.2-1.4x compared to the 32 bit version. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_loop_filter_h_4_8_neon: 144.0 127.2 vp9_loop_filter_h_8_8_neon: 207.0 182.5 vp9_loop_filter_h_16_8_neon: 415.0 328.7 vp9_loop_filter_h_16_16_neon: 672.0 558.6 vp9_loop_filter_mix2_h_44_16_neon: 302.0 203.5 vp9_loop_filter_mix2_h_48_16_neon: 365.0 305.2 vp9_loop_filter_mix2_h_84_16_neon: 365.0 305.2 vp9_loop_filter_mix2_h_88_16_neon: 376.0 305.2 vp9_loop_filter_mix2_v_44_16_neon: 193.2 128.2 vp9_loop_filter_mix2_v_48_16_neon: 246.7 218.4 vp9_loop_filter_mix2_v_84_16_neon: 248.0 218.5 vp9_loop_filter_mix2_v_88_16_neon: 302.0 218.2 vp9_loop_filter_v_4_8_neon: 89.0 88.7 vp9_loop_filter_v_8_8_neon: 141.0 137.7 vp9_loop_filter_v_16_8_neon: 295.0 272.7 vp9_loop_filter_v_16_16_neon: 546.0 453.7 The speedup vs C code in checkasm tests is around 2-7x, which is pretty much the same as for the 32 bit version. Even if these functions are faster than their 32 bit equivalent, the C version that we compare to also became around 1.3-1.7x faster than the C version in 32 bit. Based on START_TIMER/STOP_TIMER wrapping around a few individual functions, the speedup vs C code is around 4-5x. Examples of runtimes vs C on a Cortex A57 (for a slightly older version of the patch): A57 gcc-5.3 neon loop_filter_h_4_8_neon: 256.6 93.4 loop_filter_h_8_8_neon: 307.3 139.1 loop_filter_h_16_8_neon: 340.1 254.1 loop_filter_h_16_16_neon: 827.0 407.9 loop_filter_mix2_h_44_16_neon: 524.5 155.4 loop_filter_mix2_h_48_16_neon: 644.5 173.3 loop_filter_mix2_h_84_16_neon: 630.5 222.0 loop_filter_mix2_h_88_16_neon: 697.3 222.0 loop_filter_mix2_v_44_16_neon: 598.5 100.6 loop_filter_mix2_v_48_16_neon: 651.5 127.0 loop_filter_mix2_v_84_16_neon: 591.5 167.1 loop_filter_mix2_v_88_16_neon: 855.1 166.7 loop_filter_v_4_8_neon: 271.7 65.3 loop_filter_v_8_8_neon: 312.5 106.9 loop_filter_v_16_8_neon: 473.3 206.5 loop_filter_v_16_16_neon: 976.1 327.8 The speed-up compared to the C functions is 2.5 to 6 and the cortex-a57 is again 30-50% faster than the cortex-a53. This is an adapted cherry-pick from libav commits 9d2afd1eb8c5cc0633062430e66326dbf98c99e0 and 31756abe29eb039a11c59a42cb12e0cc2aef3b97. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com> 2016-11-14 12:32:27 +02:00			`aarch64/vp9lpf_neon.o \`
aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC This work is sponsored by, and copyright, Google. This has mostly got the same differences to the 8 bit version as in the arm version. For the horizontal filters, we do 16 pixels in parallel as well. For the 8 pixel wide vertical filters, we can accumulate 4 rows before storing, just as in the 8 bit version. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_avg4_10bpp_neon: 35.7 30.7 vp9_avg8_10bpp_neon: 93.5 84.7 vp9_avg16_10bpp_neon: 324.4 296.6 vp9_avg32_10bpp_neon: 1236.5 1148.2 vp9_avg64_10bpp_neon: 4639.6 4571.1 vp9_avg_8tap_smooth_4h_10bpp_neon: 130.0 128.0 vp9_avg_8tap_smooth_4hv_10bpp_neon: 440.0 440.5 vp9_avg_8tap_smooth_4v_10bpp_neon: 114.0 105.5 vp9_avg_8tap_smooth_8h_10bpp_neon: 327.0 314.0 vp9_avg_8tap_smooth_8hv_10bpp_neon: 918.7 865.4 vp9_avg_8tap_smooth_8v_10bpp_neon: 330.0 300.2 vp9_avg_8tap_smooth_16h_10bpp_neon: 1187.5 1155.5 vp9_avg_8tap_smooth_16hv_10bpp_neon: 2663.1 2591.0 vp9_avg_8tap_smooth_16v_10bpp_neon: 1107.4 1078.3 vp9_avg_8tap_smooth_64h_10bpp_neon: 17754.6 17454.7 vp9_avg_8tap_smooth_64hv_10bpp_neon: 33285.2 33001.5 vp9_avg_8tap_smooth_64v_10bpp_neon: 16066.9 16048.6 vp9_put4_10bpp_neon: 25.5 21.7 vp9_put8_10bpp_neon: 56.0 52.0 vp9_put16_10bpp_neon/armv8: 183.0 163.1 vp9_put32_10bpp_neon/armv8: 678.6 563.1 vp9_put64_10bpp_neon/armv8: 2679.9 2195.8 vp9_put_8tap_smooth_4h_10bpp_neon: 120.0 118.0 vp9_put_8tap_smooth_4hv_10bpp_neon: 435.2 435.0 vp9_put_8tap_smooth_4v_10bpp_neon: 107.0 98.2 vp9_put_8tap_smooth_8h_10bpp_neon: 303.0 290.0 vp9_put_8tap_smooth_8hv_10bpp_neon: 893.7 828.7 vp9_put_8tap_smooth_8v_10bpp_neon: 305.5 263.5 vp9_put_8tap_smooth_16h_10bpp_neon: 1089.1 1059.2 vp9_put_8tap_smooth_16hv_10bpp_neon: 2578.8 2452.4 vp9_put_8tap_smooth_16v_10bpp_neon: 1009.5 933.5 vp9_put_8tap_smooth_64h_10bpp_neon: 16223.4 15918.6 vp9_put_8tap_smooth_64hv_10bpp_neon: 32153.0 31016.2 vp9_put_8tap_smooth_64v_10bpp_neon: 14516.5 13748.1 These are generally about as fast as the corresponding ARM routines on the same CPU (at least on the A53), in most cases marginally faster. The speedup vs C code is around 4-9x. Signed-off-by: Martin Storsjö <martin@martin.st> 2016-12-14 23:48:35 +02:00			`aarch64/vp9mc_16bpp_neon.o \`
aarch64: vp9: Add NEON itxfm routines This work is sponsored by, and copyright, Google. These are ported from the ARM version; thanks to the larger amount of registers available, we can do the 16x16 and 32x32 transforms in slices 8 pixels wide instead of 4. This gives a speedup of around 1.4x compared to the 32 bit version. The fact that aarch64 doesn't have the same d/q register aliasing makes some of the macros quite a bit simpler as well. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_inv_adst_adst_4x4_add_neon: 90.0 87.7 vp9_inv_adst_adst_8x8_add_neon: 400.0 354.7 vp9_inv_adst_adst_16x16_add_neon: 2526.5 1827.2 vp9_inv_dct_dct_4x4_add_neon: 74.0 72.7 vp9_inv_dct_dct_8x8_add_neon: 271.0 256.7 vp9_inv_dct_dct_16x16_add_neon: 1960.7 1372.7 vp9_inv_dct_dct_32x32_add_neon: 11988.9 8088.3 vp9_inv_wht_wht_4x4_add_neon: 63.0 57.7 The speedup vs C code (2-4x) is smaller than in the 32 bit case, mostly because the C code ends up significantly faster (around 1.6x faster, with GCC 5.4) when built for aarch64. Examples of runtimes vs C on a Cortex A57 (for a slightly older version of the patch): A57 gcc-5.3 neon vp9_inv_adst_adst_4x4_add_neon: 152.2 60.0 vp9_inv_adst_adst_8x8_add_neon: 948.2 288.0 vp9_inv_adst_adst_16x16_add_neon: 4830.4 1380.5 vp9_inv_dct_dct_4x4_add_neon: 153.0 58.6 vp9_inv_dct_dct_8x8_add_neon: 789.2 180.2 vp9_inv_dct_dct_16x16_add_neon: 3639.6 917.1 vp9_inv_dct_dct_32x32_add_neon: 20462.1 4985.0 vp9_inv_wht_wht_4x4_add_neon: 91.0 49.8 The asm is around factor 3-4 faster than C on the cortex-a57 and the asm is around 30-50% faster on the a57 compared to the a53. This is an adapted cherry-pick from libav commit 3c9546dfafcdfe8e7860aff9ebbf609318220f29. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com> 2016-11-14 12:32:26 +02:00			`aarch64/vp9mc_neon.o`