1
0
mirror of https://github.com/FFmpeg/FFmpeg.git synced 2024-12-23 12:43:46 +02:00
Commit Graph

401 Commits

Author SHA1 Message Date
Hubert Mazur
d7abb7d143 lavc/aarch64: Add neon implementation for sse4
Provide neon implementation for sse4 function.

Performance comparison tests are shown below.
- sse_2_c: 80.7
- sse_2_neon: 31.0

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-18 12:07:26 +03:00
Hubert Mazur
ad251fd262 lavc/aarch64: Add neon implementation for sse16
Provide neon implementation for sse16 function.

Performance comparison tests are shown below.
- sse_0_c: 268.2
- sse_0_neon: 43.5

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-18 12:07:26 +03:00
Martin Storsjö
60109d5b3d aarch64: me_cmp: Fix the indentation of function declarations
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-18 12:07:26 +03:00
J. Dekker
aa9eabb7a5 lavc/aarch64: reformat add_res funcs
Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-08-16 14:00:34 +02:00
Andreas Rheinhardt
333b32af8e avcodec/h264chroma: Constify src in h264_chroma_mc_func
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-08-05 03:02:13 +02:00
Andreas Rheinhardt
b3bbbb14d0 avcodec/hevcdsp: Constify src pointers
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-08-05 02:54:04 +02:00
Andreas Rheinhardt
abb85429f3 avcodec/me_cmp: Constify me_cmp_func buffer parameters
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-07-31 03:31:53 +02:00
Andreas Rheinhardt
af43da3e4d avcodec/videodsp: Constify buf in VideoDSPContext.prefetch
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-07-31 03:14:34 +02:00
Martin Storsjö
4136405c86 aarch64: me_cmp: Don't do uaddlv once per iteration
The max height is currently documented as 16; the max difference per
pixel is 255, and a .8h element can easily contain 16*255, thus keep
accumulating in two .8h vectors, and just do the final accumulationat the
end. This should work for heights up to 256.

This requires a minor register renumbering in ff_pix_abs16_xy2_neon.

Before:       Cortex A53    A72    A73   Graviton 3
pix_abs_0_0_neon:   97.7   47.0   37.5   22.7
pix_abs_0_1_neon:  154.0   59.0   52.0   25.0
pix_abs_0_3_neon:  179.7   96.7   87.5   41.2
After:
pix_abs_0_0_neon:   96.0   39.2   31.2   22.0
pix_abs_0_1_neon:  150.7   59.7   46.2   23.7
pix_abs_0_3_neon:  175.7   83.7   81.7   38.2

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-07-16 17:26:17 +03:00
Martin Storsjö
68a03f6424 aarch64: me_cmp: Switch from uabd to uabal in ff_pix_abs16_xy2_neon
Using absolute-difference-accumulate does use twice the amount of
absolute-difference instructions, but avoids the need for the
uaddl and add instructions, reducing the total number of instructions
by 3.

These can be interleaved in the rest of the calculation, to avoid
tight dependencies at the end. Unfortunately, this is marginally
slower on Cortex A53, but faster on A72 and A73.

Before:       Cortex A53    A72    A73   Graviton 3
pix_abs_0_3_neon:  175.7  109.2   92.0   41.2
After:
pix_abs_0_3_neon:  179.7   96.7   87.5   41.2

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-07-16 17:25:54 +03:00
Martin Storsjö
b46de9aba4 aarch64: me_cmp: Interleave some of the loads in ff_pix_abs16_xy2_neon
Before:       Cortex A53    A72    A73   Graviton 3
pix_abs_0_3_neon:  183.7  112.7   97.5   41.2
After:
pix_abs_0_3_neon:  175.7  109.2   92.0   41.2

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-07-16 17:25:44 +03:00
Martin Storsjö
02e7853fd9 libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon
Checkasm doesn't currently test this codepath.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-07-16 17:25:11 +03:00
Hubert Mazur
01e190dc99 lavc/aarch64: Add pix_abs16_x2 neon implementation
Provide neon implementation for pix_abs16_x2 function.

Performance tests of implementation are below.
 - pix_abs_0_1_c: 283.5
 - pix_abs_0_1_neon: 39.0

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-07-13 23:25:22 +03:00
Hubert Mazur
eb7ab3928f lavc/aarch64: Hook up the existing ff_pix_abs16_neon to the sad[0] function pointer
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-07-11 23:58:28 +03:00
Swinney, Jonathan
c471cc7474 lavc/aarch64: motion estimation functions in neon
- ff_pix_abs16_neon
 - ff_pix_abs16_xy2_neon

In direct micro benchmarks of these ff functions verses their C implementations,
these functions performed as follows on AWS Graviton 3.

ff_pix_abs16_neon:
pix_abs_0_0_c: 141.1
pix_abs_0_0_neon: 19.6

ff_pix_abs16_xy2_neon:
pix_abs_0_3_c: 269.1
pix_abs_0_3_neon: 39.3

Tested with:
./tests/checkasm/checkasm --test=motion --bench --disable-linux-perf

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-06-28 00:51:39 +03:00
J. Dekker
3c694967f8 lavc/aarch64: hevc_sao reschedule slightly
Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-05-26 08:10:41 +02:00
J. Dekker
2e832be322 lavc/aarch64: add hevc sao edge 8x8
bench on AWS Graviton:

hevc_sao_edge_8x8_8_c: 516.0
hevc_sao_edge_8x8_8_neon: 81.0

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-05-25 08:04:46 +02:00
J. Dekker
92f67e4017 lavc/aarch64: add hevc sao edge 16x16
bench on AWS Graviton:

hevc_sao_edge_16x16_8_c: 1857.0
hevc_sao_edge_16x16_8_neon: 211.0
hevc_sao_edge_32x32_8_c: 7802.2
hevc_sao_edge_32x32_8_neon: 808.2
hevc_sao_edge_48x48_8_c: 16764.2
hevc_sao_edge_48x48_8_neon: 1796.5
hevc_sao_edge_64x64_8_c: 32647.5
hevc_sao_edge_64x64_8_neon: 3118.5

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-05-25 08:04:39 +02:00
J. Dekker
d957ee34a6 lavc/aarch64: fix hevc sao band filter
The SAO band filter can be called with non-multiples of 8, we round up
to the nearest multiple of 8 to account for this.

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-05-25 08:04:35 +02:00
Andre Kempe
861285c146 arm64: Fix wrong BTI landing pad
This patch fixes a wrong type of BTI landing pad when branching to
functions instantiated via the fft*_neon macro.

Although the previously employed paciasp instruction serves as a landing
pad, for the ways that this function is invoked it is the wrong type, resulting
in an unexpected termination of the running process.

Signed-off-by: André Kempe <andre.kempe@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-26 10:26:49 +03:00
Ben Avison
6eee650289 avcodec/vc1: Arm 64-bit NEON unescape fast path
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.

vc1dsp.vc1_unescape_buffer_c: 655617.7
vc1dsp.vc1_unescape_buffer_neon: 118237.0

Signed-off-by: Ben Avison <bavison@riscosopen.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-01 10:03:34 +03:00
Ben Avison
5379412ed0 avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast paths
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.

idctdsp.add_pixels_clamped_c: 313.3
idctdsp.add_pixels_clamped_neon: 24.3
idctdsp.put_pixels_clamped_c: 220.3
idctdsp.put_pixels_clamped_neon: 15.5
idctdsp.put_signed_pixels_clamped_c: 210.5
idctdsp.put_signed_pixels_clamped_neon: 19.5

Signed-off-by: Ben Avison <bavison@riscosopen.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-01 10:03:34 +03:00
Ben Avison
501fdc017d avcodec/vc1: Arm 64-bit NEON inverse transform fast paths
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.

vc1dsp.vc1_inv_trans_4x4_c: 158.2
vc1dsp.vc1_inv_trans_4x4_neon: 65.7
vc1dsp.vc1_inv_trans_4x4_dc_c: 86.5
vc1dsp.vc1_inv_trans_4x4_dc_neon: 26.5
vc1dsp.vc1_inv_trans_4x8_c: 335.2
vc1dsp.vc1_inv_trans_4x8_neon: 106.2
vc1dsp.vc1_inv_trans_4x8_dc_c: 151.2
vc1dsp.vc1_inv_trans_4x8_dc_neon: 25.5
vc1dsp.vc1_inv_trans_8x4_c: 365.7
vc1dsp.vc1_inv_trans_8x4_neon: 97.2
vc1dsp.vc1_inv_trans_8x4_dc_c: 139.7
vc1dsp.vc1_inv_trans_8x4_dc_neon: 16.5
vc1dsp.vc1_inv_trans_8x8_c: 547.7
vc1dsp.vc1_inv_trans_8x8_neon: 137.0
vc1dsp.vc1_inv_trans_8x8_dc_c: 268.2
vc1dsp.vc1_inv_trans_8x8_dc_neon: 30.5

Signed-off-by: Ben Avison <bavison@riscosopen.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-01 10:03:34 +03:00
Ben Avison
c62bbd4d20 avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. Note that the C
version can still outperform the NEON version in specific cases. The balance
between different code paths is stream-dependent, but in practice the best
case happens about 5% of the time, the worst case happens about 40% of the
time, and the complexity of the remaining cases fall somewhere in between.
Therefore, taking the average of the best and worst case timings is
probably a conservative estimate of the degree by which the NEON code
improves performance.

vc1dsp.vc1_h_loop_filter4_bestcase_c: 10.7
vc1dsp.vc1_h_loop_filter4_bestcase_neon: 43.5
vc1dsp.vc1_h_loop_filter4_worstcase_c: 184.5
vc1dsp.vc1_h_loop_filter4_worstcase_neon: 73.7
vc1dsp.vc1_h_loop_filter8_bestcase_c: 31.2
vc1dsp.vc1_h_loop_filter8_bestcase_neon: 62.2
vc1dsp.vc1_h_loop_filter8_worstcase_c: 358.2
vc1dsp.vc1_h_loop_filter8_worstcase_neon: 88.2
vc1dsp.vc1_h_loop_filter16_bestcase_c: 51.0
vc1dsp.vc1_h_loop_filter16_bestcase_neon: 107.7
vc1dsp.vc1_h_loop_filter16_worstcase_c: 722.7
vc1dsp.vc1_h_loop_filter16_worstcase_neon: 140.5
vc1dsp.vc1_v_loop_filter4_bestcase_c: 9.7
vc1dsp.vc1_v_loop_filter4_bestcase_neon: 43.0
vc1dsp.vc1_v_loop_filter4_worstcase_c: 178.7
vc1dsp.vc1_v_loop_filter4_worstcase_neon: 69.0
vc1dsp.vc1_v_loop_filter8_bestcase_c: 30.2
vc1dsp.vc1_v_loop_filter8_bestcase_neon: 50.7
vc1dsp.vc1_v_loop_filter8_worstcase_c: 353.0
vc1dsp.vc1_v_loop_filter8_worstcase_neon: 69.2
vc1dsp.vc1_v_loop_filter16_bestcase_c: 60.0
vc1dsp.vc1_v_loop_filter16_bestcase_neon: 90.0
vc1dsp.vc1_v_loop_filter16_worstcase_c: 714.2
vc1dsp.vc1_v_loop_filter16_worstcase_neon: 97.2

Signed-off-by: Ben Avison <bavison@riscosopen.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-01 10:03:33 +03:00
Martin Storsjö
a78f136f3f configure: Use a separate config_components.h header for $ALL_COMPONENTS
This avoids unnecessary rebuilds of most source files if only the
list of enabled components has changed, but not the other properties
of the build, set in config.h.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-03-16 14:12:49 +02:00
Andre Kempe
248986a0db arm64: Add Armv8.3-A PAC support to assembly files
This patch adds optional support for Arm Pointer Authentication Codes.

PAC support is turned on or off at compile time using additional
compiler flags. Unless any of these is enabled explicitly, no additional
code will be emitted at all.

Signed-off-by: André Kempe <andre.kempe@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-03-09 15:04:25 +02:00
Andreas Rheinhardt
52e9113695 avcodec/aarch64/idct: Add missing stddef
Fixes checkheaders on aarch64.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-02-21 13:10:04 +01:00
Martin Storsjö
402784ba9f aarch64: h264dsp: Fix incorrectly indented code
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-02-11 10:49:12 +02:00
Martin Storsjö
24b93022fe aarch64: Disable ff_hevc_sao_band_filter_8x8_8_neon out of precaution
While this function on its own passes all of fate-hevc, there's
indications that the function might need to handle widths that
aren't a multiple of 8 (noted in commit
f63f9be37c, which later was
reverted).

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-01-07 22:33:27 +02:00
Martin Storsjö
16fba44b4d Revert "lavc/aarch64: add hevc sao edge 16x16"
This reverts commit a9214a2ca3, as
it breaks fate-hevc.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-01-07 22:33:23 +02:00
Martin Storsjö
df48b1d06f Revert "lavc/aarch64: add hevc sao edge 8x8"
This reverts commit c97ffc1a77, as
it breaks fate-hevc.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-01-07 22:33:19 +02:00
Martin Storsjö
cafed377eb Revert "lavc/aarch64: add hevc sao band 8x8 tiling"
This reverts commit f63f9be37c, as
it breaks fate-hevc.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-01-07 22:33:14 +02:00
J. Dekker
f63f9be37c lavc/aarch64: add hevc sao band 8x8 tiling
bench on AWS Graviton:

hevc_sao_band_8x8_8_c: 317.5
hevc_sao_band_8x8_8_neon: 97.5
hevc_sao_band_16x16_8_c: 1115.0
hevc_sao_band_16x16_8_neon: 322.7
hevc_sao_band_32x32_8_c: 4599.2
hevc_sao_band_32x32_8_neon: 1246.2
hevc_sao_band_48x48_8_c: 10021.7
hevc_sao_band_48x48_8_neon: 2740.5
hevc_sao_band_64x64_8_c: 17635.0
hevc_sao_band_64x64_8_neon: 4875.7

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-01-04 14:32:26 +01:00
J. Dekker
89a2ed4a8b lavc/aarch64: clean-up sao band 8x8 function formatting
Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-01-04 14:31:56 +01:00
J. Dekker
c97ffc1a77 lavc/aarch64: add hevc sao edge 8x8
bench on AWS Graviton:

hevc_sao_edge_8x8_8_c: 516.0
hevc_sao_edge_8x8_8_neon: 81.0

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-01-04 14:31:56 +01:00
J. Dekker
a9214a2ca3 lavc/aarch64: add hevc sao edge 16x16
bench on AWS Graviton:

hevc_sao_edge_16x16_8_c: 1857.0
hevc_sao_edge_16x16_8_neon: 211.0
hevc_sao_edge_32x32_8_c: 7802.2
hevc_sao_edge_32x32_8_neon: 808.2
hevc_sao_edge_48x48_8_c: 16764.2
hevc_sao_edge_48x48_8_neon: 1796.5
hevc_sao_edge_64x64_8_c: 32647.5
hevc_sao_edge_64x64_8_neon: 3118.5

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-01-04 14:31:56 +01:00
Jonathan Wright
08b4716a9e aarch64: Add Armv8.5-A BTI support
Add Branch Target Identifiers (BTIs) to all functions defined in
AArch64 assembly files. Most of the BTI landing pads are added
automatically by the 'function' macro.

BTI support is turned on or off at compile time based on the presence
of the __ARM_FEATURE_BTI_DEFAULT feature macro.

A binary compiled with BTI support can be executed on an Armv8-A
processor without BTI support because the instructions are defined in
NOP space.

Signed-off-by: Jonathan Wright <jonathan.wright@arm.com>
Signed-off-by: Elijah Ahmad <elijah.ahmad@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-11-16 13:43:56 +02:00
Jonathan Wright
6f04cf54f5 aarch64: Use ret x<n> instead of br x<n> where possible
Change AArch64 assembly code to use:
  ret     x<n>
instead of:
  br      x<n>

"ret x<n>" is already used in a lot of places so this patch makes it
consistent across the code base. This does not change behavior or
performance.

In addition, this change reduces the number of landing pads needed in
a subsequent patch to support the Armv8.5-A Branch Target
Identification (BTI) security feature.

Signed-off-by: Jonathan Wright <jonathan.wright@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-11-16 13:43:56 +02:00
Martin Storsjö
fd3bd5c492 aarch64: h264qpel: Do vertical filtering without transposing
This gives rather big speedups on these functions:

Before:
put_h264_qpel_8_mc01_8_neon:     241.0   131.5   138.7
put_h264_qpel_8_mc02_8_neon:     214.7   121.2   127.5
put_h264_qpel_8_mc03_8_neon:     242.5   131.2   135.7
put_h264_qpel_8_mc11_8_neon:     421.2   218.7   251.0
put_h264_qpel_8_mc12_8_neon:     878.0   509.5   537.5
put_h264_qpel_8_mc13_8_neon:     423.7   217.0   252.0
put_h264_qpel_8_mc21_8_neon:     858.2   479.5   514.0
put_h264_qpel_8_mc22_8_neon:     649.7   385.2   403.0
put_h264_qpel_8_mc23_8_neon:     860.2   476.5   517.7
put_h264_qpel_8_mc31_8_neon:     437.2   219.5   252.5
put_h264_qpel_8_mc32_8_neon:     892.5   510.5   546.0
put_h264_qpel_8_mc33_8_neon:     438.2   218.5   257.0
put_h264_qpel_16_mc01_8_neon:    944.2   509.7   546.7
put_h264_qpel_16_mc02_8_neon:    878.7   469.5   509.7
put_h264_qpel_16_mc03_8_neon:    945.7   510.7   557.0
put_h264_qpel_16_mc11_8_neon:   1663.2   858.5   979.5
put_h264_qpel_16_mc12_8_neon:   3510.2  2027.7  2112.7
put_h264_qpel_16_mc13_8_neon:   1664.7   857.5   980.5
put_h264_qpel_16_mc21_8_neon:   3366.2  1928.5  2030.5
put_h264_qpel_16_mc22_8_neon:   2584.7  1514.7  1590.2
put_h264_qpel_16_mc23_8_neon:   3367.7  1927.7  2035.0
put_h264_qpel_16_mc31_8_neon:   1716.7   849.7   997.0
put_h264_qpel_16_mc32_8_neon:   3564.0  2044.2  3835.2
put_h264_qpel_16_mc33_8_neon:   1717.7   863.0   989.5

After:
put_h264_qpel_8_mc01_8_neon:     136.0    73.7    76.0
put_h264_qpel_8_mc02_8_neon:     108.7    65.0    64.0
put_h264_qpel_8_mc03_8_neon:     137.5    72.7    73.0
put_h264_qpel_8_mc11_8_neon:     316.2   159.0   188.5
put_h264_qpel_8_mc12_8_neon:     653.0   375.5   384.7
put_h264_qpel_8_mc13_8_neon:     318.7   165.5   189.5
put_h264_qpel_8_mc21_8_neon:     739.2   385.7   432.5
put_h264_qpel_8_mc22_8_neon:     530.7   295.5   309.5
put_h264_qpel_8_mc23_8_neon:     741.2   393.7   421.0
put_h264_qpel_8_mc31_8_neon:     332.2   162.5   190.0
put_h264_qpel_8_mc32_8_neon:     667.5   378.2   390.5
put_h264_qpel_8_mc33_8_neon:     332.7   166.5   195.5
put_h264_qpel_16_mc01_8_neon:    524.2   285.2   294.0
put_h264_qpel_16_mc02_8_neon:    454.7   252.2   250.2
put_h264_qpel_16_mc03_8_neon:    525.7   286.0   283.0
put_h264_qpel_16_mc11_8_neon:   1243.2   630.7   726.7
put_h264_qpel_16_mc12_8_neon:   2610.2  1479.7  1481.2
put_h264_qpel_16_mc13_8_neon:   1250.5   631.7   727.7
put_h264_qpel_16_mc21_8_neon:   2890.2  1571.2  1679.7
put_h264_qpel_16_mc22_8_neon:   2108.7  1177.5  1223.5
put_h264_qpel_16_mc23_8_neon:   2891.7  1578.7  1667.7
put_h264_qpel_16_mc31_8_neon:   1296.7   630.5   752.5
put_h264_qpel_16_mc32_8_neon:   2664.0  1483.2  1503.5
put_h264_qpel_16_mc33_8_neon:   1297.7   632.5   747.2

I.e. overall a 20%-60% reduction in runtime of these
functions.

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-10-18 14:27:58 +03:00
Martin Storsjö
2d5a7f6d00 arm/aarch64: Improve scheduling in the avg form of h264_qpel
Don't use the loaded registers directly, avoiding stalls on in
order cores. Use vrhadd.u8 with q registers where easily possible.

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-10-18 14:27:36 +03:00
Zhao Zhili
378ad2f8fd lavc/aarch64: fix relocation out of range error
Use a temporary label instead of global function symbol for b.gt.

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-09-25 21:55:29 +03:00
Mikhail Nitenko
d3e56b56ae lavc/aarch64: add pred functions for 10-bit
Benchmarks:                        A53     A72
pred8x8_dc_10_c:                   64.2    49.5
pred8x8_dc_10_neon:                62.0    53.7
pred8x8_dc_128_10_c:               26.0    14.0
pred8x8_dc_128_10_neon:            30.7    17.5
pred8x8_horizontal_10_c:           60.0    27.7
pred8x8_horizontal_10_neon:        38.0    34.0
pred8x8_left_dc_10_c:              42.5    27.5
pred8x8_left_dc_10_neon:           51.0    41.2
pred8x8_mad_cow_dc_0l0_10_c:       55.7    37.2
pred8x8_mad_cow_dc_0l0_10_neon:    50.2    35.2
pred8x8_mad_cow_dc_0lt_10_c:       89.2    67.0
pred8x8_mad_cow_dc_0lt_10_neon:    52.2    46.7
pred8x8_mad_cow_dc_l0t_10_c:       74.7    51.0
pred8x8_mad_cow_dc_l0t_10_neon:    50.5    45.2
pred8x8_mad_cow_dc_l00_10_c:       58.0    38.0
pred8x8_mad_cow_dc_l00_10_neon:    42.5    37.5
pred8x8_plane_10_c:               354.0   288.7
pred8x8_plane_10_neon:            141.0   101.2
pred8x8_top_dc_10_c:               44.5    30.5
pred8x8_top_dc_10_neon:            40.0    31.0
pred8x8_vertical_10_c:             27.5    14.5
pred8x8_vertical_10_neon:          21.0    17.5
pred16x16_plane_10_c:            1242.0  1070.5
pred16x16_plane_10_neon:          324.0   196.7

Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-21 00:06:26 +03:00
Mikhail Nitenko
43ca887bc2 lavc/aarch64: h264, add chroma loop filters for 10bit
Benchmarks:                                             A53     A72
h264_h_loop_filter_chroma422_10bpp_c:                  282.7   114.2
h264_h_loop_filter_chroma422_10bpp_neon:               109.5    78.5
h264_h_loop_filter_chroma_10bpp_c:                     165.0    81.5
h264_h_loop_filter_chroma_10bpp_neon:                  120.0    76.7
h264_h_loop_filter_chroma_intra422_10bpp_c:            323.7   124.2
h264_h_loop_filter_chroma_intra422_10bpp_neon:         155.0   102.7
h264_h_loop_filter_chroma_intra_10bpp_c:               121.0    49.5
h264_h_loop_filter_chroma_intra_10bpp_neon:             79.7    53.7
h264_h_loop_filter_chroma_mbaff422_10bpp_c:            188.5    75.0
h264_h_loop_filter_chroma_mbaff422_10bpp_neon:         120.0    75.5
h264_h_loop_filter_chroma_mbaff_intra422_10bpp_c:      116.7    46.0
h264_h_loop_filter_chroma_mbaff_intra422_10bpp_neon:    79.7    53.7
h264_h_loop_filter_chroma_mbaff_intra_10bpp_c:          63.0    27.2
h264_h_loop_filter_chroma_mbaff_intra_10bpp_neon:       48.5    34.0
h264_v_loop_filter_chroma_10bpp_c:                     258.7   135.5
h264_v_loop_filter_chroma_10bpp_neon:                   71.2    51.0
h264_v_loop_filter_chroma_intra_10bpp_c:               158.0    70.7
h264_v_loop_filter_chroma_intra_10bpp_neon:             48.7    31.5

Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-21 00:06:26 +03:00
Mikhail Nitenko
756d2e087a lavc/aarch64: move transpose_4x8H to neon.S
transpose_4x8H was declared in vp9lpf_16bpp_neon, however this macro is
not unique to vp9 and could be used elsewhere.

Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-21 00:06:26 +03:00
Martin Storsjö
c60b76d0c8 aarch64: h264dsp: Fix indentation of some functions to match the rest
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-08 23:18:43 +03:00
Martin Storsjö
e86ec831b0 aarch64: h264dsp: Remove unnecessary sign extensions
These became unnecessary when the stride arguments were changed from
int to ptrdiff_t in bc26fe8927
(0576ef466d) and
d5d699ab6e
(aa844dc46f).

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-08 23:18:43 +03:00
Andreas Rheinhardt
afc95a10ac avcodec/h264dsp, h264idct: Fix lengths of array parameters
Fixes many -Warray-parameter warnings from GCC 11.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2021-08-08 17:44:57 +02:00
Martin Storsjö
f27e3ccf06 aarch64: hevc_idct: Fix overflows in idct_dc
This is marginally slower, but correct for all input values.
The previous implementation failed with certain input seeds, e.g.
"checkasm --test=hevc_idct 98".

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-05-22 00:08:03 +03:00
Andreas Rheinhardt
7c1f347b18 avcodec: Remove deprecated old encode/decode APIs
Deprecated in commits 7fc329e2dd
and 31f6a4b4b8.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2021-04-27 10:43:12 -03:00
Andreas Rheinhardt
f3c197b129 Include attributes.h directly
Some files currently rely on libavutil/cpu.h to include it for them;
yet said file won't use include it any more after the currently
deprecated functions are removed, so include attributes.h directly.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2021-04-19 14:34:10 +02:00
Mikhail Nitenko
84ac1440b2 lavc/aarch64: add pred16x16 10-bit functions
Benchmarks:                      A53     A72
pred16x16_dc_10_c:              136.0   124.0
pred16x16_dc_10_neon:           121.2   106.0
pred16x16_horizontal_10_c:      155.0    73.2
pred16x16_horizontal_10_neon:    82.2    67.7
pred16x16_top_dc_10_c:          106.0    93.7
pred16x16_top_dc_10_neon:        87.7    77.2
pred16x16_vertical_10_c:         83.0    67.7
pred16x16_vertical_10_neon:      54.2    61.7

Some functions work slower than C and are left commented out.
2021-04-19 09:01:14 +02:00
Mikhail Nitenko
6b2e7dc828 lavc/aarch64: change h264pred_init structure
Change structure to allow the addition of other bit depths.
2021-04-19 09:00:58 +02:00
Martin Storsjö
870bfe16a1 aarch64: h264pred: Optimize the inner loop of existing 8 bit functions
Move the loop counter decrement further from the branch instruction,
this hides the latency of the decrement.

In loops that first load, then store (the horizontal prediction cases),
do the decrement after the load (where the next instruction would
stall a bit anyway, waiting for the result of the load).

In loops that store twice using the same destination register,
also do the decrement between the two stores (as the second store
would need to wait for the updated destination register from the
first instruction).

In loops that store twice to two different destination registers,
do the decrement before both stores, to do it as soon before the
branch as possible.

This gives minor (1-2 cycle) speedups in most cases (modulo measurement
noise), but the horizontal prediction functions get a rather notable
speedup on the Cortex A53.

Before:                     Cortex A53     A72     A73
pred8x8_dc_8_neon:                60.7    46.2    39.2
pred8x8_dc_128_8_neon:            30.7    18.0    14.0
pred8x8_horizontal_8_neon:        42.2    29.2    18.5
pred8x8_left_dc_8_neon:           52.7    36.2    32.2
pred8x8_mad_cow_dc_0l0_8_neon:    48.2    27.7    25.7
pred8x8_mad_cow_dc_0lt_8_neon:    52.5    33.2    34.7
pred8x8_mad_cow_dc_l0t_8_neon:    52.5    31.7    33.2
pred8x8_mad_cow_dc_l00_8_neon:    43.2    27.0    25.5
pred8x8_plane_8_neon:            112.2    86.2    88.2
pred8x8_top_dc_8_neon:            40.7    23.0    21.2
pred8x8_vertical_8_neon:          27.2    15.5    14.0
pred16x16_dc_8_neon:              91.0    73.2    70.5
pred16x16_dc_128_8_neon:          43.0    34.7    30.7
pred16x16_horizontal_8_neon:      86.0    49.7    44.7
pred16x16_left_dc_8_neon:         87.0    67.2    67.5
pred16x16_plane_8_neon:          236.0   175.7   173.0
pred16x16_top_dc_8_neon:          53.2    39.0    41.7
pred16x16_vertical_8_neon:        41.7    29.7    31.0

After:
pred8x8_dc_8_neon:                59.0    46.7    42.5
pred8x8_dc_128_8_neon:            28.2    18.0    14.0
pred8x8_horizontal_8_neon:        34.2    29.2    18.5
pred8x8_left_dc_8_neon:           51.0    38.2    32.7
pred8x8_mad_cow_dc_0l0_8_neon:    46.7    28.2    26.2
pred8x8_mad_cow_dc_0lt_8_neon:    55.2    33.7    37.5
pred8x8_mad_cow_dc_l0t_8_neon:    51.2    31.7    37.2
pred8x8_mad_cow_dc_l00_8_neon:    41.7    27.5    26.0
pred8x8_plane_8_neon:            111.5    86.5    89.5
pred8x8_top_dc_8_neon:            39.0    23.2    21.0
pred8x8_vertical_8_neon:          27.2    16.0    14.0
pred16x16_dc_8_neon:              85.0    70.2    70.5
pred16x16_dc_128_8_neon:          42.0    30.0    30.7
pred16x16_horizontal_8_neon:      66.5    49.5    42.5
pred16x16_left_dc_8_neon:         81.0    66.5    67.5
pred16x16_plane_8_neon:          235.0   175.7   173.0
pred16x16_top_dc_8_neon:          52.0    39.0    41.7
pred16x16_vertical_8_neon:        40.2    33.2    31.0

Despite this, a number of these functions still are slower than
what e.g. GCC 7 generates - this shows the relative speedup of the
neon codepaths over the compiler generated ones:

                           Cortex A53    A72    A73
pred8x8_dc_8_neon:               0.86   0.65   1.04
pred8x8_dc_128_8_neon:           0.59   0.44   0.62
pred8x8_horizontal_8_neon:       1.51   0.58   1.30
pred8x8_left_dc_8_neon:          0.72   0.56   0.89
pred8x8_mad_cow_dc_0l0_8_neon:   0.93   0.93   1.37
pred8x8_mad_cow_dc_0lt_8_neon:   1.37   1.41   1.68
pred8x8_mad_cow_dc_l0t_8_neon:   1.21   1.17   1.32
pred8x8_mad_cow_dc_l00_8_neon:   1.24   1.19   1.60
pred8x8_plane_8_neon:            3.36   3.58   3.76
pred8x8_top_dc_8_neon:           0.97   0.99   1.43
pred8x8_vertical_8_neon:         0.86   0.78   1.18
pred16x16_dc_8_neon:             1.20   1.06   1.49
pred16x16_dc_128_8_neon:         0.83   0.95   0.99
pred16x16_horizontal_8_neon:     1.78   0.96   1.59
pred16x16_left_dc_8_neon:        1.06   0.96   1.32
pred16x16_plane_8_neon:          5.78   6.49   7.19
pred16x16_top_dc_8_neon:         1.48   1.53   1.94
pred16x16_vertical_8_neon:       1.39   1.34   1.98

In particular, on Cortex A72, many of these functions are slower
than the compiler generated code, while they're more beneficial on
e.g. the Cortex A73.

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-04-14 15:23:44 +03:00
James Almer
f1a894f9d3 avcodec: add missing FF_API_OLD_ENCDEC wrappers to xmm clobber functions
Signed-off-by: James Almer <jamrial@gmail.com>
2021-02-26 19:26:31 -03:00
Josh Dekker
7ac41e0db2 lavc/aarch64: add HEVC sao_band NEON
Only works for 8x8.

Signed-off-by: Josh Dekker <josh@itanimul.li>
2021-02-18 14:12:01 +01:00
Josh Dekker
75c2ddfa61 lavc/aarch64: add HEVC idct_dc NEON
Signed-off-by: Josh Dekker <josh@itanimul.li>
2021-02-18 14:12:01 +01:00
Reimar Döffinger
00c916ef61 lavc/aarch64: port HEVC add_residual NEON
Speedup is fairly small, around 1.5%, but these are fairly simple.

Signed-off-by: Josh Dekker <josh@itanimul.li>
2021-02-18 14:11:57 +01:00
Reimar Döffinger
30f80d855b lavc/aarch64: port HEVC SIMD idct NEON
Makes SIMD-optimized 8x8 and 16x16 idcts for 8 and 10 bit depth
available on aarch64.
For a UHD HDR (10 bit) sample video these were consuming the most time
and this optimization reduced overall decode time from 19.4s to 16.4s,
approximately 15% speedup.
Test sample was the first 300 frames of "LG 4K HDR Demo - New York.ts",
running on Apple M1.

Signed-off-by: Josh Dekker <josh@itanimul.li>
2021-02-18 14:11:53 +01:00
Anton Khirnov
c8c2dfbc37 lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h
That is a more appropriate place for it.
2021-01-01 14:11:01 +01:00
Martin Storsjö
7168adedbc libavcodec: aarch64: Add a NEON implementation of pixblockdsp
Cortex A53    A72    A73
get_pixels_c:                140.7   87.7   72.5
get_pixels_neon:              46.0   20.0   19.5
get_pixels_unaligned_c:      140.7   87.7   73.0
get_pixels_unaligned_neon:    49.2   20.2   26.2
diff_pixels_c:               209.7  133.7  138.7
diff_pixels_neon:             54.2   31.7   23.5
diff_pixels_unaligned_c:     209.7  134.2  139.0
diff_pixels_unaligned_neon:   68.0   27.7   41.7

Signed-off-by: Martin Storsjö <martin@martin.st>
2020-05-15 23:37:55 +03:00
Carl Eugen Hoyos
34d7c8d942 lavc/aarch64: Remove unneeded file vp9mc_aarch64.c 2020-03-11 14:36:07 +01:00
Carl Eugen Hoyos
951bd25572 lavc/aarch64: Fix suffix of new file vp9mc_aarch64. 2020-03-11 14:29:22 +01:00
Carl Eugen Hoyos
213c796561 lavc/aarch64: Fix compilation with --disable-neon
Fixes ticket #8565.
2020-03-11 14:16:48 +01:00
Carl Eugen Hoyos
9a21754904 lavc/aarch64: Move non-neon vp9 copy functions out of neon source file.
Fixes part of ticket #8565.
2020-03-11 14:16:40 +01:00
Lynne
aac382e9e5 aarch64/opusdsp: do not clobber register v8
A part of v8-v15 needs to be preserved across calls.
2019-08-15 13:29:22 +01:00
Lynne
f62ee527cb aarch64/asm-offsets: remove old CELT offsets
They're not used and they're incorrect.
2019-05-14 23:41:24 +01:00
Lynne
4d2f62150d aarch64/opusdsp: implement NEON accelerated postfilter and deemphasis
153372 UNITS in postfilter_c,   65536 runs,      0 skips
73164 UNITS in postfilter_neon,   65536 runs,      0 skips -> 2.1x speedup

80591 UNITS in deemphasis_c,  131072 runs,      0 skips
43969 UNITS in deemphasis_neon,  131072 runs,      0 skips -> 1.83x speedup

Total decoder speedup: ~15% on a Raspberry Pi 3 (from 28.1x to 33.5x realtime)

Deemphasis SIMD based on the following unrolling:
const float c1 = CELT_EMPH_COEFF, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;
float state = coeff;

for (int i = 0; i < len; i += 4) {
    y[0] = x[0] + c1*state;
    y[1] = x[1] + c2*state + c1*x[0];
    y[2] = x[2] + c3*state + c1*x[1] + c2*x[0];
    y[3] = x[3] + c4*state + c1*x[2] + c2*x[1] + c3*x[0];

    state = y[3];
    y += 4;
    x += 4;
}

Unlike the x86 version, duplication is used instead of pslldq so
the structure and tables are different.
2019-04-10 01:08:54 +02:00
James Almer
92219ef4ac Merge commit '186bd30aa3b6c2b29b4dbf18278700b572068b1e'
* commit '186bd30aa3b6c2b29b4dbf18278700b572068b1e':
  h264/arm64: implement missing 4:2:2 chroma loop filter neon functions

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:29:41 -03:00
James Almer
5c363d3e59 Merge commit '7e42d5f0ab2aeac811fd01e122627c9198b13f01'
* commit '7e42d5f0ab2aeac811fd01e122627c9198b13f01':
  aarch64: vp8: Optimize vp8_idct_add_neon for aarch64

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:22:29 -03:00
James Almer
409e684e79 Merge commit '49f9c4272c4029b57ff300d908ba03c6332fc9c4'
* commit '49f9c4272c4029b57ff300d908ba03c6332fc9c4':
  aarch64: vp8: Skip saturating in shrn in ff_vp8_idct_add_neon

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:21:46 -03:00
James Almer
fbd607dd56 Merge commit '37394ef01b040605f8e1c98e73aa12b1c0bcba07'
* commit '37394ef01b040605f8e1c98e73aa12b1c0bcba07':
  aarch64: vp8: Optimize put_epel16_h6v6 with vp8_epel8_v6_y2

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:20:05 -03:00
James Almer
34a0a9746b Merge commit 'e39a9212ab37a55b346801c77487d8a47b6f9fe2'
* commit 'e39a9212ab37a55b346801c77487d8a47b6f9fe2':
  aarch64: vp8: Port bilin functions from arm version

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:18:42 -03:00
James Almer
2ac399d7fa Merge commit '58d154922707bfeb873cb3a7476e0f94b17463dd'
* commit '58d154922707bfeb873cb3a7476e0f94b17463dd':
  aarch64: vp8: Port epel4 functions from arm version

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:17:33 -03:00
James Almer
c6892f59eb Merge commit 'cc7ba00c35faf0478f1f56215e926f70ccb31282'
* commit 'cc7ba00c35faf0478f1f56215e926f70ccb31282':
  aarch64: vp8: Port missing epel8 functions from arm version

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:16:43 -03:00
James Almer
79025da3f2 Merge commit '52c9b0a6c0d02cff6caebcf6989e565e05b55200'
* commit '52c9b0a6c0d02cff6caebcf6989e565e05b55200':
  aarch64: vp8: Port vp8_luma_dc_wht and vp8_idct_dc_add4uv from arm version

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:14:40 -03:00
James Almer
39278ff0de Merge commit 'c513fcd7d235aa4cef45a6c3125bd4dcc03bf276'
* commit 'c513fcd7d235aa4cef45a6c3125bd4dcc03bf276':
  aarch64: vp8: Fix a typo in a comment

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:13:32 -03:00
James Almer
4f9a8d3fe2 Merge commit 'f1011ea28a4048ddec97794ca3e9901474fe055f'
* commit 'f1011ea28a4048ddec97794ca3e9901474fe055f':
  aarch64: vp8: Reorder the function pointer inits to match the arm original

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:09:11 -03:00
James Almer
398000abcf Merge commit '85bfaa4949f4afcde19061def3e8a18988964858'
* commit '85bfaa4949f4afcde19061def3e8a18988964858':
  aarch64: vp8: Use the proper aarch64 form for conditional branches

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:06:43 -03:00
James Almer
a2ae381b5a Merge commit '0801853e640624537db386727b36fa97aa6258e7'
* commit '0801853e640624537db386727b36fa97aa6258e7':
  libavcodec: vp8 neon optimizations for aarch64

See 833fed5253

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:05:52 -03:00
Janne Grunau
186bd30aa3 h264/arm64: implement missing 4:2:2 chroma loop filter neon functions 2019-02-27 21:57:05 +01:00
Carl Eugen Hoyos
7e4d3dbe18 lavc/aarch64/h264dsp_init: Only use neon horizontal intra loopfilter for 4:2:0. 2019-02-20 23:56:21 +01:00
James Almer
aa844dc46f aarch64/h264dsp: change loop filter stride argument to ptrdiff_t
This was missed in d5d699ab6e

Signed-off-by: James Almer <jamrial@gmail.com>
2019-02-20 19:38:46 -03:00
James Almer
e4e04dce1f Merge commit '28a8b5413b64b831dfb8650208bccd8b78360484'
* commit '28a8b5413b64b831dfb8650208bccd8b78360484':
  h264/aarch64: add intra loop filter neon asm

Merged-by: James Almer <jamrial@gmail.com>
2019-02-20 15:42:01 -03:00
James Almer
4dc1f06f0c Merge commit '846c3d6aca5484904e60946c4fe8b8833bc07f92'
* commit '846c3d6aca5484904e60946c4fe8b8833bc07f92':
  h264/aarch64: optimize neon loop filter

Merged-by: James Almer <jamrial@gmail.com>
2019-02-20 15:41:03 -03:00
James Almer
5ca7eb36b7 Merge commit 'bb515e3a735f526ccb1068031e289eb5aeb69e22'
* commit 'bb515e3a735f526ccb1068031e289eb5aeb69e22':
  h264/aarch64: sign extend int stride in loop filter asm

Merged-by: James Almer <jamrial@gmail.com>
2019-02-20 14:50:37 -03:00
Martin Storsjö
c8bc9d1380 aarch64: vp8: Move the vp8dsp makefile entries to the right places
Even if NEON would be disabled, the init functions should be built
as they are called as long as ARCH_AARCH64 is set.

These functions are part of a generic DSP subsytem, not tied directly
to one decoder. (They should be built if the vp7 decoder is enabled,
even if the vp8 decoder is disabled.)

Signed-off-by: Martin Storsjö <martin@martin.st>
(cherry picked from commit b4b27dce95)
2019-02-19 23:43:17 +02:00
Martin Storsjö
fecf75a5c4 aarch64: vp8: Remove superfluous includes
This fixes building with MSVC, which lacks unistd.h.

Signed-off-by: Martin Storsjö <martin@martin.st>
(cherry picked from commit ad32f7b126)
2019-02-19 23:42:16 +02:00
Martin Storsjö
7ddfa5e908 aarch64: vp8: Fix assembling with armasm64
Signed-off-by: Martin Storsjö <martin@martin.st>
(cherry picked from commit 2eeac79936)
2019-02-19 23:42:03 +02:00
Martin Storsjö
c950beb68d aarch64: vp8: Fix assembling with clang
This also partially fixes assembling with MS armasm64 (via
gas-preprocessor).

The movrel macro invocations need to pass the offset via a separate
parameter. Mach-o and COFF relocations don't allow a negative
offset to a symbol, which is handled properly if the offset is passed
via the parameter. If no offset parameter is given, the macro
evaluates to something like "adrp x17, subpel_filters-16+(0)", which
older clang versions also fail to parse (the older clang versions
only support one single offset term, although it can be a parenthesis.

Signed-off-by: Martin Storsjö <martin@martin.st>
(cherry picked from commit 26d7af4c38)
2019-02-19 23:41:47 +02:00
Martin Storsjö
7e42d5f0ab aarch64: vp8: Optimize vp8_idct_add_neon for aarch64
The previous version was a pretty exact translation of the arm
version. This version does do some unnecessary arithemetic (it does
more operations on vectors that are only half filled; it does 4
uaddw and 4 sqxtun instead of 2 of each), but it reduces the overhead
of packing data together (which could be done for free in the arm
version).

This gives a decent speedup on Cortex A53, a minor speedup on
A72 and a very minor slowdown on Cortex A73.

Before:        Cortex A53    A72    A73
vp8_idct_add_neon:   79.7   67.5   65.0
After:
vp8_idct_add_neon:   67.7   64.8   66.7

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:28 +02:00
Martin Storsjö
49f9c4272c aarch64: vp8: Skip saturating in shrn in ff_vp8_idct_add_neon
The original arm version didn't do saturation here. This probably
doesn't make any difference for performance, but reduces the
differences.

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:24 +02:00
Martin Storsjö
37394ef01b aarch64: vp8: Optimize put_epel16_h6v6 with vp8_epel8_v6_y2
This makes it similar to put_epel16_v6, and gives a large speedup
on Cortex A53, a minor speedup on A72 and a very minor slowdown on
A73.

Before:                 Cortex A53     A72     A73
vp8_put_epel16_h6v6_neon:   2211.4  1586.5  1431.7
After:
vp8_put_epel16_h6v6_neon:   1736.9  1522.0  1448.1

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:21 +02:00
Martin Storsjö
e39a9212ab aarch64: vp8: Port bilin functions from arm version
Cortex A53     A72     A73
vp8_put_bilin4_h_c:        303.8   102.2   161.8
vp8_put_bilin4_h_neon:     100.0    40.9    41.2
vp8_put_bilin4_hv_c:       322.8   201.0   305.9
vp8_put_bilin4_hv_neon:    156.8    72.6    77.0
vp8_put_bilin4_v_c:        304.7   101.7   166.5
vp8_put_bilin4_v_neon:      82.7    41.2    33.0
vp8_put_bilin8_h_c:       1192.7   352.5   623.8
vp8_put_bilin8_h_neon:     213.5    70.2    87.8
vp8_put_bilin8_hv_c:      1098.6   769.2  1041.9
vp8_put_bilin8_hv_neon:    324.0   123.5   146.0
vp8_put_bilin8_v_c:       1193.9   350.4   617.7
vp8_put_bilin8_v_neon:     183.9    60.7    64.7
vp8_put_bilin16_h_c:      2353.1   671.2  1223.3
vp8_put_bilin16_h_neon:    261.9   140.7   145.0
vp8_put_bilin16_hv_c:     2453.2  1470.9  2355.2
vp8_put_bilin16_hv_neon:   383.9   196.0   217.0
vp8_put_bilin16_v_c:      2349.3   669.8  1251.2
vp8_put_bilin16_v_neon:    202.9   110.7    96.2

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:14 +02:00
Martin Storsjö
58d1549227 aarch64: vp8: Port epel4 functions from arm version
Cortex A53    A72    A73
vp8_put_epel4_h4_c:        631.4  291.7  367.8
vp8_put_epel4_h4_neon:     241.0  131.0  155.7
vp8_put_epel4_h4v4_c:      967.5  529.3  667.7
vp8_put_epel4_h4v4_neon:   429.3  241.8  279.7
vp8_put_epel4_h4v6_c:     1374.7  657.5  864.5
vp8_put_epel4_h4v6_neon:   515.5  295.5  334.7
vp8_put_epel4_h6_c:        851.0  421.0  486.0
vp8_put_epel4_h6_neon:     321.5  195.0  217.7
vp8_put_epel4_h6v4_c:     1111.3  621.1  781.2
vp8_put_epel4_h6v4_neon:   539.2  328.0  365.3
vp8_put_epel4_h6v6_c:     1561.3  763.3  999.7
vp8_put_epel4_h6v6_neon:   645.5  401.0  434.7
vp8_put_epel4_v4_c:        663.8  298.3  357.0
vp8_put_epel4_v4_neon:     116.0   81.5   72.5
vp8_put_epel4_v6_c:        870.5  437.0  507.4
vp8_put_epel4_v6_neon:     147.7  108.8   92.0

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:11 +02:00
Martin Storsjö
cc7ba00c35 aarch64: vp8: Port missing epel8 functions from arm version
Cortex A53     A72     A73
vp8_put_epel8_h4_c:       2594.8  1159.6  1374.8
vp8_put_epel8_h4_neon:     506.4   244.2   314.0
vp8_put_epel8_h6_c:       3445.8  1677.1  1811.3
vp8_put_epel8_h6_neon:     634.4   371.7   433.0
vp8_put_epel8_v4_c:       2614.0  1174.8  1378.0
vp8_put_epel8_v4_neon:     321.0   221.7   235.8
vp8_put_epel8_v6_c:       3635.5  1703.0  2079.2
vp8_put_epel8_v6_neon:     416.9   317.0   295.5

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:08 +02:00
Martin Storsjö
52c9b0a6c0 aarch64: vp8: Port vp8_luma_dc_wht and vp8_idct_dc_add4uv from arm version
Cortex A53    A72    A73
vp8_luma_dc_wht_c:        115.7   75.7   90.7
vp8_luma_dc_wht_neon:      60.7   41.2   45.7
vp8_idct_dc_add4uv_c:     376.1  262.9  282.5
vp8_idct_dc_add4uv_neon:   52.0   29.0   37.0

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:04 +02:00
Martin Storsjö
c513fcd7d2 aarch64: vp8: Fix a typo in a comment
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:00 +02:00
Martin Storsjö
f1011ea28a aarch64: vp8: Reorder the function pointer inits to match the arm original
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:56 +02:00
Martin Storsjö
b4b27dce95 aarch64: vp8: Move the vp8dsp makefile entries to the right places
Even if NEON would be disabled, the init functions should be built
as they are called as long as ARCH_AARCH64 is set.

These functions are part of a generic DSP subsytem, not tied directly
to one decoder. (They should be built if the vp7 decoder is enabled,
even if the vp8 decoder is disabled.)

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:53 +02:00
Martin Storsjö
ad32f7b126 aarch64: vp8: Remove superfluous includes
This fixes building with MSVC, which lacks unistd.h.

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:50 +02:00
Martin Storsjö
85bfaa4949 aarch64: vp8: Use the proper aarch64 form for conditional branches
The previous form also does seem to assemble on current tools,
but I think it might fail on some older aarch64 tools.

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:47 +02:00
Martin Storsjö
2eeac79936 aarch64: vp8: Fix assembling with armasm64
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:44 +02:00
Martin Storsjö
26d7af4c38 aarch64: vp8: Fix assembling with clang
This also partially fixes assembling with MS armasm64 (via
gas-preprocessor).

The movrel macro invocations need to pass the offset via a separate
parameter. Mach-o and COFF relocations don't allow a negative
offset to a symbol, which is handled properly if the offset is passed
via the parameter. If no offset parameter is given, the macro
evaluates to something like "adrp x17, subpel_filters-16+(0)", which
older clang versions also fail to parse (the older clang versions
only support one single offset term, although it can be a parenthesis.

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:41 +02:00
Magnus Röös
0801853e64 libavcodec: vp8 neon optimizations for aarch64
Partial port of the ARM Neon for aarch64.

Benchmarks from fate:

benchmarking with Linux Perf Monitoring API
nop: 58.6
checkasm: using random seed 1760970128
NEON:
 - vp8dsp.idct       [OK]
 - vp8dsp.mc         [OK]
 - vp8dsp.loopfilter [OK]
checkasm: all 21 tests passed
vp8_idct_add_c: 201.6
vp8_idct_add_neon: 83.1
vp8_idct_dc_add_c: 107.6
vp8_idct_dc_add_neon: 33.8
vp8_idct_dc_add4y_c: 426.4
vp8_idct_dc_add4y_neon: 59.4
vp8_loop_filter8uv_h_c: 688.1
vp8_loop_filter8uv_h_neon: 216.3
vp8_loop_filter8uv_inner_h_c: 649.3
vp8_loop_filter8uv_inner_h_neon: 195.3
vp8_loop_filter8uv_inner_v_c: 544.8
vp8_loop_filter8uv_inner_v_neon: 131.3
vp8_loop_filter8uv_v_c: 706.1
vp8_loop_filter8uv_v_neon: 141.1
vp8_loop_filter16y_h_c: 668.8
vp8_loop_filter16y_h_neon: 242.8
vp8_loop_filter16y_inner_h_c: 647.3
vp8_loop_filter16y_inner_h_neon: 224.6
vp8_loop_filter16y_inner_v_c: 647.8
vp8_loop_filter16y_inner_v_neon: 128.8
vp8_loop_filter16y_v_c: 721.8
vp8_loop_filter16y_v_neon: 154.3
vp8_loop_filter_simple_h_c: 387.8
vp8_loop_filter_simple_h_neon: 187.6
vp8_loop_filter_simple_v_c: 384.1
vp8_loop_filter_simple_v_neon: 78.6
vp8_put_epel8_h4v4_c: 3971.1
vp8_put_epel8_h4v4_neon: 855.1
vp8_put_epel8_h4v6_c: 5060.1
vp8_put_epel8_h4v6_neon: 989.6
vp8_put_epel8_h6v4_c: 4320.8
vp8_put_epel8_h6v4_neon: 1007.3
vp8_put_epel8_h6v6_c: 5449.3
vp8_put_epel8_h6v6_neon: 1158.1
vp8_put_epel16_h6_c: 6683.8
vp8_put_epel16_h6_neon: 831.8
vp8_put_epel16_h6v6_c: 11110.8
vp8_put_epel16_h6v6_neon: 2214.8
vp8_put_epel16_v6_c: 7024.8
vp8_put_epel16_v6_neon: 799.6
vp8_put_pixels8_c: 112.8
vp8_put_pixels8_neon: 78.1
vp8_put_pixels16_c: 131.3
vp8_put_pixels16_neon: 129.8

This contains a fix to include guards by Carl Eugen Hoyos.

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:33 +02:00
Carl Eugen Hoyos
ed20fbcd48 lavc/aarch64/vp8dsp: Fix the include guard.
Fixes fate-source.
2019-01-31 22:35:44 +01:00
Magnus Röös
833fed5253 libavcodec: vp8 neon optimizations for aarch64
Partial port of the ARM Neon for aarch64.

Benchmarks from fate:

benchmarking with Linux Perf Monitoring API
nop: 58.6
checkasm: using random seed 1760970128
NEON:
 - vp8dsp.idct       [OK]
 - vp8dsp.mc         [OK]
 - vp8dsp.loopfilter [OK]
checkasm: all 21 tests passed
vp8_idct_add_c: 201.6
vp8_idct_add_neon: 83.1
vp8_idct_dc_add_c: 107.6
vp8_idct_dc_add_neon: 33.8
vp8_idct_dc_add4y_c: 426.4
vp8_idct_dc_add4y_neon: 59.4
vp8_loop_filter8uv_h_c: 688.1
vp8_loop_filter8uv_h_neon: 216.3
vp8_loop_filter8uv_inner_h_c: 649.3
vp8_loop_filter8uv_inner_h_neon: 195.3
vp8_loop_filter8uv_inner_v_c: 544.8
vp8_loop_filter8uv_inner_v_neon: 131.3
vp8_loop_filter8uv_v_c: 706.1
vp8_loop_filter8uv_v_neon: 141.1
vp8_loop_filter16y_h_c: 668.8
vp8_loop_filter16y_h_neon: 242.8
vp8_loop_filter16y_inner_h_c: 647.3
vp8_loop_filter16y_inner_h_neon: 224.6
vp8_loop_filter16y_inner_v_c: 647.8
vp8_loop_filter16y_inner_v_neon: 128.8
vp8_loop_filter16y_v_c: 721.8
vp8_loop_filter16y_v_neon: 154.3
vp8_loop_filter_simple_h_c: 387.8
vp8_loop_filter_simple_h_neon: 187.6
vp8_loop_filter_simple_v_c: 384.1
vp8_loop_filter_simple_v_neon: 78.6
vp8_put_epel8_h4v4_c: 3971.1
vp8_put_epel8_h4v4_neon: 855.1
vp8_put_epel8_h4v6_c: 5060.1
vp8_put_epel8_h4v6_neon: 989.6
vp8_put_epel8_h6v4_c: 4320.8
vp8_put_epel8_h6v4_neon: 1007.3
vp8_put_epel8_h6v6_c: 5449.3
vp8_put_epel8_h6v6_neon: 1158.1
vp8_put_epel16_h6_c: 6683.8
vp8_put_epel16_h6_neon: 831.8
vp8_put_epel16_h6v6_c: 11110.8
vp8_put_epel16_h6v6_neon: 2214.8
vp8_put_epel16_v6_c: 7024.8
vp8_put_epel16_v6_neon: 799.6
vp8_put_pixels8_c: 112.8
vp8_put_pixels8_neon: 78.1
vp8_put_pixels16_c: 131.3
vp8_put_pixels16_neon: 129.8

Signed-off-by: Magnus Röös <mla2.roos@gmail.com>
2019-01-31 20:17:51 +01:00
Janne Grunau
28a8b5413b h264/aarch64: add intra loop filter neon asm
Add my neon asm from x264 relicensed under the LGPL 2.1 or later. Ported
(x264 uses nv12 chroma) and optimized.

Cycle count for checkasm --bench on a Snapdragon 820e:
h264_h_loop_filter_luma_intra_8bpp_c: 60.0
h264_h_loop_filter_luma_intra_8bpp_neon: 54.2
h264_v_loop_filter_luma_intra_8bpp_c: 148.3
h264_v_loop_filter_luma_intra_8bpp_neon: 73.8
h264_h_loop_filter_chroma_intra_8bpp_c: 27.8
h264_h_loop_filter_chroma_intra_8bpp_neon: 21.4
h264_h_loop_filter_chroma_mbaff_intra_8bpp_c: 15.8
h264_h_loop_filter_chroma_mbaff_intra_8bpp_neon: 15.7
h264_v_loop_filter_chroma_intra_8bpp_c: 45.8
h264_v_loop_filter_chroma_intra_8bpp_neon: 17.3
2019-01-26 12:05:10 +01:00
Janne Grunau
846c3d6aca h264/aarch64: optimize neon loop filter
Exit as soon as possible if no filtering will be done.

Improves the checkasm --bench cycle count on a Snapdragon 820e:
h264_h_loop_filter_luma_8bpp_c:      72.4 ->  72.5
h264_h_loop_filter_luma_8bpp_neon:   97.1 ->  56.3
h264_v_loop_filter_luma_8bpp_c:     174.0 -> 173.5
h264_v_loop_filter_luma_8bpp_neon:   62.9 ->  60.9
h264_h_loop_filter_chroma_8bpp_c:    30.2 ->  30.3
h264_h_loop_filter_chroma_8bpp_neon: 51.6 ->  25.7
h264_v_loop_filter_chroma_8bpp_c:    57.3 ->  57.3
h264_v_loop_filter_chroma_8bpp_neon: 28.0 ->  24.0
2019-01-26 12:05:10 +01:00
Janne Grunau
bb515e3a73 h264/aarch64: sign extend int stride in loop filter asm 2019-01-26 12:05:10 +01:00
Manoj Gupta
6fcf813110 libavcodec: Remove dynamic relocs from aarch64/h264idct_neon.S
Some of the assembly functions e.g. ff_h264_idct_dc_add_neon
has code like:
        movrel          x14, X(ff_h264_idct_add_neon)

Linker cannot resolve them fully at link time and emits dynamic
relocations.
Use explicit labels instead so that no dynamic relocations are
needed at all.

This avoids lld complains about text relocations.

For background, see https://crbug.com/917919

Signed-off-by: Manoj Gupta <manojgupta@chromium.org>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2019-01-03 20:12:07 +01:00
Carl Eugen Hoyos
0576ef466d lavc/aarch64/h264dsp_init_aarch64: Fix weight function prototypes.
Fixes the following warnings:
libavcodec/aarch64/h264dsp_init_aarch64.c: In function ‘ff_h264dsp_init_aarch64’:
libavcodec/aarch64/h264dsp_init_aarch64.c:84:38: warning: assignment from incompatible pointer type [enabled by default]
         c->weight_h264_pixels_tab[0] = ff_weight_h264_pixels_16_neon;
                                      ^
libavcodec/aarch64/h264dsp_init_aarch64.c:85:38: warning: assignment from incompatible pointer type [enabled by default]
         c->weight_h264_pixels_tab[1] = ff_weight_h264_pixels_8_neon;
                                      ^
libavcodec/aarch64/h264dsp_init_aarch64.c:86:38: warning: assignment from incompatible pointer type [enabled by default]
         c->weight_h264_pixels_tab[2] = ff_weight_h264_pixels_4_neon;
                                      ^
libavcodec/aarch64/h264dsp_init_aarch64.c:88:40: warning: assignment from incompatible pointer type [enabled by default]
         c->biweight_h264_pixels_tab[0] = ff_biweight_h264_pixels_16_neon;
                                        ^
libavcodec/aarch64/h264dsp_init_aarch64.c:89:40: warning: assignment from incompatible pointer type [enabled by default]
         c->biweight_h264_pixels_tab[1] = ff_biweight_h264_pixels_8_neon;
                                        ^
libavcodec/aarch64/h264dsp_init_aarch64.c:90:40: warning: assignment from incompatible pointer type [enabled by default]
         c->biweight_h264_pixels_tab[2] = ff_biweight_h264_pixels_4_neon;
                                        ^
2018-07-13 21:28:04 +02:00
Rodger Combs
7723750475 lavc/aarch64/sbrdsp_neon: fix build on old binutils 2018-01-26 02:42:01 -06:00
James Almer
0525722ca0 Merge commit '732510636e597585a79be7d111c88b3f7e174fe7'
* commit '732510636e597585a79be7d111c88b3f7e174fe7':
  aarch64: Remove a dot from a label

Merged-by: James Almer <jamrial@gmail.com>
2017-11-11 17:47:10 -03:00
Martin Storsjö
732510636e aarch64: Remove a dot from a label
This fixes building with armasm64 (when run through gas-preprocessor).

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-10-18 10:49:33 +03:00
Matthieu Bouron
0a24d7ca83 lavc/aarch64: add sbrdsp neon implementation
autocorrelate_c: 644.0
autocorrelate_neon: 420.0
hf_apply_noise_0_c: 1688.5
hf_apply_noise_0_neon: 1498.6
hf_apply_noise_1_c: 1691.2
hf_apply_noise_1_neon: 1500.6
hf_apply_noise_2_c: 1688.1
hf_apply_noise_2_neon: 1500.3
hf_apply_noise_3_c: 1696.6
hf_apply_noise_3_neon: 1502.2
hf_g_filt_c: 2117.8
hf_g_filt_neon: 1218.7
hf_gen_c: 4573.4
hf_gen_neon: 2461.0
neg_odd_64_c: 72.0
neg_odd_64_neon: 64.7
qmf_deint_bfly_c: 1107.6
qmf_deint_bfly_neon: 291.6
qmf_deint_neg_c: 210.4
qmf_deint_neg_neon: 107.4
qmf_post_shuffle_c: 163.0
qmf_post_shuffle_neon: 107.7
qmf_pre_shuffle_c: 120.5
qmf_pre_shuffle_neon: 110.7
sum64x5_c: 1361.6
sum64x5_neon: 435.4
sum_square_c: 1686.4
sum_square_neon: 787.2
2017-07-03 14:29:22 +02:00
Clément Bœsch
b12a36170b lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis 2017-06-28 12:22:39 +02:00
Clément Bœsch
ff0ecef624 lavc/aarch64: add a few SIMD functions for AAC PS
☭ tests/checkasm/checkasm --bench --test=aacpsdsp
checkasm: using random seed 3318985180
MMX implied by specified flags
MMX implied by specified flags
NEON:
 - aacpsdsp.add_squares        [OK]
 - aacpsdsp.mul_pair_single    [OK]
 - aacpsdsp.hybrid_analysis    [OK]
 - aacpsdsp.stereo_interpolate [OK]
checkasm: all 5 tests passed
nop: 10.0
ps_add_squares_c: 63221.2
ps_add_squares_neon: 22311.7
ps_hybrid_analysis_c: 2466.6
ps_hybrid_analysis_neon: 1521.9
ps_mul_pair_single_c: 68592.0
ps_mul_pair_single_neon: 17426.6
ps_stereo_interpolate_c: 72344.3
ps_stereo_interpolate_neon: 72308.8
ps_stereo_interpolate_ipdopd_c: 117415.2
ps_stereo_interpolate_ipdopd_neon: 113386.3
2017-06-28 12:22:39 +02:00
Memphiz
9e85c5d6a7 aarch64: vp9 16bpp: Fix assembling with Xcode 6.2 and older
Properly use the b.eq form instead of the nonstandard form (which
both gas and newer clang accept though), and expand the register
lists that used a range (which the Xcode 6.2 clang, based on clang
3.5 svn, didn't support).

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-06-21 09:08:14 +03:00
Memphiz
998609ddb8 aarch64: vp9: Fix assembling with Xcode 6.2 and older
Properly use the b.eq/b.ge forms instead of the nonstandard forms
(which both gas and newer clang accept though), and expand the
register list that used a range (which the Xcode 6.2 clang, based
on clang 3.5 svn, didn't support).

This is cherrypicked from libav commit
a970f9de86.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-06-21 09:08:13 +03:00
Memphiz
a970f9de86 aarch64: vp9: Fix assembling with Xcode 6.2 and older
Properly use the b.eq/b.ge forms instead of the nonstandard forms
(which both gas and newer clang accept though), and expand the
register list that used a range (which the Xcode 6.2 clang, based
on clang 3.5 svn, didn't support).

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-06-20 16:14:03 +03:00
Matthieu Bouron
204008354f lavc/aarch64/simple_idct: fix build with Xcode 7.2 2017-06-14 23:20:58 +02:00
Matthieu Bouron
8aa60606fb lavc/aarch64/simple_idct: fix idct_col4_top coefficient
Fixes regression introduced by 5d0b8b1ae3.
2017-06-13 17:46:55 +02:00
Matthieu Bouron
5d0b8b1ae3 lavc/aarch64/simple_idct: fix iOS build without gas-preprocessor
Separates macro arguments with commas and passes .4H/.8H as macro
arguments instead of 4H/8H (the later form being interpreted as an
hexadecimal value).

Fixes ticket #6324.

Suggested-by: Martin Storsjö <martin@martin.st>
2017-05-11 16:28:54 +02:00
Clément Bœsch
0f00eb0e4e Merge commit '2425d7329fdccfa9954faba748f3865151354f0c'
* commit '2425d7329fdccfa9954faba748f3865151354f0c':
  arm64: replace 'bic' with immediate with 'and' with inverted immediate

Merged-by: Clément Bœsch <u@pkh.me>
2017-04-26 16:28:57 +02:00
James Almer
5694427dc3 Merge commit '72a19f4013ec2c7f8581416f8ad4bf81df163fb6'
* commit '72a19f4013ec2c7f8581416f8ad4bf81df163fb6':
  mpegaudiodsp: aarch64: Adjust function prototype after 2caa93b813

Merged-by: James Almer <jamrial@gmail.com>
2017-03-31 14:43:37 -03:00
James Almer
c31cbeef58 aarch64/vp9dsp: add missing header includes 2017-03-28 23:02:09 -03:00
Ronald S. Bultje
f8c019944d vp9: re-split the decoder/format/dsp interface header files.
The advantage here is that the internal software decoder interface is
not exposed to the DSP functions or the hardware accelerations.
2017-03-28 18:04:26 -04:00
Clément Bœsch
1c9f4b5078 lavc/vp9: split into vp9{block,data,mvs}
This is following Libav layout to ease merges.
2017-03-27 21:38:21 +02:00
Clément Bœsch
739d8c83f2 Merge commit '9b2ccafb480c94fd09cfb24306d5296dc013cf5b'
* commit '9b2ccafb480c94fd09cfb24306d5296dc013cf5b':
  aarch64: Add missing sign extension in ff_h264_idct8_add_neon

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-23 12:15:39 +01:00
James Almer
9a0fbb9ca9 Merge commit '2caa93b813adc5dbb7771dfe615da826a2947d18'
* commit '2caa93b813adc5dbb7771dfe615da826a2947d18':
  mpegaudiodsp: Change type of array stride parameters to ptrdiff_t

Merged-by: James Almer <jamrial@gmail.com>
2017-03-21 16:04:22 -03:00
James Almer
a8474df944 Merge commit 'e4a94d8b36c48d95a7d412c40d7b558422ff659c'
* commit 'e4a94d8b36c48d95a7d412c40d7b558422ff659c':
  h264chroma: Change type of stride parameters to ptrdiff_t

Merged-by: James Almer <jamrial@gmail.com>
2017-03-21 15:20:45 -03:00
James Almer
5a49097b42 Merge commit '2ec9fa5ec60dcd10e1cb10d8b4e4437e634ea428'
* commit '2ec9fa5ec60dcd10e1cb10d8b4e4437e634ea428':
  idct: Change type of array stride parameters to ptrdiff_t

Merged-by: James Almer <jamrial@gmail.com>
2017-03-21 14:29:52 -03:00
Clément Bœsch
ad98af27f7 Merge commit 'de2ae3c1fae5a2eb539b9abd7bc2a9ca8c286ff0'
* commit 'de2ae3c1fae5a2eb539b9abd7bc2a9ca8c286ff0':
  lavc: add clobber tests for the new encoding/decoding API

The merge only re-order what we already have.

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-21 14:43:53 +01:00
Martin Storsjö
61b8a9ea29 aarch64: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible
This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 21512 bytes to 31400 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before:
vp9_inv_dct_dct_16x16_sub1_add_10_neon:     284.6
vp9_inv_dct_dct_16x16_sub2_add_10_neon:    1902.7
vp9_inv_dct_dct_16x16_sub4_add_10_neon:    1903.0
vp9_inv_dct_dct_16x16_sub8_add_10_neon:    2201.1
vp9_inv_dct_dct_16x16_sub12_add_10_neon:   2510.0
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   2821.3
vp9_inv_dct_dct_32x32_sub1_add_10_neon:    1011.6
vp9_inv_dct_dct_32x32_sub2_add_10_neon:    9716.5
vp9_inv_dct_dct_32x32_sub4_add_10_neon:    9704.9
vp9_inv_dct_dct_32x32_sub8_add_10_neon:   10641.7
vp9_inv_dct_dct_32x32_sub12_add_10_neon:  11555.7
vp9_inv_dct_dct_32x32_sub16_add_10_neon:  12499.8
vp9_inv_dct_dct_32x32_sub20_add_10_neon:  13403.7
vp9_inv_dct_dct_32x32_sub24_add_10_neon:  14335.8
vp9_inv_dct_dct_32x32_sub28_add_10_neon:  15253.6
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  16179.5

After:
vp9_inv_dct_dct_16x16_sub1_add_10_neon:     282.8
vp9_inv_dct_dct_16x16_sub2_add_10_neon:    1142.4
vp9_inv_dct_dct_16x16_sub4_add_10_neon:    1139.0
vp9_inv_dct_dct_16x16_sub8_add_10_neon:    1772.9
vp9_inv_dct_dct_16x16_sub12_add_10_neon:   2515.2
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   2823.5
vp9_inv_dct_dct_32x32_sub1_add_10_neon:    1012.7
vp9_inv_dct_dct_32x32_sub2_add_10_neon:    6944.4
vp9_inv_dct_dct_32x32_sub4_add_10_neon:    6944.2
vp9_inv_dct_dct_32x32_sub8_add_10_neon:    7609.8
vp9_inv_dct_dct_32x32_sub12_add_10_neon:   9953.4
vp9_inv_dct_dct_32x32_sub16_add_10_neon:  10770.1
vp9_inv_dct_dct_32x32_sub20_add_10_neon:  13418.8
vp9_inv_dct_dct_32x32_sub24_add_10_neon:  14330.7
vp9_inv_dct_dct_32x32_sub28_add_10_neon:  15257.1
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  16190.6

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:54:37 +02:00
Martin Storsjö
d564c9018f aarch64: vp9itxfm16: Move the load_add_store macro out from the itxfm16 pass2 function
This allows reusing the macro for a separate implementation of the
pass2 function.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:54:30 +02:00
Martin Storsjö
0f2705e66b aarch64: vp9itxfm16: Make the larger core transforms standalone functions
This work is sponsored by, and copyright, Google.

This reduces the code size of libavcodec/aarch64/vp9itxfm_16bpp_neon.o from
26288 to 21512 bytes.

This gives a small slowdown of a couple of tens of cycles, but makes
it more feasible to add more optimized versions of these transforms.

Before:
vp9_inv_dct_dct_16x16_sub4_add_10_neon:    1887.4
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   2801.5
vp9_inv_dct_dct_32x32_sub4_add_10_neon:    9691.4
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  16154.9

After:
vp9_inv_dct_dct_16x16_sub4_add_10_neon:    1899.5
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   2827.2
vp9_inv_dct_dct_32x32_sub4_add_10_neon:    9714.7
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  16175.9

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:54:26 +02:00
Martin Storsjö
b76533f105 aarch64: vp9itxfm16: Restructure the idct32 store macros
This avoids concatenation, which can't be used if the whole macro
is wrapped within another macro.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:54:15 +02:00
Martin Storsjö
d613251622 aarch64: vp9itxfm16: Avoid .irp when it doesn't save any lines
This makes the code a bit more readable.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:54:09 +02:00
Martin Storsjö
25ced1eb1c aarch64: vp9itxfm16: Fix a typo in a comment
Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:54:02 +02:00
Martin Storsjö
21c89f3a26 arm/aarch64: vp9: Fix vertical alignment
Align the second/third operands as they usually are.

Due to the wildly varying sizes of the written out operands
in aarch64 assembly, the column alignment is usually not as clear
as in arm assembly.

This is cherrypicked from libav commit
7995ebfad1.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:53:32 +02:00
Martin Storsjö
70317b25aa arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used
In the half/quarter cases where we don't use the min_eob array, defer
loading the pointer until we know it will be needed.

This is cherrypicked from libav commit
3a0d5e206d.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:53:28 +02:00
Martin Storsjö
7995ebfad1 arm/aarch64: vp9: Fix vertical alignment
Align the second/third operands as they usually are.

Due to the wildly varying sizes of the written out operands
in aarch64 assembly, the column alignment is usually not as clear
as in arm assembly.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-16 23:09:00 +02:00
Matthieu Bouron
4c8e528d19 lavc/aarch64: add ff_simple_idct{,_add,_put}_neon functions 2017-03-16 12:00:41 +01:00
Martin Storsjö
3a0d5e206d arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used
In the half/quarter cases where we don't use the min_eob array, defer
loading the pointer until we know it will be needed.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 22:07:30 +02:00
Martin Storsjö
26ee83acc4 aarch64: vp9itxfm: Reorder iadst16 coeffs
This matches the order they are in the 16 bpp version.

There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.

This makes the 8 bpp version match the 16 bpp version better.

This is cherrypicked from libav commit
b8f66c0838.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:52 +02:00
Martin Storsjö
f952273019 aarch64: vp9itxfm: Reorder the idct coefficients for better pairing
All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.

This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.

This is cherrypicked from libav commit
09eb88a12e.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:52 +02:00
Martin Storsjö
2905657b90 aarch64: vp9itxfm: Avoid reloading the idct32 coefficients
The idct32x32 function actually pushed d8-d15 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.

After this, we still can skip pushing d12-d15.

Before:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3
After:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3

This is cherrypicked from libav commit
65aa002d54.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:51 +02:00
Martin Storsjö
f32690a298 aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1
This is one cycle faster in total, and three instructions fewer.

Before:
vp9_loop_filter_mix2_v_44_16_neon: 123.2
After:
vp9_loop_filter_mix2_v_44_16_neon: 122.2

This is cherrypicked from libav commit
3bf9c48320.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:50 +02:00
Martin Storsjö
3fbbad2984 arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit
The theoretical maximum value of E is 193, so we can just
saturate the addition to 255.

Before:                     Cortex A7      A8      A9     A53  A53/AArch64
vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0
After:
vp9_loop_filter_v_4_8_neon:     136.0   125.7   112.6    84.0         83.0
vp9_loop_filter_v_8_8_neon:     234.0   195.5   171.5   136.0        133.7
vp9_loop_filter_v_16_8_neon:    490.0   417.5   377.7   289.0        271.0
vp9_loop_filter_v_16_16_neon:   951.2   814.7   732.3   571.0        446.7

This is cherrypicked from libav commit
c582cb8537.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:50 +02:00
Martin Storsjö
c8d6eec85d aarch64: vp9lpf: Fix broken indentation/vertical alignment
This is cherrypicked from libav commit
07b5136c48.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:49 +02:00