1
0
mirror of https://github.com/FFmpeg/FFmpeg.git synced 2024-12-23 12:43:46 +02:00
Commit Graph

266 Commits

Author SHA1 Message Date
J. Dekker
a9214a2ca3 lavc/aarch64: add hevc sao edge 16x16
bench on AWS Graviton:

hevc_sao_edge_16x16_8_c: 1857.0
hevc_sao_edge_16x16_8_neon: 211.0
hevc_sao_edge_32x32_8_c: 7802.2
hevc_sao_edge_32x32_8_neon: 808.2
hevc_sao_edge_48x48_8_c: 16764.2
hevc_sao_edge_48x48_8_neon: 1796.5
hevc_sao_edge_64x64_8_c: 32647.5
hevc_sao_edge_64x64_8_neon: 3118.5

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-01-04 14:31:56 +01:00
Jonathan Wright
08b4716a9e aarch64: Add Armv8.5-A BTI support
Add Branch Target Identifiers (BTIs) to all functions defined in
AArch64 assembly files. Most of the BTI landing pads are added
automatically by the 'function' macro.

BTI support is turned on or off at compile time based on the presence
of the __ARM_FEATURE_BTI_DEFAULT feature macro.

A binary compiled with BTI support can be executed on an Armv8-A
processor without BTI support because the instructions are defined in
NOP space.

Signed-off-by: Jonathan Wright <jonathan.wright@arm.com>
Signed-off-by: Elijah Ahmad <elijah.ahmad@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-11-16 13:43:56 +02:00
Jonathan Wright
6f04cf54f5 aarch64: Use ret x<n> instead of br x<n> where possible
Change AArch64 assembly code to use:
  ret     x<n>
instead of:
  br      x<n>

"ret x<n>" is already used in a lot of places so this patch makes it
consistent across the code base. This does not change behavior or
performance.

In addition, this change reduces the number of landing pads needed in
a subsequent patch to support the Armv8.5-A Branch Target
Identification (BTI) security feature.

Signed-off-by: Jonathan Wright <jonathan.wright@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-11-16 13:43:56 +02:00
Martin Storsjö
fd3bd5c492 aarch64: h264qpel: Do vertical filtering without transposing
This gives rather big speedups on these functions:

Before:
put_h264_qpel_8_mc01_8_neon:     241.0   131.5   138.7
put_h264_qpel_8_mc02_8_neon:     214.7   121.2   127.5
put_h264_qpel_8_mc03_8_neon:     242.5   131.2   135.7
put_h264_qpel_8_mc11_8_neon:     421.2   218.7   251.0
put_h264_qpel_8_mc12_8_neon:     878.0   509.5   537.5
put_h264_qpel_8_mc13_8_neon:     423.7   217.0   252.0
put_h264_qpel_8_mc21_8_neon:     858.2   479.5   514.0
put_h264_qpel_8_mc22_8_neon:     649.7   385.2   403.0
put_h264_qpel_8_mc23_8_neon:     860.2   476.5   517.7
put_h264_qpel_8_mc31_8_neon:     437.2   219.5   252.5
put_h264_qpel_8_mc32_8_neon:     892.5   510.5   546.0
put_h264_qpel_8_mc33_8_neon:     438.2   218.5   257.0
put_h264_qpel_16_mc01_8_neon:    944.2   509.7   546.7
put_h264_qpel_16_mc02_8_neon:    878.7   469.5   509.7
put_h264_qpel_16_mc03_8_neon:    945.7   510.7   557.0
put_h264_qpel_16_mc11_8_neon:   1663.2   858.5   979.5
put_h264_qpel_16_mc12_8_neon:   3510.2  2027.7  2112.7
put_h264_qpel_16_mc13_8_neon:   1664.7   857.5   980.5
put_h264_qpel_16_mc21_8_neon:   3366.2  1928.5  2030.5
put_h264_qpel_16_mc22_8_neon:   2584.7  1514.7  1590.2
put_h264_qpel_16_mc23_8_neon:   3367.7  1927.7  2035.0
put_h264_qpel_16_mc31_8_neon:   1716.7   849.7   997.0
put_h264_qpel_16_mc32_8_neon:   3564.0  2044.2  3835.2
put_h264_qpel_16_mc33_8_neon:   1717.7   863.0   989.5

After:
put_h264_qpel_8_mc01_8_neon:     136.0    73.7    76.0
put_h264_qpel_8_mc02_8_neon:     108.7    65.0    64.0
put_h264_qpel_8_mc03_8_neon:     137.5    72.7    73.0
put_h264_qpel_8_mc11_8_neon:     316.2   159.0   188.5
put_h264_qpel_8_mc12_8_neon:     653.0   375.5   384.7
put_h264_qpel_8_mc13_8_neon:     318.7   165.5   189.5
put_h264_qpel_8_mc21_8_neon:     739.2   385.7   432.5
put_h264_qpel_8_mc22_8_neon:     530.7   295.5   309.5
put_h264_qpel_8_mc23_8_neon:     741.2   393.7   421.0
put_h264_qpel_8_mc31_8_neon:     332.2   162.5   190.0
put_h264_qpel_8_mc32_8_neon:     667.5   378.2   390.5
put_h264_qpel_8_mc33_8_neon:     332.7   166.5   195.5
put_h264_qpel_16_mc01_8_neon:    524.2   285.2   294.0
put_h264_qpel_16_mc02_8_neon:    454.7   252.2   250.2
put_h264_qpel_16_mc03_8_neon:    525.7   286.0   283.0
put_h264_qpel_16_mc11_8_neon:   1243.2   630.7   726.7
put_h264_qpel_16_mc12_8_neon:   2610.2  1479.7  1481.2
put_h264_qpel_16_mc13_8_neon:   1250.5   631.7   727.7
put_h264_qpel_16_mc21_8_neon:   2890.2  1571.2  1679.7
put_h264_qpel_16_mc22_8_neon:   2108.7  1177.5  1223.5
put_h264_qpel_16_mc23_8_neon:   2891.7  1578.7  1667.7
put_h264_qpel_16_mc31_8_neon:   1296.7   630.5   752.5
put_h264_qpel_16_mc32_8_neon:   2664.0  1483.2  1503.5
put_h264_qpel_16_mc33_8_neon:   1297.7   632.5   747.2

I.e. overall a 20%-60% reduction in runtime of these
functions.

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-10-18 14:27:58 +03:00
Martin Storsjö
2d5a7f6d00 arm/aarch64: Improve scheduling in the avg form of h264_qpel
Don't use the loaded registers directly, avoiding stalls on in
order cores. Use vrhadd.u8 with q registers where easily possible.

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-10-18 14:27:36 +03:00
Zhao Zhili
378ad2f8fd lavc/aarch64: fix relocation out of range error
Use a temporary label instead of global function symbol for b.gt.

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-09-25 21:55:29 +03:00
Mikhail Nitenko
d3e56b56ae lavc/aarch64: add pred functions for 10-bit
Benchmarks:                        A53     A72
pred8x8_dc_10_c:                   64.2    49.5
pred8x8_dc_10_neon:                62.0    53.7
pred8x8_dc_128_10_c:               26.0    14.0
pred8x8_dc_128_10_neon:            30.7    17.5
pred8x8_horizontal_10_c:           60.0    27.7
pred8x8_horizontal_10_neon:        38.0    34.0
pred8x8_left_dc_10_c:              42.5    27.5
pred8x8_left_dc_10_neon:           51.0    41.2
pred8x8_mad_cow_dc_0l0_10_c:       55.7    37.2
pred8x8_mad_cow_dc_0l0_10_neon:    50.2    35.2
pred8x8_mad_cow_dc_0lt_10_c:       89.2    67.0
pred8x8_mad_cow_dc_0lt_10_neon:    52.2    46.7
pred8x8_mad_cow_dc_l0t_10_c:       74.7    51.0
pred8x8_mad_cow_dc_l0t_10_neon:    50.5    45.2
pred8x8_mad_cow_dc_l00_10_c:       58.0    38.0
pred8x8_mad_cow_dc_l00_10_neon:    42.5    37.5
pred8x8_plane_10_c:               354.0   288.7
pred8x8_plane_10_neon:            141.0   101.2
pred8x8_top_dc_10_c:               44.5    30.5
pred8x8_top_dc_10_neon:            40.0    31.0
pred8x8_vertical_10_c:             27.5    14.5
pred8x8_vertical_10_neon:          21.0    17.5
pred16x16_plane_10_c:            1242.0  1070.5
pred16x16_plane_10_neon:          324.0   196.7

Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-21 00:06:26 +03:00
Mikhail Nitenko
43ca887bc2 lavc/aarch64: h264, add chroma loop filters for 10bit
Benchmarks:                                             A53     A72
h264_h_loop_filter_chroma422_10bpp_c:                  282.7   114.2
h264_h_loop_filter_chroma422_10bpp_neon:               109.5    78.5
h264_h_loop_filter_chroma_10bpp_c:                     165.0    81.5
h264_h_loop_filter_chroma_10bpp_neon:                  120.0    76.7
h264_h_loop_filter_chroma_intra422_10bpp_c:            323.7   124.2
h264_h_loop_filter_chroma_intra422_10bpp_neon:         155.0   102.7
h264_h_loop_filter_chroma_intra_10bpp_c:               121.0    49.5
h264_h_loop_filter_chroma_intra_10bpp_neon:             79.7    53.7
h264_h_loop_filter_chroma_mbaff422_10bpp_c:            188.5    75.0
h264_h_loop_filter_chroma_mbaff422_10bpp_neon:         120.0    75.5
h264_h_loop_filter_chroma_mbaff_intra422_10bpp_c:      116.7    46.0
h264_h_loop_filter_chroma_mbaff_intra422_10bpp_neon:    79.7    53.7
h264_h_loop_filter_chroma_mbaff_intra_10bpp_c:          63.0    27.2
h264_h_loop_filter_chroma_mbaff_intra_10bpp_neon:       48.5    34.0
h264_v_loop_filter_chroma_10bpp_c:                     258.7   135.5
h264_v_loop_filter_chroma_10bpp_neon:                   71.2    51.0
h264_v_loop_filter_chroma_intra_10bpp_c:               158.0    70.7
h264_v_loop_filter_chroma_intra_10bpp_neon:             48.7    31.5

Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-21 00:06:26 +03:00
Mikhail Nitenko
756d2e087a lavc/aarch64: move transpose_4x8H to neon.S
transpose_4x8H was declared in vp9lpf_16bpp_neon, however this macro is
not unique to vp9 and could be used elsewhere.

Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-21 00:06:26 +03:00
Martin Storsjö
c60b76d0c8 aarch64: h264dsp: Fix indentation of some functions to match the rest
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-08 23:18:43 +03:00
Martin Storsjö
e86ec831b0 aarch64: h264dsp: Remove unnecessary sign extensions
These became unnecessary when the stride arguments were changed from
int to ptrdiff_t in bc26fe8927
(0576ef466d) and
d5d699ab6e
(aa844dc46f).

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-08-08 23:18:43 +03:00
Andreas Rheinhardt
afc95a10ac avcodec/h264dsp, h264idct: Fix lengths of array parameters
Fixes many -Warray-parameter warnings from GCC 11.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2021-08-08 17:44:57 +02:00
Martin Storsjö
f27e3ccf06 aarch64: hevc_idct: Fix overflows in idct_dc
This is marginally slower, but correct for all input values.
The previous implementation failed with certain input seeds, e.g.
"checkasm --test=hevc_idct 98".

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-05-22 00:08:03 +03:00
Andreas Rheinhardt
7c1f347b18 avcodec: Remove deprecated old encode/decode APIs
Deprecated in commits 7fc329e2dd
and 31f6a4b4b8.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2021-04-27 10:43:12 -03:00
Andreas Rheinhardt
f3c197b129 Include attributes.h directly
Some files currently rely on libavutil/cpu.h to include it for them;
yet said file won't use include it any more after the currently
deprecated functions are removed, so include attributes.h directly.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2021-04-19 14:34:10 +02:00
Mikhail Nitenko
84ac1440b2 lavc/aarch64: add pred16x16 10-bit functions
Benchmarks:                      A53     A72
pred16x16_dc_10_c:              136.0   124.0
pred16x16_dc_10_neon:           121.2   106.0
pred16x16_horizontal_10_c:      155.0    73.2
pred16x16_horizontal_10_neon:    82.2    67.7
pred16x16_top_dc_10_c:          106.0    93.7
pred16x16_top_dc_10_neon:        87.7    77.2
pred16x16_vertical_10_c:         83.0    67.7
pred16x16_vertical_10_neon:      54.2    61.7

Some functions work slower than C and are left commented out.
2021-04-19 09:01:14 +02:00
Mikhail Nitenko
6b2e7dc828 lavc/aarch64: change h264pred_init structure
Change structure to allow the addition of other bit depths.
2021-04-19 09:00:58 +02:00
Martin Storsjö
870bfe16a1 aarch64: h264pred: Optimize the inner loop of existing 8 bit functions
Move the loop counter decrement further from the branch instruction,
this hides the latency of the decrement.

In loops that first load, then store (the horizontal prediction cases),
do the decrement after the load (where the next instruction would
stall a bit anyway, waiting for the result of the load).

In loops that store twice using the same destination register,
also do the decrement between the two stores (as the second store
would need to wait for the updated destination register from the
first instruction).

In loops that store twice to two different destination registers,
do the decrement before both stores, to do it as soon before the
branch as possible.

This gives minor (1-2 cycle) speedups in most cases (modulo measurement
noise), but the horizontal prediction functions get a rather notable
speedup on the Cortex A53.

Before:                     Cortex A53     A72     A73
pred8x8_dc_8_neon:                60.7    46.2    39.2
pred8x8_dc_128_8_neon:            30.7    18.0    14.0
pred8x8_horizontal_8_neon:        42.2    29.2    18.5
pred8x8_left_dc_8_neon:           52.7    36.2    32.2
pred8x8_mad_cow_dc_0l0_8_neon:    48.2    27.7    25.7
pred8x8_mad_cow_dc_0lt_8_neon:    52.5    33.2    34.7
pred8x8_mad_cow_dc_l0t_8_neon:    52.5    31.7    33.2
pred8x8_mad_cow_dc_l00_8_neon:    43.2    27.0    25.5
pred8x8_plane_8_neon:            112.2    86.2    88.2
pred8x8_top_dc_8_neon:            40.7    23.0    21.2
pred8x8_vertical_8_neon:          27.2    15.5    14.0
pred16x16_dc_8_neon:              91.0    73.2    70.5
pred16x16_dc_128_8_neon:          43.0    34.7    30.7
pred16x16_horizontal_8_neon:      86.0    49.7    44.7
pred16x16_left_dc_8_neon:         87.0    67.2    67.5
pred16x16_plane_8_neon:          236.0   175.7   173.0
pred16x16_top_dc_8_neon:          53.2    39.0    41.7
pred16x16_vertical_8_neon:        41.7    29.7    31.0

After:
pred8x8_dc_8_neon:                59.0    46.7    42.5
pred8x8_dc_128_8_neon:            28.2    18.0    14.0
pred8x8_horizontal_8_neon:        34.2    29.2    18.5
pred8x8_left_dc_8_neon:           51.0    38.2    32.7
pred8x8_mad_cow_dc_0l0_8_neon:    46.7    28.2    26.2
pred8x8_mad_cow_dc_0lt_8_neon:    55.2    33.7    37.5
pred8x8_mad_cow_dc_l0t_8_neon:    51.2    31.7    37.2
pred8x8_mad_cow_dc_l00_8_neon:    41.7    27.5    26.0
pred8x8_plane_8_neon:            111.5    86.5    89.5
pred8x8_top_dc_8_neon:            39.0    23.2    21.0
pred8x8_vertical_8_neon:          27.2    16.0    14.0
pred16x16_dc_8_neon:              85.0    70.2    70.5
pred16x16_dc_128_8_neon:          42.0    30.0    30.7
pred16x16_horizontal_8_neon:      66.5    49.5    42.5
pred16x16_left_dc_8_neon:         81.0    66.5    67.5
pred16x16_plane_8_neon:          235.0   175.7   173.0
pred16x16_top_dc_8_neon:          52.0    39.0    41.7
pred16x16_vertical_8_neon:        40.2    33.2    31.0

Despite this, a number of these functions still are slower than
what e.g. GCC 7 generates - this shows the relative speedup of the
neon codepaths over the compiler generated ones:

                           Cortex A53    A72    A73
pred8x8_dc_8_neon:               0.86   0.65   1.04
pred8x8_dc_128_8_neon:           0.59   0.44   0.62
pred8x8_horizontal_8_neon:       1.51   0.58   1.30
pred8x8_left_dc_8_neon:          0.72   0.56   0.89
pred8x8_mad_cow_dc_0l0_8_neon:   0.93   0.93   1.37
pred8x8_mad_cow_dc_0lt_8_neon:   1.37   1.41   1.68
pred8x8_mad_cow_dc_l0t_8_neon:   1.21   1.17   1.32
pred8x8_mad_cow_dc_l00_8_neon:   1.24   1.19   1.60
pred8x8_plane_8_neon:            3.36   3.58   3.76
pred8x8_top_dc_8_neon:           0.97   0.99   1.43
pred8x8_vertical_8_neon:         0.86   0.78   1.18
pred16x16_dc_8_neon:             1.20   1.06   1.49
pred16x16_dc_128_8_neon:         0.83   0.95   0.99
pred16x16_horizontal_8_neon:     1.78   0.96   1.59
pred16x16_left_dc_8_neon:        1.06   0.96   1.32
pred16x16_plane_8_neon:          5.78   6.49   7.19
pred16x16_top_dc_8_neon:         1.48   1.53   1.94
pred16x16_vertical_8_neon:       1.39   1.34   1.98

In particular, on Cortex A72, many of these functions are slower
than the compiler generated code, while they're more beneficial on
e.g. the Cortex A73.

Signed-off-by: Martin Storsjö <martin@martin.st>
2021-04-14 15:23:44 +03:00
James Almer
f1a894f9d3 avcodec: add missing FF_API_OLD_ENCDEC wrappers to xmm clobber functions
Signed-off-by: James Almer <jamrial@gmail.com>
2021-02-26 19:26:31 -03:00
Josh Dekker
7ac41e0db2 lavc/aarch64: add HEVC sao_band NEON
Only works for 8x8.

Signed-off-by: Josh Dekker <josh@itanimul.li>
2021-02-18 14:12:01 +01:00
Josh Dekker
75c2ddfa61 lavc/aarch64: add HEVC idct_dc NEON
Signed-off-by: Josh Dekker <josh@itanimul.li>
2021-02-18 14:12:01 +01:00
Reimar Döffinger
00c916ef61 lavc/aarch64: port HEVC add_residual NEON
Speedup is fairly small, around 1.5%, but these are fairly simple.

Signed-off-by: Josh Dekker <josh@itanimul.li>
2021-02-18 14:11:57 +01:00
Reimar Döffinger
30f80d855b lavc/aarch64: port HEVC SIMD idct NEON
Makes SIMD-optimized 8x8 and 16x16 idcts for 8 and 10 bit depth
available on aarch64.
For a UHD HDR (10 bit) sample video these were consuming the most time
and this optimization reduced overall decode time from 19.4s to 16.4s,
approximately 15% speedup.
Test sample was the first 300 frames of "LG 4K HDR Demo - New York.ts",
running on Apple M1.

Signed-off-by: Josh Dekker <josh@itanimul.li>
2021-02-18 14:11:53 +01:00
Anton Khirnov
c8c2dfbc37 lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h
That is a more appropriate place for it.
2021-01-01 14:11:01 +01:00
Martin Storsjö
7168adedbc libavcodec: aarch64: Add a NEON implementation of pixblockdsp
Cortex A53    A72    A73
get_pixels_c:                140.7   87.7   72.5
get_pixels_neon:              46.0   20.0   19.5
get_pixels_unaligned_c:      140.7   87.7   73.0
get_pixels_unaligned_neon:    49.2   20.2   26.2
diff_pixels_c:               209.7  133.7  138.7
diff_pixels_neon:             54.2   31.7   23.5
diff_pixels_unaligned_c:     209.7  134.2  139.0
diff_pixels_unaligned_neon:   68.0   27.7   41.7

Signed-off-by: Martin Storsjö <martin@martin.st>
2020-05-15 23:37:55 +03:00
Carl Eugen Hoyos
34d7c8d942 lavc/aarch64: Remove unneeded file vp9mc_aarch64.c 2020-03-11 14:36:07 +01:00
Carl Eugen Hoyos
951bd25572 lavc/aarch64: Fix suffix of new file vp9mc_aarch64. 2020-03-11 14:29:22 +01:00
Carl Eugen Hoyos
213c796561 lavc/aarch64: Fix compilation with --disable-neon
Fixes ticket #8565.
2020-03-11 14:16:48 +01:00
Carl Eugen Hoyos
9a21754904 lavc/aarch64: Move non-neon vp9 copy functions out of neon source file.
Fixes part of ticket #8565.
2020-03-11 14:16:40 +01:00
Lynne
aac382e9e5 aarch64/opusdsp: do not clobber register v8
A part of v8-v15 needs to be preserved across calls.
2019-08-15 13:29:22 +01:00
Lynne
f62ee527cb aarch64/asm-offsets: remove old CELT offsets
They're not used and they're incorrect.
2019-05-14 23:41:24 +01:00
Lynne
4d2f62150d aarch64/opusdsp: implement NEON accelerated postfilter and deemphasis
153372 UNITS in postfilter_c,   65536 runs,      0 skips
73164 UNITS in postfilter_neon,   65536 runs,      0 skips -> 2.1x speedup

80591 UNITS in deemphasis_c,  131072 runs,      0 skips
43969 UNITS in deemphasis_neon,  131072 runs,      0 skips -> 1.83x speedup

Total decoder speedup: ~15% on a Raspberry Pi 3 (from 28.1x to 33.5x realtime)

Deemphasis SIMD based on the following unrolling:
const float c1 = CELT_EMPH_COEFF, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;
float state = coeff;

for (int i = 0; i < len; i += 4) {
    y[0] = x[0] + c1*state;
    y[1] = x[1] + c2*state + c1*x[0];
    y[2] = x[2] + c3*state + c1*x[1] + c2*x[0];
    y[3] = x[3] + c4*state + c1*x[2] + c2*x[1] + c3*x[0];

    state = y[3];
    y += 4;
    x += 4;
}

Unlike the x86 version, duplication is used instead of pslldq so
the structure and tables are different.
2019-04-10 01:08:54 +02:00
James Almer
92219ef4ac Merge commit '186bd30aa3b6c2b29b4dbf18278700b572068b1e'
* commit '186bd30aa3b6c2b29b4dbf18278700b572068b1e':
  h264/arm64: implement missing 4:2:2 chroma loop filter neon functions

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:29:41 -03:00
James Almer
5c363d3e59 Merge commit '7e42d5f0ab2aeac811fd01e122627c9198b13f01'
* commit '7e42d5f0ab2aeac811fd01e122627c9198b13f01':
  aarch64: vp8: Optimize vp8_idct_add_neon for aarch64

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:22:29 -03:00
James Almer
409e684e79 Merge commit '49f9c4272c4029b57ff300d908ba03c6332fc9c4'
* commit '49f9c4272c4029b57ff300d908ba03c6332fc9c4':
  aarch64: vp8: Skip saturating in shrn in ff_vp8_idct_add_neon

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:21:46 -03:00
James Almer
fbd607dd56 Merge commit '37394ef01b040605f8e1c98e73aa12b1c0bcba07'
* commit '37394ef01b040605f8e1c98e73aa12b1c0bcba07':
  aarch64: vp8: Optimize put_epel16_h6v6 with vp8_epel8_v6_y2

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:20:05 -03:00
James Almer
34a0a9746b Merge commit 'e39a9212ab37a55b346801c77487d8a47b6f9fe2'
* commit 'e39a9212ab37a55b346801c77487d8a47b6f9fe2':
  aarch64: vp8: Port bilin functions from arm version

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:18:42 -03:00
James Almer
2ac399d7fa Merge commit '58d154922707bfeb873cb3a7476e0f94b17463dd'
* commit '58d154922707bfeb873cb3a7476e0f94b17463dd':
  aarch64: vp8: Port epel4 functions from arm version

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:17:33 -03:00
James Almer
c6892f59eb Merge commit 'cc7ba00c35faf0478f1f56215e926f70ccb31282'
* commit 'cc7ba00c35faf0478f1f56215e926f70ccb31282':
  aarch64: vp8: Port missing epel8 functions from arm version

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:16:43 -03:00
James Almer
79025da3f2 Merge commit '52c9b0a6c0d02cff6caebcf6989e565e05b55200'
* commit '52c9b0a6c0d02cff6caebcf6989e565e05b55200':
  aarch64: vp8: Port vp8_luma_dc_wht and vp8_idct_dc_add4uv from arm version

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:14:40 -03:00
James Almer
39278ff0de Merge commit 'c513fcd7d235aa4cef45a6c3125bd4dcc03bf276'
* commit 'c513fcd7d235aa4cef45a6c3125bd4dcc03bf276':
  aarch64: vp8: Fix a typo in a comment

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:13:32 -03:00
James Almer
4f9a8d3fe2 Merge commit 'f1011ea28a4048ddec97794ca3e9901474fe055f'
* commit 'f1011ea28a4048ddec97794ca3e9901474fe055f':
  aarch64: vp8: Reorder the function pointer inits to match the arm original

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:09:11 -03:00
James Almer
398000abcf Merge commit '85bfaa4949f4afcde19061def3e8a18988964858'
* commit '85bfaa4949f4afcde19061def3e8a18988964858':
  aarch64: vp8: Use the proper aarch64 form for conditional branches

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:06:43 -03:00
James Almer
a2ae381b5a Merge commit '0801853e640624537db386727b36fa97aa6258e7'
* commit '0801853e640624537db386727b36fa97aa6258e7':
  libavcodec: vp8 neon optimizations for aarch64

See 833fed5253

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:05:52 -03:00
Janne Grunau
186bd30aa3 h264/arm64: implement missing 4:2:2 chroma loop filter neon functions 2019-02-27 21:57:05 +01:00
Carl Eugen Hoyos
7e4d3dbe18 lavc/aarch64/h264dsp_init: Only use neon horizontal intra loopfilter for 4:2:0. 2019-02-20 23:56:21 +01:00
James Almer
aa844dc46f aarch64/h264dsp: change loop filter stride argument to ptrdiff_t
This was missed in d5d699ab6e

Signed-off-by: James Almer <jamrial@gmail.com>
2019-02-20 19:38:46 -03:00
James Almer
e4e04dce1f Merge commit '28a8b5413b64b831dfb8650208bccd8b78360484'
* commit '28a8b5413b64b831dfb8650208bccd8b78360484':
  h264/aarch64: add intra loop filter neon asm

Merged-by: James Almer <jamrial@gmail.com>
2019-02-20 15:42:01 -03:00
James Almer
4dc1f06f0c Merge commit '846c3d6aca5484904e60946c4fe8b8833bc07f92'
* commit '846c3d6aca5484904e60946c4fe8b8833bc07f92':
  h264/aarch64: optimize neon loop filter

Merged-by: James Almer <jamrial@gmail.com>
2019-02-20 15:41:03 -03:00
James Almer
5ca7eb36b7 Merge commit 'bb515e3a735f526ccb1068031e289eb5aeb69e22'
* commit 'bb515e3a735f526ccb1068031e289eb5aeb69e22':
  h264/aarch64: sign extend int stride in loop filter asm

Merged-by: James Almer <jamrial@gmail.com>
2019-02-20 14:50:37 -03:00