1
0
mirror of https://github.com/FFmpeg/FFmpeg.git synced 2024-11-21 10:55:51 +02:00
Commit Graph

295 Commits

Author SHA1 Message Date
Reimar Döffinger
00c916ef61 lavc/aarch64: port HEVC add_residual NEON
Speedup is fairly small, around 1.5%, but these are fairly simple.

Signed-off-by: Josh Dekker <josh@itanimul.li>
2021-02-18 14:11:57 +01:00
Reimar Döffinger
30f80d855b lavc/aarch64: port HEVC SIMD idct NEON
Makes SIMD-optimized 8x8 and 16x16 idcts for 8 and 10 bit depth
available on aarch64.
For a UHD HDR (10 bit) sample video these were consuming the most time
and this optimization reduced overall decode time from 19.4s to 16.4s,
approximately 15% speedup.
Test sample was the first 300 frames of "LG 4K HDR Demo - New York.ts",
running on Apple M1.

Signed-off-by: Josh Dekker <josh@itanimul.li>
2021-02-18 14:11:53 +01:00
Anton Khirnov
c8c2dfbc37 lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h
That is a more appropriate place for it.
2021-01-01 14:11:01 +01:00
Martin Storsjö
7168adedbc libavcodec: aarch64: Add a NEON implementation of pixblockdsp
Cortex A53    A72    A73
get_pixels_c:                140.7   87.7   72.5
get_pixels_neon:              46.0   20.0   19.5
get_pixels_unaligned_c:      140.7   87.7   73.0
get_pixels_unaligned_neon:    49.2   20.2   26.2
diff_pixels_c:               209.7  133.7  138.7
diff_pixels_neon:             54.2   31.7   23.5
diff_pixels_unaligned_c:     209.7  134.2  139.0
diff_pixels_unaligned_neon:   68.0   27.7   41.7

Signed-off-by: Martin Storsjö <martin@martin.st>
2020-05-15 23:37:55 +03:00
Carl Eugen Hoyos
34d7c8d942 lavc/aarch64: Remove unneeded file vp9mc_aarch64.c 2020-03-11 14:36:07 +01:00
Carl Eugen Hoyos
951bd25572 lavc/aarch64: Fix suffix of new file vp9mc_aarch64. 2020-03-11 14:29:22 +01:00
Carl Eugen Hoyos
213c796561 lavc/aarch64: Fix compilation with --disable-neon
Fixes ticket #8565.
2020-03-11 14:16:48 +01:00
Carl Eugen Hoyos
9a21754904 lavc/aarch64: Move non-neon vp9 copy functions out of neon source file.
Fixes part of ticket #8565.
2020-03-11 14:16:40 +01:00
Lynne
aac382e9e5 aarch64/opusdsp: do not clobber register v8
A part of v8-v15 needs to be preserved across calls.
2019-08-15 13:29:22 +01:00
Lynne
f62ee527cb aarch64/asm-offsets: remove old CELT offsets
They're not used and they're incorrect.
2019-05-14 23:41:24 +01:00
Lynne
4d2f62150d aarch64/opusdsp: implement NEON accelerated postfilter and deemphasis
153372 UNITS in postfilter_c,   65536 runs,      0 skips
73164 UNITS in postfilter_neon,   65536 runs,      0 skips -> 2.1x speedup

80591 UNITS in deemphasis_c,  131072 runs,      0 skips
43969 UNITS in deemphasis_neon,  131072 runs,      0 skips -> 1.83x speedup

Total decoder speedup: ~15% on a Raspberry Pi 3 (from 28.1x to 33.5x realtime)

Deemphasis SIMD based on the following unrolling:
const float c1 = CELT_EMPH_COEFF, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;
float state = coeff;

for (int i = 0; i < len; i += 4) {
    y[0] = x[0] + c1*state;
    y[1] = x[1] + c2*state + c1*x[0];
    y[2] = x[2] + c3*state + c1*x[1] + c2*x[0];
    y[3] = x[3] + c4*state + c1*x[2] + c2*x[1] + c3*x[0];

    state = y[3];
    y += 4;
    x += 4;
}

Unlike the x86 version, duplication is used instead of pslldq so
the structure and tables are different.
2019-04-10 01:08:54 +02:00
James Almer
92219ef4ac Merge commit '186bd30aa3b6c2b29b4dbf18278700b572068b1e'
* commit '186bd30aa3b6c2b29b4dbf18278700b572068b1e':
  h264/arm64: implement missing 4:2:2 chroma loop filter neon functions

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:29:41 -03:00
James Almer
5c363d3e59 Merge commit '7e42d5f0ab2aeac811fd01e122627c9198b13f01'
* commit '7e42d5f0ab2aeac811fd01e122627c9198b13f01':
  aarch64: vp8: Optimize vp8_idct_add_neon for aarch64

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:22:29 -03:00
James Almer
409e684e79 Merge commit '49f9c4272c4029b57ff300d908ba03c6332fc9c4'
* commit '49f9c4272c4029b57ff300d908ba03c6332fc9c4':
  aarch64: vp8: Skip saturating in shrn in ff_vp8_idct_add_neon

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:21:46 -03:00
James Almer
fbd607dd56 Merge commit '37394ef01b040605f8e1c98e73aa12b1c0bcba07'
* commit '37394ef01b040605f8e1c98e73aa12b1c0bcba07':
  aarch64: vp8: Optimize put_epel16_h6v6 with vp8_epel8_v6_y2

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:20:05 -03:00
James Almer
34a0a9746b Merge commit 'e39a9212ab37a55b346801c77487d8a47b6f9fe2'
* commit 'e39a9212ab37a55b346801c77487d8a47b6f9fe2':
  aarch64: vp8: Port bilin functions from arm version

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:18:42 -03:00
James Almer
2ac399d7fa Merge commit '58d154922707bfeb873cb3a7476e0f94b17463dd'
* commit '58d154922707bfeb873cb3a7476e0f94b17463dd':
  aarch64: vp8: Port epel4 functions from arm version

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:17:33 -03:00
James Almer
c6892f59eb Merge commit 'cc7ba00c35faf0478f1f56215e926f70ccb31282'
* commit 'cc7ba00c35faf0478f1f56215e926f70ccb31282':
  aarch64: vp8: Port missing epel8 functions from arm version

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:16:43 -03:00
James Almer
79025da3f2 Merge commit '52c9b0a6c0d02cff6caebcf6989e565e05b55200'
* commit '52c9b0a6c0d02cff6caebcf6989e565e05b55200':
  aarch64: vp8: Port vp8_luma_dc_wht and vp8_idct_dc_add4uv from arm version

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:14:40 -03:00
James Almer
39278ff0de Merge commit 'c513fcd7d235aa4cef45a6c3125bd4dcc03bf276'
* commit 'c513fcd7d235aa4cef45a6c3125bd4dcc03bf276':
  aarch64: vp8: Fix a typo in a comment

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:13:32 -03:00
James Almer
4f9a8d3fe2 Merge commit 'f1011ea28a4048ddec97794ca3e9901474fe055f'
* commit 'f1011ea28a4048ddec97794ca3e9901474fe055f':
  aarch64: vp8: Reorder the function pointer inits to match the arm original

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:09:11 -03:00
James Almer
398000abcf Merge commit '85bfaa4949f4afcde19061def3e8a18988964858'
* commit '85bfaa4949f4afcde19061def3e8a18988964858':
  aarch64: vp8: Use the proper aarch64 form for conditional branches

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:06:43 -03:00
James Almer
a2ae381b5a Merge commit '0801853e640624537db386727b36fa97aa6258e7'
* commit '0801853e640624537db386727b36fa97aa6258e7':
  libavcodec: vp8 neon optimizations for aarch64

See 833fed5253

Merged-by: James Almer <jamrial@gmail.com>
2019-03-14 16:05:52 -03:00
Janne Grunau
186bd30aa3 h264/arm64: implement missing 4:2:2 chroma loop filter neon functions 2019-02-27 21:57:05 +01:00
Carl Eugen Hoyos
7e4d3dbe18 lavc/aarch64/h264dsp_init: Only use neon horizontal intra loopfilter for 4:2:0. 2019-02-20 23:56:21 +01:00
James Almer
aa844dc46f aarch64/h264dsp: change loop filter stride argument to ptrdiff_t
This was missed in d5d699ab6e

Signed-off-by: James Almer <jamrial@gmail.com>
2019-02-20 19:38:46 -03:00
James Almer
e4e04dce1f Merge commit '28a8b5413b64b831dfb8650208bccd8b78360484'
* commit '28a8b5413b64b831dfb8650208bccd8b78360484':
  h264/aarch64: add intra loop filter neon asm

Merged-by: James Almer <jamrial@gmail.com>
2019-02-20 15:42:01 -03:00
James Almer
4dc1f06f0c Merge commit '846c3d6aca5484904e60946c4fe8b8833bc07f92'
* commit '846c3d6aca5484904e60946c4fe8b8833bc07f92':
  h264/aarch64: optimize neon loop filter

Merged-by: James Almer <jamrial@gmail.com>
2019-02-20 15:41:03 -03:00
James Almer
5ca7eb36b7 Merge commit 'bb515e3a735f526ccb1068031e289eb5aeb69e22'
* commit 'bb515e3a735f526ccb1068031e289eb5aeb69e22':
  h264/aarch64: sign extend int stride in loop filter asm

Merged-by: James Almer <jamrial@gmail.com>
2019-02-20 14:50:37 -03:00
Martin Storsjö
c8bc9d1380 aarch64: vp8: Move the vp8dsp makefile entries to the right places
Even if NEON would be disabled, the init functions should be built
as they are called as long as ARCH_AARCH64 is set.

These functions are part of a generic DSP subsytem, not tied directly
to one decoder. (They should be built if the vp7 decoder is enabled,
even if the vp8 decoder is disabled.)

Signed-off-by: Martin Storsjö <martin@martin.st>
(cherry picked from commit b4b27dce95)
2019-02-19 23:43:17 +02:00
Martin Storsjö
fecf75a5c4 aarch64: vp8: Remove superfluous includes
This fixes building with MSVC, which lacks unistd.h.

Signed-off-by: Martin Storsjö <martin@martin.st>
(cherry picked from commit ad32f7b126)
2019-02-19 23:42:16 +02:00
Martin Storsjö
7ddfa5e908 aarch64: vp8: Fix assembling with armasm64
Signed-off-by: Martin Storsjö <martin@martin.st>
(cherry picked from commit 2eeac79936)
2019-02-19 23:42:03 +02:00
Martin Storsjö
c950beb68d aarch64: vp8: Fix assembling with clang
This also partially fixes assembling with MS armasm64 (via
gas-preprocessor).

The movrel macro invocations need to pass the offset via a separate
parameter. Mach-o and COFF relocations don't allow a negative
offset to a symbol, which is handled properly if the offset is passed
via the parameter. If no offset parameter is given, the macro
evaluates to something like "adrp x17, subpel_filters-16+(0)", which
older clang versions also fail to parse (the older clang versions
only support one single offset term, although it can be a parenthesis.

Signed-off-by: Martin Storsjö <martin@martin.st>
(cherry picked from commit 26d7af4c38)
2019-02-19 23:41:47 +02:00
Martin Storsjö
7e42d5f0ab aarch64: vp8: Optimize vp8_idct_add_neon for aarch64
The previous version was a pretty exact translation of the arm
version. This version does do some unnecessary arithemetic (it does
more operations on vectors that are only half filled; it does 4
uaddw and 4 sqxtun instead of 2 of each), but it reduces the overhead
of packing data together (which could be done for free in the arm
version).

This gives a decent speedup on Cortex A53, a minor speedup on
A72 and a very minor slowdown on Cortex A73.

Before:        Cortex A53    A72    A73
vp8_idct_add_neon:   79.7   67.5   65.0
After:
vp8_idct_add_neon:   67.7   64.8   66.7

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:28 +02:00
Martin Storsjö
49f9c4272c aarch64: vp8: Skip saturating in shrn in ff_vp8_idct_add_neon
The original arm version didn't do saturation here. This probably
doesn't make any difference for performance, but reduces the
differences.

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:24 +02:00
Martin Storsjö
37394ef01b aarch64: vp8: Optimize put_epel16_h6v6 with vp8_epel8_v6_y2
This makes it similar to put_epel16_v6, and gives a large speedup
on Cortex A53, a minor speedup on A72 and a very minor slowdown on
A73.

Before:                 Cortex A53     A72     A73
vp8_put_epel16_h6v6_neon:   2211.4  1586.5  1431.7
After:
vp8_put_epel16_h6v6_neon:   1736.9  1522.0  1448.1

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:21 +02:00
Martin Storsjö
e39a9212ab aarch64: vp8: Port bilin functions from arm version
Cortex A53     A72     A73
vp8_put_bilin4_h_c:        303.8   102.2   161.8
vp8_put_bilin4_h_neon:     100.0    40.9    41.2
vp8_put_bilin4_hv_c:       322.8   201.0   305.9
vp8_put_bilin4_hv_neon:    156.8    72.6    77.0
vp8_put_bilin4_v_c:        304.7   101.7   166.5
vp8_put_bilin4_v_neon:      82.7    41.2    33.0
vp8_put_bilin8_h_c:       1192.7   352.5   623.8
vp8_put_bilin8_h_neon:     213.5    70.2    87.8
vp8_put_bilin8_hv_c:      1098.6   769.2  1041.9
vp8_put_bilin8_hv_neon:    324.0   123.5   146.0
vp8_put_bilin8_v_c:       1193.9   350.4   617.7
vp8_put_bilin8_v_neon:     183.9    60.7    64.7
vp8_put_bilin16_h_c:      2353.1   671.2  1223.3
vp8_put_bilin16_h_neon:    261.9   140.7   145.0
vp8_put_bilin16_hv_c:     2453.2  1470.9  2355.2
vp8_put_bilin16_hv_neon:   383.9   196.0   217.0
vp8_put_bilin16_v_c:      2349.3   669.8  1251.2
vp8_put_bilin16_v_neon:    202.9   110.7    96.2

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:14 +02:00
Martin Storsjö
58d1549227 aarch64: vp8: Port epel4 functions from arm version
Cortex A53    A72    A73
vp8_put_epel4_h4_c:        631.4  291.7  367.8
vp8_put_epel4_h4_neon:     241.0  131.0  155.7
vp8_put_epel4_h4v4_c:      967.5  529.3  667.7
vp8_put_epel4_h4v4_neon:   429.3  241.8  279.7
vp8_put_epel4_h4v6_c:     1374.7  657.5  864.5
vp8_put_epel4_h4v6_neon:   515.5  295.5  334.7
vp8_put_epel4_h6_c:        851.0  421.0  486.0
vp8_put_epel4_h6_neon:     321.5  195.0  217.7
vp8_put_epel4_h6v4_c:     1111.3  621.1  781.2
vp8_put_epel4_h6v4_neon:   539.2  328.0  365.3
vp8_put_epel4_h6v6_c:     1561.3  763.3  999.7
vp8_put_epel4_h6v6_neon:   645.5  401.0  434.7
vp8_put_epel4_v4_c:        663.8  298.3  357.0
vp8_put_epel4_v4_neon:     116.0   81.5   72.5
vp8_put_epel4_v6_c:        870.5  437.0  507.4
vp8_put_epel4_v6_neon:     147.7  108.8   92.0

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:11 +02:00
Martin Storsjö
cc7ba00c35 aarch64: vp8: Port missing epel8 functions from arm version
Cortex A53     A72     A73
vp8_put_epel8_h4_c:       2594.8  1159.6  1374.8
vp8_put_epel8_h4_neon:     506.4   244.2   314.0
vp8_put_epel8_h6_c:       3445.8  1677.1  1811.3
vp8_put_epel8_h6_neon:     634.4   371.7   433.0
vp8_put_epel8_v4_c:       2614.0  1174.8  1378.0
vp8_put_epel8_v4_neon:     321.0   221.7   235.8
vp8_put_epel8_v6_c:       3635.5  1703.0  2079.2
vp8_put_epel8_v6_neon:     416.9   317.0   295.5

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:08 +02:00
Martin Storsjö
52c9b0a6c0 aarch64: vp8: Port vp8_luma_dc_wht and vp8_idct_dc_add4uv from arm version
Cortex A53    A72    A73
vp8_luma_dc_wht_c:        115.7   75.7   90.7
vp8_luma_dc_wht_neon:      60.7   41.2   45.7
vp8_idct_dc_add4uv_c:     376.1  262.9  282.5
vp8_idct_dc_add4uv_neon:   52.0   29.0   37.0

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:04 +02:00
Martin Storsjö
c513fcd7d2 aarch64: vp8: Fix a typo in a comment
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:46:00 +02:00
Martin Storsjö
f1011ea28a aarch64: vp8: Reorder the function pointer inits to match the arm original
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:56 +02:00
Martin Storsjö
b4b27dce95 aarch64: vp8: Move the vp8dsp makefile entries to the right places
Even if NEON would be disabled, the init functions should be built
as they are called as long as ARCH_AARCH64 is set.

These functions are part of a generic DSP subsytem, not tied directly
to one decoder. (They should be built if the vp7 decoder is enabled,
even if the vp8 decoder is disabled.)

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:53 +02:00
Martin Storsjö
ad32f7b126 aarch64: vp8: Remove superfluous includes
This fixes building with MSVC, which lacks unistd.h.

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:50 +02:00
Martin Storsjö
85bfaa4949 aarch64: vp8: Use the proper aarch64 form for conditional branches
The previous form also does seem to assemble on current tools,
but I think it might fail on some older aarch64 tools.

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:47 +02:00
Martin Storsjö
2eeac79936 aarch64: vp8: Fix assembling with armasm64
Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:44 +02:00
Martin Storsjö
26d7af4c38 aarch64: vp8: Fix assembling with clang
This also partially fixes assembling with MS armasm64 (via
gas-preprocessor).

The movrel macro invocations need to pass the offset via a separate
parameter. Mach-o and COFF relocations don't allow a negative
offset to a symbol, which is handled properly if the offset is passed
via the parameter. If no offset parameter is given, the macro
evaluates to something like "adrp x17, subpel_filters-16+(0)", which
older clang versions also fail to parse (the older clang versions
only support one single offset term, although it can be a parenthesis.

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:41 +02:00
Magnus Röös
0801853e64 libavcodec: vp8 neon optimizations for aarch64
Partial port of the ARM Neon for aarch64.

Benchmarks from fate:

benchmarking with Linux Perf Monitoring API
nop: 58.6
checkasm: using random seed 1760970128
NEON:
 - vp8dsp.idct       [OK]
 - vp8dsp.mc         [OK]
 - vp8dsp.loopfilter [OK]
checkasm: all 21 tests passed
vp8_idct_add_c: 201.6
vp8_idct_add_neon: 83.1
vp8_idct_dc_add_c: 107.6
vp8_idct_dc_add_neon: 33.8
vp8_idct_dc_add4y_c: 426.4
vp8_idct_dc_add4y_neon: 59.4
vp8_loop_filter8uv_h_c: 688.1
vp8_loop_filter8uv_h_neon: 216.3
vp8_loop_filter8uv_inner_h_c: 649.3
vp8_loop_filter8uv_inner_h_neon: 195.3
vp8_loop_filter8uv_inner_v_c: 544.8
vp8_loop_filter8uv_inner_v_neon: 131.3
vp8_loop_filter8uv_v_c: 706.1
vp8_loop_filter8uv_v_neon: 141.1
vp8_loop_filter16y_h_c: 668.8
vp8_loop_filter16y_h_neon: 242.8
vp8_loop_filter16y_inner_h_c: 647.3
vp8_loop_filter16y_inner_h_neon: 224.6
vp8_loop_filter16y_inner_v_c: 647.8
vp8_loop_filter16y_inner_v_neon: 128.8
vp8_loop_filter16y_v_c: 721.8
vp8_loop_filter16y_v_neon: 154.3
vp8_loop_filter_simple_h_c: 387.8
vp8_loop_filter_simple_h_neon: 187.6
vp8_loop_filter_simple_v_c: 384.1
vp8_loop_filter_simple_v_neon: 78.6
vp8_put_epel8_h4v4_c: 3971.1
vp8_put_epel8_h4v4_neon: 855.1
vp8_put_epel8_h4v6_c: 5060.1
vp8_put_epel8_h4v6_neon: 989.6
vp8_put_epel8_h6v4_c: 4320.8
vp8_put_epel8_h6v4_neon: 1007.3
vp8_put_epel8_h6v6_c: 5449.3
vp8_put_epel8_h6v6_neon: 1158.1
vp8_put_epel16_h6_c: 6683.8
vp8_put_epel16_h6_neon: 831.8
vp8_put_epel16_h6v6_c: 11110.8
vp8_put_epel16_h6v6_neon: 2214.8
vp8_put_epel16_v6_c: 7024.8
vp8_put_epel16_v6_neon: 799.6
vp8_put_pixels8_c: 112.8
vp8_put_pixels8_neon: 78.1
vp8_put_pixels16_c: 131.3
vp8_put_pixels16_neon: 129.8

This contains a fix to include guards by Carl Eugen Hoyos.

Signed-off-by: Martin Storsjö <martin@martin.st>
2019-02-19 11:45:33 +02:00
Carl Eugen Hoyos
ed20fbcd48 lavc/aarch64/vp8dsp: Fix the include guard.
Fixes fate-source.
2019-01-31 22:35:44 +01:00
Magnus Röös
833fed5253 libavcodec: vp8 neon optimizations for aarch64
Partial port of the ARM Neon for aarch64.

Benchmarks from fate:

benchmarking with Linux Perf Monitoring API
nop: 58.6
checkasm: using random seed 1760970128
NEON:
 - vp8dsp.idct       [OK]
 - vp8dsp.mc         [OK]
 - vp8dsp.loopfilter [OK]
checkasm: all 21 tests passed
vp8_idct_add_c: 201.6
vp8_idct_add_neon: 83.1
vp8_idct_dc_add_c: 107.6
vp8_idct_dc_add_neon: 33.8
vp8_idct_dc_add4y_c: 426.4
vp8_idct_dc_add4y_neon: 59.4
vp8_loop_filter8uv_h_c: 688.1
vp8_loop_filter8uv_h_neon: 216.3
vp8_loop_filter8uv_inner_h_c: 649.3
vp8_loop_filter8uv_inner_h_neon: 195.3
vp8_loop_filter8uv_inner_v_c: 544.8
vp8_loop_filter8uv_inner_v_neon: 131.3
vp8_loop_filter8uv_v_c: 706.1
vp8_loop_filter8uv_v_neon: 141.1
vp8_loop_filter16y_h_c: 668.8
vp8_loop_filter16y_h_neon: 242.8
vp8_loop_filter16y_inner_h_c: 647.3
vp8_loop_filter16y_inner_h_neon: 224.6
vp8_loop_filter16y_inner_v_c: 647.8
vp8_loop_filter16y_inner_v_neon: 128.8
vp8_loop_filter16y_v_c: 721.8
vp8_loop_filter16y_v_neon: 154.3
vp8_loop_filter_simple_h_c: 387.8
vp8_loop_filter_simple_h_neon: 187.6
vp8_loop_filter_simple_v_c: 384.1
vp8_loop_filter_simple_v_neon: 78.6
vp8_put_epel8_h4v4_c: 3971.1
vp8_put_epel8_h4v4_neon: 855.1
vp8_put_epel8_h4v6_c: 5060.1
vp8_put_epel8_h4v6_neon: 989.6
vp8_put_epel8_h6v4_c: 4320.8
vp8_put_epel8_h6v4_neon: 1007.3
vp8_put_epel8_h6v6_c: 5449.3
vp8_put_epel8_h6v6_neon: 1158.1
vp8_put_epel16_h6_c: 6683.8
vp8_put_epel16_h6_neon: 831.8
vp8_put_epel16_h6v6_c: 11110.8
vp8_put_epel16_h6v6_neon: 2214.8
vp8_put_epel16_v6_c: 7024.8
vp8_put_epel16_v6_neon: 799.6
vp8_put_pixels8_c: 112.8
vp8_put_pixels8_neon: 78.1
vp8_put_pixels16_c: 131.3
vp8_put_pixels16_neon: 129.8

Signed-off-by: Magnus Röös <mla2.roos@gmail.com>
2019-01-31 20:17:51 +01:00
Janne Grunau
28a8b5413b h264/aarch64: add intra loop filter neon asm
Add my neon asm from x264 relicensed under the LGPL 2.1 or later. Ported
(x264 uses nv12 chroma) and optimized.

Cycle count for checkasm --bench on a Snapdragon 820e:
h264_h_loop_filter_luma_intra_8bpp_c: 60.0
h264_h_loop_filter_luma_intra_8bpp_neon: 54.2
h264_v_loop_filter_luma_intra_8bpp_c: 148.3
h264_v_loop_filter_luma_intra_8bpp_neon: 73.8
h264_h_loop_filter_chroma_intra_8bpp_c: 27.8
h264_h_loop_filter_chroma_intra_8bpp_neon: 21.4
h264_h_loop_filter_chroma_mbaff_intra_8bpp_c: 15.8
h264_h_loop_filter_chroma_mbaff_intra_8bpp_neon: 15.7
h264_v_loop_filter_chroma_intra_8bpp_c: 45.8
h264_v_loop_filter_chroma_intra_8bpp_neon: 17.3
2019-01-26 12:05:10 +01:00
Janne Grunau
846c3d6aca h264/aarch64: optimize neon loop filter
Exit as soon as possible if no filtering will be done.

Improves the checkasm --bench cycle count on a Snapdragon 820e:
h264_h_loop_filter_luma_8bpp_c:      72.4 ->  72.5
h264_h_loop_filter_luma_8bpp_neon:   97.1 ->  56.3
h264_v_loop_filter_luma_8bpp_c:     174.0 -> 173.5
h264_v_loop_filter_luma_8bpp_neon:   62.9 ->  60.9
h264_h_loop_filter_chroma_8bpp_c:    30.2 ->  30.3
h264_h_loop_filter_chroma_8bpp_neon: 51.6 ->  25.7
h264_v_loop_filter_chroma_8bpp_c:    57.3 ->  57.3
h264_v_loop_filter_chroma_8bpp_neon: 28.0 ->  24.0
2019-01-26 12:05:10 +01:00
Janne Grunau
bb515e3a73 h264/aarch64: sign extend int stride in loop filter asm 2019-01-26 12:05:10 +01:00
Manoj Gupta
6fcf813110 libavcodec: Remove dynamic relocs from aarch64/h264idct_neon.S
Some of the assembly functions e.g. ff_h264_idct_dc_add_neon
has code like:
        movrel          x14, X(ff_h264_idct_add_neon)

Linker cannot resolve them fully at link time and emits dynamic
relocations.
Use explicit labels instead so that no dynamic relocations are
needed at all.

This avoids lld complains about text relocations.

For background, see https://crbug.com/917919

Signed-off-by: Manoj Gupta <manojgupta@chromium.org>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2019-01-03 20:12:07 +01:00
Carl Eugen Hoyos
0576ef466d lavc/aarch64/h264dsp_init_aarch64: Fix weight function prototypes.
Fixes the following warnings:
libavcodec/aarch64/h264dsp_init_aarch64.c: In function ‘ff_h264dsp_init_aarch64’:
libavcodec/aarch64/h264dsp_init_aarch64.c:84:38: warning: assignment from incompatible pointer type [enabled by default]
         c->weight_h264_pixels_tab[0] = ff_weight_h264_pixels_16_neon;
                                      ^
libavcodec/aarch64/h264dsp_init_aarch64.c:85:38: warning: assignment from incompatible pointer type [enabled by default]
         c->weight_h264_pixels_tab[1] = ff_weight_h264_pixels_8_neon;
                                      ^
libavcodec/aarch64/h264dsp_init_aarch64.c:86:38: warning: assignment from incompatible pointer type [enabled by default]
         c->weight_h264_pixels_tab[2] = ff_weight_h264_pixels_4_neon;
                                      ^
libavcodec/aarch64/h264dsp_init_aarch64.c:88:40: warning: assignment from incompatible pointer type [enabled by default]
         c->biweight_h264_pixels_tab[0] = ff_biweight_h264_pixels_16_neon;
                                        ^
libavcodec/aarch64/h264dsp_init_aarch64.c:89:40: warning: assignment from incompatible pointer type [enabled by default]
         c->biweight_h264_pixels_tab[1] = ff_biweight_h264_pixels_8_neon;
                                        ^
libavcodec/aarch64/h264dsp_init_aarch64.c:90:40: warning: assignment from incompatible pointer type [enabled by default]
         c->biweight_h264_pixels_tab[2] = ff_biweight_h264_pixels_4_neon;
                                        ^
2018-07-13 21:28:04 +02:00
Rodger Combs
7723750475 lavc/aarch64/sbrdsp_neon: fix build on old binutils 2018-01-26 02:42:01 -06:00
James Almer
0525722ca0 Merge commit '732510636e597585a79be7d111c88b3f7e174fe7'
* commit '732510636e597585a79be7d111c88b3f7e174fe7':
  aarch64: Remove a dot from a label

Merged-by: James Almer <jamrial@gmail.com>
2017-11-11 17:47:10 -03:00
Martin Storsjö
732510636e aarch64: Remove a dot from a label
This fixes building with armasm64 (when run through gas-preprocessor).

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-10-18 10:49:33 +03:00
Matthieu Bouron
0a24d7ca83 lavc/aarch64: add sbrdsp neon implementation
autocorrelate_c: 644.0
autocorrelate_neon: 420.0
hf_apply_noise_0_c: 1688.5
hf_apply_noise_0_neon: 1498.6
hf_apply_noise_1_c: 1691.2
hf_apply_noise_1_neon: 1500.6
hf_apply_noise_2_c: 1688.1
hf_apply_noise_2_neon: 1500.3
hf_apply_noise_3_c: 1696.6
hf_apply_noise_3_neon: 1502.2
hf_g_filt_c: 2117.8
hf_g_filt_neon: 1218.7
hf_gen_c: 4573.4
hf_gen_neon: 2461.0
neg_odd_64_c: 72.0
neg_odd_64_neon: 64.7
qmf_deint_bfly_c: 1107.6
qmf_deint_bfly_neon: 291.6
qmf_deint_neg_c: 210.4
qmf_deint_neg_neon: 107.4
qmf_post_shuffle_c: 163.0
qmf_post_shuffle_neon: 107.7
qmf_pre_shuffle_c: 120.5
qmf_pre_shuffle_neon: 110.7
sum64x5_c: 1361.6
sum64x5_neon: 435.4
sum_square_c: 1686.4
sum_square_neon: 787.2
2017-07-03 14:29:22 +02:00
Clément Bœsch
b12a36170b lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis 2017-06-28 12:22:39 +02:00
Clément Bœsch
ff0ecef624 lavc/aarch64: add a few SIMD functions for AAC PS
☭ tests/checkasm/checkasm --bench --test=aacpsdsp
checkasm: using random seed 3318985180
MMX implied by specified flags
MMX implied by specified flags
NEON:
 - aacpsdsp.add_squares        [OK]
 - aacpsdsp.mul_pair_single    [OK]
 - aacpsdsp.hybrid_analysis    [OK]
 - aacpsdsp.stereo_interpolate [OK]
checkasm: all 5 tests passed
nop: 10.0
ps_add_squares_c: 63221.2
ps_add_squares_neon: 22311.7
ps_hybrid_analysis_c: 2466.6
ps_hybrid_analysis_neon: 1521.9
ps_mul_pair_single_c: 68592.0
ps_mul_pair_single_neon: 17426.6
ps_stereo_interpolate_c: 72344.3
ps_stereo_interpolate_neon: 72308.8
ps_stereo_interpolate_ipdopd_c: 117415.2
ps_stereo_interpolate_ipdopd_neon: 113386.3
2017-06-28 12:22:39 +02:00
Memphiz
9e85c5d6a7 aarch64: vp9 16bpp: Fix assembling with Xcode 6.2 and older
Properly use the b.eq form instead of the nonstandard form (which
both gas and newer clang accept though), and expand the register
lists that used a range (which the Xcode 6.2 clang, based on clang
3.5 svn, didn't support).

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-06-21 09:08:14 +03:00
Memphiz
998609ddb8 aarch64: vp9: Fix assembling with Xcode 6.2 and older
Properly use the b.eq/b.ge forms instead of the nonstandard forms
(which both gas and newer clang accept though), and expand the
register list that used a range (which the Xcode 6.2 clang, based
on clang 3.5 svn, didn't support).

This is cherrypicked from libav commit
a970f9de86.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-06-21 09:08:13 +03:00
Memphiz
a970f9de86 aarch64: vp9: Fix assembling with Xcode 6.2 and older
Properly use the b.eq/b.ge forms instead of the nonstandard forms
(which both gas and newer clang accept though), and expand the
register list that used a range (which the Xcode 6.2 clang, based
on clang 3.5 svn, didn't support).

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-06-20 16:14:03 +03:00
Matthieu Bouron
204008354f lavc/aarch64/simple_idct: fix build with Xcode 7.2 2017-06-14 23:20:58 +02:00
Matthieu Bouron
8aa60606fb lavc/aarch64/simple_idct: fix idct_col4_top coefficient
Fixes regression introduced by 5d0b8b1ae3.
2017-06-13 17:46:55 +02:00
Matthieu Bouron
5d0b8b1ae3 lavc/aarch64/simple_idct: fix iOS build without gas-preprocessor
Separates macro arguments with commas and passes .4H/.8H as macro
arguments instead of 4H/8H (the later form being interpreted as an
hexadecimal value).

Fixes ticket #6324.

Suggested-by: Martin Storsjö <martin@martin.st>
2017-05-11 16:28:54 +02:00
Clément Bœsch
0f00eb0e4e Merge commit '2425d7329fdccfa9954faba748f3865151354f0c'
* commit '2425d7329fdccfa9954faba748f3865151354f0c':
  arm64: replace 'bic' with immediate with 'and' with inverted immediate

Merged-by: Clément Bœsch <u@pkh.me>
2017-04-26 16:28:57 +02:00
James Almer
5694427dc3 Merge commit '72a19f4013ec2c7f8581416f8ad4bf81df163fb6'
* commit '72a19f4013ec2c7f8581416f8ad4bf81df163fb6':
  mpegaudiodsp: aarch64: Adjust function prototype after 2caa93b813

Merged-by: James Almer <jamrial@gmail.com>
2017-03-31 14:43:37 -03:00
James Almer
c31cbeef58 aarch64/vp9dsp: add missing header includes 2017-03-28 23:02:09 -03:00
Ronald S. Bultje
f8c019944d vp9: re-split the decoder/format/dsp interface header files.
The advantage here is that the internal software decoder interface is
not exposed to the DSP functions or the hardware accelerations.
2017-03-28 18:04:26 -04:00
Clément Bœsch
1c9f4b5078 lavc/vp9: split into vp9{block,data,mvs}
This is following Libav layout to ease merges.
2017-03-27 21:38:21 +02:00
Clément Bœsch
739d8c83f2 Merge commit '9b2ccafb480c94fd09cfb24306d5296dc013cf5b'
* commit '9b2ccafb480c94fd09cfb24306d5296dc013cf5b':
  aarch64: Add missing sign extension in ff_h264_idct8_add_neon

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-23 12:15:39 +01:00
James Almer
9a0fbb9ca9 Merge commit '2caa93b813adc5dbb7771dfe615da826a2947d18'
* commit '2caa93b813adc5dbb7771dfe615da826a2947d18':
  mpegaudiodsp: Change type of array stride parameters to ptrdiff_t

Merged-by: James Almer <jamrial@gmail.com>
2017-03-21 16:04:22 -03:00
James Almer
a8474df944 Merge commit 'e4a94d8b36c48d95a7d412c40d7b558422ff659c'
* commit 'e4a94d8b36c48d95a7d412c40d7b558422ff659c':
  h264chroma: Change type of stride parameters to ptrdiff_t

Merged-by: James Almer <jamrial@gmail.com>
2017-03-21 15:20:45 -03:00
James Almer
5a49097b42 Merge commit '2ec9fa5ec60dcd10e1cb10d8b4e4437e634ea428'
* commit '2ec9fa5ec60dcd10e1cb10d8b4e4437e634ea428':
  idct: Change type of array stride parameters to ptrdiff_t

Merged-by: James Almer <jamrial@gmail.com>
2017-03-21 14:29:52 -03:00
Clément Bœsch
ad98af27f7 Merge commit 'de2ae3c1fae5a2eb539b9abd7bc2a9ca8c286ff0'
* commit 'de2ae3c1fae5a2eb539b9abd7bc2a9ca8c286ff0':
  lavc: add clobber tests for the new encoding/decoding API

The merge only re-order what we already have.

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-21 14:43:53 +01:00
Martin Storsjö
61b8a9ea29 aarch64: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible
This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 21512 bytes to 31400 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before:
vp9_inv_dct_dct_16x16_sub1_add_10_neon:     284.6
vp9_inv_dct_dct_16x16_sub2_add_10_neon:    1902.7
vp9_inv_dct_dct_16x16_sub4_add_10_neon:    1903.0
vp9_inv_dct_dct_16x16_sub8_add_10_neon:    2201.1
vp9_inv_dct_dct_16x16_sub12_add_10_neon:   2510.0
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   2821.3
vp9_inv_dct_dct_32x32_sub1_add_10_neon:    1011.6
vp9_inv_dct_dct_32x32_sub2_add_10_neon:    9716.5
vp9_inv_dct_dct_32x32_sub4_add_10_neon:    9704.9
vp9_inv_dct_dct_32x32_sub8_add_10_neon:   10641.7
vp9_inv_dct_dct_32x32_sub12_add_10_neon:  11555.7
vp9_inv_dct_dct_32x32_sub16_add_10_neon:  12499.8
vp9_inv_dct_dct_32x32_sub20_add_10_neon:  13403.7
vp9_inv_dct_dct_32x32_sub24_add_10_neon:  14335.8
vp9_inv_dct_dct_32x32_sub28_add_10_neon:  15253.6
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  16179.5

After:
vp9_inv_dct_dct_16x16_sub1_add_10_neon:     282.8
vp9_inv_dct_dct_16x16_sub2_add_10_neon:    1142.4
vp9_inv_dct_dct_16x16_sub4_add_10_neon:    1139.0
vp9_inv_dct_dct_16x16_sub8_add_10_neon:    1772.9
vp9_inv_dct_dct_16x16_sub12_add_10_neon:   2515.2
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   2823.5
vp9_inv_dct_dct_32x32_sub1_add_10_neon:    1012.7
vp9_inv_dct_dct_32x32_sub2_add_10_neon:    6944.4
vp9_inv_dct_dct_32x32_sub4_add_10_neon:    6944.2
vp9_inv_dct_dct_32x32_sub8_add_10_neon:    7609.8
vp9_inv_dct_dct_32x32_sub12_add_10_neon:   9953.4
vp9_inv_dct_dct_32x32_sub16_add_10_neon:  10770.1
vp9_inv_dct_dct_32x32_sub20_add_10_neon:  13418.8
vp9_inv_dct_dct_32x32_sub24_add_10_neon:  14330.7
vp9_inv_dct_dct_32x32_sub28_add_10_neon:  15257.1
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  16190.6

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:54:37 +02:00
Martin Storsjö
d564c9018f aarch64: vp9itxfm16: Move the load_add_store macro out from the itxfm16 pass2 function
This allows reusing the macro for a separate implementation of the
pass2 function.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:54:30 +02:00
Martin Storsjö
0f2705e66b aarch64: vp9itxfm16: Make the larger core transforms standalone functions
This work is sponsored by, and copyright, Google.

This reduces the code size of libavcodec/aarch64/vp9itxfm_16bpp_neon.o from
26288 to 21512 bytes.

This gives a small slowdown of a couple of tens of cycles, but makes
it more feasible to add more optimized versions of these transforms.

Before:
vp9_inv_dct_dct_16x16_sub4_add_10_neon:    1887.4
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   2801.5
vp9_inv_dct_dct_32x32_sub4_add_10_neon:    9691.4
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  16154.9

After:
vp9_inv_dct_dct_16x16_sub4_add_10_neon:    1899.5
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   2827.2
vp9_inv_dct_dct_32x32_sub4_add_10_neon:    9714.7
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  16175.9

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:54:26 +02:00
Martin Storsjö
b76533f105 aarch64: vp9itxfm16: Restructure the idct32 store macros
This avoids concatenation, which can't be used if the whole macro
is wrapped within another macro.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:54:15 +02:00
Martin Storsjö
d613251622 aarch64: vp9itxfm16: Avoid .irp when it doesn't save any lines
This makes the code a bit more readable.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:54:09 +02:00
Martin Storsjö
25ced1eb1c aarch64: vp9itxfm16: Fix a typo in a comment
Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:54:02 +02:00
Martin Storsjö
21c89f3a26 arm/aarch64: vp9: Fix vertical alignment
Align the second/third operands as they usually are.

Due to the wildly varying sizes of the written out operands
in aarch64 assembly, the column alignment is usually not as clear
as in arm assembly.

This is cherrypicked from libav commit
7995ebfad1.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:53:32 +02:00
Martin Storsjö
70317b25aa arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used
In the half/quarter cases where we don't use the min_eob array, defer
loading the pointer until we know it will be needed.

This is cherrypicked from libav commit
3a0d5e206d.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:53:28 +02:00
Martin Storsjö
7995ebfad1 arm/aarch64: vp9: Fix vertical alignment
Align the second/third operands as they usually are.

Due to the wildly varying sizes of the written out operands
in aarch64 assembly, the column alignment is usually not as clear
as in arm assembly.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-16 23:09:00 +02:00
Matthieu Bouron
4c8e528d19 lavc/aarch64: add ff_simple_idct{,_add,_put}_neon functions 2017-03-16 12:00:41 +01:00
Martin Storsjö
3a0d5e206d arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used
In the half/quarter cases where we don't use the min_eob array, defer
loading the pointer until we know it will be needed.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 22:07:30 +02:00
Martin Storsjö
26ee83acc4 aarch64: vp9itxfm: Reorder iadst16 coeffs
This matches the order they are in the 16 bpp version.

There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.

This makes the 8 bpp version match the 16 bpp version better.

This is cherrypicked from libav commit
b8f66c0838.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:52 +02:00
Martin Storsjö
f952273019 aarch64: vp9itxfm: Reorder the idct coefficients for better pairing
All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.

This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.

This is cherrypicked from libav commit
09eb88a12e.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:52 +02:00
Martin Storsjö
2905657b90 aarch64: vp9itxfm: Avoid reloading the idct32 coefficients
The idct32x32 function actually pushed d8-d15 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.

After this, we still can skip pushing d12-d15.

Before:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3
After:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3

This is cherrypicked from libav commit
65aa002d54.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:51 +02:00
Martin Storsjö
f32690a298 aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1
This is one cycle faster in total, and three instructions fewer.

Before:
vp9_loop_filter_mix2_v_44_16_neon: 123.2
After:
vp9_loop_filter_mix2_v_44_16_neon: 122.2

This is cherrypicked from libav commit
3bf9c48320.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:50 +02:00
Martin Storsjö
3fbbad2984 arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit
The theoretical maximum value of E is 193, so we can just
saturate the addition to 255.

Before:                     Cortex A7      A8      A9     A53  A53/AArch64
vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0
After:
vp9_loop_filter_v_4_8_neon:     136.0   125.7   112.6    84.0         83.0
vp9_loop_filter_v_8_8_neon:     234.0   195.5   171.5   136.0        133.7
vp9_loop_filter_v_16_8_neon:    490.0   417.5   377.7   289.0        271.0
vp9_loop_filter_v_16_16_neon:   951.2   814.7   732.3   571.0        446.7

This is cherrypicked from libav commit
c582cb8537.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:50 +02:00
Martin Storsjö
c8d6eec85d aarch64: vp9lpf: Fix broken indentation/vertical alignment
This is cherrypicked from libav commit
07b5136c48.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:49 +02:00
Martin Storsjö
9f3a886364 aarch64: vp9lpf: Interleave the start of flat8in into the calculation above
This adds lots of extra .ifs, but speeds it up by a couple cycles,
by avoiding stalls.

This is cherrypicked from libav commit
b0806088d3.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:49 +02:00
Martin Storsjö
f0ecbb13cf arm/aarch64: vp9lpf: Calculate !hev directly
Previously we first calculated hev, and then negated it.

Since we were able to schedule the negation in the middle
of another calculation, we don't see any gain in all cases.

Before:                     Cortex A7      A8      A9     A53  A53/AArch64
vp9_loop_filter_v_4_8_neon:     147.0   129.0   115.8    89.0         88.7
vp9_loop_filter_v_8_8_neon:     242.0   198.5   174.7   140.0        136.7
vp9_loop_filter_v_16_8_neon:    500.0   419.5   382.7   293.0        275.7
vp9_loop_filter_v_16_16_neon:   971.2   825.5   731.5   579.0        453.0
After:
vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0

This is cherrypicked from libav commit
e1f9de86f4.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:48 +02:00
Martin Storsjö
148cc0bb89 aarch64: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling
This work is sponsored by, and copyright, Google.

Before:                           Cortex A53
vp9_inv_dct_dct_16x16_sub1_add_neon:   235.3
vp9_inv_dct_dct_32x32_sub1_add_neon:   555.1
After:
vp9_inv_dct_dct_16x16_sub1_add_neon:   180.2
vp9_inv_dct_dct_32x32_sub1_add_neon:   475.3

This is cherrypicked from libav commit
3fcf788fbb.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:48 +02:00
Martin Storsjö
045e33ae3f aarch64: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter
No measured speedup on a Cortex A53, but other cores might benefit.

This is cherrypicked from libav commit
388e0d2515.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:48 +02:00
Martin Storsjö
ac6cb8ae5b aarch64: vp9mc: Simplify the extmla macro parameters
Fold the field lengths into the macro.

This makes the macro invocations much more readable, when the
lines are shorter.

This also makes it easier to use only half the registers within
the macro.

This is cherrypicked from libav commit
5e0c2158fb.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:47 +02:00
Martin Storsjö
16ef000799 aarch64: vp9itxfm: Fix incorrect vertical alignment
This is cherrypicked from libav commit
0c0b87f12d.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:47 +02:00