FFmpeg

mirror of https://github.com/FFmpeg/FFmpeg.git synced 2024-12-23 12:43:46 +02:00

Author	SHA1	Message	Date
Mikhail Nitenko	84ac1440b2	lavc/aarch64: add pred16x16 10-bit functions Benchmarks: A53 A72 pred16x16_dc_10_c: 136.0 124.0 pred16x16_dc_10_neon: 121.2 106.0 pred16x16_horizontal_10_c: 155.0 73.2 pred16x16_horizontal_10_neon: 82.2 67.7 pred16x16_top_dc_10_c: 106.0 93.7 pred16x16_top_dc_10_neon: 87.7 77.2 pred16x16_vertical_10_c: 83.0 67.7 pred16x16_vertical_10_neon: 54.2 61.7 Some functions work slower than C and are left commented out.	2021-04-19 09:01:14 +02:00
Mikhail Nitenko	6b2e7dc828	lavc/aarch64: change h264pred_init structure Change structure to allow the addition of other bit depths.	2021-04-19 09:00:58 +02:00
Martin Storsjö	870bfe16a1	aarch64: h264pred: Optimize the inner loop of existing 8 bit functions Move the loop counter decrement further from the branch instruction, this hides the latency of the decrement. In loops that first load, then store (the horizontal prediction cases), do the decrement after the load (where the next instruction would stall a bit anyway, waiting for the result of the load). In loops that store twice using the same destination register, also do the decrement between the two stores (as the second store would need to wait for the updated destination register from the first instruction). In loops that store twice to two different destination registers, do the decrement before both stores, to do it as soon before the branch as possible. This gives minor (1-2 cycle) speedups in most cases (modulo measurement noise), but the horizontal prediction functions get a rather notable speedup on the Cortex A53. Before: Cortex A53 A72 A73 pred8x8_dc_8_neon: 60.7 46.2 39.2 pred8x8_dc_128_8_neon: 30.7 18.0 14.0 pred8x8_horizontal_8_neon: 42.2 29.2 18.5 pred8x8_left_dc_8_neon: 52.7 36.2 32.2 pred8x8_mad_cow_dc_0l0_8_neon: 48.2 27.7 25.7 pred8x8_mad_cow_dc_0lt_8_neon: 52.5 33.2 34.7 pred8x8_mad_cow_dc_l0t_8_neon: 52.5 31.7 33.2 pred8x8_mad_cow_dc_l00_8_neon: 43.2 27.0 25.5 pred8x8_plane_8_neon: 112.2 86.2 88.2 pred8x8_top_dc_8_neon: 40.7 23.0 21.2 pred8x8_vertical_8_neon: 27.2 15.5 14.0 pred16x16_dc_8_neon: 91.0 73.2 70.5 pred16x16_dc_128_8_neon: 43.0 34.7 30.7 pred16x16_horizontal_8_neon: 86.0 49.7 44.7 pred16x16_left_dc_8_neon: 87.0 67.2 67.5 pred16x16_plane_8_neon: 236.0 175.7 173.0 pred16x16_top_dc_8_neon: 53.2 39.0 41.7 pred16x16_vertical_8_neon: 41.7 29.7 31.0 After: pred8x8_dc_8_neon: 59.0 46.7 42.5 pred8x8_dc_128_8_neon: 28.2 18.0 14.0 pred8x8_horizontal_8_neon: 34.2 29.2 18.5 pred8x8_left_dc_8_neon: 51.0 38.2 32.7 pred8x8_mad_cow_dc_0l0_8_neon: 46.7 28.2 26.2 pred8x8_mad_cow_dc_0lt_8_neon: 55.2 33.7 37.5 pred8x8_mad_cow_dc_l0t_8_neon: 51.2 31.7 37.2 pred8x8_mad_cow_dc_l00_8_neon: 41.7 27.5 26.0 pred8x8_plane_8_neon: 111.5 86.5 89.5 pred8x8_top_dc_8_neon: 39.0 23.2 21.0 pred8x8_vertical_8_neon: 27.2 16.0 14.0 pred16x16_dc_8_neon: 85.0 70.2 70.5 pred16x16_dc_128_8_neon: 42.0 30.0 30.7 pred16x16_horizontal_8_neon: 66.5 49.5 42.5 pred16x16_left_dc_8_neon: 81.0 66.5 67.5 pred16x16_plane_8_neon: 235.0 175.7 173.0 pred16x16_top_dc_8_neon: 52.0 39.0 41.7 pred16x16_vertical_8_neon: 40.2 33.2 31.0 Despite this, a number of these functions still are slower than what e.g. GCC 7 generates - this shows the relative speedup of the neon codepaths over the compiler generated ones: Cortex A53 A72 A73 pred8x8_dc_8_neon: 0.86 0.65 1.04 pred8x8_dc_128_8_neon: 0.59 0.44 0.62 pred8x8_horizontal_8_neon: 1.51 0.58 1.30 pred8x8_left_dc_8_neon: 0.72 0.56 0.89 pred8x8_mad_cow_dc_0l0_8_neon: 0.93 0.93 1.37 pred8x8_mad_cow_dc_0lt_8_neon: 1.37 1.41 1.68 pred8x8_mad_cow_dc_l0t_8_neon: 1.21 1.17 1.32 pred8x8_mad_cow_dc_l00_8_neon: 1.24 1.19 1.60 pred8x8_plane_8_neon: 3.36 3.58 3.76 pred8x8_top_dc_8_neon: 0.97 0.99 1.43 pred8x8_vertical_8_neon: 0.86 0.78 1.18 pred16x16_dc_8_neon: 1.20 1.06 1.49 pred16x16_dc_128_8_neon: 0.83 0.95 0.99 pred16x16_horizontal_8_neon: 1.78 0.96 1.59 pred16x16_left_dc_8_neon: 1.06 0.96 1.32 pred16x16_plane_8_neon: 5.78 6.49 7.19 pred16x16_top_dc_8_neon: 1.48 1.53 1.94 pred16x16_vertical_8_neon: 1.39 1.34 1.98 In particular, on Cortex A72, many of these functions are slower than the compiler generated code, while they're more beneficial on e.g. the Cortex A73. Signed-off-by: Martin Storsjö <martin@martin.st>	2021-04-14 15:23:44 +03:00
James Almer	f1a894f9d3	avcodec: add missing FF_API_OLD_ENCDEC wrappers to xmm clobber functions Signed-off-by: James Almer <jamrial@gmail.com>	2021-02-26 19:26:31 -03:00
Josh Dekker	7ac41e0db2	lavc/aarch64: add HEVC sao_band NEON Only works for 8x8. Signed-off-by: Josh Dekker <josh@itanimul.li>	2021-02-18 14:12:01 +01:00
Josh Dekker	75c2ddfa61	lavc/aarch64: add HEVC idct_dc NEON Signed-off-by: Josh Dekker <josh@itanimul.li>	2021-02-18 14:12:01 +01:00
Reimar Döffinger	00c916ef61	lavc/aarch64: port HEVC add_residual NEON Speedup is fairly small, around 1.5%, but these are fairly simple. Signed-off-by: Josh Dekker <josh@itanimul.li>	2021-02-18 14:11:57 +01:00
Reimar Döffinger	30f80d855b	lavc/aarch64: port HEVC SIMD idct NEON Makes SIMD-optimized 8x8 and 16x16 idcts for 8 and 10 bit depth available on aarch64. For a UHD HDR (10 bit) sample video these were consuming the most time and this optimization reduced overall decode time from 19.4s to 16.4s, approximately 15% speedup. Test sample was the first 300 frames of "LG 4K HDR Demo - New York.ts", running on Apple M1. Signed-off-by: Josh Dekker <josh@itanimul.li>	2021-02-18 14:11:53 +01:00
Anton Khirnov	c8c2dfbc37	lavu: move LOCAL_ALIGNED from internal.h to mem_internal.h That is a more appropriate place for it.	2021-01-01 14:11:01 +01:00
Martin Storsjö	7168adedbc	libavcodec: aarch64: Add a NEON implementation of pixblockdsp Cortex A53 A72 A73 get_pixels_c: 140.7 87.7 72.5 get_pixels_neon: 46.0 20.0 19.5 get_pixels_unaligned_c: 140.7 87.7 73.0 get_pixels_unaligned_neon: 49.2 20.2 26.2 diff_pixels_c: 209.7 133.7 138.7 diff_pixels_neon: 54.2 31.7 23.5 diff_pixels_unaligned_c: 209.7 134.2 139.0 diff_pixels_unaligned_neon: 68.0 27.7 41.7 Signed-off-by: Martin Storsjö <martin@martin.st>	2020-05-15 23:37:55 +03:00
Carl Eugen Hoyos	34d7c8d942	lavc/aarch64: Remove unneeded file vp9mc_aarch64.c	2020-03-11 14:36:07 +01:00
Carl Eugen Hoyos	951bd25572	lavc/aarch64: Fix suffix of new file vp9mc_aarch64.	2020-03-11 14:29:22 +01:00
Carl Eugen Hoyos	213c796561	lavc/aarch64: Fix compilation with --disable-neon Fixes ticket #8565.	2020-03-11 14:16:48 +01:00
Carl Eugen Hoyos	9a21754904	lavc/aarch64: Move non-neon vp9 copy functions out of neon source file. Fixes part of ticket #8565.	2020-03-11 14:16:40 +01:00
Lynne	aac382e9e5	aarch64/opusdsp: do not clobber register v8 A part of v8-v15 needs to be preserved across calls.	2019-08-15 13:29:22 +01:00
Lynne	f62ee527cb	aarch64/asm-offsets: remove old CELT offsets They're not used and they're incorrect.	2019-05-14 23:41:24 +01:00
Lynne	4d2f62150d	aarch64/opusdsp: implement NEON accelerated postfilter and deemphasis 153372 UNITS in postfilter_c, 65536 runs, 0 skips 73164 UNITS in postfilter_neon, 65536 runs, 0 skips -> 2.1x speedup 80591 UNITS in deemphasis_c, 131072 runs, 0 skips 43969 UNITS in deemphasis_neon, 131072 runs, 0 skips -> 1.83x speedup Total decoder speedup: ~15% on a Raspberry Pi 3 (from 28.1x to 33.5x realtime) Deemphasis SIMD based on the following unrolling: const float c1 = CELT_EMPH_COEFF, c2 = c1c1, c3 = c2c1, c4 = c3c1; float state = coeff; for (int i = 0; i < len; i += 4) { y[0] = x[0] + c1state; y[1] = x[1] + c2state + c1x[0]; y[2] = x[2] + c3state + c1x[1] + c2x[0]; y[3] = x[3] + c4state + c1x[2] + c2x[1] + c3*x[0]; state = y[3]; y += 4; x += 4; } Unlike the x86 version, duplication is used instead of pslldq so the structure and tables are different.	2019-04-10 01:08:54 +02:00
James Almer	92219ef4ac	Merge commit '186bd30aa3b6c2b29b4dbf18278700b572068b1e' * commit '186bd30aa3b6c2b29b4dbf18278700b572068b1e': h264/arm64: implement missing 4:2:2 chroma loop filter neon functions Merged-by: James Almer <jamrial@gmail.com>	2019-03-14 16:29:41 -03:00
James Almer	5c363d3e59	Merge commit '7e42d5f0ab2aeac811fd01e122627c9198b13f01' * commit '7e42d5f0ab2aeac811fd01e122627c9198b13f01': aarch64: vp8: Optimize vp8_idct_add_neon for aarch64 Merged-by: James Almer <jamrial@gmail.com>	2019-03-14 16:22:29 -03:00
James Almer	409e684e79	Merge commit '49f9c4272c4029b57ff300d908ba03c6332fc9c4' * commit '49f9c4272c4029b57ff300d908ba03c6332fc9c4': aarch64: vp8: Skip saturating in shrn in ff_vp8_idct_add_neon Merged-by: James Almer <jamrial@gmail.com>	2019-03-14 16:21:46 -03:00
James Almer	fbd607dd56	Merge commit '37394ef01b040605f8e1c98e73aa12b1c0bcba07' * commit '37394ef01b040605f8e1c98e73aa12b1c0bcba07': aarch64: vp8: Optimize put_epel16_h6v6 with vp8_epel8_v6_y2 Merged-by: James Almer <jamrial@gmail.com>	2019-03-14 16:20:05 -03:00
James Almer	34a0a9746b	Merge commit 'e39a9212ab37a55b346801c77487d8a47b6f9fe2' * commit 'e39a9212ab37a55b346801c77487d8a47b6f9fe2': aarch64: vp8: Port bilin functions from arm version Merged-by: James Almer <jamrial@gmail.com>	2019-03-14 16:18:42 -03:00
James Almer	2ac399d7fa	Merge commit '58d154922707bfeb873cb3a7476e0f94b17463dd' * commit '58d154922707bfeb873cb3a7476e0f94b17463dd': aarch64: vp8: Port epel4 functions from arm version Merged-by: James Almer <jamrial@gmail.com>	2019-03-14 16:17:33 -03:00
James Almer	c6892f59eb	Merge commit 'cc7ba00c35faf0478f1f56215e926f70ccb31282' * commit 'cc7ba00c35faf0478f1f56215e926f70ccb31282': aarch64: vp8: Port missing epel8 functions from arm version Merged-by: James Almer <jamrial@gmail.com>	2019-03-14 16:16:43 -03:00
James Almer	79025da3f2	Merge commit '52c9b0a6c0d02cff6caebcf6989e565e05b55200' * commit '52c9b0a6c0d02cff6caebcf6989e565e05b55200': aarch64: vp8: Port vp8_luma_dc_wht and vp8_idct_dc_add4uv from arm version Merged-by: James Almer <jamrial@gmail.com>	2019-03-14 16:14:40 -03:00
James Almer	39278ff0de	Merge commit 'c513fcd7d235aa4cef45a6c3125bd4dcc03bf276' * commit 'c513fcd7d235aa4cef45a6c3125bd4dcc03bf276': aarch64: vp8: Fix a typo in a comment Merged-by: James Almer <jamrial@gmail.com>	2019-03-14 16:13:32 -03:00
James Almer	4f9a8d3fe2	Merge commit 'f1011ea28a4048ddec97794ca3e9901474fe055f' * commit 'f1011ea28a4048ddec97794ca3e9901474fe055f': aarch64: vp8: Reorder the function pointer inits to match the arm original Merged-by: James Almer <jamrial@gmail.com>	2019-03-14 16:09:11 -03:00
James Almer	398000abcf	Merge commit '85bfaa4949f4afcde19061def3e8a18988964858' * commit '85bfaa4949f4afcde19061def3e8a18988964858': aarch64: vp8: Use the proper aarch64 form for conditional branches Merged-by: James Almer <jamrial@gmail.com>	2019-03-14 16:06:43 -03:00
James Almer	a2ae381b5a	Merge commit '0801853e640624537db386727b36fa97aa6258e7' * commit '0801853e640624537db386727b36fa97aa6258e7': libavcodec: vp8 neon optimizations for aarch64 See `833fed5253` Merged-by: James Almer <jamrial@gmail.com>	2019-03-14 16:05:52 -03:00
Janne Grunau	186bd30aa3	h264/arm64: implement missing 4:2:2 chroma loop filter neon functions	2019-02-27 21:57:05 +01:00
Carl Eugen Hoyos	7e4d3dbe18	lavc/aarch64/h264dsp_init: Only use neon horizontal intra loopfilter for 4:2:0.	2019-02-20 23:56:21 +01:00
James Almer	aa844dc46f	aarch64/h264dsp: change loop filter stride argument to ptrdiff_t This was missed in `d5d699ab6e` Signed-off-by: James Almer <jamrial@gmail.com>	2019-02-20 19:38:46 -03:00
James Almer	e4e04dce1f	Merge commit '28a8b5413b64b831dfb8650208bccd8b78360484' * commit '28a8b5413b64b831dfb8650208bccd8b78360484': h264/aarch64: add intra loop filter neon asm Merged-by: James Almer <jamrial@gmail.com>	2019-02-20 15:42:01 -03:00
James Almer	4dc1f06f0c	Merge commit '846c3d6aca5484904e60946c4fe8b8833bc07f92' * commit '846c3d6aca5484904e60946c4fe8b8833bc07f92': h264/aarch64: optimize neon loop filter Merged-by: James Almer <jamrial@gmail.com>	2019-02-20 15:41:03 -03:00
James Almer	5ca7eb36b7	Merge commit 'bb515e3a735f526ccb1068031e289eb5aeb69e22' * commit 'bb515e3a735f526ccb1068031e289eb5aeb69e22': h264/aarch64: sign extend int stride in loop filter asm Merged-by: James Almer <jamrial@gmail.com>	2019-02-20 14:50:37 -03:00
Martin Storsjö	c8bc9d1380	aarch64: vp8: Move the vp8dsp makefile entries to the right places Even if NEON would be disabled, the init functions should be built as they are called as long as ARCH_AARCH64 is set. These functions are part of a generic DSP subsytem, not tied directly to one decoder. (They should be built if the vp7 decoder is enabled, even if the vp8 decoder is disabled.) Signed-off-by: Martin Storsjö <martin@martin.st> (cherry picked from commit `b4b27dce95`)	2019-02-19 23:43:17 +02:00
Martin Storsjö	fecf75a5c4	aarch64: vp8: Remove superfluous includes This fixes building with MSVC, which lacks unistd.h. Signed-off-by: Martin Storsjö <martin@martin.st> (cherry picked from commit `ad32f7b126`)	2019-02-19 23:42:16 +02:00
Martin Storsjö	7ddfa5e908	aarch64: vp8: Fix assembling with armasm64 Signed-off-by: Martin Storsjö <martin@martin.st> (cherry picked from commit `2eeac79936`)	2019-02-19 23:42:03 +02:00
Martin Storsjö	c950beb68d	aarch64: vp8: Fix assembling with clang This also partially fixes assembling with MS armasm64 (via gas-preprocessor). The movrel macro invocations need to pass the offset via a separate parameter. Mach-o and COFF relocations don't allow a negative offset to a symbol, which is handled properly if the offset is passed via the parameter. If no offset parameter is given, the macro evaluates to something like "adrp x17, subpel_filters-16+(0)", which older clang versions also fail to parse (the older clang versions only support one single offset term, although it can be a parenthesis. Signed-off-by: Martin Storsjö <martin@martin.st> (cherry picked from commit `26d7af4c38`)	2019-02-19 23:41:47 +02:00
Martin Storsjö	7e42d5f0ab	aarch64: vp8: Optimize vp8_idct_add_neon for aarch64 The previous version was a pretty exact translation of the arm version. This version does do some unnecessary arithemetic (it does more operations on vectors that are only half filled; it does 4 uaddw and 4 sqxtun instead of 2 of each), but it reduces the overhead of packing data together (which could be done for free in the arm version). This gives a decent speedup on Cortex A53, a minor speedup on A72 and a very minor slowdown on Cortex A73. Before: Cortex A53 A72 A73 vp8_idct_add_neon: 79.7 67.5 65.0 After: vp8_idct_add_neon: 67.7 64.8 66.7 Signed-off-by: Martin Storsjö <martin@martin.st>	2019-02-19 11:46:28 +02:00
Martin Storsjö	49f9c4272c	aarch64: vp8: Skip saturating in shrn in ff_vp8_idct_add_neon The original arm version didn't do saturation here. This probably doesn't make any difference for performance, but reduces the differences. Signed-off-by: Martin Storsjö <martin@martin.st>	2019-02-19 11:46:24 +02:00
Martin Storsjö	37394ef01b	aarch64: vp8: Optimize put_epel16_h6v6 with vp8_epel8_v6_y2 This makes it similar to put_epel16_v6, and gives a large speedup on Cortex A53, a minor speedup on A72 and a very minor slowdown on A73. Before: Cortex A53 A72 A73 vp8_put_epel16_h6v6_neon: 2211.4 1586.5 1431.7 After: vp8_put_epel16_h6v6_neon: 1736.9 1522.0 1448.1 Signed-off-by: Martin Storsjö <martin@martin.st>	2019-02-19 11:46:21 +02:00
Martin Storsjö	e39a9212ab	aarch64: vp8: Port bilin functions from arm version Cortex A53 A72 A73 vp8_put_bilin4_h_c: 303.8 102.2 161.8 vp8_put_bilin4_h_neon: 100.0 40.9 41.2 vp8_put_bilin4_hv_c: 322.8 201.0 305.9 vp8_put_bilin4_hv_neon: 156.8 72.6 77.0 vp8_put_bilin4_v_c: 304.7 101.7 166.5 vp8_put_bilin4_v_neon: 82.7 41.2 33.0 vp8_put_bilin8_h_c: 1192.7 352.5 623.8 vp8_put_bilin8_h_neon: 213.5 70.2 87.8 vp8_put_bilin8_hv_c: 1098.6 769.2 1041.9 vp8_put_bilin8_hv_neon: 324.0 123.5 146.0 vp8_put_bilin8_v_c: 1193.9 350.4 617.7 vp8_put_bilin8_v_neon: 183.9 60.7 64.7 vp8_put_bilin16_h_c: 2353.1 671.2 1223.3 vp8_put_bilin16_h_neon: 261.9 140.7 145.0 vp8_put_bilin16_hv_c: 2453.2 1470.9 2355.2 vp8_put_bilin16_hv_neon: 383.9 196.0 217.0 vp8_put_bilin16_v_c: 2349.3 669.8 1251.2 vp8_put_bilin16_v_neon: 202.9 110.7 96.2 Signed-off-by: Martin Storsjö <martin@martin.st>	2019-02-19 11:46:14 +02:00
Martin Storsjö	58d1549227	aarch64: vp8: Port epel4 functions from arm version Cortex A53 A72 A73 vp8_put_epel4_h4_c: 631.4 291.7 367.8 vp8_put_epel4_h4_neon: 241.0 131.0 155.7 vp8_put_epel4_h4v4_c: 967.5 529.3 667.7 vp8_put_epel4_h4v4_neon: 429.3 241.8 279.7 vp8_put_epel4_h4v6_c: 1374.7 657.5 864.5 vp8_put_epel4_h4v6_neon: 515.5 295.5 334.7 vp8_put_epel4_h6_c: 851.0 421.0 486.0 vp8_put_epel4_h6_neon: 321.5 195.0 217.7 vp8_put_epel4_h6v4_c: 1111.3 621.1 781.2 vp8_put_epel4_h6v4_neon: 539.2 328.0 365.3 vp8_put_epel4_h6v6_c: 1561.3 763.3 999.7 vp8_put_epel4_h6v6_neon: 645.5 401.0 434.7 vp8_put_epel4_v4_c: 663.8 298.3 357.0 vp8_put_epel4_v4_neon: 116.0 81.5 72.5 vp8_put_epel4_v6_c: 870.5 437.0 507.4 vp8_put_epel4_v6_neon: 147.7 108.8 92.0 Signed-off-by: Martin Storsjö <martin@martin.st>	2019-02-19 11:46:11 +02:00
Martin Storsjö	cc7ba00c35	aarch64: vp8: Port missing epel8 functions from arm version Cortex A53 A72 A73 vp8_put_epel8_h4_c: 2594.8 1159.6 1374.8 vp8_put_epel8_h4_neon: 506.4 244.2 314.0 vp8_put_epel8_h6_c: 3445.8 1677.1 1811.3 vp8_put_epel8_h6_neon: 634.4 371.7 433.0 vp8_put_epel8_v4_c: 2614.0 1174.8 1378.0 vp8_put_epel8_v4_neon: 321.0 221.7 235.8 vp8_put_epel8_v6_c: 3635.5 1703.0 2079.2 vp8_put_epel8_v6_neon: 416.9 317.0 295.5 Signed-off-by: Martin Storsjö <martin@martin.st>	2019-02-19 11:46:08 +02:00
Martin Storsjö	52c9b0a6c0	aarch64: vp8: Port vp8_luma_dc_wht and vp8_idct_dc_add4uv from arm version Cortex A53 A72 A73 vp8_luma_dc_wht_c: 115.7 75.7 90.7 vp8_luma_dc_wht_neon: 60.7 41.2 45.7 vp8_idct_dc_add4uv_c: 376.1 262.9 282.5 vp8_idct_dc_add4uv_neon: 52.0 29.0 37.0 Signed-off-by: Martin Storsjö <martin@martin.st>	2019-02-19 11:46:04 +02:00
Martin Storsjö	c513fcd7d2	aarch64: vp8: Fix a typo in a comment Signed-off-by: Martin Storsjö <martin@martin.st>	2019-02-19 11:46:00 +02:00
Martin Storsjö	f1011ea28a	aarch64: vp8: Reorder the function pointer inits to match the arm original Signed-off-by: Martin Storsjö <martin@martin.st>	2019-02-19 11:45:56 +02:00
Martin Storsjö	b4b27dce95	aarch64: vp8: Move the vp8dsp makefile entries to the right places Even if NEON would be disabled, the init functions should be built as they are called as long as ARCH_AARCH64 is set. These functions are part of a generic DSP subsytem, not tied directly to one decoder. (They should be built if the vp7 decoder is enabled, even if the vp8 decoder is disabled.) Signed-off-by: Martin Storsjö <martin@martin.st>	2019-02-19 11:45:53 +02:00
Martin Storsjö	ad32f7b126	aarch64: vp8: Remove superfluous includes This fixes building with MSVC, which lacks unistd.h. Signed-off-by: Martin Storsjö <martin@martin.st>	2019-02-19 11:45:50 +02:00
Martin Storsjö	85bfaa4949	aarch64: vp8: Use the proper aarch64 form for conditional branches The previous form also does seem to assemble on current tools, but I think it might fail on some older aarch64 tools. Signed-off-by: Martin Storsjö <martin@martin.st>	2019-02-19 11:45:47 +02:00
Martin Storsjö	2eeac79936	aarch64: vp8: Fix assembling with armasm64 Signed-off-by: Martin Storsjö <martin@martin.st>	2019-02-19 11:45:44 +02:00
Martin Storsjö	26d7af4c38	aarch64: vp8: Fix assembling with clang This also partially fixes assembling with MS armasm64 (via gas-preprocessor). The movrel macro invocations need to pass the offset via a separate parameter. Mach-o and COFF relocations don't allow a negative offset to a symbol, which is handled properly if the offset is passed via the parameter. If no offset parameter is given, the macro evaluates to something like "adrp x17, subpel_filters-16+(0)", which older clang versions also fail to parse (the older clang versions only support one single offset term, although it can be a parenthesis. Signed-off-by: Martin Storsjö <martin@martin.st>	2019-02-19 11:45:41 +02:00
Magnus Röös	0801853e64	libavcodec: vp8 neon optimizations for aarch64 Partial port of the ARM Neon for aarch64. Benchmarks from fate: benchmarking with Linux Perf Monitoring API nop: 58.6 checkasm: using random seed 1760970128 NEON: - vp8dsp.idct [OK] - vp8dsp.mc [OK] - vp8dsp.loopfilter [OK] checkasm: all 21 tests passed vp8_idct_add_c: 201.6 vp8_idct_add_neon: 83.1 vp8_idct_dc_add_c: 107.6 vp8_idct_dc_add_neon: 33.8 vp8_idct_dc_add4y_c: 426.4 vp8_idct_dc_add4y_neon: 59.4 vp8_loop_filter8uv_h_c: 688.1 vp8_loop_filter8uv_h_neon: 216.3 vp8_loop_filter8uv_inner_h_c: 649.3 vp8_loop_filter8uv_inner_h_neon: 195.3 vp8_loop_filter8uv_inner_v_c: 544.8 vp8_loop_filter8uv_inner_v_neon: 131.3 vp8_loop_filter8uv_v_c: 706.1 vp8_loop_filter8uv_v_neon: 141.1 vp8_loop_filter16y_h_c: 668.8 vp8_loop_filter16y_h_neon: 242.8 vp8_loop_filter16y_inner_h_c: 647.3 vp8_loop_filter16y_inner_h_neon: 224.6 vp8_loop_filter16y_inner_v_c: 647.8 vp8_loop_filter16y_inner_v_neon: 128.8 vp8_loop_filter16y_v_c: 721.8 vp8_loop_filter16y_v_neon: 154.3 vp8_loop_filter_simple_h_c: 387.8 vp8_loop_filter_simple_h_neon: 187.6 vp8_loop_filter_simple_v_c: 384.1 vp8_loop_filter_simple_v_neon: 78.6 vp8_put_epel8_h4v4_c: 3971.1 vp8_put_epel8_h4v4_neon: 855.1 vp8_put_epel8_h4v6_c: 5060.1 vp8_put_epel8_h4v6_neon: 989.6 vp8_put_epel8_h6v4_c: 4320.8 vp8_put_epel8_h6v4_neon: 1007.3 vp8_put_epel8_h6v6_c: 5449.3 vp8_put_epel8_h6v6_neon: 1158.1 vp8_put_epel16_h6_c: 6683.8 vp8_put_epel16_h6_neon: 831.8 vp8_put_epel16_h6v6_c: 11110.8 vp8_put_epel16_h6v6_neon: 2214.8 vp8_put_epel16_v6_c: 7024.8 vp8_put_epel16_v6_neon: 799.6 vp8_put_pixels8_c: 112.8 vp8_put_pixels8_neon: 78.1 vp8_put_pixels16_c: 131.3 vp8_put_pixels16_neon: 129.8 This contains a fix to include guards by Carl Eugen Hoyos. Signed-off-by: Martin Storsjö <martin@martin.st>	2019-02-19 11:45:33 +02:00
Carl Eugen Hoyos	ed20fbcd48	lavc/aarch64/vp8dsp: Fix the include guard. Fixes fate-source.	2019-01-31 22:35:44 +01:00
Magnus Röös	833fed5253	libavcodec: vp8 neon optimizations for aarch64 Partial port of the ARM Neon for aarch64. Benchmarks from fate: benchmarking with Linux Perf Monitoring API nop: 58.6 checkasm: using random seed 1760970128 NEON: - vp8dsp.idct [OK] - vp8dsp.mc [OK] - vp8dsp.loopfilter [OK] checkasm: all 21 tests passed vp8_idct_add_c: 201.6 vp8_idct_add_neon: 83.1 vp8_idct_dc_add_c: 107.6 vp8_idct_dc_add_neon: 33.8 vp8_idct_dc_add4y_c: 426.4 vp8_idct_dc_add4y_neon: 59.4 vp8_loop_filter8uv_h_c: 688.1 vp8_loop_filter8uv_h_neon: 216.3 vp8_loop_filter8uv_inner_h_c: 649.3 vp8_loop_filter8uv_inner_h_neon: 195.3 vp8_loop_filter8uv_inner_v_c: 544.8 vp8_loop_filter8uv_inner_v_neon: 131.3 vp8_loop_filter8uv_v_c: 706.1 vp8_loop_filter8uv_v_neon: 141.1 vp8_loop_filter16y_h_c: 668.8 vp8_loop_filter16y_h_neon: 242.8 vp8_loop_filter16y_inner_h_c: 647.3 vp8_loop_filter16y_inner_h_neon: 224.6 vp8_loop_filter16y_inner_v_c: 647.8 vp8_loop_filter16y_inner_v_neon: 128.8 vp8_loop_filter16y_v_c: 721.8 vp8_loop_filter16y_v_neon: 154.3 vp8_loop_filter_simple_h_c: 387.8 vp8_loop_filter_simple_h_neon: 187.6 vp8_loop_filter_simple_v_c: 384.1 vp8_loop_filter_simple_v_neon: 78.6 vp8_put_epel8_h4v4_c: 3971.1 vp8_put_epel8_h4v4_neon: 855.1 vp8_put_epel8_h4v6_c: 5060.1 vp8_put_epel8_h4v6_neon: 989.6 vp8_put_epel8_h6v4_c: 4320.8 vp8_put_epel8_h6v4_neon: 1007.3 vp8_put_epel8_h6v6_c: 5449.3 vp8_put_epel8_h6v6_neon: 1158.1 vp8_put_epel16_h6_c: 6683.8 vp8_put_epel16_h6_neon: 831.8 vp8_put_epel16_h6v6_c: 11110.8 vp8_put_epel16_h6v6_neon: 2214.8 vp8_put_epel16_v6_c: 7024.8 vp8_put_epel16_v6_neon: 799.6 vp8_put_pixels8_c: 112.8 vp8_put_pixels8_neon: 78.1 vp8_put_pixels16_c: 131.3 vp8_put_pixels16_neon: 129.8 Signed-off-by: Magnus Röös <mla2.roos@gmail.com>	2019-01-31 20:17:51 +01:00
Janne Grunau	28a8b5413b	h264/aarch64: add intra loop filter neon asm Add my neon asm from x264 relicensed under the LGPL 2.1 or later. Ported (x264 uses nv12 chroma) and optimized. Cycle count for checkasm --bench on a Snapdragon 820e: h264_h_loop_filter_luma_intra_8bpp_c: 60.0 h264_h_loop_filter_luma_intra_8bpp_neon: 54.2 h264_v_loop_filter_luma_intra_8bpp_c: 148.3 h264_v_loop_filter_luma_intra_8bpp_neon: 73.8 h264_h_loop_filter_chroma_intra_8bpp_c: 27.8 h264_h_loop_filter_chroma_intra_8bpp_neon: 21.4 h264_h_loop_filter_chroma_mbaff_intra_8bpp_c: 15.8 h264_h_loop_filter_chroma_mbaff_intra_8bpp_neon: 15.7 h264_v_loop_filter_chroma_intra_8bpp_c: 45.8 h264_v_loop_filter_chroma_intra_8bpp_neon: 17.3	2019-01-26 12:05:10 +01:00
Janne Grunau	846c3d6aca	h264/aarch64: optimize neon loop filter Exit as soon as possible if no filtering will be done. Improves the checkasm --bench cycle count on a Snapdragon 820e: h264_h_loop_filter_luma_8bpp_c: 72.4 -> 72.5 h264_h_loop_filter_luma_8bpp_neon: 97.1 -> 56.3 h264_v_loop_filter_luma_8bpp_c: 174.0 -> 173.5 h264_v_loop_filter_luma_8bpp_neon: 62.9 -> 60.9 h264_h_loop_filter_chroma_8bpp_c: 30.2 -> 30.3 h264_h_loop_filter_chroma_8bpp_neon: 51.6 -> 25.7 h264_v_loop_filter_chroma_8bpp_c: 57.3 -> 57.3 h264_v_loop_filter_chroma_8bpp_neon: 28.0 -> 24.0	2019-01-26 12:05:10 +01:00
Janne Grunau	bb515e3a73	h264/aarch64: sign extend int stride in loop filter asm	2019-01-26 12:05:10 +01:00
Manoj Gupta	6fcf813110	libavcodec: Remove dynamic relocs from aarch64/h264idct_neon.S Some of the assembly functions e.g. ff_h264_idct_dc_add_neon has code like: movrel x14, X(ff_h264_idct_add_neon) Linker cannot resolve them fully at link time and emits dynamic relocations. Use explicit labels instead so that no dynamic relocations are needed at all. This avoids lld complains about text relocations. For background, see https://crbug.com/917919 Signed-off-by: Manoj Gupta <manojgupta@chromium.org> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2019-01-03 20:12:07 +01:00
Carl Eugen Hoyos	0576ef466d	lavc/aarch64/h264dsp_init_aarch64: Fix weight function prototypes. Fixes the following warnings: libavcodec/aarch64/h264dsp_init_aarch64.c: In function ‘ff_h264dsp_init_aarch64’: libavcodec/aarch64/h264dsp_init_aarch64.c:84:38: warning: assignment from incompatible pointer type [enabled by default] c->weight_h264_pixels_tab[0] = ff_weight_h264_pixels_16_neon; ^ libavcodec/aarch64/h264dsp_init_aarch64.c:85:38: warning: assignment from incompatible pointer type [enabled by default] c->weight_h264_pixels_tab[1] = ff_weight_h264_pixels_8_neon; ^ libavcodec/aarch64/h264dsp_init_aarch64.c:86:38: warning: assignment from incompatible pointer type [enabled by default] c->weight_h264_pixels_tab[2] = ff_weight_h264_pixels_4_neon; ^ libavcodec/aarch64/h264dsp_init_aarch64.c:88:40: warning: assignment from incompatible pointer type [enabled by default] c->biweight_h264_pixels_tab[0] = ff_biweight_h264_pixels_16_neon; ^ libavcodec/aarch64/h264dsp_init_aarch64.c:89:40: warning: assignment from incompatible pointer type [enabled by default] c->biweight_h264_pixels_tab[1] = ff_biweight_h264_pixels_8_neon; ^ libavcodec/aarch64/h264dsp_init_aarch64.c:90:40: warning: assignment from incompatible pointer type [enabled by default] c->biweight_h264_pixels_tab[2] = ff_biweight_h264_pixels_4_neon; ^	2018-07-13 21:28:04 +02:00
Rodger Combs	7723750475	lavc/aarch64/sbrdsp_neon: fix build on old binutils	2018-01-26 02:42:01 -06:00
James Almer	0525722ca0	Merge commit '732510636e597585a79be7d111c88b3f7e174fe7' * commit '732510636e597585a79be7d111c88b3f7e174fe7': aarch64: Remove a dot from a label Merged-by: James Almer <jamrial@gmail.com>	2017-11-11 17:47:10 -03:00
Martin Storsjö	732510636e	aarch64: Remove a dot from a label This fixes building with armasm64 (when run through gas-preprocessor). Signed-off-by: Martin Storsjö <martin@martin.st>	2017-10-18 10:49:33 +03:00
Matthieu Bouron	0a24d7ca83	lavc/aarch64: add sbrdsp neon implementation autocorrelate_c: 644.0 autocorrelate_neon: 420.0 hf_apply_noise_0_c: 1688.5 hf_apply_noise_0_neon: 1498.6 hf_apply_noise_1_c: 1691.2 hf_apply_noise_1_neon: 1500.6 hf_apply_noise_2_c: 1688.1 hf_apply_noise_2_neon: 1500.3 hf_apply_noise_3_c: 1696.6 hf_apply_noise_3_neon: 1502.2 hf_g_filt_c: 2117.8 hf_g_filt_neon: 1218.7 hf_gen_c: 4573.4 hf_gen_neon: 2461.0 neg_odd_64_c: 72.0 neg_odd_64_neon: 64.7 qmf_deint_bfly_c: 1107.6 qmf_deint_bfly_neon: 291.6 qmf_deint_neg_c: 210.4 qmf_deint_neg_neon: 107.4 qmf_post_shuffle_c: 163.0 qmf_post_shuffle_neon: 107.7 qmf_pre_shuffle_c: 120.5 qmf_pre_shuffle_neon: 110.7 sum64x5_c: 1361.6 sum64x5_neon: 435.4 sum_square_c: 1686.4 sum_square_neon: 787.2	2017-07-03 14:29:22 +02:00
Clément Bœsch	b12a36170b	lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis	2017-06-28 12:22:39 +02:00
Clément Bœsch	ff0ecef624	lavc/aarch64: add a few SIMD functions for AAC PS ☭ tests/checkasm/checkasm --bench --test=aacpsdsp checkasm: using random seed 3318985180 MMX implied by specified flags MMX implied by specified flags NEON: - aacpsdsp.add_squares [OK] - aacpsdsp.mul_pair_single [OK] - aacpsdsp.hybrid_analysis [OK] - aacpsdsp.stereo_interpolate [OK] checkasm: all 5 tests passed nop: 10.0 ps_add_squares_c: 63221.2 ps_add_squares_neon: 22311.7 ps_hybrid_analysis_c: 2466.6 ps_hybrid_analysis_neon: 1521.9 ps_mul_pair_single_c: 68592.0 ps_mul_pair_single_neon: 17426.6 ps_stereo_interpolate_c: 72344.3 ps_stereo_interpolate_neon: 72308.8 ps_stereo_interpolate_ipdopd_c: 117415.2 ps_stereo_interpolate_ipdopd_neon: 113386.3	2017-06-28 12:22:39 +02:00
Memphiz	9e85c5d6a7	aarch64: vp9 16bpp: Fix assembling with Xcode 6.2 and older Properly use the b.eq form instead of the nonstandard form (which both gas and newer clang accept though), and expand the register lists that used a range (which the Xcode 6.2 clang, based on clang 3.5 svn, didn't support). Signed-off-by: Martin Storsjö <martin@martin.st>	2017-06-21 09:08:14 +03:00
Memphiz	998609ddb8	aarch64: vp9: Fix assembling with Xcode 6.2 and older Properly use the b.eq/b.ge forms instead of the nonstandard forms (which both gas and newer clang accept though), and expand the register list that used a range (which the Xcode 6.2 clang, based on clang 3.5 svn, didn't support). This is cherrypicked from libav commit `a970f9de86`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-06-21 09:08:13 +03:00
Memphiz	a970f9de86	aarch64: vp9: Fix assembling with Xcode 6.2 and older Properly use the b.eq/b.ge forms instead of the nonstandard forms (which both gas and newer clang accept though), and expand the register list that used a range (which the Xcode 6.2 clang, based on clang 3.5 svn, didn't support). Signed-off-by: Martin Storsjö <martin@martin.st>	2017-06-20 16:14:03 +03:00
Matthieu Bouron	204008354f	lavc/aarch64/simple_idct: fix build with Xcode 7.2	2017-06-14 23:20:58 +02:00
Matthieu Bouron	8aa60606fb	lavc/aarch64/simple_idct: fix idct_col4_top coefficient Fixes regression introduced by `5d0b8b1ae3`.	2017-06-13 17:46:55 +02:00
Matthieu Bouron	5d0b8b1ae3	lavc/aarch64/simple_idct: fix iOS build without gas-preprocessor Separates macro arguments with commas and passes .4H/.8H as macro arguments instead of 4H/8H (the later form being interpreted as an hexadecimal value). Fixes ticket #6324. Suggested-by: Martin Storsjö <martin@martin.st>	2017-05-11 16:28:54 +02:00
Clément Bœsch	0f00eb0e4e	Merge commit '2425d7329fdccfa9954faba748f3865151354f0c' * commit '2425d7329fdccfa9954faba748f3865151354f0c': arm64: replace 'bic' with immediate with 'and' with inverted immediate Merged-by: Clément Bœsch <u@pkh.me>	2017-04-26 16:28:57 +02:00
James Almer	5694427dc3	Merge commit '72a19f4013ec2c7f8581416f8ad4bf81df163fb6' * commit '72a19f4013ec2c7f8581416f8ad4bf81df163fb6': mpegaudiodsp: aarch64: Adjust function prototype after `2caa93b813` Merged-by: James Almer <jamrial@gmail.com>	2017-03-31 14:43:37 -03:00
James Almer	c31cbeef58	aarch64/vp9dsp: add missing header includes	2017-03-28 23:02:09 -03:00
Ronald S. Bultje	f8c019944d	vp9: re-split the decoder/format/dsp interface header files. The advantage here is that the internal software decoder interface is not exposed to the DSP functions or the hardware accelerations.	2017-03-28 18:04:26 -04:00
Clément Bœsch	1c9f4b5078	lavc/vp9: split into vp9{block,data,mvs} This is following Libav layout to ease merges.	2017-03-27 21:38:21 +02:00
Clément Bœsch	739d8c83f2	Merge commit '9b2ccafb480c94fd09cfb24306d5296dc013cf5b' * commit '9b2ccafb480c94fd09cfb24306d5296dc013cf5b': aarch64: Add missing sign extension in ff_h264_idct8_add_neon Merged-by: Clément Bœsch <u@pkh.me>	2017-03-23 12:15:39 +01:00
James Almer	9a0fbb9ca9	Merge commit '2caa93b813adc5dbb7771dfe615da826a2947d18' * commit '2caa93b813adc5dbb7771dfe615da826a2947d18': mpegaudiodsp: Change type of array stride parameters to ptrdiff_t Merged-by: James Almer <jamrial@gmail.com>	2017-03-21 16:04:22 -03:00
James Almer	a8474df944	Merge commit 'e4a94d8b36c48d95a7d412c40d7b558422ff659c' * commit 'e4a94d8b36c48d95a7d412c40d7b558422ff659c': h264chroma: Change type of stride parameters to ptrdiff_t Merged-by: James Almer <jamrial@gmail.com>	2017-03-21 15:20:45 -03:00
James Almer	5a49097b42	Merge commit '2ec9fa5ec60dcd10e1cb10d8b4e4437e634ea428' * commit '2ec9fa5ec60dcd10e1cb10d8b4e4437e634ea428': idct: Change type of array stride parameters to ptrdiff_t Merged-by: James Almer <jamrial@gmail.com>	2017-03-21 14:29:52 -03:00
Clément Bœsch	ad98af27f7	Merge commit 'de2ae3c1fae5a2eb539b9abd7bc2a9ca8c286ff0' * commit 'de2ae3c1fae5a2eb539b9abd7bc2a9ca8c286ff0': lavc: add clobber tests for the new encoding/decoding API The merge only re-order what we already have. Merged-by: Clément Bœsch <u@pkh.me>	2017-03-21 14:43:53 +01:00
Martin Storsjö	61b8a9ea29	aarch64: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 21512 bytes to 31400 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: vp9_inv_dct_dct_16x16_sub1_add_10_neon: 284.6 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 1902.7 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1903.0 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 2201.1 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 2510.0 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2821.3 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1011.6 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 9716.5 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9704.9 vp9_inv_dct_dct_32x32_sub8_add_10_neon: 10641.7 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 11555.7 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 12499.8 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 13403.7 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 14335.8 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 15253.6 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16179.5 After: vp9_inv_dct_dct_16x16_sub1_add_10_neon: 282.8 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 1142.4 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1139.0 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 1772.9 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 2515.2 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2823.5 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1012.7 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 6944.4 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 6944.2 vp9_inv_dct_dct_32x32_sub8_add_10_neon: 7609.8 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 9953.4 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 10770.1 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 13418.8 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 14330.7 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 15257.1 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16190.6 Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-19 22:54:37 +02:00
Martin Storsjö	d564c9018f	aarch64: vp9itxfm16: Move the load_add_store macro out from the itxfm16 pass2 function This allows reusing the macro for a separate implementation of the pass2 function. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-19 22:54:30 +02:00
Martin Storsjö	0f2705e66b	aarch64: vp9itxfm16: Make the larger core transforms standalone functions This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/aarch64/vp9itxfm_16bpp_neon.o from 26288 to 21512 bytes. This gives a small slowdown of a couple of tens of cycles, but makes it more feasible to add more optimized versions of these transforms. Before: vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1887.4 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2801.5 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9691.4 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16154.9 After: vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1899.5 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2827.2 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9714.7 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16175.9 Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-19 22:54:26 +02:00
Martin Storsjö	b76533f105	aarch64: vp9itxfm16: Restructure the idct32 store macros This avoids concatenation, which can't be used if the whole macro is wrapped within another macro. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-19 22:54:15 +02:00
Martin Storsjö	d613251622	aarch64: vp9itxfm16: Avoid .irp when it doesn't save any lines This makes the code a bit more readable. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-19 22:54:09 +02:00
Martin Storsjö	25ced1eb1c	aarch64: vp9itxfm16: Fix a typo in a comment Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-19 22:54:02 +02:00
Martin Storsjö	21c89f3a26	arm/aarch64: vp9: Fix vertical alignment Align the second/third operands as they usually are. Due to the wildly varying sizes of the written out operands in aarch64 assembly, the column alignment is usually not as clear as in arm assembly. This is cherrypicked from libav commit `7995ebfad1`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-19 22:53:32 +02:00
Martin Storsjö	70317b25aa	arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used In the half/quarter cases where we don't use the min_eob array, defer loading the pointer until we know it will be needed. This is cherrypicked from libav commit `3a0d5e206d`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-19 22:53:28 +02:00
Martin Storsjö	7995ebfad1	arm/aarch64: vp9: Fix vertical alignment Align the second/third operands as they usually are. Due to the wildly varying sizes of the written out operands in aarch64 assembly, the column alignment is usually not as clear as in arm assembly. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-16 23:09:00 +02:00
Matthieu Bouron	4c8e528d19	lavc/aarch64: add ff_simple_idct{,_add,_put}_neon functions	2017-03-16 12:00:41 +01:00
Martin Storsjö	3a0d5e206d	arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used In the half/quarter cases where we don't use the min_eob array, defer loading the pointer until we know it will be needed. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 22:07:30 +02:00
Martin Storsjö	26ee83acc4	aarch64: vp9itxfm: Reorder iadst16 coeffs This matches the order they are in the 16 bpp version. There they are in this order, to make sure we access them in the same order they are declared, easing loading only half of the coefficients at a time. This makes the 8 bpp version match the 16 bpp version better. This is cherrypicked from libav commit `b8f66c0838`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:52 +02:00
Martin Storsjö	f952273019	aarch64: vp9itxfm: Reorder the idct coefficients for better pairing All elements are used pairwise, except for the first one. Previously, the 16th element was unused. Move the unused element to the second slot, to make the later element pairs not split across registers. This simplifies loading only parts of the coefficients, reducing the difference to the 16 bpp version. This is cherrypicked from libav commit `09eb88a12e`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:52 +02:00
Martin Storsjö	2905657b90	aarch64: vp9itxfm: Avoid reloading the idct32 coefficients The idct32x32 function actually pushed d8-d15 onto the stack even though it didn't clobber them; there are plenty of registers that can be used to allow keeping all the idct coefficients in registers without having to reload different subsets of them at different stages in the transform. After this, we still can skip pushing d12-d15. Before: vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3 After: vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3 This is cherrypicked from libav commit `65aa002d54`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:51 +02:00
Martin Storsjö	f32690a298	aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1 This is one cycle faster in total, and three instructions fewer. Before: vp9_loop_filter_mix2_v_44_16_neon: 123.2 After: vp9_loop_filter_mix2_v_44_16_neon: 122.2 This is cherrypicked from libav commit `3bf9c48320`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:50 +02:00
Martin Storsjö	3fbbad2984	arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit The theoretical maximum value of E is 193, so we can just saturate the addition to 255. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0 After: vp9_loop_filter_v_4_8_neon: 136.0 125.7 112.6 84.0 83.0 vp9_loop_filter_v_8_8_neon: 234.0 195.5 171.5 136.0 133.7 vp9_loop_filter_v_16_8_neon: 490.0 417.5 377.7 289.0 271.0 vp9_loop_filter_v_16_16_neon: 951.2 814.7 732.3 571.0 446.7 This is cherrypicked from libav commit `c582cb8537`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:50 +02:00
Martin Storsjö	c8d6eec85d	aarch64: vp9lpf: Fix broken indentation/vertical alignment This is cherrypicked from libav commit `07b5136c48`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:49 +02:00
Martin Storsjö	9f3a886364	aarch64: vp9lpf: Interleave the start of flat8in into the calculation above This adds lots of extra .ifs, but speeds it up by a couple cycles, by avoiding stalls. This is cherrypicked from libav commit `b0806088d3`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:49 +02:00
Martin Storsjö	f0ecbb13cf	arm/aarch64: vp9lpf: Calculate !hev directly Previously we first calculated hev, and then negated it. Since we were able to schedule the negation in the middle of another calculation, we don't see any gain in all cases. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 147.0 129.0 115.8 89.0 88.7 vp9_loop_filter_v_8_8_neon: 242.0 198.5 174.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 500.0 419.5 382.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 971.2 825.5 731.5 579.0 453.0 After: vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0 This is cherrypicked from libav commit `e1f9de86f4`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:48 +02:00
Martin Storsjö	148cc0bb89	aarch64: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling This work is sponsored by, and copyright, Google. Before: Cortex A53 vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3 vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 180.2 vp9_inv_dct_dct_32x32_sub1_add_neon: 475.3 This is cherrypicked from libav commit `3fcf788fbb`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:48 +02:00
Martin Storsjö	045e33ae3f	aarch64: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter No measured speedup on a Cortex A53, but other cores might benefit. This is cherrypicked from libav commit `388e0d2515`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:48 +02:00
Martin Storsjö	ac6cb8ae5b	aarch64: vp9mc: Simplify the extmla macro parameters Fold the field lengths into the macro. This makes the macro invocations much more readable, when the lines are shorter. This also makes it easier to use only half the registers within the macro. This is cherrypicked from libav commit `5e0c2158fb`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:47 +02:00
Martin Storsjö	16ef000799	aarch64: vp9itxfm: Fix incorrect vertical alignment This is cherrypicked from libav commit `0c0b87f12d`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:47 +02:00
Martin Storsjö	d0fbf7f34e	aarch64: vp9itxfm: Update a comment to refer to a register with a different name This is cherrypicked from libav commit `8476eb0d3a`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:46 +02:00
Martin Storsjö	6752318c73	aarch64: vp9itxfm: Use the right lane sizes in 8x8 for improved readability This is cherrypicked from libav commit `3dd7827258`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:46 +02:00
Martin Storsjö	19a0f9529c	aarch64: vp9itxfm: Use a single lane ld1 instead of ld1r where possible The ld1r is a leftover from the arm version, where this trick is beneficial on some cores. Use a single-lane load where we don't need the semantics of ld1r. This is cherrypicked from libav commit `ed8d293306`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:46 +02:00
Martin Storsjö	3006e5253a	aarch64: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function This is cherrypicked from libav commit `4da4b2b87f`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:27 +02:00
Martin Storsjö	9532a7d4d0	aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32 This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 14740 bytes to 24292 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7 vp9_inv_dct_dct_16x16_sub2_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub4_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub8_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub12_add_neon: 1387.4 vp9_inv_dct_dct_16x16_sub16_add_neon: 1387.6 vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1 vp9_inv_dct_dct_32x32_sub2_add_neon: 5198.5 vp9_inv_dct_dct_32x32_sub4_add_neon: 5198.6 vp9_inv_dct_dct_32x32_sub8_add_neon: 5196.3 vp9_inv_dct_dct_32x32_sub12_add_neon: 6183.4 vp9_inv_dct_dct_32x32_sub16_add_neon: 6174.3 vp9_inv_dct_dct_32x32_sub20_add_neon: 7151.4 vp9_inv_dct_dct_32x32_sub24_add_neon: 7145.3 vp9_inv_dct_dct_32x32_sub28_add_neon: 8119.3 vp9_inv_dct_dct_32x32_sub32_add_neon: 8118.7 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7 vp9_inv_dct_dct_16x16_sub2_add_neon: 640.8 vp9_inv_dct_dct_16x16_sub4_add_neon: 639.0 vp9_inv_dct_dct_16x16_sub8_add_neon: 842.0 vp9_inv_dct_dct_16x16_sub12_add_neon: 1388.3 vp9_inv_dct_dct_16x16_sub16_add_neon: 1389.3 vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1 vp9_inv_dct_dct_32x32_sub2_add_neon: 3685.5 vp9_inv_dct_dct_32x32_sub4_add_neon: 3685.1 vp9_inv_dct_dct_32x32_sub8_add_neon: 3684.4 vp9_inv_dct_dct_32x32_sub12_add_neon: 5312.2 vp9_inv_dct_dct_32x32_sub16_add_neon: 5315.4 vp9_inv_dct_dct_32x32_sub20_add_neon: 7154.9 vp9_inv_dct_dct_32x32_sub24_add_neon: 7154.5 vp9_inv_dct_dct_32x32_sub28_add_neon: 8126.6 vp9_inv_dct_dct_32x32_sub32_add_neon: 8127.2 This is cherrypicked from libav commit `a63da4511d`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:25 +02:00
Martin Storsjö	a681c793a3	aarch64: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function This allows reusing the macro for a separate implementation of the pass2 function. This is cherrypicked from libav commit `79d332ebbd`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:24 +02:00
Martin Storsjö	dc47bf3872	aarch64: vp9itxfm: Make the larger core transforms standalone functions This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/aarch64/vp9itxfm_neon.o from 19496 to 14740 bytes. This gives a small slowdown of a couple of tens of cycles, but makes it more feasible to add more optimized versions of these transforms. Before: vp9_inv_dct_dct_16x16_sub4_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub16_add_neon: 1372.2 vp9_inv_dct_dct_32x32_sub4_add_neon: 5180.0 vp9_inv_dct_dct_32x32_sub32_add_neon: 8095.7 After: vp9_inv_dct_dct_16x16_sub4_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub16_add_neon: 1390.1 vp9_inv_dct_dct_32x32_sub4_add_neon: 5199.9 vp9_inv_dct_dct_32x32_sub32_add_neon: 8125.8 This is cherrypicked from libav commit `115476018d`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:22 +02:00
Martin Storsjö	52c7366c83	aarch64: vp9itxfm: Restructure the idct32 store macros This avoids concatenation, which can't be used if the whole macro is wrapped within another macro. This is also arguably more readable. This is cherrypicked from libav commit `58d87e0f49`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:09 +02:00
Martin Storsjö	b8f66c0838	aarch64: vp9itxfm: Reorder iadst16 coeffs This matches the order they are in the 16 bpp version. There they are in this order, to make sure we access them in the same order they are declared, easing loading only half of the coefficients at a time. This makes the 8 bpp version match the 16 bpp version better. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-24 00:04:34 +02:00
Martin Storsjö	09eb88a12e	aarch64: vp9itxfm: Reorder the idct coefficients for better pairing All elements are used pairwise, except for the first one. Previously, the 16th element was unused. Move the unused element to the second slot, to make the later element pairs not split across registers. This simplifies loading only parts of the coefficients, reducing the difference to the 16 bpp version. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-24 00:04:32 +02:00
Martin Storsjö	65aa002d54	aarch64: vp9itxfm: Avoid reloading the idct32 coefficients The idct32x32 function actually pushed d8-d15 onto the stack even though it didn't clobber them; there are plenty of registers that can be used to allow keeping all the idct coefficients in registers without having to reload different subsets of them at different stages in the transform. After this, we still can skip pushing d12-d15. Before: vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3 After: vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3 Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-24 00:03:44 +02:00
Martin Storsjö	3bf9c48320	aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1 This is one cycle faster in total, and three instructions fewer. Before: vp9_loop_filter_mix2_v_44_16_neon: 123.2 After: vp9_loop_filter_mix2_v_44_16_neon: 122.2 Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-24 00:03:00 +02:00
Martin Storsjö	c582cb8537	arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit The theoretical maximum value of E is 193, so we can just saturate the addition to 255. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0 After: vp9_loop_filter_v_4_8_neon: 136.0 125.7 112.6 84.0 83.0 vp9_loop_filter_v_8_8_neon: 234.0 195.5 171.5 136.0 133.7 vp9_loop_filter_v_16_8_neon: 490.0 417.5 377.7 289.0 271.0 vp9_loop_filter_v_16_16_neon: 951.2 814.7 732.3 571.0 446.7 Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-24 00:02:36 +02:00
Martin Storsjö	07b5136c48	aarch64: vp9lpf: Fix broken indentation/vertical alignment Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-12 21:57:23 +02:00
Martin Storsjö	b0806088d3	aarch64: vp9lpf: Interleave the start of flat8in into the calculation above This adds lots of extra .ifs, but speeds it up by a couple cycles, by avoiding stalls. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-11 22:54:18 +02:00
Martin Storsjö	e1f9de86f4	arm/aarch64: vp9lpf: Calculate !hev directly Previously we first calculated hev, and then negated it. Since we were able to schedule the negation in the middle of another calculation, we don't see any gain in all cases. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 147.0 129.0 115.8 89.0 88.7 vp9_loop_filter_v_8_8_neon: 242.0 198.5 174.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 500.0 419.5 382.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 971.2 825.5 731.5 579.0 453.0 After: vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0 Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-11 00:43:59 +02:00
Martin Storsjö	3fcf788fbb	aarch64: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling This work is sponsored by, and copyright, Google. Before: Cortex A53 vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3 vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 180.2 vp9_inv_dct_dct_32x32_sub1_add_neon: 475.3 Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-11 00:31:58 +02:00
Martin Storsjö	388e0d2515	aarch64: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter No measured speedup on a Cortex A53, but other cores might benefit. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-11 00:08:50 +02:00
Martin Storsjö	5e0c2158fb	aarch64: vp9mc: Simplify the extmla macro parameters Fold the field lengths into the macro. This makes the macro invocations much more readable, when the lines are shorter. This also makes it easier to use only half the registers within the macro. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-11 00:08:29 +02:00
Martin Storsjö	0c0b87f12d	aarch64: vp9itxfm: Fix incorrect vertical alignment Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-09 23:57:06 +02:00
Martin Storsjö	8476eb0d3a	aarch64: vp9itxfm: Update a comment to refer to a register with a different name Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-09 23:57:02 +02:00
Martin Storsjö	3dd7827258	aarch64: vp9itxfm: Use the right lane sizes in 8x8 for improved readability Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-09 23:56:59 +02:00
Martin Storsjö	ed8d293306	aarch64: vp9itxfm: Use a single lane ld1 instead of ld1r where possible The ld1r is a leftover from the arm version, where this trick is beneficial on some cores. Use a single-lane load where we don't need the semantics of ld1r. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-09 23:56:54 +02:00
Martin Storsjö	4da4b2b87f	aarch64: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-09 23:56:50 +02:00
Martin Storsjö	a63da4511d	aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32 This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 14740 bytes to 24292 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7 vp9_inv_dct_dct_16x16_sub2_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub4_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub8_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub12_add_neon: 1387.4 vp9_inv_dct_dct_16x16_sub16_add_neon: 1387.6 vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1 vp9_inv_dct_dct_32x32_sub2_add_neon: 5198.5 vp9_inv_dct_dct_32x32_sub4_add_neon: 5198.6 vp9_inv_dct_dct_32x32_sub8_add_neon: 5196.3 vp9_inv_dct_dct_32x32_sub12_add_neon: 6183.4 vp9_inv_dct_dct_32x32_sub16_add_neon: 6174.3 vp9_inv_dct_dct_32x32_sub20_add_neon: 7151.4 vp9_inv_dct_dct_32x32_sub24_add_neon: 7145.3 vp9_inv_dct_dct_32x32_sub28_add_neon: 8119.3 vp9_inv_dct_dct_32x32_sub32_add_neon: 8118.7 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7 vp9_inv_dct_dct_16x16_sub2_add_neon: 640.8 vp9_inv_dct_dct_16x16_sub4_add_neon: 639.0 vp9_inv_dct_dct_16x16_sub8_add_neon: 842.0 vp9_inv_dct_dct_16x16_sub12_add_neon: 1388.3 vp9_inv_dct_dct_16x16_sub16_add_neon: 1389.3 vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1 vp9_inv_dct_dct_32x32_sub2_add_neon: 3685.5 vp9_inv_dct_dct_32x32_sub4_add_neon: 3685.1 vp9_inv_dct_dct_32x32_sub8_add_neon: 3684.4 vp9_inv_dct_dct_32x32_sub12_add_neon: 5312.2 vp9_inv_dct_dct_32x32_sub16_add_neon: 5315.4 vp9_inv_dct_dct_32x32_sub20_add_neon: 7154.9 vp9_inv_dct_dct_32x32_sub24_add_neon: 7154.5 vp9_inv_dct_dct_32x32_sub28_add_neon: 8126.6 vp9_inv_dct_dct_32x32_sub32_add_neon: 8127.2 Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-09 12:32:03 +02:00
Martin Storsjö	79d332ebbd	aarch64: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function This allows reusing the macro for a separate implementation of the pass2 function. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-09 12:31:56 +02:00
Martin Storsjö	115476018d	aarch64: vp9itxfm: Make the larger core transforms standalone functions This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/aarch64/vp9itxfm_neon.o from 19496 to 14740 bytes. This gives a small slowdown of a couple of tens of cycles, but makes it more feasible to add more optimized versions of these transforms. Before: vp9_inv_dct_dct_16x16_sub4_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub16_add_neon: 1372.2 vp9_inv_dct_dct_32x32_sub4_add_neon: 5180.0 vp9_inv_dct_dct_32x32_sub32_add_neon: 8095.7 After: vp9_inv_dct_dct_16x16_sub4_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub16_add_neon: 1390.1 vp9_inv_dct_dct_32x32_sub4_add_neon: 5199.9 vp9_inv_dct_dct_32x32_sub32_add_neon: 8125.8 Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-09 12:31:45 +02:00
Martin Storsjö	58d87e0f49	aarch64: vp9itxfm: Restructure the idct32 store macros This avoids concatenation, which can't be used if the whole macro is wrapped within another macro. This is also arguably more readable. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-02-05 13:05:32 +02:00
Martin Storsjö	9f10cff610	aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter This work is sponsored by, and copyright, Google. This is similar to the arm version, but due to the larger registers on aarch64, we can do 8 pixels at a time for all filter sizes. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_loop_filter_h_4_8_10bpp_neon: 213.2 172.6 vp9_loop_filter_h_8_8_10bpp_neon: 281.2 244.2 vp9_loop_filter_h_16_8_10bpp_neon: 657.0 444.5 vp9_loop_filter_h_16_16_10bpp_neon: 1280.4 877.7 vp9_loop_filter_mix2_h_44_16_10bpp_neon: 397.7 358.0 vp9_loop_filter_mix2_h_48_16_10bpp_neon: 465.7 429.0 vp9_loop_filter_mix2_h_84_16_10bpp_neon: 465.7 428.0 vp9_loop_filter_mix2_h_88_16_10bpp_neon: 533.7 499.0 vp9_loop_filter_mix2_v_44_16_10bpp_neon: 271.5 244.0 vp9_loop_filter_mix2_v_48_16_10bpp_neon: 330.0 305.0 vp9_loop_filter_mix2_v_84_16_10bpp_neon: 329.0 306.0 vp9_loop_filter_mix2_v_88_16_10bpp_neon: 386.0 365.0 vp9_loop_filter_v_4_8_10bpp_neon: 150.0 115.2 vp9_loop_filter_v_8_8_10bpp_neon: 209.0 175.5 vp9_loop_filter_v_16_8_10bpp_neon: 492.7 345.2 vp9_loop_filter_v_16_16_10bpp_neon: 951.0 682.7 This is significantly faster than the ARM version in almost all cases except for the mix2 functions. Based on START_TIMER/STOP_TIMER wrapping around a few individual functions, the speedup vs C code is around 2-3x. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-01-24 22:36:11 +02:00
Martin Storsjö	ceb36b8178	aarch64: Add NEON optimizations for 10 and 12 bit vp9 itxfm This work is sponsored by, and copyright, Google. Compared to the arm version, on aarch64 we can keep the full 8x8 transform in registers, and for 16x16 and 32x32, we can process it in slices of 4 pixels instead of 2. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_inv_adst_adst_4x4_sub4_add_10_neon: 111.0 109.7 vp9_inv_adst_adst_8x8_sub8_add_10_neon: 914.0 733.5 vp9_inv_adst_adst_16x16_sub16_add_10_neon: 5184.0 3745.7 vp9_inv_dct_dct_4x4_sub1_add_10_neon: 65.0 65.7 vp9_inv_dct_dct_4x4_sub4_add_10_neon: 100.0 96.7 vp9_inv_dct_dct_8x8_sub1_add_10_neon: 111.0 119.7 vp9_inv_dct_dct_8x8_sub8_add_10_neon: 618.0 494.7 vp9_inv_dct_dct_16x16_sub1_add_10_neon: 295.1 284.6 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 2303.2 1883.9 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 2984.8 2189.3 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 3890.0 2799.4 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1044.4 1012.7 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 13333.7 9695.1 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 18531.3 12459.8 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 24470.7 16160.2 vp9_inv_wht_wht_4x4_sub4_add_10_neon: 83.0 79.7 The larger transforms are significantly faster than the corresponding ARM versions. The speedup vs C code is smaller than in 32 bit mode, probably because the 64 bit intermediates in the C code can be expressed more efficiently in aarch64. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-01-24 22:36:08 +02:00
Martin Storsjö	638eceed47	aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC This work is sponsored by, and copyright, Google. This has mostly got the same differences to the 8 bit version as in the arm version. For the horizontal filters, we do 16 pixels in parallel as well. For the 8 pixel wide vertical filters, we can accumulate 4 rows before storing, just as in the 8 bit version. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_avg4_10bpp_neon: 35.7 30.7 vp9_avg8_10bpp_neon: 93.5 84.7 vp9_avg16_10bpp_neon: 324.4 296.6 vp9_avg32_10bpp_neon: 1236.5 1148.2 vp9_avg64_10bpp_neon: 4639.6 4571.1 vp9_avg_8tap_smooth_4h_10bpp_neon: 130.0 128.0 vp9_avg_8tap_smooth_4hv_10bpp_neon: 440.0 440.5 vp9_avg_8tap_smooth_4v_10bpp_neon: 114.0 105.5 vp9_avg_8tap_smooth_8h_10bpp_neon: 327.0 314.0 vp9_avg_8tap_smooth_8hv_10bpp_neon: 918.7 865.4 vp9_avg_8tap_smooth_8v_10bpp_neon: 330.0 300.2 vp9_avg_8tap_smooth_16h_10bpp_neon: 1187.5 1155.5 vp9_avg_8tap_smooth_16hv_10bpp_neon: 2663.1 2591.0 vp9_avg_8tap_smooth_16v_10bpp_neon: 1107.4 1078.3 vp9_avg_8tap_smooth_64h_10bpp_neon: 17754.6 17454.7 vp9_avg_8tap_smooth_64hv_10bpp_neon: 33285.2 33001.5 vp9_avg_8tap_smooth_64v_10bpp_neon: 16066.9 16048.6 vp9_put4_10bpp_neon: 25.5 21.7 vp9_put8_10bpp_neon: 56.0 52.0 vp9_put16_10bpp_neon/armv8: 183.0 163.1 vp9_put32_10bpp_neon/armv8: 678.6 563.1 vp9_put64_10bpp_neon/armv8: 2679.9 2195.8 vp9_put_8tap_smooth_4h_10bpp_neon: 120.0 118.0 vp9_put_8tap_smooth_4hv_10bpp_neon: 435.2 435.0 vp9_put_8tap_smooth_4v_10bpp_neon: 107.0 98.2 vp9_put_8tap_smooth_8h_10bpp_neon: 303.0 290.0 vp9_put_8tap_smooth_8hv_10bpp_neon: 893.7 828.7 vp9_put_8tap_smooth_8v_10bpp_neon: 305.5 263.5 vp9_put_8tap_smooth_16h_10bpp_neon: 1089.1 1059.2 vp9_put_8tap_smooth_16hv_10bpp_neon: 2578.8 2452.4 vp9_put_8tap_smooth_16v_10bpp_neon: 1009.5 933.5 vp9_put_8tap_smooth_64h_10bpp_neon: 16223.4 15918.6 vp9_put_8tap_smooth_64hv_10bpp_neon: 32153.0 31016.2 vp9_put_8tap_smooth_64v_10bpp_neon: 14516.5 13748.1 These are generally about as fast as the corresponding ARM routines on the same CPU (at least on the A53), in most cases marginally faster. The speedup vs C code is around 4-9x. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-01-24 22:36:05 +02:00
Martin Storsjö	48ad3fe1be	aarch64: vp9dsp: Restructure the bpp checks This work is sponsored by, and copyright, Google. This is more in line with how it will be extended for more bitdepths. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-01-24 22:36:02 +02:00
Martin Storsjö	0ba0187535	aarch64: vp9mc: Fix a comment to refer to a register with the right name This is cherrypicked from libav commit `85ad5ea72c`. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:43 +01:00
Martin Storsjö	02cfb9a16e	aarch64: vp9dsp: Fix vertical alignment in the init file This is cherrypicked from libav commit `65074791e8`. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:40 +01:00
Martin Storsjö	8b11a89c06	aarch64: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32 This work is sponsored by, and copyright, Google. Previously all subpartitions except the eob=1 (DC) case ran with the same runtime: vp9_inv_dct_dct_16x16_sub16_add_neon: 1373.2 vp9_inv_dct_dct_32x32_sub32_add_neon: 8089.0 By skipping individual 8x16 or 8x32 pixel slices in the first pass, we reduce the runtime of these functions like this: vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3 vp9_inv_dct_dct_16x16_sub2_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub4_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub8_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub12_add_neon: 1372.1 vp9_inv_dct_dct_16x16_sub16_add_neon: 1372.1 vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1 vp9_inv_dct_dct_32x32_sub2_add_neon: 5190.2 vp9_inv_dct_dct_32x32_sub4_add_neon: 5180.0 vp9_inv_dct_dct_32x32_sub8_add_neon: 5183.1 vp9_inv_dct_dct_32x32_sub12_add_neon: 6161.5 vp9_inv_dct_dct_32x32_sub16_add_neon: 6155.5 vp9_inv_dct_dct_32x32_sub20_add_neon: 7136.3 vp9_inv_dct_dct_32x32_sub24_add_neon: 7128.4 vp9_inv_dct_dct_32x32_sub28_add_neon: 8098.9 vp9_inv_dct_dct_32x32_sub32_add_neon: 8098.8 I.e. in general a very minor overhead for the full subpartition case due to the additional cmps, but a significant speedup for the cases when we only need to process a small part of the actual input data. This is cherrypicked from libav commits `cad42fadcd` and `a0c443a398`. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:32 +01:00
Martin Storsjö	37cb224e3e	aarch64: vp9itxfm: Don't repeatedly set x9 when nothing overwrites it This is cherrypicked from libav commit `2f99117f6f`. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:25 +01:00
Martin Storsjö	4a5874ea8d	arm/aarch64: vp9itxfm: Fix indentation of macro arguments This is cherrypicked from libav commit `721bc37522`. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:19 +01:00
Martin Storsjö	a95e7de41d	aarch64: vp9itxfm: Use w3 instead of x3 for the int eob parameter The clobbering tests in checkasm are only invoked when testing correctness, so this bug didn't show up when benchmarking the dc-only version. This is cherrypicked from libav commit `4d960a1185`. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:16 +01:00
Janne Grunau	cb220eeef9	aarch64: vp9: loop filter: replace 'orr; cbn?z' with 'adds; b.{eq,ne}; The latter is 1 cycle faster on a cortex-53 and since the operands are bytewise (or larger) bitmask (impossible to overflow to zero) both are equivalent. This is cherrypicked from libav commit `e7ae8f7a71`. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:10 +01:00
Janne Grunau	62ea07d797	aarch64: vp9: use alternative returns in the core loop filter function Since aarch64 has enough free general purpose registers use them to branch to the appropiate storage code. 1-2 cycles faster for the functions using loop_filter 8/16, ... on a cortex-a53. Mixed results (up to 2 cycles faster/slower) on a cortex-a57. This is cherrypicked from libav commit `d7595de0b2`. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:06 +01:00
Rostislav Pehlivanov	4fdacf4cdb	imdct15: remove the AArch64 assembly Prep work for the next commit, which will add a new FFT algorithm which makes the iMDCT over 3x faster than it is currently (standalone, the FFT is with some framesizes over 10x faster). The new FFT algorithm uses the already thouroughly SIMD'd power of two FFT which already has SIMD for AArch64, so users of that platform will still see an improvement. The previous FFT+SIMD was barely 2.5x faster than the C versions on these platforms. Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>	2017-01-05 22:32:02 +00:00
Martin Storsjö	85ad5ea72c	aarch64: vp9mc: Fix a comment to refer to a register with the right name Signed-off-by: Martin Storsjö <martin@martin.st>	2017-01-03 14:16:10 +02:00
Martin Storsjö	65074791e8	aarch64: vp9dsp: Fix vertical alignment in the init file Signed-off-by: Martin Storsjö <martin@martin.st>	2017-01-03 14:15:58 +02:00
Martin Storsjö	a0c443a398	aarch64: vp9itxfm: Use the offset parameter to movrel This fixes build failures for iOS, broken since `cad42fadcd`. Signed-off-by: Martin Storsjö <martin@martin.st>	2016-12-19 22:49:51 +02:00
Janne Grunau	2425d7329f	arm64: replace 'bic' with immediate with 'and' with inverted immediate The former is not an official pseudo instruction although gas and llvm's internal assembler support it. Fixes a build error with xcode 6.2 reported by Memphiz on github.	2016-12-14 21:53:05 +01:00
Martin Storsjö	da5c8284c0	aarch64: h264idct: Use the offset parameter to movrel Signed-off-by: Martin Storsjö <martin@martin.st> (cherry picked from commit `6a62795d40`) Cherry pick Suggested-by: Martin Storsjö This should fix the build failure on macosx Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2016-12-08 18:11:07 +01:00
Martin Storsjö	cad42fadcd	aarch64: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32 This work is sponsored by, and copyright, Google. Previously all subpartitions except the eob=1 (DC) case ran with the same runtime: vp9_inv_dct_dct_16x16_sub16_add_neon: 1373.2 vp9_inv_dct_dct_32x32_sub32_add_neon: 8089.0 By skipping individual 8x16 or 8x32 pixel slices in the first pass, we reduce the runtime of these functions like this: vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3 vp9_inv_dct_dct_16x16_sub2_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub4_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub8_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub12_add_neon: 1372.1 vp9_inv_dct_dct_16x16_sub16_add_neon: 1372.1 vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1 vp9_inv_dct_dct_32x32_sub2_add_neon: 5190.2 vp9_inv_dct_dct_32x32_sub4_add_neon: 5180.0 vp9_inv_dct_dct_32x32_sub8_add_neon: 5183.1 vp9_inv_dct_dct_32x32_sub12_add_neon: 6161.5 vp9_inv_dct_dct_32x32_sub16_add_neon: 6155.5 vp9_inv_dct_dct_32x32_sub20_add_neon: 7136.3 vp9_inv_dct_dct_32x32_sub24_add_neon: 7128.4 vp9_inv_dct_dct_32x32_sub28_add_neon: 8098.9 vp9_inv_dct_dct_32x32_sub32_add_neon: 8098.8 I.e. in general a very minor overhead for the full subpartition case due to the additional cmps, but a significant speedup for the cases when we only need to process a small part of the actual input data. Signed-off-by: Martin Storsjö <martin@martin.st>	2016-11-30 23:57:05 +02:00
Martin Storsjö	2f99117f6f	aarch64: vp9itxfm: Don't repeatedly set x9 when nothing overwrites it Signed-off-by: Martin Storsjö <martin@martin.st>	2016-11-24 13:39:21 +02:00
Martin Storsjö	721bc37522	arm/aarch64: vp9itxfm: Fix indentation of macro arguments Signed-off-by: Martin Storsjö <martin@martin.st>	2016-11-23 23:56:16 +02:00
Martin Storsjö	4d960a1185	aarch64: vp9itxfm: Use w3 instead of x3 for the int eob parameter The clobbering tests in checkasm are only invoked when testing correctness, so this bug didn't show up when benchmarking the dc-only version. Signed-off-by: Martin Storsjö <martin@martin.st>	2016-11-18 23:17:33 +02:00
Janne Grunau	e7ae8f7a71	aarch64: vp9: loop filter: replace 'orr; cbn?z' with 'adds; b.{eq,ne}; The latter is 1 cycle faster on a cortex-53 and since the operands are bytewise (or larger) bitmask (impossible to overflow to zero) both are equivalent.	2016-11-16 09:05:18 +01:00
Janne Grunau	d7595de0b2	aarch64: vp9: use alternative returns in the core loop filter function Since aarch64 has enough free general purpose registers use them to branch to the appropiate storage code. 1-2 cycles faster for the functions using loop_filter 8/16, ... on a cortex-a53. Mixed results (up to 2 cycles faster/slower) on a cortex-a57.	2016-11-16 09:05:18 +01:00
Martin Storsjö	f1212e472b	aarch64: vp9: Implement NEON loop filters This work is sponsored by, and copyright, Google. These are ported from the ARM version; thanks to the larger amount of registers available, we can do the loop filters with 16 pixels at a time. The implementation is fully templated, with a single macro which can generate versions for both 8 and 16 pixels wide, for both 4, 8 and 16 pixels loop filters (and the 4/8 mixed versions as well). For the 8 pixel wide versions, it is pretty close in speed (the v_4_8 and v_8_8 filters are the best examples of this; the h_4_8 and h_8_8 filters seem to get some gain in the load/transpose/store part). For the 16 pixels wide ones, we get a speedup of around 1.2-1.4x compared to the 32 bit version. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_loop_filter_h_4_8_neon: 144.0 127.2 vp9_loop_filter_h_8_8_neon: 207.0 182.5 vp9_loop_filter_h_16_8_neon: 415.0 328.7 vp9_loop_filter_h_16_16_neon: 672.0 558.6 vp9_loop_filter_mix2_h_44_16_neon: 302.0 203.5 vp9_loop_filter_mix2_h_48_16_neon: 365.0 305.2 vp9_loop_filter_mix2_h_84_16_neon: 365.0 305.2 vp9_loop_filter_mix2_h_88_16_neon: 376.0 305.2 vp9_loop_filter_mix2_v_44_16_neon: 193.2 128.2 vp9_loop_filter_mix2_v_48_16_neon: 246.7 218.4 vp9_loop_filter_mix2_v_84_16_neon: 248.0 218.5 vp9_loop_filter_mix2_v_88_16_neon: 302.0 218.2 vp9_loop_filter_v_4_8_neon: 89.0 88.7 vp9_loop_filter_v_8_8_neon: 141.0 137.7 vp9_loop_filter_v_16_8_neon: 295.0 272.7 vp9_loop_filter_v_16_16_neon: 546.0 453.7 The speedup vs C code in checkasm tests is around 2-7x, which is pretty much the same as for the 32 bit version. Even if these functions are faster than their 32 bit equivalent, the C version that we compare to also became around 1.3-1.7x faster than the C version in 32 bit. Based on START_TIMER/STOP_TIMER wrapping around a few individual functions, the speedup vs C code is around 4-5x. Examples of runtimes vs C on a Cortex A57 (for a slightly older version of the patch): A57 gcc-5.3 neon loop_filter_h_4_8_neon: 256.6 93.4 loop_filter_h_8_8_neon: 307.3 139.1 loop_filter_h_16_8_neon: 340.1 254.1 loop_filter_h_16_16_neon: 827.0 407.9 loop_filter_mix2_h_44_16_neon: 524.5 155.4 loop_filter_mix2_h_48_16_neon: 644.5 173.3 loop_filter_mix2_h_84_16_neon: 630.5 222.0 loop_filter_mix2_h_88_16_neon: 697.3 222.0 loop_filter_mix2_v_44_16_neon: 598.5 100.6 loop_filter_mix2_v_48_16_neon: 651.5 127.0 loop_filter_mix2_v_84_16_neon: 591.5 167.1 loop_filter_mix2_v_88_16_neon: 855.1 166.7 loop_filter_v_4_8_neon: 271.7 65.3 loop_filter_v_8_8_neon: 312.5 106.9 loop_filter_v_16_8_neon: 473.3 206.5 loop_filter_v_16_16_neon: 976.1 327.8 The speed-up compared to the C functions is 2.5 to 6 and the cortex-a57 is again 30-50% faster than the cortex-a53. This is an adapted cherry-pick from libav commits `9d2afd1eb8` and `31756abe29`. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>	2016-11-15 15:10:03 -05:00
Martin Storsjö	f43079e11c	aarch64: vp9: Add NEON itxfm routines This work is sponsored by, and copyright, Google. These are ported from the ARM version; thanks to the larger amount of registers available, we can do the 16x16 and 32x32 transforms in slices 8 pixels wide instead of 4. This gives a speedup of around 1.4x compared to the 32 bit version. The fact that aarch64 doesn't have the same d/q register aliasing makes some of the macros quite a bit simpler as well. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_inv_adst_adst_4x4_add_neon: 90.0 87.7 vp9_inv_adst_adst_8x8_add_neon: 400.0 354.7 vp9_inv_adst_adst_16x16_add_neon: 2526.5 1827.2 vp9_inv_dct_dct_4x4_add_neon: 74.0 72.7 vp9_inv_dct_dct_8x8_add_neon: 271.0 256.7 vp9_inv_dct_dct_16x16_add_neon: 1960.7 1372.7 vp9_inv_dct_dct_32x32_add_neon: 11988.9 8088.3 vp9_inv_wht_wht_4x4_add_neon: 63.0 57.7 The speedup vs C code (2-4x) is smaller than in the 32 bit case, mostly because the C code ends up significantly faster (around 1.6x faster, with GCC 5.4) when built for aarch64. Examples of runtimes vs C on a Cortex A57 (for a slightly older version of the patch): A57 gcc-5.3 neon vp9_inv_adst_adst_4x4_add_neon: 152.2 60.0 vp9_inv_adst_adst_8x8_add_neon: 948.2 288.0 vp9_inv_adst_adst_16x16_add_neon: 4830.4 1380.5 vp9_inv_dct_dct_4x4_add_neon: 153.0 58.6 vp9_inv_dct_dct_8x8_add_neon: 789.2 180.2 vp9_inv_dct_dct_16x16_add_neon: 3639.6 917.1 vp9_inv_dct_dct_32x32_add_neon: 20462.1 4985.0 vp9_inv_wht_wht_4x4_add_neon: 91.0 49.8 The asm is around factor 3-4 faster than C on the cortex-a57 and the asm is around 30-50% faster on the a57 compared to the a53. This is an adapted cherry-pick from libav commit `3c9546dfaf`. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>	2016-11-15 15:10:03 -05:00
Martin Storsjö	1f7801c2bc	aarch64: vp9: Add NEON optimizations of VP9 MC functions This work is sponsored by, and copyright, Google. These are ported from the ARM version; it is essentially a 1:1 port with no extra added features, but with some hand tuning (especially for the plain copy/avg functions). The ARM version isn't very register starved to begin with, so there's not much to be gained from having more spare registers here - we only avoid having to clobber callee-saved registers. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_avg4_neon: 27.2 23.7 vp9_avg8_neon: 56.5 54.7 vp9_avg16_neon: 169.9 167.4 vp9_avg32_neon: 585.8 585.2 vp9_avg64_neon: 2460.3 2294.7 vp9_avg_8tap_smooth_4h_neon: 132.7 125.2 vp9_avg_8tap_smooth_4hv_neon: 478.8 442.0 vp9_avg_8tap_smooth_4v_neon: 126.0 93.7 vp9_avg_8tap_smooth_8h_neon: 241.7 234.2 vp9_avg_8tap_smooth_8hv_neon: 690.9 646.5 vp9_avg_8tap_smooth_8v_neon: 245.0 205.5 vp9_avg_8tap_smooth_64h_neon: 11273.2 11280.1 vp9_avg_8tap_smooth_64hv_neon: 22980.6 22184.1 vp9_avg_8tap_smooth_64v_neon: 11549.7 10781.1 vp9_put4_neon: 18.0 17.2 vp9_put8_neon: 40.2 37.7 vp9_put16_neon: 97.4 99.5 vp9_put32_neon/armv8: 346.0 307.4 vp9_put64_neon/armv8: 1319.0 1107.5 vp9_put_8tap_smooth_4h_neon: 126.7 118.2 vp9_put_8tap_smooth_4hv_neon: 465.7 434.0 vp9_put_8tap_smooth_4v_neon: 113.0 86.5 vp9_put_8tap_smooth_8h_neon: 229.7 221.6 vp9_put_8tap_smooth_8hv_neon: 658.9 621.3 vp9_put_8tap_smooth_8v_neon: 215.0 187.5 vp9_put_8tap_smooth_64h_neon: 10636.7 10627.8 vp9_put_8tap_smooth_64hv_neon: 21076.8 21026.9 vp9_put_8tap_smooth_64v_neon: 9635.0 9632.4 These are generally about as fast as the corresponding ARM routines on the same CPU (at least on the A53), in most cases marginally faster. The speedup vs C code is pretty much the same as for the 32 bit case; on the A53 it's around 6-13x for ther larger 8tap filters. The exact speedup varies a little, since the C versions generally don't end up exactly as slow/fast as on 32 bit. This is an adapted cherry-pick from libav commit `383d96aa22`. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>	2016-11-15 15:10:03 -05:00
Janne Grunau	31756abe29	aarch64: vp9: loop_filter: fix typo in skip flatout8 check The 16_16 loop filter functions could miss an early exit before flatout8. Signed-off-by: Martin Storsjö <martin@martin.st>	2016-11-14 08:51:58 +02:00
Martin Storsjö	9d2afd1eb8	aarch64: vp9: Implement NEON loop filters This work is sponsored by, and copyright, Google. These are ported from the ARM version; thanks to the larger amount of registers available, we can do the loop filters with 16 pixels at a time. The implementation is fully templated, with a single macro which can generate versions for both 8 and 16 pixels wide, for both 4, 8 and 16 pixels loop filters (and the 4/8 mixed versions as well). For the 8 pixel wide versions, it is pretty close in speed (the v_4_8 and v_8_8 filters are the best examples of this; the h_4_8 and h_8_8 filters seem to get some gain in the load/transpose/store part). For the 16 pixels wide ones, we get a speedup of around 1.2-1.4x compared to the 32 bit version. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_loop_filter_h_4_8_neon: 144.0 127.2 vp9_loop_filter_h_8_8_neon: 207.0 182.5 vp9_loop_filter_h_16_8_neon: 415.0 328.7 vp9_loop_filter_h_16_16_neon: 672.0 558.6 vp9_loop_filter_mix2_h_44_16_neon: 302.0 203.5 vp9_loop_filter_mix2_h_48_16_neon: 365.0 305.2 vp9_loop_filter_mix2_h_84_16_neon: 365.0 305.2 vp9_loop_filter_mix2_h_88_16_neon: 376.0 305.2 vp9_loop_filter_mix2_v_44_16_neon: 193.2 128.2 vp9_loop_filter_mix2_v_48_16_neon: 246.7 218.4 vp9_loop_filter_mix2_v_84_16_neon: 248.0 218.5 vp9_loop_filter_mix2_v_88_16_neon: 302.0 218.2 vp9_loop_filter_v_4_8_neon: 89.0 88.7 vp9_loop_filter_v_8_8_neon: 141.0 137.7 vp9_loop_filter_v_16_8_neon: 295.0 272.7 vp9_loop_filter_v_16_16_neon: 546.0 453.7 The speedup vs C code in checkasm tests is around 2-7x, which is pretty much the same as for the 32 bit version. Even if these functions are faster than their 32 bit equivalent, the C version that we compare to also became around 1.3-1.7x faster than the C version in 32 bit. Based on START_TIMER/STOP_TIMER wrapping around a few individual functions, the speedup vs C code is around 4-5x. Examples of runtimes vs C on a Cortex A57 (for a slightly older version of the patch): A57 gcc-5.3 neon loop_filter_h_4_8_neon: 256.6 93.4 loop_filter_h_8_8_neon: 307.3 139.1 loop_filter_h_16_8_neon: 340.1 254.1 loop_filter_h_16_16_neon: 827.0 407.9 loop_filter_mix2_h_44_16_neon: 524.5 155.4 loop_filter_mix2_h_48_16_neon: 644.5 173.3 loop_filter_mix2_h_84_16_neon: 630.5 222.0 loop_filter_mix2_h_88_16_neon: 697.3 222.0 loop_filter_mix2_v_44_16_neon: 598.5 100.6 loop_filter_mix2_v_48_16_neon: 651.5 127.0 loop_filter_mix2_v_84_16_neon: 591.5 167.1 loop_filter_mix2_v_88_16_neon: 855.1 166.7 loop_filter_v_4_8_neon: 271.7 65.3 loop_filter_v_8_8_neon: 312.5 106.9 loop_filter_v_16_8_neon: 473.3 206.5 loop_filter_v_16_16_neon: 976.1 327.8 The speed-up compared to the C functions is 2.5 to 6 and the cortex-a57 is again 30-50% faster than the cortex-a53. Signed-off-by: Martin Storsjö <martin@martin.st>	2016-11-14 00:10:13 +02:00
Martin Storsjö	3c9546dfaf	aarch64: vp9: Add NEON itxfm routines This work is sponsored by, and copyright, Google. These are ported from the ARM version; thanks to the larger amount of registers available, we can do the 16x16 and 32x32 transforms in slices 8 pixels wide instead of 4. This gives a speedup of around 1.4x compared to the 32 bit version. The fact that aarch64 doesn't have the same d/q register aliasing makes some of the macros quite a bit simpler as well. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_inv_adst_adst_4x4_add_neon: 90.0 87.7 vp9_inv_adst_adst_8x8_add_neon: 400.0 354.7 vp9_inv_adst_adst_16x16_add_neon: 2526.5 1827.2 vp9_inv_dct_dct_4x4_add_neon: 74.0 72.7 vp9_inv_dct_dct_8x8_add_neon: 271.0 256.7 vp9_inv_dct_dct_16x16_add_neon: 1960.7 1372.7 vp9_inv_dct_dct_32x32_add_neon: 11988.9 8088.3 vp9_inv_wht_wht_4x4_add_neon: 63.0 57.7 The speedup vs C code (2-4x) is smaller than in the 32 bit case, mostly because the C code ends up significantly faster (around 1.6x faster, with GCC 5.4) when built for aarch64. Examples of runtimes vs C on a Cortex A57 (for a slightly older version of the patch): A57 gcc-5.3 neon vp9_inv_adst_adst_4x4_add_neon: 152.2 60.0 vp9_inv_adst_adst_8x8_add_neon: 948.2 288.0 vp9_inv_adst_adst_16x16_add_neon: 4830.4 1380.5 vp9_inv_dct_dct_4x4_add_neon: 153.0 58.6 vp9_inv_dct_dct_8x8_add_neon: 789.2 180.2 vp9_inv_dct_dct_16x16_add_neon: 3639.6 917.1 vp9_inv_dct_dct_32x32_add_neon: 20462.1 4985.0 vp9_inv_wht_wht_4x4_add_neon: 91.0 49.8 The asm is around factor 3-4 faster than C on the cortex-a57 and the asm is around 30-50% faster on the a57 compared to the a53. Signed-off-by: Martin Storsjö <martin@martin.st>	2016-11-14 00:10:13 +02:00
Martin Storsjö	6a62795d40	aarch64: h264idct: Use the offset parameter to movrel Signed-off-by: Martin Storsjö <martin@martin.st>	2016-11-10 11:18:22 +02:00
Martin Storsjö	383d96aa22	aarch64: vp9: Add NEON optimizations of VP9 MC functions This work is sponsored by, and copyright, Google. These are ported from the ARM version; it is essentially a 1:1 port with no extra added features, but with some hand tuning (especially for the plain copy/avg functions). The ARM version isn't very register starved to begin with, so there's not much to be gained from having more spare registers here - we only avoid having to clobber callee-saved registers. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_avg4_neon: 27.2 23.7 vp9_avg8_neon: 56.5 54.7 vp9_avg16_neon: 169.9 167.4 vp9_avg32_neon: 585.8 585.2 vp9_avg64_neon: 2460.3 2294.7 vp9_avg_8tap_smooth_4h_neon: 132.7 125.2 vp9_avg_8tap_smooth_4hv_neon: 478.8 442.0 vp9_avg_8tap_smooth_4v_neon: 126.0 93.7 vp9_avg_8tap_smooth_8h_neon: 241.7 234.2 vp9_avg_8tap_smooth_8hv_neon: 690.9 646.5 vp9_avg_8tap_smooth_8v_neon: 245.0 205.5 vp9_avg_8tap_smooth_64h_neon: 11273.2 11280.1 vp9_avg_8tap_smooth_64hv_neon: 22980.6 22184.1 vp9_avg_8tap_smooth_64v_neon: 11549.7 10781.1 vp9_put4_neon: 18.0 17.2 vp9_put8_neon: 40.2 37.7 vp9_put16_neon: 97.4 99.5 vp9_put32_neon/armv8: 346.0 307.4 vp9_put64_neon/armv8: 1319.0 1107.5 vp9_put_8tap_smooth_4h_neon: 126.7 118.2 vp9_put_8tap_smooth_4hv_neon: 465.7 434.0 vp9_put_8tap_smooth_4v_neon: 113.0 86.5 vp9_put_8tap_smooth_8h_neon: 229.7 221.6 vp9_put_8tap_smooth_8hv_neon: 658.9 621.3 vp9_put_8tap_smooth_8v_neon: 215.0 187.5 vp9_put_8tap_smooth_64h_neon: 10636.7 10627.8 vp9_put_8tap_smooth_64hv_neon: 21076.8 21026.9 vp9_put_8tap_smooth_64v_neon: 9635.0 9632.4 These are generally about as fast as the corresponding ARM routines on the same CPU (at least on the A53), in most cases marginally faster. The speedup vs C code is pretty much the same as for the 32 bit case; on the A53 it's around 6-13x for ther larger 8tap filters. The exact speedup varies a little, since the C versions generally don't end up exactly as slow/fast as on 32 bit. Signed-off-by: Martin Storsjö <martin@martin.st>	2016-11-10 11:15:56 +02:00
Diego Biurrun	72a19f4013	mpegaudiodsp: aarch64: Adjust function prototype after `2caa93b813`	2016-11-10 00:13:48 +01:00
Martin Storsjö	9b2ccafb48	aarch64: Add missing sign extension in ff_h264_idct8_add_neon Signed-off-by: Martin Storsjö <martin@martin.st>	2016-10-10 14:57:53 +03:00
James Almer	42111e8543	avcodec: fix arguments on xmm/neon clobber test wrappers Signed-off-by: James Almer <jamrial@gmail.com>	2016-10-02 02:15:47 -03:00
James Almer	449f263f9f	avcodec: add missing xmm/neon clobber test wrappers for the new encode API Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: James Almer <jamrial@gmail.com>	2016-10-01 14:08:50 -03:00
Diego Biurrun	2caa93b813	mpegaudiodsp: Change type of array stride parameters to ptrdiff_t This avoids SIMD-optimized functions having to sign-extend their stride argument manually to be able to do pointer arithmetic.	2016-09-29 17:54:24 +02:00
Diego Biurrun	e4a94d8b36	h264chroma: Change type of stride parameters to ptrdiff_t This avoids SIMD-optimized functions having to sign-extend their stride argument manually to be able to do pointer arithmetic.	2016-09-29 14:48:04 +02:00
Anton Khirnov	de2ae3c1fa	lavc: add clobber tests for the new encoding/decoding API	2016-09-28 10:01:52 +02:00
Xiaolei Yu	5a70e56f2f	avcodec: fix vc1dsp dependencies	2016-09-25 13:11:45 +02:00
James Almer	293484fa5e	avcodec: add missing xmm/neon clobber test wrappers for the new decode API Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: James Almer <jamrial@gmail.com>	2016-07-03 18:04:30 -03:00
Clément Bœsch	4a081f224e	libavcodec: fix constness in clobber test avcodec_open2() wrappers Signed-off-by: Martin Storsjö <martin@martin.st>	2016-06-26 21:34:04 +03:00
Clément Bœsch	dfd0c0f981	lavc/neontest: fix constness in arm/aarch64 avcodec_open2() wrappers	2016-06-25 13:41:13 +02:00
Clément Bœsch	8ef57a0d61	Merge commit '41ed7ab45fc693f7d7fc35664c0233f4c32d69bb' * commit '41ed7ab45fc693f7d7fc35664c0233f4c32d69bb': cosmetics: Fix spelling mistakes Merged-by: Clément Bœsch <u@pkh.me>	2016-06-21 21:55:34 +02:00
James Almer	c8c14d0ffc	aarch64/synth_filter: fix compilation Signed-off-by: James Almer <jamrial@gmail.com>	2016-05-10 23:33:12 -03:00
Derek Buitenhuis	ca5ec2bf51	Merge commit '01621202aad7e27b2a05c71d9ad7a19dfcbe17ec' * commit '01621202aad7e27b2a05c71d9ad7a19dfcbe17ec': build: miscellaneous cosmetics Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>	2016-05-09 16:25:28 +01:00
Vittorio Giovara	41ed7ab45f	cosmetics: Fix spelling mistakes Signed-off-by: Diego Biurrun <diego@biurrun.de>	2016-05-04 18:16:21 +02:00
Derek Buitenhuis	87b8e95008	Merge commit 'cdb1665f70def544ddab3e3ed3763ef99c8b3873' * commit 'cdb1665f70def544ddab3e3ed3763ef99c8b3873': aarch64: Make transpose_4x4H do a regular transpose Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>	2016-04-24 12:51:42 +01:00
Derek Buitenhuis	197fa698c6	Merge commit '97aec6e75ef36ed0402653519daa8e1fc8ddb555' * commit '97aec6e75ef36ed0402653519daa8e1fc8ddb555': fft: arm: Drop unnecessary #include, add missing ones Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>	2016-04-12 15:43:09 +01:00
Diego Biurrun	01621202aa	build: miscellaneous cosmetics Restore alphabetical order in lists, break overly long lines, do some prettyprinting, add some explanatory section comments, group parts together that belong together logically.	2016-04-07 15:26:08 +02:00
Martin Storsjö	cdb1665f70	aarch64: Make transpose_4x4H do a regular transpose Previously, ff_h264_idct_add_neon (originally in the arm version) used a non-regular transpose in order to be able to use more instructions that deal with registers as 128 bit register pairs. The aarch64 translation doesn't do it to the same extent, but brought along the same structure since it was a straight translation. This reshuffles ff_h264_idct_add_neon, bringing it closer to the C implementation, making the transpose_4x4H macro do a regular transpose, usable for other algorithms as well. Previously, the third and fourth output from transpose_4x4H were swapped, and prior to `cc29d96d5a`, the same inputs as well. In addition to just swapping the outputs, also renumber the intermediate registers for better readability (making the register order match transpose_4x8B). This runs with the same number of cycles as before. Signed-off-by: Martin Storsjö <martin@martin.st>	2016-03-26 21:25:56 +02:00
Diego Biurrun	1a094af638	fft: Split MDCT bits off from FFT	2016-03-01 10:18:28 +01:00
Diego Biurrun	97aec6e75e	fft: arm: Drop unnecessary #include, add missing ones	2016-02-26 14:34:58 +01:00
foo86	ae5b2c5250	avcodec/dca: add new decoder based on libdcadec	2016-01-31 17:09:38 +01:00
foo86	4608996772	avcodec/dca: remove old decoder Remove all files and functions which are not going to be reused, and disable all functions and FATE tests temporarily which will be.	2016-01-31 17:09:38 +01:00
James Almer	209f50e16b	avcodec/synth_filter: split off remaining code from dcadec files Signed-off-by: James Almer <jamrial@gmail.com>	2016-01-25 14:57:38 -03:00
Hendrik Leppkes	d03da3e240	Merge commit '2008f76054906e9ff6bf744800af0e5a5bfe61be' * commit '2008f76054906e9ff6bf744800af0e5a5bfe61be': dca: remove unused decode_hf function and quant_d tables Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>	2016-01-02 13:17:48 +01:00
Hendrik Leppkes	e97e2588ca	Merge commit 'a0fc780a2093784e8664f88205ee1b215e109cee' * commit 'a0fc780a2093784e8664f88205ee1b215e109cee': arm64: int32_to_float_fmul neon asm Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>	2016-01-02 11:21:16 +01:00
Hendrik Leppkes	10e075c138	Merge commit '705f5e5e155f6f280a360af220fc5b30cfcee702' * commit '705f5e5e155f6f280a360af220fc5b30cfcee702': arm64: port synth_filter_float_neon from arm Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>	2016-01-02 11:14:28 +01:00
Hendrik Leppkes	de3a33784c	Merge commit 'c33c1fa8af2b2e82418a06901b6ad17b3d61b73e' * commit 'c33c1fa8af2b2e82418a06901b6ad17b3d61b73e': arm64: convert dcadsp neon asm from arm Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>	2016-01-02 11:10:24 +01:00
Alexandra Hájková	2008f76054	dca: remove unused decode_hf function and quant_d tables They were superseded with their integer equivalents. Rename integer decode_hf to decode_hf.	2015-12-24 13:58:18 +01:00
Janne Grunau	cc29d96d5a	arm64: fix inverted register order in transpose_4x4H Fix related register order issue in ff_h264_idct_add_neon. Found-by: zjh8890 <243186085@qq.com>	2015-12-21 13:44:20 +01:00
Janne Grunau	2dba0407fd	avcodec/arm64: fix inverted register order in transpose_4x4H Fix related register order issue in ff_h264_idct_add_neon. Found-by: zjh8890 <243186085@qq.com> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2015-12-19 03:58:46 +01:00
Michael Niedermayer	95b59bfb9d	Revert "avcodec/aarch64/neon.S: Update neon.s for transpose_4x4H" The change was not correct and broke H264 This reverts commit `cd83f899c9`.	2015-12-17 21:26:37 +01:00
Janne Grunau	a0fc780a20	arm64: int32_to_float_fmul neon asm 3% faster dts decoding on a cortex-a57. cortex-a57 cortex-a53 int32_to_float_fmul_array8_c: 1270.9 4475.6 int32_to_float_fmul_array8_neon: 328.6 569.2 int32_to_float_fmul_scalar_c: 928.5 4119.6 int32_to_float_fmul_scalar_neon: 309.1 524.1	2015-12-14 16:45:02 +01:00
Janne Grunau	705f5e5e15	arm64: port synth_filter_float_neon from arm ~25% faster dts decoding overall. The checkasm CPU cycles numbers are not that useful since synth_filter_float() calls FFTContext.imdct_half(). cortex-a57 cortex-a53 synth_filter_float_c: 1866.2 3490.9 synth_filter_float_neon: 915.0 1531.5 With fftc.imdct_half forced to imdct_half_neon: cortex-a57 cortex-a53 synth_filter_float_c: 1718.4 3025.3 synth_filter_float_neon: 926.2 1530.1	2015-12-14 16:45:01 +01:00

... 2 3 4 5 6 ...

401 Commits