Mikhail Nitenko
0f745b74ec
lavc/aarch64: h264qpel, add 10-bit lowpass_8_10 based functions
...
Benchmarks A53 A55 A72 A76
avg_h264_qpel_8_mc01_10_c: 936.5 924.0 656.0 504.7
avg_h264_qpel_8_mc01_10_neon: 234.7 202.0 120.7 63.2
avg_h264_qpel_8_mc02_10_c: 921.0 920.0 669.2 493.7
avg_h264_qpel_8_mc02_10_neon: 202.0 173.2 102.7 58.5
avg_h264_qpel_8_mc03_10_c: 936.5 924.0 656.0 509.5
avg_h264_qpel_8_mc03_10_neon: 236.2 203.7 120.0 63.2
avg_h264_qpel_8_mc10_10_c: 1441.0 1437.7 806.7 478.5
avg_h264_qpel_8_mc10_10_neon: 325.7 324.0 153.7 94.2
avg_h264_qpel_8_mc11_10_c: 2160.7 2148.2 1366.7 906.7
avg_h264_qpel_8_mc11_10_neon: 492.0 464.0 242.5 134.5
avg_h264_qpel_8_mc13_10_c: 2157.0 2138.2 1357.0 908.2
avg_h264_qpel_8_mc13_10_neon: 494.0 467.2 242.0 140.0
avg_h264_qpel_8_mc20_10_c: 1433.5 1410.0 785.2 486.0
avg_h264_qpel_8_mc20_10_neon: 293.7 289.7 138.0 91.5
avg_h264_qpel_8_mc30_10_c: 1458.5 1461.7 813.7 483.2
avg_h264_qpel_8_mc30_10_neon: 341.7 339.2 154.0 95.2
avg_h264_qpel_8_mc31_10_c: 2194.7 2197.2 1358.7 928.0
avg_h264_qpel_8_mc31_10_neon: 520.0 495.0 245.5 142.5
avg_h264_qpel_8_mc33_10_c: 2188.0 2205.5 1356.7 910.7
avg_h264_qpel_8_mc33_10_neon: 521.0 494.5 245.7 145.7
avg_h264_qpel_16_mc01_10_c: 3717.2 3595.0 2610.0 2012.0
avg_h264_qpel_16_mc01_10_neon: 920.5 791.5 483.2 240.5
avg_h264_qpel_16_mc02_10_c: 3684.0 3633.0 2659.0 1919.7
avg_h264_qpel_16_mc02_10_neon: 790.7 678.2 409.2 217.0
avg_h264_qpel_16_mc03_10_c: 3726.5 3596.0 2606.7 2010.0
avg_h264_qpel_16_mc03_10_neon: 922.0 792.5 483.2 239.7
avg_h264_qpel_16_mc10_10_c: 5912.0 5803.2 3241.5 1916.7
avg_h264_qpel_16_mc10_10_neon: 1267.5 1277.2 616.5 365.0
avg_h264_qpel_16_mc11_10_c: 8599.2 8482.5 5338.0 3616.2
avg_h264_qpel_16_mc11_10_neon: 1913.0 1827.0 956.2 542.2
avg_h264_qpel_16_mc13_10_c: 8643.7 8488.5 5388.0 3628.5
avg_h264_qpel_16_mc13_10_neon: 1914.7 1828.7 969.2 530.5
avg_h264_qpel_16_mc20_10_c: 5719.5 5641.0 3147.0 1946.2
avg_h264_qpel_16_mc20_10_neon: 1139.5 1150.0 539.5 344.0
avg_h264_qpel_16_mc30_10_c: 5930.0 5872.5 3267.5 1918.0
avg_h264_qpel_16_mc30_10_neon: 1331.5 1341.2 616.5 369.5
avg_h264_qpel_16_mc31_10_c: 8758.7 8697.7 5353.0 3630.7
avg_h264_qpel_16_mc31_10_neon: 2018.7 1941.7 982.2 574.7
avg_h264_qpel_16_mc33_10_c: 8683.2 8675.2 5339.2 3634.7
avg_h264_qpel_16_mc33_10_neon: 2019.7 1940.2 994.5 566.0
put_h264_qpel_8_mc01_10_c: 854.2 843.0 599.2 478.0
put_h264_qpel_8_mc01_10_neon: 192.7 168.0 101.7 56.7
put_h264_qpel_8_mc02_10_c: 766.5 760.0 550.2 441.0
put_h264_qpel_8_mc02_10_neon: 160.0 139.2 88.7 53.0
put_h264_qpel_8_mc03_10_c: 854.2 843.0 599.2 479.0
put_h264_qpel_8_mc03_10_neon: 194.2 169.7 102.0 56.2
put_h264_qpel_8_mc10_10_c: 1352.7 1353.7 749.7 446.7
put_h264_qpel_8_mc10_10_neon: 289.7 294.2 135.5 88.5
put_h264_qpel_8_mc11_10_c: 2080.0 2066.2 1309.5 876.7
put_h264_qpel_8_mc11_10_neon: 450.0 429.7 229.7 131.2
put_h264_qpel_8_mc13_10_c: 2074.7 2060.2 1294.5 870.5
put_h264_qpel_8_mc13_10_neon: 452.5 434.5 226.5 130.0
put_h264_qpel_8_mc20_10_c: 1221.5 1216.0 684.5 399.7
put_h264_qpel_8_mc20_10_neon: 257.7 262.5 121.2 78.7
put_h264_qpel_8_mc30_10_c: 1379.0 1374.7 757.2 449.5
put_h264_qpel_8_mc30_10_neon: 305.7 310.2 135.5 86.5
put_h264_qpel_8_mc31_10_c: 2109.2 2119.7 1299.5 878.0
put_h264_qpel_8_mc31_10_neon: 478.0 458.5 226.0 137.2
put_h264_qpel_8_mc33_10_c: 2101.5 2115.2 1306.5 887.0
put_h264_qpel_8_mc33_10_neon: 479.0 458.7 229.7 141.7
put_h264_qpel_16_mc01_10_c: 3485.7 3396.7 2460.5 1914.5
put_h264_qpel_16_mc01_10_neon: 752.5 665.5 397.0 213.2
put_h264_qpel_16_mc02_10_c: 3103.5 3023.2 2154.7 1720.7
put_h264_qpel_16_mc02_10_neon: 622.7 551.2 347.7 196.2
put_h264_qpel_16_mc03_10_c: 3486.2 3394.0 2436.5 1917.7
put_h264_qpel_16_mc03_10_neon: 754.0 666.5 397.0 215.7
put_h264_qpel_16_mc10_10_c: 5533.0 5488.5 2989.0 1783.0
put_h264_qpel_16_mc10_10_neon: 1123.5 1165.2 535.2 334.7
put_h264_qpel_16_mc11_10_c: 8437.7 8281.2 5209.0 3510.7
put_h264_qpel_16_mc11_10_neon: 1745.0 1697.0 878.5 513.5
put_h264_qpel_16_mc13_10_c: 8567.7 8468.0 5221.5 3528.0
put_h264_qpel_16_mc13_10_neon: 1751.7 1698.2 889.2 507.0
put_h264_qpel_16_mc20_10_c: 4907.5 4885.0 2786.2 1607.5
put_h264_qpel_16_mc20_10_neon: 995.5 1034.5 475.5 307.0
put_h264_qpel_16_mc30_10_c: 5579.7 5537.7 3045.2 1789.5
put_h264_qpel_16_mc30_10_neon: 1187.5 1231.2 532.5 334.5
put_h264_qpel_16_mc31_10_c: 8677.2 8672.5 5204.2 3516.0
put_h264_qpel_16_mc31_10_neon: 1850.7 1813.2 893.0 545.2
put_h264_qpel_16_mc33_10_c: 8688.7 8671.2 5223.2 3512.0
put_h264_qpel_16_mc33_10_neon: 1851.7 1814.2 908.5 535.2
Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-12-07 23:20:14 +02:00
Martin Storsjö
7f905f3672
aarch64: Make the indentation more consistent
...
Some functions have slightly different indentation styles; try
to match the surrounding code.
libavcodec/aarch64/vc1dsp_neon.S is skipped here, as it intentionally
uses a layered indentation style to visually show how different
unrolled/interleaved phases fit together.
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-21 23:25:29 +03:00
Martin Storsjö
184103b310
aarch64: Consistently use lowercase for vector element specifiers
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-21 23:25:18 +03:00
Martin Storsjö
fd3bd5c492
aarch64: h264qpel: Do vertical filtering without transposing
...
This gives rather big speedups on these functions:
Before:
put_h264_qpel_8_mc01_8_neon: 241.0 131.5 138.7
put_h264_qpel_8_mc02_8_neon: 214.7 121.2 127.5
put_h264_qpel_8_mc03_8_neon: 242.5 131.2 135.7
put_h264_qpel_8_mc11_8_neon: 421.2 218.7 251.0
put_h264_qpel_8_mc12_8_neon: 878.0 509.5 537.5
put_h264_qpel_8_mc13_8_neon: 423.7 217.0 252.0
put_h264_qpel_8_mc21_8_neon: 858.2 479.5 514.0
put_h264_qpel_8_mc22_8_neon: 649.7 385.2 403.0
put_h264_qpel_8_mc23_8_neon: 860.2 476.5 517.7
put_h264_qpel_8_mc31_8_neon: 437.2 219.5 252.5
put_h264_qpel_8_mc32_8_neon: 892.5 510.5 546.0
put_h264_qpel_8_mc33_8_neon: 438.2 218.5 257.0
put_h264_qpel_16_mc01_8_neon: 944.2 509.7 546.7
put_h264_qpel_16_mc02_8_neon: 878.7 469.5 509.7
put_h264_qpel_16_mc03_8_neon: 945.7 510.7 557.0
put_h264_qpel_16_mc11_8_neon: 1663.2 858.5 979.5
put_h264_qpel_16_mc12_8_neon: 3510.2 2027.7 2112.7
put_h264_qpel_16_mc13_8_neon: 1664.7 857.5 980.5
put_h264_qpel_16_mc21_8_neon: 3366.2 1928.5 2030.5
put_h264_qpel_16_mc22_8_neon: 2584.7 1514.7 1590.2
put_h264_qpel_16_mc23_8_neon: 3367.7 1927.7 2035.0
put_h264_qpel_16_mc31_8_neon: 1716.7 849.7 997.0
put_h264_qpel_16_mc32_8_neon: 3564.0 2044.2 3835.2
put_h264_qpel_16_mc33_8_neon: 1717.7 863.0 989.5
After:
put_h264_qpel_8_mc01_8_neon: 136.0 73.7 76.0
put_h264_qpel_8_mc02_8_neon: 108.7 65.0 64.0
put_h264_qpel_8_mc03_8_neon: 137.5 72.7 73.0
put_h264_qpel_8_mc11_8_neon: 316.2 159.0 188.5
put_h264_qpel_8_mc12_8_neon: 653.0 375.5 384.7
put_h264_qpel_8_mc13_8_neon: 318.7 165.5 189.5
put_h264_qpel_8_mc21_8_neon: 739.2 385.7 432.5
put_h264_qpel_8_mc22_8_neon: 530.7 295.5 309.5
put_h264_qpel_8_mc23_8_neon: 741.2 393.7 421.0
put_h264_qpel_8_mc31_8_neon: 332.2 162.5 190.0
put_h264_qpel_8_mc32_8_neon: 667.5 378.2 390.5
put_h264_qpel_8_mc33_8_neon: 332.7 166.5 195.5
put_h264_qpel_16_mc01_8_neon: 524.2 285.2 294.0
put_h264_qpel_16_mc02_8_neon: 454.7 252.2 250.2
put_h264_qpel_16_mc03_8_neon: 525.7 286.0 283.0
put_h264_qpel_16_mc11_8_neon: 1243.2 630.7 726.7
put_h264_qpel_16_mc12_8_neon: 2610.2 1479.7 1481.2
put_h264_qpel_16_mc13_8_neon: 1250.5 631.7 727.7
put_h264_qpel_16_mc21_8_neon: 2890.2 1571.2 1679.7
put_h264_qpel_16_mc22_8_neon: 2108.7 1177.5 1223.5
put_h264_qpel_16_mc23_8_neon: 2891.7 1578.7 1667.7
put_h264_qpel_16_mc31_8_neon: 1296.7 630.5 752.5
put_h264_qpel_16_mc32_8_neon: 2664.0 1483.2 1503.5
put_h264_qpel_16_mc33_8_neon: 1297.7 632.5 747.2
I.e. overall a 20%-60% reduction in runtime of these
functions.
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-10-18 14:27:58 +03:00
Martin Storsjö
2d5a7f6d00
arm/aarch64: Improve scheduling in the avg form of h264_qpel
...
Don't use the loaded registers directly, avoiding stalls on in
order cores. Use vrhadd.u8 with q registers where easily possible.
Signed-off-by: Martin Storsjö <martin@martin.st>
2021-10-18 14:27:36 +03:00
Michael Niedermayer
19fc3c0122
Merge commit 'd5dd8c7bf0f0d77c581db3236e0d938f06fd5591'
...
* commit 'd5dd8c7bf0f0d77c581db3236e0d938f06fd5591':
aarch64: h264 qpel NEON optimizations
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-15 15:13:41 +01:00
Janne Grunau
d5dd8c7bf0
aarch64: h264 qpel NEON optimizations
...
Ported from ARMv7 NEON.
2014-01-15 12:17:49 +01:00