FFmpeg

mirror of https://github.com/FFmpeg/FFmpeg.git synced 2024-12-18 03:19:31 +02:00

Author	SHA1	Message	Date
Martin Storsjö	f872b19714	aarch64: hevc: Produce plain neon versions of qpel_bi_hv As the plain neon qpel_h functions process two rows at a time, we need to allocate storage for h+8 rows instead of h+7. By allocating storage for h+8 rows, incrementing the stack pointer won't end up at the right spot in the end. Store the intended final stack pointer value in a register x14 which we store on the stack. AWS Graviton 3: put_hevc_qpel_bi_hv4_8_c: 385.7 put_hevc_qpel_bi_hv4_8_neon: 131.0 put_hevc_qpel_bi_hv4_8_i8mm: 92.2 put_hevc_qpel_bi_hv6_8_c: 701.0 put_hevc_qpel_bi_hv6_8_neon: 239.5 put_hevc_qpel_bi_hv6_8_i8mm: 191.0 put_hevc_qpel_bi_hv8_8_c: 1162.0 put_hevc_qpel_bi_hv8_8_neon: 228.0 put_hevc_qpel_bi_hv8_8_i8mm: 225.2 put_hevc_qpel_bi_hv12_8_c: 2305.0 put_hevc_qpel_bi_hv12_8_neon: 558.0 put_hevc_qpel_bi_hv12_8_i8mm: 483.2 put_hevc_qpel_bi_hv16_8_c: 3965.2 put_hevc_qpel_bi_hv16_8_neon: 732.7 put_hevc_qpel_bi_hv16_8_i8mm: 656.5 put_hevc_qpel_bi_hv24_8_c: 8709.7 put_hevc_qpel_bi_hv24_8_neon: 1555.2 put_hevc_qpel_bi_hv24_8_i8mm: 1448.7 put_hevc_qpel_bi_hv32_8_c: 14818.0 put_hevc_qpel_bi_hv32_8_neon: 2763.7 put_hevc_qpel_bi_hv32_8_i8mm: 2468.0 put_hevc_qpel_bi_hv48_8_c: 32855.5 put_hevc_qpel_bi_hv48_8_neon: 6107.2 put_hevc_qpel_bi_hv48_8_i8mm: 5452.7 put_hevc_qpel_bi_hv64_8_c: 57591.5 put_hevc_qpel_bi_hv64_8_neon: 10660.2 put_hevc_qpel_bi_hv64_8_i8mm: 9580.0 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:05:55 +02:00
Martin Storsjö	d21b9a0411	aarch64: hevc: Produce plain neon versions of qpel_uni_w_hv As the plain neon qpel_h functions process two rows at a time, we need to allocate storage for h+8 rows instead of h+7. AWS Graviton 3: put_hevc_qpel_uni_w_hv4_8_c: 422.2 put_hevc_qpel_uni_w_hv4_8_neon: 140.7 put_hevc_qpel_uni_w_hv4_8_i8mm: 100.7 put_hevc_qpel_uni_w_hv8_8_c: 1208.0 put_hevc_qpel_uni_w_hv8_8_neon: 268.2 put_hevc_qpel_uni_w_hv8_8_i8mm: 261.5 put_hevc_qpel_uni_w_hv16_8_c: 4297.2 put_hevc_qpel_uni_w_hv16_8_neon: 802.2 put_hevc_qpel_uni_w_hv16_8_i8mm: 731.2 put_hevc_qpel_uni_w_hv32_8_c: 15518.5 put_hevc_qpel_uni_w_hv32_8_neon: 3085.2 put_hevc_qpel_uni_w_hv32_8_i8mm: 2783.2 put_hevc_qpel_uni_w_hv64_8_c: 57254.5 put_hevc_qpel_uni_w_hv64_8_neon: 11787.5 put_hevc_qpel_uni_w_hv64_8_i8mm: 10659.0 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:05:55 +02:00
Martin Storsjö	5ab138673b	aarch64: hevc: Produce plain neon versions of qpel_uni_hv As the plain neon qpel_h functions process two rows at a time, we need to allocate storage for h+8 rows instead of h+7. By allocating storage for h+8 rows, incrementing the stack pointer won't end up at the right spot in the end. Store the intended final stack pointer value in a register x14 which we store on the stack. AWS Graviton 3: put_hevc_qpel_uni_hv4_8_c: 384.2 put_hevc_qpel_uni_hv4_8_neon: 127.5 put_hevc_qpel_uni_hv4_8_i8mm: 85.5 put_hevc_qpel_uni_hv6_8_c: 705.5 put_hevc_qpel_uni_hv6_8_neon: 224.5 put_hevc_qpel_uni_hv6_8_i8mm: 176.2 put_hevc_qpel_uni_hv8_8_c: 1136.5 put_hevc_qpel_uni_hv8_8_neon: 216.5 put_hevc_qpel_uni_hv8_8_i8mm: 214.0 put_hevc_qpel_uni_hv12_8_c: 2259.5 put_hevc_qpel_uni_hv12_8_neon: 498.5 put_hevc_qpel_uni_hv12_8_i8mm: 410.7 put_hevc_qpel_uni_hv16_8_c: 3824.7 put_hevc_qpel_uni_hv16_8_neon: 670.0 put_hevc_qpel_uni_hv16_8_i8mm: 603.7 put_hevc_qpel_uni_hv24_8_c: 8113.5 put_hevc_qpel_uni_hv24_8_neon: 1474.7 put_hevc_qpel_uni_hv24_8_i8mm: 1351.5 put_hevc_qpel_uni_hv32_8_c: 14744.5 put_hevc_qpel_uni_hv32_8_neon: 2599.7 put_hevc_qpel_uni_hv32_8_i8mm: 2266.0 put_hevc_qpel_uni_hv48_8_c: 32800.0 put_hevc_qpel_uni_hv48_8_neon: 5650.0 put_hevc_qpel_uni_hv48_8_i8mm: 5011.7 put_hevc_qpel_uni_hv64_8_c: 57856.2 put_hevc_qpel_uni_hv64_8_neon: 9863.5 put_hevc_qpel_uni_hv64_8_i8mm: 8767.7 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:05:55 +02:00
Martin Storsjö	5cbeefc79e	aarch64: hevc: Produce plain neon versions of qpel_hv As the plain neon qpel_h functions process two rows at a time, we need to allocate storage for h+8 rows instead of h+7. By allocating storage for h+8 rows, incrementing the stack pointer won't end up at the right spot in the end. Store the intended final stack pointer value in a register x14 which we store on the stack. AWS Graviton 3: put_hevc_qpel_hv4_8_c: 386.0 put_hevc_qpel_hv4_8_neon: 125.7 put_hevc_qpel_hv4_8_i8mm: 83.2 put_hevc_qpel_hv6_8_c: 749.0 put_hevc_qpel_hv6_8_neon: 207.0 put_hevc_qpel_hv6_8_i8mm: 166.0 put_hevc_qpel_hv8_8_c: 1305.2 put_hevc_qpel_hv8_8_neon: 216.5 put_hevc_qpel_hv8_8_i8mm: 213.0 put_hevc_qpel_hv12_8_c: 2570.5 put_hevc_qpel_hv12_8_neon: 480.0 put_hevc_qpel_hv12_8_i8mm: 398.2 put_hevc_qpel_hv16_8_c: 4158.7 put_hevc_qpel_hv16_8_neon: 659.7 put_hevc_qpel_hv16_8_i8mm: 593.5 put_hevc_qpel_hv24_8_c: 8626.7 put_hevc_qpel_hv24_8_neon: 1653.5 put_hevc_qpel_hv24_8_i8mm: 1398.7 put_hevc_qpel_hv32_8_c: 14646.0 put_hevc_qpel_hv32_8_neon: 2566.2 put_hevc_qpel_hv32_8_i8mm: 2287.5 put_hevc_qpel_hv48_8_c: 31072.5 put_hevc_qpel_hv48_8_neon: 6228.5 put_hevc_qpel_hv48_8_i8mm: 5291.0 put_hevc_qpel_hv64_8_c: 53847.2 put_hevc_qpel_hv64_8_neon: 9856.7 put_hevc_qpel_hv64_8_i8mm: 8831.0 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:05:55 +02:00
Martin Storsjö	20c38f4b8d	aarch64: hevc: Reorder qpel_hv functions to prepare for templating This is a pure reordering of code without changing anything in the individual functions. Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:05:50 +02:00
Martin Storsjö	4f71e4ebf2	aarch64: hevc: Deduplicate the hevc_put_hevc_qpel_uni_w_hv*_8_end_neon functions The hv32 and hv64 functions were identical - both loop and process 16 pixels at a time. The hv16 function was near identical, except for the outer loop (and using sp instead of a separate register). Given the size of these functions, the extra cost of the outer loop is negligible, so use the same function for hv16 as well. This removes over 200 lines of duplicated assembly, and over 4 KB of binary size. Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:05:40 +02:00
Martin Storsjö	4063e50eec	aarch64: hevc: Split the qpel_*_hv functions into two parts The first horizontal filter can use either i8mm or plain neon versions, while the second part is a pure neon implementation. Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:05:29 +02:00
Martin Storsjö	ad01d06f91	aarch64: hevc: Implement a neon version of hevc_qpel_uni_w_h*_8 AWS Graviton 3: put_hevc_qpel_uni_w_h4_8_c: 159.0 put_hevc_qpel_uni_w_h4_8_neon: 64.2 put_hevc_qpel_uni_w_h4_8_i8mm: 40.0 put_hevc_qpel_uni_w_h6_8_c: 344.7 put_hevc_qpel_uni_w_h6_8_neon: 114.5 put_hevc_qpel_uni_w_h6_8_i8mm: 82.0 put_hevc_qpel_uni_w_h8_8_c: 596.2 put_hevc_qpel_uni_w_h8_8_neon: 132.2 put_hevc_qpel_uni_w_h8_8_i8mm: 106.0 put_hevc_qpel_uni_w_h12_8_c: 1325.0 put_hevc_qpel_uni_w_h12_8_neon: 299.0 put_hevc_qpel_uni_w_h12_8_i8mm: 211.5 put_hevc_qpel_uni_w_h16_8_c: 2300.0 put_hevc_qpel_uni_w_h16_8_neon: 422.0 put_hevc_qpel_uni_w_h16_8_i8mm: 286.2 put_hevc_qpel_uni_w_h24_8_c: 5059.0 put_hevc_qpel_uni_w_h24_8_neon: 912.2 put_hevc_qpel_uni_w_h24_8_i8mm: 664.2 put_hevc_qpel_uni_w_h32_8_c: 9198.2 put_hevc_qpel_uni_w_h32_8_neon: 1638.2 put_hevc_qpel_uni_w_h32_8_i8mm: 1033.7 put_hevc_qpel_uni_w_h48_8_c: 20754.7 put_hevc_qpel_uni_w_h48_8_neon: 3633.7 put_hevc_qpel_uni_w_h48_8_i8mm: 2300.7 put_hevc_qpel_uni_w_h64_8_c: 36854.7 put_hevc_qpel_uni_w_h64_8_neon: 6435.7 put_hevc_qpel_uni_w_h64_8_i8mm: 4039.2 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:03:18 +02:00
Martin Storsjö	de23b384fd	aarch64: hevc: Produce epel_bi_hv functions for both neon and i8mm In addition to just templating, this contains one change to ff_hevc_put_hevc_epel_bi_hv32_8, by setting the w6 register which ff_hevc_put_hevc_epel_h32_8_neon requires. AWS Graviton 3: put_hevc_epel_bi_hv4_8_c: 176.5 put_hevc_epel_bi_hv4_8_neon: 62.0 put_hevc_epel_bi_hv4_8_i8mm: 58.0 put_hevc_epel_bi_hv6_8_c: 343.7 put_hevc_epel_bi_hv6_8_neon: 109.7 put_hevc_epel_bi_hv6_8_i8mm: 105.7 put_hevc_epel_bi_hv8_8_c: 536.0 put_hevc_epel_bi_hv8_8_neon: 112.7 put_hevc_epel_bi_hv8_8_i8mm: 111.7 put_hevc_epel_bi_hv12_8_c: 1107.7 put_hevc_epel_bi_hv12_8_neon: 254.7 put_hevc_epel_bi_hv12_8_i8mm: 239.0 put_hevc_epel_bi_hv16_8_c: 1927.7 put_hevc_epel_bi_hv16_8_neon: 356.2 put_hevc_epel_bi_hv16_8_i8mm: 334.2 put_hevc_epel_bi_hv24_8_c: 4195.2 put_hevc_epel_bi_hv24_8_neon: 736.7 put_hevc_epel_bi_hv24_8_i8mm: 715.5 put_hevc_epel_bi_hv32_8_c: 7280.5 put_hevc_epel_bi_hv32_8_neon: 1287.7 put_hevc_epel_bi_hv32_8_i8mm: 1162.2 put_hevc_epel_bi_hv48_8_c: 16857.7 put_hevc_epel_bi_hv48_8_neon: 2836.2 put_hevc_epel_bi_hv48_8_i8mm: 2908.5 put_hevc_epel_bi_hv64_8_c: 29248.2 put_hevc_epel_bi_hv64_8_neon: 5051.7 put_hevc_epel_bi_hv64_8_i8mm: 4491.5 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:03:16 +02:00
Martin Storsjö	96e5adda9f	aarch64: hevc: Produce epel_uni_w_hv functions for both neon and i8mm AWS Graviton 3: put_hevc_epel_uni_w_hv4_8_c: 191.2 put_hevc_epel_uni_w_hv4_8_neon: 87.7 put_hevc_epel_uni_w_hv4_8_i8mm: 83.2 put_hevc_epel_uni_w_hv6_8_c: 349.5 put_hevc_epel_uni_w_hv6_8_neon: 153.0 put_hevc_epel_uni_w_hv6_8_i8mm: 148.5 put_hevc_epel_uni_w_hv8_8_c: 581.2 put_hevc_epel_uni_w_hv8_8_neon: 166.7 put_hevc_epel_uni_w_hv8_8_i8mm: 163.5 put_hevc_epel_uni_w_hv12_8_c: 1230.0 put_hevc_epel_uni_w_hv12_8_neon: 387.7 put_hevc_epel_uni_w_hv12_8_i8mm: 370.2 put_hevc_epel_uni_w_hv16_8_c: 2003.2 put_hevc_epel_uni_w_hv16_8_neon: 501.5 put_hevc_epel_uni_w_hv16_8_i8mm: 490.2 put_hevc_epel_uni_w_hv24_8_c: 4448.7 put_hevc_epel_uni_w_hv24_8_neon: 1092.2 put_hevc_epel_uni_w_hv24_8_i8mm: 1069.7 put_hevc_epel_uni_w_hv32_8_c: 7817.2 put_hevc_epel_uni_w_hv32_8_neon: 1916.2 put_hevc_epel_uni_w_hv32_8_i8mm: 1829.5 put_hevc_epel_uni_w_hv48_8_c: 16728.2 put_hevc_epel_uni_w_hv48_8_neon: 4263.7 put_hevc_epel_uni_w_hv48_8_i8mm: 4342.7 put_hevc_epel_uni_w_hv64_8_c: 29563.2 put_hevc_epel_uni_w_hv64_8_neon: 7474.2 put_hevc_epel_uni_w_hv64_8_i8mm: 7128.5 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:59:58 +02:00
Martin Storsjö	d7294199ab	aarch64: hevc: Produce epel_uni_hv functions for both neon and i8mm AWS Graviton 3: put_hevc_epel_uni_hv4_8_c: 163.5 put_hevc_epel_uni_hv4_8_neon: 59.7 put_hevc_epel_uni_hv4_8_i8mm: 57.5 put_hevc_epel_uni_hv6_8_c: 344.7 put_hevc_epel_uni_hv6_8_neon: 105.0 put_hevc_epel_uni_hv6_8_i8mm: 102.7 put_hevc_epel_uni_hv8_8_c: 552.2 put_hevc_epel_uni_hv8_8_neon: 111.2 put_hevc_epel_uni_hv8_8_i8mm: 104.0 put_hevc_epel_uni_hv12_8_c: 1195.0 put_hevc_epel_uni_hv12_8_neon: 248.7 put_hevc_epel_uni_hv12_8_i8mm: 229.5 put_hevc_epel_uni_hv16_8_c: 1910.2 put_hevc_epel_uni_hv16_8_neon: 339.5 put_hevc_epel_uni_hv16_8_i8mm: 323.2 put_hevc_epel_uni_hv24_8_c: 4048.2 put_hevc_epel_uni_hv24_8_neon: 737.7 put_hevc_epel_uni_hv24_8_i8mm: 713.7 put_hevc_epel_uni_hv32_8_c: 6865.7 put_hevc_epel_uni_hv32_8_neon: 1285.0 put_hevc_epel_uni_hv32_8_i8mm: 1206.0 put_hevc_epel_uni_hv48_8_c: 15830.5 put_hevc_epel_uni_hv48_8_neon: 2844.7 put_hevc_epel_uni_hv48_8_i8mm: 2914.0 put_hevc_epel_uni_hv64_8_c: 27912.7 put_hevc_epel_uni_hv64_8_neon: 4970.5 put_hevc_epel_uni_hv64_8_i8mm: 4653.7 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:59:28 +02:00
Martin Storsjö	7bf3d14769	aarch64: hevc: Produce epel_hv functions for both plain neon and i8mm AWS Graviton 3: put_hevc_epel_hv4_8_c: 163.7 put_hevc_epel_hv4_8_neon: 52.5 put_hevc_epel_hv4_8_i8mm: 49.5 put_hevc_epel_hv6_8_c: 292.2 put_hevc_epel_hv6_8_neon: 97.7 put_hevc_epel_hv6_8_i8mm: 101.2 put_hevc_epel_hv8_8_c: 471.0 put_hevc_epel_hv8_8_neon: 106.7 put_hevc_epel_hv8_8_i8mm: 102.5 put_hevc_epel_hv12_8_c: 1030.2 put_hevc_epel_hv12_8_neon: 240.5 put_hevc_epel_hv12_8_i8mm: 215.0 put_hevc_epel_hv16_8_c: 1711.5 put_hevc_epel_hv16_8_neon: 340.2 put_hevc_epel_hv16_8_i8mm: 319.2 put_hevc_epel_hv24_8_c: 3670.0 put_hevc_epel_hv24_8_neon: 702.0 put_hevc_epel_hv24_8_i8mm: 666.5 put_hevc_epel_hv32_8_c: 6785.5 put_hevc_epel_hv32_8_neon: 1247.0 put_hevc_epel_hv32_8_i8mm: 1169.0 put_hevc_epel_hv48_8_c: 14689.7 put_hevc_epel_hv48_8_neon: 2665.2 put_hevc_epel_hv48_8_i8mm: 2740.0 put_hevc_epel_hv64_8_c: 25899.2 put_hevc_epel_hv64_8_neon: 4801.2 put_hevc_epel_hv64_8_i8mm: 4487.7 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:59:19 +02:00
Martin Storsjö	5b5666e5ab	aarch64: hevc: Reorder epel_hv functions to prepare for templating This is a pure reordering of code without changing anything in the individual functions. Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:59:07 +02:00
Martin Storsjö	e6d4c0e117	aarch64: hevc: Split the epel_*_hv functions into two parts The first horizontal filter can use either i8mm or plain neon versions, while the second part is a pure neon implementation. Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:59:00 +02:00
Martin Storsjö	54af555bfa	aarch64: hevc: Implement a neon version of hevc_epel_uni_w_h*_8 AWS Graviton 3: put_hevc_epel_uni_w_h4_8_c: 97.2 put_hevc_epel_uni_w_h4_8_neon: 41.2 put_hevc_epel_uni_w_h4_8_i8mm: 35.2 put_hevc_epel_uni_w_h6_8_c: 203.7 put_hevc_epel_uni_w_h6_8_neon: 84.7 put_hevc_epel_uni_w_h6_8_i8mm: 74.7 put_hevc_epel_uni_w_h8_8_c: 345.7 put_hevc_epel_uni_w_h8_8_neon: 94.0 put_hevc_epel_uni_w_h8_8_i8mm: 80.7 put_hevc_epel_uni_w_h12_8_c: 768.7 put_hevc_epel_uni_w_h12_8_neon: 196.7 put_hevc_epel_uni_w_h12_8_i8mm: 169.7 put_hevc_epel_uni_w_h16_8_c: 1313.0 put_hevc_epel_uni_w_h16_8_neon: 290.7 put_hevc_epel_uni_w_h16_8_i8mm: 238.0 put_hevc_epel_uni_w_h24_8_c: 2877.5 put_hevc_epel_uni_w_h24_8_neon: 650.0 put_hevc_epel_uni_w_h24_8_i8mm: 512.0 put_hevc_epel_uni_w_h32_8_c: 5113.5 put_hevc_epel_uni_w_h32_8_neon: 1129.5 put_hevc_epel_uni_w_h32_8_i8mm: 739.2 put_hevc_epel_uni_w_h48_8_c: 11757.0 put_hevc_epel_uni_w_h48_8_neon: 2518.7 put_hevc_epel_uni_w_h48_8_i8mm: 1688.5 put_hevc_epel_uni_w_h64_8_c: 20478.0 put_hevc_epel_uni_w_h64_8_neon: 4411.7 put_hevc_epel_uni_w_h64_8_i8mm: 2884.0 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:58:47 +02:00
Martin Storsjö	6d384298ec	aarch64: hevc: Implement a neon version of put_hevc_epel_h*_8 AWS Graviton 3: put_hevc_epel_h4_8_c: 64.7 put_hevc_epel_h4_8_neon: 25.0 put_hevc_epel_h4_8_i8mm: 21.2 put_hevc_epel_h6_8_c: 130.0 put_hevc_epel_h6_8_neon: 40.7 put_hevc_epel_h6_8_i8mm: 36.5 put_hevc_epel_h8_8_c: 209.0 put_hevc_epel_h8_8_neon: 45.2 put_hevc_epel_h8_8_i8mm: 41.2 put_hevc_epel_h12_8_c: 465.5 put_hevc_epel_h12_8_neon: 104.5 put_hevc_epel_h12_8_i8mm: 86.5 put_hevc_epel_h16_8_c: 830.7 put_hevc_epel_h16_8_neon: 134.2 put_hevc_epel_h16_8_i8mm: 114.0 put_hevc_epel_h24_8_c: 1844.7 put_hevc_epel_h24_8_neon: 282.2 put_hevc_epel_h24_8_i8mm: 277.2 put_hevc_epel_h32_8_c: 3227.5 put_hevc_epel_h32_8_neon: 501.5 put_hevc_epel_h32_8_i8mm: 396.0 put_hevc_epel_h48_8_c: 7229.2 put_hevc_epel_h48_8_neon: 1120.2 put_hevc_epel_h48_8_i8mm: 901.2 put_hevc_epel_h64_8_c: 12869.0 put_hevc_epel_h64_8_neon: 1999.2 put_hevc_epel_h64_8_i8mm: 1610.5 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:58:29 +02:00
Martin Storsjö	8f03c30a17	aarch64: hevc: Use ld1r instead of ldr+dup in hevc_qpel_uni_w_h Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:58:20 +02:00
Martin Storsjö	717cc82d28	aarch64: hevc: Specialize put_hevc_\type\()_h*_8_neon for horizontal looping For widths of 32 pixels and more, loop first horizontally, then vertically. Previously, this function would process a 16 pixel wide slice of the block, looping vertically. After processing the whole height, it would backtrack and process the next 16 pixel wide slice. When doing 8tap filtering horizontally, the function must load 7 more pixels (in practice, 8) following the actual inputs, and this was done for each slice. By iterating first horizontally throughout each line, then vertically, we access data in a more cache friendly order, and we don't need to reload data unnecessarily. Keep the original order in put_hevc_\type\()_h12_8_neon; the only suboptimal case there is for width=24. But specializing an optimal variant for that would require more code, which might not be worth it. For the h16 case, this implementation would give a slowdown, as it now loads the first 8 pixels separately from the rest, but for larger widths, it is a gain. Therefore, keep the h16 case as it was (but remove the outer loop), and create a new specialized version for horizontal looping with 16 pixels at a time. Before: Cortex A53 A72 A73 Graviton 3 put_hevc_qpel_h16_8_neon: 710.5 667.7 692.5 211.0 put_hevc_qpel_h32_8_neon: 2791.5 2643.5 2732.0 883.5 put_hevc_qpel_h64_8_neon: 10954.0 10657.0 10874.2 3241.5 After: put_hevc_qpel_h16_8_neon: 697.5 663.5 705.7 212.5 put_hevc_qpel_h32_8_neon: 2767.2 2684.5 2791.2 920.5 put_hevc_qpel_h64_8_neon: 10559.2 10471.5 10932.2 3051.7 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:58:11 +02:00
Martin Storsjö	e3a54cabde	aarch64: hevc: Merge consecutive stores in put_hevc_\type\()_h16_8_neon This gets rid of a couple instructions, but the actual performance is almost identical on Cortex A72/A73. On Cortex A53, it is a handful of cycles faster. Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:58:01 +02:00
Martin Storsjö	78db8405c0	aarch64: hevc: Don't iterate with sp in ff_hevc_put_hevc_qpel_uni_w_hv32/64_8_neon_i8mm Many of the routines within hevcdsp_epel_neon and hevcdsp_qpel_neon store temporary buffers on the stack. When consuming it, many of these functions use the stack pointer as incremental pointer for reading the data (instead of storing it in another register), which is rather unusual. Technically, this is fine as long as the pointer remains properly aligned. However in the case of ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, after incrementing sp when reading data (within each 16 pixel wide stripe) it would then reset the stack pointer back to a lower value, for reading the next 16 pixel wide stripe, expecting the data to remain untouched. This can't be assumed; data on the stack below the stack pointer can be clobbered (e.g. by a signal handler). Some OS ABIs allow for a little margin that won't be touched, aka a red zone, but not all do. The ones that do, guarantee 16 or 128 bytes, not 9 KB. Convert this function to use a separate pointer register to iterate through the data, retaining the stack pointer to point at the bottom of the data we require to remain untouched. Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:57:55 +02:00
Martin Storsjö	e66858fbab	aarch64: hevc: Reorder a misplaced function init line Group the epel and qpel functions together. Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:57:50 +02:00
Andreas Rheinhardt	ced5c5fdb8	fftools/ffmpeg_mux_init: Fix double-free on error MATCH_PER_STREAM_OPT iterates over all options of a given OptionDef and tests whether they apply to the current stream; if so, they are set to ost->apad, otherwise, the code errors out. If no error happens, ost->apad is av_strdup'ed in order to take ownership of this pointer. But this means that setting it originally was premature, as it leads to double-frees when an error happens lateron. This can simply be reproduced with ffmpeg -filter_complex anullsrc -apad bar -apad:n baz -f null - This is a regression since `83ace80bfd`. Fix this by using a temporary variable instead of directly setting ost->apad. Also only strdup the string if it actually is != NULL. Reviewed-by: Marth64 <marth64@proxyid.net> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-26 06:48:35 +01:00
Andreas Rheinhardt	4a4dcde339	avformat/internal: Move FF_FMT_INIT_CLEANUP to demux.h and rename it to FF_INFMT_INIT_CLEANUP. This flag is demuxer-only, so this is the more appropriate place for it. This does not preclude adding internal flags common to both demuxer and muxer in the future. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-26 06:36:43 +01:00
Andreas Rheinhardt	27af88fb7f	avformat/vqf: Return 0 on success in read_packet Demuxers are not supposed to return the size of the packet read. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-26 06:36:43 +01:00
Andreas Rheinhardt	29aa499fc9	avformat/cdg: Don't store avio_size() return value in int Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-26 06:36:43 +01:00
Andreas Rheinhardt	cee70b9f1b	avformat/lafdec: Fix shadowing Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-26 06:36:43 +01:00
Andreas Rheinhardt	aa8c7dc3d8	avformat/argo_cvg: Avoid relocations for ArgoCVGOverride The average length of the strings used here does not differ much from the length of the longest string; therefore it makes sense to use an array big enough for the longest string and not a pointer to a string. This also moves this array into .rodata (from .data.rel.ro). Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-26 06:36:43 +01:00
Andreas Rheinhardt	69b85a69bd	avformat/wady: Combine skips Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-26 06:36:43 +01:00
Andreas Rheinhardt	cdff5a2c0c	avformat/avr: Combine skips Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-26 06:36:43 +01:00
Andreas Rheinhardt	56ba83ff2d	avformat/fsb: Don't set data_offset manually It is set generically to the value that it is to here. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-26 06:36:43 +01:00
Andreas Rheinhardt	88f803cf64	avformat/wvedec: Inline constant Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-26 06:36:43 +01:00
Andreas Rheinhardt	8768188581	avformat/g722: Inline constants Forgotten in `5f0e161dd6`. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-26 06:36:43 +01:00
Andreas Rheinhardt	b93ed5c28e	avformat/fitsdec: Don't use AVBPrint for temporary storage Most of the data in the temporary storage ends up being returned to the user as AVPacket.data, so it makes sense to avoid using the AVBPrint for temporary storage altogether (in particular in light of the fact that the blocks read here are too big for the small-string optimization anyway) and read the data directly into AVPacket.data. This also avoids another memcpy() from a stack buffer to the AVBPrint in ts_image() (that could always have been avoided with av_bprint_get_buffer()). These changes also allow to use av_append_packet(), which greatly simplifies the code; furthermore, one can avoid cleanup code on error as the packet is already unreferenced generically on error. There are two user-visible changes from this patch: 1. Truncated packets are now marked as corrupt. 2. AVPacket.pos is set (it corresponds to the discarded header line, 80 bytes before the position corresponding to the actual packet data). Furthermore, this patch also removes code that triggered a -Wtautological-constant-out-of-range-compare warning from Clang (namely a comparison of an unsigned and INT64_MAX in an assert). Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-26 06:36:43 +01:00
Andreas Rheinhardt	5144455c20	avformat/hls: Don't access FFInputFormat.raw_codec_id It is an implementation detail of other input formats whether they use raw_codec_id or not. The HLS demuxer should not rely on this. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-26 06:36:43 +01:00
Andreas Rheinhardt	8d8b5947c3	configure: Make hls demuxer select AAC, AC3 and EAC3 demuxers The code relies on their presence and would presumably crash when retrieving in_fmt->name if an encrypted stream with a codec id without demuxer were encountered. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-26 06:36:43 +01:00
Andreas Rheinhardt	a990e6fa01	avformat/mux: Remove check for AVFMT_ALLOW_FLUSH Due to the bump it is now certain that all devices that support flushing have the proper internal flag set. (Notice that the check for LIBAVFORMAT_VERSION was wrong.) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-26 06:32:52 +01:00
Andreas Rheinhardt	e95dd6f53e	avformat/file: Combine all CONFIG_ANDROID_CONTENT_PROTOCOL blocks Besides improving readability this also ensures that a developer who has the android content protocol enabled and works on the other parts of the file will not forget to add necessary inclusions just because of (indirect) inclusions from the files included only when said protocol is enabled. Reviewed-by: Matthieu Bouron <matthieu.bouron@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-26 06:31:58 +01:00
Andreas Rheinhardt	ebe8326409	avformat/file: Constify android content protocol (The discrepancy between the definition and the declaration in protocols.c is actually UB.) Reviewed-by: Matthieu Bouron <matthieu.bouron@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-26 06:31:40 +01:00
Andreas Rheinhardt	a6189ba896	avcodec/mpegutils: Simplify indenting Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-26 06:30:45 +01:00
Andreas Rheinhardt	5eda98f382	avcodec/mpegutils: Avoid allocations when using AVBPrint Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-26 06:30:45 +01:00
James Almer	0963ef4996	fftools/ffmpeg_filter: remove prototype for non existent function Signed-off-by: James Almer <jamrial@gmail.com>	2024-03-25 23:23:27 -03:00
James Almer	767e7d3d2b	fftools/ffmpeg_filter: remove unused struct from InputFilterPriv It's already in InputFilterOptions. Signed-off-by: James Almer <jamrial@gmail.com>	2024-03-25 23:23:27 -03:00
James Almer	abcdd3aed7	avformat/mov: don't use cur_item_id as array index Reviewed-by: Michael Niedermayer <michael@niedermayer.cc> Signed-off-by: James Almer <jamrial@gmail.com>	2024-03-25 23:20:51 -03:00
Michael Niedermayer	dd733b2be4	avformat/concatdec: clip outpoint - inpoint overflow in get_best_effort_duration() An alternative would be to limit all time/duration fields to below 64bit Fixes: signed integer overflow: -93000000 - 9223372036839000000 cannot be represented in type 'long long' Fixes: 64546/clusterfuzz-testcase-minimized-ffmpeg_dem_CONCAT_fuzzer-5110813828186112 Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2024-03-26 01:19:17 +01:00
Michael Niedermayer	b54c9a9c8f	avcodec/osq: avoid several signed integer overflows Fixes: signed integer overflow: 178459578 + 2009763270 cannot be represented in type 'int' Fixes: 62285/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_OSQ_fuzzer-5013423686287360 Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2024-03-26 01:19:17 +01:00
Michael Niedermayer	e83e8d443b	avformat/jacosubdec: clarify code add comments, rename variables and indent things differently Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2024-03-26 01:19:16 +01:00
Jun Zhao	5ebcca4e08	lavf/movenc: small cleanup for style Small cleanup for style, indent, switch case lables. BTW, the preferred way to ease multiple indentation levels in a switch statement is to align the switch and its subordinate case labels in the same column Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2024-03-26 07:52:53 +08:00
Michael Niedermayer	b792e4d4c7	avformat/cafdec: Check that data chunk end fits within 64bit Fixes: signed integer overflow: 64 + 9223372036854775803 cannot be represented in type 'long long' Fixes: 51896/clusterfuzz-testcase-minimized-ffmpeg_dem_CAF_fuzzer-6536881135550464 Fixes: 62276/clusterfuzz-testcase-minimized-ffmpeg_dem_CAF_fuzzer-6536881135550464 Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2024-03-26 00:08:25 +01:00
Michael Niedermayer	b8e754525c	avformat/iff: Saturate avio_tell() + 12 Fixes: signed integer overflow: 9223372036854775796 + 12 cannot be represented in type 'long long' Fixes: 51896/clusterfuzz-testcase-minimized-ffmpeg_dem_IFF_fuzzer-4898373660704768 Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2024-03-26 00:08:25 +01:00
Michael Niedermayer	50d8e4f273	avformat/dxa: Adjust order of operations around block align Fixes: 51896/clusterfuzz-testcase-minimized-ffmpeg_dem_DXA_fuzzer-5730576523198464 Fixes: signed integer overflow: 2147483566 + 82 cannot be represented in type 'int' Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2024-03-26 00:08:25 +01:00

... 2 3 4 5 6 ...

114588 Commits