FFmpeg

mirror of https://github.com/FFmpeg/FFmpeg.git synced 2025-03-03 14:32:16 +02:00

Author	SHA1	Message	Date
Rémi Denis-Courmont	3c6516330f	lavc/exrdsp: R-V V reoder_pixels	2023-10-09 19:52:51 +03:00
Rémi Denis-Courmont	89c10d8d20	lavc/ac3: add R-V Zbb extract_exponents	2023-10-05 18:13:00 +03:00
Rémi Denis-Courmont	cec48e3b32	riscv: factor out the bswap32 assembler	2023-10-02 22:28:21 +03:00
Rémi Denis-Courmont	b36f3d5330	lavc/fmtconvert: unroll R-V V int32_to_float_fmul_scalar	2023-10-02 18:08:23 +03:00
Rémi Denis-Courmont	f3dfd4ccf2	lavc/aacpsdsp: unroll RISC-V V hybrid_synthesis_deint	2023-10-02 18:08:23 +03:00
Rémi Denis-Courmont	0f1336b285	lavc/aacpsdsp: unroll RISC-V V hybrid_analysis_ileave	2023-10-02 18:08:23 +03:00
Rémi Denis-Courmont	69d7486e59	lavc/aacpsdsp: unroll RISC-V V mul_pair_single	2023-10-02 18:08:23 +03:00
Rémi Denis-Courmont	c270928cc0	lavc/aacpsdsp: unroll R-V V stereo interpolate	2023-10-02 18:08:23 +03:00
Rémi Denis-Courmont	27d74fc1ef	lavc/aacpsdsp: simplify R-V V stereo interpolate Remove some useless vector splat.	2023-10-02 18:08:23 +03:00
Rémi Denis-Courmont	3575ee2ea3	lavc/audiodsp: unroll RISC-V clip functions audiodsp.vector_clip_int32_c: 17500.7 audiodsp.vector_clip_int32_rvv_i32: 8404.7 (m1) audiodsp.vector_clip_int32_rvv_i32: 2689.9 (m8) audiodsp.vector_clipf_c: 33679.7 audiodsp.vector_clipf_rvf: 7019.7 audiodsp.vector_clipf_rvv_f32: 8328.0 (m1) audiodsp.vector_clipf_rvv_f32: 2209.4 (m8)	2023-10-02 18:07:54 +03:00
Rémi Denis-Courmont	9bc5676e40	lavc/g722dsp: add RISC-V V DSP function	2023-08-24 21:07:18 +03:00
Arnie Chang	8d1316e515	lavc/h264chroma: RISC-V V add motion compensation for 4xH and 2xH chroma blocks Optimize the put and avg filtering for 4xH and 2xH blocks Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2023-07-25 19:10:40 +03:00
Rémi Denis-Courmont	44cac1def0	lavc/audiodsp: rework RISC-V V scalar product Take vector reduction out of the loop and unroll. Before: audiodsp.scalarproduct_int16_c: 12321.0 audiodsp.scalarproduct_int16_rvv_i32: 4175.7 After: audiodsp.scalarproduct_int16_c: 12320.5 audiodsp.scalarproduct_int16_rvv_i32: 1230.2	2023-07-20 22:54:34 +03:00
Rémi Denis-Courmont	61e5ca4ded	lavc/bswapdsp: purge RISC-V V bswap32 This cannot beat the Zbb implementation, and it is unlikely that a real meaningful CPU design would support V and not Zbb. The best loop rewrite that I could come up with (4 shifts, 2 ands, 3 ors) is still ~40% slower than Zbb. A proper faster vector implementation should be feasible with the cryptographic vector extensions, but that is a story for another time.	2023-07-19 19:29:35 +03:00
Rémi Denis-Courmont	5de1db5370	lavc/bswapdsp: rewrite RISC-V V bswap16 This favours bit-wise logic over slow strided stores.	2023-07-19 19:29:35 +03:00
Rémi Denis-Courmont	b6585eb04c	lavu: add/use flag for RISC-V Zba extension The code was blindly assuming that Zbb or V implied Zba. While the earlier is practically always true, the later broke some QEMU setups, as V was introduced earlier than Zba.	2023-07-19 19:29:35 +03:00
Rémi Denis-Courmont	2eb55157aa	lavc/aacpsdsp: unroll RISC-V V add_squares This slightly improves performance with the Device Under Test.	2023-07-19 19:29:35 +03:00
Rémi Denis-Courmont	c541ecf0dc	lavc/alacdsp: unroll RISC-V V loops This increases the group multiplier as per T-Head C910 benchmarks: alac_append_extra_bits_mono_c: 803.0 alac_append_extra_bits_stereo_c: 1604.2 alac_decorrelate_stereo_c: 1077.5 LMUL=1 alac_append_extra_bits_mono_rvv_i32: 418.2 alac_append_extra_bits_stereo_rvv_i32: 693.2 alac_decorrelate_stereo_rvv_i32: 673.5 LMUL=2 alac_append_extra_bits_mono_rvv_i32: 382.2 alac_append_extra_bits_stereo_rvv_i32: 648.2 alac_decorrelate_stereo_rvv_i32: 542.7 LMUL=4 alac_append_extra_bits_mono_rvv_i32: 241.5 alac_append_extra_bits_stereo_rvv_i32: 512.7 alac_decorrelate_stereo_rvv_i32: 364.2 LMUL=8 alac_append_extra_bits_mono_rvv_i32: 239.7 alac_append_extra_bits_stereo_rvv_i32: 497.2 alac_decorrelate_stereo_rvv_i32: 426.7	2023-07-16 23:24:00 +03:00
Rémi Denis-Courmont	a28aa0475d	lavc/vorbisdsp: unroll RISC-V V inverse_coupling This increases the group multiplier as per T-Head C910 benchmarks: inverse_coupling_c: 4597.0 inverse_coupling_rvv_i32: 1312.7 (m1) inverse_coupling_rvv_i32: 1116.7 (m2) inverse_coupling_rvv_i32: 732.2 (m4) inverse_coupling_rvv_i32: 898.0 (m8)	2023-07-16 23:24:00 +03:00
Arnie Chang	c5508f60c2	lavc/h264chroma: RISC-V V add motion compensation for 8x8 chroma blocks Optimize the put and avg filtering for 8x8 chroma blocks Signed-off-by: Arnie Chang <arnie.chang@sifive.com>	2023-05-30 17:15:05 +02:00
Rémi Denis-Courmont	4d66e8c12e	lavc/audiodsp: fix RISC-V V scalar product (again) The loop uses a 32-bit accumulator. The current code would only zero the lower 16 bits thereof.	2022-10-17 06:39:00 +02:00
Rémi Denis-Courmont	96a83ceea4	riscv: fix scalar product initialisation VSETVLI xd, x0, ...' has rather nonobvious semantics: - If xd is x0, then it preserves the current vector length. - If xd is not x0, it sets the vector length to the supported maximum. Also somewhat confusingly, while VMV.X.S always does its thing regardless of the selected vector length, VMV.S.X does _nothing_ if the selected vector length is zero. So the current code breaks fails to initialise the accumulator if we are unlucky to have a selected vector length of zero on entry. Fix it by forcing the vector length to one.	2022-10-13 10:17:38 +02:00
Rémi Denis-Courmont	105921251a	lavc/aacpsdsp: fix clobber on RISC-V LP64D/ILP32D Although the DSP function only uses single precision from RISC-V F, the caller may leave double precision values in the spilled registers if the calling convention supports double precision hardware floats. Then, we need to save and restore FS registers as double precision. Conversely, we do not need to save anything at all if an integer calling convention is in use. However we can assume that single precision floats are supported, since the Zve32f extension implies the F extension. So for the sake of simplicity, we always save at least single precision values. In theory, we should even save quadruple precision values if the LP64Q ABI is in use. I have yet to see a compiler that supports it though.	2022-10-10 02:23:18 +02:00
Rémi Denis-Courmont	bfc69297c5	lavc/opusdsp: RISC-V V (512-bit) postfilter This adds a variant of the postfilter for use with 512-bit vectors. Half a vector is enough to perform the scalar product. Normally a whole vector would be used anyhow. Indeed fractional multiplers are no faster than the unit multipler. But in this particular function, a full vector makes up 16 samples, which would be loaded at each iteration of the outer loop. The minimum guaranteed CELT postfilter period is only 15. Accounting for the edges, we can only safely preload up to 13 samples. The fractional multipler is thus used to cap the selected vector length to a safe value of 8 elements or 256 bits. Likewise, we have the 1024-bit variant with the quarter multipler. In theory, a 2048-bit one would be possible with the eigth multipler, but that length is not even defined in the specifications as of yet, nor is it supported by any emulator - forget actual hardware.	2022-10-10 02:23:17 +02:00
Rémi Denis-Courmont	97d34befea	lavc/opusdsp: RISC-V V (256-bit) postfilter This adds a variant of the postfilter for use with 256-bit vectors. As a single vector is then large enough to perform the scalar product, the group multipler is reduced to just one at run-time. The different vector type is passed via register. Unfortunately, there is no VSETIVL instruction, so the constant vector size (5) also needs to be passed via a register.	2022-10-10 02:22:39 +02:00
Rémi Denis-Courmont	8009581912	lavc/opusdsp: RISC-V V (128-bit) postfilter This is implemented for a vector size of 128-bit. Since the scalar product in the inner loop covers 5 samples or 160 bits, we need a group multipler of 2. To avoid reconfiguring the vector type, the outer loop, which loads multiple input samples sticks to the same multipler. Consequently, the outer loop loads 8 samples per iteration. This is safe since the minimum period of the CELT codec is 15 samples. The same code would also work, albeit needlessly inefficiently with a vector length of 256 bits. A proper implementation will follow instead.	2022-10-10 02:22:10 +02:00
Rémi Denis-Courmont	2abafd7307	lavc/bswapdsp: RISC-V V bswap16_buf	2022-10-05 08:26:19 +02:00
Rémi Denis-Courmont	d7528af4df	lavc/bswapdsp: RISC-V V bswap_buf	2022-10-05 08:26:19 +02:00
Rémi Denis-Courmont	f0ef11ea83	lavc/bswapdsp: RISC-V B bswap_buf Simply taking the Zbb REV8 instruction into use in a simple loop gives some significant savings: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 771.0 But we can also use the 64-bit REV8 as a pseudo-SIMD instruction with just one additional shift, and one fewer load, effectively doubling the bandwidth. Consequently, this patch is useful even if the compile-time target has Zbb enabled for C code: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 341.0 (this patch) On the other hand, this approach fails miserably for bswap16_buf as the ratio of shifts and stores becomes unfavorable compared to naïve C: bswap16_buf_c: 1542.0 bswap16_buf_rvb_b: 1803.7 Unrolling to process 128 bits (4 samples) at a time actually worsens performance ever so slightly: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 408.5	2022-10-05 08:26:19 +02:00
Lynne	b25c6a5704	riscv/alacdsp: drop config.h include	2022-10-05 06:59:43 +02:00
Rémi Denis-Courmont	3ba5579e55	riscv: remove unnecessary #include's Pointed out by Andreas Rheinhardt.	2022-10-05 06:54:56 +02:00
Rémi Denis-Courmont	f0d1637c11	lavc/alacdsp: RISC-V V append_extra_bits[1]	2022-10-05 06:51:11 +02:00
Rémi Denis-Courmont	55bde97f29	lavc/alacdsp: RISC-V V append_extra_bits[0]	2022-10-05 06:51:11 +02:00
Rémi Denis-Courmont	64ab577954	lavc/alacdsp: RISC-V V decorrelate_stereo To avoid data dependencies, this does the following unroll, which requires one extra but probably free addition: coeff = (b * left_weight) >> decorr_shift; b += a; a -= coeff; b -= coeff; swap(a, b);	2022-10-05 06:51:11 +02:00
Martin Storsjö	6059ea2a14	riscv: Fix linking without RVV; change #ifdef into #if Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-29 10:28:37 +03:00
Rémi Denis-Courmont	d31013166a	lavc/pixblockdsp: RISC-V diff_pixels & diff_pixels_unaligned	2022-09-28 11:46:11 +02:00
Rémi Denis-Courmont	ebee25855a	lavc/pixblockdsp: RISC-V V 16-bit get_pixels & get_pixels_unaligned	2022-09-28 11:46:11 +02:00
Rémi Denis-Courmont	676b08cb70	lavc/pixblockdsp: RISC-V V 8-bit get_pixels & get_pixels_unaligned	2022-09-28 11:46:11 +02:00
Rémi Denis-Courmont	2746329ce2	lavc/idctdsp: RISC-V V put_signed_pixels_clamped function	2022-09-28 11:46:11 +02:00
Rémi Denis-Courmont	fa983b5656	lavc/idctdsp: RISC-V V add_pixels_clamped function	2022-09-28 11:46:11 +02:00
Rémi Denis-Courmont	b29ee63a1b	lavc/idctdsp: RISC-V V put_pixels_clamped function	2022-09-28 11:46:11 +02:00
Martin Storsjö	dd2e524ffa	riscv: Use the correct path for including asm.S Signed-off-by: Martin Storsjö <martin@martin.st>	2022-09-28 11:02:46 +03:00
Rémi Denis-Courmont	c03f9654c9	lavc/aacpsdsp: RISC-V V stereo_interpolate[0]	2022-09-27 13:19:52 +02:00
Rémi Denis-Courmont	a15edb0bc0	lavc/aacpsdsp: RISC-V V hybrid_synthesis_deint	2022-09-27 13:19:52 +02:00
Rémi Denis-Courmont	09f907999f	lavc/aacpsdsp: RISC-V V hybrid_analysis_ileave	2022-09-27 13:19:52 +02:00
Rémi Denis-Courmont	15c3a0bd6e	lavc/aacpsdsp: RISC-V V hybrid_analysis This starts with one-time initialisation of the 26 constant factors like 08edacc248bce3f8946d75e97188d189c74a6de6. That is done with the scalar instruction set. While the formula can readily be vectored, the gains would (probably) be more than lost in transfering the results back to FP registers (or suitably reshuffling them into vector registers). Note that the main loop could likely be scheduled sligthly better by expanding the filter macro and interleaving loads with arithmetic. It is not clear yet if that would be relevant for vector processing (as opposed to traditional SIMD). We could also use fewer vectors, but there is not much point in sparing them (they are all callee-clobbered).	2022-09-27 13:19:52 +02:00
Rémi Denis-Courmont	e180326a0b	lavc/aacpsdsp: RISC-V V mul_pair_single	2022-09-27 13:19:52 +02:00
Rémi Denis-Courmont	b0cacf4c3f	lavc/aacpsdsp: RISC-V V add_squares	2022-09-27 13:19:52 +02:00
Rémi Denis-Courmont	453aba71e6	lavc/vorbisdsp: RISC-V V inverse_coupling This uses the following vectorisation: for (i = 0; i < blocksize; i++) { ang[i] = mag[i] - copysignf(fmaxf(ang[i], 0.f), mag[i]); mag[i] = mag[i] - copysignf(fminf(ang[i], 0.f), mag[i]); }	2022-09-27 13:19:52 +02:00
Rémi Denis-Courmont	220dfd0945	lavc/fmtconvert: RISC-V V int32_to_float_fmul_array8	2022-09-27 13:19:52 +02:00

1 2

56 Commits