As we do not need to widen accumulators to 64 bits, we effectively get
double capacity for unrolling compared to the integer function. This
explains the slightly better performance gains.
ac3_sum_square_bufferfly_float_c: 65.2
ac3_sum_square_bufferfly_float_rvv_f32: 12.2