The gather index vector is only used as double-length (due to register
pressure), so no need to initialise it for quad-length. Basically this
matches the multiplier in the prologue to the the multipler in the loop.
This revectors the inner loop to reverse vectors element in vectors,
thus eliminating the negative register stride. Note that RVV does not
have a vector reverse instruction, so this uses a gather.
1) Take the reductive sum out of the loop,
leaving a regular vector addition in the loop.
2) Merge the addition and the multiplication.
3) Unroll.
Before:
scalarproduct_float_rvv_f32: 832.5
After:
scalarproduct_float_rvv_f32: 275.2
VSETVLI xd, x0, ...' has rather nonobvious semantics:
- If xd is x0, then it preserves the current vector length.
- If xd is not x0, it sets the vector length to the supported maximum.
Also somewhat confusingly, while VMV.X.S always does its thing
regardless of the selected vector length, VMV.S.X does _nothing_ if the
selected vector length is zero.
So the current code breaks fails to initialise the accumulator if we
are unlucky to have a selected vector length of zero on entry. Fix it
by forcing the vector length to one.