1. no longer use the register names directly and optimized code format
2. to be compatible with O32, specify type of address variable with mips_reg and handle the address variable with PTR_ operator
3. optimize some unaligned loads and stores
4. use uld and mtc1 to workaround cpu 3A2000 gslwlc1 bug (gslwlc1 instruction extension bug in O32 ABI)
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>