Dan Malek wrote:
........But, you gave me an idea (that still doesn't require assembly programming)...........
So, my idea was to modify imdct36 so the mad_f_mul + mad_f_mul .... sequences were replaced by 64-bit multiply/accumulates, then you just round/scale once at the end (like I would do on a DSP). This improved the performance by about 6%, and I ended up with great compiler generated code. For example:
t6 = mad_f_mul(X[4], 0x0ec835e8L) + mad_f_mul(X[13], 0x061f78aaL);
becomes:
macreg = 0; mad_f_mac(macreg, X[4], 0x0ec835e8L); mad_f_mac(macreg, X[13], 0x061f78aaL); t6 = mad_f_macscale(macreg);
Of course, on the longer multiply/accumulates this makes more sense, but you get the idea in a minimal amount of space here :-). The 'macreg' is a 64-bit signed long long, the 'mad_f_mac' macro is just the multiply part of mad_f_mul, and the 'mad_f_macscale' is just the rounding/scaling part of mad_f_mul, which is done just once at the end of the MAC sequence to get the mad_fixed_t result.
I guess I need to run some official bit streams, but it sounds OK. This could certainly be the PowerPC optimization. I'll send a patch to someone if they would like to see it.
-- Dan