I have ported the library to our hardware platform ( a set-top box with ide interface and ARC8 processor) but I have found that we are running short of horsepower. The problem stems from the fact that when I use the multiplier it stalls the chip for 10cycles on every multiply
mul64 %r0, %r1 lsl %r0, [%mhi], 4 --> Stalls until multiplier finishes
I have started looking at reordering the code as I can use other instructions in these 10 cycles
<from synth.c dct32:282> // t69 = t33 + t34; t89 = MUL(t33 - t34, costab4); mad_f_mul_j( t33 - t34, costab4); // We have 10 cycles to wait t69 = t33 + t34; // 2-3 t70 = t35 - t36; // 2-3 t89 = mad_f_mul_r();
// t70 = t35 + t36; t90 = MUL(t35 - t36, costab28); // t70 = t35 - t36; mad_f_mul_j( t70, costab28); // We have 10 cycles to wait t70 = t35 + t36; // 2-3 t113 = t69 + t70; // 2-3 t71 = t37 - t38; // 2-3 t90 = mad_f_mul_r();
This gives me savings of around 6 cycles per multiply, so we only lose 4 cycles.
What I would like know is which routine/s should I convert first to save the maximum amount of time
Joolz