I have ported the library to our hardware platform ( a set-top box with ide interface and ARC8 processor) but I have found that we are running short of horsepower. The problem stems from the fact that when I use the multiplier it stalls the chip for 10cycles on every multiply
mul64 %r0, %r1 lsl %r0, [%mhi], 4 --> Stalls until multiplier finishes
I have started looking at reordering the code as I can use other instructions in these 10 cycles
<from synth.c dct32:282> // t69 = t33 + t34; t89 = MUL(t33 - t34, costab4); mad_f_mul_j( t33 - t34, costab4); // We have 10 cycles to wait t69 = t33 + t34; // 2-3 t70 = t35 - t36; // 2-3 t89 = mad_f_mul_r();
// t70 = t35 + t36; t90 = MUL(t35 - t36, costab28); // t70 = t35 - t36; mad_f_mul_j( t70, costab28); // We have 10 cycles to wait t70 = t35 + t36; // 2-3 t113 = t69 + t70; // 2-3 t71 = t37 - t38; // 2-3 t90 = mad_f_mul_r();
This gives me savings of around 6 cycles per multiply, so we only lose 4 cycles.
What I would like know is which routine/s should I convert first to save the maximum amount of time
Joolz
Hi Joolz!
You have BIG problem :) 10 cycles stall every mul. The most "multiplication consuming" functions are - IMDCT - actually it isn't too much - synthesis filter
last one has a lot of multiplication all other relatively free from mult operation. May be stereo processing has a little.
If you have short of calculation power I can recommend you to do same as I did. I've separated calculation functions and implement them on asm. I can send the source code of of MAD with such improvements that I used for - tms320vc55xx and - sp3R5 3DSP cores.