Hi Joolz!
You have BIG problem :) 10 cycles stall every mul. The most "multiplication consuming" functions are - IMDCT - actually it isn't too much - synthesis filter
last one has a lot of multiplication all other relatively free from mult operation. May be stereo processing has a little.
If you have short of calculation power I can recommend you to do same as I did. I've separated calculation functions and implement them on asm. I can send the source code of of MAD with such improvements that I used for - tms320vc55xx and - sp3R5 3DSP cores.