A little help needed on code optimisation to deal with delay slots - mad-dev

1 Jul 2004


      I have ported the library to our hardware platform ( a set-top box with
ide interface and ARC8 processor) but I have found that we are running
short of horsepower. The problem stems from the fact that when I use the
multiplier it stalls the chip for 10cycles on every multiply
mul64 %r0, %r1
    lsl %r0, [%mhi], 4 --> Stalls until multiplier finishes
I have started looking at reordering the code as I can use other
instructions in these 10 cycles
<from synth.c dct32:282>
//	  t69  = t33 + t34;  t89  = MUL(t33 - t34, costab4);
    mad_f_mul_j( t33 - t34, costab4);	// We have 10 cycles to
wait
    t69  = t33 + t34; 			// 2-3
    t70 = t35 - t36;				// 2-3
    t89 = mad_f_mul_r();
//  t70  = t35 + t36;  t90  = MUL(t35 - t36, costab28);
//	t70 = t35 - t36;
    mad_f_mul_j( t70, costab28);		// We have 10 cycles to
wait
    t70  = t35 + t36;  			// 2-3
  t113 = t69  + t70;				// 2-3
    t71 = t37 - t38;				// 2-3
    t90  = mad_f_mul_r();
This gives me savings of around 6 cycles per multiply, so we only lose 4
cycles.
What I would like know is which routine/s should I convert first to save
the maximum amount of time
Joolz