I would like to have a look please Gregory. I have already started the port
// MLO( x, y); ld %r0, [x] ld %r1, [y] mul64 %r0, %r1 ld %r2, [%mhi]
// MLA( x1, y1); ld %r0, [x1] ld %r1, [y1] mul64 %r0, %r1 add %r2, [%mhi]
// MLA( x2, y2); ld %r0, [x2] ld %r1, [y2] mul64 %r0, %r1 add %r2, [%mhi]
Into
ld %r0, [x] ld %r1, [y] mul64 %r0, %r1 ld %r0, [x1] ld %r1, [y1] ld %r2, [%mhi] mul64 %r0, %r1 ld %r0, [x2] ld %r1, [y2] add %r2, [%mhi] mul64 %r0, %r1
Saves 2 cycles per multiply, only another 8 to go
Best Regards
Julian Gardner RSD Communications Ltd ============================ Please do not send me any html messages as these may be deleted at our mail server without notice to sender or recipient. Please ONLY send me file attachments with any messages with extensions .zip, .rar, or .pdf as any other file types may be deleted at our mail server without notice to sender or recipient.
-----Original Message----- From: Grigory A. [mailto:Ryhor@tut.by] Sent: Friday, July 02, 2004 2:34 AM To: Joolz [RSD] Cc: mad-dev@lists.mars.org Subject: Re: [mad-dev] A little help needed on code optimisation to deal with delay slots
Hi Joolz!
You have BIG problem :) 10 cycles stall every mul. The most "multiplication consuming" functions are
- IMDCT - actually it isn't too much
- synthesis filter
last one has a lot of multiplication all other relatively free from mult operation. May be stereo processing has a little.
If you have short of calculation power I can recommend you to do same as I did. I've separated calculation functions and implement them on asm. I can send the source code of of MAD with such improvements that I used for
- tms320vc55xx and
- sp3R5 3DSP cores.
-- Best regards, Grigory mailto:Ryhor@tut.by
Hi Julian!
I'll do it tomorrow. In my code there are 7 - 9 asm function which you will have to implement "in best way for your hardware ". And decoder can work at any hardware almost with maximum efficiency. Maybe one more to optimize - it is Haffman decoding and freq reordering - in my case they are still on C. So I don't optimize just mult function, but whole calculation intensive function. One by one :).
Friday, July 2, 2004, 7:57:06 PM, you wrote: JR> I would like to have a look please Gregory. I have already started the JR> port
JR> // MLO( x, y); JR> ld %r0, [x] JR> ld %r1, [y] JR> mul64 %r0, %r1 JR> ld %r2, [%mhi]
JR> // MLA( x1, y1); JR> ld %r0, [x1] JR> ld %r1, [y1] JR> mul64 %r0, %r1 JR> add %r2, [%mhi]
JR> // MLA( x2, y2); JR> ld %r0, [x2] JR> ld %r1, [y2] JR> mul64 %r0, %r1 JR> add %r2, [%mhi]
JR> Into
JR> ld %r0, [x] JR> ld %r1, [y] JR> mul64 %r0, %r1 JR> ld %r0, [x1] JR> ld %r1, [y1] JR> ld %r2, [%mhi] JR> mul64 %r0, %r1 JR> ld %r0, [x2] JR> ld %r1, [y2] JR> add %r2, [%mhi] JR> mul64 %r0, %r1
JR> Saves 2 cycles per multiply, only another 8 to go
JR> Best Regards
JR> Julian Gardner JR> RSD Communications Ltd JR> ============================ JR> Please do not send me any html messages JR> as these may be deleted at our mail server JR> without notice to sender or recipient. JR> Please ONLY send me file attachments with JR> any messages with extensions .zip, .rar, JR> or .pdf as any other file types may be JR> deleted at our mail server without notice JR> to sender or recipient.
-----Original Message----- From: Grigory A. [mailto:Ryhor@tut.by] Sent: Friday, July 02, 2004 2:34 AM To: Joolz [RSD] Cc: mad-dev@lists.mars.org Subject: Re: [mad-dev] A little help needed on code optimisation to deal with delay slots
Hi Joolz!
You have BIG problem :) 10 cycles stall every mul. The most "multiplication consuming" functions are
- IMDCT - actually it isn't too much
- synthesis filter
last one has a lot of multiplication all other relatively free from mult operation. May be stereo processing has a little.
If you have short of calculation power I can recommend you to do same as I did. I've separated calculation functions and implement them on asm. I can send the source code of of MAD with such improvements that I used for
- tms320vc55xx and
- sp3R5 3DSP cores.
-- Best regards, Grigory mailto:Ryhor@tut.by