MAD 0.12.4b is now available.
This version incorporates a number of performance improvements for all platforms.
It also includes (untested) native fixed-point math support for the PowerPC platform, contributed by David Blythe. Since I've made some significant changes to the way the fixed-point math routines are maintained, I'd appreciate feedback from anyone using MAD on a PPC to make sure I haven't broken anything.
Finally, this version changes the way mixed short blocks are handled in Layer III with respect to alias reduction, correcting another apparent ambiguity in ISO/IEC 11172-3. Since mixed blocks are rare (many encoders don't support them) this has until now gone unnoticed. Thanks to H.O. for reporting this.
The source code release is here:
ftp://ftp.mars.org/pub/mpeg/
Winamp users should notice improved performance in the MAD plug-in:
http://www.mars.org/home/rob/proj/mpeg/mad-plugin/
The MAD home page is, of course:
http://www.mars.org/home/rob/proj/mpeg/
Cheers,
Rob Leslie wrote:
MAD 0.12.4b is now available.
[ ... ]
It also includes (untested) native fixed-point math support for the PowerPC platform, contributed by David Blythe. Since I've made some significant changes to the way the fixed-point math routines are maintained, I'd appreciate feedback from anyone using MAD on a PPC to make sure I haven't broken anything.
Rob,
Another NetBSD person with a mac Powerbook (a 400MHz G3 PPC cpu) has reported no change in cpu utilitisation at all between 0.12.3b and 0.12.4b - both before and after he was getting about 12.5-13% cpu utilisation decoding a 128kbps file.
I don't really have any more details than that, this is more a "for your info" mail. He is able to test patches and whatnot and may be able to provide an account for testing if you want/need it.
On my 233MHz StrongArm I saw a reduction in CPU utilisation from 37% with 0.12.3b to 21% with 0.12.4b decoding the same 192kbps mp3. Here's the output of tcsh's "time" command for each:
before: 78.704u 3.579s 3:40.28 37.3% 0+0k 0+0io 162pf+0w after: 43.241u 3.788s 3:40.12 21.3% 0+0k 0+0io 160pf+0w
Simon. -- Simon Burge simonb@wasabisystems.com NetBSD CDs, Support and Service: http://www.wasabisystems.com/
Simon Burge wrote:
Another NetBSD person with a mac Powerbook (a 400MHz G3 PPC cpu) has reported no change in cpu utilitisation at all between 0.12.3b and 0.12.4b - both before and after he was getting about 12.5-13% cpu utilisation decoding a 128kbps file.
I don't really have any more details than that, this is more a "for your info" mail. He is able to test patches and whatnot and may be able to provide an account for testing if you want/need it.
Could you put me in touch with this person?
On my 233MHz StrongArm I saw a reduction in CPU utilisation from 37% with 0.12.3b to 21% with 0.12.4b decoding the same 192kbps mp3. Here's the output of tcsh's "time" command for each:
before: 78.704u 3.579s 3:40.28 37.3% 0+0k 0+0io 162pf+0w after: 43.241u 3.788s 3:40.12 21.3% 0+0k 0+0io 160pf+0w
That's pretty nice. Which StrongARM CPU is that?
On my 220 MHz SA 1100, usage is also now at about 49 MHz.
This weekend I'll be concentrating on a rewrite of the Layer III Huffman decoding and requantization for hopefully even more performance gains.
Cheers, -rob
Rob Leslie wrote:
Simon Burge wrote:
On my 233MHz StrongArm I saw a reduction in CPU utilisation from 37% with 0.12.3b to 21% with 0.12.4b decoding the same 192kbps mp3. Here's the output of tcsh's "time" command for each:
before: 78.704u 3.579s 3:40.28 37.3% 0+0k 0+0io 162pf+0w after: 43.241u 3.788s 3:40.12 21.3% 0+0k 0+0io 160pf+0w
That's pretty nice. Which StrongARM CPU is that?
Ahh, a SA 110.
On my 220 MHz SA 1100, usage is also now at about 49 MHz.
This weekend I'll be concentrating on a rewrite of the Layer III Huffman decoding and requantization for hopefully even more performance gains.
Cool!
Simon. -- Simon Burge simonb@wasabisystems.com NetBSD CDs, Support and Service: http://www.wasabisystems.com/
Rob Leslie wrote:
It also includes (untested) native fixed-point math support for the PowerPC platform, contributed by David Blythe.
It's about as good as the all 'C' version I sent you before :-). The comment about 4xx is misleading. If it was really done for the IBM 4xx processors, it would be using the MAC that it contains, not just standard PowerPC assembler fixed point multiply.
I haven't done any detailed evaluation of the data output. It sounds OK on the familiar bit streams I use for testing.
-- Dan
Dan Malek wrote:
Rob Leslie wrote:
It also includes (untested) native fixed-point math support for the PowerPC platform, contributed by David Blythe.
It's about as good as the all 'C' version I sent you before :-). The comment about 4xx is misleading. If it was really done for the IBM 4xx processors, it would be using the MAC that it contains, not just standard PowerPC assembler fixed point multiply.
Perhaps it depends on what compiler you are using, (I'm using gcc 2.95.2). The FPM_64BIT option results in reasonable code, but gcc uses a 3 instruction sequence to do the reduction from 64 to 32 bits when it could be done with fewer. I didn't see any suitable 16-bit MAC instruction sequence to do what the MAD_F_MLA macro does on the 405gp, but if you suggest one I'll try it. My goal was to make the madplay distribution use less cpu on our 405, while trying not to lose accuracy. It went from ~40% to 25% with the simple changes i made to the mad-0.11.4b distribution. Other's experience may differ; for example, the changes i made have much smaller impact when sso optimizations are enabled. The comments about 4xx are there in case anyone was thinking that 64 bit or altivec instructions would have been more effective.
measuring on a 200Mhz 405 with gcc 2.95.2, -O3, i get the following cpu utilization
FPM_DEFAULT 20% (low accuracy) FPM_PPC 24% FPM_64BIT 35% FPM_64BIT + OPT_SSO 26% regards, david
David Blythe wrote:
measuring on a 200Mhz 405 with gcc 2.95.2, -O3, i get the following cpu utilization
FPM_DEFAULT 20% (low accuracy) FPM_PPC 24% FPM_64BIT 35% FPM_64BIT + OPT_SSO 26%
David, I'm curious: do you get better performance with -O3 than with the default set of optimization flags?
Cheers, -rob
Rob Leslie wrote:
David Blythe wrote:
measuring on a 200Mhz 405 with gcc 2.95.2, -O3, i get the following cpu utilization
FPM_DEFAULT 20% (low accuracy) FPM_PPC 24% FPM_64BIT 35% FPM_64BIT + OPT_SSO 26%
David, I'm curious: do you get better performance with -O3 than with the default set of optimization flags?
Seems like a fair question. I got in the habit of using -O3 with the 0.11.xx release before you added the extra options and never broke the habit. I gave the default options a try today and it broke the cross compiler we are using :(
powerpc-linux-gcc -DHAVE_CONFIG_H -I. -I. -I. -DFPM_PPC -Wall -g -O -fforce-mem -fforce-addr -finline-functions -fstrength-reduce -fthread-jumps -fcse-follow-jumps -fcse-skip-blocks -fexpensive-optimizations -fregmove -fschedule-insns2 -c frame.c -o frame.o frame.c: In function `mad_frame_mute': frame.c:496: Internal compiler error in `make_edges', at flow.c:967 Please submit a full bug report. See URL:http://www.gnu.org/software/gcc/faq.html#bugreport for instructions.
Its the -fstrength-reduce option that seems to provoke the problem; however, removing just that option and plowing ahead, i get for cpu utilization (i wouldn't call these particularly exact, they are easily +/-2%):
FPM_DEFAULT 19% FPM_PPC 24 FPM_PPC+SSO 21 FPM_PPC+ACCURACY 25 FPM_64BIT 27 FPM_64BIT+SSO 23 FPM_64BIT+ACCURACY 35
There are definite improvements with the config default options. I also did a fairly unscientific look at the static instruction counts for synth.c. -O3 versus "config defaults" for FPM_PPC were about 2400 versus 2000 instructions (although, there wasn't really a noticeable improvement in the cpu utilization). FPM_64BIT was about 3800 versus 2300. I think the FPM_64BIT performance and accuracy could also be further improved by creating MLA macros to work for FPM_64BIT (just redefine mad_fixed64hi_t to be long long and accumulate in hi, ...)
So is anybody else having problems with the -fstrength-reduce? david
David Blythe wrote:
Seems like a fair question. I got in the habit of using -O3 with the 0.11.xx release before you added the extra options and never broke the habit. I gave the default options a try today and it broke the cross compiler we are using :(
You need to be using the latest version of 2.95.3. I'm using: gcc version 2.95.3 20010101 (prerelease/franzo/20010101)
None of the -Ox optimizations give the performance of individual specifications you have created.
-- Dan
David Blythe wrote:
Perhaps it depends on what compiler you are using, (I'm using gcc 2.95.2). The FPM_64BIT option.........
I created an FPM_PPC that just used C code macros that generated code equivalent to the in-line assembler you created. I sent the patch to Rob, but I don't know where it went. I also worked on the first version of the layer 3 MLA that Rob then made more generic. I avoid writing in-line assembler when the compiler can do the same job with a couple of hints.
...... I didn't see any suitable 16-bit MAC instruction sequence to do what the MAD_F_MLA macro does on the 405gp,
The 405gp has about a dozen MAC instructions that may not map exactly into something for the MLA. At some point I will probably work on those. My only comment was about your comment.......you didn't do anything unique for the 4xx, you just used regular PowerPC instructions. If you did something unique for the 4xx it should use the MAC instructions.
You will also notice that the superscalar PowerPCs (pretty much anything but the IBM 4xx and MPC8xx) all process about the same speed regardless of FPM_64BIT, FPM_PPC, or other options used. My 400 MHz iMac (G3/750) runs between 10 and 13 percent depending upon options, my 500 MHz G4/7400 runs between 7 and 10 percent. On the other hand, the MPC8xx, where I started the modifications, will run between 60 and 100+ percent on a 66 MHz processor with EDO memory depending upon options chosen.
-- Dan
Rob Leslie wrote:
MAD 0.12.4b is now available.
This version incorporates a number of performance improvements for all platforms.
It also includes (untested) native fixed-point math support for the PowerPC platform, contributed by David Blythe. Since I've made some significant changes to the way the fixed-point math routines are maintained, I'd appreciate feedback from anyone using MAD on a PPC to make sure I haven't broken anything.
I did a sample-by-sample compare against the Intel implementation and (oops) found a bug (i introduced). The add with carry sequence in the MLA code for the ppc was being too aggressively scheduled and the carry bit was being lost. The attached patch fixes it. thanks david
--- ../../mad-0.12.4b/libmad/fixed.h Thu Feb 8 18:12:24 2001 +++ libmad/fixed.h Fri Feb 9 20:52:37 2001 @@ -290,12 +290,10 @@ ({ mad_fixed64hi_t __hi; \ mad_fixed64lo_t __lo; \ MAD_F_MLX(__hi, __lo, (x), (y)); \ - asm ("addc %0, %1, %2" \ - : "=r" (lo) \ - : "%r" (__lo), "0" (lo)); \ - asm ("adde %0, %1, %2" \ - : "=r" (hi) \ - : "%r" (__hi), "0" (hi)); \ + asm ("addc %0, %2, %3\t\n" \ + "adde %1, %4, %5" \ + : "=r" (lo), "=r" (hi) \ + : "%r" (__lo), "0" (lo), "%r" (__hi), "1" (hi)); \ })
# if defined(OPT_ACCURACY) @@ -306,7 +304,7 @@ * tracking the magnitude of (1 << (MAD_F_SCALEBITS - 1)) is too * complicated. * - * The __volatile__ improve the generated code by another 5% (fewer spills + * The __volatile__ improves the generated code by another 5% (fewer spills * to memory); eventually they should be removed. */ # define mad_f_scale64(hi, lo) \
On Fri, 9 Feb 2001, David Blythe wrote:
I did a sample-by-sample compare against the Intel implementation and (oops) found a bug (i introduced). The add with carry sequence in the MLA code for the ppc was being too aggressively scheduled and the carry bit was being lost. The attached patch fixes it.
[...]
asm ("addc %0, %2, %3\t\n" \
"adde %1, %4, %5" \
For a prettier assembly output, you might consider '\n\t' instead of '\t\n' in the line above.
Nicolas