Nicolas,
Well done for spotting the ARM 'Round while you shift' optimisation !
Originally in imdct_l_arm.S, I made a fairly arbitrary choice to
round in some places and just shift in others to try and balance
code-size/speed against accuracy. With your optimisation its possible
to round everywhere with no penalty, so I guess it makes sense to do
so.
I've attached a new version to this email which hopefully should be
the most accurate version so far - it now rounds everywhere like
FPM_ARM and FPM_64BIT, but has the advantage over them of using 64bit
accumulators for the imdct part.
Rob,
Just a small tweak: if ASO_IMDCT is defined, the window_l[] table in
layer3.c doesn't need to be included (at a saving of 144 bytes....)
as imdct_l_arm.S already contains its own copy.
Andre
--
____________________________________________________________
Do You Yahoo!?
Get your free @yahoo.co.uk address at http://mail.yahoo.co.uk
or your free @yahoo.ie address at http://mail.yahoo.ie
I put a new release of MAD on the FTP site. It's a day later than promised,
but I wanted to incorporate some patches Nicolas sent me.
This version has many changes. The highlights are:
libmad changes:
* Incorporated Nicolas Pitre's ARM assembly and parameterized scaling
changes.
* Incorporated Andre McCurdy's ARM assembly optimization (used only if
--enable-aso is given to `configure' to enable architecture-specific
optimizations.)
* Reduced FPM_INTEL assembly to two instructions.
* Fixed accuracy problems with certain FPM modes in synth.c.
* Improved the accuracy of FPM_APPROX.
* Improved the accuracy of SSO.
* Added experimental rules for generating a libmad.so shared library.
madplay changes:
* PCM output is now dithered for better audio quality.
* Added a resampling feature for unsupported output sampling frequencies.
* Improved the OSS output module by falling back on 8-bit format if 16-bit
is not available, and by using native 16-bit endianness.
* Added a dual channel output selection option.
The ARM changes produced a favorable effect on the accuracy of the output from
libmad on ARM processors; ARM output is no longer the same as Intel output and
instead now matches the 64bit output, but with better performance:
default with SSO with ASO with SSO+ASO
FPM_APPROX 6.800e-05/L 6.775e-05/L 6.431e-05/L 6.431e-05/L
FPM_64BIT 5.580e-08/F 1.007e-05/L 5.652e-08/F 1.008e-05/L
FPM_ARM 5.580e-08/F 1.007e-05/L 5.652e-08/F 1.008e-05/L
FPM_INTEL 9.000e-08/F 1.008e-05/L
/F means full compliance, /L means limited accuracy, /N means not compliant.
Perhaps the most significant change to `madplay' is the addition of PCM output
dithering. This is an alternative to ordinary rounding that improves the audio
quality by reducing the negative effects of quantization noise. Dithering is
commonly used in professional mastering when reducing studio-quality audio to
16 bits for pressing onto a CD. Since MAD produces PCM samples with >24 bits,
dithering is a good idea.
The dithered output sounds less "grainy" than non-dithered, although this is
easier to perceive with 8-bit output than with 16. Best of all, however,
dithering effectively increases the precision of the output because it allows
bits below the LSB to be heard.
There are many possible dithering strategies, but the chosen algorithm is
fairly simple: it merely involves keeping the cumulative quantization error
less than the significance of the LSB. The effect of this is for the LSB to
modulate in proportional agreement with the bits below it.
The only significant drawback with dithering is that it hinders an analytic
examination of the output, such as compliance testing. Therefore, it can be
turned off with a -d option to `madplay'.
As always, the release can be found here:
ftp://ftp.mars.org/pub/mpeg/
Cheers,
-rob
Rob,
Interesting.
If you have them, I would also be interested in figures for MAD which
compare multiplies using FPM_64BIT, a version which truncates the
LSBit (eg FPM_ARM), and FPM_APPROX.
Thanks
Andre
--
____________________________________________________________
Do You Yahoo!?
Get your free @yahoo.co.uk address at http://mail.yahoo.co.uk
or your free @yahoo.ie address at http://mail.yahoo.ie
I thought people might be interested in some results I'm putting together of
tests for MPEG audio decoder accuracy.
These are the same tests I wrote about some time ago, but now I have results
for a much wider range of decoders, and across all three layers. MAD is still
showing well, and is in fact the only decoder I am aware that will produce
24-bit output.
The results are interesting:
http://www.mars.org/home/rob/proj/mpeg/compliance/
Worth noting is that MAD is the only decoder in the class of integer decoders
that can produce fully compliant output.
Does anyone have any suggestions for decoders to test that are not listed?
I would like to test the ARM decoder somehow, but I don't know if I can get
access to it.
Cheers,
-rob
Rob + others,
Please find attached a version of the layer3 III_imdct_l() function
I've written in ARM assembler.
I've been messing around with it for a while, mainly as an exercise
to learn ARM assembler, but hopefully the end result is worth
sharing.
Performance wise, it should be quite a bit faster than the current C
version (and slightly more accurate as well since the
multiply-accumulate steps accumulate into 64bits, then round back to
32bits only when finished).
Unfortunately, I don't actually have any ARM based hardware that will
play audio, so its only been tested standalone with a small range of
test cases on the 'armulator' ARM simulator.
Any feedback (especially overall performance) or bug reports from
anyone actually able to test it for real would be appreciated.
It assembles for me (using gcc v2.95.2) with just:
arm-elf-gcc -c arm_III_imdct_l.S
(making sure that the extension is .S rather then .s to cause gcc to
run it though the C pre-processor).
I'd appreciate some feedback, even if the performance increase isn't
big enough to bother including it in future releases.
Andre
--
____________________________________________________________
Do You Yahoo!?
Get your free @yahoo.co.uk address at http://mail.yahoo.co.uk
or your free @yahoo.ie address at http://mail.yahoo.ie
I made a new release of MAD (0.11.1b) with the following changes:
- the libmad code is now in a separate directory
- the robustness of the Win32 audio output module is much improved
- the SSO is now disabled by default, as output accuracy is deemed to
be more important than speed in the general case
- a bug in the Layer III sanity checking was fixed that could have caused
a crash on certain random data input
- the Layer III requantization table was extended from 8191 to 8206 values
as some encoders are known to use these values, even though ISO/IEC
11172-3 suggests the maximum should be 8191 (and I couldn't convince
anyone on the LAME mailing list that this could be a bug in LAME)
- a short man page for madplay was added
- a new `madtime' program (not yet built or installed by default) accurately
calculates average bitrate and playing time for any file, including VBR
- a new experimental multi-stream mixer `madmix' was added
(--enable-experimental during configure to add -x option support for this
to madplay)
The experimental mixer code is designed to minimize CPU involvement in
decoding multiple bitstreams; subband synthesis is performed only once after
all the mixing has taken place on the intermediate decoded data.
Here's an example usage:
madmix <(madplay -Qx one.mp3) <(madplay -Qx two.mp3)
Any number of input streams can be given on the command line to be mixed. You
can also use the same -o option as for `madplay'. If your shell doesn't
support process substitution with named pipes, you'll have to mess around and
make them yourself.
Currently the mix is fixed at 100% for all streams, but this can be adjusted
on line 330 of madmix.c. I think there's potential for command-line or
file-based configuration to further make this useful, or possibly even a GUI.
As always, the release can be found here:
ftp://ftp.mars.org/pub/mpeg/
Cheers,
-rob
> Have you compared your optimised version of 0.10.0b to what's
> now in 0.11.0b ?
> I'm curious what your benchmarking will reveal.. when I get a
> chance I'm definitely going to take a closer look at it.
I just compared the 0.11.0b version of imdct36() with mine:
gcc -O1 : 1329 clocks (asm mul_f), 1529 (C mul_f)
gcc -O2 : 1607 clocks (asm mul_f), 2176 (C mul_f)
gcc -O3 : 1608 clocks (asm mul_f), 2186 (C mul_f)
All of which beat my best of 2215 clocks. It could be closer on
an embedded processor with a smaller cache though.
The raw output isn't identical to the older version, but i guess
this is just different rounding errors (?).
My x86 assembler knowledge isn't to good so I haven't really looked
at why gcc seems to be so much worse with optimisations above -O1.
As well as being slower, code size almost doubles e.g. 10448 bytes
(-O3) against 5391 (-O1) for the latest imdct36() using a C mul_f,
so maybe its a problem with optimisation in my version of gcc (the
default egcs-2.91.66 installed with RH6.2)
Does arm-elf-gcc behave in the same way ??
> > On a different subject, does anyone have access to an ARM
platform
> > for testing ?? I would be interested to know how the MAD code
> > (with and without my changes) compares to ARM's own mp3 decode
> > library (which claims to use only 29 MHz of cpu bandwidth for
real
> > time decode on an ARM7 core).
>
> I have access to a StrongARM 1100 and a 110, but not to anything
> else. I don't think I have access to ARM's MP3 decoding library, so
> I can't do any comparisons against it. However, I can evaluate your
> improvements to MAD alone on the StrongARM.
ARM's MP3 decode library (see http://www.arm.com/SoftSys/as022.html )
isn't available for free so it might be difficult to get hold of a
copy, but if the claims are true (ie 20 MHz cpu bandwidth for
realtime decode on a StrongARM using 27k of code) it looks like being
quite a good benchmark to compare MAD against... Is it anywhere near
yet ??
If not (to answer lwong's questions...) its probably due to:
1) ARM's library being written with all critical sections in
assembler
by ARM engineers who know the architecture inside out.
2) ARM's library may use more approximate calculations in some
places.
3) Any code compiled from C will have used ARM's own c compiler which
from what I've heard seems to be at least 20 percent better than gcc.
____________________________________________________________
Do You Yahoo!?
Get your free @yahoo.co.uk address at http://mail.yahoo.co.uk
or your free @yahoo.ie address at http://mail.yahoo.ie
This list seems a little quiet, but maybe this will be of interest to
someone.
I've cleaning up and optimising the imdct36() function in layer3.c
(where most of the execution time was being spent), with the result
that its now over 30 percent faster when compiled for x86.
The attached .tgz file includes the changed layer3.c, plus hacked
madplay.c and Makefile which profile the imdct36() function using the
pentium timestamp counter.
On a different subject, does anyone have access to an ARM platform
for testing ?? I would be interested to know how the MAD code (with
and without my changes) compares to ARM's own mp3 decode library
(which claims to use only 29 MHz of cpu bandwidth for real time
decode on an ARM7 core).
Andre
--
____________________________________________________________
Do You Yahoo!?
Get your free @yahoo.co.uk address at http://mail.yahoo.co.uk
or your free @yahoo.ie address at http://mail.yahoo.ie
Hello,
I have try to compile the last version of this program and run on the arm7 with cpu speed is 74MHz, os is ArmLinux. With arm-asm code used, the program is using 25secs to decode a 10secs mp3 bitstream where 192kbps, 44.1kHz. Using approx 32bits code, the program is using 18secs to decode that 10secs mp3 bitstream. The mad cannot reach to realtime decoding at this cpu. I have already turn on all optimize choices.
I have used same arm7 cpu and ran circuit logic mp3 decoder demo to decode some mp3 bitstreams in 128kbps, 44.1kHz. This mp3 program does not need ArmLinux, it run as standalone. The audio playback work very well. I do not know that mp3 program whether possible decoding realtime at 29MHz.
Do anyone know why ? ArmLinux eat some resources ? Or some compiler tricks I need to add to speed up the program. As I know, Arm7 has 3 pipeline for processing instruction. Does normal compile not take all advantange of this CPU's structure.
Regards,
lwong.
---------------------------------------------------
Get free personalized email at http://www.iname.com
Hello
I'd like to know how conformance of decoders has to be tested. I think that
there are some reference tracks, but where to got them?
Regards,
--
Gabriel Bouvigne - France
bouvigne(a)mp3-tech.org
icq: 12138873
MP3' Tech: www.mp3-tech.org