I made a new snapshot of MAD, version 0.10.1b.
ftp://ftp.mars.org/pub/mpeg/
I managed to create three new optimizations for this version to significantly improve performance across all layers and use less memory at the same time.
On the StrongARM 1100, performance for Layer III improved from 37% to 31%. Layer II improved from 22% to 17%, and Layer I from 23% to 18%. This puts MAD ahead of all known integer decoders for Layer I and Layer II, and only behind Xaudio (at 24%) for Layer III. For a more complete summary, see the TIMINGS file in the distribution.
The specific optimizations I made were to observe that the least significant 12 bits in all the subband synthesis windowing coefficients were zero, so the multiplication cycle count could be reduced on some machines by pre-shifting these away to create greater leading-zero or leading-one counts. The second optimization I made was to reduce the windowing coefficient table almost in half (saving memory) by utilizing symmetry, and simultaneously localizing the symmetric computation such that fewer overall memory references are needed. The third optimization was to modify the way fixed-point multiplication is performed during synthesis windowing, in conjunction with the first optimization. Since the coefficients have only 16 significant fractional bits, multiplying by a 12-fractional-bit number would yield exactly 28 fractional bits. Thus all the fixed-point shifts can be eliminated during windowing if the input is pre-shifted by 16 (28-12) bits. Another benefit of this optimization is the compiler can more easily choose to use a multiply-accumulate instruction if one is available.
This last optimization loses precision in the output, but I think this should not generally be audible. I haven't yet analyzed the loss with respect to decoding compliance, although this is my intention. If anyone is interested in looking into this, I'd appreciate the help. The optimization can optionally be turned off to get more accurate output.
I still have not yet rewritten the Layer III IMDCT, so Layer III performance will get better still. How much? The lower bound can be seen in the Layer I and Layer II numbers, since these are the simplest layers that use the same subband synthesis.
This version also has some code cleanup, including fixed_t -> mad_fixed_t renaming as suggested, new M/S and intensity stereo indicators for madplay's verbose mode, and plenty of other miscellaneous changes. There is also new sample code `minimad.c' which shows perhaps the simplest use of the libmad API.
Cheers, -rob