I was testing the Szu-Wei Lee algorithm and found that it is faster than the one in libmad.
As a matter of fact I have also been experimenting with a new Layer III IMDCT implementation based on the same algorithm. It is nice because it reduces the number of multiplications from 188 to 43.
I have had to modify the algorithm slightly to improve the resulting output accuracy. It improves overall performance on many platforms by as much as 6-7% and improves accuracy as well compared to the libmad 0.15.0b release.
My current implementation does not appear to improve performance over Andre McCurdy's hand-written assembly implementation for ARM platforms, however. Maybe that assembly could be rewritten according to the new algorithm for significant gains.
My implementation is attached; this or something very similar will be incorporated into the next libmad release.