So, I was running the profiler and seeing what the slow points were. To help me see it better I split out a chunk of III_decode into III_decode_a. This chunk is the sizable portion from the first for loop. Now profiler output is better broken down:
29.81 5.72 5.72 7394 0.77 1.05 mad_synth_frame 20.43 9.64 3.92 768704 0.01 0.01 imdct36 10.84 11.72 2.08 532368 0.00 0.00 dct32 8.49 13.35 1.63 29576 0.06 0.12 III_decode_a 7.76 14.84 1.49 29576 0.05 0.06 III_huffdecode 7.14 16.21 1.37 7394 0.19 1.53 III_decode 6.88 17.53 1.32 177728 0.01 0.01 III_imdct_s 3.49 18.20 0.67 768704 0.00 0.01 III_imdct_l 2.50 18.68 0.48 14788 0.03 0.03 III_stereo 1.62 18.99 0.31 2919640 0.00 0.00 mad_bit_read
Where III_decode() used to weigh in at 2, it is now seen as 3 chunks.
I doubt this patch is of any worth, but I just wanted to share.
in dct32() there is a cos table declared every single call, why not move this to just before the function?
in III_huffdecode() you make a new block for most of the function call. Why use this temporary block?