To achieve sub 5ms accuracy, you would have to decode the whole of the frame containing the desired seek destination (OK maybe just a granule - ie 1/2 of a frame - but this would require significant changes to the decoder internals) and then discard samples up to the desired start point. As you rightly say, this increases latency....
could it be possible to use smaller frames, for example: in 22050 hz mpeg file, the frame size is only 207 bytes, while in 44100 its 417 bytes. could it be acheived without lowering the quality ?
What is your application ? Why do you have such tight seek accuracy/latency requirements ??
my app is for medical-music purposes , and needs realtime performance. more i cannot say, sorry ...
Gad