Handling ID3 tags with non-latin1 encoding

List overview All Threads
Download

newer

older

How to compile sox with libmad?

Madplay fails to build on OSX x86

s.huelswitt＠gmx.de

5 Jun 2006 5 Jun '06

9:06 a.m.

Hi,

I have a question regarding handling of ID3 tags which are non-latin1 encoded e.g. iso-8859-5.

I think I can get ucs4 strings from libid3tag and convert them with iconv to whatever codeset originaly was used.

How can I get information about the original codeset?

Is there another designated way to handle these cases?

My current code to get a string for an ID3 tag is (FYI: it's VDR-MP3 player plugin):

void cScanID3::ParseStr(const struct id3_tag *tag, const char *id, char * &data) { const struct id3_frame *frame=id3_tag_findframe(tag,id,0); if(!frame) return;

free(data); data=0; const union id3_field *field=&frame->fields[1]; if(id3_field_getnstrings(field)>0) { const id3_ucs4_t *ucs4=id3_field_getstrings(field,0); if(!ucs4) return; if(!strcmp(id,ID3_FRAME_GENRE)) ucs4=id3_genre_name(ucs4);

id3_latin1_t *latin1=id3_ucs4_latin1duplicate(ucs4); if(!latin1) return;

data=strdup((char *)latin1); free(latin1); } }

Obviously the code yields bad results, if the string wasn't in latin1 originaly.

TIA.

Regards.

-- Stefan Huelswitt s.huelswitt@gmx.de | http://www.muempf.de/

Show replies by date

Kurt Roeckx

6 Jun 6 Jun

11:46 p.m.

On Mon, Jun 05, 2006 at 09:06:00AM +0000, Stefan Huelswitt wrote:

...

Hi,

I have a question regarding handling of ID3 tags which are non-latin1 encoded e.g. iso-8859-5.

I think I can get ucs4 strings from libid3tag and convert them with iconv to whatever codeset originaly was used.

What you really should do is convert things to wchar_t, and then use functions that work with them (like wprintf()) to output things. This should be much more portable.

I find it unfortute libid3tag does everything with ucs4 instead of wchar_t's.

Anyway, one problem you should probably be aware of is that id3_ucs4_t is a long, which is typicaly 64 bit on a 64 bit machine, and so is not 4 bytes as you might expect.

I'm not sure what's the best way to convert it from ucs4 to wchar_t though, and can't think of a portable way, and I'd love if someone could point me out how to do it.

(On Linux (glibc), casting it from a id3_ucs4_t to a wchar_t seems to work, but it's not portable.)

Kurt

Rob Leslie

7 Jun 7 Jun

1:38 a.m.

On Jun 6, 2006, at 4:46 PM, Kurt Roeckx wrote:

...

On Mon, Jun 05, 2006 at 09:06:00AM +0000, Stefan Huelswitt wrote:

...
I have a question regarding handling of ID3 tags which are non- latin1 encoded e.g. iso-8859-5.

I think I can get ucs4 strings from libid3tag and convert them with iconv to whatever codeset originaly was used.

What you really should do is convert things to wchar_t, and then use functions that work with them (like wprintf()) to output things. This should be much more portable.

I find it unfortute libid3tag does everything with ucs4 instead of wchar_t's.

Unfortunately wchar_t is not a very portable way of supporting Unicode. Its size is compiler-dependent, and may not be large enough to represent all of the ISO/IEC 10646 code points. For example, ISO/ IEC 9899:1999 (C99) allows wchar_t to be as small as 8 bits; even 16 bits is not large enough. The type is also locale-dependent.

...

I'm not sure what's the best way to convert it from ucs4 to wchar_t though, and can't think of a portable way, and I'd love if someone could point me out how to do it.

(On Linux (glibc), casting it from a id3_ucs4_t to a wchar_t seems to work, but it's not portable.)

There's nothing wrong with this per se as long as wchar_t is at least 21 bits wide (for example, when __STDC_ISO_10646__ is defined) or you are sure the UCS-4 code point in question can otherwise be represented by a wchar_t -- and the relevant locale is compatible. Note that casting id3_ucs4_t to wchar_t is not the same as casting (id3_ucs4_t *) to (wchar_t *); the latter is definitely not portable though it may work if the underlying types happen to have the same size.

A simple way to translate a UCS-4 string to wchar_t might be (untested):

wchar_t *ucs4_to_wchar(id3_ucs4_t const *ucs4) { wchar_t *wchar;

wchar = malloc(id3_ucs4_size(ucs4) * sizeof(wchar_t)); if (wchar) { wchar_t *ptr = wchar; while ((*ptr++ = (wchar_t) *ucs4++)) ; }

return wchar; }

This assumes __STDC_ISO_10646__ is defined, or wchar_t is otherwise compatible with ISO/IEC 10646 character codes.

-- Rob Leslie rob@mars.org

Kurt Roeckx

6:46 a.m.

On Tue, Jun 06, 2006 at 06:38:04PM -0700, Rob Leslie wrote:

...

...
I'm not sure what's the best way to convert it from ucs4 to wchar_t though, and can't think of a portable way, and I'd love if someone could point me out how to do it.

(On Linux (glibc), casting it from a id3_ucs4_t to a wchar_t seems to work, but it's not portable.)

There's nothing wrong with this per se as long as wchar_t is at least 21 bits wide (for example, when __STDC_ISO_10646__ is defined) or you are sure the UCS-4 code point in question can otherwise be represented by a wchar_t -- and the relevant locale is compatible.

The biggest problem being that "the locale is compatible" means it's not very portable.

...

Note that casting id3_ucs4_t to wchar_t is not the same as casting (id3_ucs4_t *) to (wchar_t *); the latter is definitely not portable though it may work if the underlying types happen to have the same size.

As I pointed out, on 64 bit arches, they're typicaly of different size, so break.

...

A simple way to translate a UCS-4 string to wchar_t might be (untested):

wchar_t *ucs4_to_wchar(id3_ucs4_t const *ucs4) { wchar_t *wchar;

wchar = malloc(id3_ucs4_size(ucs4) * sizeof(wchar_t)); if (wchar) { wchar_t *ptr = wchar; while ((*ptr++ = (wchar_t) *ucs4++)) ; }

return wchar; }

I've used the following code: wchar_t *ucs4_to_wchar_strdup(const id3_ucs4_t *ucs4) { id3_length_t size = id3_ucs4_size(ucs4); wchar_t *s = malloc(size * sizeof(wchar_t)); id3_length_t i;

for (i = 0; i < size; i++) { s[i] = ucs4[i]; } return s; }

Which basicly does about the same, but doesn't do error checking, and seems to work. The problem however is that id3_ucs4_size isn't exported in the header files that get installed.

Anyway, I think now what might be more portable is that you just run iconv() to convert it from ucs4 to nl_langinfo(CODESET); If you want, you can then convert things with something like mbtowc() to a wchar_t afterwards, but this isn't needed.

Note that iconv() might not be able to represent everything in CODESET, but you'll have that problem anyway once you try to output it.

I'll try and send a patch for madplayer/player.c later, if you think this should work and is acceptable. I'm just not sure how happy you are about things that are SUS/POSIX specific, since both iconv() and nl_langinfo() are.

Kurt

Rob Leslie

1:17 a.m.

On Jun 5, 2006, at 2:06 AM, Stefan Huelswitt wrote:

...

I have a question regarding handling of ID3 tags which are non- latin1 encoded e.g. iso-8859-5.

I think I can get ucs4 strings from libid3tag and convert them with iconv to whatever codeset originaly was used.

How can I get information about the original codeset?

In ID3v2, the original text encoding is one of ISO-8859-1 (i.e. Latin-1), UTF-16, or UTF-8. You can find the encoding in the assigned field of the tag frame containing the string; usually it is the first field.

ID3v1 has no text encoding specification, so it is impossible to know what encoding was used; libid3tag assumes it is Latin-1 and translates this to UCS-4.

...

Is there another designated way to handle these cases?

If you happen to know the encoding of an ID3v1 tag is something other than Latin-1, you can translate it from UCS-4 back to Latin-1, then re-interpret it however you like.

-- Rob Leslie rob@mars.org

6772

Age (days ago)

6774

Last active (days ago)

mad-dev@lists.mars.org

4 comments

3 participants

tags (0)

participants (3)

Kurt Roeckx
Rob Leslie
s.huelswitt＠gmx.de