XML2DCM: Umlaut & Encoding Problems

All other questions regarding DCMTK

Moderator: Moderator Team

Post Reply
Message
Author
st.ku
Posts: 3
Joined: Fri, 2014-02-21, 11:06

XML2DCM: Umlaut & Encoding Problems

#1 Post by st.ku » Mon, 2014-02-24, 08:41

Hi,

I made a small toolchain with dcm2xml and xml2dcm. I have a data set containing the study description "Wirbelsäule" (English: spine). dcm2xml is able to generate a proper xml file but xml2dcm after that crashes. In detail:

Code: Select all

xml2dcm.exe --log-level debug a.xml a2.dcm
D: $dcmtk: xml2dcm v3.6.0 2011-01-06 $
D: 
I: reading XML input file: a.xml
--- libxml parsing ------
a.xml:34: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xE4 0x75 0x6C 0x65
<element tag="0008,1030" vr="LO" vm="1" len="30" name="StudyDescription">Wirbels
I can search for umlauts and replace them as workaround but it is maybe a general problem. Converting the original a.dcm to a.xml I also receive warning by dcm2xml:

Code: Select all

W: (0008,0005) Specific Character Set 'ISO 2022 IR 100' not supported
Another problem I have (I was not able to find a suitable solution here): some DICOM entries have a trailing 0x10. This is converted to

Code: Select all

&#10; 
by dcm2xml and not converted back by xml2dcm. This also happens on line feeds with 0x13 0x10.

Thank you for help!

J. Riesmeier
DCMTK Developer
Posts: 2015
Joined: Tue, 2011-05-03, 14:38
Location: Oldenburg, Germany
Contact:

Re: XML2DCM: Umlaut & Encoding Problems

#2 Post by J. Riesmeier » Mon, 2014-02-24, 09:21

dcm2xml is able to generate a proper xml file but xml2dcm after that crashes.
Do you really mean "crashes" or does the tool just exit/terminate with an error?

Regarding your "umlaut" problem: This is probably caused by a wrong character set encoding. Are you sure that the XML encoding of the "ä" is correct?
Converting the original a.dcm to a.xml I also receive warning by dcm2xml:

Code: Select all

W: (0008,0005) Specific Character Set 'ISO 2022 IR 100' not supported
Right, this encoding (ISO 2022 switching of multiple character sets) is not supported by dcm2xml (as you can read in the documentation).
However, you might want to try the latest snapshot of this tool with option +U8 (--convert-to-utf8)...
Another problem I have (I was not able to find a suitable solution here): some DICOM entries have a trailing 0x10. This is converted to

Code: Select all

&#10;
by dcm2xml and not converted back by xml2dcm. This also happens on line feeds with 0x13 0x10.
I have to check this. Actually, this conversion should be done by the underlying XML library (libxml2).

J. Riesmeier
DCMTK Developer
Posts: 2015
Joined: Tue, 2011-05-03, 14:38
Location: Oldenburg, Germany
Contact:

Re: XML2DCM: Umlaut & Encoding Problems

#3 Post by J. Riesmeier » Mon, 2014-02-24, 09:30

I checked the newline issue with the latest snapshot: LF and CR are correctly converted to "&#10;" and "&#13;", and back to LF and CR.
Your sequence "&#10;" is incorrect by the way, it should be "&#10;". As far as I can see, dcm2xml 3.6.0 also generates the correct output...

st.ku
Posts: 3
Joined: Fri, 2014-02-21, 11:06

Re: XML2DCM: Umlaut & Encoding Problems

#4 Post by st.ku » Mon, 2014-02-24, 10:23

Thank you for fast reply!
Do you really mean "crashes" or does the tool just exist/terminate with an error?
Sorry for this inaccurate description. It terminates with error:

Code: Select all

xml2dcm.exe a.xml a2.dcm
E: could not parse document: a.xml
(all files are in the same folder and converting other xml files works fine)
Running xml2dcm.exe with debug logging, the output is written in the first posting.
Are you sure that the XML encoding of the "ä" is correct?
What is the correct encoding? If I open the XML file with my notepad2 (mod) with ANSI 1252 encoding I can see an "ä". The file was not modified in any way.
However, you might want to try the latest snapshot of this tool with option +U8 (--convert-to-utf8)...
I will test it.
Your sequence "&#10;" is incorrect by the way, it should be "&#10;".
Ok... I found it. It is not a problem of dcmtk. I checked the generated vanilla xml file and everything is fine but parsing the xml file in my code the ampersands are added. RapidXML seems to add it while parsing the XML. Sorry for the false negative!!!

J. Riesmeier
DCMTK Developer
Posts: 2015
Joined: Tue, 2011-05-03, 14:38
Location: Oldenburg, Germany
Contact:

Re: XML2DCM: Umlaut & Encoding Problems

#5 Post by J. Riesmeier » Mon, 2014-02-24, 13:23

What is the correct encoding? If I open the XML file with my notepad2 (mod) with ANSI 1252 encoding I can see an "ä". The file was not modified in any way.
ANSI 1252 is a Windows-specific character set (similar but not identical to ISO 8859-1 as far as I know).
The prolog of an XML document usually specifies the encoding, e.g. "ISO-8859-1" or "UTF-8".

The log output in your above posting indicates that the XML encoding is declared as UTF-8, which contradicts your statement that it is ANSI 1252 (Windows).

Post Reply

Who is online

Users browsing this forum: Bing [Bot], Google [Bot] and 1 guest