Encoding Conversion ar01s08.htmlPrev    apa.htmlNext Encoding Conversion
Data encoding compatibility problems are one of the most common
      difficulties encountered by programmers new to 
XML in
      general and 
libxml in particular. Thinking
      through the design of your application in light of this issue will help
      avoid difficulties later. Internally, 
libxml      stores and manipulates data in the UTF-8 format. Data used by your program
      in other formats, such as the commonly used ISO-8859-1 encoding, must be
      converted to UTF-8 before passing it to 
libxml      functions. If you want your program's output in an encoding other than
      UTF-8, you also must convert it.
Libxml uses
      
iconv if it is available to convert
    data. Without 
iconv, only UTF-8, UTF-16 and
    ISO-8859-1 can be used as external formats. With
    
iconv, any format can be used provided
    
iconv is able to convert it to and from
    UTF-8. Currently 
iconv supports about 150
    different character formats with ability to convert from any to any. While
    the actual number of supported formats varies between implementations, every
    
iconv implementation is almost guaranteed to
    support every format anyone has ever heard of.
[Warning] Warning A common mistake is to use different formats for the internal data
	in different parts of one's code. The most common case is an application
	that assumes ISO-8859-1 to be the internal data format, combined with
	
libxml, which assumes UTF-8 to be the
	internal data format. The result is an application that treats internal
	data differently, depending on which code section is executing. The one or
	the other part of code will then, naturally, misinterpret the data.
      
This example constructs a simple document, then adds content provided
    at the command line to the document's root element and outputs the results
    to 
stdout in the proper encoding. For this example, we
    use ISO-8859-1 encoding. The encoding of the string input at the command
    line is converted from ISO-8859-1 to UTF-8. Full code: 
aph.htmlAppendix H, Code for Encoding Conversion Example 
The conversion, encapsulated in the example code in the
      
convert function, uses
      
libxml's    
xmlFindCharEncodingHandler function:
      
	
1xmlCharEncodingHandlerPtr handler;
        
2size = (int)strlen(in)+1; 
        out_size = size*2-1; 
        out = malloc((size_t)out_size); 
…
	
3handler = xmlFindCharEncodingHandler(encoding);
…
	
4handler->input(out, &out_size, in, &temp);
…	
	
5xmlSaveFormatFileEnc("-", doc, encoding, 1);
      
      
#handlerdatatype1  handler is declared as a pointer to an
	    
xmlCharEncodingHandler function.
#calcsize2  The xmlCharEncodingHandler function needs
	  to be given the size of the input and output strings, which are
	    calculated here for strings 
in and
	  
out.
#findhandlerfunction3  xmlFindCharEncodingHandler takes as its
	    argument the data's initial encoding and searches
	    
libxml's built-in set of conversion
	    handlers, returning a pointer to the function or NULL if none is
	    found.
#callconversionfunction4  The conversion function identified by handler	  requires as its arguments pointers to the input and output strings,
	  along with the length of each. The lengths must be determined
	  separately by the application.
#outputencoding5  To output in a specified encoding rather than UTF-8, we use
	    
xmlSaveFormatFileEnc, specifying the
	    encoding.
    
ar01s08.htmlPrev  index.htmlUp  apa.htmlNext Retrieving Attributes index.htmlHome  A. Compilation 