0
0
mirror of https://github.com/zeux/pugixml.git synced 2024-12-27 13:33:17 +08:00

docs: Added error handling, parsing options and encoding sections, minor spelling fix

git-svn-id: http://pugixml.googlecode.com/svn/trunk@553 99668b35-9821-0410-8761-19e4c4f06640
This commit is contained in:
arseny.kapoulkine 2010-06-30 14:29:14 +00:00
parent e73b54e60d
commit 678d2f2369

View File

@ -540,7 +540,7 @@ Sometimes XML data should be loaded from some other source than file, i.e. HTTP
All functions accept the buffer which is represented by a pointer to XML data, `contents`, and data size in bytes. Also there are two optional arguments, which specify parsing options (see [sref manual.loading.options]) and input data encoding (see [sref manual.loading.encoding]). The buffer does not have to be zero-terminated.
`load_buffer` function works with immutable buffer - it does not ever modify the buffer. Because of this restriction it has to create a private buffer and copy XML data to it before parsing (applying encoding conversions if necessary). This copy operation carries a performance penalty, so inplace functions are provided - `load_buffer_inplace` and `load_buffer_inplace_own` store the document data in the buffer, modifying it in the process. In order for the document to stay valid, you have to make sure that the buffers lifetime exceeds that of the tree if you're using inplace functions. In addition to that, `load_buffer_inplace` does not assume ownership of the buffer, so you'll have to destroy it yourself; `load_buffer_inplace_own` assumes ownership of the buffer and destroys it once it is not needed. This means that if you're using `load_buffer_inplace_own`, you have to allocate memory with pugixml allocation function (you can get it via [link get_memory_allocation_function]).
`load_buffer` function works with immutable buffer - it does not ever modify the buffer. Because of this restriction it has to create a private buffer and copy XML data to it before parsing (applying encoding conversions if necessary). This copy operation carries a performance penalty, so inplace functions are provided - `load_buffer_inplace` and `load_buffer_inplace_own` store the document data in the buffer, modifying it in the process. In order for the document to stay valid, you have to make sure that the buffer's lifetime exceeds that of the tree if you're using inplace functions. In addition to that, `load_buffer_inplace` does not assume ownership of the buffer, so you'll have to destroy it yourself; `load_buffer_inplace_own` assumes ownership of the buffer and destroys it once it is not needed. This means that if you're using `load_buffer_inplace_own`, you have to allocate memory with pugixml allocation function (you can get it via [link get_memory_allocation_function]).
The best way from the performance/memory point of view is to load document using `load_buffer_inplace_own`; this function has maximum control of the buffer with XML data so it is able to avoid redundant copies and reduce peak memory usage while parsing. This is the recommended function if you have to load the document from memory and performance is critical.
@ -584,19 +584,173 @@ Stream loading requires working seek/tell functions and therefore may fail when
[endsect] [/stream]
[section:errors Handling parsing errors]
foo
concise syntax (if (!doc.load(...)) ...)
[#xml_parse_result]
All document loading functions return the parsing result via `xml_parse_result` object. It contains parsing status, the offset of last successfully parsed character from the beginning of the source stream, and the encoding of the source stream:
struct xml_parse_result
{
xml_parse_status status;
ptrdiff_t offset;
xml_encoding encoding;
operator bool() const;
const char* description() const;
};
[#xml_parse_status]
[#xml_parse_result::status]
Parsing status is represented as the `xml_parse_status` enumeration and can be one of the following:
* [#status_ok]
`status_ok` means that no error was encountered during parsing; the source stream represents the valid XML document which was fully parsed and converted to a tree.
[lbr]
* [#status_file_not_found]
`status_file_not_found` is only returned by `load_file` function and means that file could not be opened.
* [#status_io_error]
`status_io_error` is returned by `load_file` function and by `load` functions with `std::istream`/`std::wstream` arguments; it means that some I/O error has occured during reading the file/stream.
* [#status_out_of_memory]
`status_out_of_memory` means that there was not enough memory during some allocation; any allocation failure during parsing results in this error.
* [#status_internal_error]
`status_internal_error` means that something went horribly wrong; currently this error does not occur
[lbr]
* [#status_unrecognized_tag]
`status_unrecognized_tag` means that parsing stopped due to a tag with either an empty name or a name which starts with incorrect character, such as [^#].
* [#status_bad_pi]
`status_bad_pi` means that parsing stopped due to incorrect document declaration/processing instruction
* [#status_bad_comment][#status_bad_cdata][#status_bad_doctype][#status_bad_pcdata]
`status_bad_comment`, `status_bad_cdata`, `status_bad_doctype` and `status_bad_pcdata` mean that parsing stopped due to the invalid construct of the respective type
* [#status_bad_start_element]
`status_bad_start_element` means that parsing stopped because starting tag either had no closing `>` symbol or contained some incorrect symbol
* [#status_bad_attribute]
`status_bad_attribute` means that parsing stopped because there was an incorrect attribute, such as an attribute without value or with value that is not quoted (note that `<node attr=1>` is incorrect in XML)
* [#status_bad_end_element]
`status_bad_end_element` means that parsing stopped because ending tag had incorrect syntax (i.e. extra non-whitespace symbols between tag name and `>`)
* [#status_end_element_mismatch]
`status_end_element_mismatch` means that parsing stopped because the closing tag did not match the opening one (i.e. `<node></nedo>`) or because some tag was not closed at all
[#xml_parse_result::description]
`description()` member function can be used to convert parsing status to a string; the returned message is always in English, so you'll have to write your own function if you need a localized string. However please note that the exact messages returned by `description()` function may change from version to version, so any complex status handling should be based on `status` value.
If parsing failed because the source data was not a valid XML, the resulting tree is not destroyed - despite the fact that load function returns error, you can use the part of the tree that was successfully parsed. Obviously, the last element may have an unexpected name/value; for example, if the attribute value does not end with the necessary quotation mark, like in [^<node attr="value>some data</node>] example, the value of attribute `attr` will contain the string `value>some data</node>`.
[#xml_parse_result::offset]
In addition to the status code, parsing result has an `offset` member, which contains the offset of last successfully parsed character if parsing failed because of an error in source data; otherwise `offset` is 0. For parsing efficiency reasons, pugixml does not track the current line during parsing; this offset is in units of `pugi::char_t` (bytes for character mode, wide characters for wide character mode). Many text editors support 'Go To Position' feature - you can use it to locate the exact error position. Alternatively, if you're loading the document from memory, you can display the error chunk along with the error description (see the example code below).
[caution Offset is calculated in the XML buffer in native encoding; if encoding conversion is performed during parsing, offset can not be used to reliably track the error position.]
[#xml_parse_result::encoding]
Parsing result also has an `encoding` member, which can be used to check that the source data encoding was correctly guessed. It is equal to the exact encoding used during parsing (i.e. with the exact endianness); see [sref manual.loading.encoding] for more information.
[#xml_parse_result::bool]
Parsing result object can be implicitly converted to `bool`; if you do not want to handle parsing errors thoroughly, you can just check the return value of load functions as if it was a `bool`: `if (doc.load_file("file.xml")) { ... } else { ... }`.
This is a simple example of handling loading errors ([@samples/load_error_handling.cpp]):
[import samples/load_error_handling.cpp]
[code_load_error_handling]
[endsect] [/errors]
[section:options Parsing options]
foo
All document loading functions accept the optional parameter `options`. This is a bitmask that customizes the parsing process: you can select the node types that are parsed and various transformations that are performed with the XML text. Disabling certain transformations can improve parsing performance for some documents; however, the code for all transformations is very well optimized, and thus the majority of documents won't get any performance benefit. As a rule of thumb, only modify parsing flags if you want to get some nodes in the document that are excluded by default (i.e. declaration or comment nodes).
[note You should use the usual bitwise arithmetics to manipulate the bitmask: to enable a flag, use `mask | flag`; to disable a flag, use `mask & ~flag`.]
These flags control the resulting tree contents:
* [#parse_declaration]
`parse_declaration` determines if XML document declaration (node with type [link node_declaration]) are to be put in DOM tree. If this flag is off, it is not put in the tree, but is still parsed and checked for correctness. This flag is *off* by default.
[lbr]
* [#parse_pi]
`parse_pi` determines if processing instructions (nodes with type [link node_pi]) are to be put in DOM tree. If this flag is off, they are not put in the tree, but are still parsed and checked for correctness. Note that `<?xml ...?>` (document declaration) is not considered to be a PI. This flag is *off* by default.
[lbr]
* [#parse_comments]
`parse_comments` determines if comments (nodes with type [link node_comment]) are to be put in DOM tree. If this flag is off, they are not put in the tree, but are still parsed and checked for correctness. This flag is *off* by default.
[lbr]
* [#parse_cdata]
`parse_cdata` determines if CDATA sections (nodes with type [link node_cdata]) are to be put in DOM tree. If this flag is off, they are not put in the tree, but are still parsed and checked for correctness. This flag is *on* by default.
[lbr]
* [#parse_ws_pcdata]
`parse_ws_pcdata` determines if PCDATA nodes (nodes with type [link node_pcdata]) that consist only of whitespace characters are to be put in DOM tree. Often whitespace-only data is not significant for the application, and the cost of allocating and storing such nodes (both memory and speed-wise) can be significant. For example, after parsing XML string `<node> <a/> </node>`, `<node>` element will have three children when `parse_ws_pcdata` is set (child with type `node_pcdata` and value `" "`, child with type `node_element` and name `"a"`, and another child with type `node_pcdata` and value `" "`), and only one child when `parse_ws_pcdata` is not set. This flag is *off* by default.
These flags control the transformation of tree element contents:
* [#parse_escapes]
`parse_escapes` determines if character and entity references are to be expanded during the parsing process. Character references have the form [^&#...;] or [^&#x...;] ([^...] is Unicode numeric representation of character in either decimal ([^&#...;]) or hexadecimal ([^&#x...;]) form), entity references are [^&lt;], [^&gt;], [^&amp;], [^&apos;] and [^&quot;] (note that as pugixml does not handle DTD, the only allowed entities are predefined ones). If character/entity reference can not be expanded, it is left as is, so you can do additional processing later. Reference expansion is performed in attribute values and PCDATA content. This flag is *on* by default.
[lbr]
* [#parse_eol]
`parse_eol` determines if EOL handling (that is, replacing sequences `0x0d 0x0a` by a single `0x0a` character, and replacing all standalone `0x0d` characters by `0x0a`) is to be performed on input data (that is, comments contents, PCDATA/CDATA contents and attribute values). This flag is *on* by default.
[lbr]
* [#parse_wconv_attribute]
`parse_wconv_attribute` determines if attribute value normalization should be performed for all attributes. This means, that whitespace characters (new line, tab and space) are replaced with space (`' '`). New line characters are always treated as if `parse_eol` is set, i.e. `\r\n` is converted to single space. This flag is *on* by default.
Additionally there are two predefined option masks:
* [#parse_minimal]
`parse_minimal` has all options turned off. This option mask means that pugixml does not add declaration nodes, PI nodes, CDATA sections and comments to the resulting tree and does not perform any conversion for input data, so theoretically it is the fastest mode. However, as discussed above, in practice `parse_default` is usually equally fast.
[lbr]
* [#parse_default]
`parse_default` is the default set of flags, i.e. it has all options set to their default values. It includes parsing CDATA sections (comments/PIs are not parsed), performing character and entity reference expansion, replacing whitespace characters with spaces in attribute values and performing EOL handling. Note, that PCDATA sections consisting only of whitespace characters are not parsed (by default) for performance reasons.
This is a simple example of using different parsing options ([@samples/load_options.cpp]):
[import samples/load_options.cpp]
[code_load_options]
[endsect] [/options]
[section:encoding Encodings]
foo
[#xml_encoding]
pugixml supports all popular Unicode encodings (UTF-8, UTF-16 (big and little endian), UTF-32 (big and little endian); UCS-2 is naturally supported since its a strict subset of UTF-16) and handles all encoding conversions. Most loading functions accept the optional parameter `encoding`. This is a value of enumeration type `xml_encoding`, that can have the following values:
* [#encoding_auto]
`encoding_auto` means that pugixml will try to guess the encoding based on source XML data. The algorithm is a modified version of the one presented in Appendix F.1 of XML recommendation; it tries to match the first few bytes of input data with the following patterns in strict order:
[lbr]
* If first four bytes match UTF-32 BOM (Byte Order Mark), encoding is assumed to be UTF-32 with the endianness equal to that of BOM;
* If first two bytes match UTF-16 BOM, encoding is assumed to be UTF-16 with the endianness equal to that of BOM;
* If first three bytes match UTF-8 BOM, encoding is assumed to be UTF-8;
* If first four bytes match UTF-32 representation of [^<], encoding is assumed to be UTF-32 with the corresponding endianness;
* If first four bytes match UTF-16 representation of [^<?], encoding is assumed to be UTF-16 with the corresponding endianness;
* If first two bytes match UTF-16 representation of [^<], encoding is assumed to be UTF-16 with the corresponding endianness (this guess may yield incorrect result, but it's better than UTF-8);
* Otherwise encoding is assumed to be UTF-8.
[lbr]
* [#encoding_utf8]
`encoding_utf8` corresponds to UTF-8 encoding as defined in Unicode standard; UTF-8 sequences with length equal to 5 or 6 are not standard and are rejected.
* [#encoding_utf16_le]
`encoding_utf16_le` corresponds to little-endian UTF-16 encoding as defined in Unicode standard; surrogate pairs are supported.
* [#encoding_utf16_be]
`encoding_utf16_be` corresponds to big-endian UTF-16 encoding as defined in Unicode standard; surrogate pairs are supported.
* [#encoding_utf16]
`encoding_utf16` corresponds to UTF-16 encoding as defined in Unicode standard; the endianness is assumed to be that of target platform.
* [#encoding_utf32_le]
`encoding_utf32_le` corresponds to little-endian UTF-32 encoding as defined in Unicode standard.
* [#encoding_utf32_be]
`encoding_utf32_le` corresponds to big-endian UTF-32 encoding as defined in Unicode standard.
* [#encoding_utf32]
`encoding_utf32` corresponds to UTF-32 encoding as defined in Unicode standard; the endianness is assumed to be that of target platform.
* [#encoding_wchar]
`encoding_wchar` corresponds to the encoding of `wchar_t` type; it has the same meaning as either `encoding_utf16` or `encoding_utf32`, depending on `wchar_t` size.
The algorithm used for `encoding_auto` correctly detects any supported Unicode encoding for all well-formed XML documents (since they start with document declaration) and for all other XML documents that start with [^<]; if your XML document does not start with [^<] and has encoding that is different from UTF-8, use the specific encoding.
[note The current behavior for Unicode conversion is to skip all invalid UTF sequences during conversion. This behavior should not be relied upon; moreover, in case no encoding conversion is performed, the invalid sequences are not removed, so you'll get them as is in node/attribute contents.]
[endsect] [/encoding]
[section:w3c W3C specification conformance]
[section:w3c W3C recommendation conformance]
foo
[endsect] [/w3c]
@ -784,6 +938,8 @@ First private release for testing purposes
[section:apiref API reference]
This is the reference for all macros, types, enumerations, classes and functions in pugixml. Each symbol is a link that leads to the relevant section of the user manual.
Macros:
* `#define `[link PUGIXML_WCHAR_MODE]
@ -796,8 +952,8 @@ Macros:
Types:
* `typedef `/configuration-defined type/ [link char_t]`;`
* `typedef `/configuration-defined type/ [link string_t]`;`
* `typedef `/configuration-defined type/` `[link char_t]`;`
* `typedef `/configuration-defined type/` `[link string_t]`;`
* `typedef void* (*`[link allocation_function]`)(size_t size);`
* `typedef void (*`[link deallocation_function]`)(void* ptr);`
@ -813,16 +969,33 @@ Enumerations:
* [link node_pi]
* [link node_declaration]
* xml_encoding
* encoding_auto
* encoding_utf8
* encoding_utf16_le
* encoding_utf16_be
* encoding_utf16
* encoding_utf32_le
* encoding_utf32_be
* encoding_utf32
* encoding_wchar
* `enum `[link xml_parse_status]
* [link status_ok]
* [link status_file_not_found]
* [link status_io_error]
* [link status_out_of_memory]
* [link status_internal_error]
* [link status_unrecognized_tag]
* [link status_bad_pi]
* [link status_bad_comment]
* [link status_bad_cdata]
* [link status_bad_doctype]
* [link status_bad_pcdata]
* [link status_bad_start_element]
* [link status_bad_attribute]
* [link status_bad_end_element]
* [link status_end_element_mismatch]
* [link xml_encoding]
* [link encoding_auto]
* [link encoding_utf8]
* [link encoding_utf16_le]
* [link encoding_utf16_be]
* [link encoding_utf16]
* [link encoding_utf32_le]
* [link encoding_utf32_be]
* [link encoding_utf32]
* [link encoding_wchar]
* xpath_value_type
* xpath_type_none
@ -831,23 +1004,6 @@ Enumerations:
* xpath_type_string
* xpath_type_boolean
* xml_parse_status
* status_ok
* status_file_not_found
* status_io_error
* status_out_of_memory
* status_internal_error
* status_unrecognized_tag
* status_bad_pi
* status_bad_comment
* status_bad_cdata
* status_bad_doctype
* status_bad_pcdata
* status_bad_start_element
* status_bad_attribute
* status_bad_end_element
* status_end_element_mismatch
Constants:
* Formatting options bit flags:
@ -858,16 +1014,16 @@ Constants:
* format_write_bom
* Parsing options bit flags:
* parse_cdata
* parse_comments
* parse_declaration
* parse_default
* parse_eol
* parse_escapes
* parse_minimal
* parse_pi
* parse_ws_pcdata
* parse_wconv_attribute
* [link parse_cdata]
* [link parse_comments]
* [link parse_declaration]
* [link parse_default]
* [link parse_eol]
* [link parse_escapes]
* [link parse_minimal]
* [link parse_pi]
* [link parse_ws_pcdata]
* [link parse_wconv_attribute]
Classes:
@ -980,8 +1136,8 @@ Classes:
* xpath_node_set select_nodes(const char_t* query) const;
* xpath_node_set select_nodes(const xpath_query& query) const;
* void print(xml_writer& writer, const char_t* indent = PUGIXML_TEXT("\t"), unsigned int flags = format_default, xml_encoding encoding = encoding_auto, unsigned int depth = 0) const;
* void print(std::basic_ostream<char, std::char_traits<char> >& os, const char_t* indent = PUGIXML_TEXT("\t"), unsigned int flags = format_default, xml_encoding encoding = encoding_auto, unsigned int depth = 0) const;
* void print(std::basic_ostream<wchar_t, std::char_traits<wchar_t> >& os, const char_t* indent = PUGIXML_TEXT("\t"), unsigned int flags = format_default, unsigned int depth = 0) const;
* void print(std::ostream& os, const char_t* indent = PUGIXML_TEXT("\t"), unsigned int flags = format_default, xml_encoding encoding = encoding_auto, unsigned int depth = 0) const;
* void print(std::wostream& os, const char_t* indent = PUGIXML_TEXT("\t"), unsigned int flags = format_default, unsigned int depth = 0) const;
* typedef xml_node_iterator iterator;
* typedef xml_attribute_iterator attribute_iterator;
@ -1018,13 +1174,22 @@ Classes:
* void save(xml_writer& writer, const char_t* indent = PUGIXML_TEXT("\t"), unsigned int flags = format_default, xml_encoding encoding = encoding_auto) const;
[lbr]
* void save(std::basic_ostream<char, std::char_traits<char> >& stream, const char_t* indent = PUGIXML_TEXT("\t"), unsigned int flags = format_default, xml_encoding encoding = encoding_auto) const;
* void save(std::basic_ostream<wchar_t, std::char_traits<wchar_t> >& stream, const char_t* indent = PUGIXML_TEXT("\t"), unsigned int flags = format_default) const;
* void save(std::ostream& stream, const char_t* indent = PUGIXML_TEXT("\t"), unsigned int flags = format_default, xml_encoding encoding = encoding_auto) const;
* void save(std::wostream& stream, const char_t* indent = PUGIXML_TEXT("\t"), unsigned int flags = format_default) const;
[lbr]
* bool save_file(const char* path, const char_t* indent = PUGIXML_TEXT("\t"), unsigned int flags = format_default, xml_encoding encoding = encoding_auto) const;
[lbr]
* `struct `[link xml_parse_result]
* `xml_parse_status `[link xml_parse_result::status status]`;`
* `ptrdiff_t `[link xml_parse_result::offset offset]`;`
* `xml_encoding `[link xml_parse_result::encoding encoding]`;`
[lbr]
* `operator `[link xml_parse_result::bool bool]`() const;`
* `const char* `[link xml_parse_result::description description]`() const;`
* xpath_query
* explicit xpath_query(const char_t* query);
* ~xpath_query();
@ -1043,8 +1208,8 @@ Classes:
* virtual void write(const void* data, size_t size);
* xml_writer_stream
* xml_writer_stream(std::basic_ostream<char, std::char_traits<char> >& stream);
* xml_writer_stream(std::basic_ostream<wchar_t, std::char_traits<wchar_t> >& stream);
* xml_writer_stream(std::ostream& stream);
* xml_writer_stream(std::wostream& stream);
* virtual void write(const void* data, size_t size);
* xml_node_iterator
@ -1059,13 +1224,6 @@ Classes:
* virtual bool for_each(xml_node&) = 0;
* virtual bool end(xml_node&);
* xml_parse_result
* xml_parse_status status;
* ptrdiff_t offset;
* xml_encoding encoding;
* operator bool() const
* const char* description() const;
* xpath_exception
* explicit xpath_exception(const char* message);
* virtual const char* what() const throw();