Ben | 9 May 16:09
Favicon

Getting info from an XML file that has invalid character data in it (and how to specify recover option)

Hello

I'm writing some code to check whether our daily backups worked.   Backup Exec stores its
results in XML files.   Sometimes bad characters - or maybe it is binary data - ends up in
these XML files and then lxml chokes:

C:\>python sb-lxml.py
Traceback (most recent call last):
  File "sb-lxml.py", line 5, in <module>
    Xml = etree.parse(XmlFileName)
  File "lxml.etree.pyx", line 2520, in lxml.etree.parse (src/lxml/lxml.etree.c:22062)
  File "parser.pxi", line 1309, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:53088)
  File "parser.pxi", line 1338, in lxml.etree._parseDocumentFromURL
(src/lxml/lxml.etree.c:53337)
  File "parser.pxi", line 1248, in lxml.etree._parseDocFromFile
(src/lxml/lxml.etree.c:52584)
  File "parser.pxi", line 828, in lxml.etree._BaseParser._parseDocFromFile
(src/lxml/lxml.etree.c:50115)
  File "parser.pxi", line 452, in lxml.etree._ParserContext._handleParseResultDoc
(src/lxml/lxml.etree.c:47023)
  File "parser.pxi", line 536, in lxml.etree._handleParseResult
(src/lxml/lxml.etree.c:47861)
  File "parser.pxi", line 478, in lxml.etree._raiseParseError
(src/lxml/lxml.etree.c:47285)
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 11, line 132, column 95

The offending line looks like this (not sure if the bad characters will make it through
the email):

</error><error>Directory not found. Can not backup directory \Data\\l Strategy - Progress
(Continue reading)

Stefan Behnel | 9 May 16:42
Picon

Re: Getting info from an XML file that has invalid character data in it (and how to specify recover option)

Hi,

Ben wrote:
> Xml = etree.parse(XmlFileName)
> ##############################
> XmlFileName = r'c:/BEX03194.xml'
> parser = etree.XMLParser(recover=True)
> Xml   = etree.parse(StringIO(XmlFileName), parser)

Not sure if this is just a "find-a-short-example" error, but you parse the
filename, not the file here. This should read

   Xml   = etree.parse(XmlFileName, parser)

> Also, I've tried the 'recover' parser option, but I'm doing something wrong,
> because I get this:
>
> C:\>python sb-lxml.py
> Traceback (most recent call last):
>   File "sb-lxml.py", line 9, in <module>
>     print Xml.findtext(".//end_time")
>   File "lxml.etree.pyx", line 1656, in lxml.etree._ElementTree.findtext
> (src/lxml/lxml.etree.c:15354)
>   File "lxml.etree.pyx", line 1489, in lxml.etree._ElementTree._assertHasRoot
> (src/lxml/lxml.etree.c:14116)
> AssertionError: ElementTree not initialized, missing root

I guess that happens when the parser "recover"s from not finding any XML at
all. Maybe we should still raise an exception in this case instead of
returning an empty ElementTree. This is really an extreme case of broken data...
(Continue reading)


Gmane