1 Jul 14:03
Re: A quick and simple xpath solution for nasty HTML (was Re: Premature end of data in tag - but it looks well formed)
Stefan Behnel <stefan_ml <at> behnel.de>
2008-07-01 12:03:35 GMT
2008-07-01 12:03:35 GMT
Hi, Mike MacCana wrote: > I solved the crap HTML problem as follows. Hopefully the following will > be useful to anyone beginning XPath with lxml. Just adding a few comments as I see fit. > ## Function to strip non-ascii characters > ## See http://en.wikipedia.org/wiki/Ascii#ASCII_printable_characters > ## for list > def onlyascii(char): > if ord(char) < 32 or ord(char) > 176: > return '' > else: > return char Note that this will not work as expected with multi-byte encodings such as UTF-8. > ## We can now access our cleaned content as 'cleanedcontent' > cleanedcontent=cleaner.clean_html(asciihtml) This will (obviously) parse the HTML into a tree internally, so it's more efficient to pass a parsed tree directly. > ## Go parse our content > cleanedcontentstringio = StringIO(cleanedcontent) > parser = etree.XMLParser(recover=True) > tree = etree.parse(cleanedcontentstringio)(Continue reading)
RSS Feed