Re: HTMLParser encoding
Max Ivanov wrote:
>>> If there is no meta tag with defined document encoding, how HTMLParser
>>> converts text data into Unicode? Does it contain some encoding
>>> detection machinery?
>> Yes, but that's implemented in libxml2 and I don't know much about the
>> details. There are some ways to help it, though, in case it gets it wrong. If
>> you can provide the proper encoding (e.g. as provided through HTTP, MIME or
>> some other source), you can pass it to the parser when you create it. Or, you
>> can decode the data to a unicode string and pass that to the parser.
>
> I plan to user chardet module (http://chardet.feedparser.org/) to
> detect charset if no meta tag is present. chardet needs untouched text
> for proper detection, I couldn't pass to it unicode text from
> element.text ot .text_content() also I couldnt pass plain text full of
> tags since it make chardet return wrong results. Is there any way to
> restore original text from element.text or text_content()?
Really you should run chardet before parsing the document, then parse
the unicode document. There's not much purpose to running chardet after
parsing, as it's far too late to do anything useful.
--
--
Ian Bicking : ianb <at> colorstudy.com : http://blog.ianbicking.org