Max Ivanov | 23 Aug 12:27

HTMLParser encoding

If there is no meta tag with defined document encoding, how HTMLParser
converts text data into Unicode? Does it contain some encoding
detection machinery?
Stefan Behnel | 23 Aug 14:02

Re: HTMLParser encoding

Hi,

Max Ivanov wrote:
> If there is no meta tag with defined document encoding, how HTMLParser
> converts text data into Unicode? Does it contain some encoding
> detection machinery?

Yes, but that's implemented in libxml2 and I don't know much about the
details. There are some ways to help it, though, in case it gets it wrong. If
you can provide the proper encoding (e.g. as provided through HTTP, MIME or
some other source), you can pass it to the parser when you create it. Or, you
can decode the data to a unicode string and pass that to the parser.

Stefan
Max Ivanov | 23 Aug 14:52

Re: HTMLParser encoding

>> If there is no meta tag with defined document encoding, how HTMLParser
>> converts text data into Unicode? Does it contain some encoding
>> detection machinery?
>
> Yes, but that's implemented in libxml2 and I don't know much about the
> details. There are some ways to help it, though, in case it gets it wrong. If
> you can provide the proper encoding (e.g. as provided through HTTP, MIME or
> some other source), you can pass it to the parser when you create it. Or, you
> can decode the data to a unicode string and pass that to the parser.

I plan to user chardet module (http://chardet.feedparser.org/) to
detect charset if no meta tag is present. chardet needs untouched text
 for proper detection, I couldn't pass to it unicode text from
element.text ot .text_content() also I couldnt pass plain text full of
tags since it make chardet return wrong results. Is there any way to
restore original text from element.text or text_content()?
Ian Bicking | 28 Aug 19:33
Gravatar

Re: HTMLParser encoding

Max Ivanov wrote:
>>> If there is no meta tag with defined document encoding, how HTMLParser
>>> converts text data into Unicode? Does it contain some encoding
>>> detection machinery?
>> Yes, but that's implemented in libxml2 and I don't know much about the
>> details. There are some ways to help it, though, in case it gets it wrong. If
>> you can provide the proper encoding (e.g. as provided through HTTP, MIME or
>> some other source), you can pass it to the parser when you create it. Or, you
>> can decode the data to a unicode string and pass that to the parser.
> 
> I plan to user chardet module (http://chardet.feedparser.org/) to
> detect charset if no meta tag is present. chardet needs untouched text
>  for proper detection, I couldn't pass to it unicode text from
> element.text ot .text_content() also I couldnt pass plain text full of
> tags since it make chardet return wrong results. Is there any way to
> restore original text from element.text or text_content()?

Really you should run chardet before parsing the document, then parse 
the unicode document.  There's not much purpose to running chardet after 
parsing, as it's far too late to do anything useful.

--

-- 
Ian Bicking : ianb <at> colorstudy.com : http://blog.ianbicking.org

Gmane