25 Aug 17:22
Re: Encoding again
From: Stefan Behnel <stefan_ml <at> behnel.de>
Subject: Re: Encoding again
Newsgroups: gmane.comp.python.lxml.devel
Date: 2008-08-25 15:22:29 GMT
Subject: Re: Encoding again
Newsgroups: gmane.comp.python.lxml.devel
Date: 2008-08-25 15:22:29 GMT
Max Ivanov wrote:
>> Max Ivanov wrote:
>>> Is there any way to force lxml to make element.text and element.tail
>>> to be exactly the same as in original text, without any encoding
>>> manipulation? Or to restore them to original state, i.e. maybe
>>> somewhere inside lxml there is a var which contain original encoding,
>>> so I could do elelemt.text.encode('...').?
>>
>> I'm not sure I understand what you want, but in case you want lxml.etree
>> to
>> return the encoded byte string instead of the unicode string: no, there
>> is no
>> switch to do that. I have no idea why you would want to do that, though.
>>
>> The original encoding is stored in the docinfo property of the
>> ElementTree of
>> the document.
>
> Ok, I'll explain it in python since my English isn't ok for this task
> =) I've attached simple test case. None of assertions there pass.
I can't test it right now, but this might work for you. I just provided
the parser with the right encoding information. Note that your "HTML
document" does not specify an encoding, so I assume that the parser just
expects it to be latin-1 or some other plain byte encoding, and reads the
bytes as they come in. To be clear: it's the document that's broken here,
not the parser.
Note that you can also pass unicode strings into the parser, so if you
manage to decode your HTML data into correct unicode, the parser will do
(Continue reading)
RSS Feed