Stefan Behnel | 25 Aug 17:22

Re: Encoding again

Max Ivanov wrote:
>> Max Ivanov wrote:
>>> Is there any way to force lxml  to make element.text and element.tail
>>> to be exactly the same as in original text, without any encoding
>>> manipulation? Or to restore them to original state, i.e. maybe
>>> somewhere inside lxml there is a var which contain original encoding,
>>> so I could do elelemt.text.encode('...').?
>>
>> I'm not sure I understand what you want, but in case you want lxml.etree
>> to
>> return the encoded byte string instead of the unicode string: no, there
>> is no
>> switch to do that. I have no idea why you would want to do that, though.
>>
>> The original encoding is stored in the docinfo property of the
>> ElementTree of
>> the document.
>
> Ok, I'll explain it in python since my English isn't ok for this task
> =) I've attached simple test case. None of assertions there pass.

I can't test it right now, but this might work for you. I just provided
the parser with the right encoding information. Note that your "HTML
document" does not specify an encoding, so I assume that the parser just
expects it to be latin-1 or some other plain byte encoding, and reads the
bytes as they come in. To be clear: it's the document that's broken here,
not the parser.

Note that you can also pass unicode strings into the parser, so if you
manage to decode your HTML data into correct unicode, the parser will do
(Continue reading)

Max Ivanov | 25 Aug 22:15

Re: Encoding again

> I can't test it right now, but this might work for you. I just provided
> the parser with the right encoding information. Note that your "HTML
> document" does not specify an encoding, so I assume that the parser just
> expects it to be latin-1 or some other plain byte encoding, and reads the
> bytes as they come in. To be clear: it's the document that's broken here,
> not the parser.

Yes indeed. I understand that document is broken, but that's the case
- I've to process even broken html pages. Even more, lxml does a lots
of heavy lifting to make processing of broken html much easier. I'm
talking about another step in that way. There are a lots of pages in
russian segment of internet with no charset specified. All of them
contain lots of symbols with codes > 128. Do you agree that if you
pass some data, it is reasonable to assume that it would return
exactly the same data? Nowdays we have:

origdata = 'some string with codes >128 (national chars)'
xml = '<root>'+origdata+'</root>'
.... parsing it with lxml....
rettext = doc.text_content()
isinstance(rettext, unicode) #TRUE! but original text was not unicode.
#ok, converting original text to unicode to compare
unidata = origdata.decode('original encoding')
origdata == doc.text_content() #FALSE! lxml makes garbage from our text.

xml is all about tags and attribs, why lxml affects content of
elements? It should leave it as is, if it doesn't know what to do with
them ( == there is no charset information, so it is unable to detect
it)

(Continue reading)

Stefan Behnel | 26 Aug 08:58

Re: Encoding again

Hi,

Max Ivanov wrote:
>> I can't test it right now, but this might work for you. I just provided
>> the parser with the right encoding information. Note that your "HTML
>> document" does not specify an encoding, so I assume that the parser just
>> expects it to be latin-1 or some other plain byte encoding, and reads the
>> bytes as they come in. To be clear: it's the document that's broken here,
>> not the parser.
> 
> Yes indeed. I understand that document is broken, but that's the case
> - I've to process even broken html pages. Even more, lxml does a lots
> of heavy lifting to make processing of broken html much easier. I'm
> talking about another step in that way. There are a lots of pages in
> russian segment of internet with no charset specified. All of them
> contain lots of symbols with codes > 128. Do you agree that if you
> pass some data, it is reasonable to assume that it would return
> exactly the same data?

What you pass is a byte stream of unknown encoding. What you get back is a
tree with well defined characters. Isn't that great enough?

> Nowdays we have:
> 
> origdata = 'some string with codes >128 (national chars)'
> xml = '<root>'+origdata+'</root>'
> .... parsing it with lxml....
> rettext = doc.text_content()
> isinstance(rettext, unicode) #TRUE! but original text was not unicode.

(Continue reading)

Max Ivanov | 26 Aug 11:09

Re: Encoding again

> What you pass is a byte stream of unknown encoding. What you get back is a
> tree with well defined characters. Isn't that great enough?
In some cases (original text in ASCII) there are well defined
characters, in other cases it is garbage. Why you couldn't just leave
content inside tags as is in case original encoding is unknown and
parser unable to detect it from data (no <meta> tag for example)?

I'm asking just about new keyword argument which disables any
processing over unknown byte streams inside tags. that would make lxml
more usefull in wider situations.

>> origdata = 'some string with codes >128 (national chars)'
>> xml = '<root>'+origdata+'</root>'
>> .... parsing it with lxml....
>> rettext = doc.text_content()
>> isinstance(rettext, unicode) #TRUE! but original text was not unicode.
>
> The "text" you are talking about was a sequence of bytes. Now it is a sequence
> of characters. It may not be the sequence you expect, because the document
> does not provide any hints about what the characters it describes with its
> byte sequences are (how do /you/ know it's really bulgarian characters?), so
> they may be Latin-1, they may be UTF-8, they may be Cyrillic, they may be EBCDIC.
That's what I'm talking about! If nobody knows what content is
actually is then leave it as is, as original byte stream. Why lxml now
suggests that input stream is unicode? nobody tell it about that. If
lxml don't know about encoding then it should just process tags and
attribs and build tree, lxml don't need to know correct encoding to do
that, any unknown data should be leaved untouched!  that's simpliest
rule I could ever imagine - if you don't know what is it, and you
don't need it for your task then avoid any processing of that data,
(Continue reading)

Stefan Behnel | 26 Aug 11:46

Re: Encoding again

Max Ivanov wrote:
>> What you pass is a byte stream of unknown encoding. What you get back is
>> a tree with well defined characters. Isn't that great enough?
> In some cases (original text in ASCII) there are well defined
> characters, in other cases it is garbage. Why you couldn't just leave
> content inside tags as is in case original encoding is unknown and
> parser unable to detect it from data (no <meta> tag for example)?
>
> I'm asking just about new keyword argument which disables any
> processing over unknown byte streams inside tags. that would make lxml
> more usefull in wider situations.

If you provide a patch, we can discuss it. But be warned that it's not
just about adding a new keyword argument.

> If nobody knows what content is
> actually is then leave it as is, as original byte stream.

Parsing XML/HTML is about converting a byte stream to a tree (ok, usually)
and character content (always).

> Why lxml now suggests that input stream is unicode?

It does not suggest anything like that. It expects the input to be a byte
stream and returns a tree with Unicode content.

> any unknown data should be leaved untouched!  that's simpliest
> rule I could ever imagine - if you don't know what is it, and you
> don't need it for your task then avoid any processing of that data,
> it's up to user how to handle it later.
(Continue reading)


Gmane