oleg | 5 Dec 06:13 2012

Is it possible to have constant-space JSON decoding?


I am doing, for several months, constant-space processing of large XML
files using iteratees. The file contains many XML elements (which are
a bit complex than a number). An element can be processed
independently. After the parser finished with one element, and dumped
the related data, the processing of the next element starts anew, so
to speak. No significant state is accumulated for the overall parsing
sans the counters of processed and bad elements, for statistics. XML
is somewhat like JSON, only more complex: an XML parser has to deal
with namespaces, parsed entities, CDATA sections and the other
interesting stuff. Therefore, I'm quite sure there should not be
fundamental problems in constant-space parsing of JSON.

BTW, the parser itself is described there
	http://okmij.org/ftp/Streams.html#xml
Johan Tibell | 5 Dec 06:37 2012
Picon

Re: Is it possible to have constant-space JSON decoding?

Hi Oleg,

On Tue, Dec 4, 2012 at 9:13 PM,  <oleg <at> okmij.org> wrote:
> I am doing, for several months, constant-space processing of large XML
> files using iteratees. The file contains many XML elements (which are
> a bit complex than a number). An element can be processed
> independently. After the parser finished with one element, and dumped
> the related data, the processing of the next element starts anew, so
> to speak. No significant state is accumulated for the overall parsing
> sans the counters of processed and bad elements, for statistics. XML
> is somewhat like JSON, only more complex: an XML parser has to deal
> with namespaces, parsed entities, CDATA sections and the other
> interesting stuff. Therefore, I'm quite sure there should not be
> fundamental problems in constant-space parsing of JSON.
>
> BTW, the parser itself is described there
>         http://okmij.org/ftp/Streams.html#xml

It certainly is possible (using a SAX style parser). What you can't
have (I think) is a function:

    decode :: FromJSON a => ByteString -> Maybe a

and constant-memory parsing at the same time. The return type here
says that we will return Nothing if parsing fails. We can only do so
after looking at the whole input (otherwise how would we know if it's
malformed).

The use cases aeson was designed for (which I bet is the majority use
case) is parsing smaller messages sent over the network (i.e. in web
(Continue reading)

Jason Dagit | 5 Dec 06:48 2012
Picon

Re: Is it possible to have constant-space JSON decoding?



On Tue, Dec 4, 2012 at 9:37 PM, Johan Tibell <johan.tibell <at> gmail.com> wrote:
Hi Oleg,

On Tue, Dec 4, 2012 at 9:13 PM,  <oleg <at> okmij.org> wrote:
> I am doing, for several months, constant-space processing of large XML
> files using iteratees. The file contains many XML elements (which are
> a bit complex than a number). An element can be processed
> independently. After the parser finished with one element, and dumped
> the related data, the processing of the next element starts anew, so
> to speak. No significant state is accumulated for the overall parsing
> sans the counters of processed and bad elements, for statistics. XML
> is somewhat like JSON, only more complex: an XML parser has to deal
> with namespaces, parsed entities, CDATA sections and the other
> interesting stuff. Therefore, I'm quite sure there should not be
> fundamental problems in constant-space parsing of JSON.
>
> BTW, the parser itself is described there
>         http://okmij.org/ftp/Streams.html#xml

It certainly is possible (using a SAX style parser). What you can't
have (I think) is a function:

    decode :: FromJSON a => ByteString -> Maybe a

and constant-memory parsing at the same time. The return type here
says that we will return Nothing if parsing fails. We can only do so
after looking at the whole input (otherwise how would we know if it's
malformed).

I thought it was possible to get around this with lazy patterns such "Wadler's force function" [1]?

(untested code)

force y =
  let Just x = y
  in Just x

lazyDecode :: FromJSON a => ByteString -> Maybe a
lazyDecode = force . decode



_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Albert Y. C. Lai | 8 Dec 00:06 2012
Picon

Re: Is it possible to have constant-space JSON decoding?

On 12-12-05 12:48 AM, Jason Dagit wrote:
> I thought it was possible to get around this with lazy patterns such
> "Wadler's force function" [1]?
>
> (untested code)
>
> force y =
>    let Just x = y
>    in Just x
>
> lazyDecode :: FromJSON a => ByteString -> Maybe a
> lazyDecode = force . decode

This says, the type is Maybe, but the value is always Just. If there 
will be a parse error, you will get an async imprecise exception, not 
Nothing, at the later time when you look closer.

This respects the letter, but not the point, of the Maybe type. If you 
are to do this (I have no moral, I hold nothing against doing this, the 
async imprecise exception will give you enough grief), you may as well 
skip the Just wrapping and directly go ByteString -> a.

The point of the Maybe type is to communicate parse errors by a less 
async, less imprecise way.
Malcolm Wallace | 7 Dec 10:57 2012

Re: Is it possible to have constant-space JSON decoding?

See also the incremental XML parser in HaXml, described in "Partial parsing: combining choice with
commitment", IFL 2006.  It has constant space usage (for some patterns of usage), even with extremely
large inputs.

http://www.google.co.uk/url?sa=t&rct=j&q=malcolm+wallace+partial+parsing&source=web&cd=2&ved=0CEEQFjAB&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.135.7512%26rep%3Drep1%26type%3Dpdf&ei=Db3BUNmiOsfS4QTAkoDYAw&usg=AFQjCNHHywUCvaFv8eBoQ-x9jj4GOMHo2w

On 5 Dec 2012, at 05:37, Johan Tibell wrote:

> Hi Oleg,
> 
> On Tue, Dec 4, 2012 at 9:13 PM,  <oleg <at> okmij.org> wrote:
>> I am doing, for several months, constant-space processing of large XML
>> files using iteratees. The file contains many XML elements (which are
>> a bit complex than a number). An element can be processed
>> independently. After the parser finished with one element, and dumped
>> the related data, the processing of the next element starts anew, so
>> to speak. No significant state is accumulated for the overall parsing
>> sans the counters of processed and bad elements, for statistics. XML
>> is somewhat like JSON, only more complex: an XML parser has to deal
>> with namespaces, parsed entities, CDATA sections and the other
>> interesting stuff. Therefore, I'm quite sure there should not be
>> fundamental problems in constant-space parsing of JSON.
>> 
>> BTW, the parser itself is described there
>>        http://okmij.org/ftp/Streams.html#xml
> 
> It certainly is possible (using a SAX style parser). What you can't
> have (I think) is a function:
> 
>    decode :: FromJSON a => ByteString -> Maybe a
> 
> and constant-memory parsing at the same time. The return type here
> says that we will return Nothing if parsing fails. We can only do so
> after looking at the whole input (otherwise how would we know if it's
> malformed).
> 
> The use cases aeson was designed for (which I bet is the majority use
> case) is parsing smaller messages sent over the network (i.e. in web
> service APIs) so this is the only mode of parsing it supplies.
> 
> -- Johan
> 
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe <at> haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
oleg | 15 Dec 16:02 2012

Re: Is it possible to have constant-space JSON decoding?


Johan Tibell posed an interesting problem of incremental XML parsing
while still detecting and reporting ill-formedness errors.
> What you can't have (I think) is a function:
>
>     decode :: FromJSON a => ByteString -> Maybe a
>
> and constant-memory parsing at the same time. The return type here
> says that we will return Nothing if parsing fails. We can only do so
> after looking at the whole input (otherwise how would we know if it's
> malformed).

The problem is very common: suppose we receive an XML document over a
network (e.g., in an HTTP stream). Network connections are inherently
unreliable and can be dropped at any time (e.g., because someone
tripped over a power cord). The XML document therefore can come
truncated, for example, missing the end tag for the root
element. According to the XML Recommendations, such document is
ill-formed, and hence does not have an Infoset (In contrast, invalid
but well-formed documents do have the Infoset). Strictly speaking, we
should not be processing an XML document until we verified that it is
well-formed, that is, until we parsed it at all and have checked that
all end tags are in place. It seems we can't do the incremental XML
processing in principle.

I should mention first that sometimes people avoid such a strict
interpretation. If we process telemetry data encoded in XML, we may
consider a document meaningful even if the root end tag is missing. We
process as far as we can.

Even if we take the strict interpretation, it is still possible
to handle a document incrementally so long as the processing is
functional or the side effects can be backed out (e.g., in a
transaction). This message illustrates exactly such an incremental
processing that always detects ill-formed XML, and, optionally,
invalid XML. It is possible after all to detect ill-formedness
errors and process the document without loading it all in memory 
first -- using as little memory as needed to hold the state of the
processor -- just a short string in our example.

Our running example is an XML document representing a finite map:
a collection of key-value pairs where key is an integer:

  <map>
   <kv><key>1</key><value>v1</value></kv>
   <kv><key>2</key><value>v2</value></kv>
   <kv><key>bad</key><value>v3</value></kv>

The above document is both ill-formed (missing the end tag)
and invalid (one key is bad: non-read-able). We would like to write
a lookup function for a key in such an encoded map. We should report
an ill-formedness error always. The reporting of validation
errors may vary. The function

xml_lookup :: Monad m => Key -> Iteratee Char m (Maybe Value)
xml_lookup key = id .| xml_enum default_handlers .| xml_normalize 
		 .| kv_enum (lkup key)

always reports the validation errors. The function is built
by composition from smaller, separately developed and tested
pipeline components: parsing of a
document to the XMLStream, normalization, converting the XMLStream to
a stream of (Key,Value) pairs and finally searching the stream.
A similar function that replaces kv_enum with kv_enum_pterm
terminates the (Key,Value) conversion as soon as its client iteratee
finished. In that case if the lookup succeeds before we encounter the bad
key, no error is reported. Ill-formedness errors are raised always.

The code
	http://okmij.org/ftp/Haskell/Iteratee/XMLookup.hs

shows the examples of both methods of validation error reporting.
This code also illustrates the library of parsing combinators, which
represent the element structure (`DTD').

Gmane