Scott Dillard | 5 Aug 02:08

Proposal: add laziness to Data.Map / IntMap

Hi,

I found myself wanting a lazy version of Data.Map today, and by "lazy" I mean in the node-subtree pointers. I trust the library writers when they put strictness annotations in the fields of the tree nodes, so I'm wondering what the various implications of lazy subtrees are, beyond the obvious speed hit. Does this cause space leaks beyond what lazy value pointers can already cause? Can someone point me to some reading that discusses this?

Anyway, I'm positing to libraries (rather than haskell-cafe) to gauge opinion about a possible rewrite of Data.Map and IntMap to remove strictness annotations (bangs) from the node constructors and move them into the functions (as seqs).  "Rewrite" is maybe too strong of a word. "Significant patch" is more like it. It would involve only those functions that construct Bin values directly, which is not that many. Less than a days work, I think (yes that means I volunteer.)  Semantics of everything remains unchanged, but it opens up the possibility for lazy versions of some functions.

The most usefull result of this would be a lazy map (little m). Here's Data.Map.mapWithKey

  mapWithKey f Tip = Tip
  mapWithKey f (Bin sx kx x l r)
    = Bin sx kx (f kx x) (mapWithKey f l) (mapWithKey f r)

De-banged and then restrictified, it would look like this

  mapWithKey f Tip = Tip
  mapWithKey f (Bin sx kx x l r)
    = seq l' $ seq r' $ Bin sx kx (f kx x) l' r'
        where l = (mapWithKey f l)
                 r = (mapWithKey f r)

Looking at the first version, clearly you see that when constructing a new map you should only have to pay for the sub trees that you actually use. If you perform only a handful of lookups then throw the new tree away, why build the whole thing?

To further motivate, let me explain my use case. I have cyclical data structures (graphs, more or less) that I mutate frequently, so I store them in a Map, indexed by some Ord thing, lets say Int, so I'd have something like Map Int [Int] (but not that exactly, and nothing like Data.Graph). This is great for mutations because I can use backtracking, but for lookups it's a burden on both me and the cpu. So I memoize the thing into something like "data Node a = Node a [Node a]"  I can do this memoization using Data.Map.mapWithKey, with the Nodes built therein referring back to the produced Map. But then, what if I only crawl a small portion of this cyclical network of Nodes? Why should I have to pay for the whole thing to be rebuilt? It defeats the purpose of the me moization, which is to amortize the cost of following edges in the mutable graph.

The pro and con as I see it are:

Pro
- More flexible data structure

Con
- Code is more verbose (see Data.Tree.AVL)
- Only a few (but important) functions can be made lazy

To that last point, note that while mapWithKey can be made lazy for both Map and IntMap, only IntMap allows lazy filter and mapMaybe because it doesn't rebalance. But I'm wondering how much of the tree needs to be forced when rebalancing. Should be only O(log n), right? It also becomes important where the tree is sourced from. The source needs to produce the tree lazily. The regular definition of fromList (= foldr (uncurry insert) empty) admits no laziness, but maybe successive unions could if the sub-maps were nearly disjoint (a not-uncommon case I think.) Does anyone know if a ny benchmarking has been done to this end?

Finally, I'll stress once more that the semantics of the functions currently exported would be unchanged. This would only allow new lazy versions, named something like mapWithKeyL or unionL.

So what do you think? Too much for too little?

Scott

_______________________________________________
Libraries mailing list
Libraries <at> haskell.org
http://www.haskell.org/mailman/listinfo/libraries
Don Stewart | 5 Aug 02:18
Gravatar

Re: Proposal: add laziness to Data.Map / IntMap

sedillard:
>    Hi,
> 
>    I found myself wanting a lazy version of Data.Map today, and by "lazy" I
>    mean in the node-subtree pointers. I trust the library writers when they
>    put strictness annotations in the fields of the tree nodes, so I'm
>    wondering what the various implications of lazy subtrees are, beyond the
>    obvious speed hit. Does this cause space leaks beyond what lazy value
>    pointers can already cause? Can someone point me to some reading that
>    discusses this?
> 
>    Anyway, I'm positing to libraries (rather than haskell-cafe) to gauge
>    opinion about a possible rewrite of Data.Map and IntMap to remove
>    strictness annotations (bangs) from the node constructors and move them
>    into the functions (as seqs).  "Rewrite" is maybe too strong of a word.
>    "Significant patch" is more like it. It would involve only those functions
>    that construct Bin values directly, which is not that many. Less than a
>    days work, I think (yes that means I volunteer.)  Semantics of everything
>    remains unchanged, but it opens up the possibility for lazy versions of
>    some functions.

How about doing it as a separate library, then we can choose either
strict or lazy as the case may be?

-- Don
Scott Dillard | 5 Aug 03:17

Re: Proposal: add laziness to Data.Map / IntMap

I think maybe you guys (Don and Andrew) are misunderstanding my proposal. The lazy/strict tradeoff is a subtle issue, and I'll be sure to re-read Okasaki's stuff with this in mind, but what I'm talking about here is not a trade off. It's laziness "for free". Move the strictness annotations out of the constructors and into the library functions using 'seq'. Laziness is exposed through _separate_ functions.

I'll copy again my proposed versions of mapWithKey (because I made a typo the first time)

mapWithKey :: (k -> a -> b) -> Map k a -> Map k b
mapWithKey _ Tip = Tip
mapWithKey f (Bin sx kx x l r)
  = seq l' $ seq r' $ Bin sx kx (f kx x) l' r'
    where l' = mapWithKey f l
          r' = mapWithKey f r

mapWithKeyLazy _ Tip = Tip
mapWithKeyLazy f (Bin sx kx x l r)
  = Bin sx kx (f kx x) (mapWithKey f l) (mapWithKey f r)

So mapWithKey retains all semantics, including guarantees. So would insert, and all other functions. You export a second API (maybe in a nested module) that exposes laziness. Writing another library is kind of silly since a) you want stricness 90% of the time b) it shares 90% of the code. If maintainers are willing to deal with some extra seqs here and there then we can have both libraries in one.

Scott


On Mon, Aug 4, 2008 at 6:18 PM, Don Stewart <dons <at> galois.com> wrote:
sedillard:
>    Hi,
>
>    I found myself wanting a lazy version of Data.Map today, and by "lazy" I
>    mean in the node-subtree pointers. I trust the library writers when they
>    put strictness annotations in the fields of the tree nodes, so I'm
>    wondering what the various implications of lazy subtrees are, beyond the
>    obvious speed hit. Does this cause space leaks beyond what lazy value
>    pointers can already cause? Can someone point me to some reading that
>    discusses this?
>
>    Anyway, I'm positing to libraries (rather than haskell-cafe) to gauge
>    opinion about a possible rewrite of Data.Map and IntMap to remove
>    strictness annotations (bangs) from the node constructors and move them
>    into the functions (as seqs).  "Rewrite" is maybe too strong of a word.
>    "Significant patch" is more like it. It would involve only those functions
>    that construct Bin values directly, which is not that many. Less than a
>    days work, I think (yes that means I volunteer.)  Semantics of everything
>    remains unchanged, but it opens up the possibility for lazy versions of
>    some functions.

How about doing it as a separate library, then we can choose either
strict or lazy as the case may be?

-- Don

_______________________________________________
Libraries mailing list
Libraries <at> haskell.org
http://www.haskell.org/mailman/listinfo/libraries

Re: Proposal: add laziness to Data.Map / IntMap

Scott Dillard wrote:
> I think maybe you guys (Don and Andrew) are misunderstanding my proposal.
> The lazy/strict tradeoff is a subtle issue, and I'll be sure to re-read
> Okasaki's stuff with this in mind, but what I'm talking about here is not a
> trade off. It's laziness "for free". Move the strictness annotations out of
> the constructors and into the library functions using 'seq'. Laziness is
> exposed through _separate_ functions.

It's not for free. When the compiler does a pattern match on the Bin
constructor, Bin sz kx x l r, it can no longer assume that l and r are
fully evaluated, so it has to add code to evaluate them in case they are
not. And in fact, this code will be needed if any of your proposed lazy
functions are added later. I have not checked whether this has a
measurable performance or code size impact.

> So mapWithKey retains all semantics, including guarantees.

Semantically the change is safe, agreed.

regards,

Bertram
ajb | 5 Aug 02:50
Favicon
Gravatar

Re: Proposal: add laziness to Data.Map / IntMap

G'day all.

Quoting Scott Dillard <sedillard <at> ucdavis.edu>:

> I found myself wanting a lazy version of Data.Map today, and by "lazy" I
> mean in the node-subtree pointers.

Right.  Just to be clear, to start off with:

   - It makes no sense for "keys" to be lazy, because they need to be
     inspected to determine the shape of the data structure.  (Except
     in the case of a singleton map, if you know where in the tree
     some (key,value) pair goes, then you've already evaluated the key.)

   - It's better for "values" to be lazy in a general-purpose map-type
     data structure, because making them strict breaks Functor.

So the remaining interesting question is the internal structure pointers.

> I trust the library writers when they put
> strictness annotations in the fields of the tree nodes, so I'm wondering
> what the various implications of lazy subtrees are, beyond the obvious speed
> hit. Does this cause space leaks beyond what lazy value pointers can already
> cause? Can someone point me to some reading that discusses this?

Yes, please read Chris Okasaki's "Purely Functional Data Structures"
for a fuller discussion of the tradeoffs of putting laziness in
different places in a data structure.

Making internal pointers strict vs making them lazy doesn't necessarily
buy you much in the way of raw-cycle-counting performance.  What it buys
you is a complexity _guarantee_, which in Haskell is often more valuable.

Thinking of a binary search tree for a moment, making the internal
pointers lazy means that insertion is always O(1), but lookup may take
an arbitrary amount of time (though it will be O(log n) amortised).  It
also adds a raw-cycle-counting cost to every lookup, even if the tree is
fully evaluated.

This is the opposite of the usual desired performance.  Dictionary
implementations tend to assume that lookups are more common than
insertions and deletions, and correspondingly, clients tend to assume
that insertions and deletions are more expensive than lookups.

If these assumptions don't match your code, then indeed, you may be
using the wrong data struture.

> Looking at the first version, clearly you see that when constructing a new
> map you should only have to pay for the sub trees that you actually use. If
> you perform only a handful of lookups then throw the new tree away, why
> build the whole thing?

If you only perform a handful of lookups, I question whether you actually
wanted a binary search tree to begin with.  Wouldn't an association list
have done the job just as well?  Or compiling to functions?

> To further motivate, let me explain my use case. [...]
> So I memoize the thing into something like
> "data Node a = Node a [Node a]"

Right.  Memoising CAFs is an excellent example of one of those very few
places where writing your own dictionary data type can make a lot of
sense.

Why?  Because there are a lot of things that you don't need from, say,
AVL trees.  For example, you know all the keys in advance, which means
that your memo table won't need to be modified once it's built.  You
don't even need insertions, let alone deletions or a Functor instance.

Have you tried just using functions?  Something like this:

-- WARNING: Untested code follows.  Use at own risk.

type MyMap k v = k -> v    -- k -> Maybe v may also be appropriate.

myMapFromSortedAssocList :: (Ord k) => [(k,v)] -> MyMap k v
myMapFromSortedAssocList kvs
     = buildMap (length kvs) kvs
     where
         errorCase = error "Can't find key"

         -- Feel free to optimise additional base cases if desired.
         buildMap _ [] = \key -> errorCase
         buildMap _ [(k,v)] = \key -> if k == key then v else errorCase
         buildMap l kvs
           = let l2 = l `div` 2
                 (kvsl,(k,v):kvs2) = splitAt l2 kvs
                 mapl = buildMap l2 kvs1
                 mapr = buildMap (l - l2 - 1) kvs2
             in \key -> case compare key k of
                           LT -> mapl key
                           GT -> mapr key
                           EQ -> v

(Exercise for intermediate-level Haskellers: Why is "key" bound by an
explicit lambda?)

Cheers,
Andrew Bromage
Scott Dillard | 5 Aug 16:56

Re: Proposal: add laziness to Data.Map / IntMap

Hi Bertram

Scott Dillard wrote:
> I think maybe you guys (Don and Andrew) are misunderstanding my proposal.
> The lazy/strict tradeoff is a subtle issue, and I'll be sure to re-read
> Okasaki's stuff with this in mind, but what I'm talking about here is not a
> trade off. It's laziness "for free". Move the strictness annotations out of
> the constructors and into the library functions using 'seq'. Laziness is
> exposed through _separate_ functions.

It's not for free. When the compiler does a pattern match on the Bin
constructor, Bin sz kx x l r, it can no longer assume that l and r are
fully evaluated, so it has to add code to evaluate them in case they are
not. And in fact, this code will be needed if any of your proposed lazy
functions are added later. I have not checked whether this has a
measurable performance or code size impact.

> So mapWithKey retains all semantics, including guarantees.

Semantically the change is safe, agreed.

You're right of course (hadn't thought about that) but I'd like to point out that Adrien Hey's AVL tree library is structured in the way I propose: A node's subtree fields are not marked strict, instead seq is used to force construction of the tree. I think it's generally faster than Data.Map, at least it's claimed to be so in the documentation. The trees are not built with the same algorithm, but if the use of seq instead of bangs does impose a significant overhead, then that speaks very highly to the AVL algorithm.

This of course raises the question of why I don't just use that library. I'm checking it out, but it's quite a bit bigger than Data.Map. The thing I like about Data.Map is that it's relatively simple and I can see whats going on very quickly, enough to make modifications (which I'm about to do.) Reading Data.Tree.AVL is a more daunting task.

Also, there's something to be said for "lazy by default." I don't see much difference between Data.List.map and Data.Map.map (for finite lists anyway) and so if laziness is the correct default for the former then it should also be for the latter. The current Data.Map.map is only half strict: it builds an entire tree and fills it with thunks. Better I think to either build the whole thing completely or build only what's needed. The current Data.Map.map is the worst of both, no laziness + heap burn.

So I guess I'll start writing. Anyone have any good benchmarks? Which naming scheme is less ugly, Data.Map.mapWithKeyL or Data.Map.Lazy.mapWithKey? Any other suggestions would be appreciated.

Scott

_______________________________________________
Libraries mailing list
Libraries <at> haskell.org
http://www.haskell.org/mailman/listinfo/libraries
Favicon

RE: Proposal: add laziness to Data.Map / IntMap

> Which naming scheme is less ugly, 
> Data.Map.mapWithKeyL or Data.Map.Lazy.mapWithKey? 

A separate module is much better, as it will allow switching entire
modules over just by changing an import, and still allow something close
to the mapWithKeyL usage via a qualified import (e.g. "import qualified
Data.Map.Lazy as L")

Ganesh

==============================================================================
Please access the attached hyperlink for an important electronic communications disclaimer: 

http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
==============================================================================
Adrian Hey | 5 Aug 19:22
Favicon

Re: Proposal: add laziness to Data.Map / IntMap

Scott Dillard wrote:

> You're right of course (hadn't thought about that) but I'd like to point out
> that Adrien Hey's AVL tree library is structured in the way I propose: A
> node's subtree fields are not marked strict, instead seq is used to force
> construction of the tree. I think it's generally faster than Data.Map, at
> least it's claimed to be so in the documentation. The trees are not built
> with the same algorithm, but if the use of seq instead of bangs does impose
> a significant overhead, then that speaks very highly to the AVL algorithm.

Using explicit seqs rather than strict data types is actually faster,
for reasons that are a bit of a mystery to me. I'm not sure what cost
Bertram is talking about, but AFAIK ghc uses the same info pointer
mechanism for all heap records, including unevaluated thunks (although
the info pointers will point to different things of course). But the
cost of pattern matching on *evaluated* AVL nodes should be independent
of strictness annotation AFAICS.

Regards
--
Adrian Hey

Re: Proposal: add laziness to Data.Map / IntMap

Adrian Hey wrote:
> Using explicit seqs rather than strict data types is actually faster,
> for reasons that are a bit of a mystery to me. I'm not sure what cost
> Bertram is talking about, but AFAIK ghc uses the same info pointer
> mechanism for all heap records, including unevaluated thunks (although
> the info pointers will point to different things of course). But the
> cost of pattern matching on *evaluated* AVL nodes should be independent
> of strictness annotation AFAICS.

I must admit that for the time being the cost is of a theoretical
nature. But let me explain the idea.

Consider this code:

> module Nat (isOne) where
> 
> data Nat = Succ !Nat | Zero
> 
> isOne :: Nat -> Bool
> isOne n = case n of
>     Zero    -> False
>     Succ n' -> case n' of
>         Zero   -> True
>         Succ _ -> False

The code of isOne starts by forcing n (looking at n's tag and entering
the closure if it's unevaluated in ghc's case) and then a pattern
match (looking at the the tag again).

Now for the second pattern match, we can skip the first step, because
we know that n' is fully evaluated, thanks to the strictness annotation
in the Succ constructor.

However, ghc currently does generate the same (cmm) code for isOne
regardless of the strictness annotation, so performance wise you only
get to pay the price of the annotation (I expect that some thunks are
unnecessarily reevaluated when the constructor is used) without this
benefit.

Did I miss any reason why this idea can't work?

I really expected ghc to do that optimisation - obviously that was
wishful thinking on my part.

Bertram
Jamie Brandon | 5 Aug 21:03

Re: Proposal: add laziness to Data.Map / IntMap

> This of course raises the question of why I don't just use that library. I'm
> checking it out, but it's quite a bit bigger than Data.Map. The thing I like
> about Data.Map is that it's relatively simple and I can see whats going on
> very quickly, enough to make modifications (which I'm about to do.) Reading
> Data.Tree.AVL is a more daunting task.

I agree. Have a look at code.haskell.org/gmap/api . The
Data.GMap.OrdMap module is a wrapper around AvlTree, using a slightly
more complete interface than Data.Map . It should make an easier
starting point. In fact, I suspect that if you just make a new
AvlTreeL by deleting every occurrence of `seq` and then OrdMapL by
changing the import in OrdMap everything would work seamlessly.

The current version of OrdMap should be fairly safe to use. I'll be
putting a proper package on hackage in a week or two once Ive tidied
everything up.

Jamie
Scott Dillard | 5 Aug 20:46

Proposal: add laziness to Data.Map / IntMap

>Adrian Hey wrote:
>
> Using explicit seqs rather than strict data types is actually faster,
> for reasons that are a bit of a mystery to me. I'm not sure what cost
> Bertram is talking about, but AFAIK ghc uses the same info pointer
> mechanism for all heap records, including unevaluated thunks (although
> the info pointers will point to different things of course). But the
> cost of pattern matching on *evaluated* AVL nodes should be independent
> of strictness annotation AFAICS.

Thanks for chiming in Adrian. Just to get started I removed the strictness
annotations from the Data.Map Bin constructor, made no other changes, and ran a
silly benchmark (at the end of this email). The version without bangs is
actually faster than the version currently shipping. I get about 10.5 sec for
the lazy version and 11.5 sec for the strict version (2.1Ghz Intel Core)

I'll repeat that in bold for people just skimming this thread:

__Removing Strictness Annotations Makes It Go Faster__

The reason I think is that the helper functions bin, join and balance already
provide just enough strictness, as they need to inspect the size field. The
strictness analyzer can then do its job. The case for IntMap is tricker,
as there is no implicit strictness in the code so removing the bangs causes
stack overflows. Still working on that one.

Here are the benchmarks. The lazy version also evaluates "keySum dmap" slightly
faster (repeated inserts) and its a tie for "keySum smap" (sequential inserts).
I admit this benchmark is goofy, if you have a better one please share.

Scott



import qualified Data.Map as Map
import Data.List as List

n = 1000000
rkeys = [ (i*122789) `mod` 1006471 | i<-[0..] ] :: [Int]
dkeys = map (`div`1000) rkeys :: [Int]
skeys = [0..] :: [Int]

shuffle (a:b:c:d:e:f:g:h:rest) = e:a:h:d:c:b:g:f: shuffle rest

keySum = List.foldl' (+) 0 . Map.keys

rmap = Map.fromList    . take n . shuffle $ rkeys `zip` [0..]
dmap = Map.fromList    . take n . shuffle $ dkeys `zip` [0..]
smap = Map.fromAscList . take n . shuffle $ skeys `zip` [0..]

mix (x:xs) (y:ys) = x : y : mix xs ys
mix _ [] = []
mix [] _ = []

rkeys2 = [ (i*122789) `mod` 1006471 | i<-[0..] ] :: [Int]
rlooks = [ (i*122819) `mod` 1006471 | i<-[0..] ] :: [Int]

rlook =
  List.foldl'
    (\k s -> case Map.lookup k rmap of Nothing -> s; Just x -> s+x)
    0 (take n $ rkeys2 `mix` drop 1000 rlooks)

main = print rlook  -- or print (keySum dmap) or whatever


_______________________________________________
Libraries mailing list
Libraries <at> haskell.org
http://www.haskell.org/mailman/listinfo/libraries
Adrian Hey | 6 Aug 09:35
Favicon

Re: Proposal: add laziness to Data.Map / IntMap

Hello Scott,

Scott Dillard wrote:
> Thanks for chiming in Adrian. Just to get started I removed the strictness
> annotations from the Data.Map Bin constructor, made no other changes, and
> ran a
> silly benchmark (at the end of this email). The version without bangs is
> actually faster than the version currently shipping. I get about 10.5 sec
> for
> the lazy version and 11.5 sec for the strict version (2.1Ghz Intel Core)
> 
> I'll repeat that in bold for people just skimming this thread:
> 
> __Removing Strictness Annotations Makes It Go Faster__
> 

Interesting. This seems to be more or less in agreement with my tests on
AVL insertion. I get something like 10-20% speedup by *not* using strict
constructors (providing I add explicit seqs if needed to give me the
strictness I want). So  it seems strict constructors cause some
unnecessary work or inhibit some optimisation (or something..).

Regards
--
Adrian Hey

Gmane