Johan Tibell | 1 Jun 08:52 2011
Picon

Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)

Hi Aleksandar,

I thought it'd be educational to do some back-of-the-envelope
calculations to see how much memory we'd expect to use to store words
in a HashMap ByteString Int. First, lets start by looking at how much
memory one ByteString uses. Here's the definition of ByteString [1]:

    data ByteString = PS {-# UNPACK #-} !(ForeignPtr Word8)
                         {-# UNPACK #-} !Int                -- offset
                         {-# UNPACK #-} !Int                -- length

The two Int fields are used to support O(1) slicing of ByteStrings.

We also need the definitions of ForeignPtr [2] and Int.

    data ForeignPtr a = ForeignPtr Addr# ForeignPtrContents

    data ForeignPtrContents
        = PlainForeignPtr !(IORef (Finalizers, [IO ()]))
        | MallocPtr      (MutableByteArray# RealWorld) !(IORef
(Finalizers, [IO ()]))
        | PlainPtr       (MutableByteArray# RealWorld)

    data Int = I# Int#

The UNPACK indicates to the compiler that it should unpack the
contents of a constructor field into the constructor itself, removing
a level of indirection. We'll end up with a structure that looks like
this:

(Continue reading)

Aleksandar Dimitrov | 1 Jun 16:24 2011

Re: Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)

Hello Johan,

On Wed, Jun 01, 2011 at 08:52:04AM +0200, Johan Tibell wrote:
> I thought it'd be educational to do some back-of-the-envelope
> calculations to see how much memory we'd expect to use to store words
> in a HashMap ByteString Int.

Thank you for your writeup, which is very informative! You should really think
about writing that tutorial you mentioned earlier, it would probably be a boon
for many.

One additional thought: it might be interesting to provide this outside of this
mailing list, perhaps as a documentation addendum to unordered-containers, since
it really explains the size needs for HashMaps of ByteStrings to folks without
knowledge of higher arcane wizardry.

> For example, the average English word length is 5 characters, if you
> include stop words. We'd expect to use
> 
>     (9 + 9 + 2) * 8 + 5 = 165
> 
> bytes per unique word in our input corpus, on a 64-bit machine. Plus
> any GC overhead. This is probably more than one would expect.

Using an sqlite database, I committed 1/16 of my whole corpus (which is in
German) in ten separate steps and counted the amount of unigrams contained
therein: 2,207,369; Assuming German affection for longwords, and the fact that
the data contained umlauts in UTF-8, let's up the average word length a bit:
say, 15 (I pulled that out of my ass, obviously.) This brings us to 2207369 *
180 bytes, or ~400MB hash table.
(Continue reading)

Johan Tibell | 1 Jun 17:33 2011
Picon

Re: Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)

On Wed, Jun 1, 2011 at 4:24 PM, Aleksandar Dimitrov
<aleks.dimitrov <at> googlemail.com> wrote:
> One additional thought: it might be interesting to provide this outside of this
> mailing list, perhaps as a documentation addendum to unordered-containers, since
> it really explains the size needs for HashMaps of ByteStrings to folks without
> knowledge of higher arcane wizardry.

I think it would be a good answer on StackOverflow, but no one asked
me this question there. I could list the size overhead of a HashMap in
the docs, but I'm about to change it so I won't do so until after the
change. I don't know how big guarantees I want to make in the docs
either, as it might constrain future improvements to the
implementation. Perhaps in an addendum, like you said.

> I ended up using Data.Text over SmallString, also because I need to do other
> operations on the words (case folding, mass matching, and possibly more) and
> Text seemed more attractive for these tasks, but I'll try using SmallString,
> too.

Text uses 6 words per value, plus the size of the UTF-16 encoded
content. There's a Google Summer of Code project this year to convert
Text to UTF-8, which should decrease the space usage. In addition,
Text values aren't pinned on the heap, unlike ByteStrings, so they
should be nicer to the GC. The lowest overhead string type you could
imagine (given how GHC implements ByteArray#) would have a 4 word
overhead. Text trades 2 extra words to support efficient slicing (just
like ByteString).

When Text uses UTF-8 internally it should be possible to convert Text
to/from SmallString in O(1) time as the underlying ByteArray# could
(Continue reading)

Jason Dagit | 2 Jun 05:10 2011
Picon

Re: Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)

On Wed, Jun 1, 2011 at 8:33 AM, Johan Tibell <johan.tibell <at> gmail.com> wrote:
> On Wed, Jun 1, 2011 at 4:24 PM, Aleksandar Dimitrov
> <aleks.dimitrov <at> googlemail.com> wrote:
>> One additional thought: it might be interesting to provide this outside of this
>> mailing list, perhaps as a documentation addendum to unordered-containers, since
>> it really explains the size needs for HashMaps of ByteStrings to folks without
>> knowledge of higher arcane wizardry.
>
> I think it would be a good answer on StackOverflow, but no one asked
> me this question there. I could list the size overhead of a HashMap in
> the docs, but I'm about to change it so I won't do so until after the
> change. I don't know how big guarantees I want to make in the docs
> either, as it might constrain future improvements to the
> implementation. Perhaps in an addendum, like you said.

One of the cool things about SO is that you can answer your own
question.  For example, you might do that if you're anticipating an
FAQ.  I think asking this question on SO and reposting your answer
from this thread would be great.

Jason
Johan Tibell | 2 Jun 07:52 2011
Picon

Re: Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)

On Thu, Jun 2, 2011 at 5:10 AM, Jason Dagit <dagitj <at> gmail.com> wrote:
> One of the cool things about SO is that you can answer your own
> question.  For example, you might do that if you're anticipating an
> FAQ.  I think asking this question on SO and reposting your answer
> from this thread would be great.

Good to know.

I've decided to stick it in a blog post, add some pictures, and
elaborate some more (e.g. provide numbers for all containers and for
Text, so people can refer to the post when needed).

-- Johan
Johan Tibell | 9 Jun 13:59 2011
Picon

Re: Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)

On Thu, Jun 2, 2011 at 7:52 AM, Johan Tibell <johan.tibell <at> gmail.com> wrote:
> I've decided to stick it in a blog post, add some pictures, and
> elaborate some more (e.g. provide numbers for all containers and for
> Text, so people can refer to the post when needed).

I ended up writing two blog posts:

Computing the size of a HashMap
http://blog.johantibell.com/2011/06/computing-size-of-hashmap.html

Memory footprints of some common data types
http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html

Johan

Gmane