Kirk True | 17 May 19:52
Gravatar

UIMA internals memory footprint

Hi all,

I have begun getting seeing heavy memory use when processing largish
documents through a UIMA pipeline. I wanted to make sure what I'm
seeing with regard to UIMA's internal memory use is on par with
expectations.

It looks like either for a 1,500,000 byte or a 15,000,000 byte document
with the same annotations (100,000 10-character annotations), we incur
a ~13 MB "overhead" for internal UIMA data structures. Is this in line
with expectations?

Details:

In the interest of narrowing down the issue, I made a very simple test
annotator which mimics what my annotators do. The annotator creates a
document of N bytes which is set in a view in the CAS, then it
transforms the bytes to an HTML string that is then set in a view in
the CAS. Next, for each view, the annotator creates 50,000 annotations.
Each annotation has two 5-character attributes. I profiled my
application using two profilers (JProbe and YourKit) and took heap
snapshots before and after processing was performed and saw similar
results.

I know there's a lot going on under the hood, so I'm trying to get an
idea of what kind of size factor I can expect for a given document
size. Right now, according to my calculations and verified by the
profiler, the expected memory usage for just my data (i.e. the two
views of the document and the strings making up the annotations) is:

(Continue reading)

Marshall Schor | 29 May 22:04

Re: UIMA internals memory footprint

Hi Kirk -

Thanks for posting your test case.  It did point up an inefficiency in 
how one of the hash maps was being used - this is now being improved. 

Your basic question concerned expectations for storing things in the 
CAS.  Here's the basic model.

The CAS stores most things in int[] arrays.

Feature Structures take up a number of entries in the int[]:  one plus 
the number of features.  So an annotation takes 1 + 3 = 4 words.  (1 
word = 4 bytes). Your annotation type added 2 more features, so these 
take 6 words in the int[].

Feature Structure features which are Strings take another 4 slots in the 
int[] arrays (16 bytes), per string being referenced, in addition to 
whatever storage Java uses for strings.  Sun Java 6_01 appears to use 32 
+ 2 * number_of_characters to store a string.  In your test case, each 
String in Java took 42 bytes. 

Indexes take one word per annotation indexed, typically.

All of the int[] objects grow as needed, by quantum jumps, so at any 
particular time, the number of words allocated is often larger than the 
number used. 

To reference CAS objects from a Java program, one of 2 interfaces is 
used: the "JCas" interface or the plain "CAS" Java interface.  Both of 
these create a 2 field Java object for each referenced CAS object.  In 
(Continue reading)

Kirk True | 1 Jun 09:26
Gravatar

Re: UIMA internals memory footprint

Hi Marshall,

> This reduces 4.6 MB down to 1 MB overhead for 100K annotations.

That's awesome - thanks so much for looking into this! 

Just to double-check - will this make it into the 2.2 release?

Thanks again,
Kirk 

Marshall Schor | 8 Jun 02:45

Re: UIMA internals memory footprint

Kirk True wrote:
> Hi Marshall,
>
>   
>> This reduces 4.6 MB down to 1 MB overhead for 100K annotations.
>>     
>
> That's awesome - thanks so much for looking into this! 
>
> Just to double-check - will this make it into the 2.2 release?
>
> Thanks again,
> Kirk 
>
>
>   
Yes, it should be there.

-Marshall

Adam Lally | 17 May 20:58
Picon

Re: UIMA internals memory footprint

Kirk,

In this test are you running a CPE or just an AnalysisEngine?  If it
is a CPE do you know what your CAS Pool size is?

When a CAS is created it does allocate a large heap which is then
filled as you create annotations.  By default I believe this is
500,000 cells (2MB) per CAS, but this can be overridden (see
UIMAFramework.getDefaultPerformanceTuningPropeties()).  So this can
defintely be one source of memory overhead.  As you saw it does not
grow with larger documents, it will only grow if you create enough
annotations to fill up the allocated space.

-Adam

On 5/17/07, Kirk True <kirk@...> wrote:
> Hi all,
>
> I have begun getting seeing heavy memory use when processing largish
> documents through a UIMA pipeline. I wanted to make sure what I'm
> seeing with regard to UIMA's internal memory use is on par with
> expectations.
>
> It looks like either for a 1,500,000 byte or a 15,000,000 byte document
> with the same annotations (100,000 10-character annotations), we incur
> a ~13 MB "overhead" for internal UIMA data structures. Is this in line
> with expectations?
>
> Details:
>
(Continue reading)

Kirk True | 18 May 01:28
Gravatar

Re: UIMA internals memory footprint

Hi Adam,

> Kirk,
> 
> In this test are you running a CPE or just an AnalysisEngine?  If it
> is a CPE do you know what your CAS Pool size is?

It's an AnalysisEngine.

> When a CAS is created it does allocate a large heap which is then
> filled as you create annotations.  By default I believe this is
> 500,000 cells (2MB) per CAS, but this can be overridden (see
> UIMAFramework.getDefaultPerformanceTuningPropeties()).  So this can
> defintely be one source of memory overhead.  As you saw it does not
> grow with larger documents, it will only grow if you create enough
> annotations to fill up the allocated space.

I noticed that this is tweak-able and set it to something insanely
small (like 100). But, as you said, it grows as the number of
annotations grow. Since the parameter is under the umbrella of
performance, I'd assume that it would actually be better to
pre-allocate close to what we're going to use.

Thanks!
Kirk

> On 5/17/07, Kirk True <kirk@...> wrote:
> > Hi all,
> >
> > I have begun getting seeing heavy memory use when processing
(Continue reading)

Thilo Goetz | 18 May 09:55
Picon
Picon

Re: UIMA internals memory footprint

Kirk True wrote:
> Hi Adam,
> 
>> Kirk,
>>
>> In this test are you running a CPE or just an AnalysisEngine?  If it
>> is a CPE do you know what your CAS Pool size is?
> 
> It's an AnalysisEngine.
> 
>> When a CAS is created it does allocate a large heap which is then
>> filled as you create annotations.  By default I believe this is
>> 500,000 cells (2MB) per CAS, but this can be overridden (see
>> UIMAFramework.getDefaultPerformanceTuningPropeties()).  So this can
>> defintely be one source of memory overhead.  As you saw it does not
>> grow with larger documents, it will only grow if you create enough
>> annotations to fill up the allocated space.
> 
> I noticed that this is tweak-able and set it to something insanely
> small (like 100). But, as you said, it grows as the number of
> annotations grow. Since the parameter is under the umbrella of
> performance, I'd assume that it would actually be better to
> pre-allocate close to what we're going to use.
[...]

Yes.

You can estimate data use on the heap as follows.  Each FS uses at least one
int for the type information, plus whatever features it has.  So a vanilla
annotation is 3 ints, one for the type, and one for the start and end features,
(Continue reading)

Adam Lally | 21 May 15:38
Picon

Re: UIMA internals memory footprint

On 5/18/07, Thilo Goetz <twgoetz@...> wrote:
> You can estimate data use on the heap as follows.  Each FS uses at least one
> int for the type information, plus whatever features it has.  So a vanilla
> annotation is 3 ints, one for the type, and one for the start and end features,
> respectively.  If you have two additional features, that's 5 ints, so 20 bytes.
> If you use the JCas, you incur an additional overhead of a Java object for
> each annotation.  It's small, but I can't say off the top of my head how small
> exactly.  Plus, the JCas objects are held in a HashMap (or some such, Marshall
> correct me if I'm wrong), which incurs additional memory overhead.
>
> In my experience, the CAS can easily reach 10 to 20 times the size of the input
> document.  If you have information reach token annotations, that's not really
> surprising.  (And this is without using JCas).  Imagine you were to manually
> create Java objects that carry the same information, you would see roughly
> the same kind of overhead.
>

Using these numbers can we account for the 9,300,000 bytes of integer arrays?

100,000 annotations of size 5 cells = 500,000 ints, which is exactly
the default heap size.  But with the Sofa FS this will exceed the
default heap size.  It will grow by another 500,000 (I think).

So that accounts for 1,000,000 ints = 4,000,000 bytes.

Where are the other 5,300,000?

Likewise, what about the 1,600,000 bytes of Integers.  The JCAS hash
map only accounts for one per annotation, which in this case should
only be 400,000 bytes.
(Continue reading)

Marshall Schor | 22 May 15:46

Re: UIMA internals memory footprint

The indexes use int[] arrays. 

Kirk - what indexes do you have defined (if any)?  Do you 
"addToIndexes..." any of
the annotations you create?

-Marshall

Adam Lally wrote:
> On 5/18/07, Thilo Goetz <twgoetz@...> wrote:
>> You can estimate data use on the heap as follows.  Each FS uses at 
>> least one
>> int for the type information, plus whatever features it has.  So a 
>> vanilla
>> annotation is 3 ints, one for the type, and one for the start and end 
>> features,
>> respectively.  If you have two additional features, that's 5 ints, so 
>> 20 bytes.
>> If you use the JCas, you incur an additional overhead of a Java 
>> object for
>> each annotation.  It's small, but I can't say off the top of my head 
>> how small
>> exactly.  Plus, the JCas objects are held in a HashMap (or some such, 
>> Marshall
>> correct me if I'm wrong), which incurs additional memory overhead.
>>
>> In my experience, the CAS can easily reach 10 to 20 times the size of 
>> the input
>> document.  If you have information reach token annotations, that's 
>> not really
(Continue reading)

Kirk True | 22 May 18:42
Gravatar

Re: UIMA internals memory footprint

Hi Marshall,

> The indexes use int[] arrays. 
> 
> Kirk - what indexes do you have defined (if any)?  Do you 
> "addToIndexes..." any of
> the annotations you create?

Yes - I'm adding all annotations to the indexes.

If it helps, here's the source code for the annotator and the shim
application from which it is run:

    http://www.mustardgrain.com/files/testcaseannotator.zip

Thanks for all the feedback!

Kirk

> -Marshall
> 
> Adam Lally wrote:
> > On 5/18/07, Thilo Goetz <twgoetz@...> wrote:
> >> You can estimate data use on the heap as follows.  Each FS uses at
> 
> >> least one
> >> int for the type information, plus whatever features it has.  So a
> 
> >> vanilla
> >> annotation is 3 ints, one for the type, and one for the start and
(Continue reading)

Adam Lally | 22 May 21:23
Picon

Re: UIMA internals memory footprint

On 5/22/07, Kirk True <kirk@...> wrote:
> If it helps, here's the source code for the annotator and the shim
> application from which it is run:
>
>     http://www.mustardgrain.com/files/testcaseannotator.zip
>

Kirk,

Just so we have all the legal bases covered, could you attach this to
the JIRA issue UIMA-412
(https://issues.apache.org/jira/browse/UIMA-412) and check the box
that says you grant license to the ASF?

Thanks,
  -Adam

Marshall Schor | 20 May 02:58

Re: UIMA internals memory footprint

Thilo Goetz wrote:
> Kirk True wrote:
>> Hi Adam,
>>
>>> Kirk,
>>>
>>> In this test are you running a CPE or just an AnalysisEngine?  If it
>>> is a CPE do you know what your CAS Pool size is?
>>
>> It's an AnalysisEngine.
>>
>>> When a CAS is created it does allocate a large heap which is then
>>> filled as you create annotations.  By default I believe this is
>>> 500,000 cells (2MB) per CAS, but this can be overridden (see
>>> UIMAFramework.getDefaultPerformanceTuningPropeties()).  So this can
>>> defintely be one source of memory overhead.  As you saw it does not
>>> grow with larger documents, it will only grow if you create enough
>>> annotations to fill up the allocated space.
>>
>> I noticed that this is tweak-able and set it to something insanely
>> small (like 100). But, as you said, it grows as the number of
>> annotations grow. Since the parameter is under the umbrella of
>> performance, I'd assume that it would actually be better to
>> pre-allocate close to what we're going to use.
> [...]
>
> Yes.
>
> You can estimate data use on the heap as follows.  Each FS uses at 
> least one
(Continue reading)

Thilo Goetz | 20 May 08:25
Picon
Picon

Re: UIMA internals memory footprint

Marshall Schor wrote:
> Thilo Goetz wrote:
[...]
>> If you use the JCas, 
> or you create FeatureStructure Java objects (which are Java Objects),

True.  My point was (which maybe I should have mentioned ;-) that JCas
objects stick around, while plain old FeatureStructures get can get
garbage collected after each annotator has run.  So JCas objects behave
like the rest of the CAS in that respect, and unlike FeatureStructure
objects.  Not beating on the JCas, just trying to explain sources of
memory consumption in the final analysis, after processing, so to speak.

--Thilo


Gmane