fred | 6 Aug 14:54
Gravatar

partially reading a file...

Hi,

Let's say I want to read a (binary) file which contains a nx*ny*nz array.

Is it possible to read a "sub-array" from this file, ie each block of 
(nx/4, ny/4, nz/4) for instance, without loading the whole file ?

TIA.

Cheers,

--

-- 
Fred
Travis E. Oliphant | 6 Aug 16:16

Re: partially reading a file...

fred wrote:
> Hi,
>
> Let's say I want to read a (binary) file which contains a nx*ny*nz array.
>
> Is it possible to read a "sub-array" from this file, ie each block of 
> (nx/4, ny/4, nz/4) for instance, without loading the whole file ?
>   
An easy way to do this which forces the operating system to do the work 
of partial loading is to use a memory mapped file as the source of the 
array (i.e. a memmap array). 

Then, selecting out a block is as simple as slicing.

-Travis
fred | 6 Aug 16:30
Gravatar

Re: partially reading a file...

Travis E. Oliphant a écrit :
> fred wrote:
>> Hi,
>>
>> Let's say I want to read a (binary) file which contains a nx*ny*nz array.
>>
>> Is it possible to read a "sub-array" from this file, ie each block of 
>> (nx/4, ny/4, nz/4) for instance, without loading the whole file ?
>>   
> An easy way to do this which forces the operating system to do the work 
> of partial loading is to use a memory mapped file as the source of the 
> array (i.e. a memmap array). 
> 
> Then, selecting out a block is as simple as slicing.
Maybe I had to mention this: the aim is to cut in several files a 
"large" data file, _bigger_ than total available memory amount.

Does memmap still apply ?

Cheers,

--

-- 
Fred
Travis E. Oliphant | 6 Aug 19:43

Re: partially reading a file...

fred wrote:
> Travis E. Oliphant a écrit :
>   
>> fred wrote:
>>     
>>> Hi,
>>>
>>> Let's say I want to read a (binary) file which contains a nx*ny*nz array.
>>>
>>> Is it possible to read a "sub-array" from this file, ie each block of 
>>> (nx/4, ny/4, nz/4) for instance, without loading the whole file ?
>>>   
>>>       
>> An easy way to do this which forces the operating system to do the work 
>> of partial loading is to use a memory mapped file as the source of the 
>> array (i.e. a memmap array). 
>>
>> Then, selecting out a block is as simple as slicing.
>>     
> Maybe I had to mention this: the aim is to cut in several files a 
> "large" data file, _bigger_ than total available memory amount.
>   
Absolutely memory mapping still applies --- it's a perfect application 
for it.  But, you will probably need a 64-bit system.

Memory mapping is how the OS handles "virtual memory" which uses disk 
space to increase main memory.  You are just using that idea directly 
with a memory mapped file.

-Travis
(Continue reading)

fred | 6 Aug 19:53
Gravatar

Re: partially reading a file...

Travis E. Oliphant a écrit :

> Absolutely memory mapping still applies --- it's a perfect application 
> for it.  But, you will probably need a 64-bit system.
No problem.

> Memory mapping is how the OS handles "virtual memory" which uses disk 
> space to increase main memory.  You are just using that idea directly 
> with a memory mapped file.
Ok.
Thanks for the hint.

Cheers,

--

-- 
Fred
Travis E. Oliphant | 6 Aug 20:14

Re: partially reading a file...

fred wrote:
> Travis E. Oliphant a écrit :
>
>   
>> Absolutely memory mapping still applies --- it's a perfect application 
>> for it.  But, you will probably need a 64-bit system.
>>     
> No problem.
>
>   
>> Memory mapping is how the OS handles "virtual memory" which uses disk 
>> space to increase main memory.  You are just using that idea directly 
>> with a memory mapped file.
>>     
> Ok.
> Thanks for the hint.
>
>   

More directly:

Use numpy.memmap --- look at the docstring for example use and help on 
all the arguments available.  But, something like this (untested):

a = numpy.memmap(<filename>, mode='r', dtype=float, shape=(nx,ny,nz))
b = a[:nx/4,:ny/4,:nz/4]
b.tofile(<somefilename>)

Should work...

(Continue reading)

fred | 6 Aug 20:18
Gravatar

Re: partially reading a file...

Travis E. Oliphant a écrit :

> More directly:
> 
> Use numpy.memmap --- look at the docstring for example use and help on 
> all the arguments available.  But, something like this (untested):
> 
> a = numpy.memmap(<filename>, mode='r', dtype=float, shape=(nx,ny,nz))
> b = a[:nx/4,:ny/4,:nz/4]
> b.tofile(<somefilename>)
> 
> Should work...
Travis: tons of thanks ! :-))

Cheers,

--

-- 
Fred
Sebastian Haase | 6 Aug 21:06

Re: partially reading a file...

On Wed, Aug 6, 2008 at 8:14 PM, Travis E. Oliphant
<oliphant <at> enthought.com> wrote:
> fred wrote:
>> Travis E. Oliphant a écrit :
>>
>>
>>> Absolutely memory mapping still applies --- it's a perfect application
>>> for it.  But, you will probably need a 64-bit system.
>>>
>> No problem.
>>
>>
>>> Memory mapping is how the OS handles "virtual memory" which uses disk
>>> space to increase main memory.  You are just using that idea directly
>>> with a memory mapped file.
>>>
>> Ok.
>> Thanks for the hint.
>>
>>
>
> More directly:
>
> Use numpy.memmap --- look at the docstring for example use and help on
> all the arguments available.  But, something like this (untested):
>
> a = numpy.memmap(<filename>, mode='r', dtype=float, shape=(nx,ny,nz))
> b = a[:nx/4,:ny/4,:nz/4]
> b.tofile(<somefilename>)
>
(Continue reading)

Dharhas Pothina | 7 Aug 15:19
Favicon

Mapping a series of files.

Hi,

I've been following the thread on 'partially reading a file' with some interest and have a related question.

So I have a series of large binary data files (1_data.dat, 2_data.dat, etc) that represent a 3D time series
of data. Right now I am cycling through all the files reading the entire dataset to memory and extracting
the subset I need. This works but is extremely memory hungry and slow and I'm running out of memory for
datasets more than a year long. I could calculate which few files contain the  data I need and only read those
in but that is a bit cumbersome and also doesn't help if I need a 1d or 2d slice of the whole time period.

In the other thread Travis gave an example of using memmap to map a file to memory. Can I do this to with
multiple files. ie use memmap to generate an array[x,y,z,t] that I can then use slicing to actually read
what I need? Another complication is that each binary file has a header section and then a data section. By
reading the first file I can calculate the offset for the data part of the file.

thanks,

- dharhas
Sebastian Haase | 7 Aug 18:00

Re: Mapping a series of files.

On Thu, Aug 7, 2008 at 3:19 PM, Dharhas Pothina
<Dharhas.Pothina <at> twdb.state.tx.us> wrote:
> Hi,
>
> I've been following the thread on 'partially reading a file' with some interest and have a related question.
>
> So I have a series of large binary data files (1_data.dat, 2_data.dat, etc) that represent a 3D time series
of data. Right now I am cycling through all the files reading the entire dataset to memory and extracting
the subset I need. This works but is extremely memory hungry and slow and I'm running out of memory for
datasets more than a year long. I could calculate which few files contain the  data I need and only read those
in but that is a bit cumbersome and also doesn't help if I need a 1d or 2d slice of the whole time period.
>
> In the other thread Travis gave an example of using memmap to map a file to memory. Can I do this to with
multiple files. ie use memmap to generate an array[x,y,z,t] that I can then use slicing to actually read
what I need? Another complication is that each binary file has a header section and then a data section. By
reading the first file I can calculate the offset for the data part of the file.
>
Hi dharhas
yes, you can do all these things,
I'm doing this for 3d and 4d images files.  What file format are you
interested in ? I use MRC files ...

Cheers,
Sebastian Haase
Dharhas Pothina | 7 Aug 18:24
Favicon

Re: Mapping a series of files.


It isn't a standardized format. It is the output of a Fortran hydrodynamic circulation model called SELFE.
The output files are fortran binaries. I could probably cycle through the files and convert them to netcdf
one by one with a python script but it would be quicker and more space efficient if I could directly use the
original outputs.

thanks,

- dharhas

>>> "Sebastian Haase" <haase <at> msg.ucsf.edu> 8/7/2008 11:00 AM >>>
On Thu, Aug 7, 2008 at 3:19 PM, Dharhas Pothina
<Dharhas.Pothina <at> twdb.state.tx.us> wrote:
> Hi,
>
> I've been following the thread on 'partially reading a file' with some interest and have a related question.
>
> So I have a series of large binary data files (1_data.dat, 2_data.dat, etc) that represent a 3D time series
of data. Right now I am cycling through all the files reading the entire dataset to memory and extracting
the subset I need. This works but is extremely memory hungry and slow and I'm running out of memory for
datasets more than a year long. I could calculate which few files contain the  data I need and only read those
in but that is a bit cumbersome and also doesn't help if I need a 1d or 2d slice of the whole time period.
>
> In the other thread Travis gave an example of using memmap to map a file to memory. Can I do this to with
multiple files. ie use memmap to generate an array[x,y,z,t] that I can then use slicing to actually read
what I need? Another complication is that each binary file has a header section and then a data section. By
reading the first file I can calculate the offset for the data part of the file.
>
Hi dharhas
yes, you can do all these things,
(Continue reading)

Charles Doutriaux | 7 Aug 18:15

Re: Mapping a series of files.

Darhas,

It's your files can be converted to netcdf (or grib), then we have a 
tool to do exactly what you want
basically you'd run
cdscan -x full.xml *.nc

And it would generate an xml file that would simulate being a full file

then using our cdms2 read module you would do

f=cdms2.open('full.xml')
data =f("var",time=('2008-1','2008-7'))
It would figure out for you which files to open. You could even be more 
restrictive by selecting a sub region (latitude=(-20,20)) etc...

for more info:
http://cdat.sf.net

C.

Dharhas Pothina wrote:
> Hi,
>
> I've been following the thread on 'partially reading a file' with some interest and have a related question.
>
> So I have a series of large binary data files (1_data.dat, 2_data.dat, etc) that represent a 3D time series
of data. Right now I am cycling through all the files reading the entire dataset to memory and extracting
the subset I need. This works but is extremely memory hungry and slow and I'm running out of memory for
datasets more than a year long. I could calculate which few files contain the  data I need and only read those
(Continue reading)

Dharhas Pothina | 7 Aug 18:29
Favicon

Re: Mapping a series of files.


There are some issues with converting to netcdf. Mainly the fact that there is no standard for unstructured
grids in netcdf. Most of the tools work for structured grids. There have been a couple of attempts to come up
with an unstructured grid netcdf standard but from what I can tell they petered out in 2006. We are
struggling with this right now since we have a couple of different hydro models and are trying to define a
common format so we can develop our analysis and vis tools.

My present idea is to write a module that abstracts the details of each model format and allows me to load the
data into python.

Will your module work with unstructured grids?

- dharhas

>>> Charles Doutriaux <doutriaux1 <at> llnl.gov> 8/7/2008 11:15 AM >>>
Darhas,

It's your files can be converted to netcdf (or grib), then we have a 
tool to do exactly what you want
basically you'd run
cdscan -x full.xml *.nc

And it would generate an xml file that would simulate being a full file

then using our cdms2 read module you would do

f=cdms2.open('full.xml')
data =f("var",time=('2008-1','2008-7'))
It would figure out for you which files to open. You could even be more 
restrictive by selecting a sub region (latitude=(-20,20)) etc...
(Continue reading)

fred | 6 Aug 21:17
Gravatar

Re: partially reading a file...

Travis E. Oliphant a écrit :

> Should work...
It does !

Travis, as Gaël like to say, you are my hero :-)))

Many many thanks.

Cheers,

--

-- 
Fred
fred | 6 Aug 21:56
Gravatar

Re: partially reading a file [corollary]

Now, let's say I have scatter data in a big binary file (stored in the 
form (xi, yi, zi, vi)), like on the snapshot, showing a "small" scatter.

How can I cut the scatter efficiently in several files, as in the 
previous mail ?

I can use memmap to "read" the whole file, but after ?
It's more a algorithmic issue from my own point of view.

TIA.

Cheers,

--

-- 
Fred
_______________________________________________
SciPy-user mailing list
SciPy-user <at> scipy.org
http://projects.scipy.org/mailman/listinfo/scipy-user
fred | 7 Aug 17:36
Gravatar

Re: partially reading a file...

Travis E. Oliphant a écrit :
>
> Should work...
I have tested the trick on a file which has 2.7 10**9 nodes, ie > 2**31
and I get the following message:

File "/usr/local/lib/python2.5/site-packages/numpy/core/memmap.py", line 
193, in __new__
      mm = mmap.mmap(fid.fileno(), bytes, access=acc)
      ValueError: mmap length is greater than file size

Is there a workaround to consider long integer (if this is the issue) ?

TIA.

Cheers,

--

-- 
Fred
Sebastian Haase | 7 Aug 18:08

Re: partially reading a file...

On Thu, Aug 7, 2008 at 5:36 PM, fred <fredmfp <at> gmail.com> wrote:
> Travis E. Oliphant a écrit :
>>
>> Should work...
> I have tested the trick on a file which has 2.7 10**9 nodes, ie > 2**31
> and I get the following message:
>
> File "/usr/local/lib/python2.5/site-packages/numpy/core/memmap.py", line
> 193, in __new__
>      mm = mmap.mmap(fid.fileno(), bytes, access=acc)
>      ValueError: mmap length is greater than file size
>
> Is there a workaround to consider long integer (if this is the issue) ?
>
> TIA.
>
Are you "really" on a 64-bit system ?
Is this Linux ?
Is your Python the original from the distro - or did you build it yourself ?
Do a: >>> import sys;print sys.maxint

Did you build numpy yourself or did you download a binary ?

HTH,
Sebastian
fred | 8 Aug 10:36
Gravatar

Re: partially reading a file...

Sebastian Haase a écrit :

> Are you "really" on a 64-bit system ?
Yes.

> Is this Linux ?
Yes.

> Is your Python the original from the distro - or did you build it yourself ?
Built myself.

> Do a: >>> import sys;print sys.maxint
I get the expected answer: 2**63-1.

> Did you build numpy yourself or did you download a binary ?
Built myself.

What's going on ?

Cheers,

--

-- 
Fred
Sebastian Haase | 8 Aug 11:27

Re: partially reading a file...

On Fri, Aug 8, 2008 at 10:36 AM, fred <fredmfp <at> gmail.com> wrote:
> Sebastian Haase a écrit :
>
>> Are you "really" on a 64-bit system ?
> Yes.
>
>> Is this Linux ?
> Yes.
>
>> Is your Python the original from the distro - or did you build it yourself ?
> Built myself.
>
>> Do a: >>> import sys;print sys.maxint
> I get the expected answer: 2**63-1.
>
>> Did you build numpy yourself or did you download a binary ?
> Built myself.
>
> What's going on ?
>
>
Don't know ....  what is the size of the file you are trying to open
again - in bytes ?
What file system are you using (don't know if this is of any interest ....) ?

-S.
fred | 8 Aug 11:35
Gravatar

Re: partially reading a file...

Sebastian Haase a écrit :

> Don't know ....  what is the size of the file you are trying to open
> again - in bytes ?
-rw-r--r-- 1 fred users 5529600000 2008-08-07 16:31 input.sep

> What file system are you using (don't know if this is of any interest ....) ?
ext3

Cheers,

--

-- 
Fred
fred | 8 Aug 12:17
Gravatar

Re: partially reading a file...

Sebastian Haase a écrit :

> What file system are you using (don't know if this is of any interest ....) ?
Hmmm, forget this thread.

A keyboard-to-chair interface problem.

Sorry.

Cheers,

--

-- 
Fred
Sebastian Haase | 8 Aug 12:24

Re: partially reading a file...

On Fri, Aug 8, 2008 at 12:17 PM, fred <fredmfp <at> gmail.com> wrote:
> Sebastian Haase a écrit :
>
>> What file system are you using (don't know if this is of any interest ....) ?
> Hmmm, forget this thread.
>
> A keyboard-to-chair interface problem.
>
> Sorry.
>
>
That's O.K.

>>> 5529600000 / 1024 / 1024 / 1024
5.14984130859

So you are saying you are mem-mapping a 5.2 GB file without problem  !?

That's pretty neat ;-)

- Sebastian
fred | 8 Aug 14:11
Gravatar

Re: partially reading a file...

Sebastian Haase a écrit :

>>>> 5529600000 / 1024 / 1024 / 1024
> 5.14984130859
> 
> So you are saying you are mem-mapping a 5.2 GB file without problem  !?
> 
> That's pretty neat ;-)
Dimensions were wrong in my code, yes.
For 1200x1600x720, it works fine ;-)

Cheers,

--

-- 
Fred

Gmane