Tim | 28 Jul 2011 16:34
Picon

SQLite join takes a very long time

All-
I have a set of text files, each with 410 million records (with a
structure something like these 5-record examples) that I'd like to
bring into SQLite:

file a = [-99, 1, 1, -99, -99]
file b = [-99, 2, 4, 1, -99]
file c = [-99, 7, 7, -99, -99]

The -99 values are NA; location in the vector matters. I'd like as my
final output a table looking like this

ID a b c
1 1 2 7
2 1 4 7

With ID being the location in the vector (= row) and only rows without
NA in the table.

With a test dataset of 6 files of 10 million records each, I've tried
importing each file independently to separate tables of two columns
each that look like this:

table_a
ID a
0 -99
1 1
2 1
3 -99
4 -99
(Continue reading)

Roger Binns | 28 Jul 2011 17:01

Re: SQLite join takes a very long time


On 07/28/2011 07:34 AM, Tim wrote:
> The -99 values are NA; location in the vector matters. I'd like as my
> final output a table looking like this

I recommend you write a generator function that just returns each final row,
something like this:

def rows(*columns):
   for ID, cols in enumerate(itertools.izip(*columns)):
      if any(c==-99 for c in cols):
         continue
      yield [ID]+list(cols)

You insert this data like this:

  cursor.executemany("INSERT INTO table VALUES(?,?,?,?)",
     rows(file_a, file_b, file_c))

This will save all the manipulations you are trying to do.  If the file
contents are also returned as generators then this will all use very little
memory.

> ***But this has been running for >24 hours on the test data set!

You should prefix the query with EXPLAIN QUERY PLAN and you'll at least be
able to see what SQLite is doing, especially what indices are (not) used.

> The
> SQLite file size has grown from 2GB to 200GB and continues to grow. 
(Continue reading)

Tim | 28 Jul 2011 17:29
Picon

Re: SQLite join takes a very long time

Thank you for your quick reply -

>
> I recommend you write a generator function that just returns each final row,
> something like this:

Yes, I'm already using a generator function to read in the files. As
I
actually read by row, and then split the rows to yield to apsw, I
hadn't
thought it would be easy to make a generator work across multiple
files at a time. Here's a snippet of what I'm using for access to one
text file and writing to one table (fcn is the file connection):

def getrows():
   for r in range(header['nrows']):
	   line = fcn.readline()
	   for i, x in enumerate(line.split()):
		   yield i, x
sql = 'insert into tbl' + layerName + '(FID, ' + layerName + ')
values(?, ?)'
c.execute('''begin; drop table if exists tbl''' + layerName + ''';
	   create table tbl''' + layerName + '''(
	   FID integer, '''
	   + layerName + ''' ''' + dataType +
	   ''')''')

> This will save all the manipulations you are trying to do.  If the file
> contents are also returned as generators then this will all use very little
> memory.
(Continue reading)

Tim | 28 Jul 2011 17:37
Picon

Re: SQLite join takes a very long time

Whoops, I didn't include two key execute statements in my snippet...

> def getrows():
>    for r in range(header['nrows']):
>            line = fcn.readline()
>            for i, x in enumerate(line.split()):
>                    yield i, x
> sql = 'insert into tbl' + layerName + '(FID, ' + layerName + ')
> values(?, ?)'
> c.execute('''begin; drop table if exists tbl''' + layerName + ''';
>            create table tbl''' + layerName + '''(
>            FID integer, '''
>            + layerName + ''' ''' + dataType +
>            ''')''')

c.executemany(sql, getrows())
c.execute("commit")

>

Edzard Pasma | 29 Jul 2011 10:02
Picon
Favicon

Re: Re: SQLite join takes a very long time


Op 28-jul-2011, om 17:37 heeft Tim het volgende geschreven:

> Whoops, I didn't include two key execute statements in my snippet...
>
>> def getrows():
>>    for r in range(header['nrows']):
>>            line = fcn.readline()
>>            for i, x in enumerate(line.split()):
>>                    yield i, x
>> sql = 'insert into tbl' + layerName + '(FID, ' + layerName + ')
>> values(?, ?)'
>> c.execute('''begin; drop table if exists tbl''' + layerName + ''';
>>            create table tbl''' + layerName + '''(
>>            FID integer, '''
>>            + layerName + ''' ''' + dataType +
>>            ''')''')
>
> c.executemany(sql, getrows())
> c.execute("commit")

Hello,

It is strange that you do not save the line number (r in above code)  
from the input files. That should become the ID that you use later on  
to join the tables. Is there a chance that ID and FID are mixed up?  
(that  would explain things)

I hope the conclusion is not that SQL is not suited for this case.  
One should just use the tools that one feels confortable with. Unix  
(Continue reading)

Tim | 29 Jul 2011 13:29
Picon

Re: SQLite join takes a very long time

Thanks for your comments -
Note that the files are structured as a grid with XX rows and YY
columns, so in that code example, I read a single row with readlines()
an then break the row up into values that go into the table with
split. So each file has rows*columns number of records not just
records equal to the number of rows.

I'm sure you are right that SQLite is suited to the task - it's just
me not knowing how to do the join without it making a cross-product as
Roger indicated as what was probably going on.

Cheers,
Tim Howard

> Hello,
>
> It is strange that you do not save the line number (r in above code)  
> from the input files. That should become the ID that you use later on  
> to join the tables. Is there a chance that ID and FID are mixed up?  
> (that  would explain things)
>
> I hope the conclusion is not that SQL is not suited for this case.  
> One should just use the tools that one feels confortable with. Unix  
> commands like merge and cut might also do the job.
>
> Regards, Edzard Pasma

Roger Binns | 28 Jul 2011 17:49

Re: Re: SQLite join takes a very long time


On 07/28/2011 08:29 AM, Tim wrote:
> Since I can't read all the data into memory at one time is there
> another way to 'yield' in various places in the generator function?

You can yield in as many places as you want in the generator function.
Define a readfile function that yields each value from one file and then the
final code would look something like this:

  cursor.executemany("INSERT INTO table VALUES(?,?,?,?)",
     rows(readfile("a"), readfile("b"), readfile("c")))

The only other issue I see from your code is that you are returning the
numeric values as strings.  SQLite's affinity rules can fix that for you,
but I'd call int() on the field from line.split() to explicitly pass in an
integer.  You can verify how data is stored in the SQLite database using typeof:

  select typeof(ID), typeof(a) from table_a limit 10

If you aren't getting integer back then indices will be a heck of lot slower
and larger.

Roger

Gmane