Emmanuel Tabard | 26 Jan 18:42
Picon

Fwd: [sql] memory consumption


It's so slow and takes so much memory because it was thought to work with
a few hundreds of entries. :-D

Fair enough :D

Wow, that's an interesting problem... I guess it can be heavily improved,
especially if we can store some information to the disc.
Anyway, it's not an easy task: the real problem is that we don't have a
unique ID to identify a movie (that would be the ID that we're saving... but
the problem is matching it to the other information of the row: title, year,
imdb_index, kind, etc. etc.)

The thing is, the whole database takes 5go. That's why I was wondering how the script can eat 20go of memory. Maybe sqlobject leaks !
You could do it in 4 steps :
- Grab all informations from the existing database (imdb id, title, index, year, kind) and store it in a temporary table or text file.
- Drop the database
- rebuild it
- iterate in your file/temp table and restore the ids one by one

But it could be slow to query the fresh database with your temp table datas. (Because of the text fields ...)
Anyway, it takes 10 hours to store the ids in memory. Can't be worse :D

To make it faster you can also generate a unique signature for each rows (sha1(title, index, year, kinds)?). Index this field and your temp table would be : imdbid | signature.
It should be quick.

With mysql you can also warmup indexes this way :

SHOW TABLES in imdbpy
-> for each table LOAD INDEX INTO CACHE table


- Emmanuel

Le 26 janv. 2012 à 17:23, Davide Alberani a écrit :

On Thu, Jan 26, 2012 at 15:32, Emmanuel Tabard <manu-+vDP33JFECNGWvitb5QawA@public.gmane.org> wrote:

First of all, thank you for imdbpy. This is really plug'n play, well done !!!

Thanks. :-P

Context :
- Import all imdb database (from text dumps) - first time it's fast and ok
- I have the imdb ids for 90% of titles and names (no need for companies and characters)

That's a lot of data. :)

My problem comes when imdbpy updates my database. It takes hours to save the imdb ids and it consumes a *lot* of memory. Almost all of my RAM (24go) ...

Is there a way to optimize that step ? Why does it takes so much memory ?

It's so slow and takes so much memory because it was thought to work with
a few hundreds of entries. :-D

Wow, that's an interesting problem... I guess it can be heavily improved,
especially if we can store some information to the disc.
Anyway, it's not an easy task: the real problem is that we don't have a
unique ID to identify a movie (that would be the ID that we're saving... but
the problem is matching it to the other information of the row: title, year,
imdb_index, kind, etc. etc.)

Hmmm... I promise to think about it in the weekend.  If anyone have a
nice solution to this problem, any hint is welcome!

--
Davide Alberani <davide.alberani-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/


------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@...
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel
Emmanuel Tabard | 26 Jan 18:50
Picon

Re: [sql] memory consumption

Few stats :

RESTORING imdbID values for movies... DONE! (restored 1644956 entries out of 1952428) RESTORING imdbID values for people... DONE! (restored 3304069 entries out of 3320213) # TIME fushing caches... : 90min, 23sec (wall) 73min, 59sec (user) 1min, 37sec (system) # TIME TOTAL TIME TO INSERT/WRITE DATA : 1193min, 58sec (wall) 1095min, 16sec (user) 13min, 7sec (system) building database indexes (this may take a while) # TIME createIndexes() : 13min, 56sec (wall) 0min, 0sec (user) 0min, 0sec (system) adding foreign keys (this may take a while) # TIME createForeignKeys() : 16min, 5sec (wall) 0min, 0sec (user) 0min, 0sec (system) # TIME FINAL : 1223min, 59sec (wall) 1095min, 16sec (user) 13min, 7sec (system)


You can notice that :
 - title 84% success
 - name 99% success

But I didn't watch the diffs. I don't know if the restore fails somehow or if imdb has a lot of editing :)


- Emmanuel
Le 26 janv. 2012 à 18:42, Emmanuel Tabard a écrit :


It's so slow and takes so much memory because it was thought to work with
a few hundreds of entries. :-D

Fair enough :D

Wow, that's an interesting problem... I guess it can be heavily improved,
especially if we can store some information to the disc.
Anyway, it's not an easy task: the real problem is that we don't have a
unique ID to identify a movie (that would be the ID that we're saving... but
the problem is matching it to the other information of the row: title, year,
imdb_index, kind, etc. etc.)

The thing is, the whole database takes 5go. That's why I was wondering how the script can eat 20go of memory. Maybe sqlobject leaks ! 
You could do it in 4 steps : 
- Grab all informations from the existing database (imdb id, title, index, year, kind) and store it in a temporary table or text file.
- Drop the database
- rebuild it
- iterate in your file/temp table and restore the ids one by one

But it could be slow to query the fresh database with your temp table datas. (Because of the text fields ...)
Anyway, it takes 10 hours to store the ids in memory. Can't be worse :D

To make it faster you can also generate a unique signature for each rows (sha1(title, index, year, kinds)?). Index this field and your temp table would be : imdbid | signature.
It should be quick.

With mysql you can also warmup indexes this way :

SHOW TABLES in imdbpy
-> for each table LOAD INDEX INTO CACHE table


- Emmanuel

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@...
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel
Davide Alberani | 27 Jan 09:18
Picon
Gravatar

Re: [sql] memory consumption

On Thu, Jan 26, 2012 at 18:50, Emmanuel Tabard <manu@...> wrote:
>
> # TIME FINAL : 1223min, 59sec (wall) 1095min, 16sec (user) 13min, 7sec
> (system)

:-)

> You can notice that :
>  - title 84% success
>  - name 99% success

For sure movie titles change more and faster than user names (a user name
can change only if a typo was found, if the person legally change his name or
if a second person with the same name/surname is added to the db)

Anyway, 84% is a little too high... I guess some kind of titles (tv
series episodes,
maybe?) are not handled correctly.

Thanks for your help!

--

-- 
Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
Emmanuel Tabard | 6 Feb 09:10
Picon

Re: [sql] memory consumption

Feb  6 07:16:53 ks380342 kernel: Out of memory: Kill process 29782 (imdbpy2sql.py) score 718 or sacrifice child
Feb  6 07:16:53 ks380342 kernel: Killed process 29782 (imdbpy2sql.py) total-vm:18212980kB,
anon-rss:18149024kB, file-rss:64kB

Greedy boy :D 

Le 27 janv. 2012 à 09:18, Davide Alberani a écrit :

> On Thu, Jan 26, 2012 at 18:50, Emmanuel Tabard <manu@...> wrote:
>> 
>> # TIME FINAL : 1223min, 59sec (wall) 1095min, 16sec (user) 13min, 7sec
>> (system)
> 
> :-)
> 
>> You can notice that :
>>  - title 84% success
>>  - name 99% success
> 
> For sure movie titles change more and faster than user names (a user name
> can change only if a typo was found, if the person legally change his name or
> if a second person with the same name/surname is added to the db)
> 
> Anyway, 84% is a little too high... I guess some kind of titles (tv
> series episodes,
> maybe?) are not handled correctly.
> 
> 
> Thanks for your help!
> 
> 
> -- 
> Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
> http://www.mimante.net/

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
Davide Alberani | 7 Feb 09:20
Picon
Gravatar

Re: [sql] memory consumption

On Mon, Feb 6, 2012 at 09:10, Emmanuel Tabard <manu@...> wrote:
>
> Feb  6 07:16:53 ks380342 kernel: Out of memory: Kill process 29782 (imdbpy2sql.py) score 718 or
sacrifice child
> Feb  6 07:16:53 ks380342 kernel: Killed process 29782 (imdbpy2sql.py) total-vm:18212980kB,
anon-rss:18149024kB, file-rss:64kB
>
> Greedy boy :D

Eheh... :-)
As usual, I'm really busy right now... I hope to have time to give it
a look this weekend.

--

-- 
Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
Davide Alberani | 11 Feb 19:49
Picon
Gravatar

Re: [sql] memory consumption

On Tue, Feb 7, 2012 at 09:20, Davide Alberani
<davide.alberani@...> wrote:
>
> As usual, I'm really busy right now... I hope to have time to give it
> a look this weekend.

Ehi, snowstorms buy you a lot of free time... :-P

It was easier that I thought, mostly thanks to the fact the we already have
md5 checksum of names and title (a more or less recent feature).

In the mercurial repository there's a draft of solution.

How it works:
- titles/names with imdbID are stored in a dbm database, using their
md5 as keys.
- at restore time, imdbIDs are restored in batches of 10000 each time.

Notes:
- by default, the database are created in the current directory (and
not deleted);
  there's now the '-t dir' command line argument, to specify a
temporary directory.
- I've not tested it with huge amounts of data: if it's slow or fails,
let me know
  if it's while storing or restoring the IDs (and the error message).
- 10.000 entries for a batch is *totally* arbitrary: we've to choose a
good compromise
  between performances and the maximum size of a query.
- the batch is executed as a single query, like:
      UPDATE table SET imdb_id = CASE md5sum WHEN 'md5_1' THEN
'imdbID1' ... END WHERE md5sum IN ('md5_1', md5_2', ...)
  I don't really know if this syntax is valid for every SQL databases...
- I've simplified the code, maybe too much.
- I've not tested it with CSV support.

As usual, any test, bug report, comment and so on is welcome.

--

-- 
Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
Emmanuel Tabard | 11 Feb 22:06
Picon

Re: [sql] memory consumption

Hey !

Ehi, snowstorms buy you a lot of free time... :-P

:D

How it works:
- titles/names with imdbID are stored in a dbm database, using their
md5 as keys.
- at restore time, imdbIDs are restored in batches of 10000 each time.

Looks nice !!!

It seems that you load all the datas in memory before storing it in the temp databases. 
"cls.select(ISNOTNULL(cls.q.imdbID))"

Maybe you should save the imdbids by batch of 10000entries ?


Tell me if you need the complete database dump to test with tons of datas !

Thank you very much for your help !


Le 11 févr. 2012 à 19:49, Davide Alberani a écrit :

On Tue, Feb 7, 2012 at 09:20, Davide Alberani <davide.alberani-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

As usual, I'm really busy right now... I hope to have time to give it
a look this weekend.

Ehi, snowstorms buy you a lot of free time... :-P

It was easier that I thought, mostly thanks to the fact the we already have
md5 checksum of names and title (a more or less recent feature).

In the mercurial repository there's a draft of solution.

How it works:
- titles/names with imdbID are stored in a dbm database, using their
md5 as keys.
- at restore time, imdbIDs are restored in batches of 10000 each time.

Notes:
- by default, the database are created in the current directory (and
not deleted);
 there's now the '-t dir' command line argument, to specify a
temporary directory.
- I've not tested it with huge amounts of data: if it's slow or fails,
let me know
 if it's while storing or restoring the IDs (and the error message).
- 10.000 entries for a batch is *totally* arbitrary: we've to choose a
good compromise
 between performances and the maximum size of a query.
- the batch is executed as a single query, like:
     UPDATE table SET imdb_id = CASE md5sum WHEN 'md5_1' THEN
'imdbID1' ... END WHERE md5sum IN ('md5_1', md5_2', ...)
 I don't really know if this syntax is valid for every SQL databases...
- I've simplified the code, maybe too much.
- I've not tested it with CSV support.

As usual, any test, bug report, comment and so on is welcome.


--
Davide Alberani <davide.alberani-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@...
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel
Davide Alberani | 12 Feb 08:44
Picon
Gravatar

Re: [sql] memory consumption

On Sat, Feb 11, 2012 at 22:06, Emmanuel Tabard <manu@...> wrote:
>
> Looks nice !!!

Well, it's not exactly nice & clean code... let's hope it works.

> It seems that you load all the datas in memory before storing it in the temp
> databases.
> "cls.select(ISNOTNULL(cls.q.imdbID))"
>
> Maybe you should save the imdbids by batch of 10000entries ?

If you believe at the theory, the return value of a 'select' method is
a lazy object, and so it should not consume much memory.
It's true that it's then cycled over, one item at a time, and so it
may be slow (in which case, we can directly use a cursors and
process the results in batches).

> Tell me if you need the complete database dump to test with tons of datas !

It won't hurt... :)
A dump of "SELECT md5sum, imdb_id" from the 'name', 'title', 'char_name'
and 'company_name' will be enough (not on the public list, please :))

Bye,

--

-- 
Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
Emmanuel Tabard | 12 Feb 14:20
Picon

Re: [sql] memory consumption



Le 12 févr. 2012 à 08:44, Davide Alberani a écrit :

Well, it's not exactly nice & clean code... let's hope it works.

If prefer nice features to clean code :D

If you believe at the theory, the return value of a 'select' method is
a lazy object, and so it should not consume much memory.
It's true that it's then cycled over, one item at a time, and so it
may be slow (in which case, we can directly use a cursors and
process the results in batches).

Fair enough !
When it was selecting all the not null ids, the memory of the process grows up and the size of the .db never grows up.
My theory is that dbm save on close ? Does that make sense ?



It seems that you load all the datas in memory before storing it in the temp
databases.
"cls.select(ISNOTNULL(cls.q.imdbID))"

Maybe you should save the imdbids by batch of 10000entries ?

If you believe at the theory, the return value of a 'select' method is
a lazy object, and so it should not consume much memory.
It's true that it's then cycled over, one item at a time, and so it
may be slow (in which case, we can directly use a cursors and
process the results in batches).

Tell me if you need the complete database dump to test with tons of datas !

It won't hurt... :)
A dump of "SELECT md5sum, imdb_id" from the 'name', 'title', 'char_name'
and 'company_name' will be enough (not on the public list, please :))


Bye,

--
Davide Alberani <davide.alberani-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@...
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel
Davide Alberani | 12 Feb 14:56
Picon
Gravatar

Re: [sql] memory consumption

On Sun, Feb 12, 2012 at 14:20, Emmanuel Tabard <manu@...> wrote:
>
> Fair enough !
> When it was selecting all the not null ids, the memory of the process grows
> up and the size of the .db never grows up.
> My theory is that dbm save on close ? Does that make sense ?

Strange (even if, being anydbm a generic interface to various underlying
modules, you can never tell).

This simple snippet, on my system, creates a 1.2 Gb files and in the process
the memory in not used much (besides for caches, but it doesn't matter):

#!/usr/bin/env python
import time
import anydbm

long_string = 'LALALALA' * 1024
db = anydbm.open('/tmp/big.db', 'n')
for x in xrange(100000):
    x = str(x)
    db[x] = long_string

print 'INSERT'
db.close()
print 'CLOSE'
time.sleep(10)
print 'DONE'
sys.exit()
#======================

I fear that the leak is in the cycle on the result of the 'select'. :-/

--

-- 
Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
Emmanuel Tabard | 12 Feb 15:52
Picon

Re: [sql] memory consumption

I was wondering, why don't you use the original dbs ?

Something like that takes 3 seconds: 

"CREATE TABLE title_extract SELECT imdb_id, md5sum FROM title WHERE imdb_id IS NOT NULL
CREATE TABLE name_extract SELECT imdb_id, md5sum FROM name WHERE imdb_id IS NOT NULL
"

And use your query to restore.

Should be freaking fast ...

Le 12 févr. 2012 à 14:56, Davide Alberani a écrit :

> On Sun, Feb 12, 2012 at 14:20, Emmanuel Tabard <manu@...> wrote:
>> 
>> Fair enough !
>> When it was selecting all the not null ids, the memory of the process grows
>> up and the size of the .db never grows up.
>> My theory is that dbm save on close ? Does that make sense ?
> 
> Strange (even if, being anydbm a generic interface to various underlying
> modules, you can never tell).
> 
> This simple snippet, on my system, creates a 1.2 Gb files and in the process
> the memory in not used much (besides for caches, but it doesn't matter):
> 
> #!/usr/bin/env python
> import time
> import anydbm
> 
> long_string = 'LALALALA' * 1024
> db = anydbm.open('/tmp/big.db', 'n')
> for x in xrange(100000):
>    x = str(x)
>    db[x] = long_string
> 
> print 'INSERT'
> db.close()
> print 'CLOSE'
> time.sleep(10)
> print 'DONE'
> sys.exit()
> #======================
> 
> I fear that the leak is in the cycle on the result of the 'select'. :-/
> 
> 
> -- 
> Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
> http://www.mimante.net/

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
Emmanuel Tabard | 12 Feb 16:50
Picon

Re: [sql] memory consumption

Here is a little workarround :

-- Extract imdb_id and md5sum  (6sec)
CREATE TABLE title_extract SELECT imdb_id, md5sum FROM title WHERE imdb_id IS NOT NULL;
CREATE TABLE name_extract SELECT imdb_id, md5sum FROM name WHERE imdb_id IS NOT NULL;

-- Add indexes (12sec)
ALTER TABLE title_extract ADD INDEX md5sum_idx (md5sum)
ALTER TABLE name_extract ADD INDEX md5sum_idx (md5sum)

-- Reset imdb ids ...
UPDATE title SET imdb_id = NULL;
UPDATE name SET imdb_id = NULL;

-- Restore imdb ids for movies (2min)
UPDATE title
INNER JOIN title_extract USING (md5sum)
SET title.imdb_id = title_extract.imdb_id

-- Restore imdb ids for people (5min)
UPDATE name
INNER JOIN name_extract USING (md5sum)
SET name.imdb_id = name_extract.imdb_id

Total time save/restore : less than 10minutes

Le 12 févr. 2012 à 15:52, Emmanuel Tabard a écrit :

> I was wondering, why don't you use the original dbs ?
> 
> Something like that takes 3 seconds: 
> 
> "CREATE TABLE title_extract SELECT imdb_id, md5sum FROM title WHERE imdb_id IS NOT NULL
> CREATE TABLE name_extract SELECT imdb_id, md5sum FROM name WHERE imdb_id IS NOT NULL
> "
> 
> And use your query to restore.
> 
> Should be freaking fast ...
> 
> Le 12 févr. 2012 à 14:56, Davide Alberani a écrit :
> 
>> On Sun, Feb 12, 2012 at 14:20, Emmanuel Tabard <manu@...> wrote:
>>> 
>>> Fair enough !
>>> When it was selecting all the not null ids, the memory of the process grows
>>> up and the size of the .db never grows up.
>>> My theory is that dbm save on close ? Does that make sense ?
>> 
>> Strange (even if, being anydbm a generic interface to various underlying
>> modules, you can never tell).
>> 
>> This simple snippet, on my system, creates a 1.2 Gb files and in the process
>> the memory in not used much (besides for caches, but it doesn't matter):
>> 
>> #!/usr/bin/env python
>> import time
>> import anydbm
>> 
>> long_string = 'LALALALA' * 1024
>> db = anydbm.open('/tmp/big.db', 'n')
>> for x in xrange(100000):
>>   x = str(x)
>>   db[x] = long_string
>> 
>> print 'INSERT'
>> db.close()
>> print 'CLOSE'
>> time.sleep(10)
>> print 'DONE'
>> sys.exit()
>> #======================
>> 
>> I fear that the leak is in the cycle on the result of the 'select'. :-/
>> 
>> 
>> -- 
>> Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
>> http://www.mimante.net/
> 

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
Davide Alberani | 12 Feb 19:23
Picon
Gravatar

Re: [sql] memory consumption

On Sun, Feb 12, 2012 at 16:50, Emmanuel Tabard <manu@...> wrote:
>
> Here is a little workarround :

Well, that's a very interesting solution, thanks. :-)

Anyway, I have to think a little about it: storing the data in the
filesystem granted
us the possibility to split the "CSV mode" into 2 separated steps:
first all data are
put in CSV/pickle/dbm files, then the db is re-created.

Hmmm... mumble, mumble... (d'oh, the weekend is over!)

--

-- 
Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
Emmanuel Tabard | 12 Feb 21:17
Picon

Re: [sql] memory consumption

> Well, that's a very interesting solution, thanks. :-)

I made a test run with this solution.

Time to save/restore : 6minutes

Restoring success :

 - People : 99.8777%
 - Movies : 99.8845%

> Anyway, I have to think a little about it: storing the data in the
> filesystem granted
> us the possibility to split the "CSV mode" into 2 separated steps:
> first all data are
> put in CSV/pickle/dbm files, then the db is re-created.

I saw that you drop tables and not the entire database. So I created the extract tables, clear the imdb ids,
run imdb2sql et restore the extract tables.
IMHO, I recommend you to keep your datas (extracted) on the same database system. Less queries, less disk
operations, less python processing.

Anyhow I don't know if it's a cross dbs solution (tested only with postgres and mysql ...). 

> Hmmm... mumble, mumble... (d'oh, the weekend is over!)

Hope the snowstorms are over !
Thanks you very much for your help ;-)

Le 12 févr. 2012 à 19:23, Davide Alberani a écrit :

> On Sun, Feb 12, 2012 at 16:50, Emmanuel Tabard <manu@...> wrote:
>> 
>> Here is a little workarround :
> 
> Well, that's a very interesting solution, thanks. :-)
> 
> Anyway, I have to think a little about it: storing the data in the
> filesystem granted
> us the possibility to split the "CSV mode" into 2 separated steps:
> first all data are
> put in CSV/pickle/dbm files, then the db is re-created.
> 
> Hmmm... mumble, mumble... (d'oh, the weekend is over!)
> 
> -- 
> Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
> http://www.mimante.net/

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
Davide Alberani | 19 Feb 11:26
Picon
Gravatar

Re: [sql] memory consumption

On Sun, Feb 12, 2012 at 21:17, Emmanuel Tabard <manu@...> wrote:
>
>> Well, that's a very interesting solution, thanks. :-)
>
> I made a test run with this solution.
>
> Time to save/restore : 6minutes
>
> Restoring success :
>
>  - People : 99.8777%
>  - Movies : 99.8845%

That's great!

Ok, I've committed a compromise solution: when we're using CSV (which
is thought to give you a set of files that you can move and restore on
completely
different db altogether), I use my previous implementation, but using
a cursor instead
of an ORM object (it should be really fast, at least storing the imdbIDs).

When not using CSV, I've implemented your solution (the version based on
a dbm database is also used as a fallback, in case some db servers
have problems).

But this is just the default: with the -i command line argument (by
the way: I've
removed -t to select a tmp directory: the CSV one is used or the
current directory)
you can force one method or the other: "-i dbm" and "-i table".

I hope it works and it covers all the possible use cases. :-)

--

-- 
Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
Emmanuel Tabard | 20 Feb 18:40
Picon

Re: [sql] memory consumption

Hell yeah :D

RESTORING imdbIDs values for movies... DONE! (restored 2102978 entries out of 2104144)
# TIME restore movies : 0min, 47sec (wall) 0min, 0sec (user) 0min, 0sec (system)
RESTORING imdbIDs values for people... DONE! (restored 3544106 entries out of 3545620)
# TIME restore people : 1min, 19sec (wall) 0min, 0sec (user) 0min, 0sec (system)

Freaking fast !!!!

thank you for those quick fixs ! Also, thanks for the credits ;-) 

Le 19 févr. 2012 à 11:26, Davide Alberani a écrit :

> On Sun, Feb 12, 2012 at 21:17, Emmanuel Tabard <manu@...> wrote:
>> 
>>> Well, that's a very interesting solution, thanks. :-)
>> 
>> I made a test run with this solution.
>> 
>> Time to save/restore : 6minutes
>> 
>> Restoring success :
>> 
>>  - People : 99.8777%
>>  - Movies : 99.8845%
> 
> That's great!
> 
> Ok, I've committed a compromise solution: when we're using CSV (which
> is thought to give you a set of files that you can move and restore on
> completely
> different db altogether), I use my previous implementation, but using
> a cursor instead
> of an ORM object (it should be really fast, at least storing the imdbIDs).
> 
> When not using CSV, I've implemented your solution (the version based on
> a dbm database is also used as a fallback, in case some db servers
> have problems).
> 
> But this is just the default: with the -i command line argument (by
> the way: I've
> removed -t to select a tmp directory: the CSV one is used or the
> current directory)
> you can force one method or the other: "-i dbm" and "-i table".
> 
> I hope it works and it covers all the possible use cases. :-)
> 
> -- 
> Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
> http://www.mimante.net/

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
Emmanuel Tabard | 25 Feb 15:12
Picon

[sql] indexes

Hi ! 

Your sql schema misses a few essentials indexes : 

 - title.imdb_id
 - title.episode_nr
 - title.season_nr

 - name.imdb_id

You maybe don't need those indexes :)

I just wanted to know if you're ok pulling this upstream. In the other case I'll maintain a patch on my side !

Bests,

Emmanuel
------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
Davide Alberani | 25 Feb 16:09
Picon
Gravatar

Re: [sql] indexes

On Sat, Feb 25, 2012 at 15:12, Emmanuel Tabard <manu@...> wrote:
> Hi !
>
> Your sql schema misses a few essentials indexes :
>
>  - title.imdb_id
>  - title.episode_nr
>  - title.season_nr
>
>  - name.imdb_id
>
> You maybe don't need those indexes :)

Yep, I kept the indexes at the minimum... maybe too much. :-)

> I just wanted to know if you're ok pulling this upstream. In the other case I'll maintain a patch on my side !

No problems at introducing them, but... would not it be easier if you just join
the IMDbPY project? :-P
If it's ok for you, give me your sourceforge and bitbucket usernames, and
I'll grant you write permission to the repository.
It's ok even if you don't really have much time to dedicate to it: when
you need something, just ask to me (to be sure that it will not create some
strange problems that only me can know) and commit. :-)

Point is, IMDbPY was always a playground, for me (come on... we didn't
REALLY need to support both SQLObject and SQLAlchemy ORMs ;-) and I'm not
an expert on databases.

On the db, actually, I'd like to see this work done:
1. an overall evaluation of the current status: is the structure still
valid/meaningful?
   (keeping in mind that many choices were due to performance reasons - maybe
   with wrong assumptions on my side).
2. introduce indexes/foreign keys were needed.
3. analyze if it's possible to get rid of the movie_info_idx tables:
it's exactly
    the same as movie_info, but it contains only the rating information, and
    more indexes are created on them (point is: does creating these indexes
    for movie_info really waste so much spaces that it's not worth, or was it
    all in my head? ;-)

--

-- 
Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
Emmanuel Tabard | 25 Feb 16:45
Picon

Re: [sql] indexes


Le 25 févr. 2012 à 16:09, Davide Alberani a écrit :

> Yep, I kept the indexes at the minimum... maybe too much. :-)

This is a reasonable approach. Sometimes too much indexes could be worse than no index at all :D

> 
>> I just wanted to know if you're ok pulling this upstream. In the other case I'll maintain a patch on my side !
> 
> No problems at introducing them, but... would not it be easier if you just join
> the IMDbPY project? :-P

Fair enough ! I can help you with sql nightmares. To be honest, I only use the parsing and sql importing part of
your project, which is by itself awesome :p

> If it's ok for you, give me your sourceforge and bitbucket usernames, and
> I'll grant you write permission to the repository.

I saw that your github repo is just a mirror ? Is it possible to use pull requests on github or this is a read only
repo ?
Point is, the process is easier on github to propose path and discuss it. 

If it's not possible, i'll create a sourceforge account :)

> It's ok even if you don't really have much time to dedicate to it: when
> you need something, just ask to me (to be sure that it will not create some
> strange problems that only me can know) and commit. :-)

I don't have much time, my startup eats a lot of my time ... But your project has an important part in mine, so if
I can share my optims be sure I'll will commit them !
I'll probably make one of my own project public one day. It's python, it imports imdb with imdbpy add
wikipedia synospys and bio(you can target language), freebase, themoviedb, thetvdb (etc) to have an
universal and automatized movie database. For now, the code is too dirty to be public :D

> 
> Point is, IMDbPY was always a playground, for me (come on... we didn't
> REALLY need to support both SQLObject and SQLAlchemy ORMs ;-) and I'm not
> an expert on databases.
> 
> On the db, actually, I'd like to see this work done:
> 1. an overall evaluation of the current status: is the structure still
> valid/meaningful?
>   (keeping in mind that many choices were due to performance reasons - maybe
>   with wrong assumptions on my side).
> 2. introduce indexes/foreign keys were needed.
> 3. analyze if it's possible to get rid of the movie_info_idx tables:
> it's exactly
>    the same as movie_info, but it contains only the rating information, and
>    more indexes are created on them (point is: does creating these indexes
>    for movie_info really waste so much spaces that it's not worth, or was it
>    all in my head? ;-)

Your choices are mainly smart, the schema isn't so bad ;-)
The nature of the data is complex with a lot of relations, and for performance reasons I think the
denormalization of movie_info_idx isn't a bad thing.

Let me know for the github thing :)

> 
> -- 
> Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
> http://www.mimante.net/

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
Davide Alberani | 27 Jan 09:07
Picon
Gravatar

Re: [sql] memory consumption

On Thu, Jan 26, 2012 at 18:42, Emmanuel Tabard <manu@...> wrote:
>
> The thing is, the whole database takes 5go. That's why I was wondering how
> the script can eat 20go of memory. Maybe sqlobject leaks !

Or maybe my script does. :P  (by the way you can try using SQLAlchemy,
instead, with the '-o sqlalchemy' option of imdbpy2sql.py - yes, we support
two ORMs...)

> To make it faster you can also generate a unique signature for each rows
> (sha1(title, index, year, kinds)?). Index this field and your temp table
> would be : imdbid | signature.
> It should be quick.

Yep, I was thinking at something like that.
I'll try to see if the signature can be easily created when
the data are first inserted.

--

-- 
Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2

Gmane