Benjamin Lau | 27 Aug 2012 11:28
Picon

[Trac-dev] Large git repositories?

Hello,

I was wondering if any progress had been made on this. I found this page:
http://trac.edgewall.org/wiki/TracDev/Performance/Git
and was playing around with Peter Stuge's branch, but that seems to be
in a incomplete state.

After playing around with the code a while I realize that the main
issue is that we're having to reprocess a large number of changes over
and over. It also seems like the revision cache gets invalidated and
reset quite a bit (way more than I'd expect especially since I've got
the system hooked up to a trac repository that nobody can write to
(local copy of our production repository).

I'm not sure how my repo compares in size to others (23k commits with
6k branches/tags) but it takes PyGIT about 3 minutes (200k
milliseconds) to run get_rev_cache when a rebuild is triggered.

Another thought I had is about the way the source browser behaves. It
tries to show the whole source tree with the latest changes on each
file from all the branches and tags in the repo... but I think it
would be better to just have it default to showing whatever branch is
pointed to by origin/HEAD and then let people select particular
branches if they way... or at least a workflow like this would work
better for me. I'm not sure if this would help solve the existing
problem with speed since we'd still need to traverse a large portion
of the commits in the repo (my master branch contains 18k of the 23k
commits in the repository).

I was also playing around with pygit2 and played around with improving
(Continue reading)

Christian Boos | 29 Aug 2012 12:21
Picon
Favicon

Re: [Trac-dev] Large git repositories?

On 8/27/2012 11:28 AM, Benjamin Lau wrote:
> I'm not sure how my repo compares in size to others (23k commits with
> 6k branches/tags) but it takes PyGIT about 3 minutes (200k
> milliseconds) to run get_rev_cache when a rebuild is triggered.

That's what I would call a medium-sized repository, except for the 
number of refs. It's a bit worrisome that even such a repository can't 
be handled efficiently...

>
> Another thought I had is about the way the source browser behaves. It
> tries to show the whole source tree with the latest changes on each
> file from all the branches and tags in the repo... but I think it
> would be better to just have it default to showing whatever branch is
> pointed to by origin/HEAD and then let people select particular
> branches if they way... or at least a workflow like this would work
> better for me.

The Mercurial plugin recently adopted that approach as
well
(http://trac.edgewall.org/changeset/8af21bda2b3e2272f4dc6b41037efcd896d2d5d8/mercurial-plugin/) 

so I think it would make sense to do the same for Git.

> I'm not sure if this would help solve the existing
> problem with speed since we'd still need to traverse a large portion
> of the commits in the repo (my master branch contains 18k of the 23k
> commits in the repository).
>
> I was also playing around with pygit2 and played around with improving
(Continue reading)

Benjamin Lau | 29 Aug 2012 12:51
Picon

Re: [Trac-dev] Large git repositories?

> The Mercurial plugin recently adopted that approach as well
> (http://trac.edgewall.org/changeset/8af21bda2b3e2272f4dc6b41037efcd896d2d5d8/mercurial-plugin/)
> so I think it would make sense to do the same for Git.

I'll take a look at that.

> Impressive! Yes, the pygit2 approach seems worth the trouble after all.

I forked Peter's branch onto github[1] to start playing with it. Since
it wasn't complete I ended up backing out his changes and reverting to
trunk as of a few days ago. I've since split off a copy of PyGIT.py as
PyGIT2.py (you can see this in some of my logs from #10826) but the
only difference between those at the moment is that PyGIT2 imports
pygit2 and does some additional logging. I'll try swapping out the
internals of PyGIT.all_revs() with my pygit2 based version and maybe
just keep the PyGIT API for the moment while I see where improvements
can be made.

I've been working on some code for walking the git repository as well
(with some help from the fine folks on #libgit2 (freenode). But that
takes about the same amount of time to execute as the trac-admin $ENV
repository resync <repo> command... so no real improvement there. It
may just be a problem of approach. And maybe it's not a huge deal if
this is slow as long as the rest of the caching system is working.

My underlying issue here seems to have been that apache/wsgi has been
doing something funky that isn't at all like the correct behavior I'd
expect.

Ben
(Continue reading)


Gmane