Piotr Krukowiecki | 10 Feb 10:42
Picon

git status: small difference between stating whole repository and small subdirectory

Hi,

I compared stating whole tree vs one small subdirectory, and I
expected that for the subdirectory status will be very very fast.
After all, it has only few files to stat. But it's not fast. Why?

With cold cache (echo 3 | sudo tee /proc/sys/vm/drop_caches):

$ time git status    > /dev/null
real	0m41.670s
user	0m0.980s
sys	0m2.908s

$ time git status -- src/.../somedir   > /dev/null
real	0m17.380s
user	0m0.748s
sys	0m0.328s

With warm cache:

$ time git status    > /dev/null
real	0m0.792s
user	0m0.404s
sys	0m0.384s

$ time git status -- src/.../somedir   > /dev/null
real	0m0.335s
user	0m0.288s
sys	0m0.048s

(Continue reading)

Nguyen Thai Ngoc Duy | 10 Feb 13:33
Picon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Fri, Feb 10, 2012 at 4:42 PM, Piotr Krukowiecki
<piotr.krukowiecki <at> gmail.com> wrote:
> Hi,
>
> I compared stating whole tree vs one small subdirectory, and I
> expected that for the subdirectory status will be very very fast.
> After all, it has only few files to stat. But it's not fast. Why?

Because stat'ing is not the only thing git-status does? In order to
find out staged changes, unstaged changes and untracked files, it has
to do the equivalence of "git diff --cached", "git diff" and "git
ls-files -o". I think copy detection is also enabled, which uses more
cycles.

Profiling it should give you a good idea what parts cost most.
--

-- 
Duy
Piotr Krukowiecki | 10 Feb 14:46
Picon

Re: git status: small difference between stating whole repository and small subdirectory

On Fri, Feb 10, 2012 at 1:33 PM, Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> wrote:
> On Fri, Feb 10, 2012 at 4:42 PM, Piotr Krukowiecki
> <piotr.krukowiecki <at> gmail.com> wrote:
>> Hi,
>>
>> I compared stating whole tree vs one small subdirectory, and I
>> expected that for the subdirectory status will be very very fast.
>> After all, it has only few files to stat. But it's not fast. Why?
>
> Because stat'ing is not the only thing git-status does? In order to
> find out staged changes, unstaged changes and untracked files, it has
> to do the equivalence of "git diff --cached", "git diff" and "git
> ls-files -o". I think copy detection is also enabled, which uses more
> cycles.

I believe copy detection is not done, neither for tracked nor untracked files.
Rename detection is done for tracked files. In this case it should not
matter, as there were no changes to the files.

My point is, that for such small number of small files (55 files and
223KB), which had no changes at all, the status took a lot of time (17
seconds) and doing status on whole repository which has more than
2000x files and 10000x data took only 2x more time.

I don't think that the algorithm scales so well, so my guess is that
'status' is so inefficient for subdirectories (i.e. the "git status --
dir" filter does not filter as much as it could).

> Profiling it should give you a good idea what parts cost most.

(Continue reading)

Nguyen Thai Ngoc Duy | 10 Feb 15:37
Picon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Fri, Feb 10, 2012 at 8:46 PM, Piotr Krukowiecki
<piotr.krukowiecki <at> gmail.com> wrote:
> On Fri, Feb 10, 2012 at 1:33 PM, Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> wrote:
>> On Fri, Feb 10, 2012 at 4:42 PM, Piotr Krukowiecki
>> <piotr.krukowiecki <at> gmail.com> wrote:
>>> Hi,
>>>
>>> I compared stating whole tree vs one small subdirectory, and I
>>> expected that for the subdirectory status will be very very fast.
>>> After all, it has only few files to stat. But it's not fast. Why?
>>
>> Because stat'ing is not the only thing git-status does? In order to
>> find out staged changes, unstaged changes and untracked files, it has
>> to do the equivalence of "git diff --cached", "git diff" and "git
>> ls-files -o". I think copy detection is also enabled, which uses more
>> cycles.
>
> I believe copy detection is not done, neither for tracked nor untracked files.
> Rename detection is done for tracked files. In this case it should not
> matter, as there were no changes to the files.
>
> My point is, that for such small number of small files (55 files and
> 223KB), which had no changes at all, the status took a lot of time (17
> seconds) and doing status on whole repository which has more than
> 2000x files and 10000x data took only 2x more time.
>
> I don't think that the algorithm scales so well, so my guess is that
> 'status' is so inefficient for subdirectories (i.e. the "git status --
> dir" filter does not filter as much as it could).

(Continue reading)

Piotr Krukowiecki | 13 Feb 17:54
Picon

Re: git status: small difference between stating whole repository and small subdirectory

On Fri, Feb 10, 2012 at 3:37 PM, Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> wrote:
> I think the cost is in $GIT_DIR, not the working directory.

Could you explain?

I got the problem again today. This time I've made a copy of the
repository so hopefully I'll able to reproduce the problems.

This time it's a different repository but the 'status' on a small
subdirectory is even more than 2x slower than on the whole repository.

Whole repo:
$ find * -type f | wc -l
33021
$ du -shc * | grep total
2.1G	total

The subdir:
$ find * -type f | wc -l
17
$ du -shc * | grep total
84K	total

As previously, timing was done with cold cache (echo 3 | sudo tee
/proc/sys/vm/drop_caches) and executed several times.

This time I have used recent git (1.7.9.188.g12766) compiled with -pg.
git was executed in the subdirectory. Tracked files were not
changed/deleted, there was just a couple of small untracked files.

(Continue reading)

Piotr Krukowiecki | 10 Feb 17:18
Picon

Re: git status: small difference between stating whole repository and small subdirectory

On Fri, Feb 10, 2012 at 10:42 AM, Piotr Krukowiecki
<piotr.krukowiecki <at> gmail.com> wrote:
> Hi,
>
> I compared stating whole tree vs one small subdirectory, and I
> expected that for the subdirectory status will be very very fast.
> After all, it has only few files to stat. But it's not fast. Why?
>
>
> With cold cache (echo 3 | sudo tee /proc/sys/vm/drop_caches):
>
> $ time git status    > /dev/null
> real    0m41.670s
> user    0m0.980s
> sys     0m2.908s
>
> $ time git status -- src/.../somedir   > /dev/null
> real    0m17.380s
> user    0m0.748s
> sys     0m0.328s
>
>
> With warm cache:
>
> $ time git status    > /dev/null
> real    0m0.792s
> user    0m0.404s
> sys     0m0.384s
>
> $ time git status -- src/.../somedir   > /dev/null
(Continue reading)

Thomas Rast | 14 Feb 12:34
Picon

Re: git status: small difference between stating whole repository and small subdirectory

Piotr Krukowiecki <piotr.krukowiecki <at> gmail.com> writes:

> On Fri, Feb 10, 2012 at 10:42 AM, Piotr Krukowiecki
> <piotr.krukowiecki <at> gmail.com> wrote:
>> I compared stating whole tree vs one small subdirectory, and I
>> expected that for the subdirectory status will be very very fast.
>> After all, it has only few files to stat. But it's not fast. Why?
>>
>>
>> With cold cache (echo 3 | sudo tee /proc/sys/vm/drop_caches):
>>
>> $ time git status    > /dev/null
>> real    0m41.670s
>> user    0m0.980s
>> sys     0m2.908s
>>
>> $ time git status -- src/.../somedir   > /dev/null
>> real    0m17.380s
>> user    0m0.748s
>> sys     0m0.328s
[...]
> I can't reproduce this behavior at the moment. 'status' on the
> directory takes about 1.5s instead of 17s. status on whole repository
> takes 27s.
> This is my work repository, so it was changed today.

To me these timings smell like a combination of either a network
filesystem or a slow/busy disk, and non-packed repositories.  Next time
this happens look at 'git count-objects', run 'git gc' and redo the
timings.
(Continue reading)

Piotr Krukowiecki | 15 Feb 09:57
Picon

Re: git status: small difference between stating whole repository and small subdirectory

On Tue, Feb 14, 2012 at 12:34 PM, Thomas Rast <trast <at> inf.ethz.ch> wrote:
> Piotr Krukowiecki <piotr.krukowiecki <at> gmail.com> writes:
>
>> On Fri, Feb 10, 2012 at 10:42 AM, Piotr Krukowiecki
>> <piotr.krukowiecki <at> gmail.com> wrote:
>>> I compared stating whole tree vs one small subdirectory, and I
>>> expected that for the subdirectory status will be very very fast.
>>> After all, it has only few files to stat. But it's not fast. Why?
>>>
>>>
>>> With cold cache (echo 3 | sudo tee /proc/sys/vm/drop_caches):
>>>
>>> $ time git status    > /dev/null
>>> real    0m41.670s
>>> user    0m0.980s
>>> sys     0m2.908s
>>>
>>> $ time git status -- src/.../somedir   > /dev/null
>>> real    0m17.380s
>>> user    0m0.748s
>>> sys     0m0.328s
> [...]
>> I can't reproduce this behavior at the moment. 'status' on the
>> directory takes about 1.5s instead of 17s. status on whole repository
>> takes 27s.
>> This is my work repository, so it was changed today.
>
> To me these timings smell like a combination of either a network
> filesystem or a slow/busy disk, and non-packed repositories.  Next time
> this happens look at 'git count-objects', run 'git gc' and redo the
(Continue reading)

Nguyen Thai Ngoc Duy | 15 Feb 12:01
Picon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Wed, Feb 15, 2012 at 3:57 PM, Piotr Krukowiecki
<piotr.krukowiecki <at> gmail.com> wrote:
> Indeed, after gc the times went down:
> 10s -> 2.3s (subdirectory)
> 17s -> 9.5s (whole repo)
>
> 2 seconds is much better and I'd say acceptable for me. But my questions are:
> - why is it so slow with not packed repo?
> - can it be faster without repacking?

gc does more than just repacking. If you still have the un-gc'd repo,
Try these commands one by one, and time "git status" after each:

 - git pack-refs --all --prune
 - git reflog expire --all
 - git repack -d -l
 - git prune --expire
 - git rerere gc

I'd be more interested in why auto-gc does not kick in (or whther it should).

> - even with packed repo, the time on small subdirectory is much higher
> than I'd expect given time on whole repo and subdirectory size - why?

Hard to say without measuring. I just notice that I missed your mail
with profiling results. I will have a look, but just in case, is the
repository publicly available?
--

-- 
Duy
(Continue reading)

Piotr Krukowiecki | 15 Feb 16:14
Picon

Re: git status: small difference between stating whole repository and small subdirectory

On Wed, Feb 15, 2012 at 12:01 PM, Nguyen Thai Ngoc Duy
<pclouds <at> gmail.com> wrote:
> On Wed, Feb 15, 2012 at 3:57 PM, Piotr Krukowiecki
> <piotr.krukowiecki <at> gmail.com> wrote:
>> Indeed, after gc the times went down:
>> 10s -> 2.3s (subdirectory)
>> 17s -> 9.5s (whole repo)
>>
>> 2 seconds is much better and I'd say acceptable for me. But my questions are:
>> - why is it so slow with not packed repo?
>> - can it be faster without repacking?
>
> gc does more than just repacking. If you still have the un-gc'd repo,
> Try these commands one by one, and time "git status" after each:
>
>  - git pack-refs --all --prune
>  - git reflog expire --all
>  - git repack -d -l
>  - git prune --expire
>  - git rerere gc

It will take some time but hopefully I'll have the stats for tomorrow.

> I'd be more interested in why auto-gc does not kick in (or whther it should).

I don't have any specific options set, so default values should be used.

I'm using git-svn though, so my workflow looks like this:
   git svn fetch + git svn rebase
   ... git operations like commit, cherry-pick, rebase ...
(Continue reading)

Piotr Krukowiecki | 16 Feb 14:22
Picon

Re: git status: small difference between stating whole repository and small subdirectory

On Wed, Feb 15, 2012 at 4:14 PM, Piotr Krukowiecki
<piotr.krukowiecki <at> gmail.com> wrote:
> On Wed, Feb 15, 2012 at 12:01 PM, Nguyen Thai Ngoc Duy
> <pclouds <at> gmail.com> wrote:
>> On Wed, Feb 15, 2012 at 3:57 PM, Piotr Krukowiecki
>> <piotr.krukowiecki <at> gmail.com> wrote:
>>> Indeed, after gc the times went down:
>>> 10s -> 2.3s (subdirectory)
>>> 17s -> 9.5s (whole repo)
>>>
>>> 2 seconds is much better and I'd say acceptable for me. But my questions are:
>>> - why is it so slow with not packed repo?
>>> - can it be faster without repacking?
>>
>> gc does more than just repacking. If you still have the un-gc'd repo,
>> Try these commands one by one, and time "git status" after each:
>>
>>  - git pack-refs --all --prune
>>  - git reflog expire --all
>>  - git repack -d -l
>>  - git prune --expire
>>  - git rerere gc
>
> It will take some time but hopefully I'll have the stats for tomorrow.

Here they are. I did 'status' three times to get reliable results and
before each run have dropped caches. Backed up repository was copied
before each 'status'. Full log is at http://pastebin.com/VmB7J9CJ

git version 1.7.9.rc0.10.gbeecc
(Continue reading)

Jeff King | 15 Feb 20:03
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Wed, Feb 15, 2012 at 09:57:29AM +0100, Piotr Krukowiecki wrote:

> All is on local disk and system is idle.
> 
> Indeed, after gc the times went down:
> 10s -> 2.3s (subdirectory)
> 17s -> 9.5s (whole repo)
> 
> 2 seconds is much better and I'd say acceptable for me. But my questions are:

Obviously these answers didn't come from any deep analysis, but are
educated guesses from me based on previous performance patterns we've
seen on the list:

> - why is it so slow with not packed repo?

Your numbers show that you're I/O-bound:

>>> $ time git status    > /dev/null
>>> real    0m41.670s
>>> user    0m0.980s
>>> sys     0m2.908s
>>>
>>> $ time git status -- src/.../somedir   > /dev/null
>>> real    0m17.380s
>>> user    0m0.748s
>>> sys     0m0.328s

which is not surprising, since you said you dropped caches before-hand.
Repacking probably reduced your disk footprint by a lot, which meant
(Continue reading)

Piotr Krukowiecki | 16 Feb 14:37
Picon

Re: git status: small difference between stating whole repository and small subdirectory

On Wed, Feb 15, 2012 at 8:03 PM, Jeff King <peff <at> peff.net> wrote:
> On Wed, Feb 15, 2012 at 09:57:29AM +0100, Piotr Krukowiecki wrote:
>>
> I notice that you're still I/O bound even after the repack:
>
>> $ time git status  -- .
>> real    0m2.503s
>> user    0m0.160s
>> sys     0m0.096s
>>
>> $ time git status
>> real    0m9.663s
>> user    0m0.232s
>> sys     0m0.556s
>
> Did you drop caches here, too?

Yes I did - with cache the status takes something like 0.1-0.3s on whole repo.

>  Usually that would not be the case on a
> warm cache. If it is, then it sounds like you are short on memory to
> actually hold the directory tree and object db in cache. If not, what do
> the warm cache numbers look like?

I've got 4GB of ram and I did not hit the swap when doing last
performance tests AFAIK.
Please see my previous posts for performance results with warm cache
and profile results:

http://article.gmane.org/gmane.comp.version-control.git/190397
(Continue reading)

Thomas Rast | 16 Feb 15:05
Picon

Re: git status: small difference between stating whole repository and small subdirectory

Piotr Krukowiecki <piotr.krukowiecki <at> gmail.com> writes:

> On Wed, Feb 15, 2012 at 8:03 PM, Jeff King <peff <at> peff.net> wrote:
>> On Wed, Feb 15, 2012 at 09:57:29AM +0100, Piotr Krukowiecki wrote:
>>>
>> I notice that you're still I/O bound even after the repack:
>>
>>> $ time git status  -- .
>>> real    0m2.503s
>>> user    0m0.160s
>>> sys     0m0.096s
>>>
>>> $ time git status
>>> real    0m9.663s
>>> user    0m0.232s
>>> sys     0m0.556s
>>
>> Did you drop caches here, too?
>
> Yes I did - with cache the status takes something like 0.1-0.3s on whole repo.

So umm, I'm not sure that leaves anything to be improved.

I looked at some strace dumps, and limiting the status to a subdirectory
(in my case, '-- t' in git.git) does omit the lstat()s on uninteresting
parts of the index-listed files, as well as the getdents() (i.e.,
readdir()) for parts of the tree that are not interesting.

BTW, some other parts of git-status's display may be responsible for the
amount of data it pulls from disk.  In particular, the "Your branch is
(Continue reading)

Junio C Hamano | 16 Feb 21:15
Picon
Picon
Favicon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

Thomas Rast <trast <at> inf.ethz.ch> writes:

> ...  If my memory of pack organization serves
> right, the commit objects involved would essentially be spread across
> the whole pack

No they are crumped together in a contiguous section in a packfile, so
that "git log" without any pathspec can go faster without consulting tree
objects.

Piotr Krukowiecki | 17 Feb 17:55
Picon

Re: git status: small difference between stating whole repository and small subdirectory

On Thu, Feb 16, 2012 at 3:05 PM, Thomas Rast <trast <at> inf.ethz.ch> wrote:
> Piotr Krukowiecki <piotr.krukowiecki <at> gmail.com> writes:
>
>> On Wed, Feb 15, 2012 at 8:03 PM, Jeff King <peff <at> peff.net> wrote:
>>> On Wed, Feb 15, 2012 at 09:57:29AM +0100, Piotr Krukowiecki wrote:
>>>>
>>> I notice that you're still I/O bound even after the repack:
>>>
>>>> $ time git status  -- .
>>>> real    0m2.503s
>>>> user    0m0.160s
>>>> sys     0m0.096s
>>>>
>>>> $ time git status
>>>> real    0m9.663s
>>>> user    0m0.232s
>>>> sys     0m0.556s
>>>
>>> Did you drop caches here, too?
>>
>> Yes I did - with cache the status takes something like 0.1-0.3s on whole repo.
>
> So umm, I'm not sure that leaves anything to be improved.

But even with caches time on small directory is only half of time on whole repo:
0.15s vs 0.07s

> I looked at some strace dumps, and limiting the status to a subdirectory
> (in my case, '-- t' in git.git) does omit the lstat()s on uninteresting
> parts of the index-listed files, as well as the getdents() (i.e.,
(Continue reading)

Jeff King | 16 Feb 20:20
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Thu, Feb 16, 2012 at 02:37:47PM +0100, Piotr Krukowiecki wrote:

> >> $ time git status  -- .
> >> real    0m2.503s
> >> user    0m0.160s
> >> sys     0m0.096s
> >>
> >> $ time git status
> >> real    0m9.663s
> >> user    0m0.232s
> >> sys     0m0.556s
> >
> > Did you drop caches here, too?
> 
> Yes I did - with cache the status takes something like 0.1-0.3s on whole repo.

OK, then that makes sense. It's pretty much just I/O on the filesystem
and on the object db.

You can break status down a little more to see which is which. Try "git
update-index --refresh" to see just how expensive the lstat and index
handling is.

And then try "git diff-index HEAD" for an idea of how expensive it is to
just read the objects and compare to the index.

> > Not really. You're showing an I/O problem, and repacking is git's way of
> > reducing I/O.
> 
> So if I understand correctly, the reason is because git must compare
(Continue reading)

Piotr Krukowiecki | 17 Feb 18:19
Picon

Re: git status: small difference between stating whole repository and small subdirectory

On Thu, Feb 16, 2012 at 8:20 PM, Jeff King <peff <at> peff.net> wrote:
> On Thu, Feb 16, 2012 at 02:37:47PM +0100, Piotr Krukowiecki wrote:
>
>> >> $ time git status  -- .
>> >> real    0m2.503s
>> >> user    0m0.160s
>> >> sys     0m0.096s
>> >>
>> >> $ time git status
>> >> real    0m9.663s
>> >> user    0m0.232s
>> >> sys     0m0.556s
>> >
>> > Did you drop caches here, too?
>>
>> Yes I did - with cache the status takes something like 0.1-0.3s on whole repo.
>
> OK, then that makes sense. It's pretty much just I/O on the filesystem
> and on the object db.
>
> You can break status down a little more to see which is which. Try "git
> update-index --refresh" to see just how expensive the lstat and index
> handling is.

"git update-index --refresh" with dropped cache took
real	0m3.726s
user	0m0.024s
sys	0m0.404s

while "git status" with dropped cache takes
(Continue reading)

Jeff King | 17 Feb 21:37
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Fri, Feb 17, 2012 at 06:19:06PM +0100, Piotr Krukowiecki wrote:

> "git update-index --refresh" with dropped cache took
> real	0m3.726s
> user	0m0.024s
> sys	0m0.404s
> [...]
> The diff-index after dropping cache takes
> real	0m14.095s
> user	0m0.268s
> sys	0m0.564s

OK, that suggests to me that the real culprit is the I/O we spend in
accessing the object db, since that is the main I/O that happens in the
second command but not the first.

> > Mostly reading (we keep a sorted index and access the packfiles via
> > mmap, so we only touch the pages we need). But you're also paying to
> > lstat() the directory tree, too. And you're paying to load (probably)
> > the whole index into memory, although it's relatively compact compared
> > to the actual file data.
> 
> If the index is the objects/pack/*.idx files than it's 21MB

Yes, that's it. Though we don't necessarily read the whole thing. The
sorted list of sha1s is only a part of that. And we mmap and
binary-search that, so we only have to fault in pages that are actually
used in our binary search.

However, we're faulting in random pages of the index in series, so it
(Continue reading)

Junio C Hamano | 17 Feb 23:25
Picon
Picon
Favicon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

Jeff King <peff <at> peff.net> writes:

> That being said, we do have an index extension to store the tree sha1 of
> whole directories (i.e., we populate it when we write a whole tree or
> subtree into the index from the object db, and it becomes invalidated
> when a file becomes modified). This optimization is used by things like
> "git commit" to avoid having to recreate the same sub-trees over and
> over when creating tree objects from the index. But we could also use it
> here to avoid having to even read the sub-tree objects from the object
> db.

Like b65982b (Optimize "diff-index --cached" using cache-tree, 2009-05-20)
perhaps?
Jeff King | 17 Feb 23:29
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Fri, Feb 17, 2012 at 02:25:25PM -0800, Junio C Hamano wrote:

> Jeff King <peff <at> peff.net> writes:
> 
> > That being said, we do have an index extension to store the tree sha1 of
> > whole directories (i.e., we populate it when we write a whole tree or
> > subtree into the index from the object db, and it becomes invalidated
> > when a file becomes modified). This optimization is used by things like
> > "git commit" to avoid having to recreate the same sub-trees over and
> > over when creating tree objects from the index. But we could also use it
> > here to avoid having to even read the sub-tree objects from the object
> > db.
> 
> Like b65982b (Optimize "diff-index --cached" using cache-tree, 2009-05-20)
> perhaps?

That's what I get for speaking before running "git log".

So yeah, we may be about as reasonably fast as we can go. Or maybe that
optimization isn't kicking in for some reason. I think going further
would require Piotr to do more profiling.

-Peff
Piotr Krukowiecki | 20 Feb 09:25
Picon

Re: git status: small difference between stating whole repository and small subdirectory

On Fri, Feb 17, 2012 at 11:29 PM, Jeff King <peff <at> peff.net> wrote:
> On Fri, Feb 17, 2012 at 02:25:25PM -0800, Junio C Hamano wrote:
>
>> Jeff King <peff <at> peff.net> writes:
>>
>> > That being said, we do have an index extension to store the tree sha1 of
>> > whole directories (i.e., we populate it when we write a whole tree or
>> > subtree into the index from the object db, and it becomes invalidated
>> > when a file becomes modified). This optimization is used by things like
>> > "git commit" to avoid having to recreate the same sub-trees over and
>> > over when creating tree objects from the index. But we could also use it
>> > here to avoid having to even read the sub-tree objects from the object
>> > db.
>>
>> Like b65982b (Optimize "diff-index --cached" using cache-tree, 2009-05-20)
>> perhaps?
>
> That's what I get for speaking before running "git log".
>
> So yeah, we may be about as reasonably fast as we can go. Or maybe that
> optimization isn't kicking in for some reason. I think going further
> would require Piotr to do more profiling.

Is the cache set? Not sure how to check it. t0090-cache-tree.sh uses
test-dump-cache-tree and executes "read-tree HEAD" to establish the
cache, but in my case read-tree does not make the cache dumpable (but
it improves status performance).

$ test-dump-cache-tree  | wc -l
0
(Continue reading)

Jeff King | 20 Feb 15:06
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Mon, Feb 20, 2012 at 09:25:00AM +0100, Piotr Krukowiecki wrote:

> Is the cache set? Not sure how to check it. t0090-cache-tree.sh uses
> test-dump-cache-tree and executes "read-tree HEAD" to establish the
> cache, but in my case read-tree does not make the cache dumpable (but
> it improves status performance).
> 
> $ test-dump-cache-tree  | wc -l
> 0
> $ git read-tree HEAD
> $ test-dump-cache-tree  | wc -l
> 0
> $ echo 3 | sudo tee /proc/sys/vm/drop_caches && time git status -- .
> [...]
> real	0m1.085s

Hmm. I would think test-dump-cache-tree would do it. I don't know why
read-tree wouldn't fill it in, though.

Interestingly, on my git.git repo, I had an empty cache. Running "git
read-tree HEAD" filled it (according to test-dump-cache-tree). It seems
that running "git checkout" empties the cache.  So perhaps git could do
better about keeping the cache valid over time.

-Peff
Thomas Rast | 20 Feb 15:09
Picon

Re: git status: small difference between stating whole repository and small subdirectory

Jeff King <peff <at> peff.net> writes:

> On Mon, Feb 20, 2012 at 09:25:00AM +0100, Piotr Krukowiecki wrote:
>
>> Is the cache set? Not sure how to check it. t0090-cache-tree.sh uses
>> test-dump-cache-tree and executes "read-tree HEAD" to establish the
>> cache, but in my case read-tree does not make the cache dumpable (but
>> it improves status performance).
>> 
>> $ test-dump-cache-tree  | wc -l
>> 0
>> $ git read-tree HEAD
>> $ test-dump-cache-tree  | wc -l
>> 0
>> $ echo 3 | sudo tee /proc/sys/vm/drop_caches && time git status -- .
>> [...]
>> real	0m1.085s
>
> Hmm. I would think test-dump-cache-tree would do it. I don't know why
> read-tree wouldn't fill it in, though.
>
> Interestingly, on my git.git repo, I had an empty cache. Running "git
> read-tree HEAD" filled it (according to test-dump-cache-tree). It seems
> that running "git checkout" empties the cache.  So perhaps git could do
> better about keeping the cache valid over time.

test_expect_failure 'checkout gives cache-tree' '
	git checkout HEAD^ &&
	test_shallow_cache_tree
'
(Continue reading)

Nguyen Thai Ngoc Duy | 20 Feb 15:36
Picon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Mon, Feb 20, 2012 at 03:09:57PM +0100, Thomas Rast wrote:
> > Interestingly, on my git.git repo, I had an empty cache. Running "git
> > read-tree HEAD" filled it (according to test-dump-cache-tree). It seems
> > that running "git checkout" empties the cache.  So perhaps git could do
> > better about keeping the cache valid over time.
> 
> test_expect_failure 'checkout gives cache-tree' '
> 	git checkout HEAD^ &&
> 	test_shallow_cache_tree
> '
> 
> ;-)

Quick and dirty that passes that test. I think we could do better if
we analyse two way merge rules carefully and avoid this diff, but
that's too much for me right now.

-- 8< --
diff --git a/builtin/checkout.c b/builtin/checkout.c
index 5bf96ba..c06287a 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -319,6 +319,10 @@ static void show_local_changes(struct object *head, struct diff_options *opts)
 		die(_("diff_setup_done failed"));
 	add_pending_object(&rev, head, NULL);
 	run_diff_index(&rev, 0);
+	if (!DIFF_OPT_TST(&rev.diffopt, HAS_CHANGES)) {
+		struct tree *tree = parse_tree_indirect(head->sha1);
+		prime_cache_tree(&active_cache_tree, tree);
+	}
(Continue reading)

Jeff King | 20 Feb 15:39
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Mon, Feb 20, 2012 at 09:36:44PM +0700, Nguyen Thai Ngoc Duy wrote:

> > test_expect_failure 'checkout gives cache-tree' '
> > 	git checkout HEAD^ &&
> > 	test_shallow_cache_tree
> > '
> > 
> > ;-)
> 
> Quick and dirty that passes that test. I think we could do better if
> we analyse two way merge rules carefully and avoid this diff, but
> that's too much for me right now.

Unpack trees is already sprinkled with cache_tree_invalidate_path. But
something seems to throw away the cache tree entirely (I think it may be
that the extension simply isn't copied over to the destination index).
I'm walking through it right now.

-Peff
Jeff King | 20 Feb 16:11
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Mon, Feb 20, 2012 at 09:39:52AM -0500, Jeff King wrote:

> On Mon, Feb 20, 2012 at 09:36:44PM +0700, Nguyen Thai Ngoc Duy wrote:
> 
> > > test_expect_failure 'checkout gives cache-tree' '
> > > 	git checkout HEAD^ &&
> > > 	test_shallow_cache_tree
> > > '
> > > 
> > > ;-)
> > 
> > Quick and dirty that passes that test. I think we could do better if
> > we analyse two way merge rules carefully and avoid this diff, but
> > that's too much for me right now.
> 
> Unpack trees is already sprinkled with cache_tree_invalidate_path. But
> something seems to throw away the cache tree entirely (I think it may be
> that the extension simply isn't copied over to the destination index).
> I'm walking through it right now.

Hmm. OK, this doesn't pass the test, but I think it is better than the
current behavior.

Basically, what happens now with "git checkout" is this:

  1. read_cache pulls the cache_tree from disk into the_index

  2. we call unpack_trees with o->src_index == o->dst_index ==
     &the_index.

(Continue reading)

Thomas Rast | 20 Feb 19:45
Picon

Re: git status: small difference between stating whole repository and small subdirectory

Jeff King <peff <at> peff.net> writes:

> diff --git a/unpack-trees.c b/unpack-trees.c
> index 8be3f6c..e8aedea 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -1135,6 +1135,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
>  		}
>  	}
>  
> +	o->result.cache_tree = o->src_index->cache_tree;
>  	o->src_index = NULL;
>  	ret = check_updates(o) ? (-2) : 0;
>  	if (o->dst_index)

Brilliant.  I know I'm stealing Junio's punchline, but please make it so
:-)

Browsing around in history, it seems that this was silently broken by
34110cd (Make 'unpack_trees()' have a separate source and destination
index, 2008-03-06), which introduced the distinction between source and
destination index.  Before that they were the same, so the cache tree
would have been updated correctly.

> It makes "git checkout" with no changes just work (since we preserve the
> cache tree, and it doesn't need updated). It makes something like "git
> checkout HEAD^" work, keeping most of the cache-tree intact, but
> invalidating trees containing paths that were modified.

Great.  Here's a test you could use.  It's a bit noisy because the
(Continue reading)

Jeff King | 20 Feb 21:35
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Mon, Feb 20, 2012 at 07:45:59PM +0100, Thomas Rast wrote:

> > +	o->result.cache_tree = o->src_index->cache_tree;
> >  	o->src_index = NULL;
> >  	ret = check_updates(o) ? (-2) : 0;
> >  	if (o->dst_index)
> 
> Brilliant.  I know I'm stealing Junio's punchline, but please make it so
> :-)
> 
> Browsing around in history, it seems that this was silently broken by
> 34110cd (Make 'unpack_trees()' have a separate source and destination
> index, 2008-03-06), which introduced the distinction between source and
> destination index.  Before that they were the same, so the cache tree
> would have been updated correctly.

OK, good. When you write a one-liner that makes a huge change in
performance, it is usually a good idea to think to yourself "no, it
couldn't be this easy, could it?".

But after more discussion from people more clueful than I (this is the
first time I've even looked at cache-tree code), I'm feeling like this
is the right direction, at least, if not exactly the right patch.
And seeing that it is in fact a regression in 34110cd, and that the
existing cache-tree invalidations predate that makes me feel better. At
one point, at least, they were complete and we were depending on them to
be accurate. Things may have changed since then, of course, but I at
least know that they were sufficient in 34110cd^.

> +# NEEDSWORK: only one of these two can succeed.  The second is there
(Continue reading)

Junio C Hamano | 20 Feb 23:04
Picon
Picon
Favicon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

Jeff King <peff <at> peff.net> writes:

> ... Things may have changed since then, of course, but I at
> least know that they were sufficient in 34110cd^.

Looking at where cache_tree_free() is called, I think back then the
two-way merge was deemed OK, but we did not trust three-way merge or
merge-recursive at all.

> I think you can construct two tests that will both work in the "ideal"
> case. In the first one, you move to a tree that updates "foo", and
> therefore the root cache-tree is invalidated.

I have to warn you that under-invalidation of cache-tree is extremely hard
to find.  The only way I know, which I had to resort to when dealing with
a handful of instances of under-invalidation bugs, is to run write-tree
with potentially corrupt cache-tree and then using the same index but with
the cache-tree purged, run another write-tree and check to see if two
trees match.

> In general, t0090 could benefit from using a larger tree. For example,
> the add test does "git add foo" and checks that the root cache-tree was
> invalidated. But it should _also_ check that the cache-tree for a
> subdirectory is _not_ invalidated (and it isn't; git-add does the right
> thing).

It is OK to check that we do not over-invalidate for performance, but it
is a lot more important to make sure we do not under-invalidate for
correctness.  I am a bit worried that you seem to be putting more stress
on the former.
(Continue reading)

Jeff King | 20 Feb 23:41
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Mon, Feb 20, 2012 at 02:04:26PM -0800, Junio C Hamano wrote:

> > ... Things may have changed since then, of course, but I at
> > least know that they were sufficient in 34110cd^.
> 
> Looking at where cache_tree_free() is called, I think back then the
> two-way merge was deemed OK, but we did not trust three-way merge or
> merge-recursive at all.

Thanks, I'll take a look more closely at those cases.

> It is OK to check that we do not over-invalidate for performance, but it
> is a lot more important to make sure we do not under-invalidate for
> correctness.  I am a bit worried that you seem to be putting more stress
> on the former.

I think it is just selection bias of the specific parts of his tests
that I was responding to. I completely agree that correctness is way
more important, and I'm also trying to come up with tests to validate
correctness. I just wasn't talking about them there.

I still think replaying real-world test cases is going to be more likely
to find issues in invalidation. I can come up with lots of simple
test-cases, but they're not likely to find anything we wouldn't find in
the code with trivial inspection. I think a combination of careful
analysis and real-world validation is going to be more helpful in the
long run than the kind of simplistic tests that are in t0090.

-Peff
(Continue reading)

Junio C Hamano | 21 Feb 00:31
Picon
Picon
Favicon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

Jeff King <peff <at> peff.net> writes:

> I still think replaying real-world test cases is going to be more likely
> to find issues in invalidation.

Yes, but it depends on what kind of replaying you have in mind.  It is
very hard to come up with "replaying real-world test case".

For example, randomly picking commit pair <$A,$B> from the kernel
repository and running

	git reset --hard $A
        git checkout $B
        T0=$(git write-tree)
        drop-cache-tree
        T1=$(git write-tree)
        test "$T0" = "$T1" && test "$T0" = $(git rev-parse $B^{tree})

is necessary but I do not think that is sufficient.  We also want to do
something like:

	git reset --hard $A
        ... modify paths that do not change between $A and $B
        ... git add these paths
        git write-tree
        git checkout $B

with and without the "git write-tree" to see the part of the cache-tree
smudged by the modification behaves sanely.  The codepath that is used
to deal with the case where the index does not match $A but matches $B is
(Continue reading)

Piotr Krukowiecki | 21 Feb 08:21
Picon

Re: git status: small difference between stating whole repository and small subdirectory

On Tue, Feb 21, 2012 at 12:31 AM, Junio C Hamano <gitster <at> pobox.com> wrote:
> Jeff King <peff <at> peff.net> writes:
[...]

Hi,

I hope the fixes will also help git-svn users? I.e. there's a lot of
rebasing (and cherry-picking - at least in my case) and probably some
other stuff going on under hood. I.e. I hope that if I git-svn rebase
or cherry-pick, it won't invalidate the cache...

Thanks,
--

-- 
Piotr Krukowiecki
Junio C Hamano | 20 Feb 21:08
Picon
Picon
Favicon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

Jeff King <peff <at> peff.net> writes:

>   4. At the end of unpack_trees, we forget about src_index, and copy
>      o->result into *o->dst_index byte for byte. I.e., we overwrite
>      the_index.cache_tree, which has been properly updated the whole
>      time,

I strongly suspect that "properly updated" part needs to be thoroughly
audited.  I wouldn't be surprised that this behaviour is what we did when
we split src_index vs dst_index when he rewrote unpack_trees() in order to
emulate the original "unpack-trees is beyond salvation because it does not
maintain cache tree correctly, just nuke it" behaviour.

> But it does not actually insert the _destination_ tree into the cache
> tree. Which we can do in certain situations, but only if there were no
> paths in the tree that were left unchanged (e.g., you modify "foo", then
> "git checkout HEAD^", which updates "bar". Your tree does not match
> HEAD^, and must be invalidated).  While it would be cool to be able to
> handle those complex cases,...

It may look cool but it may not be a good change. You are spending extra
cycles to optimize for the next write-tree that may not happen before the
index is further updated.

> I think this implementation matches the intent of the original calls to
> cache_tree_invalidate_path sprinkled throughout unpack-trees.c.

Yes, and as long as we invalidate all the directories that need to be
invalidated during the unpack-tree operation, I think it is a correct
thing to do.
(Continue reading)

Jeff King | 20 Feb 21:17
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Mon, Feb 20, 2012 at 12:08:03PM -0800, Junio C Hamano wrote:

> >   4. At the end of unpack_trees, we forget about src_index, and copy
> >      o->result into *o->dst_index byte for byte. I.e., we overwrite
> >      the_index.cache_tree, which has been properly updated the whole
> >      time,
> 
> I strongly suspect that "properly updated" part needs to be thoroughly
> audited.  I wouldn't be surprised that this behaviour is what we did when
> we split src_index vs dst_index when he rewrote unpack_trees() in order to
> emulate the original "unpack-trees is beyond salvation because it does not
> maintain cache tree correctly, just nuke it" behaviour.

Yep, I am also concerned about that.

> > But it does not actually insert the _destination_ tree into the cache
> > tree. Which we can do in certain situations, but only if there were no
> > paths in the tree that were left unchanged (e.g., you modify "foo", then
> > "git checkout HEAD^", which updates "bar". Your tree does not match
> > HEAD^, and must be invalidated).  While it would be cool to be able to
> > handle those complex cases,...
> 
> It may look cool but it may not be a good change. You are spending extra
> cycles to optimize for the next write-tree that may not happen before the
> index is further updated.

I don't think it would be too many cycles; you would have to mark each
tree you enter as having items from the left-hand tree or the right-hand
tree. If only one, you can reuse the cache-tree entry (or tree sha1, if
coming from a tree). Otherwise, you must invalidate.  And it doesn't
(Continue reading)

Nguyen Thai Ngoc Duy | 21 Feb 15:45
Picon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Mon, Feb 20, 2012 at 9:36 PM, Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> wrote:
> On Mon, Feb 20, 2012 at 03:09:57PM +0100, Thomas Rast wrote:
>> > Interestingly, on my git.git repo, I had an empty cache. Running "git
>> > read-tree HEAD" filled it (according to test-dump-cache-tree). It seems
>> > that running "git checkout" empties the cache.  So perhaps git could do
>> > better about keeping the cache valid over time.
>>
>> test_expect_failure 'checkout gives cache-tree' '
>>       git checkout HEAD^ &&
>>       test_shallow_cache_tree
>> '
>>
>> ;-)
>
> Quick and dirty that passes that test.

I'm aware that Jeff's tackling at lower level, which retains
cache-tree for many more cases. But this patch seems simple and safe
to me, and in my experience this case happens quite often (or maybe I
tend to keep my index clean). Junio, any chance this patch may get in?

> -- 8< --
> diff --git a/builtin/checkout.c b/builtin/checkout.c
> index 5bf96ba..c06287a 100644
> --- a/builtin/checkout.c
> +++ b/builtin/checkout.c
> @@ -319,6 +319,10 @@ static void show_local_changes(struct object *head, struct diff_options *opts)
>                die(_("diff_setup_done failed"));
>        add_pending_object(&rev, head, NULL);
>        run_diff_index(&rev, 0);
(Continue reading)

Junio C Hamano | 21 Feb 20:16
Picon
Picon
Favicon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes:

> I'm aware that Jeff's tackling at lower level, which retains
> cache-tree for many more cases.
>
> But this patch seems simple and safe
> to me, and in my experience this case happens quite often (or maybe I
> tend to keep my index clean). Junio, any chance this patch may get in?

I do not think we are talking about a duplicated effort here.

By definition, the change to hook into unpack_trees() and making sure we
invalidate all the necessary subtrees in the cache cannot give you a cache
tree that is more populated than what you started with.  And the train of
thought in Peff's message is to improve this invalidation---we currently
invalidate everything ;-)

Somebody has to populate the cache tree fully when we _know_ the index
matches a certain tree, and adding a call to prime_cache_tree() in
strategic places is a way to do so.  The most obvious is write-tree, but
there are a few other existing codepaths that do so.

Because prime_cache_tree() by itself is a fairly expensive operation that
reads all the trees recursively, its benefits need to be evaluated. It
should to happen only in an operation that is already heavy-weight, is
likely to have read all the trees and have many of them in-core cache, and
also relatively rarely happens compared to "git add" so that the cost can
be amortised over time, such as "reset --(hard|mixed)".

Switching branches is likely to fall into that category, but that is just
(Continue reading)

Nguyen Thai Ngoc Duy | 22 Feb 03:12
Picon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Wed, Feb 22, 2012 at 2:16 AM, Junio C Hamano <gitster <at> pobox.com> wrote:
> Because prime_cache_tree() by itself is a fairly expensive operation that
> reads all the trees recursively, its benefits need to be evaluated. It
> should to happen only in an operation that is already heavy-weight, is
> likely to have read all the trees and have many of them in-core cache, and
> also relatively rarely happens compared to "git add" so that the cost can
> be amortised over time, such as "reset --(hard|mixed)".
>
> Switching branches is likely to fall into that category, but that is just
> my gut feeling.  I would feel better at night if somebody did a benchmark
> ;-)

In this particular case, "git diff --cached" is run internally, so I
say all trees are read once and hopefully most of them still in OS
cache. Will run some benchmark, maybe with the coming perf test suite.

> One thing we do not currently do anywhere that _might_ be of merit is to
> make a call to cache_tree_update() instead of prime_cache_tree() when we
> already know that only a very small subpart of the cache-tree is invalid
> and it is cheaper to repair it by rehashing only a small portion of the
> index than to re-prime the entire cache tree with prime_cache_tree().

That makes me think if "diff --cached" can take advantage of
cache-tree to avoid walking down valid cached trees and do tree-tree
diff in those cases instead. Not sure if it gains us anything but code
complexity.
--

-- 
Duy
Junio C Hamano | 22 Feb 03:55
Picon
Picon
Favicon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes:

> That makes me think if "diff --cached" can take advantage of
> cache-tree to avoid walking down valid cached trees and do tree-tree
> diff in those cases instead. Not sure if it gains us anything but code
> complexity.

Why do I have this funny feeling that we saw that comment in this thread
already?

Nguyen Thai Ngoc Duy | 22 Feb 13:54
Picon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Wed, Feb 22, 2012 at 9:55 AM, Junio C Hamano <gitster <at> pobox.com> wrote:
> Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes:
>
>> That makes me think if "diff --cached" can take advantage of
>> cache-tree to avoid walking down valid cached trees and do tree-tree
>> diff in those cases instead. Not sure if it gains us anything but code
>> complexity.
>
> Why do I have this funny feeling that we saw that comment in this thread
> already?

Simple. You wrote it and I missed it.

On Sat, Feb 18, 2012 at 5:25 AM, Junio C Hamano <gitster <at> pobox.com> wrote:
> Jeff King <peff <at> peff.net> writes:
>
>> That being said, we do have an index extension to store the tree sha1 of
>> whole directories (i.e., we populate it when we write a whole tree or
>> subtree into the index from the object db, and it becomes invalidated
>> when a file becomes modified). This optimization is used by things like
>> "git commit" to avoid having to recreate the same sub-trees over and
>> over when creating tree objects from the index. But we could also use it
>> here to avoid having to even read the sub-tree objects from the object
>> db.
>
> Like b65982b (Optimize "diff-index --cached" using cache-tree, 2009-05-20)
> perhaps?

This optimizes the case when a cached tree matches entirely.I wonder
whether it's faster if we switch to tree-tree diff whenever we find
(Continue reading)

Thomas Rast | 22 Feb 14:17
Picon

Re: git status: small difference between stating whole repository and small subdirectory

Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes:

> On Sat, Feb 18, 2012 at 5:25 AM, Junio C Hamano <gitster <at> pobox.com> wrote:
>> Jeff King <peff <at> peff.net> writes:
>>
>>> That being said, we do have an index extension to store the tree sha1 of
>>> whole directories (i.e., we populate it when we write a whole tree or
>>> subtree into the index from the object db, and it becomes invalidated
>>> when a file becomes modified). This optimization is used by things like
>>> "git commit" to avoid having to recreate the same sub-trees over and
>>> over when creating tree objects from the index. But we could also use it
>>> here to avoid having to even read the sub-tree objects from the object
>>> db.
>>
>> Like b65982b (Optimize "diff-index --cached" using cache-tree, 2009-05-20)
>> perhaps?
>
> This optimizes the case when a cached tree matches entirely.I wonder
> whether it's faster if we switch to tree-tree diff whenever we find
> valid cached trees. If cache-tree is fully valid, "git diff --cached
> foo" would be equivalent to "git diff HEAD foo".

Not necessarily; the cache-tree is valid if it faithfully represents
what is in the index.  It does not have any direct relation to HEAD.

> I tried "git diff --raw HEAD HEAD~100" (where HEAD was
> v3.1-rc1-272-g73e0881 on linux-2.6) and "git diff --cached --raw
> HEAD~100" with no cache-tree. The former is a little bit faster than
> the latter (177ms vs 275ms). On gentoo-x86, 70k worktree files, it's
> 4.33s vs 4.45s. But in tree-tree diff we pay high in cold cache case
(Continue reading)

Nguyen Thai Ngoc Duy | 22 Feb 11:34
Picon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Tue, Feb 21, 2012 at 11:16:37AM -0800, Junio C Hamano wrote:
> Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes:
> 
> > I'm aware that Jeff's tackling at lower level, which retains
> > cache-tree for many more cases.
> >
> > But this patch seems simple and safe
> > to me, and in my experience this case happens quite often (or maybe I
> > tend to keep my index clean). Junio, any chance this patch may get in?
> 
> I do not think we are talking about a duplicated effort here.
> 
> By definition, the change to hook into unpack_trees() and making sure we
> invalidate all the necessary subtrees in the cache cannot give you a cache
> tree that is more populated than what you started with.  And the train of
> thought in Peff's message is to improve this invalidation---we currently
> invalidate everything ;-)
> 
> Somebody has to populate the cache tree fully when we _know_ the index
> matches a certain tree, and adding a call to prime_cache_tree() in
> strategic places is a way to do so.  The most obvious is write-tree, but
> there are a few other existing codepaths that do so.
> 
> Because prime_cache_tree() by itself is a fairly expensive operation that
> reads all the trees recursively, its benefits need to be evaluated. It
> should to happen only in an operation that is already heavy-weight, is
> likely to have read all the trees and have many of them in-core cache, and
> also relatively rarely happens compared to "git add" so that the cost can
> be amortised over time, such as "reset --(hard|mixed)".

(Continue reading)

Junio C Hamano | 22 Feb 04:32
Picon
Picon
Favicon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes:

>> diff --git a/builtin/checkout.c b/builtin/checkout.c
>> index 5bf96ba..c06287a 100644
>> --- a/builtin/checkout.c
>> +++ b/builtin/checkout.c
>> @@ -319,6 +319,10 @@ static void show_local_changes(struct object *head, struct diff_options *opts)
>>                die(_("diff_setup_done failed"));
>>        add_pending_object(&rev, head, NULL);
>>        run_diff_index(&rev, 0);
>> +       if (!DIFF_OPT_TST(&rev.diffopt, HAS_CHANGES)) {
>> +               struct tree *tree = parse_tree_indirect(head->sha1);
>> +               prime_cache_tree(&active_cache_tree, tree);
>> +       }
>>  }

I think this patch is wrong on at least two counts.

 * The run_diff_index(&rev, 0) you reused is doing "diff HEAD" and not
   "diff --cached HEAD".  The added check does not say anything about the
   comparison between the index and the tree at the HEAD.

 * Even if we added an extra run_diff_index(&rev, 1) there, or added a
   call to index_differs_from() to run "diff --cached HEAD" to check what
   needs to be checked, it is still not quite right.

On the latter point, imagine what happens in the two invocations of
checkout in the following sequence:

   $ git reset --hard master
(Continue reading)

Piotr Krukowiecki | 10 Apr 17:16
Picon

Re: git status: small difference between stating whole repository and small subdirectory

On Wed, Feb 22, 2012 at 4:32 AM, Junio C Hamano <gitster <at> pobox.com> wrote:
> Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes:
>
>>> diff --git a/builtin/checkout.c b/builtin/checkout.c
>>> index 5bf96ba..c06287a 100644
>>> --- a/builtin/checkout.c
>>> +++ b/builtin/checkout.c
>>> @@ -319,6 +319,10 @@ static void show_local_changes(struct object *head, struct diff_options *opts)
>>>                die(_("diff_setup_done failed"));
>>>        add_pending_object(&rev, head, NULL);
>>>        run_diff_index(&rev, 0);
>>> +       if (!DIFF_OPT_TST(&rev.diffopt, HAS_CHANGES)) {
>>> +               struct tree *tree = parse_tree_indirect(head->sha1);
>>> +               prime_cache_tree(&active_cache_tree, tree);
>>> +       }
>>>  }
>
> I think this patch is wrong on at least two counts.
>
>  * The run_diff_index(&rev, 0) you reused is doing "diff HEAD" and not
>   "diff --cached HEAD".  The added check does not say anything about the
>   comparison between the index and the tree at the HEAD.
>
>  * Even if we added an extra run_diff_index(&rev, 1) there, or added a
>   call to index_differs_from() to run "diff --cached HEAD" to check what
>   needs to be checked, it is still not quite right.
>
> On the latter point, imagine what happens in the two invocations of
> checkout in the following sequence:
>
(Continue reading)

Junio C Hamano | 10 Apr 18:23
Picon
Picon
Favicon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

Piotr Krukowiecki <piotr.krukowiecki <at> gmail.com> writes:

> could I ask what is the status of this? There were some patches
> posted, but I think nothing final?

I do not think you meant to address your inquiry to me, but I think these
patches tried out some ideas, got issues discovered in them and then got
abandoned before resulting in a working code that is ready for testing.

I wish there were fewer such series, but it happens (see "Stalled" section
in What's cooking).

Jeff King | 10 Apr 20:00
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Tue, Apr 10, 2012 at 09:23:59AM -0700, Junio C Hamano wrote:

> > could I ask what is the status of this? There were some patches
> > posted, but I think nothing final?
> 
> I do not think you meant to address your inquiry to me, but I think these
> patches tried out some ideas, got issues discovered in them and then got
> abandoned before resulting in a working code that is ready for testing.

Yes. I think we decided that we needed some pretty good testing to add
the cache_tree handling back into unpack_trees. I'd still like to do
that testing, but haven't done it yet.

-Peff
Junio C Hamano | 20 Feb 20:57
Picon
Picon
Favicon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

Thomas Rast <trast <at> inf.ethz.ch> writes:

> test_expect_failure 'checkout gives cache-tree' '
> 	git checkout HEAD^ &&
> 	test_shallow_cache_tree
> '

Depending on what state you start the checkout from, that is not a valid
test.  Some form of "git reset" before the checkout to ensure the initial
state is needed.
Thomas Rast | 20 Feb 20:59
Picon

Re: git status: small difference between stating whole repository and small subdirectory

Junio C Hamano <gitster <at> pobox.com> writes:

> Thomas Rast <trast <at> inf.ethz.ch> writes:
>
>> test_expect_failure 'checkout gives cache-tree' '
>> 	git checkout HEAD^ &&
>> 	test_shallow_cache_tree
>> '
>
> Depending on what state you start the checkout from, that is not a valid
> test.  Some form of "git reset" before the checkout to ensure the initial
> state is needed.

Oh, I was just quoting what we already had at the end of t0090 since
4eb0346f.  The test preceding it runs 'git reset --hard'.

--

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
Nguyen Thai Ngoc Duy | 20 Feb 15:16
Picon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Mon, Feb 20, 2012 at 9:06 PM, Jeff King <peff <at> peff.net> wrote:
> Interestingly, on my git.git repo, I had an empty cache. Running "git
> read-tree HEAD" filled it (according to test-dump-cache-tree). It seems
> that running "git checkout" empties the cache.  So perhaps git could do
> better about keeping the cache valid over time.

For fast forward case when result index matches 100% destination tree,
yeah we should repopulate cache-tree. "git reset" does that. Not sure
about other cases though. I don't think we can keep track what
subtrees are unchanged after unpack_trees() in order to keep them.
--

-- 
Duy
Jeff King | 20 Feb 15:22
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Mon, Feb 20, 2012 at 09:16:43PM +0700, Nguyen Thai Ngoc Duy wrote:

> On Mon, Feb 20, 2012 at 9:06 PM, Jeff King <peff <at> peff.net> wrote:
> > Interestingly, on my git.git repo, I had an empty cache. Running "git
> > read-tree HEAD" filled it (according to test-dump-cache-tree). It seems
> > that running "git checkout" empties the cache.  So perhaps git could do
> > better about keeping the cache valid over time.
> 
> For fast forward case when result index matches 100% destination tree,
> yeah we should repopulate cache-tree. "git reset" does that. Not sure
> about other cases though. I don't think we can keep track what
> subtrees are unchanged after unpack_trees() in order to keep them.

Yeah, doing it after unpack_trees seems crazy. But I really feel like
unpack_trees should be able to handle cache updates as it unpacks. It
knows what is being updated and what is being merged. But maybe it is
more complicated than that; I haven't looked at the code yet.

-Peff
Junio C Hamano | 20 Feb 20:56
Picon
Picon
Favicon
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

Jeff King <peff <at> peff.net> writes:

> Interestingly, on my git.git repo, I had an empty cache. Running "git
> read-tree HEAD" filled it (according to test-dump-cache-tree). It seems
> that running "git checkout" empties the cache.  So perhaps git could do
> better about keeping the cache valid over time.

At least in the early days unpack-trees built the result by manually
adding an entry without calling the add_index_entry() all over the place,
which meant that it was futile to pretend that there is even a slight
chance that complex beast would correctly invalidate cached tree
information at all the necessary places. I recall that I added a code to
nuke the cache tree at the very beginning of "merging" codepaths to avoid
any bogus cache tree result to be stored in the resulting index.

These days, we have src_index and dst_index, and dst_index IIRC can start
as empty in which case "start from kept information and selectively
invalidate" would not work at all.  When src_index and dst_index are the
same, however, you should be able to keep the cached tree valid, at least
in theory.
Jeff King | 20 Feb 21:09
Gravatar

Re: git status: small difference between stating whole repository and small subdirectory

On Mon, Feb 20, 2012 at 11:56:13AM -0800, Junio C Hamano wrote:

> These days, we have src_index and dst_index, and dst_index IIRC can start
> as empty in which case "start from kept information and selectively
> invalidate" would not work at all.  When src_index and dst_index are the
> same, however, you should be able to keep the cached tree valid, at least
> in theory.

Yeah, I was worried that the cache invalidations sprinkled throughout
unpack-trees.c would not be sufficient (and because we are invalidating,
a missing invalidation would give us bogus cache info, which is Very
Bad).

So I think the one-liner I posted before is not sufficient in the
general case, because it definitely doesn't consider where the
destination is starting from. It should at least be more like:

  if (src_index == dst_index) {
          /* We would ordinarily want to do a deep copy here, but since
           * we know that we will be overwriting src_index in the long
           * run, it's OK to just take ownership of its cache_tree. */
          o->result.cache_tree = o->src_index->cache_tree;
          o->src_index->cache_tree = NULL;
  }

  [... do the usual tree traversal here, except invalidate entries in
       o->result.call_tree instead of o->src_index. That makes it a
       no-op when src_index != dst_index (because we have no cache tree
       defined in result, then), and otherwise we are invalidating what
       will go into the result...]
(Continue reading)


Gmane