Dirk Pranke | 15 Aug 2012 21:22

A proposal for handling "failing" layout tests and TestExpectations

Hi all,

As many of you know, we normally treat the -expected files as
"regression" tests rather than "correctness" tests; they are intended
to capture the current behavior of the tree. As such, they
historically have not distinguished between a "correct failure" and an
"incorrect failure".

The chromium port, however, has historically often not checked in
expectations for tests that are currently failing (or even have been
failing for a long time), and instead listed them in the expectations
file. This was primarily motivated by us wanting to know easily all of
the tests that were "failing". However, this approach has its own
downsides.

I would like to move the project to a point where all of the ports
were basically using the same workflow/model, and combine the best
features of each approach [1].

To that end, I propose that we allow tests to have expectations that
end in '-passing' and '-failing' as well as '-expected'.

The meanings for '-passing' and '-failing' should be obvious, and
'-expected' can continue the current meaning of either or both of
"what we expect to happen" and "I don't know if this is correct or
not" :).

A given test will be allowed to only have one of the three potential
results at any one time/revision in a checkout. [2]

(Continue reading)

Filip Pizlo | 15 Aug 2012 21:27
Picon
Favicon

Re: A proposal for handling "failing" layout tests and TestExpectations

This sounds like it's adding even more complication to an already complicated system.

Given how many tests we currently have, I also don't buy that continuing to run a test that is already known to
fail provides much benefit.  If the test covers two things and one fails while the other succeeds (example
#1: it prints correct results but doesn't crash; example #2: it literally has tests for two distinct
unrelated features, and one feature is broken) then the correct thing to do is to split up the test.

So, I'd rather not continue to institutionalize this notion that we should have loads of incomprehensible
machinery to reason about tests that have already given us all the information they were meant to give
(i.e. they failed, end of story).

-F

On Aug 15, 2012, at 12:22 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:

> Hi all,
> 
> As many of you know, we normally treat the -expected files as
> "regression" tests rather than "correctness" tests; they are intended
> to capture the current behavior of the tree. As such, they
> historically have not distinguished between a "correct failure" and an
> "incorrect failure".
> 
> The chromium port, however, has historically often not checked in
> expectations for tests that are currently failing (or even have been
> failing for a long time), and instead listed them in the expectations
> file. This was primarily motivated by us wanting to know easily all of
> the tests that were "failing". However, this approach has its own
> downsides.
> 
(Continue reading)

Dirk Pranke | 15 Aug 2012 21:40

Re: A proposal for handling "failing" layout tests and TestExpectations

On Wed, Aug 15, 2012 at 12:27 PM, Filip Pizlo <fpizlo <at> apple.com> wrote:
> This sounds like it's adding even more complication to an already complicated system.

In some ways, yes. In other ways, perhaps it will allow us to simplify
things; e.g., if we are checking in failing tests, there is much less
of a need for multiple failure keywords in the TestExpectations file
(so perhaps we can simplify them back to something closer to Skipped
files).

> Given how many tests we currently have, I also don't buy that continuing to run a test that is already known
to fail provides much benefit.

I'm not sure I understand your feedback here? It's common practice (on
all the ports as far as I know today) to rebaseline tests that are
currently failing so that they fail differently. Of course, we also
skip some tests while they are failing as well. However, common
objections to skipping tests are that we can lose coverage for a
feature and/or miss when a test starts failing worse (e.g. crashing)?
And of course, a test might start passing again, but if we're skipping
it we wouldn't know that ...

> So, I'd rather not continue to institutionalize this notion that we should have loads of
incomprehensible machinery to reason about tests that have already given us all the information they
were meant to give (i.e. they failed, end of story).

Are you suggesting that, rather than checking in new baselines at all
or having lots of logic to manage different kinds of failures, we
should just let failing tests fail (and keep the tree red) until a
change is either reverted or the test is fixed?

(Continue reading)

Filip Pizlo | 16 Aug 2012 00:06
Picon
Favicon

Re: A proposal for handling "failing" layout tests and TestExpectations

Apparently I was somewhat unclear.  Let me restate.  We have the following mechanisms available when a test fails:

1) Check in a new -expected.* file.

2) Modify the test.

3) Modify a TestExpectations file.

4) Add the test to a Skipped file.

5) Remove the test entirely.

I have no problem with (1) unless it is intended to mark the test as expected-to-fail-but-not-crash.  I
agree that using -expected.* to accomplish what TestExpectations accomplishes is not valuable, but I
further believe that even TestExpectations is not valuable.

I broadly prefer (2) whenever possible.

I believe that (3) and (4) are redundant, and I don't buy the value of (3).

I don't like (5) but we should probably do more of it for tests that have a chronically low signal-to-noise ratio.

You're proposing a new mechanism.  I'm arguing that given the sheer number of tests, and the overheads
associated with maintaining them, (4) is the broadly more productive strategy in terms of
bugs-fixed/person-hours.  And, increasing the number of mechanisms for dealing with tests by 20% is
likely to reduce overall productivity rather than helping anyone.

-F

On Aug 15, 2012, at 12:40 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:
(Continue reading)

Dirk Pranke | 16 Aug 2012 01:02

Re: A proposal for handling "failing" layout tests and TestExpectations

On Wed, Aug 15, 2012 at 3:06 PM, Filip Pizlo <fpizlo <at> apple.com> wrote:
> Apparently I was somewhat unclear.  Let me restate.  We have the following mechanisms available when a test fails:
>
> 1) Check in a new -expected.* file.
>
> 2) Modify the test.
>
> 3) Modify a TestExpectations file.
>
> 4) Add the test to a Skipped file.
>
> 5) Remove the test entirely.
>
> I have no problem with (1) unless it is intended to mark the test as expected-to-fail-but-not-crash.  I
agree that using -expected.* to accomplish what TestExpectations accomplishes is not valuable, but I
further believe that even TestExpectations is not valuable.
>
> I broadly prefer (2) whenever possible.
>
> I believe that (3) and (4) are redundant, and I don't buy the value of (3).
>
> I don't like (5) but we should probably do more of it for tests that have a chronically low signal-to-noise ratio.
>

Thank you for clarifying. I had actually written an almost identical
list but didn't send it, so I think we're on the same page at least as
far as understanding the problem goes ...

So, I would describe my suggestion as an improved variant of the kind
of (1) that can be used as "expected-to-fail-but-not-crash" (which
(Continue reading)

Filip Pizlo | 16 Aug 2012 02:00
Picon
Favicon

Re: A proposal for handling "failing" layout tests and TestExpectations


On Aug 15, 2012, at 4:02 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:

> On Wed, Aug 15, 2012 at 3:06 PM, Filip Pizlo <fpizlo <at> apple.com> wrote:
>> Apparently I was somewhat unclear.  Let me restate.  We have the following mechanisms available when a
test fails:
>> 
>> 1) Check in a new -expected.* file.
>> 
>> 2) Modify the test.
>> 
>> 3) Modify a TestExpectations file.
>> 
>> 4) Add the test to a Skipped file.
>> 
>> 5) Remove the test entirely.
>> 
>> I have no problem with (1) unless it is intended to mark the test as expected-to-fail-but-not-crash.  I
agree that using -expected.* to accomplish what TestExpectations accomplishes is not valuable, but I
further believe that even TestExpectations is not valuable.
>> 
>> I broadly prefer (2) whenever possible.
>> 
>> I believe that (3) and (4) are redundant, and I don't buy the value of (3).
>> 
>> I don't like (5) but we should probably do more of it for tests that have a chronically low signal-to-noise ratio.
>> 
> 
> Thank you for clarifying. I had actually written an almost identical
> list but didn't send it, so I think we're on the same page at least as
(Continue reading)

Dirk Pranke | 16 Aug 2012 02:24

Re: A proposal for handling "failing" layout tests and TestExpectations

On Wed, Aug 15, 2012 at 5:00 PM, Filip Pizlo <fpizlo <at> apple.com> wrote:
>
> On Aug 15, 2012, at 4:02 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:
>
>> On Wed, Aug 15, 2012 at 3:06 PM, Filip Pizlo <fpizlo <at> apple.com> wrote:
>>> Apparently I was somewhat unclear.  Let me restate.  We have the following mechanisms available when a
test fails:
>>>
>>> 1) Check in a new -expected.* file.
>>>
>>> 2) Modify the test.
>>>
>>> 3) Modify a TestExpectations file.
>>>
>>> 4) Add the test to a Skipped file.
>>>
>>> 5) Remove the test entirely.
>>>
>>> I have no problem with (1) unless it is intended to mark the test as expected-to-fail-but-not-crash.  I
agree that using -expected.* to accomplish what TestExpectations accomplishes is not valuable, but I
further believe that even TestExpectations is not valuable.
>>>
>>> I broadly prefer (2) whenever possible.
>>>
>>> I believe that (3) and (4) are redundant, and I don't buy the value of (3).
>>>
>>> I don't like (5) but we should probably do more of it for tests that have a chronically low
signal-to-noise ratio.
>>>
>>
(Continue reading)

Peter Kasting | 16 Aug 2012 02:39

Re: A proposal for handling "failing" layout tests and TestExpectations

On Wed, Aug 15, 2012 at 5:00 PM, Filip Pizlo <fpizlo <at> apple.com> wrote:

I believe that the cognitive load is greater than any benefit from catching bugs incidentally by continuing to run a (1-fail) or (3) test, and continuing to evaluate whether or not the expectation matches some notions of desired behavior.

As someone who has spent a lot of time maintaining Chromium's expectations, this seems clearly false, if your proposed alternative is to stop running the test.  This is because a very common course of events is for a test to begin failing, and then later on return to passing.  We (Chromium) see this all the time with e.g. Skia changes, where for example the Skia folks will rewrite gradient handling to more perfectly match some spec and as a result dozens or hundreds of tests, many not explicitly intended to be about gradient handling, will change and possibly begin passing.

By contrast, if we aren't running a test, we don't know when the test begins passing again (except by trying to run it).  The resulting effect is that skipped tests tend to remain skipped.  Tests that remain skipped are no better than no tests.  And even if such tests are periodically retested, once a test's output changes, there is a large window of time where the test wasn't running, making it difficult to pinpoint exactly what caused the change and whether the resulting effect is intentional and beneficial.

If we ARE running a test, then when the results change, knowing whether the existing result was thought to be correct or not is a critical part of a sheriff's job in deciding what to do about the change.  This is one reason why Chromium has never gone down the path of simply checking in failure expectations, and something that Dirk's proposal explicitly tries to address while still allowing ports that (IMO mistakenly) don't care to continue to not care.

We already have some good tooling (e.g. garden-o-matic) that could be extended to show and update the small amount of additional info Dirk is proposing.  I am very skeptical of abstract claims that this proposal inflates complexity and decreases productivity in the absence of actually testing a real workflow using the tools that we sheriffs really use to maintain tree greenness.

I would like to see this proposal tested to get concrete feedback instead of arguments on principle.

PK
_______________________________________________
webkit-dev mailing list
webkit-dev <at> lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev
Filip Pizlo | 16 Aug 2012 02:45
Picon
Favicon

Re: A proposal for handling "failing" layout tests and TestExpectations

The typical approach used in situations that you describe is to rebase, not skip.  This avoids the problem of not knowing when the test started passing.  Hence, I'm not sure what you're implying.  Maybe a better example would help.


On Aug 15, 2012, at 5:39 PM, Peter Kasting <pkasting <at> chromium.org> wrote:

On Wed, Aug 15, 2012 at 5:00 PM, Filip Pizlo <fpizlo <at> apple.com&g t; wrote:
I believe that the cognitive load is greater than any benefit from catching bugs incidentally by continuing to run a (1-fail) or (3) test, and continuing to evaluate whether or not the expectation matches some notions of desired behavior.

As someone who has spent a lot of time maintaining Chromium's expectations, this seems clearly false, if your proposed alternative is to stop running the test.  This is because a very common course of events is for a test to begin failing, and then later on return to passing.  We (Chromium) see this all the time with e.g. Skia changes, where for example the Skia folks will rewrite gradient handling to more perfectly match some spec and as a result dozens or hundreds of tests, many not explicitly intended to be about gradient handling, will change and possibly begin passing.

By contrast, if we aren't running a test, we don't know when the test begins passing again (except by trying to run it).  The resulting effect is that skipped tests tend to remain skipped.  Tests that remain skipped are no better than no tests.  And even if such tests are periodically retested, once a test's output changes, there is a large window of time where the test wasn't running, making it difficult to pinpoint exactly what caused the change and whether the resulting effect is intentional and beneficial.

If we ARE running a test, then when the results change, knowing whether the existing result was thought to be correct or not is a critical part of a sheriff's job in deciding what to do about the change.  This is one reason why Chromium has never gone down the path of simply checking in failure expectations, and something that Dirk's proposal explicitly tries to address while still allowing ports that (IMO mistakenly) don't care to continue to not care.

We already have some good tooling (e.g. garden-o-matic) that could be extended to show and update the small amount of additional info Dirk is proposing.  I am very skeptical of abstract claims that this proposal inflates complexity and decreases productivity in the absence of actually testing a real workflow using the tools that we sheriffs really use to maintain tree greenness.

I would like to see this proposal tested to get concrete feedback instead of arguments on principle.

I would not like to see our testing infrastructure get any more complicated than it already is, just because of a philosophical direction chosen unilaterally by one port.


PK

_______________________________________________
webkit-dev mailing list
webkit-dev <at> lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev
Peter Kasting | 16 Aug 2012 02:51

Re: A proposal for handling "failing" layout tests and TestExpectations

On Wed, Aug 15, 2012 at 5:45 PM, Filip Pizlo <fpizlo <at> apple.com> wrote:

The typical approach used in situations that you describe is to rebase, not skip.

Which is precisely what Dirk is proposing.  Literally the only difference here is that the rebased test expectation would contain an optional notation about whether we believe the new baseline to be correct or incorrect.

Ports that don't care, or tests where we don't know, will be totally unaffected.  I am not sure why this bothers you so much.  You talk about making the infrastructure more complicated, but it doesn't seem like there is much additional complication involved, and what minor complication is involved is being volunteered to be handled by Dirk and other folks.

PK
_______________________________________________
webkit-dev mailing list
webkit-dev <at> lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev
Filip Pizlo | 16 Aug 2012 03:02
Picon
Favicon

Re: A proposal for handling "failing" layout tests and TestExpectations


On Aug 15, 2012, at 5:51 PM, Peter Kasting <pkasting <at> chromium.org> wrote:

On Wed, Aug 15, 2012 at 5:45 PM, Filip Pizlo <fpizlo <at> apple.com> wrote:
The typical approach used in situations that you describe is to rebase, not skip.

Which is precisely what Dirk is proposing.  Literally the only difference here is that the rebased test expectation would contain an optional notation about whether we believe the new baseline to be correct or incorrect.

Can you articulate the value that this gives?  We must weigh that against the downsides:

1) Increased complexity of infrastructure.

2) Possibility of the sheriff getting it wrong.

(2) concerns me most.  We're talking about using filenames to serve as a kind of unchecked comment.  We already know that comments are usually bad because there is no protection against them going stale.


Ports that don't care, or tests where we don't know, will be totally unaffected.  I am not sure why this bothers you so much.  You talk about making the infrastructure more complicated, but it doesn't seem like there is much additional complication involved, and what minor complication is involved is being volunteered to be handled by Dirk and other folks.

Ultimately it will lead to a further multiplication of number of possible ways of doing things, which is bad.  I'm sure some would love to get rid of Skipped files just as much as I would love to get rid of TestExpectations files.  Both are valid things to love, and imply that there must surely exist a middle ground: a way of doing things that is strictly better than the sum of the two.

In particular, to further clarify my position, if someone were to argue that Dirk's proposal would be a wholesale replacement for TestExpectations, then I would be more likely to be on board, since I very much like the idea of reducing the number of ways of doing things.  Maybe that's a good way to reach compromise.

Dirk, what value do you see in TestExpectations were your change to be landed?  Do scenarios still exist where there would be a test for which (a) there is no -fail.* file, (b) the test is not skipped, and (c) it's marked with some annotation in TestExpectations?  I'm most interested in the question of such scenarios exist, since in my experience, whenever a test is not rebased, is not skipped, and is marked as failing in TestExpectations, it ends up just causing gardening overhead later.

-F

_______________________________________________
webkit-dev mailing list
webkit-dev <at> lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev
Peter Kasting | 16 Aug 2012 03:18

Re: A proposal for handling "failing" layout tests and TestExpectations

On Wed, Aug 15, 2012 at 6:02 PM, Filip Pizlo <fpizlo <at> apple.com> wrote:

2) Possibility of the sheriff getting it wrong.

(2) concerns me most.  We're talking about using filenames to serve as a kind of unchecked comment.  We already know that comments are usually bad because there is no protection against them going stale.

I don't see how this is parallel to stale comments.  Tests get continually compared to the given output and we see immediately when something changes.

It is certainly possible for whoever assigns the filenames to get things wrong.  There are basically two mitigations of this.  One is allowing the existing "expected.xxx" file extensions to remain, and encouraging people to leave them as-is when they're not sure whether the existing result is correct.  The other is for sheriffs to use this signal as just that -- a signal -- just as today we use the "expected.xxx" files as a signal of what the correct output might be.  The difference is that this can generally be considered a stronger signal.  Historically, there's been no real attempt to guarantee that an "expected" result is anything other than the test's current behavior.

I'm sure some would love to get rid of Skipped files just as much as I would love to get rid of TestExpectations files.  Both are valid things to love, and imply that there must surely exist a middle ground: a way of doing things that is strictly better than the sum of the two.

That's exactly what we're trying to do.

The value of this change is that hopefully it would dramatically reduce the amount of content in these, especially in TestExpectations files.  If you want to kill these so much, then this is a change you should at least let us test!

In particular, to further clarify my position, if someone were to argue that Dirk's proposal would be a wholesale replacement for TestExpectations, then I would be more likely to be on board, since I very much like the idea of reducing the number of ways of doing things.  Maybe that's a good way to reach compromise.

It's hard to know if we could completely eliminate them without testing this, but yes, one goal here is to greatly reduce the need for TestExpectations lines.  A related goal is to make the patterns and mechanisms used by all ports more similar.  As someone who has noted his frustration with both "different ways of doing things" and "philosophical directions chosen by one port", you would hopefully be well-served by this direction.

PK
_______________________________________________
webkit-dev mailing list
webkit-dev <at> lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev
Dirk Pranke | 16 Aug 2012 23:13

Re: A proposal for handling "failing" layout tests and TestExpectations

On Wed, Aug 15, 2012 at 6:02 PM, Filip Pizlo <fpizlo <at> apple.com> wrote:
>
> 2) Possibility of the sheriff getting it wrong.
>
> (2) concerns me most.  We're talking about using filenames to serve as a
> kind of unchecked comment.  We already know that comments are usually bad
> because there is no protection against them going stale.
>

Sheriffs can already get things wrong (and rebaseline when they
shouldn't). I believe that adding passing/failing to expected will
make things better in this regard, not worse.

Another idea/observation is that if we have multiple types of
expectation files, it might be easier to set up watchlists, e.g., "let
me know whenever a file gets checked into fast/forms with an -expected
or -failing result". It seems like this might be useful, but I'm not
sure.

> In particular, to further clarify my position, if someone were to argue that
> Dirk's proposal would be a wholesale replacement for TestExpectations, then
> I would be more likely to be on board, since I very much like the idea of
> reducing the number of ways of doing things.  Maybe that's a good way to
> reach compromise.
>
> Dirk, what value do you see in TestExpectations were your change to be
> landed?  Do scenarios still exist where there would be a test for which (a)
> there is no -fail.* file, (b) the test is not skipped, and (c) it's marked
> with some annotation in TestExpectations?  I'm most interested in the
> question of such scenarios exist, since in my experience, whenever a test is
> not rebased, is not skipped, and is marked as failing in TestExpectations,
> it ends up just causing gardening overhead later.

This is a good question, because it is definitely my intent that this
change replace some existing practices, not add to them.

Currently, the Chromium port uses TestExpectations entries for four
different kinds of things: tests we don't ever plan to fix (WONTFIX),
tests that we skip because not doing so causes other tests to break,
tests that fail (reliably), and tests that are flaky.

Skipped files do not let you distinguish (programmatically) between
the first two categories, and so my plan is to replace Skipped files
with TestExpectations (using the new syntax discussed a month or so
ago) soon (next week or two at the latest).

I would like to replace using TestExpectations for failing tests (at
least for tests that are expected to keep failing indefinitely because
someone isn't working on an active fix) with this new mechanism.

That leaves flaky tests. One can debate what the right thing to do w/
flaky tests is here; I'm inclined to argue that flakiness is at least
as bad as failing, and we should probably be skipping them, but the
Chromium port has not yet actively tried this approach (I think other
ports probably have experience here, though).

Does that help answer your question / sway you at all?

-- Dirk
Filip Pizlo | 16 Aug 2012 23:32
Picon
Favicon

Re: A proposal for handling "failing" layout tests and TestExpectations


On Aug 16, 2012, at 2:13 PM, Dirk Pranke wrote:

> On Wed, Aug 15, 2012 at 6:02 PM, Filip Pizlo <fpizlo <at> apple.com> wrote:
>> 
>> 2) Possibility of the sheriff getting it wrong.
>> 
>> (2) concerns me most.  We're talking about using filenames to serve as a
>> kind of unchecked comment.  We already know that comments are usually bad
>> because there is no protection against them going stale.
>> 
> 
> Sheriffs can already get things wrong (and rebaseline when they
> shouldn't). I believe that adding passing/failing to expected will
> make things better in this regard, not worse.

In what way do things become better?  Because other people will see what the sheriff believed about the result?

Can you articulate some more about what happens when you have both -expected and -failing?

My specific concern is that after someone checks in a fix, we will have some sheriff accidentally misjudge
the change in behavior to be a regression, and check in a -failing file.  And then we end up in a world of confusion.

This is why I think that just having -expected files is better.  It is a kind of recognition that we're
tracking changes in behavior, rather than comparing against some almighty notion of what it means to be correct.

> 
> Another idea/observation is that if we have multiple types of
> expectation files, it might be easier to set up watchlists, e.g., "let
> me know whenever a file gets checked into fast/forms with an -expected
> or -failing result". It seems like this might be useful, but I'm not
> sure.
> 
>> In particular, to further clarify my position, if someone were to argue that
>> Dirk's proposal would be a wholesale replacement for TestExpectations, then
>> I would be more likely to be on board, since I very much like the idea of
>> reducing the number of ways of doing things.  Maybe that's a good way to
>> reach compromise.
>> 
>> Dirk, what value do you see in TestExpectations were your change to be
>> landed?  Do scenarios still exist where there would be a test for which (a)
>> there is no -fail.* file, (b) the test is not skipped, and (c) it's marked
>> with some annotation in TestExpectations?  I'm most interested in the
>> question of such scenarios exist, since in my experience, whenever a test is
>> not rebased, is not skipped, and is marked as failing in TestExpectations,
>> it ends up just causing gardening overhead later.
> 
> This is a good question, because it is definitely my intent that this
> change replace some existing practices, not add to them.
> 
> Currently, the Chromium port uses TestExpectations entries for four
> different kinds of things: tests we don't ever plan to fix (WONTFIX),
> tests that we skip because not doing so causes other tests to break,
> tests that fail (reliably), and tests that are flaky.
> 
> Skipped files do not let you distinguish (programmatically) between
> the first two categories, and so my plan is to replace Skipped files
> with TestExpectations (using the new syntax discussed a month or so
> ago) soon (next week or two at the latest).
> 
> I would like to replace using TestExpectations for failing tests (at
> least for tests that are expected to keep failing indefinitely because
> someone isn't working on an active fix) with this new mechanism.
> 
> That leaves flaky tests. One can debate what the right thing to do w/
> flaky tests is here; I'm inclined to argue that flakiness is at least
> as bad as failing, and we should probably be skipping them, but the
> Chromium port has not yet actively tried this approach (I think other
> ports probably have experience here, though).
> 
> Does that help answer your question / sway you at all?

Yes, it does - it answers my question, though it perhaps doesn't sway me.  My concerns are still that:

1) Switching to skipping flaky tests wholesale in all ports would be great, and then we could get rid of the
flakiness support.

2) The WONTFIX mode in TestExpectations feels to me more like a statement that you're just trying to see if
the test doesn't crash.  Correct?  Either way, it's confusing.

3) Your new mechanism feels like it's already covered by our existing use of -expected files.  I'm not quite
convinced that having -failing in addition to -expected files would be all that helpful.

(3) concerns me the most, and it concerns me particularly because we're still not giving good answers to (1)
or (2).

-F
Ryosuke Niwa | 16 Aug 2012 23:38
Favicon
Gravatar

Re: A proposal for handling "failing" layout tests and TestExpectations


On Thu, Aug 16, 2012 at 2:32 PM, Filip Pizlo <fpizlo <at> apple.com> wrote:

On Aug 16, 2012, at 2:13 PM, Dirk Pranke wrote:

> On Wed, Aug 15, 2012 at 6:02 PM, Filip Pizlo <fpizlo <at> apple.com> wrote:
>>
>> 2) Possibility of the sheriff getting it wrong.
>>
>> (2) concerns me most.  We're talking about using filenames to serve as a
>> kind of unchecked comment.  We already know that comments are usually bad
>> because there is no protection against them going stale.
>>
>
> Sheriffs can already get things wrong (and rebaseline when they
> shouldn't). I believe that adding passing/failing to expected will
> make things better in this regard, not worse.

In what way do things become better?  Because other people will see what the sheriff believed about the result?

Can you articulate some more about what happens when you have both -expected and -failing?

My specific concern is that after someone checks in a fix, we will have some sheriff accidentally misjudge the change in behavior to be a regression, and check in a -failing file.  And then we end up in a world of confusion.

This is why I think that just having -expected files is better.  It is a kind of recognition that we're tracking changes in behavior, rather than comparing against some almighty notion of what it means to be correct.

>
> Another idea/observation is that if we have multiple types of
> expectation files, it might be easier to set up watchlists, e.g., "let
> me know whenever a file gets checked into fast/forms with an -expected
> or -failing result". It seems like this might be useful, but I'm not
> sure.
>
>> In particular, to further clarify my position, if someone were to argue that
>> Dirk's proposal would be a wholesale replacement for TestExpectations, then
>> I would be more likely to be on board, since I very much like the idea of
>> reducing the number of ways of doing things.  Maybe that's a good way to
>> reach compromise.
>>
>> Dirk, what value do you see in TestExpectations were your change to be
>> landed?  Do scenarios still exist where there would be a test for which (a)
>> there is no -fail.* file, (b) the test is not skipped, and (c) it's marked
>> with some annotation in TestExpectations?  I'm most interested in the
>> question of such scenarios exist, since in my experience, whenever a test is
>> not rebased, is not skipped, and is marked as failing in TestExpectations,
>> it ends up just causing gardening overhead later.
>
> This is a good question, because it is definitely my intent that this
> change replace some existing practices, not add to them.
>
> Currently, the Chromium port uses TestExpectations entries for four
> different kinds of things: tests we don't ever plan to fix (WONTFIX),
> tests that we skip because not doing so causes other tests to break,
> tests that fail (reliably), and tests that are flaky.
>
> Skipped files do not let you distinguish (programmatically) between
> the first two categories, and so my plan is to replace Skipped files
> with TestExpectations (using the new syntax discussed a month or so
> ago) soon (next week or two at the latest).
>
> I would like to replace using TestExpectations for failing tests (at
> least for tests that are expected to keep failing indefinitely because
> someone isn't working on an active fix) with this new mechanism.
>
> That leaves flaky tests. One can debate what the right thing to do w/
> flaky tests is here; I'm inclined to argue that flakiness is at least
> as bad as failing, and we should probably be skipping them, but the
> Chromium port has not yet actively tried this approach (I think other
> ports probably have experience here, though).
>
> Does that help answer your question / sway you at all?

Yes, it does - it answers my question, though it perhaps doesn't sway me.  My concerns are still that:

1) Switching to skipping flaky tests wholesale in all ports would be great, and then we could get rid of the flakiness support.

This is not necessarily helpful in some cases. There are cases some test make subsequent tests flaky so skipping a flaky test doesn't fix the problem. Instead, it just makes the next test flaky. The right fix, of course, to skip/fix the culprit but pinpointing the test that makes subsequent tests more flaky has proved to be a time consuming and challenging task.

2) The WONTFIX mode in TestExpectations feels to me more like a statement that you're just trying to see if the test doesn't crash.  Correct?  Either way, it's confusing.

Yes, I'd argue that they should just be skipped but that's a bikeshedding for another time.

- Ryosuke

_______________________________________________
webkit-dev mailing list
webkit-dev <at> lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev
Ojan Vafai | 16 Aug 2012 23:50

Re: A proposal for handling "failing" layout tests and TestExpectations

On Thu, Aug 16, 2012 at 2:32 PM, Filip Pizlo <fpizlo <at> apple.com> wrote:

1) Switching to skipping flaky tests wholesale in all ports would be great, and then we could get rid of the flakiness support.

Then you don't notice when a flaky tests stops being flaky. The cost of flakiness support on the project does not seem large to me and it's pretty frequent that a test will only be flaky for a few hundred runs (e.g. due to a bad checkin that gets fixed), so we can then remove it from TestExpectations and run the test as normal.

2) The WONTFIX mode in TestExpectations feels to me more like a statement that you're just trying to see if the test doesn't crash.  Correct?  Either way, it's confusing.

WONTFIX is for tests that don't make sense to fix for the given port (e.g. dashboard-specific tests for non-Apple ports). It's a way of distinguishing tests that we're skipping because of a bug on our part vs. tests that we're skipping because the test doesn't apply to the port.

3) Your new mechanism feels like it's already covered by our existing use of -expected files.  I'm not quite convinced that having -failing in addition to -expected files would be all that helpful.

But there are many cases where we *know* the result is incorrect, e.g. it doesn't match a spec. Sure, there are also many cases where it's not clear what the correct behavior is.
_______________________________________________
webkit-dev mailing list
webkit-dev <at> lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev
Maciej Stachowiak | 15 Aug 2012 23:16
Picon
Favicon

Re: A proposal for handling "failing" layout tests and TestExpectations


On Aug 15, 2012, at 12:27 PM, Filip Pizlo <fpizlo <at> apple.com> wrote:

> This sounds like it's adding even more complication to an already complicated system.
> 
> Given how many tests we currently have, I also don't buy that continuing to run a test that is already known
to fail provides much benefit.  If the test covers two things and one fails while the other succeeds
(example #1: it prints correct results but doesn't crash; example #2: it literally has tests for two
distinct unrelated features, and one feature is broken) then the correct thing to do is to split up the test.

The case where this happens isn't so much testing unrelated features. It's more that a single test can face
failures of multiple incidental aspects. Let me give an example. Let's say you have a test that's intended
to verify JavaScript multiplication. Let's say it starts failing because of a regression in function
call behavior. There may not be a clean way to split these two aspects. You may have to rewrite the
multiplication test to work around this hypothetical function call bug. It may be nontrivial to figure
out how to do that.

So I agree in principle that splitting orthogonal tests is good, but it's not always a good solution to
avoiding chained regressions.

Note also that having more test files makes the test rights slower.

> 
> So, I'd rather not continue to institutionalize this notion that we should have loads of
incomprehensible machinery to reason about tests that have already given us all the information they
were meant to give (i.e. they failed, end of story).

I think another aspect of your reasoning doesn't entirely fit with the history of WebKit regression
testing. Our tests were not originally designed to test against some abstract notion of "correct"
behavior. They were designed to detect behavior changes. It's definitely not the case that, once
behavior has changed once, there's no value to detecting when it changes again. Some of these behavior
changes might be good and should form a new baseline. Others might be bad but still should not inhibit
detecting other testing while they are investigated.

(I should note that I don't yet have an opinion on the new proposal; I have not read it. But I am skeptical that
splitting tests is a practical solution.)

Regards,
Maciej

> 
> -F
> 
> 
> On Aug 15, 2012, at 12:22 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:
> 
>> Hi all,
>> 
>> As many of you know, we normally treat the -expected files as
>> "regression" tests rather than "correctness" tests; they are intended
>> to capture the current behavior of the tree. As such, they
>> historically have not distinguished between a "correct failure" and an
>> "incorrect failure".
>> 
>> The chromium port, however, has historically often not checked in
>> expectations for tests that are currently failing (or even have been
>> failing for a long time), and instead listed them in the expectations
>> file. This was primarily motivated by us wanting to know easily all of
>> the tests that were "failing". However, this approach has its own
>> downsides.
>> 
>> I would like to move the project to a point where all of the ports
>> were basically using the same workflow/model, and combine the best
>> features of each approach [1].
>> 
>> To that end, I propose that we allow tests to have expectations that
>> end in '-passing' and '-failing' as well as '-expected'.
>> 
>> The meanings for '-passing' and '-failing' should be obvious, and
>> '-expected' can continue the current meaning of either or both of
>> "what we expect to happen" and "I don't know if this is correct or
>> not" :).
>> 
>> A given test will be allowed to only have one of the three potential
>> results at any one time/revision in a checkout. [2]
>> 
>> Because '-expected' will still be supported, this means that ports can
>> continue to work as they do today and we can try -passing/-failing on
>> a piecemeal basis to see if it's useful or not.
>> 
>> Ideally we will have some way (via a presubmit hook, or lint checks,
>> or something) to be able to generate a (checked-in) list (or perhaps
>> just a dashboard or web page) of all of the currently failing tests
>> and corresponding bugs from the "-failing" expectations, so that we
>> can keep one of the advantages that chromium has gotten out of their
>> TestExpectations files [3].
>> 
>> I will update all of the tools (run-webkit-tests, garden-o-matic,
>> flakiness dashboard, etc.) as needed to make managing these things as
>> easy as possible. [4]
>> 
>> Thoughts? I'm definitely open to suggestions/variants/other ideas/etc.
>> 
>> -- Dirk
>> 
>> Notes:
>> 
>> [1] Both the "check in the failures" and the "suppress the failures"
>> approaches have advantages and disadvantages:
>> 
>> Both approaches have their advantages and disadvantages:
>> 
>> Advantages for checking in failures:
>> 
>> * you can tell when a test starts failing differently
>> * the need to distinguish between different types of failures (text
>> vs. image vs. image+text) in the expectations file drops; the baseline
>> tells you what to expect
>> * the TestExpectations file can be much smaller and easier to manage -
>> the current Chromium file is a massively unruly mess
>> * the history of a particular test can be found by looking at the repo history
>> 
>> Disadvantages for checking in failures (advantages for just suppressing them):
>> * given current practice (just using -expected) you can't tell if a
>> particular -expected file is supposed to be be correct or not
>> * it may create more churn in the checked-in baselines in the repo
>> * you can't get a list of all of the failing tests associated with a
>> particular bug as easily
>> 
>> There are probably lots of ways one could attempt to design a solution
>> to these problems; I believe the approach outlined above is perhaps
>> the simplest possible and allows for us to try it in small parts of
>> the test suite (and only on some ports) to see if it's useful or not
>> without forcing it on everyone.
>> 
>> [2] I considered allowing "-passing" and "-failing" to co-exist, but a
>> risk is that the "passing" or correct result for a test will change,
>> and if a test is currently expected to fail, we won't notice that that
>> port's "passing" result becomes stale. In addition, managing the
>> transitions across multiple files becomes a lot more
>> complicated/tricky. There are tradeoffs here, and it's possible some
>> other logic will work better in practice, but I thought it might be
>> best to start simple.
>> 
>> [3] I haven't figured out the best way to do this yet, or whether it's
>> better to keep this list inside the TestExpectations files or in
>> separate files, or just have a dashboard separate from the repository
>> completely ...
>> 
>> [4] rough sketch of the work needed to be done here:
>> * update run-webkit-tests to look for all three suffixes, and to
>> generate new -failing -failing and/or -passing results as well as new
>> -expected results
>> * update garden-o-matic so that when you want to rebaseline a file you
>> can (optionally) indicate whether the new baseline is passing or
>> failing
>> * update webkit-patch rebaseline-expectations to (optionally) indicate
>> if the new baselines are passing or failing
>> * pass the information from run-webkit-tests to the flakiness
>> dashboard (via results.json) to indicate whether we matched against
>> -expected/-passing/-failing so that the dashboard can display the
>> right baselines for comparison.
>> * figure out the solution to [3] :).
>> _______________________________________________
>> webkit-dev mailing list
>> webkit-dev <at> lists.webkit.org
>> http://lists.webkit.org/mailman/listinfo/webkit-dev
> 
> _______________________________________________
> webkit-dev mailing list
> webkit-dev <at> lists.webkit.org
> http://lists.webkit.org/mailman/listinfo/webkit-dev
Filip Pizlo | 16 Aug 2012 00:02
Picon
Favicon

Re: A proposal for handling "failing" layout tests and TestExpectations


On Aug 15, 2012, at 2:16 PM, Maciej Stachowiak <mjs <at> apple.com> wrote:

> 
> On Aug 15, 2012, at 12:27 PM, Filip Pizlo <fpizlo <at> apple.com> wrote:
> 
>> This sounds like it's adding even more complication to an already complicated system.
>> 
>> Given how many tests we currently have, I also don't buy that continuing to run a test that is already known
to fail provides much benefit.  If the test covers two things and one fails while the other succeeds
(example #1: it prints correct results but doesn't crash; example #2: it literally has tests for two
distinct unrelated features, and one feature is broken) then the correct thing to do is to split up the test.
> 
> The case where this happens isn't so much testing unrelated features. It's more that a single test can face
failures of multiple incidental aspects. Let me give an example. Let's say you have a test that's intended
to verify JavaScript multiplication. Let's say it starts failing because of a regression in function
call behavior. There may not be a clean way to split these two aspects. You may have to rewrite the
multiplication test to work around this hypothetical function call bug. It may be nontrivial to figure
out how to do that.
> 
> So I agree in principle that splitting orthogonal tests is good, but it's not always a good solution to
avoiding chained regressions.
> 
> Note also that having more test files makes the test rights slower.
> 
>> 
>> So, I'd rather not continue to institutionalize this notion that we should have loads of
incomprehensible machinery to reason about tests that have already given us all the information they
were meant to give (i.e. they failed, end of story).
> 
> I think another aspect of your reasoning doesn't entirely fit with the history of WebKit regression
testing. Our tests were not originally designed to test against some abstract notion of "correct"
behavior. They were designed to detect behavior changes. It's definitely not the case that, once
behavior has changed once, there's no value to detecting when it changes again. Some of these behavior
changes might be good and should form a new baseline. Others might be bad but still should not inhibit
detecting other testing while they are investigated.

Sorry, I was rather unclear.  I'm arguing against marking a test as expected-to-fail-but-not-crash.  I
have no broad problem with rebasing the test, except if you're using -expected.* files to indicate that
the test is in this expected-to-fail-but-not-crash state.  There is potentially a significant
difference between tests that cover what is displayed and the loads of tests that cover discrete
functionality (DOM APIs, JS behavior, etc).  Rebasing has very little benefit in the latter kinds of tests.

I'm also arguing against having yet another mechanism for indicating that a test is borked.

Finally, I think any strategy we use must take into account the fact that we currently have a *huge* number of
tests.  While there may be examples of failing tests that still succeed in giving us meaningful coverage, I
think there are far more examples of tests that provide no benefit, but require an evil troika of (1) CPU
time to run, (2) network/disk time to check out and update, and (3) person time to rebase.  So, at this point,
having sophisticated mechanisms for dealing with expected-to-fail-but-not-crash tests beyond
filing a bug and skipping the test is likely to just be noise.

> 
> 
> (I should note that I don't yet have an opinion on the new proposal; I have not read it. But I am skeptical that
splitting tests is a practical solution.)
> 
> 
> Regards,
> Maciej
> 
> 
>> 
>> -F
>> 
>> 
>> On Aug 15, 2012, at 12:22 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:
>> 
>>> Hi all,
>>> 
>>> As many of you know, we normally treat the -expected files as
>>> "regression" tests rather than "correctness" tests; they are intended
>>> to capture the current behavior of the tree. As such, they
>>> historically have not distinguished between a "correct failure" and an
>>> "incorrect failure".
>>> 
>>> The chromium port, however, has historically often not checked in
>>> expectations for tests that are currently failing (or even have been
>>> failing for a long time), and instead listed them in the expectations
>>> file. This was primarily motivated by us wanting to know easily all of
>>> the tests that were "failing". However, this approach has its own
>>> downsides.
>>> 
>>> I would like to move the project to a point where all of the ports
>>> were basically using the same workflow/model, and combine the best
>>> features of each approach [1].
>>> 
>>> To that end, I propose that we allow tests to have expectations that
>>> end in '-passing' and '-failing' as well as '-expected'.
>>> 
>>> The meanings for '-passing' and '-failing' should be obvious, and
>>> '-expected' can continue the current meaning of either or both of
>>> "what we expect to happen" and "I don't know if this is correct or
>>> not" :).
>>> 
>>> A given test will be allowed to only have one of the three potential
>>> results at any one time/revision in a checkout. [2]
>>> 
>>> Because '-expected' will still be supported, this means that ports can
>>> continue to work as they do today and we can try -passing/-failing on
>>> a piecemeal basis to see if it's useful or not.
>>> 
>>> Ideally we will have some way (via a presubmit hook, or lint checks,
>>> or something) to be able to generate a (checked-in) list (or perhaps
>>> just a dashboard or web page) of all of the currently failing tests
>>> and corresponding bugs from the "-failing" expectations, so that we
>>> can keep one of the advantages that chromium has gotten out of their
>>> TestExpectations files [3].
>>> 
>>> I will update all of the tools (run-webkit-tests, garden-o-matic,
>>> flakiness dashboard, etc.) as needed to make managing these things as
>>> easy as possible. [4]
>>> 
>>> Thoughts? I'm definitely open to suggestions/variants/other ideas/etc.
>>> 
>>> -- Dirk
>>> 
>>> Notes:
>>> 
>>> [1] Both the "check in the failures" and the "suppress the failures"
>>> approaches have advantages and disadvantages:
>>> 
>>> Both approaches have their advantages and disadvantages:
>>> 
>>> Advantages for checking in failures:
>>> 
>>> * you can tell when a test starts failing differently
>>> * the need to distinguish between different types of failures (text
>>> vs. image vs. image+text) in the expectations file drops; the baseline
>>> tells you what to expect
>>> * the TestExpectations file can be much smaller and easier to manage -
>>> the current Chromium file is a massively unruly mess
>>> * the history of a particular test can be found by looking at the repo history
>>> 
>>> Disadvantages for checking in failures (advantages for just suppressing them):
>>> * given current practice (just using -expected) you can't tell if a
>>> particular -expected file is supposed to be be correct or not
>>> * it may create more churn in the checked-in baselines in the repo
>>> * you can't get a list of all of the failing tests associated with a
>>> particular bug as easily
>>> 
>>> There are probably lots of ways one could attempt to design a solution
>>> to these problems; I believe the approach outlined above is perhaps
>>> the simplest possible and allows for us to try it in small parts of
>>> the test suite (and only on some ports) to see if it's useful or not
>>> without forcing it on everyone.
>>> 
>>> [2] I considered allowing "-passing" and "-failing" to co-exist, but a
>>> risk is that the "passing" or correct result for a test will change,
>>> and if a test is currently expected to fail, we won't notice that that
>>> port's "passing" result becomes stale. In addition, managing the
>>> transitions across multiple files becomes a lot more
>>> complicated/tricky. There are tradeoffs here, and it's possible some
>>> other logic will work better in practice, but I thought it might be
>>> best to start simple.
>>> 
>>> [3] I haven't figured out the best way to do this yet, or whether it's
>>> better to keep this list inside the TestExpectations files or in
>>> separate files, or just have a dashboard separate from the repository
>>> completely ...
>>> 
>>> [4] rough sketch of the work needed to be done here:
>>> * update run-webkit-tests to look for all three suffixes, and to
>>> generate new -failing -failing and/or -passing results as well as new
>>> -expected results
>>> * update garden-o-matic so that when you want to rebaseline a file you
>>> can (optionally) indicate whether the new baseline is passing or
>>> failing
>>> * update webkit-patch rebaseline-expectations to (optionally) indicate
>>> if the new baselines are passing or failing
>>> * pass the information from run-webkit-tests to the flakiness
>>> dashboard (via results.json) to indicate whether we matched against
>>> -expected/-passing/-failing so that the dashboard can display the
>>> right baselines for comparison.
>>> * figure out the solution to [3] :).
>>> _______________________________________________
>>> webkit-dev mailing list
>>> webkit-dev <at> lists.webkit.org
>>> http://lists.webkit.org/mailman/listinfo/webkit-dev
>> 
>> _______________________________________________
>> webkit-dev mailing list
>> webkit-dev <at> lists.webkit.org
>> http://lists.webkit.org/mailman/listinfo/webkit-dev
> 
Dirk Pranke | 15 Aug 2012 21:48

Re: A proposal for handling "failing" layout tests and TestExpectations

I've received at least one bit of feedback off-list that it might not
have been clear what problems I was trying to solve and whether the
solution would add enough benefit to be worth it. Let me try to
restate things, in case this helps ...

The problems I'm attempting to solve are:

1) Chromium does things differently from the other WebKit ports, and
this seems bad :).

2) the TestExpectations syntax is too complicated

3) When you suppress or skip failures (as chromium does), you can't
easily tell when test changes behavior (starts failing differently).

4) We can't easily tell if a test's -expected output is actually
believed to be correct or not, which makes figuring out whether to
rebaseline or not difficult, and makes figuring out if your change
that is causing a test to "fail" is actually introducing a new
problem, just failing differently, or even potentially fixing
something.

I agree that it's unclear how much churn there will be and how much
benefit there will be, but we've been talking about these problems for
years and I think without actually trying *something* we won't know if
there's a better way to solve this or not, and I personally think it's
worth trying. The solution I've described is the least intrusive
mechanism we can try that I've yet come up with.

-- Dirk

On Wed, Aug 15, 2012 at 12:22 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:
> Hi all,
>
> As many of you know, we normally treat the -expected files as
> "regression" tests rather than "correctness" tests; they are intended
> to capture the current behavior of the tree. As such, they
> historically have not distinguished between a "correct failure" and an
> "incorrect failure".
>
> The chromium port, however, has historically often not checked in
> expectations for tests that are currently failing (or even have been
> failing for a long time), and instead listed them in the expectations
> file. This was primarily motivated by us wanting to know easily all of
> the tests that were "failing". However, this approach has its own
> downsides.
>
> I would like to move the project to a point where all of the ports
> were basically using the same workflow/model, and combine the best
> features of each approach [1].
>
> To that end, I propose that we allow tests to have expectations that
> end in '-passing' and '-failing' as well as '-expected'.
>
> The meanings for '-passing' and '-failing' should be obvious, and
> '-expected' can continue the current meaning of either or both of
> "what we expect to happen" and "I don't know if this is correct or
> not" :).
>
> A given test will be allowed to only have one of the three potential
> results at any one time/revision in a checkout. [2]
>
> Because '-expected' will still be supported, this means that ports can
> continue to work as they do today and we can try -passing/-failing on
> a piecemeal basis to see if it's useful or not.
>
> Ideally we will have some way (via a presubmit hook, or lint checks,
> or something) to be able to generate a (checked-in) list (or perhaps
> just a dashboard or web page) of all of the currently failing tests
> and corresponding bugs from the "-failing" expectations, so that we
> can keep one of the advantages that chromium has gotten out of their
> TestExpectations files [3].
>
> I will update all of the tools (run-webkit-tests, garden-o-matic,
> flakiness dashboard, etc.) as needed to make managing these things as
> easy as possible. [4]
>
> Thoughts? I'm definitely open to suggestions/variants/other ideas/etc.
>
> -- Dirk
>
> Notes:
>
> [1] Both the "check in the failures" and the "suppress the failures"
> approaches have advantages and disadvantages:
>
> Both approaches have their advantages and disadvantages:
>
> Advantages for checking in failures:
>
> * you can tell when a test starts failing differently
> * the need to distinguish between different types of failures (text
> vs. image vs. image+text) in the expectations file drops; the baseline
> tells you what to expect
> * the TestExpectations file can be much smaller and easier to manage -
> the current Chromium file is a massively unruly mess
> * the history of a particular test can be found by looking at the repo history
>
> Disadvantages for checking in failures (advantages for just suppressing them):
> * given current practice (just using -expected) you can't tell if a
> particular -expected file is supposed to be be correct or not
> * it may create more churn in the checked-in baselines in the repo
> * you can't get a list of all of the failing tests associated with a
> particular bug as easily
>
> There are probably lots of ways one could attempt to design a solution
> to these problems; I believe the approach outlined above is perhaps
> the simplest possible and allows for us to try it in small parts of
> the test suite (and only on some ports) to see if it's useful or not
> without forcing it on everyone.
>
> [2] I considered allowing "-passing" and "-failing" to co-exist, but a
> risk is that the "passing" or correct result for a test will change,
> and if a test is currently expected to fail, we won't notice that that
> port's "passing" result becomes stale. In addition, managing the
> transitions across multiple files becomes a lot more
> complicated/tricky. There are tradeoffs here, and it's possible some
> other logic will work better in practice, but I thought it might be
> best to start simple.
>
> [3] I haven't figured out the best way to do this yet, or whether it's
> better to keep this list inside the TestExpectations files or in
> separate files, or just have a dashboard separate from the repository
> completely ...
>
> [4] rough sketch of the work needed to be done here:
> * update run-webkit-tests to look for all three suffixes, and to
> generate new -failing -failing and/or -passing results as well as new
> -expected results
> * update garden-o-matic so that when you want to rebaseline a file you
> can (optionally) indicate whether the new baseline is passing or
> failing
> * update webkit-patch rebaseline-expectations to (optionally) indicate
> if the new baselines are passing or failing
> * pass the information from run-webkit-tests to the flakiness
> dashboard (via results.json) to indicate whether we matched against
> -expected/-passing/-failing so that the dashboard can display the
> right baselines for comparison.
> * figure out the solution to [3] :).
Michael Saboff | 15 Aug 2012 22:36
Picon
Favicon

Re: A proposal for handling "failing" layout tests and TestExpectations

It seems to me that there are two issues here.  One is Chromium specific about process conformity.  It seems to
me that should stay a Chromium issue without making the mechanism more complex for all ports.  The other
ports seem to make things work using the existing framework.

The other broader issue is failing tests.  If I understand part of Filip's concern it is a signal to noise
issue.  We do not want the pass / fail signal to be lost in the noise of expected failures.  Failing tests
should be fixed as appropriate for failing platform(s).  That fixing might involve splitting off or
removing a failing sub-test so that the remaining test adds value once again.  Especially "a pass becoming
a fail" edge.  For me, a test failing differently provides unknown value as the noise of it being a failing
test likely exceeds the benefit of the different failure mode signal.  It takes a non-zero effort to filter
that noise and that effort is likely better spent fixing the test.

- Michael

On Aug 15, 2012, at 12:48 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:

> I've received at least one bit of feedback off-list that it might not
> have been clear what problems I was trying to solve and whether the
> solution would add enough benefit to be worth it. Let me try to
> restate things, in case this helps ...
> 
> The problems I'm attempting to solve are:
> 
> 1) Chromium does things differently from the other WebKit ports, and
> this seems bad :).
> 
> 2) the TestExpectations syntax is too complicated
> 
> 3) When you suppress or skip failures (as chromium does), you can't
> easily tell when test changes behavior (starts failing differently).
> 
> 4) We can't easily tell if a test's -expected output is actually
> believed to be correct or not, which makes figuring out whether to
> rebaseline or not difficult, and makes figuring out if your change
> that is causing a test to "fail" is actually introducing a new
> problem, just failing differently, or even potentially fixing
> something.
> 
> I agree that it's unclear how much churn there will be and how much
> benefit there will be, but we've been talking about these problems for
> years and I think without actually trying *something* we won't know if
> there's a better way to solve this or not, and I personally think it's
> worth trying. The solution I've described is the least intrusive
> mechanism we can try that I've yet come up with.
> 
> -- Dirk
> 
> On Wed, Aug 15, 2012 at 12:22 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:
>> Hi all,
>> 
>> As many of you know, we normally treat the -expected files as
>> "regression" tests rather than "correctness" tests; they are intended
>> to capture the current behavior of the tree. As such, they
>> historically have not distinguished between a "correct failure" and an
>> "incorrect failure".
>> 
>> The chromium port, however, has historically often not checked in
>> expectations for tests that are currently failing (or even have been
>> failing for a long time), and instead listed them in the expectations
>> file. This was primarily motivated by us wanting to know easily all of
>> the tests that were "failing". However, this approach has its own
>> downsides.
>> 
>> I would like to move the project to a point where all of the ports
>> were basically using the same workflow/model, and combine the best
>> features of each approach [1].
>> 
>> To that end, I propose that we allow tests to have expectations that
>> end in '-passing' and '-failing' as well as '-expected'.
>> 
>> The meanings for '-passing' and '-failing' should be obvious, and
>> '-expected' can continue the current meaning of either or both of
>> "what we expect to happen" and "I don't know if this is correct or
>> not" :).
>> 
>> A given test will be allowed to only have one of the three potential
>> results at any one time/revision in a checkout. [2]
>> 
>> Because '-expected' will still be supported, this means that ports can
>> continue to work as they do today and we can try -passing/-failing on
>> a piecemeal basis to see if it's useful or not.
>> 
>> Ideally we will have some way (via a presubmit hook, or lint checks,
>> or something) to be able to generate a (checked-in) list (or perhaps
>> just a dashboard or web page) of all of the currently failing tests
>> and corresponding bugs from the "-failing" expectations, so that we
>> can keep one of the advantages that chromium has gotten out of their
>> TestExpectations files [3].
>> 
>> I will update all of the tools (run-webkit-tests, garden-o-matic,
>> flakiness dashboard, etc.) as needed to make managing these things as
>> easy as possible. [4]
>> 
>> Thoughts? I'm definitely open to suggestions/variants/other ideas/etc.
>> 
>> -- Dirk
>> 
>> Notes:
>> 
>> [1] Both the "check in the failures" and the "suppress the failures"
>> approaches have advantages and disadvantages:
>> 
>> Both approaches have their advantages and disadvantages:
>> 
>> Advantages for checking in failures:
>> 
>> * you can tell when a test starts failing differently
>> * the need to distinguish between different types of failures (text
>> vs. image vs. image+text) in the expectations file drops; the baseline
>> tells you what to expect
>> * the TestExpectations file can be much smaller and easier to manage -
>> the current Chromium file is a massively unruly mess
>> * the history of a particular test can be found by looking at the repo history
>> 
>> Disadvantages for checking in failures (advantages for just suppressing them):
>> * given current practice (just using -expected) you can't tell if a
>> particular -expected file is supposed to be be correct or not
>> * it may create more churn in the checked-in baselines in the repo
>> * you can't get a list of all of the failing tests associated with a
>> particular bug as easily
>> 
>> There are probably lots of ways one could attempt to design a solution
>> to these problems; I believe the approach outlined above is perhaps
>> the simplest possible and allows for us to try it in small parts of
>> the test suite (and only on some ports) to see if it's useful or not
>> without forcing it on everyone.
>> 
>> [2] I considered allowing "-passing" and "-failing" to co-exist, but a
>> risk is that the "passing" or correct result for a test will change,
>> and if a test is currently expected to fail, we won't notice that that
>> port's "passing" result becomes stale. In addition, managing the
>> transitions across multiple files becomes a lot more
>> complicated/tricky. There are tradeoffs here, and it's possible some
>> other logic will work better in practice, but I thought it might be
>> best to start simple.
>> 
>> [3] I haven't figured out the best way to do this yet, or whether it's
>> better to keep this list inside the TestExpectations files or in
>> separate files, or just have a dashboard separate from the repository
>> completely ...
>> 
>> [4] rough sketch of the work needed to be done here:
>> * update run-webkit-tests to look for all three suffixes, and to
>> generate new -failing -failing and/or -passing results as well as new
>> -expected results
>> * update garden-o-matic so that when you want to rebaseline a file you
>> can (optionally) indicate whether the new baseline is passing or
>> failing
>> * update webkit-patch rebaseline-expectations to (optionally) indicate
>> if the new baselines are passing or failing
>> * pass the information from run-webkit-tests to the flakiness
>> dashboard (via results.json) to indicate whether we matched against
>> -expected/-passing/-failing so that the dashboard can display the
>> right baselines for comparison.
>> * figure out the solution to [3] :).
> _______________________________________________
> webkit-dev mailing list
> webkit-dev <at> lists.webkit.org
> http://lists.webkit.org/mailman/listinfo/webkit-dev
Ojan Vafai | 15 Aug 2012 22:46

Re: A proposal for handling "failing" layout tests and TestExpectations

I think this is the best compromise between the benefits of the historically chromium approach (i.e. add to TestExpectations) and the historically non-chromium approach (either skip the test or check in a failing result, usually the latter). The only thing Dirk's proposal changes for these ports (i.e. all but Chromium) is that an expectation file's suffix can be -passing or -failing in addition to the current -expected.


It makes for a clear list of broken tests that need fixing without compromising the test suites effectiveness at detecting new regressions.

I agree that we just need to try it and see how it turns out. For non-chromium ports, it will be easy to switch back if we decide it's too much overhead and Chromium will just need to come up with an alternate way of keeping track of the list of tests that are failing.

On Wed, Aug 15, 2012 at 1:36 PM, Michael Saboff <msaboff <at> apple.com> wrote:
It seems to me that there are two issues here.  One is Chromium specific about process conformity.  It seems to me that should stay a Chromium issue without making the mechanism more complex for all ports.  The other ports seem to make things work using the existing framework.

The other broader issue is failing tests.  If I understand part of Filip's concern it is a signal to noise issue.  We do not want the pass / fail signal to be lost in the noise of expected failures.  Failing tests should be fixed as appropriate for failing platform(s).  That fixing might involve splitting off or removing a failing sub-test so that the remaining test adds value once again.  Especially "a pass becoming a fail" edge.  For me, a test failing differently provides unknown value as the noise of it being a failing test likely exceeds the benefit of the different failure mode signal.  It takes a non-zero effort to filter that noise and that effort is likely better spent fixing the test.

Historically, all WebKit ports other than Chromium (especially the Apple port) have taken exactly this approach. What you're arguing against is the long-standing approach WebKit has taken to testing. I don't think discussing this broader issue is useful since noone is asking the non-Chromium ports to change this aspect. This has been discussed to death on webkit-dev, with many people (mostly at Apple) being strong proponents of checking in failing expectations.

This just brings the Chromium port inline with other ports without losing the benefit the Chromium approach currently has of keeping a list of failing tests that need fixing.
 

- Michael

On Aug 15, 2012, at 12:48 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:

> I've received at least one bit of feedback off-list that it might not
> have been clear what problems I was trying to solve and whether the
> solution would add enough benefit to be worth it. Let me try to
> restate things, in case this helps ...
>
> The problems I'm attempting to solve are:
>
> 1) Chromium does things differently from the other WebKit ports, and
> this seems bad :).
>
> 2) the TestExpectations syntax is too complicated
>
> 3) When you suppress or skip failures (as chromium does), you can't
> easily tell when test changes behavior (starts failing differently).
>
> 4) We can't easily tell if a test's -expected output is actually
> believed to be correct or not, which makes figuring out whether to
> rebaseline or not difficult, and makes figuring out if your change
> that is causing a test to "fail" is actually introducing a new
> problem, just failing differently, or even potentially fixing
> something.
>
> I agree that it's unclear how much churn there will be and how much
> benefit there will be, but we've been talking about these problems for
> years and I think without actually trying *something* we won't know if
> there's a better way to solve this or not, and I personally think it's
> worth trying. The solution I've described is the least intrusive
> mechanism we can try that I've yet come up with.
>
> -- Dirk
>
> On Wed, Aug 15, 2012 at 12:22 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:
>> Hi all,
>>
>> As many of you know, we normally treat the -expected files as
>> "regression" tests rather than "correctness" tests; they are intended
>> to capture the current behavior of the tree. As such, they
>> historically have not distinguished between a "correct failure" and an
>> "incorrect failure".
>>
>> The chromium port, however, has historically often not checked in
>> expectations for tests that are currently failing (or even have been
>> failing for a long time), and instead listed them in the expectations
>> file. This was primarily motivated by us wanting to know easily all of
>> the tests that were "failing". However, this approach has its own
>> downsides.
>>
>> I would like to move the project to a point where all of the ports
>> were basically using the same workflow/model, and combine the best
>> features of each approach [1].
>>
>> To that end, I propose that we allow tests to have expectations that
>> end in '-passing' and '-failing' as well as '-expected'.
>>
>> The meanings for '-passing' and '-failing' should be obvious, and
>> '-expected' can continue the current meaning of either or both of
>> "what we expect to happen" and "I don't know if this is correct or
>> not" :).
>>
>> A given test will be allowed to only have one of the three potential
>> results at any one time/revision in a checkout. [2]
>>
>> Because '-expected' will still be supported, this means that ports can
>> continue to work as they do today and we can try -passing/-failing on
>> a piecemeal basis to see if it's useful or not.
>>
>> Ideally we will have some way (via a presubmit hook, or lint checks,
>> or something) to be able to generate a (checked-in) list (or perhaps
>> just a dashboard or web page) of all of the currently failing tests
>> and corresponding bugs from the "-failing" expectations, so that we
>> can keep one of the advantages that chromium has gotten out of their
>> TestExpectations files [3].
>>
>> I will update all of the tools (run-webkit-tests, garden-o-matic,
>> flakiness dashboard, etc.) as needed to make managing these things as
>> easy as possible. [4]
>>
>> Thoughts? I'm definitely open to suggestions/variants/other ideas/etc.
>>
>> -- Dirk
>>
>> Notes:
>>
>> [1] Both the "check in the failures" and the "suppress the failures"
>> approaches have advantages and disadvantages:
>>
>> Both approaches have their advantages and disadvantages:
>>
>> Advantages for checking in failures:
>>
>> * you can tell when a test starts failing differently
>> * the need to distinguish between different types of failures (text
>> vs. image vs. image+text) in the expectations file drops; the baseline
>> tells you what to expect
>> * the TestExpectations file can be much smaller and easier to manage -
>> the current Chromium file is a massively unruly mess
>> * the history of a particular test can be found by looking at the repo history
>>
>> Disadvantages for checking in failures (advantages for just suppressing them):
>> * given current practice (just using -expected) you can't tell if a
>> particular -expected file is supposed to be be correct or not
>> * it may create more churn in the checked-in baselines in the repo
>> * you can't get a list of all of the failing tests associated with a
>> particular bug as easily
>>
>> There are probably lots of ways one could attempt to design a solution
>> to these problems; I believe the approach outlined above is perhaps
>> the simplest possible and allows for us to try it in small parts of
>> the test suite (and only on some ports) to see if it's useful or not
>> without forcing it on everyone.
>>
>> [2] I considered allowing "-passing" and "-failing" to co-exist, but a
>> risk is that the "passing" or correct result for a test will change,
>> and if a test is currently expected to fail, we won't notice that that
>> port's "passing" result becomes stale. In addition, managing the
>> transitions across multiple files becomes a lot more
>> complicated/tricky. There are tradeoffs here, and it's possible some
>> other logic will work better in practice, but I thought it might be
>> best to start simple.
>>
>> [3] I haven't figured out the best way to do this yet, or whether it's
>> better to keep this list inside the TestExpectations files or in
>> separate files, or just have a dashboard separate from the repository
>> completely ...
>>
>> [4] rough sketch of the work needed to be done here:
>> * update run-webkit-tests to look for all three suffixes, and to
>> generate new -failing -failing and/or -passing results as well as new
>> -expected results
>> * update garden-o-matic so that when you want to rebaseline a file you
>> can (optionally) indicate whether the new baseline is passing or
>> failing
>> * update webkit-patch rebaseline-expectations to (optionally) indicate
>> if the new baselines are passing or failing
>> * pass the information from run-webkit-tests to the flakiness
>> dashboard (via results.json) to indicate whether we matched against
>> -expected/-passing/-failing so that the dashboard can display the
>> right baselines for comparison.
>> * figure out the solution to [3] :).
> _______________________________________________
> webkit-dev mailing list
> webkit-dev <at> lists.webkit.org
> http://lists.webkit.org/mailman/listinfo/webkit-dev

_______________________________________________
webkit-dev mailing list
webkit-dev <at> lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev

_______________________________________________
webkit-dev mailing list
webkit-dev <at> lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev
Dirk Pranke | 15 Aug 2012 23:25

Re: A proposal for handling "failing" layout tests and TestExpectations

On Wed, Aug 15, 2012 at 1:36 PM, Michael Saboff <msaboff <at> apple.com> wrote:
> It seems to me that there are two issues here.  One is Chromium specific about process conformity.  It seems
to me that should stay a Chromium issue without making the mechanism more complex for all ports.  The other
ports seem to make things work using the existing framework.
>

I'm definitely attempting to minimize the impact on ports (and people)
that are happy with things the way they are. One thing we could do to
help with this would be to require that any '-passing'/'-failing'
tests be limited to the platform/ directories, so that way a given
port could opt-in to this additional complexity (and we could replace
one Chromium-specific process with another one without affecting the
other ports).

> The other broader issue is failing tests.  If I understand part of Filip's concern it is a signal to noise
issue.  We do not want the pass / fail signal to be lost in the noise of expected failures.  Failing tests
should be fixed as appropriate for failing platform(s).  That fixing might involve splitting off or
removing a failing sub-test so that the remaining test adds value once again.  Especially "a pass becoming
a fail" edge.  For me, a test failing differently provides unknown value as the noise of it being a failing
test likely exceeds the benefit of the different failure mode signal.  It takes a non-zero effort to filter
that noise and that effort is likely better spent fixing the test.

If I understand you correctly, you're suggesting that if a test starts
to fail, you don't care how it is failing, and simply knowing that it
is failing is enough? And if a test doesn't meet that criteria, it
should be split into multiple tests such that each sub-test does? I
think that's a nice goal, but we're nowhere close to that in practice.

In particular, this isn't true of many pixel tests at all. Minor
changes in text rendering can cause a whole bunch of tests to fail,
and yet we still need to run those tests to look for real regressions.
Ideally we can replace some or most of these tests with text-only
tests and reftests, but we're a long way away from that, too.

-- Dirk
Ryosuke Niwa | 16 Aug 2012 02:19
Favicon
Gravatar

Re: A proposal for handling "failing" layout tests and TestExpectations

I have a concern that a lot of people wouldn't know what the "correct" output is for a given test.


For a lot of pixel tests, deciding whether a given output is correct or not is really hard. e.g. some seemingly insignificant anti-alias different may turn out be a result of a bug in Skia and other graphics library or WebCore code that uses it.

As a result,
  • people may check in wrong "correct" results
  • people may add "failing" results even though new results are more or less correct

This leads me to think that just checking in the current output as -expected.png and filing bugs separately is a better approach.

- Ryosuke

_______________________________________________
webkit-dev mailing list
webkit-dev <at> lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev
Dirk Pranke | 16 Aug 2012 23:05

Re: A proposal for handling "failing" layout tests and TestExpectations

On Wed, Aug 15, 2012 at 5:19 PM, Ryosuke Niwa <rniwa <at> webkit.org> wrote:
> I have a concern that a lot of people wouldn't know what the "correct"
> output is for a given test.
>
> For a lot of pixel tests, deciding whether a given output is correct or not
> is really hard. e.g. some seemingly insignificant anti-alias different may
> turn out be a result of a bug in Skia and other graphics library or WebCore
> code that uses it.
>
> As a result,
>
> people may check in wrong "correct" results
> people may add "failing" results even though new results are more or less
> correct
>
>
> This leads me to think that just checking in the current output as
> -expected.png and filing bugs separately is a better approach.
>

I think your observations are correct, but at least my experience as a
gardener/sheriff leads me to a different conclusion. Namely, when I'm
looking at a newly failing test, it is difficult if not impossible for
me to know if the existing baseline was previously believed to be
correct or not, and thus it's hard for me to tell if the new baseline
should be considered worse, better, or different. In theory I could go
look at the changelog for each test, but I would be skeptical if that
had enough useful information (I would expect most comments to be
along the lines of "rebaselining after X" with no indication if the
output is correct or not. This is just a theory, though.

This is why I want to test this theory :). It seems like if we got
experience with this on one (or more) ports for a couple of months we
would have a much more well-informed opinion, and I'm not seeing a
huge downside to at least trying this idea out.

-- Dirk
Ryosuke Niwa | 16 Aug 2012 23:23
Favicon
Gravatar

Re: A proposal for handling "failing" layout tests and TestExpectations

On Thu, Aug 16, 2012 at 2:05 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:

I think your observations are correct, but at least my experience as a
gardener/sheriff leads me to a different conclusion. Namely, when I'm
looking at a newly failing test, it is difficult if not impossible for
me to know if the existing baseline was previously believed to be
correct or not, and thus it's hard for me to tell if the new baseline
should be considered worse, better, or different.
 
How does the proposal solve this problem? Right now gardeners have two options:
  • Rebaseline
  • Add a test expectation
Once we implemented the proposal, we have at least three options:
  • Rebaseline correct.png
  • Add/rebaseline expected.png
  • Add/rebaseline failure.png
  • (Optionally) Add a test expectation.
And that's a lot of options to choose from. The more options we have, the more likely people make mistakes. We're already inconsistent in how -expected.png is used because some people make mistakes. I'm afraid that adding another set of expected results result in even more mistakes and a unrecoverable mess.

This is why I want to test this theory :). It seems like if we got
experience with this on one (or more) ports for a couple of months we
would have a much more well-informed opinion, and I'm not seeing a
huge downside to at least trying this idea out.

Sure. But if we're doing this experiment on trunk, thereby imposing significant cognitive load on every other port, then I'd like to see us setting both an exact date at which we decide whether this approach is good or not and criteria by which we decide this before the experiment starts.

Like Filip, I'm extremely concerned about the prospect of us introducing yet-another-way-of-doing-things, and not be able to get rid of it later.

- Ryosuke

_______________________________________________
webkit-dev mailing list
webkit-dev <at> lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev
Dirk Pranke | 17 Aug 2012 00:04

Re: A proposal for handling "failing" layout tests and TestExpectations

On Thu, Aug 16, 2012 at 2:23 PM, Ryosuke Niwa <rniwa <at> webkit.org> wrote:
> On Thu, Aug 16, 2012 at 2:05 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:
>>
>> I think your observations are correct, but at least my experience as a
>> gardener/sheriff leads me to a different conclusion. Namely, when I'm
>> looking at a newly failing test, it is difficult if not impossible for
>> me to know if the existing baseline was previously believed to be
>> correct or not, and thus it's hard for me to tell if the new baseline
>> should be considered worse, better, or different.
>
>
> How does the proposal solve this problem?

It doesn't solve the problem per se; sorry, I didn't mean to imply
that it did. What my proposal does do is give the sheriff more
information with which to decide what to do, namely:

1) the test was passing, but now it isn't. Is that okay? Do I need to
rebaseline, or should we revert?
2) the test was failing; now it's either failing differently, or it's
maybe passing ...
3) the test was -expected (status quo).

Granted, in all three cases the sheriff has to make a judgement call,
but my theory/claim is that this is easier than just (3).

In addition, we gain an entirely new feature (for the non-Chromium
ports): the ability to tell if a test is believed to be failing or
not.

> The more options we have, the more likely people make mistakes.

I don't think this is necessarily true; it's possible that having more
options will actually make it easier to do the right thing.

>> This is why I want to test this theory :). It seems like if we got
>> experience with this on one (or more) ports for a couple of months we
>> would have a much more well-informed opinion, and I'm not seeing a
>> huge downside to at least trying this idea out.
>
>
> Sure. But if we're doing this experiment on trunk, thereby imposing
> significant cognitive load on every other port

How does this impose a cognitive load (of any size) on every other
port? If the files are only checked in to platform/X, every other
platform doesn't even need to know about them ...

> setting both an exact date at which we decide whether this approach is good
> or not and criteria by which we decide this before the experiment starts.

I like the theory of this but I think determining the exact criteria
will be difficult if not impossible. I'm not sure how we would
evaluate this other than "contributors are happier" (or not).

> Like Filip, I'm extremely concerned about the prospect of us introducing
> yet-another-way-of-doing-things, and not be able to get rid of it later.

Presumably the only way we'd be not able to get rid of it would be if
some port actually liked it, in which case, why would we get rid of
it?

On Thu, Aug 16, 2012 at 2:32 PM, Filip Pizlo <fpizlo <at> apple.com> wrote:
> In what way do things become better?  Because other people will see what the sheriff believed about the result?

Yes, that's the theory; see above.

> Can you articulate some more about what happens when you have both -expected and -failing?
>
> My specific concern is that after someone checks in a fix, we will have some sheriff accidentally misjudge
the change in behavior to be a regression, and check in a -failing file.  And then we end up in a world of confusion.
>

Ah! This is important ... you *can't have this scenario*. The tools
will prohibit you (via style checks or something) from having more
than one kind of result checked in. I absolutely agree that this would
be confusing, which is why I don't want to allow it (I originally
started out thinking we'd support both -correct and -failing, and
realized it made life horrible)

(I'm going to skip the flaky test discussion ... I think it's
tangential and it's clearly a debatable topic in its own right; I only
brought it up because it still shows a purpose for TestExpectations
entries).

> 2) The WONTFIX mode in TestExpectations feels to me more like a statement that you're just trying to see if
the test doesn't crash.  Correct?  Either way, it's confusing.

No, sorry, in the new world (new TestExpectations syntax) WONTFIX
files will be skipped automatically. They are like SKIP except that
they carry the connotation that there are no plans to unskip them.

> 3) Your new mechanism feels like it's already covered by our existing use of -expected files.  I'm not quite
convinced that having -failing in addition to -expected files would be all that helpful.

Fair enough, I realize it's not an open-and-shut case. I'm not 100%
convinced it'll be useful myself, but I think it's worth trying.

> (3) concerns me the most, and it concerns me particularly because we're still not giving good answers to
(1) or (2).

You lost me here. I plan to do (1) and (2) regardless of whether we
start using '-passing/-failing' or not. How is (3) related to (1) and
(2)?

-- Dirk

-- Dirk
Ryosuke Niwa | 17 Aug 2012 00:15
Favicon
Gravatar

Re: A proposal for handling "failing" layout tests and TestExpectations

On Thu, Aug 16, 2012 at 3:04 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:

On Thu, Aug 16, 2012 at 2:23 PM, Ryosuke Niwa <rniwa <at> webkit.org> wrote:
> Like Filip, I'm extremely concerned about the prospect of us introducing
> yet-another-way-of-doing-things, and not be able to get rid of it later.

Presumably the only way we'd be not able to get rid of it would be if
some port actually liked it, in which case, why would we get rid of it?

Because we already have too many ways to deal with test failures.

Why don't we at least merge Skipped and TestExpectations first.

- Ryosuke

_______________________________________________
webkit-dev mailing list
webkit-dev <at> lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev
Stephen Chenney | 17 Aug 2012 00:50

Re: A proposal for handling "failing" layout tests and TestExpectations

On Thu, Aug 16, 2012 at 6:15 PM, Ryosuke Niwa <rniwa <at> webkit.org> wrote:
On Thu, Aug 16, 2012 at 3:04 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:
On Thu, Aug 16, 2012 at 2:23 PM, Ryosuke Niwa <rniwa <at> webkit.org> wrote:
> Like Filip, I'm extremely concerned about the prospect of us introducing
> yet-another-way-of-doing-things, and not be able to get rid of it later.

Presumably the only way we'd be not able to get rid of it would be if
some port actually liked it, in which case, why would we get rid of it?

Because we already have too many ways to deal with test failures.

Why don't we at least merge Skipped and TestExpectations first.


I agree with the priorities above, at least. I also agree with the overall goal of making our implementation match our philosophy on testing.

Ryosuke has raised a very valid point: it is not clear what a Gardener should do with a test that has changed behavior. Personally, the area I deal with is very susceptible to the "single pixel differences matter" issue, and I would prefer gardeners to flag updates in some way so that I can look at it myself (or one of the very few other experts can look at it). Maybe I'm just a control freak.

In the current situation, gardeners are not likely to roll out a patch short of build failures, obvious test or perf regressions, or security issues. For the bulk of cases where a test result has changed, the gardener will either add it to expectations or go ahead and update the result.

The result is a mixture of incorrect expected results and files listed in TestExpectations that may or may not be correct. The way I deal with this right now is to maintain a snapshot of TestExpectations and occasionally go through and look for newly added results, and check if they are in fact correct. And every now and then go look at older entries to see if they have been updated. Or I get lucky and notice that a new expected result has been checked in that is incorrect (I can mostly catch that by following checkins in the code).

The proposed change does not alter this fundamental dynamic. Depending on what policy gardeners adopt, they might now rename the result as failing and remove the expected, or maybe they'll just update the expected. I'm still in a situation where I don't know the exact status of any given result, and where I have no easy way of knowing if a gardener has correctly updated an expected or failing image.

Two questions come out of this:
1) What policy do we want gardeners to follow? Giving "one more option" is not helpful as far as I can tell. The new method does nothing that TestExpectations could not already do, so I suppose the goal is to not put known failing tests in expectations.
2) How will it be simpler to know if something is really failing, really correct, or whether a non-expert gardener made an understandable mistake?

Personally, I don't much care if we go to failing and remove all just-plain-failing tests from TestExpectations provided gardeners know what to do. What I really want is "unconfirmed" or some way for a gardener to say "I changed a result to turn the bots green, but I don't know what I'm doing and need an expert".

On top of that we would need a nagging system to make experts look at unconfirmed results. We already have such a nagging system, thanks to Ojan.

My goal would also be met by the ability to easily track all expected results changes and TestExpectation updates for a subset of tests.

Cheers,
Stephen.
_______________________________________________
webkit-dev mailing list
webkit-dev <at> lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev
Dirk Pranke | 17 Aug 2012 02:18

Re: A proposal for handling "failing" layout tests and TestExpectations

On Thu, Aug 16, 2012 at 3:50 PM, Stephen Chenney <schenney <at> chromium.org> wrote:
> On Thu, Aug 16, 2012 at 6:15 PM, Ryosuke Niwa <rniwa <at> webkit.org> wrote:
>>
>> On Thu, Aug 16, 2012 at 3:04 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:
>>>
>>> On Thu, Aug 16, 2012 at 2:23 PM, Ryosuke Niwa <rniwa <at> webkit.org> wrote:
>>> > Like Filip, I'm extremely concerned about the prospect of us
>>> > introducing
>>> > yet-another-way-of-doing-things, and not be able to get rid of it
>>> > later.
>>>
>>> Presumably the only way we'd be not able to get rid of it would be if
>>> some port actually liked it, in which case, why would we get rid of it?
>>
>>
>> Because we already have too many ways to deal with test failures.
>>
>> Why don't we at least merge Skipped and TestExpectations first.
>>
>
> I agree with the priorities above, at least. I also agree with the overall
> goal of making our implementation match our philosophy on testing.
>
> Ryosuke has raised a very valid point: it is not clear what a Gardener
> should do with a test that has changed behavior. Personally, the area I deal
> with is very susceptible to the "single pixel differences matter" issue, and
> I would prefer gardeners to flag updates in some way so that I can look at
> it myself (or one of the very few other experts can look at it). Maybe I'm
> just a control freak.
>
> In the current situation, gardeners are not likely to roll out a patch short
> of build failures, obvious test or perf regressions, or security issues. For
> the bulk of cases where a test result has changed, the gardener will either
> add it to expectations or go ahead and update the result.
>
> The result is a mixture of incorrect expected results and files listed in
> TestExpectations that may or may not be correct. The way I deal with this
> right now is to maintain a snapshot of TestExpectations and occasionally go
> through and look for newly added results, and check if they are in fact
> correct. And every now and then go look at older entries to see if they have
> been updated. Or I get lucky and notice that a new expected result has been
> checked in that is incorrect (I can mostly catch that by following checkins
> in the code).
>
> The proposed change does not alter this fundamental dynamic. Depending on
> what policy gardeners adopt, they might now rename the result as failing and
> remove the expected, or maybe they'll just update the expected. I'm still in
> a situation where I don't know the exact status of any given result, and
> where I have no easy way of knowing if a gardener has correctly updated an
> expected or failing image.

So, if we had -passing/-failing, then people who knew what the results
were supposed to be in a given directory (call them "owners") could
rename the existing -expected files into one category or another, and
then a gardener could add newly-failing tests as -expected; it would
then be trivial for the owner to look for new -expected files in that
directory.

Right now, an owner would have to periodically scan the directory for
files changed since the last time they scanned the directory, and
would have no way of knowing whether an updated file was "correct" or
not (perhaps you could filter out by who committed the change, but
it's certainly harder than ls *-expected).

So, I would contend that my proposal would make it easier for us to
have a process by which gardeners could punt on changes they're unsure
about and for owners to subsequently review them. I think you could
have a similar model if we just checked in *all* new results into
-expected, but it would be harder to implement (perhaps not
impossible, though, I haven't given it much thought yet).

On the other hand, if you follow the chromium model of just
suppressing the file, then we have no idea if the failure you
currently see has any relationship to the failure that was originally
seen, or if the file that is currently checked in has any relationship
to what "correct" might now need to be (we know that the file was
correct at some point but is now stale). So, in the chromium model, at
least, you can do a grep through TestExpectations to look for failing
tests, but you don't have the history you would get by looking at a
history of changes to -expected/-passing/-failing.

In a world where all failures are triaged promptly by qualified
owners, all of this mechanism is unnecessary. Unfortunately, we don't
live in that world at the moment.

-- Dirk
>
> Two questions come out of this:
> 1) What policy do we want gardeners to follow? Giving "one more option" is
> not helpful as far as I can tell. The new method does nothing that
> TestExpectations could not already do, so I suppose the goal is to not put
> known failing tests in expectations.

I believe it does do something TestExpectations cannot do, see above,
and yes, we would then not put known failing tests in
TestExpectations.

> 2) How will it be simpler to know if something is really failing, really
> correct, or whether a non-expert gardener made an understandable mistake?
>

This part I'm not sure about but I believe that it would be easier to
guess whether the new result is newly failing, potentially failing
differently, or potentially passing, if we know how the prior result
was categorized, assuming it was categorized correctly. Perhaps this
implies a much stronger level of saying that you shouldn't categorize
something as -passing or -failing unless you're *sure* about what
you're doing.

I think I have already addressed your remaining comments ...

-- Dirk
Ryosuke Niwa | 17 Aug 2012 02:41
Favicon
Gravatar

Re: A proposal for handling "failing" layout tests and TestExpectations

  1. On Thu, Aug 16, 2012 at 5:18 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:
On Thu, Aug 16, 2012 at 3:50 PM, Stephen Chenney <schenney <at> chromium.org> wrote:
> I agree with the priorities above, at least. I also agree with the overall
> goal of making our implementation match our philosophy on testing.
>
> Ryosuke has raised a very valid point: it is not clear what a Gardener
> should do with a test that has changed behavior. Personally, the area I deal
> with is very susceptible to the "single pixel differences matter" issue, and
> I would prefer gardeners to flag updates in some way so that I can look at
> it myself (or one of the very few other experts can look at it). Maybe I'm
> just a control freak.
>
> In the current situation, gardeners are not likely to roll out a patch short
> of build failures, obvious test or perf regressions, or security issues. For
> the bulk of cases where a test result has changed, the gardener will either
> add it to expectations or go ahead and update the result.
>
> The result is a mixture of incorrect expected results and files listed in
> TestExpectations that may or may not be correct. The way I deal with this
> right now is to maintain a snapshot of TestExpectations and occasionally go
> through and look for newly added results, and check if they are in fact
> correct. And every now and then go look at older entries to see if they have
> been updated. Or I get lucky and notice that a new expected result has been
> checked in that is incorrect (I can mostly catch that by following checkins
> in the code).
>
> The proposed change does not alter this fundamental dynamic. Depending on
> what policy gardeners adopt, they might now rename the result as failing and
> remove the expected, or maybe they'll just update the expected. I'm still in
> a situation where I don't know the exact status of any given result, and
> where I have no easy way of knowing if a gardener has correctly updated an
> expected or failing image.

So, if we had -passing/-failing, then people who knew what the results
were supposed to be in a given directory (call them "owners") could
rename the existing -expected files into one category or another, and
then a gardener could add newly-failing tests as -expected; it would
then be trivial for the owner to look for new -expected files in that
directory.

Right now, an owner would have to periodically scan the directory for
files changed since the last time they scanned the directory, and
would have no way of knowing whether an updated file was "correct" or
not (perhaps you could filter out by who committed the change, but
it's certainly harder than ls *-expected).

So, I would contend that my proposal would make it easier for us to
have a process by which gardeners could punt on changes they're unsure
about and for owners to subsequently review them. I think you could
have a similar model if we just checked in *all* new results into
-expected, but it would be harder to implement (perhaps not
impossible, though, I haven't given it much thought yet).

That seems like a much better model. Right now, one of the biggest problem is that people rebaseline tests or add test expectations without understanding what the correct output is. It happens for almost all ports.

Having -correct.png doesn't necessarily help that because even a seemingly innocent one-pixel difference could be a bug (a patch may only add a baseline to one platform), and it's really hard to tell what the expected result is unless you have worked on the rendering code for a long time.

On the other hand, if you follow the chromium model of just
suppressing the file, then we have no idea if the failure you
currently see has any relationship to the failure that was originally
seen, or if the file that is currently checked in has any relationship
to what "correct" might now need to be (we know that the file was
correct at some point but is now stale). So, in the chromium model, at
least, you can do a grep through TestExpectations to look for failing
tests, but you don't have the history you would get by looking at a
history of changes to -expected/-passing/-failing.

In a world where all failures are triaged promptly by qualified
owners, all of this mechanism is unnecessary. Unfortunately, we don't
live in that world at the moment.

I'd argue that this is a bigger problem worth solving. Differentiating expected failure and correct results isn't going to help if gardeners were to keep making mistakes.

- Ryosuke

_______________________________________________
webkit-dev mailing list
webkit-dev <at> lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev
Dirk Pranke | 17 Aug 2012 03:11

Re: A proposal for handling "failing" layout tests and TestExpectations

On Thu, Aug 16, 2012 at 5:41 PM, Ryosuke Niwa <rniwa <at> webkit.org> wrote:
> On Thu, Aug 16, 2012 at 5:18 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:
>>
>> On Thu, Aug 16, 2012 at 3:50 PM, Stephen Chenney <schenney <at> chromium.org>
>> wrote:
>> > I agree with the priorities above, at least. I also agree with the
>> > overall
>> > goal of making our implementation match our philosophy on testing.
>> >
>> > Ryosuke has raised a very valid point: it is not clear what a Gardener
>> > should do with a test that has changed behavior. Personally, the area I
>> > deal
>> > with is very susceptible to the "single pixel differences matter" issue,
>> > and
>> > I would prefer gardeners to flag updates in some way so that I can look
>> > at
>> > it myself (or one of the very few other experts can look at it). Maybe
>> > I'm
>> > just a control freak.
>> >
>> > In the current situation, gardeners are not likely to roll out a patch
>> > short
>> > of build failures, obvious test or perf regressions, or security issues.
>> > For
>> > the bulk of cases where a test result has changed, the gardener will
>> > either
>> > add it to expectations or go ahead and update the result.
>> >
>> > The result is a mixture of incorrect expected results and files listed
>> > in
>> > TestExpectations that may or may not be correct. The way I deal with
>> > this
>> > right now is to maintain a snapshot of TestExpectations and occasionally
>> > go
>> > through and look for newly added results, and check if they are in fact
>> > correct. And every now and then go look at older entries to see if they
>> > have
>> > been updated. Or I get lucky and notice that a new expected result has
>> > been
>> > checked in that is incorrect (I can mostly catch that by following
>> > checkins
>> > in the code).
>> >
>> > The proposed change does not alter this fundamental dynamic. Depending
>> > on
>> > what policy gardeners adopt, they might now rename the result as failing
>> > and
>> > remove the expected, or maybe they'll just update the expected. I'm
>> > still in
>> > a situation where I don't know the exact status of any given result, and
>> > where I have no easy way of knowing if a gardener has correctly updated
>> > an
>> > expected or failing image.
>>
>> So, if we had -passing/-failing, then people who knew what the results
>> were supposed to be in a given directory (call them "owners") could
>> rename the existing -expected files into one category or another, and
>> then a gardener could add newly-failing tests as -expected; it would
>> then be trivial for the owner to look for new -expected files in that
>> directory.
>>
>> Right now, an owner would have to periodically scan the directory for
>> files changed since the last time they scanned the directory, and
>> would have no way of knowing whether an updated file was "correct" or
>> not (perhaps you could filter out by who committed the change, but
>> it's certainly harder than ls *-expected).
>>
>> So, I would contend that my proposal would make it easier for us to
>> have a process by which gardeners could punt on changes they're unsure
>> about and for owners to subsequently review them. I think you could
>> have a similar model if we just checked in *all* new results into
>> -expected, but it would be harder to implement (perhaps not
>> impossible, though, I haven't given it much thought yet).
>
>
> That seems like a much better model.

To clarify: what does "That" refer to? My idea for gardeners to check
in "-expected" and owners to move to "-passing/failing", or the idea
that we check in "new results into -expected" and have some other way
of reviewing new baselines? (or something else)?

> Right now, one of the biggest problem
> is that people rebaseline tests or add test expectations without
> understanding what the correct output is. It happens for almost all ports.
>
> Having -correct.png doesn't necessarily help that because even a seemingly
> innocent one-pixel difference could be a bug (a patch may only add a
> baseline to one platform), and it's really hard to tell what the expected
> result is unless you have worked on the rendering code for a long time.
>

Yes. I hate gardening/sheriffing because I feel like I usually don't
know whether to rebaseline or not :) (Although I suspect having
-correct would be better than not having it, at least).

>> In a world where all failures are triaged promptly by qualified
>> owners, all of this mechanism is unnecessary. Unfortunately, we don't
>> live in that world at the moment.
>
>
> I'd argue that this is a bigger problem worth solving. Differentiating
> expected failure and correct results isn't going to help if gardeners were
> to keep making mistakes.
>

It is a bigger problem worth solving, but it is also much harder to
solve. As long as we have the combination of

1) changes can cause lots of tests to change expectations and need new baselines
2) it's acceptable to land changes without the accompanying updated baselines
3) there is pressure to keep the bots green (not only for its own
sake, but so that we can more easily identify *new* failures)
4) insufficient resources watching the tree (which is really true for
every port today, even chromium)

i think we're pretty much screwed :). I'm working on things that
*might* help with 1), but I don't see 2) or 4) changing in the
near-to-foreseeable future (unless my changes also might help w/ 2),
and I think 3) is just good practice.

Put differently, I think my proposal gives us the most manageable
process for keeping the bots green / keeping new failures visible
while still being able to identify older failures.

But I'm not wedded to it; if anyone has better ideas, I would
certainly like to hear them.

-- Dirk
Ryosuke Niwa | 17 Aug 2012 17:07
Favicon
Gravatar

Re: A proposal for handling "failing" layout tests and TestExpectations

On Thu, Aug 16, 2012 at 6:11 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:
On Thu, Aug 16, 2012 at 5:41 PM, Ryosuke Niwa <rniwa <at> webkit.org> wrote:
> On Thu, Aug 16, 2012 at 5:18 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:
>>
>> On Thu, Aug 16, 2012 at 3:50 PM, Stephen Chenney <schenney <at> chromium.org>
>> wrote:
>> > I agree with the priorities above, at least. I also agree with the
>> > overall
>> > goal of making our implementation match our philosophy on testing.
>> >
>> > Ryosuke has raised a very valid point: it is not clear what a Gardener
>> > should do with a test that has changed behavior. Personally, the area I
>> > deal
>> > with is very susceptible to the "single pixel differences matter" issue,
>> > and
>> > I would prefer gardeners to flag updates in some way so that I can look
>> > at
>> > it myself (or one of the very few other experts can look at it). Maybe
>> > I'm
>> > just a control freak.
>> >
>> > In the current situation, gardeners are not likely to roll out a patch
>> > short
>> > of build failures, obvious test or perf regressions, or security issues.
>> > For
>> > the bulk of cases where a test result has changed, the gardener will
>> > either
>> > add it to expectations or go ahead and update the result.
>> >
>> > The result is a mixture of incorrect expected results and files listed
>> > in
>> > TestExpectations that may or may not be correct. The way I deal with
>> > this
>> > right now is to maintain a snapshot of TestExpectations and occasionally
>> > go
>> > through and look for newly added results, and check if they are in fact
>> > correct. And every now and then go look at older entries to see if they
>> > have
>> > been updated. Or I get lucky and notice that a new expected result has
>> > been
>> > checked in that is incorrect (I can mostly catch that by following
>> > checkins
>> > in the code).
>> >
>> > The proposed change does not alter this fundamental dynamic. Depending
>> > on
>> > what policy gardeners adopt, they might now rename the result as failing
>> > and
>> > remove the expected, or maybe they'll just update the expected. I'm
>> > still in
>> > a situation where I don't know the exact status of any given result, and
>> > where I have no easy way of knowing if a gardener has correctly updated
>> > an
>> > expected or failing image.
>>
>> So, if we had -passing/-failing, then people who knew what the results
>> were supposed to be in a given directory (call them "owners") could
>> rename the existing -expected files into one category or another, and
>> then a gardener could add newly-failing tests as -expected; it would
>> then be trivial for the owner to look for new -expected files in that
>> directory.
>>
>> Right now, an owner would have to periodically scan the directory for
>> files changed since the last time they scanned the directory, and
>> would have no way of knowing whether an updated file was "correct" or
>> not (perhaps you could filter out by who committed the change, but
>> it's certainly harder than ls *-expected).
>>
>> So, I would contend that my proposal would make it easier for us to
>> have a process by which gardeners could punt on changes they're unsure
>> about and for owners to subsequently review them. I think you could
>> have a similar model if we just checked in *all* new results into
>> -expected, but it would be harder to implement (perhaps not
>> impossible, though, I haven't given it much thought yet).
>
>
> That seems like a much better model.

To clarify: what does "That" refer to? My idea for gardeners to check
in "-expected" and owners to move to "-passing/failing"] or the idea
that we check in "new results into -expected" and have some other way
of reviewing new baselines? (or something else)?

The former. In particular, the idea that only experts of a given area can check in -correct/-failing is promising because there are many cases where pixel tests fail due to completely unrelated bugs and we want to consider it to be still correct. Butwe can implement this policy without the said proposal. All we need to do is for gardeners to add test expectations and wait for experts to come in and decide whether we need rebaseline, bug fix, or rollout.

On the other hand, the pixel test output that's correct to one expert may not be correct to another expert. For example, I might think that one editing test's output is correct because it shows the feature we're testing in the test is working properly. But Stephen might realizes that this -expected.png contains off-by-one Skia bug. So categorizing -correct.png and -failure.png may require multiple experts looking at the output, which may or may not be practical.

Furthermore, some areas may not have an adequate number of experts or experts who can triage -expected.png, which means that we'll further fragment the process depending on how many experts we have for a given area.

>> In a world where all failures are triaged promptly by qualified
>> owners, all of this mechanism is unnecessary. Unfortunately, we don't
>> live in that world at the moment.
>
>
> I'd argue that this is a bigger problem worth solving. Differentiating
> expected failure and correct results isn't going to help if gardeners were
> to keep making mistakes.
>

It is a bigger problem worth solving, but it is also much harder to
solve.

I agree. This is why I think we should just check-in whatever result we're currently seeing as -expected.png because we wouldn't at least have any ambiguity in the process then. We just check in whatever we're currently seeing and file a bug if we see a problem with the new result and possibly rollout the patch after talking with the author/reviewer.

The new result we check in may not be 100% right but experts  e.g. me for editing and Stephen for Skia  can come in and look at recent changes to triage any new failures.

In fact, it might be worthwhile for us to invest our time in improving tools to do this. For example, can we have a portal where I can see new rebaselines that happened in LayoutTests/editing and LayoutTests/platform/*/editing since the last time I visited the portal? e.g. it can show chronological timeline of baselines along with a possible cause (list of changesets maybe?) of the baseline.

- Ryosuke

_______________________________________________
webkit-dev mailing list
webkit-dev <at> lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev
Dirk Pranke | 17 Aug 2012 20:06

Re: A proposal for handling "failing" layout tests and TestExpectations

On Fri, Aug 17, 2012 at 8:07 AM, Ryosuke Niwa <rniwa <at> webkit.org> wrote:
> On Thu, Aug 16, 2012 at 6:11 PM, Dirk Pranke <dpranke <at> chromium.org> wrote:
>>
>> On Thu, Aug 16, 2012 at 5:41 PM, Ryosuke Niwa <rniwa <at> webkit.org> wrote:
>> > On Thu, Aug 16, 2012 at 5:18 PM, Dirk Pranke <dpranke <at> chromium.org>
>> > wrote:
>> >>
>> >> On Thu, Aug 16, 2012 at 3:50 PM, Stephen Chenney
>> >> <schenney <at> chromium.org>
>> >> wrote:
>> >> > I agree with the priorities above, at least. I also agree with the
>> >> > overall
>> >> > goal of making our implementation match our philosophy on testing.
>> >> >
>> >> > Ryosuke has raised a very valid point: it is not clear what a
>> >> > Gardener
>> >> > should do with a test that has changed behavior. Personally, the area
>> >> > I
>> >> > deal
>> >> > with is very susceptible to the "single pixel differences matter"
>> >> > issue,
>> >> > and
>> >> > I would prefer gardeners to flag updates in some way so that I can
>> >> > look
>> >> > at
>> >> > it myself (or one of the very few other experts can look at it).
>> >> > Maybe
>> >> > I'm
>> >> > just a control freak.
>> >> >
>> >> > In the current situation, gardeners are not likely to roll out a
>> >> > patch
>> >> > short
>> >> > of build failures, obvious test or perf regressions, or security
>> >> > issues.
>> >> > For
>> >> > the bulk of cases where a test result has changed, the gardener will
>> >> > either
>> >> > add it to expectations or go ahead and update the result.
>> >> >
>> >> > The result is a mixture of incorrect expected results and files
>> >> > listed
>> >> > in
>> >> > TestExpectations that may or may not be correct. The way I deal with
>> >> > this
>> >> > right now is to maintain a snapshot of TestExpectations and
>> >> > occasionally
>> >> > go
>> >> > through and look for newly added results, and check if they are in
>> >> > fact
>> >> > correct. And every now and then go look at older entries to see if
>> >> > they
>> >> > have
>> >> > been updated. Or I get lucky and notice that a new expected result
>> >> > has
>> >> > been
>> >> > checked in that is incorrect (I can mostly catch that by following
>> >> > checkins
>> >> > in the code).
>> >> >
>> >> > The proposed change does not alter this fundamental dynamic.
>> >> > Depending
>> >> > on
>> >> > what policy gardeners adopt, they might now rename the result as
>> >> > failing
>> >> > and
>> >> > remove the expected, or maybe they'll just update the expected. I'm
>> >> > still in
>> >> > a situation where I don't know the exact status of any given result,
>> >> > and
>> >> > where I have no easy way of knowing if a gardener has correctly
>> >> > updated
>> >> > an
>> >> > expected or failing image.
>> >>
>> >> So, if we had -passing/-failing, then people who knew what the results
>> >> were supposed to be in a given directory (call them "owners") could
>> >> rename the existing -expected files into one category or another, and
>> >> then a gardener could add newly-failing tests as -expected; it would
>> >> then be trivial for the owner to look for new -expected files in that
>> >> directory.
>> >>
>> >> Right now, an owner would have to periodically scan the directory for
>> >> files changed since the last time they scanned the directory, and
>> >> would have no way of knowing whether an updated file was "correct" or
>> >> not (perhaps you could filter out by who committed the change, but
>> >> it's certainly harder than ls *-expected).
>> >>
>> >> So, I would contend that my proposal would make it easier for us to
>> >> have a process by which gardeners could punt on changes they're unsure
>> >> about and for owners to subsequently review them. I think you could
>> >> have a similar model if we just checked in *all* new results into
>> >> -expected, but it would be harder to implement (perhaps not
>> >> impossible, though, I haven't given it much thought yet).
>> >
>> >
>> > That seems like a much better model.
>>
>> To clarify: what does "That" refer to? My idea for gardeners to check
>> in "-expected" and owners to move to "-passing/failing"] or the idea
>> that we check in "new results into -expected" and have some other way
>> of reviewing new baselines? (or something else)?
>
>
> The former. In particular, the idea that only experts of a given area can
> check in -correct/-failing is promising because there are many cases where
> pixel tests fail due to completely unrelated bugs and we want to consider it
> to be still correct. Butwe can implement this policy without the said
> proposal. All we need to do is for gardeners to add test expectations and
> wait for experts to come in and decide whether we need rebaseline, bug fix,
> or rollout.

A problem with just adding test expectations is that we lose
visibility into whether the test is failing the same way or a
different way. That is one of the (if not the primary) motivation for
my proposal ...

>
> On the other hand, the pixel test output that's correct to one expert may
> not be correct to another expert. For example, I might think that one
> editing test's output is correct because it shows the feature we're testing
> in the test is working properly. But Stephen might realizes that this
> -expected.png contains off-by-one Skia bug. So categorizing -correct.png and
> -failure.png may require multiple experts looking at the output, which may
> or may not be practical.

Perhaps. Obviously (a) there's a limit to what you can do here, and
(b) a test that requires multiple experts to verify its correctness
is, IMO, a bad test :).

>  I think we should just check-in whatever result we're
> currently seeing as -expected.png because we wouldn't at least have any
> ambiguity in the process then. We just check in whatever we're currently
> seeing and file a bug if we see a problem with the new result and possibly
> rollout the patch after talking with the author/reviewer.

This is basically saying we should just follow the "existing
non-Chromium" process, right? This would seem to bring us back to step
1: it doesn't address the problem that I identified with the "existing
non-Chromium" process, namely that a non-expert can't tell by looking
at the checkout  what tests are believed to be passing or not. I don't
think searching bugzilla (as it is currently used) is a workable
alternative.

> The new result we check in may not be 100% right but experts — e.g. me for
> editing and Stephen for Skia — can come in and look at recent changes to
> triage any new failures.
>
> In fact, it might be worthwhile for us to invest our time in improving tools
> to do this. For example, can we have a portal where I can see new
> rebaselines that happened in LayoutTests/editing and
> LayoutTests/platform/*/editing since the last time I visited the portal?
> e.g. it can show chronological timeline of baselines along with a possible
> cause (list of changesets maybe?) of the baseline.

We could build such a portal, sure. I would be interested to hear from
others whether such a thing would be more or less useful than my
proposal.

Of course, you could just set up a watchlist for new expectations
today. Probably not quite as polished as we could get with a portal,
but dirt simple ...

-- Dirk
_______________________________________________
webkit-dev mailing list
webkit-dev <at> lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev
Dimitri Glazkov | 17 Aug 2012 20:15

Re: A proposal for handling "failing" layout tests and TestExpectations

I don't have Dirk's patience, stamina and determination to explain and
evangelize the advantages of better awareness of current LayoutTests
coverage, I will chime in saying that I gladly support any effort to
improve it. I am especially excited about the ability to distinguish
failing, passing, and failing-in-different-way expectations.

:DG<
Ryosuke Niwa | 17 Aug 2012 20:29
Favicon
Gravatar

Re: A proposal for handling "failing" layout tests and TestExpectations

On Fri, Aug 17, 2012 at 11:06 AM, Dirk Pranke <dpranke <at> chromium.org> wrote:

> On the other hand, the pixel test output that's correct to one expert may
> not be correct to another expert. For example, I might think that one
> editing test's output is correct because it shows the feature we're testing
> in the test is working properly. But Stephen might realizes that this
> -expected.png contains off-by-one Skia bug. So categorizing -correct.png and
> -failure.png may require multiple experts looking at the output, which may
> or may not be practical.

Perhaps. Obviously (a) there's a limit to what you can do here, and
(b) a test that requires multiple experts to verify its correctness
is, IMO, a bad test :).

With that argument, almost all pixel tests are bad tests because pixel tests in editing, for example, involve editing, rendering, and graphics code. I don't think any single person can comprehend the entire stack to tell with a 100% confidence that the test result is exactly and precisely correct.

>  I think we should just check-in whatever result we're
> currently seeing as -expected.png because we wouldn't at least have any
> ambiguity in the process then. We just check in whatever we're currently
> seeing and file a bug if we see a problem with the new result and possibly
> rollout the patch after talking with the author/reviewer.

This is basically saying we should just follow the "existing
non-Chromium" process, right?

Yes. In addition, doing so will significantly reduce the complexity of the current process.

This would seem to bring us back to step
1: it doesn't address the problem that I identified with the "existing
non-Chromium" process, namely that a non-expert can't tell by looking
at the checkout what tests are believed to be passing or not.

What is the use case of this? I've been working on WebKit for more than 3 years, and I've never had to think about whether a test for an area outside of my expertise has the correct output or not other than when I was gardening. And having -correct / -failing wouldn't have helped me knowing what the correct output when I was gardening anyway because the new output may as well as be new -correct or -failing result.

I don't think searching bugzilla (as it is currently used) is a workable alternative.

Why not? Bugzilla is the tool we use to triage and track bugs. I don't see a need for an alternative method to keep track of bugs.

> The new result we check in may not be 100% right but experts — e.g. me for
> editing and Stephen for Skia — can come in and look at recent changes to
> triage any new failures.
>
> In fact, it might be worthwhile for us to invest our time in improving tools
> to do this. For example, can we have a portal where I can see new
> rebaselines that happened in LayoutTests/editing and
> LayoutTests/platform/*/editing since the last time I visited the portal?
> e.g. it can show chronological timeline of baselines along with a possible
> cause (list of changesets maybe?) of the baseline.

We could build such a portal, sure. I would be interested to hear from
others whether such a thing would be more or less useful than my
proposal.

Of course, you could just set up a watchlist for new expectations
today. Probably not quite as polished as we could get with a portal,
but dirt simple ..

That might be useful as long as it has an option to give us a digest instead of sending me an e-mail per commit.

- Ryosuke

_______________________________________________
webkit-dev mailing list
webkit-dev <at> lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev
Dirk Pranke | 17 Aug 2012 20:52

Re: A proposal for handling "failing" layout tests and TestExpectations

On Fri, Aug 17, 2012 at 11:29 AM, Ryosuke Niwa <rniwa <at> webkit.org> wrote:
> On Fri, Aug 17, 2012 at 11:06 AM, Dirk Pranke <dpranke <at> chromium.org> wrote:
>>
>> > On the other hand, the pixel test output that's correct to one expert
>> > may
>> > not be correct to another expert. For example, I might think that one
>> > editing test's output is correct because it shows the feature we're
>> > testing
>> > in the test is working properly. But Stephen might realizes that this
>> > -expected.png contains off-by-one Skia bug. So categorizing -correct.png
>> > and
>> > -failure.png may require multiple experts looking at the output, which
>> > may
>> > or may not be practical.
>>
>> Perhaps. Obviously (a) there's a limit to what you can do here, and
>> (b) a test that requires multiple experts to verify its correctness
>> is, IMO, a bad test :).
>
>
> With that argument, almost all pixel tests are bad tests because pixel tests
> in editing, for example, involve editing, rendering, and graphics code.

If in order to tell a pixel test is correct you need to be aware of
how all of that stuff works, then, yes, it's a bad test. It can fail
too many different ways, and is testing too many different bits of
information. As Filip might suggest, it would be nice if we could
split such tests up. That said, I will freely grant that in many cases
we can't easily do better given the way things are currently
structured, and splitting up such tests would be an enormous amount of
work.

If the pixel test is testing whether a rectangle is actually green or
actually red, such a test is fine, doesn't need much subject matter
expertise, and it is hard to imagine how you'd test such a thing some
other way.

> I don't think any single person can comprehend the entire stack to tell with a
> 100% confidence that the test result is exactly and precisely correct.

Sure. Such a high bar should be avoided.

>> >  I think we should just check-in whatever result we're
>> > currently seeing as -expected.png because we wouldn't at least have any
>> > ambiguity in the process then. We just check in whatever we're currently
>> > seeing and file a bug if we see a problem with the new result and
>> > possibly
>> > rollout the patch after talking with the author/reviewer.
>>
>> This is basically saying we should just follow the "existing
>> non-Chromium" process, right?
>
>
> Yes. In addition, doing so will significantly reduce the complexity of the
> current process.
>
>> This would seem to bring us back to step
>> 1: it doesn't address the problem that I identified with the "existing
>> non-Chromium" process, namely that a non-expert can't tell by looking
>> at the checkout what tests are believed to be passing or not.
>
>
> What is the use case of this? I've been working on WebKit for more than 3
> years, and I've never had to think about whether a test for an area outside
> of my expertise has the correct output or not other than when I was
> gardening. And having -correct / -failing wouldn't have helped me knowing
> what the correct output when I was gardening anyway because the new output
> may as well as be new -correct or -failing result.

I've done this frequently when gardening, when simply trying to learn
how a given chunk of code works and how a given chunk of tests work
(or don't work), and when trying to get a sense of how well our
product is or isn't passing tests.

Perhaps this is the case because I tend to more work on infrastructure
and testing, and look at stuff shallowly across the whole tree rather
than in depth in particular areas as you do.

>> I don't think searching bugzilla (as it is currently used) is a workable
>> alternative.
>
>
> Why not? Bugzilla is the tool we use to triage and track bugs. I don't see a
> need for an alternative method to keep track of bugs.

The way we currently use bugzilla, it is difficult if not impossible
to find a concise and accurate list of all the failing layout tests
meeting any sort of filename- or directory-based criteria (maybe you
can do it just for editing, I don't know). The layout test summary
reports that Ojan sends out to the chromium devs is an example of
this: he generates that from the TestExpectations files; doing so from
bugzilla is not currently feasible.

Note that we could certainly extend bugzilla to make this easier, if
there was consensus to do so (and I would be in favor of this, but
that would also incur more process than we have today).

- Dirk

Gmane