Ingo Molnar | 3 Mar 2008 18:48
Picon
Picon
Favicon

Re: Regression in 2.6.25-rc3: s2ram segfaults before suspending


* Klaus S. Madsen <ksm <at> hjernemadsen.org> wrote:

> The following patch solves the segfault, by changing the mmap flags of 
> the video memory area, to allow execution. The patch is against 
> libx86-0.99 available from http://www.codon.org.uk/~mjg59/libx86/
> 
> --- libx86-0.99/x86-common.c	2006-09-08 00:44:27.000000000 +0200
> +++ libx86-0.99.new/x86-common.c	2008-03-01 10:08:25.000000000 +0100
>  <at>  <at>  -232,7 +232,7  <at>  <at> 
>  	}
>  
>  	m = mmap((void *)0xa0000, 0x100000 - 0xa0000,
> -	 PROT_READ | PROT_WRITE,
> +	 PROT_READ | PROT_WRITE | PROT_EXEC,

are you sure you ID-ed the right commit that broke things?

while requiring PROT_EXEC is fine, breaking existing user-space apps 
over that is not fine. So are you absolutely sure that by reverting that 
PWT|PCD commit, s2ram again starts to work? That's utmost weird...

perhaps there's some CPU bug that causes NX to _NOT_ work if only PCD is 
used (not PCD|PWT). Seems like a pretty unlikely scenario though.

	Ingo

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
(Continue reading)

Klaus S. Madsen | 3 Mar 2008 21:52

Re: Regression in 2.6.25-rc3: s2ram segfaults before suspending

On Mon, Mar 03, 2008 at 18:48:58 +0100, Ingo Molnar wrote:
> 
> * Klaus S. Madsen <ksm <at> hjernemadsen.org> wrote:
> 
> > The following patch solves the segfault, by changing the mmap flags of 
> > the video memory area, to allow execution. The patch is against 
> > libx86-0.99 available from http://www.codon.org.uk/~mjg59/libx86/
> > 
> > --- libx86-0.99/x86-common.c	2006-09-08 00:44:27.000000000 +0200
> > +++ libx86-0.99.new/x86-common.c	2008-03-01 10:08:25.000000000 +0100
> >  <at>  <at>  -232,7 +232,7  <at>  <at> 
> >  	}
> >  
> >  	m = mmap((void *)0xa0000, 0x100000 - 0xa0000,
> > -	 PROT_READ | PROT_WRITE,
> > +	 PROT_READ | PROT_WRITE | PROT_EXEC,
> 
> are you sure you ID-ed the right commit that broke things?
I can't be sure. It was my third attempt, and there seems to be some
sort of Makefile trouble in that area, which causes the problem to
appear and disappear at random, unless I do a make clean && make. But
the triggering commit was found with make clean && make, and I made sure
that reverting the resulting commit did actually solve the problem...

However I wasn't able to make the problem go away, by removing the
_PAGE_PWT constants from __PAGE_KERNEL_NOCACHE and
__PAGE_KERNEL_VSYSCALL_NOCACHE in include-asm/pgtable.h in the newest
2.6.25:

diff --git a/include/asm-x86/pgtable.h b/include/asm-x86/pgtable.h
(Continue reading)

Ingo Molnar | 4 Mar 2008 13:42
Picon
Picon
Favicon

Re: Regression in 2.6.25-rc3: s2ram segfaults before suspending


* Klaus S. Madsen <ksm <at> hjernemadsen.org> wrote:

> > are you sure you ID-ed the right commit that broke things?
>
> I can't be sure. It was my third attempt, and there seems to be some 
> sort of Makefile trouble in that area, which causes the problem to 
> appear and disappear at random, unless I do a make clean && make. But 
> the triggering commit was found with make clean && make, and I made 
> sure that reverting the resulting commit did actually solve the 
> problem...

btw., even if it turns out to be the wrong commit, you sure poked in the 
right general area. This is one of the reoccuring problems with git 
bisection: a small mistake near the end of a long bisection session can 
point to the wrong commit. Especially with more sporadic failure modes 
it can be quite a challenge.

	Ingo

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
Ingo Molnar | 4 Mar 2008 13:36
Picon
Picon
Favicon

Re: Regression in 2.6.25-rc3: s2ram segfaults before suspending


* Klaus S. Madsen <ksm <at> hjernemadsen.org> wrote:

> So while I'm fairly confident in that I bisected correctly, the number 
> of attempts I had to go through to get a reliable result, and the fact 
> that I cannot make the problem go away by reverting the current code 
> to something similar, counts quite a lot against me.
> 
> However I'm 100% confident that the problem appears between 
> cf8fa920cb4271f17e0265c863d64bea1b31941a and 
> 925596a017bbd045ff711b778256f459e50a119, which is something like 16 
> commits. I have been at both points in the tree at least 2 times, and 
> confirmed that cf8fa920cb4271f17e0265c863d64bea1b31941a worked for me, 
> and 925596a017bbd045ff711b778256f459e50a119 didn't.

my guess would be that it's this commit that causes it:

| commit 6c3866558213ff706d8331053386915371ad63ec
| Author: Jeremy Fitzhardinge <jeremy <at> goop.org>
| Date:   Wed Jan 30 13:32:55 2008 +0100
|
|    x86: move all asm/pgtable constants into one place

> But I'm a bit puzzled by the fact that I'm aparently the only one how 
> have encountered the problem? Maybe it's only a problem if one also 
> uses PAE? (Thats just a wild guess to explain why I'm the only one 
> seeing this).

PAE activates NX on 32-bit. So we probably had an NX regression that got 
fixed by the side-effects of one of the unifications. Does it start 
(Continue reading)

Jeremy Fitzhardinge | 5 Mar 2008 00:00
Gravatar

Re: Regression in 2.6.25-rc3: s2ram segfaults before suspending

Ingo Molnar wrote:
> * Klaus S. Madsen <ksm <at> hjernemadsen.org> wrote:
>
>   
>> So while I'm fairly confident in that I bisected correctly, the number 
>> of attempts I had to go through to get a reliable result, and the fact 
>> that I cannot make the problem go away by reverting the current code 
>> to something similar, counts quite a lot against me.
>>
>> However I'm 100% confident that the problem appears between 
>> cf8fa920cb4271f17e0265c863d64bea1b31941a and 
>> 925596a017bbd045ff711b778256f459e50a119, which is something like 16 
>> commits. I have been at both points in the tree at least 2 times, and 
>> confirmed that cf8fa920cb4271f17e0265c863d64bea1b31941a worked for me, 
>> and 925596a017bbd045ff711b778256f459e50a119 didn't.
>>     
>
> my guess would be that it's this commit that causes it:
>
> | commit 6c3866558213ff706d8331053386915371ad63ec
> | Author: Jeremy Fitzhardinge <jeremy <at> goop.org>
> | Date:   Wed Jan 30 13:32:55 2008 +0100
> |
> |    x86: move all asm/pgtable constants into one place
>
>   
>> But I'm a bit puzzled by the fact that I'm aparently the only one how 
>> have encountered the problem? Maybe it's only a problem if one also 
>> uses PAE? (Thats just a wild guess to explain why I'm the only one 
>> seeing this).
(Continue reading)

H. Peter Anvin | 5 Mar 2008 00:11
Favicon

Re: Regression in 2.6.25-rc3: s2ram segfaults before suspending

Jeremy Fitzhardinge wrote:
>>
>> PAE activates NX on 32-bit. So we probably had an NX regression that 
>> got fixed by the side-effects of one of the unifications. Does it 
>> start working if you disable NX via the noexec=off boot option?
> 
> What's the state of play here?  Is upshot that this change fixed a bug 
> which broke s2ram, or caused a bug which broke s2ram?
> 

As far as I can tell, this change fixed a bug, and the fact that the bug 
was fixed triggered a s2ram bug.

	-hpa
Matthew Garrett | 5 Mar 2008 00:30
Favicon

Re: Regression in 2.6.25-rc3: s2ram segfaults before suspending

On Tue, Mar 04, 2008 at 03:11:17PM -0800, H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
> >>
> >>PAE activates NX on 32-bit. So we probably had an NX regression that 
> >>got fixed by the side-effects of one of the unifications. Does it 
> >>start working if you disable NX via the noexec=off boot option?
> >
> >What's the state of play here?  Is upshot that this change fixed a bug 
> >which broke s2ram, or caused a bug which broke s2ram?
> >
> 
> As far as I can tell, this change fixed a bug, and the fact that the bug 
> was fixed triggered a s2ram bug.

Strictly a libx86 bug, so I'll try to get an updated version uploaded in 
the near future. This won't hit x86_64 users, since the code is run 
through x86emu rather than executed directly.

--

-- 
Matthew Garrett | mjg59 <at> srcf.ucam.org
Jeremy Fitzhardinge | 5 Mar 2008 00:21
Gravatar

Re: Regression in 2.6.25-rc3: s2ram segfaults before suspending

H. Peter Anvin wrote:
> As far as I can tell, this change fixed a bug, and the fact that the 
> bug was fixed triggered a s2ram bug. 

OK.  Doesn't surprise me.  There were a few cases of dubious pte 
handling which could have accidentally lost the NX bit.

    J
Klaus S. Madsen | 4 Mar 2008 22:58

Re: Regression in 2.6.25-rc3: s2ram segfaults before suspending

On Tue, Mar 04, 2008 at 13:36:02 +0100, Ingo Molnar wrote:
> my guess would be that it's this commit that causes it:
> 
> | commit 6c3866558213ff706d8331053386915371ad63ec
> | Author: Jeremy Fitzhardinge <jeremy <at> goop.org>
> | Date:   Wed Jan 30 13:32:55 2008 +0100
> |
> |    x86: move all asm/pgtable constants into one place
All right. I just tried the following:

git checkout 82bc03fc158e28c90d7ed9919410776039cb4e14
make clean && make -j3 && sudo make install modules_install

And the resulting kernel works OK. However if I checkout commit
6c3866558213ff706d8331053386915371ad63ec, and compile it using the
previous commandline, s2ram fails so it seems you're right.

> > But I'm a bit puzzled by the fact that I'm aparently the only one how 
> > have encountered the problem? Maybe it's only a problem if one also 
> > uses PAE? (Thats just a wild guess to explain why I'm the only one 
> > seeing this).
> 
> PAE activates NX on 32-bit. So we probably had an NX regression that got 
> fixed by the side-effects of one of the unifications. Does it start 
> working if you disable NX via the noexec=off boot option?
If I boot with noexec=off, s2ram starts working again.

--

-- 
Kind regards
	Klaus S. Madsen
(Continue reading)

H. Peter Anvin | 4 Mar 2008 23:08
Favicon

Re: Regression in 2.6.25-rc3: s2ram segfaults before suspending

Klaus S. Madsen wrote:
>>> But I'm a bit puzzled by the fact that I'm aparently the only one how 
>>> have encountered the problem? Maybe it's only a problem if one also 
>>> uses PAE? (Thats just a wild guess to explain why I'm the only one 
>>> seeing this).
>> PAE activates NX on 32-bit. So we probably had an NX regression that got 
>> fixed by the side-effects of one of the unifications. Does it start 
>> working if you disable NX via the noexec=off boot option?
> If I boot with noexec=off, s2ram starts working again.

As one can expect.

	-hpa
Pavel Machek | 3 Mar 2008 22:05
Picon

Re: Regression in 2.6.25-rc3: s2ram segfaults before suspending

Hi!

> However I wasn't able to make the problem go away, by removing the
> _PAGE_PWT constants from __PAGE_KERNEL_NOCACHE and
> __PAGE_KERNEL_VSYSCALL_NOCACHE in include-asm/pgtable.h in the newest
> 2.6.25:
> 
> diff --git a/include/asm-x86/pgtable.h b/include/asm-x86/pgtable.h
> index 174b877..f81c968 100644
> --- a/include/asm-x86/pgtable.h
> +++ b/include/asm-x86/pgtable.h
>  <at>  <at>  -84,9 +84,9  <at>  <at>  extern pteval_t __PAGE_KERNEL, __PAGE_KERNEL_EXEC;
>  #define __PAGE_KERNEL_RO		(__PAGE_KERNEL & ~_PAGE_RW)
>  #define __PAGE_KERNEL_RX		(__PAGE_KERNEL_EXEC & ~_PAGE_RW)
>  #define __PAGE_KERNEL_EXEC_NOCACHE	(__PAGE_KERNEL_EXEC | _PAGE_PCD | _PAGE_PWT)
> -#define __PAGE_KERNEL_NOCACHE		(__PAGE_KERNEL | _PAGE_PCD | _PAGE_PWT)
> +#define __PAGE_KERNEL_NOCACHE		(__PAGE_KERNEL | _PAGE_PCD)
>  #define __PAGE_KERNEL_VSYSCALL		(__PAGE_KERNEL_RX | _PAGE_USER)
> -#define __PAGE_KERNEL_VSYSCALL_NOCACHE	(__PAGE_KERNEL_VSYSCALL | _PAGE_PCD | _PAGE_PWT)
> +#define __PAGE_KERNEL_VSYSCALL_NOCACHE	(__PAGE_KERNEL_VSYSCALL | _PAGE_PCD)
>  #define __PAGE_KERNEL_LARGE		(__PAGE_KERNEL | _PAGE_PSE)
>  #define __PAGE_KERNEL_LARGE_EXEC	(__PAGE_KERNEL_EXEC | _PAGE_PSE)
>  
> So while I'm fairly confident in that I bisected correctly, the number
> of attempts I had to go through to get a reliable result, and the fact
> that I cannot make the problem go away by reverting the current code to
> something similar, counts quite a lot against me.
> 
> However I'm 100% confident that the problem appears between
> cf8fa920cb4271f17e0265c863d64bea1b31941a and
(Continue reading)

Pavel Machek | 3 Mar 2008 22:06
Picon

Re: Regression in 2.6.25-rc3: s2ram segfaults before suspending

On Mon 2008-03-03 22:05:26, Pavel Machek wrote:
> Hi!
> 
> > However I wasn't able to make the problem go away, by removing the
> > _PAGE_PWT constants from __PAGE_KERNEL_NOCACHE and
> > __PAGE_KERNEL_VSYSCALL_NOCACHE in include-asm/pgtable.h in the newest
> > 2.6.25:
> > 
> > diff --git a/include/asm-x86/pgtable.h b/include/asm-x86/pgtable.h
> > index 174b877..f81c968 100644
> > --- a/include/asm-x86/pgtable.h
> > +++ b/include/asm-x86/pgtable.h
> >  <at>  <at>  -84,9 +84,9  <at>  <at>  extern pteval_t __PAGE_KERNEL, __PAGE_KERNEL_EXEC;
> >  #define __PAGE_KERNEL_RO		(__PAGE_KERNEL & ~_PAGE_RW)
> >  #define __PAGE_KERNEL_RX		(__PAGE_KERNEL_EXEC & ~_PAGE_RW)
> >  #define __PAGE_KERNEL_EXEC_NOCACHE	(__PAGE_KERNEL_EXEC | _PAGE_PCD | _PAGE_PWT)
> > -#define __PAGE_KERNEL_NOCACHE		(__PAGE_KERNEL | _PAGE_PCD | _PAGE_PWT)
> > +#define __PAGE_KERNEL_NOCACHE		(__PAGE_KERNEL | _PAGE_PCD)
> >  #define __PAGE_KERNEL_VSYSCALL		(__PAGE_KERNEL_RX | _PAGE_USER)
> > -#define __PAGE_KERNEL_VSYSCALL_NOCACHE	(__PAGE_KERNEL_VSYSCALL | _PAGE_PCD | _PAGE_PWT)
> > +#define __PAGE_KERNEL_VSYSCALL_NOCACHE	(__PAGE_KERNEL_VSYSCALL | _PAGE_PCD)
> >  #define __PAGE_KERNEL_LARGE		(__PAGE_KERNEL | _PAGE_PSE)
> >  #define __PAGE_KERNEL_LARGE_EXEC	(__PAGE_KERNEL_EXEC | _PAGE_PSE)
> >  
> > So while I'm fairly confident in that I bisected correctly, the number
> > of attempts I had to go through to get a reliable result, and the fact
> > that I cannot make the problem go away by reverting the current code to
> > something similar, counts quite a lot against me.
> > 
> > However I'm 100% confident that the problem appears between
(Continue reading)

Klaus S. Madsen | 3 Mar 2008 22:21

Re: Regression in 2.6.25-rc3: s2ram segfaults before suspending

On Mon, Mar 03, 2008 at 22:06:47 +0100, Pavel Machek wrote:
> > > But I'm a bit puzzled by the fact that I'm aparently the only one how
> > > have encountered the problem? Maybe it's only a problem if one also uses
> > > PAE? (Thats just a wild guess to explain why I'm the only one seeing
> > > this).
> > 
> > I do not have NX capable CPU in machine using for suspend
> > testing. ... and you are using 32-bit userspace on 64-bit kernel,
> > right?
> 
> Sorry, I meant "32-bit userspace on 32-bit kernel"... but that on
> 64-bit capable cpu.
Yes. With 4 GB of RAM, so I use PAE, which is also required for NX. Just
to be sure, I disabled PAE, and sure enough, the problem disappeared.

I'll double that the commit I fingered actually caused the problem, but
I will not have time for this before tomorrw evening. 

Just to be sure: If I want to checkout the revision, just before the bad
commit I can do something like:

git checkout commit-id~1

Right? 

Thanks for spending time on this.

--

-- 
Kind regards
	Klaus S. Madsen
(Continue reading)

Matthew Garrett | 3 Mar 2008 22:06
Favicon

Re: Regression in 2.6.25-rc3: s2ram segfaults before suspending

On Mon, Mar 03, 2008 at 10:05:26PM +0100, Pavel Machek wrote:

> I do not have NX capable CPU in machine using for suspend
> testing. ... and you are using 32-bit userspace on 64-bit kernel,
> right?

Not if he's expecting vm86 to work...

--

-- 
Matthew Garrett | mjg59 <at> srcf.ucam.org

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
H. Peter Anvin | 3 Mar 2008 21:58
Favicon

Re: Regression in 2.6.25-rc3: s2ram segfaults before suspending

Klaus S. Madsen wrote:
> $ cat /proc/cpuinfo
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 15
> model name      : Intel(R) Core(TM)2 Duo CPU     T7500   <at>  2.20GHz
> stepping        : 10
> 
> But I'm a bit puzzled by the fact that I'm aparently the only one how
> have encountered the problem? Maybe it's only a problem if one also uses
> PAE? (Thats just a wild guess to explain why I'm the only one seeing
> this).

NX protection only applies to PAE.

	-hpa

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
H. Peter Anvin | 3 Mar 2008 18:53
Favicon

Re: Regression in 2.6.25-rc3: s2ram segfaults before suspending

Ingo Molnar wrote:
> * Klaus S. Madsen <ksm <at> hjernemadsen.org> wrote:
> 
>> The following patch solves the segfault, by changing the mmap flags of 
>> the video memory area, to allow execution. The patch is against 
>> libx86-0.99 available from http://www.codon.org.uk/~mjg59/libx86/
>>
>> --- libx86-0.99/x86-common.c	2006-09-08 00:44:27.000000000 +0200
>> +++ libx86-0.99.new/x86-common.c	2008-03-01 10:08:25.000000000 +0100
>>  <at>  <at>  -232,7 +232,7  <at>  <at> 
>>  	}
>>  
>>  	m = mmap((void *)0xa0000, 0x100000 - 0xa0000,
>> -	 PROT_READ | PROT_WRITE,
>> +	 PROT_READ | PROT_WRITE | PROT_EXEC,
> 
> are you sure you ID-ed the right commit that broke things?
> 
> while requiring PROT_EXEC is fine, breaking existing user-space apps 
> over that is not fine. So are you absolutely sure that by reverting that 
> PWT|PCD commit, s2ram again starts to work? That's utmost weird...
> 
> perhaps there's some CPU bug that causes NX to _NOT_ work if only PCD is 
> used (not PCD|PWT). Seems like a pretty unlikely scenario though.
> 

It really does.  What would be much more likely is that the PCD -> 
(PCD|PWT) triggered something in the kernel proper.

	-hpa
(Continue reading)


Gmane