Linus Torvalds | 8 Nov 2004 02:16

Re: Oops in 2.6.10-rc1


On Mon, 8 Nov 2004, Christian Kujau wrote:
> 
> what i did not expect is that this ChangeSet is now *not* the culprit,
> because there is no oops. am i right? [1]

Yes.

So now I'd like to know _where_ the culprit is, since it turned out to be 
not the ALSA code. 

> i did another thing: i enabled the (deprecated) OSS driver (es1371.ko)
> tried to load this thing:
> 
> http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-debug_oops-OSS.txt
> 
> it oopses.
>  - you said it's not a b0rken pci thingy
>  - i have to assume now that it's not an ALSA issue (since oss oopses too)
>  - it is OSS? the driver? i've CC'ed linux-sound...

Sounds like something else changed, and likely the ALSA _and_ the OSS 
driver both broke. Which is not all that unlikely, since I suspect they 
share a lot of history.

> yes, like Documentation/BUG-HUNTING says. but i seem to have difficulties
> in using my tools (bk). sorry for that.

Not your fault. Think of this as a learning experience ;)

(Continue reading)

Christian Kujau | 8 Nov 2004 14:01
Picon

Re: Oops in 2.6.10-rc1


Linus Torvalds schrieb:
> 
> Not your fault. Think of this as a learning experience ;)

it definitely is, yes.

> Anyway, now that the _other_ driver also oopses, and with a very similar 
> oops too, so it looks like they both depended on some undocumented (or 
> changed) detail in the PCI layer. Next step would be to see if the thing 
> that breaks is this merge:

may i ask how you come to this conclusion? by technical knowledge or could
this be deduced by some bk magic too?

> 
> 	ChangeSet <at> 1.2463, 2004-11-04 17:07:16-08:00, torvalds <at> ppc970.osdl.org
> 	  Merge bk://kernel.bkbits.net/gregkh/linux/driver-2.6
> 	  into ppc970.osdl.org:/home/torvalds/v2.6/linux
> 
> which merges Greg's PCI/driver model changes.
> 
> It's all the same steps you took with the ALSA merge, you're a
> professional by now ;)

i did "bk undo -a1.2463" from a current -BK tree and it oopses:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-debug_oops-a1.2463.txt

(i've booted with different boot options this time, because i noticed that
(Continue reading)

Pekka Enberg | 8 Nov 2004 19:44
Picon
Gravatar

Re: Oops in 2.6.10-rc1

Hi Christian,

On Mon, 08 Nov 2004 14:01:39 +0100, Christian Kujau <evil <at> g-house.de> wrote:
> i've put everthing on http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/
> the .configs, the oopses are there. i've double checked a kernel built
> from "bk -a a1.2000.7.2" yesterday but the result was the same (no oops)

Just to update, I cannot reproduce the oops with your config (nor
mine) on my machine running 2.6.10-rc1-bk14.

                       Pekka

0000:00:00.0 Host bridge: VIA Technologies, Inc. VT8363/8365
[KT133/KM133] (rev 03)
        Subsystem: ASUSTeK Computer Inc. A7V133/A7V133-C Mainboard
        Flags: bus master, medium devsel, latency 8
        Memory at e7000000 (32-bit, prefetchable)
        Capabilities: [a0] AGP version 2.0
        Capabilities: [c0] Power Management version 2

0000:00:01.0 PCI bridge: VIA Technologies, Inc. VT8363/8365
[KT133/KM133 AGP] (prog-if 00 [Normal decode])
        Flags: bus master, 66Mhz, medium devsel, latency 0
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 0000d000-0000dfff
        Memory behind bridge: d7000000-d7efffff
        Prefetchable memory behind bridge: d7f00000-e6ffffff
        Expansion ROM at 0000d000 [disabled] [size=4K]
        Capabilities: [80] Power Management version 2

(Continue reading)

Greg KH | 8 Nov 2004 20:00
Gravatar

Re: Oops in 2.6.10-rc1

On Mon, Nov 08, 2004 at 08:44:37PM +0200, Pekka Enberg wrote:
> Hi Christian,
> 
> On Mon, 08 Nov 2004 14:01:39 +0100, Christian Kujau <evil <at> g-house.de> wrote:
> > i've put everthing on http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/
> > the .configs, the oopses are there. i've double checked a kernel built
> > from "bk -a a1.2000.7.2" yesterday but the result was the same (no oops)
> 
> Just to update, I cannot reproduce the oops with your config (nor
> mine) on my machine running 2.6.10-rc1-bk14.

But 2.6.10-rc1-bk15 does have the problem?

Trying to figure out where the issue is...

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-sound" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Pekka Enberg | 8 Nov 2004 20:18
Picon
Gravatar

Re: Oops in 2.6.10-rc1

Hi,

On Mon, 8 Nov 2004 11:00:40 -0800, Greg KH <greg <at> kroah.com> wrote:
> But 2.6.10-rc1-bk15 does have the problem?
> 
> Trying to figure out where the issue is...

No, -bk14 is just the kernel I am running right now (I haven't tried
-bk15) and I haven't had the problem. I cannot reproduce the oops _at
all_ which is why I suspect it's his hardware. I included my lspci and
dmesg output because we have similar (but not exactly the same)
setups.

FWIW, I've asked Christian for an obdump of the kernel to see if I can
track down where it oopses at because I cannot find anything in the
code. I suspected pcibios_enable_irq  (which is a function pointer)
might be wrong but looking at his logs, I don't think we get that far.

                          Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-sound" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christian Kujau | 8 Nov 2004 21:31
Picon

Re: Oops in 2.6.10-rc1


Pekka Enberg schrieb:
> Hi,
> 
> On Mon, 8 Nov 2004 11:00:40 -0800, Greg KH <greg <at> kroah.com> wrote:
> 
>>But 2.6.10-rc1-bk15 does have the problem?
>>
>>Trying to figure out where the issue is...

i could use the -bk snapshots too, but since i am using bk myself (i try),
i think we can narrow it down a bit more.

> 
> No, -bk14 is just the kernel I am running right now (I haven't tried
> -bk15) and I haven't had the problem. I cannot reproduce the oops _at
> all_ which is why I suspect it's his hardware. I included my lspci and
> dmesg output because we have similar (but not exactly the same)
> setups.

i've put an lspci output here:
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/lspci-v.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/lspci-vv.txt

i do not suspect hw problems *yet*, because kernel up to 2.6.9 (tracking
bk) do not show this behaviour.

> FWIW, I've asked Christian for an obdump of the kernel to see if I can

will show up in a couple of minutes here:
(Continue reading)

Pekka Enberg | 8 Nov 2004 20:30
Picon
Gravatar

Re: Oops in 2.6.10-rc1

On Mon, 8 Nov 2004 11:00:40 -0800, Greg KH <greg <at> kroah.com> wrote:
> > But 2.6.10-rc1-bk15 does have the problem?
> >
> > Trying to figure out where the issue is...

On Mon, 8 Nov 2004 21:18:09 +0200, Pekka Enberg <penberg <at> gmail.com> wrote: 
> No, -bk14 is just the kernel I am running right now (I haven't tried
> -bk15) and I haven't had the problem.

Sorry for not being clear, any kernel after 2.6.10-rc1 oopses
according to Christian which is why I haven't bothered to test
anything else except -bk14.

                           Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-sound" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds | 8 Nov 2004 19:13

Re: Oops in 2.6.10-rc1


On Mon, 8 Nov 2004, Christian Kujau wrote:
> 
> > Anyway, now that the _other_ driver also oopses, and with a very similar 
> > oops too, so it looks like they both depended on some undocumented (or 
> > changed) detail in the PCI layer. Next step would be to see if the thing 
> > that breaks is this merge:
> 
> may i ask how you come to this conclusion? by technical knowledge or could
> this be deduced by some bk magic too?

No, just gut feel. If the pre-merge ALSA works, and the post-merge one 
doesn't, and the oops in both cases happen somewhere close to where it 
does "pci_enable_device()", there's not a lot left. There are interrupts, 
and there is the PCI layer...

> > 	ChangeSet <at> 1.2463, 2004-11-04 17:07:16-08:00, torvalds <at> ppc970.osdl.org
> > 	  Merge bk://kernel.bkbits.net/gregkh/linux/driver-2.6
> > 	  into ppc970.osdl.org:/home/torvalds/v2.6/linux
> > 
> > which merges Greg's PCI/driver model changes.
> > 
> > It's all the same steps you took with the ALSA merge, you're a
> > professional by now ;)
> 
> i did "bk undo -a1.2463" from a current -BK tree and it oopses:

Note that "bk undo -axxx" will _leave_ xxx in place, and undo everything 
after. 

(Continue reading)

Christian Kujau | 8 Nov 2004 21:59
Picon

Re: Oops in 2.6.10-rc1


Linus Torvalds schrieb:
>
> No, just gut feel. If the pre-merge ALSA works, and the post-merge one 
> doesn't, and the oops in both cases happen somewhere close to where it 
> does "pci_enable_device()", there's not a lot left. There are interrupts, 
> and there is the PCI layer...

yes, makes sense.

>>
>>i did "bk undo -a1.2463" from a current -BK tree and it oopses:
> 
> Note that "bk undo -axxx" will _leave_ xxx in place, and undo everything 
> after. 
> 
> So what you did still has the merge in the tree, and that it still oopses 
> is thus to be expected. BUT, we're getting closer.

yes, i think i understood that. that's why i wanted to revert 1.2463 too.

[...]

> 
> Now, that's fine - the USB merge is likely to be ok, so try doing
> 
> 	bk undo -a1.2462

for now i appreciate your work here but i have to postpone the the "bk
revtool" stuff because i have no X _and_ bk here. (but i'm a good student
(Continue reading)

Christian Kujau | 9 Nov 2004 00:49
Picon

Re: Oops in 2.6.10-rc1


>>>Now, that's fine - the USB merge is likely to be ok, so try doing
>>>
>>>	bk undo -a1.2462

i did so, 1.2463 went away, building as usual - but the oops resists :(

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-debug_oops-a1.2462.txt

> 
> for now i appreciate your work here but i have to postpone the the "bk
> revtool" stuff because i have no X _and_ bk here. (but i'm a good student
> and will do my homework)

...in progress...

>>>
>>>	bk set -n -d -r1.2462 -r1.2463 | bk -R prs -h -d'<:P: <at> :HOST:>\n$each(:C:){\t(:C:)\n}\n' -
>>>
>>>which is black magic that does a set operation and shows all the changes 
>>>in between the sets of "bk at 1.2462" and "bk at 1.2463".

hm, i guess this has to wait now.

>>>Looking at the list (appended), I don't see anything obvious, but hey, if 
>>>it was obvious it wouldn't have been merged in the first place. 

yes, i'll look for changes regarding PCI. i've started to compile the -bk
snapshots too. there i can do less wrong things. when i have the "bad" -bk
snapshot i'll use "bk" itself again to find the detailed change leading to
(Continue reading)

Christian Kujau | 9 Nov 2004 02:31
Picon

Re: Oops in 2.6.10-rc1


ok, i've done some other things here and built kernels from
2.6.10-rc1-bk13 and all were giving the oops:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config-2.6.10-rc1-bk13
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-debug_oops-2.6.10-rc1-bk13.txt

the config is the same config i am usually using, never gave me a
headache, new options (due to new kernel version) were left to default in
most cases. anyway - i've pulled again a recent tree, did
"bk undo -a1.2463" again but this time i stripped down my .config (via
menuconfig) to the absolute necessary things:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config-2.6.10-rc1_a1.2463_take2

...and  it did *NOT* oops:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-no-oops-2.6.10-rc1_a1.2463.txt

i'll investigate further, building former -bk snapshots, using other
configs before i'll fiddle around with bk again (to get the smaller
changes). but this is a tomorrow thing, real life calls in :(

Thank you all so far,
Christian.
--
BOFH excuse #92:

Stale file handle (next time use Tupperware(tm)!)
(Continue reading)

Pekka Enberg | 9 Nov 2004 08:40
Picon
Gravatar

Re: Oops in 2.6.10-rc1

Hi,

On Tue, 09 Nov 2004 02:31:28 +0100, Christian Kujau <evil <at> g-house.de> wrote:
> the config is the same config i am usually using, never gave me a
> headache, new options (due to new kernel version) were left to default in
> most cases. anyway - i've pulled again a recent tree, did
> "bk undo -a1.2463" again but this time i stripped down my .config (via
> menuconfig) to the absolute necessary things:
> 
> http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config-2.6.10-rc1_a1.2463_take2
> 
> ...and  it did *NOT* oops:
> 
> http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-no-oops-2.6.10-rc1_a1.2463.txt
> 
> i'll investigate further, building former -bk snapshots, using other
> configs before i'll fiddle around with bk again (to get the smaller
> changes). but this is a tomorrow thing, real life calls in :(

CONFIG_PREEMPT is one obvious candidate (you have that enabled in the
original config and disabled in the non-oopsing one).

                       Pekka
Christian Kujau | 9 Nov 2004 13:33
Picon

Re: Oops in 2.6.10-rc1


this damn thread is far too long already...

Pekka Enberg schrieb:
> CONFIG_PREEMPT is one obvious candidate (you have that enabled in the
> original config and disabled in the non-oopsing one).

i've disabled *only* CONFIG_PREEMPT in another .config but it still oopses:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-debug_oops-2.6.10-rc1_no-preempt.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config-2.6.10-rc1_no-preempt.txt

2.6.9 with preempt enabled does not oops:
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config-2.6.9_preempt.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-no-oops_2.6.9_preempt.txt

i was a fool to test further -bk snapshots but it was kinda late yesterday
 and i was confused:

patch-2.6.9.bz2          -> 19-Oct-2004
patch-2.6.10-rc1.bz2     -> 23-Oct-2004 00:12
patch-2.6.10-rc1-bk1.bz2 -> 23-Oct-2004 13:34

2.6.9 is not oopsing *here*, plain 2.6.10-rc1 is oopsing. so i can *not*
use -bk snapshots any more and i will go on with BK (undo the ChangeSets
Linus told me about) and use different .configs now. sorry for the
confusion and especially sorry to my bk mentor: we seem to be so close to
the right ChangeSet and then i started to use *snapshots* again.

Thanks,
(Continue reading)

Christian Kujau | 9 Nov 2004 18:26
Picon

Re: Oops in 2.6.10-rc1 (almost solved)

On Tue, 09 Nov 2004 13:33:20 +0100, Christian Kujau wrote
> i've disabled *only* CONFIG_PREEMPT in another .config but it 
> still oopses:

at least i finally found the "bad" .config option: it's CONFIG_EDD.
when i disable this option (and only this options. i can use the same
.config as usual only disbaling this very option. diff is my witness.)
i can boot a current (!) 2.6.10-rc1-bk and a working snd-ens1371!

i'll test with CONFIG_EDD=m later on. here a short summary:

2.6.9         CONFIG_EDD=y   - OK
2.6.10-rc1-bk CONFIG_EDD=y   - OOPS!
2.6.10-rc1-bk CONFIG_EDD=n   - OK
2.6.10-rc1-bk CONFIG_EDD=m   - ??

yes, i'll continue to find out the ChangeSet but now i (and perhaps you
too, if you are as curious as me) will know where to look at.
i must admit that i was not entirely sure why i wanted to enable
CONFIG_EDD at all. if i had never enabled it, it'd have saved me a week
of bug chasing, but learning is fun, too.

thanks,
Christian.
--

-- 
BOFH excuse #209:

Only people with names beginning with 'A' are getting mail this week (a
la Microsoft)
(Continue reading)

Linus Torvalds | 9 Nov 2004 19:53

Re: Oops in 2.6.10-rc1 (almost solved)


On Tue, 9 Nov 2004, Christian Kujau wrote:
> 
> at least i finally found the "bad" .config option: it's CONFIG_EDD.
> when i disable this option (and only this options. i can use the same
> .config as usual only disbaling this very option. diff is my witness.)
> i can boot a current (!) 2.6.10-rc1-bk and a working snd-ens1371!

Very strange. There's not a lot of stuff that affects EDD directly that I 
can see, but there is:

	ChangeSet <at> 1.2000.5.108, 2004-10-20 08:36:22-07:00, Matt_Domsch <at> dell.com
	  [PATCH] EDD: use EXTENDED READ command, add CONFIG_EDD_SKIP_MBR
	  
	  Some controller BIOSes have problems with the legacy int13 fn02 READ
	  SECTORS command.  int13 fn42 EXTENDED READ is used in preference by most
	  boot loaders today, so lets use that.  If EXTENDED READ fails or isn't
	  supported, fall back to READ SECTORS.
	  
	  This hopefully resolves the three reports of BIOSes which would either
	  long-pause (30+ seconds) or hang completely on the legacy READ SECTORS
	  command.
	  
	  This also adds CONFIG_EDD_SKIP_MBR to eliminate reading the MBR on each
	  BIOS-presented disk, in case there are further problems in this area.
	  
	  Signed-off-by: Matt Domsch <Matt_Domsch <at> dell.com>
	  Signed-off-by: Andrew Morton <akpm <at> osdl.org>
	  Signed-off-by: Linus Torvalds <torvalds <at> osdl.org>

(Continue reading)

Christian Kujau | 10 Nov 2004 00:30
Picon

Re: Oops in 2.6.10-rc1 (almost solved)


Linus Torvalds schrieb:
> 
> Very strange. There's not a lot of stuff that affects EDD directly that I 
> can see, but there is:
> 
> 	ChangeSet <at> 1.2000.5.108, 2004-10-20 08:36:22-07:00, Matt_Domsch <at> dell.com
> 	  [PATCH] EDD: use EXTENDED READ command, add CONFIG_EDD_SKIP_MBR

and i say: good catch! that does it!

i did "bk undo -a1.2000.5.108" on a current tree, booting this still gives
an oops:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.9_a1.2000.5.108.txt

excluding this single ChangeSet with "bk undo -r1.2118" does work with
CONFIG_EDD=y:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.9_r1.2000.5.108.txt

(the filename here should really read "...r1.2118.txt" because that was
the number of the changeset representing the above [PATCH] *after* i did
"bk undo -a1.2000.5.108". right?)

> However, even that would just change the EDD _data_, it doesn't change the 
> code that actually runs in the kernel. And I _really_ don't see what EDD 
> has got to do with anything.

understanding a lot less of all this than you guys i also wonder why only
(Continue reading)

Matt Domsch | 10 Nov 2004 00:40
Picon
Favicon

Re: Oops in 2.6.10-rc1 (almost solved)

On Wed, Nov 10, 2004 at 12:30:21AM +0100, Christian Kujau wrote:
> > 	ChangeSet <at> 1.2000.5.108, 2004-10-20 08:36:22-07:00, Matt_Domsch <at> dell.com
> > 	  [PATCH] EDD: use EXTENDED READ command, add CONFIG_EDD_SKIP_MBR
> 
> and i say: good catch! that does it!
> 
> i did "bk undo -a1.2000.5.108" on a current tree, booting this still gives
> an oops:
> 
> http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.9_a1.2000.5.108.txt
> 
> excluding this single ChangeSet with "bk undo -r1.2118" does work with
> CONFIG_EDD=y:
> 
> http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.9_r1.2000.5.108.txt

OK, thanks, that helps.  From the diff of those dmesg:

-BIOS EDD facility v0.16 2004-Jun-25, 16 devices found
+BIOS EDD facility v0.16 2004-Jun-25, 6 devices found

So with the latest EDD patch noted above, it's finding more disks than
before.  How many disks do you actually have in the system?

I'll review the assembly again to see where I could have miscounted,
and see how that may affect the EDD sysfs exports.  Likely no answer
from me before tomorrow though.

Thanks,
Matt
(Continue reading)

Matt Domsch | 11 Nov 2004 23:43
Picon
Favicon

Re: Oops in 2.6.10-rc1 (almost solved)

On Tue, Nov 09, 2004 at 05:40:54PM -0600, Matt Domsch wrote:
> OK, thanks, that helps.  From the diff of those dmesg:
> 
> -BIOS EDD facility v0.16 2004-Jun-25, 16 devices found
> +BIOS EDD facility v0.16 2004-Jun-25, 6 devices found

As Linus points out, those are the magic numbers in EDD for number of
device entries stored.  Your BIOS seems to be reporting that is has
more devices than it does, or the EDD assembly is horked in a way I
have not yet deciphered.

> I'll review the assembly again to see where I could have miscounted,
> and see how that may affect the EDD sysfs exports.  Likely no answer
> from me before tomorrow though.

I haven't been able to find a solution to your problem yet, and given
some external time constraints I've got, won't be able to look into
this again for another week or more.

Thanks,
Matt

--

-- 
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & www.dell.com/linux
Linux on Dell mailing lists  <at>  http://lists.us.dell.com
Christian Kujau | 12 Nov 2004 01:27
Picon

Re: Oops in 2.6.10-rc1 (almost solved)


Matt Domsch schrieb:
> 
> As Linus points out, those are the magic numbers in EDD for number of
> device entries stored.  Your BIOS seems to be reporting that is has
> more devices than it does, or the EDD assembly is horked in a way I
> have not yet deciphered.

actually, my BIOS is even to old for e.g. ACPI, with latest firmware
installed. i had no issues so far with the board/bios, but perhaps this is
no longer true. however, it's still strange that this thing is only
triggerd with you change and CONFIG_EDD=y.

> 
> I haven't been able to find a solution to your problem yet, and given
> some external time constraints I've got, won't be able to look into
> this again for another week or more.

nevermind then. as nobody else seem to be bothered by this i am happy with
the workarund (CONFIG_EDD=n) and since the lkml-archives exist we could
get back to it when it's bothering more people (n>1)

thank you for your time,
Christian.
--
BOFH excuse #396:

Mail server hit by UniSpammer.
Linus Torvalds | 12 Nov 2004 01:49

Re: Oops in 2.6.10-rc1 (almost solved)


On Fri, 12 Nov 2004, Christian Kujau wrote:
> 
> nevermind then. as nobody else seem to be bothered by this i am happy with
> the workarund (CONFIG_EDD=n) and since the lkml-archives exist we could
> get back to it when it's bothering more people (n>1)

The problem with that approach is that very few people are willing to 
spend the time and effort to really try to figure out where the problem 
triggers for them. Thanks again for testing lots of kernels, and different 
configurations.

Basically, if it's a problem that only happens for a smallish percentage
of people, and an even smaller percentage of those is willing to dig down
and find it, it's not a problem we can afford to ignore. Ignoring it just
means that there will be "a few" error reports that we will either waste
time on, or (even worse) we'll dismiss as "known problems" and then
possibly miss _another_ bug.

This is why I take random unexplained (but pinpointed) problems so 
seriously. If it wasn't as apparently random, we could file it under 
"known problem" and decide to try to fix it later. As it is, it's filed 
under "known cause", but since we don't know _why_, it might cause totally 
different problems on another machine, and that just makes it too painful 
for words. 

So the changeset is reverted for now in the current -bk tree, and I'll 
make a -rc2 this weekend and hope that we can stabilize for 2.6.10.

		Linus
(Continue reading)

Christian Kujau | 12 Nov 2004 02:27
Picon

Re: Oops in 2.6.10-rc1 (almost solved)


Linus Torvalds schrieb:
> 
> This is why I take random unexplained (but pinpointed) problems so 
> seriously. If it wasn't as apparently random, we could file it under 
> "known problem" and decide to try to fix it later. As it is, it's filed 
> under "known cause", but since we don't know _why_, it might cause totally 
> different problems on another machine, and that just makes it too painful 
> for words. 

just after sending my last mail i too (re)thought about this and i'd have
begged Matt to revert the patch if it was not *only* me having this issue.

but i can see your point here and i appreciate your decision.

> So the changeset is reverted for now in the current -bk tree, and I'll 
> make a -rc2 this weekend and hope that we can stabilize for 2.6.10.

yay!

thanks,
Christian.
--
BOFH excuse #96:

Vendor no longer supports the product
Linus Torvalds | 11 Nov 2004 23:53

Re: Oops in 2.6.10-rc1 (almost solved)


On Thu, 11 Nov 2004, Matt Domsch wrote:
> 
> I haven't been able to find a solution to your problem yet, and given
> some external time constraints I've got, won't be able to look into
> this again for another week or more.

Matt, I'll revert the EXTENDED READ change for now, then. The random
behaviour of the problem it causes makes me really dislike this bug, and
I'd like to release a -rc2 and start calming down the 2.6.10 stuff, but
having known random stuff happen really disturbs me.

We can re-do it once it's more obvious why it broke..

		Linus
Matt Domsch | 11 Nov 2004 23:55
Picon
Favicon

Re: Oops in 2.6.10-rc1 (almost solved)

On Thu, Nov 11, 2004 at 02:53:15PM -0800, Linus Torvalds wrote:
> Matt, I'll revert the EXTENDED READ change for now, then. The random
> behaviour of the problem it causes makes me really dislike this bug, and
> I'd like to release a -rc2 and start calming down the 2.6.10 stuff, but
> having known random stuff happen really disturbs me.
> 
> We can re-do it once it's more obvious why it broke..

Good plan, thanks.

--

-- 
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & www.dell.com/linux
Linux on Dell mailing lists  <at>  http://lists.us.dell.com
Christian Kujau | 10 Nov 2004 01:21
Picon

Re: Oops in 2.6.10-rc1 (almost solved)


Matt Domsch schrieb:
> 
> -BIOS EDD facility v0.16 2004-Jun-25, 16 devices found
> +BIOS EDD facility v0.16 2004-Jun-25, 6 devices found
> 
> So with the latest EDD patch noted above, it's finding more disks than
> before.  How many disks do you actually have in the system?

i have one scsi disk (sda) and two atapi cdrom drives:

hda: CRD-8483B, ATAPI CD/DVD-ROM drive
hdb: AOPEN CD-RW CRW3248 1.17 20020620, ATAPI CD/DVD-ROM drive
...
SCSI device sda: 35548320 512-byte hdwr sectors (18201 MB)
SCSI device sda: drive cache: write back

the "scsi0 : sym-2.1.18k" is on a pci card, the atapi devices are
connected onboard. if it helps:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/lspci-v.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/lspci-vv.txt

> I'll review the assembly again to see where I could have miscounted,
> and see how that may affect the EDD sysfs exports.  Likely no answer
> from me before tomorrow though.

that's ok, real life kicks in here too...

thanks,
(Continue reading)

Linus Torvalds | 10 Nov 2004 02:01

Re: Oops in 2.6.10-rc1 (almost solved)


On Wed, 10 Nov 2004, Christian Kujau wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Matt Domsch schrieb:
> > 
> > -BIOS EDD facility v0.16 2004-Jun-25, 16 devices found
> > +BIOS EDD facility v0.16 2004-Jun-25, 6 devices found
> > 
> > So with the latest EDD patch noted above, it's finding more disks than
> > before.  How many disks do you actually have in the system?
> 
> i have one scsi disk (sda) and two atapi cdrom drives:

Interestingly, "16" is also EDD_MBR_SIG_MAX, so my suspicion is that it 
overflowed some EDD data area. edd_num_devices() (which is what reports 
the above number) does

	min_t(unsigned char,
		max_t(unsigned char, edd.edd_info_nr, edd.mbr_signature_nr),
		max_t(unsigned char, EDD_MBR_SIG_MAX, EDDMAXNR));

where EDDMAXNR is 6, and EDD_MBR_SIG_MAX is the afore-mentioned 16, so we 
know that either edd.edd_info_nr or edd.mbr_signature_nr is actually 
_bigger_ than 16.

Which is clearly totally bogus. In fact, even your old "6 devices found" 
thing looks suspiciously bogus.
(Continue reading)

Greg KH | 9 Nov 2004 20:04
Gravatar

[PATCH] kobject: fix double kobject_put() in error path of kobject_add()

This fixes a problem introduced in the previous set of driver model
changes that has been seen by a lot of people (most notibly the greater
than 256 pty users, but others might also be hitting this without
realizing it.)

Also add a comment so we don't try to "fix" this again.

Signed-off-by: Greg Kroah-Hartman <greg <at> kroah.com>

--- a/lib/kobject.c	2004-11-05 10:06:33 -08:00
+++ b/lib/kobject.c	2004-11-08 23:58:02 -08:00
 <at>  <at>  -181,10 +181,10  <at>  <at>  int kobject_add(struct kobject * kobj)

 	error = create_dir(kobj);
 	if (error) {
+		/* unlink does the kobject_put() for us */
 		unlink(kobj);
 		if (parent)
 			kobject_put(parent);
-		kobject_put(kobj);
 	} else {
 		kobject_hotplug(kobj, KOBJ_ADD);
 	}
Linus Torvalds | 9 Nov 2004 20:09

Re: [PATCH] kobject: fix double kobject_put() in error path of kobject_add()


On Tue, 9 Nov 2004, Greg KH wrote:
>
> This fixes a problem introduced in the previous set of driver model
> changes that has been seen by a lot of people (most notibly the greater
> than 256 pty users, but others might also be hitting this without
> realizing it.)

Ahh.. Christian, pls test this one.

		Linus
Christian Kujau | 9 Nov 2004 23:06
Picon

Re: [PATCH] kobject: fix double kobject_put() in error path of kobject_add()


i'm sorry to say that it did not help:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.10-rc1_edd__kobject_put.txt

i'll go on and try to exclude

ChangeSet <at> 1.2000.5.108, 2004-10-20 08:36:22-07:00, Matt_Domsch <at> dell.com
	  [PATCH] EDD: use EXTENDED READ command, add CONFIG_EDD_SKIP_MBR

(or just test /pub/linux/kernel/v2.6/snapshots/old/patch-2.6.9-bk*.gz ...)

thanks,
Christian.
--
BOFH excuse #200:

The monitor needs another box of pixels.
Greg KH | 9 Nov 2004 20:08
Gravatar

Re: [PATCH] kobject: fix double kobject_put() in error path of kobject_add()

On Tue, Nov 09, 2004 at 11:04:21AM -0800, Greg KH wrote:
> This fixes a problem introduced in the previous set of driver model
> changes that has been seen by a lot of people (most notibly the greater
> than 256 pty users, but others might also be hitting this without
> realizing it.)
> 
> Also add a comment so we don't try to "fix" this again.
> 
> Signed-off-by: Greg Kroah-Hartman <greg <at> kroah.com>

Christian, I don't know if this patch explicitly fixes your problem, but
it fixes problems other people have been having with the driver core
lately.  I'd appreciate it if you could test it out and let me know if
it solves your problem, with CONFIG_EDD enabled, or if it doesn't help
at all.

thanks,

greg k-h
Christian Kujau | 9 Nov 2004 22:31
Picon

Re: [PATCH] kobject: fix double kobject_put() in error path of kobject_add()


Greg KH schrieb:
> lately.  I'd appreciate it if you could test it out and let me know if
> it solves your problem, with CONFIG_EDD enabled, or if it doesn't help
> at all.

please ignore my first mail (the part about not being able to patch), it's
already in BK i can see now, sorry.

compiling now...

--
BOFH excuse #22:

monitor resolution too high
Christian Kujau | 9 Nov 2004 22:21
Picon

Re: [PATCH] kobject: fix double kobject_put() in error path of kobject_add()


Greg KH schrieb:
> 
> Christian, I don't know if this patch explicitly fixes your problem, but
> it fixes problems other people have been having with the driver core
> lately.  I'd appreciate it if you could test it out and let me know if
> it solves your problem, with CONFIG_EDD enabled, or if it doesn't help
> at all.
> 

yes, i'll do so and test the patch. is this in current -BK yet? because
applying your patch [1] to 2.6.10-rc1 gives:

Hunk #1 FAILED at 181.
1 out of 1 hunk FAILED -- saving rejects to file lib/kobject.c.rej

i've done a few other things before, let me just post the results before i
go on with your suggestions:

i've compiled a recent (BK) 2.6.10-rc1 again with CONFIG_EDD=m|y|n

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config-2.6.10-rc1_edd-modular.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config-2.6.10-rc1_edd.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/config-2.6.10-rc1_no-edd.txt

the results:

http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.10-rc1_edd-modular.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.10-rc1_edd.txt
http://www.nerdbynature.de/bits/prinz/2.6.10-rc1/dmesg-2.6.10-rc1_no-edd.txt
(Continue reading)

Pekka Enberg | 9 Nov 2004 21:19
Picon
Gravatar

Re: [PATCH] kobject: fix double kobject_put() in error path of kobject_add()

Hi Greg,

On Tue, 9 Nov 2004 11:08:09 -0800, Greg KH <greg <at> kroah.com> wrote:
> Christian, I don't know if this patch explicitly fixes your problem, but
> it fixes problems other people have been having with the driver core
> lately.  I'd appreciate it if you could test it out and let me know if
> it solves your problem, with CONFIG_EDD enabled, or if it doesn't help
> at all.

The broken kobject_add fix is not in -rc1 proper which oopses on
Christian's machine. I don't think this patch has anything to do with
his problem.

                               Pekka
Linus Torvalds | 9 Nov 2004 02:05

Re: Oops in 2.6.10-rc1


On Tue, 9 Nov 2004, Christian Kujau wrote:
> 
> >>>Looking at the list (appended), I don't see anything obvious, but hey, if 
> >>>it was obvious it wouldn't have been merged in the first place. 
> 
> yes, i'll look for changes regarding PCI. i've started to compile the -bk
> snapshots too. there i can do less wrong things. when i have the "bad" -bk
> snapshot i'll use "bk" itself again to find the detailed change leading to
> the oops.

Actually, looking a bit closer, I think the PCI merge we just looked at 
was the PCI merge that happened _after_ 2.6.10-rc1. And since 2.6.10-rc1 
already oopsed for you, it shouldn't be an issue.

I think the _real_ PCI merge we should have looked at is:

	ChangeSet <at> 1.2000.1.7, 2004-10-19 16:59:19-07:00, torvalds <at> ppc970.osdl.org
	  Merge PCI updates

and in particular, that merged the PCI changes from

	ChangeSet <at> 1.1988.2.81, 2004-10-19 14:48:04-07:00, greg <at> kroah.com
	  PCI: fix up pci_save/restore_state in via-agp due to api change.
	  
	  Signed-off-by: Greg Kroah-Hartman <greg <at> kroah.com>

with my pre-PCI-merge tree at:

	ChangeSet <at> 1.2000.1.6, 2004-10-19 15:06:19-07:00, torvalds <at> ppc970.osdl.org
(Continue reading)

Christian Kujau | 9 Nov 2004 02:41
Picon

Re: Oops in 2.6.10-rc1


Linus Torvalds schrieb:
> 
> So what I'd like you to do is to take the pre-PCI-merge tree, and see if 
> that works for you
> 
> 	# assuming a 2.6.10-rc1 tree
> 	bk undo -a1.2000.1.6
> 
> and if that works, then try the post-PCI-merge tree:
> 
> 	# assuming a 2.6.10-rc1 tree
> 	bk undo -a1.2000.1.7
> 
> (I just checked: the above numbers are actually valid even in the current
> -bk tree, so you don't have to first go to 2.6.10-rc1, you can just start 
> from a current tree)

thanks, Linus. i'll do all this tomorrow, see my other mail i just sent.
i'll definitely do all this 'cause i'm really curious about this thing.
(it's not even the need of sound any more. heck, i could just put in
another soundcard but that'd be too easy :)

> 
> Thanks for testing, and sorry for the confusion with the more recent PCI 
> merge.

doh, you can't image how thankful i am for your (and the other people's!)
help here. but don't waste too many cycles on this weird issue here. if it
does not break for a million users out there now - why bother at all?
(Continue reading)

Christian Kujau | 10 Nov 2004 01:12
Picon

Re: Oops in 2.6.10-rc1


Linus Torvalds schrieb:
> 
> Now, if you want to get _really_ fancy, you can now look at each changeset 
> that differed, with something like
> 
> 	bk set -n -d -r1.2462 -r1.2463 | bk -R prs -h -d'<:P: <at> :HOST:>\n$each(:C:){\t(:C:)\n}\n' -
> 
> which is black magic that does a set operation and shows all the changes 
> in between the sets of "bk at 1.2462" and "bk at 1.2463".
> 
> (This is _not_ the same as "bk changes -r1.2462..1.2463", because that one 
> just shows the single merge change that is on the direct _path_ from one 
> changeset to another. The black magic thing shows the set difference of 
> changesets that comes from the full graph at two points).

hm, i still fail to see the "magic" part here. from a current tree i get:

---------------
$ bk set -n -d -r1.2000.5.107 -r1.2000.5.108 | bk -R prs -h \
-d'<:P: <at> :HOST:>\n$each(:C:){\t(:C:)\n}\n' - | head -n5
<Matt_Domsch <at> dell.com>
  [PATCH] EDD: use EXTENDED READ command, add CONFIG_EDD_SKIP_MBR

  Some controller BIOSes have problems with the legacy int13 fn02 READ
  SECTORS command.  int13 fn42 EXTENDED READ is used in preference by most
---------------

which looks similiar to the next one, but with "bk changes" i get the
ChangeSet number again:
(Continue reading)

Linus Torvalds | 10 Nov 2004 01:23

Re: Oops in 2.6.10-rc1


On Wed, 10 Nov 2004, Christian Kujau wrote:
> > 
> > 	bk set -n -d -r1.2462 -r1.2463 | bk -R prs -h -d'<:P: <at> :HOST:>\n$each(:C:){\t(:C:)\n}\n' -
> > 
> > which is black magic that does a set operation and shows all the changes 
> > in between the sets of "bk at 1.2462" and "bk at 1.2463".
> 
> hm, i still fail to see the "magic" part here. from a current tree i get:

You don't see any magic, unless there are merges involved. And you've 
already narrowed the thing down to a single non-merge changeset, at which 
point the "magic" way is just a very slow way of doing the same thing.

The magic hits you only when you have non-trivial merges, in which case 
the set operation shows you more than the "just walk from one top-of-tree 
to the other".

		Linus

Gmane