NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH/RFC 00/19] Support loop-back NFS mounts

Loop-back NFS mounts are when the NFS client and server run on the
same host.

The use-case for this is a high availability cluster with shared
storage.  The shared filesystem is mounted on any one machine and
NFS-mounted on the others.
If the nfs server fails, some other node will take over that service,
and then it will have a loop-back NFS mount which needs to keep
working.

This patch set addresses the "keep working" bit and specifically
addresses deadlocks and livelocks.
Allowing the fail-over itself to be deadlock free is a separate
challenge for another day.

The short description of how this works is:

deadlocks:
  - Elevate PF_FSTRANS to apply globally instead of just in NFS and XFS.
    PF_FSTRANS disables __GFP_NS in the same way that PF_MEMALLOC_NOIO
    disables __GFP_IO.
  - Set PF_FSTRANS in nfsd when handling requests related to
    memory reclaim, or requests which could block requests related
    to memory reclaim.
  - Use lockdep to find all consequent deadlocks from some other
    thread allocating memory while holding a lock that nfsd might
    want.
  - Fix those other deadlocks by setting PF_FSTRANS or using GFP_NOFS
    as appropriate.

(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 01/19] Promote current_{set, restore}_flags_nested from xfs to global.

These are useful macros from xfs for modifying current->flags.
Other places in the kernel perform the same task in various different
ways.
This patch moves the macros from xfs to include/linux/sched.h and
changes all code which temporarily sets a current->flags flag to
use these macros.

This does not change functionality in any important, but does fix a
few sites which assume that PF_FSTRANS is not already set and so
arbitrarily set and then clear it.  The new code is more careful and
will only clear it if it was previously clear.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 drivers/base/power/runtime.c    |    6 +++---
 drivers/block/nbd.c             |    6 +++---
 drivers/md/dm-bufio.c           |    6 +++---
 drivers/md/dm-ioctl.c           |    6 +++---
 drivers/mtd/nand/nandsim.c      |   28 ++++++++--------------------
 drivers/scsi/iscsi_tcp.c        |    6 +++---
 drivers/usb/core/hub.c          |    6 +++---
 fs/fs-writeback.c               |    5 +++--
 fs/xfs/xfs_linux.h              |    7 -------
 include/linux/sched.h           |   27 ++++++++-------------------
 kernel/softirq.c                |    6 +++---
 mm/migrate.c                    |    9 ++++-----
 mm/page_alloc.c                 |   10 ++++++----
 mm/vmscan.c                     |   10 ++++++----
 net/core/dev.c                  |    6 +++---
 net/core/sock.c                 |    6 +++---
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 01/19] Promote current_{set, restore}_flags_nested from xfs to global.

These are useful macros from xfs for modifying current->flags.
Other places in the kernel perform the same task in various different
ways.
This patch moves the macros from xfs to include/linux/sched.h and
changes all code which temporarily sets a current->flags flag to
use these macros.

This does not change functionality in any important, but does fix a
few sites which assume that PF_FSTRANS is not already set and so
arbitrarily set and then clear it.  The new code is more careful and
will only clear it if it was previously clear.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 drivers/base/power/runtime.c    |    6 +++---
 drivers/block/nbd.c             |    6 +++---
 drivers/md/dm-bufio.c           |    6 +++---
 drivers/md/dm-ioctl.c           |    6 +++---
 drivers/mtd/nand/nandsim.c      |   28 ++++++++--------------------
 drivers/scsi/iscsi_tcp.c        |    6 +++---
 drivers/usb/core/hub.c          |    6 +++---
 fs/fs-writeback.c               |    5 +++--
 fs/xfs/xfs_linux.h              |    7 -------
 include/linux/sched.h           |   27 ++++++++-------------------
 kernel/softirq.c                |    6 +++---
 mm/migrate.c                    |    9 ++++-----
 mm/page_alloc.c                 |   10 ++++++----
 mm/vmscan.c                     |   10 ++++++----
 net/core/dev.c                  |    6 +++---
 net/core/sock.c                 |    6 +++---
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 02/19] lockdep: lockdep_set_current_reclaim_state should save old value

Currently kswapd sets current->lockdep_reclaim_gfp but the first
memory allocation call will clear it.  So the setting does no good.
Thus the lockdep_set_current_reclaim_state call in kswapd() is
ineffective.

With this patch we always save the old value and then restore it,
so lockdep gets to properly check the locks that kswapd takes.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 include/linux/lockdep.h  |    8 ++++----
 kernel/locking/lockdep.c |    8 +++++---
 mm/page_alloc.c          |    5 +++--
 mm/vmscan.c              |   10 ++++++----
 4 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 92b1bfc5da60..18eedd692d16 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
 <at>  <at>  -351,8 +351,8  <at>  <at>  static inline void lock_set_subclass(struct lockdep_map *lock,
 	lock_set_class(lock, lock->name, lock->key, subclass, ip);
 }

-extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask);
-extern void lockdep_clear_current_reclaim_state(void);
+extern gfp_t lockdep_set_current_reclaim_state(gfp_t gfp_mask);
+extern void lockdep_restore_current_reclaim_state(gfp_t old_mask);
 extern void lockdep_trace_alloc(gfp_t mask);

(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 02/19] lockdep: lockdep_set_current_reclaim_state should save old value

Currently kswapd sets current->lockdep_reclaim_gfp but the first
memory allocation call will clear it.  So the setting does no good.
Thus the lockdep_set_current_reclaim_state call in kswapd() is
ineffective.

With this patch we always save the old value and then restore it,
so lockdep gets to properly check the locks that kswapd takes.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 include/linux/lockdep.h  |    8 ++++----
 kernel/locking/lockdep.c |    8 +++++---
 mm/page_alloc.c          |    5 +++--
 mm/vmscan.c              |   10 ++++++----
 4 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 92b1bfc5da60..18eedd692d16 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
 <at>  <at>  -351,8 +351,8  <at>  <at>  static inline void lock_set_subclass(struct lockdep_map *lock,
 	lock_set_class(lock, lock->name, lock->key, subclass, ip);
 }

-extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask);
-extern void lockdep_clear_current_reclaim_state(void);
+extern gfp_t lockdep_set_current_reclaim_state(gfp_t gfp_mask);
+extern void lockdep_restore_current_reclaim_state(gfp_t old_mask);
 extern void lockdep_trace_alloc(gfp_t mask);

(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 03/19] lockdep: improve scenario messages for RECLAIM_FS errors.

lockdep can check for locking problems involving reclaim using
the same infrastructure as used for interrupts.

However a number of the messages still refer to interrupts even
if it was actually a reclaim-related problem.

So determine where the problem was caused by reclaim or irq and adjust
messages accordingly.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 kernel/locking/lockdep.c |   43 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 32 insertions(+), 11 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index e05b82e92373..33d2ac7519dc 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
 <at>  <at>  -1423,7 +1423,8  <at>  <at>  static void
 print_irq_lock_scenario(struct lock_list *safe_entry,
 			struct lock_list *unsafe_entry,
 			struct lock_class *prev_class,
-			struct lock_class *next_class)
+			struct lock_class *next_class,
+			int reclaim)
 {
 	struct lock_class *safe_class = safe_entry->class;
 	struct lock_class *unsafe_class = unsafe_entry->class;
 <at>  <at>  -1455,20 +1456,27  <at>  <at>  print_irq_lock_scenario(struct lock_list *safe_entry,
 		printk("\n\n");
(Continue reading)

Peter Zijlstra | 16 Apr 09:22 2014

Re: [PATCH 03/19] lockdep: improve scenario messages for RECLAIM_FS errors.

On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> lockdep can check for locking problems involving reclaim using
> the same infrastructure as used for interrupts.
> 
> However a number of the messages still refer to interrupts even
> if it was actually a reclaim-related problem.
> 
> So determine where the problem was caused by reclaim or irq and adjust
> messages accordingly.
> 
> Signed-off-by: NeilBrown <neilb@...>
> ---
>  kernel/locking/lockdep.c |   43 ++++++++++++++++++++++++++++++++-----------
>  1 file changed, 32 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index e05b82e92373..33d2ac7519dc 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
>  <at>  <at>  -1423,7 +1423,8  <at>  <at>  static void
>  print_irq_lock_scenario(struct lock_list *safe_entry,
>  			struct lock_list *unsafe_entry,
>  			struct lock_class *prev_class,
> -			struct lock_class *next_class)
> +			struct lock_class *next_class,
> +			int reclaim)

I would rather we just pass enum lock_usage_bit along from the
callsites.

(Continue reading)

Peter Zijlstra | 16 Apr 09:22 2014

Re: [PATCH 03/19] lockdep: improve scenario messages for RECLAIM_FS errors.

On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> lockdep can check for locking problems involving reclaim using
> the same infrastructure as used for interrupts.
> 
> However a number of the messages still refer to interrupts even
> if it was actually a reclaim-related problem.
> 
> So determine where the problem was caused by reclaim or irq and adjust
> messages accordingly.
> 
> Signed-off-by: NeilBrown <neilb <at> suse.de>
> ---
>  kernel/locking/lockdep.c |   43 ++++++++++++++++++++++++++++++++-----------
>  1 file changed, 32 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index e05b82e92373..33d2ac7519dc 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
>  <at>  <at>  -1423,7 +1423,8  <at>  <at>  static void
>  print_irq_lock_scenario(struct lock_list *safe_entry,
>  			struct lock_list *unsafe_entry,
>  			struct lock_class *prev_class,
> -			struct lock_class *next_class)
> +			struct lock_class *next_class,
> +			int reclaim)

I would rather we just pass enum lock_usage_bit along from the
callsites.

(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 03/19] lockdep: improve scenario messages for RECLAIM_FS errors.

lockdep can check for locking problems involving reclaim using
the same infrastructure as used for interrupts.

However a number of the messages still refer to interrupts even
if it was actually a reclaim-related problem.

So determine where the problem was caused by reclaim or irq and adjust
messages accordingly.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 kernel/locking/lockdep.c |   43 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 32 insertions(+), 11 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index e05b82e92373..33d2ac7519dc 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
 <at>  <at>  -1423,7 +1423,8  <at>  <at>  static void
 print_irq_lock_scenario(struct lock_list *safe_entry,
 			struct lock_list *unsafe_entry,
 			struct lock_class *prev_class,
-			struct lock_class *next_class)
+			struct lock_class *next_class,
+			int reclaim)
 {
 	struct lock_class *safe_class = safe_entry->class;
 	struct lock_class *unsafe_class = unsafe_entry->class;
 <at>  <at>  -1455,20 +1456,27  <at>  <at>  print_irq_lock_scenario(struct lock_list *safe_entry,
 		printk("\n\n");
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 04/19] Make effect of PF_FSTRANS to disable __GFP_FS universal.

Currently both xfs and nfs will handle PF_FSTRANS by disabling
__GFP_FS.

Make this effect global by repurposing memalloc_noio_flags (which
does the same thing for PF_MEMALLOC_NOIO and __GFP_IO) to generally
impost the task flags on a gfp_t.
Due to this repurposing we change the name of memalloc_noio_flags
to gfp_from_current().

As PF_FSTRANS now uniformly removes __GFP_FS we can remove special
code for this from xfs and nfs.

As we can now expect other code to set PF_FSTRANS, its meaning is more
general, so the WARN_ON in xfs_vm_writepage() which checks PF_FSTRANS
is not set is no longer appropriate.  PF_FSTRANS may be set for other
reasons than an XFS transaction.

As lockdep cares about __GFP_FS, we need to translate PF_FSTRANS to
__GFP_FS before calling lockdep_alloc_trace() in various places.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/nfs/file.c         |    3 +--
 fs/xfs/kmem.h         |    2 --
 fs/xfs/xfs_aops.c     |    7 -------
 include/linux/sched.h |    5 ++++-
 mm/page_alloc.c       |    3 ++-
 mm/slab.c             |    2 ++
 mm/slob.c             |    2 ++
 mm/slub.c             |    1 +
(Continue reading)

Dave Chinner | 16 Apr 07:37 2014

Re: [PATCH 04/19] Make effect of PF_FSTRANS to disable __GFP_FS universal.

On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> Currently both xfs and nfs will handle PF_FSTRANS by disabling
> __GFP_FS.
> 
> Make this effect global by repurposing memalloc_noio_flags (which
> does the same thing for PF_MEMALLOC_NOIO and __GFP_IO) to generally
> impost the task flags on a gfp_t.
> Due to this repurposing we change the name of memalloc_noio_flags
> to gfp_from_current().
> 
> As PF_FSTRANS now uniformly removes __GFP_FS we can remove special
> code for this from xfs and nfs.
> 
> As we can now expect other code to set PF_FSTRANS, its meaning is more
> general, so the WARN_ON in xfs_vm_writepage() which checks PF_FSTRANS
> is not set is no longer appropriate.  PF_FSTRANS may be set for other
> reasons than an XFS transaction.

So PF_FSTRANS no longer means "filesystem in transaction context".
Are you going to rename to match whatever it's meaning is now?
I'm not exactly clear on what it means now...

> As lockdep cares about __GFP_FS, we need to translate PF_FSTRANS to
> __GFP_FS before calling lockdep_alloc_trace() in various places.
> 
> Signed-off-by: NeilBrown <neilb@...>
....
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 64db0e53edea..882b86270ebe 100644
> --- a/fs/xfs/kmem.h
(Continue reading)

NeilBrown | 16 Apr 08:17 2014
Picon

Re: [PATCH 04/19] Make effect of PF_FSTRANS to disable __GFP_FS universal.

On Wed, 16 Apr 2014 15:37:56 +1000 Dave Chinner <david <at> fromorbit.com> wrote:

> On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> > Currently both xfs and nfs will handle PF_FSTRANS by disabling
> > __GFP_FS.
> > 
> > Make this effect global by repurposing memalloc_noio_flags (which
> > does the same thing for PF_MEMALLOC_NOIO and __GFP_IO) to generally
> > impost the task flags on a gfp_t.
> > Due to this repurposing we change the name of memalloc_noio_flags
> > to gfp_from_current().
> > 
> > As PF_FSTRANS now uniformly removes __GFP_FS we can remove special
> > code for this from xfs and nfs.
> > 
> > As we can now expect other code to set PF_FSTRANS, its meaning is more
> > general, so the WARN_ON in xfs_vm_writepage() which checks PF_FSTRANS
> > is not set is no longer appropriate.  PF_FSTRANS may be set for other
> > reasons than an XFS transaction.
> 
> So PF_FSTRANS no longer means "filesystem in transaction context".
> Are you going to rename to match whatever it's meaning is now?
> I'm not exactly clear on what it means now...

I did consider renaming it to "PF_MEMALLOC_NOFS" as it is similar to
"PF_MEMALLOC_NOIO", except that it disables __GFP_FS rather than __GFP_IO.
Maybe I should go ahead with that.

> 
> 
(Continue reading)

NeilBrown | 17 Apr 03:03 2014
Picon

Re: [PATCH 04/19] Make effect of PF_FSTRANS to disable __GFP_FS universal.

On Wed, 16 Apr 2014 16:17:26 +1000 NeilBrown <neilb@...> wrote:

> On Wed, 16 Apr 2014 15:37:56 +1000 Dave Chinner <david@...> wrote:
> 
> > On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:

> > > -	/*
> > > -	 * Given that we do not allow direct reclaim to call us, we should
> > > -	 * never be called while in a filesystem transaction.
> > > -	 */
> > > -	if (WARN_ON(current->flags & PF_FSTRANS))
> > > -		goto redirty;
> > 
> > We still need to ensure this rule isn't broken. If it is, the
> > filesystem will silently deadlock in delayed allocation rather than
> > gracefully handle the problem with a warning....
> 
> Hmm... that might be tricky.  The 'new' PF_FSTRANS can definitely be set when
> xfs_vm_writepage is called and we really want the write to happen.
> I don't suppose there is any other way to detect if a transaction is
> happening?

I've been thinking about this some more....

That code is in xfs_vm_writepage which is only called as ->writepage.
xfs never calls that directly so it could only possibly be called during
reclaim?

We know that doesn't happen, but if it does then PF_MEMALLOC would be set,
but PF_KSWAPD would not... and you already have a test for that.
(Continue reading)

Dave Chinner | 17 Apr 06:41 2014

Re: [PATCH 04/19] Make effect of PF_FSTRANS to disable __GFP_FS universal.

On Thu, Apr 17, 2014 at 11:03:50AM +1000, NeilBrown wrote:
> On Wed, 16 Apr 2014 16:17:26 +1000 NeilBrown <neilb <at> suse.de> wrote:
> 
> > On Wed, 16 Apr 2014 15:37:56 +1000 Dave Chinner <david <at> fromorbit.com> wrote:
> > 
> > > On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> 
> > > > -	/*
> > > > -	 * Given that we do not allow direct reclaim to call us, we should
> > > > -	 * never be called while in a filesystem transaction.
> > > > -	 */
> > > > -	if (WARN_ON(current->flags & PF_FSTRANS))
> > > > -		goto redirty;
> > > 
> > > We still need to ensure this rule isn't broken. If it is, the
> > > filesystem will silently deadlock in delayed allocation rather than
> > > gracefully handle the problem with a warning....
> > 
> > Hmm... that might be tricky.  The 'new' PF_FSTRANS can definitely be set when
> > xfs_vm_writepage is called and we really want the write to happen.
> > I don't suppose there is any other way to detect if a transaction is
> > happening?
> 
> I've been thinking about this some more....
> 
> That code is in xfs_vm_writepage which is only called as ->writepage.
> xfs never calls that directly so it could only possibly be called during
> reclaim?

    __filemap_fdatawrite_range or __writeback_single_inode
(Continue reading)

NeilBrown | 17 Apr 03:03 2014
Picon

Re: [PATCH 04/19] Make effect of PF_FSTRANS to disable __GFP_FS universal.

On Wed, 16 Apr 2014 16:17:26 +1000 NeilBrown <neilb <at> suse.de> wrote:

> On Wed, 16 Apr 2014 15:37:56 +1000 Dave Chinner <david <at> fromorbit.com> wrote:
> 
> > On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:

> > > -	/*
> > > -	 * Given that we do not allow direct reclaim to call us, we should
> > > -	 * never be called while in a filesystem transaction.
> > > -	 */
> > > -	if (WARN_ON(current->flags & PF_FSTRANS))
> > > -		goto redirty;
> > 
> > We still need to ensure this rule isn't broken. If it is, the
> > filesystem will silently deadlock in delayed allocation rather than
> > gracefully handle the problem with a warning....
> 
> Hmm... that might be tricky.  The 'new' PF_FSTRANS can definitely be set when
> xfs_vm_writepage is called and we really want the write to happen.
> I don't suppose there is any other way to detect if a transaction is
> happening?

I've been thinking about this some more....

That code is in xfs_vm_writepage which is only called as ->writepage.
xfs never calls that directly so it could only possibly be called during
reclaim?

We know that doesn't happen, but if it does then PF_MEMALLOC would be set,
but PF_KSWAPD would not... and you already have a test for that.
(Continue reading)

NeilBrown | 16 Apr 08:17 2014
Picon

Re: [PATCH 04/19] Make effect of PF_FSTRANS to disable __GFP_FS universal.

On Wed, 16 Apr 2014 15:37:56 +1000 Dave Chinner <david <at> fromorbit.com> wrote:

> On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> > Currently both xfs and nfs will handle PF_FSTRANS by disabling
> > __GFP_FS.
> > 
> > Make this effect global by repurposing memalloc_noio_flags (which
> > does the same thing for PF_MEMALLOC_NOIO and __GFP_IO) to generally
> > impost the task flags on a gfp_t.
> > Due to this repurposing we change the name of memalloc_noio_flags
> > to gfp_from_current().
> > 
> > As PF_FSTRANS now uniformly removes __GFP_FS we can remove special
> > code for this from xfs and nfs.
> > 
> > As we can now expect other code to set PF_FSTRANS, its meaning is more
> > general, so the WARN_ON in xfs_vm_writepage() which checks PF_FSTRANS
> > is not set is no longer appropriate.  PF_FSTRANS may be set for other
> > reasons than an XFS transaction.
> 
> So PF_FSTRANS no longer means "filesystem in transaction context".
> Are you going to rename to match whatever it's meaning is now?
> I'm not exactly clear on what it means now...

I did consider renaming it to "PF_MEMALLOC_NOFS" as it is similar to
"PF_MEMALLOC_NOIO", except that it disables __GFP_FS rather than __GFP_IO.
Maybe I should go ahead with that.

> 
> 
(Continue reading)

Dave Chinner | 16 Apr 07:37 2014

Re: [PATCH 04/19] Make effect of PF_FSTRANS to disable __GFP_FS universal.

On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> Currently both xfs and nfs will handle PF_FSTRANS by disabling
> __GFP_FS.
> 
> Make this effect global by repurposing memalloc_noio_flags (which
> does the same thing for PF_MEMALLOC_NOIO and __GFP_IO) to generally
> impost the task flags on a gfp_t.
> Due to this repurposing we change the name of memalloc_noio_flags
> to gfp_from_current().
> 
> As PF_FSTRANS now uniformly removes __GFP_FS we can remove special
> code for this from xfs and nfs.
> 
> As we can now expect other code to set PF_FSTRANS, its meaning is more
> general, so the WARN_ON in xfs_vm_writepage() which checks PF_FSTRANS
> is not set is no longer appropriate.  PF_FSTRANS may be set for other
> reasons than an XFS transaction.

So PF_FSTRANS no longer means "filesystem in transaction context".
Are you going to rename to match whatever it's meaning is now?
I'm not exactly clear on what it means now...

> As lockdep cares about __GFP_FS, we need to translate PF_FSTRANS to
> __GFP_FS before calling lockdep_alloc_trace() in various places.
> 
> Signed-off-by: NeilBrown <neilb <at> suse.de>
....
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 64db0e53edea..882b86270ebe 100644
> --- a/fs/xfs/kmem.h
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 05/19] SUNRPC: track whether a request is coming from a loop-back interface.

If an incoming NFS request is coming from the local host, then
nfsd will need to perform some special handling.  So detect that
possibility and make the source visible in rq_local.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 include/linux/sunrpc/svc.h      |    1 +
 include/linux/sunrpc/svc_xprt.h |    1 +
 net/sunrpc/svcsock.c            |   10 ++++++++++
 3 files changed, 12 insertions(+)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 04e763221246..a0dbbd1e00e9 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
 <at>  <at>  -254,6 +254,7  <at>  <at>  struct svc_rqst {
 	u32			rq_prot;	/* IP protocol */
 	unsigned short
 				rq_secure  : 1;	/* secure port */
+	unsigned short		rq_local   : 1;	/* local request */

 	void *			rq_argp;	/* decoded arguments */
 	void *			rq_resp;	/* xdr'd results */
diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index b05963f09ebf..b99bdfb0fcf9 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
 <at>  <at>  -63,6 +63,7  <at>  <at>  struct svc_xprt {
 #define	XPT_DETACHED	10		/* detached from tempsocks list */
 #define XPT_LISTENER	11		/* listening endpoint */
(Continue reading)

Jeff Layton | 16 Apr 16:47 2014
Picon

Re: [PATCH 05/19] SUNRPC: track whether a request is coming from a loop-back interface.

On Wed, 16 Apr 2014 14:03:36 +1000
NeilBrown <neilb <at> suse.de> wrote:

> If an incoming NFS request is coming from the local host, then
> nfsd will need to perform some special handling.  So detect that
> possibility and make the source visible in rq_local.
> 
> Signed-off-by: NeilBrown <neilb <at> suse.de>
> ---
>  include/linux/sunrpc/svc.h      |    1 +
>  include/linux/sunrpc/svc_xprt.h |    1 +
>  net/sunrpc/svcsock.c            |   10 ++++++++++
>  3 files changed, 12 insertions(+)
> 
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 04e763221246..a0dbbd1e00e9 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
>  <at>  <at>  -254,6 +254,7  <at>  <at>  struct svc_rqst {
>  	u32			rq_prot;	/* IP protocol */
>  	unsigned short
>  				rq_secure  : 1;	/* secure port */
> +	unsigned short		rq_local   : 1;	/* local request */
>  
>  	void *			rq_argp;	/* decoded arguments */
>  	void *			rq_resp;	/* xdr'd results */
> diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
> index b05963f09ebf..b99bdfb0fcf9 100644
> --- a/include/linux/sunrpc/svc_xprt.h
> +++ b/include/linux/sunrpc/svc_xprt.h
(Continue reading)

NeilBrown | 17 Apr 01:25 2014
Picon

Re: [PATCH 05/19] SUNRPC: track whether a request is coming from a loop-back interface.

On Wed, 16 Apr 2014 10:47:02 -0400 Jeff Layton <jlayton <at> poochiereds.net>
wrote:

> On Wed, 16 Apr 2014 14:03:36 +1000
> NeilBrown <neilb <at> suse.de> wrote:
> 
> > If an incoming NFS request is coming from the local host, then
> > nfsd will need to perform some special handling.  So detect that
> > possibility and make the source visible in rq_local.
> > 
> > Signed-off-by: NeilBrown <neilb <at> suse.de>
> > ---
> >  include/linux/sunrpc/svc.h      |    1 +
> >  include/linux/sunrpc/svc_xprt.h |    1 +
> >  net/sunrpc/svcsock.c            |   10 ++++++++++
> >  3 files changed, 12 insertions(+)
> > 
> > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > index 04e763221246..a0dbbd1e00e9 100644
> > --- a/include/linux/sunrpc/svc.h
> > +++ b/include/linux/sunrpc/svc.h
> >  <at>  <at>  -254,6 +254,7  <at>  <at>  struct svc_rqst {
> >  	u32			rq_prot;	/* IP protocol */
> >  	unsigned short
> >  				rq_secure  : 1;	/* secure port */
> > +	unsigned short		rq_local   : 1;	/* local request */
> >  
> >  	void *			rq_argp;	/* decoded arguments */
> >  	void *			rq_resp;	/* xdr'd results */
> > diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
(Continue reading)

NeilBrown | 17 Apr 01:25 2014
Picon

Re: [PATCH 05/19] SUNRPC: track whether a request is coming from a loop-back interface.

On Wed, 16 Apr 2014 10:47:02 -0400 Jeff Layton <jlayton <at> poochiereds.net>
wrote:

> On Wed, 16 Apr 2014 14:03:36 +1000
> NeilBrown <neilb <at> suse.de> wrote:
> 
> > If an incoming NFS request is coming from the local host, then
> > nfsd will need to perform some special handling.  So detect that
> > possibility and make the source visible in rq_local.
> > 
> > Signed-off-by: NeilBrown <neilb <at> suse.de>
> > ---
> >  include/linux/sunrpc/svc.h      |    1 +
> >  include/linux/sunrpc/svc_xprt.h |    1 +
> >  net/sunrpc/svcsock.c            |   10 ++++++++++
> >  3 files changed, 12 insertions(+)
> > 
> > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > index 04e763221246..a0dbbd1e00e9 100644
> > --- a/include/linux/sunrpc/svc.h
> > +++ b/include/linux/sunrpc/svc.h
> >  <at>  <at>  -254,6 +254,7  <at>  <at>  struct svc_rqst {
> >  	u32			rq_prot;	/* IP protocol */
> >  	unsigned short
> >  				rq_secure  : 1;	/* secure port */
> > +	unsigned short		rq_local   : 1;	/* local request */
> >  
> >  	void *			rq_argp;	/* decoded arguments */
> >  	void *			rq_resp;	/* xdr'd results */
> > diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
(Continue reading)

Jeff Layton | 16 Apr 16:47 2014
Picon

Re: [PATCH 05/19] SUNRPC: track whether a request is coming from a loop-back interface.

On Wed, 16 Apr 2014 14:03:36 +1000
NeilBrown <neilb <at> suse.de> wrote:

> If an incoming NFS request is coming from the local host, then
> nfsd will need to perform some special handling.  So detect that
> possibility and make the source visible in rq_local.
> 
> Signed-off-by: NeilBrown <neilb <at> suse.de>
> ---
>  include/linux/sunrpc/svc.h      |    1 +
>  include/linux/sunrpc/svc_xprt.h |    1 +
>  net/sunrpc/svcsock.c            |   10 ++++++++++
>  3 files changed, 12 insertions(+)
> 
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 04e763221246..a0dbbd1e00e9 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
>  <at>  <at>  -254,6 +254,7  <at>  <at>  struct svc_rqst {
>  	u32			rq_prot;	/* IP protocol */
>  	unsigned short
>  				rq_secure  : 1;	/* secure port */
> +	unsigned short		rq_local   : 1;	/* local request */
>  
>  	void *			rq_argp;	/* decoded arguments */
>  	void *			rq_resp;	/* xdr'd results */
> diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
> index b05963f09ebf..b99bdfb0fcf9 100644
> --- a/include/linux/sunrpc/svc_xprt.h
> +++ b/include/linux/sunrpc/svc_xprt.h
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 04/19] Make effect of PF_FSTRANS to disable __GFP_FS universal.

Currently both xfs and nfs will handle PF_FSTRANS by disabling
__GFP_FS.

Make this effect global by repurposing memalloc_noio_flags (which
does the same thing for PF_MEMALLOC_NOIO and __GFP_IO) to generally
impost the task flags on a gfp_t.
Due to this repurposing we change the name of memalloc_noio_flags
to gfp_from_current().

As PF_FSTRANS now uniformly removes __GFP_FS we can remove special
code for this from xfs and nfs.

As we can now expect other code to set PF_FSTRANS, its meaning is more
general, so the WARN_ON in xfs_vm_writepage() which checks PF_FSTRANS
is not set is no longer appropriate.  PF_FSTRANS may be set for other
reasons than an XFS transaction.

As lockdep cares about __GFP_FS, we need to translate PF_FSTRANS to
__GFP_FS before calling lockdep_alloc_trace() in various places.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/nfs/file.c         |    3 +--
 fs/xfs/kmem.h         |    2 --
 fs/xfs/xfs_aops.c     |    7 -------
 include/linux/sched.h |    5 ++++-
 mm/page_alloc.c       |    3 ++-
 mm/slab.c             |    2 ++
 mm/slob.c             |    2 ++
 mm/slub.c             |    1 +
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 05/19] SUNRPC: track whether a request is coming from a loop-back interface.

If an incoming NFS request is coming from the local host, then
nfsd will need to perform some special handling.  So detect that
possibility and make the source visible in rq_local.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 include/linux/sunrpc/svc.h      |    1 +
 include/linux/sunrpc/svc_xprt.h |    1 +
 net/sunrpc/svcsock.c            |   10 ++++++++++
 3 files changed, 12 insertions(+)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 04e763221246..a0dbbd1e00e9 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
 <at>  <at>  -254,6 +254,7  <at>  <at>  struct svc_rqst {
 	u32			rq_prot;	/* IP protocol */
 	unsigned short
 				rq_secure  : 1;	/* secure port */
+	unsigned short		rq_local   : 1;	/* local request */

 	void *			rq_argp;	/* decoded arguments */
 	void *			rq_resp;	/* xdr'd results */
diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index b05963f09ebf..b99bdfb0fcf9 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
 <at>  <at>  -63,6 +63,7  <at>  <at>  struct svc_xprt {
 #define	XPT_DETACHED	10		/* detached from tempsocks list */
 #define XPT_LISTENER	11		/* listening endpoint */
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 06/19] nfsd: set PF_FSTRANS for nfsd threads.

If a localhost mount is present, then it is easy to deadlock NFS by
nfsd entering direct reclaim and calling nfs_release_page() which
requires nfsd to perform an fsync() (which it cannot do because it is
reclaiming memory).

By setting PF_FSTRANS we stop the memory allocator from ever
attempting any FS operation would could deadlock.

We need this flag set for any thread which is handling a request from
the local host, but we also need to always have it for at least 1 or 2
threads so that we don't end up with all threads blocked in allocation.

When we set PF_FSTRANS we also tell lockdep that we are handling
reclaim so that it can detect deadlocks for us.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/nfsd/nfssvc.c           |   18 ++++++++++++++++++
 include/linux/sunrpc/svc.h |    1 +
 net/sunrpc/svc.c           |    6 ++++++
 3 files changed, 25 insertions(+)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 9a4a5f9e7468..6af8bc2daf7d 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
 <at>  <at>  -565,6 +565,8  <at>  <at>  nfsd(void *vrqstp)
 	struct svc_xprt *perm_sock = list_entry(rqstp->rq_server->sv_permsocks.next, typeof(struct
svc_xprt), xpt_list);
 	struct net *net = perm_sock->xpt_net;
(Continue reading)

Peter Zijlstra | 16 Apr 09:28 2014

Re: [PATCH 06/19] nfsd: set PF_FSTRANS for nfsd threads.

On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> If a localhost mount is present, then it is easy to deadlock NFS by
> nfsd entering direct reclaim and calling nfs_release_page() which
> requires nfsd to perform an fsync() (which it cannot do because it is
> reclaiming memory).
> 
> By setting PF_FSTRANS we stop the memory allocator from ever
> attempting any FS operation would could deadlock.
> 
> We need this flag set for any thread which is handling a request from
> the local host, but we also need to always have it for at least 1 or 2
> threads so that we don't end up with all threads blocked in allocation.
> 
> When we set PF_FSTRANS we also tell lockdep that we are handling
> reclaim so that it can detect deadlocks for us.
> 
> Signed-off-by: NeilBrown <neilb <at> suse.de>
> ---
>  fs/nfsd/nfssvc.c           |   18 ++++++++++++++++++
>  include/linux/sunrpc/svc.h |    1 +
>  net/sunrpc/svc.c           |    6 ++++++
>  3 files changed, 25 insertions(+)
> 
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index 9a4a5f9e7468..6af8bc2daf7d 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
>  <at>  <at>  -565,6 +565,8  <at>  <at>  nfsd(void *vrqstp)
>  	struct svc_xprt *perm_sock = list_entry(rqstp->rq_server->sv_permsocks.next, typeof(struct
svc_xprt), xpt_list);
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 07/19] nfsd and VM: use PF_LESS_THROTTLE to avoid throttle in shrink_inactive_list.

nfsd already uses PF_LESS_THROTTLE (and is the only user) to avoid
throttling while dirtying pages.  Use it also to avoid throttling while
doing direct reclaim as this can stall nfsd in the same way.

Also only set PF_LESS_THROTTLE when handling a 'write' request for a
local connection.  This is the only time when the throttling can cause
a problem.  In other cases we should throttle if the system is busy.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/nfsd/nfssvc.c |    6 ------
 fs/nfsd/vfs.c    |    6 ++++++
 mm/vmscan.c      |    7 +++++--
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 6af8bc2daf7d..cd24aa76e58d 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
 <at>  <at>  -593,12 +593,6  <at>  <at>  nfsd(void *vrqstp)
 	nfsdstats.th_cnt++;
 	mutex_unlock(&nfsd_mutex);

-	/*
-	 * We want less throttling in balance_dirty_pages() so that nfs to
-	 * localhost doesn't cause nfsd to lock up due to all the client's
-	 * dirty pages.
-	 */
-	current->flags |= PF_LESS_THROTTLE;
 	set_freezable();
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 06/19] nfsd: set PF_FSTRANS for nfsd threads.

If a localhost mount is present, then it is easy to deadlock NFS by
nfsd entering direct reclaim and calling nfs_release_page() which
requires nfsd to perform an fsync() (which it cannot do because it is
reclaiming memory).

By setting PF_FSTRANS we stop the memory allocator from ever
attempting any FS operation would could deadlock.

We need this flag set for any thread which is handling a request from
the local host, but we also need to always have it for at least 1 or 2
threads so that we don't end up with all threads blocked in allocation.

When we set PF_FSTRANS we also tell lockdep that we are handling
reclaim so that it can detect deadlocks for us.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/nfsd/nfssvc.c           |   18 ++++++++++++++++++
 include/linux/sunrpc/svc.h |    1 +
 net/sunrpc/svc.c           |    6 ++++++
 3 files changed, 25 insertions(+)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 9a4a5f9e7468..6af8bc2daf7d 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
 <at>  <at>  -565,6 +565,8  <at>  <at>  nfsd(void *vrqstp)
 	struct svc_xprt *perm_sock = list_entry(rqstp->rq_server->sv_permsocks.next, typeof(struct
svc_xprt), xpt_list);
 	struct net *net = perm_sock->xpt_net;
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 08/19] Set PF_FSTRANS while write_cache_pages calls ->writepage

It is normally safe for direct reclaim to enter filesystems
even when a page is locked - as can happen if ->writepage
allocates memory with GFP_KERNEL (which xfs does).

However if a localhost NFS mount is present, then a flush-*
thread might hold a page locked and then in direct reclaim,
ask nfs to commit an inode (nfs_release_page).  When nfsd
performs the fsync it might try to lock the same page, which leads to
a deadlock.

A ->writepage should not allocate much memory, or do so very often, so
it is safe to set PF_FSTRANS, and this removes the possible deadlock.

This was not detected by lockdep as it doesn't monitor the page lock.
It was found as a real deadlock in testing.

Signed-off-by: NeilBrown  <neilb <at> suse.de>
---
 mm/page-writeback.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7106cb1aca8e..572e70b9a3f7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
 <at>  <at>  -1909,6 +1909,7  <at>  <at>  retry:

 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
+			unsigned int pflags;
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 07/19] nfsd and VM: use PF_LESS_THROTTLE to avoid throttle in shrink_inactive_list.

nfsd already uses PF_LESS_THROTTLE (and is the only user) to avoid
throttling while dirtying pages.  Use it also to avoid throttling while
doing direct reclaim as this can stall nfsd in the same way.

Also only set PF_LESS_THROTTLE when handling a 'write' request for a
local connection.  This is the only time when the throttling can cause
a problem.  In other cases we should throttle if the system is busy.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/nfsd/nfssvc.c |    6 ------
 fs/nfsd/vfs.c    |    6 ++++++
 mm/vmscan.c      |    7 +++++--
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 6af8bc2daf7d..cd24aa76e58d 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
 <at>  <at>  -593,12 +593,6  <at>  <at>  nfsd(void *vrqstp)
 	nfsdstats.th_cnt++;
 	mutex_unlock(&nfsd_mutex);

-	/*
-	 * We want less throttling in balance_dirty_pages() so that nfs to
-	 * localhost doesn't cause nfsd to lock up due to all the client's
-	 * dirty pages.
-	 */
-	current->flags |= PF_LESS_THROTTLE;
 	set_freezable();
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 09/19] XFS: ensure xfs_file_*_read cannot deadlock in memory allocation.

xfs_file_*_read holds an inode lock while calling a generic 'read'
function.  These functions perform read-ahead and are quite likely to
allocate memory.
So set PF_FSTRANS to ensure they avoid __GFP_FS and so don't recurse
into a filesystem to free memory.

This can be a problem with loop-back NFS mounts, if free_pages ends up
wating in nfs_release_page(), and nfsd is blocked waiting for the lock
that this code holds.

This was found both by lockdep and as a real deadlock during testing.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/xfs/xfs_file.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 64b48eade91d..88b33ef64668 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
 <at>  <at>  -243,6 +243,7  <at>  <at>  xfs_file_aio_read(
 	ssize_t			ret = 0;
 	int			ioflags = 0;
 	xfs_fsize_t		n;
+	unsigned int		pflags;

 	XFS_STATS_INC(xs_read_calls);

 <at>  <at>  -290,6 +291,10  <at>  <at>  xfs_file_aio_read(
(Continue reading)

Dave Chinner | 16 Apr 08:04 2014

Re: [PATCH 09/19] XFS: ensure xfs_file_*_read cannot deadlock in memory allocation.

On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> xfs_file_*_read holds an inode lock while calling a generic 'read'
> function.  These functions perform read-ahead and are quite likely to
> allocate memory.

Yes, that's what reading data from disk requires.

> So set PF_FSTRANS to ensure they avoid __GFP_FS and so don't recurse
> into a filesystem to free memory.

We already have that protection via the
> 
> This can be a problem with loop-back NFS mounts, if free_pages ends up
> wating in nfs_release_page(), and nfsd is blocked waiting for the lock
> that this code holds.
> 
> This was found both by lockdep and as a real deadlock during testing.
> 
> Signed-off-by: NeilBrown <neilb <at> suse.de>
> ---
>  fs/xfs/xfs_file.c |   12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 64b48eade91d..88b33ef64668 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
>  <at>  <at>  -243,6 +243,7  <at>  <at>  xfs_file_aio_read(
>  	ssize_t			ret = 0;
>  	int			ioflags = 0;
(Continue reading)

NeilBrown | 16 Apr 08:27 2014
Picon

Re: [PATCH 09/19] XFS: ensure xfs_file_*_read cannot deadlock in memory allocation.

On Wed, 16 Apr 2014 16:04:59 +1000 Dave Chinner <david@...> wrote:

> On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> > xfs_file_*_read holds an inode lock while calling a generic 'read'
> > function.  These functions perform read-ahead and are quite likely to
> > allocate memory.
> 
> Yes, that's what reading data from disk requires.
> 
> > So set PF_FSTRANS to ensure they avoid __GFP_FS and so don't recurse
> > into a filesystem to free memory.
> 
> We already have that protection via the
> > 
> > This can be a problem with loop-back NFS mounts, if free_pages ends up
> > wating in nfs_release_page(), and nfsd is blocked waiting for the lock
> > that this code holds.
> > 
> > This was found both by lockdep and as a real deadlock during testing.
> > 
> > Signed-off-by: NeilBrown <neilb@...>
> > ---
> >  fs/xfs/xfs_file.c |   12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 64b48eade91d..88b33ef64668 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> >  <at>  <at>  -243,6 +243,7  <at>  <at>  xfs_file_aio_read(
(Continue reading)

NeilBrown | 16 Apr 08:27 2014
Picon

Re: [PATCH 09/19] XFS: ensure xfs_file_*_read cannot deadlock in memory allocation.

On Wed, 16 Apr 2014 16:04:59 +1000 Dave Chinner <david <at> fromorbit.com> wrote:

> On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> > xfs_file_*_read holds an inode lock while calling a generic 'read'
> > function.  These functions perform read-ahead and are quite likely to
> > allocate memory.
> 
> Yes, that's what reading data from disk requires.
> 
> > So set PF_FSTRANS to ensure they avoid __GFP_FS and so don't recurse
> > into a filesystem to free memory.
> 
> We already have that protection via the
> > 
> > This can be a problem with loop-back NFS mounts, if free_pages ends up
> > wating in nfs_release_page(), and nfsd is blocked waiting for the lock
> > that this code holds.
> > 
> > This was found both by lockdep and as a real deadlock during testing.
> > 
> > Signed-off-by: NeilBrown <neilb <at> suse.de>
> > ---
> >  fs/xfs/xfs_file.c |   12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 64b48eade91d..88b33ef64668 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> >  <at>  <at>  -243,6 +243,7  <at>  <at>  xfs_file_aio_read(
(Continue reading)

Dave Chinner | 16 Apr 08:31 2014

Re: [PATCH 09/19] XFS: ensure xfs_file_*_read cannot deadlock in memory allocation.

On Wed, Apr 16, 2014 at 04:04:59PM +1000, Dave Chinner wrote:
> On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> > xfs_file_*_read holds an inode lock while calling a generic 'read'
> > function.  These functions perform read-ahead and are quite likely to
> > allocate memory.
> 
> Yes, that's what reading data from disk requires.
> 
> > So set PF_FSTRANS to ensure they avoid __GFP_FS and so don't recurse
> > into a filesystem to free memory.
> 
> We already have that protection via the

Oops, stray paste. Ignore that comment.

-Dave.
--

-- 
Dave Chinner
david@...
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner | 16 Apr 08:31 2014

Re: [PATCH 09/19] XFS: ensure xfs_file_*_read cannot deadlock in memory allocation.

On Wed, Apr 16, 2014 at 04:04:59PM +1000, Dave Chinner wrote:
> On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> > xfs_file_*_read holds an inode lock while calling a generic 'read'
> > function.  These functions perform read-ahead and are quite likely to
> > allocate memory.
> 
> Yes, that's what reading data from disk requires.
> 
> > So set PF_FSTRANS to ensure they avoid __GFP_FS and so don't recurse
> > into a filesystem to free memory.
> 
> We already have that protection via the

Oops, stray paste. Ignore that comment.

-Dave.
--

-- 
Dave Chinner
david <at> fromorbit.com

_______________________________________________
xfs mailing list
xfs <at> oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

Dave Chinner | 16 Apr 08:04 2014

Re: [PATCH 09/19] XFS: ensure xfs_file_*_read cannot deadlock in memory allocation.

On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> xfs_file_*_read holds an inode lock while calling a generic 'read'
> function.  These functions perform read-ahead and are quite likely to
> allocate memory.

Yes, that's what reading data from disk requires.

> So set PF_FSTRANS to ensure they avoid __GFP_FS and so don't recurse
> into a filesystem to free memory.

We already have that protection via the
> 
> This can be a problem with loop-back NFS mounts, if free_pages ends up
> wating in nfs_release_page(), and nfsd is blocked waiting for the lock
> that this code holds.
> 
> This was found both by lockdep and as a real deadlock during testing.
> 
> Signed-off-by: NeilBrown <neilb <at> suse.de>
> ---
>  fs/xfs/xfs_file.c |   12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 64b48eade91d..88b33ef64668 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
>  <at>  <at>  -243,6 +243,7  <at>  <at>  xfs_file_aio_read(
>  	ssize_t			ret = 0;
>  	int			ioflags = 0;
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 08/19] Set PF_FSTRANS while write_cache_pages calls ->writepage

It is normally safe for direct reclaim to enter filesystems
even when a page is locked - as can happen if ->writepage
allocates memory with GFP_KERNEL (which xfs does).

However if a localhost NFS mount is present, then a flush-*
thread might hold a page locked and then in direct reclaim,
ask nfs to commit an inode (nfs_release_page).  When nfsd
performs the fsync it might try to lock the same page, which leads to
a deadlock.

A ->writepage should not allocate much memory, or do so very often, so
it is safe to set PF_FSTRANS, and this removes the possible deadlock.

This was not detected by lockdep as it doesn't monitor the page lock.
It was found as a real deadlock in testing.

Signed-off-by: NeilBrown  <neilb <at> suse.de>
---
 mm/page-writeback.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7106cb1aca8e..572e70b9a3f7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
 <at>  <at>  -1909,6 +1909,7  <at>  <at>  retry:

 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
+			unsigned int pflags;
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock

sk_lock can be taken while reclaiming memory (in nfsd for loop-back
NFS mounts, and presumably in nfs), and memory can be allocated while
holding sk_lock, at least via:

 inet_listen -> inet_csk_listen_start ->reqsk_queue_alloc

So to avoid deadlocks, always set PF_FSTRANS while holding sk_lock.

This deadlock was found by lockdep.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 include/net/sock.h |    1 +
 net/core/sock.c    |    2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index b9586a137cad..27c355637e44 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
 <at>  <at>  -324,6 +324,7  <at>  <at>  struct sock {
 #define sk_v6_rcv_saddr	__sk_common.skc_v6_rcv_saddr

 	socket_lock_t		sk_lock;
+	unsigned int		sk_pflags; /* process flags before taking lock */
 	struct sk_buff_head	sk_receive_queue;
 	/*
 	 * The backlog queue is special, it is always used with
diff --git a/net/core/sock.c b/net/core/sock.c
index cf9bd24e4099..8bc677ef072e 100644
(Continue reading)

Eric Dumazet | 16 Apr 07:13 2014
Picon

Re: [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock

On Wed, 2014-04-16 at 14:03 +1000, NeilBrown wrote:
> sk_lock can be taken while reclaiming memory (in nfsd for loop-back
> NFS mounts, and presumably in nfs), and memory can be allocated while
> holding sk_lock, at least via:
> 
>  inet_listen -> inet_csk_listen_start ->reqsk_queue_alloc
> 
> So to avoid deadlocks, always set PF_FSTRANS while holding sk_lock.
> 
> This deadlock was found by lockdep.

Wow, this is adding expensive stuff in fast path, only for nfsd :(

BTW, why should the current->flags should be saved on a socket field,
and not a current->save_flags. This really looks a thread property, not
a socket one.

Why nfsd could not have PF_FSTRANS in its current->flags ?

For applications handling millions of sockets, this makes a difference.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html

NeilBrown | 16 Apr 07:47 2014
Picon

Re: [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock

On Tue, 15 Apr 2014 22:13:46 -0700 Eric Dumazet <eric.dumazet@...>
wrote:

> On Wed, 2014-04-16 at 14:03 +1000, NeilBrown wrote:
> > sk_lock can be taken while reclaiming memory (in nfsd for loop-back
> > NFS mounts, and presumably in nfs), and memory can be allocated while
> > holding sk_lock, at least via:
> > 
> >  inet_listen -> inet_csk_listen_start ->reqsk_queue_alloc
> > 
> > So to avoid deadlocks, always set PF_FSTRANS while holding sk_lock.
> > 
> > This deadlock was found by lockdep.
> 
> Wow, this is adding expensive stuff in fast path, only for nfsd :(

Yes, this was probably one part that I was least comfortable about.

> 
> BTW, why should the current->flags should be saved on a socket field,
> and not a current->save_flags. This really looks a thread property, not
> a socket one.
> 
> Why nfsd could not have PF_FSTRANS in its current->flags ?

nfsd does have PF_FSTRANS set in current->flags.  But some other processes
might not.

If any process takes sk_lock, allocates memory, and then blocks in reclaim it
could be waiting for nfsd.  If nfsd waits for that sk_lock, it would cause a
(Continue reading)

NeilBrown | 16 Apr 07:47 2014
Picon

Re: [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock

On Tue, 15 Apr 2014 22:13:46 -0700 Eric Dumazet <eric.dumazet <at> gmail.com>
wrote:

> On Wed, 2014-04-16 at 14:03 +1000, NeilBrown wrote:
> > sk_lock can be taken while reclaiming memory (in nfsd for loop-back
> > NFS mounts, and presumably in nfs), and memory can be allocated while
> > holding sk_lock, at least via:
> > 
> >  inet_listen -> inet_csk_listen_start ->reqsk_queue_alloc
> > 
> > So to avoid deadlocks, always set PF_FSTRANS while holding sk_lock.
> > 
> > This deadlock was found by lockdep.
> 
> Wow, this is adding expensive stuff in fast path, only for nfsd :(

Yes, this was probably one part that I was least comfortable about.

> 
> BTW, why should the current->flags should be saved on a socket field,
> and not a current->save_flags. This really looks a thread property, not
> a socket one.
> 
> Why nfsd could not have PF_FSTRANS in its current->flags ?

nfsd does have PF_FSTRANS set in current->flags.  But some other processes
might not.

If any process takes sk_lock, allocates memory, and then blocks in reclaim it
could be waiting for nfsd.  If nfsd waits for that sk_lock, it would cause a
(Continue reading)

David Miller | 16 Apr 15:00 2014
Picon

Re: [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock

From: Eric Dumazet <eric.dumazet <at> gmail.com>
Date: Tue, 15 Apr 2014 22:13:46 -0700

> For applications handling millions of sockets, this makes a difference.

Indeed, this really is not acceptable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

NeilBrown | 17 Apr 04:38 2014
Picon

Re: [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock

On Wed, 16 Apr 2014 09:00:02 -0400 (EDT) David Miller <davem <at> davemloft.net>
wrote:

> From: Eric Dumazet <eric.dumazet <at> gmail.com>
> Date: Tue, 15 Apr 2014 22:13:46 -0700
> 
> > For applications handling millions of sockets, this makes a difference.
> 
> Indeed, this really is not acceptable.

As you say...
I've just discovered that I can get rid of the lockdep message (and hence
presumably the deadlock risk) with a well placed:

		newsock->sk->sk_allocation = GFP_NOFS;

which surprised me as it seemed to be an explicit GFP_KERNEL allocation that
was mentioned in the lockdep trace.  Obviously these traces require quite
some sophistication to understand.

So - thanks for the feedback, patch can be ignored.

Thanks,
NeilBrown
NeilBrown | 17 Apr 04:38 2014
Picon

Re: [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock

On Wed, 16 Apr 2014 09:00:02 -0400 (EDT) David Miller <davem <at> davemloft.net>
wrote:

> From: Eric Dumazet <eric.dumazet <at> gmail.com>
> Date: Tue, 15 Apr 2014 22:13:46 -0700
> 
> > For applications handling millions of sockets, this makes a difference.
> 
> Indeed, this really is not acceptable.

As you say...
I've just discovered that I can get rid of the lockdep message (and hence
presumably the deadlock risk) with a well placed:

		newsock->sk->sk_allocation = GFP_NOFS;

which surprised me as it seemed to be an explicit GFP_KERNEL allocation that
was mentioned in the lockdep trace.  Obviously these traces require quite
some sophistication to understand.

So - thanks for the feedback, patch can be ignored.

Thanks,
NeilBrown
_______________________________________________
xfs mailing list
xfs <at> oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
(Continue reading)

David Miller | 16 Apr 15:00 2014
Picon

Re: [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock

From: Eric Dumazet <eric.dumazet <at> gmail.com>
Date: Tue, 15 Apr 2014 22:13:46 -0700

> For applications handling millions of sockets, this makes a difference.

Indeed, this really is not acceptable.

_______________________________________________
xfs mailing list
xfs <at> oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

Eric Dumazet | 16 Apr 07:13 2014
Picon

Re: [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock

On Wed, 2014-04-16 at 14:03 +1000, NeilBrown wrote:
> sk_lock can be taken while reclaiming memory (in nfsd for loop-back
> NFS mounts, and presumably in nfs), and memory can be allocated while
> holding sk_lock, at least via:
> 
>  inet_listen -> inet_csk_listen_start ->reqsk_queue_alloc
> 
> So to avoid deadlocks, always set PF_FSTRANS while holding sk_lock.
> 
> This deadlock was found by lockdep.

Wow, this is adding expensive stuff in fast path, only for nfsd :(

BTW, why should the current->flags should be saved on a socket field,
and not a current->save_flags. This really looks a thread property, not
a socket one.

Why nfsd could not have PF_FSTRANS in its current->flags ?

For applications handling millions of sockets, this makes a difference.

_______________________________________________
xfs mailing list
xfs <at> oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 11/19] FS: set PF_FSTRANS while holding mmap_sem in exec.c

Because mmap_sem is sometimes(*) taken while holding a sock lock,
and the sock lock might be needed for reclaim (at least when loop-back
NFS is active), we must not block on FS reclaim while mmap_sem is
held.

exec.c allocates memory while holding mmap_sem, and so needs
PF_FSTRANS protection.

* lockdep reports:
[   57.653355]    [<ffffffff810eb068>] lock_acquire+0xa8/0x1f0
[   57.653355]    [<ffffffff811835a4>] might_fault+0x84/0xb0
[   57.653355]    [<ffffffff81aec65d>] do_ip_setsockopt.isra.18+0x93d/0xed0
[   57.653355]    [<ffffffff81aecc17>] ip_setsockopt+0x27/0x90
[   57.653355]    [<ffffffff81b15146>] udp_setsockopt+0x16/0x30
[   57.653355]    [<ffffffff81a913cf>] sock_common_setsockopt+0xf/0x20
[   57.653355]    [<ffffffff81a9075e>] SyS_setsockopt+0x5e/0xc0
[   57.653355]    [<ffffffff81c3d062>] system_call_fastpath+0x16/0x1b

to explain why mmap_sem might be taken while sock lock is held.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/exec.c |    6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 3d78fccdd723..2c70a03ddb2b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
 <at>  <at>  -652,6 +652,7  <at>  <at>  int setup_arg_pages(struct linux_binprm *bprm,
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 09/19] XFS: ensure xfs_file_*_read cannot deadlock in memory allocation.

xfs_file_*_read holds an inode lock while calling a generic 'read'
function.  These functions perform read-ahead and are quite likely to
allocate memory.
So set PF_FSTRANS to ensure they avoid __GFP_FS and so don't recurse
into a filesystem to free memory.

This can be a problem with loop-back NFS mounts, if free_pages ends up
wating in nfs_release_page(), and nfsd is blocked waiting for the lock
that this code holds.

This was found both by lockdep and as a real deadlock during testing.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/xfs/xfs_file.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 64b48eade91d..88b33ef64668 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
 <at>  <at>  -243,6 +243,7  <at>  <at>  xfs_file_aio_read(
 	ssize_t			ret = 0;
 	int			ioflags = 0;
 	xfs_fsize_t		n;
+	unsigned int		pflags;

 	XFS_STATS_INC(xs_read_calls);

 <at>  <at>  -290,6 +291,10  <at>  <at>  xfs_file_aio_read(
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 12/19] NET: set PF_FSTRANS while holding rtnl_lock

As rtnl_mutex can be taken while holding sk_lock, and sk_lock can be
taken while performing memory reclaim (at least when loop-back NFS is
active), any memory allocation under rtnl_mutex must avoid __GFP_FS,
which is most easily done by setting PF_MEMALLOC.

        CPU0                    CPU1
        ----                    ----
   lock(rtnl_mutex);
                                lock(sk_lock-AF_INET);
                                lock(rtnl_mutex);
   <Memory allocation/reclaim>
     lock(sk_lock-AF_INET);

  *** DEADLOCK ***

1/ rtnl_mutex is taken while holding sk_lock:

    [<ffffffff81abb442>] rtnl_lock+0x12/0x20
    [<ffffffff81b28c3a>] ip_mc_leave_group+0x2a/0x160
    [<ffffffff81aec70b>] do_ip_setsockopt.isra.18+0x96b/0xed0
    [<ffffffff81aecc97>] ip_setsockopt+0x27/0x90
    [<ffffffff81b151c6>] udp_setsockopt+0x16/0x30
    [<ffffffff81a9144f>] sock_common_setsockopt+0xf/0x20
    [<ffffffff81a907de>] SyS_setsockopt+0x5e/0xc0

2/ memory is allocated under rtnl_mutex:
    [<ffffffff8166eb41>] kobject_set_name_vargs+0x21/0x70
    [<ffffffff81840d92>] dev_set_name+0x42/0x50
    [<ffffffff81ac5e97>] netdev_register_kobject+0x57/0x130
    [<ffffffff81aaf574>] register_netdevice+0x354/0x550
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock

sk_lock can be taken while reclaiming memory (in nfsd for loop-back
NFS mounts, and presumably in nfs), and memory can be allocated while
holding sk_lock, at least via:

 inet_listen -> inet_csk_listen_start ->reqsk_queue_alloc

So to avoid deadlocks, always set PF_FSTRANS while holding sk_lock.

This deadlock was found by lockdep.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 include/net/sock.h |    1 +
 net/core/sock.c    |    2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index b9586a137cad..27c355637e44 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
 <at>  <at>  -324,6 +324,7  <at>  <at>  struct sock {
 #define sk_v6_rcv_saddr	__sk_common.skc_v6_rcv_saddr

 	socket_lock_t		sk_lock;
+	unsigned int		sk_pflags; /* process flags before taking lock */
 	struct sk_buff_head	sk_receive_queue;
 	/*
 	 * The backlog queue is special, it is always used with
diff --git a/net/core/sock.c b/net/core/sock.c
index cf9bd24e4099..8bc677ef072e 100644
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 13/19] MM: set PF_FSTRANS while allocating per-cpu memory to avoid deadlock.

lockdep reports a locking chain

  sk_lock-AF_INET --> rtnl_mutex --> pcpu_alloc_mutex

As sk_lock may be needed to reclaim memory, allowing that
reclaim while pcu_alloc_mutex is held can lead to deadlock.
So set PF_FSTRANS while it is help to avoid the FS reclaim.

pcpu_alloc_mutex can be taken when rtnl_mutex is held:

    [<ffffffff8117f979>] pcpu_alloc+0x49/0x960
    [<ffffffff8118029b>] __alloc_percpu+0xb/0x10
    [<ffffffff8193b9f7>] loopback_dev_init+0x17/0x60
    [<ffffffff81aaf30c>] register_netdevice+0xec/0x550
    [<ffffffff81aaf785>] register_netdev+0x15/0x30

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 mm/percpu.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/percpu.c b/mm/percpu.c
index 036cfe07050f..77dd24032f41 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
 <at>  <at>  -712,6 +712,7  <at>  <at>  static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved)
 	int slot, off, new_alloc;
 	unsigned long flags;
 	void __percpu *ptr;
+	unsigned int pflags;
(Continue reading)

Dave Chinner | 16 Apr 07:49 2014

Re: [PATCH 13/19] MM: set PF_FSTRANS while allocating per-cpu memory to avoid deadlock.

On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> lockdep reports a locking chain
> 
>   sk_lock-AF_INET --> rtnl_mutex --> pcpu_alloc_mutex
> 
> As sk_lock may be needed to reclaim memory, allowing that
> reclaim while pcu_alloc_mutex is held can lead to deadlock.
> So set PF_FSTRANS while it is help to avoid the FS reclaim.
> 
> pcpu_alloc_mutex can be taken when rtnl_mutex is held:
> 
>     [<ffffffff8117f979>] pcpu_alloc+0x49/0x960
>     [<ffffffff8118029b>] __alloc_percpu+0xb/0x10
>     [<ffffffff8193b9f7>] loopback_dev_init+0x17/0x60
>     [<ffffffff81aaf30c>] register_netdevice+0xec/0x550
>     [<ffffffff81aaf785>] register_netdev+0x15/0x30
> 
> Signed-off-by: NeilBrown <neilb <at> suse.de>

This looks like a workaround to avoid passing a gfp mask around to
describe the context in which the allocation is taking place.
Whether or not that's the right solution, I can't say, but spreading
this "we can turn off all reclaim of filesystem objects" mechanism
all around the kernel doesn't sit well with me...

And, again, PF_FSTRANS looks plainly wrong in this code - it sure
isn't a fs transaction context we are worried about here...

--

-- 
Dave Chinner
(Continue reading)

NeilBrown | 16 Apr 08:22 2014
Picon

Re: [PATCH 13/19] MM: set PF_FSTRANS while allocating per-cpu memory to avoid deadlock.

On Wed, 16 Apr 2014 15:49:42 +1000 Dave Chinner <david <at> fromorbit.com> wrote:

> On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> > lockdep reports a locking chain
> > 
> >   sk_lock-AF_INET --> rtnl_mutex --> pcpu_alloc_mutex
> > 
> > As sk_lock may be needed to reclaim memory, allowing that
> > reclaim while pcu_alloc_mutex is held can lead to deadlock.
> > So set PF_FSTRANS while it is help to avoid the FS reclaim.
> > 
> > pcpu_alloc_mutex can be taken when rtnl_mutex is held:
> > 
> >     [<ffffffff8117f979>] pcpu_alloc+0x49/0x960
> >     [<ffffffff8118029b>] __alloc_percpu+0xb/0x10
> >     [<ffffffff8193b9f7>] loopback_dev_init+0x17/0x60
> >     [<ffffffff81aaf30c>] register_netdevice+0xec/0x550
> >     [<ffffffff81aaf785>] register_netdev+0x15/0x30
> > 
> > Signed-off-by: NeilBrown <neilb <at> suse.de>
> 
> This looks like a workaround to avoid passing a gfp mask around to
> describe the context in which the allocation is taking place.
> Whether or not that's the right solution, I can't say, but spreading
> this "we can turn off all reclaim of filesystem objects" mechanism
> all around the kernel doesn't sit well with me...

We are (effectively) passing a gfp mask around, except that it lives in
'current' rather than lots of other places.
I actually like the idea of discarding PF_MEMALLOC, PF_FSTRANS and
(Continue reading)

Dave Chinner | 16 Apr 08:30 2014

Re: [PATCH 13/19] MM: set PF_FSTRANS while allocating per-cpu memory to avoid deadlock.

On Wed, Apr 16, 2014 at 04:22:01PM +1000, NeilBrown wrote:
> On Wed, 16 Apr 2014 15:49:42 +1000 Dave Chinner <david@...> wrote:
> 
> > On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> > > lockdep reports a locking chain
> > > 
> > >   sk_lock-AF_INET --> rtnl_mutex --> pcpu_alloc_mutex
> > > 
> > > As sk_lock may be needed to reclaim memory, allowing that
> > > reclaim while pcu_alloc_mutex is held can lead to deadlock.
> > > So set PF_FSTRANS while it is help to avoid the FS reclaim.
> > > 
> > > pcpu_alloc_mutex can be taken when rtnl_mutex is held:
> > > 
> > >     [<ffffffff8117f979>] pcpu_alloc+0x49/0x960
> > >     [<ffffffff8118029b>] __alloc_percpu+0xb/0x10
> > >     [<ffffffff8193b9f7>] loopback_dev_init+0x17/0x60
> > >     [<ffffffff81aaf30c>] register_netdevice+0xec/0x550
> > >     [<ffffffff81aaf785>] register_netdev+0x15/0x30
> > > 
> > > Signed-off-by: NeilBrown <neilb@...>
> > 
> > This looks like a workaround to avoid passing a gfp mask around to
> > describe the context in which the allocation is taking place.
> > Whether or not that's the right solution, I can't say, but spreading
> > this "we can turn off all reclaim of filesystem objects" mechanism
> > all around the kernel doesn't sit well with me...
> 
> We are (effectively) passing a gfp mask around, except that it lives in
> 'current' rather than lots of other places.
(Continue reading)

Dave Chinner | 16 Apr 08:30 2014

Re: [PATCH 13/19] MM: set PF_FSTRANS while allocating per-cpu memory to avoid deadlock.

On Wed, Apr 16, 2014 at 04:22:01PM +1000, NeilBrown wrote:
> On Wed, 16 Apr 2014 15:49:42 +1000 Dave Chinner <david <at> fromorbit.com> wrote:
> 
> > On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> > > lockdep reports a locking chain
> > > 
> > >   sk_lock-AF_INET --> rtnl_mutex --> pcpu_alloc_mutex
> > > 
> > > As sk_lock may be needed to reclaim memory, allowing that
> > > reclaim while pcu_alloc_mutex is held can lead to deadlock.
> > > So set PF_FSTRANS while it is help to avoid the FS reclaim.
> > > 
> > > pcpu_alloc_mutex can be taken when rtnl_mutex is held:
> > > 
> > >     [<ffffffff8117f979>] pcpu_alloc+0x49/0x960
> > >     [<ffffffff8118029b>] __alloc_percpu+0xb/0x10
> > >     [<ffffffff8193b9f7>] loopback_dev_init+0x17/0x60
> > >     [<ffffffff81aaf30c>] register_netdevice+0xec/0x550
> > >     [<ffffffff81aaf785>] register_netdev+0x15/0x30
> > > 
> > > Signed-off-by: NeilBrown <neilb <at> suse.de>
> > 
> > This looks like a workaround to avoid passing a gfp mask around to
> > describe the context in which the allocation is taking place.
> > Whether or not that's the right solution, I can't say, but spreading
> > this "we can turn off all reclaim of filesystem objects" mechanism
> > all around the kernel doesn't sit well with me...
> 
> We are (effectively) passing a gfp mask around, except that it lives in
> 'current' rather than lots of other places.
(Continue reading)

NeilBrown | 16 Apr 08:22 2014
Picon

Re: [PATCH 13/19] MM: set PF_FSTRANS while allocating per-cpu memory to avoid deadlock.

On Wed, 16 Apr 2014 15:49:42 +1000 Dave Chinner <david <at> fromorbit.com> wrote:

> On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> > lockdep reports a locking chain
> > 
> >   sk_lock-AF_INET --> rtnl_mutex --> pcpu_alloc_mutex
> > 
> > As sk_lock may be needed to reclaim memory, allowing that
> > reclaim while pcu_alloc_mutex is held can lead to deadlock.
> > So set PF_FSTRANS while it is help to avoid the FS reclaim.
> > 
> > pcpu_alloc_mutex can be taken when rtnl_mutex is held:
> > 
> >     [<ffffffff8117f979>] pcpu_alloc+0x49/0x960
> >     [<ffffffff8118029b>] __alloc_percpu+0xb/0x10
> >     [<ffffffff8193b9f7>] loopback_dev_init+0x17/0x60
> >     [<ffffffff81aaf30c>] register_netdevice+0xec/0x550
> >     [<ffffffff81aaf785>] register_netdev+0x15/0x30
> > 
> > Signed-off-by: NeilBrown <neilb <at> suse.de>
> 
> This looks like a workaround to avoid passing a gfp mask around to
> describe the context in which the allocation is taking place.
> Whether or not that's the right solution, I can't say, but spreading
> this "we can turn off all reclaim of filesystem objects" mechanism
> all around the kernel doesn't sit well with me...

We are (effectively) passing a gfp mask around, except that it lives in
'current' rather than lots of other places.
I actually like the idea of discarding PF_MEMALLOC, PF_FSTRANS and
(Continue reading)

Dave Chinner | 16 Apr 07:49 2014

Re: [PATCH 13/19] MM: set PF_FSTRANS while allocating per-cpu memory to avoid deadlock.

On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> lockdep reports a locking chain
> 
>   sk_lock-AF_INET --> rtnl_mutex --> pcpu_alloc_mutex
> 
> As sk_lock may be needed to reclaim memory, allowing that
> reclaim while pcu_alloc_mutex is held can lead to deadlock.
> So set PF_FSTRANS while it is help to avoid the FS reclaim.
> 
> pcpu_alloc_mutex can be taken when rtnl_mutex is held:
> 
>     [<ffffffff8117f979>] pcpu_alloc+0x49/0x960
>     [<ffffffff8118029b>] __alloc_percpu+0xb/0x10
>     [<ffffffff8193b9f7>] loopback_dev_init+0x17/0x60
>     [<ffffffff81aaf30c>] register_netdevice+0xec/0x550
>     [<ffffffff81aaf785>] register_netdev+0x15/0x30
> 
> Signed-off-by: NeilBrown <neilb <at> suse.de>

This looks like a workaround to avoid passing a gfp mask around to
describe the context in which the allocation is taking place.
Whether or not that's the right solution, I can't say, but spreading
this "we can turn off all reclaim of filesystem objects" mechanism
all around the kernel doesn't sit well with me...

And, again, PF_FSTRANS looks plainly wrong in this code - it sure
isn't a fs transaction context we are worried about here...

--

-- 
Dave Chinner
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 11/19] FS: set PF_FSTRANS while holding mmap_sem in exec.c

Because mmap_sem is sometimes(*) taken while holding a sock lock,
and the sock lock might be needed for reclaim (at least when loop-back
NFS is active), we must not block on FS reclaim while mmap_sem is
held.

exec.c allocates memory while holding mmap_sem, and so needs
PF_FSTRANS protection.

* lockdep reports:
[   57.653355]    [<ffffffff810eb068>] lock_acquire+0xa8/0x1f0
[   57.653355]    [<ffffffff811835a4>] might_fault+0x84/0xb0
[   57.653355]    [<ffffffff81aec65d>] do_ip_setsockopt.isra.18+0x93d/0xed0
[   57.653355]    [<ffffffff81aecc17>] ip_setsockopt+0x27/0x90
[   57.653355]    [<ffffffff81b15146>] udp_setsockopt+0x16/0x30
[   57.653355]    [<ffffffff81a913cf>] sock_common_setsockopt+0xf/0x20
[   57.653355]    [<ffffffff81a9075e>] SyS_setsockopt+0x5e/0xc0
[   57.653355]    [<ffffffff81c3d062>] system_call_fastpath+0x16/0x1b

to explain why mmap_sem might be taken while sock lock is held.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/exec.c |    6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 3d78fccdd723..2c70a03ddb2b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
 <at>  <at>  -652,6 +652,7  <at>  <at>  int setup_arg_pages(struct linux_binprm *bprm,
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 14/19] driver core: set PF_FSTRANS while holding gdp_mutex

lockdep reports a locking chain:

  sk_lock-AF_INET --> rtnl_mutex --> gdp_mutex

As sk_lock can be needed for memory reclaim (when loop-back NFS is in
use at least), any memory allocation under gdp_mutex needs to
be protected by PF_FSTRANS.

The path frome rtnl_mutex to gdp_mutex is

    [<ffffffff81841ecc>] get_device_parent+0x4c/0x1f0
    [<ffffffff81842496>] device_add+0xe6/0x610
    [<ffffffff81ac5f2a>] netdev_register_kobject+0x7a/0x130
    [<ffffffff81aaf5d4>] register_netdevice+0x354/0x550
    [<ffffffff81aaf7e5>] register_netdev+0x15/0x30

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 drivers/base/core.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 2b567177ef78..1a2735237650 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
 <at>  <at>  -750,6 +750,7  <at>  <at>  static struct kobject *get_device_parent(struct device *dev,
 		struct kobject *kobj = NULL;
 		struct kobject *parent_kobj;
 		struct kobject *k;
+		unsigned int pflags;
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 12/19] NET: set PF_FSTRANS while holding rtnl_lock

As rtnl_mutex can be taken while holding sk_lock, and sk_lock can be
taken while performing memory reclaim (at least when loop-back NFS is
active), any memory allocation under rtnl_mutex must avoid __GFP_FS,
which is most easily done by setting PF_MEMALLOC.

        CPU0                    CPU1
        ----                    ----
   lock(rtnl_mutex);
                                lock(sk_lock-AF_INET);
                                lock(rtnl_mutex);
   <Memory allocation/reclaim>
     lock(sk_lock-AF_INET);

  *** DEADLOCK ***

1/ rtnl_mutex is taken while holding sk_lock:

    [<ffffffff81abb442>] rtnl_lock+0x12/0x20
    [<ffffffff81b28c3a>] ip_mc_leave_group+0x2a/0x160
    [<ffffffff81aec70b>] do_ip_setsockopt.isra.18+0x96b/0xed0
    [<ffffffff81aecc97>] ip_setsockopt+0x27/0x90
    [<ffffffff81b151c6>] udp_setsockopt+0x16/0x30
    [<ffffffff81a9144f>] sock_common_setsockopt+0xf/0x20
    [<ffffffff81a907de>] SyS_setsockopt+0x5e/0xc0

2/ memory is allocated under rtnl_mutex:
    [<ffffffff8166eb41>] kobject_set_name_vargs+0x21/0x70
    [<ffffffff81840d92>] dev_set_name+0x42/0x50
    [<ffffffff81ac5e97>] netdev_register_kobject+0x57/0x130
    [<ffffffff81aaf574>] register_netdevice+0x354/0x550
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 15/19] nfsd: set PF_FSTRANS when client_mutex is held.

When loop-back NFS with NFSv4 is in use, client_mutex might be needed
to reclaim memory, so any memory allocation while client_mutex is held
must avoid __GFP_FS, so best to set PF_FSTRANS.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/nfsd/nfs4state.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index d5d070fbeb35..7b7fbcbe20cb 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
 <at>  <at>  -75,6 +75,7  <at>  <at>  static int check_for_locks(struct nfs4_file *filp, struct nfs4_lockowner *lowner

 /* Currently used for almost all code touching nfsv4 state: */
 static DEFINE_MUTEX(client_mutex);
+static unsigned int client_mutex_pflags;

 /*
  * Currently used for the del_recall_lru and file hash table.  In an
 <at>  <at>  -93,6 +94,7  <at>  <at>  void
 nfs4_lock_state(void)
 {
 	mutex_lock(&client_mutex);
+	current_set_flags_nested(&client_mutex_pflags, PF_FSTRANS);
 }

 static void free_session(struct nfsd4_session *);
 <at>  <at>  -127,6 +129,7  <at>  <at>  static __be32 nfsd4_get_session_locked(struct nfsd4_session *ses)
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 13/19] MM: set PF_FSTRANS while allocating per-cpu memory to avoid deadlock.

lockdep reports a locking chain

  sk_lock-AF_INET --> rtnl_mutex --> pcpu_alloc_mutex

As sk_lock may be needed to reclaim memory, allowing that
reclaim while pcu_alloc_mutex is held can lead to deadlock.
So set PF_FSTRANS while it is help to avoid the FS reclaim.

pcpu_alloc_mutex can be taken when rtnl_mutex is held:

    [<ffffffff8117f979>] pcpu_alloc+0x49/0x960
    [<ffffffff8118029b>] __alloc_percpu+0xb/0x10
    [<ffffffff8193b9f7>] loopback_dev_init+0x17/0x60
    [<ffffffff81aaf30c>] register_netdevice+0xec/0x550
    [<ffffffff81aaf785>] register_netdev+0x15/0x30

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 mm/percpu.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/percpu.c b/mm/percpu.c
index 036cfe07050f..77dd24032f41 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
 <at>  <at>  -712,6 +712,7  <at>  <at>  static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved)
 	int slot, off, new_alloc;
 	unsigned long flags;
 	void __percpu *ptr;
+	unsigned int pflags;
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 16/19] VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.

__d_alloc can be called with i_mutex held, so it is safer to
use GFP_NOFS.

lockdep reports this can deadlock when loop-back NFS is in use,
as nfsd may be required to write out for reclaim, and nfsd certainly
takes i_mutex.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/dcache.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index ca02c13a84aa..3651ff6185b4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
 <at>  <at>  -1483,7 +1483,7  <at>  <at>  struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	struct dentry *dentry;
 	char *dname;

-	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
+	dentry = kmem_cache_alloc(dentry_cache, GFP_NOFS);
 	if (!dentry)
 		return NULL;

 <at>  <at>  -1495,7 +1495,7  <at>  <at>  struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	 */
 	dentry->d_iname[DNAME_INLINE_LEN-1] = 0;
 	if (name->len > DNAME_INLINE_LEN-1) {
-		dname = kmalloc(name->len + 1, GFP_KERNEL);
(Continue reading)

Dave Chinner | 16 Apr 08:25 2014

Re: [PATCH 16/19] VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.

On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> __d_alloc can be called with i_mutex held, so it is safer to
> use GFP_NOFS.
> 
> lockdep reports this can deadlock when loop-back NFS is in use,
> as nfsd may be required to write out for reclaim, and nfsd certainly
> takes i_mutex.

But not the same i_mutex as is currently held. To me, this seems
like a false positive? If you are holding the i_mutex on an inode,
then you have a reference to the inode and hence memory reclaim
won't ever take the i_mutex on that inode.

FWIW, this sort of false positive was a long stabding problem for
XFS - we managed to get rid of most of the false positives like this
by ensuring that only the ilock is taken within memory reclaim and
memory reclaim can't be entered while we hold the ilock.

You can't do that with the i_mutex, though....

Cheers,

Dave.
--

-- 
Dave Chinner
david <at> fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
(Continue reading)

NeilBrown | 16 Apr 08:49 2014
Picon

Re: [PATCH 16/19] VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.

On Wed, 16 Apr 2014 16:25:20 +1000 Dave Chinner <david <at> fromorbit.com> wrote:

> On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> > __d_alloc can be called with i_mutex held, so it is safer to
> > use GFP_NOFS.
> > 
> > lockdep reports this can deadlock when loop-back NFS is in use,
> > as nfsd may be required to write out for reclaim, and nfsd certainly
> > takes i_mutex.
> 
> But not the same i_mutex as is currently held. To me, this seems
> like a false positive? If you are holding the i_mutex on an inode,
> then you have a reference to the inode and hence memory reclaim
> won't ever take the i_mutex on that inode.
> 
> FWIW, this sort of false positive was a long stabding problem for
> XFS - we managed to get rid of most of the false positives like this
> by ensuring that only the ilock is taken within memory reclaim and
> memory reclaim can't be entered while we hold the ilock.
> 
> You can't do that with the i_mutex, though....
> 
> Cheers,
> 
> Dave.

I'm not sure this is a false positive.
You can call __d_alloc when creating a file and so are holding i_mutex on the
directory.
nfsd might also want to access that directory.
(Continue reading)

Dave Chinner | 16 Apr 11:00 2014

Re: [PATCH 16/19] VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.

On Wed, Apr 16, 2014 at 04:49:41PM +1000, NeilBrown wrote:
> On Wed, 16 Apr 2014 16:25:20 +1000 Dave Chinner <david <at> fromorbit.com> wrote:
> 
> > On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> > > __d_alloc can be called with i_mutex held, so it is safer to
> > > use GFP_NOFS.
> > > 
> > > lockdep reports this can deadlock when loop-back NFS is in use,
> > > as nfsd may be required to write out for reclaim, and nfsd certainly
> > > takes i_mutex.
> > 
> > But not the same i_mutex as is currently held. To me, this seems
> > like a false positive? If you are holding the i_mutex on an inode,
> > then you have a reference to the inode and hence memory reclaim
> > won't ever take the i_mutex on that inode.
> > 
> > FWIW, this sort of false positive was a long stabding problem for
> > XFS - we managed to get rid of most of the false positives like this
> > by ensuring that only the ilock is taken within memory reclaim and
> > memory reclaim can't be entered while we hold the ilock.
> > 
> > You can't do that with the i_mutex, though....
> > 
> > Cheers,
> > 
> > Dave.
> 
> I'm not sure this is a false positive.
> You can call __d_alloc when creating a file and so are holding i_mutex on the
> directory.
(Continue reading)

NeilBrown | 17 Apr 02:51 2014
Picon

Re: [PATCH 16/19] VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.

On Wed, 16 Apr 2014 19:00:51 +1000 Dave Chinner <david <at> fromorbit.com> wrote:

> On Wed, Apr 16, 2014 at 04:49:41PM +1000, NeilBrown wrote:
> > On Wed, 16 Apr 2014 16:25:20 +1000 Dave Chinner <david <at> fromorbit.com> wrote:
> > 
> > > On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> > > > __d_alloc can be called with i_mutex held, so it is safer to
> > > > use GFP_NOFS.
> > > > 
> > > > lockdep reports this can deadlock when loop-back NFS is in use,
> > > > as nfsd may be required to write out for reclaim, and nfsd certainly
> > > > takes i_mutex.
> > > 
> > > But not the same i_mutex as is currently held. To me, this seems
> > > like a false positive? If you are holding the i_mutex on an inode,
> > > then you have a reference to the inode and hence memory reclaim
> > > won't ever take the i_mutex on that inode.
> > > 
> > > FWIW, this sort of false positive was a long stabding problem for
> > > XFS - we managed to get rid of most of the false positives like this
> > > by ensuring that only the ilock is taken within memory reclaim and
> > > memory reclaim can't be entered while we hold the ilock.
> > > 
> > > You can't do that with the i_mutex, though....
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > 
> > I'm not sure this is a false positive.
(Continue reading)

Dave Chinner | 17 Apr 07:58 2014

Re: [PATCH 16/19] VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.

On Thu, Apr 17, 2014 at 10:51:05AM +1000, NeilBrown wrote:
> On Wed, 16 Apr 2014 19:00:51 +1000 Dave Chinner <david <at> fromorbit.com> wrote:
> 
> > On Wed, Apr 16, 2014 at 04:49:41PM +1000, NeilBrown wrote:
> > > On Wed, 16 Apr 2014 16:25:20 +1000 Dave Chinner <david <at> fromorbit.com> wrote:
> > > 
> > > > On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> > > > > __d_alloc can be called with i_mutex held, so it is safer to
> > > > > use GFP_NOFS.
> > > > > 
> > > > > lockdep reports this can deadlock when loop-back NFS is in use,
> > > > > as nfsd may be required to write out for reclaim, and nfsd certainly
> > > > > takes i_mutex.
> > > > 
> > > > But not the same i_mutex as is currently held. To me, this seems
> > > > like a false positive? If you are holding the i_mutex on an inode,
> > > > then you have a reference to the inode and hence memory reclaim
> > > > won't ever take the i_mutex on that inode.
> > > > 
> > > > FWIW, this sort of false positive was a long stabding problem for
> > > > XFS - we managed to get rid of most of the false positives like this
> > > > by ensuring that only the ilock is taken within memory reclaim and
> > > > memory reclaim can't be entered while we hold the ilock.
> > > > 
> > > > You can't do that with the i_mutex, though....
> > > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > 
(Continue reading)

Dave Chinner | 17 Apr 07:58 2014

Re: [PATCH 16/19] VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.

On Thu, Apr 17, 2014 at 10:51:05AM +1000, NeilBrown wrote:
> On Wed, 16 Apr 2014 19:00:51 +1000 Dave Chinner <david <at> fromorbit.com> wrote:
> 
> > On Wed, Apr 16, 2014 at 04:49:41PM +1000, NeilBrown wrote:
> > > On Wed, 16 Apr 2014 16:25:20 +1000 Dave Chinner <david <at> fromorbit.com> wrote:
> > > 
> > > > On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> > > > > __d_alloc can be called with i_mutex held, so it is safer to
> > > > > use GFP_NOFS.
> > > > > 
> > > > > lockdep reports this can deadlock when loop-back NFS is in use,
> > > > > as nfsd may be required to write out for reclaim, and nfsd certainly
> > > > > takes i_mutex.
> > > > 
> > > > But not the same i_mutex as is currently held. To me, this seems
> > > > like a false positive? If you are holding the i_mutex on an inode,
> > > > then you have a reference to the inode and hence memory reclaim
> > > > won't ever take the i_mutex on that inode.
> > > > 
> > > > FWIW, this sort of false positive was a long stabding problem for
> > > > XFS - we managed to get rid of most of the false positives like this
> > > > by ensuring that only the ilock is taken within memory reclaim and
> > > > memory reclaim can't be entered while we hold the ilock.
> > > > 
> > > > You can't do that with the i_mutex, though....
> > > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > 
(Continue reading)

NeilBrown | 17 Apr 02:51 2014
Picon

Re: [PATCH 16/19] VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.

On Wed, 16 Apr 2014 19:00:51 +1000 Dave Chinner <david <at> fromorbit.com> wrote:

> On Wed, Apr 16, 2014 at 04:49:41PM +1000, NeilBrown wrote:
> > On Wed, 16 Apr 2014 16:25:20 +1000 Dave Chinner <david <at> fromorbit.com> wrote:
> > 
> > > On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> > > > __d_alloc can be called with i_mutex held, so it is safer to
> > > > use GFP_NOFS.
> > > > 
> > > > lockdep reports this can deadlock when loop-back NFS is in use,
> > > > as nfsd may be required to write out for reclaim, and nfsd certainly
> > > > takes i_mutex.
> > > 
> > > But not the same i_mutex as is currently held. To me, this seems
> > > like a false positive? If you are holding the i_mutex on an inode,
> > > then you have a reference to the inode and hence memory reclaim
> > > won't ever take the i_mutex on that inode.
> > > 
> > > FWIW, this sort of false positive was a long stabding problem for
> > > XFS - we managed to get rid of most of the false positives like this
> > > by ensuring that only the ilock is taken within memory reclaim and
> > > memory reclaim can't be entered while we hold the ilock.
> > > 
> > > You can't do that with the i_mutex, though....
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > 
> > I'm not sure this is a false positive.
(Continue reading)

Dave Chinner | 16 Apr 11:00 2014

Re: [PATCH 16/19] VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.

On Wed, Apr 16, 2014 at 04:49:41PM +1000, NeilBrown wrote:
> On Wed, 16 Apr 2014 16:25:20 +1000 Dave Chinner <david <at> fromorbit.com> wrote:
> 
> > On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> > > __d_alloc can be called with i_mutex held, so it is safer to
> > > use GFP_NOFS.
> > > 
> > > lockdep reports this can deadlock when loop-back NFS is in use,
> > > as nfsd may be required to write out for reclaim, and nfsd certainly
> > > takes i_mutex.
> > 
> > But not the same i_mutex as is currently held. To me, this seems
> > like a false positive? If you are holding the i_mutex on an inode,
> > then you have a reference to the inode and hence memory reclaim
> > won't ever take the i_mutex on that inode.
> > 
> > FWIW, this sort of false positive was a long stabding problem for
> > XFS - we managed to get rid of most of the false positives like this
> > by ensuring that only the ilock is taken within memory reclaim and
> > memory reclaim can't be entered while we hold the ilock.
> > 
> > You can't do that with the i_mutex, though....
> > 
> > Cheers,
> > 
> > Dave.
> 
> I'm not sure this is a false positive.
> You can call __d_alloc when creating a file and so are holding i_mutex on the
> directory.
(Continue reading)

NeilBrown | 16 Apr 08:49 2014
Picon

Re: [PATCH 16/19] VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.

On Wed, 16 Apr 2014 16:25:20 +1000 Dave Chinner <david <at> fromorbit.com> wrote:

> On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> > __d_alloc can be called with i_mutex held, so it is safer to
> > use GFP_NOFS.
> > 
> > lockdep reports this can deadlock when loop-back NFS is in use,
> > as nfsd may be required to write out for reclaim, and nfsd certainly
> > takes i_mutex.
> 
> But not the same i_mutex as is currently held. To me, this seems
> like a false positive? If you are holding the i_mutex on an inode,
> then you have a reference to the inode and hence memory reclaim
> won't ever take the i_mutex on that inode.
> 
> FWIW, this sort of false positive was a long stabding problem for
> XFS - we managed to get rid of most of the false positives like this
> by ensuring that only the ilock is taken within memory reclaim and
> memory reclaim can't be entered while we hold the ilock.
> 
> You can't do that with the i_mutex, though....
> 
> Cheers,
> 
> Dave.

I'm not sure this is a false positive.
You can call __d_alloc when creating a file and so are holding i_mutex on the
directory.
nfsd might also want to access that directory.
(Continue reading)

Dave Chinner | 16 Apr 08:25 2014

Re: [PATCH 16/19] VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.

On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> __d_alloc can be called with i_mutex held, so it is safer to
> use GFP_NOFS.
> 
> lockdep reports this can deadlock when loop-back NFS is in use,
> as nfsd may be required to write out for reclaim, and nfsd certainly
> takes i_mutex.

But not the same i_mutex as is currently held. To me, this seems
like a false positive? If you are holding the i_mutex on an inode,
then you have a reference to the inode and hence memory reclaim
won't ever take the i_mutex on that inode.

FWIW, this sort of false positive was a long stabding problem for
XFS - we managed to get rid of most of the false positives like this
by ensuring that only the ilock is taken within memory reclaim and
memory reclaim can't be entered while we hold the ilock.

You can't do that with the i_mutex, though....

Cheers,

Dave.
--

-- 
Dave Chinner
david <at> fromorbit.com

_______________________________________________
xfs mailing list
xfs <at> oss.sgi.com
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 17/19] VFS: set PF_FSTRANS while namespace_sem is held.

namespace_sem can be taken while various i_mutex locks are held, so we
need to avoid reclaim from blocking on an FS (particularly loop-back
NFS).

A memory allocation happens under namespace_sem at least in:

[<ffffffff8119d16f>] kmem_cache_alloc+0x4f/0x290
[<ffffffff811c2fff>] alloc_vfsmnt+0x1f/0x1d0
[<ffffffff811c339a>] clone_mnt+0x2a/0x310
[<ffffffff811c57e3>] copy_tree+0x53/0x380
[<ffffffff811c6aef>] copy_mnt_ns+0x7f/0x280
[<ffffffff810c16fc>] create_new_namespaces+0x5c/0x190
[<ffffffff810c1ab9>] unshare_nsproxy_namespaces+0x59/0x90

So set PF_FSTRANS in namespace_lock() and restore in
namespace_unlock().

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/namespace.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index 2ffc5a2905d4..83dcd5083dbb 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
 <at>  <at>  -63,6 +63,7  <at>  <at>  static struct hlist_head *mount_hashtable __read_mostly;
 static struct hlist_head *mountpoint_hashtable __read_mostly;
 static struct kmem_cache *mnt_cache __read_mostly;
 static DECLARE_RWSEM(namespace_sem);
(Continue reading)

Al Viro | 16 Apr 06:46 2014
Picon

Re: [PATCH 17/19] VFS: set PF_FSTRANS while namespace_sem is held.

On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> namespace_sem can be taken while various i_mutex locks are held, so we
> need to avoid reclaim from blocking on an FS (particularly loop-back
> NFS).

I would really prefer to deal with that differently - by explicit change of
gfp_t arguments of allocators.

The thing is, namespace_sem is held *only* over allocations, and not a lot
of them, at that - only mnt_alloc_id(), mnt_alloc_group_id(), alloc_vfsmnt()
and new_mountpoint().  That is all that is allowed.

Again, actual work with filesystems (setup, shutdown, remount, pathname
resolution, etc.) is all done outside of namespace_sem; it's held only
for manipulations of fs/{namespace,pnode}.c data structures and the only
reason it isn't a spinlock is that we need to do some allocations.

So I'd rather slap GFP_NOFS on those few allocations...

_______________________________________________
xfs mailing list
xfs <at> oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

NeilBrown | 16 Apr 07:52 2014
Picon

Re: [PATCH 17/19] VFS: set PF_FSTRANS while namespace_sem is held.

On Wed, 16 Apr 2014 05:46:18 +0100 Al Viro <viro <at> ZenIV.linux.org.uk> wrote:

> On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> > namespace_sem can be taken while various i_mutex locks are held, so we
> > need to avoid reclaim from blocking on an FS (particularly loop-back
> > NFS).
> 
> I would really prefer to deal with that differently - by explicit change of
> gfp_t arguments of allocators.
> 
> The thing is, namespace_sem is held *only* over allocations, and not a lot
> of them, at that - only mnt_alloc_id(), mnt_alloc_group_id(), alloc_vfsmnt()
> and new_mountpoint().  That is all that is allowed.
> 
> Again, actual work with filesystems (setup, shutdown, remount, pathname
> resolution, etc.) is all done outside of namespace_sem; it's held only
> for manipulations of fs/{namespace,pnode}.c data structures and the only
> reason it isn't a spinlock is that we need to do some allocations.
> 
> So I'd rather slap GFP_NOFS on those few allocations...

So something like this?  I put that in to my testing instead.

Thanks,
NeilBrown

diff --git a/fs/namespace.c b/fs/namespace.c
index 83dcd5083dbb..8e103b8c8323 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
(Continue reading)

Al Viro | 16 Apr 18:37 2014
Picon

Re: [PATCH 17/19] VFS: set PF_FSTRANS while namespace_sem is held.

On Wed, Apr 16, 2014 at 03:52:30PM +1000, NeilBrown wrote:

> So something like this?  I put that in to my testing instead.

Something like this, yes...  And TBH I would prefer the same approach
elsewhere - this kind of hidden state makes it hard to do any kind of
analysis.

Put it that way - the simplest situation is when the allocation flags
depend only on the call site.  Next one is when it's a function of
call chain - somewhat harder to review.  And the worst is when it's
a function of previous history of execution - not just the call chain,
but the things that had been called (and returned) prior to that one.

How many of those locations need to be of the third kind?  All fs/namespace.c
ones are of the first one...
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro | 16 Apr 18:37 2014
Picon

Re: [PATCH 17/19] VFS: set PF_FSTRANS while namespace_sem is held.

On Wed, Apr 16, 2014 at 03:52:30PM +1000, NeilBrown wrote:

> So something like this?  I put that in to my testing instead.

Something like this, yes...  And TBH I would prefer the same approach
elsewhere - this kind of hidden state makes it hard to do any kind of
analysis.

Put it that way - the simplest situation is when the allocation flags
depend only on the call site.  Next one is when it's a function of
call chain - somewhat harder to review.  And the worst is when it's
a function of previous history of execution - not just the call chain,
but the things that had been called (and returned) prior to that one.

How many of those locations need to be of the third kind?  All fs/namespace.c
ones are of the first one...

_______________________________________________
xfs mailing list
xfs <at> oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

NeilBrown | 16 Apr 07:52 2014
Picon

Re: [PATCH 17/19] VFS: set PF_FSTRANS while namespace_sem is held.

On Wed, 16 Apr 2014 05:46:18 +0100 Al Viro <viro <at> ZenIV.linux.org.uk> wrote:

> On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> > namespace_sem can be taken while various i_mutex locks are held, so we
> > need to avoid reclaim from blocking on an FS (particularly loop-back
> > NFS).
> 
> I would really prefer to deal with that differently - by explicit change of
> gfp_t arguments of allocators.
> 
> The thing is, namespace_sem is held *only* over allocations, and not a lot
> of them, at that - only mnt_alloc_id(), mnt_alloc_group_id(), alloc_vfsmnt()
> and new_mountpoint().  That is all that is allowed.
> 
> Again, actual work with filesystems (setup, shutdown, remount, pathname
> resolution, etc.) is all done outside of namespace_sem; it's held only
> for manipulations of fs/{namespace,pnode}.c data structures and the only
> reason it isn't a spinlock is that we need to do some allocations.
> 
> So I'd rather slap GFP_NOFS on those few allocations...

So something like this?  I put that in to my testing instead.

Thanks,
NeilBrown

diff --git a/fs/namespace.c b/fs/namespace.c
index 83dcd5083dbb..8e103b8c8323 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 14/19] driver core: set PF_FSTRANS while holding gdp_mutex

lockdep reports a locking chain:

  sk_lock-AF_INET --> rtnl_mutex --> gdp_mutex

As sk_lock can be needed for memory reclaim (when loop-back NFS is in
use at least), any memory allocation under gdp_mutex needs to
be protected by PF_FSTRANS.

The path frome rtnl_mutex to gdp_mutex is

    [<ffffffff81841ecc>] get_device_parent+0x4c/0x1f0
    [<ffffffff81842496>] device_add+0xe6/0x610
    [<ffffffff81ac5f2a>] netdev_register_kobject+0x7a/0x130
    [<ffffffff81aaf5d4>] register_netdevice+0x354/0x550
    [<ffffffff81aaf7e5>] register_netdev+0x15/0x30

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 drivers/base/core.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 2b567177ef78..1a2735237650 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
 <at>  <at>  -750,6 +750,7  <at>  <at>  static struct kobject *get_device_parent(struct device *dev,
 		struct kobject *kobj = NULL;
 		struct kobject *parent_kobj;
 		struct kobject *k;
+		unsigned int pflags;
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 18/19] nfsd: set PF_FSTRANS during nfsd4_do_callback_rpc.

nfsd will sometimes call flush_workqueue on the workqueue running
nfsd4_do_callback_rpc, so we must ensure that it doesn't block in
filesystem reclaim.
So set PF_FSTRANS.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/nfsd/nfs4callback.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
index 7f05cd140de3..7784b5d4edf0 100644
--- a/fs/nfsd/nfs4callback.c
+++ b/fs/nfsd/nfs4callback.c
 <at>  <at>  -997,6 +997,9  <at>  <at>  static void nfsd4_do_callback_rpc(struct work_struct *w)
 	struct nfsd4_callback *cb = container_of(w, struct nfsd4_callback, cb_work);
 	struct nfs4_client *clp = cb->cb_clp;
 	struct rpc_clnt *clnt;
+	unsigned int pflags;
+
+	current_set_flags_nested(&pflags, PF_FSTRANS);

 	if (clp->cl_flags & NFSD4_CLIENT_CB_FLAG_MASK)
 		nfsd4_process_cb_update(cb);
 <at>  <at>  -1005,11 +1008,13  <at>  <at>  static void nfsd4_do_callback_rpc(struct work_struct *w)
 	if (!clnt) {
 		/* Callback channel broken, or client killed; give up: */
 		nfsd4_release_cb(cb);
+		current_restore_flags_nested(&pflags, PF_FSTRANS);
 		return;
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 15/19] nfsd: set PF_FSTRANS when client_mutex is held.

When loop-back NFS with NFSv4 is in use, client_mutex might be needed
to reclaim memory, so any memory allocation while client_mutex is held
must avoid __GFP_FS, so best to set PF_FSTRANS.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/nfsd/nfs4state.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index d5d070fbeb35..7b7fbcbe20cb 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
 <at>  <at>  -75,6 +75,7  <at>  <at>  static int check_for_locks(struct nfs4_file *filp, struct nfs4_lockowner *lowner

 /* Currently used for almost all code touching nfsv4 state: */
 static DEFINE_MUTEX(client_mutex);
+static unsigned int client_mutex_pflags;

 /*
  * Currently used for the del_recall_lru and file hash table.  In an
 <at>  <at>  -93,6 +94,7  <at>  <at>  void
 nfs4_lock_state(void)
 {
 	mutex_lock(&client_mutex);
+	current_set_flags_nested(&client_mutex_pflags, PF_FSTRANS);
 }

 static void free_session(struct nfsd4_session *);
 <at>  <at>  -127,6 +129,7  <at>  <at>  static __be32 nfsd4_get_session_locked(struct nfsd4_session *ses)
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 19/19] XFS: set PF_FSTRANS while ilock is held in xfs_free_eofblocks

memory allocates can happen while the xfs ilock is held in
xfs_free_eofblocks, particularly

  [<ffffffff813e6667>] kmem_zone_alloc+0x67/0xc0
  [<ffffffff813e5945>] xfs_trans_add_item+0x25/0x50
  [<ffffffff8143d64c>] xfs_trans_ijoin+0x2c/0x60
  [<ffffffff8142275e>] xfs_itruncate_extents+0xbe/0x400
  [<ffffffff813c72f4>] xfs_free_eofblocks+0x1c4/0x240

So set PF_FSTRANS to avoid this causing a deadlock.

Care is needed here as xfs_trans_reserve() also sets PF_FSTRANS, while
xfs_trans_cancel and xfs_trans_commit will clear it.
So our extra setting must fully nest these calls.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/xfs/xfs_bmap_util.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index f264616080ca..53761fe4fada 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
 <at>  <at>  -889,6 +889,7  <at>  <at>  xfs_free_eofblocks(
 	xfs_filblks_t	map_len;
 	int		nimaps;
 	xfs_bmbt_irec_t	imap;
+	unsigned int pflags;

(Continue reading)

Dave Chinner | 16 Apr 08:18 2014

Re: [PATCH 19/19] XFS: set PF_FSTRANS while ilock is held in xfs_free_eofblocks

On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> memory allocates can happen while the xfs ilock is held in
> xfs_free_eofblocks, particularly
> 
>   [<ffffffff813e6667>] kmem_zone_alloc+0x67/0xc0
>   [<ffffffff813e5945>] xfs_trans_add_item+0x25/0x50
>   [<ffffffff8143d64c>] xfs_trans_ijoin+0x2c/0x60
>   [<ffffffff8142275e>] xfs_itruncate_extents+0xbe/0x400
>   [<ffffffff813c72f4>] xfs_free_eofblocks+0x1c4/0x240
> 
> So set PF_FSTRANS to avoid this causing a deadlock.

Another "You broke KM_NOFS" moment. You win a Kit Kat. ;)

xfs_trans_add_item():

	lidp = kmem_zone_zalloc(xfs_log_item_desc_zone, KM_SLEEP | KM_NOFS);

KM_NOFS needs to work, otherwise XFS is just a huge steaming pile of
memory reclaim deadlocks regardless of whether you are using
loopback NFS or not.

Cheers,

Dave.
--

-- 
Dave Chinner
david@...
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
(Continue reading)

Dave Chinner | 16 Apr 08:18 2014

Re: [PATCH 19/19] XFS: set PF_FSTRANS while ilock is held in xfs_free_eofblocks

On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> memory allocates can happen while the xfs ilock is held in
> xfs_free_eofblocks, particularly
> 
>   [<ffffffff813e6667>] kmem_zone_alloc+0x67/0xc0
>   [<ffffffff813e5945>] xfs_trans_add_item+0x25/0x50
>   [<ffffffff8143d64c>] xfs_trans_ijoin+0x2c/0x60
>   [<ffffffff8142275e>] xfs_itruncate_extents+0xbe/0x400
>   [<ffffffff813c72f4>] xfs_free_eofblocks+0x1c4/0x240
> 
> So set PF_FSTRANS to avoid this causing a deadlock.

Another "You broke KM_NOFS" moment. You win a Kit Kat. ;)

xfs_trans_add_item():

	lidp = kmem_zone_zalloc(xfs_log_item_desc_zone, KM_SLEEP | KM_NOFS);

KM_NOFS needs to work, otherwise XFS is just a huge steaming pile of
memory reclaim deadlocks regardless of whether you are using
loopback NFS or not.

Cheers,

Dave.
--

-- 
Dave Chinner
david <at> fromorbit.com

_______________________________________________
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 16/19] VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.

__d_alloc can be called with i_mutex held, so it is safer to
use GFP_NOFS.

lockdep reports this can deadlock when loop-back NFS is in use,
as nfsd may be required to write out for reclaim, and nfsd certainly
takes i_mutex.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/dcache.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index ca02c13a84aa..3651ff6185b4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
 <at>  <at>  -1483,7 +1483,7  <at>  <at>  struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	struct dentry *dentry;
 	char *dname;

-	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
+	dentry = kmem_cache_alloc(dentry_cache, GFP_NOFS);
 	if (!dentry)
 		return NULL;

 <at>  <at>  -1495,7 +1495,7  <at>  <at>  struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	 */
 	dentry->d_iname[DNAME_INLINE_LEN-1] = 0;
 	if (name->len > DNAME_INLINE_LEN-1) {
-		dname = kmalloc(name->len + 1, GFP_KERNEL);
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 17/19] VFS: set PF_FSTRANS while namespace_sem is held.

namespace_sem can be taken while various i_mutex locks are held, so we
need to avoid reclaim from blocking on an FS (particularly loop-back
NFS).

A memory allocation happens under namespace_sem at least in:

[<ffffffff8119d16f>] kmem_cache_alloc+0x4f/0x290
[<ffffffff811c2fff>] alloc_vfsmnt+0x1f/0x1d0
[<ffffffff811c339a>] clone_mnt+0x2a/0x310
[<ffffffff811c57e3>] copy_tree+0x53/0x380
[<ffffffff811c6aef>] copy_mnt_ns+0x7f/0x280
[<ffffffff810c16fc>] create_new_namespaces+0x5c/0x190
[<ffffffff810c1ab9>] unshare_nsproxy_namespaces+0x59/0x90

So set PF_FSTRANS in namespace_lock() and restore in
namespace_unlock().

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/namespace.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index 2ffc5a2905d4..83dcd5083dbb 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
 <at>  <at>  -63,6 +63,7  <at>  <at>  static struct hlist_head *mount_hashtable __read_mostly;
 static struct hlist_head *mountpoint_hashtable __read_mostly;
 static struct kmem_cache *mnt_cache __read_mostly;
 static DECLARE_RWSEM(namespace_sem);
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 18/19] nfsd: set PF_FSTRANS during nfsd4_do_callback_rpc.

nfsd will sometimes call flush_workqueue on the workqueue running
nfsd4_do_callback_rpc, so we must ensure that it doesn't block in
filesystem reclaim.
So set PF_FSTRANS.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/nfsd/nfs4callback.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
index 7f05cd140de3..7784b5d4edf0 100644
--- a/fs/nfsd/nfs4callback.c
+++ b/fs/nfsd/nfs4callback.c
 <at>  <at>  -997,6 +997,9  <at>  <at>  static void nfsd4_do_callback_rpc(struct work_struct *w)
 	struct nfsd4_callback *cb = container_of(w, struct nfsd4_callback, cb_work);
 	struct nfs4_client *clp = cb->cb_clp;
 	struct rpc_clnt *clnt;
+	unsigned int pflags;
+
+	current_set_flags_nested(&pflags, PF_FSTRANS);

 	if (clp->cl_flags & NFSD4_CLIENT_CB_FLAG_MASK)
 		nfsd4_process_cb_update(cb);
 <at>  <at>  -1005,11 +1008,13  <at>  <at>  static void nfsd4_do_callback_rpc(struct work_struct *w)
 	if (!clnt) {
 		/* Callback channel broken, or client killed; give up: */
 		nfsd4_release_cb(cb);
+		current_restore_flags_nested(&pflags, PF_FSTRANS);
 		return;
(Continue reading)

NeilBrown | 16 Apr 06:03 2014
Picon

[PATCH 19/19] XFS: set PF_FSTRANS while ilock is held in xfs_free_eofblocks

memory allocates can happen while the xfs ilock is held in
xfs_free_eofblocks, particularly

  [<ffffffff813e6667>] kmem_zone_alloc+0x67/0xc0
  [<ffffffff813e5945>] xfs_trans_add_item+0x25/0x50
  [<ffffffff8143d64c>] xfs_trans_ijoin+0x2c/0x60
  [<ffffffff8142275e>] xfs_itruncate_extents+0xbe/0x400
  [<ffffffff813c72f4>] xfs_free_eofblocks+0x1c4/0x240

So set PF_FSTRANS to avoid this causing a deadlock.

Care is needed here as xfs_trans_reserve() also sets PF_FSTRANS, while
xfs_trans_cancel and xfs_trans_commit will clear it.
So our extra setting must fully nest these calls.

Signed-off-by: NeilBrown <neilb <at> suse.de>
---
 fs/xfs/xfs_bmap_util.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index f264616080ca..53761fe4fada 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
 <at>  <at>  -889,6 +889,7  <at>  <at>  xfs_free_eofblocks(
 	xfs_filblks_t	map_len;
 	int		nimaps;
 	xfs_bmbt_irec_t	imap;
+	unsigned int pflags;

(Continue reading)

Jeff Layton | 16 Apr 16:42 2014
Picon

Re: [PATCH/RFC 00/19] Support loop-back NFS mounts

On Wed, 16 Apr 2014 14:03:35 +1000
NeilBrown <neilb <at> suse.de> wrote:

> Loop-back NFS mounts are when the NFS client and server run on the
> same host.
> 
> The use-case for this is a high availability cluster with shared
> storage.  The shared filesystem is mounted on any one machine and
> NFS-mounted on the others.
> If the nfs server fails, some other node will take over that service,
> and then it will have a loop-back NFS mount which needs to keep
> working.
> 
> This patch set addresses the "keep working" bit and specifically
> addresses deadlocks and livelocks.
> Allowing the fail-over itself to be deadlock free is a separate
> challenge for another day.
> 
> The short description of how this works is:
> 
> deadlocks:
>   - Elevate PF_FSTRANS to apply globally instead of just in NFS and XFS.
>     PF_FSTRANS disables __GFP_NS in the same way that PF_MEMALLOC_NOIO
>     disables __GFP_IO.
>   - Set PF_FSTRANS in nfsd when handling requests related to
>     memory reclaim, or requests which could block requests related
>     to memory reclaim.
>   - Use lockdep to find all consequent deadlocks from some other
>     thread allocating memory while holding a lock that nfsd might
>     want.
(Continue reading)

NeilBrown | 17 Apr 02:20 2014
Picon

Re: [PATCH/RFC 00/19] Support loop-back NFS mounts

On Wed, 16 Apr 2014 10:42:07 -0400 Jeff Layton <jlayton@...> wrote:

> On Wed, 16 Apr 2014 14:03:35 +1000
> NeilBrown <neilb@...> wrote:
> 

> > Comments, criticisms, etc most welcome.
> > 
> > Thanks,
> > NeilBrown
> > 
> 
> I've only given this a once-over, but the basic concept seems a bit
> flawed. IIUC, the basic idea is to disallow allocations done in knfsd
> threads context from doing fs-based reclaim.
> 
> This seems very heavy-handed, and like it could cause problems on a
> busy NFS server. Those sorts of servers are likely to have a lot of
> data in pagecache and thus we generally want to allow them to do do
> writeback when memory is tight.
> 
> It's generally acceptable for knfsd to recurse into local filesystem
> code for writeback. What you want to avoid in this situation is reclaim
> on NFS filesystems that happen to be from knfsd on the same box.
> 
> If you really want to fix this, what may make more sense is trying to
> plumb that information down more granularly. Maybe GFP_NONETFS and/or
> PF_NETFSTRANS flags?

Hi Jeff,
(Continue reading)

Dave Chinner | 17 Apr 03:27 2014

Re: [PATCH/RFC 00/19] Support loop-back NFS mounts

On Thu, Apr 17, 2014 at 10:20:48AM +1000, NeilBrown wrote:
> A good example is the deadlock with the flush-* threads.
> flush-* will lock a page, and  then call ->writepage.  If ->writepage
> allocates memory it can enter reclaim, call ->releasepage on NFS, and block
> waiting for a COMMIT to complete.
> The COMMIT might already be running, performing fsync on that same file that
> flush-* is flushing.  It locks each page in turn.  When it  gets to the page
> that flush-* has locked, it will deadlock.

It's nfs_release_page() again....

> In general, if nfsd is allowed to block on local filesystem, and local
> filesystem is allowed to block on NFS, then a deadlock can happen.
> We would need a clear hierarchy
> 
>    __GFP_NETFS > __GFP_FS > __GFP_IO
> 
> for it to work.  I'm not sure the extra level really helps a lot and it would
> be a lot of churn.

I think you are looking at this the wrong way - it's not the other
filesystems that have to avoid memory reclaim recursion, it's the
NFS client mount that is on loopback that needs to avoid recursion.

IMO, the fix should be that the NFS client cannot block on messages sent to the NFSD
on the same host during memory reclaim. That is, nfs_release_page()
cannot send commit messages to the server if the server is on
localhost. Instead, it just tells memory reclaim that it can't
reclaim that page.

(Continue reading)

NeilBrown | 17 Apr 03:50 2014
Picon

Re: [PATCH/RFC 00/19] Support loop-back NFS mounts

On Thu, 17 Apr 2014 11:27:39 +1000 Dave Chinner <david@...> wrote:

> On Thu, Apr 17, 2014 at 10:20:48AM +1000, NeilBrown wrote:
> > A good example is the deadlock with the flush-* threads.
> > flush-* will lock a page, and  then call ->writepage.  If ->writepage
> > allocates memory it can enter reclaim, call ->releasepage on NFS, and block
> > waiting for a COMMIT to complete.
> > The COMMIT might already be running, performing fsync on that same file that
> > flush-* is flushing.  It locks each page in turn.  When it  gets to the page
> > that flush-* has locked, it will deadlock.
> 
> It's nfs_release_page() again....
> 
> > In general, if nfsd is allowed to block on local filesystem, and local
> > filesystem is allowed to block on NFS, then a deadlock can happen.
> > We would need a clear hierarchy
> > 
> >    __GFP_NETFS > __GFP_FS > __GFP_IO
> > 
> > for it to work.  I'm not sure the extra level really helps a lot and it would
> > be a lot of churn.
> 
> I think you are looking at this the wrong way - it's not the other
> filesystems that have to avoid memory reclaim recursion, it's the
> NFS client mount that is on loopback that needs to avoid recursion.
> 
> IMO, the fix should be that the NFS client cannot block on messages sent to the NFSD
> on the same host during memory reclaim. That is, nfs_release_page()
> cannot send commit messages to the server if the server is on
> localhost. Instead, it just tells memory reclaim that it can't
(Continue reading)

Dave Chinner | 17 Apr 06:23 2014

Re: [PATCH/RFC 00/19] Support loop-back NFS mounts

On Thu, Apr 17, 2014 at 11:50:18AM +1000, NeilBrown wrote:
> On Thu, 17 Apr 2014 11:27:39 +1000 Dave Chinner <david <at> fromorbit.com> wrote:
> 
> > On Thu, Apr 17, 2014 at 10:20:48AM +1000, NeilBrown wrote:
> > > A good example is the deadlock with the flush-* threads.
> > > flush-* will lock a page, and  then call ->writepage.  If ->writepage
> > > allocates memory it can enter reclaim, call ->releasepage on NFS, and block
> > > waiting for a COMMIT to complete.
> > > The COMMIT might already be running, performing fsync on that same file that
> > > flush-* is flushing.  It locks each page in turn.  When it  gets to the page
> > > that flush-* has locked, it will deadlock.
> > 
> > It's nfs_release_page() again....
> > 
> > > In general, if nfsd is allowed to block on local filesystem, and local
> > > filesystem is allowed to block on NFS, then a deadlock can happen.
> > > We would need a clear hierarchy
> > > 
> > >    __GFP_NETFS > __GFP_FS > __GFP_IO
> > > 
> > > for it to work.  I'm not sure the extra level really helps a lot and it would
> > > be a lot of churn.
> > 
> > I think you are looking at this the wrong way - it's not the other
> > filesystems that have to avoid memory reclaim recursion, it's the
> > NFS client mount that is on loopback that needs to avoid recursion.
> > 
> > IMO, the fix should be that the NFS client cannot block on messages sent to the NFSD
> > on the same host during memory reclaim. That is, nfs_release_page()
> > cannot send commit messages to the server if the server is on
(Continue reading)

Dave Chinner | 17 Apr 06:23 2014

Re: [PATCH/RFC 00/19] Support loop-back NFS mounts

On Thu, Apr 17, 2014 at 11:50:18AM +1000, NeilBrown wrote:
> On Thu, 17 Apr 2014 11:27:39 +1000 Dave Chinner <david <at> fromorbit.com> wrote:
> 
> > On Thu, Apr 17, 2014 at 10:20:48AM +1000, NeilBrown wrote:
> > > A good example is the deadlock with the flush-* threads.
> > > flush-* will lock a page, and  then call ->writepage.  If ->writepage
> > > allocates memory it can enter reclaim, call ->releasepage on NFS, and block
> > > waiting for a COMMIT to complete.
> > > The COMMIT might already be running, performing fsync on that same file that
> > > flush-* is flushing.  It locks each page in turn.  When it  gets to the page
> > > that flush-* has locked, it will deadlock.
> > 
> > It's nfs_release_page() again....
> > 
> > > In general, if nfsd is allowed to block on local filesystem, and local
> > > filesystem is allowed to block on NFS, then a deadlock can happen.
> > > We would need a clear hierarchy
> > > 
> > >    __GFP_NETFS > __GFP_FS > __GFP_IO
> > > 
> > > for it to work.  I'm not sure the extra level really helps a lot and it would
> > > be a lot of churn.
> > 
> > I think you are looking at this the wrong way - it's not the other
> > filesystems that have to avoid memory reclaim recursion, it's the
> > NFS client mount that is on loopback that needs to avoid recursion.
> > 
> > IMO, the fix should be that the NFS client cannot block on messages sent to the NFSD
> > on the same host during memory reclaim. That is, nfs_release_page()
> > cannot send commit messages to the server if the server is on
(Continue reading)

NeilBrown | 17 Apr 03:50 2014
Picon

Re: [PATCH/RFC 00/19] Support loop-back NFS mounts

On Thu, 17 Apr 2014 11:27:39 +1000 Dave Chinner <david <at> fromorbit.com> wrote:

> On Thu, Apr 17, 2014 at 10:20:48AM +1000, NeilBrown wrote:
> > A good example is the deadlock with the flush-* threads.
> > flush-* will lock a page, and  then call ->writepage.  If ->writepage
> > allocates memory it can enter reclaim, call ->releasepage on NFS, and block
> > waiting for a COMMIT to complete.
> > The COMMIT might already be running, performing fsync on that same file that
> > flush-* is flushing.  It locks each page in turn.  When it  gets to the page
> > that flush-* has locked, it will deadlock.
> 
> It's nfs_release_page() again....
> 
> > In general, if nfsd is allowed to block on local filesystem, and local
> > filesystem is allowed to block on NFS, then a deadlock can happen.
> > We would need a clear hierarchy
> > 
> >    __GFP_NETFS > __GFP_FS > __GFP_IO
> > 
> > for it to work.  I'm not sure the extra level really helps a lot and it would
> > be a lot of churn.
> 
> I think you are looking at this the wrong way - it's not the other
> filesystems that have to avoid memory reclaim recursion, it's the
> NFS client mount that is on loopback that needs to avoid recursion.
> 
> IMO, the fix should be that the NFS client cannot block on messages sent to the NFSD
> on the same host during memory reclaim. That is, nfs_release_page()
> cannot send commit messages to the server if the server is on
> localhost. Instead, it just tells memory reclaim that it can't
(Continue reading)

Dave Chinner | 17 Apr 03:27 2014

Re: [PATCH/RFC 00/19] Support loop-back NFS mounts

On Thu, Apr 17, 2014 at 10:20:48AM +1000, NeilBrown wrote:
> A good example is the deadlock with the flush-* threads.
> flush-* will lock a page, and  then call ->writepage.  If ->writepage
> allocates memory it can enter reclaim, call ->releasepage on NFS, and block
> waiting for a COMMIT to complete.
> The COMMIT might already be running, performing fsync on that same file that
> flush-* is flushing.  It locks each page in turn.  When it  gets to the page
> that flush-* has locked, it will deadlock.

It's nfs_release_page() again....

> In general, if nfsd is allowed to block on local filesystem, and local
> filesystem is allowed to block on NFS, then a deadlock can happen.
> We would need a clear hierarchy
> 
>    __GFP_NETFS > __GFP_FS > __GFP_IO
> 
> for it to work.  I'm not sure the extra level really helps a lot and it would
> be a lot of churn.

I think you are looking at this the wrong way - it's not the other
filesystems that have to avoid memory reclaim recursion, it's the
NFS client mount that is on loopback that needs to avoid recursion.

IMO, the fix should be that the NFS client cannot block on messages sent to the NFSD
on the same host during memory reclaim. That is, nfs_release_page()
cannot send commit messages to the server if the server is on
localhost. Instead, it just tells memory reclaim that it can't
reclaim that page.

(Continue reading)

NeilBrown | 17 Apr 02:20 2014
Picon

Re: [PATCH/RFC 00/19] Support loop-back NFS mounts

On Wed, 16 Apr 2014 10:42:07 -0400 Jeff Layton <jlayton <at> redhat.com> wrote:

> On Wed, 16 Apr 2014 14:03:35 +1000
> NeilBrown <neilb <at> suse.de> wrote:
> 

> > Comments, criticisms, etc most welcome.
> > 
> > Thanks,
> > NeilBrown
> > 
> 
> I've only given this a once-over, but the basic concept seems a bit
> flawed. IIUC, the basic idea is to disallow allocations done in knfsd
> threads context from doing fs-based reclaim.
> 
> This seems very heavy-handed, and like it could cause problems on a
> busy NFS server. Those sorts of servers are likely to have a lot of
> data in pagecache and thus we generally want to allow them to do do
> writeback when memory is tight.
> 
> It's generally acceptable for knfsd to recurse into local filesystem
> code for writeback. What you want to avoid in this situation is reclaim
> on NFS filesystems that happen to be from knfsd on the same box.
> 
> If you really want to fix this, what may make more sense is trying to
> plumb that information down more granularly. Maybe GFP_NONETFS and/or
> PF_NETFSTRANS flags?

Hi Jeff,
(Continue reading)


Gmane