Shaohua Li | 25 Jun 2012 09:24

[patch 00/10 v3] raid5: improve write performance for fast storage

Hi,

Like raid 1/10, raid5 uses one thread to handle stripe. In a fast storage, the
thread becomes a bottleneck. raid5 can offload calculation like checksum to
async threads. And if storge is fast, scheduling async work and running async
work will introduce heavy lock contention of workqueue, which makes such
optimization useless. And calculation isn't the only bottleneck. For example,
in my test raid5 thread must handle > 450k requests per second. Just doing
dispatch and completion will make raid5 thread incapable. The only chance to
scale is using several threads to handle stripe.

Simpliy using several threads doesn't work. conf->device_lock is a global lock
which is heavily contended. patch 3-9 in the set are trying to address this
problem. With them, when several threads are handling stripe, device_lock is
still contended but takes much less cpu time and not the heavist locking any
more. Even the 10th patch isn't accepted, the patch 3-9 look good to merge.

I did stress test (block size range 1k - 64k with a small total size, so
overlap/stripe sharing guaranteed) with the patches and looks fine except some
issues fixed in the first two patches. That issues aren't related to the series,
but I need them in stress test.

With the locking issue solved (at least largely), switching stripe handling to
multiple threads is trival.

Threads are still created in advance (default thread number is disk number) and
can be reconfigured by user. Automatically creating and reaping threads is
great, but I'm worrying about numa binding.

In a 3-disk raid5 setup, 2 extra threads can provide 130% throughput
(Continue reading)

Shaohua Li | 25 Jun 2012 09:24

[patch 01/10 v3] raid5: use wake_up_all for overlap waking

It's possible several tasks are waiting for stripe overlap. We clear R5_Overlap
bit and wake_up, but wake_up just wakes one task. So if there are several tasks
in the wait queue, some tasks will not be woken up even its strip R5_Overlap
clear. The end result is tasks hang in make_request.

wake_up_all should not introduce performance issue here, since overlap case is
rare.

Signed-off-by: Shaohua Li <shli <at> fusionio.com>
---
 drivers/md/raid5.c |   20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-19 08:11:10.021688417 +0800
+++ linux/drivers/md/raid5.c	2012-06-19 08:11:29.833439339 +0800
 <at>  <at>  -1399,7 +1399,7  <at>  <at>  static void __raid_run_ops(struct stripe
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
 			if (test_and_clear_bit(R5_Overlap, &dev->flags))
-				wake_up(&sh->raid_conf->wait_for_overlap);
+				wake_up_all(&sh->raid_conf->wait_for_overlap);
 		}
 	put_cpu();
 }
 <at>  <at>  -2436,7 +2436,7  <at>  <at>  handle_failed_stripe(struct r5conf *conf
 		}

 		if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
(Continue reading)

NeilBrown | 28 Jun 2012 09:26
Picon
Gravatar

Re: [patch 01/10 v3] raid5: use wake_up_all for overlap waking

On Mon, 25 Jun 2012 15:24:48 +0800 Shaohua Li <shli <at> kernel.org> wrote:

> It's possible several tasks are waiting for stripe overlap. We clear R5_Overlap
> bit and wake_up, but wake_up just wakes one task. So if there are several tasks
> in the wait queue, some tasks will not be woken up even its strip R5_Overlap
> clear. The end result is tasks hang in make_request.
> 
> wake_up_all should not introduce performance issue here, since overlap case is
> rare.

This is not necessary.
wake_up_all is only different from wake_up if WQ_FLAG_EXCLUSIVE it set, e.g.
by prepare_to_wait_exclusive.
As we don't use an exclusive wait to wait on wait_for_overlap, there is no
point in using wake_up_all, wake_up already wakes everything up.

NeilBrown

> 
> Signed-off-by: Shaohua Li <shli <at> fusionio.com>
> ---
>  drivers/md/raid5.c |   20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2012-06-19 08:11:10.021688417 +0800
> +++ linux/drivers/md/raid5.c	2012-06-19 08:11:29.833439339 +0800
>  <at>  <at>  -1399,7 +1399,7  <at>  <at>  static void __raid_run_ops(struct stripe
>  		for (i = disks; i--; ) {
(Continue reading)

Shaohua Li | 28 Jun 2012 10:53

Re: [patch 01/10 v3] raid5: use wake_up_all for overlap waking

On Thu, Jun 28, 2012 at 05:26:21PM +1000, NeilBrown wrote:
> On Mon, 25 Jun 2012 15:24:48 +0800 Shaohua Li <shli <at> kernel.org> wrote:
> 
> > It's possible several tasks are waiting for stripe overlap. We clear R5_Overlap
> > bit and wake_up, but wake_up just wakes one task. So if there are several tasks
> > in the wait queue, some tasks will not be woken up even its strip R5_Overlap
> > clear. The end result is tasks hang in make_request.
> > 
> > wake_up_all should not introduce performance issue here, since overlap case is
> > rare.
> 
> This is not necessary.
> wake_up_all is only different from wake_up if WQ_FLAG_EXCLUSIVE it set, e.g.
> by prepare_to_wait_exclusive.
> As we don't use an exclusive wait to wait on wait_for_overlap, there is no
> point in using wake_up_all, wake_up already wakes everything up.

Oh, this is silly, sorry about it. So the only problem I hit the hang is that I
fixed in the second patch. Other patches can still applied without this one.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Shaohua Li | 25 Jun 2012 09:24

[patch 02/10 v3] raid5: delayed stripe fix

There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but
the two bits have relationship. A delayed stripe can be moved to hold list only
when preread active stripe count is below IO_THRESHOLD. If a stripe has both
the bits set, such stripe will be in delayed list and preread count not 0,
which will make such stripe never leave delayed list.

Signed-off-by: Shaohua Li <shli <at> fusionio.com>
---
 drivers/md/raid5.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-25 14:36:15.964613183 +0800
+++ linux/drivers/md/raid5.c	2012-06-25 14:36:57.280096788 +0800
 <at>  <at>  -196,12 +196,14  <at>  <at>  static void __release_stripe(struct r5co
 		BUG_ON(!list_empty(&sh->lru));
 		BUG_ON(atomic_read(&conf->active_stripes)==0);
 		if (test_bit(STRIPE_HANDLE, &sh->state)) {
-			if (test_bit(STRIPE_DELAYED, &sh->state))
+			if (test_bit(STRIPE_DELAYED, &sh->state) &&
+			    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
 				list_add_tail(&sh->lru, &conf->delayed_list);
 			else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
 				   sh->bm_seq - conf->seq_write > 0)
 				list_add_tail(&sh->lru, &conf->bitmap_list);
 			else {
+				clear_bit(STRIPE_DELAYED, &sh->state);
 				clear_bit(STRIPE_BIT_DELAY, &sh->state);
 				list_add_tail(&sh->lru, &conf->handle_list);
(Continue reading)

NeilBrown | 2 Jul 2012 02:46
Picon
Gravatar

Re: [patch 02/10 v3] raid5: delayed stripe fix

On Mon, 25 Jun 2012 15:24:49 +0800 Shaohua Li <shli <at> kernel.org> wrote:

> There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but
> the two bits have relationship. A delayed stripe can be moved to hold list only
> when preread active stripe count is below IO_THRESHOLD. If a stripe has both
> the bits set, such stripe will be in delayed list and preread count not 0,
> which will make such stripe never leave delayed list.
> 
> Signed-off-by: Shaohua Li <shli <at> fusionio.com>
> ---
>  drivers/md/raid5.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2012-06-25 14:36:15.964613183 +0800
> +++ linux/drivers/md/raid5.c	2012-06-25 14:36:57.280096788 +0800
>  <at>  <at>  -196,12 +196,14  <at>  <at>  static void __release_stripe(struct r5co
>  		BUG_ON(!list_empty(&sh->lru));
>  		BUG_ON(atomic_read(&conf->active_stripes)==0);
>  		if (test_bit(STRIPE_HANDLE, &sh->state)) {
> -			if (test_bit(STRIPE_DELAYED, &sh->state))
> +			if (test_bit(STRIPE_DELAYED, &sh->state) &&
> +			    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>  				list_add_tail(&sh->lru, &conf->delayed_list);
>  			else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
>  				   sh->bm_seq - conf->seq_write > 0)
>  				list_add_tail(&sh->lru, &conf->bitmap_list);
>  			else {
> +				clear_bit(STRIPE_DELAYED, &sh->state);
(Continue reading)

Shaohua Li | 2 Jul 2012 02:49

Re: [patch 02/10 v3] raid5: delayed stripe fix

On Mon, Jul 02, 2012 at 10:46:48AM +1000, NeilBrown wrote:
> On Mon, 25 Jun 2012 15:24:49 +0800 Shaohua Li <shli <at> kernel.org> wrote:
> 
> > There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but
> > the two bits have relationship. A delayed stripe can be moved to hold list only
> > when preread active stripe count is below IO_THRESHOLD. If a stripe has both
> > the bits set, such stripe will be in delayed list and preread count not 0,
> > which will make such stripe never leave delayed list.
> > 
> > Signed-off-by: Shaohua Li <shli <at> fusionio.com>
> > ---
> >  drivers/md/raid5.c |    4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > Index: linux/drivers/md/raid5.c
> > ===================================================================
> > --- linux.orig/drivers/md/raid5.c	2012-06-25 14:36:15.964613183 +0800
> > +++ linux/drivers/md/raid5.c	2012-06-25 14:36:57.280096788 +0800
> >  <at>  <at>  -196,12 +196,14  <at>  <at>  static void __release_stripe(struct r5co
> >  		BUG_ON(!list_empty(&sh->lru));
> >  		BUG_ON(atomic_read(&conf->active_stripes)==0);
> >  		if (test_bit(STRIPE_HANDLE, &sh->state)) {
> > -			if (test_bit(STRIPE_DELAYED, &sh->state))
> > +			if (test_bit(STRIPE_DELAYED, &sh->state) &&
> > +			    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> >  				list_add_tail(&sh->lru, &conf->delayed_list);
> >  			else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
> >  				   sh->bm_seq - conf->seq_write > 0)
> >  				list_add_tail(&sh->lru, &conf->bitmap_list);
> >  			else {
(Continue reading)

NeilBrown | 2 Jul 2012 02:55
Picon
Gravatar

Re: [patch 02/10 v3] raid5: delayed stripe fix

On Mon, 2 Jul 2012 08:49:55 +0800 Shaohua Li <shli <at> kernel.org> wrote:

> On Mon, Jul 02, 2012 at 10:46:48AM +1000, NeilBrown wrote:
> > On Mon, 25 Jun 2012 15:24:49 +0800 Shaohua Li <shli <at> kernel.org> wrote:
> > 
> > > There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but
> > > the two bits have relationship. A delayed stripe can be moved to hold list only
> > > when preread active stripe count is below IO_THRESHOLD. If a stripe has both
> > > the bits set, such stripe will be in delayed list and preread count not 0,
> > > which will make such stripe never leave delayed list.
> > > 
> > > Signed-off-by: Shaohua Li <shli <at> fusionio.com>
> > > ---
> > >  drivers/md/raid5.c |    4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > Index: linux/drivers/md/raid5.c
> > > ===================================================================
> > > --- linux.orig/drivers/md/raid5.c	2012-06-25 14:36:15.964613183 +0800
> > > +++ linux/drivers/md/raid5.c	2012-06-25 14:36:57.280096788 +0800
> > >  <at>  <at>  -196,12 +196,14  <at>  <at>  static void __release_stripe(struct r5co
> > >  		BUG_ON(!list_empty(&sh->lru));
> > >  		BUG_ON(atomic_read(&conf->active_stripes)==0);
> > >  		if (test_bit(STRIPE_HANDLE, &sh->state)) {
> > > -			if (test_bit(STRIPE_DELAYED, &sh->state))
> > > +			if (test_bit(STRIPE_DELAYED, &sh->state) &&
> > > +			    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> > >  				list_add_tail(&sh->lru, &conf->delayed_list);
> > >  			else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
> > >  				   sh->bm_seq - conf->seq_write > 0)
(Continue reading)

Shaohua Li | 25 Jun 2012 09:24

[patch 03/10 v3] raid5: add a per-stripe lock

Add a per-stripe lock to protect stripe specific data, like dev->read,
written, ... The purpose is to reduce lock contention of conf->device_lock.

Signed-off-by: Shaohua Li <shli <at> fusionio.com>
---
 drivers/md/raid5.c |   17 +++++++++++++++++
 drivers/md/raid5.h |    1 +
 2 files changed, 18 insertions(+)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-25 14:36:57.280096788 +0800
+++ linux/drivers/md/raid5.c	2012-06-25 14:37:13.651888057 +0800
 <at>  <at>  -751,6 +751,7  <at>  <at>  static void ops_complete_biofill(void *s

 	/* clear completed biofills */
 	spin_lock_irq(&conf->device_lock);
+	spin_lock(&sh->stripe_lock);
 	for (i = sh->disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];

 <at>  <at>  -776,6 +777,7  <at>  <at>  static void ops_complete_biofill(void *s
 			}
 		}
 	}
+	spin_unlock(&sh->stripe_lock);
 	spin_unlock_irq(&conf->device_lock);
 	clear_bit(STRIPE_BIOFILL_RUN, &sh->state);

 <at>  <at>  -800,8 +802,10  <at>  <at>  static void ops_run_biofill(struct strip
(Continue reading)

NeilBrown | 2 Jul 2012 02:50
Picon
Gravatar

Re: [patch 03/10 v3] raid5: add a per-stripe lock

On Mon, 25 Jun 2012 15:24:50 +0800 Shaohua Li <shli <at> kernel.org> wrote:

> Add a per-stripe lock to protect stripe specific data, like dev->read,
> written, ... The purpose is to reduce lock contention of conf->device_lock.
> 
> Signed-off-by: Shaohua Li <shli <at> fusionio.com>

I had hoped to avoid having a per-stripe lock again, but it does look like it
is needed.
However I don't like the way you have split up these three patches - it makes
them a little hard to review.

I would like to see one patch which converts the bi_phys_segments access to
be atomic and also removes all the spin_lock calls that were just for
protecting that.

Then another patch which adds the new stripe_lock, clearly documenting
exactly what is protects (not just "like dev->read" but an explicit list)
and also removes any spin_lock of device_lock that is no longer needed.

Then I could see what is being added and what is being removed all in the one
patch and I can be sure that they balance.

Thanks,
NeilBrown

> ---
>  drivers/md/raid5.c |   17 +++++++++++++++++
>  drivers/md/raid5.h |    1 +
>  2 files changed, 18 insertions(+)
(Continue reading)

Shaohua Li | 2 Jul 2012 05:16

Re: [patch 03/10 v3] raid5: add a per-stripe lock

On Mon, Jul 02, 2012 at 10:50:46AM +1000, NeilBrown wrote:
> On Mon, 25 Jun 2012 15:24:50 +0800 Shaohua Li <shli <at> kernel.org> wrote:
> 
> > Add a per-stripe lock to protect stripe specific data, like dev->read,
> > written, ... The purpose is to reduce lock contention of conf->device_lock.
> > 
> > Signed-off-by: Shaohua Li <shli <at> fusionio.com>
> 
> I had hoped to avoid having a per-stripe lock again, but it does look like it
> is needed.
> However I don't like the way you have split up these three patches - it makes
> them a little hard to review.
> 
> I would like to see one patch which converts the bi_phys_segments access to
> be atomic and also removes all the spin_lock calls that were just for
> protecting that.
> 
> Then another patch which adds the new stripe_lock, clearly documenting
> exactly what is protects (not just "like dev->read" but an explicit list)
> and also removes any spin_lock of device_lock that is no longer needed.
> 
> Then I could see what is being added and what is being removed all in the one
> patch and I can be sure that they balance.

reworked the patch 3-5 to two patches as you suggested, and sent to you. please check.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
(Continue reading)

NeilBrown | 2 Jul 2012 09:39
Picon
Gravatar

Re: [patch 03/10 v3] raid5: add a per-stripe lock

On Mon, 2 Jul 2012 11:16:26 +0800 Shaohua Li <shli <at> kernel.org> wrote:

> > Then I could see what is being added and what is being removed all in the one
> > patch and I can be sure that they balance.
> 
> reworked the patch 3-5 to two patches as you suggested, and sent to you. please check.

Thanks.  That's looking really good.

However I think we can do better.  I've been looking more closely at the code
and I think that the only things that we need stripe_lock to protect are
->toread and ->towrite, together with the following bios.  e.g.
->toread->bi_next etc.

->read and ->written don't need stripe_lock protection, as they are only
manipulated by the handle_stripe state machine which uses STRIPE_ACTIVE 
and refcounts for exclusion.

So add_stripe_bio need to take the lock while adding a bio to the
->toread and  ->towrite lists, and ops_run_biodrain() and ops_run_biofill
need to take the lock while the move the list from ->to{read,write} to
->{read,written}.  But we don't need it anywhere else.  e.g. analyse_stripe
shouldn't need the lock at all.  Any change that could happen during the loop
could equally happen after the lock was released so we don't lose by not
having the lock.

There is another current user of the lock, but I think that should be
discarded as a false optimisation.
We currently try to optimise out extra calls to bitmap_startwrite and
bitmap_endwrite when we see back-to-back writes to the one stripe.  However I
(Continue reading)

Shaohua Li | 3 Jul 2012 03:27

Re: [patch 03/10 v3] raid5: add a per-stripe lock

On Mon, Jul 02, 2012 at 05:39:53PM +1000, NeilBrown wrote:
> On Mon, 2 Jul 2012 11:16:26 +0800 Shaohua Li <shli <at> kernel.org> wrote:
> 
> 
> > > Then I could see what is being added and what is being removed all in the one
> > > patch and I can be sure that they balance.
> > 
> > reworked the patch 3-5 to two patches as you suggested, and sent to you. please check.
> 
> Thanks.  That's looking really good.
> 
> However I think we can do better.  I've been looking more closely at the code
> and I think that the only things that we need stripe_lock to protect are
> ->toread and ->towrite, together with the following bios.  e.g.
> ->toread->bi_next etc.
> 
> ->read and ->written don't need stripe_lock protection, as they are only
> manipulated by the handle_stripe state machine which uses STRIPE_ACTIVE 
> and refcounts for exclusion.
> 
> So add_stripe_bio need to take the lock while adding a bio to the
> ->toread and  ->towrite lists, and ops_run_biodrain() and ops_run_biofill
> need to take the lock while the move the list from ->to{read,write} to
> ->{read,written}.  But we don't need it anywhere else.  e.g. analyse_stripe
> shouldn't need the lock at all.  Any change that could happen during the loop
> could equally happen after the lock was released so we don't lose by not
> having the lock.

Looks reasonable.

(Continue reading)

majianpeng | 3 Jul 2012 14:16
Picon

Re: Re: [patch 03/10 v3] raid5: add a per-stripe lock

On 2012-07-02 15:39 NeilBrown <neilb <at> suse.de> Wrote:
>On Mon, 2 Jul 2012 11:16:26 +0800 Shaohua Li <shli <at> kernel.org> wrote:
>
>
>> > Then I could see what is being added and what is being removed all in the one
>> > patch and I can be sure that they balance.
>> 
>> reworked the patch 3-5 to two patches as you suggested, and sent to you. please check.
>
>Thanks.  That's looking really good.
>
>However I think we can do better.  I've been looking more closely at the code
>and I think that the only things that we need stripe_lock to protect are
>->toread and ->towrite, together with the following bios.  e.g.
>->toread->bi_next etc.
>
>->read and ->written don't need stripe_lock protection, as they are only
>manipulated by the handle_stripe state machine which uses STRIPE_ACTIVE 
>and refcounts for exclusion.
>
>So add_stripe_bio need to take the lock while adding a bio to the
>->toread and  ->towrite lists, and ops_run_biodrain() and ops_run_biofill
>need to take the lock while the move the list from ->to{read,write} to
>->{read,written}.  
How about xchg()?
>But we don't need it anywhere else.  e.g. analyse_stripe
>shouldn't need the lock at all.  Any change that could happen during the loop
>could equally happen after the lock was released so we don't lose by not
>having the lock.
>
(Continue reading)

NeilBrown | 4 Jul 2012 01:56
Picon
Gravatar

Re: [patch 03/10 v3] raid5: add a per-stripe lock

On Tue, 3 Jul 2012 20:16:31 +0800 majianpeng <majianpeng <at> gmail.com> wrote:

> On 2012-07-02 15:39 NeilBrown <neilb <at> suse.de> Wrote:
> >On Mon, 2 Jul 2012 11:16:26 +0800 Shaohua Li <shli <at> kernel.org> wrote:
> >
> >
> >> > Then I could see what is being added and what is being removed all in the one
> >> > patch and I can be sure that they balance.
> >> 
> >> reworked the patch 3-5 to two patches as you suggested, and sent to you. please check.
> >
> >Thanks.  That's looking really good.
> >
> >However I think we can do better.  I've been looking more closely at the code
> >and I think that the only things that we need stripe_lock to protect are
> >->toread and ->towrite, together with the following bios.  e.g.
> >->toread->bi_next etc.
> >
> >->read and ->written don't need stripe_lock protection, as they are only
> >manipulated by the handle_stripe state machine which uses STRIPE_ACTIVE 
> >and refcounts for exclusion.
> >
> >So add_stripe_bio need to take the lock while adding a bio to the
> >->toread and  ->towrite lists, and ops_run_biodrain() and ops_run_biofill
> >need to take the lock while the move the list from ->to{read,write} to
> >->{read,written}.  

> How about xchg()?

(it would help a little if you made your comments stand out more, and maybe
(Continue reading)

majianpeng | 4 Jul 2012 03:09
Picon

Re: Re: [patch 03/10 v3] raid5: add a per-stripe lock

On 2012-07-04 07:56 NeilBrown <neilb <at> suse.de> Wrote:
>On Tue, 3 Jul 2012 20:16:31 +0800 majianpeng <majianpeng <at> gmail.com> wrote:
>
>> On 2012-07-02 15:39 NeilBrown <neilb <at> suse.de> Wrote:
>> >On Mon, 2 Jul 2012 11:16:26 +0800 Shaohua Li <shli <at> kernel.org> wrote:
>> >
>> >
>> >> > Then I could see what is being added and what is being removed all in the one
>> >> > patch and I can be sure that they balance.
>> >> 
>> >> reworked the patch 3-5 to two patches as you suggested, and sent to you. please check.
>> >
>> >Thanks.  That's looking really good.
>> >
>> >However I think we can do better.  I've been looking more closely at the code
>> >and I think that the only things that we need stripe_lock to protect are
>> >->toread and ->towrite, together with the following bios.  e.g.
>> >->toread->bi_next etc.
>> >
>> >->read and ->written don't need stripe_lock protection, as they are only
>> >manipulated by the handle_stripe state machine which uses STRIPE_ACTIVE 
>> >and refcounts for exclusion.
>> >
>> >So add_stripe_bio need to take the lock while adding a bio to the
>> >->toread and  ->towrite lists, and ops_run_biodrain() and ops_run_biofill
>> >need to take the lock while the move the list from ->to{read,write} to
>> >->{read,written}.  
>
>> How about xchg()?
>
(Continue reading)

Shaohua Li | 25 Jun 2012 09:24

[patch 04/10 v3] raid5: lockless access raid5 overrided bi_phys_segments

Raid5 overrides bio->bi_phys_segments, accessing it is with device_lock hold,
which is unnecessary, We can make it lockless actually.

Signed-off-by: Shaohua Li <shli <at> fusionio.com>
---
 drivers/md/raid5.c |   58 +++++++++++++++++++++++++++++------------------------
 1 file changed, 32 insertions(+), 26 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-25 14:37:13.651888057 +0800
+++ linux/drivers/md/raid5.c	2012-06-25 14:37:18.743841116 +0800
 <at>  <at>  -99,34 +99,40  <at>  <at>  static inline struct bio *r5_next_bio(st
  * We maintain a biased count of active stripes in the bottom 16 bits of
  * bi_phys_segments, and a count of processed stripes in the upper 16 bits
  */
-static inline int raid5_bi_phys_segments(struct bio *bio)
+static inline int raid5_bi_processed_stripes(struct bio *bio)
 {
-	return bio->bi_phys_segments & 0xffff;
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	return (atomic_read(segments) >> 16) & 0xffff;
 }

-static inline int raid5_bi_hw_segments(struct bio *bio)
+static inline int raid5_dec_bi_active_stripes(struct bio *bio)
 {
-	return (bio->bi_phys_segments >> 16) & 0xffff;
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	return atomic_sub_return(1, segments) & 0xffff;
(Continue reading)

Shaohua Li | 25 Jun 2012 09:24

[patch 05/10 v3] raid5: remove some device_lock locking places

With per-stripe lock and bi_phys_segments lockless, we can safely remove some
locking places of device_lock.

stripe ->read, ->toread ... are protected by per-stripe lock. Accessing bio
list of the stripe is always serialized by this lock. If bio in ->read,
->toread ... list are shared by multiple stripes, there are two protections:
1. bi_phys_segments acts as a reference count
2. traverse the list uses r5_next_bio, which makes traverse never access bio
not belonging to the stripe

Let's have an example:
|  stripe1 |  stripe2    |  stripe3  |
...bio1......|bio2|bio3|....bio4.....

stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
stripe3 will end_bio bio4.

before add_stripe_bio() addes a bio to a stripe, we already increament the bio
bi_phys_segments, so don't worry other stripes release the bio.

Signed-off-by: Shaohua Li <shli <at> fusionio.com>
---
 drivers/md/raid5.c |   60 ++++++++++++++++++++---------------------------------
 1 file changed, 23 insertions(+), 37 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
(Continue reading)

Shaohua Li | 25 Jun 2012 09:24

[patch 06/10 v3] raid5: reduce chance release_stripe() taking device_lock

release_stripe() is a place conf->device_lock is heavily contended. We take the
lock even stripe count isn't 1, which isn't required.

Signed-off-by: Shaohua Li <shli <at> fusionio.com>
---
 drivers/md/raid5.c |   73 +++++++++++++++++++++++++++++------------------------
 1 file changed, 40 insertions(+), 33 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-25 14:37:21.000000000 +0800
+++ linux/drivers/md/raid5.c	2012-06-25 14:38:13.899130571 +0800
 <at>  <at>  -196,49 +196,56  <at>  <at>  static int stripe_operations_active(stru
 	       test_bit(STRIPE_COMPUTE_RUN, &sh->state);
 }

-static void __release_stripe(struct r5conf *conf, struct stripe_head *sh)
+static void handle_release_stripe(struct r5conf *conf, struct stripe_head *sh)
 {
-	if (atomic_dec_and_test(&sh->count)) {
-		BUG_ON(!list_empty(&sh->lru));
-		BUG_ON(atomic_read(&conf->active_stripes)==0);
-		if (test_bit(STRIPE_HANDLE, &sh->state)) {
-			if (test_bit(STRIPE_DELAYED, &sh->state) &&
-			    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-				list_add_tail(&sh->lru, &conf->delayed_list);
-			else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
-				   sh->bm_seq - conf->seq_write > 0)
-				list_add_tail(&sh->lru, &conf->bitmap_list);
-			else {
(Continue reading)

NeilBrown | 2 Jul 2012 02:57
Picon
Gravatar

Re: [patch 06/10 v3] raid5: reduce chance release_stripe() taking device_lock

On Mon, 25 Jun 2012 15:24:53 +0800 Shaohua Li <shli <at> kernel.org> wrote:

> release_stripe() is a place conf->device_lock is heavily contended. We take the
> lock even stripe count isn't 1, which isn't required.
> 
> Signed-off-by: Shaohua Li <shli <at> fusionio.com>
> ---
>  drivers/md/raid5.c |   73 +++++++++++++++++++++++++++++------------------------
>  1 file changed, 40 insertions(+), 33 deletions(-)
> 
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2012-06-25 14:37:21.000000000 +0800
> +++ linux/drivers/md/raid5.c	2012-06-25 14:38:13.899130571 +0800
>  <at>  <at>  -196,49 +196,56  <at>  <at>  static int stripe_operations_active(stru
>  	       test_bit(STRIPE_COMPUTE_RUN, &sh->state);
>  }
>  
> -static void __release_stripe(struct r5conf *conf, struct stripe_head *sh)
> +static void handle_release_stripe(struct r5conf *conf, struct stripe_head *sh)
>  {
> -	if (atomic_dec_and_test(&sh->count)) {
> -		BUG_ON(!list_empty(&sh->lru));
> -		BUG_ON(atomic_read(&conf->active_stripes)==0);
> -		if (test_bit(STRIPE_HANDLE, &sh->state)) {
> -			if (test_bit(STRIPE_DELAYED, &sh->state) &&
> -			    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> -				list_add_tail(&sh->lru, &conf->delayed_list);
> -			else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
> -				   sh->bm_seq - conf->seq_write > 0)
(Continue reading)

Shaohua Li | 25 Jun 2012 09:24

[patch 07/10 v3] md: personality can provide unplug private data

Allow personality providing unplug private data. Next patch will use it.

Signed-off-by: Shaohua Li <shli <at> fusionio.com>
---
 drivers/md/md.c     |   31 +++++++++++++------------------
 drivers/md/md.h     |   20 +++++++++++++++++++-
 drivers/md/raid1.c  |    2 +-
 drivers/md/raid10.c |    2 +-
 drivers/md/raid5.c  |    2 +-
 5 files changed, 35 insertions(+), 22 deletions(-)

Index: linux/drivers/md/md.c
===================================================================
--- linux.orig/drivers/md/md.c	2012-06-25 14:36:13.668642048 +0800
+++ linux/drivers/md/md.c	2012-06-25 14:38:33.106889041 +0800
 <at>  <at>  -498,22 +498,13  <at>  <at>  void md_flush_request(struct mddev *mdde
 }
 EXPORT_SYMBOL(md_flush_request);

-/* Support for plugging.
- * This mirrors the plugging support in request_queue, but does not
- * require having a whole queue or request structures.
- * We allocate an md_plug_cb for each md device and each thread it gets
- * plugged on.  This links tot the private plug_handle structure in the
- * personality data where we keep a count of the number of outstanding
- * plugs so other code can see if a plug is active.
- */
-struct md_plug_cb {
-	struct blk_plug_cb cb;
-	struct mddev *mddev;
(Continue reading)

NeilBrown | 2 Jul 2012 03:06
Picon
Gravatar

Re: [patch 07/10 v3] md: personality can provide unplug private data

On Mon, 25 Jun 2012 15:24:54 +0800 Shaohua Li <shli <at> kernel.org> wrote:

> Allow personality providing unplug private data. Next patch will use it.

Thanks. I've applied this with a couple of minor changes.

In particular I change the 'size' arg to be size total size of the
plug structure, not the amount to add to the end.  I also change it
to use kzalloc rather then an extra memset.

Thanks,
NeilBrown

> 
> Signed-off-by: Shaohua Li <shli <at> fusionio.com>
> ---
>  drivers/md/md.c     |   31 +++++++++++++------------------
>  drivers/md/md.h     |   20 +++++++++++++++++++-
>  drivers/md/raid1.c  |    2 +-
>  drivers/md/raid10.c |    2 +-
>  drivers/md/raid5.c  |    2 +-
>  5 files changed, 35 insertions(+), 22 deletions(-)
> 
> Index: linux/drivers/md/md.c
> ===================================================================
> --- linux.orig/drivers/md/md.c	2012-06-25 14:36:13.668642048 +0800
> +++ linux/drivers/md/md.c	2012-06-25 14:38:33.106889041 +0800
>  <at>  <at>  -498,22 +498,13  <at>  <at>  void md_flush_request(struct mddev *mdde
>  }
>  EXPORT_SYMBOL(md_flush_request);
(Continue reading)

Shaohua Li | 25 Jun 2012 09:24

[patch 08/10 v3] raid5: make_request use batch stripe release

make_request() does stripe release for every stripe and the stripe usually has
count 1, which makes previous release_stripe() optimization not work. In my
test, this release_stripe() becomes the heaviest pleace to take
conf->device_lock after previous patches applied.

Below patch makes stripe release batch. All the stripes will be released in
unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe
lru.

Signed-off-by: Shaohua Li <shli <at> fusionio.com>
---
 drivers/md/raid5.c |   64 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 drivers/md/raid5.h |    1 
 2 files changed, 60 insertions(+), 5 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-25 14:38:33.110889008 +0800
+++ linux/drivers/md/raid5.c	2012-06-25 14:38:37.378835415 +0800
 <at>  <at>  -484,7 +484,8  <at>  <at>  get_active_stripe(struct r5conf *conf, s
 		} else {
 			if (atomic_read(&sh->count)) {
 				BUG_ON(!list_empty(&sh->lru)
-				    && !test_bit(STRIPE_EXPANDING, &sh->state));
+				    && !test_bit(STRIPE_EXPANDING, &sh->state)
+				    && !test_bit(STRIPE_ON_UNPLUG_LIST, &sh->state));
 			} else {
 				if (!test_bit(STRIPE_HANDLE, &sh->state))
 					atomic_inc(&conf->active_stripes);
 <at>  <at>  -3984,6 +3985,51  <at>  <at>  static struct stripe_head *__get_priorit
(Continue reading)

NeilBrown | 2 Jul 2012 04:31
Picon
Gravatar

Re: [patch 08/10 v3] raid5: make_request use batch stripe release

On Mon, 25 Jun 2012 15:24:55 +0800 Shaohua Li <shli <at> kernel.org> wrote:

> make_request() does stripe release for every stripe and the stripe usually has
> count 1, which makes previous release_stripe() optimization not work. In my
> test, this release_stripe() becomes the heaviest pleace to take
> conf->device_lock after previous patches applied.
> 
> Below patch makes stripe release batch. All the stripes will be released in
> unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe
> lru.
> 

I've applied this patch, but I'm afraid I butchered it a bit first :-)

>  <at>  <at>  -3984,6 +3985,51  <at>  <at>  static struct stripe_head *__get_priorit
>  	return sh;
>  }
>  
> +#define raid5_unplug_list(mdcb) (struct list_head *)(mdcb + 1)

I really don't like this sort of construct.  It is much cleaner (I think) to
add to a structure by embedding it in a larger structure, then using
"container_of" to map from the inner to the outer structure.  So I have
changed that.

>  <at>  <at>  -4114,7 +4161,14  <at>  <at>  static void make_request(struct mddev *m
>  			if ((bi->bi_rw & REQ_SYNC) &&
>  			    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>  				atomic_inc(&conf->preread_active_stripes);
> -			release_stripe(sh);
(Continue reading)

Shaohua Li | 2 Jul 2012 04:59

Re: [patch 08/10 v3] raid5: make_request use batch stripe release

On Mon, Jul 02, 2012 at 12:31:12PM +1000, NeilBrown wrote:
> On Mon, 25 Jun 2012 15:24:55 +0800 Shaohua Li <shli <at> kernel.org> wrote:
> 
> > make_request() does stripe release for every stripe and the stripe usually has
> > count 1, which makes previous release_stripe() optimization not work. In my
> > test, this release_stripe() becomes the heaviest pleace to take
> > conf->device_lock after previous patches applied.
> > 
> > Below patch makes stripe release batch. All the stripes will be released in
> > unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe
> > lru.
> > 
> 
> I've applied this patch, but I'm afraid I butchered it a bit first :-)
> 
> 
> >  <at>  <at>  -3984,6 +3985,51  <at>  <at>  static struct stripe_head *__get_priorit
> >  	return sh;
> >  }
> >  
> > +#define raid5_unplug_list(mdcb) (struct list_head *)(mdcb + 1)
> 
> I really don't like this sort of construct.  It is much cleaner (I think) to
> add to a structure by embedding it in a larger structure, then using
> "container_of" to map from the inner to the outer structure.  So I have
> changed that.

Thanks.

> >  <at>  <at>  -4114,7 +4161,14  <at>  <at>  static void make_request(struct mddev *m
(Continue reading)

NeilBrown | 2 Jul 2012 07:07
Picon
Gravatar

Re: [patch 08/10 v3] raid5: make_request use batch stripe release

On Mon, 2 Jul 2012 10:59:50 +0800 Shaohua Li <shli <at> kernel.org> wrote:

> > From 04b7dd7d0ad4a21622cad7c10821f914a8d9ccd3 Mon Sep 17 00:00:00 2001
> > From: NeilBrown <neilb <at> suse.de>
> > Date: Mon, 2 Jul 2012 12:14:49 +1000
> > Subject: [PATCH] md/plug: disable preempt when reported a plug is present.
> > 
> > As 'schedule' will unplug a queue, a plug added by
> > mddev_check_plugged is only valid until the next schedule().
> > So call preempt_disable before installing the plug, and require the
> > called to call preempt_enable once the value has been used.
> > 
> > Signed-off-by: NeilBrown  <neilb <at> suse.de>
> > 
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 1369c9d..63ea6d6 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> >  <at>  <at>  -512,6 +512,10  <at>  <at>  static void plugger_unplug(struct blk_plug_cb *cb)
> >  
> >  /* Check that an unplug wakeup will come shortly.
> >   * If not, wakeup the md thread immediately
> > + * Note that the structure returned is only value until
> > + * the next schedule(), so preemption is disabled when it
> > + * is not NULL, and must be re-enabled after the value
> > + * has been used.
> >   */
> >  struct md_plug_cb *mddev_check_plugged(struct mddev *mddev,
> >  				       md_unplug_func_t unplug, size_t size)
> >  <at>  <at>  -522,6 +526,7  <at>  <at>  struct md_plug_cb *mddev_check_plugged(struct mddev *mddev,
(Continue reading)

Shaohua Li | 25 Jun 2012 09:24

[patch 09/10 v3] raid5: raid5d handle stripe in batch way

Let raid5d handle stripe in batch way to reduce conf->device_lock locking.

Signed-off-by: Shaohua Li <shli <at> fusionio.com>
---
 drivers/md/raid5.c |   45 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 32 insertions(+), 13 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-25 14:38:37.378835415 +0800
+++ linux/drivers/md/raid5.c	2012-06-25 14:38:40.150800523 +0800
 <at>  <at>  -4568,6 +4568,30  <at>  <at>  static int  retry_aligned_read(struct r5
 	return handled;
 }

+#define MAX_STRIPE_BATCH 8
+static int handle_active_stripes(struct r5conf *conf)
+{
+	struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
+	int i, batch_size = 0;
+
+	while (batch_size < MAX_STRIPE_BATCH &&
+			(sh = __get_priority_stripe(conf)) != NULL)
+		batch[batch_size++] = sh;
+
+	if (batch_size == 0)
+		return batch_size;
+	spin_unlock_irq(&conf->device_lock);
+
+	for (i = 0; i < batch_size; i++)
(Continue reading)

NeilBrown | 2 Jul 2012 04:32
Picon
Gravatar

Re: [patch 09/10 v3] raid5: raid5d handle stripe in batch way

On Mon, 25 Jun 2012 15:24:56 +0800 Shaohua Li <shli <at> kernel.org> wrote:

> Let raid5d handle stripe in batch way to reduce conf->device_lock locking.
> 
> Signed-off-by: Shaohua Li <shli <at> fusionio.com>
> ---
>  drivers/md/raid5.c |   45 ++++++++++++++++++++++++++++++++-------------
>  1 file changed, 32 insertions(+), 13 deletions(-)
> 
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2012-06-25 14:38:37.378835415 +0800
> +++ linux/drivers/md/raid5.c	2012-06-25 14:38:40.150800523 +0800
>  <at>  <at>  -4568,6 +4568,30  <at>  <at>  static int  retry_aligned_read(struct r5
>  	return handled;
>  }
>  
> +#define MAX_STRIPE_BATCH 8
> +static int handle_active_stripes(struct r5conf *conf)
> +{
> +	struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
> +	int i, batch_size = 0;
> +
> +	while (batch_size < MAX_STRIPE_BATCH &&
> +			(sh = __get_priority_stripe(conf)) != NULL)
> +		batch[batch_size++] = sh;
> +
> +	if (batch_size == 0)
> +		return batch_size;
> +	spin_unlock_irq(&conf->device_lock);
(Continue reading)

Shaohua Li | 25 Jun 2012 09:24

[patch 10/10 v3] raid5: create multiple threads to handle stripes

Like raid 1/10, raid5 uses one thread to handle stripe. In a fast storage, the
thread becomes a bottleneck. raid5 can offload calculation like checksum to
async threads. And if storge is fast, scheduling async work and running async
work will introduce heavy lock contention of workqueue, which makes such
optimization useless. And calculation isn't the only bottleneck. For example,
in my test raid5 thread must handle > 450k requests per second. Just doing
dispatch and completion will make raid5 thread incapable. The only chance to
scale is using several threads to handle stripe.

With this patch, user can create several extra threads to handle stripe. How
many threads are better depending on disk number, so the thread number can be
changed in userspace. By default, the thread number is 0, which means no extra
thread.

In a 3-disk raid5 setup, 2 extra threads can provide 130% throughput
improvement (double stripe_cache_size) and the throughput is pretty close to
theory value. With >=4 disks, the improvement is even bigger, for example, can
improve 200% for 4-disk setup, but the throughput is far less than theory
value, which is caused by several factors like request queue lock contention,
cache issue, latency introduced by how a stripe is handled in different disks.
Those factors need further investigations.

Signed-off-by: Shaohua Li <shli <at> fusionio.com>
---
 drivers/md/raid5.c |  137 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/md/raid5.h |    3 +
 2 files changed, 139 insertions(+), 1 deletion(-)

Index: linux/drivers/md/raid5.c
===================================================================
(Continue reading)

NeilBrown | 2 Jul 2012 04:39
Picon
Gravatar

Re: [patch 10/10 v3] raid5: create multiple threads to handle stripes

On Mon, 25 Jun 2012 15:24:57 +0800 Shaohua Li <shli <at> kernel.org> wrote:

> Like raid 1/10, raid5 uses one thread to handle stripe. In a fast storage, the
> thread becomes a bottleneck. raid5 can offload calculation like checksum to
> async threads. And if storge is fast, scheduling async work and running async
> work will introduce heavy lock contention of workqueue, which makes such
> optimization useless. And calculation isn't the only bottleneck. For example,
> in my test raid5 thread must handle > 450k requests per second. Just doing
> dispatch and completion will make raid5 thread incapable. The only chance to
> scale is using several threads to handle stripe.
> 
> With this patch, user can create several extra threads to handle stripe. How
> many threads are better depending on disk number, so the thread number can be
> changed in userspace. By default, the thread number is 0, which means no extra
> thread.
> 
> In a 3-disk raid5 setup, 2 extra threads can provide 130% throughput
> improvement (double stripe_cache_size) and the throughput is pretty close to
> theory value. With >=4 disks, the improvement is even bigger, for example, can
> improve 200% for 4-disk setup, but the throughput is far less than theory
> value, which is caused by several factors like request queue lock contention,
> cache issue, latency introduced by how a stripe is handled in different disks.
> Those factors need further investigations.
> 
> Signed-off-by: Shaohua Li <shli <at> fusionio.com>
> ---
>  drivers/md/raid5.c |  137 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/md/raid5.h |    3 +
>  2 files changed, 139 insertions(+), 1 deletion(-)
> 
(Continue reading)

Dan Williams | 2 Jul 2012 22:03
Picon
Favicon

Re: [patch 10/10 v3] raid5: create multiple threads to handle stripes

On Mon, Jun 25, 2012 at 12:24 AM, Shaohua Li <shli <at> kernel.org> wrote:
> Like raid 1/10, raid5 uses one thread to handle stripe. In a fast storage, the
> thread becomes a bottleneck. raid5 can offload calculation like checksum to
> async threads. And if storge is fast, scheduling async work and running async
> work will introduce heavy lock contention of workqueue, which makes such
> optimization useless. And calculation isn't the only bottleneck. For example,
> in my test raid5 thread must handle > 450k requests per second. Just doing
> dispatch and completion will make raid5 thread incapable. The only chance to
> scale is using several threads to handle stripe.
>
> With this patch, user can create several extra threads to handle stripe. How
> many threads are better depending on disk number, so the thread number can be
> changed in userspace. By default, the thread number is 0, which means no extra
> thread.
>
> In a 3-disk raid5 setup, 2 extra threads can provide 130% throughput
> improvement (double stripe_cache_size) and the throughput is pretty close to
> theory value. With >=4 disks, the improvement is even bigger, for example, can
> improve 200% for 4-disk setup, but the throughput is far less than theory
> value, which is caused by several factors like request queue lock contention,
> cache issue, latency introduced by how a stripe is handled in different disks.
> Those factors need further investigations.
>
> Signed-off-by: Shaohua Li <shli <at> fusionio.com>

Can you share a bit more about your test setup?  Is this
single-threaded throughput?  I'm wondering if we can take advantage of
keeping the work cpu local.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
(Continue reading)

Shaohua Li | 3 Jul 2012 10:04

Re: [patch 10/10 v3] raid5: create multiple threads to handle stripes

On Mon, Jul 02, 2012 at 01:03:53PM -0700, Dan Williams wrote:
> On Mon, Jun 25, 2012 at 12:24 AM, Shaohua Li <shli <at> kernel.org> wrote:
> > Like raid 1/10, raid5 uses one thread to handle stripe. In a fast storage, the
> > thread becomes a bottleneck. raid5 can offload calculation like checksum to
> > async threads. And if storge is fast, scheduling async work and running async
> > work will introduce heavy lock contention of workqueue, which makes such
> > optimization useless. And calculation isn't the only bottleneck. For example,
> > in my test raid5 thread must handle > 450k requests per second. Just doing
> > dispatch and completion will make raid5 thread incapable. The only chance to
> > scale is using several threads to handle stripe.
> >
> > With this patch, user can create several extra threads to handle stripe. How
> > many threads are better depending on disk number, so the thread number can be
> > changed in userspace. By default, the thread number is 0, which means no extra
> > thread.
> >
> > In a 3-disk raid5 setup, 2 extra threads can provide 130% throughput
> > improvement (double stripe_cache_size) and the throughput is pretty close to
> > theory value. With >=4 disks, the improvement is even bigger, for example, can
> > improve 200% for 4-disk setup, but the throughput is far less than theory
> > value, which is caused by several factors like request queue lock contention,
> > cache issue, latency introduced by how a stripe is handled in different disks.
> > Those factors need further investigations.
> >
> > Signed-off-by: Shaohua Li <shli <at> fusionio.com>
> 
> Can you share a bit more about your test setup?  Is this
> single-threaded throughput?  I'm wondering if we can take advantage of
> keeping the work cpu local.

(Continue reading)


Gmane