Shaohua Li | 2 Jul 2012 03:08

[patch 0/3 v3] Optimize raid1 read balance for SSD

raid1 read balance is an important algorithm to make read performance optimal.
It's a distance based algorithm, eg, for each request dispatch, choose disk
whose last finished request is close the request. This is great for hard disk.
But SSD has some special characteristics:

1. nonrotational. Distance means nothing for SSD, though merging small rquests
to big request is still optimal for SSD. If no merge, distributing rquests to
raid disks as more as possible is more optimal.

2. Getting too big request isn't always optimal. For hard disk, compared to
spindle move, data transfer overhead is trival, so we always prefer bigger
request. In SSD, request size exceeds specific value, performance isn't always
increased with request size increased.  An example is readahead. If readahead
merges too big request and causes some disks idle, the performance is less
optimal than that when all disks are busy and running small requests.

The patches try to address the issues.

V3:
makes the algorithm for the first issue work for hard disk and SSD mixed raid
as suggested by Neil. The algorithm for the second issue only applies to SSD,
because in my test it degrades performance for hard disk raid or hard disk/SSD
mixed raid.

V1->V2:
rebase to latest kernel.

Thanks,
Shaohua
--
(Continue reading)

Shaohua Li | 2 Jul 2012 03:08

[patch 1/3 v3] raid1: make sequential read detection per disk based

Currently the sequential read detection is global wide. It's natural to make it
per disk based, which can improve the detection for concurrent multiple
sequential reads. And next patch will make SSD read balance not use distance
based algorithm, where this change help detect truly sequential read for SSD.

Signed-off-by: Shaohua Li <shli <at> fusionio.com>
---
 drivers/md/raid1.c |   29 ++++++-----------------------
 drivers/md/raid1.h |   11 +++++------
 2 files changed, 11 insertions(+), 29 deletions(-)

Index: linux/drivers/md/raid1.c
===================================================================
--- linux.orig/drivers/md/raid1.c	2012-06-28 10:44:47.550666575 +0800
+++ linux/drivers/md/raid1.c	2012-06-28 12:01:03.513137377 +0800
 <at>  <at>  -483,7 +483,6  <at>  <at>  static int read_balance(struct r1conf *c
 	const sector_t this_sector = r1_bio->sector;
 	int sectors;
 	int best_good_sectors;
-	int start_disk;
 	int best_disk;
 	int i;
 	sector_t best_dist;
 <at>  <at>  -503,20 +502,17  <at>  <at>  static int read_balance(struct r1conf *c
 	best_good_sectors = 0;

 	if (conf->mddev->recovery_cp < MaxSector &&
-	    (this_sector + sectors >= conf->next_resync)) {
+	    (this_sector + sectors >= conf->next_resync))
 		choose_first = 1;
(Continue reading)

NeilBrown | 4 Jul 2012 07:38
Picon
Gravatar

Re: [patch 1/3 v3] raid1: make sequential read detection per disk based

On Mon, 02 Jul 2012 09:08:41 +0800 Shaohua Li <shli <at> kernel.org> wrote:

> Currently the sequential read detection is global wide. It's natural to make it
> per disk based, which can improve the detection for concurrent multiple
> sequential reads. And next patch will make SSD read balance not use distance
> based algorithm, where this change help detect truly sequential read for SSD.
> 
> Signed-off-by: Shaohua Li <shli <at> fusionio.com>
> ---
>  drivers/md/raid1.c |   29 ++++++-----------------------
>  drivers/md/raid1.h |   11 +++++------
>  2 files changed, 11 insertions(+), 29 deletions(-)
> 
> Index: linux/drivers/md/raid1.c
> ===================================================================
> --- linux.orig/drivers/md/raid1.c	2012-06-28 10:44:47.550666575 +0800
> +++ linux/drivers/md/raid1.c	2012-06-28 12:01:03.513137377 +0800
>  <at>  <at>  -483,7 +483,6  <at>  <at>  static int read_balance(struct r1conf *c
>  	const sector_t this_sector = r1_bio->sector;
>  	int sectors;
>  	int best_good_sectors;
> -	int start_disk;
>  	int best_disk;
>  	int i;
>  	sector_t best_dist;
>  <at>  <at>  -503,20 +502,17  <at>  <at>  static int read_balance(struct r1conf *c
>  	best_good_sectors = 0;
>  
>  	if (conf->mddev->recovery_cp < MaxSector &&
> -	    (this_sector + sectors >= conf->next_resync)) {
(Continue reading)

Shaohua Li | 2 Jul 2012 03:08

[patch 3/3 v3] raid1: prevent merging too large request

For SSD, if request size exceeds specific value (optimal io size), request size
isn't important for bandwidth. In such condition, if making request size bigger
will cause some disks idle, the total throughput will actually drop. A good
example is doing a readahead in a two-disk raid1 setup.

So when we should split big request? We absolutly don't want to split big
request to very small requests. Even in SSD, big request transfer is more
efficient. Below patch only consider request with size above optimal io size.

If all disks are busy, is it worthy to do split? Say optimal io size is 16k,
two requests 32k and two disks. We can let each disk run one 32k request, or
split the requests to 4 16k requests and each disk runs two. It's hard to say
which case is better, depending on hardware.

So only consider case where there are idle disks. For readahead, split is
always better in this case. And in my test, below patch can improve > 30%
thoughput. Hmm, not 100%, because disk isn't 100% busy.

Such case can happen not just in readahead, for example, in directio. But I
suppose directio usually will have bigger IO depth and make all disks busy, so
I ignored it.

Note: if the raid uses any hard disk, we don't prevent merging. That will make
performace worse.

Signed-off-by: Shaohua Li <shli <at> fusionio.com>
---
 drivers/md/raid1.c |   46 ++++++++++++++++++++++++++++++++++++++--------
 drivers/md/raid1.h |    1 +
 2 files changed, 39 insertions(+), 8 deletions(-)
(Continue reading)

NeilBrown | 4 Jul 2012 07:59
Picon
Gravatar

Re: [patch 3/3 v3] raid1: prevent merging too large request

On Mon, 02 Jul 2012 09:08:43 +0800 Shaohua Li <shli <at> kernel.org> wrote:

> For SSD, if request size exceeds specific value (optimal io size), request size
> isn't important for bandwidth. In such condition, if making request size bigger
> will cause some disks idle, the total throughput will actually drop. A good
> example is doing a readahead in a two-disk raid1 setup.
> 
> So when we should split big request? We absolutly don't want to split big
> request to very small requests. Even in SSD, big request transfer is more
> efficient. Below patch only consider request with size above optimal io size.
> 
> If all disks are busy, is it worthy to do split? Say optimal io size is 16k,
> two requests 32k and two disks. We can let each disk run one 32k request, or
> split the requests to 4 16k requests and each disk runs two. It's hard to say
> which case is better, depending on hardware.
> 
> So only consider case where there are idle disks. For readahead, split is
> always better in this case. And in my test, below patch can improve > 30%
> thoughput. Hmm, not 100%, because disk isn't 100% busy.
> 
> Such case can happen not just in readahead, for example, in directio. But I
> suppose directio usually will have bigger IO depth and make all disks busy, so
> I ignored it.
> 
> Note: if the raid uses any hard disk, we don't prevent merging. That will make
> performace worse.
> 
> Signed-off-by: Shaohua Li <shli <at> fusionio.com>
> ---
>  drivers/md/raid1.c |   46 ++++++++++++++++++++++++++++++++++++++--------
(Continue reading)

Shaohua Li | 4 Jul 2012 10:01

Re: [patch 3/3 v3] raid1: prevent merging too large request

2012/7/4 NeilBrown <neilb <at> suse.de>:
> On Mon, 02 Jul 2012 09:08:43 +0800 Shaohua Li <shli <at> kernel.org> wrote:
>
>> For SSD, if request size exceeds specific value (optimal io size), request size
>> isn't important for bandwidth. In such condition, if making request size bigger
>> will cause some disks idle, the total throughput will actually drop. A good
>> example is doing a readahead in a two-disk raid1 setup.
>>
>> So when we should split big request? We absolutly don't want to split big
>> request to very small requests. Even in SSD, big request transfer is more
>> efficient. Below patch only consider request with size above optimal io size.
>>
>> If all disks are busy, is it worthy to do split? Say optimal io size is 16k,
>> two requests 32k and two disks. We can let each disk run one 32k request, or
>> split the requests to 4 16k requests and each disk runs two. It's hard to say
>> which case is better, depending on hardware.
>>
>> So only consider case where there are idle disks. For readahead, split is
>> always better in this case. And in my test, below patch can improve > 30%
>> thoughput. Hmm, not 100%, because disk isn't 100% busy.
>>
>> Such case can happen not just in readahead, for example, in directio. But I
>> suppose directio usually will have bigger IO depth and make all disks busy, so
>> I ignored it.
>>
>> Note: if the raid uses any hard disk, we don't prevent merging. That will make
>> performace worse.
>>
>> Signed-off-by: Shaohua Li <shli <at> fusionio.com>
>> ---
(Continue reading)

Shaohua Li | 2 Jul 2012 03:08

[patch 2/3 v3] raid1: read balance chooses idlest disk for SSD

SSD hasn't spindle, distance between requests means nothing. And the original
distance based algorithm sometimes can cause severe performance issue for SSD
raid.

Considering two thread groups, one accesses file A, the other access file B.
The first group will access one disk and the second will access the other disk,
because requests are near from one group and far between groups. In this case,
read balance might keep one disk very busy but the other relative idle.  For
SSD, we should try best to distribute requests to as more disks as possible.
There isn't spindle move penality anyway.

With below patch, I can see more than 50% throughput improvement sometimes
depending on workloads.

The only exception is small requests can be merged to a big request which
typically can drive higher throughput for SSD too. Such small requests are
sequential reads. Unlike hard disk, sequential read which can't be merged (for
example direct IO, or read without readahead) can be ignored for SSD. Again
there is no spindle move penality. readahead dispatches small requests and such
requests can be merged.

Last patch can help detect sequential read well, at least if concurrent read
number isn't greater than raid disk number. In that case, distance based
algorithm doesn't work well too.

V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for
random IO too. This makes the algorithm generic for raid with SSD.

Signed-off-by: Shaohua Li <shli <at> fusionio.com>
---
(Continue reading)

Roberto Spadim | 2 Jul 2012 04:13
Picon

Re: [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD

nice =) very very nice =)
maybe to get better than this... select the disk with min(pending time)

time could be estimated with something like:
(distance * time/distance unit) + (blocks to read/write * time to
read/write 1 block) + (non sequencial penalty time)

for ssd:
    time/distance unit  = 0
    time to read/write 1 block = must test with each device
    non sequencial penalty time = must test with each device, but some
tests show that non sequencial are near to sequencial reads
for hd:
    time/distance unit and time to read/write 1 block are proportional
with disk speed (rpm)
    non sequencial penalty time is proportional to distance and head
position, many disk specs show in specs that it´s take, in worst case,
near 10ms to start reading/writing, this time is the time to disk spin
one revolution and put head in right position
        check that time to read/write in rotational change with block
position and number of heads reading (blocks at center of disk are
slower, blocks far from center are faster)
         for ssd it´s change with allocation 'problems', (for write)
if a block is 'trimmed' it´s very fast, if block is used (dirty) it
must read block, change and write, this is a slower... in others
words... the time to read is related with position and mean disk/ssd
read/write times (for a 'good' aproximation not a ideal one...), this
algorithm (without pending information) give me 1% of mean improvement
in kernel 2.6.33 (must check but i think that´s right)

(Continue reading)

Shaohua Li | 2 Jul 2012 05:02

Re: [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD

On Sun, Jul 01, 2012 at 11:13:42PM -0300, Roberto Spadim wrote:
> nice =) very very nice =)
> maybe to get better than this... select the disk with min(pending time)
> 
> time could be estimated with something like:
> (distance * time/distance unit) + (blocks to read/write * time to
> read/write 1 block) + (non sequencial penalty time)

I didn't think there is way to measure request time. Disks support NCQ, they
can dispatch several requests at one time and finish them almost at the same
time. I used to measure this time (to make CFQ ioscheduler self tune), but
failed.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Roberto Spadim | 2 Jul 2012 05:57
Picon

Re: [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD

hummm well that´s true... exist a queue inside disk hardware that we
can´t measure... but... if you want i can make tests to you :)
i used a configuration a bit diferent some time ago, instead of a SSD
and a harddisk, i used a disk with 7200rpm and a disk with 15000 the
"time based" algorithm runs nice in this case, maybe could give just a
little more 'performace' (maybe none), like i told the mean performace
that i got was 1% (i made tests with different disks speed and
ssd+disks, i had a ocz vortex2, a sata 7200rpm (500gb) and a sas
15000rpm (142gb), some other guy here in kernel list tested too, but
they didn´t confirmed if the performace was a mean performace or just
a 'error' in measure

when i done this i got some 'empirical' values to 'tune' the
algorithm, i don´t remember all 'theory' but i done something like
this:

1)  (distance * time/distance unit)
time/distance unit,
    i don´t remember distance unit, i think it´s 1 block =  512bytes
right? well, just check the idea...
    for disks:
        total disk capacity in distance units / 1 revolution time
        1 revolution time = 1/rpm for disk, for example
              7200 rpm => 120 hz => 8.333ms = 8333us (near 10ms like
told in disk spec of random acess time)
              15000 rpm => 250hz => 4ms = 4000us (near 5ms like told
in disk spec)
    for ssd : 0 seconds
	7200 => 500gb (1024*1024*1024/512) / 8333 =   1048576000blocks /
8333us = 0.000'007'946'968'078 block/us
(Continue reading)

Roberto Spadim | 2 Jul 2012 06:33
Picon

Re: [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD

check that if you don´t what this algorithm, you could use:
distance time =1
read time=0
penalty =0
and it would work as today implementation... (ok must check if this
could work for single disk to full array read, but it´s near)

2012/7/2 Roberto Spadim <roberto <at> spadim.com.br>:
> hummm well that´s true... exist a queue inside disk hardware that we
> can´t measure... but... if you want i can make tests to you :)
> i used a configuration a bit diferent some time ago, instead of a SSD
> and a harddisk, i used a disk with 7200rpm and a disk with 15000 the
> "time based" algorithm runs nice in this case, maybe could give just a
> little more 'performace' (maybe none), like i told the mean performace
> that i got was 1% (i made tests with different disks speed and
> ssd+disks, i had a ocz vortex2, a sata 7200rpm (500gb) and a sas
> 15000rpm (142gb), some other guy here in kernel list tested too, but
> they didn´t confirmed if the performace was a mean performace or just
> a 'error' in measure
>
> when i done this i got some 'empirical' values to 'tune' the
> algorithm, i don´t remember all 'theory' but i done something like
> this:
>
>
> 1)  (distance * time/distance unit)
> time/distance unit,
>     i don´t remember distance unit, i think it´s 1 block =  512bytes
> right? well, just check the idea...
>     for disks:
(Continue reading)

Roberto Spadim | 2 Jul 2012 06:31
Picon

Re: [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD

for SSD without inteligent parts (cache and queue) i think this could
works nice... (at least in theory)
for USB sticks and microsd cards may work too...

2012/7/2 Shaohua Li <shli <at> kernel.org>:
> On Sun, Jul 01, 2012 at 11:13:42PM -0300, Roberto Spadim wrote:
>> nice =) very very nice =)
>> maybe to get better than this... select the disk with min(pending time)
>>
>> time could be estimated with something like:
>> (distance * time/distance unit) + (blocks to read/write * time to
>> read/write 1 block) + (non sequencial penalty time)
>
> I didn't think there is way to measure request time. Disks support NCQ, they
> can dispatch several requests at one time and finish them almost at the same
> time. I used to measure this time (to make CFQ ioscheduler self tune), but
> failed.
>
> Thanks,
> Shaohua
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo <at> vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--

-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
(Continue reading)

Roberto Spadim | 2 Jul 2012 06:36
Picon

Re: [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD

sorry about many posts, i will stop a bit now...

could we turn off disk NCQ? ok nobody want to read this, but could we
turn off NCQ and get better performace? since we do a good elevator at
linux code... maybe turn off disk cache too... (it´s only 32MB for
sata/sas, and can be alot to raid controller...), just some ideas to
get better performace... (i didn´t think before write if cache help or
not when we send I/O to disk, but check if it´s a good question or
not)

2012/7/2 Roberto Spadim <roberto <at> spadim.com.br>:
> for SSD without inteligent parts (cache and queue) i think this could
> works nice... (at least in theory)
> for USB sticks and microsd cards may work too...
>
>
> 2012/7/2 Shaohua Li <shli <at> kernel.org>:
>> On Sun, Jul 01, 2012 at 11:13:42PM -0300, Roberto Spadim wrote:
>>> nice =) very very nice =)
>>> maybe to get better than this... select the disk with min(pending time)
>>>
>>> time could be estimated with something like:
>>> (distance * time/distance unit) + (blocks to read/write * time to
>>> read/write 1 block) + (non sequencial penalty time)
>>
>> I didn't think there is way to measure request time. Disks support NCQ, they
>> can dispatch several requests at one time and finish them almost at the same
>> time. I used to measure this time (to make CFQ ioscheduler self tune), but
>> failed.
>>
(Continue reading)

NeilBrown | 4 Jul 2012 07:45
Picon
Gravatar

Re: [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD

On Mon, 02 Jul 2012 09:08:42 +0800 Shaohua Li <shli <at> kernel.org> wrote:

> SSD hasn't spindle, distance between requests means nothing. And the original
> distance based algorithm sometimes can cause severe performance issue for SSD
> raid.
> 
> Considering two thread groups, one accesses file A, the other access file B.
> The first group will access one disk and the second will access the other disk,
> because requests are near from one group and far between groups. In this case,
> read balance might keep one disk very busy but the other relative idle.  For
> SSD, we should try best to distribute requests to as more disks as possible.
> There isn't spindle move penality anyway.
> 
> With below patch, I can see more than 50% throughput improvement sometimes
> depending on workloads.
> 
> The only exception is small requests can be merged to a big request which
> typically can drive higher throughput for SSD too. Such small requests are
> sequential reads. Unlike hard disk, sequential read which can't be merged (for
> example direct IO, or read without readahead) can be ignored for SSD. Again
> there is no spindle move penality. readahead dispatches small requests and such
> requests can be merged.
> 
> Last patch can help detect sequential read well, at least if concurrent read
> number isn't greater than raid disk number. In that case, distance based
> algorithm doesn't work well too.
> 
> V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for
> random IO too. This makes the algorithm generic for raid with SSD.
> 
(Continue reading)


Gmane