RDP | 24 Sep 14:10
Picon

ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

Hello, 
  May be this question would have been addressed elsewhere but I did like the opinion and experience of other users.

There could be some misconceptions that I might be carrying, so please be kind to point them out. Any help, advice and suggestions will be very highly appreciated.

My goal is to get a greater than 100 TB gluster NAS up on the cloud. Each server will hold around 2x8TB disks. The export volume size (client disk mount size) would be greater than 20 TB.

This is how I am planning to set it up all.. 16 servers each with 2x8=16 TB of space. The glusterfs will be replicate and distributed (raid-10). I did like to go with ZFS on linux for the disks. 
The client machines will use the glusterfs client for mounting the volumes.

ext4 is limited to 16 TB due to userspace tool (e2fsprogs).

Would this be considered as a production ready setup? The data housed on this cluster will is critical and hence I need to very sure before I go ahead with this kind of a setup.

Or would using ZFS with Gluster makes more sense on FreeBSD or illuminos (ZFS is native there).

Thanks a lot
_______________________________________________
Gluster-users mailing list
Gluster-users@...
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Craig Carl | 24 Sep 21:12

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

The only thing I might worry about is using ZFS on Linux, I still think it might be a little early to trust it with truly critical data, plus there doesn't seem to be a big ZFS+Linux+Gluster install base to help you if problems come up.

I would use mdadm + LVM2 to create your RAID arrays on each server, creating multiple ~2TB LUNs on each server with ext3 or 4, then layer Gluster on top of that.

Craig

Sent from a mobile device, please excuse my tpyos.

On Sep 24, 2011, at 14:10, RDP <rdp.com-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

Hello, 
  May be this question would have been addressed elsewhere but I did like the opinion and experience of other users.

There could be some misconceptions that I might be carrying, so please be kind to point them out. Any help, advice and suggestions will be very highly appreciated.

My goal is to get a greater than 100 TB gluster NAS up on the cloud. Each server will hold around 2x8TB disks. The export volume size (client disk mount size) would be greater than 20 TB.

This is how I am planning to set it up all.. 16 servers each with 2x8=16 TB of space. The glusterfs will be replicate and distributed (raid-10). I did like to go with ZFS on linux for the disks. 
The client machines will use the glusterfs client for mounting the volumes.

ext4 is limited to 16 TB due to userspace tool (e2fsprogs).

Would this be considered as a production ready setup? The data housed on this cluster will is critical and hence I need to very sure before I go ahead with this kind of a setup.

Or would using ZFS with Gluster makes more sense on FreeBSD or illuminos (ZFS is native there).

Thanks a lot
_______________________________________________
Gluster-users mailing list
Gluster-users <at> gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users@...
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Liam Slusser | 24 Sep 21:14
Picon

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

I have a very large, >500tb, Gluster cluster on Centos Linux but I use the XFS filesystem in a production role.  Each xfs filesystem (brick) is around 32tb in size.  No problems all runs very well.

ls

_______________________________________________
Gluster-users mailing list
Gluster-users@...
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Anand Babu Periasamy | 24 Sep 21:26

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

On Sat, Sep 24, 2011 at 12:14 PM, Liam Slusser <lslusser-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

I have a very large, >500tb, Gluster cluster on Centos Linux but I use the XFS filesystem in a production role.  Each xfs filesystem (brick) is around 32tb in size.  No problems all runs very well.

ls

Yes XFS is the way to go for large partitions > 16TB (or even 12TB). XFS is brought back to life by Red Hat. Most of the XFS developers are now RH employees. We can confidently recommend XFS now.

--
Anand Babu Periasamy
Blog [ http://www.unlocksmith.org ]
Twitter [ http://twitter.com/abperiasamy ]

Imagination is more important than knowledge --Albert Einstein
_______________________________________________
Gluster-users mailing list
Gluster-users@...
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Jon Tegner | 25 Sep 14:26
Picon

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

Sorry for a stupid question, but would there be issues using glusterfs based on several 11 TB ext4-bricks?

/jon


On 09/24/2011 09:26 PM, Anand Babu Periasamy wrote:
On Sat, Sep 24, 2011 at 12:14 PM, Liam Slusser <lslusser-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

I have a very large, >500tb, Gluster cluster on Centos Linux but I use the XFS filesystem in a production role.  Each xfs filesystem (brick) is around 32tb in size.  No problems all runs very well.

ls

Yes XFS is the way to go for large partitions > 16TB (or even 12TB). XFS is brought back to life by Red Hat. Most of the XFS developers are now RH employees. We can confidently recommend XFS now.

--
Anand Babu Periasamy
Blog [ http://www.unlocksmith.org ]
Twitter [ http://twitter.com/abperiasamy ]

Imagination is more important than knowledge --Albert Einstein
_______________________________________________ Gluster-users mailing list Gluster-users-+FkPdpiNhgJAfugRpC6u6w@public.gmane.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@...
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Nathan Stratton | 25 Sep 15:30
Favicon

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud


On Sun, 25 Sep 2011, Jon Tegner wrote:

> Sorry for a stupid question, but would there be issues using glusterfs based 
> on several 11 TB ext4-bricks?

Nope! I have used that in a config, 16 750 disks in RAID 6 on 3ware card 
formated ext4.

><>
Nathan Stratton                                CTO, BlinkMind, Inc.
nathan at robotics.net                         nathan at blinkmind.com
http://www.robotics.net                        http://www.blinkmind.com
Craig Carl | 24 Sep 21:33

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

XFS is a valid alternative to ZFS on Linux. If I remember correctly any operation that requires modifying a lot of xattr's can be slower than ext*, have you noticed anything like that? You might see slower rebalances or self healing?

Craig

Sent from a mobile device, please excuse my tpyos.

On Sep 24, 2011, at 22:14, Liam Slusser <lslusser-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

I have a very large, >500tb, Gluster cluster on Centos Linux but I use the XFS filesystem in a production role.  Each xfs filesystem (brick) is around 32tb in size.  No problems all runs very well.

ls

_______________________________________________
Gluster-users mailing list
Gluster-users@...
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Liam Slusser | 24 Sep 22:21
Picon

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

I've also heard it can be slower however I've never done any performance tests on the same hardware with ext3/4 vs XFS since my partitions are so big ext3/4 is just not an option.  With that said I've been pleased with the performance I get and am a happy XFS user.

ls

On Sep 24, 2011 12:31 PM, "Craig Carl" <craig-j0SORn8cgwysTnJN9+BGXg@public.gmane.org> wrote:
> XFS is a valid alternative to ZFS on Linux. If I remember correctly any operation that requires modifying a lot of xattr's can be slower than ext*, have you noticed anything like that? You might see slower rebalances or self healing?
>
> Craig
>
> Sent from a mobile device, please excuse my tpyos.
>
> On Sep 24, 2011, at 22:14, Liam Slusser <lslusser-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>> I have a very large, >500tb, Gluster cluster on Centos Linux but I use the XFS filesystem in a production role. Each xfs filesystem (brick) is around 32tb in size. No problems all runs very well.
>>
>> ls
_______________________________________________
Gluster-users mailing list
Gluster-users@...
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Di Pe | 25 Sep 09:56
Picon

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

Hi,

I think you are up to an interesting project. May be you could share a
few more details. (1) What cloud are you planning to use, EC2 with EBS
volumes or some hosted stuff like Rackspace? (2) What are your
motivations for using RAID10 (for example on Amazon that would
increase your monthly price from $10k to $20k just for storage not
counting io operations --- I am not suggesting you use raid0, btw) (3)
is this for something like a web farm with one unix user accessing the
web farm or is it a multi-user HPC like environment for which you need
a Posix file system?

So far the discussion has been focusing on XFS vs ZFS. I admit that I
am a fan of ZFS and I have only used XFS for performance reasons on
mysql servers where it did well. When I read something like this
http://oss.sgi.com/archives/xfs/2011-08/msg00320.html that makes me
not want to use XFS for big data. You can assume that this is a real
recent bug because Joe is a smart guy who knows exactly what he is
doing. Joe and the Gluster guys are vendors who can work around these
issues and provide support. If XFS is the choice, may be you should
hire them for this gig.

ZFS typically does not have these FS repair issues in the first place.
The motivation of Lawrence Livermore for porting ZFS to Linux was
quite clear:

http://zfsonlinux.org/docs/SC10_BoF_ZFS_on_Linux_for_Lustre.pdf

OK, they have 50PB and we are talking about much smaller deployments.
However some of the limitations they report I can confirm. Also,
recovering from a drive failure with this whole LVM/Linux Raid stuff
is unpredictable. Hot swapping does not always work and if you
prioritize the re-sync of data to the new drive you can strangle the
entire box (by default the priority of the re-sync process is low on
linux). If you are a Linux expert you can handle this kind of stuff
(or hire someone) but if you ever want to give this setup to a Storage
Administrator you better give them something that they can use with
confidence (may be less of an issue in the cloud).
Compare to this to ZFS: re-silvering works with a very predictable
result and timing. There is a ton of info out there on this topic.  I
think that gluster users may be getting around many of the linux raid
issues by simply taking the entire node down (which is ok in mirrored
node settings) or by using hardware raid controllers. (which are often
not available in the cloud )
Some in the Linux community seem to be slightly opposed to ZFS (I
assume because of the licensing issue) and make sometimes odd
suggestions ("You should use BTRFS").

As someone who is involved with  managing hundreds of terabytes of
storage I can say that if something goes wrong with a big hunk of your
storage it quite a pain to get it back. I would only feel comfortable
to use a combination of gluster with anything as my primary storage if
I had it mirrored to another datacenter not using gluster technology
for the mirroring (hence my raid10 question, or may be that is what
you are planning). Primary storage of that size without mirroring I
would put on a commercial thing like isilon, IBM, Bluearc where I get
24*7 support, etc. We are currently happy users of glusterfs and we
are using it as a caching tier for hpc (our users have managed to
bring it down) and for backup and I love it for that. We are currently
testing ZFSOnLinux with gluster 3.2.3 on good hardware (8 core, 64 GB,
SSD for caching) with ultra cheap drives (WD green caviar) and the
performance results are very impressive. I am currently not too
concerned with stability. Should the kernel crash (which has not
happened yet) the data will be unaffected because no linux code (the
Solaris porting layer) is actually touching any of the hard drives.

dipe

On Sat, Sep 24, 2011 at 5:10 AM, RDP <rdp.com@...> wrote:
> Hello,
>   May be this question would have been addressed elsewhere but I did like
> the opinion and experience of other users.
> There could be some misconceptions that I might be carrying, so please be
> kind to point them out. Any help, advice and suggestions will be very highly
> appreciated.
> My goal is to get a greater than 100 TB gluster NAS up on the cloud. Each
> server will hold around 2x8TB disks. The export volume size (client disk
> mount size) would be greater than 20 TB.
> This is how I am planning to set it up all.. 16 servers each with 2x8=16 TB
> of space. The glusterfs will be replicate and distributed (raid-10). I did
> like to go with ZFS on linux for the disks.
> The client machines will use the glusterfs client for mounting the volumes.
> ext4 is limited to 16 TB due to userspace tool (e2fsprogs).
> Would this be considered as a production ready setup? The data housed on
> this cluster will is critical and hence I need to very sure before I go
> ahead with this kind of a setup.
> Or would using ZFS with Gluster makes more sense on FreeBSD or illuminos
> (ZFS is native there).
> Thanks a lot
> _______________________________________________
> Gluster-users mailing list
> Gluster-users@...
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>
>
Joe Landman | 25 Sep 14:51
Favicon

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

On 09/25/2011 03:56 AM, Di Pe wrote:

> So far the discussion has been focusing on XFS vs ZFS. I admit that I
> am a fan of ZFS and I have only used XFS for performance reasons on
> mysql servers where it did well. When I read something like this
> http://oss.sgi.com/archives/xfs/2011-08/msg00320.html that makes me
> not want to use XFS for big data. You can assume that this is a real

This is a corner case bug, and one we are hoping we can get more data to 
the XFS team for.  They asked for specific information that we couldn't 
provide (as we had to fix the problem).  Note: other file systems which 
allow for sparse files *may* have similar issues.  We haven't tried yet.

The issues with ZFS on Linux have to do with legal hazards.  Neither 
Oracle, nor those who claim ZFS violates their patents, would be happy 
to see license violations, or further deployment of ZFS on Linux.  I 
know the national labs in the US are happily doing the integration from 
source.  But I don't think Oracle and the patent holders would sit idly 
by while others do this.  So you'd need to use a ZFS based system such 
as Solaris 11 express to be able to use it without hassle.  BSD and 
Illumos may work without issue as well, and should be somewhat better on 
the legal front than Linux + ZFS.  I am obviously not a lawyer, and you 
should consult one before you proceed down this route.

> recent bug because Joe is a smart guy who knows exactly what he is
> doing. Joe and the Gluster guys are vendors who can work around these
> issues and provide support. If XFS is the choice, may be you should
> hire them for this gig.
>
> ZFS typically does not have these FS repair issues in the first place.
> The motivation of Lawrence Livermore for porting ZFS to Linux was
> quite clear:
>
> http://zfsonlinux.org/docs/SC10_BoF_ZFS_on_Linux_for_Lustre.pdf
>
> OK, they have 50PB and we are talking about much smaller deployments.
> However some of the limitations they report I can confirm. Also,
> recovering from a drive failure with this whole LVM/Linux Raid stuff
> is unpredictable. Hot swapping does not always work and if you
> prioritize the re-sync of data to the new drive you can strangle the
> entire box (by default the priority of the re-sync process is low on
> linux). If you are a Linux expert you can handle this kind of stuff
> (or hire someone) but if you ever want to give this setup to a Storage
> Administrator you better give them something that they can use with
> confidence (may be less of an issue in the cloud).
> Compare to this to ZFS: re-silvering works with a very predictable
> result and timing. There is a ton of info out there on this topic.  I
> think that gluster users may be getting around many of the linux raid
> issues by simply taking the entire node down (which is ok in mirrored
> node settings) or by using hardware raid controllers. (which are often
> not available in the cloud )

There are definite advantages to better technology.  But the issue in 
this case is the legal baggage that goes along with them.

BTRFS may, eventually, be a better choice.  The national labs can do 
this with something of an immunity to prosecution for license violation, 
by claiming the work is part of a research project, and won't actively 
be used in a way that would harm Oracle's interests.  And it would be 
... bad ... for Oracle (and others) to sue to government over a 
relatively trivial violation.

Until Oracle comes out with an absolute declaration that its OK to use 
ZFS with Linux in a commercial setting ... yeah ... most vendors are 
gonna stay away from that scenario.

> Some in the Linux community seem to be slightly opposed to ZFS (I
> assume because of the licensing issue) and make sometimes odd
> suggestions ("You should use BTRFS").

Licensing mainly.  BTRFS has a better design, but its not ready yet. 
Won't be for a while.

--

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman@...
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
Di Pe | 27 Sep 02:50
Picon

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

On Sun, Sep 25, 2011 at 5:51 AM, Joe Landman
<landman@...> wrote:
> On 09/25/2011 03:56 AM, Di Pe wrote:
>
>> So far the discussion has been focusing on XFS vs ZFS. I admit that I
>> am a fan of ZFS and I have only used XFS for performance reasons on
>> mysql servers where it did well. When I read something like this
>> http://oss.sgi.com/archives/xfs/2011-08/msg00320.html that makes me
>> not want to use XFS for big data. You can assume that this is a real
>
> This is a corner case bug, and one we are hoping we can get more data to the
> XFS team for.  They asked for specific information that we couldn't provide
> (as we had to fix the problem).  Note: other file systems which allow for
> sparse files *may* have similar issues.  We haven't tried yet.

Fair enough, but one of the things LLNL pointed out was that you have
to do fsck in the first place (aka standard file systems are not self
healing)

>
> The issues with ZFS on Linux have to do with legal hazards.  Neither Oracle,
> nor those who claim ZFS violates their patents, would be happy to see
> license violations, or further deployment of ZFS on Linux.  I know the
> national labs in the US are happily doing the integration from source.  But
> I don't think Oracle and the patent holders would sit idly by while others
> do this.  So you'd need to use a ZFS based system such as Solaris 11 express
> to be able to use it without hassle.  BSD and Illumos may work without issue
> as well, and should be somewhat better on the legal front than Linux + ZFS.
>  I am obviously not a lawyer, and you should consult one before you proceed
> down this route.
>
>> recent bug because Joe is a smart guy who knows exactly what he is
>> doing. Joe and the Gluster guys are vendors who can work around these
>> issues and provide support. If XFS is the choice, may be you should
>> hire them for this gig.
>>
>> ZFS typically does not have these FS repair issues in the first place.
>> The motivation of Lawrence Livermore for porting ZFS to Linux was
>> quite clear:
>>
>> http://zfsonlinux.org/docs/SC10_BoF_ZFS_on_Linux_for_Lustre.pdf
>>
>> OK, they have 50PB and we are talking about much smaller deployments.
>> However some of the limitations they report I can confirm. Also,
>> recovering from a drive failure with this whole LVM/Linux Raid stuff
>> is unpredictable. Hot swapping does not always work and if you
>> prioritize the re-sync of data to the new drive you can strangle the
>> entire box (by default the priority of the re-sync process is low on
>> linux). If you are a Linux expert you can handle this kind of stuff
>> (or hire someone) but if you ever want to give this setup to a Storage
>> Administrator you better give them something that they can use with
>> confidence (may be less of an issue in the cloud).
>> Compare to this to ZFS: re-silvering works with a very predictable
>> result and timing. There is a ton of info out there on this topic.  I
>> think that gluster users may be getting around many of the linux raid
>> issues by simply taking the entire node down (which is ok in mirrored
>> node settings) or by using hardware raid controllers. (which are often
>> not available in the cloud )
>
> There are definite advantages to better technology.  But the issue in this
> case is the legal baggage that goes along with them.
>
> BTRFS may, eventually, be a better choice.  The national labs can do this
> with something of an immunity to prosecution for license violation, by
> claiming the work is part of a research project, and won't actively be used
> in a way that would harm Oracle's interests.  And it would be ... bad ...
> for Oracle (and others) to sue to government over a relatively trivial
> violation.
>

I am trying to make sense what people discuss regarding the ZFS
licensing issue. Did you hear anything from anyone at Oracle that
would indicate that they don't like ZFS on Linux? If I think through
it I can't see why this would make any sense. The ZFS on Linux
community is extremely small and will probably always be and the main
reason besides data size is that the GPL doesn't like the CDDL not
vice-versa so distros shy away from it.
The LLNL people have found a way around the GPL2 issue by implementing
it as a driver.
Why doesn't Oracle sue Nexenta? Those guys have deployed 330PB of
their storage and would be a worthy  target.
The only company that seems to have issues with ZFS in general is
NetApp and I'm sure that they don't care whether it's installed on
Solaris or on Linux. NetApp interestingly sued CoRaid, a disk shelf
vendor that was using Nexenta as OS but they did not sue not Nexenta
itself. NetApp knew that their case was very weak. If they had sued
Nexenta, Nexenta would have fought back because the very existence of
the company would have been at risk. NetApp feared that Nexenta might
have won which would have confirmed the legitimacy of ZFS. CoRaid on
the other hand was not dependent on their ZFS solution for their
business to be able to continue. They were not willing to take any
risk and this allowed NetApp to spread some excellent FUD.
I would be interested to hear if there have been any comments anywhere
(written or verbal) by noteworthy Oracle representatives regarding the
ZFS on Linux question (except saying: "Oh, we won't GPL ZFS")

> Until Oracle comes out with an absolute declaration that its OK to use ZFS
> with Linux in a commercial setting ... yeah ... most vendors are gonna stay
> away from that scenario.
>

>> Some in the Linux community seem to be slightly opposed to ZFS (I
>> assume because of the licensing issue) and make sometimes odd
>> suggestions ("You should use BTRFS").
>
> Licensing mainly.  BTRFS has a better design, but its not ready yet. Won't
> be for a while.

I wonder if you can compare the two. While ZFS is also a volume
manager, software raid and has multiple builtin caching layers the
other FS are just FS. BTRFS seems to have some volume management in
place. I would be interested to know why you think BTRFS has a better
design than ZFS?

>
>
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics, Inc.
> email: landman@...
> web  : http://scalableinformatics.com
>       http://scalableinformatics.com/sicluster
> phone: +1 734 786 8423 x121
> fax  : +1 866 888 3112
> cell : +1 734 612 4615
> _______________________________________________
> Gluster-users mailing list
> Gluster-users@...
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>
paul simpson | 29 Sep 18:38
Favicon

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

been reading this thread - quite fascinating.


zfsonlinux + gluster looks like an intriguing combination.  i'm interested in your findings to date; specifically would the zfs L2ARC (with SSDs) speed up underlying gluster operations?  it sounds like it could be a potent mix.

regards,

-paul
_______________________________________________
Gluster-users mailing list
Gluster-users@...
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Joe Landman | 29 Sep 18:48
Favicon

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

On 09/29/2011 12:38 PM, paul simpson wrote:
> been reading this thread - quite fascinating.
>
> zfsonlinux + gluster looks like an intriguing combination.  i'm
> interested in your findings to date; specifically would the zfs L2ARC
> (with SSDs) speed up underlying gluster operations?  it sounds like it
> could be a potent mix.

Just don't minimize the legal risk issue.  Its very hard for a vendor to 
ship/support this due to the potential risk.  Its arguably hard for a 
user to deploy zfs on linux due to the risk, unless they had a way to 
argue that they are not violating licensing (can't intermix GPL and CDDL 
and ship/support it) for commercial purposes.

Lots of folks can't claim the type of cover that a national lab can 
claim (researching storage models).  You have to decide if the risk is 
worth it.

If you were to do this, I'd suggest going the Illumos/OpenIndiana or BSD 
route.  Yeah, work still needs to be done to get Gluster to build there, 
but the licensing is on firmer ground (hard to claim that an "open 
source" license such as CDDL does not mean what it says).

Understand where you stand first.  Speak to a lawyer type first.  Make 
sure you won't have issues.

And do remember, that while Oracle and Netapp have (for the moment) 
de-escalated hostilities, Oracle did not provide indemnity to non-Oracle 
customers.  So Netapp (and others) *can* resume their actions.  A 
question was asked why not go after Nexenta versus others.  Simple. 
There are many others (e.g. more potential licensing/legal fees) as 
compared to a single Nexenta.  Its arguably less about rights as it is 
revenue from legal action.  But that stuff does happen ...

Oracle is probably the only one whom can ship ZFS anything safely.  And, 
I'd guess that they are perfectly happy with that situation.

>
> regards,
>
> -paul

--

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@...
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
paul simpson | 29 Sep 19:13
Favicon

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

hey joe,


recived the legal issues loud and clear - all very good points.  hope these issues will become clarified in due course.

putting the legal issues aside - still v keen to hear your and others thoughts about ZFS&L2ARC being a good platform for glusterfs.  that fast SSD tier sounds like a perfect compliment to glusters slow small file performance.  

regards,

paul



On 29 September 2011 17:48, Joe Landman <landman-nyOC7EYE20mM0MU9lROt9DlRY1/6cnIP@public.gmane.org> wrote:
On 09/29/2011 12:38 PM, paul simpson wrote:
been reading this thread - quite fascinating.

zfsonlinux + gluster looks like an intriguing combination.  i'm
interested in your findings to date; specifically would the zfs L2ARC
(with SSDs) speed up underlying gluster operations?  it sounds like it
could be a potent mix.

Just don't minimize the legal risk issue.  Its very hard for a vendor to ship/support this due to the potential risk.  Its arguably hard for a user to deploy zfs on linux due to the risk, unless they had a way to argue that they are not violating licensing (can't intermix GPL and CDDL and ship/support it) for commercial purposes.

Lots of folks can't claim the type of cover that a national lab can claim (researching storage models).  You have to decide if the risk is worth it.

If you were to do this, I'd suggest going the Illumos/OpenIndiana or BSD route.  Yeah, work still needs to be done to get Gluster to build there, but the licensing is on firmer ground (hard to claim that an "open source" license such as CDDL does not mean what it says).

Understand where you stand first.  Speak to a lawyer type first.  Make sure you won't have issues.

And do remember, that while Oracle and Netapp have (for the moment) de-escalated hostilities, Oracle did not provide indemnity to non-Oracle customers.  So Netapp (and others) *can* resume their actions.  A question was asked why not go after Nexenta versus others.  Simple. There are many others (e.g. more potential licensing/legal fees) as compared to a single Nexenta.  Its arguably less about rights as it is revenue from legal action.  But that stuff does happen ...

Oracle is probably the only one whom can ship ZFS anything safely.  And, I'd guess that they are perfectly happy with that situation.


regards,

-paul


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman <at> scalableinformatics.com
web  : http://scalableinformatics.com
      http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

_______________________________________________
Gluster-users mailing list
Gluster-users@...
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
David Miller | 29 Sep 19:32
Picon

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

On Thu, Sep 29, 2011 at 1:13 PM, paul simpson <paul <at> realisestudio.com> wrote:
hey joe,

recived the legal issues loud and clear - all very good points.  hope these issues will become clarified in due course.

putting the legal issues aside - still v keen to hear your and others thoughts about ZFS&L2ARC being a good platform for glusterfs.  that fast SSD tier sounds like a perfect compliment to glusters slow small file performance.  

Couldn't you  accomplish the same thing with flashcache?  https://github.com/facebook/flashcache/
--
David
_______________________________________________
Gluster-users mailing list
Gluster-users@...
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
David Miller | 29 Sep 19:44
Picon

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

On Thu, Sep 29, 2011 at 1:32 PM, David Miller <david3d-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
Couldn't you  accomplish the same thing with flashcache?  https://github.com/facebook/flashcache/

I should expand on that a little bit.  Flashcache is a kernel module created by Facebook that uses the device mapper interface in Linux to provide a ssd cache layer to any block device.

What I think would be interesting is using flashcache with a pcie ssd as the caching device.  That would add about $500-$600 to the cost of each brick node but should be able to buffer the active IO from the spinning media pretty well.

Somthing like this.  http://www.amazon.com/OCZ-Technology-Drive-240GB-Express/dp/B0058RECUE or something from FusionIO if you want something that's aimed more at the enterprise.
--
David
_______________________________________________
Gluster-users mailing list
Gluster-users@...
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
paul simpson | 29 Sep 19:47
Favicon

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

hi david,


thanks for the pointer.  already reading the readme on the repo.  it looks interesting - and i'd be keen to hear from any gluster guru's what their thoughts are on such a setup...

regards,

-paul


On 29 September 2011 18:44, David Miller <david3d-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
On Thu, Sep 29, 2011 at 1:32 PM, David Miller <david3d-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
Couldn't you  accomplish the same thing with flashcache?  https://github.com/facebook/flashcache/

I should expand on that a little bit.  Flashcache is a kernel module created by Facebook that uses the device mapper interface in Linux to provide a ssd cache layer to any block device.

What I think would be interesting is using flashcache with a pcie ssd as the caching device.  That would add about $500-$600 to the cost of each brick node but should be able to buffer the active IO from the spinning media pretty well.

Somthing like this.  http://www.amazon.com/OCZ-Technology-Drive-240GB-Express/dp/B0058RECUE or something from FusionIO if you want something that's aimed more at the enterprise.
--
David

_______________________________________________
Gluster-users mailing list
Gluster-users@...
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Joe Landman | 29 Sep 19:58
Favicon

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

On 09/29/2011 01:44 PM, David Miller wrote:
> On Thu, Sep 29, 2011 at 1:32 PM, David Miller <david3d@...
> <mailto:david3d@...>> wrote:
>
>     Couldn't you  accomplish the same thing with flashcache?
>     https://github.com/facebook/flashcache/
>
>
> I should expand on that a little bit.  Flashcache is a kernel module
> created by Facebook that uses the device mapper interface in Linux to
> provide a ssd cache layer to any block device.
>
> What I think would be interesting is using flashcache with a pcie ssd as
> the caching device.  That would add about $500-$600 to the cost of each
> brick node but should be able to buffer the active IO from the spinning
> media pretty well.

Erp ... low end PCIe flash with decent performance start much higher 
than 500-600 $ USD.

> Somthing like this.
> http://www.amazon.com/OCZ-Technology-Drive-240GB-Express/dp/B0058RECUE
> or something from FusionIO if you want something that's aimed more at
> the enterprise.

Flashcache is reasonably good, but there are many variables in using it, 
and its designed for a different use case.  For most people the 
writeback may be reasonable, but other use cases would require different 
configs.

This said, please understand that it (and L2ARC, and other similar 
things) are *not* silver bullets (e.g. not magical things that will 
instantly make something far better, at no cost/effort).  They do 
introduce additional complexity, and additional tuning points.

The thing you cannot get rid of, the network traversal, is implicated in 
much of the performance degradation for small files.  Putting the file 
system on a RAM disk (if possible, tmpfs doesn't support xattrs), 
wouldn't make the system much faster for small files.  Eliminating the 
network traversal and doing local distributed caching of metadata on the 
client side ... could ... but this would be a huge new complication, and 
I'd argue that it probably isn't worth it.

For the short duration, small file performance is going to be bad.  You 
might be able to play some games to make this performance better (L2ARC 
etc. could help in some aspects, but they won't be universally much better).

What matters most is very good design on the storage backend (we are 
biased due to what it is we sell/support), very good networking, and 
very good gluster implementation/tuning.  Its real easy to hit very slow 
performance by missing critical elements.  We field many inquiries which 
usually start out with "we built our own and the performance isn't that 
good."  You won't get good performance on the cluster file system if the 
underlying file system and storage design isn't going to give it to you 
in the first place.

This said, please understand that there is a (significant) performance 
cost to all those nice features in ZFS.  And there is a reason why its 
not generally considered a high performance file system.  So if you 
start building with it, you shouldn't necessarily think that the whole 
is going to be faster than the sum of the parts.  Might be worse.

This is a caution from someone who has tested/shipped many different 
file systems in the past.  ZFS included, on Solaris and other machines. 
  There is a very significant performance penalty one pays for using 
some of these features.  You have to decide if this penalty is worth it.

--

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@...
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
Liam Slusser | 30 Sep 22:37
Picon

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

I've used ZFS in lots of different roles and I've found that out of
the box ZFS performs decent but to get really great performance out of
the (zfs) filesystem you really need to tune it for the application.
ZFS has tons and tons of somewhat hidden features (edit /etc/system
and reboot type stuff) and if set correctly has outstanding
performance.

liam

On Thu, Sep 29, 2011 at 10:58 AM, Joe Landman
<landman@...> wrote:
> This said, please understand that there is a (significant) performance cost
> to all those nice features in ZFS.  And there is a reason why its not
> generally considered a high performance file system.  So if you start
> building with it, you shouldn't necessarily think that the whole is going to
> be faster than the sum of the parts.  Might be worse.
>
> This is a caution from someone who has tested/shipped many different file
> systems in the past.  ZFS included, on Solaris and other machines.  There is
> a very significant performance penalty one pays for using some of these
> features.  You have to decide if this penalty is worth it.
Tristan Ball | 3 Oct 03:48
Picon
Favicon

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

I'm curious as to what ZFS options you're finding necessary to tune for what kind of applications? Are you
willing to share? :-)

Thanks,
        Tristan

Tristan Ball - Hosted Services Manager VIC
Pronto Hosted Services
20 Lakeside Drive, Burwood East, VIC 3151
Phone: +61 3 9887 7770 | Email: tristanb@...
Mobile: +61 408 397 473

For PHS helpdesk support, please email phs@...
For urgent after hours support phone: 1800 622 556

---Legal Notice---
The email message and any attachments are confidential and subject to copyright. If you are not the
intended recipient, any use, interference with, disclosure or copying of this material is unauthorised
and prohibited. No part may be reproduced, adapted or transmitted without the written permission of the
copyright owner. If you have received this email in error, please immediately advise the sender by return
email and delete the message from your system. Before opening or using attachments, check for viruses and
defects. Our liability is limited to re-supplying any affected attachments.

-----Original Message-----
From: gluster-users-bounces@...
[mailto:gluster-users-bounces@...] On Behalf Of Liam Slusser
Sent: Saturday, 1 October 2011 6:38 AM
To: landman@...
Cc: gluster-users@...; rdp.com@...
Subject: Re: [Gluster-users] ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

I've used ZFS in lots of different roles and I've found that out of the box ZFS performs decent but to get
really great performance out of the (zfs) filesystem you really need to tune it for the application.
ZFS has tons and tons of somewhat hidden features (edit /etc/system and reboot type stuff) and if set
correctly has outstanding performance.

liam

On Thu, Sep 29, 2011 at 10:58 AM, Joe Landman
<landman@...> wrote:
> This said, please understand that there is a (significant) performance
> cost to all those nice features in ZFS.  And there is a reason why its
> not generally considered a high performance file system.  So if you
> start building with it, you shouldn't necessarily think that the whole
> is going to be faster than the sum of the parts.  Might be worse.
>
> This is a caution from someone who has tested/shipped many different
> file systems in the past.  ZFS included, on Solaris and other
> machines.  There is a very significant performance penalty one pays
> for using some of these features.  You have to decide if this penalty is worth it.
_______________________________________________
Gluster-users mailing list
Gluster-users@...
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
paul simpson | 3 Oct 18:19
Favicon

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

hi joe,


many thanks for your insights here - much food for thought.

regards,

paul


_______________________________________________
Gluster-users mailing list
Gluster-users@...
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Di Pe | 10 Oct 11:07
Picon

Re: ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

All,

I mentioned a few weeks ago that we were doing some testing of ZFS &
Gluster. We brought up two 4-year old but still reasonably beefy
machines with the following specs:

Hardware:

Supermicro X7DB8 Storage server / 64 GB (16x4GB 667GHz ECC)
2 x Xeon Quadcore E5450  @ 3.00GHz
2 x Marvell  MV88SX6081 8-port SATA II PCI-X
1 x OCZ 50GB RevoDrive OCZSSDPX-1RVD0050
1 x Mellanox Technologies MT25204 [InfiniHost III Lx HCA]
16 x 2TB WD SATA Consumer grade drives

Software:

Kernel 2.6.37.6-0.7 (openSUSE 11.4)
ZFSOnLinux 0.6.0 -RC5
GlusterFS 3.2.3

machine1 was setup with WD 2TB Caviar Black consumer grade drives (5y
warranty) and machine2 was setup with the cheapest WD 2TB Caviar
Green.

Other notes on the hardware:
* We chose Marvell SATA controllers because the Sun Thumpers (the
mother of all ZFS devices) were also using them and a local ZFS expert
has advised against using combined SAS/SATA controllers. We chose
PCI-X because this older hardware did not have enough PCIe slots.
* The OCZ revo drive is kind of small but this one is quite fast and
costs only $250

We created one zfs tank with raidz on each machine which resulted in 2
x 25TB space. The PCIe Revodrives have 2 x 25GB flash. We used one as
the ZIL log device and the other one as L2ARC cache (The ZIL device
should probably be mirrored in a production system, the L2ARC device
can apparently go away without causing FS corruption)

Setting ashift=12 for the green drives was the only change we made
from the default ZFS config (there is a lot of documentation how to
make these green drives usable with ZFS.)

Our first impression was that either the system memory or the L2ARC
provided really good write caching for short sequential writes (a lot
of our work is like that) iotop showed 700+ MB/s.

We ran bonnie++ a couple of times but we are no benchmark experts, see
the results at the end of this note. We'd be happy to run other
benchmarks if someone thinks that would make more sense. It was
interesting to see that the green drives reached about 90% of the
throughput performance of the black drives.

Another type of workload we have in our environment is HPC users who
create many small files and delete them quickly afterwards. To
simulate this we created a silly little python script that made lots
of 1k files with slightly different content:

*****scratch.py ***************************************************
#! /usr/bin/python
import sys
mystr="""
alskjdhfka;kajf akjhfdskajshf k;ajhsdf;kajhf k;jah ........another
1000 random chars
"""
for i in xrange(int(sys.argv[1])):
  file = "file-%s" % i
  fh = open(file,"w")
  fh.write(str(i)+mystr)
  fh.close()
***********************************************************

The script created 10000 files and then 100000 files locally on Machine1.

First we ran this locally on the boot drives (ext4):

create 10000 files:      sub seconds
ls -la on 10000 files:    sub seconds
rm * on 10000 files:     sub seconds

create 100000 files:      4s
ls -la on 100000 files:   1s
rm * on 100000 files:    3s

After cleaning the cache we ran this on the ZFS filesystem:

create 10000 files:      3s
ls -la on 10000 files:    sub seconds
rm * on 10000 files:     sub seconds

create 100000 files:     92s
ls -la on 100000 files:  1s
rm * on 100000 files:    8s

at first this is kind of disappointing because zfs seems to be much
slower. However, a ZFS file server is almost never used as local
storage in a compute box. You'll have to measure this via
NFS/GlusterFS and then the story looks slightly different:

First we tried to do the same thing on a fast compute box that is
connected to a fast NetApp 3050 with 60  10k RPM FC drives:

create 10000 files:      14s
ls -la on 10000 files:     1s
rm * on 10000 files:      4s

Then we used a Solaris file server (Dell R810 connected to a 600TB FC
SAN that consists mostly of SATA drives) that is mounted on the same
box:

create 10000 files:       32s  (NFSv4: 53s)
ls -la on 10000 files:     6s   (NFSv4: sub seconds)
rm * on 10000 files:      15s  (NFSv4: 20s)

Then we used our new 50TB ZFS gluster system (Machine1 and Machine2,
distributed, 2 bricks) mounted via glusterfs from the client

create 10000 files:       27s
ls -la on 10000 files:     8s
rm * on 10000 files:      13s

I admit we don't have any really really fast storage and more tests
need to be done. But when we look what our current needs are our ZFS
gluster would do quite well compared to existing equipment. We did not
have any crashes ZFS so far, everything worked very stable.

On the throughput front we have not connected these systems to 10G but
the numbers indicate that this 2 node gluster cluster could probably
fill an entire 10G pipe.

We also wanted to experiment with infiniband/RDMA but the drivers seem
to be unsupported in later Linux kernels (e.g 2.6.37). If anyone has a
howto please let me know

########   BENCHMARK #####################

Machine1, 16 WD caviar black, 50GB SSD, ashift=9

/loc/bonnie # bonnie++ -d /loc/bonnie -s 22g -r 11g -n 0 -f -b -u root
Using uid:0, gid:0.
Writing intelligently...done
Rewriting...done
Reading intelligently...done
start 'em...done...done...done...
Version 1.03d       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
Machine1           22G           380617  95 332375  92
1015229  99  3380   7
Machine1,22G,,,380617,95,332375,92,,,1015229,99,3380.1,7,,,,,,,,,,,,,
/loc/bonnie #
/loc/bonnie # bonnie++ -d /loc/bonnie -s 96g -r 16g -n 0 -f -b -u root
Using uid:0, gid:0.
Writing intelligently...done
Rewriting...done
Reading intelligently...done
start 'em...done...done...done...
Version 1.03d       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
Machine1          96G           336650  91 197629  70           472625
 74 141.4   1
Machine1,96G,,,336650,91,197629,70,,,472625,74,141.4,1,,,,,,,,,,,,,

Gmane