ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud
_______________________________________________ Gluster-users mailing list Gluster-users@... http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
_______________________________________________ Gluster-users mailing list Gluster-users@... http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Hello,May be this question would have been addressed elsewhere but I did like the opinion and experience of other users.There could be some misconceptions that I might be carrying, so please be kind to point them out. Any help, advice and suggestions will be very highly appreciated.My goal is to get a greater than 100 TB gluster NAS up on the cloud. Each server will hold around 2x8TB disks. The export volume size (client disk mount size) would be greater than 20 TB.This is how I am planning to set it up all.. 16 servers each with 2x8=16 TB of space. The glusterfs will be replicate and distributed (raid-10). I did like to go with ZFS on linux for the disks.The client machines will use the glusterfs client for mounting the volumes.ext4 is limited to 16 TB due to userspace tool (e2fsprogs).Would this be considered as a production ready setup? The data housed on this cluster will is critical and hence I need to very sure before I go ahead with this kind of a setup.Or would using ZFS with Gluster makes more sense on FreeBSD or illuminos (ZFS is native there).Thanks a lot
_______________________________________________
Gluster-users mailing list
Gluster-users <at> gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
_______________________________________________ Gluster-users mailing list Gluster-users@... http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
I have a very large, >500tb, Gluster cluster on Centos Linux but I use the XFS filesystem in a production role. Each xfs filesystem (brick) is around 32tb in size. No problems all runs very well.
ls
_______________________________________________ Gluster-users mailing list Gluster-users@... http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
On Sat, Sep 24, 2011 at 12:14 PM, Liam Slusser <lslusser-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
I have a very large, >500tb, Gluster cluster on Centos Linux but I use the XFS filesystem in a production role. Each xfs filesystem (brick) is around 32tb in size. No problems all runs very well.
ls
_______________________________________________ Gluster-users mailing list Gluster-users@... http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
On Sat, Sep 24, 2011 at 12:14 PM, Liam Slusser <lslusser-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:Yes XFS is the way to go for large partitions > 16TB (or even 12TB). XFS is brought back to life by Red Hat. Most of the XFS developers are now RH employees. We can confidently recommend XFS now.I have a very large, >500tb, Gluster cluster on Centos Linux but I use the XFS filesystem in a production role. Each xfs filesystem (brick) is around 32tb in size. No problems all runs very well.
ls
--
Anand Babu Periasamy
Blog [ http://www.unlocksmith.org ]
Twitter [ http://twitter.com/abperiasamy ]
Imagination is more important than knowledge --Albert Einstein
_______________________________________________ Gluster-users mailing list Gluster-users-+FkPdpiNhgJAfugRpC6u6w@public.gmane.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
_______________________________________________ Gluster-users mailing list Gluster-users@... http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
On Sun, 25 Sep 2011, Jon Tegner wrote: > Sorry for a stupid question, but would there be issues using glusterfs based > on several 11 TB ext4-bricks? Nope! I have used that in a config, 16 750 disks in RAID 6 on 3ware card formated ext4. ><> Nathan Stratton CTO, BlinkMind, Inc. nathan at robotics.net nathan at blinkmind.com http://www.robotics.net http://www.blinkmind.com
I have a very large, >500tb, Gluster cluster on Centos Linux but I use the XFS filesystem in a production role. Each xfs filesystem (brick) is around 32tb in size. No problems all runs very well.
ls
_______________________________________________ Gluster-users mailing list Gluster-users@... http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
I've also heard it can be slower however I've never done any performance tests on the same hardware with ext3/4 vs XFS since my partitions are so big ext3/4 is just not an option. With that said I've been pleased with the performance I get and am a happy XFS user.
ls
_______________________________________________ Gluster-users mailing list Gluster-users@... http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Hi, I think you are up to an interesting project. May be you could share a few more details. (1) What cloud are you planning to use, EC2 with EBS volumes or some hosted stuff like Rackspace? (2) What are your motivations for using RAID10 (for example on Amazon that would increase your monthly price from $10k to $20k just for storage not counting io operations --- I am not suggesting you use raid0, btw) (3) is this for something like a web farm with one unix user accessing the web farm or is it a multi-user HPC like environment for which you need a Posix file system? So far the discussion has been focusing on XFS vs ZFS. I admit that I am a fan of ZFS and I have only used XFS for performance reasons on mysql servers where it did well. When I read something like this http://oss.sgi.com/archives/xfs/2011-08/msg00320.html that makes me not want to use XFS for big data. You can assume that this is a real recent bug because Joe is a smart guy who knows exactly what he is doing. Joe and the Gluster guys are vendors who can work around these issues and provide support. If XFS is the choice, may be you should hire them for this gig. ZFS typically does not have these FS repair issues in the first place. The motivation of Lawrence Livermore for porting ZFS to Linux was quite clear: http://zfsonlinux.org/docs/SC10_BoF_ZFS_on_Linux_for_Lustre.pdf OK, they have 50PB and we are talking about much smaller deployments. However some of the limitations they report I can confirm. Also, recovering from a drive failure with this whole LVM/Linux Raid stuff is unpredictable. Hot swapping does not always work and if you prioritize the re-sync of data to the new drive you can strangle the entire box (by default the priority of the re-sync process is low on linux). If you are a Linux expert you can handle this kind of stuff (or hire someone) but if you ever want to give this setup to a Storage Administrator you better give them something that they can use with confidence (may be less of an issue in the cloud). Compare to this to ZFS: re-silvering works with a very predictable result and timing. There is a ton of info out there on this topic. I think that gluster users may be getting around many of the linux raid issues by simply taking the entire node down (which is ok in mirrored node settings) or by using hardware raid controllers. (which are often not available in the cloud ) Some in the Linux community seem to be slightly opposed to ZFS (I assume because of the licensing issue) and make sometimes odd suggestions ("You should use BTRFS"). As someone who is involved with managing hundreds of terabytes of storage I can say that if something goes wrong with a big hunk of your storage it quite a pain to get it back. I would only feel comfortable to use a combination of gluster with anything as my primary storage if I had it mirrored to another datacenter not using gluster technology for the mirroring (hence my raid10 question, or may be that is what you are planning). Primary storage of that size without mirroring I would put on a commercial thing like isilon, IBM, Bluearc where I get 24*7 support, etc. We are currently happy users of glusterfs and we are using it as a caching tier for hpc (our users have managed to bring it down) and for backup and I love it for that. We are currently testing ZFSOnLinux with gluster 3.2.3 on good hardware (8 core, 64 GB, SSD for caching) with ultra cheap drives (WD green caviar) and the performance results are very impressive. I am currently not too concerned with stability. Should the kernel crash (which has not happened yet) the data will be unaffected because no linux code (the Solaris porting layer) is actually touching any of the hard drives. dipe On Sat, Sep 24, 2011 at 5:10 AM, RDP <rdp.com@...> wrote: > Hello, > May be this question would have been addressed elsewhere but I did like > the opinion and experience of other users. > There could be some misconceptions that I might be carrying, so please be > kind to point them out. Any help, advice and suggestions will be very highly > appreciated. > My goal is to get a greater than 100 TB gluster NAS up on the cloud. Each > server will hold around 2x8TB disks. The export volume size (client disk > mount size) would be greater than 20 TB. > This is how I am planning to set it up all.. 16 servers each with 2x8=16 TB > of space. The glusterfs will be replicate and distributed (raid-10). I did > like to go with ZFS on linux for the disks. > The client machines will use the glusterfs client for mounting the volumes. > ext4 is limited to 16 TB due to userspace tool (e2fsprogs). > Would this be considered as a production ready setup? The data housed on > this cluster will is critical and hence I need to very sure before I go > ahead with this kind of a setup. > Or would using ZFS with Gluster makes more sense on FreeBSD or illuminos > (ZFS is native there). > Thanks a lot > _______________________________________________ > Gluster-users mailing list > Gluster-users@... > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > >
On 09/25/2011 03:56 AM, Di Pe wrote: > So far the discussion has been focusing on XFS vs ZFS. I admit that I > am a fan of ZFS and I have only used XFS for performance reasons on > mysql servers where it did well. When I read something like this > http://oss.sgi.com/archives/xfs/2011-08/msg00320.html that makes me > not want to use XFS for big data. You can assume that this is a real This is a corner case bug, and one we are hoping we can get more data to the XFS team for. They asked for specific information that we couldn't provide (as we had to fix the problem). Note: other file systems which allow for sparse files *may* have similar issues. We haven't tried yet. The issues with ZFS on Linux have to do with legal hazards. Neither Oracle, nor those who claim ZFS violates their patents, would be happy to see license violations, or further deployment of ZFS on Linux. I know the national labs in the US are happily doing the integration from source. But I don't think Oracle and the patent holders would sit idly by while others do this. So you'd need to use a ZFS based system such as Solaris 11 express to be able to use it without hassle. BSD and Illumos may work without issue as well, and should be somewhat better on the legal front than Linux + ZFS. I am obviously not a lawyer, and you should consult one before you proceed down this route. > recent bug because Joe is a smart guy who knows exactly what he is > doing. Joe and the Gluster guys are vendors who can work around these > issues and provide support. If XFS is the choice, may be you should > hire them for this gig. > > ZFS typically does not have these FS repair issues in the first place. > The motivation of Lawrence Livermore for porting ZFS to Linux was > quite clear: > > http://zfsonlinux.org/docs/SC10_BoF_ZFS_on_Linux_for_Lustre.pdf > > OK, they have 50PB and we are talking about much smaller deployments. > However some of the limitations they report I can confirm. Also, > recovering from a drive failure with this whole LVM/Linux Raid stuff > is unpredictable. Hot swapping does not always work and if you > prioritize the re-sync of data to the new drive you can strangle the > entire box (by default the priority of the re-sync process is low on > linux). If you are a Linux expert you can handle this kind of stuff > (or hire someone) but if you ever want to give this setup to a Storage > Administrator you better give them something that they can use with > confidence (may be less of an issue in the cloud). > Compare to this to ZFS: re-silvering works with a very predictable > result and timing. There is a ton of info out there on this topic. I > think that gluster users may be getting around many of the linux raid > issues by simply taking the entire node down (which is ok in mirrored > node settings) or by using hardware raid controllers. (which are often > not available in the cloud ) There are definite advantages to better technology. But the issue in this case is the legal baggage that goes along with them. BTRFS may, eventually, be a better choice. The national labs can do this with something of an immunity to prosecution for license violation, by claiming the work is part of a research project, and won't actively be used in a way that would harm Oracle's interests. And it would be ... bad ... for Oracle (and others) to sue to government over a relatively trivial violation. Until Oracle comes out with an absolute declaration that its OK to use ZFS with Linux in a commercial setting ... yeah ... most vendors are gonna stay away from that scenario. > Some in the Linux community seem to be slightly opposed to ZFS (I > assume because of the licensing issue) and make sometimes odd > suggestions ("You should use BTRFS"). Licensing mainly. BTRFS has a better design, but its not ready yet. Won't be for a while. -- -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman@... web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615
On Sun, Sep 25, 2011 at 5:51 AM, Joe Landman <landman@...> wrote: > On 09/25/2011 03:56 AM, Di Pe wrote: > >> So far the discussion has been focusing on XFS vs ZFS. I admit that I >> am a fan of ZFS and I have only used XFS for performance reasons on >> mysql servers where it did well. When I read something like this >> http://oss.sgi.com/archives/xfs/2011-08/msg00320.html that makes me >> not want to use XFS for big data. You can assume that this is a real > > This is a corner case bug, and one we are hoping we can get more data to the > XFS team for. They asked for specific information that we couldn't provide > (as we had to fix the problem). Note: other file systems which allow for > sparse files *may* have similar issues. We haven't tried yet. Fair enough, but one of the things LLNL pointed out was that you have to do fsck in the first place (aka standard file systems are not self healing) > > The issues with ZFS on Linux have to do with legal hazards. Neither Oracle, > nor those who claim ZFS violates their patents, would be happy to see > license violations, or further deployment of ZFS on Linux. I know the > national labs in the US are happily doing the integration from source. But > I don't think Oracle and the patent holders would sit idly by while others > do this. So you'd need to use a ZFS based system such as Solaris 11 express > to be able to use it without hassle. BSD and Illumos may work without issue > as well, and should be somewhat better on the legal front than Linux + ZFS. > I am obviously not a lawyer, and you should consult one before you proceed > down this route. > >> recent bug because Joe is a smart guy who knows exactly what he is >> doing. Joe and the Gluster guys are vendors who can work around these >> issues and provide support. If XFS is the choice, may be you should >> hire them for this gig. >> >> ZFS typically does not have these FS repair issues in the first place. >> The motivation of Lawrence Livermore for porting ZFS to Linux was >> quite clear: >> >> http://zfsonlinux.org/docs/SC10_BoF_ZFS_on_Linux_for_Lustre.pdf >> >> OK, they have 50PB and we are talking about much smaller deployments. >> However some of the limitations they report I can confirm. Also, >> recovering from a drive failure with this whole LVM/Linux Raid stuff >> is unpredictable. Hot swapping does not always work and if you >> prioritize the re-sync of data to the new drive you can strangle the >> entire box (by default the priority of the re-sync process is low on >> linux). If you are a Linux expert you can handle this kind of stuff >> (or hire someone) but if you ever want to give this setup to a Storage >> Administrator you better give them something that they can use with >> confidence (may be less of an issue in the cloud). >> Compare to this to ZFS: re-silvering works with a very predictable >> result and timing. There is a ton of info out there on this topic. I >> think that gluster users may be getting around many of the linux raid >> issues by simply taking the entire node down (which is ok in mirrored >> node settings) or by using hardware raid controllers. (which are often >> not available in the cloud ) > > There are definite advantages to better technology. But the issue in this > case is the legal baggage that goes along with them. > > BTRFS may, eventually, be a better choice. The national labs can do this > with something of an immunity to prosecution for license violation, by > claiming the work is part of a research project, and won't actively be used > in a way that would harm Oracle's interests. And it would be ... bad ... > for Oracle (and others) to sue to government over a relatively trivial > violation. > I am trying to make sense what people discuss regarding the ZFS licensing issue. Did you hear anything from anyone at Oracle that would indicate that they don't like ZFS on Linux? If I think through it I can't see why this would make any sense. The ZFS on Linux community is extremely small and will probably always be and the main reason besides data size is that the GPL doesn't like the CDDL not vice-versa so distros shy away from it. The LLNL people have found a way around the GPL2 issue by implementing it as a driver. Why doesn't Oracle sue Nexenta? Those guys have deployed 330PB of their storage and would be a worthy target. The only company that seems to have issues with ZFS in general is NetApp and I'm sure that they don't care whether it's installed on Solaris or on Linux. NetApp interestingly sued CoRaid, a disk shelf vendor that was using Nexenta as OS but they did not sue not Nexenta itself. NetApp knew that their case was very weak. If they had sued Nexenta, Nexenta would have fought back because the very existence of the company would have been at risk. NetApp feared that Nexenta might have won which would have confirmed the legitimacy of ZFS. CoRaid on the other hand was not dependent on their ZFS solution for their business to be able to continue. They were not willing to take any risk and this allowed NetApp to spread some excellent FUD. I would be interested to hear if there have been any comments anywhere (written or verbal) by noteworthy Oracle representatives regarding the ZFS on Linux question (except saying: "Oh, we won't GPL ZFS") > Until Oracle comes out with an absolute declaration that its OK to use ZFS > with Linux in a commercial setting ... yeah ... most vendors are gonna stay > away from that scenario. > >> Some in the Linux community seem to be slightly opposed to ZFS (I >> assume because of the licensing issue) and make sometimes odd >> suggestions ("You should use BTRFS"). > > Licensing mainly. BTRFS has a better design, but its not ready yet. Won't > be for a while. I wonder if you can compare the two. While ZFS is also a volume manager, software raid and has multiple builtin caching layers the other FS are just FS. BTRFS seems to have some volume management in place. I would be interested to know why you think BTRFS has a better design than ZFS? > > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics, Inc. > email: landman@... > web : http://scalableinformatics.com > http://scalableinformatics.com/sicluster > phone: +1 734 786 8423 x121 > fax : +1 866 888 3112 > cell : +1 734 612 4615 > _______________________________________________ > Gluster-users mailing list > Gluster-users@... > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >
been reading this thread - quite fascinating.
_______________________________________________ Gluster-users mailing list Gluster-users@... http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
On 09/29/2011 12:38 PM, paul simpson wrote: > been reading this thread - quite fascinating. > > zfsonlinux + gluster looks like an intriguing combination. i'm > interested in your findings to date; specifically would the zfs L2ARC > (with SSDs) speed up underlying gluster operations? it sounds like it > could be a potent mix. Just don't minimize the legal risk issue. Its very hard for a vendor to ship/support this due to the potential risk. Its arguably hard for a user to deploy zfs on linux due to the risk, unless they had a way to argue that they are not violating licensing (can't intermix GPL and CDDL and ship/support it) for commercial purposes. Lots of folks can't claim the type of cover that a national lab can claim (researching storage models). You have to decide if the risk is worth it. If you were to do this, I'd suggest going the Illumos/OpenIndiana or BSD route. Yeah, work still needs to be done to get Gluster to build there, but the licensing is on firmer ground (hard to claim that an "open source" license such as CDDL does not mean what it says). Understand where you stand first. Speak to a lawyer type first. Make sure you won't have issues. And do remember, that while Oracle and Netapp have (for the moment) de-escalated hostilities, Oracle did not provide indemnity to non-Oracle customers. So Netapp (and others) *can* resume their actions. A question was asked why not go after Nexenta versus others. Simple. There are many others (e.g. more potential licensing/legal fees) as compared to a single Nexenta. Its arguably less about rights as it is revenue from legal action. But that stuff does happen ... Oracle is probably the only one whom can ship ZFS anything safely. And, I'd guess that they are perfectly happy with that situation. > > regards, > > -paul -- -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman@... web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615
hey joe,
Just don't minimize the legal risk issue. Its very hard for a vendor to ship/support this due to the potential risk. Its arguably hard for a user to deploy zfs on linux due to the risk, unless they had a way to argue that they are not violating licensing (can't intermix GPL and CDDL and ship/support it) for commercial purposes.On 09/29/2011 12:38 PM, paul simpson wrote:been reading this thread - quite fascinating.
zfsonlinux + gluster looks like an intriguing combination. i'm
interested in your findings to date; specifically would the zfs L2ARC
(with SSDs) speed up underlying gluster operations? it sounds like it
could be a potent mix.
Lots of folks can't claim the type of cover that a national lab can claim (researching storage models). You have to decide if the risk is worth it.
If you were to do this, I'd suggest going the Illumos/OpenIndiana or BSD route. Yeah, work still needs to be done to get Gluster to build there, but the licensing is on firmer ground (hard to claim that an "open source" license such as CDDL does not mean what it says).
Understand where you stand first. Speak to a lawyer type first. Make sure you won't have issues.
And do remember, that while Oracle and Netapp have (for the moment) de-escalated hostilities, Oracle did not provide indemnity to non-Oracle customers. So Netapp (and others) *can* resume their actions. A question was asked why not go after Nexenta versus others. Simple. There are many others (e.g. more potential licensing/legal fees) as compared to a single Nexenta. Its arguably less about rights as it is revenue from legal action. But that stuff does happen ...
Oracle is probably the only one whom can ship ZFS anything safely. And, I'd guess that they are perfectly happy with that situation.
regards,
-paul
--Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman <at> scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________ Gluster-users mailing list Gluster-users@... http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
hey joe,recived the legal issues loud and clear - all very good points. hope these issues will become clarified in due course.putting the legal issues aside - still v keen to hear your and others thoughts about ZFS&L2ARC being a good platform for glusterfs. that fast SSD tier sounds like a perfect compliment to glusters slow small file performance.
_______________________________________________ Gluster-users mailing list Gluster-users@... http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Couldn't you accomplish the same thing with flashcache? https://github.com/facebook/flashcache/
_______________________________________________ Gluster-users mailing list Gluster-users@... http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
hi david,
On Thu, Sep 29, 2011 at 1:32 PM, David Miller <david3d-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:Couldn't you accomplish the same thing with flashcache? https://github.com/facebook/flashcache/
I should expand on that a little bit. Flashcache is a kernel module created by Facebook that uses the device mapper interface in Linux to provide a ssd cache layer to any block device.
What I think would be interesting is using flashcache with a pcie ssd as the caching device. That would add about $500-$600 to the cost of each brick node but should be able to buffer the active IO from the spinning media pretty well.
Somthing like this. http://www.amazon.com/OCZ-Technology-Drive-240GB-Express/dp/B0058RECUE or something from FusionIO if you want something that's aimed more at the enterprise.
--
David
_______________________________________________ Gluster-users mailing list Gluster-users@... http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
On 09/29/2011 01:44 PM, David Miller wrote: > On Thu, Sep 29, 2011 at 1:32 PM, David Miller <david3d@... > <mailto:david3d@...>> wrote: > > Couldn't you accomplish the same thing with flashcache? > https://github.com/facebook/flashcache/ > > > I should expand on that a little bit. Flashcache is a kernel module > created by Facebook that uses the device mapper interface in Linux to > provide a ssd cache layer to any block device. > > What I think would be interesting is using flashcache with a pcie ssd as > the caching device. That would add about $500-$600 to the cost of each > brick node but should be able to buffer the active IO from the spinning > media pretty well. Erp ... low end PCIe flash with decent performance start much higher than 500-600 $ USD. > Somthing like this. > http://www.amazon.com/OCZ-Technology-Drive-240GB-Express/dp/B0058RECUE > or something from FusionIO if you want something that's aimed more at > the enterprise. Flashcache is reasonably good, but there are many variables in using it, and its designed for a different use case. For most people the writeback may be reasonable, but other use cases would require different configs. This said, please understand that it (and L2ARC, and other similar things) are *not* silver bullets (e.g. not magical things that will instantly make something far better, at no cost/effort). They do introduce additional complexity, and additional tuning points. The thing you cannot get rid of, the network traversal, is implicated in much of the performance degradation for small files. Putting the file system on a RAM disk (if possible, tmpfs doesn't support xattrs), wouldn't make the system much faster for small files. Eliminating the network traversal and doing local distributed caching of metadata on the client side ... could ... but this would be a huge new complication, and I'd argue that it probably isn't worth it. For the short duration, small file performance is going to be bad. You might be able to play some games to make this performance better (L2ARC etc. could help in some aspects, but they won't be universally much better). What matters most is very good design on the storage backend (we are biased due to what it is we sell/support), very good networking, and very good gluster implementation/tuning. Its real easy to hit very slow performance by missing critical elements. We field many inquiries which usually start out with "we built our own and the performance isn't that good." You won't get good performance on the cluster file system if the underlying file system and storage design isn't going to give it to you in the first place. This said, please understand that there is a (significant) performance cost to all those nice features in ZFS. And there is a reason why its not generally considered a high performance file system. So if you start building with it, you shouldn't necessarily think that the whole is going to be faster than the sum of the parts. Might be worse. This is a caution from someone who has tested/shipped many different file systems in the past. ZFS included, on Solaris and other machines. There is a very significant performance penalty one pays for using some of these features. You have to decide if this penalty is worth it. -- -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman@... web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615
I've used ZFS in lots of different roles and I've found that out of the box ZFS performs decent but to get really great performance out of the (zfs) filesystem you really need to tune it for the application. ZFS has tons and tons of somewhat hidden features (edit /etc/system and reboot type stuff) and if set correctly has outstanding performance. liam On Thu, Sep 29, 2011 at 10:58 AM, Joe Landman <landman@...> wrote: > This said, please understand that there is a (significant) performance cost > to all those nice features in ZFS. And there is a reason why its not > generally considered a high performance file system. So if you start > building with it, you shouldn't necessarily think that the whole is going to > be faster than the sum of the parts. Might be worse. > > This is a caution from someone who has tested/shipped many different file > systems in the past. ZFS included, on Solaris and other machines. There is > a very significant performance penalty one pays for using some of these > features. You have to decide if this penalty is worth it.
I'm curious as to what ZFS options you're finding necessary to tune for what kind of applications? Are you willing to share?Thanks, Tristan Tristan Ball - Hosted Services Manager VIC Pronto Hosted Services 20 Lakeside Drive, Burwood East, VIC 3151 Phone: +61 3 9887 7770 | Email: tristanb@... Mobile: +61 408 397 473 For PHS helpdesk support, please email phs@... For urgent after hours support phone: 1800 622 556 ---Legal Notice--- The email message and any attachments are confidential and subject to copyright. If you are not the intended recipient, any use, interference with, disclosure or copying of this material is unauthorised and prohibited. No part may be reproduced, adapted or transmitted without the written permission of the copyright owner. If you have received this email in error, please immediately advise the sender by return email and delete the message from your system. Before opening or using attachments, check for viruses and defects. Our liability is limited to re-supplying any affected attachments. -----Original Message----- From: gluster-users-bounces@... [mailto:gluster-users-bounces@...] On Behalf Of Liam Slusser Sent: Saturday, 1 October 2011 6:38 AM To: landman@... Cc: gluster-users@...; rdp.com@... Subject: Re: [Gluster-users] ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud I've used ZFS in lots of different roles and I've found that out of the box ZFS performs decent but to get really great performance out of the (zfs) filesystem you really need to tune it for the application. ZFS has tons and tons of somewhat hidden features (edit /etc/system and reboot type stuff) and if set correctly has outstanding performance. liam On Thu, Sep 29, 2011 at 10:58 AM, Joe Landman <landman@...> wrote: > This said, please understand that there is a (significant) performance > cost to all those nice features in ZFS. And there is a reason why its > not generally considered a high performance file system. So if you > start building with it, you shouldn't necessarily think that the whole > is going to be faster than the sum of the parts. Might be worse. > > This is a caution from someone who has tested/shipped many different > file systems in the past. ZFS included, on Solaris and other > machines. There is a very significant performance penalty one pays > for using some of these features. You have to decide if this penalty is worth it. _______________________________________________ Gluster-users mailing list Gluster-users@... http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
hi joe,
_______________________________________________ Gluster-users mailing list Gluster-users@... http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
All, I mentioned a few weeks ago that we were doing some testing of ZFS & Gluster. We brought up two 4-year old but still reasonably beefy machines with the following specs: Hardware: Supermicro X7DB8 Storage server / 64 GB (16x4GB 667GHz ECC) 2 x Xeon Quadcore E5450 @ 3.00GHz 2 x Marvell MV88SX6081 8-port SATA II PCI-X 1 x OCZ 50GB RevoDrive OCZSSDPX-1RVD0050 1 x Mellanox Technologies MT25204 [InfiniHost III Lx HCA] 16 x 2TB WD SATA Consumer grade drives Software: Kernel 2.6.37.6-0.7 (openSUSE 11.4) ZFSOnLinux 0.6.0 -RC5 GlusterFS 3.2.3 machine1 was setup with WD 2TB Caviar Black consumer grade drives (5y warranty) and machine2 was setup with the cheapest WD 2TB Caviar Green. Other notes on the hardware: * We chose Marvell SATA controllers because the Sun Thumpers (the mother of all ZFS devices) were also using them and a local ZFS expert has advised against using combined SAS/SATA controllers. We chose PCI-X because this older hardware did not have enough PCIe slots. * The OCZ revo drive is kind of small but this one is quite fast and costs only $250 We created one zfs tank with raidz on each machine which resulted in 2 x 25TB space. The PCIe Revodrives have 2 x 25GB flash. We used one as the ZIL log device and the other one as L2ARC cache (The ZIL device should probably be mirrored in a production system, the L2ARC device can apparently go away without causing FS corruption) Setting ashift=12 for the green drives was the only change we made from the default ZFS config (there is a lot of documentation how to make these green drives usable with ZFS.) Our first impression was that either the system memory or the L2ARC provided really good write caching for short sequential writes (a lot of our work is like that) iotop showed 700+ MB/s. We ran bonnie++ a couple of times but we are no benchmark experts, see the results at the end of this note. We'd be happy to run other benchmarks if someone thinks that would make more sense. It was interesting to see that the green drives reached about 90% of the throughput performance of the black drives. Another type of workload we have in our environment is HPC users who create many small files and delete them quickly afterwards. To simulate this we created a silly little python script that made lots of 1k files with slightly different content: *****scratch.py *************************************************** #! /usr/bin/python import sys mystr=""" alskjdhfka;kajf akjhfdskajshf k;ajhsdf;kajhf k;jah ........another 1000 random chars """ for i in xrange(int(sys.argv[1])): file = "file-%s" % i fh = open(file,"w") fh.write(str(i)+mystr) fh.close() *********************************************************** The script created 10000 files and then 100000 files locally on Machine1. First we ran this locally on the boot drives (ext4): create 10000 files: sub seconds ls -la on 10000 files: sub seconds rm * on 10000 files: sub seconds create 100000 files: 4s ls -la on 100000 files: 1s rm * on 100000 files: 3s After cleaning the cache we ran this on the ZFS filesystem: create 10000 files: 3s ls -la on 10000 files: sub seconds rm * on 10000 files: sub seconds create 100000 files: 92s ls -la on 100000 files: 1s rm * on 100000 files: 8s at first this is kind of disappointing because zfs seems to be much slower. However, a ZFS file server is almost never used as local storage in a compute box. You'll have to measure this via NFS/GlusterFS and then the story looks slightly different: First we tried to do the same thing on a fast compute box that is connected to a fast NetApp 3050 with 60 10k RPM FC drives: create 10000 files: 14s ls -la on 10000 files: 1s rm * on 10000 files: 4s Then we used a Solaris file server (Dell R810 connected to a 600TB FC SAN that consists mostly of SATA drives) that is mounted on the same box: create 10000 files: 32s (NFSv4: 53s) ls -la on 10000 files: 6s (NFSv4: sub seconds) rm * on 10000 files: 15s (NFSv4: 20s) Then we used our new 50TB ZFS gluster system (Machine1 and Machine2, distributed, 2 bricks) mounted via glusterfs from the client create 10000 files: 27s ls -la on 10000 files: 8s rm * on 10000 files: 13s I admit we don't have any really really fast storage and more tests need to be done. But when we look what our current needs are our ZFS gluster would do quite well compared to existing equipment. We did not have any crashes ZFS so far, everything worked very stable. On the throughput front we have not connected these systems to 10G but the numbers indicate that this 2 node gluster cluster could probably fill an entire 10G pipe. We also wanted to experiment with infiniband/RDMA but the drivers seem to be unsupported in later Linux kernels (e.g 2.6.37). If anyone has a howto please let me know ######## BENCHMARK ##################### Machine1, 16 WD caviar black, 50GB SSD, ashift=9 /loc/bonnie # bonnie++ -d /loc/bonnie -s 22g -r 11g -n 0 -f -b -u root Using uid:0, gid:0. Writing intelligently...done Rewriting...done Reading intelligently...done start 'em...done...done...done... Version 1.03d ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP Machine1 22G 380617 95 332375 92 1015229 99 3380 7 Machine1,22G,,,380617,95,332375,92,,,1015229,99,3380.1,7,,,,,,,,,,,,, /loc/bonnie # /loc/bonnie # bonnie++ -d /loc/bonnie -s 96g -r 16g -n 0 -f -b -u root Using uid:0, gid:0. Writing intelligently...done Rewriting...done Reading intelligently...done start 'em...done...done...done... Version 1.03d ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP Machine1 96G 336650 91 197629 70 472625 74 141.4 1 Machine1,96G,,,336650,91,197629,70,,,472625,74,141.4,1,,,,,,,,,,,,,
RSS Feed