Tuning NFS Performance on the Espresso Cluster
We've gotten some complaints about the NFS performance on espresso, and it does look pretty bad.
I've been trying to get good performance on a test rig, here are my latest results.
From this thread on the bioclusters mailing list we should be doing 25-30 MB/sec on this hardware for block read and write, for one client. Drop off for multiple clients is worse than linear (over one single GE link).
The NFS server is a dual Xeon 2.4 GHz?, 2 GB RAM, with 3 3ware Inc 3ware 7000-series ATA-RAID controllers, each handling 8 SATA drives in RAID5. All NFS tests were done to one of those RAID5 volumes. Intel PRO/100 Gigabit Ethernet over copper.
The NFS clients are nodes from the espresso cluster, dual Xeon 2.4 GHz? with 1GB RAM each. Intel PRO/1000 Gigabit Ethernet over copper.
The machines are all connected via a Dell PowerConnect? 5424 Gigabit Ethernet switch. We've pounded on these and gotten good performance via TCP and UDP, so we know the switches and network cards are in good working order.
Clients run BioBrew? 3, based on ROCKS 3.1, in turn based on RedHat? Enterprise Linux 3 WS. Kernel is 2.4.21-4.0.1.ELsmp?. NFS runs over the autofs automounter (autofs-3.1.7-41). The ROCKS master node is not the NFS server, I actually set up a separate NFS machine.
Server runs RedHat? linux 9, kernel 2.4.20-31.9smp, nfsd from nfs-utils-1.0.1-3.9.
All disk IO tests were done with bonnie++ version 1.3, file sizes for the tests were at least 2 GB, to prevent issues with linux file caching.
We've got some problems, block read and write should be 25-35 MB/sec, but are only 10 MB/sec. It may be hardware related, as the performance on the NFS server isn't what it should be. Block writes on the RAID5 volume should be 30-40 MB/sec, but the -1 client datapoint represents one bonnie process running directly on the NFS server on the native filesystem. It's a bit under the 30 MB/sec mark. I've tested both ext3 and reiserfs4.
See also the Tool to bencmark disk IO? thread on the bioclusters mailing list.
Performance as seen on the server, the sum of the performance of each client:
If you look at the performance from the client's point of view, however, things are really grim. Here's the average performance per each bonnie client:
I've gotten some improvement by exporting the files with the
async option, and using
rsize=8192,wsize=8192 in the automounter options.
Clients are seeing many retransmits:
Client rpc stats: calls retrans authrefrsh 21438790 361096 0
Unfortunately the nazgul server kernel (2.4.20) doesn't support NFS over TCP. Espresso's kernel (2.4.21-EL) does.
The server is also heavily loaded, we should run more nfsd threads. See the
/proc/net/rpc/nfsd file, the
th line shows the time spent at each 10% of load:
th 8 28713782 25919.360 8872.650 12977.000 0.000 16695.330 4689.360 3214.830 2339.010 0.000 51838.740
I set the following options in '/etc/sysconfig/nfs':
I've also tested both
reiserfs4 on this load on the same hardware, and it doesn't seem to make any difference in the bonnie benchmarks.
Kennie Cruz alerted me to this possible kernel bug: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=121434 There is a long thread on the NPACI ROCKS mailing list. I'm testing this kernel on espresso: http://people.redhat.com/coughlan/RHEL3-perf-test/i686/kernel-smp-2.4.21-21.ELperftestonly2.i686.rpm It's mentioned as a possible cure for the IO bug in the redhat bugzilla site. Info on ROCKS list: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2004-September/thread.html#7712
The default settings suck? I don't have any other conclusions yet, I'm still testing. I do want to note that the performance seen is with default export and mount flags on both the server and the clients. I'm going to ask around and get an idea if this performance is good or bad.
Espresso's performance is in fact, much worse than this. Stock ROCKS installs put the NFS server on the master node. Espresso's master node is a dual Xeon 2.4 GHz? with 1 GB RAM, and RAID5 over 5 IDE drives with a 3ware controller. I've seen 600 KB/sec IO rates under heavy load with just a few clients on this machine.
I'm pretty sure putting the NFS on a separate dedicated server (plus the SATA drives) is a big win. The performance still isn't what I was expecting from these servers. We've had to tell our users to use SGE facilities for staging files to local storage (see SGE's
$TMPDIR) when they do lots of IO. At least that seems to be a common solution, the documentation for several other clusters suggests doing the same thing.
Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP compute-1-5.loca 2G 21977 65 43503 24 38001 23 32240 89 92362 23 5101 19 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 85 0 2135 3 172 0 91 0 2124 3 114 0 compute-1-5.local,2G,21977,65,43503,24,38001,23,32240,89,92362,23,5101.3,19,16,85,0,2135,3,172,0,91,0,2124,3,114,0 106.780u 49.700s 15:20.38 17.0% 0+0k 0+0io 541pf+0w