Views
OscarClusterSetup
Our Atipa Cluster Setup (Oscar v2.3 and RH 9)
ide.disk conf hda1 78 ext3 /boot defaults hda5 1992 swap hda6 * ext3 / defaults
Important notes
1) update the redhat 9 kernel-smp and source
2) Replace the kernel rpms in the /tftpboot/rpm directory with used ones.
3) The nic in the left side of the back of the slave nodes is the fast ethernet, the other is the giga.
4) Use the giga nic as the private nic for the cluster.
Oscar 2.3 Note: The file /opt/opium/etc/sync_users.conf has to be edited in order to sync users in all cluster nodes.
Execute: /opt/opium/bin/sync_users if you want the changes to be done at the moment. If you haven't added any users you need the --force switch.
Stuff we need to do to get this cluster into production:
- Reinstall master, using hardware raid (Alt 3 gets into raid bios) (Carlos did this).
- Fix pbs/maui, it isn't scheduling. (Reinstalling fixed this). Also note that the docs say that the
pbsnodescommand may be used to mark nodes up and down, or query node status. It didn't seem to work on the crashed nodes. - Turn on hyperthreading in bios, turn it off in lilo (or kernelpicker)
- Turn on PXE on all the nodes, now that Carlos has fixed the switches.
To run jobs in the queue, prepare a PBS script.
Sample script for a single CPU job:
#!/bin/sh #PBS -l nodes=1:ppn=1 #PBS -M humberto@hpcf.upr.edu #PBS -m abe cd work/dir foo
Sample script for multi CPU MPI job:
#!/bin/sh #PBS -l nodes=32:ppn=2 #PBS -M humberto@hpcf.upr.edu #PBS -m abe export P4_GLOBMEMSIZE=10485760 mpirun -np 64 -machinefile $PBS_NODEFILE ./xhpl
Oscar Home page --humberto, 2003/10/30 09:43 AST reply
The oscar home page at http://oscar.sourceforge.net/ has links to the User's Guide and other documentation, a must read!
PBS sucks. --humberto, 2003/12/10 14:33 AST reply
I'm pounding on 2 nodes in the cluster, running xhpl with 2 cpu, 12000 NB (looks like it's a bit too big), and the cluster basically stopped scheduling. Maui and PBS are both dead.
Too harsh --humberto, 2003/12/11 09:36 AST reply
OK, I was a bit too harsh on PBS. I turned off the 2 nodes (they were crashed with out of RAM errors) and restarted pbs. The nodes are now marked down (see pbsnodes -a) and jobs schedule around them. I'm supposed to be able to use pbsnodes to mark them up after I reboot.
By the way, the jobs show as still running in PBS, and I can't delete them from the queue.
See, it's a pain. --humberto, 2003/12/11 15:11 AST reply
PBS eventually (3 or 4 hours) decided it couldn't restart the jobs, but in the mean time qstat and showq and qsub were broken. Shannon is now running codeml jobs and dcomara is working too.
We need to set some kind of node limits to prevent the nodes from taking out the whole cluster.
Too hot, turned off some nodes --humberto, 2003/12/14 14:34 AST reply
It got too hot here today and I turned off some nodes.
Maui gets crazy untill I edited /var/spool/pbs/server_priv/nodes and deleted the nodes I turned off.
I think maui now works.