Building an Ubuntu Computing Cluster
First off, this is a work in progress, and so it will take a while until I complete it. Also I will only write what succeeds. All the hours spent chasing dead ends will not be mentioned. Also, all of the steps described here are tested on Ubuntu Feisty, but should probably work on other releases as well.
(Patrik: I have added some comments marked like this one).
Right, let's get cracking.
Basics
This is what you need:
- Cluster hardware, i.e. a lot of computers with similar hardware and memory.
- A good understanding of linux and the following technologies
- NIS, DHCP, NFS and PXE
- Patience and/or Coffee/Tee?.
The design I came up with was to use on of the computing nodes as a master node featuring all servers I needed. In other words, one of the computing nodes will have DHCP, NIS, TFTP and NFS servers on it. Sounds a lot I know, but you'll see it's all quite necessary. I will make some assumptions in this small tutorial. For instance I will assume that your cluster will live in a C-class network with a server(the first computing node) at 192.168.0.1. I will assume that you have 4 computing nodes in total, since that is what I'm dealing with here. The first node should have two network cards. One for the interwub and the other one for the local cluster network. Now since we are building a computing cluster I find it proper to empower them with their own swap, because I don't want them signing off when memory is scarce. So for each disk of your nodes you should create a partition table with a primary interface for tmp storage and an extended interface for swap. Choose as you like here but I made a lot of swap.
Getting our hands dirty
Start with installing Ubuntu on one of the nodes using either a cdrom or following the UbuntuViaUSB tutorial. (Patrik and Pontus: We followed Michael's instructions and we managed to build a memory stick that worked perfectly on our desktop computers, but was denied by the cluster. We finally tried this method http://learn.clemsonlinux.org/wiki/Ubuntu:Install_from_USB_drive#The_Quick_Method with a good result. We have absolutely no idea what the problem is/was, but be prepared that this trivial step might take more time than expected). I choose to install only the server since a full desktop isn't really required. Now grab some tee and enjoy the installation of Ubuntu on one of the nodes. (Patrik: The installer wants to automatically detect the network hardware. This step takes a very long time and will give you the impression that something is wrong. Leave the room and return after ten minutes). Finished? Good, now we need to install some software.
sudo apt-get install dhcp3-server tftpd-hpa syslinux nfs-kernel-server initramfs-tools
Then we need to set up some PXE booting prerequisites
cp /usr/lib/syslinux/pxelinux.0 /var/lib/tftpboot mkdir /var/lib/tftpboot/pxelinux.cfg
Now we need to let the dhcp server know which network to offer the pxe boot image to. So edit the dhcp configuration using vi /etc/dhcp3/dhcpd.conf
, and fill it with then information below
ddns-update-style none; option domain-name "thepcluster.org"; #option domain-name-servers ns1.thepcluster.org; default-lease-time 600; max-lease-time 7200; authoritative; log-facility local7; allow booting; allow bootp; subnet 192.168.0.0 netmask 255.255.255.0 { range 192.168.0.2 192.168.0.4; option broadcast-address 192.168.0.255; option routers 192.168.0.1; option domain-name-servers 192.168.0.1; filename "/var/lib/tftpboot/pxelinux.0"; }
(Patrik: This file is SelmaN01-/etc/dhcp3/dhcpd.conf for N01).
In order for any of this to matter we need to tell the tftp server to run, I mean actually run, itself when started. Edit /etc/default/tftpd-hpa
and set RUN_DAEMON="yes"
. After this we need to create a new config file for the pxe thingie so that it will know which kernel to pass to the client. Create the file /var/lib/tftpboot/pxelinux.cfg/default
and fill it with the following information.
LABEL linux KERNEL vmlinuz APPEND root=/dev/nfs initrd=initrd.img nfsroot=192.168.0.1:/opt/nfsroot ip=dhcp rw
(Patrik: This file is SelmaN01-/var/lib/tftpboot/pxelinux.cfg).
This will tell the clients to mount the root file system via nfs from server 192.168.0.1, our masternode, at /opt/nfsroot. Of course in order for a client to mount a directory over nfs, a server must first share it. So edit your /etc/exports
and tell it how to export our nfsroot by entering
/opt/nfsroot 192.168.0.0/24(rw,no_root_squash,async) /home 192.168.0.0/24(rw,no_root_squash,async)
(Patrik: The NFS daemon has been updated since Michael wrote the original document. A new option has been added, which must be set to a value or the daemon will print warnings during startup. I added subtree_check to /opt/nfsroot and no_subtree_check to /home, but I was only guessing and I recommend you to read the manpages for exports and configure the system according to your own understanding of how NFS works). As you may have noticed we put the home directory in there as well. The reason for this is that we need to let the NIS users access their home directories from each node on the cluster. More about this later. This might be a good time to actually create the clients root file system, and export it.
mkdir -p /opt/nfsroot exportfs -rv
Remember the configuration we did in pxelinux.cfg? Well it's time to actually create some of the files mentioned there. For starters we shall create the initrd.img file. Since we don't have a dedicated client set up yet we will have to go through some steps to create it. Edit your /etc/initramfs-tools/initramfs.conf
and change BOOT to nfs instead of local. Now make an initrd.img and store it along with the kernel image in /var/lib/tftpboot
.
mkinitramfs -o /var/lib/tftpboot/initrd.img cp /boot/vmlinuz-someversion /var/lib/tftpboot/vmlinuz
(Patrik: After you have done this, restore /etc/initramfs-tools/initramfs.conf to whatever it was before you modified it. If initramfs-tools ever gets updated, as in my case, apt-get will try to generate a new initramfs for the login node). Now fill the nfsroot directory.
cd / cp -axv bin home media root srv tmp var boot etc initrd lib lib64 lost+found mnt sbin sys usr /opt/nfsroot/ cp -axv dev /opt/nfsroot cp -axv proc /opt/nfsroot
(Patrik: Michael forgot the lib64 symlink in the original instructions. Four hours of reading source code and I found the error. There is also a problem to copy some of the files in proc. The -v flag to cp will give you a hint which files are causing trouble. I just skipped those files). After this you need to make sure that the fstab in the nfsroot knows what to do. Fill /opt/nfsroot/etc/fstab
with the information below. This will use the local disks on the client for swap ant tmp storage. Home will be mounted over nfs since we will configure nis users.
proc /proc proc defaults 0 0 /dev/nfs / nfs defaults 0 0 192.168.0.1:/home /home nfs defaults 0 0 /dev/sda1 /tmp ext3 defaults 0 1 /dev/sda5 none swap sw 0 0
(Patrik: This file is SelmaN01-/opt/nfsroot/etc/fstab for N01)
I hear your inquiring minds cry! Well, the thing is we don't have an actual interface set up on the server yet with an ip of 192.168.0.1, so let's get to it. Fire up vi
on /etc/network/interfaces
and the new interface. After your work is done the file should look somthing like the example below.
# The loopback network interface auto lo iface lo inet loopback # The primary network interface auto eth1 iface eth1 inet dhcp # The interface for the local cluster network auto eth0 iface eth0 inet static address 192.168.0.1 netmask 255.255.255.0
I know, I know, setting the eth0 as the local interface might not have been entirely transparent. Deal with it. Anyway, now we have our two interfaces active and our primary servers are set up properly. Before we move on we need to make sure that the client root file system is configured properly. We already did the fstab, but we need to make one crucial adjustment to the interfaces file on the nfsroot. Comment out all interfaces except the lo one. If you allow eth0 or eth1 to be brought up you will loose the nfsroot since the network card was set ut by pxe during the initial boot process.
Setting up NIS server
By now we should be able to boot the nodes from network and use nfsroot as the root file system. Now we need to set up some users since we don't want to mirror the users over the real root and the nfsroot. First we need to make our system a bit more secure, since the default for NIS is to allow the whole world access to its services. We accomplish this by setting up /etc/hosts.allow
ALL: LOCAL ALL: .thep.lu.se ALL: 192.168.0.0/24
and /etc/hosts.deny
ALL: ALL
where we only allow users from thep.lu.se
and the local network to access the server. To actually install the NIS we issue
sudo apt-get install portmap nis
and choose the domain name. Note that this is not the real domain name, it's just something that NIS will use to identify groups of clients so you can basically choose anything you'd like here. However, we shall stick with thepcluster.org as we did when we configured our local interface for the dhcp server. When the installation is done we have one more security thingie to fix where we edit the /etc/ypserv.securenets
and comment out 0.0.0.0
. I can't stress the importance of this enough. If you forget then the whole world has access. So fill it with allowed hosts instead like:
host 192.168.0.1 host 192.168.0.2 host 192.168.0.3 host 192.168.0.4
We want the masternode to know that it is a server. We accomplish that be editing /etc/default/nis
and setting NISSERVER = master
. Also make sure that /etc/default/portmap
has the ARGS="-i 127.0.0.1" line commented out. Next edit /etc/yp.conf
and add a server line like
domain thepcluster.org server selma.thepcluster.org
(Patrik: I replaced selma.thepcluster.org with the IP-number of the first node. This reminds me to point out that Michael does not mention about how to set up the network and name lookup. I set up /etc/hosts according to the manpage of hosts. Before I did this I had huge problems with sshd). Maybe the masternode isn't called selma, and if so you should enter the name of your masternode here. Issue sudo /usr/lib/yp/ypinit -m
to build the database. Don't worry about some of the errors that occur. I didn't and our system works just fine. This is an excellent opportunity to add new users to your system, i.e., adduser monkeyboy
etc. When all the users are added issue a make -C /var/yp
to propagete the changes through the system.
Setting up NIS client in nfsroot
Setting up NIS on the clients is a picnic. Copy the hosts.allow, hosts.deny and yp.conf from /etc to the nfsroot's etc. Then chroot into the nfsroot by
sudo chroot /opt/nfsroot
and issue
apt-get install nis
andn write the same domain name as before. Then add +::::::
, +:::
and +::::::::
at the end of /etc/passwd
, /etc/group
and /etc/shadow
respectively.
Configuring TORQUE Resource Manager
There are a few queuing systems out there today, but we chose torque. Mostly for compatibility reasons. Download it from Torque by
wget http://www.clusterresources.com/downloads/torque/torque-2.1.8.tar.gz
and sit back and relax as it downloads. When it's finished you install it a la
tar -zxf torque-2.1.8.tar.gz cd torque-2.1.8 ./configure make sudo make install
Torque have to know which user should hold the queue and thus be the admin of the queue, in our case we choose the clusteradmin user. Run
./torque.setup clusteradmin make packages
from within the torque src folder. This will set up the basic queue and create the package we need to distribute and install on the clients. Let's install the package.
cp torque-package-clients-linux-x86_64.sh /opt/nfsroot/tmp cp torque-package-mom-linux-x86_64.sh /opt/nfsroot/tmp sudo chroot /opt/nfsroot cd /tmp ./torque-package-mom-linux-x86_64.sh --install ./torque-package-clients-linux-x86_64.sh --install
(Patrik and Simon: Michael forgot the --install flags which we have nod added to the instructions.) Edit /var/spool/torque/server_name
and make sure that it contains the name of the first computing node, i.e., selma in our case. After that exit the chroot typing, you got it, exit
. Now we're back at the real root, and we want to tell Torque what computing nodes to use. We accomplish this by filling /var/spool/torque/server_priv/nodes
with the information
n01.thepcluster.org np=8 n02.thepcluster.org np=8 n03.thepcluster.org np=8 n04.thepcluster.org np=8
where np=8 tells the server that the nodes has 8 processors. Ok, it turns out that it's not enough setting the server_name file in the torque directory. We also have to tell the mom where the server is. We do that by writing
$pbsserver n01 # note: hostname running pbs_server $logevent 255 # bitmap of which events to log
in the /var/spool/torque/mom_priv/config
file. Modify the /opt/nfsroot/etc/rc.local to start the pbs_mom daemon by typing pbs_mom
before the exit 0;
command! An example of the rc.local file is shown below.
# mount -a # rm -rf /tmp/torque # cp -a /var/spool/torque_orig /tmp/torque /usr/local/sbin/pbs_mom exit 0
(Patrik and Simon: We commented away the first three lines as they only caused errors (and seem to be redudant). We have no idea why Michael put them there). Now fire up all the computing nodes. When they start we try the torque server out to make sure that our queue is in order. So terminate and restart the server, check its configuration,
qterm pbs_server qstat -q qmgr -c 'p s'
status of all computing nodes and finally submit a job to the queue.
pbsnodes -a echo "sleep 60" | qsub qstat
If everything worked out all right, we are ready to fire up the scheduler by issuing
pbs_sched
and relax. To finish up we make sure that the server and scheduler starts automatically at boot time by adding the following section to /etc/rc.local
.
/usr/local/sbin/pbs_server /usr/local/sbin/pbs_sched /usr/local/bin/qmgr -c 'set server query_other_jobs=true' /usr/local/sbin/pbs_mom
That's it! We have a full fledged computing cluster booting from network by PXE, controlling users by NIS and scheduling jobs by Torque.
Bonus stuff
I followed this guide to make it possible for the nodes to reach the outside world. Necessary for Matlab to work.