I work for the Norwegian High-Throughput Sequencing Centre (NSC), but at the Centre for Ecological and Evolutionary Synthesis (CEES). At CEES, numerous researchers run bioinformatic analyses, or other computation-heavy analyses, for their projects. With this post, I want to describe the infrastructure we use for calculations and storage, and the reason why we chose to set these up the way we did.
In general, when one needs high-performance compute (HPC) infrastructure, a (group of) researcher(s) can purchase these and locate them in or around the office, or use a cloud solution. Many, if not most, universities offer a computer cluster for their researchers’ analysis needs. We chose a hybrid model between the universitys HPC infrastructure and setting up one ourselves. In other words, our infrastructure is a mix of self-owned, and shared resources that we either apply for, or rent.
We own several high-memory (1.5 TB RAM) servers with 64 CPUs and 64 TB local disk each. We also own 256 ‘grid’ CPUs for highly parallel calculations; these are in boxes of 16 CPUs with 64 GB RAM each. We rent project disk space, currently 60 TB, for which we pay a reasonable price. We have applied for (through the national “Norwegian metacenter for computational science” Notur) and been allocated CPU hours on the University of Oslo supercomputer Abel, currently 1.3 million hours per half year. Finally, we have an allocation of up to 50 TB on disk, and 50 TB on tape, at the Norwegian Storage Infrastructure NorStore.
It is important to note that what we own is not located at our offices. Instead, these servers sit right next to the Abel servers, in the same rooms, sharing the same power and cooling setups, and even sharing the same disks. In other words, both our own servers and the Abel servers ‘see’ the same common disk areas!
Software is made available through the Environment ‘module’ package. For example, to have the environment set up for using the
samtools package, users can simply write
module load samtools
Currently, this will set up
samtools 1.0 in the user’s environment. A huge benefit of the module system is that it makes it easy to access different versions of software. Users can for example choose to use an older version of
samtools by typing:
module load samtools/0.1.19
This is great for reproducibility! Both the Abel supercomputer servers, and our self-owned servers have access to the same modules. There is a large number of programs available this way. In addition, we can add custom modules that only our user group has access to.
Access is in one of two ways:
- direct ssh access for the high-memory servers (users are asked to send an email to our internal mailing list to alert others of the intended use)
- through the SLURM job submission system for the shared computer resources
We maintain a wiki with basic information on the use of our common infrastructure, and tips and tricks that are more general. This wiki is open to the world.
There are several benefits of this hybrid model:
- we do not have to worry about setting up and maintaining the infrastructure, power, cooling, servers breaking down and disks that stop spinning; all of this is taken care of by the Abel staff
- we do not have to have a local sysadmin (systems administrator) for operating our infrastructure. Instead, we rely on the very competent Abel staff, formally known as the Research Infrastructure Services Group. For example, if these people can not install a piece of software, none of us researchers stands a chance
- we would never be able to get the same good prices by trying to negotiate on our own
But there are also some obvious disadvantages:
- we have less control of the setup; for instance, there is much software tailored towards SGE as a job submission system, while Abel uses SLURM (this bothered me for quite a while, but I have heard very good things about SLURM lately, so I am getting more and more happy with this choice)
- none of us has root access to our own servers (but I consider that a blessing 🙂 )
- as the Abel supercomputer is sizeable (>10000 cores, serving hundreds of users from the university and other Norwegian research groups), the Abel staff are a busy group of people. They receive many requests and this sometimes means waiting time for us to get something fixed or installed
- also, when the Abel cluster needs to go down for maintenance, this (usually) means our servers are down too
- there is still a considerable overhead for us (mainly one of my colleagues and me) when it comes to the day-to-day running of the infrastructure; for example, getting new users registered, communication with all users, maintenance of the wiki, etc.
- as with any shared resource (common kitchens, anyone?), some people are better at being considerate of their fellow researchers than others; disks tend to run full faster than we can ask people to remove their leftover temporary run data, just to name one thing
In conclusion, the hybrid model works very well for us. We get a lot more for our money with a lot less worries. This allows the many projects ongoing at CEES and NSC to run smoothly most of the time, something we all are very grateful for. Having access to these resources also makes us an attractive collaboration partner!
If you are interested in knowing more about our HPC resources, have a look at our wiki.