ICALEPCS 2009
WED005
Implementing High Availability with COTS Components and Open-source Software
N.Neufeld, R.Schwemmer* (CERN)
High Availability of IT services is essential for the successful operation of large experimental facilities such as the LHC experiments. In the past, high availability was often taken for granted and/or ensured by using very expensive high-end hardware based on proprietary, single-vendor solutions. Today's IT infrastructure in HEP is usually a heterogeneous environment of cheap, of the shelf components which usually have no intrinsic failure tolerance and can thus not be considered reliable at all. Many services, in particular networked services like the Domain Name service, shared storage and databases need to run on this unreliable hardware, while they are indispensable for the operation of today's control systems. We present our approach to this problem which is based on a combination of open-source tools, such as the Linux High Availability Project and home-made tools to ensure high-availability for the LHCb Experiment Control system, which consists of over 200 servers, several hundred switches and is controlling thousands of devices ranging from custom made devices, connected to the LAN, to the servers of the event-filter farm.