Fencing and STONITH

Contents

8.1. Classes of Fencing
8.2. Node Level Fencing
8.3. STONITH Configuration
8.4. Monitoring Fencing Devices
8.5. Special Fencing Devices
8.6. For More Information

Abstract

Fencing is a very important concept in computer clusters for HA (High Availability). A cluster sometimes detects that one of the nodes is misbehaving and needs to remove it. This is called fencing and is commonly done with a STONITH resource. Fencing may be defined as a method to bring an HA cluster to a known state.

Every resource in a cluster has a state attached, for example: “resource r1 is started on node1”. In an HA cluster, such a state implies that “resource r1 is stopped on all nodes but node1”, because an HA cluster must make sure that every resource may be started on at most one node. Every node must report every change that happens to a resource. The cluster state is thus a collection of resource states and node states.

If, for whatever reason, a state of some node or resource cannot be established with certainty, fencing comes in. Even when the cluster does not know what is happening on some node, fencing can make sure that that node does not run any important resources.

Classes of Fencing

There are two classes of fencing: resource level and node level fencing. The latter is the primary subject of this chapter.

Resource Level Fencing

Using resource level fencing the cluster can make sure that a node cannot access one or more resources. One typical example is a SAN, where a fencing operation changes rules on a SAN switch to deny access from the node.

The resource level fencing may be achieved using normal resources on which the resource you want to protect depends. Such a resource would simply refuse to start on this node and therefore resources which depend on will not run on the same node.

Node Level Fencing

Node level fencing makes sure that a node does not run any resources at all. This is usually done in a very simple, yet brutal way: the node is simply reset using a power switch. This may ultimately be necessary because the node may not be responsive at all.