Abstract
The High Availability cluster stack's highest priority is protecting the integrity of data. This is achieved by preventing uncoordinated concurrent access to data storage: For example, ext3 file systems are only mounted once in the cluster, OCFS2 volumes will not be mounted unless coordination with other cluster nodes is available. In a well-functioning cluster Pacemaker will detect if resources are active beyond their concurrency limits and initiate recovery. Furthermore, its policy engine will never exceed these limitations.
However, network partitioning or software malfunction could potentially cause scenarios where several coordinators are elected. If this so-called split brain scenarios were allowed to unfold, data corruption might occur. Hence, several layers of protection have been added to the cluster stack to mitigate this.
The primary component contributing to this goal is IO fencing/STONITH since it ensures that all other access prior to storage activation is terminated. Other mechanisms are cLVM2 exclusive activation or OCFS2 file locking support to protect your system against administrative or application faults. Combined appropriately for your setup, these can reliably prevent split-brain scenarios from causing harm.
This chapter describes an IO fencing mechanism that leverages the storage itself, followed by the description of an additional layer of protection to ensure exclusive storage access. These two mechanisms can be combined for higher levels of protection.
You can reliably avoid split-brain scenarios by using Split Brain
Detector (SBD), watchdog support and the
external/sbd STONITH agent.
In an environment where all nodes have access to shared storage, a small partition (1MB) is formated for the use with SBD. After the respective daemon is configured, it is brought online on each node before the rest of the cluster stack is started. It is terminated after all other cluster components have been shut down, thus ensuring that cluster resources are never activated without SBD supervision.
The daemon automatically allocates one of the message slots on the partition to itself, and constantly monitors it for messages addressed to itself. Upon receipt of a message, the daemon immediately complies with the request, such as initiating a power-off or reboot cycle for fencing.
The daemon constantly monitors connectivity to the storage device, and terminates itself in case the partition becomes unreachable. This guarantees that it is not disconnected from fencing messages. If the cluster data resides on the same logical unit in a different partition, this is not an additional point of failure: The work-load will terminate anyway if the storage connectivity has been lost.
Increased protection is offered through watchdog
support. Modern systems support a hardware watchdog
that has to be updated by the software client, otherwise the hardware
will enforce a system restart. This protects against failures of the SBD
process itself, such as dying, or becoming stuck on an IO error.
The following steps are necessary to set up storage-based protection:
All of the following procedures must be executed as root. Before
you start, make sure the following requirements are met:
![]() | Requirements |
|---|---|
| |
It is recommended to create a 1MB partition at the start of the device.
If your SBD device resides on a multipath group, you need to adjust the
timeouts SBD uses, as MPIO's path down detection can cause some
latency. After the msgwait timeout, the message is
assumed to have been delivered to the node. For multipath, this should
be the time required for MPIO to detect a path failure and switch to
the next path. You may have to test this in your environment. The node
will commit suicide if it has not updated the watchdog timer fast
enough.
The watchdog timeout must be shorter than the
msgwait timeout—half the value is a good
estimate.
In the following, this SBD partition is referred to by
/dev/. Replace it
with your actual pathname, for example: SBD/dev/sdc1.
![]() | Overwriting Existing Data |
|---|---|
Make sure the device you want to use for SBD does not hold any data. The sdb command will overwrite the device without further requests for confirmation. | |
Initialize the SBD device with the following command:
sbd -d /dev/SBD create
This will write a header to the device, and create slots for up to 255 nodes sharing this device with default timings.
If your SBD device resides on a multipath group, adjust the timeouts SBD uses. This can be specified when the SBD device is initialized (all timeouts are given in seconds):
/usr/sbin/sbd -d /dev/SBD -4 $msgwait -1 $watchdogtimeout create
With the following command, check what has been written to the device:
sbd -d /dev/SBD dump Header version : 2 Number of slots : 255 Sector size : 512 Timeout (watchdog) : 5 Timeout (allocate) : 2 Timeout (loop) : 1 Timeout (msgwait) : 10
As you can see, the timeouts are also stored in the header, to ensure that all participating nodes agree on them.
It is highly recommended to set up your Linux system to use a watchdog. This involves loading the proper watchdog driver on system boot.
On HP hardware, this is the hpwdt module.
For systems with a Intel TCO, iTCO_wdt can be
used. softdog is the most generic driver, but it
is recommended to use a driver with actual hardware integration.
See drivers/watchdog in the kernel package for a
list of choices.
The SBD daemon is a critical piece of the cluster stack. It has to be running when the cluster stack is running, or even when part of it has crashed, so that it can be fenced.
To make the OpenAIS init script start and stop SDB, add the following to /etc/sysconfig/sbd:
SBD_DEVICE="/dev/SBD" # The next line enables the watchdog support: SBD_OPTS="-W"
If the SBD device is not accessible, the daemon will fail to start and inhibit OpenAIS startup.
![]() | |
If the SBD device becomes inaccessible from a node, this could cause the node to enter an infinite reboot cycle. That is technically correct, but depending on your administrative policies, might be considered a nuisance. You may wish to not automatically start up OpenAIS on boot in such cases. | |
Before proceeding, ensure that SBD has started on all nodes by
executing rcopenais restart.
The following command will dump the node slots and their current messages from the SBD device:
sbd -d /dev/SBD list
Now you should see all cluster nodes that have ever been started with
SBD listed here, the message slot should show
clear.
Try sending a test message to one of the nodes:
sbd -d /dev/SBD message nodea test
The node will acknowledge the receipt of the message in the system logs:
Aug 29 14:10:00 nodea sbd: [13412]: info: Received command test from nodeb
This confirms that SBD is indeed up and running on the node, and that it is ready to receive messages.
To complete the SBD setup, it is necessary to activate SBD as a STONITH/fencing mechanism in the CIB as follows:
crm configure crm(live)configure# property stonith-enabled="true" crm(live)configure# property stonith-timeout="30s" crm(live)configure# primitive stonith:external/sbd params sbd_device="/dev/SBD" crm(live)configure# commit crm(live)configure# quit
Since node slots are allocated automatically, no manual hostlist needs to be defined.
Disable any other fencing devices you might have configured before, since the SBD mechanism is used for this function now.
Once the resource has started, your cluster is now successfully configured for shared-storage fencing, and will utilize this method in case a node needs to be fenced.
This section introduces sfex, an additional low-level
mechanism to lock access to shared storage exclusively to one node. Note
that sfex does not replace STONITH. Since sfex requires shared storage,
it is recommended that the external/sbd fencing
mechanism described above is used on another partition of the storage.
By design, sfex cannot be used in conjunction with workloads that require concurrency (such as OCFS2), but serves as a layer of protection for classic fail-over style workloads. This is similar to a SCSI-2 reservation in effect, but more general.
In a shared storage environment, a small partition of the storage is set aside for storing one or more locks.
Before acquiring protected resources, the node must first acquire the protecting lock. The ordering is enforced by Pacemaker, and the sfex component ensures that even if Pacemaker were subject to a split-brain situation, the lock will never be granted more than once.
These locks must also be refreshed periodically, so that a node's death does not permanently block the lock and other nodes can proceed.
In the following, learn how to create a shared partition for use with sfex and how to configure a resource for the sfex lock in the CIB. A single sfex partition can hold any number of locks, it defaults to one, and needs 1 KB of storage space allocated per lock.
![]() | Requirements |
|---|---|
| |
Procedure 15.1. Creating an sfex Partition
Create a shared partition for use with sfex. Note the name of this
partition and use it as a substitute for
/dev/sfex below.
Create the sfex meta data with the following command:
sfex_init -i 1 /dev/sfex
Verify that the meta data has been created correctly:
sfex_stats -i 1 /dev/sfex ; echo $?
This should return 2, since the lock is not
currently held.
Procedure 15.2. Configuring a Resource for the sfex Lock
The sfex lock is represented via a resource in the CIB, configured as follows:
primitive sfex_1 ocf:heartbeat:sfex \
# params device="/dev/sfex" index="1" collision_timeout="1" \
lock_timeout="70" monitor_interval="10" \
# op monitor interval="10s" timeout="30s" on_fail="fence"
To protect resources via a sfex lock, create mandatory ordering and
placement constraints between the protectees and the sfex resource. If
the resource to be protected has the id
filesystem1:
# order order-sfex-1 inf: sfex_1 filesystem1 # colocation colo-sfex-1 inf: filesystem1 sfex_1
If using group syntax, add the sfex resource as the first resource to the group:
# group LAMP sfex_1 filesystem1 apache ipaddr