Storage Protection

Contents

15.1. Storage-based Fencing
15.2. Ensuring Exclusive Storage Activation

Abstract

The High Availability cluster stack's highest priority is protecting the integrity of data. This is achieved by preventing uncoordinated concurrent access to data storage: For example, ext3 file systems are only mounted once in the cluster, OCFS2 volumes will not be mounted unless coordination with other cluster nodes is available. In a well-functioning cluster Pacemaker will detect if resources are active beyond their concurrency limits and initiate recovery. Furthermore, its policy engine will never exceed these limitations.

However, network partitioning or software malfunction could potentially cause scenarios where several coordinators are elected. If this so-called split brain scenarios were allowed to unfold, data corruption might occur. Hence, several layers of protection have been added to the cluster stack to mitigate this.

The primary component contributing to this goal is IO fencing/STONITH since it ensures that all other access prior to storage activation is terminated. Other mechanisms are cLVM2 exclusive activation or OCFS2 file locking support to protect your system against administrative or application faults. Combined appropriately for your setup, these can reliably prevent split-brain scenarios from causing harm.

This chapter describes an IO fencing mechanism that leverages the storage itself, followed by the description of an additional layer of protection to ensure exclusive storage access. These two mechanisms can be combined for higher levels of protection.

Storage-based Fencing

You can reliably avoid split-brain scenarios by using Split Brain Detector (SBD), watchdog support and the external/sbd STONITH agent.

Overview

In an environment where all nodes have access to shared storage, a small partition (1MB) is formated for the use with SBD. After the respective daemon is configured, it is brought online on each node before the rest of the cluster stack is started. It is terminated after all other cluster components have been shut down, thus ensuring that cluster resources are never activated without SBD supervision.

The daemon automatically allocates one of the message slots on the partition to itself, and constantly monitors it for messages addressed to itself. Upon receipt of a message, the daemon immediately complies with the request, such as initiating a power-off or reboot cycle for fencing.

The daemon constantly monitors connectivity to the storage device, and terminates itself in case the partition becomes unreachable. This guarantees that it is not disconnected from fencing messages. If the cluster data resides on the same logical unit in a different partition, this is not an additional point of failure: The work-load will terminate anyway if the storage connectivity has been lost.

Increased protection is offered through watchdog support. Modern systems support a hardware watchdog that has to be updated by the software client, otherwise the hardware will enforce a system restart. This protects against failures of the SBD process itself, such as dying, or becoming stuck on an IO error.

Setting Up Storage-based Protection

The following steps are necessary to set up storage-based protection:

All of the following procedures must be executed as root. Before you start, make sure the following requirements are met:

[Important]Requirements
  • The environment must have shared storage reachable by all nodes.

  • The shared storage segment must not make use of host-based RAID, cLVM2, nor DRBD.

  • However, using storage-based RAID and multipathing is recommended for increased reliability.

Creating the SBD Partition

It is recommended to create a 1MB partition at the start of the device. If your SBD device resides on a multipath group, you need to adjust the timeouts SBD uses, as MPIO's path down detection can cause some latency. After the msgwait timeout, the message is assumed to have been delivered to the node. For multipath, this should be the time required for MPIO to detect a path failure and switch to the next path. You may have to test this in your environment. The node will commit suicide if it has not updated the watchdog timer fast enough. The watchdog timeout must be shorter than the msgwait timeout—half the value is a good estimate.

In the following, this SBD partition is referred to by /dev/SBD. Replace it with your actual pathname, for example: /dev/sdc1.

[Important]Overwriting Existing Data

Make sure the device you want to use for SBD does not hold any data. The sdb command will overwrite the device without further requests for confirmation.

  1. Initialize the SBD device with the following command:

    sbd -d /dev/SBD create

    This will write a header to the device, and create slots for up to 255 nodes sharing this device with default timings.

  2. If your SBD device resides on a multipath group, adjust the timeouts SBD uses. This can be specified when the SBD device is initialized (all timeouts are given in seconds):

    /usr/sbin/sbd -d /dev/SBD -4 $msgwait -1 $watchdogtimeout create
  3. With the following command, check what has been written to the device:

    sbd -d /dev/SBD dump 
    Header version     : 2
    Number of slots    : 255
    Sector size        : 512
    Timeout (watchdog) : 5
    Timeout (allocate) : 2
    Timeout (loop)     : 1
    Timeout (msgwait)  : 10

As you can see, the timeouts are also stored in the header, to ensure that all participating nodes agree on them.

Setting Up the Software Watchdog

It is highly recommended to set up your Linux system to use a watchdog. This involves loading the proper watchdog driver on system boot.

  • On HP hardware, this is the hpwdt module.

  • For systems with a Intel TCO, iTCO_wdt can be used. softdog is the most generic driver, but it is recommended to use a driver with actual hardware integration.

See drivers/watchdog in the kernel package for a list of choices.

Starting the SBD Daemon

The SBD daemon is a critical piece of the cluster stack. It has to be running when the cluster stack is running, or even when part of it has crashed, so that it can be fenced.

  1. To make the OpenAIS init script start and stop SDB, add the following to /etc/sysconfig/sbd:

    SBD_DEVICE="/dev/SBD"
    # The next line enables the watchdog support:
    SBD_OPTS="-W"

    If the SBD device is not accessible, the daemon will fail to start and inhibit OpenAIS startup.

    [Note]

    If the SBD device becomes inaccessible from a node, this could cause the node to enter an infinite reboot cycle. That is technically correct, but depending on your administrative policies, might be considered a nuisance. You may wish to not automatically start up OpenAIS on boot in such cases.

  2. Before proceeding, ensure that SBD has started on all nodes by executing rcopenais restart.

Testing SBD

  1. The following command will dump the node slots and their current messages from the SBD device:

    sbd -d /dev/SBD list

    Now you should see all cluster nodes that have ever been started with SBD listed here, the message slot should show clear.

  2. Try sending a test message to one of the nodes:

    sbd -d /dev/SBD message nodea test
  3. The node will acknowledge the receipt of the message in the system logs:

    Aug 29 14:10:00 nodea sbd: [13412]: info: Received command test from nodeb

    This confirms that SBD is indeed up and running on the node, and that it is ready to receive messages.

Configuring the Fencing Resource

  1. To complete the SBD setup, it is necessary to activate SBD as a STONITH/fencing mechanism in the CIB as follows:

    crm configure
    crm(live)configure# property stonith-enabled="true"
    crm(live)configure# property stonith-timeout="30s"
    crm(live)configure# primitive stonith:external/sbd params sbd_device="/dev/SBD"
    crm(live)configure# commit
    crm(live)configure# quit

    Since node slots are allocated automatically, no manual hostlist needs to be defined.

  2. Disable any other fencing devices you might have configured before, since the SBD mechanism is used for this function now.

Once the resource has started, your cluster is now successfully configured for shared-storage fencing, and will utilize this method in case a node needs to be fenced.

Ensuring Exclusive Storage Activation

This section introduces sfex, an additional low-level mechanism to lock access to shared storage exclusively to one node. Note that sfex does not replace STONITH. Since sfex requires shared storage, it is recommended that the external/sbd fencing mechanism described above is used on another partition of the storage.

By design, sfex cannot be used in conjunction with workloads that require concurrency (such as OCFS2), but serves as a layer of protection for classic fail-over style workloads. This is similar to a SCSI-2 reservation in effect, but more general.

Overview

In a shared storage environment, a small partition of the storage is set aside for storing one or more locks.

Before acquiring protected resources, the node must first acquire the protecting lock. The ordering is enforced by Pacemaker, and the sfex component ensures that even if Pacemaker were subject to a split-brain situation, the lock will never be granted more than once.

These locks must also be refreshed periodically, so that a node's death does not permanently block the lock and other nodes can proceed.

Setup

In the following, learn how to create a shared partition for use with sfex and how to configure a resource for the sfex lock in the CIB. A single sfex partition can hold any number of locks, it defaults to one, and needs 1 KB of storage space allocated per lock.

[Important]Requirements
  • The shared partition for sfex should be on the same logical unit as the data you wish to protect.

  • The shared sfex partition must not make use of host-based RAID, nor DRBD.

  • Using a cLVM2 logical volume is possible.

Procedure 15.1. Creating an sfex Partition

  1. Create a shared partition for use with sfex. Note the name of this partition and use it as a substitute for /dev/sfex below.

  2. Create the sfex meta data with the following command:

    sfex_init -i 1 /dev/sfex
  3. Verify that the meta data has been created correctly:

    sfex_stats -i 1 /dev/sfex ; echo $?

    This should return 2, since the lock is not currently held.

Procedure 15.2. Configuring a Resource for the sfex Lock

  1. The sfex lock is represented via a resource in the CIB, configured as follows:

    primitive sfex_1 ocf:heartbeat:sfex \
    #	params device="/dev/sfex" index="1" collision_timeout="1" \
          lock_timeout="70" monitor_interval="10" \
    #	op monitor interval="10s" timeout="30s" on_fail="fence"
  2. To protect resources via a sfex lock, create mandatory ordering and placement constraints between the protectees and the sfex resource. If the resource to be protected has the id filesystem1:

    # order order-sfex-1 inf: sfex_1 filesystem1
    # colocation colo-sfex-1 inf: filesystem1 sfex_1
  3. If using group syntax, add the sfex resource as the first resource to the group:

    # group LAMP sfex_1 filesystem1 apache ipaddr