Chapter 13. Multi-Site Clusters (Geo Clusters)

Contents

13.1. Challenges for Multi-Site Clusters
13.2. Conceptual Overview
13.3. Requirements
13.4. Basic Setup
13.5. Managing Multi-Site Clusters
13.6. Troubleshooting

Abstract

Apart from local clusters and metro area clusters, SUSE® Linux Enterprise High Availability Extension 11 SP3 also supports multi-site clusters (geo clusters). That means you can have multiple, geographically dispersed sites with a local cluster each. Failover between these clusters is coordinated by a higher level entity, the so-called booth. Support for multi-site clusters is available as a separate option to SUSE Linux Enterprise High Availability Extension.

13.1. Challenges for Multi-Site Clusters

Typically, multi-site environments are too far apart to support synchronous communication between the sites and synchronous data replication. That leads to the following challenges:

  • How to make sure that a cluster site is up and running?

  • How to make sure that resources are only started once?

  • How to make sure that quorum can be reached between the different sites and a split brain scenario can be avoided?

  • How to manage failover between the sites?

  • How to deal with high latency in case of resources that need to be stopped?

In the following sections, learn how to meet these challenges with SUSE Linux Enterprise High Availability Extension.

13.2. Conceptual Overview

Multi-site clusters based on SUSE Linux Enterprise High Availability Extension can be considered as overlay clusters where each cluster site corresponds to a cluster node in a traditional cluster. The overlay cluster is managed by the booth mechanism. It guarantees that the cluster resources will be highly available across different cluster sites. This is achieved by using so-called tickets that are treated as failover domain between cluster sites, in case a site should be down.

The following list explains the individual components and mechanisms that were introduced for multi-site clusters in more detail.

Components and Concepts

Ticket

A ticket grants the right to run certain resources on a specific cluster site. A ticket can only be owned by one site at a time. Initially, none of the sites has a ticket—each ticket must be granted once by the cluster administrator. After that, tickets are managed by the booth for automatic failover of resources. But administrators may also intervene and grant or revoke tickets manually.

Resources can be bound to a certain ticket by dependencies. Only if the defined ticket is available at a site, the respective resources are started. Vice versa, if the ticket is removed, the resources depending on that ticket are automatically stopped.

The presence or absence of tickets for a site is stored in the CIB as a cluster status. With regards to a certain ticket, there are only two states for a site: true (the site has the ticket) or false (the site does not have the ticket). The absence of a certain ticket (during the initial state of the multi-site cluster) is not treated differently from the situation after the ticket has been revoked: both are reflected by the value false.

A ticket within an overlay cluster is similar to a resource in a traditional cluster. But in contrast to traditional clusters, tickets are the only type of resource in an overlay cluster. They are primitive resources that do not need to be configured nor cloned.

Booth

The booth is the instance managing the ticket distribution and thus, the failover process between the sites of a multi-site cluster. Each of the participating clusters and arbitrators runs a service, the boothd. It connects to the booth daemons running at the other sites and exchanges connectivity details. Once a ticket is granted to a site, the booth mechanism will manage the ticket automatically: If the site which holds the ticket is out of service, the booth daemons will vote which of the other sites will get the ticket. To protect against brief connection failures, sites that lose the vote (either explicitly or implicitly by being disconnected from the voting body) need to relinquish the ticket after a time-out. Thus, it is made sure that a ticket will only be re-distributed after it has been relinquished by the previous site. See also Dead Man Dependency (loss-policy="fence").

Arbitrator

Each site runs one booth instance that is responsible for communicating with the other sites. If you have a setup with an even number of sites, you need an additional instance to reach consensus about decisions such as failover of resources across sites. In this case, add one or more arbitrators running at additional sites. Arbitrators are single machines that run a booth instance in a special mode. As all booth instances communicate with each other, arbitrators help to make more reliable decisions about granting or revoking tickets.

An arbitrator is especially important for a two-site scenario: For example, if site A can no longer communicate with site B, there are two possible causes for that:

  • A network failure between A and B.

  • Site B is down.

However, if site C (the arbitrator) can still communicate with site B, site B must still be up and running.

Dead Man Dependency (loss-policy="fence")

After a ticket is revoked, it can take a long time until all resources depending on that ticket are stopped, especially in case of cascaded resources. To cut that process short, the cluster administrator can configure a loss-policy (together with the ticket dependencies) for the case that a ticket gets revoked from a site. If the loss-policy is set to fence, the nodes that are hosting dependent resources are fenced. This considerably speeds up the recovery process of the cluster and makes sure that resources can be migrated more quickly.

Figure 13.1. Example Scenario: A Two-Site Cluster (4 Nodes + Arbitrator)

Example Scenario: A Two-Site Cluster (4 Nodes + Arbitrator)

As usual, the CIB is synchronized within each cluster, but it is not synchronized across cluster sites of a multi-site cluster. You have to configure the resources that will be highly available across the multi-site cluster for every site accordingly.

13.3. Requirements

Software Requirements

  • All clusters that will be part of the multi-site cluster must be based on SUSE Linux Enterprise High Availability Extension 11 SP3.

  • SUSE® Linux Enterprise Server 11 SP3 must be installed on all arbitrators.

  • The booth package must be installed on all cluster nodes and on all arbitrators that will be part of the multi-site cluster.

The most common scenario is probably a multi-site cluster with two sites and a single arbitrator on a third site. However, technically, there are no limitations with regards to the number of sites and the number of arbitrators involved.

Nodes belonging to the same cluster site should be synchronized via NTP. However, time synchronization is not required between the individual cluster sites.

13.4. Basic Setup

Configuring a multi-site cluster takes the following basic steps:

13.4.1. Configuring Cluster Resources and Constraints

Apart from the resources and constraints that you need to define for your specific cluster setup, multi-site clusters require additional resources and constraints as described below. Instead of configuring them with the CRM shell, you can also do so with the HA Web Konsole. For details, refer to Section 5.5.2, “Configuring Additional Cluster Resources and Constraints”.

Procedure 13.1. Configuring Ticket Dependencies

The crm configure rsc_ticket command lets you specify the resources depending on a certain ticket. Together with the constraint, you can set a loss-policy that defines what should happen to the respective resources if the ticket is revoked. The attribute loss-policy can have the following values:

  • fence: Fence the nodes that are running the relevant resources.

  • stop: Stop the relevant resources.

  • freeze: Do nothing to the relevant resources.

  • demote: Demote relevant resources that are running in master mode to slave mode.

  1. On one of the cluster nodes, start a shell and log in as root or equivalent.

  2. Enter crm configure to switch to the interactive shell.

  3. Configure a constraint that defines which resources depend on a certain ticket. For example:

    crm(live)configure#
    rsc_ticket rsc1-req-ticketA ticketA: rsc1 loss-policy="fence"

    This creates a constraint with the ID rsc1-req-ticketA. It defines that the resource rsc1 depends on ticketA and that the node running the resource should be fenced in case ticketA is revoked.

    If resource rsc1 was not a primitive, but a special clone resource that can run in master or slave mode, you may want to configure that only rsc1's master mode depends on ticketA. With the following configuration, rsc1 is automatically demoted to slave mode if ticketA is revoked:

    crm(live)configure#
    rsc_ticket rsc1-req-ticketA ticketA: rsc1:Master loss-policy="demote"
  4. If you want other resources to depend on further tickets, create as many constraints as necessary with rsc_ticket.

  5. Review your changes with show.

  6. If everything is correct, submit your changes with commit and leave the crm live configuration with exit.

    The constraints are saved to the CIB.

Procedure 13.2. Configuring a Resource Group for boothd

Each site needs to run one instance of boothd that communicates with the other booth daemons. The daemon can be started on any node, therefore it should be configured as primitive resource. To make the boothd resource stay on the same node, if possible, add resource stickiness to the configuration. As each daemon needs a persistent IP address, configure another primitive with a virtual IP address. Group booth primitives:

  1. On one of the cluster nodes, start a shell and log in as root or equivalent.

  2. Enter crm configure to switch to the interactive shell.

  3. To create both primitive resources and to add them to one group, g-booth:

    crm(live)configure#
    primitive booth-ip ocf:heartbeat:IPaddr2 params ip="IP_ADDRESS"
    primitive booth ocf:pacemaker:booth-site \
          meta resource-stickiness="INFINITY" \
          op monitor interval="10s" timeout="20s"
    group g-booth booth-ip booth
  4. Review your changes with show.

  5. If everything is correct, submit your changes with commit and leave the crm live configuration with exit.

  6. Repeat the resource group configuration on the other cluster sites, using a different IP address for each boothd resource group.

    With this configuration, each booth daemon will be available at its individual IP address, independent of the node the daemon is running on.

Procedure 13.3. Adding an Ordering Constraint

If a ticket has been granted to a site but all nodes of that site should fail to host the boothd resource group for any reason, a split-brain situation among the geographically dispersed sites could occur. In that case, no boothd instance would be available to safely manage fail-over of the ticket to another site. To avoid a potential concurrency violation of the ticket (the ticket is granted to multiple sites simultaneously), add an ordering constraint:

  1. On one of the cluster nodes, start a shell and log in as root or equivalent.

  2. Enter crm configure to switch to the interactive shell.

  3. Create an ordering constraint:

    crm(live)configure#
    order order-booth-rsc1 inf: g-booth rsc1

    This defines that rsc1 (that depends on ticketA) can only be started after the g-booth resource group.

    In case rsc1 is not a primitive, but a special clone resource and configured as described in Step 3, the ordering constraint should be configured as follows:

    crm(live)configure#
    order order-booth-rsc1 inf: g-booth rsc1:promote

    This defines that rsc1 can only be promoted to master mode after the g-booth resource group has started.

  4. Review your changes with show.

  5. For any other resources that depend on a certain ticket, define further ordering constraints.

  6. If everything is correct, submit your changes with commit and leave the crm live configuration with exit.

13.4.2. Setting Up the Booth Services

After having configured the resource group for the boothd and the ticket dependencies, complete the booth setup:

Procedure 13.4. Editing The Booth Configuration File

  1. Log in to a cluster node as root or equivalent.

  2. Create /etc/booth/booth.conf and edit it according to the example below:

    Example 13.1. Example Booth Configuration File

    transport="UDP" 1
    port="6666" 2
    arbitrator="147.2.207.14" 3
    site="147.4.215.19" 4
    site="147.18.2.1"  4
    ticket="ticketA;510006"
    ticket="ticketB;510006"     

    1

    Defines the transport protocol used for communication between the sites. For SP2, only UDP is supported, other transport layers will follow.

    2

    Defines the port used for communication between the sites. Choose any port that is not already used for different services. Make sure to open the port in the nodes' and arbitrators' firewalls.

    3

    Defines the IP address of the arbitrator. Insert an entry for each arbitrator you use in your setup.

    4

    Defines the IP address used for the boothd on each site. Make sure to insert the correct virtual IP addresses (IPaddr2) for each site, otherwise the booth mechanism will not work correctly.

    5

    Defines the ticket to be managed by the booth. For each ticket, add a ticket entry.

    6

    Optional parameter. Defines the ticket's expiry time in seconds. A site that has been granted a ticket will renew the ticket regularly. If the booth does not receive any information about renewal of the ticket within the defined expiry time, the ticket will be revoked and granted to another site. If no expiry time is specified, the ticket will expire after 600 seconds by default.


    An example booth configuration file is available at /etc/booth/booth.conf.example.

  3. Verify your changes and save the file.

  4. Copy /etc/booth/booth.conf to all sites and arbitrators. In case of any changes, make sure to update the file accordingly on all parties.

    [Note]Synchronize Booth Configuration to All Sites and Arbitrators

    All cluster nodes and arbitrators within the multi-site cluster must use the same booth configuration. While you may need to copy the files manually to the arbitrators and to one cluster node per site, you can use Csync2 within each cluster site to synchronize the file to all nodes.

Procedure 13.5. Starting the Booth Services

  1. Start the booth resource group on each other cluster site. It will start one instance of the booth service per site.

  2. Log in to each arbitrator and start the booth service:

    /etc/init.d/booth-arbitrator start

    This starts the booth service in arbitrator mode. It can communicate with all other booth daemons but in contrast to the booth daemons running on the cluster sites, it cannot be granted a ticket.

After finishing the booth configuration and starting the booth services, you are now ready to start the ticket process.

13.5. Managing Multi-Site Clusters

Before the booth can manage a certain ticket within the multi-site cluster, you initially need to grant it to a site manually. Use the booth client command line tool to grant, list, or revoke tickets as described in Overview of booth client Commands. The booth client commands work on any machine where the booth daemon is running.

Overview of booth client Commands

Listing All Tickets on All Sites
#booth client list
      
ticket: ticketA, owner: 147.4.215.19, expires: 2013/04/24 12:00:01
ticket: ticketB, owner: None, expires: INF
Granting a Ticket to a Site
#booth client grant -t ticketA -s 147.2.207.14


cluster[3100]: 2013/04/24_11:44:14 info: grant command sent, result will be
returned asynchronously, you can get the result from the log files.

In this case, ticketA will be granted to the site 147.2.207.14. The grant operation will be executed immediately. However, it might not be finished yet when the message above appears on the screen. Find the exact status in the log files.

Before granting a ticket, the command will execute a sanity check. If the same ticket is already granted to another site, you will be warned about that and be prompted to revoke the ticket from the current site first.

Revoking a Ticket From a Site
#booth client revoke -t ticketA -s 147.2.207.14


cluster[3100]: 2013/04/24_11:44:14 info: revoke command sent, result will be
returned asynchronously, you can get the result from the log files.

In this case, ticketA will be revoked from the site 147.2.207.14. The revoke operation will be executed immediately. However, it might not be finished yet when the message above appears on the screen. Find the exact status in the log files.

[Warning]crm_ticket and crm site ticket

In case the booth service is not running for any reasons, you may also manage tickets manually with crm_ticket or crm site ticket. Both commands are only available on cluster nodes. In case of manual intervention, use them with great care as they cannot verify if the same ticket is already granted elsewhere. For basic information about the commands, refer to their man pages.

As long as booth is up and running, only use booth client for manual intervention.

After you have initially granted a ticket to a site, the booth mechanism will take over and manage the ticket automatically. If the site holding a ticket should be out of service, the ticket will automatically be revoked after the expiry time and granted to another site. The resources that depend on that ticket will fail over to the new site holding the ticket. The nodes that have run the resources before will be treated according to the loss-policy you set within the constraint.

Procedure 13.6. Managing Tickets Manually

Assuming that you want to manually move ticketA from site 147.2.207.14 to 147.2.207.15, proceed as follows:

  1. Set ticketA to standby with the following command:

    crm_ticket -t ticketA -s
  2. Wait for any resources that depend on ticketA to be stopped or demoted cleanly.

  3. Revoke ticketA from its current site with:

    booth client revoke -t ticketA -s 147.2.207.14
  4. Wait for the revocation process to be finished successfully (check /var/log/messages for details). Do not execute any grant commands during this time.

  5. After the ticket has been revoked from its original site, grant it to the new site with:

    booth client grant -t ticketA -s 147.2.207.15

13.6. Troubleshooting

Booth logs to /var/log/messages and uses the same logging mechanism as the CRM. Thus, changing the log level will also take effect on booth logging. The booth log messages also contain information about any tickets.

Both the booth log messages and the booth configuration file are included in the hb_report.

In case of unexpected booth behavior or any problems, check /var/log/messages or create an hb_report.


SUSE Linux Enterprise High Availability Extension High Availability Guide 11 SP3