Contents
Abstract
Apart from local clusters and metro area clusters, SUSE® Linux Enterprise High Availability Extension
11 SP2 also supports multi-site clusters. That means you can
have multiple, geographically dispersed sites with a local cluster each.
Failover between these clusters is coordinated by a higher level entity,
the so-called booth. Support for multi-site clusters
is available as a separate option to SUSE Linux Enterprise High Availability Extension.
Typically, multi-site environments are too far apart to support synchronous communication between the sites and synchronous data replication. That leads to the following challenges:
How to make sure that a cluster site is up and running?
How to make sure that resources are only started once?
How to make sure that quorum can be reached between the different sites and a split brain scenario can be avoided?
How to manage failover between the sites?
How to deal with high latency in case of resources that need to be stopped?
In the following sections, learn how to meet these challenges with SUSE Linux Enterprise High Availability Extension.
Multi-site clusters based on SUSE Linux Enterprise High Availability Extension can be considered as “overlay” clusters where each cluster site corresponds to a cluster node in a traditional cluster. The overlay cluster is managed by the booth mechanism. It guarantees that the cluster resources will be highly available across different cluster sites. This is achieved by using so-called tickets that are treated as failover domain between cluster sites, in case a site should be down.
The following list explains the individual components and mechanisms that were introduced for multi-site clusters in more detail.
Components and Concepts¶
A ticket grants the right to run certain resources on a specific cluster site. A ticket can only be owned by one site at a time. Initially, none of the sites has a ticket—each ticket must be granted once by the cluster administrator. After that, tickets are managed by the booth for automatic failover of resources. But administrators may also intervene and grant or revoke tickets manually.
Resources can be bound to a certain ticket by dependencies. Only if the defined ticket is available at a site, the respective resources are started. Vice versa, if the ticket is removed, the resources depending on that ticket are automatically stopped.
The presence or absence of tickets for a site is stored in the CIB as
a cluster status. With regards to a certain ticket, there are only two
states for a site: true (the site has the ticket)
or false (the site does not have the ticket). The
absence of a certain ticket (during the initial state of the
multi-site cluster) is not treated differently from the situation
after the ticket has been revoked: both are reflected by the value
false.
A ticket within an overlay cluster is similar to a resource in a traditional cluster. But in contrast to traditional clusters, tickets are the only type of resource in an overlay cluster. They are primitive resources that do not need to be configured nor cloned.
The booth is the instance managing the ticket distribution and thus,
the failover process between the sites of a multi-site cluster. Each
of the participating clusters and arbitrators runs a service, the
boothd. It connects
to the booth daemons running at the other sites and exchanges
connectivity details. Once a ticket is granted to a site, the booth
mechanism will manage the ticket automatically: If the site which
holds the ticket is out of service, the booth daemons will vote which
of the other sites will get the ticket. To protect against brief
connection failures, sites that lose the vote (either explicitly or
implicitly by being disconnected from the voting body) need to
relinquish the ticket after a time-out. Thus, it is made sure that a
ticket will only be re-distributed after it has been relinquished by
the previous site. See also
Dead Man Dependency (loss-policy="fence").
Each site runs one booth instance that is responsible for communicating with the other sites. If you have a setup with an even number of sites, you need an additional instance to reach consensus about decisions such as failover of resources across sites. In this case, add one or more arbitrators running at additional sites. Arbitrators are single machines that run a booth instance in a special mode. As all booth instances communicate with each other, arbitrators help to make more reliable decisions about granting or revoking tickets.
An arbitrator is especially important for a two-site scenario: For
example, if site A can no longer communicate with
site B, there are two possible causes for that:
A network failure between A and
B.
Site B is down.
However, if site C (the arbitrator) can still
communicate with site A, site A
must still be up and running.
loss-policy="fence")
After a ticket is revoked, it can take a long time until all resources
depending on that ticket are stopped, especially in case of cascaded
resources. To cut that process short, the cluster administrator can
configure a loss-policy (together with the ticket
dependencies) for the case that a ticket gets revoked from a site. If
the loss-policy is set to fence, the nodes that are
hosting dependent resources are fenced. This considerably speeds up
the recovery process of the cluster and makes sure that resources can
be migrated more quickly.
As usual, the CIB is synchronized within each cluster, but it is not synchronized across cluster sites of a multi-site cluster. You have to configure the resources that will be highly available across the multi-site cluster for every site accordingly.
Software Requirements
All clusters that will be part of the multi-site cluster must be based on SUSE Linux Enterprise High Availability Extension 11 SP2.
SUSE® Linux Enterprise Server 11 SP2 must be installed on all arbitrators.
The booth package must be
installed on all cluster nodes and on all
arbitrators that will be part of the multi-site cluster.
The most common scenario is probably a multi-site cluster with two sites and a single arbitrator on a third site. However, technically, there are no limitations with regards to the number of sites and the number of arbitrators involved.
Nodes belonging to the same cluster site should be synchronized via NTP. However, time synchronization is not required between the individual cluster sites.
Configuring a multi-site cluster takes the following basic steps:
Procedure 13.1. Configuring Ticket Dependencies¶
The crm configure rsc_ticket command lets you
specify the resources depending on a certain ticket. Together with the
constraint, you can set a loss-policy that defines
what should happen to the respective resources if the ticket is
revoked. The attribute loss-policy can have the
following values:
fence: Fence the nodes that are running the
relevant resources.
stop: Stop the relevant resources.
freeze: Do nothing to the relevant resources.
demote: Demote relevant resources that are running
in master mode to slave mode.
On one of the cluster nodes, start a shell and log in as root or
equivalent.
Enter crm configure to switch to the interactive shell.
Configure a constraint that defines which resources depend on a certain ticket. For example:
crm(live)configure#
rsc_ticket rsc1-req-ticketA ticketA: rsc1 loss-policy="fence"
This creates a constraint with the ID
rsc1-req-ticketA. It defines that the resource
rsc1 depends on ticketA and that
the node running the resource should be fenced in case
ticketA is revoked.
If resource rsc1 was not a primitive, but a special
clone resource that can run in master or
slave mode, you could also configure that
rsc1 is automatically demoted to
slave mode if ticketA is
revoked:
crm(live)configure#
rsc_ticket rsc1-req-ticketA ticketA: rsc1:Master loss-policy="demote"If you want other resources to depend on further tickets, create as many constraints as necessary with rsc_ticket.
Review your changes with show.
If everything is correct, submit your changes with commit and leave the crm live configuration with exit.
The constraints are saved to the CIB.
Procedure 13.2. Configuring a Resource Group for boothd¶
Each site needs to run one instance of
boothd that communicates
with the other booth daemons. The daemon can be started on any node,
therefore it should be configured as primitive resource. As each daemon
needs a persistent IP address, configure another primitive with a
virtual IP address. Group booth primitives:
On one of the cluster nodes, start a shell and log in as root or
equivalent.
Enter crm configure to switch to the interactive shell.
To create both primitive resources and to add them to one group,
g-booth:
crm(live)configure#
primitive booth-ip ocf:heartbeat:IPaddr2 params ip="IP_ADDRESS"
primitive booth ocf:pacemaker:booth-site
group g-booth booth-ip boothReview your changes with show.
If everything is correct, submit your changes with commit and leave the crm live configuration with exit.
Repeat the resource group configuration on the other cluster sites,
using a different IP address for each boothd
resource group.
With this configuration, each booth daemon will be available at its individual IP address, independent of the node the daemon is running on.
After having configured the resource group for the
boothd and the ticket
dependencies, complete the booth setup:
Procedure 13.3. Editing The Booth Configuration File¶
Log in to a cluster node as root or equivalent.
Create /etc/sysconfig/booth and edit it according
to the example below:
Example 13.1. Example Booth Configuration File
transport="UDP"port="6666"
arbitrator="147.2.207.14"
site="147.4.215.19"
site="147.18.2.1"
ticket="ticketA;
1000
" ticket="ticketB;
1000
"
Defines the transport protocol used for communication between the sites. For SP2, only UDP is supported, other transport layers will follow. | |
Defines the port used for communication between the sites. Choose any port that is not already used for different services. Make sure to open the port in the nodes' and arbitrators' firewalls. | |
Defines the IP address of the arbitrator. Insert an entry for each arbitrator you use in your setup. | |
Defines the IP address used for the
| |
Defines the ticket to be managed by the booth. For each ticket, add
a | |
Defines the ticket's expiry time in seconds. A site that has been granted a ticket will renew the ticket regularly. If the booth does not receive any information about renewal of the ticket within the defined expiry time, the ticket will be revoked and granted to another site. |
An example booth configuration file is available at
/etc/booth/booth.conf.example.
Verify your changes and save the file.
Copy /etc/sysconfig/booth to all sites and
arbitrators. In case of any changes, make sure to update the file
accordingly on all parties.
![]() | Synchronize Booth Configuration to All Sites and Arbitrators |
|---|---|
All cluster nodes and arbitrators within the multi-site cluster must use the same booth configuration. While you may need to copy the files manually to the arbitrators and to one cluster node per site, you can use Csync2 within each cluster site to synchronize the file to all nodes. | |
Procedure 13.4. Starting the Booth Services¶
Start the booth resource group on each other cluster site. It will start one instance of the booth service per site.
Log in to each arbitrator and start the booth service:
/etc/init.d/booth-arbitrator start
This starts the booth service in arbitrator mode. It can communicate with all other booth daemons but in contrast to the booth daemons running on the cluster sites, it cannot be granted a ticket.
After finishing the booth configuration and starting the booth services, you are now ready to start the ticket process.
Before the booth can manage a certain ticket within the multi-site cluster, you initially need to grant it to a site manually. Use the booth client command line tool to grant, list, or revoke tickets. The booth client commands work on any machine where the booth daemon is running.
![]() | crm_ticket and crm site ticket |
|---|---|
In case the booth service is not running for any reasons, you may also manage tickets manually with crm_ticket or crm site ticket. Both commands are only available on cluster nodes. In case of manual intervention, use them with great care as they cannot verify if the same ticket is already granted elsewhere. For basic information about the commands, refer to their man pages. As long as booth is up and running, only use booth client for manual intervention. | |
Procedure 13.5. Managing Tickets Manually¶
To list all tickets on all sites:
booth client list
To grant a ticket to a site (for example, ticketA to
the site 147.2.207.14):
booth client grant -t ticketA -s 147.2.207.14
The command executes a sanity check and will warn you if the same ticket is already granted to another site.
When granting tickets, you can also specify the expiry time after which
a ticket will fail over to another site if it has not been renewed. The
default expiry time is 600 seconds. To specify a
different value, use the -e option:
booth client grant -t ticketA -s 147.2.207.14 -e 1000
To revoke a ticket from a site (for example, ticketA
from the site 147.2.207.14):
booth client revoke -t ticketA -s 147.2.207.14
After you have initially granted a ticket to a site, the booth mechanism
will take over and manage the ticket automatically. If the site holding a
ticket should be out of service, the ticket will automatically be revoked
after the expiry time and granted to another site. The resources that
depend on that ticket will fail over to the new site holding the ticket.
The nodes that have run the resources before will be treated according to
the loss-policy you set within the constraint.