Contents
Abstract
Apart from local clusters and metro area clusters, SUSE® Linux Enterprise High Availability Extension
11 SP3 also supports multi-site clusters (geo clusters). That
means you can have multiple, geographically dispersed sites with a local
cluster each. Failover between these clusters is coordinated by a higher
level entity, the so-called booth. Support for
multi-site clusters is available as a separate option to SUSE Linux Enterprise High Availability Extension.
Typically, multi-site environments are too far apart to support synchronous communication between the sites and synchronous data replication. That leads to the following challenges:
How to make sure that a cluster site is up and running?
How to make sure that resources are only started once?
How to make sure that quorum can be reached between the different sites and a split brain scenario can be avoided?
How to manage failover between the sites?
How to deal with high latency in case of resources that need to be stopped?
In the following sections, learn how to meet these challenges with SUSE Linux Enterprise High Availability Extension.
Multi-site clusters based on SUSE Linux Enterprise High Availability Extension can be considered as “overlay” clusters where each cluster site corresponds to a cluster node in a traditional cluster. The overlay cluster is managed by the booth mechanism. It guarantees that the cluster resources will be highly available across different cluster sites. This is achieved by using so-called tickets that are treated as failover domain between cluster sites, in case a site should be down.
The following list explains the individual components and mechanisms that were introduced for multi-site clusters in more detail.
Components and Concepts¶
A ticket grants the right to run certain resources on a specific cluster site. A ticket can only be owned by one site at a time. Initially, none of the sites has a ticket—each ticket must be granted once by the cluster administrator. After that, tickets are managed by the booth for automatic failover of resources. But administrators may also intervene and grant or revoke tickets manually.
Resources can be bound to a certain ticket by dependencies. Only if the defined ticket is available at a site, the respective resources are started. Vice versa, if the ticket is removed, the resources depending on that ticket are automatically stopped.
The presence or absence of tickets for a site is stored in the CIB as
a cluster status. With regards to a certain ticket, there are only two
states for a site: true (the site has the ticket)
or false (the site does not have the ticket). The
absence of a certain ticket (during the initial state of the
multi-site cluster) is not treated differently from the situation
after the ticket has been revoked: both are reflected by the value
false.
A ticket within an overlay cluster is similar to a resource in a traditional cluster. But in contrast to traditional clusters, tickets are the only type of resource in an overlay cluster. They are primitive resources that do not need to be configured nor cloned.
The booth is the instance managing the ticket distribution and thus,
the failover process between the sites of a multi-site cluster. Each
of the participating clusters and arbitrators runs a service, the
boothd. It connects
to the booth daemons running at the other sites and exchanges
connectivity details. Once a ticket is granted to a site, the booth
mechanism will manage the ticket automatically: If the site which
holds the ticket is out of service, the booth daemons will vote which
of the other sites will get the ticket. To protect against brief
connection failures, sites that lose the vote (either explicitly or
implicitly by being disconnected from the voting body) need to
relinquish the ticket after a time-out. Thus, it is made sure that a
ticket will only be re-distributed after it has been relinquished by
the previous site. See also
Dead Man Dependency (loss-policy="fence").
Each site runs one booth instance that is responsible for communicating with the other sites. If you have a setup with an even number of sites, you need an additional instance to reach consensus about decisions such as failover of resources across sites. In this case, add one or more arbitrators running at additional sites. Arbitrators are single machines that run a booth instance in a special mode. As all booth instances communicate with each other, arbitrators help to make more reliable decisions about granting or revoking tickets.
An arbitrator is especially important for a two-site scenario: For
example, if site A can no longer communicate with
site B, there are two possible causes for that:
A network failure between A and
B.
Site B is down.
However, if site C (the arbitrator) can still
communicate with site B, site B
must still be up and running.
loss-policy="fence")
After a ticket is revoked, it can take a long time until all resources
depending on that ticket are stopped, especially in case of cascaded
resources. To cut that process short, the cluster administrator can
configure a loss-policy (together with the ticket
dependencies) for the case that a ticket gets revoked from a site. If
the loss-policy is set to fence, the nodes that are
hosting dependent resources are fenced. This considerably speeds up
the recovery process of the cluster and makes sure that resources can
be migrated more quickly.
As usual, the CIB is synchronized within each cluster, but it is not synchronized across cluster sites of a multi-site cluster. You have to configure the resources that will be highly available across the multi-site cluster for every site accordingly.
Software Requirements
All clusters that will be part of the multi-site cluster must be based on SUSE Linux Enterprise High Availability Extension 11 SP3.
SUSE® Linux Enterprise Server 11 SP3 must be installed on all arbitrators.
The booth package must be
installed on all cluster nodes and on all
arbitrators that will be part of the multi-site cluster.
The most common scenario is probably a multi-site cluster with two sites and a single arbitrator on a third site. However, technically, there are no limitations with regards to the number of sites and the number of arbitrators involved.
Nodes belonging to the same cluster site should be synchronized via NTP. However, time synchronization is not required between the individual cluster sites.
Configuring a multi-site cluster takes the following basic steps:
Apart from the resources and constraints that you need to define for your specific cluster setup, multi-site clusters require additional resources and constraints as described below. Instead of configuring them with the CRM shell, you can also do so with the HA Web Konsole. For details, refer to Section 5.5.2, “Configuring Additional Cluster Resources and Constraints”.
Procedure 13.1. Configuring Ticket Dependencies¶
The crm configure rsc_ticket command lets you
specify the resources depending on a certain ticket. Together with the
constraint, you can set a loss-policy that defines
what should happen to the respective resources if the ticket is
revoked. The attribute loss-policy can have the
following values:
fence: Fence the nodes that are running the
relevant resources.
stop: Stop the relevant resources.
freeze: Do nothing to the relevant resources.
demote: Demote relevant resources that are running
in master mode to slave mode.
On one of the cluster nodes, start a shell and log in as root or
equivalent.
Enter crm configure to switch to the interactive shell.
Configure a constraint that defines which resources depend on a certain ticket. For example:
crm(live)configure#
rsc_ticket rsc1-req-ticketA ticketA: rsc1 loss-policy="fence"
This creates a constraint with the ID
rsc1-req-ticketA. It defines that the resource
rsc1 depends on ticketA and that
the node running the resource should be fenced in case
ticketA is revoked.
If resource rsc1 was not a primitive, but a special
clone resource that can run in master or
slave mode, you may want to configure that only
rsc1's master mode depends on
ticketA. With the following configuration,
rsc1 is automatically demoted to
slave mode if ticketA is
revoked:
crm(live)configure#
rsc_ticket rsc1-req-ticketA ticketA: rsc1:Master loss-policy="demote"If you want other resources to depend on further tickets, create as many constraints as necessary with rsc_ticket.
Review your changes with show.
If everything is correct, submit your changes with commit and leave the crm live configuration with exit.
The constraints are saved to the CIB.
Procedure 13.2. Configuring a Resource Group for boothd¶
Each site needs to run one instance of
boothd that communicates
with the other booth daemons. The daemon can be started on any node,
therefore it should be configured as primitive resource. To make the
boothd resource stay on the same node, if
possible, add resource stickiness to the configuration. As each daemon
needs a persistent IP address, configure another primitive with a
virtual IP address. Group booth primitives:
On one of the cluster nodes, start a shell and log in as root or
equivalent.
Enter crm configure to switch to the interactive shell.
To create both primitive resources and to add them to one group,
g-booth:
crm(live)configure#
primitive booth-ip ocf:heartbeat:IPaddr2 params ip="IP_ADDRESS"
primitive booth ocf:pacemaker:booth-site \
meta resource-stickiness="INFINITY" \
op monitor interval="10s" timeout="20s"
group g-booth booth-ip boothReview your changes with show.
If everything is correct, submit your changes with commit and leave the crm live configuration with exit.
Repeat the resource group configuration on the other cluster sites,
using a different IP address for each boothd
resource group.
With this configuration, each booth daemon will be available at its individual IP address, independent of the node the daemon is running on.
Procedure 13.3. Adding an Ordering Constraint¶
If a ticket has been granted to a site but all nodes of that site
should fail to host the boothd
resource group for any reason, a “split-brain” situation
among the geographically dispersed sites could occur. In that case, no
boothd instance would be
available to safely manage fail-over of the ticket to another site. To
avoid a potential concurrency violation of the ticket (the ticket is
granted to multiple sites simultaneously), add an ordering constraint:
On one of the cluster nodes, start a shell and log in as root or
equivalent.
Enter crm configure to switch to the interactive shell.
Create an ordering constraint:
crm(live)configure#
order order-booth-rsc1 inf: g-booth rsc1
This defines that rsc1 (that depends on
ticketA) can only be started after the
g-booth resource group.
In case rsc1 is not a primitive, but a special
clone resource and configured as described in
Step 3, the
ordering constraint should be configured as follows:
crm(live)configure#
order order-booth-rsc1 inf: g-booth rsc1:promote
This defines that rsc1 can only be promoted to
master mode after the g-booth resource group has
started.
Review your changes with show.
For any other resources that depend on a certain ticket, define further ordering constraints.
If everything is correct, submit your changes with commit and leave the crm live configuration with exit.
After having configured the resource group for the
boothd and the ticket
dependencies, complete the booth setup:
Procedure 13.4. Editing The Booth Configuration File¶
Log in to a cluster node as root or equivalent.
Create /etc/booth/booth.conf and edit it
according to the example below:
Example 13.1. Example Booth Configuration File
transport="UDP"port="6666"
arbitrator="147.2.207.14"
site="147.4.215.19"
site="147.18.2.1"
ticket="ticketA;
1000
" ticket="ticketB;
1000
"
Defines the transport protocol used for communication between the sites. For SP2, only UDP is supported, other transport layers will follow. | |
Defines the port used for communication between the sites. Choose any port that is not already used for different services. Make sure to open the port in the nodes' and arbitrators' firewalls. | |
Defines the IP address of the arbitrator. Insert an entry for each arbitrator you use in your setup. | |
Defines the IP address used for the
| |
Defines the ticket to be managed by the booth. For each ticket, add
a | |
Optional parameter. Defines the ticket's expiry time in seconds. A
site that has been granted a ticket will renew the ticket
regularly. If the booth does not receive any information about
renewal of the ticket within the defined expiry time, the ticket
will be revoked and granted to another site. If no expiry time is
specified, the ticket will expire after |
An example booth configuration file is available at
/etc/booth/booth.conf.example.
Verify your changes and save the file.
Copy /etc/booth/booth.conf to all sites and
arbitrators. In case of any changes, make sure to update the file
accordingly on all parties.
![]() | Synchronize Booth Configuration to All Sites and Arbitrators |
|---|---|
All cluster nodes and arbitrators within the multi-site cluster must use the same booth configuration. While you may need to copy the files manually to the arbitrators and to one cluster node per site, you can use Csync2 within each cluster site to synchronize the file to all nodes. | |
Procedure 13.5. Starting the Booth Services¶
Start the booth resource group on each other cluster site. It will start one instance of the booth service per site.
Log in to each arbitrator and start the booth service:
/etc/init.d/booth-arbitrator start
This starts the booth service in arbitrator mode. It can communicate with all other booth daemons but in contrast to the booth daemons running on the cluster sites, it cannot be granted a ticket.
After finishing the booth configuration and starting the booth services, you are now ready to start the ticket process.
Before the booth can manage a certain ticket within the multi-site cluster, you initially need to grant it to a site manually. Use the booth client command line tool to grant, list, or revoke tickets as described in Overview of booth client Commands. The booth client commands work on any machine where the booth daemon is running.
Overview of booth client Commands¶
#booth client list
ticket: ticketA, owner: 147.4.215.19, expires: 2013/04/24 12:00:01
ticket: ticketB, owner: None, expires: INF#booth client grant -t ticketA -s 147.2.207.14 cluster[3100]: 2013/04/24_11:44:14 info: grant command sent, result will be returned asynchronously, you can get the result from the log files.
In this case, ticketA will be granted to the site
147.2.207.14. The grant operation will be executed
immediately. However, it might not be finished yet when the message
above appears on the screen. Find the exact status in the log files.
Before granting a ticket, the command will execute a sanity check. If the same ticket is already granted to another site, you will be warned about that and be prompted to revoke the ticket from the current site first.
#booth client revoke -t ticketA -s 147.2.207.14 cluster[3100]: 2013/04/24_11:44:14 info: revoke command sent, result will be returned asynchronously, you can get the result from the log files.
In this case, ticketA will be revoked from the site
147.2.207.14. The revoke operation will be executed
immediately. However, it might not be finished yet when the message
above appears on the screen. Find the exact status in the log files.
![]() | crm_ticket and crm site ticket |
|---|---|
In case the booth service is not running for any reasons, you may also manage tickets manually with crm_ticket or crm site ticket. Both commands are only available on cluster nodes. In case of manual intervention, use them with great care as they cannot verify if the same ticket is already granted elsewhere. For basic information about the commands, refer to their man pages. As long as booth is up and running, only use booth client for manual intervention. | |
After you have initially granted a ticket to a site, the booth mechanism
will take over and manage the ticket automatically. If the site holding a
ticket should be out of service, the ticket will automatically be revoked
after the expiry time and granted to another site. The resources that
depend on that ticket will fail over to the new site holding the ticket.
The nodes that have run the resources before will be treated according to
the loss-policy you set within the constraint.
Procedure 13.6. Managing Tickets Manually¶
Assuming that you want to manually move ticketA from
site 147.2.207.14 to 147.2.207.15,
proceed as follows:
Set ticketA to standby with the following command:
crm_ticket -t ticketA -s
Wait for any resources that depend on ticketA to be
stopped or demoted cleanly.
Revoke ticketA from its current site with:
booth client revoke -t ticketA -s 147.2.207.14
Wait for the revocation process to be finished successfully (check
/var/log/messages for details). Do not execute any
grant commands during this time.
After the ticket has been revoked from its original site, grant it to the new site with:
booth client grant -t ticketA -s 147.2.207.15
Booth logs to /var/log/messages and uses the same
logging mechanism as the CRM. Thus, changing the log level will also take
effect on booth logging. The booth log messages also contain information
about any tickets.
Both the booth log messages and the booth configuration file are included in the hb_report.
In case of unexpected booth behavior or any problems, check
/var/log/messages or create an
hb_report.