Contents
Abstract
Often, strange problems may occur that are not easy to understand (especially when starting to experiment with High Availability). However, there are several utilities that may be used to take a closer look at the High Availability internal processes. This chapter recommends various solutions.
Troubleshooting difficulties installing the packages or in bringing the cluster online.
The packages needed for configuring and managing a cluster are
included in the High Availability installation
pattern, available with the High Availability Extension.
Check if High Availability Extension is installed as an add-on to SUSE Linux Enterprise Server 11 SP1 on each of the cluster nodes and if the pattern is installed on each of the machines as described in Section 3.1, “Installing the High Availability Extension”.
In order to communicate with each other, all nodes belonging to the
same cluster need to use the same bindnetaddr,
mcastaddr and mcastport as
described in Section 3.2, “Initial Cluster Setup”.
Check if the communication channels and options configured in
/etc/corosync/corosync.conf are the same for all
cluster nodes.
In case you use encrypted communication, check if the
/etc/corosync/authkey file is available on all
cluster nodes.
All corosync.conf settings with the exception of
nodeid must be the same;
authkey files on all nodes must be identical.
mcastport?If the mcastport used for communication between the cluster nodes is blocked by the firewall, the nodes cannot see each other. When configuring the initial setup with YaST as described in Section 3.1, “Installing the High Availability Extension”, the firewall settings are usually automatically adjusted.
To make sure the mcastport is not blocked by the firewall, check the
settings in /etc/sysconfig/SuSEfirewall2 on each
node. Alternatively, start the YaST firewall module on each cluster
node. After clicking +, add the mcastport to the
list of allowed and confirm your changes.
Check the OpenAIS status on each cluster node with /etc/init.d/openais status. In case OpenAIS is not running, start it by executing /etc/init.d/openais start.
The following displays the resource operation history (option
-o) and inactive resources (-r):
crm_mon -o -r
The display is refreshed when status changes (to cancel this press Ctrl+C.) An example could look like:
Example 17.1. Stopped Resources
Refresh in 10s...
============
Last updated: Mon Jan 19 08:56:14 2009
Current DC: d42 (d42)
3 Nodes configured.
3 Resources configured.
============
Online: [ d230 d42 ]
OFFLINE: [ clusternode-1 ]
Full list of resources:
Clone Set: o2cb-clone
Stopped: [ o2cb:0 o2cb:1o2cb:2 ]
Clone Set: dlm-clone
Stopped [ dlm:0 dlm:1 dlm:2 ]
mySecondIP (ocf::heartbeat:IPaddr): Stopped
Operations:
* Node d230:
aa: migration-threshold=1000000
+ (5) probe: rc=0 (ok)
+ (37) stop: rc=0 (ok)
+ (38) start: rc=0 (ok)
+ (39) monitor: interval=15000ms rc=0 (ok)
* Node d42:
aa: migration-threshold=1000000
+ (3) probe: rc=0 (ok)
+ (12) stop: rc=0 (ok)First get your node online (see Section 17.3). After that, check your resources and operations.
The Configuration Explained PDF under http://clusterlabs.org/wiki/Documentation covers three different recovery types in the How Does the Cluster Interpret the OCF Return Codes? section.
To check the current state of your cluster, use one of the programs
crm_mon or crm
status. This displays the current DC as well as all
the nodes and resources known by the current node.
There could be several reasons:
Look first in the configuration file
/etc/corosync/corosync.conf and check if the
multicast address is the same for every node in the cluster (look in
the interface section with the key
mcastaddr.)
Check your firewall settings.
Check if your switch supports multicast addresses
Check if the connection between your nodes is broken. Most often, this is the result of a badly configured firewall. This also may be the reason for a split brain condition, where the cluster is partitioned.
Use the command crm_resource -L to learn about your current resources.
To check an OCF script use ocf-tester, for instance:
ocf-tester -n ip1 -o ip=YOUR_IP_ADDRESS \
/usr/lib/ocf/resource.d/heartbeat/IPaddr
Use -o multiple times for more parameters. The list
of required and optional parameters can be obtained by running
crm ra info
AGENT, for example:
crm ra info ocf:heartbeat:IPaddr
Before running ocf-tester, make sure the resource is not managed by the cluster.
You may always add the --verbose parameter to your
commands. If you do that multiple times, the debug output becomes very
verbose. See /var/log/messages for useful hints.
Use the following commands :
crm resource list crm resource cleanuprscid[node]
If you leave out the node, the resource is cleaned on all nodes. More information can be found in Section 6.4.2, “Cleaning Up Resources”.
Check /var/log/messages if there is the following
line:
Jan 12 09:58:55 clusternode2 lrmd: [3487]: info: RA output: (o2cb:1:start:stderr) 2009/01/12_09:58:55 ERROR: Could not load ocfs2_stackglue Jan 12 16:04:22 clusternode2 modprobe: FATAL: Module ocfs2_stackglue not found.
In this case the kernel module ocfs2_stackglue.ko
is missing. Install the package
ocfs2-kmp-default,
ocfs2-kmp-pae or
ocfs2-kmp-xen depending on the installed kernel.
For additional information about high availability on Linux and Heartbeat including configuring cluster resources and managing and customizing a Heartbeat cluster, see http://clusterlabs.org/wiki/Documentation.