Troubleshooting

Contents

17.1. Installation Problems
17.2. Debugging a HA Cluster
17.3. FAQs
17.4. Fore More Information

Abstract

Often, strange problems may occur that are not easy to understand (especially when starting to experiment with High Availability). However, there are several utilities that may be used to take a closer look at the High Availability internal processes. This chapter recommends various solutions.

Installation Problems

Troubleshooting difficulties installing the packages or in bringing the cluster online.

Are the HA packages installed?

The packages needed for configuring and managing a cluster are included in the High Availability installation pattern, available with the High Availability Extension.

Check if High Availability Extension is installed as an add-on to SUSE Linux Enterprise Server 11 SP1 on each of the cluster nodes and if the High Availability pattern is installed on each of the machines as described in Section 3.1, “Installing the High Availability Extension”.

Is the initial configuration the same for all cluster nodes?

In order to communicate with each other, all nodes belonging to the same cluster need to use the same bindnetaddr, mcastaddr and mcastport as described in Section 3.2, “Initial Cluster Setup”.

Check if the communication channels and options configured in /etc/corosync/corosync.conf are the same for all cluster nodes.

In case you use encrypted communication, check if the /etc/corosync/authkey file is available on all cluster nodes.

All corosync.conf settings with the exception of nodeid must be the same; authkey files on all nodes must be identical.

Does the firewall allow communication via the mcastport?

If the mcastport used for communication between the cluster nodes is blocked by the firewall, the nodes cannot see each other. When configuring the initial setup with YaST as described in Section 3.1, “Installing the High Availability Extension”, the firewall settings are usually automatically adjusted.

To make sure the mcastport is not blocked by the firewall, check the settings in /etc/sysconfig/SuSEfirewall2 on each node. Alternatively, start the YaST firewall module on each cluster node. After clicking Allowed Service+Advanced, add the mcastport to the list of allowed UDP Ports and confirm your changes.

Is OpenAIS started on each cluster node?

Check the OpenAIS status on each cluster node with /etc/init.d/openais status. In case OpenAIS is not running, start it by executing /etc/init.d/openais start.

Debugging a HA Cluster

The following displays the resource operation history (option -o) and inactive resources (-r):

crm_mon -o -r

The display is refreshed when status changes (to cancel this press Ctrl+C.) An example could look like:

Example 17.1. Stopped Resources

Refresh in 10s...

============
Last updated: Mon Jan 19 08:56:14 2009
Current DC: d42 (d42)
3 Nodes configured.
3 Resources configured.
============

Online: [ d230 d42 ]
OFFLINE: [ clusternode-1 ]

Full list of resources:

Clone Set: o2cb-clone
         Stopped: [  o2cb:0 o2cb:1o2cb:2 ]
Clone Set: dlm-clone
         Stopped [ dlm:0 dlm:1 dlm:2 ]
mySecondIP      (ocf::heartbeat:IPaddr):        Stopped

Operations:
* Node d230:
   aa: migration-threshold=1000000
    + (5) probe: rc=0 (ok)
    + (37) stop: rc=0 (ok)
    + (38) start: rc=0 (ok)
    + (39) monitor: interval=15000ms rc=0 (ok)
* Node d42:
   aa: migration-threshold=1000000
    + (3) probe: rc=0 (ok)
    + (12) stop: rc=0 (ok)

First get your node online (see Section 17.3). After that, check your resources and operations.

The Configuration Explained PDF under http://clusterlabs.org/wiki/Documentation covers three different recovery types in the How Does the Cluster Interpret the OCF Return Codes? section.

FAQs

What is the state of my cluster?

To check the current state of your cluster, use one of the programs crm_mon or crm status. This displays the current DC as well as all the nodes and resources known by the current node.

Several nodes of my cluster do not see each other.

There could be several reasons:

  • Look first in the configuration file /etc/corosync/corosync.conf and check if the multicast address is the same for every node in the cluster (look in the interface section with the key mcastaddr.)

  • Check your firewall settings.

  • Check if your switch supports multicast addresses

  • Check if the connection between your nodes is broken. Most often, this is the result of a badly configured firewall. This also may be the reason for a split brain condition, where the cluster is partitioned.

I want to list my currently known resources.

Use the command crm_resource -L to learn about your current resources.

I configured a resource, but it always fails.

To check an OCF script use ocf-tester, for instance:

ocf-tester -n ip1 -o ip=YOUR_IP_ADDRESS \
  /usr/lib/ocf/resource.d/heartbeat/IPaddr

Use -o multiple times for more parameters. The list of required and optional parameters can be obtained by running crm ra info AGENT, for example:

crm ra info ocf:heartbeat:IPaddr

Before running ocf-tester, make sure the resource is not managed by the cluster.

I just get a failed message. Is it possible to get more information?

You may always add the --verbose parameter to your commands. If you do that multiple times, the debug output becomes very verbose. See /var/log/messages for useful hints.

How can I clean up my resources?

Use the following commands :

crm resource list
crm resource cleanup rscid [node]

If you leave out the node, the resource is cleaned on all nodes. More information can be found in Section 6.4.2, “Cleaning Up Resources”.

I can not mount an ocfs2 device.

Check /var/log/messages if there is the following line:

Jan 12 09:58:55 clusternode2 lrmd: [3487]: info: RA output: (o2cb:1:start:stderr) 2009/01/12_09:58:55 
  ERROR: Could not load ocfs2_stackglue
Jan 12 16:04:22 clusternode2 modprobe: FATAL: Module ocfs2_stackglue not found.

In this case the kernel module ocfs2_stackglue.ko is missing. Install the package ocfs2-kmp-default, ocfs2-kmp-pae or ocfs2-kmp-xen depending on the installed kernel.

Fore More Information

For additional information about high availability on Linux and Heartbeat including configuring cluster resources and managing and customizing a Heartbeat cluster, see http://clusterlabs.org/wiki/Documentation.