Specifying Resource Failover Nodes

A resource will be automatically restarted if it fails. If that cannot be achieved on the current node, or it fails N times on the current node, it will try to fail over to another node. You can define a number of failures for resources (a migration-threshold), after which they will migrate to a new node. If you have more than two nodes in your cluster, the node a particular resource fails over to is chosen by the High Availability software.

If you want to choose which node a resource will fail over to, you must do the following:

  1. Configure a location constraint for that resource as described in Adding or Modifying Locational Constraints.

  2. Add the migration-threshold meta attribute to that resource as described in Adding or Modifying Meta and Instance Attributes and enter a Value for the migration-threshold. The value should be positive and less that INFINITY.

  3. If you want to automatically expire the failcount for a resource, add the failure-timeout meta attribute to that resource as described in Adding or Modifying Meta and Instance Attributes and enter a Value for the failure-timeout.

  4. If you want to specify additional failover nodes with preferences for a resource, create additional location constraints.

For example, let us assume you have configured a location constraint for resource r1 to preferably run on node1. If it fails there, migration-threshold is checked and compared to the failcount. If failcount >= migration-threshold then the resource is migrated to the node with the next best preference.

By default, once the threshold has been reached, the node will no longer be allowed to run the failed resource until the administrator manually resets the resource’s failcount (after fixing the failure cause).

However, it is possible to expire the failcounts by setting the resource’s failure-timeout option. So a setting of migration-threshold=2 and failure-timeout=60s would cause the resource to migrate to a new node after two failures and potentially allow it to move back (depending on the stickiness and constraint scores) after one minute.

There are two exceptions to the migration threshold concept, occurring when a resource either fails to start or fails to stop: Start failures set the failcount to INFINITY and thus always cause an immediate migration. Stop failures cause fencing (when stonith-enabled is set to true which is the default). In case there is no STONITH resource defined (or stonith-enabled is set to false), the resource will not migrate at all.

To clean up the failcount for a resource with the Linux HA Management Client, select Management in the left pane, select the respective resource in the right pane and click Cleanup Resource in the toolbar. This executes the commands crm_resource -C and crm_failcount -D for the specified resource on the specified node. For more information, see also crm_resource(8) and crm_failcount(8).