A resource will be automatically restarted if it fails. If that cannot be
achieved on the current node, or it fails N times on the current node, it
will try to fail over to another node. You can define a number of
failures for resources (a migration-threshold), after
which they will migrate to a new node. If you have more than two nodes in
your cluster, the node a particular resource fails over to is chosen by
the High Availability software.
If you want to choose which node a resource will fail over to, you must do the following:
Configure a location constraint for that resource as described in Adding or Modifying Locational Constraints.
Add the migration-threshold meta attribute to that
resource as described in
Adding or Modifying Meta and Instance Attributes and enter a
for the migration-threshold. The value should
be positive and less that INFINITY.
If you want to automatically expire the failcount for a resource, add
the failure-timeout meta attribute to that resource
as described in Adding or Modifying Meta and Instance Attributes
and enter a for the failure-timeout.
If you want to specify additional failover nodes with preferences for a resource, create additional location constraints.
For example, let us assume you have configured a location constraint for
resource r1 to preferably run on
node1. If it fails there,
migration-threshold is checked and compared to the
failcount. If failcount >= migration-threshold then the resource is
migrated to the node with the next best preference.
By default, once the threshold has been reached, the node will no longer be allowed to run the failed resource until the administrator manually resets the resource’s failcount (after fixing the failure cause).
However, it is possible to expire the failcounts by setting the
resource’s failure-timeout option. So a setting of
migration-threshold=2 and
failure-timeout=60s would cause the resource to
migrate to a new node after two failures and potentially allow it to move
back (depending on the stickiness and constraint scores) after one
minute.
There are two exceptions to the migration threshold concept, occurring
when a resource either fails to start or fails to stop: Start failures
set the failcount to INFINITY and thus always cause an immediate
migration. Stop failures cause fencing (when
stonith-enabled is set to true
which is the default). In case there is no STONITH resource defined (or
stonith-enabled is set to false),
the resource will not migrate at all.
To clean up the failcount for a resource with the Linux HA Management Client, select in the left pane, select the respective resource in the right pane and click in the toolbar. This executes the commands crm_resource -C and crm_failcount -D for the specified resource on the specified node. For more information, see also crm_resource(8) and crm_failcount(8).