@Component public class HealthCheckServiceImpl extends Object implements HealthCheckService
| Constructor and Description |
|---|
HealthCheckServiceImpl() |
| Modifier and Type | Method and Description |
|---|---|
void |
healFailedNodes()
"Cluster healing" is meant to repair inconsistent database state that results from an unclean cluster shutdown or network hardware failure.
|
public void healFailedNodes()
Scenario 1: A node takes a work unit from the queue and updates the associated workflow instance node_name field. Maybe it also updates
the workflow instance's or work item's status field. Afterwards the node is found dead/failed.
Scenario 2: The master node locks a workflow instance but fails before it can add the workflow instance to the work unit queue.
Resolution for scenario 1:
Resolution for scenario 2:
First of all, we need to make sure that the working queue is empty and that every consumer has had sufficient time to assign its most recently
taken work unit to itself. To this extend, every consumer is granted the so called "maximum node assignment time."
Two different kind of errors may cause the work unit queue to be empty for at least the duration of the maximum node assignment and at the same time (locked, nodeName) = (true, null) exist. The first kind is that the distributed queue failed. The second kind is that the node which took the element from the queue failed between taking the element and assigning the process execution to itself.
The recovery of this scenario is expensive because we need to suspend the producer, wait for the queue to become empty, wait for the maximum node assignment grace period, do the* recovery and resume the producer. For this reason, this advanced recovery is not run on every health check.
healFailedNodes in interface HealthCheckServiceCopyright © 2018. All rights reserved.