Auto repair nodes

A. Overview

This article introduces and explains how the node auto-repair feature works in FPT Cloud's D-FKE product.

Node auto-repair helps keep nodes in the cluster in a Ready, running state. When this feature is enabled, node-related events are continuously monitored. If a node enters a NotReady state, node auto-repair monitors it and, after 3 minutes in that state, takes the necessary actions to return the node to Ready.

B. Installation

For newly created clusters after the feature release: the feature is automatically enabled.
For existing clusters, run the following command:

kubectl apply -f https://raw.githubusercontent.com/fci-xplat/fke-config/main/node-autorepair

C. Configuration

In the configmap auto-repair-configmap in the kube-system namespace, there are two fields you can edit:

enable_rebooting_node (default "true")

enable_replacing_node (default "true")

enable_rebooting_node: grants node auto-repair permission to reboot faulty nodes.
enable_replacing_node: grants node auto-repair permission to delete faulty nodes.

If you want nodes in your cluster to remain stable, set enable_replacing_node to "false".

D. Basic workflow explanation

A node detected as NotReady for more than 3 minutes is rebooted.
The node is rebooted up to 3 times.
If rebooting does not make the node ready, all pods on the node are drained.
If the cluster does not have enough resources to run the pods from the old node, a new node is added to the cluster (cluster-autoscaler action).
The drained node is automatically deleted after 15 minutes (cluster-autoscaler action).
If Cluster Autoscaler is not active, node auto-repair deletes the faulty node after 60 minutes.

E. Notes

The D-FKE backend system may add, modify, or delete nodes in the cluster (this commonly occurs with clusters that have a firewall configured).
From within the cluster, it is possible to log in to the infrastructure where the cluster nodes run.

A. Overview​

B. Installation​

C. Configuration​

D. Basic workflow explanation​

E. Notes​

A. Overview

B. Installation

C. Configuration

D. Basic workflow explanation

E. Notes