seems like that manual intervention (log on and evacuate node) is the price we have to pay if we don't want our master to wreak havoc in our cluster when it has connectivity problems.
Maybe this whole mechanism could be built in a more defensive way. What is missing for us is an option to just re-create the pods that were on that node somewhere else if that node can't be reached for 5 minutes, and only evacuate the node after, say, 4 hours. Because that node might still be working properly and serving requests, it might just not be reachable for the master, as was in our case.
Such an option would be great to have, because all our services are built in a way that they are allowed to exist multiple times in the network.
Best Regards & thanks for your support Clayton!
Am 2016-10-12 um 17:44 schrieb Clayton Coleman: