[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: Complete cluster meltdown due to "Kubelet stopped posting node status"



Yeah, if you make this change you'll be responsible for triggering evacuation of down nodes.  You can do that via "oadm manage-node NODE_NAME --evacuate"

On Mon, Oct 10, 2016 at 8:06 AM, v <vekt0r7 gmx net> wrote:
Hello Clayton,

thank you for replying!
I'm not sure whether changing the node failure detection threshhold is the right way to go. I have found this:

https://docs.openshift.com/enterprise/3.1/install_config/master_node_configuration.html
masterIP: 10.0.2.15 podEvictionTimeout: 5m schedulerConfigFile: "" I think that podEvictionTimeout is the thing that bit us. After changing that to "24h" I don't see any "Evicting pods on node" or "Recording Deleting all Pods from Node" messages in the master logs any more.

Regards
v

Am 2016-10-10 um 15:21 schrieb Clayton Coleman:
Network segmentation mode is in 1.3.  In 1.1 or 1.2 you can also
increase the node failure detection threshold (80s by default) as high
as you want by setting the extended controller argument for it, which
will delay evictions (you could set 24h and use external tooling to
handle node down).

If you are concerned about external traffic causing DDoS, add a proxy
configuration for your masters that rate limits traffic by cookie or
source ip.



On Oct 10, 2016, at 2:56 AM, v <vekt0r7 gmx net> wrote:

Hello,

we just had our whole Openshift cluster go down hard due to a "feature" in the Openshift master that deletes all pods from a node if the node doesn't report back to the master on a regular basis.

Turns out we're not the only ones who have been bitten by this "feature":
https://github.com/kubernetes/kubernetes/issues/30972#issuecomment-241077740
https://github.com/kubernetes/kubernetes/issues/24200

I am writing here to find out whether it is possible to disable this feature completely. We don't need it and we don't want our master to ever do something like that again.

Please note how easily this feature can be abused: At the moment anyone can bring down your whole Openshift cluster just by DDoSing the master(s) for a few minutes.







The logs (they were the same for all nodes):
Okt 09 21:47:10 openshiftmaster.com origin-master[919215]: I1004 21:47:10.804666  919215 nodecontroller.go:697] node openshiftnode.com hasn't been updated for 5m17.169004459s. Last out of disk condition is: &{Type:OutOfDisk Status:Unknown LastHeartbeatTime:2016-10-04 21:41:53 +0200 CEST LastTransitionTime:2016-10-04 21:42:33 +0200 CEST Reason:NodeStatusUnknown Message:Kubelet stopped posting node status.}
Okt 09 21:47:10 openshiftmaster.com origin-master[919215]: I1004 21:47:10.804742  919215 nodecontroller.go:451] Evicting pods on node openshiftnode.com: 2016-10-04 21:47:10.80472667 +0200 CEST is later than 2016-10-04 21:42:33.779813315 +0200 CEST + 4m20s
Okt 09 21:47:10 openshiftmaster.com origin-master[919215]: I1004 21:47:10.945766  919215 nodecontroller.go:540] Recording Deleting all Pods from Node openshiftnode.com. event message for node openshiftnode.com

Regards
v

_______________________________________________
users mailing list
users lists openshift redhat com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users
_______________________________________________
users mailing list
users lists openshift redhat com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


_______________________________________________
users mailing list
users lists openshift redhat com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]