[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: 3.9 Default Router Malfunction When 1 of 3 Pods is Down

Routers all watch all routes.  What are you fronting your routers with for HA?  VRRP?  An F5 or cloud load balancer?  DNS?

On Sep 2, 2018, at 6:18 AM, Stan Varlamov <stan varlamov exlinc com> wrote:

Went through a pretty scary experience of partial and uncontrollable outage in a 3.9 cluster that happened to be caused by issues in the default out of the box Router. The original installation had 3 region=infra nodes where the 3 router pods got installed via the generic ansible cluster installation. 2 of the 3 nodes where subsequently re-labeled at some point in the past, and after one node was restarted, over sudden, random routes started “disappearing”, causing 502s. I noticed that one of the 3 Router pods was in pending – due to lack of available nodes. Bottom line, till I got all 3 pods back into operation (tried dropping nodeselector requirements but ended up re-labeling the nodes back to infra) – the routes would not come back. I would expect that even one working Router can control all routes in the cluster – no. I couldn’t find a pattern which routes were off vs. those that stayed on, and some routes would pop in and out of operation. Is there something in the Router design that relies on all its pods working? Appears that individual Router pods are “responsible” for some routes in the cluster vs. just doing redundancy.




users mailing list
users lists openshift redhat com

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]