[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: Missing OpenShift Nodes - Unable to Join Cluster





On Fri, Sep 9, 2016 at 10:18 AM, Isaac Christoffersen <ichristoffersen vizuri com> wrote:
So the hostnames did not change and after rolling back to just the BYO configuration and removing the AWS settings, I was able to get back up and running.  This means that the certificates were good as well.

I lost the ability to use EBS volumes doing this, but we in the process of using EFS anyway.

I suspect the issue is tied up in the fact that these node names have multiple aliases and have a different local hostname then they do in the EC2 console.  However, I'm not why this manifested itself after running successfully for 4 weeks.

That is definitely odd, I would expect that the hostname wouldn't matter. For the cloud provider integration the value of the nodeName setting in /etc/origin/node/node-config.yaml should match private-dns-name attribute for the instance.

 

Either way, I'm moving on with just BYO.

thanks,

Isaac

Isaac ChristoffersenTechnical Director

Vizuri, a division of AEM Corporation
13880 Dulles Corner Lane # 300
Herndon, Virginia 20171
www.vizuri.com | @1Vizuri


On Thu, Sep 8, 2016 at 10:36 PM, Isaac Christoffersen <ichristoffersen vizuri com> wrote:
No, the hostnames are the same.  Because I was getting the "external Id from Cloud provider" error, I disabled the AWS configuration settings and left it as solely a BYO.  

This allowed me to get my nodes back up.  There's definitely something with the AWS cloud provider settings and how instance names for nodes are being found. 

I only need the AWS config for EBS storage for Persistence Volumes, so I can't fully disable it the AWS settings.

How does the external id lookup work?  Can I verify the settings it expects?

Isaac ChristoffersenTechnical Director

Vizuri, a division of AEM Corporation
13880 Dulles Corner Lane # 300
Herndon, Virginia 20171
www.vizuri.com | @1Vizuri


On Thu, Sep 8, 2016 at 9:24 PM, Jason DeTiberus <jdetiber redhat com> wrote:

On Sep 8, 2016 7:06 PM, "Isaac Christoffersen" <ichristoffersen vizuri com> wrote:
>
> I'm running Origin in AWS and after adding some shared EFS volumes to the node instances, the nodes seem to be unable to rejoin the cluster.  
>
> It's a 3 Master + ETCD setup with 4 application Nodes.  An 'oc get nodes' returns an empty list and of course, none of the pods will start.
>
>
> Various error messages that I see that are relevant are:
>
> "Unable to construct api.Node object for kubelet: failed to get external ID from cloud provider: instance not found
> "Could not find an allocated subnet for node: ip-10-0-37-217..... , Waiting..."
>
> and 
>
> ""Error updating node status, will retry: error getting node "ip-10-0-37-217....": nodes "ip-10-0-37-217...." not found"
>
>
> Any insights into how to start troubleshooting further.  I'm baffled.

Did the nodes come back up with a new IP address? If so, the internal DNS name would have also changed and the node would need to be reconfigured accordingly.

Items that would need to be updated:
- node name in the node config
- node serving certificate

There is an Ansible playbook that can automate the redeployment of certificates as well (playbooks/byo/openshift-cluster/redeploy-certificates.yml).

--
Jason DeTiberus




_______________________________________________
users mailing list
users lists openshift redhat com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users




--
Jason DeTiberus

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]