Re: OKD3.11 install blocked - Could not find csr for nodes

Thank you very much for the extensive response, Samuel!

I've found that I do have a DNS misconfiguration so I receive the CSR error from the title not because of something related to Openshift installer procedure.

Somehow (and I haven't yet found the reason, but still looking for it) dnsmasq fills the upstream DNS configuration with some public nameservers and not my "internal" DNS. 
So after the openshift-ansible playbook, related to this, installs dnsmasq and calls the /etc/NetworkManager/dispatcher.d/99-origin-dns.sh script(restarts NetworkManager), all nodes end up with "bad" upstream nameservers (in the /etc/dnsmasq.d/origin-upstream-dns.conf and /etc/origin/node/resolv.conf files).
Even if the /etc/resolv.conf file for each host has the right nameserver and search domain, dnsmasq populates the OKD-related conf files above with a different nameserver.

I think this is related to dnsmasq/NetworkManager specific configuration....will have to look into it and figure out what's not going as expected and why. I believe these are served by the DHCP server, but still looking for a way to address this.

Anyway thanks again for the input, it put me on the right track! :) 


În dum., 2 iun. 2019 la 22:04, Samuel Martín Moro <faust64 gmail com> a scris:

This is quite puzzling, ... could you share your inventory with us? make sure to obfuscate any sensitive data (ldap/htpasswd credentials among others, ...)
mostly interested in potential openshift_node_groups edition. Although something else might come up (?)

At first glance, you are right, it sounds like a firewalling issue.
Yet from your description, you did open all required ports.
I could suggest you check back on these, make sure your data is accurate - although I would assume it is.
Also: if using Cri-O as a runtime, note that you would be missing port 10010, that should be opened on all nodes. Yet I don't think that one would be related to nodes registrations against your master API.

Another explanation could be related to DNS (can your infra/compute nodes properly resolve your masters name? the contrary would be unusual, still could explain what's going on).

As a general rule, at that stage, I would restart the origin-node service on those hosts that fail to register, keeping an eye on /var/log/messages (or journalctl -f).
If that doesn't help, I might raise log levels in /etc/sysconfig/origin-node (there's a variable which defaults to 2, you can change it to 99, beware it would give you a lots of logs/could saturate your disks at some point, don't keep it like this over a long period)

Dealing with large volumes of logs, note that openshift services tends to store messages with prefix based on severity: you might be able to "| grep -E 'E[0-9][0-9]" to focus on error messages, or W[0-9][0-9] for warnings, ...

Your issue being potentially related to firewalling, I might also use tcpdump looking into what's being exchanged between nodes.
Look for any packets with a SYN flag ("[S]") that would not be followed by an SYN-ACK ("[S.]").

Let us know how that goes,

Good luck.
Failing during the "Approve node certificate" steps is relatively common, and could have several causes, from node groups configuration, to DNS, firewalls, broken TCP handshake, MTU not allowing for certificates to go through, ... we'll want to dig deeper, to elucidate that issue.


On Sat, Jun 1, 2019 at 12:19 PM Punga Dan <dan punga gmail com> wrote:
Hello all!

I'm hitting a problem when trying to install a OKD3.11 on one master 2 infra and 2 compute nodes. The hosts are VM that run centos7. 
I've gone through the issues related to this subject: https://access.redhat.com/solutions/3680401 which suggest naming the hosts as FQDN. Tried it with the same problem appearing for the same set of hosts(all except the master).

In my case the error is only for the 2 infra nodes and 2 compute nodes, so not for the master as well.

oc get nodes gives me just the master node, but I guess this is the case as the other OKD-nodes stand to be created by the process that fails. Am I wrong?

oc get csr gives me a result of 3 csrs:
[root master ~]# oc get csr
NAME        AGE       REQUESTOR            CONDITION
csr-4xjjb   24m       system:admin         Approved,Issued
csr-b6x45   24m       system:admin         Approved,Issued
csr-hgmpf   20m       system:node:master   Approved,Issued

Here I believe I have 2 csrs for system:Admin because I ran the playbooks/openshift-node/join.yml a second time.

The bootstrapping certificates on the master look fine(??)
[root master ~]# ll /etc/origin/node/certificates/
total 20
-rw-------. 1 root root 2830 iun  1 11:30 kubelet-client-2019-06-01-11-30-04.pem
-rw-------. 1 root root 1135 iun  1 11:31 kubelet-client-2019-06-01-11-31-23.pem
lrwxrwxrwx. 1 root root   68 iun  1 11:31 kubelet-client-current.pem -> /etc/origin/node/certificates/kubelet-client-2019-06-01-11-31-23.pem
-rw-------. 1 root root 1179 iun  1 11:35 kubelet-server-2019-06-01-11-35-42.pem
lrwxrwxrwx. 1 root root   68 iun  1 11:35 kubelet-server-current.pem -> /etc/origin/node/certificates/kubelet-server-2019-06-01-11-35-42.pem

 I've rechecked the open ports thinking the issue lies in some network-related config.
- all hosts have the node related ports opened: 53/udp, 10250/tcp, 4789/udp
- master(with etcd): 8053/udp+tcp, 2049/udp+tcp, 8443/tcp, 8444/tcp, 4789/udp, 53/udp
- infra has on top of the node ones, the ports related to router/routes and logging components which it will host
The chosen SDN is os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant' with no extra config in the inventory file. (Do I need any?)

Any hints about where and what to check would be much appreciated!

Best regards,
Dan Pungă
Samuel Martín Moro
