[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Failing to bootstrap disconnected 4.2 cluster on metal



Hi,

I'm trying to bootstrap a disconnected (air-gapped) 4.2 cluster using the bare metal method. It is technically vmware, but I'm following the bare metal version as our vmware cluster wasn't quite compatible with the vmware instructions.

After a few false starts I managed to get the bootstrapping to start to take place.  One strange thing that happened was that it was trying to download images from "quay.io/openshift-release-dev/ocp-v4.0-art-dev" instead of the documented "quay.io/openshift-release-dev/ocp-release". I found this rather odd, and I couldn't find many references to "ocp-v4.0-art-dev" on the internet, so I'm not sure exactly where it came from.  I did a "strings openshift-install | grep ocp-v4.0-art-dev" but that didn't show anything, so it's a bit of a strange one.

So my image content sources ended up being:

imageContentSources: - mirrors: - <bastion_host_name>:5000/<repo_name>/release source: quay.io/openshift-release-dev/ocp-release - mirrors: - <bastion_host_name>:5000/<repo_name>/release source: quay.io/openshift-release-dev/ocp-v4.0-art-dev  
- mirrors: - <bastion_host_name>:5000/<repo_name>/release source: registry.svc.ci.openshift.org/ocp/release  

I was watching the journalctl on the bootstrap server, and I saw each etcd server join one by one, then once they had all joined, then the apiserver on the bootstrap server seemed to lockup, when I tried to connect to https://localhost:6443 the connections would hang.  Initially, I thought this meant that bootstrap had completed, but then I noticed that none of the master nodes were listing on 6443, they were all trying to look themselves up in etcd at "api-int.<cluster_name>.<base_domain>" but nothing was listening.

I then scoured the journal on the bootstrap node, but I struggled to find logs related to why the apiserver had disappeared.  The journal was mostly full of the bootstrap node trying to connect to https://localhost:6443, which suggested to me that bootstrap was not yet complete.

I tried rebooting the bootstrap node, but I think that made it worse, it seemed to be in a crash loop whinging about files in /etc/kubernetes already existing or something like that.  I had a look through /var/logs and found this error message in some pod logs:

exiting because of error: log: unable to create log: open /var/log/bootstrap-control-plane/kube-apiserver.log: permission denied 

I'm not sure if that error is because I restarted before bootstrap was successful, or if that is actually some sort of problem.

I tried reinstalling from scratch a few times, and it always got stuck in the same place, so it doesn't seem to be transient.

Where can I look for errors? Is "ocp-v4.0-art-dev" an indication of a problem? Since it's an air-gapped solution it's difficult to get logs out of the system, so I don't know if I'll be able to use must-gather.  However, if I'm understanding it correctly, must-gather can only be used after bootstrap has succeeded.

Thoughts?

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]