Re: How to debug the machine config operator in 4.2.10?

After a bit more digging and looking at other pod logs, I managed to find some useful logs in the machine-config-daemon on one of the nodes.

The error is:

content mismatch for file /etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt: -----BEGIN CERTIFICATE...

...certificate data...

Marking Degraded due to: unexpected on-disk state validating against rendered-worker-987dsa987f98

When I ssh onto the node, I can see that /etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt already had the certificates that I specified via setting up additional trusted CA's for builds instructions.  But when trying to pull an image via "sudo crictl pull myprivate.registry:5001/image:tag", it would complain about x509 certificates not being trusted. But if I reboot the node, then pulling via crictl starts working. However, the machine config operator remains broken complaining about the above error.  So it seems that the certificates are finding their way onto the node via different mechanism than the MCO.

This cluster is a disconnected cluster with some extra trusted CAs that were configured during installation, so I'm wondering if the content mismatch in the MCO is related to merging the CA certs for images and the certs inside the "user-ca-bundle" configmap in the "openshift-config" namespace

Any ideas?

On Tue, 18 Feb 2020 at 17:33, Joel Pearson <japearson agiledigital com au> wrote:

I've been having trouble to get openshift to reliably accept CA's for custom secure registries:
We've been following this guide:  https://docs.openshift.com/container-platform/4.2/builds/setting-up-trusted-ca.html

And it has worked sometimes and not others. The most frustrating bit is not being able to figure out when the CA certificates have been applied, sometimes just waiting 5 minutes is enough, other times, it never happens. I'm not sure what logs I need to watch so I know that it has seen it, and done something.

This article says that the machine config operator (MCO) restarts nodes to apply the updates, but when I watch "oc get nodes", I don't see anything restarting, but sometimes it seems the certificates get applied anyway, somehow.

Additionally, the MCO is degraded in the cluster, and it's not clear why. All I have managed to find so far is timeout error messages in the MCO pod, and then in the MCO cluster operator status, it just says it timed out waiting for them to sync, and that they're all unavailable.

Where do I need to look to debug any errors related to the MCO?

