[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: faulty diagnostics?



Thanks for bringing this up. This tool... needs some attention. Comments below:

On Fri, Oct 27, 2017 at 7:48 AM, Tim Dudgeon <tdudgeon ml gmail com> wrote:
I've been looking at using the diagnostics (oc adm diagnostics) to test the status of a cluster installed with the ansible installer and consistently see things that seem to be false alarms. The cluster appears to be functions (builds run, can push to registry and routes are working etc.). This is with origin 3.6.0.

1. This is consistently seen, and a restart of the master  does not fix it. The name docker-registry.default.svc resolve tot he ip address 172.30.200.62

ERROR: [DClu1019 from diagnostic ClusterRegistry openshift/origin/pkg/diagnostics/cluster/registry.go:343]
       Diagnostics created a test ImageStream and compared the registry IP
       it received to the registry IP available via the docker-registry service.

       docker-registry      : 172.30.200.62:5000
       ImageStream registry : docker-registry.default.svc:5000

       They do not match, which probably means that an administrator re-created
       the docker-registry service but the master has cached the old service
       IP address. Builds or deployments that use ImageStreams with the wrong
       docker-registry IP will fail under this condition.

       To resolve this issue, restarting the master (to clear the cache) should
       be sufficient. Existing ImageStreams may need to be re-created.

This is a bug -- the registry deployment changed without updating the relevant diagnostic. It has been fixed with https://github.com/openshift/origin/pull/16188 which I guess was not backported in Origin to 3.6 so expect it fixed in 3.7.

 
2. This warning is seen

WARN:  [DClu0003 from diagnostic NodeDefinition openshift/origin/pkg/diagnostics/cluster/node_definitions.go:113]
       Node ip-10-0-247-194.eu-west-1.compute.internal is ready but is marked Unschedulable.
       This is usually set manually for administrative reasons.
       An administrator can mark the node schedulable with:
           oadm manage-node ip-10-0-247-194.eu-west-1.compute.internal --schedulable=true

       While in this state, pods should not be scheduled to deploy on the node.
       Existing pods will continue to run until completed or evacuated (see
       other options for 'oadm manage-node').
This is for the master node which by default is non-schedulable.

It's a warning, not an error, because this could be a legitimate configuration. Actually, the diagnostic generally has no way to know that a node belongs to a master or that it is supposed to be unschedulable (there is nothing in the API to determine this).

That diagnostic is intended to alert you to the possibility that the reason a node is not getting pods scheduled is because of this setting. It's not saying there's anything wrong with the cluster. It's certainly a bit confusing; do you feel it's better to get a useless warning from masters or not to hear about unschedulable nodes at all in the diagnostics?

 

3. If metrics and logging are not deployed you see this warning:

WARN:  [DH0005 from diagnostic MasterConfigCheck openshift/origin/pkg/diagnostics/host/check_master_config.go:52]
       Validation of master config file '/etc/origin/master/master-config.yaml' warned:
       assetConfig.loggingPublicURL: Invalid value: "": required to view aggregated container logs in the console
       assetConfig.metricsPublicURL: Invalid value: "": required to view cluster metrics in the console
       auditConfig.auditFilePath: Required value: audit can not be logged to a separate file

Whilst 2 and 3 could be considered minor irritations, 1 might scare people that something is actually wrong.


Once again... it's a warning. And again, it's because there's no way to determine from the API whether these are supposed to be deployed. 

 

Also, the 'oc adm diagnostics' command need to be run as root or with sudo otherwise you get some file permissions related errors. I don't think this is mentioned in the docs.


Could you be more specific about what errors you get? Errors accessing the node/master config files perhaps?

Thanks for the feedback, and sorry for the delay in responding.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]