OpenShift Master: v3.9.0+ba7faec-1
Kubernetes Master: v1.9.1+a0ce1bc657After working successfully for the past few months, my Jenkins deployment started to fail to launch build agents for jobs. The event error was essentially Failed to start transient scope unit: Argument list too long. The error was initially confusing because it’s just running the same agents it’s always been running. The agents are configured to live for a short time (15 minutes) after which they’re removed and another created when necessary.
OpenShift Web Console: v3.9.0+b600d46-dirty
All this has been perfectly functional up until today.
The complete event error was: -
MountVolume.SetUp failed for volume "fs-input" : mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/origin/openshift.local.volumes/pods/4da0f883-aaa2-11e8-901a-c81f66c79dfc/volumes/kubernetes.io~nfs/fs-input --scope -- mount -t nfs -o ro bastion.novalocal:/data/fs-input /var/lib/origin/openshift.local.volumes/pods/4da0f883-aaa2-11e8-901a-c81f66c79dfc/volumes/kubernetes.io~nfs/fs-input Output: Failed to start transient scope unit: Argument list too long
I suspect it might be related to Kubernetes issue #57345 : Number of "loaded inactive dead" systemd transient mount units continues to grow.
In attempt to rectify the situation I tried the issue's suggestion, which was to run: -
$ sudo systemctl daemon-reload
...on the affected node(s). It worked on all nodes except the one that was giving me problems. On the “broken” node the command took a few seconds to complete but failed, responding with: -
Failed to execute operation: Connection timed out
I was unable to reboot the node from the command-line (clearly the system was polluted to the point that it was essentially unusable) and I was forced to resort to rebooting the node by other means.
When the node returned Jenkins and it’s deployments eventually returned to an operational state.
So it looks like the issue may be right: - the number of systemd transient mount units continues to grow unchecked on nodes.
Although I’ve recovered the system and now believe I have a work-around for the underlying fault next time I see this I wonder whether anyone else seen this in 3.9 and is there a long-term solution for this?
achristie informaticsmatters com