[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: OpenShift Origin 3.7 Template Broker seems super flaky





On Mon, Jan 8, 2018 at 12:21 AM, Joel Pearson <japearson agiledigital com au> wrote:
> The TemplateInstance object should have an ownerReference to a BrokerTemplateInstance and that reference not being handled properly is the bug.  If you remove that ownerRef from the TemplateInstance, you should be safe from undesired of the TemplateInstance (and the cascading delete of everything else) (at least w/ respect to the bug we are aware of).

Nice, that did the trick.

I did an oc patch, and that fixed it:

$ oc get templateinstance
NAME                                   TEMPLATE
b180d814-2917-4c7e-875f-b91e5d4743e8   jenkins-ephemeral

$ oc patch templateinstance b180d814-2917-4c7e-875f-b91e5d4743e8 --type json -p='[{"op": "remove", "path": "/metadata/ownerReferences"}]'
templateinstance "b180d814-2917-4c7e-875f-b91e5d4743e8" patched


Also, I've got another stale serviceinstance after a few rounds of testing, I cannot for the life of me make it die, meaning I can't delete the project that it is a part of, I've tried a force delete, but it doesn't work.

$ oc delete serviceinstance jenkins-ephemeral-8dmk9 --force --grace-period=0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
serviceinstance "jenkins-ephemeral-8dmk9" deleted

$ oc get serviceinstance
NAME                      AGE
jenkins-ephemeral-8dmk9   7m

What's the magic sauce to make it so that I can delete the serviceinstance?

That's going to be a question for our Service Catalog lead, Paul Morie (on CC).

 

On 8 January 2018 at 15:29, Ben Parees <bparees redhat com> wrote:


On Sun, Jan 7, 2018 at 9:35 PM, Joel Pearson <japearson agiledigital com au> wrote:
Ahh, I looked into all the objects that were getting deleted and they all have an ownerReference, eg:

"ownerReferences": [
                    {
                        "apiVersion": "template.openshift.io/v1",
                        "kind": "TemplateInstance",
                        "name": "75c0ccd3-642e-4035-a5cf-3c27e54cae40",
                        "uid": "a7301596-f41a-11e7-88e5-fa163eb8ca3a",
                        "blockOwnerDeletion": true
                    }
                ]

That looks like what patch is about. I also found that if I tried to edit an object and remove the ownerReference then it also triggered a garbage collect on the spot and all the resources evaporated.


Sounds worse than the behavior we were aware of, but fundamentally what's causing the cascade deletion is this:

Jan 08 00:26:49 master-0.openshift.staging.local dockerd-current[23329]: I0108 00:26:49.904249       1 garbagecollector.go:394] delete object [template.openshift.io/v1/TemplateInstance, namespace: jenkins-test, name: e3639aec-bbbc-4170-b0e4-3b63735af348, uid: 915d585d-f408-11e7-88e5-fa163eb8ca3a] with propagation policy Background

The TemplateInstance object should have an ownerReference to a BrokerTemplateInstance and that reference not being handled properly is the bug.  If you remove that ownerRef from the TemplateInstance, you should be safe from undesired of the TemplateInstance (and the cascading delete of everything else) (at least w/ respect to the bug we are aware of).

That should be the only ownerRef you need to delete unless there are other (to date unknow) bugs in the GC behavior, or in how the TSB is creating the ownerRef chain.

 
So I guess my workaround can be, run the template, wait for everything to deploy, export all templated resources to json, strip out ownerReferences, and create all the resources again.

On Mon, Jan 8, 2018 at 12:30 PM Joel Pearson <japearson agiledigital com au> wrote:
Hmm, in my case I don't need to need to restart to cause the problem to happen. Is there some way to run nightlies of openshift:release-3.7 using the openshift-ansible? So that I can verify it's fixed for me?

On Mon, Jan 8, 2018 at 12:23 PM Jordan Liggitt <jliggitt redhat com> wrote:
Garbage collection in particular could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1525699 (fixed in  https://github.com/openshift/origin/pull/17818 but not included in a point release yet)


On Jan 7, 2018, at 8:17 PM, Joel Pearson <japearson agiledigital com au> wrote:

Hi,

Has anyone else noticed that the new OpenShift Origin 3.7 Template Broker seems super flaky?

For example, if I deploy a Jenkins (Persistent or Ephemeral), and then I modify the route, by adding an annotation for example:


I have https://github.com/tnozicka/openshift-acme Installed in the cluster which then grabs an SSL cert for me, adds it to the route, then moments later all resources from the template are garbage collected for no apparent reason. 

I also got the same behaviour when I modified the service account the Jenkins template uses, I added an additional route so I added a new "serviceaccounts.openshift.io/oauth-redirectreference.jenkins:" entry. It took a bit longer (like 12 hours), but it all disappeared again.  I have a suspicion that if you modify any object that a template created, then eventually the template broker will remove all objects it created.

Is there any way to disable the new template broker and use the old template system?

In Origin 3.6 it was flawless and worked with openshift-acme without any problems at all.

I should mention that if I create things manually then it works fine, I can use openshift-acme, and all my resources don't vanish at whim. 

Here is a snippet of the logs, you can see the acme points are removed after successfully getting a cert, and then moments later, the deleting starts:

Jan 08 00:26:47 master-0.openshift.staging.local dockerd-current[23329]: I0108 00:26:47.648255       1 leaderelection.go:199] successfully renewed lease kube-service-catalog/service-catalog-controller-manager
Jan 08 00:26:47 master-0.openshift.staging.local origin-node[26684]: I0108 00:26:47.744777   26749 roundrobin.go:338] LoadBalancerRR: Removing endpoints for jenkins-test/acme-9cv97q5dn8:
Jan 08 00:26:47 master-0.openshift.staging.local dockerd-current[23329]: I0108 00:26:47.744777   26749 roundrobin.go:338] LoadBalancerRR: Removing endpoints for jenkins-test/acme-9cv97q5dn8:
Jan 08 00:26:47 master-0.openshift.staging.local origin-node[26684]: I0108 00:26:47.762005   26749 ovs.go:143] Error executing ovs-ofctl: ovs-ofctl: None: invalid IP address
Jan 08 00:26:47 master-0.openshift.staging.local dockerd-current[23329]: I0108 00:26:47.762005   26749 ovs.go:143] Error executing ovs-ofctl: ovs-ofctl: None: invalid IP address
Jan 08 00:26:47 master-0.openshift.staging.local dockerd-current[23329]: E0108 00:26:47.765091   26749 sdn_controller.go:284] Error deleting OVS flows for service &{{ } {acme-9cv97q5dn8  jenkins-test /api/v1/namespaces/jenkins-test/services/acme-9cv97q5dn8 94c6b3b3-f40a-11e7-88e5-fa163eb8ca3a 622382 0 2018-01-08 00:26:34 +0000 UTC <nil> <nil> map[] map[] [] nil [] } {ClusterIP [{http TCP 80 {0 80 } 0}] map[] None  []  None []  0} {{[]}}}: exit status 1
Jan 08 00:26:47 master-0.openshift.staging.local origin-node[26684]: E0108 00:26:47.765091   26749 sdn_controller.go:284] Error deleting OVS flows for service &{{ } {acme-9cv97q5dn8  jenkins-test /api/v1/namespaces/jenkins-test/services/acme-9cv97q5dn8 94c6b3b3-f40a-11e7-88e5-fa163eb8ca3a 622382 0 2018-01-08 00:26:34 +0000 UTC <nil> <nil> map[] map[] [] nil [] } {ClusterIP [{http TCP 80 {0 80 } 0}] map[] None  []  None []  0} {{[]}}}: exit status 1
Jan 08 00:26:48 master-0.openshift.staging.local dockerd-current[23329]: I0108 00:26:48.139090       1 rest.go:362] Starting watch for /api/v1/namespaces, rv=622418 labels= fields= timeout=8m38s
Jan 08 00:26:48 master-0.openshift.staging.local origin-master-api[23448]: I0108 00:26:48.139090       1 rest.go:362] Starting watch for /api/v1/namespaces, rv=622418 labels= fields= timeout=8m38s
Jan 08 00:26:49 master-0.openshift.staging.local dockerd-current[23329]: I0108 00:26:49.668205       1 leaderelection.go:199] successfully renewed lease kube-service-catalog/service-catalog-controller-manager
Jan 08 00:26:49 master-0.openshift.staging.local dockerd-current[23329]: I0108 00:26:49.885207       1 garbagecollector.go:291] processing item [template.openshift.io/v1/TemplateInstance, namespace: jenkins-test, name: e3639aec-bbbc-4170-b0e4-3b63735af348, uid: 915d585d-f408-11e7-88e5-fa163eb8ca3a]
Jan 08 00:26:49 master-0.openshift.staging.local origin-master-controllers[73353]: I0108 00:26:49.885207       1 garbagecollector.go:291] processing item [template.openshift.io/v1/TemplateInstance, namespace: jenkins-test, name: e3639aec-bbbc-4170-b0e4-3b63735af348, uid: 915d585d-f408-11e7-88e5-fa163eb8ca3a]
Jan 08 00:26:49 master-0.openshift.staging.local dockerd-current[23329]: I0108 00:26:49.904249       1 garbagecollector.go:394] delete object [template.openshift.io/v1/TemplateInstance, namespace: jenkins-test, name: e3639aec-bbbc-4170-b0e4-3b63735af348, uid: 915d585d-f408-11e7-88e5-fa163eb8ca3a] with propagation policy Background
Jan 08 00:26:49 master-0.openshift.staging.local origin-master-controllers[73353]: I0108 00:26:49.904249       1 garbagecollector.go:394] delete object [template.openshift.io/v1/TemplateInstance, namespace: jenkins-test, name: e3639aec-bbbc-4170-b0e4-3b63735af348, uid: 915d585d-f408-11e7-88e5-fa163eb8ca3a] with propagation policy Background
Jan 08 00:26:49 master-0.openshift.staging.local dockerd-current[23329]: I0108 00:26:49.910964       1 garbagecollector.go:291] processing item [apps.openshift.io/v1/DeploymentConfig, namespace: jenkins-test, name: jenkins, uid: 91759f72-f408-11e7-88e5-fa163eb8ca3a]

Any ideas? Has anyone else seen this?  Considering "openshift-ansible-service-broker" is deployed in a broken state by openshift-ansible on the release-3.7 branch (for origin, I think enterprise would work as the tags exist), it makes me think that not many people are using the new service brokers that are talked about here: https://blog.openshift.com/whats-new-in-openshift-3-7-service-catalog-and-brokers/

Thanks,

Joel
_______________________________________________
users mailing list
users lists openshift redhat com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

_______________________________________________
users mailing list
users lists openshift redhat com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users




--
Ben Parees | OpenShift




--
Kind Regards,

Joel Pearson
Agile Digital | Senior Software Consultant

Love Your Software™ | ABN 98 106 361 273
p: 1300 858 277 | m: 0405 417 843 | w: agiledigital.com.au



--
Ben Parees | OpenShift


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]