[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: High number of 4xx requests on etcd (3.6 upgrade)



I ended up finding that one of my rules had a wrong filter that was returning a +inf value.  Most of the errors (count being over 2000) were 404 on QGET but fell well below 0.01% after I fixed the wrong filter rule. Is that normal to get this number of 404 requests?

This is only on a staging cluster so the number of keys are small. There are only about 100~ namespaces with dummy objects. 

I went through the etcd logs and didn't see anything abnormal. API server response times jumped to around 0.15 for GET and 1.5 for POST. This is approximately 50%~ increase from what I saw with 1.5

There were 3 leader changes, as I tried upgrading docker etcd_container to see if that would fix the issue (for some reason the ansible upgrade script doesn't upgrade this(?)).



On Sun, 13 Aug 2017 at 01:23 Clayton Coleman <ccoleman redhat com> wrote:
How big is your etcd working set in terms of number of keys?  How many namespaces?  If keys <50k then i would suspect software, hardware, or network issue in between masters and etcd.  Http etcd failures should only happen when the master is losing elections and being turned over, or the election/heartbeat timeout is too low for your actual network latency.  Double check the etcd logs and verify that you aren't seen any election failures or turnover.

What metrics are the apiserver side returning related to etcd latencies and failures?

On Aug 12, 2017, at 11:07 AM, Andrew Lau <andrew andrewklau com> wrote:

etcd data is on dedicated drives and aws reports idle and burst capacity around 90%

On Sun, 13 Aug 2017 at 00:28 Clayton Coleman <ccoleman redhat com> wrote:
Check how much IO is being used by etcd and how much you have provisioned.


> On Aug 12, 2017, at 5:32 AM, Andrew Lau <andrew andrewklau com> wrote:
>
> Post upgrade to 3.6 I'm noticing the API server seems to be responding a lot slower and my etcd metrics etcd_http_failed_total is returning a large number of failed GET requests.
>
> Has anyone seen this?
> _______________________________________________
> users mailing list
> users lists openshift redhat com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]