[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: scenarios of entire app in a cluster unavailable

Thank you for info. It it’s useful 

Srinivas Kotaru

On 9/20/16, 5:37 AM, "Brenton Leanhardt" <bleanhar redhat com> wrote:

    On Mon, Sep 19, 2016 at 6:40 PM, Srinivas Naga Kotaru (skotaru)
    <skotaru cisco com> wrote:
    > Trying to understand on which scenarios all the instances of an application
    > running from cluster unavailable?
    > OS upgrade failure??
    > Openshift upgrade bugs/failures/downtime?
    The best way to mitigate risks from the first two are to upgrade
    independent sets of Nodes in batches to prevent downtime in the event
    of unforeseen problems.  This should be rare if there is sufficient
    monitoring in the environment.
    In the Origin 1.4, OCP 3.4 timeframe it will be much easier to upgrade
    batches of Nodes.  It's possible today but it takes a little more
    involvement with the ansible inventory.  In large environments with
    strict maintenance windows it's common to only update a set of Nodes
    during each window.
    > Router failures ??
    This is likely the most common source of user-facing downtime.
    > Keepalive containers failed??
    Unless this event triggered a failover to a pod that was actually in
    outage I don't think the Keepalive pod failing would cause a
    user-facing outage.  The platform would spawn another.
    > Floating IP shared by keepalive container had issues??
    If somehow the floating IP was in use by another interface on the
    network I'm certain bad things would happen.
    > VXLAN bug or upgrade caused entire cluster network failure?
    Catastrophic network failures could indeed cause a major outage.
    > Human config error ( what those???)
    Always.  Best avoided by using a tool like Ansible and testing changes
    in other environments before production.
    > Is above list accurate? Can we think off any other possible scanarios where
    > whole application will be down in cluster duet to platform issues?
    I would mention downtime caused by load.  Anecdotally, this is
    probably the second most common cause of downtime.  It often relates
    to the human error and lack of monitoring.  The more dense the
    platform operators wish to keep the environment the more rigor is
    needed for monitoring.
    This could simply be an error of the pod owner as well.  eg, the JVM
    inside the pod might be online however the application running in the
    JVM might be throwing out of memory errors due to incorrect assignment
    of limits.
    > --
    > Srinivas Kotaru
    > _______________________________________________
    > users mailing list
    > users lists openshift redhat com
    > http://lists.openshift.redhat.com/openshiftmm/listinfo/users

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]