feature-request

Rolling Restarts

One feature I'd really love is the ability to execute a restart as a "rolling" restart. Right now, manual restarts (hitting the button, applying a config change, etc) stop all instances at once producing app downtime. And without a defined health check policy there's probably no way around that. But when a health check policy IS defined, I would love to be able to set the default restart method to a rolling restart where each subsequent instance restart does not begin until the previous instance reaches healthy status. That functionality would be incredibly valuable in such a wide variety of situations...

avatar
10
  • This would also be really useful to me. I hadn't considered using the health check, that's a nice approach :+1:.

    avatar
  • Thanks Casey for the detail and the request, and Thomas for adding support here. You're both internally tagged for updates on this as it progresses.

    Casey - the healthcheck failing should cause the individual instance to restart. Are you saying that healthchecks where multiple instances fail would trigger a synchronous restart queue or for that speicifc case I don't know that I understand the use case.

    avatar
    platform
  • Hey Chris - the restart behavior when a health check fails seems fine to me. Let's ponder a different scenario - for example, say I need to update an environment variable on a container... When I save the update, Cycle will restart the instances to pick up the change. But in a production situation, I need to ensure that there is a functional application in a health state at all times. So my "ideal" would be for Cycle to identify that there is a defined health check, and when that is the case restart only a single instance, wait for that instance to return to a healthy state, then restart the next instance and repeat. This concept of a "rolling restart" ensures that the environment can be updated while remaining healthy.

    A good example of when you might need to do this would be when an error is occurring and you are trying to rapidly make changes to resolve the issue. Going through a full build and deployment loop for every incremental modification (enabling more detailed logging, making an experimental configuration change, etc) massively slows the process vs being able to make a quick change and a quick restart. I traditionally try to expose a lot of control elements as environment variables on an app so that I have a lot of flexibility to explore and resolve issues, but right now in Production I cannot use any of those tools because saving changes causes downtime when all the instances restart concurrently.

    avatar
  • That makes sense and would probably be helpful in a scenario where the service is only partially degraded but mostly working. You want to be able to resolve the issue for the partially degraded portion without downstream users losing access - especially if that degraded portion of the service is small.

    avatar
    platform
  • Yep - exactly. The most common scenario is where a bug is happening and we're trying to identify the source, so we would temporarily escalate the logging level for that part of the application (normally WARN in Production so highly performant but not very verbose) to try to get more information regarding the issue. A quick reconfigure gets you verbose logging in a few seconds, then you execute the failing event to capture the log output and reconfigure back to WARN (and then go do some forensics on the log output).

    Having to push builds to accomplish that (factoring for test runs, code reviews, etc) really blows up the scope and slows you down.

    avatar
  • This is an awesome idea. (so I get tagged, too)

    avatar
  • I started on this functionality today, it'll be set using <container>.config.deploy.ready_check (similar to health_check). More info soon

    avatar
    platform
  • ❤️❤️❤️❤️❤️

    avatar
  • Fantastic; keep it coming team.

    avatar
  • Excellent news!

    avatar
v2025.12.19.01 © 2024 Petrichor Holdings, Inc.

🍪 Help Us Improve Our Site

We use first-party cookies to keep the site fast and secure, see which pages need improved, and remember little things to make your experience better. For more information, read our Privacy Policy.