Building Healthy Services

In modern container orchestration, ensuring that services are healthy and available is key to maintaining reliable application performance. Health checks, restarts, and failure detection are three critical mechanisms that support this.

Health Checks

Health checks determine whether a containerized application is running as expected. There are two main types:

  • Liveness Probes: These check if a container is still operational. If a container fails this check, the orchestrator restarts it. This prevents the application from being stuck in a failed state.
  • Readiness Probes: These ensure the container is ready to handle requests. A container may be alive but not yet ready to serve traffic, so this probe prevents routing until the container is fully functional.

Common implementations of health checks include HTTP endpoints, executing commands, or TCP socket checks.

Restarts

When containers fail or become unhealthy, the platform may automatically restart them. Scenarios that trigger restarts include:

  • Crash Recovery: When a container crashes due to an internal error, it is restarted to recover.
  • Liveness Probe Failures: If the container fails its liveness check, it is restarted.
  • Resource Exhaustion: If a container exceeds its allocated CPU or memory, the system may kill and restart it to stabilize resources.

Restart policies control how restarts are handled. Common policies include:

  • Always: Restart the container regardless of the exit status.
  • On-failure: Restart only if the container exits with an error.
  • Never: Do not restart automatically.

Failure Detection

Failure detection involves recognizing when a container or its underlying infrastructure is experiencing problems. Key methods include:

  • Health Check Failures: Continuous failure of health checks signals an unhealthy state.
  • Node Monitoring: The platform monitors the health of the servers (nodes) hosting the containers. If a node fails, workloads are shifted to other nodes.
  • Resource Metrics: Monitoring CPU, memory, and other resource usage helps predict potential container failures due to resource limits.

Monitoring tools and alerts enhance failure detection, helping teams to intervene before failures escalate.

Best Practices

  • Proper Probe Configuration: Ensure liveness and readiness probes are correctly set up to avoid unnecessary restarts or false failure detections.
  • Use Appropriate Restart Policies: Tailor restart policies based on the application's requirements. For instance, use on-failure for applications with clear exit codes and always for stateless services.
  • Resource Monitoring and Limits: Monitor resource usage and tune limits to avoid resource-related failures.