Archive for orchestration and scheduling

DockerCon 2016 Keynote

Andrea and I presented Docker 1.12 orchestration on stage at DockerCon this year — in front of 4,000 people! I don’t even think I’ve ever met 4,000 people. It was awesome!

Comments (5)

What is the Firmament scheduler?

Some in the Kubernetes community are considering adopting a new scheduler based on Malte Schwarzkopf’s Firmament cluster scheduler. I just finished reading Ch. 5 of Malte’s thesis. Here’s a high level summary of what Firmament is about.

Today’s container orchestration systems like Kubernetes, Mesos, Diego and Docker Swarm rely heavily on straightforward heuristics for scheduling. This works well if you want to optimize along a single dimension, like efficient bin packing of workloads to servers. But these heuristics are not designed to simultaneously handle complex tradeoffs between competing priorities like data locality, scheduling delay, soft and hard affinity constraints, inter-task dependency constraints, etc. Taking so many factors into account at once is difficult.

The Firmament scheduler tries to optimize across many tradeoffs, while still making fast scheduling decisions. How? Like Microsoft’s Quincy scheduler, it considers things from a new angle: cost. Suppose we assign a cost to every scheduling tradeoff. The problem of efficient scheduling then becomes a global cost minimization problem, which is much more tractable than trying to design a heuristic that balances many different factors.

Firmament’s technical implementation is to model the scheduling problem as a flow graph. Workloads are the flow sources, and they flow into the cluster, whose topology of machines and availability zones is modeled by vertices in the graph. Ultimately, all workloads arrive at a global sink, having either flowed through a machine on which they were scheduled or having remained unscheduled. Which path is decided by cost.

Here’s a simplified diagram I created (based on Firmament’s diagram (which is a simpler version of Quincy’s Fig. 4)):

Simplified example of Firmament's flow graph structure.

Simplified example of Firmament’s flow graph structure. By assigning costs to each edge, global cost minimization can be performed. For example, each of the three workloads may be scheduled on the cluster or remain unscheduled, depending on the relative costs of their immediate execution vs. delay.

But how are these costs determined? That’s the coolest part of Firmament: it supports pluggable cost models through a cost model API. Firmament provides several performance-based cost models as well as an interesting one that seeks to minimize data center electricity consumption. Of course, users can supply their own cost models through the API.

For more information on Firmament, here are some resources:

Comments (1)

Kubernetes Concepts

Once you have a Kubernetes cluster up and running, there are three key abstractions to understand: pods, services and replication controllers.

Pods. Pods — as in a pod of whales (whale metaphors are very popular in this space) — is a group of containers scheduled on the same host. They are tightly coupled because they are all part of the same application and would have run on the same host in the old days. Each container in a pod shares the same network, IPC and PID namespaces. Of course, since Docker doesn’t support shared PID namespaces (every Docker process is PID 1 of its own hierarchy and there’s no way to merge two running containers), a pod right now is really just a group of Docker containers running on the same host with shared Kubernetes volumes (as distinct from Docker volumes).

Pods are a low level primitive. Users do not normally create them directly; instead, replication controller are responsible for creating pods (see below).

You can view pods like this:

kubectl.sh get pods

Read more about pods in the Kubernetes documentation: Kubernetes Pods

Replication Controllers. Pods, like the containers within them, are ephemeral. They do not survive node failure or reboots. Instead, replication controllers are used to keep a certain number of pod replicas running at all times, taking care to start new pod replicas when more or needed. Thus, replication controllers are longer lived than pods and can be thought of like a manager abstraction sitting atop of the pod concept.

You can view replication controllers like this:

kubectl.sh get replicationControllers

Read more about replication controllers in the Kubernetes documentation: Replication Controllers in Kubernetes.

Services. Services are an abstraction that groups together multiple pods to provide a service. (The term “service” here is used in the microservices architecture sense.) The example in the Kubernetes documentation is that of an image-processing backend, which may consist of several pod replicas. These replicas, grouped together, represent the image processing microservice within your larger application.

A service is longer lived than a replication controller, and a service may create or destroy many replication controllers during its life. Just as replication controllers are a management abstraction sitting atop the pods abstraction, services can be thought of as a control abstraction that sits atop multiple replication controllers.

You can view services like this:

kubectl.sh get services

Read more about services in the Kubernetes documentation: Kubernetes Services.

Source / Further Reading: Design of Kubernetes

Leave a Comment