Archive for November, 2015

What is the Firmament scheduler?

Some in the Kubernetes community are considering adopting a new scheduler based on Malte Schwarzkopf’s Firmament cluster scheduler. I just finished reading Ch. 5 of Malte’s thesis. Here’s a high level summary of what Firmament is about.

Today’s container orchestration systems like Kubernetes, Mesos, Diego and Docker Swarm rely heavily on straightforward heuristics for scheduling. This works well if you want to optimize along a single dimension, like efficient bin packing of workloads to servers. But these heuristics are not designed to simultaneously handle complex tradeoffs between competing priorities like data locality, scheduling delay, soft and hard affinity constraints, inter-task dependency constraints, etc. Taking so many factors into account at once is difficult.

The Firmament scheduler tries to optimize across many tradeoffs, while still making fast scheduling decisions. How? Like Microsoft’s Quincy scheduler, it considers things from a new angle: cost. Suppose we assign a cost to every scheduling tradeoff. The problem of efficient scheduling then becomes a global cost minimization problem, which is much more tractable than trying to design a heuristic that balances many different factors.

Firmament’s technical implementation is to model the scheduling problem as a flow graph. Workloads are the flow sources, and they flow into the cluster, whose topology of machines and availability zones is modeled by vertices in the graph. Ultimately, all workloads arrive at a global sink, having either flowed through a machine on which they were scheduled or having remained unscheduled. Which path is decided by cost.

Here’s a simplified diagram I created (based on Firmament’s diagram (which is a simpler version of Quincy’s Fig. 4)):

Simplified example of Firmament's flow graph structure.

Simplified example of Firmament’s flow graph structure. By assigning costs to each edge, global cost minimization can be performed. For example, each of the three workloads may be scheduled on the cluster or remain unscheduled, depending on the relative costs of their immediate execution vs. delay.

But how are these costs determined? That’s the coolest part of Firmament: it supports pluggable cost models through a cost model API. Firmament provides several performance-based cost models as well as an interesting one that seeks to minimize data center electricity consumption. Of course, users can supply their own cost models through the API.

For more information on Firmament, here are some resources:

Comments (1)