Kubernetes Operators — When, how and the gotchas to keep in mind

Kubernetes Operators have now become mainstream. An Operator is essentially a Kubernetes Custom Controller managing one or more Custom Resources. The term Operator has become popular as it succinctly captures end result of what Custom Controller+Custom Resource combination is typically defined for, e.g. — declaratively managing a stateful software on Kubernetes (e.g.: databases, off-the-shelf web applications, ML workloads, etc.). In technical terms, Custom Controller+Custom Resource combination is called ‘Custom Resource Definition’ (CRD). In the following post we use Operator and CRD interchangeably.

Kubecon 2018 in Seattle had great customizing and extendibility track. Presentations ranged from visionary, such as converting all Kubernetes constructs to CRDs, to examples of custom CRDs written for specific requirements, such as gaming server (Agones), workflow system (Airflow), Databases (Postgres Operator), proprietary SaaS products (Kolide), to Custom controllers without Custom Resources (Airbnb). Still, platform teams looking to build their own Operators face several questions regarding the overall approach, such as — which life-cycle actions of a software to automate using Operators? How to correctly model such actions using declarative primitives of a Custom Resource (the Custom Resource Spec)? when to use Custom sub-resources? How to evaluate community Operators? Will different Custom Resources interoperate correctly? As an example, here is a recent email from kubernetes-dev mailing list with such questions.

At CloudARK, we have seen our share of struggles with such questions. In this post we want to highlight few key points that can help you get started in your journey. Towards the end we have additional pointers that you should check out as well.

1. How to model declarative state?

Custom Controller along with Custom Resource becomes a new declarative API. Hence when starting to develop your Operator/CRD, it is important to model various workflow actions in a declarative manner.

As a concrete example, suppose you are developing an Operator to manage Postgres instances on Kubernetes. The actions that you want to automate are — add/delete users, add/delete databases. How would you model this using a Custom Resource? At first, you may think of creating a Spec definition that supports inputs as commands such as ‘create database’, ‘create user’, etc. This approach may allow your Spec to be very generic — which can support any Postgres command and not just create databases or create users. But the problem is that your Operator will not be able to track the state of the underlying resource (Postgres database instance) anymore. It is only acting like a conduit for executing commands, without any higher-level understanding of what those commands are. This is not recommended because the state reconciliation logic in your Operator’s Custom Controller relies on checking the desired state and reconciling it but in this case it does not really know what the desired state should be. It just knows that there are bunch of commands that it is executing! Moreover, the Spec is not declarative, instead it has become imperative. What is the correct way then?

The correct way is to explicitly define attributes in your Spec such that values of these attributes should be able to describe the state of your Custom Resource. In our Postgres Custom Resource example, users and databases become these attributes and their values will be the names of users and databases that you want to create. Presence of a new name in the value of users string will mean the corresponding new user needs to be created, if it is not already. Absence of a name will mean the opposite (delete the entity if it exists). By making this change your Spec will become declarative. The actual command logic for creating users and databases should be implemented as part of the Custom Controller code. With this design the Custom Controller will always have correct understanding of which users/databases exist in a database instance. Those will essentially be the values in the Spec. The Controller will be able to use that to drive any reconciliation decisions.

2. What about non-declarative actions?

There might be some actions that you cannot model as declarative state updates. For example, in your Postgres Operator suppose you want to support ability to find out historical records of the actions that the Operator has executed, such as — when was a particular database/user was added to Postgres Custom Resource Spec, what is the difference between two Custom Resource Specs, etc. For modeling such actions you can consider using Custom Controller+Custom sub-resource constructs. A custom sub-resource is essentially an action that does not need any declarative state input. The Custom Controller that handles the Custom sub-resource acts as the entry point for performing required actions. Some examples of such non-declarative actions in native Kubernetes are ‘exec’, ‘logs’, ‘scale’. You can consider defining similar actions for your particular needs. In order to implement Custom sub-resources though, you will need to use Kubernetes Aggregated API server (AA) and not Custom Resource Definition (CRD). (Check out our post on patterns of Kubernetes API extensions for distinction between CRD and AA.) The main advantage of using Custom sub-resources is that they are accessible directly through kubectl. This is very useful as it ensures that your Operator’s consumers won’t have to install and learn any new CLI.

3. How to get started?

Once you decide to start developing your Operator, what tools you have? There are four tools that you will find helpful: sample-controller, kubebuilder, Operator SDK, sample-apiserver. sample-controller is the fundamental tool whose architecture provides a good starting point to keep in mind when designing your Operator. Key concepts in it are — a workqueue, event dispatching functionality, shared informers. (Refer to this post to understand how they all work together.) Kubebuilder and Operator SDK are frameworks that provide opinionated abstractions on top of these concepts. They initially started out with distinct architectures. Recently they have converged towards using a common core library (controller-runtime). The current distinction between the two seem to be more around directory layout and RBAC related annotations on the Reconcile function supported by Kubebuilder. sample-apiserver is useful if you want to develop Custom sub-resources. As for additional examples of Aggregated API Servers (AA), you can refer to kubediscovery and kubeprovenance, which we have developed.


Above suggestions are a subset of overall guidelines that we are developing towards simplifying design, implementation, packaging, discovery, and interoperability of Operators. Here is the link to the complete list guidelines. If you have any suggestions or questions about these, we would love to hear them. Please consider filing a Github issue. We are looking to build and improve upon these guidelines based on community inputs.

And, if you are developing Operators, or are thinking about developing them, we would love to collaborate with you. We have been studying community Operators, including the tooling around them, for some time. We have also built CRDs and Aggregated API servers ourselves (https://github.com/cloud-ark). This exercise has enabled us to develop insights into Kubernetes API extensions in general and the Operator Pattern in particular. We are excited to bring to bear those insights to your endeavors.