Kubernetes scalability bottlenecks: can Kubernetes scale to manage one million objects?
While originally Kubernetes was designed as a platform for orchestrating containers, lately Kubernetes is becoming a platform for orchestrators or for applications for management of other entities. Kubernetes can run applications that manage VMs, multiple Kubernetes clusters, edge devices and applications, and can also orchestrate provisioning of Bare Metal machines and of OpenShift clusters, among other things.
In this blog post I describe common characteristics and a generic architecture of management applications that use the Kubernetes API for representing application state. I classify the management applications that use the Kubernetes API as either Kubernetes-native or as fully-Kubernetes-native (I invented the latter term). Lastly, I examine scalability bottlenecks of Kubernetes that can hinder applications that manage a large number of objects using the Kubernetes API. I outline possible workarounds for some of the bottlenecks.
Kubernetes-native applications
Some of the management applications mentioned above describe themselves as Kubernetes-native or Kubernetes-style: they are implemented and operated using Kubernetes API, CLI and tools from the Kubernetes ecosystem. I did not find a precise definition of what Kubernetes-native is. Sometimes, a more broad term cloud-native is used. In the following list I provide common characteristics of Kubernetes-native applications. The list is not meant to be exhaustive and not all Kubernetes-native applications have all the characteristics.
- Applying the controller pattern, or a special kind of the controller pattern called the operator pattern.
- Representing control and configuration information as Kubernetes ConfigMaps, Kubernetes Secrets and Kubernetes Custom Resources.
- Using Kubernetes software libraries, such as client-go, controller-runtime, OperatorSDK.
- Application components report Prometheus metrics, emit Kubernetes events, define Kubernetes liveness, readiness and startup probes.
- The applications use CNCF-landscape tools: such as Prometheus for monitoring and alerts, Grafana for observability, Service Meshes such as Linkerd for advanced traffic control and security, Open Policy Agent for policy-based control, Crossplane for infrastructure management, and many others.
- The applications are operated using GitOps and Continuous Deployment tools for Kubernetes, such as Flux and ArgoCD.
In my opinion, a Kubernetes-native application does not have to run on a Kubernetes cluster. It can run elsewhere, communicate with the Kubernetes API of some Kubernetes cluster and have all of the characteristics above or a subset of them.
Fully-Kubernetes-native applications
Some of the applications even store all the configuration data for the objects they manage using the Kubernetes API. I call such applications fully Kubernetes-native.
There are multiple advantages of the fully-Kubernetes-native applications:
- since the applications store all their data in the same database Kubernetes stores its data (etcd), there is no need to operate an additional database, to define its schema and to program access to it. The applications get schema validation and versioning of their data for free. The application developers only need to learn the Kubernetes API and Kubernetes Custom Resource Definitions. There is no need to learn SQL or other database query and data-definition languages.
- the applications get for no cost an API server (the Kubernetes API Server) and a CLI (kubectl) for their data.
kubectl
can be extended for the needs of the application using kubectl plugins. - the applications use Kubernetes authentication and Kubernetes authorization mechanisms. In particular, the application admins can define Kubernetes Role-Based Access Control (RBAC) rules to specify the operations the users can perform on managed objects, without being required to use external authorization systems.
Kubernetes-native applications that are not fully Kubernetes-native, store some of their state in a database, other than etcd. They have to implement data access to the database, authorization for the data access and some API server on top of the database.
Implementing applications in the fully-Kubernetes-native way might spare development and operational effort, and reduce skill requirements. The question, though, is: can such applications scale to manage a large number of objects? If one wants to manage 5G network infrastructure or edge devices, one might need to manage tens of thousands or hundreds of thousands or, in the future, maybe even millions of objects. In the remainder of this blog post I describe Kubernetes scalability bottlenecks that can hamper fully-Kubernetes-native applications in managing large numbers of objects. These are the bottlenecks I encountered when working with fully Kubernetes-native applications, the readers are welcome to specify more Kubernetes bottlenecks in the comments.
Fully-Kubernetes-native management applications
To facilitate explanation of scalability requirements of the management applications, I present a generic architecture for fully-Kubernetes-native management applications in the following diagram:
In the diagram above, a management application manages some managed entities (sorry for too many "management" words). Examples of managed entities are VMs, bare metal machines, edge devices, other Kubernetes clusters. Each managed entity might have managed subentities. In the case of a VM, examples of managed subentities are network interfaces and disks. In the case of an edge device, an example of managed subentities is mobile applications that run on the device. The management agents run on the managed entities and watch for desired configuration for their managed entities. The desired configuration resides in the application CRs (Kubernetes Custom Resources). The management agents act according to the desired configuration, for example, add a disk or a network interface, deploy a mobile application or configure an existing application. The management agents report the status of the managed entities and of the managed subentities by updating the status field of the CRs or by using dedicated CRs. The management agents get and update the application CRs through the Kubernetes API.
The users of the application can provide the desired configuration for the managed entities and subentities by creating or updating application CRs using kubectl
create
, apply
, edit
commands. The users can get the status of the managed entities and subentities by reading application CRs using kubectl get
. GitOps and other tools such as Web consoles and dashboards, from the Kubernetes ecosystem or custom ones, can operate the managed entities and subentities through the Kubernetes API. Note that while the Kubernetes API for management and the Kubernetes API for agents is represented in the diagram as different objects, they are served by the same Kubernetes API server and may be identical.
The users can use a custom Web console to perform management operations by GUI instead of by CLI. (I omitted the Web console on the diagram above for brevity.) The management application may be implemented as a set of controllers/operators that watch the application CRs. The management application performs reconciliation of the desired state (of managed entities or of managed subentities), and of the actual state reported by the management agents. Note that while the management application and the external tools are depicted on the diagram outside of the Kubernetes cluster, they may run inside the cluster. (They will consume the same Kubernetes API in both cases: running inside or outside the cluster). The management agents, the management application and the external tools may be implemented using standard Kubernetes libraries and may operate according to the controller pattern.
One of the important aspects of management is security. A compromised agent on one managed entity might access or modify data of other managed entities in the application CRs. In Kubernetes such attacks can be prevented by restricting the permissions of the agent to allow access only to the CRs of the agent's entity. Restricting the permissions of an agent can be accomplished in Kubernetes declaratively by allocating a service account to represent the agent and by specifying Kubernetes access control rules for the service account. Alternatively, authorization WebHooks or admission controllers (for control of updates only) might be used, but those require implementing and maintaining additional components.
Notice an additional important point in this architecture: the management agents initiate network connections to the Kubernetes cluster that hosts the application CRs, the management application does not initiate network connections to the managed entities. The management agents pull the configuration data, the management application does not push the configuration data to the managed entities directly. Such design facilitates network communication in the case where the managed entities are deployed on the edge:
- The managed entities may be intermittently connected to the network. When a managed entity becomes connected, its management agent connects to the Kubernetes cluster, pulls the configuration data for its entity and reports the status back. If the management application would initiate connections to the managed entities, it would have to handle network disconnections in an environment with limited network connectivity.
- On the edge, there could be a firewall that prevents initiating network connections from the outside and allows initiating network connections by the managed entities only.
Kubernetes scalability bottlenecks
The most obvious scalability bottleneck is storage. The etcd documentation recommends 8GB as the maximum storage size. It means that an application that manages one million objects can store only up to 8K of data per object. Practically, this limit is lower since etcd stores other cluster data and also a limited history of the objects.
If authorization is handled by creating a service account per management agent as desribed above, then the management application needs to allocate one million service accounts to be able to manage one million managed entities in a secure way. The size of a service account in Kubernetes can be on the order of 10KB, which means the management application cannot create one million service accounts using the recommended etcd storage limit of 8GB (10KB*1,000,000 = 10 GB). Compare this storage limit with storage limits of a leading SQL database, PostgreSQL. The database size is virtually unlimited, a single relation size can be 32TB while a single field size can be 1GB (!).
A possible solution to the storage limitations of etcd is Kine. Kine is an etcd shim, an adapter of etcd API to other databases, like MySQL and PostgreSQL. However, even if the storage problem of Kubernetes is solved, there is another bottleneck for Kubernetes scalability, namely Single-resource API. The mutating API verbs, such as CREATE
, UPDATE
and DELETE
, support single resources only. A client that wishes to apply such a verb to many resources must make a separate request for each of those resources.
Consider the case where the management application needs to rotate one million certificates or update common configuration for one million managed entities. Another case is when a management agent has to update status of multiple subentities of the managed entity. If the management application or a management agent need to update at once a large number of application CRs, tough luck, they must perform the updates one-by-one. Performing updates one-by-one involves a network roundtrip per application CR, including handling network failures and retries per each CR. The issue can be especially acute when connectivity between a management agent and the Kubernetes cluster is limited or there is high latency, for example in case of managing edge devices.
Compare Kubernetes API with a SQL database: in SQL you can insert multiple rows in a single INSERT statement, you can update multiple rows by an UPDATE ... WHERE statement selecting multiple rows to be updated by the WHERE
clause. There are also batch operations for sending multiple INSERT and UPDATE commands in a batch. Some SQL databases even have binary bulk operations, where multiple inserts can be parsed into a binary blob on the client side and sent to the server as binary. See for example batch queries and COPY protocol support for faster bulk data loads in the pgx Go driver for PostgreSQL.
Yet another Kubernetes API problem related to scalability is lack of ability to specify sort order of the returned lists of objects. The users of kubectl
can specify sort order using --sort-by flag. However, sorting is performed on the client side by kubectl. In a similar way other tools may implement client-side sorting for various sort orders. Fetching one million objects and sorting them by clients may strain computational resources (memory and CPU) of the clients and waste bandwidth, especially in the case where sorting is used with pagination. Consider the case where some user wants to see top ten objects out of one million, according to some criteria. In this case the client (for example a Web browser) must fetch all the million and find the first ten objects in the requested sorting order. Alternatively, the clients could use some proxy on top of Kubernetes API, and let this proxy component perform caching, sorting and pagination for the clients. This would require additional effort of development and maintenance of the proxy component, and would waste computational resources required to run the proxy component.
Another possible performance bottleneck of Kubernetes API is lack of protocol-buffers support for CRDs. Protocol buffers with gRPC may provide 7 to 10 times faster message transmission comparing to a REST API.
One more scalability issue with Kubernetes ecosystem is lack of a built-in mechanism for load balancing between replicas of controllers. There is no built-in mechanism in Kubernetes to distribute processing of changes of CRs to multiple replicas of the same controller. The controller replicas usually perform leader election and the elected leader performs all the reconciliation.
A side note on controller concurrency: by default, the controllers that use controller-runtime do not perform reconciliation concurrently (the MaxConcurrentReconciles option is 1). Also, the work queue retry rate limit (in case of multiple errors) is by default 10 QPS only. To understand rate-limiting in controller runtime and Kubernetes go client, check this great article.
While the defaults mentioned above can be changed by controller developers to increase concurrency, reconciling one million application CRs can strain the computational resources of a single controller. The developers of the management application must implement custom load balancing solutions, wasting development and maintenance effort, if they want to use the controller pattern with one million CRs. Note that various tools in Kubernetes ecosystem are implemented as controllers, for example ArgoCD and Flux.
A general problem with Kubernetes clients, commonly but not exclusively appearing in controllers, is the local caches maintained by informers (commonly used in building controllers). An informer maintains a local cache of all the objects in its purview, which is problematic when that data volume is large, as reported here and here.
Another major scalability bottleneck in Kubernetes is authorization. Consider the following use case: a user of an application that manages one million entities must be allowed to access only ten entities out of the million. Such requirement can be specified using Kubernetes RBAC as one of the Kubernetes authorization modes, by providing GET access to the ten CRs of the allowed managed entities and by forbidding LIST access to the managed entity CRD. (If the LIST access is granted, the user will be able to GET all the managed entities). However, with such authorization, the user cannot GET the ten CRs they are allowed to GET, without specifying all the ten CRs explicitly by their names. This is because Kubernetes API's LIST operation does not perform filtering based on authorization. The clients can either get all the CRs or some specific CRs. Moreover, there is no API to ask Kubernetes which objects some user is allowed to access. Kubernetes clients can query Kubernetes authorization API and inquire whether a given user can access some specific object or whether a given user can access all the objects of some kind. Practically it means that if some client tool, like GitOps or UI, needs to process objects on behalf of a user who has access to ten objects out of one million, this tool must query Kubernetes API one million times, once per each object, to filter the objects the user is allowed to access. That would be a major performance bottleneck (one million network calls of the authorization API). A security issue with the scenario above is that the tool must be authorized to LIST all the objects.
A solution to the problem above could be creating an authorization cache that continuously calculates all the authorization decisions and caches them. Such a cache can be combined with the proxy mentioned above, used for sorting and pagination. Implementing and maintaining such custom authorization/sorting/pagination cache and proxy would require significant development effort.
Summary
To summarize, let me list the Kubernetes scalability bottlenecks examined above:
- Storage
- Single-resource API
- No sort order for server-side sorting
- No protocol-buffers support for CRDs
- No load balancing between replicas of controllers
- Extensive client-side caching
- No filtering by authorization
- Unknown unknowns
The last item in the list above relates to unknown bottlenecks that could be discovered once all other problems in the list are solved. There is an evidence of large-scale fully-Kubernetes-native management from Rancher Fleet that managed to (sorry again for too many "management" words) import one million managed entities (Kubernetes clusters in that case). This blog post describes that experiment and, in particular, using Kine to overcome the storage problems. However, the authors of the blog post did not provide details on how effective was performing the actual management tasks after importing, initial discovery of deployments and reporting the status back. It would be interesting to know how much time did it take to deploy a new application to one million managed clusters, was the UI able to perform sorting and pagination of one million managed clusters, how effective was using kubectl to work with one million managed clusters.
In my opinion, in order to support fully-Kubernetes-native management at large scale, without requiring custom solutions for storage, caching, sorting, load balancing and authorization, Kubernetes must be significantly changed.
I would like to thank Mike Spreitzer for reviewing this blog post and for providing enlightening comments, and for great discussions we had about Kubernetes. Thanks to Maroon Ayoub for his review.