For a while now, I’ve been curious about what impacts running a service mesh will have on the performance of my software. I’ve spoken with vendors who have estimated around a 10% performance hit, but I wanted to see for myself the actual impacts. Actually, it’s not just to satisfy my own curiosity, I’ve also had customers ask what the runtime penalty is, and would I recommend they run a service mesh. Some of our customers run real-time systems and so this is an important question to be able to answer for them.
Istio provides a lot of value for microservice architectures, providing features such as security, throttling, and telemetry. That is a lot of goodness, but nothing comes for free. I wanted to find out if running a service mesh is worth the performance cost and just what those costs might be. As will be seen in the details below, there are measurable performance impacts running an Istio service mesh in your Kubernetes cluster.
First, let’s cover the test setup. In a previous article, I had created a tiny service mesh with two microservices: capitol-client and capitol-info. Capitol-client is a microservice that enables users to query for country capitols, enter new capitols, or update existing ones. The microservice runs a website and API gateway written in AngularJS and hosted on NGINX. The second microservice, capitol-info, provides CRUD capabilities via REST operations: the first returning capitols via an HTTP GET call; and the second inserting/updating capitols via an HTTP POST. This microservice is written in Java and runs with Spring Boot. As shown in the diagram below, the capitol-client microservice calls the back end capitol-info microservice, and it is this microservice architecture that will be used as the test framework. Finally, in order to generate load on the service mesh, I use Postman to call the capitol-client API gateway.
In order to quantify the performance impacts, I ran two sets of tests. The first test set ran against Docker Desktop with Kubernetes (aka K8s), hosted on my local machine. The second test used my local machine to generate the calls to a service mesh running on Red Hat OpenShift 4.4, hosted in AWS.
For each test I conducted three runs of 1000 batched calls to the API gateway in capitol-client. Each batched call did one HTTP GET call and one HTTP POST call with a 10 millisecond delay between the batches. For each test run, I recorded the average time, min and max, and standard deviation. The results of each set of three runs (6000 total HTTP calls) are averaged to help minimize the impacts of outliers and external factors.
Test #1: Local machine with Docker Desktop
As mentioned above, in the first test everything is running on my local machine. This machine is a MacBook Pro, with an 2.4 GHz 8-Core Intel Core i9, and 32 GB of memory. 12 GB of memory is allocated to Kubernetes. The testing software is the Postman client calling Docker Desktop with Kubernetes.
To get a better understanding of the overall impacts, I not only tested against Kubernetes with and without Istio, but also plain vanilla Docker, and Istio with security in place. The tests sets were as follows:
- Docker containers running on a bridged network
- Kubernetes pods with network proxies
- Kubernetes running Istio
- Kubernetes running Istio with authorization policies in place for both microservices
For the local test, TLS was not used between the Postman test client and Kubernetes, just plain unencrypted HTTP traffic. The local machine test used Istio 1.6.1 with mTLS enabled for east-west traffic.
Running the tests across these different architectures yields very interesting results:
- Docker was considerably faster than Kubernetes.
- There’s a definite performance penalty to pay when running Istio.
- Adding fine-grained access controls to Istio did not yield any performance impacts. Indeed, it appears to provide a slight performance improvement.
Examining the times, we can see that Istio places a roughly 3 millisecond penalty over plain Docker and approximately a 2 millisecond penalty over plain Kubernetes. If we measure the difference as a percentage of the call time, we can see it doesn’t look good for Istio:
- Docker calls took about 60% of the time that a call to a service running in Istio
- Kubernetes calls took about 80% of the time of Istio calls.
If we look at the table above, we also see that calls to Istio-managed microservices had much higher maximums and standard deviations. This could be problematic if you rely on consistent call times. These numbers look problematic for Istio but, we will need to look at enterprise scale numbers as well to get a true understanding of the performance impacts.
Test #2: AWS and Openshift
Local testing of service mesh performance is interesting, but most people care about how it behaves in the enterprise. Therefore, the next test sets were conducted using Red Hat OpenShift 4.4 running in AWS. This setup follows a well established cloud native pattern of running Kubernetes in the Cloud and would give me a much better idea how Istio behaves in an Enterprise environment.
This environment obviously differs greatly from my local laptop in that: it is running on a cluster; it has multi-tenancy; it has significantly more latency between my test client and the microservices; and unlike my laptop, there is not CPU contention between my test client and the Openshift cluster. As background, the OpenShift cluster I used for testing is running in the AWS U.S. East Region. I live in Northern Virginia, fairly close to the AWS East Region data centers. Because I was running through a VPN, the network connection was about 45 Mbps down and about 20Mbps up, using wired Ethernet. The OpenShift control plane has three nodes running on m5.xlarge and three worker nodes running m5.2xlarge. At the time of this writing, m5.xlarge has 4 virtual CPUs (vCPU) and 16GB of memory, m5.2xlarge has 8 vCPU and 32GB of memory. The cluster is used for R&D and so it was not running a heavy load during testing.
Unlike the local test, the remote test used HTTPS between the Postman test client and the microservices running in OpenShift. OpenShift was running Istio 1.4.6. The tests sets were as follows:
- Open Shift with a non-Istio enabled project (namespace)
- Open Shift with an Istio enabled project, using automatic sidecar injection
- Open Shift with an Istio enabled project, using automatic sidecar injection and with mutual TLS (mTLS) enabled for east-west traffic.
The call time from my local Postman client to the AWS service mesh took an order of magnitude longer than it did in all the local tests. This makes the performance impacts look relatively smaller. However, if we look at the numbers we can see there is still a performance hit when running with Istio. The performance impacts in both raw numbers and percentage differences are less than it was in the local tests.
Diving into the numbers in a little more detail, introducing network latency evens out the minimum and maximum times, as well as the standard deviation.
Unlike the local test results, as an overall percentage of the call time, Istio performance impact is no longer as significant in this situation. Obviously, not every use case will accept a 0.5 to 1.5 ms delay so your mileage may vary.
As these tests show, there is a definite performance impact running microservices in an Istio service mesh. I only used two microservices in these tests, however, I suspect that each additional microservice in the workflow will add a little bit more overhead. Therefore, based on the depth of your call chain, you may incur more runtime performance penalties. There are options to ameliorate these penalties, such as using more or faster hardware, or modifying your architecture, but these have financial impacts.
Even though there are performance impacts running an Istio service mesh, I think for many use cases the the benefits far outweigh the costs. The ability to have authorization, mutual TLS, telemetry, and features like rate-limiting are well worth having a several millisecond performance penalty. Personally, I find it much easier to deal with microservices running with Istio than trying to manage them without. Obviously, there will be use cases where every millisecond counts, so you’ll need to take this into consideration when deciding on whether or not to use service mesh.