Skip to content

2019

Packer & Fedora Gotchas

Lately, I've been working in virtual machines to strengthen my security posture for clients. There's more to come on that, but for now I wanted to share a fix for a confusing problem I had. I was trying to install Fedora 31 as a Packer build.

Performance with Spring Boot and Gatling (Part 2)

In Part 1, we built a simple Spring Boot webapp and demonstrated a surprising performance problem. A Gatling performance test simulating different numbers of users each making a single request showed our webapp unable to keep up with 40 "users" making one request per second on my fairly powerful computer.

We eliminate a couple of potential causes in the first part of the article. If you just want to know what was causing the problem, you can go straight there.

We've already eliminate many potential culprits, so we continue using a process of elimination to figure out what's causing the problem. I shared a link to the first part and invited people to guess what the problem was.

Thread starvation was amongst the guesses, so let's take a look.

How Many Threads?

Spring Boot apps run in an embedded Tomcat server by default, so this is something we can look into. If we were, say, talking to a slow database synchronously then it would seem more likely that we might be running out of threads to service new requests as exising requests keep threads waiting for responses from the database.

As it's easy, we'll see what happens if we increase the number of threads Tomcat can use. We can check the documentation to see that the default thread pool size is 200 threads and how to increase it. We can increase that by an order of magnitude to 2000 threads in application.properties:

server.tomcat.max-threads=2000

Running our Gatling test shows that this change has adversely affected performance. We see more active users at peak and a funny dip in the actives line as the backlog is cleared. That dip, marked on the chart below, is a few requests timing out.

Requests per second and active users

That didn't help, so let's put the max thread count back to the default before we try anything else! If you're following along, I recommend using clean, as in mvn clean spring-boot:run to ensure that nothing is remembered from the previous build. You can lose a lot of time acting on misleading results otherwise!

JSON Encoding

Another unlikely candidate is the JSON encoding we're doing on our responses. Out of the box, Spring Boot uses the venerable Jackson library, but you never know, the default settings might be really inefficient. Again, it's really easy to check so let's find out. We update the controller so that instead of returning Java object containing a String, it returns just a plain string, so from:

    class Greeting {
        public String getGreeting() {
            return "Greetings from Spring Boot!";
        }
    }

    @RequestMapping("/")
    public Greeting index() {
        return new Greeting();
    }
to:

    @RequestMapping("/")
    public String index() {
        return "Greetings from Spring Boot!";
    }

If you make that change and run the Gatling test, you see...

...drum roll...

No detectable difference from the original results we got in part 1. Not really a surprise, JSON encoding isn't the problem.

Spring Security

The next thing we can check easily is whether Spring Security is causing the problem somehow. Like the cases above, we've got well tried and tested software running with Spring Boot's sensible defaults, so yet again, it seems unlikely. Something's causing the poor performance and Spring Security does lots of things though, like authentication and session management. We're running out of possible causes though so let's give it a try.

The quick and easy way to check whether something that Spring Security is autoconfiguring in is causing the problem is to just omit the dependency. Let's delete spring-security-web and spring-security-config from our pom.xml. We can see there's less happening on startup, but will it handle the load better? Let's run the test.

Asciicast with no spring security on the classpath

We have a winner! You can see that the performance is improved. No active requests throughout the test - the app is keeping up easily. The charts tell the same story. Let's compare side by side to get a feel for the difference.

Response time distribution, before:

Histogram of response time distributions, with Spring Security

Response time distribution, after:

Histogram of response time distributions, without Spring Security

Instead of a spread of response times all the way up to 30 seconds, we have all responses served within a few milliseconds. Much better! On to requests per second.

Requests per Second and Active Users, before:

Plots of requests per second and active users, with Spring Security

Requests per Second and Active Users, after:

Plots of requests per second and active users, without Spring Security

To appreciate the difference, note the different scale for Active Users, the y-axis on the right, and the total time the test ran for on the x-axis. Before, the number of active users climbed faster than the request rate, indicating that requests would start timing out if the test hadn't ended. After we remove Spring Security, the number of active users is the same as the request rate throughout the test, so the app is keeping up perfectly with the load.

What's causing the problem?

Spring Security with Spring Boot automatically applies sensible security settings to an app when it's on the classpath. You take control of aspects of configuration through annotations and extending classes. As this is such a simple app, there's not too much to check. We'll take control of authentication and roughly replicate the automatic configuration with this @Configuration annotated class

@Configuration
public class SecurityConfig extends WebSecurityConfigurerAdapter {

    @Override
    public void configure(HttpSecurity http) throws Exception {
        http.httpBasic();
    }

}

The Gatling test shows that the performance problem is back. Delete the http.httpBasic() line and the problem goes away. Something to do with authentication then. I didn't find anything in Spring Security or Spring Boot documentation to explain it.

I'm not sure how you'd figure it out if you didn't know where to start. GIven a little experience with passwords and authentication you can join the remaining dots. There's two things going on that could be responsible. Let's look at password encoding.

Password Encoding

We protect passwords for by 'encoding' or 'hashing' them before we store them. When the user authenticates, we encode the password they gave us and compare with our stored hash to see if the password was right.

The choice of encoding algorithm is important in this world of cloud computing, GPUs and hardware acceleration. We need an algorithm that needs a lot of CPU power to encode. We get a password encoder using the bcrypt algorithm by default, an algorithm that's been designed to withstand modern techniques and compute power. You can read more about bcrypt and how it helps keep your user database secure in Auth0's article and Jeff Attwood's post on Coding Horror.

See the connection yet? The choice of Bcrypt makes sense for protecting the credentials we're entrusted with, but do we do about this terrible performance?

Sessions

By default, the security config responds to our first authentication with a cookie containing a session ID. That is exchanged without any encoding. Sessions come with lots of problems of their own, so we'll leave that one for another day. If we'd been using a browser, or Gatling had been set up to make lots of requests as the same user, we'd have used the session ID and not seen a performance problem.

Reconfiguring our App

We'll override the password encoder to prove that it is causing the performance problem. As we're dealing with a hardcoded test password rather than real user passwords, we don't need to worry about it not being secured. We update our SecurityConfig class like this:

@Configuration
public class SecurityConfig extends WebSecurityConfigurerAdapter {

    // make the credentials in application.properties available
    @Value("${spring.security.user.name}")
    private String username;

    @Value("${spring.security.user.password}")
    private String password;

    @Override
    protected void configure(AuthenticationManagerBuilder auth) throws Exception {

        // choose a more efficient (and far weaker) hashing algorithm
        final PasswordEncoder sha256 = new StandardPasswordEncoder();

        auth.inMemoryAuthentication()
                .passwordEncoder(sha256)
                .withUser(username)
                .password(sha256.encode(password))
                .roles("USER");
    }

    @Override
    public void configure(HttpSecurity http) throws Exception {
        http.httpBasic();
    }

}
When we run our performance test one last time, we see that we have performance and we have authentication. @glenathan takes the prize for correctly guessing the cause!

How fast can it go? When I push the request rate higher, I see that this app can actually handle around 2,000 requests per second. I won't bore you with more asciinema or the charts, but you can play with the updated app yourself if you want.

Performance with Spring Boot and Gatling (Part 1)

Just after the rest of the team had left for their Christmas holidays, my colleague and I discovered a weird performance problem with a Spring Boot application we'd just started writing. This is a the story of discovering the problem and the detective work that led us to the culprit hiding in plain sight. We're going to recreate the app and the performance tests, but first I'll tell you how we got here.

Prologue

Twenty requests per second. Any more than that and response times would climb. Eventually, the load balancer's readiness checks would timeout and it'd refuse to send traffic to the app, taking the service offline.

Here's what our little prototype system looked like:

Diagram showing the components parts of the prototype when we discovered the performance problem

We're building an API, rather than a website. Clients authenticate by passing a username and password in the request. Each request is a query, and the app talks to a graph database to calculate an answer, returning it as a JSON document. It's running in a Kubernetes cluster on AWS, behind an Elastic Load Balancer.

How can it be struggling to serve more than twenty requests per second? I've not used this graph database before, can it really be that slow? Nor have I used Kubernetes or Spring Boot, are they responsible? You wouldn't think there'd be enough of our code yet to perform so poorly, but our own code is always the go-to suspect.

Too Many Suspects

There's too many potential culprits here, so let's eliminate some. Can I reproduce the problem here on my machine? Yes - and I get a clue. As the test runs, I can hear the fan spinning up. Checking back on AWS for server metrics, the CPU utilisation was shooting up to 100% during the test. That removes Kubernetes, the load balancer, the network and disks from the investigation, at least for now. Memory could still be a problem, as Java's garbage collection chews up compute time when there's not enough memory.

Now we eliminate the database. We'd written a resource to return version information, which is just returning a document from memory. Running the performance test on that endpoint revealed the same terrible performance! Something to do with the Spring framework, or the Tomcat application server then - where can we go from here?

We could pull out profiler tooling to look inside the running app and see what's going on. It's been a while since I used that tooling on the JVM, and it'll produce a lot information to interpret, so I'll leave that as a backup plan. For now, we've got an easy option that will rule out the Spring Boot framework and Tomcat application server. A "getting started" Spring Boot app won't take long to set up. We can eliminate JSON processing, configuration problems and coding errors as potential candidates, and get a benchmark for how performant the simplest Spring Boot app is with our hardware and test setup.

This is where we write some code.

The "Getting Started" App

You can find and clone the project we're talking about in this post on Github at https://github.com/brabster/performance-with-spring-boot/tree/1.0. You'll need a JDK and Maven installed to compile and run the application.

I based the "getting started" app closely on Spring Boot's documentation. It's got one endpoint at / and returns a JSON document {"greeting": "Greetings from Spring Boot!"} like this:

package hello;

import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.bind.annotation.RequestMapping;

@RestController
public class HelloController {

    class Greeting {
        public String getGreeting() {
            return "Greetings from Spring Boot!";
        }
    }

    @RequestMapping("/")
    public Greeting index() {
        return new Greeting();
    }

}

We'll use Spring Security to authenticate API clients, so we need to add the dependencies and set a default username and password. The dependencies we need to add to our Maven pom.xml file are:

<dependency>
    <groupId>org.springframework.security</groupId>
    <artifactId>spring-security-web</artifactId>
    <version>5.1.2.RELEASE</version>
</dependency>
<dependency>
    <groupId>org.springframework.security</groupId>
    <artifactId>spring-security-config</artifactId>
    <version>5.1.2.RELEASE</version>
</dependency>

To set a default username and password, we add a properties file containing two properties that override Spring Security's defaults:

spring.security.user.name=user
spring.security.user.password=24gh39ugh0

Start the app with mvn spring-boot:run and you should see something like this:

Asciinema recording of the app starting

Performance Testing with Gatling

Gatling is the tooling that gave us those original requests per second figures, so let's reproduce the setup to do our performance tests here. Gatling tests are written in Scala and can coexist with the Java code, but we need a little support in our project to run tests and get editor support for Scala.

To compile Scala code and enable Scala support (at least in IntelliJ IDEA) I used the rather neat scala-maven-plugin:

<plugin>
    <groupId>net.alchim31.maven</groupId>
    <artifactId>scala-maven-plugin</artifactId>
    <version>3.4.4</version>
    <executions>
        <execution>
            <id>scala-test-compile</id>
            <phase>process-test-resources</phase>
            <goals>
                <goal>testCompile</goal>
            </goals>
        </execution>
    </executions>
</plugin>

To run Gatling from Maven and view its output, we need a dependency and a plugin:

<dependency>
    <groupId>io.gatling.highcharts</groupId>
    <artifactId>gatling-charts-highcharts</artifactId>
    <scope>test</scope>
    <version>3.0.2</version>
</dependency>
<plugin>
    <groupId>io.gatling</groupId>
    <artifactId>gatling-maven-plugin</artifactId>
    <version>3.0.1</version>
</plugin>

Now we can write a Gatling test. This is our scenario, describing the client behaviour we want to test.

setUp(myScenario.inject(
    incrementUsersPerSec(20)
      .times(5)
      .eachLevelLasting(5 seconds)
      .startingFrom(20)
  )).protocols(httpProtocol)
    .assertions(global.successfulRequests.percent.is(100))

We're starting with 20 users per second making a request to the / resource, holding at that concurrency for five seconds. They only make one request. Then we increase the number of users per second by twenty, five times, holding for five seconds each time. Every request must return an HTTP 200 status code to pass the test. Simple! You'll find the rest of the test in LoadTest.scala.

Make sure the app is running and then run the test with mvn gatling:test.

Asciinema recording of the performance test running

When the tests run you see a progress bar being refreshed every few seconds. The ### part represents the proportion of requests that have been made and completed. The section with dashes --- is requests made but not yet completed. The numbers are just below the progress bar, active telling us how many requests have been made but not yet completed. There's a lot of those, over a thousand towards the end of the test, and this computer isn't exactly underpowered. There's our performance problem! Towards the end of the test, requests are taking over 26 seconds to complete.

If you cloned the project, you can try changing the scenario in LoadTest.scala to explore the problem. Running something like top will show you your live CPU utilisation. I can see the app using almost a full 4 cores while the test is running. To serve a short text string from memory to less than 100 users per second!

Gatling's Reports

Gatling saves a report of metrics and charts for each test. There's a couple that I think give us interesting insight into what just happened that we might not have seen as the test was running.

Bar chart showing 15 requests responded in around 200 milliseconds, with other requests uniformly distributed up to almost 30 seconds

The "Response Time Distribution" report tells us that the fastest few requests are served in around 200ms. So it takes at least 200ms to serve a request! Then there's an roughly uniform distribution of request times up to 30 seconds. The test only ran for around 70 seconds in total.

Next, the "Number of requests per second" chart shows more clearly that the app isn't keeping up, even with these low request rates. The number of active users (those that have made a request and not yet had a response) climbs until Gatling stops sending new requests.

Line chart showing the number of new requests per second and the number of active requests over time.

You can see the app is not quite able to keep up at 40 requests per second. as we ramp to 60 the line swings upwards as it really starts to fall behind. 45 seconds or so into the test the number of requests per second drops from 100 to zero, and the number of active requests, just over 1000 by this point, stops climbing and starts to fall as the app starts to clear its backlog.

Gatling's reports show you plenty of other interesting charts and figures. Find them in your target directory after running a test.

Next Time

That's the context, the tools and a simple codebase to get us started. In Part 2, we see how a performance problem shows itself and figure out how to resolve it.