• Pricing
  1. Home
  2. Blog
  3. Engineering
  4. Stabilizing Keycloak at Scale: Tools for Debugging & Investigating

Stabilizing Keycloak at Scale: Tools for Debugging & Investigating

by Silvan Jegen

You can also read this article in German, French, Indonesian, Italian and Portuguese.

Smallpdf Staff Software Engineer, Silvan Jegen explores setbacks encountered with Keycloak authentication and solutions for resolving them.

At Smallpdf we introduced Keycloak for authentication in the summer of 2021. Like many new implementations, it wasn’t a smooth process and took a lot longer than expected. In this article, Smallpdf Staff Software Engineer, Silvan Jegen, takes a deep dive into issues encountered and techniques used to investigate and debug them.

Stability Issues


Our initial setup comprised several Keycloak instances, each with an embedded Infinispan cache. When we put these instances into production, we quickly encountered stability issues, where the nodes ran out of memory and restarted. These restarts also resulted in user sessions being lost, which then logged these users out.

Several months of experimentation on how to address these stability issues followed. We tried different combinations of embedded and external Infinispan caches with different persistence options. For a while, we used the SIFS persistence option to an externalized Infinispan cache, but after about two weeks, this setup resulted in an outage.

In the end, we settled on a setup with multiple Keycloak instances, all connecting to an external Infinispan cluster. When we provisioned enough nodes to be able to sustain the production load on both the Keycloak and the Infinispan side, we emerged from the experience with a stable production system.

What did the process of finding our current setup look like?

Load Testing


Our experience with the Keycloak+Infinispan with RocksDB showed us that our production workload wasn’t as close to the load testing conditions as we thought. The first measure we took to make sure that these outages wouldn’t happen again was to make more use of load testing.

The tool we turned to for this was “k6,” which allows us to generate requests with fine-grained control (thanks to JavaScript scripting) and to customize the way they’re sent (i.e., frequency, duration, etc.).

We had to switch from a legacy authentication system, so we needed to send direct grant requests to our Keycloak instance, which would mimic creating a lot of sessions on our cluster. To do this, we used the following script:

To run this script with “k6” for one hour using 16 TCP connections to send requests in parallel, you can use the following command line:

Note: when using this script, you create multiple direct grant sessions for the same user account. This may or may not be what you actually want to test.

Solving the Problem


Using “k6” to load sessions and checking the JVM metrics and logs that we collect in our observability solution allows us to verify whether we can expect our setup to be able to support our production load.

Loading sessions into Infinispan with “k6” directly instead of sending them to Keycloak turned out to not accurately reflect what would happen when we put Keycloak together with Infinispan in production. So for the best results, we recommend loading sessions by sending requests to Keycloak using the script above instead of loading them directly into Infinispan—although this would be faster.

Lost Sessions


Once our Keycloak and Infinispan clusters were stable, we expected the Refresh Token Error counts that we were observing previously to go down. The errors did in fact go down, but they were still higher than expected, meaning that thousands of users were still being logged out daily, because when they were trying to refresh their access token, they were served with an error.

Getting More Info


To debug this issue, we first consulted the event log Keycloak comes with. However, the default log level that’s set was too high to get the information we were interested in. So, to get more detailed logs, we set the level to “DEBUG.” It helps to be aware that this can create a lot of data which will most likely cause issues if you’re saving the event logs in the relational Keycloak DB and have substantial traffic.

The logs obtained this way still didn’t contain the required error details that we needed to identify the source of these Refresh Token Errors. We ended up patching the code of our Keycloak instance to add more details than the generic errors provided, as well as the session IDs. The latter allowed us to associate the Refresh Token Errors more easily with the corresponding “Login” events.

This extra info then pointed us in the right direction: the vast majority of our Refresh Token errors were due to “session_not_found” errors. This indicated that Keycloak wasn’t finding the sessions of our users after some time had passed, though the time until the session was lost seemed to vary greatly.

Creating a Controlled Setup


The next step was to reproduce the issue in a controlled environment. We found that the best way to do this was to create a local Keycloak+Infinispan setup to which we could send requests using Postman. Using this local setup, we could increase the log level to “TRACE” without drowning in events, since there was no production traffic to care about.

The “TRACE” debug level allowed us to see that new sessions were initially successfully retrieved from Infinispan after creation. After some time passed, however, Keycloak would try to retrieve a session on the Infinispan side and the session would be gone with no discernible reason.

It turned out that evicting and recalling a session back from disk twice in a row resulted in the session not being found at the time of the next refresh token request. The reason for this is still unknown. The session disappearing was the source of the Refresh Token Errors that we were observing in our production environment.

So What Did We Learn?


Debugging Keycloak is complex, especially if you’re running it with an externalized Infinispan cluster. Having the JVM metrics and the logs readily available might not be enough.

In that case, depending on a local setup to reproduce the issue you’re having is the best way to go—even if it requires you to patch Keycloak to get the data from it that you need to debug the issue after.

Don’t forget to do some proper load testing for changes that you’re making in your Keycloak setup either. Based on our experience, it will probably save you some trouble in production!

With special thanks to Sam Smith and Zhandos Zhylkaidar for their insights and contributions to this article.

Enjoyed this post? Look out for more info, insights, and tips from the engineers at Smallpdf coming soon on the blog.

Silvan Jegen
Staff Software Engineer @Smallpdf