Explore your tuning options for increasing JanusGraph write throughput and lowering latencies.
It’s common to have folks write to the JanusGraph users and dev group looking for advice on data loading performance. In many cases, they’re running on their laptop or a small test cluster, kicking the tires, and having some issues getting acceptable performance.
For this post, we’ll walk through the process of getting things running in a speedy fashion.
The experimental setup consists of JanusGraph and a Gatling-based load generator. The load generator does nothing more than send mutating traversals that create vertices in batches of 10 like this:
G.ADDV(“PERSON”).PROPERTY(“CITY”, “SOME CITY”).PROPERTY(“POSTALCODE”, 12345).ADDV(“PERSON”).PROPERTY(“CITY”, “SOME CITY”).PROPERTY(“POSTALCODE”, 12345).ADDV(“PERSON”).PROPERTY(“CITY”, “SOME CITY”).PROPERTY(“POSTALCODE”, 12345).ADDV(“PERSON”).PROPERTY(“CITY”, “SOME CITY”).PROPERTY(“POSTALCODE”, 12345).ADDV(“PERSON”).PROPERTY(“CITY”, “SOME CITY”).PROPERTY(“POSTALCODE”, 12345).ADDV(“PERSON”).PROPERTY(“CITY”, “SOME CITY”).PROPERTY(“POSTALCODE”, 12345).ADDV(“PERSON”).PROPERTY(“CITY”, “SOME CITY”).PROPERTY(“POSTALCODE”, 12345).ADDV(“PERSON”).PROPERTY(“CITY”, “SOME CITY”).PROPERTY(“POSTALCODE”, 12345).ADDV(“PERSON”).PROPERTY(“CITY”, “SOME CITY”).PROPERTY(“POSTALCODE”, 12345).ADDV(“PERSON”).PROPERTY(“CITY”, “SOME CITY”).PROPERTY(“POSTALCODE”, 12345) .COUNT()
That’s it, not even any edges. Though basic, this will be enough to see the effects of a few different tunables on throughput and latency.
The tests were run in AWS using a standalone Gatling instance. Gatling sent write requests to a standalone JanusGraph instance, running on an m4.2xlarge.
JanusGraph then stored all data in a single Scylla instance hosted on a i3.2xlarge. All instances were located in the same placement group. The Scylla instance was created using Scylla’s AMI. The default configuration was used throughout all of the experiments so all of the tweaks were performed on the Janus side.
Note that a production setup should never use less than 3 Scylla instances (unless there isn’t an SLA to speak of I guess…), but one instance is adequate for demonstration purposes and the following discussion will be equally applicable to clustered setup.
The experimental setup
Janus’s deployment options can be confusing to the newcomer. The support for a variety of storage backends is a bonus if Janus needs to fit into an existing infrastructure.
However, one immediate question comes to mind: Should users run Janus colocated (same instances) with their storage layer?
This is tunable in its own right, but I chose to run it on its own instance for these experiments. If Janus is colocated with Scylla, it’s worth looking at using Scylla’s smp and overprovisioned settings so resource contention between Scylla and the Janus JVM is kept to a minimum.
These experiments will use the Apache TinkerPop Java driver to remotely connect to JanusGraph. Some people still run Janus embedded but I usually set it up to be accessed remotely so that the application workload does not commingle with the database workload.
These tests were run using the original string-based querying method, not the newer language embedded withRemote option. A later post will look at the performance of the GLV vs. Groovy scripts. One update was made on the client driver side before the test runs.
The following properties were added to the remote-objects.yaml to allow for more inflight requests to ensure that Gatling was not the bottleneck.
connectionPool:
minSize: 8
maxSize: 8
maxInProcessPerConnection: 8
maxSimultaneousUsagePerConnection: 8
There are a lot of good performance-related tips in the Apache TinkerPop docs, we’ll follow a few here. First, a few things about the load script.
The “addV”’s will be executed by the Groovy script engine embedded in the Gremlin server. There is one thing in particular that is a performance killer when it comes to script execution, script compilation. The Gremlin Server has a script cache, so if the hash of the script that is sent over lines up with a cached and already compiled script, no compilation is needed.
If Gremlin Server doesn’t get a hit, expect a nasty compilation penalty on every query. Usually, similar queries will be sent over and over and over.
Our experiment just repeatedly inserts 10 new people. The key here to getting a cache hit every time the script is sent is to turn the parts of the scripts that change, the property values, into variables. That way, the Groovy that is sent over will be exactly the same every time, thereby matching the cached script.
The script parameterization feature can be used to attach the unique property values every time, which will be seeded into the script’s runtime when it’s executed. See here for the syntactical details and notes.
Looking at our script, you’ll notice that I’m including more than one addV per call. The exact number you’ll want to send over at once may vary, but the basic idea holds that there are performance benefits to be gained from batching.
In addition to batching, note that I chained all of the mutations into a single traversal. This amortizes the cost of traversal compilation, which can be non-trivial when you’re going for as high of throughput as possible.
Note that Gremlin is quite powerful and you can mix reads and writes into the same traversal, extending way beyond my simple insert example. So keep that in mind as you write your mutating traversals. The chosen batch size 10 is rather arbitrary so plan to test a few different sizes when you’re doing performance tuning.
Response serialization is a CPU-intensive operation and can eat into throughput. Notice that the queries that are being sent over end with a “count” step. We could have also iterated the traversal with an “iterate” so nothing would have come back in the response.
If the newly minted vertex ids are needed, an “id” step can be added to the traversal just to return the ids without incurring the full overhead of serializing the vertex objects. The same advice goes for reads, if the full vertex or edge is not required in the results of a query, update the traversal to selectively return only what is necessary.
Gremlin Server has come up a few times already so you might be saying, “Gremlin Server? I’m running JanusGraph”. You’re not wrong, but JanusGraph implements the TinkerPop APIs and includes Gremlin Server.
All remote access to JanusGraph occurs through the Gremlin Server. Some Gremlin Server settings will be discussed below.
The JanusGraph Gremlin Server startup script located in ./bin/gremlin-server.sh specifies the max heap size as 512 megabytes. Since these experiments only test write throughput, we don’t need to worry about the global- or longer-running queries so the majority of objects that get created as part of our inserts will be short-lived.
It is a good idea to monitor the Gremlin Server’s memory usage and likely increase the heap size. If the load largely involves short-lived objects the new generation size can also be adjusted to lessen the likelihood of unnecessary promotion into the older generations.
Per the comprehensive Gremlin Server docs, users can specify the number of worker and Gremlin pool threads. The docs recommend increasing the worker pool size to at most two times the instance core count.
The Gremlin pool is responsible for script execution so, like the worker pool, if the Janus server is not fully utilized, increasing its size may allow the server to process more script requests in parallel.
These settings are found in the Gremlin Server config file: conf/gremlin-server/gremlin-server.yaml
Janus vertices are uniquely identified by ids of the Java type long. Every time an addV is called, an id needs to be assigned to that vertex. These ids come from a pool of ids that Janus “retrieves” when the first requests come in and then throughout normal operation, as the pool is exhausted.
In a distributed setup, these id blocks must be retrieved carefully, in a manner where more than one Janus instance can't grab the same pool of ids. If they did, two or more Janus instances can step on each other’s mutations.
Since this operation must be safe across not only multiple threads but also multiple instances, locks are required.
If a cluster (or single instance) is under heavy load, it is common to see large spikes in latency in an untuned setup. The user may say to themselves, “Hmm, Janus runs on the JVM, sounds like we have some garbage collection pauses clogging things up”.
That isn’t a bad guess, and very well could be the culprit, but they check their metrics and GC looks fine with no stop the worlds that correlate with the drops in throughput.
At this point, If they move on to looking at the internal Janus threads, there is a good chance they’ll see several idassigner threads periodically blocking and putting a stop to the current batch of insertions. This is id allocation in action.
Luckily, there are a few tunables related to id allocation that can help alleviate this pain and greatly reduce the p99 times.
Refer here for the full set of options. For now, we’ll focus on block size and renew-percentage. The default block size is 10,000 and the default renew percentage is 0.3.
That means that when a Janus id thread starts up, it will reserve 10,000 ids and when that pool drops below 30% full, it will attempt to grab the next block asynchronously to reduce the chance of a full stop.
If an application is inserting 1000’s of vertices a second, that block won’t last long and even with the asynchronous renewal, threads will start to block and things will come to a grinding halt.
The docs recommend that users set the block size to approximately the number of vertices expected to be inserted in an hour but don’t be afraid to experiment with this value and the renew-percentage.
When a Janus instance is shut down, its unused ids go with it and are not returned to the pool. This can cause some concern because there are a finite number of ids available.
The good news is, per the docs: “JanusGraph can store up to a quintillion edges (2^60) and half as many vertices” so the likelihood of running up to this is tiny unless perhaps, you’re regularly reserving very, very, very large id blocks and cycling your Janus instances incredibly frequently.
This is going to be a case of “Do as I Say, not as I Do” because when you’re doing performance tuning, it’s not a good idea to go and change a bunch of things all at once.
One method I would suggest is the USE method that Brendan Gregg describes here. Having said that, hopefully, these suggestions provide some food for thought when you embark upon your next JanusGraph tuning exercise.
This first experiment was run against a freshly unzipped JanusGraph install. Gatling was setup to inject 1,000 batches of vertices per second which, taking into account the batch size of 10, translates to 10,000 vertex creations per second. The defaults, of particular interest due to the above discussions are:
JVM
JanusGraph
A smooth response rate of about 1,000 responses per second
For the second experiment, the settings were left the same as experiment one, but since things ran so smoothly, the user injection rate was bumped up to 1,200 users / second. This does not mean Janus can handle that many with this setup, it just means that Gatling attempted to make 1,200 calls per second, regardless of Janus’ ability to handle the load.
Throughput becomes erratic
Things start to get a little more interesting. It doesn’t look like any requests are being dropped, but the Janus throughput is much choppier, interspersed with very large drops to what appears to be a complete stop.
The below chart also shows a much more erratic set of response times, with the upper percentile responses hitting over 500 ms, and sometimes even over one second.
If you were to hook a profiler up to the Janus JVM, you’d likely see the previously mentioned id allocation threads, blocking periodically, lining up with the huge spikes in latency and dead stop on the throughput front.
The Janus instance could be pushed a little further, to the point of dropping requests, but we’ll move on to updating several configuration options next to see what effect that has on performance.
At 1,200 users per second, latencies periodically spike to 500 ms or more
A number of the Janus tunables have been updated for this last experiment. The Gatling user injection rate has also been bumped up from 1,200 to 1,750 users per second. Remember that a “user” here corresponds to one of those g.addV().... traversals that insert 10 new vertices.
JVM
JanusGraph
Gremlin Server
The experiment is run, and looking at the responses per second below, Janus doesn’t quite keep pace with the 1,750 per second, but it is pretty close.
Note that the extreme drop-offs in throughput have disappeared and results are more in line with the 1,000 users/second of the first experiment, albeit at a nicely increased throughput. The change to the default id block size removed the most egregious of the stalls.
By setting it to the very high number of 1 billion, we removed the need to go through an id block retrieval round. This is not super realistic outside of perhaps a bulk loading scenario, but it nicely illustrates the effect of id allocation on throughput.
When tuning, I would not recommend shooting for never needing to retrieve id blocks. Instead, adjust the renewal percentage and block size so that there is plenty of time for an asynchronous renewal to occur before the id block is starved and writes are brought to a halt.
Results show between 16,000 and 17,500 vertex additions per second
Switching over to the response time percentiles, the p90 numbers look pretty solid and the p99s are well below the 500-1,000 ms seen in the previous experiment.
A large reduction in tail latencies compared to the untuned 1,200 user / second experiment
Navigating the intricate pathways of JanusGraph can sometimes feel like traversing through an enthralling labyrinth of data and connections. Much like how an adventurer would need an efficient map to guide their journey, users of JanusGraph often seek an optimized write performance to fuel their graph database endeavors.
But why does this matter so much?
Well, consider this: A graph database, at its heart, is a dynamic entity. The fluidity and vastness of relationships it can handle are unparalleled. But for it to achieve its true potential, it needs to be fed with data swiftly and efficiently.
A lag in write performance can be likened to a blocked artery in this vast system, preventing the optimal flow of information and stymying potential insights.
Moreover, as projects scale, the volume of data to be ingested often increases exponentially. If JanusGraph is not primed for optimal write speeds, bottlenecks can easily form.
This not only slows down the database operations but can also affect the downstream applications and analyses that rely on the data. Picture a craftsman trying to sculpt a masterpiece, but the clay keeps drying too fast - frustrating, isn’t it?
In the same way, for data scientists, developers, and businesses leveraging JanusGraph, ensuring top-tier write performance isn't just a technical metric to aim for; it’s the very bedrock ensuring the smooth operation of their projects. A strong write performance means that data can be quickly ingested, allowing for swift queries, analyses, and actionable insights.
Depending on your data ingestion requirements, an out-of-the-box Janus install may meet your needs. If not, this post has covered a few tunables that you can add to your toolbox and possibly put to good use the next time you’re methodically studying the performance of your Janus setups.
Tell us what you need and one of our experts will get back to you.