Multi-Tenant Applications in DataStax Graph

How do you handle customer #2? You delivered an MVP of some hosted software for customer #1. Your brother-in-law knows a guy who has a similar problem and after a lunch meeting, now you need to add customer #2 to your incubating SaaS tool. Of course customer #1 and customer #2 shouldn’t be able to see each other’s data, but you don’t necessarily want to install and configure everything all over again just because you added another customer.

We call this multi-tenancy: using the same or similar code bases, but partitioning the data in such a way that they are isolated from one another. The trick is where and how you do that split on the data. (There are other tricks, but for now we’ll just focus on the data part of it.)  

Graph databases are the new kids, relatively speaking, in the persistence engine space. In this post I’ll take a look at implementing this with DataStax Enterprise Graph (DSE Graph), a highly performant, highly scalable graph data store built on top of DataStax Enterprise and Apache Cassandra. DSE Graph does not have any native multi-tenant support, but it does have capabilities that can be leveraged for multi-tenancy. We’ll look at three specifically:

  • PartitionStrategy: logically partitioning the data at query time
  • Graph Isolation: physically partitioning data by creating different graphs for each tenant
  • Virtualization/Containerization: physically partitioning data using virtualization technologies

Note: For this blog post we refer to a multi-tenant application as an application in which two or more tenants are served software by a set of resources. A “tenant” in this case refers to a group of users with common access to a dedicated share of the data, configuration and resources of the system. For an overview on reducing complexity in multi-tenant applications, read Part 1 of this series, Multi-Tenant Applications – Reduce the Complexity.

PartitionStrategy

DSE Graph implements the TinkerPop3 API which provides the ability to use a PartitionStrategy within your Gremlin traversal. A partition strategy is mechanism that allows for partitioning of vertices and edges into subgraphs based on a defined partitioning key. Using this mechanism you could partition your graph based on a specific key (e.g., company_id) and then run all future traversals, including mutating traversals, against that subgraph. In order to achieve the partitioning you will need to modify your data model to include the specific key you want to partition on in the appropriate vertices. This can be done by either adding a property to one or more vertices to define the partitioning, or using a custom vertex ID on the vertices that will be used for the partitioning.  

Note: To read more about Custom Vertex IDs, please read this link.

Once you have properly configured your data model to allow for efficient partitioning into subgraphs, the next step is to build out the PartitionStrategy you are going to use with syntax similar to this:

     strategy = PartitionStrategy.build().partitionKey(“_company”).writePartition(“acme co.”).readPartitions(“acme co.”).create()

Now that you have created the PartitionStrategy, you will have to apply that strategy to your “g” traversal source as shown below. This will need to be done each time you re-initialize the traversal source.

     g = g.withStrategies(strategy)

Once your traversal source has the desired PartitionStrategy applied to it, you are free to build your traversal as you normally would. The PartitionStrategy will be applied during the traversal and will augment the traversal at compile time to partition the graph into the appropriate subgraphs.

     g.addV(‘user’).property(’email’, ‘test@test.com’)

Note: If you are interested in the details of implementing a PartitionStrategy, please read this link.

Pros

  • A single graph in DSE Graph is built to scale to hundreds of millions of vertices and billions of edges.
  • High hardware reuse minimizes operational costs.
  • Easier development because security specs are implicit in the strategy

Cons

  • The ‘g’ traversal source must be re-initialized for each query script with the partition strategy.
  • Data model needs to accommodate the need to partition the graph.
  • All customers are in the same graph so they are vulnerable to the Noisy Neighbor Problem.

6

Graph Isolation

DSE Graph is built to allow multiple graphs to run on the same cluster. It may seem as if an ideal solution would be to allow each customer to have their own graph; however, due to some of the underlying limitations in the Cassandra data store used by DSE Graph, that ends up causing several scalability problems. In DSE Graph, each new graph creates three underlying keyspaces to support it, and for each new vertex label created within a graph, two Cassandra tables are set up as backing tables for the nodes and edges of that vertex. Cassandra has a practical limit of ~500-1000 tables per cluster due to the memory overhead required to contain the metadata. Due to these limitations, any reasonably complex graph will not allow more than a few graphs to be run on the same cluster. For example, given a graph with 20 vertex labels, each cluster would allow only ~20 graphs per cluster, meaning that each group of 20 tenants would require a separate cluster. This sort of scenario would result in a large number of clusters being required to handle any reasonably sized tenant load.

Pros

  • Tenant data is 100% physically isolated from one another.
  • Easier development as no special security measures need to be applied

Cons

  • High operational costs as multiple clusters are required
  • Minimal hardware reuse as only a few tenants can run on the same cluster
  • Graphs hosted on the same clusters are vulnerable to the Noisy Neighbor Problem.

7

Virtualization/Containerization

DSE Graph is built on top of Cassandra, which runs on a distributed peer-to-peer architecture based on node clustering. This architecture requires multiple different nodes running and is configured to work together as part of a cluster. The operational complexity of running a DataStax cluster is already higher than many other distributed and non-distributed architectures. While DSE does provide some highly efficient operational tooling in the form of OpsCenter and LifeCycle Manager to help minimize the overhead, there’s still a significant amount of work required to install and maintain running clusters. DataStax does provide some support for Docker and, most recently, in partnership with Mesosphere, DC/OS (DSE Graph not yet supported but is on the roadmap) to help minimize some of the infrastructure management overhead. However, it still requires a significant amount of overhead in order to make a well-performing cluster, although some IT automation tools such as Chef/Puppet can be used to help minimize this overhead.  

Pros

  • Increased hardware utilization
  • Allows for configuration as code scenarios

Cons

  • Significant work is required to make this work in a virtualized manner; no out-of-the-box offerings currently support DSE Graph.
  • Configuration management system is required.
  • If graphs are hosted in the same hardware, customers are vulnerable to the Noisy Neighbor Problem.

8

Conclusions

10_Summary_DSE

Due to the nature of DSE Graph, physically isolating tenants is difficult to achieve efficiently. Graph isolation does not scale well and support for isolating via virtualization is thin but improving. Logically isolating tenants inside of one large graph is really the most effective way to achieve multi-tenancy at the moment. Through use of data modeling and a PartitionStrategy, you are able to achieve graphs that tend to scale efficiently. This logical isolation complements DSE Graph’s strength of handling massive scales of data, hundreds of millions of vertices and billions of edges, with no single point of failure, no downtime, and linearly scalable architecture.  

For other posts in this series, see:

Multi-Tenant Applications: Reduce the Complexity

Multi-Tenant Applications in Neo4j

Multi-Tenant Applications in OrientDB