Combining traditional search techniques with graph algorithms to efficiently find subgroups within data.
As graph consultants we eagerly apply graph data modeling and algorithms to solve customer problems. A recent example was for us to solve, what’s known in healthcare as, “Care Path Determination” for a particular patient. Let’s take your average patient, call him Joe. Determining his care path goes something like this:
For the purposes of this blog, we’re simply going to address step 1. We’ll cover the other steps in later blogs in this series.
Given a large batch of healthcare data, how do you find all the patients that “look like Joe?” Being graph enthusiasts, we’re going to utilize a graph database to hold our data structure. Next, we’ll apply algorithms to turn our data into discoveries.
For some of our internal thesis vetting, we’ve been using data sets generated by Synthea. Synthea provides mocked generated patient data that will serve our purposes for this blog. The generated data provides patients, allergies, procedures, medications, immunizations as well as providers and care plans. The data set that was generated has 1,177 patients altogether.
First, let’s review what information is available for patients in this dataset. Central to the issue at hand is the notion of a Patient vertex - and as in real life, our patients have certain attributes. Furthermore Patients have related ideas like Conditions, Allergies and Addresses. None of these are surprising but as we explain later, the details matter and we’ll need to refer back to these figures in a bit.
A simple query search is something we’re all familiar with, and shaving off unwanted data could provide more control in the initial steps of journey mapping. This notion may be blasphemous to most graph lovers, however, we’re in the business of solving data problems in the best manner possible.
It’s important to remember that a productive simple search requires us to know all the queried data for each entity. In determining a care path, we would reasonably want to find similar patients based on attributes such as location, age, sex, health profile, etc. all within some given time period in history. However, some or much of that information could be missing leading us to overlook vital “Joes.”
Of all the graph based algorithms available, one of the most simple and powerful classes is that of community detection. The job of community detection is to seek out nodes within a graph which are similar based on the structure of the graph. There are many ways to detect communities within a graph, finding the best (eg: smart and efficient) way is the trick and, generally, has to do with your graph structure. Here are a few popular options:
[From wikipedia, “it has been shown that modularity suffers a resolution limit and, therefore, it is unable to detect small communities. Biological networks, including animal brains, exhibit a high degree of modularity..” A higher modularity (-1 to 1) suggests clusters where members are “more similar to each other” with a single cluster. Note that this is completely derived by the structure of the graph rather than attributes on nodes or edges.]
If we built our graph with nodes of patients and edges defined to connect patients on similar attributes, it is certainly feasible to find a community which represents all the “Joes.” Which method we use is yet to be determined. As Label Propagation requires some a priori information, it’s probably best to table that method for this use case. Modularity and spectral approaches, however, are equally valid in their ability to approximate the solution. Without knowing ground truth, knowing which approach is “better” is arbitrary. However, with a look at complexities, we can move forward here using the Louvain method for modularity maximization.
Similarity is a bit different than clustering. On the surface, similarity coefficients determine how alike two graphs (or subgraphs) are. For this scenario we would look at our target patient, Joe, and his neighborhood within the graph. Next we would look at all other patients and rate how similar their neighborhoods are to the target. These coefficient calculations scale by O(n).
The purpose of calculating similarity after community detection is to rank the members in the community on how alike they are to our target patient, Joe. Then we can focus our efforts on a given top X% or top X amount.
In the case of these coefficients, they’re all similarly efficient and easy to implement. When doing any kind of comparison measure, the key is consistency. So we really do just need to pick one and run with it. Cosine similarity seems to be a community standard and is implemented in many tool kits. However, for the sake of science, we’re going to look at all three.
The magic happens when we can combine all available resources. Our best approach is to start with a simple search on the few basic attributes that will necessarily be available. This best serves our graph algorithms by reducing their load to bear. In smartly utilizing simple search as a preprocessor to any community detection and/or similarity algorithm, determining a care path can be much less painful. Here we drill down to get the list we want by prescribing the following steps:
This drill down approach allows some parts to run continuously like simple search and community detection - constantly updating those communities. Then pushing an appropriate and narrow set through a similarity algorithm to find the best path forward for the patient.
Now back to our Synthea dataset of 1,177 patients, we found the following:
First, looking at our target patient - “Joe” we perform a simple search based on gender, race, and age +/- 3 years. This shaves off 96% of our dataset to initially target just 45 patients, including Joe. (Note that these may not be the best attributes for our search given what we’re seeking, but they were chosen here because those baseline characteristics are expected to be known.)
As you can see below, simple search can greatly lighten the load on an algorithm like community detection. However, due to our synthetic data set being so complete we have some interesting results. Performing community detection on these 45, left us with just one community as these 45 are highly connected on the remaining data. In a much larger data set, we may expect to see more distinct communities even after a simple search is performed.
Looking at similarity, as stated before, each coefficient is valid in its own right and the goal is to maintain consistency. Below we consider the top ten results for each method both for the entire dataset and after performing search and community detection. It is important to note that we did not consider the attributes that were previously searched on when performing the similarity calculations. This accounts for the difference in result bounds as there were fewer and different values to compute from.
What is interesting here is the common results between the methods before and after search and community detection. After logically narrowing down the data, the similarity coefficient calculations agreed 125% more than before doing so. This gives us more faith in our results and cause to investigate further.
In conclusion, by combining traditional database search methods with graph algorithms, you lighten the computational load. This method will help you to find your “Joes” and process them accordingly. If you have questions about this blog or want to meet with our technical team, email us at info@experoinc.com!
Tell us what you need and one of our experts will get back to you.