MGM Resorts recently disclosed that a recent cyberattack incurred a $100 million cost to the company* and facilitated the theft of customers' personal information (*Siddiqui, 2023). The hospitality and entertainment giant revealed a cybersecurity issue that affected its main website, online reservation systems, and in-casino services such as slot machines, credit card terminals, and ATMs. These events have intensified the focus on cyber attacks, leading to the emergence of more sophisticated cyber and ransomware-style attacks. These attacks, which combine cyber and malware-based strategies, result in billions of dollars in losses. Many software efforts concentrate on signaling and detection but prove inadequate against attacks that span multiple investigation silos. In today’s economy, all sectors—including Retail, Banking, Finance, Logistics, and others—are compelled to address the threat of cyber attacks such as Cyber, ID theft, Fraud, AML, and cyber investigation groups. The advent of new types of attacks, including spear phishing, flooding, and ID spoofing, underscores the need to break down internal firewalls between detection and investigational groups; failure to do so may lead to substantial losses against sophisticated attacks.
This session will explore how new technologies like GenAI, NVIDIA Morpheus, ML, TimeSeries, Geographic analytics, and Graph data science can collaborate with simple deep link visualization to enhance detection accuracy, reduce false positives, and increase transparency between silos, enabling real-time alerting to avoid sanctions and fines.
This event will focus on how Dell, Expero, and Kinetica software and hardware can significantly accelerate detection, prevention, and investigations. It is tailored for executives and special investigation teams across all sectors, with key topics presented in under 15 minutes to maximize insights. Throughout this webinar, experts from Expero and Kinetica will demonstrate how to leverage current technologies alongside Graph Data Science, NVIDIA Morpheus, Machine Learning, and Visualizations to unleash organizational potential.
Discussion on the inherent difficulties of Cyber and connected fraud, despite numerous current technologies, and their impact on compliance. Also, how to identify real-time threats.
Understanding the utilization of new GenAI and ML/AI in real-time, and the power of combining GenAI with NVIDIA Morpheus, TimeSeries, Geographic analytics, ML algorithms, and Graph data science to reduce false positives and increase accuracy.
Exploration of visualization and human intelligence technologies to increase throughput, provide valuable data analytics, and achieve quicker and more efficient outcomes for Fraud Managers, Risk Investigators, and Data and Analytics teams.
Practical implementation methods of GenAI for cyber identification, complex dependency, and case management using ‘human-in-the-loop’ technology for higher accuracy and streamlined processes.
Laura Smith
Good morning, everyone. Thank you so much for attending today's webinar, combating cybercrime with next gen AI and analytics. Before we get started today, I just wanted to go over a couple housekeeping items with you. The session today is being recorded and you'll receive a follow up email with a copy of the recording. It will also be hosted on both of our websites. So feel free to access the recording there as well. If you have any questions or issues during the webinar, feel free to message me I'm the host, Laura Smith, I'll be happy to assist or drop any questions in the q&a box at the bottom of the screen, we will answer them live or at the end. With that, we'll go ahead and turn it over to our speakers today. We have Scott Heath, our financial crime solution lead at exp zero. And Aaron Bossard, Director of solution engineering at Kinetica.
Scott Heath
Super, thank you, Laura. Well, Aaron, I really excited to chat with you. Today, we've got a couple of different things we're going to share, we're going to share a case study, we're going to share some quick demonstration. And then we're gonna get into some of the guts behind why what we're chatting about today is a little different. Aaron's background is he's actually coming from government and has some experience in things that he can't tell me about because he might have to kill me. I'm kidding. Works for the head work with the US government and some of the other cyber things. My background is actually in cyber and financial crime. And so what we hope to do today is give you a great overview. And towards the end, we'll have some questions for it. So without further ado, we'll go ahead and jump in. We're gonna go over sort of what is the state and most everyone on the call will know that and then we're going to talk about some of the complexities, then what we're gonna do is we're going to talk about kind of how does it fit together? And then finally, Aaron is going to bring us home with How does this feel? What's behind the curtain? What's driving all of this in a real time? environment? And then we'll have some questions. So with that, well, we all know this. We've all seen this, right? It's terrifying. I think the biggest one recently was MGM. And why did I have this sort of problem? Well, I was actually trying to book a hotel. Well, what ended up happening here was they were hit with one of the biggest, most egregious ransom wares. And not only could I not get into my hotel, but I couldn't book and they locked them down, right. So what we're seeing now is a much bigger rise of ransomware. But it starts with cyber. And then what we see in a lot of our banking customers is, what could be a problem turns into a much bigger problem when you couple that with anti money laundering and credit card syndicate kinds of folks etc. And so where we're going now, is it's no longer good to look at these things in silos, we've got to bring them together. And then finally, you know, if things are being paid out in crypto, right, we also have the ability to do that. So all of these different things are now moving, and the dollar amounts are getting bigger and bigger. And trying to defend these in real time, given some of the things that we're going to talk about becomes extremely complicated. So with that, this is part of the cyber kill chain that we're going to talk about today. Right, which is what did the bad guys do? Right, they're going to poke around the edges. Finally, they're going to find a way in and they're going to weaponize that. And whether it's a simple account takeover, or it's a malware drop, whatever that is, and then finally they exploit it, then it starts to do things on its own, etc. Now, for those of you that are cyber team members, this is not new, you've been doing this for a while, you have many things. And we'll talk about how Splunk and some of these other products out in the market are all part of that spiderweb of sensing and detection that we get into now, some of the things we're going to talk about today. And Aaron and I are gonna step through some of these in different ways, is whether it's internal or external. And it could be both right, it could be combinations. And that's really sort of what starts to get hard for this is on the external, we're gonna see products that give you pretty good information. But when you combine that with internal and you start to look for complex patterns, or things that are hiding in plain sight, that's what we're going to talk about a bit later today, which is not only doing these things, right, but coupling it with what some of the things that I had mentioned previously around AML, et cetera, and that's where things start to get very number one, complex number one, or number two, right large when it comes to datasets and then the velocity by which we can find detect and then stop, starts to really sort of mount up if you will. So then hopefully these things are common here. So when or, you know, over on the left, we're looking at well, malware or DNS tunneling, or those kinds of things. And over on the inside, right, as one of our customers said, if I lost $5, to the external threat, I could lose millions of dollars if I have an internal connection to an external, so it makes it even worse. So again, these are some of the things that that we're going to cover today. Now, a lot of the technologies that are out there today are doing their best. And these are all perfectly good products. But what we start to see then, is individual silos. So if you're looking at the edge, over on the left, and whether you're trying to coordinate those things with network packets, and you know, all of those things, those are all very important things. And you know, whether you're using Splunk, or CrowdStrike, or Palo Alto, they each do slightly different things. But not one of them is a silver bullet. And then we start to see internal and slightly external, right, so we start to see emails, and then we see the delivery mechanisms. We see team chats, we see those kinds of things, right, they can all be basically gateways for a cyber attack or even, you know, again, in banking, it could be an account takeover, or it could be a combination of those, right. And that's ultimately what we start to see over on the right is more of the traditional enterprise, software's have the capability to share, but they're not sort of purpose driven. So now that we have kind of this sort of level here of the different elements that we're talking about, and again, all of these, many of you probably have them today, they're very common, and some of them can get expensive. But the key now is we need to use either what you have today? Or how do we start to really sort of change the game from an analytics perspective. And that's really what we see today. Most of our customers here in our sample we're going to share are really kind of overwhelmed, right? I'm looking at your account data out in your Adobe clickstream. Right, so have you visited your account? Are you using a known device of some sort? Right? Have you been in and out of a physical building? Right? Am I using are sending you documents, right? Most of those just really sort of shut those down begin with, but still, we need to share documents internally, right. And then as we work our way around, we start to see things like Splunk out on the edge, right? So what are those cloud and data systems doing? Where are those? Well, obviously, IoT has become a huge boon. But at the same time, it's easy to create a tsunami of data, right? I can, I can send packet data between Splunk and Palo Alto and CrowdStrike, I can send you all the information you could ever want. In milliseconds. The problem is I now I have a traffic jam, our poor security team member in the middle. And we'll see that here in a minute, again, is really sort of a wash and information. But what's real, what's a false positive? Right? What is a productive alert versus an unproductive alert right over in our counterparts? Who do fraud investigation or anti money laundering, they have sort of calculations and we do it in cyber to where we calculate what's the productivity of the alert, what is the productivity of that alert attached to a case, that may take us farther downstream, right. And so all of these sort of elements are now kind of attacking our cyber team, if you will. And so there are more sophisticated versions of this, but many of these, and the story that we're going to talk today is a large sort of financial services institution, where what they do is over on the left, they have multiple lines of business. So they need to do entity matching. And they're not doing Master Data Management, which is a slightly or similar different. What they're really trying to do is to say is this packet is this ID and your account connected, then they start to have different business units. So whether it's internal, anti money laundering, credit card, ID, etc. But then we see cyber, right, and malware, these are all somewhat interconnected that that, again, is very important on how, whether you're, again, whether you're in supply chain, or you're doing other things, you don't necessarily have all of these, but you still have some of these right. Now, what we're going to talk about today, is our partner, Connecticut, what did they bring to the table? Well, they bring an enormous capability that has not been typically seen in one product, we have to do things serially. But what can I do with that, right? I can start to see those What if I can see predictive analytics, I can start to look at alerts of alerts, all those previous products. We'll see that here in a minute, where I can start to do in line real time scoring and prediction, I can start to look for similarities and do these kinds of things at speed, which is different. Many of the other products out there can do one or two, whether they do time series or geospatial and we'll see those in more detail. And really, what is experiential bringing to the party. Well, what we've done is we've built a compass hence the tool that allows us to connect to Connecticut and can be a bolt on to your existing Splunk or Palo Alto kinds of architectures. And in financial services, we can actually plug into your atomizer and Oracle Mantis kinds of solutions. So really what we're trying to do now is a lot of different things. But ultimately empower the user, whether you're a cyber defense analyst, or you're a combination of other things. In the way your organization does this, right, that's really what we're trying to do is build a bolt on at scale system that can sort of level up if you will. Now, where did we come from? Right. So back in the day, we all know that there were SQL, right SQL runs the world, and it still does today. And it's not going away for any means. We started to see them as these rise of computing on edge. And what we see then is in 2010, and whatnot, we see increases in GPU CPUs, right, being able to do smart things on the edge, but we're still kind of stuck in that streaming, or screaming from the edge where the human is overwhelmed. Well, that's what really gets us now to the center one, which is now we can start to do these different kinds of payloads. We can look at time series geographic, what's something called graph database, which is interconnections? And then ml? And then finally, Gen AI? How do we tie all that together? To bring down our false positives, right, we have some empirical numbers here that talk about sort of what that is for the 45% decrease in the false positive, that's a real number that's, that's from one of our customers. In fact, some of their, their empirical data was as high as 83% reduction in false positives. Why? Because I can join all of this external screaming data, and start to sense what is a real or productive kind of alert, and then disassociate the noise, which increases our accuracy. So again, that's really what the premise of what we're going to talk about today. So again, let's talk a little bit about our customer. So in this case, I'm gonna give a quick sort of overview of this, right? So they have multiple silos in their teams. What they wanted to do was actually show the ability, then, where's cyber? Where's the malware? How do we stop it in a kill chain, but more importantly, what they were doing is they were looking at this massive kind of alert information a day. Now what what what they ended up doing was the lines of business where we feel like we were at an acceptable loss ratio for AML. And we were at an acceptable external cyber attack. Again, that was limited. Well, when they combined the data together, what they found was no, they were not fine. They actually had been hit by a very sophisticated ring that knew how the silos were working. And effectively, they took $20 million in there sort of attacks. So it was a combined one, it wasn't like one of our malware was so again, what did they do? Well, in this case, they had separate units that were using separate connected data. So the cyber team was using Splunk and Palo Alto, those kinds of things, but a human had to sort of parse that information in the middle, then they had to share it with their anti money laundering partners and, and other groups within the firm. Well, all of that, number one was both automated. And there was a level of human that had to go into that because of the sort of radioactive information it was internal and external, and cyber fraud, there were silos and that's natural in organizations. But if we could create a dataset that allowed the silos and was allowed to run those algorithms and the math, in conjunction to look for those to let the humans know, that's ultimately what, what we're talking about today. And so again, when you look at these use cases, they're down here at the bottom, but that interconnectivity as they move up and start to connect, they could be for any other use case. And again, most of our customers see this kind of breakdown for what those things are. Now, when we see this slide here, there are particular stats for this particular customer, once we started to use this kind of technology was game changing. Not only was it game changing by reducing the noise and increasing the productivity, and reducing those false positives, but we created more and the accurate, productive cases, right. So what we were doing then is we were actually now taking the team's throughput allowing them to do more at speed and at scale. And that was really sort of the big takeaway. Now, what does that mean? That means that just because you're starting to use machine learning, which is a good thing, that's great, but what you now need to do is you need to combine that geographic location you need to combine machine learning algorithms. With the time series, et cetera, and then the final step is you've got to keep your human in the loop because they reinforce the models and then make it even stronger. And together, now, we start to see that big gain, if you will. Now, if you can't do all those things, it's still okay. And as practitioners, a lot of people are sort of trying to do these things piecemeal. But what we are going to share with you today is sort of that big boost for how they all work together. Now, this is what they did, right. So what they were doing over on the left is at a data level, they were doing their reconnaissance. And as different things were sort of coming in, right, they were doing some heavy or some math kinds of things, etc. And then they had a separate team that was doing the machine learning. Well, a lot of that, with that loop in the middle, there was manual, they couldn't quite get it all together, it was expensive, it was hard, it was complicated. And then they had to go do these steps over here on the right, which is then connect that and then share with other team members. And again, a lot of this is, you know, partially automated. But what we're showing today now, is that this here is really where they ended up, right? They were gathering that information from Palo Alto, they weren't getting the packets, they were getting CrowdStrike, they were looking for each of these different alerts. But each of these alerts, as we'll see here later, can be different, right? They give you a different pattern. And unless a human is in there to say, Well, I see the Palo Alto alert, and I see, you know, this Adobe alert on a Splunk alert, and they're all different, right? They're all telling me slightly different things. One may be green, one may be yellow, and one may be red. And then this central team here was using a tableau kind of a database that they had to build themselves that really sort of showed these different teams. Well, in this team sort of triage team number one was looking at more network kinds of things. The other one was looking more internal. And then the third one was writing models. What if I could actually invert this, and make it even faster and make it more real time? Now, this was taking four to eight hours. And so if they were under attack, yeah, they they really had stress to get through this. And you know, the team was moderately large. But again, there was a lot of cycles that were in there. And some of the patterns now are so complex, that they were missing those, right. And that was part of what it was, is the silos and then part of this effort here. So again, this is probably similar. We talked to a lot of people on a daily basis, Aaron and I. And so this is fairly common. Now why? Well, what we see here now is we see this speed of ingest, I'm getting IoT packets that are just blasting me and I can throttle those. And I can use things like Kafka to do that. But if I could combine that real time ingest, and I could combine that AI and ML, and what is called a graph database, which are connected nodes and vertices to see a larger graph of things. I know, things I don't know. And Aaron's going to talk a little bit about that. But if I could do those two things, and then start to do things like OLAP, and time series as we work our way around here, now I've got a game changer. And I've got enterprise search, because then what happens is, sort of that final mile is now I can have the human be presented options, right? And they're combined, but they're not being serialized. So instead of running an analytic, looking at something, let a human interpret it, see it in a graph, do it again, do it again, that takes time. And then here in the middle. So what we did was, we actually collapse those, and then we start to use cin AI, to really sort of move that productivity needle, ask this data, a question, right? Ask it, why and what and where so I can have an interrogation? That's, that's productive, right? Because today, when I look at a spreadsheet, I can't ask you the question, it will just look at me. All right. So that's kind of why this is indeed so hard. And the workload for those that are technical here behind the scenes, this is large computing, this is every bit of Nvidia and what those GPUs and what we hear about in the news today and the business folks, you don't necessarily have to get into the details on this, right. But what you need to understand is those workloads can be very complicated. Now, what does that mean? So when we started working together in a more unified world, what we did then is we used Kinetica. And the power of its adapters, it has roughly 200 Set adapters that are out there, and we brought it together. And then the difference now is we have that ability to take all of those different workloads, run them as they are sort of alert of alerts. So when I get to CrowdStrike, and Adobe in Palo Alto, I'm able to basically wait those in such a way that allows me to look at that alert of alerts is the Palo Alto one constantly script streaming at me, but is it real? Right? And now what I can do is I can combine those things, then what we're able to do now is look at a different kind of access inside of the system. So my super user can see all of the different events that are going on, I can see how things are basically working, where are we, what are my teams doing. So now I have my internal and my external teams. And then I have my AI, AI and ML team. So everybody has a home here to be able to do that. And it's a bolt on, right. So this is not a slow kind of installation or, or those kinds of things it can bolt on, to be able to get those interactions going pretty quickly. And so we were able to bring that in under a minute. And in some cases, most of the SLAs are in the millisecond range. Now, what's going on partially behind the scenes for those who I keep using this word graph, graph is interesting data style. And just like folks that are familiar with materialized views, and time series and geographic kinds of things, there are algorithms and then what we see over there as machine learning, now, these start to give us that pattern identification. And these are some little crypto grams in here. And if you're more interested in graph, you can certainly contact us after but this is one of the key elements that is new, just like a Gen AI. And so these are the kinds of things that go on. So now what we can do is when we see a connected set of data, whether it's a Mac ID connected to a VPN to a black list, or whatever it is, I can start to use that analytic to say there are problems here, then what I can do is I can start to do this multi dimensional pattern matching. And I can do all of that in that sort of simultaneous workload, then what I do now is I combined my machine learning, again, I can go look for things called communities, and then I can overlay it with time and space. But this is the secret sauce. This is what Aaron is going to walk us through here is this capability to tie all of these things together in one workload, right is a game changer. And that's why we can accelerate the time to the kill chain, and effectively either shut it down, or start to interrogate it. And in some cases, we have customers where they are watching some of these because they don't want to catch them on the first one they may run away. They really want to basically tie them together with those three letter guys in our government and say we gotcha. And this is the kind of horsepower that it takes to do that. Now, big picture here, right? What are we talking about? Well, we see down here on the bottom, right, we see this sort of what I call fast lane and slow lane, right, we can do real time data, right. But there are still, you know, sets of information for customer and account data, if you're in finance services, or if you're in supply chain, or wherever that may be, you have slowly and kinds of data. And that's okay, you need both, right. So what we see now is this ability to have this adapter or connector layer where I can start to do that. And now what I can do is all of that data is at my fingertips, I can leave the data in Palo Alto, but I'm going to use the connected information in this determination or analytic. And now what we see is up here on the top, which we're about to go see here shortly is a demonstration of how these parts fit together. But this is really again, that logical step, we call this a more of a market texture diagram from some of the business folks. And then Aaron's going to talk a little more about some of the technology. But again, everything has played. Everything's connected, right? And it's modular, you don't have to do it. All right, but the way that we do that, and again, a lot of this is powered by Dell, and and video. So again, this is just another layer of how these things work. Right, so we see our scoring engine in the middle center. So now when we break that down, now we see where our business users are. So if you're cyber and you're in the cyber detection, or the business team, what you see then is there's a number of user interfaces for your team, right, you're still touching all that horsepower down underneath, right. And then if you're an IT or a machine learning team member, there are places for you as well. So you can go build and test your math models, you can go use Python notebooks and things you already are most likely doing today, you can plug those elements in. And then the adapters that's where we connect to whether it's CrowdStrike, or Splunk, or whatever those products are we have those API's down here, so that we can connect that data up in the user interfaces up in the top. Now we can start to look for things that matter. What is the data flow? Where did it come from? Where's it going to? And we're gonna see this here in a second. I can look for event correlation between those other data systems, when am I being attacked from certain things? And when am I not being attacked by other things? Which ones are right and wrong? And what you may see then is patterns will emerge from that geography. You've got to have geography in there. And whether you're doing massive polygons, like we do for some of the military folks that Connecticut does, or we're doing some of these other things like trends, Oregon frog radius, all of these things are now tied together. So with that, I'm going to switch over. And let's do a demo. Now, in this demonstration that I'm going to show you here, what I'd like to do then is we'll walk through some of those concepts that we talked. Now, in this first part of the demonstration, what we now see is I have a dashboard, everybody's got a dashboard. And again, this could be Tableau, each of these widgets can tie into our platform. So for instance, if you have an existing Tableau dashboard, or Power BI, whatever that may be, each of these widgets that connect to our system can basically populate that. But if you don't have one, obviously, you can use ours. But what we're seeing now is we're seeing a composite view, right underneath this, we'll see here in a minute, where we see each of those individual data sources that we mentioned. But now what I can do is basically an alert of alerts, I can see where these attacks are coming from these KPIs. Here, we can see the false positives, we can see over on the left, right, which is where are they coming from? Are they a geographic, right? Are they tripping some of these other alarms out on the edge? Or are they more internal? As we do that, and so again, we can start to do that. Then in our first case, what I want to do is I want to go look at the map, the map is showing me where some of these existing sources are coming from. And in this case, what I see is a pain here. Now in this, I see a couple of things that are happening. Number one is I see time series across the bottom two kinds of time, there's linear time. So when are those things happening, and then there is by temporal time, now our friends at Kinetica can handle both. So they can do both bitemporal. So what happened when, what was connected to what and what is occurring right now. So that's a very important thing. Now in this visual user interface looks pretty easy, right? I can scroll, I can add down here different kinds of alerts. If you look carefully, I can say which ones are phishing, which ones do I think are exploits and what those are, the other thing that I see is my gen AI companion, what it's doing. And this is more than just sort of a co pilot, this is able to drive perhaps my workflow. And so what it's doing over here is that is actually doing that it's saying, hey, user, I have found something that I believe, highly risky. And what I need you to do is to identify that you'd like to add it to a workflow. The other thing is that I'm finding specific errors that are giving the human indication that this is not just a spurious kind of activity. Now the user is saying, Hey, I should probably dig in. Well, that's exactly what I'd like to do, I want to go out and actually click into this. Now, what I can see is, there's much more behind this, I see that there are more than just specific types of this, right, I see that this particular threat activity here in my queue is connected to other events. And again, what I can start to do then is dig in. Now what I do is I get to the root of where I am. So down here over on the left, I see that Splunk is saying it's not that bad, right CrowdStrike. Their alerts are saying well, that's not bad, either. But then when I started to look at Palo Alto, and I look for the blacklist, or perhaps a mask ID, there is a problem. Over on the right, we use a tool called blast radius. That is that graph data structure. So now I've seen time series, I've seen geographic, and now I'm seeing sort of this graph connectivity. Well, in this case, this is patterned after that one we mentioned, it looks like well, maybe I'm connected to something out here. And there is indeed a bad item, but it looks okay. But now when i overlay what's going on with anti money laundering and credit card, what I see is no, no, it is not, it's actually much worse. And I see that it is connected all the way out to a blacklist and unknown URL. And there is an open case on this. So that means that man, I should really be shutting that down. Then up here at the top, now I have my gen AI, that sort of rolling, rolling me through that. So again, what I want to do now is start to dig in here. And what I see then is again, my cyber sort of companion over here is saying what do you want to do with this? Where is that timeframe? Where are those different elements, I think there's something more going on here. Now what I can do is I can actually go over to our alert module. And I can see from Splunk and crowd CrowdStrike what those data elements are, and I can effectively compose a tighter alert. And then you can see down here at the bottom, which is what kicked this off. So again, this was meant to be basically an illustration of how that works all together. And sort of how our tool can start to combine those different workloads to be able to do that. So with that. I've said a lot of words and now I'm gonna pass it over to my partner Aaron. Aaron, are you on mute? Yeah,
Aaron Bossert
I was just about to say I'm so technical incompetent that I can't find the mute button. Okay, so Just a little bit of background about about Connecticut, right? So I, Connecticut came out of the out of the intelligence community back in 2009. And the the Ask was simple, the mission was simple, we need to be able to have a way to track hundreds of different data sources, stream them in in near real time, be able to analyze those do geospatial operations on those text based operations on those graph based operations on those things, and have that all one for one platform. That was essentially the genesis of Kinetica, in our original mandate was to be able to support multiple different use cases with the same, you know, with the same platform, right? So if you can go to the next slide. Um, so So fundamentally, what, you know, what is Kinetica? Right, so we have the ability to do bulk ingest, we have the ability to do streaming data ingest both at the same time, if that's what you need, we can bring in everything from IoT data to packet traffic, which is, of course, kind of the focus for for today's demo that I'm going to be showing you. And we can bring in even unstructured data in the form of, of vector embeddings. For for documents and things like that, so that we can do similarity searches, which is also going to play into what I'm going to show you today as well, then we have the ability to then support multiple different front ends for that. So one of the front ends that you that you've just seen today is this, you know, rather stunning visualization that expera is showing, which gives you maps, and it gives you charts, and it gives you the ability to talk in ask ask questions of the data and all that kind of fun stuff, right. So we have industry standard connectors, which allow us to connect to those front end applications in those ways. We have libraries for, excuse me, for Python for seeing JavaScript, Java. And if none of those things suit your needs, we have a REST interface. And then of course, we have a sequel interface to to the data as well. So thank you, I'm just about to ask. So when we take a little bit more of a deep dive, right, so how do we accomplish this? Right? So how do we get all those different capabilities into one platform? Right? So it's, it's, you know, certainly the engineering team would, you know, would want to strangle me if I said it was simple, because it's absolutely not. But what this does provide is a very simple architecture and platform, such that rather than having a vector database for vector similarity search, a time series database for doing time series analysis, a geo database, for doing geospatial analysis, a full text search for doing, you know, like Elasticsearch, or something like that, where you can do, you know, full text indexing and things like that. And then finally, graph capabilities, and artificial intelligence as well. We've, we've put all of this into a single platform, such that when you want to work with a graph, you don't have to reinvent the wheel. And you don't have to send your data over the network somewhere else to create the graph, analyze the graph, bring the data back in, and then do further analysis with it right? So you can you can mix and match all these different components in the same queries, because it all uses the same SQL syntax to query these different data sources. So for example, as a vector database, right, so there's the pine cones of the world and all those that are great, but they're a single purpose tool, right? So for us, a vector is nothing more than another data type, just like an integer, just like a flow just like a string. So you can mix and match vectors into your data directly. Or you can use more traditional tools as well in leverage what Kinetica has to offer as as, as a standalone vector database, if that's what your use case demands. And all of this is accelerated by A, by A GPU based design, where all of these different capabilities are enhanced. They're sped up, and they're they're made to scale beyond what you would typically see out in the wild, because we're able to run on top of these NVIDIA GPUs which allow us to, to kind of brute force thing, a lot of problems, right. So a lot of things that would, you know, that would become problematic with other databases where you might have, you know, the, you know, these weird joins to accomplish and things like that where you just have to they set up the tables specifically for a subset of questions, set that off to the side. And when those queries come in, you can run them on those different views of the data. But that doesn't work when you have ad hoc questions to, to answer. So with that being said, I'm going to go ahead and share my screen now. And I'm going to discuss a few different things, I'm going to kind of take you on a tour of, of Kinetica. So if I look at the cluster that we're running right now, we don't need this much horsepower, just for those 8 billion rows that are in there. This is a demo cluster that I use for many, many different use cases. And this happens to be our new demo cluster. So you can see that in terms of memory consumption, we're not even a third of the way up, and we're storing 13 billion rows of data in this database, and we're getting response times like I just showed you, right? Each of the nodes in the cluster has GPUs installed, we have fairly large amounts of memory to work with. And in in high core counts to go with that, right. So so essentially, what this does, is this allows you to brute force a lot of problems, which means that we can answer ad hoc questions very, very quickly. Right. Now, that's going to become important here in a second. Because when we when when we start looking at things, like network traffic, essentially, what I've done is I'm using the NVIDIA Morpheus pipeline to ingest streaming PCAP. So it's, of course, PCAP is a file format, not, you know, not the actual network traffic. But But ingest network traffic as it comes in off the wire. And we're able to parse all of that out into a machine in human readable form. That is significantly less costly in terms of storage, but still captures a great deal of information about the packets, right, so we get, you know, TTL values for the, you know, from the IP header, source, import source and destination ports and IPS, just like you would expect, and then we do something interesting as well. So in this case, we have the ability to store data as JSON. So mixed in with this relational table, I'm also storing JSON, that represents, in this case, the DNS header of the query section. And the reason I'm doing that is because I don't know ahead of time, exactly what's going to be interesting within each one of those fields. And if I exploded out, um, then I just ended up with 1000s of columns all for one packet. And it just doesn't scale very well, if we do, right. So we won't be able to store nearly as much PCAP as much network traffic as we would want to. Right, we can certainly do it, but but it but it would be fairly inefficient. So same thing with the answer section from the DNS records. And then we're also keeping this as a sparse table, you'll notice that we have HTTP columns, all that kind of stuff is in here, and it's all merged into one table to make things a little bit easier, you could certainly design it to split out protocol by protocol like you would with Zeke. Or if you're, you know, if you're if you've been around for a very long time, bro. And do separate tables for each protocol and then do joins based off of those, that's that's an absolutely fine use case as well. The design decision in this particular case was just to put it all into one sparse table. Because we could handle the scale and we can handle the the brute force type of queries that we would run against this, right. So that's, that's what we have running in the database right now. We also have what's called the SQL workbench. And inside the SQL workbench, you can think of this kind of like a lightweight Jupyter Notebooks experience. And what this allows you to do is to put together SQL workflows, and share those with co workers and things like that. So you can create materialized views, you can create tables, you can do queries, you can do everything that you would expect to do like in a Jupyter Notebook. Except it's, you know, purely focused on visualization. And in SQL queries, right. So just so just to give you an example here, of a completely different type of data, right, so one of the things that I've done here is I've also taken some threat intelligence reports. And I've performed vector embeddings for those, um, so that I can store those off in our vector store, right? So that's what these are right here. So the embeddings is just you know, at the end of the day, it's just a list of float values, right? So the vector embedding is there. It's just another column type, just like this one is JSON, just like the other ones are text. Um, that it's as simple as that. Right, so we're able to store off all of this different information and make this unstructured table data searchable and usable by, by the end users, right? So coming back to, you know, essentially why, you know why we want to be able to, to kind of do all these things is, is because we want to be able to analyze the data in a holistic way without having to go to 10 different tools to do it. Right. So what I'm going to show you right now is a Jupiter workbook. Inside of the Jupyter Notebook, all all of this could be wrapped up into a pretty user interface, but I wanted to be able to kind of show the details of what's happening. So that you get a sense of what tools we're using, and things like that. So here, I'm using link chain. And within link chain, I am using the NEMO service from in video, which is really cool from our perspective, because you can you can experiment on the on the cloud based instance to kind of get the models that you need and decide how to fine tune them and how to do prompt engineering for them very, very quickly and easily. And then when you're ready to actually put your your use case into production, you can run that same NEEMO service locally with those same models. You just You just bring those into your environment. And now you have a secure place to to interact with these MLMs, where you don't have the security concerns that you would typically have with someone like open AI or someone else because who knows what they're actually collecting? Or you know, when you say don't don't use my data for training, how do you know that's not happening? Right? Easiest way to do that is just to use your own lor. So we're using the NEMO service from that perspective, right. And we're combining all of this together. We're running an embedding service that comes from Nemo. Um, but we're running it locally, right. So so we're using, we're using an LLM to do a vector embeddings for the data. We're also using this other LLM, right, so this is GPT 43. B, not to be confused with open AI. Set up here and in kind of ready to go right. So what are we going to do. So we're going to use the length chain framework here to ingest some documents, right. So here's just a sampling of documents, some threat intelligence reports that are all in PDF format. Now Lang chain makes that very simple to work with, you'll notice that that we have available to us this PDF loader, so I don't have to go somewhere else to do OCR or to do you know, to strip out the text from the PDS or anything like that it's already the tools are already there and available to write. So once we do that, we get a listing of all of the different files, we load up all the different files, and then we process them one page at a time and get them in, put them through the vector embedding service, right. So when we do that, I'm going to go ahead and hit enter here is going to take a couple of seconds to run, we're done, we can go on to the next step. Now we now we actually create the collection that goes that gets stored in Kinetica using our our link chain plugin for a for a vector store. Just like that, and then that's going to take another couple of seconds to run and then everything's inserted into the database. And we're good to go. Right? So so there we go five seconds later. And we have all these different PDFs in there. Just like I showed you over here. threat intel reports. Right? That's what we just did. Right? We can look at it, we can say, you know how many records there are all that kind of fun stuff, right? And if we want to just make sure we're clearing everything out, right, I can delete it, just to show you where you know, we're not faking the funk as it were. We're actually doing this live, word patch on go. There we go. So we can go back up here, process the documents a couple seconds later, we're good to go. Pretty at the table with the vector embeddings. Good to go. And now we can go back and check. And sure enough, as soon as I refresh it, there's our table again, right data preview. Here's everything done. So within all of 10 seconds, we've you know, we've ingested all of these different documents, we've created vector embeddings for them. We've done all this all this good stuff to make it so that we can answer questions a little bit back, right. So then we're going to create a retriever for the for the documents. And we're going to set up link chain so so a full description of what length chain and Landgraf view are a little bit outside the scope of what we're talking about today. But just in general, what we're doing here is we're creating a Um, we're making it easier to use multiple different large language models and multiple different services such as the vector embedding service that's there. And one of the things that we do to accomplish that is we set up a template for, for for the prompts. So the context, essentially, that gets sent back to the various LLM 's that will be interacting with, right. So we have the, you know, we have this kind of formatted area right here. So we have the ability to insert a variable. So once once we ask a question, the the LLM is actually going to choose to use the vector embedding service. So it's going to create a vector embedding for the question, it's going to retrieve the most similar answers, or content rather from, from the vector store, it's going to add that to the context, so that we can then when we ask questions that are relevant to our internal documentation and things like that, it can answer right. So why is that important? Um, so no matter what LLM you're using, right? Whether it's open AI, whether it's, you know, llama three, or, you know, whatever the case may be, it only knows what it knows, up until the point that its training was finished, right? doesn't know anything about you know about your specific issues or specific cybersecurity related issues? Probably not, right. So what we can do is we can actually augment that using retrieval augmented generation. So that's rag for short. And what that means is that every time you ask a question, the LLM is going to encode that it's going to do a vector embedding for the question, it's going to reach into your repository in your vector store, and is going to scan through all of these different documents and find the most relevant portion of that document and return that to, to the LLM. So that it can summarize what it got from the document and give you the answer to your question. Right on. So now we have this way to to, to access structured data via Wragge, which I'll show you here in a second. Once we get through with this part. And we have the ability to access unstructured data being these PDFs, right? Could be CSV files, it could be PDFs, it could be Word documents, the you know, the sky's the limit in terms of what what can be supported for this type of use case, right? So once we have everything in here, right, we've created our retriever. And essentially all we're doing here is just saying, whenever you do a search, I want you to bring back the four top answers, right? So the four most relevant things, this is the format that the context will be in. And whenever I ask a question and you go retrieve it from the vector store, you're going to insert that context into this variable location right here so that the LLM now has that additional knowledge associated with it right. Then from there, all we're doing is just creating a formatted message to send to the LLM. We're doing all the piping, all the all the plumbing, if you will, to connect all these things together where we're you know, we're essentially just saying, Hey, we're creating, we're creating this retrieval chain where we go to the LLM, the LLM goes to go to the vector store and then answer the question based off of that, right. Um, so what I have here, make sure I actually ran this, so I don't look dumb. Oh, I didn't run this one here. That would look bad. Um, that, there we go. Um, so what I did is I put up a few questions, right. And again, all of this can be done through a user interface. But again, the you know, I wanted to kind of dig into the details today. So I put it into this format, so you can kind of see what's going on under the hood, if you will. Right. So what we're going to do now is we're going to ask 1234 questions about data, um, that if I asked any other LLM open AI, or otherwise, it just would have very little chance of answering the, the, the questions appropriately, right. But because we're bringing in our internal knowledge base via the PDF documents, you could do this with your, you know, with your internal Slack channel or teams, you know, all of that is fair game, when it comes to a retriever like this, you can actually index this stuff in in near real time with Kinetica. Which, which kind of changes the game as well, it allows you to keep up to date with the network traffic that I'm going to show you as well as the as well as the documents and all that kind of fun stuff. So here we've asked a few questions, right. Let's go back up here because we have quite a bit of text to go through. Um, and and we're gonna see what happened, right. So we asked a question, what is the TLP level for the document on volte? HIFU? Right, so TLP is this traffic light protocol. If if if anybody doesn't know what that is, it's essentially a way for these government agencies that produce these kinds of reports or Mandiant, or, or whoever produces the report to have a common, a common language for classification of the data outside of the government's Secret, Top Secret, confidential, all that kind of stuff, right? So makes it much simpler for data sharing amongst amongst other amongst other organizations, because you can just look there, you can read the document in the documents as TLP clear means anybody can see it, right. But when I asked that question, right, that gets embedded as a as a vector into and then gets compared against my vector store. Right? And then it, it retrieves the all the little chunks of data up to four, to give it additional context, additional information that it can use to answer this question, right? And then from that context, right, it's not useful if it just returns to this context, because it's, it's just, you know, disjointed blobs of text, right? So what it does then is it sends that to the large language model, which gives you a simple answer. The answer was hate. DLP is clear. So anybody can see it. Okay, fair enough. So what what, what's the next question that we have? Right, so what is full pay for rent? So this one, you know, made it all over the news and all kinds of other stuff, as, you know, state sponsored attack and all that kind of all that kind of jazz? So relatively recently, right, so. So what we do is we ask a question about it. Right? So what is the typhoon and then the vector store again, retrieves the top four results from from the vector store? And then summarizes that and gives you an answer based off of off of the summarized data, right, so the whole typhoon actors are suspected to be state sponsored? Could that be a little bit better of an answer? Absolutely. But for demo purposes, it gets the point across, right. I mean, we can always do prompt engineering, and we can refine the size of the chunks of text that we store and things like that to produce better and better results. To, you know, to make the answers better, right. So now we've done all this right, we've asked all these questions, you've gotten our answers, right. Oh, there we go. So there was the last one was wondering where it went. So this is an interesting one, right? So this is beyond just hey, can you go find some key words in the text? The actual question was, um, what should I do if I suspect a bowl, Typhoon intrusion? And the answer comes in, if you suspect a bowl, Typhoon intrusion, you should immediately contact your local law enforcement agency and report the incident, they will be able to assist you blah, blah, blah, they'll they'll tell you also from the from the information that's here, some of the indicators of compromise that are there that you can that you can reference, right. So it gives you all of that information, summarizes it and brings it back for you to use. Now we're going to do something that's a little bit different with, you know, with the same platform, right, so we've shown visualization, we've sown vector embeddings, we we've shown a regular old SQL query execution. But here's something that Connecticut provides as well is we provide our own words language model, which does text to SQL conversion, right? So what does that bite, right? So, so for? So for a person who knows what they're doing, it's a fun toy to play with, right? However, for an executive or someone that's less skilled with SQL, and maybe just wants to interact with the data without having to go and find someone, get them to run a query, run the report, give it back to them, they can ask questions and get answers back from them. Right. So we set up again, our connection to to our large language model, right, that does the does the text to SQL. And we're going to ask some questions, right, so let's set it up. And there we are, we have some queries. And let's go through it right. So here's where we get interesting again, right? So what HTTP response codes are there? And how many instances of each are there? Well, it returned that as a data table, right? So HTTP response code of 207. There's seven of them 200, there's five, so on and so forth, right? Four or four? There's two. Um, then the next question is what different user agent strings were there? Well, here's a listing of them, right? So there's Dav client, the Microsoft stuff, all that kind of stuff is in there. Now we get to a really interesting question. So show me similar DNS wreck. urge to the one at this timestamp, right? So if I've done my analysis, I've looked at the data and I'm like, You know what? This particular DNS request looks a little bit funny. I would like to see if there's anything similar to it within the data so that I can start analyzing it a little bit better to do that, and what does it return to us, it returns to us a table with 1234, the top five, most similar so much with like our retrieval from the documents, this is doing the same kind of retrieval, but from the data itself. So we've done a vector embedding for the header. So the entire DNS query header, and for the entire DNS answer header. So of course, if it's the quarter, you there's no answer header, and vice versa. Right. So. So it only doesn't, of course, if if there's something there, um, what you'll notice, though, is that is that immediately, it becomes clear that you've got the same, you have the same source and destination IP addresses being captured here. Right. So so they're similar in the sense that the, that they're coming from the same IPs, at least now, could that be behind a proxy? Absolutely. There's, you know, there's all kinds of other stuff to look at. But, you know, we can look at each one of these, and we can see that there's commonalities between them, right. In terms of what's being queried, or how it's being queried, or what fields are there, all that kind of fun stuff. And it returned that to us, um, in, in just a couple of seconds. Right. So, so, so what have we done here? Right. So, um, when leveraging a platform that allows you to do multimodal analysis, that fundamentally, that's what it is, right? Um, I don't have a separate graph database, right. But yet, the, wherever my window went, and yet, here, we have graph representations of the data, right? And we're able to do in hoc queries, and one, community detection algorithms, all that kind of fun stuff, right. It's not just graph representation in the visual sense, there's actually a graph representation of the data underlying all that, which allows you to do that kind of analysis, and then tie it back in with with the rest of the time series style data, the regular OLAP style queries, right. So find the most connected components within the graph. And then based off of those most connected components, I want the top five from the top five, I want to see only those that were active during time a to time b. Well, I think, I think from the experience, side friend, Farmington is the is the bad guy in all of these, so I'll stick with that. So when friend Farmington was supposed to be on vacation, but was actually accessing a system that they shouldn't have been accessing, right. So all that's possible there, we can see it on a map, we can filter the data out based off the IP addresses based off of the, you know, the the types of events that we're seeing. And then to take that a step further, the other thing that we can do, is rather than just doing this kind of analysis, we can send out alerts. Right. So here's my, again, just demo. But, um, but here is my slack channel with with alerts that were spitting out of Connecticut right here, hey, we have this IP address, at this time, attempted to access our web server, right? Is this, you know, malicious? Probably not. It's, it's just for demo purposes, right? But, but at the end of the day, there it is, right? So we're able to do all of that from the same platform, then we're able to come in and implement this entire AI based workflow. Also, using large language models to access unstructured data to access structured data, to do vector similarity searches among them, um, all of these tools come together to give you this, this, this holistic capability to look at your data through multiple lenses, and to, to be able to perform analysis based off of that, and you can mix and match the components. So for example, what I'm doing here, right, where I say, Hey, show me, show me all the ones with the DNS headers that looks similar to the one here, we can then take that another step, perform another query based off of that and say, based off of off of those records that are similar, show me the top destination ports, how many bytes were transferred? What did the text records look like? Are they similar? Is there fuzzy matches between them? Is there is there anything in the text records that doesn't look normal? Right? All of these things are possible? Because we're capable of storing that unstructured data like we did up here, right to access these documents. And because we're able to access the data like we're doing here, right, all of that from one system, you ask, you ask the system a question, and you can get an answer in either tabular form or next forum or charts, or, or any number of other formats that you might want to use. And all of that is made possible by this platform that kind of gives you all those capabilities in one place gives you a simpler architecture, there's less things to break, there's less moving parts, you don't have to copy your data from point A to point B, it's already there. It's already available. So I'll say the same thing that that my partner here said. And that was a awful lot of talking, I just did for that demo. Um, so let me pause there let you digest that. I'll stop sharing my screen for the moment. And if you have any questions, I'll be happy to answer them.
Scott Heath
Appreciate that, Aaron. I think the the takeaway now is you've seen the power, you saw kind of how the business users consume it, and then as technology participants, right, that's very powerful. And so hopefully, today, we've been able to bring that to you. Now, part of this is, we love Dell, and Nvidia, right, they've really started to lead the pack on this. And again, whether you like hardware or not, a lot of the things that we do today are precipitated by sort of the scale up and scale out, you can use lots of other technologies, right? But GPUs are out there. And today that some of the Dell stack is really sort of making it better and stronger. key takeaway business and technology, it's one platform, we can combine that edge sensing with Splunk. And some of those other products, we can integrate with other person's case management tools, etc, or we have sort of this combined capability. But the takeaway now is a game changer. When it comes to these kinds of things, and then it's a bolt on, right, you don't have to buy all of the things you saw today, hopefully, you'll buy something. And, you know, again, if you're building your own, we do that, too. Right? We're here to help. But the takeaway now is get in front of the bad guys. So but that, how do I get started, we've got a couple of things here. So if today was interesting to you certainly reach out to us, we can do these things quickly, in whether you're looking at Gen AI, or you're looking at a combined kind of analytic, or again, any of the things you saw today, we can do smaller or larger versions of these. And they can either be in the cloud, or on premises for what we do. So with that, bring it to a close lower back to you. Thank you so much for the time for although, those of you that were with us today. Yeah,
Laura Smith
Thank you all so much. I know we're a little bit over time, we did have a few questions that came through. So I'll go ahead and run through those. And then we'll wrap up. There's one in the q&a box, when you said you built your own LLM? Was this specific to the customer data or something more central to Kinetica?
Aaron Bossert
Oh, yeah. So the LLM that we have uses, uses a foundational model that is open source. And then we have fine tune the model based off of query patterns taken from our own internal demo cluster essentially, right? So we've captured, you know, 1000s of different queries and, you know, essentially query and answer pairs, right? And we fine tune the model based off of that. So the result is that you have a model that is that is laser focused on only one task being text to SQL. You can't ask it if it's happy, you can't ask it for the news. It does text to SQL. And that's it. But that's okay, because that's where we bring in like the NEMO framework to bring in other large language models. Hopefully that answers the question.
Laura Smith
Great, thank you so much, Aaron. And then Scott, you mentioned that system can bolt on does the solution allow for configuration to match our existing systems?
Scott Heath
Yeah, there was a lot of integration in my discussion, which is integrating with your your sort of sensing data, we can do that. So if you have consoles that are fairly straightforward, we can make it look and feel like you're not visiting that we can go present the data. So if you need to get all the way back to Palo Alto, or Splunk, or whatever that is, we can use them as iframes in our dashboard, or your Tableau dashboard, or whatever it is. And then during the configuration, we can make color patterns. We have sort of a mix and match dashboard capability, kinds of things, etc. So we can self configure and allow our customers to do that. So lots of ways to configure so that it doesn't look or act unnatural to what you're already doing today. And it's fast. That's the other thing is we have some configuration tools to do that. So hopefully that answer the question as well. Yeah. All
Laura Smith
Right, everyone. Well, thank you all so much for your time. Thank you both Scott and Aaron for the presentation today. It was Rate. Just as a quick reminder, we did record today's session so we'll send out a copy of the recording in the next few days. Feel free to follow up with us through email or you can message any of us as well. With that, we'll go ahead and end today's session. Have a great rest of your week and we will be in touch soon.
Tell us what you need and one of our experts will get back to you.