Written by Joel Engardio for MapD CEO Todd Mostak
Speed at Scale: How General Purpose GPUs Can Save Us From Data Overload
By Todd Mostak
MapD CEO
Why would anyone want to build a GPU-accelerated analytics platform? Everyone knows that GPUs are great for high performance computing, deep learning AI and machine learning. But why are GPUs good for general purpose analytics?
Because the growth of data has become a very big problem.
Data is growing at a 40 percent year-over-year clip. That’s fast. And for some enterprises, the data tsunami is much bigger. If you’re bringing in a lot of sensor data from mobile phones, airplanes or cars, your rate of data growth is above average, to say the least.
The old solution of scaling with more CPUs is over. CPUs are still getting faster to the tune of 20 percent every year, but that’s no match for the 40 percent growth rate of data.
This gap has spawned all sorts of awkward work-arounds. People are down-sampling. They’re pre-aggregating or basically not indexing every column. Or they’re massively scaling out to huge clusters of computers. People would often brag how big their Hadoop-based data lake was. But it’s not cool anymore to have thousand-node clusters just to get a basic count of your users. These are all just coping strategies — not real solutions. Underneath those coping strategies, the pains persists and affect actual outcomes.
People are looking for a new way forward, and that’s where GPUs come in. GPUs, by leveraging thousands of processing cores, have become synonymous with speed, leading them to become the mainstay of supercomputing and computationally-heavy “deep learning” alike. But even more crucial is how much faster they are getting each and every year. GPU performance is growing on average about 50 percent year-over-year, easily outpacing the growth rate of data. This makes the GPU ideal for next-generation platforms that accelerate analytics at scale.
MapD was born from my research at MIT, when I was looking to use GPUs to solve extreme scale analytics problems at extreme speed.
The core of MapD is a very fast open source SQL engine built to leverage the thousands of cores across multiple GPUs and multiple servers. MapD leverages supercomputing infrastructure to deliver SQL queries across billions of records in milliseconds.
Yet all this power and speed is not just for a CEO to see sales numbers or even for Facebook to count its monthly unique users. It’s designed for people making real time operational decisions based on very large, high-velocity data. If these users are forced to wait hours or even minutes to query or visualize their data, the opportunity to make proactive decisions on insights gained is often squandered.
At MapD, we leverage the GPU rendering pipeline to natively visualize massive datasets in real time. It’s great if you can query billions of rows in real time, but what if you actually want to see that on a map or scatter plot or a network graph? Our human eyes and brains function with millisecond latency, so why should our analytics platform (built to feed our eyes and brains) be any slower? We built a visual analytics platform on top of GPUs to eliminate the barriers to your curiosity by allowing you to instantly pivot, drill down and find correlations in the data.
People have been trying to do this for a long time. Early on, enterprising researchers attempted to shoehorn compute workloads into the graphics shaders of GPUs. Then NVIDIA pioneered and launched CUDA in 2007. CUDA and GPU computing quickly took hold for scientific workloads. More recently, deep learning has become accelerated by GPUs. The final frontier is leveraging these GPUs for general purpose compute and analytics, where MapD is at the forefront. Far from being a niche problem, these are the workloads that millions of analysts must complete on a daily basis.
MapD is about delivering speed and interactivity at scale. It was built for the analysts and data scientists exploring large streaming datasets to find anomalies and correlations in real time, without having to go get a cup of coffee — or worse — go to sleep and come back the next morning to figure out if they structured their queries correctly.
MapD is a platform, not a point solution. Hedge funds use it to crunch credit card data in real time. Fleet managers use itfor real-time visualization of vehicle location. Others build custom visualizations and apps using the MapD platform. That’s not to mention the state-of-science ecosystem evolving around GoAi. For instance, there are native python connectors that leverage Apache Arrow to get data in and out of the system with zero overhead. MapD can be deployed in many different ways and our customers are finding new use cases every day. That’s the potential of GPUs for general analytics.