In this article, you will learn how SQL aggregate functions can represent an easy way to significantly improve your application’s performance. Mainly, you will see how they were a game-changer in a real-world scenario based on a data-driven application developed for a startup operating in the sports industry.
Let’s now delve deeper into this scenario and learn why you can’t ignore SQL aggregate functions in data science.
Introducing the Scenario
The application I recently worked on aims to offer advanced data exploration features in the sports world through the web. In particular, it needs to allow exploration of both raw and aggregated data. Since the database involves terabytes of heterogeneous and unstructured data, the challenges were mostly on the backend and database side. Now, let’s dive into this scenario.
Technologies, Server Specs, and Architecture
We developed the backend in Kotlin with the Spring Boot 2.5.3 framework and the Hibernate 5.4.32.Final ORM (Object Relational Mapping). We deployed it on an 8GB 4 CPU VPS through a Docker container managed by Dokku. The initial heap size was set to 2GB and limited to 7GB, while we allocated the remaining GB of RAM to a Redis-based caching system. We built the web application with performance in mind. Specifically, it’s based on the multi-layered Spring Boot architecture described here and involves multi-thread processing.
We implemented the database as a MySQL server running on an 8GB 2 CPU VPS. We hosted the backend application and the database in the same server farm, but they do not share the same VPS. Since the sports data is simple but highly heterogeneous, the database was structured to avoid duplication and encourage standardization. This structure is why we chose a relational database. As it stands, the database involves hundreds of tables, and I cannot present it entirely here due to an NDA.
Luckily, the most problematic tables share more or less the same structure. So, analyzing just one table should be enough. In particular, this is what the PositionalData table looks like:
As you can see, it involves more than 100 columns, and it has more than four external IDs. On average, each of these tables contains at least 15 million rows.
One of the critical features of the frontend application is to let users analyze the aggregated values of hundreds of different sport parameters (e.g., passes, throws, blocks) coming from all the selected games of one or more seasons. We developed a backend API to perform a query on the table mentioned earlier to retrieve the data. Such a query was nothing more than a trivial SELECT returning from 10k to 20k rows. Then, this data is aggregated with a multi-thread process, stored in the Redis cache, and finally serialized in JSON and returned to the frontend application. From the first moment that the API receives a hit (and thus, before the result is available in the Redis cache) to completion, users must wait between two to four seconds.
This delay was unacceptable.
Delving Into the Performance Problem
Let’s now see the downsides of the approach just presented.
ORM Data Transformation Bottleneck
Most advanced ORMs abstract how they represent data at the database level. In other terms, the ORM performs the query, retrieves the desired data from the database, and takes care of transforming it into its application-level representation. This data transformation process happens behind the scene, but it undoubtedly represents an overhead. Although that process is usually negligible in terms of performance, it can quickly become a bottleneck for thousands of rows.
This slowdown is especially likely when using OO (Object Oriented) languages. Additionally, creating a new class instance takes time and resources. One way to limit the object size and heap usage might be to select only the strictly necessary set of columns. This approach would make each object lighter, even though the object creation process represents the main overhead. Therefore, the time spent performing this transformation process would not change significantly.
Looping Takes Time
Performing simple operations like sum or average on arrays of objects containing thousands of elements is not performance-free. Although this does not compare to the time spent by the ORM to transform the data, it indeed represents an additional overhead. Fortunately, Java supports many thread-safe collections to perform operations concurrently. On the other hand, opening and managing threads are complex and time-consuming tasks.
Let’s see how several SQL aggregate functions helped me solve the performance issue.
What Are SQL Aggregate Functions?
SQL aggregate functions allow you to calculate several rows and obtain one value as a result. Even though each SQL language has its own aggregate function, the most common ones are:
- COUNT(): returns a count of the number of rows selected
- MIN(): extracts the minimum value
- MAX(): extracts the maximum value
- SUM(): performs the sum operation
- AVG(): performs the average operation
They represent a potent and helpful tool when associated with the GROUP BY statement. Thanks to it, you can first group the desired data and then aggregate it by harnessing them. If you want to delve into MySQL aggregate functions, you can find all the supported ones here. I also recommend checking out this article and this resource.
Replacing Application-Level Operations With Queries
While SQL aggregation functions seemed promising, I did not know if they could make a difference before seeing them in action. Specifically, the application-level operation generated a data structure containing the average value on the value column and the sum of each areaX (with X from 1 to 144) column on each parameter chosen over the selected games. You can easily represent this in the following query:
As you can see, this query takes advantage of the SQL aggregate functions to return aggregate data at the database level. It does all this while filtering over the desired data using the IN statement on
parameterId and grouping it based on the same
parameterId. In other words, data is first filtered based on the selected game of the season and the desired parameters to analyze. Then, the resulting information is grouped by parameter and aggregated by the SQL aggregate functions.
Defining the Right Indexes
Since that query involves GROUP BY, IN, and SQL aggregate statements, it might be slow. This potential slowness is why defining the proper indexes is so essential. In detail, the most critical and performance-effective index applied was the following one:
Should you always use aggregate functions? There are both some positives and negatives with this approach.
Let’s compare the result in response time when calling the same API involving data aggregation with no cache and the same parameters.
SQL aggregate functions are undoubtedly a great tool to take the performance to the next level when dealing with data science. Using them is easy and effective, although not all the ORM can fully or natively support them. Either way, knowing how to take advantage of them may become essential to improve performance, and explaining it through a real-world case study was why I wrote this article!