7 MINUTE READ | May 6, 2019
Parsing Redshift Logs to Understand Data Usage
Every company hits a point where simply providing the right data across the organization is no longer good enough. As your data infrastructure scales, questions get asked like:
Which individuals and teams are using the data most frequently?
How is the data typically being accessed? Through Excel, dashboarding solutions, or SQL IDEs?
How does your data usage tie back to various internal solutions (dashboards, scripts, etc.) that are running?
What data sources get used most frequently? What are the most common pairings?
Are new data sets your Data Engineering team supplies being used?
To answer these questions, it’s imperative that you start actively processing and analyzing your database logs. If you’re using AWS Redshift for your database, Amazon provides built in logging that contains a huge amount of useful data. With our current size and scale, we’ve found that each of these hourly logs each contains approximately 200k logged items and 50mb worth of data when uncompressed. In other words, to look at the metadata about how we’re accessing our data, we have to parse through more than 438GB of text files on a yearly basis. While the logs themselves are immensely helpful, they also have lots of quirks and issues that prevent their underlying data from being immediately beneficial.
For the purposes of this post, we’ll be walking through some of the process and pitfalls found within the “User Activity” logs, using Python to extract the data. We won’t be covering the process needed to download these log files in bulk to begin processing them.
From the documentation, here’s the data structure of the log files:
Here’s an initial example to help demonstrate what these logs might look like in practice:
As we sifted through the log level data, we came across a few unique cases that you should be aware of.
If your queries select data from a view or datasource that is tied to multiple other views, instead of seeing one query in the logs, you’ll see every nested query logged with the same timestamp and the same transaction ID on separate lines.
If there are any failures to run a query and it has to be retried, you’ll see duplicated transaction IDs and queries in the log with different timestamps. We opted to only keep the last instance of any transaction ID to avoid overcounting the realistic number of times data was accessed.
If a user individually authenticates their credentials through a tool, you won’t be able to tell it’s a tool unless you later bring in the IP address of the request from the “connection log” and tie it to the Process ID.
File names are stored according to UTC but appear to be one hour off. For example, a file with the date of 04/01/2019 and a time of 00:00 will actually contain data for 03/31/2019 between the hours of 23:00:00-23:59:59. It’s important to keep this in mind if you’re needing to troubleshoot where the original data lives.
By default, every log item in your Redshift Logs will be separated by newline characters, while also retaining newline characters in the query itself. This makes separating the log items tricky if you want to analyze the full context of the query (which we’ll detail below).
Knowing the quirks of the log files, our first step is to figure out how to split out each log item, given that they are initially separated by newlines and each query will also contain newlines. We recommend the following steps to clean things up:
When you read the file in, replace all newlines in the entire file.
Every new item starts with a timestamp. Find every occurrence of this timestamp and replace it with itself, plus a newline.
Split the log file into a list, looking for each newline instance that you just created.
For our purposes, we only needed to truly evaluate the log items for queries that were either run by our users or our tools. To ensure that we weren’t overwhelmed with data, we ran our list_of_logs through multiple different filters with values we wanted to include or exclude.
Typically, these filters boiled down to ensuring the queries were run on the production database and contained a select statement.
We highly recommend that you implement your own filters before starting to extract your data. We found that oftentimes, this could reduce things from 200k log items to 2k, making the regular expression matches go much quicker.
To start off, it’s easy enough to extract the predefined elements like the user, database, timestamp, etc. Due to the consistency of the log files, you can grab all of these groups using a simple regex statement.
For our purposes, we needed to go one layer deeper to understand exactly what the underlying queries were doing. Once the query was extracted, we then needed to find every unique datasource being used within that query.
We also have some tools internally, like Redash, that can run queries on behalf of our teams. These tools have their own username and don’t really help us understand which teams or users are really leveraging the data. To better classify these, we typically have the systems add ID tags as comments to the top of our queries before they get run so we can easily extract the relevant information.
To extract these pieces of information, we used the following code:
At this point, you should have all of the extracted data stored in some manner for a single log item and can then start looping through all of the log items that exist in a log file.
Since we were wanting to store this data to later visualize our data usage trends, we opted to store the extracted data as a dataframe and output everything as a flat CSV file. Depending on your setup, the better solution may be to store this data as a JSON file since the data is technically nested.
The final step on our side was to ensure that if a transaction ID contained multiple datasources being queried that we didn’t just leave the information as a comma-separated list. We instead wanted to create a unique row for every datasource that was seen. We also decided that despite processing a file for every hour, we wanted to store the extracted data into one single file per date.
Assuming you’ve followed the above steps and stored your extracted data in a dataframe, with each variable existing as its own column, you can use the following function to easily split out the rows and ultimately create the file:
While the process of parsing your Redshift User Activity Logs can take some time, trial, and error to get things right, we truly believe that approaching data from this perspective will be helpful for guiding our data usage efforts across our agency. The final result for us is the ability to gain a full picture into how data gets used, by whom, and where we can continue to improve. This shift in mindset and process is helping us find holes in our data tracking efforts and ensuring that we’re setting up our future for better overall data governance.
Stay in touch
Subscribe to our newsletter
Data Usage by User at PMG
Posted by: Blake Burch
2 MINUTES READ | February 4, 2020
3 MINUTES READ | April 6, 2017
2 MINUTES READ | December 9, 2016