• About Us
    • New York
  • Work
  • Capabilities
  • Careers
  • Technology
  • Blog
  • Contact Us
  • About Us
    • New York
  • Work
  • Capabilities
  • Careers
  • Technology
  • Blog
  • Contact Us
May 06, 2019

Parsing Redshift Logs to Understand Data Usage

Posted by Blake Burch

Every company hits a point where simply providing the right data across the organization is no longer good enough. As your data infrastructure scales, questions get asked like:

  • Which individuals and teams are using the data most frequently?
  • How is the data typically being accessed? Through Excel, dashboarding solutions, or SQL IDEs?
  • How does your data usage tie back to various internal solutions (dashboards, scripts, etc.) that are running?
  • What data sources get used most frequently? What are the most common pairings?
  • Are new data sets your Data Engineering team supplies being used?

To answer these questions, it’s imperative that you start actively processing and analyzing your database logs. If you’re using AWS Redshift for your database, Amazon provides built in logging that contains a huge amount of useful data. With our current size and scale, we’ve found that each of these hourly logs each contains approximately 200k logged items and 50mb worth of data when uncompressed. In other words, to look at the metadata about how we’re accessing our data, we have to parse through more than 438GB of text files on a yearly basis. While the logs themselves are immensely helpful, they also have lots of quirks and issues that prevent their underlying data from being immediately beneficial.

For the purposes of this post, we’ll be walking through some of the process and pitfalls found within the “User Activity” logs, using Python to extract the data. We won’t be covering the process needed to download these log files in bulk to begin processing them.

From the documentation, here’s the data structure of the log files:

Here’s an initial example to help demonstrate what these logs might look like in practice:

Understanding the Data

As we sifted through the log level data, we came across a few unique cases that you should be aware of.

  • If your queries select data from a view or datasource that is tied to multiple other views, instead of seeing one query in the logs, you’ll see every nested query logged with the same timestamp and the same transaction ID on separate lines.

  • If there are any failures to run a query and it has to be retried, you’ll see duplicated transaction IDs and queries in the log with different timestamps. We opted to only keep the last instance of any transaction ID to avoid overcounting the realistic number of times data was accessed.

  • If a user individually authenticates their credentials through a tool, you won’t be able to tell it’s a tool unless you later bring in the IP address of the request from the “connection log” and tie it to the Process ID.
  • File names are stored according to UTC but appear to be one hour off. For example, a file with the date of 04/01/2019 and a time of 00:00 will actually contain data for 03/31/2019 between the hours of 23:00:00-23:59:59. It’s important to keep this in mind if you’re needing to troubleshoot where the original data lives.
  • By default, every log item in your Redshift Logs will be separated by newline characters, while also retaining newline characters in the query itself. This makes separating the log items tricky if you want to analyze the full context of the query (which we’ll detail below).

Splitting Out Your Logs

Knowing the quirks of the log files, our first step is to figure out how to split out each log item, given that they are initially separated by newlines and each query will also contain newlines. We recommend the following steps to clean things up:

  1. When you read the file in, replace all newlines in the entire file.
  2. Every new item starts with a timestamp. Find every occurrence of this timestamp and replace it with itself, plus a newline.
  3. Split the log file into a list, looking for each newline instance that you just created.

Filtering for the Important Data

For our purposes, we only needed to truly evaluate the log items for queries that were either run by our users or our tools. To ensure that we weren’t overwhelmed with data, we ran our list_of_logs through multiple different filters with values we wanted to include or exclude.

Typically, these filters boiled down to ensuring the queries were run on the production database and contained a select statement.

We highly recommend that you implement your own filters before starting to extract your data. We found that oftentimes, this could reduce things from 200k log items to 2k, making the regular expression matches go much quicker.

Extracting the Important Data

To start off, it’s easy enough to extract the predefined elements like the user, database, timestamp, etc. Due to the consistency of the log files, you can grab all of these groups using a simple regex statement.

For our purposes, we needed to go one layer deeper to understand exactly what the underlying queries were doing. Once the query was extracted, we then needed to find every unique datasource being used within that query.

We also have some tools internally, like Redash, that can run queries on behalf of our teams. These tools have their own username and don’t really help us understand which teams or users are really leveraging the data. To better classify these, we typically have the systems add ID tags as comments to the top of our queries before they get run so we can easily extract the relevant information.

To extract these pieces of information, we used the following code:

At this point, you should have all of the extracted data stored in some manner for a single log item and can then start looping through all of the log items that exist in a log file.

Storing your Extracted Data

Since we were wanting to store this data to later visualize our data usage trends, we opted to store the extracted data as a dataframe and output everything as a flat CSV file. Depending on your setup, the better solution may be to store this data as a JSON file since the data is technically nested.

The final step on our side was to ensure that if a transaction ID contained multiple datasources being queried that we didn’t just leave the information as a comma-separated list. We instead wanted to create a unique row for every datasource that was seen. We also decided that despite processing a file for every hour, we wanted to store the extracted data into one single file per date.

Assuming you’ve followed the above steps and stored your extracted data in a dataframe, with each variable existing as its own column, you can use the following function to easily split out the rows and ultimately create the file:

Final Thoughts

While the process of parsing your Redshift User Activity Logs can take some time, trial, and error to get things right, we truly believe that approaching data from this perspective will be helpful for guiding our data usage efforts across our agency. The final result for us is the ability to gain a full picture into how data gets used, by whom, and where we can continue to improve. This shift in mindset and process is helping us find holes in our data tracking efforts and ensuring that we’re setting up our future for better overall data governance.

Data Usage by User at PMG
Data Usage by User at PMG


datadata innovationdata visualizationredshift
Previous
Next

Latest White Papers

  • Shifting Plans for 2020 & Beyond
  • Game On: How Brands Can Log Into A Diverse Multi-Billion Dollar Industry
  • What CCPA Means For Brands
  • How Google is Improving Consumer Data Privacy
  • Ways to Prepare for the Cookieless Future
  • See all White Papers

Featured Posts

  • Ad Age Names PMG #1 Best Place to Work in 2021
  • Hindsight 2020 & Looking Ahead to 2021
  • Preparing for Streaming’s Growth & The Future of TV Buying
  • MediaPost Names PMG Independent Agency of the Year
  • PMG Client Portfolio Trends During Amazon Prime Day 2020

Categories

  • Consumer Insights
  • Content
  • Creative Design
  • Data Analytics
  • Development
  • Digital TV & Video
  • Ecommerce
  • Industry News
  • Local
  • Mobile
  • Paid Search
  • PMG Culture
  • Programmatic & Display
  • SEO
  • Social Media
  • Structured Data
Fort Worth

2845 West 7th Street
Fort Worth, TX 76107

Dallas

3102 Oak Lawn Avenue
Suite 650
Dallas, TX 75219

Austin

823 Congress Avenue
Suite 800
Austin, TX 78701

London

33 Broadwick Street
London
W1F 0DQ

New York

120 East 23rd Street
New York, NY 10010

Get in touch

(817) 420 9970
info@pmg.com

Subscribe to the PMG Newsletter
© 2021 PMG Worldwide, LLC, All Rights Reserved
  • Contact
  • Privacy Policy
 Tweet
 Share
 Tweet
 Share
 Tweet
 Share
 LinkedIn
We and our partners use cookies to personalize content, analyze traffic, and deliver ads. By using our website, you agree to the use of cookies as described in our Cookie Policy.