A Pragmatic Guide to Good Data Hygiene in Digital Marketing
The old truism states that analysts and data scientists spend about 80% of their time cleaning data and only 20% of their time actually analyzing it. Cleaning data, in this instance, refers to all of the manual operations that are required to turn data that’s readily available into data that is readily useful – that is, data that can be used to put together reports, build models, answer questions, or guide strategy.
While the real value of data lies in the story that it can tell, and more importantly, in the actions that it might suggest – many of us have a rather strained (and by consequence, limited) relationship with our data. We have to download it, we have to reformat it, we have to move columns around, we have to delete things, we have to rename things, we have to manually group things together, we have to blend in a second set of data, we have to set up rules for custom logic, we have to reference things and parse things and concatenate things, we have to pivot things. And in many cases, we have to run all of the same cleaning operations for a separate analysis a week later with a new set of dirty data.
In today’s digital ecosystem, data analysis is more or less an expected task for all marketers, not just the geeks with hardcore analytical titles. As a self-proclaimed data evangelist, I consider this a good thing, but it also means that every person on your team might be spending well over 25% of their week ‘negotiating’ with Excel and other BI tools to whip their data into a useful state. That’s a lot of time and energy that could be directed towards higher-value activities.
When you really simplify it, dirty data is the consequence of either (1) human error or (2) sub-optimal data processing systems. On the human side, information is misentered, fields that should be the same are inadvertently made inconsistent (e.g. ‘Japan’ vs. ‘JPN’), and in many cases, the de facto ‘rules’ around a data entry process aren’t sufficiently defined or controlled.
Given enough time, good data engineers can build systems that clean up even the dirtiest data feeds, but such systems rely on consistency in the way that new data is generated, which invariably requires that the rules of data entry are rigorously followed. The reality is that marketing operations can change daily, and account managers greatly outpace their data engineers. New categories are created, bespoke naming conventions arise, taxonomy gets shuffled, and your analytics team is left scrambling to accommodate the ad hoc changes. Even if you’re not reliant on a data engineering or analytics team, your manual data cleansing efforts are likely to be complicated by the newly miscategorized information.
When a new campaign is launched in Facebook or AdWords, account managers are effectively going through a data entry process that will determine what details are added to their historic marketing records and what categories of information will be available for analysis after the fact. With the multitude of channels and partners that today’s marketers are tasked to manage, the potential for inconsistent data multiplies with every additional platform. Today’s brands must take steps to manage the complexity of their marketing data, or risk missing out on insights that could help push them further into the black.
Most of the headaches that arise from dirty data can be avoided entirely with a bit of forethought at the data entry phase. A comprehensive and rigorously enforced “naming structure” yields efficiencies at virtually every step of the digital marketing workflow – from campaign launch to historic analysis. This is especially true when your naming structure has consistent elements across channels. When fully implemented, good naming allows you to:
Fully automate data feeds and reporting activities
Minimize junk data, guesswork, and manual re-categorization efforts
Easily query and segment data by meaningful categories
Maintain consistent and detailed records of historic activity to inform similar future initiatives
Greatly reduce new team member onboarding time
Every account I’ve ever touched has had a unique set of naming rules, ranging from virtually non-existent (note: this is not good practice) to comprehensively detailed and reliably consistent. If you’re in the former group do not despair. I’ve provided steps you can take to improve your naming structure from Bryant’s Analytics Cookbook for Advanced Gourmets:
A good naming structure allows analysts to effortlessly aggregate data into meaningful categories that help inform business decisions. To this extent, your naming should contain all of the categories that will be used to understand the performance of your marketing efforts at all levels of your organization. Again, part of the value of implementing a good naming structure is to build an interpretable record of your marketing history directly into your data for later review.
Think through your marketing calendar, your seasonal efforts, your evergreen strategies, your test initiatives, your targeting tactics, your funding sources, your objectives, your campaigns, your sub-campaigns, your geos – if these dimensions aren’t available in your marketing data, you’ll probably wish they were in the future. And if they already are available in your data, you can thank your previous self or the excellent marketers that preceded you.
For our purposes, structured information consists of two parts: (1) where a particular piece of information lives, and (2) how a particular piece of information is named.
When you launch a new campaign in Facebook or AdWords, you may consider capturing the marketing event, the targeting type, and a specific product that you’re promoting in the campaign name. If these fields of information consistently show up in the same places – say, as a text string with the format [event]-[targeting type]-[product] – they can be parsed into their respective dimensions with a single operation. Of course, this is a very simple example, and experienced digital marketers know that a particular piece of media may be associated with several categorizations. With this in mind, it’s important to put some forethought into these entries and make sure that they adequately reflect the structure of your marketing.
With everything in its rightful place, it’s important that the information in said places shows up in predictable ways. In general, marketing teams should be prescriptive about how everything is to be named. Without hard rules in place, we tend to take liberties and name things to suit our individual preferences, leading to a data set that shows ‘july4promo,’ ‘4july,’ ‘4jul2018,’ and five other variations for the same marketing event. Looking at this data, we can tell we’re looking at something related to the independence of this great nation, but our analytical tools will interpret this as eight unique categories and we’ll have extra work to do to normalize this field.
With the continuous evolution of digital advertising platforms, it benefits marketing teams to appoint channel specialists – one consequence of this is that specialists often operate in silos – your social team likely speaks a much different language than your search team, even though their respective efforts ‘bubble up’ to common initiatives and marketing goals. To the extent that it makes sense, a good naming structure allows fluid analysis of common fields across channels.
Good naming structure is often compromised by unanticipated exceptions. For example, a small percentage of your media may have a geographic targeting component while the remaining majority does not. In this scenario, a common mistake is to only include a placeholder for geo in the naming of media where geo is relevant, effectively creating a unique structure just for geo campaigns. This isn’t the end of the world by any means, but for every unique structure that exists throughout your naming, additional rules need to be written or manually executed to make your data cooperate.
Alternatively, your data processing efforts can be greatly simplified by having ‘dummy’ placeholders for all components of your marketing structure even if they’re not always relevant to the media at hand. For cases like this, we’ll often use ‘NA’ or a similar tag to designate an unused field while maintaining the correct structure of information.
The phrase ‘tribal knowledge’ is used to describe operative knowledge that is required to maintain an existing system but is not documented in any useful form. Many marketers inadvertently find their data practices tethered together by tribal knowledge. Not only is this dangerous for risk of turnover, it’s also extremely inefficient when involving new team members. Information systems are complex, and that complexity is almost always taken for granted by their tribal knowledge owners. All of this is to say that even the most robust naming structure should be thoroughly documented.
It’s a good day when the data engineer I work with tells me that a new data source he’s been working on is ready for analysis. That good day quickly becomes a bad day when I spend time with the data and find that the results are invalid due to inconsistent naming practices. Sometimes it’s as simple as a misused space or an extra character, other times it’s as severe as a certain category showing up in the wrong dimension.
In any case, these entry errors need to be accommodated (or eliminated) to get to a valid results set – often on a case by case basis. With better controls at data entry or even just a quality assurance step at campaign launch, inconsistent naming and the cleanup it necessitates can be minimized.
The structure of your marketing efforts will continue to evolve. As such, new categorization schemes will continue to emerge and previous ones will lose their utility over time. When naming structure exceptions arise, it’s often a signal that it’s time to revisit your naming structure and make updates. Comprehensive platform coverage (see step (3)) and thorough documentation (see step (5)) minimize friction when rolling out new naming rules, so your marketing team can stay focused on hitting their targets.
Stay in touch
Subscribe to our newsletter
Dirty data is expensive but with some up-front planning and adherence to the plan, the expense can be greatly reduced. The best way to deal with dirty data is by preventing it altogether. Only you can prevent dirty data.
Posted by: Bryant Schmitz
2 MINUTES READ | February 4, 2020