This month marks my fifth year at PMG. I started as a part time SEO when PMG was four people small. Eventually I moved over development full time, somewhere along the way becoming one of our lead devs. You can come work with us! It’s pretty great.
As I was reflecting on the last five years I had a few thoughts…
- Just how much CSV reading code have I written?
- I should distill some of the lessons I learned about dealing with agency data.
So here’s my three biggest lessons about dealing with data. If you work at an agency or deal with online analytical processing, some of these lessons might help you out.
1. Always Internalize Identifiers
Document-oriented data, like product or location feeds, often has an identifier unique to a given company that lets you update and track changes for a single document.
For example, Acme might have 100 product, each with its own unique, Acme-specific SKU. You can use these SKUs to look up the product and update its records in your database.
Let’s call those Acme-specific SKUs an external identifier. External identifiers vary from company to company, so you can’t really make assumptions about them other than they are a string and they will (hopefully) be unique within the companies data. Acme won’t have duplicate SKUs, in other words.
Contrast that with an internal identifier. This is an identifier you control. It might be a simple auto-incrementing value from a database or a UUID. Whatever it is, you own the internal identifier and can control it.
Make sure you come up with a system to map those external identifiers to internal identifiers. Never use an external identifier outside this little bit of mapping code. Don’t use that external identifier as a primary key, in other words.
When in doubt, make your internal identifier a UUID.
2. Accountability is King
You might have the most reliable, smooth-running system ever built, but it doesn’t matter if your users can’t see what’s going on. Give them the tools to do so and you’ll have happy users.
Say your system does imports of data from some external server (like SFTP). You could process all those imports in the background with a simple queue and never show anything to the user other than the complete data.
The result of a system like that is answering a lot of questions about what’s being processed. Instead, you’re better off to give the user tools to check on the imports themselves. Have a queue of imports? Also create a database record for each import that gets updated as the import progresses through its stages (created -> started -> finished|errored).
Some interesting stuff starts happening when you build this kind of accountability and expose it to your users: it makes it really easy to track other metrics along with the status.
- How long did the import take?
- How many records were imported?
- How many records were seen for the first time?
Then you trend those interesting metrics over time and start to see broad patterns. Did your imports slow down on a given data? Was it because of some code you released? Or was the the network wonky? From a business perspective, did the number of records imported go crazy one day? Has the number or records stayed the same? Do either of those things indicate issues? Or just business as usual?
These are questions that cannot even be asked without thinking about the accountability frameworks in place.
3. Never do Partial Reloads
Have an issue with some data for a given day?
Delete it all and reload it. Don’t try to update or partially delete, just call it a mulligan and start over. This is by far the best way to ensure your data is consistent, but it also means you have to design applications differently.
Remember our import example above? Fetching data from an external server, you can’t be sure that the files (or whatever) will always be there. So you build systems that pull down the data for that given data and store it in its raw, unprocessed form so it can be replayed later if necessary. In other words, build some sort of data lake that’s separate and distinct from the processed data.
This can simple (like a bucket on Amazon S3) or complex (a full fledged application). It doesn’t matter. The point is that you build something that makes it easy to reload entire sets of data as necessary.