What your DataBrew Tutorial Won’t Say

Practical tips for your real life Data Analytics / ETL project

Manu Magalhães
10 min readDec 16, 2022

DevSecOps is data analysis applied to security. How would you react to this provocation? Well, that’s for another post. Today I’m sharing some lessons from my first DataBrew/data analysis/analytics project (I’ll also touch Athena and QuickSight topics.) Unsurprisingly, once you leave the neatness of the tutorial and get your real data… Things get messy real quick.

Technical Highlight: You probably won’t need a crawler

Did you know that you can save money and skip crawlers altogether if you select “Data Catalog S3 tables” as your output location? I haven’t seen any tutorial say that. It’s easy to think that this kind of output will create just a Data Catalog in Glue and nothing else, specially because it greys out the file type and delimiter fields as soon as you select it. But that’s just bad UX.

Here’s what you really get when you select Data Catalog as output:
1) a Glue Data Catalog that is automatically populated/updated at the end of the Brew job run (i.e., no need for crawlers)
2) the processed data in the format of your choice in the S3 bucket you select.

All you have to do is to pre-select the file type before you select the output location. Yeah, I know.

Data Catalog selected as output location, having parquet as the output format

And while we’re at it, tutorials make you run data profile job, right? It’s largely unnecessary. For me, the schema view was more than enough, even if it covers only the sampled data.

A view of the schema section of your DataBrew project

Now, to the most important discoveries:

Data Analysis is a Different Beast

This was the first time I worked with S3 as a d̶a̶t̶a̶b̶a̶s̶e̶ data lake. Relying on untidy data is quite a mindset shift!

The nature of traditional apps is CRUD (note the bold “u”). When you top that with a SQL DB, you become quite suspicious of duplicated data. In my humble, short experience however, duplication is a given in Data Analytics. I almost passed out first time I run a COUNT in Athena and got a value of 23,487 instead of 132. Honestly, just get to terms with duplication.

Track Your Data Points: Timestamps

Timestamps are your get-out-of-jail free card. Why?

Besides the mind-boggling data duplication, your sources will probably have different extraction frequencies. When a once-a-month data source mixes with an hourly updated source, and then with a firehose (escale this at your heart’s content), you will quickly lose confidence on how accurate or recent the final result is. Timestamps will save your neck.

What did it mean in practice?
First, I ensured that all my transformations would output structured data (I chose parquet), and that they included an extraction and/or transformation timestamp column. I also ensured that, whenever possible, I’d have parameterised datasets that automatically included the data creation date right during extraction. I highly recommend it.

How do timestamps help?
When the time comes to load the data, with a simple ORDER BY you know straight away how recent your data is, thanks to the timestamp column. And obviously, you can use the timestamp to narrow the results down to the time period you want, like current_date or max(extraction_timestamp).

Timestamps are also helpful to present data in a time series, for example, to show the evolution of a certain metric over time.

Visualise your Data Points

My PO is awesome, and he gave me what I nicknamed “data deliverables”: the specific data points required to play around later. It looked more or less like this:

Can you see how the viz puts your mind in motion almost instantly? For example, the maintenance category immediately makes me think:

  • We have are 5 different repair suppliers. Two of them have an API I can query, the others send invoices in .xlsx format every two weeks.
  • The bike damage data can be sourced both from the app DB (as users report them), and from suppliers. Which one does he want? Would it be useful to have both sources for cross-reference, to reward users who report damages more accurately, or to watch more closely users that might’ve potentially caused damage to the bikes?
  • We don’t have ready-to-use information on how many times a bike has been repaired. I might need to use the very data I collect and transform to get it.

This will be super useful in your extraction stage, but we’re not going down that road now. You’ll soon see how the data points shape the transformation stage.

Define Naming Conventions Early: Data Level

Data Analytics projects can feed from many sources, and it’s scary how quickly things can get out of hand. Naming conventions are crucial to keep track of your data sources, both at project and at data level.

How do naming conventions help?

Long story short, naming conventions are all about data tracing. You might not value this a lot in the first weeks, but soon enough you’ll be grateful for them.

Scenario
The image below shows a DataBrew project that feeds from three different datasets.

My particular project was not about bikes, btw

Now, suppose all these data sources have a common field, like bike_damage_type. In DataBrew, you will probably mindlessly pick tit from any of the sources because it’s redundant data, so it doesn’t really matter. But a week later, you get the news that the data structure is changing and one of the sources will drop the bike_damage_type field. How long will it take you to check if this change will break your project? Or suppose you find out that the data source maintenance had a horrible error, and the key bike_damage_type now shows customers phone numbers. Can you easily tell if this sensitive data got through to your clean files? With naming conventions, you can.

What has naming conventions meant in practice?
For me, it’s meant that every column had a source ID prefix.

In the example above, let’s say the dataset citibike has the following columns: bike_id, bike_damage_type, and bike_colour, and that the dataset maintain contains maintenance_count, bike_damage_type, and bike_id.

As I only need 3 of those data points, the columns in my clean output will be renamed to cb_bike_id, mntn_bike_damage_type and mntn_maintenance_count. I’d then instantly know which data source bike_damage_type comes from. And obviously, naming conventions are agreed upon and logged into the documentation.

Define Naming Conventions Early: Project Level

Naming conventions are hugely helpful with the DataBrew structure too.

The image above is an early version of my transformation architecture. You can see that the naming convention is applied to the DataBrew structure as a whole: both cb and mntn are used as prefixes for the S3 bucket, for the dataset, for the DataBrew job (and for the Brew project too), for the Glue table and for the clean bucket.

I wish I had applied this from the start. Right now, my project namings are half messy, half organised. But it’s clear that finding, tracking, and most importantly, remembering resources is significantly easier where the conventions are followed.

Separation of Concerns: Project Level

When transformation processes are too tightly coupled, you might find yourself doing a lot of rework. That’s why the architecture above was not the one I ended up with.

The mntn-deliverable job is doing several things at once: cleaning the raw data from 2 different data sources, transforming/joining them, then building the deliverable. If my PO then tells me “I need to know the top 2 colours of damaged bikes reported by our most active users”, my new DataBrew Project would have to do all the cleaning and some of the transformation all over again for the citibike dataset. So I decided to decouple data cleaning from the data deliverable:

I called those “Level 1” jobs. Their purpose is solely to clean the raw data. I don’t mix them with other sources, I just leave them ready to be used as sources for other DataBrew projects, but more importantly, for Athena views. A refresher: Glue Data Catalogues will automatically appear in Athena. Such a bliss.

So the logical conclusion was to get the “Level 2 jobs”:

I know it’s a bit more complex, but that helped me an awful lot when I started to stitch data together in Athena. Many times I needed to get the clean data straight from the source, because the “highly transformed” data served one use case, but wouldn’t serve others. Again, it doesn’t have to be your cup of coffee, but it worked for me.

Separation of Concerns: Process Level

Transform data in Brew, round it in Athena, let QuickSight do the fine tuning. Don’t mix steps or things might get out of hand. I’ll give a few examples.

Athena
Contrary to what some tutorials suggest, if you have a heavily nested json source, flatten it in Brew instead of using Athena. Same goes if you’re doing extensive joins with very detailed conditions. The odds are that you’ll need to run those complex queries over and over. With Athena, this means lower performance and greater costs; with Brew, it means greater simplicity, the benefit of version control via recipes, and more manageable, leaner Athena resources.

Bottom line: let Athena round your data, polish it, make simple joins and high level filtering. If you’re flexing your SQL skills way too much, and specially if a complex query is used as a base to other views, it might be a sign that you should move that work to Brew instead.

QuickSight
If you are using a BI tool like QuickSight, you probably won’t want Athena to “overfilter” your data. If you need to show, e.g., how many bikes are in use today, and in another graph, how many were used over the last 12 months, you’ll probably want to leverage the same Athena view and let QuickSight handle the filtering. With more freedom, QuickSight end usersn will be able to slice the data as they please (e.g., they can aggregate data by month, but then drill it down to the week, day, hour or whatever. They will love this.) This also means that you can let they do their job, and you go back to your primary role too.

But this only works if your Athena queries are broad enough. Too precise Athena queries will lead to an unmanageable amount of views, and you might quickly lose track of them, as version control in Athena is a bit unnatural.

Mind the Cascading Effect

Correct me if I’m wrong, but complexity seems to be the name of the game for data projects. You need several data points from multiple sources, and even after you have clean data, they don’t come necessarily from a single point. This means that any change in the top of the data flow will have a cascading effect down the line. A change (or a failure) in one DataBrew job may easily cause errors in half of your QuickSight sheets.

The best way to forecast, prevent and manage cascading effects is to get a robust architecture documentation. Trust me, writing down the data flow and drawing the architecture in Figjam was a game changer in my understanding of the project, in the velocity I could develop the solution, and in my confidence about the trustworthiness of my data. I got to the point that no change would be introduced to my project until it was first accounted for in the documentation.

So, to handle cascading effects effectively, my tip is:
- Use naming conventions for your clean structured data, so you can find the original source from the inside
- Have a clear and precise architecture and/or data flow document, so you can quickly identify what resources will be affected by changes

Some challenges I’m still trying to figure out

Maintaining a Data Analysis project is trickier than actually launching it. Your end users will realise that your data can answer a lot oftheir most burning questions, and depending on how you architecture your solution, some requests will take you literally 2 min to fulfill.

But both QuickSight and Brew are pretty new technologies in AWS and are not mature yet. For example, a simple operation like deleting a user group from QuickSight can only be done via the CLI; CDK integration is still very limited; version control for QuickSight is basically non existent, and the tool can test your nerves because “undo” is not an option. Yes, you read that right. To experiment with changes to some graphs, you’d better duplicate them and do your magic on the copy, because that thing is not going back to its previous form unless you remember pretty well what shape, data points and field wells you had.

I also still struggle with managing Athena views. They proliferate very easily, specially if you are working with tight schedules of focusing on something else. It’s easy to lose track of what’s the purpose of each view and what are the dependencies behind them, because all your Athena resources are tested, executed and deployed in the console without the faintest need for your IDE, which makes version control difficult again. I’d love to have end to end IaC structure to my data project, but unfortunately, I’m not there yet, for restrainst both on my side and on AWS’s.

I’ve written too much already, so I’ll finish with Tessalia’s advice:

Now go and do it, Yennefer :)

Disclaimer:
a) I’m not a Data Engineer/Scientist, happy to hear from you if you find bad advice here!
b) Yes, I proudly used a no-code solution. Data Analytics is not my primary role, and until there’s a business case to justify hiring a Data Person to my DevSecOps team, I’ll gladly use the tools that do the heavy lifting for me :)
c) I use data analytics and data analysis interchangeably, as my project was both.

--

--

Manu Magalhães
Manu Magalhães

No responses yet