10 Data Engineering Practices to Ensure Data and Code Quality
Data engineering is one of the fastest-growing professions of the century. Since I started working in the field, I encountered various ways of ensuring data and code quality across organizations. Even though each company may follow different processes and standards, there are some universal principles that can help us to enhance the development speed, improve code maintenance, and make work with data easier.
1. Functional programming
The first programming language I’ve learned during my studies was Java. Even though I understood the benefits of object-oriented programming related to creating reusable classes and modules, I found it hard to apply it when working with data. Two years later, I came across R — a functional programming language, and back then, I fell in love. Being able to use the dplyr package and simply pipe the functions to transform the data and quickly see the results, was life-changing.
But these days, Python allows us to combine both worlds: the ability to write object-oriented modular scripts, while at the same time making use of functional programming that works so well when interacting with data in R.
The reason why functional programming is so excellent for working with data is that nearly any data engineering task can be accomplished by taking input data, applying some function to it (i.e., your T in ETL: transforming, cleaning, or enriching the data), and loading its output to some centralized repository or serving it for reporting or data science use cases.
The functional programming paradigm is so common in data engineering that many blog posts have been written about that. For example, the article linked below has been posted by the creator of Apache Airflow, Maxime Beauchemin, already in the beginning of 2018:
Functional Data Engineering — a modern paradigm for batch data processing
Batch data processing — historically known as ETL — is extremely challenging. It’s time-consuming, brittle, and often…medium.com
Similarly, many data engineering tools have been created to facilitate the process. Functional programming let us create code that can be reused across many data engineering tasks and can be easily tested by feeding small chunks of data to the function before running ETL on large amounts of production data.
2. Design your functions to do one thing
To make functions reusable, it’s a good practice to write them in such a way that they do one thing. You can always have your main function that can tie together different pieces. Overall, I found out that by making functions small (i.e., focusing on doing one thing well), I tend to develop code faster, as a failure of a single element can be easier identified and fixed.
Smaller functions make it also easier to exchange single components and use them as lego bricks that can be combined together for different use cases.
3. Proper naming conventions are crucial
A good practice is to name the objects so that a new person who looks at your code can immediately understand your intentions. If some abbreviations may not be understandable for everybody, it may be better to avoid them and write the names in full. Additionally, most data engineers I’ve seen tend to use the following conventions:
- verbs as function names, ex. get_dataframe_from_google_ads() can be potentially easier to understand than google_ads() — rather than showing only the source system, the longer version also indicates the action that the function performs and the object type that it returns (a data frame). You could consider it wordy, but usually, you only need to write it twice: once when you define it and once when you call it. Therefore, in my opinion, those longer names pay off.
- UPPER-CASE global variables — most data engineers I worked with define global variables as upper-case to distinguish them from local variables (ex. those in the main function)
- many consider it best to define imports only at the top of a script — in theory, you could import libraries within functions or classes, but it may be easier to track package dependencies if all imports are at the top of a script.
Ideally, your naming can make your code self-documenting, which can also make you write code faster.
4. Make your code easy to maintain by writing less and better code
In general, we read code much more frequently than we write it. Therefore, it makes sense to make our code readable and easy to follow. With proper naming and a good structure, we make it easier for our future self and for other people who work with our code.
Being concise is also helpful: the less code we write, the less code we need to maintain. If we can accomplish something with fewer lines of code, it’s a potential win.
5. Documentation is key but only if done properly
It may sound counter-intuitive, but we shouldn’t document WHAT our code is doing. Instead, we should document WHY our code is doing what it’s doing. I’ve often seen code comments stating the obvious.
For instance, our function get_dataframe_from_google_ads() doesn’t have to say that we are downloading data from google ads but rather the reason for doing that, ex. “downloading ads spending data for later marketing cost attribution.”
On top of that, using docstrings or type annotations to document the expected input and output of the function is extremely helpful! It immediately makes you a better data engineer.
6. Avoid hard-coding values
I’ve seen many ETL-related SQL queries that were using some threshold values without explaining the reason for them. For instance, imagine a script to extract data from some table, but only for events that happened after 30.09.2020. And there is absolutely no documentation why somebody picked this specific date! Without explaining why, how should anybody later find out why this value has been hard-coded. It could be because, on that day, the company transitioned to a new source system, data provider, or they may have changed some business strategy.
I’m not saying that it’s wrong to specify such business logic within the code. But without documenting why somebody has chosen such an arbitrary threshold, this hard-coded value could remain a mystery for the next generations of data engineers in years to come.
7. Avoid keeping zombie code
A common anti-pattern I’ve encountered often is when somebody keeps code that has been abandoned but left commented out in the script. Maybe somebody wanted to test some new behavior and kept the old version just in case the new one doesn’t work. Or perhaps this person wanted to keep the history. In both cases, I would argue that it’s best to avoid that, as it may be confusing for later developers to distinguish between what is indeed the correct version.
For instance, I experienced a situation when the commented out code snippet made much more sense than the version that was not commented out. It could be that at some point, somebody will switch those two just because he or she would assume that the act of commenting out this more logical version happened by mistake. Therefore, keeping a zombie code can be dangerous.
8. Modularity done right: separate your business logic from the utility functions
Mixing the utility functions and business logic can make sense, but it’s still beneficial to keep them separate. If appropriately used, the common functionality can be pushed to a different package and reused later across many projects. This separation requires more effort upfront (ex. by building a release process for such packages), but reusability and the benefit of defining a single functionality only once can pay off in the long run.
9. Keep it simple
According to the Zen of Python :
“Simple is better than complex.”
Many data engineers, especially those with a Computer Science background, can sometimes create sophisticated but too complex solutions. For instance, if something can be expressed as a simple function that takes some data as input and returns a transformed version as output, then writing a custom class object for such operation may be considered an over-engineered solution.
10. Think long-term
Sometimes we need to make trade-offs between doing things right and doing them fast. Creating solutions that are general enough to be reused across different use cases, and that will make our lives easier long-term, take longer to develop. For instance, establishing a release process and CI/CD pipelines for modules shared across projects can take a lot of time upfront, but this extra effort usually pays off later. The same is true for taking the time to create scripts that continuously validate and monitor data quality.
In this article, we discussed best practices for data engineering to ensure a high quality of data and maintainability of code. We’ve shown that most data engineering tasks can be expressed as functions that take some input data and transform it according to the specific business requirements.
Those functions should ideally be designed to do one thing and documented so that anyone who reads the code knows what is needed as input and what is generated as the output of the function.
We’ve also looked at useful naming conventions and the best way of structuring, writing, and documenting code to make it useful in the long run.
Thank you for reading!
 Reddit — cat image