Skip to main content

The Hitchhiker's Guide to Data Engineering


Motivation

The more experience I become a data scientist and machine learning engineer, the more convinced I am that data engineering is one of the most critical and foundational skills in any data scientist's toolkit. I find this to be true for both evaluating projects or job opportunities and scaling one's work on the job.

I will make a post later to point out the difference between a data scientist, machine learning engineer and data engineer. However, until that, I personally think there is a huge overlap between those roles. I worked as a machine learning engineer for almost two years before pursuing data science path. By the time I worked as a machine learning engineer, my role was working closely with data scientists to build recommendation systems for customers who use the company's products as well as building predictive modeling for other departments in the organization. While I do really enjoy what I did but, to be honest, I prefer to spend hours at the office to dig deeper into the data, find the analysis insight that no one can see and make significant impacts on the business model. That is the reason why I quit my job and went to Monash University for a Master's degree in Data Science. 

Stepping into Data Science is absolutely a painful experience, especially for those don't have any experience with computer programming, probability, and statistic. The real pattern I have witnessed among all of my friends who studied data science with me is divided into two kinds of people.
  • CS Student: who is similar to me, have strong programming background, few years of work experiences either back-end or front-end, a clear mindset of what to code, good knowledge of data structure and algorithms, have abilities to adapt, self-learner the new programming language. They can see the issues and can immediately draw in mind a picture of the application they can develop in a few days. However, those students are weak at probability and statistics.
  • Finance Student: who has excellent probability and statistics skills, who can do hypothesis testing easily, understand how Bayesian work and use some non-coding tool to analyze data is far better than any CS Student.


Of course, in the end, for those who really study and really want to have a decent job in this field will master all of the combined skills above (about that, Monash did an excellent job to help students, however, that was the most 2 painful years of studying you might have).

I have had great opportunities to thrive in my skills by doing several real-world projects under the supervision of my supervisors. I started to build complicated machine learning models from scratch, doing depth data analysis, challenge myself by learning new tools such as Tableau, SAS, practice R and its library day by day, use several machine learning libraries in Python. The more I work with data, the more I realize as a data scientist, I need to have capabilities to convert data into value is largely correlated with the stage of the company's data infrastructure as well as how mature is data warehouse is. This means that data scientists should know enough about data engineering to carefully evaluate how the skills are aligned with the stage and needs of the company. 

Despite its importance, education in data engineering has been limited. Mainly because almost everyone in the field thinks as long as you have decent skills in programming, it is not difficult to move to data engineering. I have to say it is a totally wrong impression of what is data engineering. Given its nascency, in many ways, the only feasible path to get training in data engineering is to learn on the job, and it can sometimes be too late. I am very fortunate to have worked with my professors who patiently taught me this subject after I have a talk with him about my concern as some of our fellow are actually lack skills in data engineering and building data products. I understand that not everyone has the same opportunity, then, as a result, I have written this blog to guide the beginner in the field and more importantly to help them bridge the gap.

Organization of this Guide

The scope of my guide will try to as comprehensive as possible and it is designed heavily around several topics from data modeling with SQL, NoSQL, Data Warehouse tools and technologies, schedule ETL with Airflow, batch data processing, real-time data processing to new thriving technologies such as Data Lake, Spark for big data processing, Kafka, EMR to cloud technology with GCP, AWS, etc. That said, this focus should not prevent the reader from getting a basic understanding of data engineering and hopefully, it will pique your interest to learn more about this fast-growing, emerging filed.

Important Note:
  • This series is not for people who know nothing about data science or don't have knowledge of programming. Indeed, I won't cover the basic stuff such as SQL nor tutorial of any programming language. The primary audience of this series is for aspiring data scientists who need to learn about data engineering as well as people who are software developers but find themself more interesting about data pipeline rather than software development.
  • This series is primary on tools, techniques, algorithms to design systems for data pipeline as well as to data products. It is not a series of data analysis algorithms nor machine learning model.
Nowadays, more and more companies and organizations are seeking Full Stack Data Scientist, sound familiar? Data Scientists are the people who capable of doing data analyst, visualization, finding what is insight data, the pattern behind it to answers the important questions that will impact the business model of the entire company. Full-stack Data Scientist has abilities to deploy the model in scale and more importantly, have ability to build data pipeline which originally belongs to the data engineer. Finding full-stack data scientists with sufficient skills are difficult and as harder as to find qualified full-stack developer.


Unfortunately, many companies or education do not realize that most of our existing data science training programs, academic or professional, tend to focus on the top of pyramid knowledge. If you look around the program that universities offer right now, most of them encourage students to scrape, prepare or access raw data through public APIs, most of them do not teach students how to properly design table schemas or build data pipelines. As a result, some of the critical elements of real-life data science projects were lost in translation.

As preparation, I will make a guide as comprehensive as possible. By saying comprehensive, I mean I will try to cover all the aspects of data engineering as well as utilize all the new tools, new technologies. Also, we are living in the world of Docker right now, therefore, I will cover most the tutorial with the Docker and by somehow, you will have a sense of how to quickly deploy machine learning model or software by using Docker and further deploying in scale with the help of Kubernetes.



To have a good preparation for this series, feel free and self-taught yourself basic knowledge about Docker and make sure you know Python well.

Following is what I will cover through this guide. For each new post, I will append the link below for the sake of tracking.

1.     Data Modeling
·       Overview of Data Modeling
·       Relational Data Models
·       Data Modeling with MySQL and PostgreSQL
·       NoSQL Data Models with Cassandra
·       Put everything together.
2.     Cloud Data Warehouse
·       What the hell is Data Warehouse
·       Cloud computing and AWS
·       Implementing Data Warehouse on AWS with Redshift
·       Put everything together
3.     Data Lake with Spark
·       What the heck is Spark and why everyone is using it?
·       Data Wrangling with Spark
·       Optimization
·       Now, we talk about Data Lake.
·       Spark in Data Lake
4.     Data Pipeline with Airflow
·       Data Pipelines
·       Data Quality
·       Production Data Pipeline
·       Real project with Data Pipeline
5.     Data Streaming with Spark and Kafka
      Continues to work on

Again, we will use Docker to pass by installation steps and you will quickly sense the DevOps (Holla!!!). 

Throughout this guide, you will work with the real data that I have opportunities to work with by the time doing Master of DS at Monash. These data are huge and reflect the real work you may have in the industry.

I understand how much you want to step right into the first level of the data science triangle. I am sure you can find a lot of material and online courses in data science on Google. For now, just let just focus on engineering perspective. I promise you will much better after completing a series of this guide and after that, it is your choice to continue to exercise those skills or take a step further to go deep into data science. Whatever decision you made, this series will help you to have a well prepared and equip the necessary skills to thrive in the data science world.

-----

I am extremely busy helping young companies, non-tech companies as well as startups to automate their data pipeline in their products. Thus, it will be hard for me to update a new post frequently, I apologize for that but I need to keep working to feed my mouth.
If you have any question or need help on building data product, feel free to shoot me an email for comment bellow, I will respond as soon as possible. For now, let just wait for me for the next post.

Comments