Data Engineering — W’s
Data Engineering is a niche in software engineering (you can disagree here.). It’s been there with Data Science at the start but it is lately the buzzword and the most sought-after role in recent times. Companies are finding it very difficult to hire quality Data Engineers, and there are tons of jobs out there and still, they find it difficult to fill these positions. Here’s a small perspective from my side.
It's really simple. Till a few years ago, data engineering, or the “Big Data” as it was called around that time were limited to companies that were unable to cope with the increasing data at their peril and were writing pig/hive jobs (Map-Reduce) on large on-prem clusters. There were few people which understood this world, and there were no specific data engineering teams. Then came Spark and the whole scenario changed. Spark proved to be 10x, 20 x faster, and with spark2.2 (Spark SQL) writing reusable distributed code powered by SparkSql and data analysis changed big data processing drastically.
With most companies now having to manage so much data need 3 to 4 data engineers/team or full-fledged data teams to manage their ETL based on company structure or workload because every decision-making depends on it and that makes data engineering the very crux of the modern engineering ecosystems.
You can do away without a Data Science Team but can not without a Data Engineering Team these days. The ROI is very easy to calculate in the latter but not so with the former.
So what is data engineering(DE)? Going back to Spark world, processing large data with spark, either by vanilla spark-submit, or orchestrated by oozie/Airflow is DE, so is moving data from Azure to Snowflake via some connector, or a Sqoop job transitioning RDBMS Data to HDFS.
In a very basic form — Data Engineering is making data available for Querying or consumption to downstream processes by storing it in an efficient way on your Data Lake/Data Ecosystem by sourcing data from Data Sources.
This is a very basic definition. Obviously, DE involves much more than simply making data available- Proper logging on who is accessing data, how we are storing data in an efficient way so that there is no redundancy, how we account for PII/Sensitive Data, and how we make sure that only certain people have access to certain data, deleting data based on Compliances (GDPR), storing data based on Regulations (Credit Card Numbers/Telecom Data), accounting for scale for accessing, processing, storing, different kind of use cases and handling them (preferably via a Single Cloud Provider), and monitoring and alerts on wherever possible, and on and on. There are tons more.
Who is a data engineer? Anyone who deals with data or does any of the subsets of work mentioned above that results in better data management is a Data Engineer. The Data doesn’t necessarily have to be ^Big^ for someone to be called a Data Engineer. There doesn't have to be “HDFS/Spark” in the picture, the DE has become a very broad term and encompasses a variety of tasks and sub-roles. My personal opinion is that someone who just queries data comes under the umbrella of Analyst and not DE. (Again, you can disagree)
The world has moved from Cloudera/Hortonworks to Cloud-Based offerings mainly because of Managed Services, full-time support, and containerization technologies that cater to millions of use cases and granular controls and cost savings. So Data Engineering is ever-evolving and DEs need to have expertise in so many tools/cloud-based offerings/ technologies and you can not learn it all. But everything at its core has Distributed Processing and if one really understands how data is split and how storage and computing are distributed and how it all fits, the gap b/w jobs and quality engineers becomes very thin-lined.
There is a new term flowing around, “Data Platform”, and “Data Platform Engineer”, more on that in the next write-up.