A self-discovery into life, liberty, and the pursuit of data.
A short disclaimer for all the “real engineers” that will potentially never give my thoughts or opinions the time of day after I bear my soul and personal experiences. I’ve been self-taught my entire career, and I’m not afraid to learn. Please don’t cancel me forever as I admit to the interweb that at this very moment, Python isn’t one of my top ten special skills.
I was a data engineer for three years and SQL was my everyday programming language. Let me answer some of your next questions. My official title was indeed Data Engineer, which I might add was a significant improvement towards accurately representing my role over my first title of Software Developer & Integrator. I have lived through many of the same growing pains that are synonymous with any data engineering right of passage, and I spent most of my time building and maintaining near-real time, micro-batch, and batch data pipelines transforming data from multiple different sources and ensuring the data quality of these said pipelines to the business. And I did most of this work only using SQL. I never learned Scala at all. I can google my way through any basic Python needs I have (I can crush a FizzBuzz like nobody’s business), but I have to admit my deep dark confession that my Python use is mostly centered around creating virtual environments for my dbt projects, installing one time use packages I may need for some specific instance and then never again, or getting stuck on Jinja for the sake of an obscure dbt macro.
Am I a fraud? Does my lack of knowledge of object-oriented programming or the fact that I have no Python code deployed in production mean that I don’t qualify as a data engineer? Am I excommunicated from my profession because I don’t have an utter repugnant reaction if I have to navigate around using the UI instead of the CLI? My elite impostor syndrome and deepest insecurities would argue that my time in the trenches - where I did substantial work - would mean nothing. Because I am self-taught and completely learned on the job, without having the substantial background of a traditional software engineer, my experience is invalid. The last couple of weeks have been spent profoundly reflecting on all these questions after some recent parallel conversations with co-workers, friends, and real people out there doing the things, that lead me to the very scary conclusion that it’s actually a case-by-case basis for each individual role as to whether or not a data engineering job can be accomplished entirely on the basis of SQL. However, regardless of your opinion on the matter, you cannot objectively deny the data culture capitalizing on the mentality that SQL is King. Everyone’s teaching SQL tips and tricks, because regardless of the opinion if hardcore engineers don’t use SQL, all other data professionals do. And having that somewhat universal language to communicate between roles is priceless. As my friend and coworker pointed out, one of the most heavily used DataFrame functions he sees in the field is .sql(), which lets the Python folks type out - you guessed it - SQL.
The truth is that I very much so classify my experience as data engineering. I spent hours upon hours cleaning and cleansing data, building and maintaining new data pipelines, fixing legacy code, running CI/CD pipelines, ensuring data quality, securing our data, and so much more. And after my start to this identity crisis I found article after article validating my experience. But here’s the problem with data in general. It’s all a spectrum. I also think you could have called me a Data Product Owner, or my more affectionately known nickname “girl we ping on Slack when the IA jobs fail”. But the truth is, there were situations where my direct teammates were doing very different things. Everyone has outstanding and differing roles and responsibilities because each job is uniquely different from the other. Even large organizations can find themselves tangled with multiple completely contrasting tech stacks, resulting in data engineers developing varying areas of expertise. Data engineers become fingerprints, each and every one of them holds special skills developed from facing the obscure challenges and needs unique to each organization at that specific moment in time.
There are two contributing problems:
Everyone defines data engineering differently
There are literally too many things
As roles and responsibilities blend and blur, where is the line between infrastructure engineer and data engineer? Between data engineer and analytics engineer? And between an analytics engineer and BI analyst? I’ve actually never truly understood the role of a data architect, but you have to throw that one in too. Don’t forget the treasured data scientists who are critical to our entire operation, especially with the rise of Generative AI, and before you know it, you’ve created a chaotic radar plot where roles and responsibilities intersect and overlap no matter the title. It becomes clearer why we have all gravitated toward a sole lingua franca for our data initiatives where possible.
Let’s add the second layer of complication and address the complex ecosystem of tooling that we all agree is impossible to learn in full. If we shoot for the stars and pretend each of these roles above can master 5 of these associated technologies/tools, and we take a conservative guess that includes around 250 tools, then according to the good old combinations calculator, we have too many, an estimate of around 7817031300 different options.
As a curious thought exercise, let’s guess how many of these tools are UI based or can be categorized as a SQL derivative. I’d bet a high percentage. And I think this contributes to the notion that you can get decently far in your data engineering journey without having to dig toward something deeper.
I’m willing to call a spade a spade. The rise of the cloud has greatly aided and abetted data engineering to shift left and allow for data engineers like myself to even exist. Since the inception of Redshift, the rise of dbt, and the plethora of new data technologies born out of the chaos that is the modern data stack, custom written Java UDFs are no longer a top five choice for data transformations, and some of the highly technical software engineering principles assumed in the past are no longer necessary. I’ve often wondered what the life looks like of a data engineer at a 2021 start-up that purely manages their data infrastructure in the cloud. I’m assuming their job is grueling and I would be a fool not to bet they spend more than their fair share of time doing all the fun tasks of data engineering - including explaining the origination of specific data values to their peers for the 8th time that month - but I’m also inclined to believe that they may be data person #1, #2, or #3, and they need help managing their data infrastructure because they can’t take a vacation, sit in meetings, or get anything done otherwise. My hypothesis is that the rise of the modern data stack contributes to a data climate that quite frankly, allows for data experts to develop - where the insights and revelations become priority number one. It’s why we see so many self-taught data people transition from data analyst to analytics engineer/data engineer, because they have learned the hard skill of critically thinking about the data. You can lean on SQL as your backbone and still address the needs of your organization outsourcing the grueling hours on infrastructure. If this new paradigm means I can enable myself to do all the work today without having to take a detour to master Terraform, or try (and probably fail) to configure a Kubernetes cluster myself, then I’m pro this. And if that means that my official title should pivot toward glorified data analyst because I don’t have to or want to deal with some obscure, crappy infrastructure error that comes with the complicated data life, then sign me up.
This next generation of data people will even look staunchly different than the ones today. With the rise of Copilot, anything is possible. I can see a world where I start using Python every day. But I also see a world where I’m able to lean even more into my SQL comfort zone. With the rise of table formats, data lakehouses, and query engines (quick plug), I dream of an outcome where I can develop SQL based systems that I can enable my data consumers and end-users to service themselves. And while I do understand some are against the pride of utilizing low-code or no code tooling (and in some instances - the cost of it), I also think that it’s a tradeoff that enables data people to be problem solvers, no matter their skill level. I like data engineering because I love diving into the mess and finding an answer, and I like that sometimes the data helps me make better decisions. And I like that there are avenues today that make it easier for me than ever to do so. But I also can’t deny the history of technologies developed like Hive, Datastage, Informatica, Nifi, Trino & more which have for years been trying to bring SQL to the people, which makes me think that I can’t be the only one solving data problems via SQL.