🎯 How to choose Data Engineering tools❗
A simple framework with 13 criteria… 💥💥💥
Data Engineering is vast and rapidly evolving. It could be confusing to choose tools from the plethora of options.
Data engineers need not be experts about every available tool and database, rather they must know how to choose the right technology for the specific task and anticipate the needs of data scientists and analysts.
Therefore, this article is designed to serve as a helping guide for selecting the right tool for a good data engineering architecture.
Before selecting any data engineering tools, data engineers should really start with understanding business needs, and then, design data architecture (also called data engineering architecture).
Let’s look at a typical example of data engineering architecture, as shown below. It has data sources, data storage, real-time data ingestion, batch or stream processing, analytical data store and BI layer.
Architecture design is a must before selecting any tool. Think of architecture as a strategic guide that answers 3Ws (‘what’, ‘why’, and ‘when’) and tools as an implementor of that architecture. Authors of Fundamentals of Data Engineering — Joe Reis, Matt Housley has coined this approach as Architecture first and Technology second.
However, I feel, it should be called:
Business utility first, Architecture second, and Technology third.
How to choose Data Engineering tools?
Technologies should be selected in a way that it add value to business or data product while being future-oriented and cost-efficient.
To make selection easier, we will discuss following 13 criteria —
- Business needs and impact
- It is all about the team
- Velocity to deliver securely and reliably
- Interoperability and Flexibility
- $$$$$
- Should you build or buy?
- Open Sourced and Beyond
- User Interface (UI) and Extant of Coding
- Data Sources and Cloud
- Future-oriented solutions
- Monolith vs. Modular vs. Monolith Modular
- Serverless vs. Servers
- Data Management and DataOps
Let’s deep-dive further in each of them…..
1. Business needs and impact
It is essential that selection of tools is done with business utility lenses. Data Engineer should ask:
How XYZ technology is helping business achieve more value and competitive advantage?
Could there be a better alternate to achieve the business needs?
Have you taken a green light from the key decision makers?
In my experience, it is always better to discuss pros and cons with CEO or CTO or a decision maker before proposing a solution or finalising the technologies.
2. It is all about the team
Tech team plays a critical role in designing, deploying, and maintaining the technological solution. Solution should be chosen with team skills and size in mind.
Understand the team size, skills, and capabilities.
For a small team a low-code or a managed service could be better if the business is not looking to expanding the team size. Alternatively, solutions need can also be outsourced. Idea here should be to not put heavy load on the team which is anyway small, and yes, attrition is a giant bug.
Ask your team — how would they feel about learning a XYZ technology? Who would partner to conduct a proof-of-concept?
For cash-constrained startups who are in the early stages, open-sourced technologies could play a balancing role.
3. Velocity to deliver securely and reliably
It goes without saying that speed to deliver is quite important. If a perfect tool is not going to deliver fast or early or securely then focus on a “good-enough” tool which can serve business needs early and reliably.
Ask yourself — what tools can help in delivering value early and often?
One quick way to assess is by doing a proof-of-concept and then, choose tools that help you move quickly, reliably, safely, and securely.
4. Interoperability and flexibility
Business needs evolve and data engineering architect can also change with that. Flexibility and interoperability are the norms to stand by in the world covered with data.
In my experience, every tool must connect to others in the ecosystem and nothing works in silos; and therefore, ask yourself— How interoperable the tool is? How this tool is flexible? Is flexibility good or bad?
Any given day, preference should be given to technologies which can connect, exchange information and interact with others easily, reliably and securely.¹
5. $$$$$
Technologies or tools could be a significant cost-center for the business. In fact, the gurus have defined a new practice all together, called FinOps (a relatively new practice in data engineering).
It is always a good idea to opt for cost-efficient solutions (or cost-optimised practices) that serves the business utility, maximises the ROI, and also, offer flexibility to move to the next, new technology.
To evaluate cost of technologies, evaluate and compare following key metrics¹ for making your case—
- Total Cost of Ownership (TCO)
- CapEx (fixed initial cost) VS. OpEx (pay-as-you-go)
- Total Opportunity Cost of Ownership (TOCO)
Opportunity Cost = Return on option NOT chosen — Return on chosen option
6. Should you build or buy?
When to build? If security and competitive advantage are of a concern, then one may opt to build. Ask yourself — does the tool available in the market work for the company’s use-case? Does the company wants to focus on R&D for new opportunities? Is it less costly to build?
When to buy? Is it less expensive to buy? If there are resource constrains or lack of expertise or no compromise on core business or it will meet the business expectations as needed, one can consider to simply buy.
7. Open sourced and beyond
There is no simple answer to the debate of open-sourced VS. proprietary software solution. It depends on many factors listed below and of course preference of team and decision makers.
- Cost and Security — is the open-sourced more secured or proprietary solution? What about cost?
- Transparency — which one is more transparent about potential vulnerabilities? Open source software shares its source code, making it quite easy to understand and modify as required.
- Flexibility and Interoperability— which one is offering more flexibility and interoperability?
- Change management — where do we have more control to change the code or features? Usually, open-sourced is the answer to it.
How to compare two open-sourced solutions? If you are comparing two open-sourced solutions, then attention must be given to market share, interoperability, troubleshooting, community support, documentation, maturity, developer team, roadmap, and pricing structure.
8. User Interface (UI) and Extant of Coding
User Interface (web, CLI, GUI) can play a decent role in data management and monitoring of the data-flow. Some data pipeline tools offer visual drag-and-drop or spreadsheet-like user interfaces. Others utilize scripting or coding. Let’s take an example of data pipeline tool — Fivetran, which offers a web UI (no-coding at all) for user to connect to data sources and load data to cloud data warehouse. On the other hand, Prefect ETL tool requires ETL to be written in Python language. There is a web UI to monitor and test the deployed flow in Prefect ETL as well there is no escape from coding the ETL out.
Data engineer should pay attention to UI and extant of coding. Also evaluate low-code or no-code options.
9. Data Sources and Cloud
Data sources could be SaaS apps, Rest APIs, Databases, Files, Events, or manual files. Data format may also be different; it may be Json, Parquet, XML, log files, HDFS etc. Data engineers need to choose tools compatible with data sources and file formats. If some tools are not compatible with a specific data format, in that case, cloud functions (AWS Lambda or Google Cloud Function) can be used to convert them into a compatible format. For example, Fivetran is not compatible with XML format yet and therefore, using AWS Lamda one can convert XML to Json and continue using Fivetran.
On the cloud front, which location should be selected — on-premises infrastructure, cloud, or hybrid? Do you have understanding of the pricing structure of cloud? What would be the cost to extract, analyze, or migrate data from cloud? How could you optimize the cost of cloud usage? Should the priority be given to multi-cloud service (example Snowflake)?
10. Future-oriented solutions
It is always a good idea to focus on present problem with the eye-on the future. One way to do that is to design architect based-on the immutable or mutable technologies.
Immutable technologies usually serve as a base in the architect. It would be painful to change them in the future. For example, a choice of programming language (Python Scala, Go etc..) or option of a cloud provider (AWS, GCP, Azure, or multi cloud) could be close to immutable.
Will immutable technologies be there in the future? Is there a clear roadmap? Could you use some transitory technologies and change them in future?
11. Monolith vs. Modular vs. Monolith Modular
Shopify Engineering blog talks beautifully about Monoliths and Modular Monoliths —
- Monolithic data systems are tightly coupled, under one-roof, self-contained. Monolith systems are brittle and multitenancy in a monolithic system can also be a significant problem. While monoliths are attractive due to ease of understanding and reduced complexity, this comes at a high cost. The cost is the potential loss of flexibility, opportunity cost, and high friction development cycles.¹
- Modularity is an important concept that has ability to swap out tools as technology changes.
- The distributed monolith pattern is a distributed architecture that still suffers from many of the limitations of monolithic architecture.
12. Serverless vs. Servers
Server (or server computing) is nothing but backend server infrastructure needed for software or data products.
Let’s understand serverless now. Serverless (also called serverless computing or serverless architecture) where a cloud provider manages the server to dynamically allot compute storage and resources as needed to execute each line of code. It usually offers greater scalability, more flexibility, and quicker time to release, all at a reduced cost. With serverless, developers don’t worry about purchasing, provisioning, and managing backend servers. The cloud provider takes care of the virtual machine and container management, hardware allocating, and specific tasks built into the code, such as multithreading.
Backend-as-a-Service (example, AWS Lamda) and functions-as-a-service are two popular serverless architect options.
What would be the reasons to prefer server over serverless? You should prefer server over serverless if you are looking for full-access and control, security, customization, and no dependency on internet connection. Depending on the usagae, they can also be cost-effectiveness over serverless.
Cost-effectiveness, scalability, no maintenance by tech team are a few reasons to choose serverless over server. Furthermore, serverless environment typically run on containers behind the scenes and therefore, a choice of container manager (such as Kubernetes) could also be required.
13. Data Management and DataOps
DataOps is a set of cultural habits. It is DevOps on data and mostly for the data-as-product.
As defined by Gartner, “ Data management (DM) consists of the practices, architectural techniques, and tools for achieving consistent access to and delivery of data across the spectrum of data subject areas and data structure types in the enterprise, to meet the data consumption requirements of all applications and business processes.
DataOps is a set of cultural habits. It is DevOps on data and mostly for the data-as-product.
Not that we have understood the definitions, let’s look at some sample questions you should ask in this regard —
- Security — How is data protected against breaches, both from the outside and within? Which tools offer security control, user access control etc?
- Regulations — How to be compliant with GDPR, CCPA, and other data privacy regulations? Does a tool in hand allow to host data in compliance with GDPR, CCPA, and other data privacy regulations?
- Data Quality — How do you ensure data quality and that I’m viewing the correct data?
- Data Catalogue — Which tools will support data discoverability and data cataloguing?
- Automation — which technology is helping more with automation of integration or data delivery?
- Observability and Monitoring — When evaluating a new technology, how much control do you have over the deployment of new code, how will you be alerted if there’s a problem?
- Alerts and Incident reports — How will you respond when there’s a problem? How will you handle issues? What would be the incident response?