Your go-to-guide for modern data infrastructure with open-source tooling. Image courtesy of Sigmund of Unslpash.com

Your Blueprint to Building Reliable Data Infrastructure With Open-Source

The Data Engineering blueprint for data infrastructure that outlines the open-source options at each layer of the modern data stack.

Abhi Vaidyanatha
Abhi Vaidyanatha

The paradox of choice is real when it comes to setting up your data infrastructure. The modern data stack has more options than ever before, with a wide array of modular tools that you can mix and match to suit your business needs. That freedom can be daunting.

Personally, I like to compare the options of tools for a modern data stack to a buffet. The modern data stack has a wide array of options, and more often than not, you load your plate up with more than you can handle.

Recently, we have noticed a trend among data teams choosing open-source tools for their data stacks when either building or re-evaluating their existing infrastructure.

With the current state of the market, it makes sense to continuously evaluate your stack to ensure that you are keeping costs down. Companies are beginning to opt for open-source toolkits due to their low cost, high flexibility, and helpful developer communities.

Martin Woodward, VP of developer relations at GitHub perfectly summed up why companies are opting for open-source. “There are a number of benefits to open source, from providing an environment to work fast and flexibly, to enabling collaboration from developers around the world,” said Woodward.  “No single person or team can make the progress that we can all make together.”

Every organization's approach to building a data stack will vary based on a variety of factors. With that in mind, we put together a basic blueprint that outlines the options at each layer of the data stack and the open-source tools that are available to use.

What is data infrastructure?

Data Infrastructure refers to the set of tools that guide data from extraction to analysis. Photo by CHUTTERSNAP / Unsplash

Before diving into the five foundational layers of the modern data stack, it is important to understand what we mean when we refer to data infrastructure. When we talk about data infrastructure, we are referring to the set of tools that guide data from being extracted to being suitable for an end user to analyze or leverage. Typically, companies follow either an extract, transform, load (ETL) or extract, load, transform (ELT) paradigm for their data.

First, during extraction, data gets collected from its sources. Then it gets loaded into a central repository and transformed into a consumable form, or vice versa. Solid data infrastructure is the foundation for BI and monitoring—and essential for advanced real-time analytics and machine learning automation.

But building data infrastructure is not an easy task. For starters, every piece of your data infrastructure needs to be secure, efficient, and most importantly interoperable with the rest of your stack. If you choose open-source tooling, you are in luck, as most tools provide best-in-class solutions for each layer of your data platform.

The five layers of modern data infrastructure

Data infrastructure is multilayered, with each layer corresponding to a different piece of your data stack. Every foundational data stack should have at a minimum these five core layers:

  • Data ingestion
  • Data orchestration
  • Data warehouse/analytics engine
  • Data transformation
  • Data analysis/business intelligence

Let’s dig deeper into each layer and compare some of the most popular options to choose from when creating a modern data stack.

Note: While we are big advocates of using open-source tooling at each layer, we do realize that there are other SaaS options available to choose from. We included those in our guide to provide you with a full analysis of each layer.

1. Data ingestion

The start of every data stack is ingesting usable data that downstream systems can use. When picking a data ingestion tool, it is best to choose one that consistently funnels accurate data from various sources into a centralized repository on time.

Two of the most common options are to either choose Airbyte or Fivetran.

Airbyte vs. Fivetran

Fivetran and Airbyte both emphasize security and speed in extracting data and loading it to a destination. Their core functionalities are very similar in that they offer pre-built data connectors enabling access to hundreds of potential sources. Unlike Fivetran, Airbyte is open-source, which makes it more customizable.

Airbyte users can implement custom data connectors in under an hour or easily modify existing connectors to suit a company’s needs. While both tools offer built-in scheduling for how often they ingest data, Fivetran is limited to specific time increments, whereas Airbyte has no restrictions.

In terms of support options, Airbyte has large Slack and GitHub communities for troubleshooting. Fivetran is more suited to larger enterprise customers, where they have a longer track record than Airbyte.

As far as cost, since Airbyte is open-source, it’s possible to self-host it and avoid paying for their cloud offering. If self-hosting isn’t of interest, both solutions have cloud offerings that follow consumption-based pricing schemes.

2. Data orchestration

Since data infrastructure spans so many layers, orchestration is essential to automate workflows and ensure that each layer of the stack works together.

For instance, an orchestration tool might communicate with Airbyte or Fivetran to periodically ingest data, after which the orchestrator kicks off data transformation, a step we’ll address in a later section. Dagster and Airflow are popular open-source solutions that take different approaches to this problem.

Dagster vs. Airflow

Apache Airflow is the original workflow orchestration tool and has been the industry standard for a long time. Airflow takes a task-centric approach to orchestration. Using just Python, a data engineer can define a directed acyclic graph (aka DAG) of tasks that need to execute in a particular order. Airflow takes care of the rest and displays statuses in a web UI that’s easy to monitor. Airflow is widely adopted and has lots of support available online in case you encounter issues.

Dagster emerged relatively recently, with its first public release in 2019, and has a completely different orchestration model. Unlike Airflow’s imperative, task-centric philosophy, Dagster is declarative and data-centric.

Dagster’s major benefit is that it doesn’t naively execute tasks, but rather considers the state of your data to determine what jobs need to run. Data teams also praise Dagster for its emphasis on developer friendliness. It’s designed with features like local development, unit testing, and continuous integration in mind.

3. Data warehouse & analytics engine

A data warehouse serves as the central node in most companies’ data infrastructure. This is the single source of truth for all data and typically the destination for data that tools like Airbyte and Fivetran ingest.

Alternatively, some companies ditch the data warehouse altogether and query data as needed from disparate sources with an analytics engine.

Let’s quickly compare some of the most popular data warehouse and analytic engine tools in the market and see how they match up.

Snowflake

Snowflake is a cloud data platform that can support multi-cloud environments built on top of Google Cloud, Microsoft Azure, and Amazon Web Services. Snowflake uses virtual compute instances for its compute needs and storage service for persistent data storage.

Snowflake differs from other data warehouses and query engines with its ability to scale swiftly. Its decoupled storage and compute architecture easily supports resizing clusters without downtime. It also supports auto-scaling horizontally for higher query concurrency during peak business hours.

Databricks

Developed by the creators of Apache Spark, Databricks is a cloud-based data processing and analysis platform that simplifies working with large amounts of data.

The company also pioneered the Data Lakehouse, which combines the benefits of a data warehouse and a data lake. Users can leverage structured and unstructured data while taking advantage of the low storage costs that typically come with data lakes.

A great feature of Databricks is the ability to autoscale clusters based on utilization. What separates Databricks from other query engines is its ability to leverage the Spark framework for processing large amounts of data.

BigQuery

BigQuery is a fully managed enterprise data warehouse that aids engineers with managing and analyzing data. Their serverless architecture lets you use SQL queries to answer your organization's data questions without the need to manage infrastructure.

Due to the fact it directly integrates with the Google Cloud Platform (GCP), BigQuery is convenient for Ad-Hoc analytics and ML use cases on GCP. Like the other data warehouses, BigQuery scales when handling large data volumes and assigns more compute resources when needed.

Redshift

Redshift is a fully-managed petabyte-scale data warehouse service that simplifies analyzing all your data using your existing business intelligence tools. Unlike other cloud data warehouses, Redshift runs as an isolated tenant per customer and runs in your VPC.

What separates Redshift from other cloud data warehouses is its integration with the AWS ecosystem. Redshift also provides a result cache for accelerating repetitive query workloads.

Clickhouse

Clickhouse is a data warehouse that offers both a free, open-source distribution and a cloud version of its software. It is compatible with a large ecosystem of other data tools, positioning itself as the nexus of your data stack. Clickhouse also achieves high performance through vectorized computation and parallel processing.

Data warehouses need to be fault tolerant to mitigate the risk of data loss, for which Clickhouse supports the replication of data across data centers. As your company and data grow, Clickhouse automatically scales alongside your computing cluster, with the ability to handle hundreds of terabytes of data per node.

Trino

Analytics engines are an alternative to a data warehouse, designed to avoid the complexity and cost of extracting and loading data into a centralized store. Trino is a distributed query engine that enables you to write SQL that quickly queries large amounts of data.

Known for its federated queries, Trino enables analytics across multiple data stores as if they were one. The advantage of this method is that it decouples data storage from computation. Analytics engines invariably have to run queries across numerous sources, meaning they’re only as fast as their slowest bottleneck, so Trino is highly optimized for speed to make up for this.

4. Data transformation

Raw data needs to be transformed so that downstream analysis and intelligence tools can efficiently consume that data. Transformation includes many operations, such as unnesting fields and adjusting timestamps. dbt is by far and away the most developed option and it happens to be open-source.

dbt

With a robust community and low barrier to getting started, dbt fits into many companies’ ELT pipelines as the final step before sending data off to the data visualization layer.

Engineers can write modular transformations in SQL or Python, and dbt provides a bunch of features to ease deployment, including scheduling, logging, and alerting. Like many of the other tools in this guide, dbt offers an open-source and cloud-based product.

5. Data analysis/business intelligence

Ultimately, one of the main reasons to collect data and send it through the earlier steps of ingestion, storage, and transformation is so that data analysis and business intelligence software can draw insights from the data. There are many tools in this space that tackle all kinds of use cases. We’ll focus on some open-source solutions for data visualization.

Superset

Apache Superset is an extremely powerful data exploration and visualization tool. It supports deep customization through visualization plug-ins and can be used either with SQL or through a graphical interface. Superset also supports dashboard creation for interactive views of multiple tables and charts.

Metabase

Metabase is rather similar to Superset, although it’s reputed to have a more intuitive UI at the expense of some options for customization. Metabase still supports SQL, but its primary emphasis is on no-code ways to make sense of data. Metabase also excels at alerts and reporting to ensure that the relevant stakeholders are up to date with business trends.

Lightdash

Lightdash is a great visualization option for companies that heavily rely on dbt. The platform tightly integrates with dbt, to the extent that all the business logic lives as code in dbt, reducing overall complexity. Like the previous tools, Lightdash tries to meet the needs of technical and non-technical users such that an entire organization can benefit from its data infrastructure.

Plural helps build and deploy your data infrastructure

It can be difficult to know where to start when there are so many data infrastructure options and lots of pieces that need to come together for a robust stack.

Typically, the struggle of using open-source technology is managing, deploying, and integrating the tools yourself in your own cloud.

Plural aims to make deploying open-source applications on the cloud a breeze for organizations of all sizes. In under ten minutes, you can deploy a product-ready open-source data infrastructure stack on a Kubernetes cluster with Plural.

To learn more about how Plural works and how we are helping engineering teams across the world deploy open-source applications in a cloud production environment, reach out to our team to schedule a demo.

Ready to effortlessly deploy and operate open-source applications in minutes? Get started with Plural today.

Join us on our Discord channel for questions, discussions, and to meet the rest of the community.

Tutorials

Abhi Vaidyanatha

Head of Community