In today's data-driven world, managing the sheer volume and complexity of data has become a formidable challenge for companies of all sizes. Engineers often face challenging questions like "Which data source is accurate?" or "What does this field mean?"
Data governance tools like DataHub, have emerged to help data consumers and producers have a better understanding of how their data is represented and connected. DataHub is a modern data catalog that enables end-to-end data discovery, data observability, and data governance.
Currently, there are a few different ways you can deploy a DataHub instance. You can either deploy it via:
- AWS Elastic Kubernetes Service (EKS)
- GCP Google Kubernetes Engine (GKE)
- DataHub’s managed service
There are multiple compelling reasons for deploying DataHub on Kubernetes. If you operate in a regulated industry and handle sensitive PII data, it is crucial to ensure that everything remains within your own VPC. The most effective way to achieve this is by self-hosting open-source applications.
This article will guide you through the process of setting up a new Kubernetes cluster, and how to deploy DataHub on Plural - a free and open-source Kubernetes DevOps platform that enables users to deploy Kubernetes clusters with little to no management experience necessary.
What is DataHub used for?
DataHub eases the struggle of tracking various data assets scattered across a company’s data ecosystem. The platform acts as a centralized hub and reliable repository, where all metadata related to a company’s data is stored and is easily accessible.
With DataHub engineers can search across all layers of their data stack to see where datasets exists and understand the end-to-end journey of the data by tracing lineage across platforms. From there, engineers can proactively identify the impact of breaking changes on downstream dependencies and notify stakeholders of any issues that might be occurring.
Self-hosting DataHub is not an easy task. At its core, DataHub consists of four main components: GMS, MAE Consumer, MCE Consumer, and Frontend. Those main components require you to install the following external dependencies before deploying DataHub:
- Apache Kafka
- Local DB (MySQL, PostgreSQL, MariaDB)
- Search Index (Elasticsearch)
- Graph Index (Supports either Neo4j or Elasticsearch)
To deploy DataHub on Kubernetes using Plural, you must create an account. To do so, head to the Plural App and follow the on-screen instructions.
Note: If you prefer to deploy DataHub locally on a Kubernetes cluster you can download and use the Plural CLI. Before proceeding, make sure to have set up either an AWS, GCP, or Azure cloud account with administrator access.
Using the Plural CLI, it is possible to deploy a Kubernetes cluster locally. However, before proceeding, make sure you have an AWS, GCP, or Azure cloud account with admin access, the CLI for your chosen cloud provider correctly installed and configured, and either a GitHub or GitLab account.
Creating and configuring a Kubernetes cluster with Plural
- After creating your account, you will go through Plural's onboarding process.
- Click on use your own cloud. From there you'll be prompted to select if you want to use our cloud shell experience or install the CLI on your local machine. It is recommended to use our cloud shell for a quick and easy experience.
- Create a GitHub or GitLab repository to store the state of the deployment. Plural manages all cluster configurations via Git and will provision a GitHub repository on your behalf. This repository is set up using scoped deploy keys to store the state of your workspace, and no OAuth credentials are persisted.
- Choose your cloud provider. Plural is a solution that deploys and manages infrastructure in a user’s cloud environment, so it needs relatively high levels of access to your cloud environment. As a result, you need to provide a service account to Plural so it can authenticate against your cloud environment.
5. Choose a distinct name for the cluster created for the deployment. Afterward, specify a unique prefix for the bucket and a subdomain for creating DNS.
6. Review that the information you entered is correct, and if so click create. Note: This step can take a few minutes.
Plural will now create a cloud shell environment for you, which will take a few minutes. Afterward, you’ll be asked to choose which applications you wish to install on a new Kubernetes cluster.
How to Install DataHub on the Kubernetes Cluster with Plural
- Search for datahub in the install apps window. Select datahub and press Continue.
2. Next, you'll enter a Virtual Private Cloud (VPC) name where the DataHub deployment will reside, ensuring Plural has a clean environment to deploy into and minimizes disruption to existing systems.
3. Enter a Wal bucket name
4. You will be prompted to enter a hostname for ElasticSearch, a DataHub dependency.
5. Next, If you earlier chose to install the Plural Console alongside your DataHub installation you’ll be prompted to configure your Plural Console environment.
6. Lastly, enter a hostname for the DataHub installation. Note: It's recommended to name the hostname after the application. Confirm everything looks good before deploying (note this step can take up to 15 minutes.)
Accessing the Plural Console
Deployed within the same cluster as the managed applications, the Plural Console acts as a central operational hub and offers several essential functionalities for effective management.
- The Plural Console facilitates automated upgrades from the Kubernetes API.
- The Console serves as a built-in Kubernetes dashboard for all Plural-managed applications within the cluster.
- It conducts app-level health checks to ensure smooth operation.
- The Console also serves as a communication point for reporting incidents to the application owner.
To enter the Plural Console, navigate to the “Plural Console URL” from the Cloud Shell.
Accessing and Configuring the DataHub Deployment
You can access the DataHub Dashboard through the Cloud Shell link or the Plural Console.
Inside the Plural Console, press Launch next to datahub to access the Dashboard. By enabling OIDC previously, you won’t need to manage authentication.
Once in you should see DataHub's search console up and running. To start using DataHub, you'll need to ingest metadata.
Start using DataHub by performing your first metadata ingestion. Refer to DataHub’s Introduction to Metadata Ingestion guide to get up and running.
Next Steps and Resources
In this article you have learned how to:
- Set up a Plural Git repository for storing infrastructure information..
- Easily provision a fully configured Kubernetes cluster, even with no prior management experience.
- Install an instance of DataHub on your newly created Kubernetes cluster.
Are you looking to get your DataHub instance up and running on Kubernetes with minimal effort?
Make sure to join our Discord community for deployment help, discussion, and meeting other Plural users.
Ready to effortlessly deploy and operate open-source applications in minutes? Get started with Plural today.
Read more about DataHub on its official documentation.
Be the first to know when we drop something new.