Get startedSign in
Back

Spark

A unified analytics engine for large-scale data processing.

Available providers

Why use Spark on Plural?

Plural helps you deploy and manage the lifecycle of open-source applications on Kubernetes. Our platform combines the scalability and observability benefits of managed SaaS with the data security, governance, and compliance benefits of self-hosting Spark.

If you need more than just Spark, look for other cloud-native and open-source tools in our marketplace of curated applications to leapfrog complex deployments and get started quickly.

Spark’s websiteGitHubLicenseInstalling Spark docs

Deploying Spark is a matter of executing these 3 commands:

plural bundle install spark spark-aws
plural build
plural deploy --commit "deploying spark"
Read the install documentation

Apache Spark

Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, pandas API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.

https://spark.apache.org/

GitHub Actions Build AppVeyor Build PySpark Coverage PyPI Downloads

Online Documentation

You can find the latest Spark documentation, including a programming guide, on the project web page. This README file only contains basic setup instructions.

Building Spark

Spark is built using Apache Maven. To build Spark and its example programs, run:

bash
./build/mvn -DskipTests clean package

(You do not need to do this if you downloaded a pre-built package.)

More detailed documentation is available from the project site, at "Building Spark".

For general development tips, including info on developing Spark using an IDE, see "Useful Developer Tools".

Interactive Scala Shell

The easiest way to start using Spark is through the Scala shell:

bash
./bin/spark-shell

Try the following command, which should return 1,000,000,000:

scala
scala> spark.range(1000 * 1000 * 1000).count()

Interactive Python Shell

Alternatively, if you prefer Python, you can use the Python shell:

bash
./bin/pyspark

And run the following command, which should also return 1,000,000,000:

python
>>> spark.range(1000 * 1000 * 1000).count()

Example Programs

Spark also comes with several sample programs in the examples directory. To run one of them, use ./bin/run-example <class> [params]. For example:

bash
./bin/run-example SparkPi

will run the Pi example locally.

You can set the MASTER environment variable when running examples to submit examples to a cluster. This can be a mesos:// or spark:// URL, "yarn" to run on YARN, and "local" to run locally with one thread, or "local[N]" to run locally with N threads. You can also use an abbreviated class name if the class is in the examples package. For instance:

bash
MASTER=spark://host:7077 ./bin/run-example SparkPi

Many of the example programs print usage help if no params are given.

Running Tests

Testing first requires building Spark. Once Spark is built, tests can be run using:

bash
./dev/run-tests

Please see the guidance on how to run tests for a module, or individual tests.

There is also a Kubernetes integration test, see resource-managers/kubernetes/integration-tests/README.md

A Note About Hadoop Versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs.

Please refer to the build documentation at "Specifying the Hadoop Version and Enabling YARN" for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions.

Configuration

Please refer to the Configuration Guide in the online documentation for an overview on how to configure Spark.

Contributing

Please review the Contribution to Spark guide for information on how to get started contributing to the project.

How Plural works

We make it easy to securely deploy and manage open-source applications in your cloud.

Select from 90+ open-source applications

Get any stack you want running in minutes, and never think about upgrades again.

Securely deployed on your cloud with your git

You control everything. No need to share your cloud account, keys, or data.

Designed to be fully customizable

Built on Kubernetes and using standard infrastructure as code with Terraform and Helm.

Maintain & Scale with Plural Console

Interactive runbooks, dashboards, and Kubernetes api visualizers give an easy-to-use toolset to manage application operations.

Learn more
Screenshot of app installation in Plural app

Build your custom stack with Plural

Build your custom stack with over 90+ apps in the Plural Marketplace.

Data
Stack
Airbyte
DATA
Clickhouse
DATA
Dagster
DATA
Datahub
DATA
Growthbook
DATA
Jitsu
DATA
Lightdash
DATA
Posthog
DATA
Explore the Marketplace

Used by fast-moving teams at

  • CoachHub
  • Digitas
  • Fnatic
  • FSN Capital
  • Justos
  • Mott Mac

What companies are saying about us

We no longer needed a dedicated DevOps team; instead, we actively participated in the industrialization and deployment of our applications through Plural. Additionally, it allowed us to quickly gain proficiency in Terraform and Helm.

Walid El Bouchikhi
Data Engineer at Beamy

I have neither the patience nor the talent for DevOps/SysAdmin work, and yet I've deployed four enterprise-caliber open-source apps on Kubernetes... since 9am today. Bonkers.

Sawyer Waugh
Head of Engineering at Justifi

This is awesome. You saved me hours of further DevOps work for our v1 release. Just to say, I really love Plural.

Ismael Goulani
CTO & Data Engineer at Modeo

Wow! First of all I want to say thank you for creating Plural! It solves a lot of problems coming from a non-DevOps background. You guys are amazing!

Joey Taleño
Head of Data at Poplar Homes

We have been using Plural for complex Kubernetes deployments of Kubeflow and are excited with the possibilities it provides in making our workflows simpler and more efficient.

Jürgen Stary
Engineering Manager @ Alexander Thamm

Plural has been awesome, it’s super fast and intuitive to get going and there is zero-to-no overhead of the app management.

Richard Freling
CTO and Co-Founder at Commandbar

Case StudyHow Fnatic Deploys Their Data Stack with Plural

Fnatic is a leading global esports performance brand headquartered in London, focused on leveling up gamers. At the core of Fnatic’s success is its best-in-class data team. The Fnatic data team relies on third-party applications to serve different business functions with every member of the organization utilizing data daily. While having access to an abundance of data is great, it opens up a degree of complexity when it comes to answering critical business questions and in-game analytics for gaming members.

To answer these questions, the data team began constructing a data stack to solve these use cases. Since the team at Fnatic are big fans of open-source they elected to build their stack with popular open-source technologies.

Fnatic’s Data Stack

Airbyte
Airflow
Clickhouse
Grafana
Metabase
PostgreSQL

FAQ

Plural is open-source and self-hosted. You retain full control over your deployments in your cloud. We perform automated testing and upgrades and provide out-of-the-box Day 2 operational workflows. Monitor, manage, and scale your configuration with ease to meet changing demands of your business. Read more.

We support deploying on all major cloud providers, including AWS, Azure, and GCP. We also support all on-prem Kubernetes clusters, including OpenShift, Tanzu, Rancher, and others.

No, Plural does not have access to any cloud environments when deployed through the CLI. We generate deployment manifests in the Plural Git repository and then use your configured cloud provider's CLI on your behalf. We cannot perform anything outside of deploying and managing the manifests that are created in your Plural Git repository. However, Plural does have access to your cloud credentials when deployed through the Cloud Shell. Read more.