To start, I think it's important to discuss the motivations for building Plural, since they're ultimately what guided most of our technical decisions. Plural originally came out of my enthusiasm for kubernetes, especially the realization that the unique combination of a rich, extendable API and strong community could ultimately provide a platform on which to build self-managing applications.
When I began investigating the kubernetes deployments of many popular open source applications, it became clear there was a wide chasm between even a good kubernetes deployment and a fully hosted offering offered by a cloud provider or a mature software vendor. This meant that while kubernetes has a lot of technical potential, until that user experience gap is closed, it's not generally commercially viable.
That said, I thought the gap was closable, given how mature a lot of the tooling actually is, the big unsolved problem is delivering a workflow that allows for consistent combination of the standard toolchain for deployment. Which is what we're building at Plural.
It's worth listing exactly what those constraints are, at least as I've seen them consistently:
- Applications need to be tailored to each specific cloud. This usually comes down to injecting credentials and setting up object storage/databases. Still, each cloud has its own services, apis and conventions, which quickly means you are navigating n sets of docs where n is the number of providers supported.
- The big lift of running directly on kubernetes is deep customizability, but that impedes a functional out-of-the-box experience, and most tools don't solve for both.
- The application lifecycle needs to be solved, since an unmanageable but easy to install application is still a pile of tech debt. That requires really strong administration UX.
Additionally we wanted a set of abstractions and principles that can scale to virtually any application, deploy to virtually any cloud, and be usable by virtually any developer.
It became clear that solving for cloud customizability and application configurability is as complex as a code management problem. You can think of it as managing a graph of dependencies for an application, between other applications and submodules needed to create the various cloud-specific resources the application needs. Take Apache Airflow as an example, it generally needs to deploy these things:
- a log object storage bucket (which is different depending on cloud, with Azure being unique in having no s3 compatibility whatsoever)
- a redis server for celery queueing
- the airflow webserver / worker / scheduler
Virtually any deployment of airflow will need to manage a sequenced installation of all those components, and any upgrade would also need to do some sort of version compatibility check to ensure all the components can play nicely together.
So how did we solve it? In general we chose an architecture with 3 main components:
- the core plural api - a graphql api that effectively serves as a package manager for all deployment modules we support, eg terraform and helm, and their dependencies to manage the creation of arbitrary applications like described above. Think of it like an npm for devops, but with a deployment engine underneath.
- a command line interface - responsible for taking information stored in the api and generating a standardized workspace within a git repository based on what a user has installed. It also does the work of executing the terraform and helm commands for you using that workspace
- admin console - a web service deployable in your own kubernetes cluster to take over all the day 2+ operational responsibilities of managing applications. In particular, its biggest responsibility is accepting and applying updates as they flow through the api.
It's worth digging into some of the technical choices we made in each of these systems.
We made a somewhat unusual decision to use Elixir for our server-side code on both the api and admin console. I had previous experience building a very large elixir codebase at Frame.io and learned to love the language, but there were some unique rationales that made it, or really the entire BEAM (elixir's VM) ecosystem, a good fit for Plural as well:
- Need for a lot of realtime, websocket based UX, especially around displaying the changes in state within a kubernetes cluster, lends itself naturally to the BEAM VM's actor model implementation, and Elixir/Phoenix make that very approachable
- Our admin console needs to be bulletproof, since it's the interface to manage all the other set of applications within Plural. If it goes down, user's are basically flying blind. Elixir/erlang is notorious for extreme reliability. It also needs to be efficient, since we want the "Plural tax" to be as small as possible. BEAM scales vertically wonderfully and is built for network-switch level reliability
- The memory consumption for elixir is remarkably stable for a managed runtime. This also ties into its actor model implementation, each "process" is allocated its own heap and stack, so if you program according to standard BEAM practices of using ephemeral processes orchestrated together to manage your application, the death of those processes can guide the VM to efficiently reap unused memory. You can read more about how all this works here.
- Dynamic typing makes rapid development much easier, while elixir's powerful pattern matching allows for the codebase to scale similar to a static language since it can provide even more powerful typing contracts to a statically typed function signature (except ultimately enforced at runtime).
Like the graphql decision, using elixir does come with tradeoffs. The most significant of which is community. Elixir is a niche language and you don't have as large an initial well of developers to source from for it, that said existing elixir devs love to continue working on elixir and are often high quality. There's also an interesting ramp-up process on the language as it is a significant paradigm shift from imperative and object oriented languages to a fully functional language with strong immutability guarantees. Part of how I've navigated that in the past is being very active in pair-programming with new developers as part of the onboarding process in the codebase.
Finally dynamic typing is a meaningful perf hit in comparison to static typing, along with overhead imposed by immutability, especially for CPU bound work which our server-side will do a fair amount of (especially JSON serialization). I do think some recent changes in the BEAM should improve the straight-line performance of elixir/erlang, but it's still worth noting.
When building apis, there are two main patterns available: REST and GraphQl. I had built REST apis at plenty of former roles, but there were two main reasons I actually preferred GraphQl. First, it's much better supported in the browser currently. Apollo Client makes React development much easier than Redux, and also seems more performant, and I was anticipating a lot of complex UI to solve for a really challenging UX problem in making a wide slew of applications operable. Secondly, I knew we were going to need a lot of realtime functionality in the various products, and there aren't many solid wire protocols to add on top of websockets...except for GraphQl subscriptions. Being able to have auto-typed, self-documenting websocket clients was a huge win, and I felt worth the slight novelty of GraphQl.
We have a significant portion of our product tied to a CLI distribution. The two common languages we could have used for building that were python and golang. I think golang is the obvious winner here, for the ability to build easily distributed cross-platform binaries alone, but also it provides us the ability to statically link to source code for a lot of the tools we'll need as well within the kubernetes ecosystem. We also have written a fair amount of kubernetes operator code to manage the runtime of plural applications, and it's good to choose a language that makes it easy for our dev teams to toggle between both of those codebases.
There's infinitely more granularity to all these decisions, but we thought there might be some insights people could find helpful, or maybe just food for thought. If you are interested in learning more, everything we build is open source, so feel free to check them out, which you can find at these links (give us a star if you like what we are doing):