Hello, Manta: Bringing Unix to Big Data

June 25, 2013 - by Mark Cavage

I came to Joyent about two and half years ago after being thoroughly convinced that Joyent was a place where "full stack engineering" actually happened; nothing was off-limits, systems could be built with the best abstractions, and we could take a fresh approach to tackling cloud computing problems using the right technology abstraction for each task. In particular, one of the products Joyent has long needed to build was object storage, but we didn't really want to build something that was only marginally better than the existing storage offerings already on the market. If we were going to tackle large-scale storage, we needed to do something truly better than anything else out there. As Bryan Cantrill points out, storage is always a trail of tears, and not something you undertake lightly (having built a great enterprise storage appliance and long walked that trail of tears himself, he has some experience here).

Joyent created and maintains SmartOS, which offers some very compelling technologies for systems infrastructure: DTrace, ZFS and Zones. For a while we kicked around how ZFS could give us something better, and were fixated on how to make a better object storage service even better using ZFS. However, somewhere along the way, we realized the differentiating technology wasn't ZFS; it was Zones.

I've been working on the cloud for a while now (almost since "the beginning"), and one of the things I've always loathed is that it's just too hard to leverage the cloud to perform basic data processing tasks. There's cluster setup and management, data movement ETL (extract, transform, load), high availability (HA) management. And that's all before you write any code to actually work with your data. The realization eventually came that we could deliver a truly amazing product if we elevated compute to a first-class citizen in the product; specifically by bringing arbitrary compute to data.

One afternoon in October of 2011, Bryan and I talked this over as a first pass, and it became clear Zones were the answer. A very short time later (a few days, if I recall), Dave Pacheco was on board and we started kicking around what this would really look like. For the first while we were also trying to integrate with Node, such that users phrased work in terms of JavaScript code, and it actually took quite a while of kicking ideas around before we realized the interface we really want is simply Unix. Every Unix user is familiar with data processing using pipes (I'm looking at you, find | grep | sort), and if we could truly run arbitrary compute on stored data while managing the distributed problems, large-scale data processing is significantly easier. We were all busy with other commitments, so we really didn't start development in earnest until the Spring of 2012 when we all got together in San Francisco for a "summit," and created a whiteboard architecture of how this would work.

There were obviously a lot of other problems to solve, and as I said earlier, Joyent is a "full stack" company. With some of the initial design laid out, Jerry Jelinek actually wrote the first lines of code for Manta, in the form of hyperlofs, without which Manta would not exist. Yunong Xiao wrote an HA metadata stack built on Postgres and ZooKeeper. Keith Wesolowski created custom hardware systems for storage and applications in Manta. Bill Pijewski wrote a custom deployment management system for Manta. Nate Fitch wrote garbage collection, and Fred Kuo wrote usage aggregation. And Manta couldn't actually exist without all the engineering effort Joyent has put into SmartDataCenter.

As Manta has evolved from something that barely worked in our lab, to a private beta, to public availability now, we've continually striven to make it easier and easier to use. Nobody disagrees that the simplest abstractions are best, and really the further along we got, the more we realized we just wanted to make it easier and easier to carry over existing "one-liners" to Manta, and have Manta manage the distributed coordination and "muck." In essence, we've brought the Unix philosophy to big data. From what we've seen, everyone who tries Manta walks in thinking it's object storage with a twist, but walks away realizing that it's a paradigm shift (including us). Manta dramatically reduces the barrier to entry for big data processing, and enables completely new use cases that weren't possible before; while Unix one-liners are the "hello world" for Manta, really you can do anything you would be able to do in a full OS.

It is profoundly satisfying to see Manta hit the market today. As with all revolutionary technologies, we don't really know what all the possibilities are for users; we just know that we've unlocked them. We're tremendously excited to see all the applications that will be built on top of Manta, and as we go, we'll be adding more features as we hear about various use cases. We're already planning on rich access control, triggers, SQL, real-time analytics and more.

To get started, there's a tutorial and a screencast that will walk you through installing an SDK, managing objects, and running some jobs. Give it a spin, and let us know what you think.