Joyent Manta Storage Service: Image Manipulation and Publishing Part 2 - Analytics with MapReduce

In Part 1 of this series I introduced the Getty Open Content image collection.

In this blog, I will explain how you can use the Manta Storage Service to inspect and validate the 4,596 images in the set and extract the image pixel dimesions from within the JPEG files. And I will show you how I found this - the widest image in the Getty Open Content collection.

Figures Walking in a Parkland; Louis Carrogis de Carmontelle, French, 1717 - 1806; France, Europe; 1783 - 1800; Watercolor and gouache with traces of black chalk underdrawing on translucent Whatman paper; 47.3 x 377 cm (18 5/8 x 148 7/16 in.); 96.GC.20

This piece of artwork is one of the two extreme points on the edge of a graph of dimensions of all the Getty Open Content Images, after they have been area normalized to 0.25 Megapixels with ImageMagick.

Deploying a Unix pipeline that computes on the stored objects you have put on the Manta Storage Service is straightforward. There is a mapping from a pipeline you construct on your notebook to the style of compute jobs Manta can process. In this blog we will look at how Manta's MapReduce features work by setting up an example on your local machine first.

This blog contains a How-To in four stages:

Set up software on your local machine to mimic the Manta tools you will use.
Walk through the steps of a simple MapReduce analysis on the Getty Open Content images on your local machine.
Use the same MapReduce example on the Joyent Manta Storage Service.
Retrieve the results, graph image size distribution, and find the widest and tallest image.

Installation List

For this blog you will need a Linux or Unix based computer with:

A Joyent account with your public/private keys installed
the Node.js based Manta Command Line Utilities
ImageMagick
cURL
R

You will need to set up a Joyent account for the Manta Storage Service.

If you are new to Joyent, there is a free trial link at the top right of this page that provides enough credit to work through the examples in this blog, and more.

The image manipulation tools are from the ImageMagick suite of command-line programs. The two key command line tools are programs called convert and identify. These are already installed on Manta, but for the first part, you will need them installed on your own machine.

Mac OS X Local Install

Manta's Command Line Utilities are Node.js programs, delivered and installed by the Node Package Manager npm.

Some of the Manta code requires Xcode to compile, so Xcode must be installed.You can check if Xcode is installed in a Terminal window with the command:

$ sudo xcode-select --versionxcode-select version 2311.

Not installed? Head over to the Mac OS X App Store and download the free Xcode developer tools.

Install and start Xcode and click on "Create a New Xcode Project" at the welcome screen, then cancel and quit Xcode.Run the above check again and it should be ready to go.

For a new Mac OSX install including Node.js, npm node-manta and its dependencies, use the complete package here.

If you already have Node.js installed, follow the instructions here.

Mac OS X users should go to CRAN to get the R package.

To install ImageMagick and cURL you can use MacPorts or Joyent's pkgsrc package managment system. Pkgsrc is a command line installation utility that Joyent uses for deploying packages to SmartOS and Mac OS X.

Instructions for installing pkgsrc are found here.

To install ImageMagick and curl with pkgin:

sudo pkgin install ImageMagic curl

Linux Local Install

Node.js for Linux can be installed from nodejs.org

The Manta Command Line Utilities are installed with:

npm install -g manta

Linux users can use package management tools to install ImageMagick, cURL and R(e.g. apt-get on Ubuntu, yum on RedHat/Fedora/CentOS, zypper on SuSE)

Windows Local Install

You can install Node.js from http://nodejs.org and the Manta Command Line Utilities with:

npm install -g manta

Go to CRAN to get the R package. ImageMagick is here. cURL is here.

While the How-To is a Unix based one in its instruction syntax, the Windows PowerShell supports pipes and the ForEach-Object construct is similar to the find & xargs combinations in Unix. Windows users can follow along for the most part, and use Manta directly from the Windows command shell with the Manta Command Line Utilities.

Manta Environment Variables

The command line Manta tools need three environment variables.One is your public SSL key for secure access.You can find these on the Joyent Dashboard page, they look like this:

export MANTA_URL=https://us-east.manta.joyent.comexport MANTA_USER=your_user_nameexport MANTA_KEY_ID=d0:4a:88:2a:f1:b3:2f:9b:57:09:c4:4b:83:1d:29:a7

Copy and paste into the .bashrc file in your home directory using a text editor and save the file.Then source the .bashrc file:source ~/.bashrc

to set the environment variables, check with:

$ echo $MANTA_USERyour_user_name

Ready to go?

Once that is done, you can check to see that everything is installed by checking the versions.Note that you may not have exactly the same versions I show here.

$ convert --versionVersion: ImageMagick 6.8.5-8 2013-07-12 Q16 http://www.imagemagick.org$ curl --versioncurl 7.31.0 (i386-apple-darwin10) libcurl/7.31.0 OpenSSL/1.0.1e zlib/1.2.8 libidn/1.27$ R --versionR version 3.0.1 (2013-05-16) -- "Good Sport"$ node --versionv0.10.16$ which mlogin/usr/local/bin/mlogin

Download the Small WebP form of the Getty Images

This demo uses shrunken versions of the original Getty Open images, which were a little over 100Gb of JPEG data in total, and a bit impractical for ashort demonstration.

Once you have cURL installed, you can use it to download the tar archive of the small 0.25 megapixel Getty Open Content images:

Then extract the images with the tar command:

tar -xf getty_webp.tar

Using ImageMagick on your local system

Assuming you extracted the archive into your home directory, you can find the files by changing directory here:

$ cd ~/var/tmp/500x500_webp$ ls -al | headtotal 346920drwxr-xr-x  4599 cwvhogue  staff  156366 17 Sep 17:10 .drwxr-xr-x     3 cwvhogue  staff     102 17 Sep 17:09 ..-rw-r--r--     1 cwvhogue  staff   25720 10 Sep 15:14 00000201.webp-rw-r--r--     1 cwvhogue  staff   31520 10 Sep 15:14 00000301.webp-rw-r--r--     1 cwvhogue  staff   26567 10 Sep 15:14 00000401.webp...

Including the README.txt file there are a total of 4,597 files in the directory, following.

Validating Image Data

Now we can use ImageMagick identify to check the image data format and report on the file dimensions and size of each:

identify *.webp > local_identify.txt

This outputs a file with one line per image like this:

$ head local_identify.txt00000201.webp JPEG 372x672 372x672+0+0 8-bit sRGB 25.7KB 0.000u 0:00.00900000301.webp[1] JPEG 598x418 598x418+0+0 8-bit sRGB 31.5KB 0.000u 0:00.00000000401.webp[2] JPEG 417x600 417x600+0+0 8-bit sRGB 26.6KB 0.000u 0:00.00000000501.webp[3] JPEG 414x604 414x604+0+0 8-bit sRGB 18.3KB 0.000u 0:00.00000000601.webp[4] JPEG 518x483 518x483+0+0 8-bit sRGB 28.5KB 0.000u 0:00.000...

The identify program interrogates the graphic format by looking inside the image.In this example, the image files have the .webp extension, but the internal encoding is reported as JPEG.

Uh oh, these files are NOT webp formatted.

They are JPEG files with a .webp extension, and not what I intended to make and post in the previous blog.These files were created with the ImageMagick command called like this:

ImageMagick in this case was not linked to the library it required for WebP conversion. Yes I will have to go back and fix Part 1.

The convert program blissfully ignored the command line argument in bold and simply carried out the conversion to make a JPEG file reduced to .25Megapixel and with a quality setting of 50.

ImageMagick supports many graphical file image types and does not do any checking on the file to see that the internal encoding matches the filenameextension. So it happily and silenty spits out a file that is JPEG encoded internally with a .webp extension. The files will open in Chrome from theweblink, but not from FireFox or Safari. The .webp extension signals to the browser the image type. Rename the files to `.jpg` and they will open.*

So let's correct this problem and rename the files to *.jpg on our local system.

Wild Cards

Simple wild-cards do not work for renaming on Unix. There is no command to rename *.jpg to *.webp.On older Unix systems the filesystem itself had limits on the number of files it could process with wildcard style commands. I have hit this wall while making hundreds of thousands of files in a Mac OS X directory.

The one liner to rename these files combines the Unix find and mv commands. It extracts the file name with the dirname and basename functions which are first resolved inside the backtick quotations.

The find command has an -exec capability. That is it can execute a shell with with any Unix command specified with -cHere it executes mv on every file it finds with the *.webp extension.

This type of command pattern is something like what you will find when using the mfind command which finds objects in the Manta directory hierarchy.

Let's go back to the identify command for a moment and I will show you a simple way to do a one machine MapReduce construct on your notebook. This will help you understand how things work on Manta, and show how you can mock up a job on Unix andmove it on to the Manta Storage Service for computing.

The One Machine Unix Map

This runs identify separately and outputs one file for each image containing the one-line output of identify:

$ ls -al *.id | head-rw-r--r--  1 cwvhogue  staff  74 20 Sep 08:49 00000201.id-rw-r--r--  1 cwvhogue  staff  74 20 Sep 08:49 00000301.id-rw-r--r--  1 cwvhogue  staff  74 20 Sep 08:49 00000401.id-rw-r--r--  1 cwvhogue  staff  74 20 Sep 08:49 00000501.id...$ more 00000201.id..jpg JPEG 372x672 372x672+0+0 8-bit sRGB 25.7KB 0.010u 0:00.000

So now you have all the outputs.

The One Machine Unix Reduce

We can do a reduce phase like so:

This finds all 4,596 *.id files we made in the Map phase and passes their names to the Unix xargs command.The xargs command doles out each filename as input to the cat command, which concatenates them into a single output file.

Voila, reduced! All of those small one line outputs from identify are collected into a single file.

$ head mr_identify.txt..jpg JPEG 372x672 372x672+0+0 8-bit sRGB 25.7KB 0.010u 0:00.000..jpg JPEG 598x418 598x418+0+0 8-bit sRGB 31.5KB 0.000u 0:00.009..jpg JPEG 417x600 417x600+0+0 8-bit sRGB 26.6KB 0.000u 0:00.000...

Now you have a file that looks a lot like the one previously made.Recall the wild card version we did earlier? Run it again over the name-corrected JPEG files with:

identify *.jpg > local_identify.txt

And now compare the two:

$ cat local_identify.txt | wc -l    4596$ cat mr_identify.txt | wc -l    4596$ head local_identify.txt.jpg JPEG 372x672 372x672+0+0 8-bit sRGB 25.7KB 0.000u 0:00.009.jpg[1] JPEG 598x418 598x418+0+0 8-bit sRGB 31.5KB 0.000u 0:00.000.jpg[2] JPEG 417x600 417x600+0+0 8-bit sRGB 26.6KB 0.000u 0:00.000.jpg[3] JPEG 414x604 414x604+0+0 8-bit sRGB 18.3KB 0.000u 0:00.000...$ head mr_identify.txt..jpg JPEG 372x672 372x672+0+0 8-bit sRGB 25.7KB 0.010u 0:00.000..jpg JPEG 598x418 598x418+0+0 8-bit sRGB 31.5KB 0.000u 0:00.009..jpg JPEG 417x600 417x600+0+0 8-bit sRGB 26.6KB 0.000u 0:00.000..jpg JPEG 414x604 414x604+0+0 8-bit sRGB 18.3KB 0.000u 0:00.009...

So if you are with me so far, you can now see how MapReduce works on a single machine using simple Unix commands, and get an idea how wildcard expansion is facilitated by the Unix find command.

Certainly the wildcard version is faster on your notebook than the MapReduce construct. But this is not a Big Data example, and that is where this begins to matter.

The Manta MapReduce Version - Computing on the Object Store

So not only is Manta an object store, it is also a massive distributed computer!

Let me show you how to take what you just did on your notebook on the Manta Storage Service.

First, let's put the data up on to Manta starting with a tar archive of the JPEG images with the proper .jpg filenames you just made above with these commands:

mkdir gettymv *.jpg gettytar -cvf getty.tar gettymput -f getty.tar /$MANTA_USER/stor

Depending on your upload bandwidth this may take a while. This is from my home internet provider:

$ mput -f getty.tar /$MANTA_USER/stor...ue/stor/getty.tar [=======================>] 100% 163.95MB  55.27KB/s 50m37s

While you are waiting, let me cover some of the basics about Manta's default directories.

Manta provides a number of Command Line Utilities that resemble a number of familiar Unix commands including mls, mfind and muntar.

The Manta directories that come with your account are as follows

/manta-username/stor/manta-username/public/manta-username/reports/manta-username/jobs

Your private data goes into stor. Public data that is accessible to anyone on the web is published by simply putting it into public. The reports directory provides you with access log and billing information.

The jobs directory holds information about compute jobs you run on Manta with the mjob command, including tracking information for every job you run, job outputs, errors and all the error messages that programs may emit while being run.

Once the upload is complete, the mput command will have moved the getty.tar object from your computer into the /stor subdirectory on your Manta account.

$ mls /$MANTA_USER/storgetty.tar

Now we will uncompress the archive and create objects in /$MANTA_USER/public

added 1 input to d228bed1-a4f4-47c9-b7bf-44e614d654d8.../cwvhogue/public/getty.jpg/cwvhogue/public/getty.jpg/cwvhogue/public/getty.jpg/cwvhogue/public/getty.jpg/cwvhogue/public/getty.jpg

This creates the /$MANTA_USER/public/getty subdirectory and makes objects on the Manta Storage Service from the contents of the stored Unix filesystemin the getty.tar archive.

You can see them in any browser like this, they are now public and accessible to anyone.

https://us-east.manta.joyent.com/cwvhogue/public/getty.jpg

Importantly you have just distributed copies of your objects onto more than one high performance multicore server in the Joyent Manta Storage Service where they are set up for some fast computing capabilities without moving the stored objects.

Now we are ready to do the validation with ImageMagick identify on Manta.

Recall the One Machine MapReduce example - the identify command was run on each file on your notebook, making lots of one-line files.

Then cat was used to collect these all into one file. Both of these were initiated from the find command.

On Manta, the MapReduce process that does the equivalent find and identify command uses the mfind. The mjob command is like the -exec part of Unix find but with distributed computing SUPERPOWERS.

Go ahead and try it:

b1011a15-64a8-463d-8329-de1909c149fbadded 1000 inputs to b1011a15-64a8-463d-8329-de1909c149fbadded 1000 inputs to b1011a15-64a8-463d-8329-de1909c149fbadded 1000 inputs to b1011a15-64a8-463d-8329-de1909c149fbadded 596 inputs to b1011a15-64a8-463d-8329-de1909c149fbadded 1000 inputs to b1011a15-64a8-463d-8329-de1909c149fbmjob: error: job b1011a15-64a8-463d-8329-de1909c149fb had 1 error

Here mfind is the Manta version of the Unix find command. The wildcard -n "jpg$" is a Javascript wildcard, as mfind is a Node.js application.The results of mfind are returned to your notebook and piped via the Unix | pipe into the mjob command.

The mjob command takes this list of Manta Objects and executes the identify command by distributing it across the Manta compute nodes. The MapReduce map phase command comes after the -m flag.

The reduce phase is again the cat command, just as in the One Machine Reduce example above, and this comes after the reduce phase flag the -r.

The job number is in this case b1011a15-64a8-463d-8329-de1909c149fbYou can retrieve the job with (use your own job number):

$ mjob get  b1011a15-64a8-463d-8329-de1909c149fb{  "id": "b1011a15-64a8-463d-8329-de1909c149fb",  "name": "",  "state": "done",  "cancelled": false,  "inputDone": true,  "stats": {    "errors": 1,    "outputs": 1,    "retries": 0,    "tasks": 4597,    "tasksDone": 4597  },  "timeCreated": "2013-09-20T17:49:05.830Z",  "timeDone": "2013-09-20T17:52:05.926Z",  "timeArchiveStarted": "2013-09-20T17:55:48.277Z",  "timeArchiveDone": "2013-09-20T17:52:14.489Z",  "phases": [    {      "exec": "identify $MANTA_INPUT_FILE",      "type": "map"    },    {      "exec": "cat",      "type": "reduce"    }  ],  "options": {}}

The output of cat is stored in your jobs directory.

You can retrieve the output object store key with the command:

$ mjob outputs  b1011a15-64a8-463d-8329-de1909c149fb/cwvhogue/jobs/b1011a15-64a8-463d-8329-de1909c149fb/stor/reduce.1.124d6c06-07ba-4481-862a-31d68b530e6f

Then you can pull the output file down to your computer for inspection with mget, like so:

$ mget /cwvhogue/jobs/b1011a15-64a8-463d-8329-de1909c149fb/stor/reduce.1.124d6c06-07ba-4481-862a-31d68b530e6f > manta_identify.txt...862a-31d68b530e6f [=======================>] 100% 452.30KB

(Yes there is one error in this job, ImageMagick has difficulties with.jpg - looking into it...)

Now you can see the results of the one-line MapReduce job you just performed on Manta.

$ head manta_identify.txt/manta/cwvhogue/public/getty.jpg JPEG 405x617 405x617+0+0 8-bit sRGB 33KB 0.000u 0:00.000/manta/cwvhogue/public/getty.jpg JPEG 418x599 418x599+0+0 8-bit sRGB 44.5KB 0.000u 0:00.000/manta/cwvhogue/public/getty.jpg JPEG 572x437 572x437+0+0 8-bit sRGB 32.9KB 0.000u 0:00.000/manta/cwvhogue/public/getty.jpg JPEG 576x434 576x434+0+0 8-bit sRGB 27.8KB 0.000u 0:00.000/manta/cwvhogue/public/getty.jpg JPEG 622x402 622x402+0+0 8-bit sRGB 25.4KB 0.000u 0:00.000

Now the MapReduce results are concatenated in order of when they appear in the mjob distributed compute queue, and so they are not sorted.

You can sort the results by stripping off the leading directory name with sed 's/\/manta\/cwvhogue\/public\/getty\///' then sorting by number likeso:

$ cat manta_identify.txt | sed 's/\/manta\/cwvhogue\/public\/getty\///' | sort -n > manta_identify_sorted.txt$ head manta_identify_sorted.txt.jpg JPEG 372x672 372x672+0+0 8-bit sRGB 25.7KB 0.000u 0:00.000.jpg JPEG 598x418 598x418+0+0 8-bit sRGB 31.5KB 0.000u 0:00.000.jpg JPEG 417x600 417x600+0+0 8-bit sRGB 26.6KB 0.010u 0:00.000.jpg JPEG 414x604 414x604+0+0 8-bit sRGB 18.3KB 0.000u 0:00.000.jpg JPEG 518x483 518x483+0+0 8-bit sRGB 28.5KB 0.000u 0:00.000.jpg JPEG 408x613 408x613+0+0 8-bit sRGB 27.8KB 0.000u 0:00.009

Graphing Image Size Distribution in R

In Part 1 I made a graph of the image size distribution of the original Getty Images in R.I will show you how I did this using the manta_identify_sorted.txt (or if you skipped that part, you can use the mr_identify.txt file).

And since the data we have includes the width and height, we will also use R to plot the aspect ratio across the image set,and retrieve the widest and tallest images from within the set. For this I use awk which interprets the columns of the input file according to thenumbered variables at the bottom.

To convert this to a .csv comma separated value file, which you can read into R (or a spreadsheet program), I use awk to extract the columns withthe filename, the image pixel dimensions and the file size. Then I use a number of sed commands to add spaces in between the number and units of thesize, i.e. 25.7KB becomes 25.7 KB and 372x672+0+0 becomes 372x672 and then 372 672. Finally awk prints each of these tidied up lines with commas between them.

If you are starting from the file you made on your notebook, use this:

This is what your Getty_Filesizes.csv file should look like:Download

$ head Getty_Filesizes.csv11.7, KB,.jpg, 399, 62612.5, KB,.jpg, 446, 56112.5, KB,.jpg, 508, 49212.9, KB,.jpg, 588, 42512.9, KB,.jpg, 570, 43913, KB,.jpg, 344, 72613, KB,.jpg, 398, 62813, KB,.jpg, 568, 44013.6, KB,.jpg, 543, 46013.7, KB,.jpg, 399, 626

All should be in KB, so this command should return nothing - it uses grep to find any lines without the KB string:

grep -v -e 'KB' Getty_Filesizes.csv

Now we are ready to load this file into R. First here is the R session I did to make the graphs.The lines to type in start with the R prompt >.

$ RR version 3.0.1 (2013-05-16) -- "Good Sport"Copyright (C) 2013 The R Foundation for Statistical ComputingPlatform: x86_64-apple-darwin10.8.0 (64-bit)...

First we will load in the Getty_Filesizes.csv file and look at it with the R ls() and head() commands:

> getty_sizes<-read.csv(header=FALSE, "Getty_Filesizes.csv")> ls()[1] "getty_sizes"> head(getty_sizes)    V1  V2            V3  V4  V51 11.7  KB .jpg 399 6262 12.5  KB .jpg 446 5613 12.5  KB .jpg 508 4924 12.9  KB .jpg 588 4255 12.9  KB .jpg 570 4396 13.0  KB .jpg 344 726

Then we can plot a histogram of the file size distribution, with these two commands:

> hist(getty_sizes[,1],breaks=1000, main="Getty Open Image Size at 0.25 Megapixel", xlab="KB")> rug(getty_sizes[,1])

Here is the cut & paste R command Gist for this plot:

I was interested in finding the Getty Open Content images with the most extreme aspect ratios, really long or really tall graphics.

So with the information in the file, we can make an x-y plot of the width and height of all the images. These are all approximately the same area innumber of pixels, so we get a nice smooth curve. Images with dimensions 500 x 500 will be perfectly square.

plot(getty_sizes[,5] ~ getty_sizes[,4], main="Aspect Ratio", xlab="width (pixels)", ylab="height (pixels)")

There are outliers on either ends of the graph! They can be retrieved using R's which.max() command which returns the array index of the maxmum value in a vector.

For the widest image shown at the top of the blog post:

> getty_sizes[which.max(getty_sizes[,4]),]     V1  V2            V3   V4  V54216 54  KB .jpg 1473 170

And for the tallest image:

> getty_sizes[which.max(getty_sizes[,5]),]       V1  V2            V3  V4  V52027 34.6  KB .jpg 265 943

That wraps up Part 2. In Part 3 I will go over image resizing and show you how just how fast Manta is on the Getty Open Content Orignals compared to running the same conversion on my notebook. After that I will cover how to extract the XML metadata buried inside the Getty Open Content images and MapReduce that into a file of one-line file descriptions.

If you are interested to learn more about R - see my video on Simple Graphing in R.

Venus on the Waves; François Boucher, French, 1703 - 1770; 1769; Oil on canvas; Unframed: 265.7 x 76.5 cm (104 5/8 x 30 1/8 in.),Framed: 273.1 x 86.7 x 6.4 cm (107 1/2 x 34 1/8 x 2 1/2 in.); 71.PA.54

Post written by Christopher Hogue, Ph.D.