FRAMINGHAM, 7 jUNE 2010 - One of the promises of the public cloud is the opportunity to quickly dial up server resources in order to do a job that calls for heavy-duty batch processing. But first you need a way to manage the life cycle of that job. Fortunately, there are tools that can automate setting up and tearing down jobs in the public cloud.
In this groundbreaking test, we looked at products from RightScale, Appistry and Tap In Systems that automate and manage launching the job, scaling cloud resources, making sure you get the results you want, storing those results, then shutting down the pay-as-you-go process.
We found that each product gets to the finish line, but they each require some level of custom code and they each take different, vastly circuitous routes.
We liked RightScale's ability to both monitor and control application use, as well as its wide base of template controls and thoughtfulness for overall control. RightScale's RightGrid methodology manages the full life cycle of apps and instances, and gave us the feeling that hardware could truly be disposable in the cloud.
Yet with a bit of work, we found that both Appistry and Tap In Systems offered task automation components that could also be successful for cloud-based jobs.
In our first public cloud management test, we focused on the ability of products from RightScale, Tap In Systems and Cloudkick to simply monitor public clouds like Amazon's EC2, GoGrid and Rackspace.
This time around, our test bed was narrowed to Amazon's public cloud, and we used a variety of Amazon cloud services, including Elastic Compute Cloud (EC2) server resources, Simple Queue Service (SQS) queuing system, and Simple Storage Service (S3).
The good news for enterprises is that Amazon's pay-per-usage model can be a major cost saver. In this real-world test, we were able to complete our tasks using an extraordinarily low amount of our Amazon processing budget.
Similar batch job cost savings can be realized using Amazon competitors like GoGrid, RackSpace and others, but only if the tasks are automated using the type of cloud management tools that we tested here.
The basic procedure was similar for all three products A job that needs to be performed requires application code, data files, a place to process the data (the cloud), and a place to put the results.
There are two options: make the job into a bundle where we could define code, data, outputs and options, or do that plus have a controller get messages from the job in progress. That allowed us to take either pre-defined actions based on the messages, or allowed us to change what happens in the middle of a job.
First, we needed to create an Amazon execution image with the applications we would be using for automation. We chose ffmpeg, an application suited towards video rendering jobs for processing by arrays.
Once created, we bundled the image and uploaded it to Amazon so we would have a copy to start with. Each product then varied in terms of controlling the life cycle of the bundle. Typically, the life cycle is the sequence of events that coordinates the process of doing jobs, gathering the results, storing them, and reporting success/failure (there will inevitably be both).
We gauged success by the degree of built-in controls, application customization that was necessary, how the management application would either programmatically or automatically scale resources to execute the job (by reading CPU or other resources, then adjusting jobs to add servers or resources), and communicating messages among job executors and coordinating processes.
RightScale's flexibility became readily apparent early in our testing. RightScale's ServerTemplates can be modified, and the orchestration needed to perform jobs from beginning to end doesn't require bundling all components prior to job execution, as the other vendors did.
By modifying the ServerTemplates, we didn't need to create our own bundled image on Amazon using their EC2 tools, in effect, making the process that much simpler. But, like the other cloud management providers we tested, RightScale requires a bit of scripting work to make it useful.
RightScale uses two kinds of server application platforms. The first type of server array is queue-controlled. Jobs are placed into a queue, and depending upon resources that have been set, they execute in a first-in/first-out fashion. The second type uses monitored alerts to trigger actions that branch the job's processing instructions.
RightScale's RightGrid platform is based on the queue-based array model, which we focused on since it's the more convenient job control mechanism.
The queue-based processes connect with Amazon's SQS allowing RightGrid to pick desired jobs from a queue for service and processing. It's like an assembly line for batch jobs.
A RightGrid job coordinator (which is a server process located somewhere) uses a script that sends the work via the Amazon SQS. The worker instances pick up the job, get the salient work files, then do the work. If there's a lot of work in the queue, depending on our settings, the array could launch more servers. Input, output, and audit queues are automatically created. We borrowed the framework for these queues from RightScale's demo application.
As an example, running a job might consist of building a job process, sending the job to the input queue, waiting for messages regarding job starts, delivering messages about progress, expansion of resources to service the job, or job completion, then the act of storing results to Amazon's S3 storage.
Modified representative scripts handle it all. The data from the output queue can then be sent into a database, as an example.
Variations on a script
We weren't locked into one language for scripting the templates, as RightGrid can use many languages, including Ruby, Python, bash shell, Java, plus whatever commands we could invoke from a command line.
We found it's easiest to use RightScale's pre-made configuration message encoding system, which is written in Ruby. The entire grid doesn't have to be written in Ruby, just the worker control files. Workers are process controllers that come in two varieties one-shot and persistent.
RightGrid commonly uses one-shot workers to service processing queues. Persistent workers are used when the application requires a lot of front-end startup processing.
Alert-based arrays can scale up or down based on certain conditions (such as CPU usage, memory usage). This is useful for scalable applications such as Web sites. There are a couple of familiar options when creating this type of array, such as min/max number of servers, just like queue-based arrays. But mostly alert-based arrays are completely different.
A couple of options of note: Per each alert are the decision threshhold, resize by, and resize calm time options. These are used in determining when, how many and how often to scale up or down. There are actually tons of conditions that can be checked for this, not just CPU and memory.
RightScale demands a knowledge of scripting (and perhaps Ruby) to launch scalable, event or queue-driven jobs. Coupled with its superior monitoring, we liked RightScale the most for scalable batch jobs. Its downside is that it works only with a couple of cloud providers today, Rackspace and Amazon, although it could do so interchangeably. It's efficient in its use of the costs of online resources, and managed the life cycle of job control very well.
Tap In Control Plan Editor
Tap In Control Plan Editor is an automation tool using the Petri Net model, which is a math transform describing distributed systems just like the cloud.
With the Tap In tools, it's possible to create a Plan to automate tasks depending on certain Plan conditions. There are a number of example Plans included with the program, and numerous specific functions are available for cloud platforms such as GoGrid, Amazon EC2, Rackspace, and Terremark vCloud.
The Plan Tool is central to how Tap In Control Plan treats jobs. At each Plan branch, there can be different conditions in which you can run scripts, which can be written in Ruby, Java or Groovy. You can have inputs and outputs from each branch, which can be passed on to the scripts or the next point in the job process.
We found that Tap In's sample scripts were out of date (the latest versions of Ruby scripts were included but old files were referenced in the samples), and we had to do a bit of modification to them to get them to work. Fortunately, the sample code didn't need to be changed much -- just the 'include' files and cloud site logon information. This should be fixed by the next release, according to Tap In.
The Tap In user interface is similar to a flow chart editor where we could put circles, squares and triangles, then connect them together. Although this seems simple, the diagrams can become quite complex based on the Petri Net model controls of these symbols. Your Boolean logic and math classes pay off here.
Within each circle in the Plan, called places, code can be running from Java, Ruby or Groovy sources. Code can also be placed in the transition shapes. As an example, a square symbol means an OR transition, hexagon means an AND transition, triangle means thread split/join, and/or triangles can mean also more advanced Boolean/Von Neumann states such as interrupts and interrupt handlers.
We used Ruby as a test language. After creating a place or other transition, we used a line to connect two places. We then built a basic Control Plan model based on Tap In's Amazon sample files.
The idea for our Control Plan was to perform a job that would scale up by launching more instances when there were video files in an Amazon S3 bucket. Inside this bucket were videos that we wanted to encode to a different format.
In our test model, we created a place to begin, and looked to the next place in the diagram to check our queue. The CheckQueue place contained inputs for our Amazon EC2 credentials, and the credentials were passed into the Ruby script. The script then checked if there were any video files in the Amazon S3 bucket.
If the queue was empty, the program logic would wait for a few minutes, and check again. But if files were found, the program logic would start the scale-up process. The process would check if our server image was available. If our server image couldn't be found for some reason, then we would stop the process cold.
Otherwise the application logic would launch an instance, move the video file to another bucket directory, and send a message to the instance to start processing the video, using the user-data for Amazon instances.
Next, the test application logic checked to make sure the instance was up and running. If it wasn't, the application logic kept checking and waiting until the instance came alive. Our video rendering app would then run, depositing its result back into our preset Amazon S3 bucket.
The Control Plan Editor allowed us to go through, step-by-step, to test/debug our scripts. There is a built-in source code editor to modify the scripts, along with a console to show output from each script run. Each place/transition has a separate console available to watch progress as well.
The documentation on Tap In's Wiki is useful for digging deeper in the Petri Net model.
There are a lot of sample files to play around with, as well as numerous scripts. It's also possible to connect with Tap In System's other product, Tap In Monitor, to get detailed monitoring information about your instances.
The developed model can be run again and again, but it is not really run in the background, you must keep it open, although Tap In Systems claims it is working on a control plan server which will be able to use these models via the xml files that are generated in a more automated way.
Tap In Systems Control Plan has several qualities going for it: interesting visual flow control of distributed systems with excellent control logic, a diverse number of sample plans and scripts to glean from, the best compatibility with cloud service providers we've tested so far, and capacity for enormously sophisticated cloud models.
What hobbles Tap In Systems are a couple of weaknesses. Its monitoring of systems and instances is far less mature than RightScale. And Tap In's Control Plans require a fairly steep learning curve. Nonetheless, there's great power here, and lots of geek appeal. At press time, a Control Plan Server became available that orchestrates multiple Control Plans.
Appistry -- CloudIQ
By contrast to RightScale and Tap In Systems, Appistry doesn't have any automation for scaling the number of instances used for application processing. New job-processing instances must be built manually and inconveniently by comparison.
Appistry lends itself towards more persistent application use, rather than the concept of the public cloud's 'disposable hardware'. Instead, Appistry instances can be pre-defined to accept distributed work among its pre-allocated worker instances.
Fortunately, other Appistry processing parts can be highly automated and Appistry lends its power specifically towards Web instances and development. To get the best automation results, we found that you need to build your applications more or less around Appistry although it's not required. Appistry applications need, at minimum, a wrapper built around application unless Java, .Net, and/or C/C++ code is used to talk to Appistry's CloudIQ Engine.
We tested Appistry on Amazon EC2, although it's also supported on GoGrid or private clouds. Installation is easy, although after launching the first instance, we needed to copy the Private DNS host address into the user data of each subsequent instance so the others can find the first instance on Amazon Machine Images.
If you use it on your own private cloud (on your own network), the instances should be able to find each other through multicast. We occasionally found communications problems among components of Appistry instances and processes on Amazon.
The CloudIQ Engine is a runtime container for Java, .Net and/or C/C++ code. It's also possible to create other 'wrappers' around code and executables in other languages. The console displays fabrics, which are the framework of cloud instances that workers process within.
The console can be accessed on any of the instances within the fabric, and the fabrics can be woven together through instances of the Appistry Network Bridge. Console access requires a browser, an instance of Java, and Adobe Air. The CloudIQ engine can launch tasks which will then be taken care of by the fabric workers.
The CloudIQ Platform user interface divides a fabric into applications, services, packages and workers. Applications monitored are fabric processes, that use services, existing in packages, that are, in turn, attended to by workers. The fabric's work output is homogeneous, as workers have identical processes running on them. The fabrics can be linked together to create dependencies among the workers' discrete fabric processes.
CloudIQ Storage is similar in concept to Amazon S3, and in a way competes with S3. Each instance of CloudIQ Storage can be in different locations but they all work together as one group and look like one virtual drive. Generally, CloudIQ files are synced with each other (for example the same files are located on each storage location).
In the case of the Amazon Appistry images, the CloudIQ storage is built-in to the image which means by default the storage will disappear along with the instances, unless of course you change the default directories to Amazon Elastic Block Storage (EBS) volumes. This also means that storage is pre-allocated, and finite within the instance by default.
In our testing, we created a wrapper program to launch the ffmpeg video rendering application. We used the CloudIQ engine coded in such a way that if we launched the client multiple times it would distribute a task to another fabric worker. When the work was done, we copied the results over to a single EBS volume attached to the first instance. To access the files in the storage and control the storage process, we could use the 'curl' command to send http requests to do things like delete, deploy, get, put, stop and some other things.
There are three different types of programs installed onto a fabric: a fabric application that's a batch processing application or computing application, a service such as Tomcat, Weblogic, or Apache, or a package such as Java Development Kit (JDK), Ruby, RPM, or command line installation like "yum install".
Appistry is a sophisticated construction set for distributed cloud computing, but generally for more persistent applications. Its monitoring and reporting infrastructure relies on mostly external tools, when compared to the instance monitoring capabilities of RightScale and Tap In Systems Control Plan.
Appistry can use a variety of code that can be linked in with the Appistry APIs to produce a distributed system (or set of systems) if you're adept at coding the project, and Appistry's success is fully dependent on lots of custom coding. The results, however, could be very useful. But first, you need to get thru the 1,400 pages of documentation. Fortunately, paid customers get dedicated systems engineering help, and there's available architectural support as well.
Sign up for CIO Asia eNewsletters.