Apache Spark is an open-source distributed processing engine used for big data workloads. It is particularly suited to batch processing, streaming, graph databases and machine learning through the use of in-memory caching, according to Amazon.

The announcement by Amazon follows one from IBM earlier this week that it has devoted 3,500 researchers and developers to help in Spark's upkeep and development as it too prepares to offer Spark as a service later this month.


EMR supports Spark version 1.3.1 and utilises Hadoop YARN as the cluster manager. Running Spark on top of EMR has been possible before, but the integrated support should make using the engine more straightforward. IT staff can create a cluster from the AWS Management Console, for example. Spark applications developed using Scala, Python, Java, and SQL can all run on EMR.

Amazon and IBM will go head to head later this month, when IBM also starts offering a Spark service. The company said on Monday it will allow developers to build and run their own machine learning algorithms.

Amazon's pricing is based on the cost of the underlying EC2 instances and a separate charge for the processing service.

Running Spark on EMR and a basic c3.xlarge instance costs US$0.263 per hour on-demand while using the more capable c3.8xlarge instance costs $1.95 per hour. There are also more expensive instances with lots of memory or storage to choose between (so-called memory and storage optimised instances). The individual prices then have to be multiplied by the number of nodes used.