Apache Spark is a cluster computing framework for large-scale data processing. Spark does not use MapReduce as an execution engine, however, it is closely integrated with Hadoop ecosystem and can run on YARN, use Hadoop file formats, and HDFS storage. Note that we chose to decommission our Hadoop cluster at Princeton in favor of using GPFS on our current clusters. Spark works very well with GPFS.
Spark is best known for its ability to cache large datasets in memory between intermediate calculations. Its default API is simpler than MapReduce: the favored APi is Scala, but there is also support for Python, R and Java.
In the beginning of the tutorial, we will learn how to launch and use the Spark shell. Next we will show how to prepare a simple Spark word count application using Python and Scala and run it in the interactive shell, client or a cluster mode using the YARN scheduler.
For more details on Spark, one can refer to the external documentation.
Interactive spark shell
Spark provides and interactive shell which gives a way to learn the API, as well as to analyze datasets interactively. One can launch the shell by simply running:
This will launch the Spark shell with a Scala interpreter. Alternatively run:
to use the Python interpreter.
RDDs, transformations and actions
input_data = [1, 2, 3, 4, 5]
distrData = sc.parallelize(input_data)
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. An example of a transformation is a map, and an example of an action is a reduce:
lines = sc.textFile("data.txt")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile("hdfs://...")
counts = lines.flatMap(lambda x: x.split(' ')) \<
.map(lambda x: (x, 1)) \
output = counts.collect()
for (word, count) in output:
print "%s: %i" % (word, count)
sc = SparkContext(appName="ScalaWordCount")
val textFile = spark.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
val output = counts.collect()
Packaging Scala Spark applications with SBT
name := "Hello World"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies ++= Seq(
// Spark dependency
"org.apache.spark" % "spark-core_2.10" % "1.3.0" % "provided",
Spark job submission and monitoring
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
|Command line argument||Definition|
|--class||The entry point for your application|
|--master||The master URL. E.g. local,
|--deploy-mode||Whether to deploy your driver
on the worker nodes (cluster) or
locally as an external client (client)
|--conf||Arbitrary Spark configuration property
in key=value format. For values that
contain spaces wrap “key=value” in
quotes (as shown).
|application-jar||Path to a bundled jar including your
application and all dependencies.
|application-arguments||Arguments passed to the main
method of your main class, if any