The big data (bd) cluster is an SGI Hadoop Linux cluster which consists of 6 data nodes and 4 service nodes all with Intel Xeon CPU E5-2680 v2 @ 2.80GHz.
Apache Hadoop is a reliable, scalable platform for storage and analysis. Its primary components are HDFS - a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware and YARN - a cluster resource management system which allows any distributed program (not just MapReduce) to run on data in a Hadoop cluster.
MapReduce is a programming model for data processing. It is simple, yet requires some effort to redesign the code to comply with it.Hadoop can run MapReduce programs written in various languages including Java, Python, C++ and R. The Map-Reduce tutorial describes the anatomy of the map-reduce application and the submission guidelines on Hadoop.
Apache Spark is a cluster computing framework for large-scale data processing. Spark does not use the MapReduce as an execution engine, however, it is closely integratedwith Hadoop ecosystem and can run on YARN, use Hadoop file formats, and HDFS storage. Spark is best known for its ability to cache large datasets in memory between jobs. It's default API is simpler than MapReduce: the favored interface is via Scala, but there is also support for using Python. The Spark tutorial provides an introduction to the Spark framework and the submission guidelines in the client and cluster modes.
To register for the BigData cluster, simply send an email to email@example.com indicate your user name and your major professor.
Login using SSH as follows:
If connecting from off-campus, one needs to do one of the following:
- setup an Aventail Secure Remote Access (SRA) connection, and then connect to the cluster directly
- ssh to a host on the wired on-campus network (e.g., nobel.princeton.edu), and then ssh to the cluster from there
|Node Type||Processor Speed||Memory||Interconnect||Processing Power||Memory per Node||Local Disk||Performance: Theoretical|
|Head node||2.80 GHz Xeon CPU E5-2680 v2||2.4 GB||10g Ethernet||40 CPUs,
|98 GB/node||500 MB per node|
|Service node||2.80 GHz Xeon CPU E5-2680 v2||10.7 GB||10g Ethernet||24 CPUs,
|256 GB/node||500 MB per node|
|Worker node||2.80 GHz Xeon CPU E5-2680 v2||10.7 GB||10g Ethernet||24 CPUs,
|256 GB/node||20 TB per node|
|448 GFLOPS TOTAL|
The software configuration includes core tools of the Hadoop software stack, Python libraries necessary for basic math, text processing, machine learning and time series analysis. There is Spark distribution with machine learning and graph analysis libraries and the Spark-streaming tools for real time analysis.YARN resource management tool is also installed. A partial cluster software configuration is described in the table below.
|Operating System||Springdale Linux v6|
Anything that can be installed with Anaconda and pip package managers
|Spark libraries||Spark ML, MLlib, GraphX, Spark streaming libraries|
|Build tools||Scala build tool, Maven|