BigData (bd)

The big data (bd) cluster is an SGI Hadoop Linux cluster which consists of 6 data nodes and 4 service nodes all with Intel Xeon CPU E5-2680 v2 @ 2.80GHz.

Each worker node has 256 GB of memory. Every CPU node is interconnected by a 10g Ethernet.
The hadoop cluster has High Availability (HA) mode enabled by configuring the cluster without single points of failure. To achieve HDFS high availability, two separate machines are configured as NameNodes.

Apache Hadoop is a reliable, scalable platform for storage and analysis. Its primary components are HDFS - a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware and YARN - a cluster resource management system which allows any distributed program (not just MapReduce) to run on data in a Hadoop cluster.

MapReduce is a programming model for data processing. It is simple, yet requires some effort to redesign the code to comply with it.Hadoop can run MapReduce programs written in various languages including Java, Python, C++ and R. The Map-Reduce tutorial describes the anatomy of the map-reduce application and the submission guidelines on Hadoop.

Apache Spark is a cluster computing framework for large-scale data processing. Spark does not use the MapReduce as an execution engine, however, it is closely integratedwith Hadoop ecosystem and can run on YARN, use Hadoop file formats, and HDFS storage. Spark is best known for its ability to cache large datasets in memory between jobs. It's default API is simpler than MapReduce: the favored interface is via Scala, but there is also support for using Python. The Spark tutorial provides an introduction to the Spark framework and the submission guidelines in the client and cluster modes.

To register for the BigData cluster, simply send an email to cses@princeton.edu indicate your user name and your major professor.

Login using  SSH as follows:

ssh myusername@bd.rc.princeton.edu

If connecting from off-campus, one needs to do one of the following:

  • setup an Aventail Secure Remote Access (SRA) connection, and then connect to the cluster directly
  • ssh to a host on the wired on-campus network (e.g., nobel.princeton.edu), and then ssh to the cluster from there

Hardware Configuration

Node Type Processor Speed Memory Interconnect Processing Power Memory per Node Local Disk Performance: Theoretical
Head node 2.80 GHz Xeon CPU E5-2680 v2 2.4 GB 10g Ethernet 40 CPUs,
2 threads/core,
10 cores/socket
98 GB/node 500 MB per node  
Service node 2.80 GHz Xeon CPU E5-2680 v2 10.7 GB 10g Ethernet 24 CPUs,
2 threads/core,
10 cores/socket
256 GB/node 500 MB per node  
Worker node 2.80 GHz Xeon CPU E5-2680 v2 10.7 GB 10g Ethernet 24 CPUs,
2 threads/core,
10 cores/sock
256 GB/node 20 TB per node  
              448 GFLOPS TOTAL

Software Configuration

The software configuration includes core tools of the Hadoop software stack, Python libraries necessary for basic math, text processing, machine learning and time series analysis. There is Spark distribution with machine learning and graph analysis libraries and the Spark-streaming tools for real time analysis.YARN resource management tool is also installed. A partial cluster software configuration is described in the table below.

Functionality Packages
Operating System Springdale Linux v6
Job scheduler YARN
Python Libraries

Anything that can be installed with Anaconda and pip package managers

Spark libraries Spark ML, MLlib, GraphX, Spark streaming libraries
Build tools Scala build tool, Maven