Introduction to Big Data with Apache Spark, 12/10-12/11

Dec 10, 2015, 2:00 pmDec 11, 2015, 5:00 pm
121 Lewis Library
Event Description
Data analysis is an integral part of most research projects in an academic environment. Ability to analyze data in a quick and efficient way is a prerequisite for successful research. This course will attempt to articulate the expected output of scientist performing data intensive research and then teach how to use Spark and Hadoop software stack to achieve these expectations. The course will cover a set of mini-case studies including Web Mining, Text Processing and Classification, Image Processing exercises that teach students how to manipulate data sets using parallel processing with PySpark and SparkSQL. This two-part course requires a programming background and experience with Python at an intermediate level. All exercises will use a mix of PySpark and Spark SQL (parts of Apache Spark), but previous experience with Spark or distributed computing is not required. It is strongly recommended that students take Introduction to Python for Scientific Computing or Getting Started with Python if they need to learn or refresh their Python knowledge. Alexey Svyatkovskiy is a programmer and Big Data expert in Princeton University. He holds a Ph.D in particle physics and has over 5 years experience in large scale data analysis, physics analysis, and machine learning for the CMS experiment at the CERN Large Hadron Collider. Please register online at the training website, or contact Andrea Rubinstein at /258-1397.