Advanced Data Analysis Techniques with Apache Spark, 12/13/17

Dec 13, 2017, 11:00 am4:30 pm
Room 347, Visualization Lab
Event Description
Data analysis is an integral part of most research projects in an academic environment. Ability to analyze data in a quick and efficient way is a prerequisite for successful research. In the modern fast-paced world of Big Data, the Apache Hadoop's MapReduce framework for batch processing has been outgrown by Apache Spark, which boasted speeds 10-100x faster than Hadoop and set the world record in large scale sorting. This workshop requires a programming background and experience with Python at an *intermediate* level. All exercises will use a mix of PySpark and Spark SQL (parts of Apache Spark), but previous experience with Spark or distributed computing is not required. It is strongly recommended that registrants take Introduction to Python for Scientific Computing or Introduction to Programming Using Python (offered by PICSciE) to learn or refresh their Python knowledge. This training will attempt to articulate the expected output of scientists performing data intensive research and then teach how to use Spark and Hadoop software stack to achieve these expectations. It will also cover a set of mini-case studies including Web Mining, Text Classification and Image Processing exercises that will teach students how to manipulate data sets using distributed processing with PySpark and SparkSQL. Upon completion of this course, participants will be capable of creating their own performance-maximized Spark applications. Alexey Svyatkovskiy is a Big Data, Software and Programming Analyst with the Princeton Institute for Computational Science & Engineering (PICSciE). He holds a PhD in particle physics and has over 5 years of experience in large scale data analysis and machine learning. His work has been presented at IEEE Big Data conference and Spark Summit. Please register online at the training website, or contact Andrea Rubinstein at /258-1397.