Event Details

COURSE OVERVIEW

This course Provides instruction on the processes and practice of data science, including machine learning and natural language processing. Included are: tools and programming languages (Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikit-learn), the Natural Language Toolkit (NLTK), and Spark MLlib.

COURSE CONTENT

DAY 1: AN INTRODUCTION TO HADOOP AND DATASCIENCE

OBJECTIVES

  • Using Hadoop for Data Science

  • The Hadoop Distributed File System

  • The MapReduce Framework

  • Hadoop 2 and YARN

  • Machine Learning from Data

LABS

  • Setting up the Lab Environment

  • Using HDFS Commands

  • Demonstration: Understanding MapReduce

  • Using Apache Mahout for Machine Learning

DAY 2: AN INTRODUCTION TO APACHE PIG AND PYTHON

OBJECTIVES

  • Introduction to Apache Pig

  • Python Programming

  • Analyzing Data with Python

  • Running Python on Hadoop

  • Machine Learning Algorithms

LABS

  • Getting Started with Apache Pig

  • Using the IPython Notebook

  • Demonstration: Understanding the NumPy Package

  • Demonstration: The Pandas Library

  • Performing Data Analysis with Python

  • Interpolating Data Points

  • Defining User Defined Functions in Python

  • Streaming Python with Apache Pig

  • Exploring Data with Apache Pig

  • Demonstration: Classification with Scikit-Learn

  • Computing K-Nearest Neighbor

  • Generating a K-Means Clustering

DAY 3: MACHINE LEARNING ALGORITHMS

OBJECTIVES

  • Machine Learning Algorithms Continued

  • Natural Language Processing

  • Apache SparkMLib

  • Talking Data Science to Production

LABS

  • Demonstration: POS Tagging Using a Decision Tree

  • Using the Python Natural Language Toolkit

  • Classifying Text Using Na├»ve Bayes

  • Using Spark Transformations andActions

  • Using Spark MLib

  • Creating a Spam Classifier Using Spark MLib

  • When: Wed Dec. 26 - 9:00 am
    to
    Fri Dec. 28 5:00 pm

  • Web: Visit Website