This course Provides instruction on the processes and practice of data science, including machine learning and natural language processing. Included are: tools and programming languages (Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikit-learn), the Natural Language Toolkit (NLTK), and Spark MLlib.
DAY 1: AN INTRODUCTION TO HADOOP AND DATASCIENCE
OBJECTIVES
Using Hadoop for Data Science
The Hadoop Distributed File System
The MapReduce Framework
Hadoop 2 and YARN
Machine Learning from Data
LABS
Setting up the Lab Environment
Using HDFS Commands
Demonstration: Understanding MapReduce
Using Apache Mahout for Machine Learning
DAY 2: AN INTRODUCTION TO APACHE PIG AND PYTHON
Introduction to Apache Pig
Python Programming
Analyzing Data with Python
Running Python on Hadoop
Machine Learning Algorithms
Getting Started with Apache Pig
Using the IPython Notebook
Demonstration: Understanding the NumPy Package
Demonstration: The Pandas Library
Performing Data Analysis with Python
Interpolating Data Points
Defining User Defined Functions in Python
Streaming Python with Apache Pig
Exploring Data with Apache Pig
Demonstration: Classification with Scikit-Learn
Computing K-Nearest Neighbor
Generating a K-Means Clustering
DAY 3: MACHINE LEARNING ALGORITHMS
Machine Learning Algorithms Continued
Natural Language Processing
Apache SparkMLib
Talking Data Science to Production
Demonstration: POS Tagging Using a Decision Tree
Using the Python Natural Language Toolkit
Classifying Text Using Naïve Bayes
Using Spark Transformations andActions
Using Spark MLib
Creating a Spam Classifier Using Spark MLib
When: Wed Dec. 26 - 9:00 amtoFri Dec. 28 5:00 pm
Web: Visit Website