Introduction to Data Mining:
Applications- Nature of The Problem- Classification Problems in Real Life- Email Spam, Handwritten Digit Recognition, Image segmentation, Speech Recognition, DNA Expression Microarray, DNA Sequence Classification. Exploratory Data Analysis (EDA)- What is Data- Numerical Summarization - Measures of Similarity and Dissimilarity, ProximityDistance- Euclidean Distance, Minkowski Distance, Mahalanobis Distance Visualization- Tools for Displaying Single Variables - Tools for Displaying Relationships Between Two Variables - Tools for Displaying More Than Two Variables R Scripts- R Library: ggplot2-R Markdown
Statistical Learning and Model Selection:
Prediction Accuracy - Prediction Error, Training and Test Error as A Function of Model Complexity, Over fitting a Model, Bias-Variance Trade-off, Cross Validation- Holdout Sample: Training and Test Data, Three-way Split: Training, Validation and Test Data, Cross-Validation, Random Sub sampling, K-fold Cross-Validation, Leave-One-Out Cross-Validation with examples for each.
Linear Regression and Variable Selection:
Meaning- Review Expectation, Variance, Frequentist Basics, Parameter Estimation, Linear Methods, Point Estimate, Example Results, Theoretical Justification, R Scripts. Variable Selection- Variable Selection for the Linear Model, R Scripts.
Regression Shrinkage Methods and Tree based method:
Meaning, Types- Ridge Regression, Compare Squared Loss for Ridge Regression, More on Coefficient Shrinkage, The Lasso. Tree Based Methods- Construct the Tree, The Impurity Function, Estimate the Posterior Probabilities of Classes in Each Node, Advantages of the Tree-Structured Approach, Variable Combinations, Missing Values, Right Sized Tree via Pruning, Bagging and Random Forests, R Scripts, Bagging, From Bagging to Random Forests, Boosting
Principal Components Analysis and Classification:
Singular Value Decomposition (SVD), Principal Components, Principal Components Analysis(PCA), Geometric Interpretation, Acquire Data, Classification - Classification Error Rate, Bayes Classification Rule, Linear Methods for Classification, Logistic Regression - Assumptions, Comparison with Linear Regression on Indicators- Fitting based on Optimization Criterion, Binary Classification, Multiclass Case (K ≥ 3), Discriminant Analysis - Class Density Estimation, Linear Discriminant Analysis, Optimal Classification
Support Vector Machines:
Overview, When Data is Linearly Separable, Support Vector Classifier, When Data is NOT Linearly Separable, Kernel Functions, Multiclass SVM.
Assessment Details (both CIE and SEE)
Continuous Internal Evaluation:
There shall be a maximum of 50 CIE Marks. A candidate shall obtain not less than 50% of the maximum marks prescribed for the CIE.
CIE Marks shall be based on:
a) Tests (for 25Marks) and
b) Assignments, presentations, Quiz, Simulation, Experimentation, Mini project, oral examination, field work and class participation etc., (for 25 Marks) conducted in the respective course. Course instructors are given autonomy in choosing a few of the above based on the subject relevance and should maintain necessary supporting documents for same.
Semester End Examination:
The SEE question paper will be set for 100 marks and the marks scored will be proportionately reduced to 50.
Suggested Learning Resources:
Books
1. John W. Tukey “Exploratory Data Analysis”, 1st Edition, ISBN13: 978-0201076165, ISBN10: 0201076160
2. Foster Provost and Tom Fawcett. “Data Science for Business: What you need to know about data mining and data-analytic thinking”. O'Reilly Media, latest edition, ISBN-13: 978- 1449361327
3. Hadley Wickham, Garrett Grolemund."R for Data Science: Import, Tidy, Transform, Visualize, and Model Data", Publisher: "O'Reilly Media, Inc.", 2016, ISBN 1491910364, 9781491910368
4. Cathy O'Neil, Rachel Schutt. "Doing Data Science: Straight Talk from the Frontline", Publisher: "O'Reilly Media, Inc.", 2013, ISBN 144936389X, 9781449363895
Course outcome :
At the end of the course the student will be able to :
CO1 Understand Data Mining and its importance . L2
CO2 Apply knowledge of research design for business problems L3
CO3 Analyze the cause and effect relationship between the variables from the analysis L4
CO4 Evaluate Regression and decision tree based methodsto solve business problems L5