22MBABA304 Exploratory data analysis for business syllabus for MBA


Unit-1 Introduction to Data Mining 8 hours

Introduction to Data Mining:

Applications- Nature of The Problem- Classification Problems in Real Life- Email Spam, Handwritten Digit Recognition, Image segmentation, Speech Recognition, DNA Expression Microarray, DNA Sequence Classification. Exploratory Data Analysis (EDA)- What is Data- Numerical Summarization - Measures of Similarity and Dissimilarity, ProximityDistance- Euclidean Distance, Minkowski Distance, Mahalanobis Distance Visualization- Tools for Displaying Single Variables - Tools for Displaying Relationships Between Two Variables - Tools for Displaying More Than Two Variables R Scripts- R Library: ggplot2-R Markdown

Unit-2 Statistical Learning and Model Selection 8 hours

Statistical Learning and Model Selection:

Prediction Accuracy - Prediction Error, Training and Test Error as A Function of Model Complexity, Over fitting a Model, Bias-Variance Trade-off, Cross Validation- Holdout Sample: Training and Test Data, Three-way Split: Training, Validation and Test Data, Cross-Validation, Random Sub sampling, K-fold Cross-Validation, Leave-One-Out Cross-Validation with examples for each.

Unit-3 Linear Regression and Variable Selection 8 hours

Linear Regression and Variable Selection:

Meaning- Review Expectation, Variance, Frequentist Basics, Parameter Estimation, Linear Methods, Point Estimate, Example Results, Theoretical Justification, R Scripts. Variable Selection- Variable Selection for the Linear Model, R Scripts.

Unit-4 Regression Shrinkage Methods and Tree based method 9 hours

Regression Shrinkage Methods and Tree based method:

Meaning, Types- Ridge Regression, Compare Squared Loss for Ridge Regression, More on Coefficient Shrinkage, The Lasso. Tree Based Methods- Construct the Tree, The Impurity Function, Estimate the Posterior Probabilities of Classes in Each Node, Advantages of the Tree-Structured Approach, Variable Combinations, Missing Values, Right Sized Tree via Pruning, Bagging and Random Forests, R Scripts, Bagging, From Bagging to Random Forests, Boosting

Unit-5 Principal Components Analysis and Classification 10 hours

Principal Components Analysis and Classification:

Singular Value Decomposition (SVD), Principal Components, Principal Components Analysis(PCA), Geometric Interpretation, Acquire Data, Classification - Classification Error Rate, Bayes Classification Rule, Linear Methods for Classification, Logistic Regression - Assumptions, Comparison with Linear Regression on Indicators- Fitting based on Optimization Criterion, Binary Classification, Multiclass Case (K ≥ 3), Discriminant Analysis - Class Density Estimation, Linear Discriminant Analysis, Optimal Classification

Unit-6 Support Vector Machines 7 hours

Support Vector Machines:

Overview, When Data is Linearly Separable, Support Vector Classifier, When Data is NOT Linearly Separable, Kernel Functions, Multiclass SVM.

Assessment Details (both CIE and SEE)

  • The weightage of Continuous Internal Evaluation (CIE) is 50% and for Semester End Exam (SEE) is 50%.
  • The minimum passing marks for the CIE is 50% of the maximum marks.
  • Minimum passing marks in SEE is 40% of the maximum marks of SEE.
  • A student shall be deemed to have satisfied the academic requirements (passed) and earned the credits allotted to each course if the student secures not less than 50% in the sum total of the CIE (Continuous Internal Evaluation) and SEE (Semester End Examination) taken together.

Continuous Internal Evaluation:

There shall be a maximum of 50 CIE Marks. A candidate shall obtain not less than 50% of the maximum marks prescribed for the CIE.

CIE Marks shall be based on:

a) Tests (for 25Marks) and

b) Assignments, presentations, Quiz, Simulation, Experimentation, Mini project, oral examination, field work and class participation etc., (for 25 Marks) conducted in the respective course. Course instructors are given autonomy in choosing a few of the above based on the subject relevance and should maintain necessary supporting documents for same.

 

Semester End Examination:

The SEE question paper will be set for 100 marks and the marks scored will be proportionately reduced to 50.

  • The question paper will have 8 full questions carrying equal marks.
  • Each full question is for 20 marks with 3 sub questions.
  • Each full question will have sub question covering all the topics.
  • The students will have to answer five full questions; selecting four full questions from question number one to seven in the pattern of 3, 7 & 10 Marks and question number eight is compulsory.

 

Suggested Learning Resources:

Books

1. John W. Tukey “Exploratory Data Analysis”, 1st Edition, ISBN13: 978-0201076165, ISBN10: 0201076160

2. Foster Provost and Tom Fawcett. “Data Science for Business: What you need to know about data mining and data-analytic thinking”. O'Reilly Media, latest edition, ISBN-13: 978- 1449361327

3. Hadley Wickham, Garrett Grolemund."R for Data Science: Import, Tidy, Transform, Visualize, and Model Data", Publisher: "O'Reilly Media, Inc.", 2016, ISBN 1491910364, 9781491910368

4. Cathy O'Neil, Rachel Schutt. "Doing Data Science: Straight Talk from the Frontline", Publisher: "O'Reilly Media, Inc.", 2013, ISBN 144936389X, 9781449363895

 

Course outcome :

At the end of the course the student will be able to :

CO1 Understand Data Mining and its importance . L2

CO2 Apply knowledge of research design for business problems L3

CO3 Analyze the cause and effect relationship between the variables from the analysis L4

CO4 Evaluate Regression and decision tree based methodsto solve business problems L5

Last Updated: Tuesday, January 24, 2023