In this web page you will find

- The class outline. For each section, you can obtain the class notes in pdf and the R code used to generate the analyses and graphs.
- Links for homework : data needed, assignments sheets in pdf, and the latex files.
- Books often referenced
- Computing resource.
- Class general information.

Lecture | Title | Description | Notes | Code | |
---|---|---|---|---|---|

NA | Review | Stuff you should know: Basics of probability, the central limit theorem, and inference | NA | ||

1 | Introduction to Regression and Prediction | We will describe linear regression in the context of a prediction problem. | R | ||

2 | Overview of Supervised Learning | Regression for predicting bivariate data, K nearest neighbors (KNN), bin smoothers, and an introduction to the bias/variance trade-off. | R | ||

3-4 | Linear Methods for Regression | Subset selection and ridge regression. We will use singular value decomposition (SVD) and principal component analysis (PCA) to understand these methods. | R | ||

5 | Linear Methods for Classification | Linear Regression, Linear Discriminant Analysis (LDA), and Logisitc Regression | R | ||

6 | Kernel Methods | Kernal smoothers including loess. We will briefly describe 2 dimensional smoothers. We will also define degrees of freedom in the context of smoothing and learn about density estimators. | R | ||

7 | Model Assessment and Selection | We revist the bias-variance tradeoff. We describe how monte-carlo simulations can be used to assess bias and variance. We then introduce cross-validation, AIC, and BIC. | R | ||

8 | The Bootstrap | We give a short introduction to the bootstrap and demonstrate its utility in smoothing problems. | R | ||

9-10 | Splines, Wavelets, and Friends | We give intuitive and mathematical description of Splines and Wavelets. We use the SVD to understand these better and see connections with signal processing methods. | R | ||

11-12 | Additive Models, GAM and Neural Networks | We move back to cases with many covariates. We introduce projection pursuit, additive models as well as generalized additive models. We breifly describe neural networks and explain the connection to projection pursuit. | NA | ||

13-14 | CART, Boosting and Additive Trees | We introduce classification algorithms and regression trees (CART) as well as the more modern versions such as random forrests. | archive for CART, archive for others | ||

15 | Model Averaging | Bayesian Statistics, Boosting and Bagging | NA | ||

16 | Clustering algorithms | Notes and code taking from my My microarray class | R |

**Homework:**

- Homework 1 [Due 4/10]:
Look through the top journal in your field for a paper
in which a regression analysis was performed, many covariates were
available, and p-values were reported.
If your field is mathematical (statistics, biostatistics, engineering,etc..) then look through the top journal of your favorite public health application. If you don't have one then use American Journal of Epidemiology (there should be plenty of regrssion analyses in this journal).

- Discuss how the model was motivated. Deductively, empirically, both or neither?
- Give me your thoughts on their model choice? Could they have done something differently? Are the results described model driven?
- Where does the p in p-value come from? i.e. Where does the randomness come from? Random sample, randomization, or nature...? If nature, then write a paragraph explain how.

- Homework 2 [Due 4/17]
- Use this training data to predict the outcomes for this data. You should give the 500 predictions and an estimate of the number of mistakes you've made. Please send a text file with only the predictions (separated by spaces). Include a description of what you did. Whomever predicts best wins first prize. Whomever best estimates the number of mistakes they make comes in second. Prizes will be handed out.
- Derive the discrimination function for LDA (third equation on page 75) and show it is linear.
- Show that LDA and regression are equivalent when the outcomes are binary.

- Homework 3 [Due 4/24]
- Dowload the Strontium Data [text file] and fit a polynomial of degree 1,2,3,4,6,12, a spline (you pick the knots) and smoothing splines. Make plots of the data and the fitted curves.
- Write a paragraph describing your project.

- Homework 4 [Due 5/1]
- From this data, get your best estimate of y (yhat) and confidence bands, for each of the given x-values. First prize goes to the smallest RSS, second prize goes to true f(x) entirely inside the confidence bands with smallest area between bands.
- Turn in project first draft.

- Project [Last day of class]

**Data-sets:**

- All data except vowel training [Zip file]
- Prostate Cancer Data [Description, R image, CSV file] ]
- Vowel Training Data [Description, Train, Test, R Image]
- Strontium data[text file]
- CD4 data[text file]
- All Mouse data [text file]
- Mouse Body Temperature data [text file]
- Diabetes data [text file]
- Kyphosis data [text file]
- Microarray data [text file]
- Cholostyramine data [text file]
- Intensity data [text file]
- gam.datasets data [Splus file]
- Polution data sets [data in csv,variable descriptions]

- T. Hastie, R. Tibshirani, and J. H. Fried. (2001) The Elements of Statistical Learning. Springer-Verlag: New York. [ Web Page]
- Venables, W.N. and Ripley, B.D. (2002)
*Modern Applied Statistics with S-Plus.*Springer-Verlag: New York. - Brian D. Ripley. (1996) Pattern Recognition and Neural Networks. Cambridge University Press.

- Course title: Statistical Learning: Algorithmic and Nonparametric Approaches (140.649)
- Lab Hour:
- Instructor: Rafael Irizarry
- Department of Biostatistics
- Phone 410-614-5157, email: rafa@jhu.edu
- I assume you know: Linear algebra and statistical principles at a 651--654 level.
- It will be useful to learn one of the following programming languages: R (recommended), S-Plus, or MATLAB.
- Grading: 3 homeworks 60%, 2 quizzes 20%, 1 project 20%
- Course description: Teaches public health students to use modern, computationally-based methods for exploring and drawing inferences from data. After a brief review of probability, the central limit theorem, and inference, the course covers resampling methods, non-parametric regression, prediction, and dimension reduction and clustering. Specifically covers: Monte Carlo simulation, bootstrap cross-validation, splines, local weighted regression, CART, random forests, neural networks, support vector machines, and hierarchical clustering.

Last updated: 4/18/2006