--- title: "R Tutorial for Stat 331" author: "Tae Hyun Kim" output: pdf_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, comment = NA, warning = FALSE, message = FALSE) ``` ##Introduction R is a popular programming language among data analysts because of its intuitive syntax and its open-source nature: it is free, and anyone can contribute. In this tutorial, I will go over the following points. 1. Basic syntax and data structure 2. Data manipulation in R 3. R markdown documents ##Basic Syntax and Data Structure ```{r, comment=NA} # You can write comment by using pound symbol. # You can get help by using a question mark and command name ?mean ?var ?data.frame ``` Assign value to a variable using either '<-' or '=' ```{r} x <- 5 x = 5 ``` Display the value of the variable by 'print' ```{r} print(x) ``` The type of variable can be known by 'class' ```{r} class(x) ``` Manipulation of scalars are fairly intuitive ```{r} y = 3 print(x+y) print(x-y) print(x/y) print(x*y) ``` ###Vectors There are other types of data structure. Let's create a vector. 'rep' command repeats the first value times the second value. rep(5, 3) prints three fives. 'numeric' returns a numeric vector of zeros with given length. 'c' command concatenates the values you assign. ```{r} y = rep(1, 5) print(y) z = numeric(5) print(z) w = c(1, 6, -4, 2, 3) print(w) ``` The type of a vector returns the type of one element in a vector. Naturally, a vector can only contain elements of the same type. ```{r} print(class(y)) print(class(z)) print(class(w)); ``` You can do element-wise arithmetics of the vectors. ```{r} print(y + w) print(w * z) ``` I can also create a sequence ```{r} y = -3:5 print(y) ``` Or, if you want the sequence to have different increment, you can use 'seq' ```{r} y = seq(from = -15, to = 20, by = 5); print(y) ``` I can access a specific element of a vector using brackets. ```{r} print(w[3]) w[3] = 1000 print(w) ``` Or you can access multiple elements of a vector ```{r} print(w[2:3]) ``` You can compute length, mean, variance, and sd of a vector with intuitivce function calls ```{r} length(w) mean(w) var(w) sd(w) ``` Note that 'var' and 'sd' commands are for samples, not for population. It's important to distinguish the two especially in this course. Take a look at the example below. I'm comparing two variance formulas. $$var_{population}(x) = \sum_{i=1}^{n} \frac{(x_i-\bar{x})^2}{n}$$ $$var_{sample}(x) = \sum_{i=1}^{n} \frac{(x_i-\bar{x})^2}{n-1}$$ ```{r} x = c(1,2,3,4,5) m = mean(x) sumsquares = (1-m)^2 + (2-m)^2 + (3-m)^2 + (4-m)^2 + (5-m)^2 sumsquares / length(x) #population variance formula sumsquares / (length(x)-1) #sample variance formula var(x) #R default variance computation ``` ###Matrices Now, let's create a matrix. I introduce two ways This repeats a certain value. ```{r} A = matrix(4, 3, 2); print(A) ``` Or you can be more specific. The two lines below return the exact same matrices. ```{r} B1 = matrix(c(1, 3, 2, 6, 4, -5), nrow = 3) B2 = matrix(c(1, 3, 2, 6, 4, -5), ncol = 2) print(B1); print(B2) ``` Let's call this matrix B, instead of B1 and B2 ```{r} B = B1 ``` You can remove objects in your environment through 'rm'. This helps you manage your storage space when you work with large data sets. You won't be needing this in the scope of this class. ```{r} rm(B1); rm(B2) ``` You can access specific elements in matrices, too. The command below calls for the elements in the rows 1 and 3, and column 2. ```{r} B[c(1, 3), 2] ``` ###Logicals Logical values take either TRUE or FALSE. Let's check if x has value of 5. Note that we use two equal signs ```{r} x==5 ``` You can also assign this value to y ```{r} y = (x==5) print(y) ``` To check if something is NOT equal to something, use ! ```{r} y = (x != 5) print(y) ``` ###Lists I mentioned earlier that vectors can only take elements of the same type. List, on the other hand, can handle objects of different types. ```{r} ListA = list('dog', c(2, 4, 5), -4, x==6) class(ListA) print(ListA) class(ListA[[1]]); class(ListA[[2]]); class(ListA[[3]]); class(ListA[[4]]) ``` ##Data Manipulation in R You need to set your working directory first to access the data set stored in your computer. You can check where you are by 'getwd' ```{r, comment = NA} getwd() ``` You can check what files are currently in your directory like this ```{r} list.files() ``` If you are not in the correct working directory, you can set it up : example is my personal directory ```{r} setwd('~/Desktop/S331') ``` Now, I will read the data set in txt format and assign the name 'school' This data set is under 'Schools Data' on the course website Make sure you the data set is in your current directory ```{r} school = read.table('schools.txt', header = TRUE) ``` Alternatively, you can specify the directory of the data ```{r} school = read.table('~/Desktop/S331/schools.txt', header = TRUE) ``` 'header' means that the first row of the txt file is the column names. You can also read csv files through read.csv. Check R manual to learn more about read.table and read.csv. ```{r} class(school) ``` You can read data in RData format by 'load'. This data set is under 'School District Data' on the course website ```{r} load('schooldistrict.RData') class(schooldistrict) # Note that the object name is same as the file name in this case, but it could be different. Check your 'Environment' box in R studio. ``` Here are some basic commands useful in initial data analysis ```{r} head(school) #lets you see the first few rows of the data tail(school) #lets you see the last few rows of the data nrow(school) #number of rows ncol(school) #number of columns dim(school) #dimension of the data colnames(school) #column names ``` Just to make things simpler, I will take a subset of the data and work with that. ```{r} school2 = school[1:100, 1:3] #subset the data, rows from 1 to 100, columns from 1 to 3 summary(school2) #lets you see the summary of each column str(school2) #lets you see 'string' version of the data ``` As well as the column index, You can also access each column by the column name using dollar sign. ```{r} school2$SCHNO mean(school2$SCHNO) ``` I can also subset in more specific ways. I will take only rows with value '1' in LOCALE01 column. First, check how the LOCALE01 column looks like ```{r} school2$LOCALE01 ``` Then, take subset, and find out how many rows have value 1 ```{r} newsubset = school2[school2$LOCALE01==1, ] nrow(newsubset) ```