Stats Lab Homework using r

Stats Lab Homework using r

Lab homework using r instructions below

LAB HOMEWORK INSTRUCTIONS

——————————————————————-

#save your plots as an image file and upload separately (export > save as image)

#save them as png with filename your_name_(question number)_(histogram/plot)

#You are told that in Providence Rhode Island the height of women averages 64 inches (5’4″) with a standard deviation of 1.5 inches.

#Of the 178 thousand people in Providence Rhode Island, 52% are women.

#PART 1

#1.1) Simulate the population described above. Call the population variable popri

#1.2) Compute the mean and sd of this population.

#How does this mean and sd compare to the true mean (64) and sd (1.5)?

#1.3) Plot a density plot of this population. What does this distribution look like?

#PART 2

#2.1) Take a random sample of popri containing 10 people and call the variable sam10

#2.2) Compute the mean and sd of the sample sam10

#2.3) Take a random sample of popri containing 1000 people and call the variable sam1000.

#2.4) Compute the mean and sd of the sample sam1000

#2.5) Which is closer to the true population mean and sd; the sample with 10 people or the sample with 1000 people? Why?

#PART 3

#3.1) Create a matrix called samsri containing 200 random samples of 500 subjects in each sample from the population variable popri.

#HINT: Declare an empty matrix and then use a ‘for loop’ to fill it

#3.2) Create a vector called samsri200means that contains the means for each column (sample) from samsri.

#3.3)Plot a density plot of samsri200means (0.5 pt). What does this distribution look like?



——————————————————————————————

HERE’S A LECTURE ON WHAT WE LEARNED IN CLASS FOR THE LAB HOMEWORK DOWN BELOW FOR YOUR REFERENCE

#Lab 4-Contents

#0. Review of Normal Probability Distribution Functions

#1. Simulating Populations using Random Variables

#2. Taking Samples from a Population: The sampling Distribution

#3. Programming in R: Using Loops

#4. Programming in R: The apply function

#5. Sampling Distribution of the Uniform Distribution

#——————————————————–

# 0. Review of Normal Probability Distribution Functions

#——————————————————–

#Last week we learned how to calculate:

#1) Probabilities from a Normal Distribution using pnorm(Z, mean, sd)

#Ex: What is the probability of a student getting a 75 or less on the exam

#2) Quantiles from a Normal Distribution using qnorm(Z, mean, sd)

#Ex: What score would a student have to achieve to be in the top 10% on the exam

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#EXERCISE 0-1: What is the probability of a student

#getting a 75 or less on the exam given that

#the scores on the exam follow a normal distribution

#of mean 78, and standard deviation of 10?

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

pnorm(75, 78, 10)

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#EXERCISE 0-2: What score would a student have to achieve

#to be in the top 10% on the exam given

#the scores on the exam follow a normal distribution of mean 78,

#and standard deviation of 10?

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

qnorm(.9, 78, 10)

#——————————————————–

#1. Simulating Populations using Random Variables

#——————————————————–

#The difference between a population and a sample:

#Your sample is the group of individuals who participate

#in your study, and your population is the broader group

#of people to whom your results will apply.

#Therefore: “population” in statistics includes all members of a defined group.

#A part of the population is called a sample.

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

#Last week (Lab 3), we used the command rnorm()

#to create a variable with a normal distribution

#Random Normal variable: rnorm(n, mean, sd)

#NOTE: We can specify how large our population is,the mean and SD.

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#EXERCISE 1-1:

# Create a normally distributed population variable called pop

# that consists of 10,000 subjects with mean of 15 and sd of 2

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#set.seed(1)

pop = rnorm(n=10000, mean=15, sd=2)

#Let’s verify that this did what we wanted.

hist(pop); mean(pop); sd(pop)

#??????????????????????????????????????????????????????????????????????????????????#

#Thought Question 1: Why is your mean and sd for pop slightly different

#than mine OR rather, why is noones exactly a mean of 15 and sd of 2

#??????????????????????????????????????????????????????????????????????????????????#

#??????????????????????????????????????????????????????????????????????????????????#

# Now, let’s pretend this variable x is a population of

# undergrads + graduate students at USC who have ever used marijuana

#Thought Question 2: Considering that USC is ~20k students,

#In reality, could I actually collect this information from every USC

#student to form this distribution? What should I do instead?

#??????????????????????????????????????????????????????????????????????????????????#

#—————————————————————

#2. Taking Samples from a Population: The sampling Distribution

#—————————————————————

#All (most) research studies deal with samples

#We can take a sample from data we consider to be our population

#by using the sample() function in R

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#

#Random sample:sample(x, size, replace=TRUE)

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#Example 1: Let’s pretend we are researchers.

#While ideally we would like to study the POPULATION of marijuana users

#at USC, we realize that we only have funding to ask 200 students.

#We can see what our data might look like if we take a sample from

#the population variable pop.

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

sam = sample(x=pop, size=200, replace=TRUE)

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#Exercise 2-1: Calcualte the mean and SD of the sample sam.

#How do these results differ from the means

# and SDs in the pop variable?

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

mean(sam); sd(sam); hist(sam)

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#Exercise 2-2:

# A) Create a random sample called sam20 from pop containing 20 subjects.

# B) Create a random sample called sam750 from pop containing 750 subjects.

# C) Compute the Means and SDs for sam20 and sam750. Create Histograms for both.

# D) How do the means from each sample compare to the true population mean of 15?

# E) Is there an association of the number of people in the sample with the magnitude of

# the difference from the populaton mean?

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#A)

sam20=sample(pop, 20, replace=TRUE)

#B)

sam750=sample(pop, 750, replace=TRUE)

#C)

mean(sam20); mean(sam750)

sd(sam20); sd(sam750)

plot(density(sam20)); lines(density(sam750), col=”red”)

#D)

abs(15-mean(sam20)); abs(15-mean(sam750))

#E)

#As the number of people in the sample goes up,

#the mean becomes closer to the true population mean

#—————————————————————

#3. Programming in R: Using Loops

#—————————————————————

#In practice, as researchers we almost always have samples

#and NEVER really know the true population

#Simulating a population and taking samples from it can tell us something

#about how well a given estimator (mean, trimmed mean, median etc.)

#represents a distributions (eg. normal vs skewed)

#To begin to understand how taking samples can give us information about an estimator,

#we need to take MANY samples from our simulated population.

#Let’s say we wanted to have 100 different samples of pop with 200 subjects in each sample.

#We could do this two ways:

#1) Write the function sample() many times

sam1=sample(pop, 200, replace=TRUE)

sam2=sample(pop, 200, replace=TRUE)

# …

sam100=sample(pop, 200, replace=TRUE)

#2) Or use a loop

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#

#Loop over ii: for (ii in X:Y) { # ii is the counter

#COMMANDS WITH ii # X is the first value the counter

#} # Y is the last value of the counter

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#

# x=23+24;x

#print(x)

#Here is a simple loop going from a value of 1 to 10

for (jj in 1:10) {

print(jj)

} #NOTE: When executing the loop,

#you MUST highlight and run the entire loop from { to } including the brackets.

# ii will take the values specified with the “in 1:10” argument

#At first ii = 1, but then will increase by 1 (1,2,3,4,5…) until it reaches 10

#Also we can change ii to be whatever we want.

#Below I’ve used my name to demonstrate this

for (kk in 1:15) {

print(kk)

}

#Back to our goal:

#We want to be able to take 100 different samples of pop with 200 people in each sample

#A good way to do this is to first create an EMPTY matrix to put our data

#into using the matrix() command

mysams = matrix(, ncol=100, nrow=200) #? Why ncol=100 and nrow=200?

#Then we can use a loop to place each of our samples into a column of this empty matrix called “mysams”

for (ii in 1:100) {

mysams[ ,ii] = sample(pop, size=200, replace=TRUE)

}

#Look to your right and double click over ‘mysams’

#—————————————————————

#4. Programming in R: The apply function

#—————————————————————

# We just learned how to use a loop to take MANY random samples

# from a population variable. While the purpose of this

# may not be clear just yet, it will be later on in the semester.

#Once we have our dataset containing 100 samples of 200 people, I’d like to find out the mean of each sample

#I could do this two ways:

#1) By manually doing it

mean(mysams[,1])

mean(mysams[,2])

#…

mean(mysams[,100])

#2) By using a loop

for (jj in 1:100) {

print( mean(mysams[,jj]) ) #I have to use print() here because things in loops don’t get output to the screen without it

}

#However, I don’t just want to KNOW the means of each sample, instead I’d like to have a variable

#where each observation is the mean of a given sample so that I can analyse the means of the samples

#We can do this by first creating an empty Vector of length 100

sam100means = numeric(100)

#And then using a loop to populate the vector

for (jj in 1:100) {

sam100means[jj] = mean(mysams[,jj])

}

#With this, I can examine the average (mean) of the means for each sample

mean(sam100means)

#And their distribution

hist(sam100means)

#There is an easier way to get this sam100means variable,

#We can use the apply() function!

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#

#Collapse Data (apply): apply(X, MARGIN, FUN)

# X=dataset; MARGIN: 1=Rows, 2=Columns; FUN=Function

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#

#I will create a variable called sam100means2 containing the column means of mysams

sam100means2 = apply(X=mysams, MARGIN=2, FUN=mean)

#In the above: MARGIN=2 tells R to do the operation on the columns

#FUN=mean tells R to take the mean

#A Density plot of this:

plot(density(sam100means2))

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#Exercise 4:

# A) Create a variable called sam100sd that contains the standard deviations of each

# sample from mysams. Use whatever method you prefer to do this.

# B) Show a density plot of the SDs from mysams

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#A)

sam100sd = apply(X=mysams, MARGIN=2, FUN=sd)

#Alternately

sam100sd1=numeric(100)

for (i in 1:100) {

sam100sd1[i]=sd(mysams[, i])

}

#B)

plot(density(sam100sd))

#—————————————————————

#5. Sampling Distribution of the Uniform Distribution

#—————————————————————

#While we’ve seen that the distribution of means

#from random samples taken from a NORMAL population

#are normally distributed, what if our population is not normally distributed?

#Let’s see for example the uniform distribution

# which can be created using the runif() function

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#

#Random Uniform variable: rnunif(n)

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#

#I’ll create a uniform population distribution of 10,000 subjects

popunif = runif(10000) #Unifor distribution

#This distribution looks like:

hist(popunif)

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#Exercise 5:

# A) Create a matrix called unifsams that contains 150 random samples of 250 subjects from popunif

# B) Create a variable called unif150means that contains the means for each of the 150 samples

# C) Create a density plot of means in unif150means. What does this plot look like?

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#A)

unifsams = matrix(, ncol=150, nrow=250)

for (ii in 1:150) {

unifsams[,ii] = sample(popunif, 250, replace=TRUE)

}

#B)

unif150means = apply(unifsams, 2, mean)

#C)

plot(density(unif150means ))

#Normally Distributed remember the Central Limit Theorem.

# Read from the book section 5.3.2 to get a theoretical explanation about this last excercise

#Section:5.3.2 Approximating the Sampling Distribution of the Sample Mean: The General Case