Harvard Stats Homework Using R Lab 5 Help

Harvard Stats Homework Using R Lab 5 Help

stats homework using R

Homework- lab 5:

#The central limit theorem (CLT) states that as the sample size gets sufficiently large, the distribution of the sample means will be normally distributed.

#In addition, the CLT has been used to justify the fact that for many of our statistics we rely upon computing the mean (not median or trimmed mean) of our samples


#There are a few problems with the CLT.

#1) How large of a sample is needed

#2) It seems that our experiments with the contaminated normal may contradict this.

#In this homework assignment you will investigate the CLT further.


#PART 1 – The Central Limit Theorem under Normality.

#1.1) Simulate a standard normal population of 1 million people called pop1

#1.2) Draw 5000 samples of size 20 and put these in sam20. Draw 5000 samples of size 50 and put these in sam50 .

#1.3) Create variables called sam20means and sam50means that contains the means of the samples . Use a density plot to show the sampling distribution of the means for sam20means and sam50means together

#1.4) Compare the Standard Error (SE) of the sampling distributions. Which sample size creates better estimates of the population mean (ie. has the lowest SE)?


#PART 2 – The Central Limit Theorem under Non-Normality

#2.1) Simulate a contaminated normal population using cnorm() of 1 million people called pop2 where 30% (epsilon=0.3) of the data have an SD of 30 (k=30) .

#2.2) Draw 5000 samples of size 30 and put these in sam30. Draw 5000 samples of size 100 and put these in sam100.

#2.3) Create variables called sam30means, sam30tmeans, sam100means, sam100tmeans that represent the means AND trimmed means for the samples.

#2.4) Use a density plot to show the sampling distribution of the means and trimmed means for these variables.

#2.5) Compare the Standard Error (SE) of the sampling distributions.

#2.6) Which would be better here: a larger sample size using the mean as the location estimator OR a smaller sample using the trimmed mean?

#2.7) Which location estimator performs the best, regardless of sample size?



————————————————————————————————————————————————-

Lab 5 lecture notes:

#Lab 5-Contents

# 1. Sampling Distribution of the Mean,

# Median, and Trimmed Mean under Normality

# 2. Sampling Distribution of the Mean,

# Median, and Trimmed Mean under Non-Normality

# 3. The Central Limit Theorem

# Last week we saw that when we had a Normal or Uniform population,

# that the means of random samples taken from that population

#were normally distributed.

#Today we are going to investigate the distributions of the mean,

#median, and trimmed mean from samples coming from Normal

# and non-normal populations.

#———————————————————————————

# 1. Sampling Distribution of the Mean, Median,

# and Trimmed Mean under Normality

#———————————————————————————

#Let’s start by generating a standard normal distribution (mean=0, SD=1) for 1 million subjects

pop1 = rnorm(1000000, mean=0, sd=1)

#We will use this as our population from a normal distribution

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#EXERCISE 1-1:

#A) Find the mean, median, trimmed mean (using tmean() ), and sd of pop1

#B) Draw a density plot of pop1

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#A)

mean(pop1); median(pop1);

tmean(pop1); sd(pop1)

#B)

plot(density(pop1))

#Like we did last week, we are going to want to take random samples

# from our population and then compute a measure of central tendency

#(eg. mean, median, trimmed mean) for each sample and examine

#the distribution of this measure.

#We are going to take 5000 samples of 20 subjects

sam1 = matrix(, ncol=5000, nrow=20)

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

# EXERCISE 1-2: Use a loop to draw 5000 samples of size 20 from pop1

# an place the samples in sam1

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

for (ii in 1:5000) {

sam1[ , ii] = sample(pop1, 20, replace=TRUE)

}

# Now that we have our datafile containing all 5000 samples (ie. sam1)

# we can begin to create variables for each of our location measures

#I’ll start us off with the mean

sam1means = apply(sam1, 2, mean) # number 2 = work in the columns rather than rows

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

# EXERCISE 1-3: Use the apply function to generate

# the variables sam1meds (medians) and sam1tmeans (trimmed mean)

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

sam1meds = apply(sam1, 2, median)

sam1tmeans = apply(sam1, 2, tmean)

#Let’s look at the distributions of each of these location estimators

plot(density(sam1means))

lines(density(sam1meds), col=”red”)

lines(density(sam1tmeans), col=”blue”)

abline(v = mean(pop1), lty=2) #Add in a line for the pop1 mean

#??????????????????????????????????????????????????????????????#

#Thought Question 1: Which location estimator performs the best

#for data coming from a normal population? Why?

#??????????????????????????????????????????????????????????????#

# One of the ways we can determine which location estimator

# performs the best is by looking at the standard deviation

# of the estimator accross all the samples.

# The estimator with the lowest SD will have the least amount

# of variability accross the samples.

# A more common name for the standard deviation of the location

# estimator is called the Standard Error or SE

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

# EXERCISE 1-4: Find the Standard Error of the sample means,

# medians, and trimmed means. Based upon the SE, which

# location estimator is the best for samples coming from

# a normal population?

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

sd(sam1means); sd(sam1meds); sd(sam1tmeans)

#The mean performs the best.

# In real life, we generally cannot go out an collect multiple samples

# from a population, so we compute the Standard Error using a formula:

# SE = sd(sample) / sqrt(sample N)

#———————————————————————————

# 2. Sampling Distribution of the Mean, Median,

# and Trimmed Mean under Non-Normality

#———————————————————————————

# Normal distributions generally have very few outliers,

# however when outliers begin to occur more frequently so of the

# basic assumptions about normal distributions are no longer true

# (as we are about to see).

# One distribution that is like a normal distribution,

# but with more outliers is called a mixed or contaminated

# normal distribution and it is a result of two populations mixing together.

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#EXAMPLE 1: “a” will be a mix of TWO populations 1: with SD=1 and 2: with SD=2

a=c(rnorm(5000, 0, 1), rnorm(5000, 0, 2))

#Let’s compare this to b, which is from ONE population but with the same parameters of a

b=rnorm(10000, mean(a), sd(a))

plot(density(a))

lines(density(b), col=”red”)

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#??????????????????????????????????????????????????????????????#

#Thought Question 2: How are a and b from Example 1 different?

#??????????????????????????????????????????????????????????????#

#Thankfully, rather than having to create contaminated normal distributions the hard way, we can just use

#a function provided to us by Dr. Wilcox called cnorm()

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#

#Contaminated/Mix Normal Distribution: cnorm(n, epsilon=0.1, k=10)

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#

#Let’s look at the options for the contaminated normal distribution:

#cnorm() combines two normal distributions:

#1) A standard normal (mean=0, sd=1) for 1-epsilon % of the data

#2) A normal of mean=0 and sd=k for epsilon % of the data

#If we were trying to re-create the variable a we made in example 1 we would have to do:

z=cnorm(10000, epsilon=0.5, k=2)

plot(density(a))

lines(density(z), col=”blue”)

#Which looks very very similar to a!

#Let’s create a second population called pop2 from a contaminated normal distribution

pop2 = cnorm(1000000, epsilon=0.1, k=10)

#The mean, sd, and plot of which are:

mean(pop2); sd(pop2); plot(density(pop2))

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#EXERCISE 2:

#A) Create an empty matrix called sam2 to contain 5000 samples

#of 20 observations each

#B) Populate sam2 with 5000 random samples of size 20 from pop2

#C) Compute the mean (sam2means), median (sam2meds),

#and trimmed mean (sam2tmeans) for each sample

#D) Create an overlaid density plot of each sample WITH the pop2

#mean as a verticle line

#E) Find the SE of each location estimator

#F) Based upon the SE, which location estimator is the best

# for samples coming from a contaminated normal distribution

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#A)

#B)

#C)

#D)

#E)

#F)

#———————————————————————————

# 3. The Central Limit Theorem

#———————————————————————————

#We’ve discovered a few things today:

#1) When a population comes from a normal distribution,

# then mean will be the best location estimator of the samples

#2) When a population comes from a mixed/contaminated normal distribution,

# the trimmed mean is the best location estimator

# These observations are related to the Central Limit Theorem (CLT)

# that is discussed in Section 5.3 of the book (page 85)

# The CLT states that as the sample size gets sufficiently large,

# the distribution of the sample means will be normally distributed.

# We saw a demonstration of this last week when we looked at the means

# from the unifom distribution.

# The CLT has been used to justify the fact that for many of our statistics

# we rely upon computing the mean (not median or trimmed mean) of our samples

#There are a few problems with the CLT.

#1) how large of a sample do we need?

#2) It seems that our experiements with the contaminated normal may contradict this.

#In the homework you will investigate this further