Problem statement
Your team has a new project: to develop a smartphone app that your customers can download and use. (click on the rectangle to enlarge the image).
Challenge
Your boss has just presented the project to the investors. But they don’t want to finance the project without a prior success analysis. They require a well-founded estimate of the expected proportion of app users based on a sample. And as if that wasn’t enough, they want it
the size of the sample is chosen so that
- the estimation precision of \(1\%\)
- with the high probability level of \(95\%\)
is guaranteed.
Your boss stands beside herself: ‘What kind of estimation precision?! If we carry out a survey, we can precisely calculate the proportion of app users and their average willingness to pay! Or not?!’
Can you help your boss to understand investors’ concerns?
Your mission
Surveys are expensive! How do you give your boss the answer to the question of how large the minimum sample size should be, so that the required estimation precision is achieved at the 95% certainty level, and the investors are satisfied?
Your intermediate steps:
- Explain to your boss why the proportion of potential app users from the survey may not exactly match the proportion of potential app users in the entire customer base.
- find out what the distribution of the corresponding estimator for the apps users share looks like and whether you can use this to determine the required sample size for the required accuracy.
Problem statement
How large should the minimum sample size be so that the required estimation precision of \(1\%\) is achieved at the certainty level of \(95\%\) and the investors are satisfied?
Sampling as a random process
When participants in a customer survey are selected randomly with equal probability, this corresponds to random sampling.
From the textbook…
A sample is used to obtain the information about the behavior of a characteristic in the population.
\((X_1,\ldots,X_n)\) denote a random sample of size \(n\) of \(X\) if \(X_i, i=1,\ldots,n\) are independent and follow the same distribution as \(X\).
The observed realizations \(x_1,\ldots, x_n\) of the random variable \((X_1,\ldots,X_n)\) are called sample realizations.
An estimator is a function of a random sample and
its realization, an estimate, always depends on a specific sample realization and varies if a new sample is taken.
Estimation from a random sample
\((X_1,\ldots,X_n)\) denote a random sample of size \(n\) of \(X\) if \(X_i, i=1,\ldots,n\) are independent and follow the same distribution as \(X\).
The observed realizations \(x_1,\ldots, x_n\) of the random variable \((X_1,\ldots,X_n)\) are called sample realizations.
An estimator is a function of a random sample and
an estimate always depends on a specific sample realization and varies if a new sample is taken.
Your approach
That sounds very abstract at first… You can see what happens when a sample is taken in the animation (click on the rectangle to enlarge the image):
In order to better understand the construction of a sample yourself, try taking a (simulated) sample from the customer database and create an estimate for the average customer age based on this.
(Customer age is stored in the system, i.e. you have a population of all customers here and can determine the average age in the population (average of all customers). However, this allows you to simulate the sampling process and the estimated values with the true average age to compare all customers.
Your experiments
Now you simulate the sample survey:
randomly select \(n=20\) customers (these are your survey values - sample values)
calculate the average age of these selected customers
compare your estimate with the real average of the population
Repeat the steps several times and see what happens (See the interactive database app \(\downarrow\)).
Now increase the sample size and repeat the steps again. Will the estimation results be more accurate?
Interactive database app
sample size n:
sampling:
Sampling of n customers
The mean age in the sample
The mean age in the population
Estimates based on different samples
Coding
The relevant customer data is contained in the table named kunden, column alter. Program the calculation of the mean customer age on the population level.
Customer database:
Modify the following code to compute the average customer age and its standard deviation (use the data column kunden$alter for your calculations):
mean(NA)
sd(NA)
mean(kunden$alter)
sd(kunden$alter)
Now draw a sample with sample size \(n=20\) using sample(x=kunden$alter, size=20, replace=TRUE) and calculate the sample mean.
mean(NA)
mean(sample(x=kunden$alter, size=20, replace=TRUE))
You can execute the line of code above multiple times so that you can see what happens when a new sample is drawn.
Alternatively, you can program repeated sampling. Complete the fourth line in the following code window:
wiederholungen=100 # Anzahl von gewuenschten Wiedeholungen des Experiments
schaetzwerte<-numeric(wiederholungen) # Platzhalter fuer die Schaetzwerte
for (i in 1:wiederholungen){
schaetzwert<-mean(NA) # Schätwert aus der aktuellen Ziehung
schaetzwerte[i]<-schaetzwert # alle die Schätzwerte in einem Vektor
}
plot(schaetzwerte, pch=16, col=2, main = "Schätzwerte aus verschiedenen Stichprobenziehungen", xlab="Ziehung", ylab="Wert", ylim=c(20,40))
abline(h=30)
#Zeile vier:
schaetzwert<-mean(sample(x=kunden$alter, size=20, replace=TRUE))
Increase the size of each sample in the fourth row (e.g. size = 200). What changes?
Distribution of the estimator
Our estimator - the sample mean \(\bar{X}=\frac 1n \sum_{i=1}^n X_i\) is a random variable before the sampling takes place. Its values change depending on the sample realization \(x_1,x_2,\ldots, x_n\). The values fluctuate less as the sample size \(n\) increases.
But how exactly is the relationship to the sample size?
The answer is provided by the central limit theorem (CLT).
If the sample variables \(X_1,X_2,\ldots, X_n\) are identical and independently distributed with a finite mean \(\mu\) and a finite variance \(\sigma^2\), the approximate distribution of sample mean: \[\bar{X}\stackrel{a}{\sim} N(\mu, \frac{\sigma^2}{n}),\] or in its standardized version: \[\frac{\bar X-\mu}{\sigma/\sqrt{n}}\stackrel{a}{\sim}N(0;1).\]
You can try out how it all works in practice in the interactive database app below.
Estimate based on a sample
\((X_1,\ldots,X_n)\) denote a random sample of size \(n\) to \(X\).
The estimator \(\bar{X}=\frac 1n \sum_{i=1}^n X_i\) is a random variable before sampling
with the approx. Distribution: \(\bar{X}\stackrel{a}{\sim} N(\mu, \frac{\sigma^2}{n})\) or standardized: \(\frac{\bar X-\mu} {\sigma/\sqrt{n}}\stackrel{a}{\sim}N(0,1).\)
The actual task
Can you find the required sample size \(n\) so that in your survey simulation with probability of at least \(\color{green} {95\%}\) the mean age of the customers can be estimated with \(\color{red}{precision}\) of half a year (\(\color{blue}{0.5}\))?
It means \(\mathbb P(\color{red}{|\bar X-\mu|}\leq \color{blue}{0.5})\geq \color{green}{0.95}\) or equivalent \(\mathbb P(\color{red}{\color{blue}{-0.5}\leq \bar X-\mu\leq \color{blue}{0.5}})\geq \color{green}{0.95}\). Use \(\frac{\bar X-\mu}{\sigma/\sqrt{n}}\stackrel{a}{\sim}N(0;1)\) and try to solve for \(n\).
Hint
\[\begin{align*}&~~~~~~\mathbb P(\color{red}{\color{blue}{-0.5}\leq \overline X-\mu\leq \color{blue}{0.5}})\geq \color{green}{0.95}\\ &\Leftrightarrow \mathbb P\left({-\frac{\color{blue}{0.5}}{\sigma/\sqrt{n}}\leq \frac{\color{red}{\bar X-\mu}}{\sigma/\sqrt{n}}\leq \frac{\color{blue}{0.5}}{\sigma/\sqrt{n}}}\right)\geq \color{green}{0.95}\\ &\Leftrightarrow \mathbb P\left({q_{\color{green}{0.025}}\leq \frac{\color{red}{\overline X-\mu}}{\sigma/\sqrt{n}}\leq q_{\color{green}{0.975}}}\right)\geq \color{green}{0.95},\\ &\Leftrightarrow \mathbb P\left({-q_{\color{green}{0.975}}\leq \frac{\color{red}{\overline X-\mu}}{\sigma/\sqrt{n}}\leq q_{\color{green}{0.975}}}\right)\geq \color{green}{0.95},\\ \end{align*}\] where \(q_{\alpha}\) is the \(\alpha\)-quantile of the standard normal distribution. Note: \(0.975 = 1-(1-\color{green}{0.95})/2 = (1+\color{green}{0.95})/2\). In this case we use \(q_{0.975}= 1.96.\) It should thus hold: \[q_{\color{green}{0.975}} \geq \frac{\color{blue}{0.5}}{\sigma/\sqrt n}\] and \[n \geq \left(\frac{q_{\color{green}{0.975}}\cdot \sigma}{\color{blue}{0.5}}\right)^2.\]Note…
… that the sample size depends on the unknown variance in the population. In practice, an estimate of the variance is used. You should assume a rather higher value.
When it comes to estimating a share value, it’s a little easier - the maximum variance of such a share from the sample is \(\frac 14\).
Coding
With the help of the central limit theorem, the distribution of sample means is approximated by the normal distribution: \[\overline X\sim N(\mu,\frac{\sigma^2}{n})\]
You have already determined the population parameters (mean \(30\) and variance \(100\)) from the customer table. Now calculate the probability:
\[\mathbb P(29.5\leq\overline X\leq 30.5)\]
using R-function pnorm() and \(n=20\).
pnorm(q=30.5, mean=NA, sd=NA) - pnorm(q=29.5, mean=NA, sd=NA)
pnorm(q=30.5, mean=30, sd=sqrt(5)) - pnorm(q=29.5, mean=30, sd=sqrt(5))
Next, calculate the probability:
\[\mathbb P(29.5\leq\overline X\leq 30.5)\]
using R-function pnorm() and \(n=1000\).
pnorm(q=30.5, mean=NA, sd=NA) - pnorm(q=29.5, mean=NA, sd=NA)
pnorm(q=30.5, mean=30, sd=sqrt(0.1)) - pnorm(q=29.5, mean=30, sd=sqrt(0.1))
Find the 95%-quantile using R-function qnorm() and \(n=1000\).
You need the number \(q_{0.95}\), such that: \[\begin{align} \mathbb P(\bar X\leq q_{0.95}) = 0.95. \end{align}\]
qnorm(p=0.95, mean=NA, sd=NA)
qnorm(p=0.95, mean=30, sd=sqrt(0.1))
Now you can implement the formula for calculating the optimal sample size when a certain precision genauigk and a security level wniveau are given.
Hint
Note: due to the Central Limit Theorem: \(\frac{\bar X-\mu}{\sigma/\sqrt{n}}\stackrel{a}\sim N(0;1)\).If there are specifications for the accuracy
genauigk and the security level wniveau, this means: genauigk\()\stackrel{!}\geq\)wniveau,
or standardized:
genauigk\(/\sigma\leq\frac{\bar X-\mu}{\sigma/\sqrt{n}}\leq \sqrt{n }\) genauigk\(/\sigma)\stackrel{!}\geq\)wniveau.
The latter, after forming, leads to:
genauigk\(/\sigma)-1\stackrel{!}\geq\)wniveau
and consequently
genauigk\(/\sigma\stackrel{!}\geq\Phi^{-1}\big((\)wniveau\(+1)/2\big)\).
Where \(\Phi(\cdot)\) denotes the distribution function of the standard normal distribution and \(\Phi^{-1}(\cdot)\) is the quantile of the standard normal distribution. Try to understand the calculations and convert the result to \(n\) at the end.
Complete the following R function, which accepts as arguments the required precision genauigk, the probability level wniveau and the variance of the population varianz and outputs the required sample size.
Modify the following lines of code based on your calculations:
n_fun<-function(genauigk,wniveau,varianz){
n<-ceiling(NA)
return(n)
}
n_fun<-function(genauigk,wniveau,varianz){
n<-ceiling((qnorm((wniveau+1)/2,0,1)/genauigk)^2*varianz)
return(n)
}
n_fun(genauigk=0.5,wniveau=0.95,varianz=100)
Optimal sample size
For certainty level \(w\), precision \(g\) and standard deviation (or its estimate) \(\sigma\)
\(n \geq \left(\frac{q_{(1+w)/2}\cdot \sigma}{g}\right)^2.\)
Optimal sample size
The answer has to come!
Now you’ve experimented enough, the boss is pushing, you have to provide an answer.
The estimator for the app user share is: \(\overline{X}=\frac 1n\sum_{i=1}^n X_i\) with each \[X_i=\begin{cases}1, i-\text{th customer will use the app},\\ 0, \text{else}. \end{cases}\]
Similar to our “Customer Age” experiment, \(\overline X\sim N\Big(\mathbb E(X); \frac{\text{Var}(X)}{n}\Big)\) with \(\mathbb E(X)=p\) and \(\text{Var}(X)=p(1-p)\) (as \(X\) follows Bernoulli distribution). I.e. in your derived formula you have to replace \(\sigma^2\) with \(p(1-p)\) and then you can calculate the sample size, right?
The problem is: WE DON’T KNOW \(p\)! But the solution is simple. Consider the ppath of \(f(p)=p(1-p)\). What is the maximum value of the function?
So use the maximum possible variance for your calculation!
Coding
Use the implemented function n_fun() to determine the optimal sample size for the precision of 0.01, certainty level of 0.95 and the variance of 0.25 (modify the line #6):
n_fun<-function(genauigk,wniveau,varianz){
n<-ceiling((qnorm((wniveau+1)/2,0,1)/genauigk)^2*varianz)
return(n)
}
n_fun(genauigk=NA,wniveau=NA,varianz=NA)
n_fun<-function(genauigk,wniveau,varianz){
n<-ceiling((qnorm((wniveau+1)/2,0,1)/genauigk)^2*varianz)
return(n)
}
n_fun(genauigk=0.01,wniveau=0.95,varianz=0.25)
Verify solution
Here you can type in your solution and see if it is correct (Play the video to the end, then you can type in the solution. Press ENTER to submit):