Skip to Tutorial Content

27.05.2024

Maria Osipenko

Parameter estimation

Problem statement

Your team has a new project: to develop a smartphone app that your customers can download and use. (click on the rectangle to enlarge the image).

Challenge

Your boss has just presented the project to the investors. But they don’t want to finance the project without a prior success analysis. They require a well-founded estimate of the expected proportion of app users based on a sample. And as if that wasn’t enough, they want it

the size of the sample is chosen so that

  • the estimation precision of \(1\%\)
  • with the high probability level of \(95\%\)

is guaranteed.

Your boss stands beside herself: ‘What kind of estimation precision?! If we carry out a survey, we can precisely calculate the proportion of app users and their average willingness to pay! Or not?!’

Can you help your boss to understand investors’ concerns?

Your mission

Surveys are expensive! How do you give your boss the answer to the question of how large the minimum sample size should be, so that the required estimation precision is achieved at the 95% certainty level, and the investors are satisfied?

Your intermediate steps:

  • Explain to your boss why the proportion of potential app users from the survey may not exactly match the proportion of potential app users in the entire customer base.
  • find out what the distribution of the corresponding estimator for the apps users share looks like and whether you can use this to determine the required sample size for the required accuracy.

Problem statement

How large should the minimum sample size be so that the required estimation precision of \(1\%\) is achieved at the certainty level of \(95\%\) and the investors are satisfied?

Sampling as a random process

When participants in a customer survey are selected randomly with equal probability, this corresponds to random sampling.

From the textbook…

  • A sample is used to obtain the information about the behavior of a characteristic in the population.

    • \((X_1,\ldots,X_n)\) denote a random sample of size \(n\) of \(X\) if \(X_i, i=1,\ldots,n\) are independent and follow the same distribution as \(X\).

    • The observed realizations \(x_1,\ldots, x_n\) of the random variable \((X_1,\ldots,X_n)\) are called sample realizations.

  • An estimator is a function of a random sample and

  • its realization, an estimate, always depends on a specific sample realization and varies if a new sample is taken.

Estimation from a random sample

  • \((X_1,\ldots,X_n)\) denote a random sample of size \(n\) of \(X\) if \(X_i, i=1,\ldots,n\) are independent and follow the same distribution as \(X\).

  • The observed realizations \(x_1,\ldots, x_n\) of the random variable \((X_1,\ldots,X_n)\) are called sample realizations.

  • An estimator is a function of a random sample and

  • an estimate always depends on a specific sample realization and varies if a new sample is taken.

Your approach

That sounds very abstract at first… You can see what happens when a sample is taken in the animation (click on the rectangle to enlarge the image):

In order to better understand the construction of a sample yourself, try taking a (simulated) sample from the customer database and create an estimate for the average customer age based on this.

(Customer age is stored in the system, i.e. you have a population of all customers here and can determine the average age in the population (average of all customers). However, this allows you to simulate the sampling process and the estimated values with the true average age to compare all customers.

Your experiments

Now you simulate the sample survey:

  • randomly select \(n=20\) customers (these are your survey values ​​- sample values)

  • calculate the average age of these selected customers

  • compare your estimate with the real average of the population

Repeat the steps several times and see what happens (See the interactive database app \(\downarrow\)).

Now increase the sample size and repeat the steps again. Will the estimation results be more accurate?

Interactive database app

sample size n:
sampling:

Sampling of n customers

The mean age in the sample


The mean age in the population


Estimates based on different samples

Quiz

Coding

The relevant customer data is contained in the table named kunden, column alter. Program the calculation of the mean customer age on the population level.

Customer database:

Modify the following code to compute the average customer age and its standard deviation (use the data column kunden$alter for your calculations):

mean(NA)
sd(NA)
mean(kunden$alter)
sd(kunden$alter)

Now draw a sample with sample size \(n=20\) using sample(x=kunden$alter, size=20, replace=TRUE) and calculate the sample mean.

mean(NA)
mean(sample(x=kunden$alter, size=20, replace=TRUE))

You can execute the line of code above multiple times so that you can see what happens when a new sample is drawn.

Alternatively, you can program repeated sampling. Complete the fourth line in the following code window:

wiederholungen=100 # Anzahl von gewuenschten Wiedeholungen des Experiments
schaetzwerte<-numeric(wiederholungen) # Platzhalter fuer die Schaetzwerte
for (i in 1:wiederholungen){
  schaetzwert<-mean(NA) # Schätwert aus der aktuellen Ziehung
  schaetzwerte[i]<-schaetzwert # alle die Schätzwerte in einem Vektor
}
plot(schaetzwerte, pch=16, col=2, main = "Schätzwerte aus verschiedenen Stichprobenziehungen", xlab="Ziehung", ylab="Wert", ylim=c(20,40))
abline(h=30)
#Zeile vier:
  schaetzwert<-mean(sample(x=kunden$alter, size=20, replace=TRUE))

Increase the size of each sample in the fourth row (e.g. size = 200). What changes?

Distribution of the estimator

Our estimator - the sample mean \(\bar{X}=\frac 1n \sum_{i=1}^n X_i\) is a random variable before the sampling takes place. Its values change depending on the sample realization \(x_1,x_2,\ldots, x_n\). The values fluctuate less as the sample size \(n\) increases.

But how exactly is the relationship to the sample size?

The answer is provided by the central limit theorem (CLT).

If the sample variables \(X_1,X_2,\ldots, X_n\) are identical and independently distributed with a finite mean \(\mu\) and a finite variance \(\sigma^2\), the approximate distribution of sample mean: \[\bar{X}\stackrel{a}{\sim} N(\mu, \frac{\sigma^2}{n}),\] or in its standardized version: \[\frac{\bar X-\mu}{\sigma/\sqrt{n}}\stackrel{a}{\sim}N(0;1).\]

Quiz

You can try out how it all works in practice in the interactive database app below.

Estimate based on a sample

  • \((X_1,\ldots,X_n)\) denote a random sample of size \(n\) to \(X\).

  • The estimator \(\bar{X}=\frac 1n \sum_{i=1}^n X_i\) is a random variable before sampling

  • with the approx. Distribution: \(\bar{X}\stackrel{a}{\sim} N(\mu, \frac{\sigma^2}{n})\) or standardized: \(\frac{\bar X-\mu} {\sigma/\sqrt{n}}\stackrel{a}{\sim}N(0,1).\)

The actual task

Can you find the required sample size \(n\) so that in your survey simulation with probability of at least \(\color{green} {95\%}\) the mean age of the customers can be estimated with \(\color{red}{precision}\) of half a year (\(\color{blue}{0.5}\))?

It means \(\mathbb P(\color{red}{|\bar X-\mu|}\leq \color{blue}{0.5})\geq \color{green}{0.95}\) or equivalent \(\mathbb P(\color{red}{\color{blue}{-0.5}\leq \bar X-\mu\leq \color{blue}{0.5}})\geq \color{green}{0.95}\). Use \(\frac{\bar X-\mu}{\sigma/\sqrt{n}}\stackrel{a}{\sim}N(0;1)\) and try to solve for \(n\).

Hint \[\begin{align*}&~~~~~~\mathbb P(\color{red}{\color{blue}{-0.5}\leq \overline X-\mu\leq \color{blue}{0.5}})\geq \color{green}{0.95}\\ &\Leftrightarrow \mathbb P\left({-\frac{\color{blue}{0.5}}{\sigma/\sqrt{n}}\leq \frac{\color{red}{\bar X-\mu}}{\sigma/\sqrt{n}}\leq \frac{\color{blue}{0.5}}{\sigma/\sqrt{n}}}\right)\geq \color{green}{0.95}\\ &\Leftrightarrow \mathbb P\left({q_{\color{green}{0.025}}\leq \frac{\color{red}{\overline X-\mu}}{\sigma/\sqrt{n}}\leq q_{\color{green}{0.975}}}\right)\geq \color{green}{0.95},\\ &\Leftrightarrow \mathbb P\left({-q_{\color{green}{0.975}}\leq \frac{\color{red}{\overline X-\mu}}{\sigma/\sqrt{n}}\leq q_{\color{green}{0.975}}}\right)\geq \color{green}{0.95},\\ \end{align*}\] where \(q_{\alpha}\) is the \(\alpha\)-quantile of the standard normal distribution. Note: \(0.975 = 1-(1-\color{green}{0.95})/2 = (1+\color{green}{0.95})/2\). In this case we use \(q_{0.975}= 1.96.\) It should thus hold: \[q_{\color{green}{0.975}} \geq \frac{\color{blue}{0.5}}{\sigma/\sqrt n}\] and \[n \geq \left(\frac{q_{\color{green}{0.975}}\cdot \sigma}{\color{blue}{0.5}}\right)^2.\]

Note…

… that the sample size depends on the unknown variance in the population. In practice, an estimate of the variance is used. You should assume a rather higher value.

When it comes to estimating a share value, it’s a little easier - the maximum variance of such a share from the sample is \(\frac 14\).

Quiz

Coding

With the help of the central limit theorem, the distribution of sample means is approximated by the normal distribution: \[\overline X\sim N(\mu,\frac{\sigma^2}{n})\]

You have already determined the population parameters (mean \(30\) and variance \(100\)) from the customer table. Now calculate the probability:

\[\mathbb P(29.5\leq\overline X\leq 30.5)\]

using R-function pnorm() and \(n=20\).

pnorm(q=30.5, mean=NA, sd=NA) - pnorm(q=29.5, mean=NA, sd=NA)
pnorm(q=30.5, mean=30, sd=sqrt(5)) - pnorm(q=29.5, mean=30, sd=sqrt(5))

Next, calculate the probability:

\[\mathbb P(29.5\leq\overline X\leq 30.5)\]

using R-function pnorm() and \(n=1000\).

pnorm(q=30.5, mean=NA, sd=NA) - pnorm(q=29.5, mean=NA, sd=NA)
pnorm(q=30.5, mean=30, sd=sqrt(0.1)) - pnorm(q=29.5, mean=30, sd=sqrt(0.1))

Find the 95%-quantile using R-function qnorm() and \(n=1000\).

You need the number \(q_{0.95}\), such that: \[\begin{align} \mathbb P(\bar X\leq q_{0.95}) = 0.95. \end{align}\]

qnorm(p=0.95, mean=NA, sd=NA)
qnorm(p=0.95, mean=30, sd=sqrt(0.1))

Now you can implement the formula for calculating the optimal sample size when a certain precision genauigk and a security level wniveau are given.

Hint Note: due to the Central Limit Theorem: \(\frac{\bar X-\mu}{\sigma/\sqrt{n}}\stackrel{a}\sim N(0;1)\).
If there are specifications for the accuracy genauigk and the security level wniveau, this means:
\(\mathbb P(|\bar X-\mu|\leq\) genauigk\()\stackrel{!}\geq\)wniveau,

or standardized:
\(\mathbb P(-\sqrt{n}\)genauigk\(/\sigma\leq\frac{\bar X-\mu}{\sigma/\sqrt{n}}\leq \sqrt{n }\) genauigk\(/\sigma)\stackrel{!}\geq\)wniveau.

The latter, after forming, leads to:
\(2\Phi(\sqrt{n}\)genauigk\(/\sigma)-1\stackrel{!}\geq\)wniveau

and consequently
\(\sqrt{n}\)genauigk\(/\sigma\stackrel{!}\geq\Phi^{-1}\big((\)wniveau\(+1)/2\big)\).

Where \(\Phi(\cdot)\) denotes the distribution function of the standard normal distribution and \(\Phi^{-1}(\cdot)\) is the quantile of the standard normal distribution. Try to understand the calculations and convert the result to \(n\) at the end.


Complete the following R function, which accepts as arguments the required precision genauigk, the probability level wniveau and the variance of the population varianz and outputs the required sample size.

Modify the following lines of code based on your calculations:

n_fun<-function(genauigk,wniveau,varianz){
  n<-ceiling(NA)
  return(n)
}
n_fun<-function(genauigk,wniveau,varianz){
  n<-ceiling((qnorm((wniveau+1)/2,0,1)/genauigk)^2*varianz)
  return(n)
}
n_fun(genauigk=0.5,wniveau=0.95,varianz=100)

Optimal sample size

  • For certainty level \(w\), precision \(g\) and standard deviation (or its estimate) \(\sigma\)

  • \(n \geq \left(\frac{q_{(1+w)/2}\cdot \sigma}{g}\right)^2.\)

Optimal sample size

The answer has to come!

Now you’ve experimented enough, the boss is pushing, you have to provide an answer.

The estimator for the app user share is: \(\overline{X}=\frac 1n\sum_{i=1}^n X_i\) with each \[X_i=\begin{cases}1, i-\text{th customer will use the app},\\ 0, \text{else}. \end{cases}\]

Similar to our “Customer Age” experiment, \(\overline X\sim N\Big(\mathbb E(X); \frac{\text{Var}(X)}{n}\Big)\) with \(\mathbb E(X)=p\) and \(\text{Var}(X)=p(1-p)\) (as \(X\) follows Bernoulli distribution). I.e. in your derived formula you have to replace \(\sigma^2\) with \(p(1-p)\) and then you can calculate the sample size, right?

The problem is: WE DON’T KNOW \(p\)! But the solution is simple. Consider the ppath of \(f(p)=p(1-p)\). What is the maximum value of the function?

So use the maximum possible variance for your calculation!

Coding

Use the implemented function n_fun() to determine the optimal sample size for the precision of 0.01, certainty level of 0.95 and the variance of 0.25 (modify the line #6):

n_fun<-function(genauigk,wniveau,varianz){
  n<-ceiling((qnorm((wniveau+1)/2,0,1)/genauigk)^2*varianz)
  return(n)
}

n_fun(genauigk=NA,wniveau=NA,varianz=NA)
n_fun<-function(genauigk,wniveau,varianz){
  n<-ceiling((qnorm((wniveau+1)/2,0,1)/genauigk)^2*varianz)
  return(n)
}

n_fun(genauigk=0.01,wniveau=0.95,varianz=0.25)

Verify solution

Here you can type in your solution and see if it is correct (Play the video to the end, then you can type in the solution. Press ENTER to submit):

Summary