The Benford’s law

Consent form
This website uses cookies to store user preferences and the status of the learning application that has already been processed. Furthermore, your interactions with the learning application as cursor movements, clicks and inputs are collected for research purposes. By continuing to use the website, you agree to this use.

Data protection information obligations regarding data collection in the “MultiLA” research project in accordance with Art. 13 GDPR The project “Multimodal Interactive Learning Dashboards with Learning Analytics” (MultiLA) aims to research learning behavior in the learning applications provided. For this purpose, data is collected and processed, which we will explain below.

Name and contact details of the person responsible Berlin University of Technology and Economics Treskowallee 8 10318 Berlin

T: +49.40.42875-0

Represented by the President Praesidentin@HTW-Berlin.de

Data protection officer Official data protection officer Vitali Dick (HiSolutions AG) datenschutz@htw-berlin.de

Project manager Other leg jerkers andre.beinrucker@htw-berlin.de

Processing of personal data 3.1 Purpose The processing of personal data serves the purpose of analyzing learning behavior and the use of interactive learning applications as part of the “MultiLA” research project.

3.2 Legal basis The legal basis is Article 6 Paragraph 1 Letter e GDPR.

3.3 Duration of storage All data is recorded only within the learning application. They are stored on the HTW-Berlin servers and will be deleted when the project or possible follow-up projects expire.

Your rights You have the right to receive information from the university about the data stored about you and/or to have incorrectly stored data corrected. You also have the right to delete or restrict processing or to object to processing. In addition, if you have given consent as the legal basis for the processing, you have the right to withdraw your consent at any time. The lawfulness of processing based on consent until its revocation remains unaffected. In this case, please contact the following person: Andre Beinrucker, andre.beinrucker@htw-berlin.de. You have the right to lodge a complaint with a supervisory authority if you believe that the processing of your personal data violates the law.
Information about your right to object according to Art. 21 Paragraph 1 GDPR You have the right, for reasons arising from your particular situation, to object at any time to the processing of data concerning you, which is carried out on the basis of Article 6 Paragraph 1 Letter e of the GDPR (data processing in the public interest).

What is the Benford’s law? 🤷

Wikipedia:

“The Benford’s law, also known as the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation that in many real-life sets of numerical data, the leading digit is likely to be small. In sets that obey the law, the number 1 appears as the leading significant digit about 30% of the time, while 9 appears as the leading significant digit less than 5% of the time. If the digits were distributed uniformly, they would each occur about 11.1% of the time.”

Apparently, this law asserts, that the leading digits of numbers, that we see in our everyday life, are not equally probable, as we would expect, but rather follow the probability distribution function below:

\[p_X(x) = \begin{cases}\log_{10}(1+\frac 1x),& x=1,2,\ldots,9\\ 0,&\text{otherwise.}\end{cases}\]

where \(X\) denotes a random variable with values \(1,2,\ldots,9.\)

Discuss with your peers:

Why is \(X\) a random variable? What kind of random variable is \(X\)? Think about situations, where you can observe values for \(X\).
Which values does \(X\) take for a sample of prices: \(1.95,11.99,2.31,6.42, 2.93\)?

Click to show the answer

\(x_1=1,x_2=1,x_3=2,x_4=6,x_5=2.\)

The Benford’s law

\[\begin{equation}p_X(x) = \begin{cases}\log_{10}(1+\frac 1x),& x=1,2,\ldots,9\\ 0,&\text{otherwise.}\end{cases}\end{equation}\]

Exercise 1 construct a table, specifying the distribution of \(X.\) (We encourage you to use excel/calc or R for that)

If you use R, follow the tips:

with function seq(from,to,by) you can create the sequence of \(X\)-values.
Then, use function log10(number) to calculate the probabilities.
You can use rbind(row1,row2) to put both values and probabilities together in one table. Finally, function kable(table, caption) allows you to nicely visualize your output.

You can run your code in the following chunk (use hint-Button if you need more support).

x = seq(from=1,to=9,by=1) # construct a number sequence 1,2,...,9 for the X-values
probs = log10(1 + 1/x) # compute the probabilities for X-values using the pdf above
tabd = rbind(x,probs) #this is the table with the values and the corresponding probabilities
kable(tabd, caption = "Probability distribution for X")

Quick check (round your answers to four decimal places):

If you wish, you can use the following chunk to compute the probability in the question above using R:

sum(probs[seq(from=2,to=9,by=2)]) # sum probs for the values which are even

If you wish, you can use the following chunk to compute the probability in the question above using R:

sum(probs[x>=3]) # sum probs for the values which are greater or equal to 3

Obviously, the most frequent value in this distribution is \(1.\) In the lecture, we talked about the expected value, which is the mean value, we expect, if we observe very many (infinitely many) values of a random variable. So, what is the expected value for \(X\)? We also talked about the variations in the values of a random variable. To quantify it, one can compute the variance. So, what is the variance of \(X\) following the Benford’s law?

Recall that the formulas for computing expectation and variance of a discrete random variable:

\[\mathbb E(X) = x_1p_1 + x_2p_2 + \ldots + x_np_n=\sum_{i=1}^n x_ip_i,\] \[\text{Var}(X) = \color{red}{\mathbb E(X^2)} - \color{blue}{\left(\mathbb E(X)\right)^2 }= \color{red}{\sum_{i=1}^n x_i^2\cdot p_i} - \color{blue}{\left(\sum_{i=1}^n x_ip_i\right)^2}.\]

Exercise 2 Compute the expectation and the variance of \(X.\) If you wish, you can use the following chunk to compute the expectation and the variance using R:

EX = sum(x*probs) # sum values times probs
VarX = sum(x^2*probs) - (EX)^2

Samples from the Benford’s law 🔬

Now we provide also a real sample of \(50\) prices for you here (https://www.edeka24.de)

A sample of real prices (no intended advertisement). Click here to unfold/fold the data 📲

Make a list of the first digits of the prices shown. If you use R, you can enter the digits in a vector named digits separated by commas like this digits = c(7,2,3,6,8,...) below.

digits = c()

digits = c(7,2,3,6,8,
            1,1,2,3,3,
            1,3,9,1,2,
            2,1,3,3,3,
            8,4,2,2,6,
            1,1,1,1,1,
            1,1,1,2,2,
            2,1,5,1,2,
            1,1,5,7,2,
            3,4,4,3,4
            )

Probabilities

\[\tiny \begin{array}{rrrrrrrrrr} \hline x& 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\ \hline p_X& 0.30 & 0.18 & 0.12 & 0.10 & 0.08 & 0.07 & 0.06 & 0.05 & 0.05 \\ \hline \end{array}\]

Exercise 3 compute the relative sample frequencies (empirical probabilities) for the digits. Construct a table, specifying the unique observed values and their relative frequencies. You can use tabfr <- table(digits) to compute the absolute frequencies, and use prop.table(tabfr) to print the table of relative frequencies.

tabfr = table(digits)
prop.table(tabfr)

Quick check (round your answers to four decimal places):

If you wish, you can use the following chunk to compute the frequencies in the questions above using R:

tabfr[2]/50 # absolute frequency of digit 2 divided by the sample size

sum(tabfr[3:9])/50 # sum of the absolute frequencies for >=3 divided by the sample size

Aren’t the relative frequencies roughly comparable to the ideal probabilities of the Benford’s law? Discuss with your peers why the relative frequencies in a sample with a small size a likely to deviate from the ideal probabilities even if the Benford’s law is the underlying distribution generating the values in the sample.

Useful R-functions

digits = c(...) to specify the vector of sample values,
tabfr = table(digits) to compute the absolute frequencies,
prop.table(tabfr) to compute the relative frequencies.

Repeat the analysis for the kilogram prices with the sample values:

\[1,6,3,2,1,2,2,5,1,1,\\ 1,3,1,1,3,5,5,6,6,6\\ 8,1,2,2,2,6,6,5,6,9,\\3,9,9,5,5,4,3,5,9,8,\\7,8,2,2,9,6,2,2,2,4\]

Do you observe some correspondence with the Benford’s law again?

# no hints this time :)

The Benford’s law

Maria Osipenko

13.03.2024

What is the Benford’s law? 🤷

The Benford’s law

Samples from the Benford’s law 🔬

Probabilities

Useful R-functions

Summary