-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathsession-04-presentation.qmd
624 lines (465 loc) · 21.6 KB
/
session-04-presentation.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
---
title: Statistical programming
title-slide-attributes:
data-background-image: mvtec-cover-statistical-programming-4x3.png
data-background-size: contain
data-background-opacity: "0.1"
subtitle: Overview of probability. Simulation
author: Marc Comas-Cufí
format:
revealjs:
self-contained: true
smaller: true
logo: mvtec-cover-statistical-programming-4x3.png
fontsize: 12pt
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, fig.align = 'center', dev.args = list(bg = 'transparent'),
collapse=TRUE, comment = "#>")
```
## Today's session
```{r, echo=FALSE, results='asis'}
cat(readr::read_lines("session-04-content.md"), sep='\n')
```
# A brief review of probability
## Event {.smaller}
* An _event_ is a set of possible results in a particular experiment.
* The event containing all possible events/results is called the _sample space_.
* For completeness, the _impossible event_ is defined and it is denoted as $\emptyset$.
* Example:
* "Face 6" is a possible event when rolling a six-sided dice.
* Events are nicely represented with Euler diagrams:
```{r, fig.align='center', fig.cap='Euler diagram for three events A, B and C'}
knitr::include_graphics('session-04-presentation/euler_diagram.svg')
```
## Basic event operations {.smaller}
Given two events $A$ and $B$.
* $A \cap B$. Intersection of events $A$ and $B$. Event defined as $A$ and $B$ occur.
```{r, fig.align='center', out.height="80px"}
knitr::include_graphics('session-04-presentation/Venn0001.svg')
```
* $A \cup B$. Union of events $A$ and $B$. Event defined as either $A$ or $B$ occur.
```{r, fig.align='center', out.height="80px"}
knitr::include_graphics('session-04-presentation/Venn0111.svg')
```
* $A^c$. Complement of an event $A$. Event defined as $A$ does not occur.
```{r, fig.align='center', out.height="80px"}
knitr::include_graphics('session-04-presentation/Venn10.svg')
```
## Probability (of an event) {.smaller}
* The _probability of an event_ measures "how likely" the event is or "how big" with respect the sample space the event is.
* The probability of an event $A$ is denoted as $\text{P}(A)$.
* Probabilities are quantities between 0 and 1, i.e. for any event $A$, $0 \leq \text{P}(A) \leq 1$.
* $\text{P}(\emptyset) = 0$ and $\text{P}(\Omega) = 1$, where $\Omega$ denotes the sample space.
What are probabilities?
* __Frequentist__. Probabilities are long run relative frequencies of events.
* __Bayesian__. Probabilities are used to quantify our uncertainty about events.
* Example:
* "Face 6" event has probability $1/6$ when rolling a six-sided dice.
## Conditional probability {.smaller}
The conditional probability of one event $A$ given another event $B$ is defined as
$$
\text{P}(A|B) = \frac{\text{P}(A\cap B)}{\text{P}(B)}.
$$
* $\text{P}(A|B)$ measures how likely is to happen $A$ once we know $B$ has happened.
* We can thing conditioning as if we were reducing our sample space with events where $B$ occurs.
* Example: "Face 6" event has probability $1/3$ once we are told the result is even.
__Bayes' rule__
For two events $A$ and $B$. Then,
$$
\begin{array}
\text{P}(A|B) &=& \frac{\text{P}(A) \text{P}(B|A)}{\text{P}(B)}\\
&=& \frac{\text{P}(A) \text{P}(B|A)}{\text{P}(A) \text{P}(B|A) + \text{P}(A^c) \text{P}(B|A^c)}.
\end{array}
$$
## Simulating events with R (1) {.smaller}
```{r, include=FALSE}
library(tidyverse)
set.seed(10)
```
* We can randomize a set of possible events with function `sample()`.
```{r, echo=TRUE}
cards = c('2', '3', '4', '5', '6', '7', '8', '9', '10', 'J', 'Q', 'K', 'A')
sample(cards)
```
* We can pick a certain number of events:
```{r, echo=TRUE}
sample(cards, 5)
```
* By default, `sample()` picks elements without replacement. To force replacement:
```{r, echo=TRUE}
sample(cards, 5, replace = TRUE)
```
## Simulating events with R (2) {.smaller}
* We can assign a probability to each event:
```{r, echo=TRUE}
breast_cancer = c('yes', 'no')
sample(breast_cancer, 200, replace = TRUE, prob = c(0.004, 0.996))
```
## Activity: Medical diagnosis {.smaller background-color=#CCE5FF}
Suppose you are a women in your 40s, and you decide to have a medical test for breast cancer called a mammogram.
* If the test is positive, what is the probability you have cancer?
Information:
* The probability of having breast cancer at 40s is 0.004.
* The test has a __sensitivity__ of 80%. In other words, if you have breast cancer, the test will be positive with probability 0.8.
* The test has a __specificity__ 90%. In other words, if you don't have breast cancer, the test will be negative with probability 0.9.
## Activity: Medical diagnosis {.smaller background-color=#CCE5FF}
```{r, echo=TRUE}
library(tidyverse)
N = 100000
test_result_sampling = function(breast_cancer){
if(breast_cancer == 'yes'){
sample(c('+', '-'), 1, prob = c(0.8, 0.2))
}else{
sample(c('+', '-'), 1, prob = c(0.1, 0.9))
}
}
women40s = tibble(
breast_cancer = sample(c('yes', 'no'), N, replace = TRUE, prob = c(0.004, 0.996)),
test_result = map_chr(breast_cancer, ~test_result_sampling(.x))
)
```
* What is the proportion of women with breast cancer in the population having a positive test?
```{r, include=FALSE}
women40s %>%
group_by(test_result) %>% # or filter(test_result == '+')
summarise(with_breast_cancer = mean(breast_cancer == 'yes'))
```
## Activity: Medical diagnosis {.smaller background-color=#CCE5FF}
Use probability theory to calculate the exact probability of a women having breast cancer once we know she got a positive test.
__Hint.__ Consider the following events:
* $A$ = "To have cancer", $A^c$ = "Not to have cancer" and
* $B$ = "Test is positive".
## Independence between events {.smaller}
* Two events $A$ and $B$ are said to be independent, denoted $A \perp B$, if
$$
\text{P}(A) = \text{P}(A|B).
$$
* A practical equivalent definition is $A$ and $B$ are independent if
$$
\text{P}(A \cap B) = \text{P}(A) \text{P}(B).
$$
## Conditional independence {.smaller}
* Two events $A$ and $B$ are said to be conditionally independent given $C$, denoted $A \perp B | C$, if
$$
\text{P}(A|C) = \text{P}(A|B \cap C),
$$
or
$$
\text{P}(A \cap B | C) = \text{P}(A |C)\text{P}(B |C).
$$
## Activity {.smaller background-color=#CCE5FF}
* Check if event __A = "face is even"__ and event __B = "the face is smaller or equal than 4"__ are independent when rolling a 6-faced dice.
* Check if event __A = "face is 6 in the blue dice"__ and __B = "face is 6 in the red dice"__ are independent when rolling a blue and red 6-faced dice.
* Check if event __A = "face is 6 in the blue dice"__ and __B = "face is 6 in the red dice"__ are conditionally independent when rolling a blue and red 6-faced dice when we know about event __C = "the sum is even"__.
## Activity: Monty Hall problem {background-color=#CCE5FF}
> Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what's behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice?
<!--
Consider these two events: $A$ = "car is selected" and $B$ = "goat is selected".
* __Not switching.__
* We start on event $B$ with probability $\frac{2}{3}$.
* Starting on event $A$ with probability $\frac{1}{3}$.
* __Switching.__
* We start on event $B$ with probability $\frac{2}{3}$. After switching, we stay on event $A$.
* Starting on event $A$ with probability $\frac{1}{3}$. After switching, we stay on event $B$.
-->
```{r, eval=FALSE}
monty_hall = function(CHANGE = FALSE){
DOORS = sample(c('car', 'goat', 'goat'))
PICKED = sample(1:3, 1)
DISCARD = sample(rep((1:3)[(DOORS != 'car') & (1:3 != PICKED)], 2), 1)
if(CHANGE){
FINAL_PICK = setdiff(1:3, c(PICKED, DISCARD))
}else{
FINAL_PICK = PICKED
}
DOORS[FINAL_PICK]
}
prop.table( table(replicate(1000, monty_hall(CHANGE = FALSE))) )
prop.table( table(replicate(1000, monty_hall(CHANGE = TRUE))) )
```
## Random variables {.smaller}
```{r}
theme_set(theme_minimal())
```
* A random variable (r.v.) is a function from a sample space to a measurable space, like the real numbers.
* We can thing of a random variable as a variable taking values randomly.
* If values are discrete, the function assigning a probability to its values is called _probability mass function_ (pmf).
```{r, fig.align='center', out.height="300px"}
knitr::include_graphics('session-04-presentation/Random_Variable_as_a_Function-en.svg')
```
## Discrete random variables {.smaller}
* Discrete r.v. take values in a countable set of elements. For example a finite set of numbers or the integers.
* Discrete r.v. can be characterised with giving $\text{P}(X = x)$ for each possible value $x$. Function $p(x) = \text{P}(X = x)$ is called the _probability mass function_ (pmf).
* __Example:__ Discrete r.v. $X$ is uniquely determined by
* $X \in \{1, 2, 3\}$ and
* the pmf: $p(1) = 0.25$, $p(2) = 0.5$ and $p(3) = 0.25$.
```{r, fig.height=3, out.width="80%"}
pmf = tibble(
X = c(1,2,3),
Probability = c(0.25, 0.5, 0.25)
)
ggplot(data = pmf) +
geom_bar(aes(x=X, y = Probability), col = 'black', width = 1, stat = 'identity', alpha = 0.4) +
labs(title = 'Probability mass function', x = '') +
scale_y_continuous(limits = c(0,1)) +
scale_x_continuous(breaks = 1:3, minor_breaks = NULL, labels = sprintf("X=%d", 1:3))
```
<!-- * Valid events for a discrete random variable $X$ are: $X = x$, $X \leq x$, $X \neq x$, ...-->
## Continuous random variables {.smaller}
* Continuous random variables take values in the real line.
* Possible events for continuous r.v. are combinations of subsets the real line.
* Continuous r.v. are characterised by providing $\text{P}(X \in E)$ for any event $E$.
* The _probability density function_ (pdf) of a coninuous r.v. is a function $f(x)$ such that \(\text{P}(X \in E) = \int_{x \in E} f(x) dx. \)
* __Example:__ Continuous r.v. $X$ is uniquely determined with the pdf $f(x)$ defined as $f(x) = \frac{1}{3}$ if $x \in (0,3)$ and $f(x) = 0$ otherwise.
```{r, fig.height=2.5, out.width="80%"}
df = tibble(
X = seq(-2,5,length.out=2000),
Density = if_else(0<=X & X <=3, 1/3, 0)
)
yup = 0.003
ggplot(data = df) +
geom_area(aes(x=X, y = Density), alpha = 0.4) +
geom_segment(aes(x=0, xend = 3, y = 1/3, yend = 1/3), size = 0.5) +
geom_segment(aes(x=-2, xend = 0, y = yup, yend = yup), size = 0.5) +
geom_segment(aes(x=3, xend = 5, y = yup, yend = yup), size = 0.5) +
labs(title = 'Probability density function', x = 'x') +
scale_y_continuous(limits = c(0,0.6))
```
<!--
## Cumulative distribution function
* The _cumulative distribution function_ cdf is the function $F(x)$ such that $F(x) = \text{P}(X \leq x)$.
* $F(x) = \sum_{z < x} p(z)$ for a discrete r.v. and
* $F(x) = \int_{z < x} f(z) dz$ for a continuous r.v.
-->
## The expected value or mean of a r.v. {.smaller}
* The _expected value_ of a r.v. $X$, denoted $\mathbb{E}[X]$, is the mean value we expect after an infinite number of runs.
* For a discrete r.v. $X$, the expected value is
$$
\mathbb{E}[X] = \sum_{x} x \;p(x) \;\;\;\;\;\;\;\;\;\;[\sum_{x} x \;\text{P}(X = x)]
$$
* For a continuous r.v. $X$, the expected value is
$$
\mathbb{E}[X] = \int_{-\infty}^{\infty} x \;f(x) dx
$$
* Example: the expected value of the r.v. with the value of rolling a 6-faced dice is
$$
1 \frac{1}{6} + 2 \frac{1}{6} + 3 \frac{1}{6} + 4 \frac{1}{6} + 5 \frac{1}{6} + 6 \frac{1}{6} = 3.5
$$
```{r, echo=TRUE}
mean( sample(1:6, 1000000, replace = TRUE) )
```
## More expected values {.smaller}
* The variance of a r.v., denoted $\text{var}[X]$, is defined as
$$
\text{var}[X] = \mathbb{E}[ \,(X - \mathbb{E}[X])^2\, ].
$$
* The covariance between two r.v.'s $X$ and $Y$, denoted $\text{cov}[X,Y]$, is defined as
$$
\text{cov}[X,Y] = \mathbb{E}[ \,(X - \mathbb{E}[X]) (Y - \mathbb{E}[Y]) \, ].
$$
* See visualisation at <https://seeing-theory.brown.edu/basic-probability/index.html>
* For any function $g(x)$, we can defined the expected value of $g(X)$ as
* $\mathbb{E}[g(X)] = \sum_{x} g(x) \;p(x)$ when $X$ is discrete or
* $\mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) \;f(x) dx$ when continuous.
# Some common probability distributions
## Categorical distribution {.smaller}
The categorical distribution is the distribution obtained after associating natural number $1$ to $k$ to a set of events with a certain probability $\boldsymbol\pi = (\pi_1, \dots, \pi_k)$.
* For a categorical r.v. $X$, denoted $X \sim Cat(\boldsymbol\pi = (\pi_1, \dots, \pi_k))$,
* $\mu = \mathbb{E}[X] = \sum_{i \leq k} i\; \pi_k$ and
* $\text{var}[X] = \sum_{i \leq k} (i - \mu)^2\; \pi_k$.
```{r, fig.height=3, out.width="80%"}
pmf = tibble(
X = 1:4,
Probability = c(1/8, 2/8, 3/8, 2/8)
)
ggplot(data = pmf) +
geom_bar(aes(x=X, y = Probability), col = 'black', width = 1, stat = 'identity', alpha = 0.4) +
labs(title = latex2exp::TeX('Probability mass function for a $Cat(\\mathbf{\\pi} = (1/8, 2/8, 3/8, 2/8))$'), x = '') +
scale_y_continuous(limits = c(0,0.5)) +
scale_x_continuous(breaks = 1:5, minor_breaks = NULL, labels = sprintf("X=%d", 1:5))
```
## Gaussian (normal) distribution {.smaller}
The gaussian (normal) distribution is the distribution that reduces the probability exponentially (with velocity $1/\sigma^2$) for events far from a certain center ($\mu$).
* For a gaussian r.v. $X$, $X \sim N(\mu, \sigma)$,
* $\mathbb{E}[X] = \mu$ and
* $\text{var}[X] = \sigma^2$.
```{r, fig.height=3, out.width="80%"}
df = tibble(
X = seq(-3,3,length.out=100),
Density = dnorm(X)
)
ggplot(data = df) +
geom_area(aes(x=X, y = Density), col = 'black', alpha = 0.4) +
labs(title = latex2exp::TeX('Density function for a $N(\\mu = 0,\\sigma = 1)$'), x = 'x') +
scale_y_continuous(limits = c(0,0.5))
```
## Binomial distribution {.smaller}
The binomial distribution counts the number of success after repeating a certain experiments $n$ times when the probability of success is $\pi$.
* For a binomial r.v. $X$, $X \sim Bin(n, \pi)$,
* $\mathbb{E}[X] = n \pi$ and
* $\text{var}[X] = n \pi (1-\pi)$.
```{r, fig.height=3, out.width="80%"}
pmf = tibble(
X = 0:10,
Probability = dbinom(X, size = 10, prob = 0.25)
)
ggplot(data = pmf) +
geom_bar(aes(x=X, y = Probability), col = 'black', width = 1, stat = 'identity', alpha = 0.4) +
labs(title = latex2exp::TeX('Probability mass function for a $Bin(n = 10,\\pi = 0.25)$'), x = '') +
scale_y_continuous(limits = c(0,0.3)) +
scale_x_continuous(breaks = 0:10, minor_breaks = NULL, labels = sprintf("X=%d", 0:10))
```
## Poisson distribution {.smaller}
The Poisson distribution counts the number of success detected in an interval of time or space when the expected number of successes is $\lambda$.
* For a Poisson r.v. $X$, $X \sim Poiss(\lambda)$,
* $\mathbb{E}[X] = \lambda$ and
* $\text{var}[X] = \lambda$.
```{r, fig.height=3, out.width="80%"}
pmf = tibble(
X = 0:13,
Probability = dpois(X, lambda = 3)
)
ggplot(data = pmf) +
geom_bar(aes(x=X, y = Probability), col = 'black', width = 1, stat = 'identity', alpha = 0.4) +
labs(title = latex2exp::TeX('Probability mass function for a $Pois(\\lambda = 3)$'), x = '') +
scale_y_continuous(limits = c(0,0.3)) +
scale_x_continuous(breaks = 0:13, minor_breaks = NULL, labels = sprintf("X=%d", 0:13))
```
## Generating r.v. with R {.smaller}
* Categorical r.v.
```{r, echo=TRUE}
sample(1:4, 6, prob = c(1/8,2/8,3/8,2/8), replace=TRUE)
```
* Gaussian r.v.
```{r, echo=TRUE}
rnorm(6, mean = 0, sd = 1)
```
* Binomial r.v.
```{r, echo=TRUE}
rbinom(6, size = 10, prob = 0.25)
```
* Poisson r.v.
```{r, echo=TRUE}
rpois(6, lambda = 3)
```
## Activity {background-color=#CCE5FF}
Thing about possible variables that can be modelled with one of the four seen distributions:
* Categorical distribution
* Gaussian distribution
* Binomial distribution
* Poisson distribution
Think about possible parameters to use.
<!--
## Conditional independence for r.v.'s
* For random variables, conditional independence, $X \perp Y | Z$, holds if for some functions $g$ and $h$ we have
$$
p(x, y| z) = g(x, z) h(y, z)
$$
for all possible values $x$, $y$ and $z$ of $X$, $Y$ and $Z$ respectively.
* If $X$ and $Y$ are independent, then $cor(X,Y) = 0$.
```{r, fig.height=3, out.width="80%"}
set.seed(0)
X = tibble(x = replicate(1000, cor(rnorm(100), rbinom(100, 10, 0.4))))
ggplot(data=X) +
geom_histogram(aes(x = x), col = 'black', alpha = 0.4, bins = 20) +
labs(y = 'Frequencies', title = latex2exp::TeX('Correlation between independent $X$ and $Y$ distribution observed with simulation'),
subtitle = latex2exp::TeX('1000 simulations with $X\\sim N(\\mu = 0, \\sigma = 1)$ and $Y \\sim Bin(n=10, \\pi = 0.4)$'))
```
-->
## Joint probability distribution
The _joint probability distribution_ is a multivariate model for random variables. The joint probability distribution of r.v. $X_1, \dots, X_k$ is completely determined by providing
* the probability of all possible events for discrete r.v.'s
$$
p(x_1,\dots,x_k) = \text{P}(X_1 = x_1 \cap \cdots \cap X_k = x_k),
$$
* or the multivariate density for continous r.v's
$$
f(x_1,\dots,x_k).
$$
## Transformations of random variables
* If we transform a r.v. with a function $g(x)$ we get another r.v.
* Linear transformations. If we linearly transform a r.v. $\boldsymbol X=(X_1,\dots,X_k)$ with $f(\boldsymbol x) = \boldsymbol A \boldsymbol x + \boldsymbol b$ then
* $\mathbb{E}[f(\boldsymbol X)] = \boldsymbol A \boldsymbol\mu + \boldsymbol b$ where $\boldsymbol\mu = \mathbb{E}[\boldsymbol X]$, and
* $\text{cov}[f(\boldsymbol X)] = \boldsymbol A \boldsymbol\Sigma \boldsymbol A^{T}$ where $\boldsymbol\Sigma = \text{cov}[\boldsymbol X]$.
* For non-linear transformations we can simulate the r.v. to obtain these expected values.
__Example:__ The log-normal distribution is defined as the distribution obtained after exponentiating a gaussian r.v., $f(x) = e^x$, $Y = e^X$, where $X \sim N(\mu,\sigma)$. Check using simulating gaussian r.v. that
* $\mathbb{E}[Y] = e^{\mu+\sigma^2/2}$ and
* $\text{var}[Y] = (e^{\sigma^2}-1) e^{2\mu+\sigma^2}$.
## The central limit theorem
If $X_1, \dots, X_n$ are r.v. equally distributed with expected value $\mathbb{E}[X]$ and variance $\text{var}[X]$. Then, when $n$ is big, we have
$$
\bar{X} = \frac{X_1 + \cdots + X_n}{n} \sim N(\mathbb{E}[X], \sqrt{\text{var}[X]/n})
$$
<hr>
Next plot shows the distribution of 1000 realisations of a r.v. obtained as the average of 30 binomial distributions with parameters $n=10$ and $\pi=0.25$. Blue line represents the gaussian distribution with $\mu=n \pi$ and $\sigma=\sqrt{n \pi (1-\pi) / 30}$
```{r, fig.height=3, out.width="80%"}
n_ = 10
pi_ = 0.25
X_mean = tibble(isim = 1:1000) %>%
mutate(
X = map(isim, ~rbinom(30, n_, pi_)),
x_n = map_dbl(X, length),
x_m = map_dbl(X, mean),
x_s = map_dbl(X, sd))
MU = n_ * pi_
SIGMA = sqrt(n_ * pi_ * (1-pi_) / 30)
X_norm = tibble(x = seq(1.5, 3.5, length.out=1000),
fx = dnorm(x, MU, SIGMA))
ggplot() +
geom_histogram(data=X_mean, aes(x=x_m, y = ..density..), bins = 20, alpha=0.4, col = 'black') +
geom_line(data=X_norm, aes(x=x, y = fx), col = 'blue') +
labs(x = latex2exp::TeX("$\\frac{X_1+\\cdots+X_{30}}{30}$") )
```
## The central limit theorem
If $X_1, \dots, X_n$ are r.v. equally distributed with expected value $\mathbb{E}[X]$ and variance $\text{var}[X]$. Then, when $n$ is big, we have
$$
\frac{\bar{X} - \mathbb{E}[X]}{\sqrt{\text{var}[X]/n}} \sim N(0, 1)
$$
<hr>
Next plot shows the distribution of 1000 standardised realisations of a r.v. obtained as the average of 30 binomial distributions with parameters $n=10$ and $\pi=0.25$. Blue line represents the gaussian distribution with $\mu=0$ and $\sigma=1$
```{r, fig.height=3, out.width="80%"}
Z_mean = X_mean %>%
mutate(
z_m = (x_m - MU) / SIGMA )
Z_norm = tibble(x = seq(-3, 3, length.out=1000),
fx = dnorm(x))
ggplot() +
geom_histogram(data=Z_mean, aes(x=z_m, y = ..density..), bins = 20, alpha=0.4, col = 'black') +
geom_line(data=Z_norm, aes(x=x, y = fx), col = 'blue') +
labs(x = latex2exp::TeX("$\\frac{\\bar{X} - E\\[X\\]}{\\sqrt{var\\[X\\]/n}}$") )
```
## Sampling distribution
If $x_1, \dots, x_n$ is a sample i.i.d with expected value $\mathbb{E}[X]$. Then, when $n$ is big, we have
$$
\frac{\bar{x} - \mathbb{E}[X]}{s_x/\sqrt{n}} \sim t_{n-1}
$$
<hr>
Next plot shows the distribution of 1000 standardised realisations of a r.v. obtained as the average of 30 binomial distributions with parameters $n=10$ and $\pi=0.25$. Normal and Student distributionss are compared.
```{r, fig.height=3, out.width="80%"}
T_mean = X_mean %>%
mutate(
t_m = (x_m - MU) / (x_s/sqrt(x_n))
)
T_norm = tibble(x = seq(-5,5, length.out=1000),
fx = dt(x, 29))
ggplot() +
geom_histogram(data=T_mean, aes(x=t_m, y = ..density..), bins = 20, alpha=0.4, col = 'black') +
geom_line(data=Z_norm, aes(x=x, y = fx, col = 'Normal')) +
geom_line(data=T_norm, aes(x=x, y = fx, col = 't-Student')) +
labs(x = latex2exp::TeX("$\\frac{\\bar{X} - E\\[X\\]}{\\sqrt{var\\[X\\]/n}}$"), col = '') +
scale_color_manual(values = c('blue', 'red')) +
theme(legend.position = 'top')
```
## Final activity {background-color=#CCE5FF}
* Fix some r.v. $Y$ after some transformation $f(x)$.
* For example by transforming certain binomial distribution with function $f(x) = \sin(x)$.
* Take a sample of size $50$ and calculate the mean of this sample. We we call it: $x_1$.
* Repeat the process 1000 times obtaining a sample of means: $X_m = \{x_1, \dots, x_{1000}\}$.
* Visualise the resulting distribution of sample $X_m$.
* Think about how we can use $x_i$ to obtain information of $\mathbb{E}[Y]$?
# That's all for today
## Next week session
```{r, echo=FALSE, results='asis'}
cat(readr::read_lines("session-05-content.md"), sep='\n')
```