Estimating Proportions

Goal of Statistics

Statistics seeks to describe characteristics of a broad group (population) using only a subset of information (sample). For instance, making statements about all of Chico’s graduates would be difficult; we’d first have to find them all and then extract data from each person. Instead, statistics uses a sample of all graduates to infer characterstics about the population of Chico’s graduates.

In proper language, a statistician uses a radom sample to calculate sample statistics which provide estimates of population parameters. Relative to the image above, the population is depicted on the left, the sample is depicted on the right, and the arrows indicate we are to take a simple random sample of individuals (or observations) from the population. With these data, we make inferences about the population parameters. The discipline of statistics studies how to properly use data to make best guesses about the population. To be useful, we must carefully interpret these best guesses.

At the mathematical level, the population is described by a function, and characteristics of the population are parameters that give structure to these functions. Data is then theoretically generated from the population’s function. The data, thus carrying information about this function, are used to estimate the population parameters. As the data side is likely to be more tangible, we’ll start there.

Data

Since this class is both a introduction to R and a statistics course, we’ll waste no time introducing R code. Let’s load two of the most common packages, read in a data set (which R calls a dataframe), and then make a plot. For every analysis, big or small, that we perform in this class these three steps should be the very first.

library(dplyr)
library(ggplot2)
update_geom_defaults("point", list(colour = "blue"))
update_geom_defaults("density", list(colour = "blue"))
update_geom_defaults("path", list(colour = "blue"))
old <- theme_set(theme_bw() + theme(text = element_text(size=18)))
df %>%
select('SuperFamily', 'Family') %>%
sample_n(6)
<th scope=col>SuperFamily</th><th scope=col>Family</th>
FeliformiaViverridae
FeliformiaFelidae
FeliformiaFelidae
CaniformiaCanidae
FeliformiaViverridae
CaniformiaMustelidae
ggplot() +
geom_point(data=df, aes(SB, BW, color=SuperFamily), size=3)

If these data are truly a random sample (and we’re to believe they are), then the proportions of the colors (not the numbers) depict a population parameter. Here, $p$ might be the population proportion of animals from the order Carnivora that are in the Super Family Caniformia. As we don’t know what value $p$ takes on, we will estimate it with data.

As far as this class is concerned, estimating population parameters from data takes quite a bit of machinery. The first necessary piece is the (assumed) functional form that represents a proportion. A common choice for proportions is the Bernoulli distribution. The Bernoulli distribution will provide us with a function, dependent on some unkown value $p$, from which we collect data and then manipulate to estimate $p$.

Bernoulli Distribution

The probability density function of the Bernoulli distribution

for $x \in {0, 1}$ and $p \in [0, 1]$. Notice that $x$ only takes on a finite set of values. When a random variable can take on only a countable number of values, it is called a discrete random variable.

bernoulli <- function(x, p) {
p^x * (1 - p)^(1 - x)
}
x <- 0:1
p <- 0.25
df <- data.frame(x = x, f = bernoulli(x, p))
ggplot(df, aes(factor(x), f)) +
geom_point(size=3) +
labs(x = 'x', y = 'f(x)')

Example

Since $x$ only ever takes on two values $0$ or $1$, this matches perfectly with our binary categorical variable SuperFamily. The trick is, the levels of the variable SuperFamily will correspond to the values that the input $x$ of the Bernoulli random variable can take on, namely $0$ and $1$. How we map from $\{Canfiromia, Feliformia\}$ to $\{0, 1\}$ is mathematically unimportant, but convention suggests that you are interested in one of the two levels more than the other.

Symbollically, we write $X_n \sim_{iid} \text{Bernoulli}(p)$ for $n = 1, \ldots, N$. The random variables $X_n$ correspond to the sequence $0$s and $1$s that tell us which observations belong to the Super Family Caniformia. The population parameter $p$ is unwknown, but can be estimated with the data $X_n$. Notice that for Bernoulli data, the sample (because it’s applied to data) mean returns a proportion since at most the sum of $N$ $1$s is $N$.

library(dplyr)
carnivora <- read.csv("https://raw.githubusercontent.com/roualdes/data/master/carnivora.csv") %>%
mutate(Caniformia = as.numeric(SuperFamily == 'Caniformia')) # interested in Canfiformia