Notes on Econometrics I

Grace McCormack

April 28, 2019

Contents

1 Overview 2

1.1 Introduction to a general econometrician framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 A rough taxonomy of econometric analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

I Probability & Statistics 4

2 Probability 5

2.1 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Bayesian statistics 11

3.1 Bayesian vs. Classical Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Bayesian updating and conjugate prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Decision theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Classical statistics 15

4.1 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.1 The Neyman Pearson Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3 Conﬁdence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4 Statistical power and MDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.5 Chi-Squared Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

II Econometrics 29

5 Linear regression 30

5.1 OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2 Conﬁdence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.3 Variance-Covariance Matrix of OLS Coefﬁcients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.4 Gauss-Markov and BLUE OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.5 Heteroskedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.6 Weighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Maximum Likelihood Estimation 43

6.1 General MLE framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.2 Logit and Probit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.2.1 Binary Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.2.2 Binary Probit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

A Additional Resources 51

A.1 General notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

A.2 Notes on speciﬁc topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

1 Overview

This set of notes is intended to supplement the typical ﬁrst semester of econometrics taken by PhD students in public policy, eco-

nomics, and other related ﬁelds. It was developed speciﬁcally for the ﬁrst year econometrics sequence at the Harvard Kennedy School

of Govenrment, which is designed to provide students with tools necessary for economics and political science research related to

policy design. In this vein, I wish us to think of econometrics as a means of using data to understand something about the true nature

of the world. The organizing framework for these notes can be seen below. I will be returning to this framework throughout the notes.

1.1 Introduction to a general econometrician framework

1.) We start with a Population Relationship or Population Data-

Generating Process (DGP), which we can think about as some

“law of nature” that is true about the world. The DGP is deﬁned

by some population parameter θ.

• parameter - a population value or characteristic of the Data-

Generating-Process, for example, the mean a distribution or

someone’s marginal utility of consumption. In this set of

notes, I will often use θ to denote a population parameter.

The population parameter is what generates data and is what

we want to estimate using statistics or econometrics

The DGP can be something simple, like the density of a normal

distribution in which case θ might be the mean and standard

deviation of the distribution. It could also be something quite

complicated like the causal effect of education on income, in

which case θ might be the ﬁnancial return to each additional year

of education.

2.) This DGP will produce some data from which we will

be able to observe a sample of N observations. For example, if

the DGP is the normal distribution, we could have a sample of N

normally distributed variables. If the DGP is the causal effect of

education on income, we could have a sample of N people with

information on incomes and education.

3.) We wish to use our data to understand the true popula-

tion parameter θ. We can characterize the parameter a myriad of

ways depending on the context:

• posterior distribution - the probability distribution of the

parameter θ based on the data that we observed (y, x) and

some prior belief of the distribution of the parameter f(θ).

This is what we will learn to be called a Bayesian approach.

• hypothesis test - we can use our data to see if we can reject

various hypothesis about our data (for example, a hypothesis

may be that the mean of a distribution is 7 or that education

has no effect on income)

• estimator - our “best guess” of what the population param-

eter value is, for example a sample mean or an estimated

OLS coefﬁcient. In this set of notes, I will use a “ˆ” to de-

note an estimator. While the estimator will often be a single

value (a so-called “point estimate”), we also typically have

to characterize how certain we are that this estimator ac-

curately captures the population parameter, typically with a

conﬁdence interval.

We will return to this framework more throughout these notes.

1.) Population Relationship

or Population Data-Generating Process:

= g(x

|θ)

where g(·) is just an arbitrary function

and θ is some population parameter.

2.) Observe data from a sample

of N observations of i = 1 ... N

, x

} i = 1...N

3.) Characterize parameters of model

using some econometric method

sampling

estimating

1.2 A rough taxonomy of econometric analyses 3

1.2 A rough taxonomy of econometric analyses

Before we get started on the nitty gritties, I would like to take a moment to note how different types of econometric analyses ﬁt

broadly into this framework. Unlike microeconomics, which is taught rather similarly across most ﬁrst year PhD programs, there is

some degree of variation in the typical econometric sequence. You might be uncertain about what type of econometric tools that you

should be learning or exactly what your choice set is to begin with. I will categorize three broad areas that most econometric courses

will fall into (note that this list is not a universally acknowledged taxonomy, but I ﬁnd it a useful heuristic):

1. Reduced form estimation – This is the type of econometrics that is most often used for Labor Economics and Public Eco-

nomics. This approach entails linear regression to recover some causal effect of X on Y. It is also usef for “sufﬁcient statistics”

approaches. This is likely the type of econometrics that you encountered in your undergraduate courses.

2. Structural estimation – This type of econometrics is much more common in Industrial Organization. This approach requires

explicit modeling of the utility function or production function to recover parameters like an individual’s price elasticity or

risk aversion or a ﬁrm’s marginal cost of production. In our framework above, we can think of it as requiring the g(x

|θ)

to be a utility function or some “micro-founded” data-generating process. While often more complicated than reduced form

approaches, this approach is useful for modeling counterfactuals – that is, estimating what would happen if we changed

something about the world.

3. Machine learning – This is a relatively new tool for economists that is entirely focused on making predictions. That is, unlike

reduced form or structural approaches, machine learning is less concerned about recovering the causal impact of X on Y and

more just about learning how to predict Y. It typically involves large datasets. In our framework, we may think of machine

learning as focusing on estimating ˆy and less on

Don’t worry if these distinctions remain somewhat unclear after this brief desciption. The differences will become more clear in

taking ﬁeld courses, attending seminars, and, of course, reading papers if not in introdutory classes alone. While these notes should

be useful for all three of these broad categories, I am primarily concerned with providing the fundamentals necessary to take on the

ﬁrst two approaches.

Part I

Probability & Statistics

2 Probability

The ﬁrst part of the HKS course (and many econometrics courses) is focused on probability. Some students may ﬁnd the topics

tiresome or basic, but they are quite foundational to econometrics and thus important to get right. While you are unlikely to need to

have a comprehensive library of distributions memorized to successfully do empirical research, a good working understanding and

ability to learn the properties of different distributions quickly is important, especially for more advanced modeling.

We begin with our population data-generating proces y

= g(x

|θ). As mentioned before, this can be something complicated

like a causal relationship or it can be a simple distribution. Even if the population DGP is just a simple distribution, we must have a

healthy grasp on probability and the properties of distributions and expectations in order to have hope of proceeding to sampling and

estimation. After all, if we cannot understand the properties of the distributions that could underly the population DGP, how could

we ever hope to estimate its parameters?

For this section, it probably makes sense to think of the probability generating process as a distribution, i.e. x

∼ f(x

I will not spend a lot of time on probability, given that most people have some background in it already by the time they take a

PhD course and that there are several textbooks and online resources that treat it in much greater detail than I could. I will instead fo-

cus on a few concepts that you might have not seen in detail before that are going to be useful in more complex probability problems.

Speciﬁcally, we will be studying:

• Moment-generating-functions (MGF’s): this is merely a transformation of our probability distribution that makes the “mo-

ments” (i.e. mean, variance) of very complicated distributions easier to calculate

• Convolutions : this is a way of deriving the distribution of sums of random variables

2.1 Moment Generating Functions 6

2.1 Moment Generating Functions

One way to understand a population DGP is to characterize its mean, variance and other so-called “moments” of the distribution that

help us understand the distribution’s shape. Unfortunately, we are often interested in estimating parameters θ for quite complicated

distributions f(x

|θ), and often, these distributions are too complicated for us to recover mean and variance using the simple equa-

tions that we learned in undergrad.

Instead, we can use a moment-generating-function (MGF). An MGF is just a tool used to recover means and variances from compli-

cated distributions by exploiting the properties of derivatives of exponential functions.

For distribution x ∼ f(x)

• We deﬁne the Moment-Generating-Function as

(t) = E[exp(tx)]

where we are taking the expectation over all the possible values of x. The variable t is just some arbitrary variable, which we

will use to pull out the actual moments from this distribution

• We deﬁne the following derivatives:

* M

(t)

∂

∂t

(t)

* M

(t) =

∂

∂xt

(t)

• Moments

* E(x) = lim

t→0

(t)

* V ar(x) = lim

t→0

(t) − M

(t)

}

Many students ﬁnd MGF’s non-intuitive or difﬁcult to visualize and try to understand if there is something more signiﬁcant going on

here. However, at the end of the day, we are just exploiting that the derivative of the exponential function is self-replicating. Thus,

you should consider MGF’s just a tool that is useful for recovering different statistics about our population DGP, nothing more.

2.1 Moment Generating Functions 7

——————————————————————–

Example: Moment Generating Function

Consider a uniform distribution f(x) =

, x ∈ [0, 10], ﬁnd the mean using a MGF

——————————————————————–

Solve:

First, we ﬁnd the MGF

(t) = E[exp(tx)]

(t) =

exp(tx)

(t) =

exp(tx)

(t) =

exp(t10)

−

exp(0)

(t) =

exp(t10)

−

Now, we ﬁnd the derivative

(t)

∂

∂t

(t)

exp(10t)

−

exp(10t)

10t

(t)

10texp(10t)−exp(10t)+1

10t

Finally, we are ready to take the limit to ﬁnd the mean

E(x) = lim

t→0

(t)

E(x) = lim

t→0

10texp(10t)−exp(10t)+1

10t

We see that both the numerator and the denomenator go to zero. Thus, we have to use

L’Hopital’s rule. And take the derivative of the numerator and denomenator

E(x) = lim

t→0

10exp(10t)+100texp(10t)−10exp(10t)

20t

E(x) = lim

t→0

5exp(10t)

E(x) = 5

2.2 Convolutions 8

2.2 Convolutions

Convolutions are used when we want to know the pdf f (y) of some variable Y , which is equal to the sum of some variables

(Y = X

+ X

). It’s useful when we are aggregating multiple observations X

, X

or when we are getting multiple signals, for

example if we wanted to know the distribution of a small sample mean.

Simple discrete example: Before we get to the generic form of continuous convolutions, let us start with a simple discrete ex-

ample. Consider if X

= 1{coin ﬂip i is a heads} and Y = X

+ X

. That is, Y is merely the total number of heads we get in two

ﬂips. What if we wanted to calulate the pdf?

P (Y = 0) = P (X

= 0) ∗ P (X

= 0) =

P (Y = 1) = P (X

= 0) ∗ P (X

= 1) + P (X

= 1) ∗ P (X

= 0) =

P (Y = 2) = P (X

= 1) ∗ P (X

= 1) =

While the above approach looks good, we may instead want to represent our pdf in summation notation. Our ﬁrst trick will be

to observe that since Y = X

+ X

, for any Y value y and X

value a, we already know X

. Thus, we could write the probability

as below:

P (Y = y) =

P (X

= a)P (X

= y − a)

However, we run into a problem – we don’t know the limits of integration! For a given y value, we may not be able to ﬁx X

to be 0 or 1. Consider if Y = 0, we clearly cannot allow X

to equal 1, since no possible value of X

(which is also constrained to

be 0 or 1) will be able to satisfy the condition that Y = X

+ X

(that is, X

cannot equal -1).

Instead, we will have to break this into a piece-wise function:

P (Y = y) =











a=0

P (X

= a)P (X

= y − a) if y = 0

a=0

P (X

= a)P (X

= y − a) if y = 1

a=1

P (X

= a)P (X

= y − a) if y = 2

Notice that we have different limits of integration for the different y values. While this might seem like an unnecesary step in

such a simple discrete example, we will see that for continuous distributions, it is less clear what the limits of integration should be.

Continuous distributions: For continuous functions, the generic formula for convolution is f(y) =

(a)f

(y − a)da

As in the discrete example, we are integrating over different values of X

to achieve different values of y, we have to be careful

about our limits of integration and usually end up with a piece wise function

f(y) =

(

(a)f

(y − a)da y ∈ [b

, b

]

(a)f

(y − a)da y ∈ [b

, b

]

I have a few general steps to solve a convolution of continuous distributions, that is when I want to ﬁnd f(y) when y = x

+ x

and

, x

∼ f (x), where f(x) is continuous

1. Find range of y (b

to b

in above example)

2. Find potential “break-points” where we are going to want to break up our piece-wise functions (b

in the above example) using

the ranges of the underlying variables X

and X

3. Within each “sub-range,” identify limits of integration for X

(a) Check actual min or max of X

(b) If that doesn’t work, use Y = X

+ X

• Go back and check range and make sure that implied limit is within X

range

4. Once we have our sub-ranges of the piece-wise function and the limits of integration within each range, plug in our distribution

function and integrate. Construct piece-wise function

2.2 Convolutions 9

————————————–

Example: Convolution

Suppose X

∼ U (−1, 1) ⇒ f

(x) =

Suppose X

∼ U (0, 1) ⇒ f

(x) = 1

Using the method of convolution, ﬁnd f(y) where Y = X

+ X

————————————–

Solve:

1. Y can range from -1 to 2

2. Using the range of X

and X

, we have two possible “break points” coinciding with lower and upper bounds of our underlying

variables {0, 1}. Thus, we have three plausible regions to investigate:

• y ∈ [−1, 0]

• y ∈ [0, 1]

• y ∈ [1, 2]

3. We now must ﬁnd the limits of x

within each region.

• y ∈ [−1, 0]

We have to ﬁgure out the limits of integration: c and d in f(y) =

(a)f

(y − a)da

Lower limit (c): say we have some arbitrary y ∈ [−1, 0], say −0.5, what is the minimum value x

could take on?

(a) If X

= −1 (its actual minimum), then we can let X

= 0.5, so this should work.

Testing higher numbers in the range (e.g. −0.00001) and lower numbers (e.g. −0.9), we see that X

can be

sufﬁciently high to handle when X

= −1. So c = −1

Upper limit (d): say we have some arbitrary y ∈ [−1, 0], say −0.5, what is the maximum value x

could take on?

(a) If X

= 1 (its actual maximum), we run into a problem. X

cannot take on negative values, so X

cannot equal −1

when y = −0.5. Considering other values y ∈ [−1, 0], we see this is still a problem.

(b) We now can note that for a given y, X

will be maximized when X

is minimized. The minimum of X

is 0, so the

maximum of X

must be y in this range

– Can X

take on these values? When y ∈ [−1, 0], its entire range is a subset of the range of X

, so we should be

good to go.

Now, we plug in to the convolution function:

f(y) =

−1

1 ∗

f(y) =

−1

f(y) = (y + 1)

2.2 Convolutions 10

• y ∈ [0, 1]

We have to ﬁgure out the limits of integration: c and d in f(y) =

(a)f

(y − a)da

Lower limit (c) : Suppose y = 0.5

(a) Could X

= −1 (its actual minimum)? No, we see that X

cannot be sufﬁciently large (1.5) to make up the

difference for y = 0.5, so we should not allow X

to dip so low.

(b) We can now note that the for a given value of y, the minimum value of X

corresponds to the maximum value of

, which is 1. If X

= 1, then we can solve that X

= y − 1. Thus, for any y, our lower bound for X

= y − 1

– Can X

take on all these values of y − 1? For y ∈ [0, 1], y − 1 ∈ [−1, 0], which is a subset of X

’s range so we

should be good to go

Upper Limit (d) : Suppose y = 0.5

(a) Could X

= 1 ? Its actual maximum value? Clearly not, since X

can’t be negative

(b) For a given y, we can max X

by minimizing X

. Thus, max(X

) = y − min(X

) = y − 0 = y

– Can X

take on these values? the range of y for this region is 0,1, a subset of X

’s range, so we’re good to go.

Now, we plug in to the convolution function:

f(y) =

y−1

f(y) =

y−1

f(y) =

(y − (y − 1))

f(y) =

• y ∈ [1, 2]

We have to ﬁgure out the limits of integration: c and d in f(y) =

(a)f

(y − a)da

Lower limit (c): Suppose y = 1.5

(a) Could X

= −1 (its actual minimum)? Clearly not

(b) For a given y, the minimum X

= y − max(X

) = y − 1.

– Can X

take on all these values? For y ∈ [1, 2], y − 1 ∈ [0, 1], so X

can take on any of these values.

Upper limit (d) : Suppose y = 1.5

(a) Could X

take on its actual maximum value (1)? Yes! We see that even when X

is maximized, X

can take on

sufﬁciently small numbers to rationalize any y in this range.

Now, we plug in to the convolution function:

f(y) =

y−1

f(y) = a

y−1

f(y) = (1 − (y − 1))

f(y) = (2 − y)

SOLUTION: f (y) =











(y + 1)

y ∈ [−1, 0]

y ∈ (0, 1]

(2 − y)

y ∈ (1, 2]

3 Bayesian statistics

Bayesian statistics is a branch of statistics that extends the logic of Bayes Rule to characterize distributions of parameter values.

While this branch of statistics is less common for economic analysis, it’s an important area of which to have understanding. The

concept also comes up in many areas of economic theory including statistical discrimination and adverse selection. We will brieﬂy

discuss the distinction between Bayesian and Classical (sometimes known as Frequentist) Statistics. Then, we shall discuss Bayesian

Updating, the formal updating of priors using bayes rule, and Decision Theory, an application of Bayes Rule.

3.1 Bayesian vs. Classical Statistics

Let us quickly discuss a distinction between two schools of thought in statistics: bayesian and classical (or frequentist) statistics.

While there has been reams of paper written comparing these two approaches and parsing out the exact differences (or lack thereof),

I will just give the briefest explanation below:

(a) Bayesian statistics – This branch of statistics allows the econometrician (or statitician) to have some belief about the value of

θ prior to observing the data (called a prior distribution). This prior will affect our ﬁnal guess

θ. Decision analysis will be in

this camp of statistics. While this branch of statistics is not as commonly used in economic studies, its ideas underly many

analyses nonetheless.

(b) Classical statistics – This branch of statistics does not allow a prior. Instead, it takes the observed data as the only type of

information from which one can draw conclusions. Conventional hypothesis testing falls in this camp.

We can ﬁt both of these into our framework as below.

(a) Bayesian statistics framework

1.) Population Relationship

or Population Data-Generating Process:

∼ g(x

|θ)

where g(·) is just an arbitrary density func-

tion and θ is some population parameter.

2.) Observe data from a sample

of N observations of i = 1 ... N

x = {y

, x

} i = 1...N

3.) Calculate a Posterior Probability

Density Function of parameter θ based

off of our prior f (θ) and the data x

f(θ|x) =

g(x|θ)f(θ)

g(x|θ)f(θ)dθ

sampling

updating our prior

(b) Classical statistics framework

1.) Population Relationship

or Population Data-Generating Process:

∼ g(x

|θ)

where g(·) is just an arbitrary density func-

tion and θ is some population parameter.

2.) Observe data from a sample

of N observations of i = 1 ... N

x = {y

, x

} i = 1...N

3.) Use data to test hypotheses about pop-

ulation parameter θ OR calculate estimator

θ of population parameter about which

we can calculate a conﬁdence interval

sampling

testing / estimating

3.2 Bayesian updating and conjugate prior distributions 12

3.2 Bayesian updating and conjugate prior distributions

Considering our framework, imagine that we are uncertain about the parameter θ. We know that it could take on any number of

values, but we don’t know which one. However, we do have some prior beliefs about which θ values are more likely than others. For

example, if you see someone eating a sandwich, you cannot be certain what type of ﬁlling the sandwich has, but you can have some

sense before getting any more information from the person that the probability that the ﬁlling is peanutbutter is probably higher than

the probability that the ﬁlling is jello. This is the idea behind a prior distribution.

• A prior distribution f(θ) - a pdf that deﬁnes the relative likelihood of observing different values of θ prior to observing any

data

Depending on the context, this prior distribution could come from some information we know about the world ex-ante or could just

be some arbitrary belief one has – either way, the mechanics are the same.

Bayesian updating is the process of using data x to generate a posterior distribution, which characterizes the relative likelihood

of observing different θ values, accounting for the new information that we have learned from the data.

• posterior distribution g(θ) - a pdf that deﬁnes the relative likelihood of observing different values of θ, formed using bayes

rule from the observed data x and the prior distribution f(θ)

As a reminder, Bayes Rule can take on the following forms:

Bayes Rule: Continuous θ

f(θ|x) =

f(x|θ)f (z)

f(x|θ)f (θ)dθ

Bayes Rule: Discrete θ

f(θ|x) =

f(x|θ)f (θ)

You should think of Bayesian updating as a way of better understanding our DGP that incorporates both our data that we are observ-

ing in a given experiment or study and some external information or belief which is deﬁning our prior.

Note: A common related concept is that of a conjugate prior. A conjugate prior is a prior f(θ) such that if it has a distribu-

tional form in a certain “family” of distributions, then our updated posterior g(θ) will also follow a form in that family. Examples

include gaussian priors, which produce posteriors also in the gaussian family.

On the next page, we will iterate through a few simple examples of Bayesian updating and see how our posterior distribution is

informed by both our prior and our data.

The normal distribution is part of the Gaussian family.

3.2 Bayesian updating and conjugate prior distributions 13

Consider if we have a coin with unknown probability θ ∈ [0, 1] of ﬂipping heads. We want to know the probability that the coin

unfairly favors heads (i.e. θ > /frac12). We have some prior belief about the likelihood of different θ values being the true θ, but

we want to use data to update this belief. To do so, we will be ﬂipping coins and updating our prior to generate a posterior distribution.

Example 1: ﬂat prior

Set up Prior f(θ) = 1

Experiment 1 coin ﬂip

Result H

f(θ|x) f(θ|x) =

f(x|θ)f (θ)

f(x|θ)f (θ)dθ

θdθ

= 2θ

P r(unfair coin) P r(unfair coin) =

1/2

2θdθ =

Example 2: H-favored prior

Set up Prior f(θ) = 2θ

Experiment 1 coin ﬂip

Result H

f(θ|x) f(θ|x) =

f(x|θ)f (θ)

f(x|θ)f (θ)dθ

2θ

dθ

= 3θ

P r(unfair coin) P r(unfair coin) =

1/2

23θ

dθ =

Example 3: Flat prior, more data

Set up Prior f(θ) = 2θ

Experiment 1 coin ﬂip

Result HH

f(θ|x) f(θ|x) =

f(x|θ)f (θ)

f(x|θ)f (θ)dθ

dθ

= 3θ

P r(unfair coin) P r(unfair coin) =

1/2

23θ

dθ =

3.3 Decision theory 14

3.3 Decision theory

Decision theory studies how agents make choices under uncertainty and has two main branches: a normative philosophical side and

a more formal mathematical side. We shall be giving the briefest introduction to the latter. You might notice that it is closely related

to what you are already learning in your ﬁrst semester of microeconomics.

The main idea:

• We consider an individual who may take action α

...α

(For example α

= invest, α

= not invest)

• Each of these actions is associated with some distribution of utility values associated with possible outcomes (For example,

(u) might be the distribution of possible returns on investment)

• Individuals are expected utility maximizers - that is, when faced with multiple options, they will choose the option that will

give them the highest value in expectation given the information that htey have.

– The chosen action α

∗

= argmax

{

(u)}

• In our example, even though an individual may “win” or “lose” on an investment, they will still invest if the investment will

pay off in expectation.

The value of information

• New information can change your behavior; for example, if the individual from the previous example has a machine that

perfectly predicts if the investment pays off, they will only invest in “winners” and never invest in “losers.”

• Consider information I that changes the distribution of possible utilities f

so that we have more certainty over what shall

occur

– Under perfect information, f

will be degenerate, giving you a perfect prediction of what will happen

– Under imperfect information, f

will not give you perfect omniscience, but will give you a more accurate picture of the

future than no information

• If information changes your behavior, it also changes your expected utility; we can calculate the expected utility that individuals

get under different levels of information

– EV

P I

– the expected value that an individual will get under perfect information

– EV

– the expected value that an individual will get under imperfect information

– EV

– the expected value that an individual will get under no information (this is the baseline in most situations)

• We can use these expected values to derive the incremental value of different pieces of information

– V

P I

= EV

P I

− EV

, the additional value that individuals get from having perfect infromation relative to no infor-

mation. You can think of this as the willingness to pay for that information

– V

= EV

− EV

, the additional value that individuals get from having imperfect infromation relative to no

information.

– Typically, V

P I

≥ V

4 Classical statistics

While Bayesian Statistics makes a lot of sense, it does present

some difﬁculties as an empirical framework. For one, it requires

us to take a stance on what prior to use. While some sitations

might elicit a natural prior, researchers will often end up arguing

over the appropriate prior. A researcher hungry for citations

might be tempted to use whatever prior that will generate a more

dramatic ﬁnding. Further, because Bayesian statistics typically

results in posterior distributions of a parameter θ, not necessarily a

single “point estimate,” it is less clear as a way to present material

(though this limiation can be overcome for some Bayesian

Estimators).

Classical Statistics (sometimes called Frequentist Statistics)

is the dominant type of statistical framework for the sciences

and social sciences. The main distinction between Bayesian and

Classical is that we no longer have to choose a prior. The only

thing that matters for estimation and inference is the data we

observe.

There are a couple inter-related concepts with which we

will have to become comfortable as we dive into ClassicCal

Statistics:

1. Estimators - our “best guess”

θ of a parameter θ

2. Hypothesis test - a process of deciding whether or not to

reject hypotheses at certain “conﬁdence levels” based on the

data we observe

3. Power and MDE - given a desired hypothesis test, a way to

assess how we want to design our experiment to make sure

we will be able to draw meaningful conclusions from our

data

4. Conﬁdence intervals - related to hypothesis testing, a range

of values which we can be conﬁdent at a certain level con-

tains the true value

5. Chi-squared tests - a speciﬁc type of hypothesis test for cat-

egorical variables

Classical statistics framework

1.) Population Relationship

or Population Data-Generating Process:

∼ g(x

|θ)

where g(·) is just an arbitrary density func-

tion and θ is some population parameter.

2.) Observe data from a sample

of N observations of i = 1 ... N

x = {y

, x

} i = 1...N

3.) Use data to (a) calculate estimator

θ of

population parameter about which we can

calculate a conﬁdence interval OR (b) test

hypotheses about population parameter θ

sampling

testing / estimating

4.1 Estimators 16

4.1 Estimators

As deﬁned earlier in the notes overview, we can think of an estimator

θ as a “best guess” of the population parameter θ based off

of the data that we are observing. Notice that when we are talking about estimators, the convention is to put a “hat” on top of the

estimator’s symbol. Unlike in Bayesian Statistics, Classical Statistics will often arrive at a single “point estimate” of the population

parameter value, instead of a distribution of possible θ values.

Examples of estimators:

• Consider if we have normally distributed data x ∼ N (θ

, θ

), with mean θ

and standard deviation θ

. We could deﬁne

estimators

as the sample mean (

) and

}theta

as the sample variance (

− ¯x)

)

• Consider if we believe that people with a bachelor’s degree get β more dollars per hour than people with just a high school

degree. We could deﬁne an estimator

beta as the difference in average wages between people with bachelors and people with

high school degrees.

What one should understand is that we have quite expansive freedom in deﬁning our estimators however we want. However, there

are going to be some characteristics that we would want our estimator to have.

# 1: Unbiased

• def’n: estimator

θ is unbiased iff E(

θ) = θ

• interpretation: we expect that an unbiased estimator will be right on average over all the samples we could have from a given

distribution

# 2: Consistent

• def’n: estimator

θ is consistent iff lim

n→inf

θ = θ

• interpretation: as sample size grows, a consistent estimator will have an increasing probability of being close to the population

parameter. As the smaple size approaches inﬁnity, the estimator will be arbitrarily close to the parameter.

• Another important related term is efﬁciency, which characterizes how quickly a consistent estimator will approach the true

parameter as N increases.

You will often hear that we want to ﬁnd the most efﬁcient unbiased estimator.

4.2 Hypothesis testing 17

4.2 Hypothesis testing

In the previous section, we learned how to derive estimators

θ of population parameters θ. In this section, we will learn how to

test speciﬁc hypotheses about population parameters. The mechanics of this section will also be useful for constructing “conﬁdence

intervals.”

The typical presentation is that we have a “null hypothesis” (H

) and an “alternative hypothesis” (H

). It may be useful throughout

this section to think about the material in the context of an experiment.

Examples of hypotheses:

• Consider x ∼ N (θ, 1). We may have

– H

: θ = 4

– H

: θ = 9

• Consider if we believe that people with a bachelor’s degree get β more dollars per hour than people with just a high school

degree.

– H

: β = 0

– H

: β 6= 0 (i.e. education affects earnings)

In the next subsection, will use the Neyman Pearson Lemma to derive the form of the hypothesis test, which will tell us whether we

can reject the null hypothesis.

We will then proceed to a discussion of power and MDE calculations which we can use to determine

what type of experiment settings (i.e. sample size or differences in null and alternatives) that are necessary to reject the null.

Note that we can never accept the Null or Alternative, even with enough data.

4.2 Hypothesis testing 18

4.2.1 The Neyman Pearson Lemma

Main idea: We have two possible population states of the world(H

and H

). We can typically think of this as two different possible

θ values. We want to come up with some sort of “rule” or “test” that will characterize whether or not we can reject H

. The general

form of this test will be:

Reject H

iff

θ ∈ R, where R is the “rejection region”, which we can deﬁne as a...

• threshold: {R|x > η}

• ﬁnite interval : {R|x ∈ (η

, η

)}

• union of many intervals: {R|x ∈ (η

, η

) ∪ ... ∪(η

n−1

, η

)}

What type of rule do we want? The Neyman-Pearson Lemma states that the optimal rejection rule will minimize type 2 error, subject

to a ﬁxed constraint of type 1 error. We deﬁne these errors below:

• Type 1 : reject H

when it is true (set P r(type 1 error) = α )

• Type 2 : accept H

when it is not true (minimize P r(Type 2 error)

Steps to ﬁnd optimal rejection region:

1. Calculate likelihood ratio Λ(x) using distributions implied by different hypotheses

• The likelihood ratio is deﬁned as Λ(x) =

L(x|θ

)

L(x|θ

)

f(x|θ

)

f(x|θ

)

• Intuition: for datapoint x, Λ(x) is the relative likelihood of observing datapoint x under the alternative relative to the

null. The higher Λ(x), the more likely that that observation x was produced by θ

instead of θ

2. Use likelihood ratio to derive the form of the test (i.e. if the rejection region is a threshold, ﬁnite interval, or set of intervals).

The rejection region will correspond to the area where Λ(x) is highest.

3. Use type 1 constraint that P r(type 1 error) = α to ﬁnd the actual numeric values of the limits of the rejection regoin, whether

that region be a threshold or a ﬁnite interval. Notice that P r(type 2 error) only matters in determining the form (threshold or

ﬁnite region) and not the actual limits of the rejection region.

We shall now run through two examples of derivations of optimal rejection regions.

4.2 Hypothesis testing 19

———————————————————————

Example 1 : Apply Neyman-Pearson to an arbitrary pdf, only one observation

Consider if we have a random variable X ∼ f(x) = θx

θ−1

, x ∈ (0, 1]

We are presented with two states of the world:

Null H

: θ = 3

Alternative H

: θ = 2

We observe only one draw x from the distribution and we want to derive a rule to indicate when we think we are in Null or Al-

ternative state of the world. We shall construct the rule to minimize probability of type 1 error subject to ﬁxing probability of type 2

error α = 0.05

———————————————————————

Solve

1. Calculate likelihood ratio

Let’s ﬁll in the generic likelihood function (aka the distribution function) L(x|θ) = θx

θ−1

, for our two potential values of

the population parameter θ

Λ(x) =

L(x|θ

)

L(x|θ

)

Λ(x) =

−1

Λ(x) =

2−1

3−1

Λ(x) =

2. Derive form of rejection region using either formal math or graphical intuition

• We want to reject null if the likelihood of the alternative compared to the null is too high. Thus, we reject null if

Λ(x) ≥ η , where η is some arbitrary value to be determined. We will then use this relationship to derive a new

threshold speciﬁcally for our test statistic (x)

≥ η

3η

|{z}

∗

≥ x

Thus, reject H

iff x ≤ η

∗

• Because our likelihood function is in a nice form, we can also use our graphical intuition to get the threshold value

Λ(x) =

. Graphing below, we see that the likelihood function is decreasing over the range of x, so we know that we

are looking to deﬁne a rejection region between 0 and η

∗

Likelihood functions Likelihood ratio

Using either approach, we know that our rejection region will be deﬁned as {R|x ∈ (0, η

∗

]}

3. Use type 1 constraint to get speciﬁc value of η

∗

α = Pr(type 1 error)

α = Pr(reject H

even though H

true)

4.2 Hypothesis testing 20

α = Pr(x ≤ η

∗

|θ

)

α =

∗

−1

α =

∗

3−1

α = x

∗

√

α = η

∗

We are working with a 95 % CI, so we set α = 0.05 → Optimal test: RejectH

(θ

= 3)in favor of H

(θ

= 2) iff x ≤

√

0.05

4.2 Hypothesis testing 21

———————————————————————

Example 2 : Normal pdf ; many observations

Consider if we have a random variable X ∼ N(θ, 1)

We are presented with two states of the world:

Null H

: θ = 5

Alternative H

: θ = 3

We observe a series of draws x = {x

, ...x

} from the distribution and we want to derive a rule to indicate when we think we

are in Null or Alternative state of the world. We shall construct the rule to minimize probability of type 1 error subject to ﬁxing

probability of type 2 error α = 0.05

———————————————————————

Solve

1. Calculate likelihood ratio

Unlike before, we are now observing multiple x values, so we have to incorporate all of these observations into our likeli-

hood

Likelihood of one observation x

: L(x

|θ) = (2π)

−1/2

exp{

−(x

−θ)

}

Likelihood of vector x = {x

, ...x

} : L(x|θ) =

i=1

L(x

|θ)

L(x|θ) =

i=1

(2π)

−1/2

exp{

−(x

−θ)

}

L(x|θ) = (2π)

−n/2

exp{

−

−θ)

}

L(x|θ) = (2π)

−n/2

exp{

−

−¯x+¯x−θ)

}

L(x|θ) = (2π)

−n/2

exp{

−

{(x

−¯x)

+(¯x−θ)

+(x

−¯x)(¯x−θ)}

}

L(x|θ) = (2π)

−n/2

exp{

−

−¯x)

−

(¯x−θ)

−

−¯x)(¯x−θ)

}

L(x|θ) = (2π)

−n/2

exp{

−

−¯x)

−n(¯x−θ)

−

¯x−x

θ−¯x

+¯xθ)

}

L(x|θ) = (2π)

−n/2

exp{

−

−¯x)

−n(¯x−θ)

−(n¯x

−n¯xθ−n¯x

+n¯xθ)

}

L(x|θ) = (2π)

−n/2

exp{

−

−¯x)

−n(¯x−θ)

}

L(x|θ

)

L(x|θ

)

(2π)

−n/2

exp{

−

−¯x)

−n(¯x−θ

)

}

(2π)

−n/2

exp{

−

−¯x)

−n(¯x−θ

)

}

L(x|θ

)

L(x|θ

)

= exp{

−n(¯x−θ

)

+n(¯x−θ

)

}

L(x|θ

)

L(x|θ

)

= exp{

−n

((¯x −θ

)

− (¯x − θ

)

)}

Plug in for θ

= 5 and θ

= 3

L(x|θ

)

L(x|θ

)

= exp{

−n

((¯x −3)

− (¯x − 5)

)}

L(x|θ

)

L(x|θ

)

= exp{

−n

((¯x

− 6¯x + 9) −(¯x

− 10¯x + 25))}

L(x|θ

)

L(x|θ

)

= exp{

−n

(¯x

− 6¯x + 9 − ¯x

+ 10¯x − 25)}

L(x|θ

)

L(x|θ

)

= exp{

−n

(4¯x −16)}

L(x|θ

)

L(x|θ

)

= exp{−n(2¯x − 8)}

2. Derive form of rejection region using either formal math or graphical intuition

• We want to reject null if the likelihood of the alternative compared to the null is too high. Because we have set the

likelihood to be the likelihood of the alternative over the null, it is highest when the alternative is relatively more likely.

Thus, we reject the null when the likelihood ratio is higer. We can write this rule as follows:

Reject the null (H

) if....

L(x|θ

)

L(x|θ

)

≥ η

exp{−n(2¯x − 8)} ≥ η

−n(2¯x −8) ≥ ln(η)

2¯x ≤

−n

ln(η) + 8

4.2 Hypothesis testing 22

¯x ≤

−2n

ln(η) + 4

| {z }

∗

¯x ≤ η

∗

• Suppose we did not want to do the math, but instead wanted to just use graphical intuition. This equation is slightly less

nice than our earlier example, but if we are familiar with exponential functions, we can still make it work:

Likelihood functions Likelihood ratio

Beautiful! So either way we go about determining the shape of our likelihood function, we know that the form our test will

take is some threshold over which we will reject our null.

3. Use type 1 error constraint to get speciﬁc value of η

∗

α = Pr(type 1 error)

α = Pr(reject H

even though H

true)

α = Pr(¯x ≤ η

∗

|θ

)

To do this last step, we must draw on our knowledge of (1) sums of normal variables and (2) the standard normal distri-

bution.

(1) if x ∼ N (5, 1) ⇒ ¯x ∼ N(5,

)

(2) if ¯x ∼ N (5,

) ⇒

¯x−5

√

∼ N (0, 1)

Thus,

α = Pr(¯x ≤ η

∗

|θ

)

α = Pr(

√

n(¯x −5) ≤

√

n(η

∗

− 5))

We can then use the standard normal table and ﬁnd the associate z statistic for α = 0.05 (as we see in the table below, it

is 1.64; because we are dealing with a left-sided test, we use the negative side, -1.64)

Now we plug into the rejection inequality

√

n(¯x −5) ≤ −1.64

Optimal test: Reject H

iff ¯x ≤

−1.64

+ 5

4.3 Conﬁdence Intervals 23

4.3 Conﬁdence Intervals

Conﬁdence intervals are a useful rearrangement of a normal distribution hypothesis test. They become quite useful when we get to

regressions and establishing ranges of values that our β coefﬁcients could satisfy. Before we dive in, let us remind ourselves, of some

deﬁnitions:

• θ - some true parameter about the world that we do not know

•

θ - an estimator for this true parameter

• Z

crit

- the critical value necessary to reject a null under a normal distribution, 1.96 under a two-sided hypothesis test

Now that we have these terms, let us consdider a new object:

• conﬁdence interval (CI) - range of possible values in which the true θ value will fall a certain percentage of the time. That is,

for a 95 % CI, 95 % of the time, the true θ will in that interval

Deriving the 95 % CI: We start with a two-sided hypothesis test at the 95 % conﬁdence level, which has critical value 1.96. As a

reminder, we can interpret the below expression as stating that 95 % of the times that θ is the true parameter value, we will have an

estimator

θ that satisﬁes the condition below.

P r(|

θ − θ

| ≤ 1.96) = 0.95

We can rearrange this to get an interval of values that θ will be in 95 % of the time...

θ − θ

| ≤ 1.96

θ − θ

≤ 1.96 &

θ − θ

≥ −1.96

θ − θ

≤ 1.96 &

θ − θ

≥ −1.96

θ − θ ≤ 1.96 ∗ SE

θ − θ ≥ −1.96 ∗ SE

θ − 1.96 ∗ SE

≤ θ &

θ + 1.96 ∗ SE

≥ θ

95 % CI : θ ∈ (

θ − 1.96 ∗ SE

θ + 1.96 ∗ SE

)

How do we interpret the above range of values?

• When we have estimator value

θ, the true value of θ will fall into this range of values 95 % of the time.

4.4 Statistical power and MDE 24

4.4 Statistical power and MDE

As discussed previously, the cannonical form of the optimal hypothesis test per Neyman-Pearson is:

Minimize Pr(Type II error) subject to Pr(Type I error) = α

Where

• Type I error is rejecting the null when the null is true

• Type II error is failing to reject the null when the alternative is true

We will now emphasize a related concept: power. Power can be thought of the ability of our hypothesis test to identify if the alterna-

tive is true. Formally, Power = 1 - Pr(Type II error), thus when we are minimizing Type II eror in our optimal hypothesis test, we

are maximizing power. We can visualize the power associated with a one-sided test of a normal distribution as below:

There are two types of thought experiments for statistical power:

1. Minimum detectable effect: We want to test the hypothesis that two distributions are different. For a given sample size and

signiﬁcance level α, how different do our two distributions have to be to get a target power?

2. Predetermining the sample size: We want to test the hypothesis that two distributions are different. For a given minimal

detectable effect, what is the sample size we need to get a target power?

4.4 Statistical power and MDE 25

Tips to solve: The steps to solving these types of problems is pretty similar. The goal is usually to get all of the components to the

following equality:

Equation 1 : z

crit

¯x

∗

− µ

(n)

(µ

− µ

) + Z

crit

∗ SE

(n)

and then solve for your parameter of interest: (µ

−µ

) if a MDE problem or n if a sample size problem. I write the steps generically

below on how to get the individual components of the formula, but solving the problem will obviously depend on what parameters

you are given and what parameters you solve for. I will be writing these steps for the case of normal distributions, which is by far the

most common setting for these calculations to be conducted.

1. Find z

crit

• interpretation: z

crit

is the critical value necessary to reject the null hypothesis α percent of the time that the null is true

• input: α conﬁdence level

• process: Use the normal distribution tables (e.g. z

crit

= 1.96 when α = 0.05, two-sided test)

2. Find ¯x

∗

• interpretation: ¯x

∗

is the test statistic necessary to observe in order to reject the null

• input: z

crit

, null mean µ

, and standard error of the test statistic under the null distribution SE

(n)–which I have written

as a function of n

• process : use identity ¯x

∗

= µ

+ z

crit

∗ SE

(n)

3. Find z

crit

• interpretation: z

crit

is the critical value necessary to detect the alternative the desired percent of the time (power)

• input : target power(1 − β), typically 80%

• process : Use the normal distribution tables (e.g. z

crit

= 0.84 when 1 − β = 0.8, one-sided test)

4. Plug all these components into Equation 1 and solve for the desired parameter, µ

− µ

or n

4.4 Statistical power and MDE 26

—————————————————————————————–

Example: Power calculations

We have a x’s which are drawn from a distribution with two possible means (µ

and µ

) and a variance of 4. We observe N =

81 draws from the distribution, how different do our null and alternative mean have to be in order to be able to conduct a two-sided

test with signiﬁcance of 0.05 and power equal to 80 %? Assume µ

> µ

—————————————————————————————–

Solve

1. Find z

crit

• When α = 0.05, Z

crit

= 1.96

2. Find ¯x

∗

• To reject null,

¯x−µ

(n)

> 1.96

We rearrange and get ¯x > 1.96 ∗ SE

(n) + µ

3. Find Z

crit

• When 1 −β = 0.8, Z

crit

= 0.84, because we will reject the null for larger values of ¯x, the area associated with type II

error will be to the left, so we use Z

crit

= −0.84

4. Use our relationship z

crit

¯x

∗

−µ

(n)

• First, let’s solve for the standard errors of our sample means

(n) = SE

(n) =

var(

)

(n) = SE

(n) =

var(x

)

(n) = SE

(n) =

(n) = SE

(n) =

√

(n) = SE

(n) =

√

(n) = SE

(n) =

• Now, we are ready to plug in everything

crit

¯x

∗

−µ

(n)

−0.84 =

1.96∗

+µ

−µ

(−0.84) − 1.96 ∗

= µ

− µ

Or equivalently, µ

− µ

(0.84 + 1.96)

− µ

2.8 =

5.6

Solution: Our alternative mean must be at minimum

5.6

greater than the null mean in order to satisfy the

signiﬁcance and power criteria in the setup given our sample size and individual observation standard errrors

4.5 Chi-Squared Tests 27

4.5 Chi-Squared Tests

You are probably familiar with binomial distributed variables, which consider N trials each of which has one of two outcomes (often

phrased as “success” or “failure”). The binomial distribution corresponds to the number of “successes” in N trials.

Consider now if each of our trials can result in more than two outcomes, say there are k total outcomes or categories. Such a

variable has a multinomial distribution. More formally:

• If Y is a multinomial variable, it can take an of the k values in the set {outcome

, outcome

, ...outcome

}. Each outcome r

has a probability p

of occurring. You often hear multinomial variables being referred to as categorical variables

Chi-Squared Test: We have some categorical variable that can take on k values {outcome

, outcome

, ...outcome

}, each with

some probability {p

, p

, ...p

}. We have two possible states of the world:

• Null: {p

= r

, p

= r

, ..., p

= r

}

• Alternative: {p

6= r

, p

6= r

, ..., p

6= r

}

Per usual, we want to derive a rule or test to determine whether or not to reject the null with a given conﬁdence level α. This process

has the following steps:

1. Calculate test statistic: χ

j=1

−np

)

2. Determine “degrees of freedom: ” k-1

3. Look up this combination of α and degrees of freedom in a chi-squared table to get critical value χ

crit

4. If χ

> χ

crit

, reject the null

Importance in statistics: While the chi-squared independence test can seem somewhat strange, it is actually connected to many

other more familiar statistical concepts

• Normalcy: The sum of k iid squared normal variables is distributed according to the chi-squared distribution. That is, if

Z ∼ N (0, 1) and Y =

i=1

, y ∼ χ

• The F-test: A useful test when we turn to regression analysis is the F-test, which is used to test the overall signiﬁcance of

the model. That is, if in a linear regression, we have coefﬁcients, β

, β

, ...β

, the f-test can test the joint hypothesis that

= 0, β

= 0, ...β

= 0. The F test is essentially testing a variable γ =

where b and c are both chi-squared distributed.

4.5 Chi-Squared Tests 28

—————————————————————————————–

Example: Chi-Squared test

We divide educational attainment into three categories: (1) less than high school, (2) high school, and (3) more than high school.

Suppose we know that in the United States, an individual has the following probability of attaining a certain level of education:

, p

We go take a survey of 36 people in Illinois individuals in Illinois, and we observe the following break-down of educational at-

tainment: num

= 12, num

= 12

Can we reject the following null at signiﬁcance level α = 0.05 signiﬁcance level?

• Null : p

Illinois

, p

Illinois

, p

Illinois

• Alternative: p

Illinois

, p

Illinois

, p

Illinois

—————————————————————————————–

Solve

1. χ

j=1

−np

)

−np

)

−np

)

−np

)

(12−(36)

)

(36)

(12−(36)

)

(36)

(12−(36)

)

(36)

(12−9)

(12−18)

(12−9)

(3)

(6)

(3)

= 1 + 2 + 1

= 4

2. Degrees of freedom: 3 - 1 = 2

3. For α = 0.05 and dof = 2, we look at the following table under α = 0.05 and 2 degrees of freedom. We get critical value equal

to 5.99, that is, 95% of the times that the null is true, our χ

statistic will be less than 5.99.

4. We have χ

= 4 and χ

crit

= 5.99, so χ

< χ

crit

, thus...

Solution: We cannot reject the null at α = 0.05 level

Part II

Econometrics

The notes up to this point have been devoted to learning

basic probability and statisics that are used to recover either

distributions of possible values of population parameter θ

(if we are Bayesians) or point estimators

θ of our “best

guesses” of the parameter values (if we are Frequentists).

We will be drawing from both of these schools of thought

(though initially, much more so from the latter) as we take

our discussion into more “pure econometrics.”

While the distinction between econometrics and plain

statistics is quite squishy, we can think of this part of the

notes as helping us relate actual relationships between

variables X and Y . Our ultimate goal is typically to

characterize some sort of causal relationship betwen x and

y where we can say x causes y. This ﬁts well into our

econometric framework at right, where our usual goal is to

recover

θ using some econometric method.

There are a number of econometric methods that we

could use:

• Ordinary Least Squares

• Weighted Least Squares

• Maximum Likelihood Estimation

• Generalized Method of Moments

Many of these methods are closely related, and we can

often get very similar (or even identical) estimators from

different methodologies. All of these methodologies also

have ways of characterizing conﬁdence intervals around

θ.

1.) Population Relationship

or Population Data-Generating Process:

= g(x

|θ)

where g(·) is just an arbitrary function

and θ is some population parameter.

2.) Observe data from a sample

of N observations of i = 1 ... N

, x

} i = 1...N

3.) Characterize parameters

of model of population DGP

= g(x

θ)

using some econometric method

sampling

estimating

5 Linear regression

Linear regression is in many ways the work-horse method-

ology for econometric analysis. For individual i, we assume

that indpendent variables x

...x

are linearly related to

outcome variable y

. We typically assume that there is

some unobserved component 

that also inﬂuences y

A simple example:

wage

= β

+ β

education

+ 

If this is indeed the true relationship between wages and

education, we would expect that a person with no education

would recieve β

in wages, while a person with γ years of

education would recieve β

+ β

γ. However, there is some

“unexplained component” of wage 

that does not come

from education (for instance intelligence or networking

acumen) that could also impact wages.

Before we proceed further, just some points to keep

in mind:

• while the example above is a univariate regression,

with only one indpendent variable education

, we

can include more independent variables to produce a

multivariate regression and estimate an individual

for each independent variable

• the

β’s that we estimate for our linear regressions

are still estimators in the same sense of our classical

statistics estimators

• the imposition of a linearity assumption underlies

both OLS and WLS, methodologies that we will be

discussing. This will be too strong an assumption in

some cases, but can also be quite ﬂexible, depending

on the variables or transformation of variables that we

add to our model

• Matrix algebra is particularly useful for this area since

is well suited for linearity and is a compact way of

presenting many variables

Generic linear problem framework

Population Relationship

or Population Data-Generating Process:

= β

+ β

+ ... + β

+ 

Observe data from a sample of

N observations of i = 1 ... N

, x

, ..., x

} i = 1...N

Estimate parameters of population

model using econometric model:

+ ... +

+ e

is the model’s residual for

observation i or equivalently...

ˆy

+ ... +

Where

β is determined accord-

ing to some econometric method

sampling

estimating

5.1 OLS 31

5.1 OLS

Ordinary least squares (OLS) is a methodology for estimating the parameters

β of a linear model. Its logic is based on minimizing

the “mistakes” that our model makes in predicting y values. Speciﬁcally consider....

population DGP : y

= β

+ β

+ 

we estimate model : y

+ e

This model predicts the following “ﬁtted values” for our y’s: ˆy =

Because our model does not perfectly ﬁt the data, it produces an errror: e

= y

− ˆy

OLS chooses

β to minimize

− ˆy

)

, the Sum of Squared Errors (SSE)

Visualization: Consider if we collected the four data

points at right then calculated an OLS line. The red dots

are our data; these would be our x

, y

values. The blue line

is deﬁned by ˆy =

. Thus, for each x

, we have

a predicted ˆy. For example, for x = 1, our model would

predict ˆy = 6. Finally, the green vertical line is our error

. For example, the error for x = 1 is y − ˆy = −1.

5.1 OLS 32

1) Calculus derivation of OLS coefﬁcients (Univariate Case)

Suppose that the population D.G.P. is y

= β

+ β

+ 

We are given a sample of N y

, x

observations and want to estimate the population parameters β

, β

. We do this by setting

up the following minimization problem:

OLS

=argmin

{

i=1

)

}

OLS

=argmin

{

i=1

− ˆy

)

}

OLS

=argmin

{

i=1

−

)

}

Take FOC’s:

• F OC

0 = −2

i=1

−

0 =

i=1

−

i=1

−

i=1

−

i=1

−

i=1

N ¯x

• F OC

0 = −2

i=1

−

)

0 =

i=1

− N

−

i=1

−

i=1

N ¯y−

i=1

= ¯y −

¯x

This result is known as the “grand mean” result. It means that our regression line will go through the mean value of ¯x

and ¯y.

Combine and solve:

i=1

−

i=1

N ¯x

= ¯y −

¯x

i=1

−

i=1

= ¯y¯x −

¯x

i=1

−

i=1

= ¯x¯y −

i=1

(¯x

i=1

−

i=1

) = ¯x¯y −

i=1

¯x¯y−

i=1

¯x

−

i=1

−¯x)(y

−¯y)

−¯x)

cov(x,y)

var(x)

= ¯y −

¯x

= ¯y −

−¯x)(y

−¯y)

−¯x)

¯x

Univariate OLS Coefﬁcients:

= ¯y −

− ¯x)(y

− ¯y)

− ¯x)

¯x

cov(x, y)

var(x)

5.1 OLS 33

2) Matrix derivations of OLS coefﬁcients (Multivariate Case)

Matrix algebra makes multivariate OLS more compact, but we do have to incur some ﬁxed costs in learning how to manipulate

matrices and vectors.

Suppose our Population Data Generating Process is y

= β

+ β

+ ... + β

+ 

, which produces each i observation

our of N data points.

We can represent this in matrix notation as follows:







...







|{z}







1 x

... x

1 x

... x

... ... ... ...

1 x

... x







| {z }







...







|{z}









...









|{z}



↔ Y = Xβ + 

We estimate the following model:







...







|{z}







1 x

... x

1 x

... x

... ... ... ...

1 x

... x







| {z }







...







|{z}







...







|{z}

↔ Y = X

β + e

This model produces “ﬁtted values” for each x

as follows:







ˆy

...

ˆy







|{z}







1 x

... x

1 x

... x

... ... ... ...

1 x

... x







| {z }







...







|{z}

↔

Y = X

Using this more compact notation where e is a vector of each observation’s errors, we can derive the usual minimization for OLS:

OLS

= argmin

OLS

= argmin

OLS

= argmin

[Y − X

β]

[Y − X

β]

OLS

= argmin

−

][Y − X

β]

OLS

= argmin

Y −

Y − Y

β +

β]

• Note:

beta

Y is a scalar (i.e. only 1 row and 1 column). Thus, it is equal to its inverse, so we can rewrite:

OLS

= argmin

Y − 2

Y +

β]

Take FOC’s:

• Note:

b =

a = b

• Note:

ba = 2b

0 = −2X

Y + 2X

OLS coefﬁcients :

β = [X

−1

5.2 Conﬁdence intervals 34

5.2 Conﬁdence intervals

In hypothesis testing from classical statisics, we set up our problem as accepting or rejecting if a certain hypothesis was true. Often

in a regression context, we are testing whether a given coefﬁcient β

= 0 with the alternative hypothesis being that β

6= 0

Consider the population data generating process: y

= β

+ β

+ 

. We run a linear regression and get y

+ e

where y

− ˆy = e

We will construct a two sided test at the α level, where Z

α/2

is the critical value.

It takes the form |

−β

null

| ≤ Z

α/2

We can rearrange this to get a 1 − α% Conﬁdence Interval:

− β

null

| ≤ Z

α/2

− β

null

≤ Z

α/2

− β

null

≥ −Z

α/2

− β

null

≤ Z

α/2

− β

null

≥ −Z

α/2

− β

null

≤ Z

α/2

∗ SE

− β

null

≥ −Z

α/2

∗ SE

− Z

α/2

∗ SE

≤ β

null

+ Z

α/2

∗ SE

≥ β

null

1 − α% CI : β

null

∈ (

− Z

α/2

∗ SE

+ Z

α/2

∗ SE

)

This is equivalent to saying that (1 −α) % of the time that β

null

is true, it should be in the above interval.

For more details on SE

, check out the section on the Variance-Covariance Matrix section.

For example, Z

α/2

= 1.96 for α = 0.05

5.2 Conﬁdence intervals 35

Sample Stata Output

Above we have regressed cholesterol on the variable time tv and an intercept:

cholesterol =

time tv

Let’s go to the horizontal line next to time tv and see if we can intepret the results.

• First, we have coefﬁcent 0.0440691. This is

• Next, we have the standard error of our coefﬁcient: SE

−¯x)

= 0.0105434

• the t column is the t-statistic (analogous to our z-statistic in our rule – see my brief note on z vs. t statistics above). Notice that

t =

−0

0.0440691

0.0105434

= 4.18

• Next, we have the p-value associated with this t-statistic.

The p-value comes from ﬁnding the smallest p-value

associated with our t-statistic (4.18) and degrees of

freedom (100-2 = 98). If we were doing this by hand,

we would go down to the bottom of the table below

to the line that says “z” (since the number of d.o.f

only matters up to n-k = 30 and we have 98, so we

can just use the Z-score), and then we would move

rightward until we found a critical value greater than

or equal to our number that would mean that we

could not reject the null. Then, we would see what

p-value is associated with that column. (In this case,

the coefﬁcient is so strongly signiﬁcant that the table

does not provide a p-value small enough).

• The ﬁnal columns are the 95% conﬁdence interval of our β

coefﬁcient. It gives the region of values that contains our actual

value with a conﬁdence of 95%. We could check that this is consistent with the equation that we derived already in the

previous section by plugging in the relevant values into our formula...

P r(

∈ CI|β

) = 0.95

P r(β

null

∈ (

− Z

α/2

∗ SE

+ Z

α/2

∗ SE

)) = 0.95

5.2 Conﬁdence intervals 36

————————————————————————————————

Example: Regression coefﬁcients and conﬁdence interval

Suppose we have n = 100 observations of tuples {x

, y

}, with {x

, ...x

} = 0 , {x

, ...x

} = 1, {x

, ...x

100

} = 2

We run the linear regression: ˆy =

We observe in our sample

• Conditional sample means : ¯y

i∈{x

=j}

i∈{x

=j}

• Sample variance of y : s

n−2

− ˆy)

n−2

How would you construct a 95% conﬁdence interval around β

in terms of conditional sample means and variances: ¯y

, ¯y

, s

Where would you reject H

: β

= 0

————————————————————————————————

Solve

Per our derivation above, we the following formula for our conﬁdence interval:

P r(β

null

∈ (

− Z

α/2

∗ SE

+ Z

α/2

∗ SE

)) = 0.95

Or equivalently, reject β

= 0 iff |

√

var(

)

| ≤ 1.96

Let’s solve for

and SE

•

100

i=1

−¯x)(y

−¯y)

100

i=1

−¯x)

i∈{x

=0}

−¯x)(y

−¯y)+

i∈{x

=1}

−¯x)(y

−¯y)+

i∈{x

=2}

−¯x)(y

−¯y)

i∈{x

=0}

−¯x)

i∈{x

=1}

−¯x)

i∈{x

=2}

−¯x)

¯x =

100

(20(0) + 60(1) + 20(2)) = 1

i∈{x

=0}

(0−1)(y

−¯y)+

i∈{x

=1}

(1−1)(y

−¯y)+

i∈{x

=2}

(2−1)(y

−¯y)

i∈{x

=0}

(0−1)

i∈{x

=1}

(1−1)

i∈{x

=2}

(2−1)

i∈{x

=0}

(−1)(y

−¯y)+

i∈{x

=2}

−¯y)

i∈{x

=0}

i∈{x

=2}

(1)

20(−1)( ¯y

−¯y)+20(y

−¯y)

20+20

¯y

−¯y

• SE

= SE(

)

var(

)

−¯x)

n−2

−¯x)

√

SOLUTION:

We can have 95 % level of conﬁdence that our true β

is in the range..

∈ [

β − Z

α/2,n−2

SE(

) ,

+ Z

α/2,n−2

SE(

)]

∈ [

¯y

−¯y

− (1.96)

√

¯y

+¯y

+ (1.96)

√

]

We reject H

: β

= 0 iff

¯y

−¯y

≥ (1.96)

√

¯y

−¯y

≤ −(1.96)

√

5.3 Variance-Covariance Matrix of OLS Coefﬁcients 37

5.3 Variance-Covariance Matrix of OLS Coefﬁcients

• Deﬁnition of Variance-Covariance Matrix

The goal of variance covariance matrix is to get something that looks like this...

E{[

β−β][

β−β]

} =







var(

) cov(

) ... cov(

)

cov(

) var(

) ... cov(

)

cov(

) cov(

) ... var(

)







• Derivations

Let us derive the variance-covariance matrix. We begin by cleaning up our vector of estimators

β:

β = [X

−1

β = [X

−1

[Xβ + ]

β = β + [X

−1



We can notice that there are two components to this estimator–the constant population parameter vector β and a stochastic

component [X

−1

 (where the stochasticity is coming from the  vector). Let us put this into our deﬁnition of the

variance-covariance matrix and simplify in terms of X and :

E{[

β − β][

β − β]

} = E{[[X

−1

][[X

−1

]

}

= E{[X

−1



X[[X

−1

]

}

= [X

−1

E{

}

| {z }

Ω

X[X

−1

Wonderful! The above expression characterizes how our β coefﬁcients are correlated together. Let us focus as well Ω =







E[

] ... ... E[

, 

]

E[

, 

] E[

, 

] ... E[

, 

]

E[

, 

] E[

, 

] ... E[

, 

]













... σ







If our sample is iid, the off-diagonal elements are 0...

Ω =







0 ... 0

0 σ

... 0

0 0 ... σ







5.4 Gauss-Markov and BLUE OLS 38

5.4 Gauss-Markov and BLUE OLS

Under the following assumptions, known as the Gauss-Markov Assumptions, the ordinary least squares line is the best linear unbiased estimator

for a relationship between as set of x’s and a set of y’s, where “best” means “most efﬁcient

.”

1. Y is a linear function of X

• algabraic notation - Y

i=0

+ 

, where i is the individual observation and j is the independent variable

• matrix notation - Y = Xβ + 

2. Strict exogeneity

• E[

|X] = 0, that is our error and our independent variables are orthogonal to one another. This would be violated under

endogeneity or simultaneity or other forms of OMVB.

3. No multicollinearity

• This assumption is essentially stating that each new x variable we add must add variation that is not explained by the

x variables aleady included. We need variation in X in order to see any sort of relationship between changes in X and

changes in Y.

• algabraic notation - We see this assumption in our

β coefﬁcient equations. For example, in the univariate case:

− ¯x)(y

− ¯y)

− ¯x)

. If there’s no variation in x, ¯x = x

∀i, so the denominator is 0, thus the

is undeﬁned.

• matrix algebra notation - this is often described as the requirement that X’X is “full rank,” which is to say that no column

of X can be a linear transform of another (i.e. no independent variable is just a linear transform of another). This allows

us to take the inverse of X

X in our

β = [X

−1

4. Homoskedastic standard errors

• We’ll discuss this in greater detail in the next section, but the assumption is essentially that σ

= σ

for all i

remember back to our disccussion of estimators: most efﬁcient means the estimator that approaches the true value the quickest as sample size increases

5.5 Heteroskedasticity 39

5.5 Heteroskedasticity

Let us dive into the assumption that our errors are homoskedastic.

• homoskedasticity - σ

= σ

for all observations i. Visu-

ally, it means that if we were to draw our OLS regression

line through our data points (for example, through the

distribution at right), the resulting residuals would be

distributed roughly evenly throughout our x distribution.

We can rewrite our Ω matrix as below:

Ω =







0 ... 0

0 σ

... 0

0 0 ... σ













0 ... 0

0 σ

... 0

0 0 ... σ







= σ

This means we can simplify our variance-covariance

matrix as follows:

E{[

β − β][

β − β]

} = [X

−1

ΩX[X

−1

= [X

−1

IX[X

−1

= σ

−1

X[X

−1

= σ

−1

We can visualize this at right. We can imagine if we

ran a regression line through the cloud of data, the distance

between y

and ˆy

does not seem to change systematically

with the value of x.

Image lifted from wikipedia article on the subject

We can contrast this with heteroskedasticity.

• heteroskedasticity - when σ

6= σ

∀ i. That is, we have

some values of x that have a noisier relationship with y

than others. In particular, at right we see a situation where

the noise is increasing as x increases. That is, the “cloud”

around the regression line is increasing as we increase x.

We must keep our Ω matrix as below:

Ω =







0 ... 0

0 σ

... 0

0 0 ... σ







Image lifted from wikipedia article on the subject

5.5 Heteroskedasticity 40

The issue(s) with heteroskedasticity:

When we run OLS in the presence of heteroskedasticity, we run into two main issues.

1. Standard errors are not accurate (and often, too small), thus our conﬁdence intervals around

β will no longer be accurate.

2. OLS is no longer BLUE (speciﬁcally, it is not “best” in the sense that it is not the most efﬁcient out of all the unbiased

estimators)

To see why these issues arise, let’s focus on a univariate regression: ˆy =

, which results in the following equation for

− ¯x)(y

− ¯y)

− ¯x)

Let us expand this expression so that it is written as a weighted sum of y

− ¯y, where each “weight” is

−¯x

−¯x)

− ¯x)

− ¯y) + ... +

− ¯x)

− ¯y)

We can observe that for a given (y

− ¯y) value, the value of

is going to be more inﬂuenced by observations further from ¯x (i.e.

− ¯x is larger in magnitude, either positive or negative) because such observations have a “weight”

−¯x

−¯x)

of greater magnitude.

Let us now discuss how this contributes to our two problems:

1. Standard errors: As noted above, observations that have values of x that are further away from ¯x are going to be more important

in determining the value of

. If our data looks like the left image below, we might have a problem. Speciﬁcally, we have

the problem that the important points in our distribution (the x’s on the far left and right) are very noisy. Under homoskedastic

assumptions, we would just take the mean of the noise across the distribution of x, so we would not take into account the fact

that the very points that are the most important for determining the value of

are also the least certain. Thus, in this case, our

standard errors are going to be too small typically. The opposite could be the case in the right picture.

Lifted from the excellent blog post on the topic at chrisauld.com

2. Inefﬁciency: Under OLS, the “importance” of a data point (x

, y

) is determined by how much it differs from the mean since

points the large changes in our variables will offer more information on the relationship between these changes than points

with little changes. However, there is another dimension of information quality within a single datapoint that we can take into

account: the degree of variation within that x value. Thus, under heteroskedasticity, there are other estimators that could take

this additional dimension of quality into account.

5.5 Heteroskedasticity 41

The remedies to heteroskedasticity:

There are two main ways of addressing heteroskedasticity.

• Robust standard errors – this modiﬁcation to our standard error equation will make our conﬁdence interval for our

β coefﬁcient

more accurate; it will not change our estimates of

β (thus it addresses issue # 1 but not issue # 2). Effectively, we weight each

error by the degree to which it inﬂuenced our

β estimates, i.e. weighing by our distance x

− ¯x. For both versions of standard

errors, we can use the sample variation s

to approximate σ

, so long as we take into account the degrees of freedom.

– Under the homoskedastic assumption, we have the following standard error formulae, which will not be accurate if our

data actually exhibits heteroskedasticity:

∗ algebraic notation (uniariate case) - SE

homo

−¯x)

∗ matrix algebra notation - SE

homo

= σ

−1

– Standard errors under the heteroskedastic assumption will allow our errors to be heteroskedastic and further allow these

errors to differentialy inform the overall error of our

β estimator.

∗ algebraic notation (uniariate case) - SE

het

−¯x)

(

−¯x)

)

∗ matrix algebra notation - SE

het

= [X

−1

ΩX[X

−1

where we allow each diagnoal Ω element to be unique

– O

• Generalized least squares

– this modiﬁcation to our estimation of the

β coefﬁcients themselves will make our

β’s more

consistent and the resulting standard errors more accurate if we have all the proper information to implement (thus it addresses

both issue # 1 and issue # 2). We will address WLS into more detail later.

GLS is a class of estimators that includes weighted least squares (WLS)

5.6 Weighted Least Squares 42

5.6 Weighted Least Squares

Weighting is an important tool in empirical analyses to satisfy the various objectives we have for our estimators, in this case, our

objective for creating an estimator that is not only unbiased but as efﬁcient as possible (i.e. consistent and approaching the true value

as quickly as possible as we increase sample size) in the presence of heteroskedasticity. Essentially, in weighted least squares we

will be constructing our estimator

to rely more on data-points that we think have a stronger signal of the true parameter β

. Points

that are less noisy will be stronger signals.

There are other contexts for weighting aside from heteroskedasticity speciﬁcally (though this is probably the most common mo-

tivating example). One should also be aware that weighting enters into the least-squares framework as well as general methods of

moments and other estimation frameworks

. Since Generalized Least Squares will be covered next semester, I will not go into much

detail here, except to make the note that Weighted Least Squares is a special case of Generalize Least Squares and that we often have

to rely on a method called Feasible Generalized Least Squares to actually estimate the parameters in such a framework.

Weighted Least Squares: Let’s return to our OLS set up. OLS solved the following minimization problem

∗

= argmin

{

i=1

−

)

}

This minimization problem produced the following estimates:

− ¯x)(y

− ¯y)

− ¯x)

− ¯y) + ... +

− ¯x)

− ¯y)

As noted a few times before at this point, for a given y-residual value (y

− ¯y), points that have x values further away from ¯x will

have greater inﬂuence on our actual estimator

. If we are in a setting with heteroskedasticity, we might want to take into account

that some points have more noise than others and thus provide a poorer signal to the relationship between the x and y.

That is, if our population relationship is y

= β

+ β

+ 

and we are in a region of the x distribution where 

varies a ton,

we can conclude less from an instance of a really high y − ¯y value or a really low y − ¯y value because these deviations from the mean

y value could have just been produced by a really high or low 

Weighted least squares takes this into account by weighting each observation by its variance.

OLS

algabreic notation (univariate)

OLS

= argmin

i=1

−

)

matrix notation

OLS

= argmin

[Y − X

β]

[Y − X

β]

WLS

algabreic notation (univariate)

W LS

= argmin

i=1

−

)

matrix notation

W LS

= argmin

[Y − X

β]

Ω

−1

[Y − X

β]

In the presence of heteroskedasticity, WLS will produce more efﬁcient estimates if we know σ

. This is a major limitation of

WLS. It is rare to know σ

ex-ante, and if we are wrong about σ

, bad things can occur:

• We can end up with estimates that are less consistent than those that would have been produced under OLS

• Depending how we are parameterizing σ

, we can also end up in the situation where our estimates are also biased.

• Our standard errors can also be incorrect

Regardless, WLS still makes sense in many contexts and it’s part of a class of estimators (GLS) and deeply connected to an important

concept (signal value and weighting) that one should understand as an applied economist.

As we learn about other frameworks for motivating estimators, this will (hopefully) not be too surprising since different motivating frameworks can produce

equivalent estimators for β

6 Maximum Likelihood Estimation

The econemtrician’s task is to come up with parameter estimators

(

θ) from a sample (X

, Y

) that was generated according to some

population data-generating-process Y

= g(X

|θ), where g(·) is

just some arbitrary function. OLS and WLS methodology imposes

the assumption that g(·) is linear. While in some sense OLS is the

workhorse of much reduced form empirical work, by restricting

our model selection to unbiased linear models, we also restricted

the set of data-generating-processes we could accurately capture

and thus parameters we could reasonably estimate. Maximum

likelihood estimation is a method that allows us to estimate pa-

rameters for a broader set of potential data-generating-processes.

General econometric problem framework

Population Relationship

or Population Data-Generating Process:

= g(x

|θ)

where g(·) is just an arbitrary function

Observe data from a sample of

N observations of i = 1 ... N

, x

} i = 1...N

Estimate parameters of population

model using econometric model:

= g(x

θ)

where our estimator

θ is selected ac-

cording to some econometric method

sampling

estimating

6.1 General MLE framework 44

6.1 General MLE framework

Let us start with an admittedly convoluted analogy. Consider if there are three possible ﬂavors of ice-cream in the world: chocolate,

vanilla, and pistachio. We see a puddle of melted green ice-cream, and we wish to guess what ﬂavor it is without tasting it because of

sanitary reasons. One method of arriving at a guess (or estimator) is to guess the ﬂavor which has the highest likelihood of producing

a green puddle out of all the possible ﬂavors. That is, for each ice-cream ﬂavor, we think about the probability that that speciﬁc

ice-cream ﬂavor could melt into a green puddle. We would choose the ice-cream ﬂavor that is associated with the highest conditional

probability of producing a green puddle. In this example, because the ice-cream puddle is green, we would guess pistachio. That

is, out of the possible puddle-generating-icecream-ﬂavors, pistachio maximizes the probability of observing the actual data that we

have collected (the puddle)

In MLE, we are going to choose the estimator value

θ that is associated with the highest probability of observing the data that

we have in our sample, conditional on that

θ. For example, consider if our x

∼ f (x

|θ) for some unknown θ and we observe one

observation of x

....

MLE

= argmax

f(x

, y

θ)

More often, we will have many observations of x

, so the MLE estimator will be the maximizing input for the joint distribution of

all these independent x draws

MLE

= argmax

i=1..N

f(x

, y

θ)

| {z }

θ)

We can call L(

θ) the likelihood function. This is the same object that we used to construct likelihood ratios earlier this semester.

Note that it is not a distribution!!! That is, it is not a PDF – it would not make sense to integrate it out to get an expected

θ!!! It is

a function of

θ that takes the observed data as given. This is distinct from a distribution that characterizes probabilities of random

outcomes x

given a parameter value.

The logic here is that our data was generated using some data-generating-process with parameters θ. It seems reasonable that a

good estimator

θ of these population parameters would be the ones that are the most likely out of all possible parameters to actually

produce the data that we observe.

Why do we need other methods of deriving estimators (i.e. why isn’t OLS enough)?

• The simplest answer is that the world is complicated and the data-generating processes that we are trying to understand might

not be appropriately characterized by a linear function. For instance, our “y variable” may be a binary variable for which a

linear regression might not be appropriate (more on this when we get to probit and logit).

• In some sense, maximum likelihood estimators are also more appealing than a mindless application of OLS since they require

one to be careful about specifying what one believes to be the statistical nature of the data generating process.

Is Maximum likelihood estimation a bayesian estimator?

• This is a reasonable point of confusion since likelihood functions can seem quite like conditional probabilities, which are of

course a very bayesian concept. However, MLE is a ﬁrmly frequentist or classical econometric method and could indeed only

be considered a bayesian method as an edge case where the prior over possible parameters is uniform. The equivalent of MLE

in a bayesian framework is called Maximum a Posteriori estimation (MAP).

• To see the difference more clearly, let’s look at the bayesian approach to deriving an estimator. If we were bayesians, we would

derive a distribution function P r(θ|data) =

P r(data|θ)P r(θ)

P r(data)

(bayes rule). One typically would get a distribution of possible θ ’s

with different associated probabilities. We could get a point estimate of

θ by just picking

Bayesian

= argmax

P r(

θ|data).

Notice that this bayesian approach requires us to take a stand on a prior P r(θ).

In some ways this is a stupidly obvious process that our brain does all the time without us consciously thinking about it. If we need to make a guess based off of

imperfect information, we try to guess the option that’s most consistent with what information we have been given. MLE is just formalizing this.

There is actually a fair bit of writing on this point. If you are interested, a quick google search of “ MLE likelihood vs. probability” will result in many forums

on the topic. In the interest of brevity, I will leave those further explorations to the reader to do on their own time.

What happens if we get the statistical nature of the DGP wrong? For example, what should we think of a MLE estimator when we have assumed our data

is normal when it is actually poisson? The short answer is that MLE will give us the likelihood maximizing estimator within the class of the assumed statistical

distribution we have assumed for our model, but depending on which statistical distribution the data is generated from and which model we have actually used, our

MLE estimator might actually not be so great. In fact, we might want to take a step back and select our model to minimize the impact of mis-speciﬁcation itself. This

motivates a literature on model selection criteria, the most common of which is the Kullbach-Leibler information criterion. A more thorough discussion is outside the

scope of this particular set of notes

6.1 General MLE framework 45

• MLE instead maximizes P r(data|θ). This should produce a point estimate of θ. We can see that our Bayesian

Bayesian

would be the same as our

MLE

if and only if

P r(θ)

P r(data)

were constant, which would be equivalent with a ﬂat prior (i.e. we

have an equal probability of any θ)

6.1 General MLE framework 46

What does this actually look like?

• Setup: Consider if we observe data points x

, i = 1....N and we believe that each draw is generated iid from a certain

distribution f(x|θ) and want to estimate

θ. Let’s use MLE

• Implement MLE: We choose the estimator

θ that maximizes the probability of observing all of the x’s that we observe. Re-

member your basic probability rules that for independent events A and B, P r(A∩B) = P r(A)∗P r(B) or if we’re considering

continuous distributions, f(A ∩ B) = f(A) ∗f (B).

Thus, we can write our problem as....

MLE

= argmax

joint likelihood of observing all the x’s

MLE

= argmax

f(x

θ) ∗ f (x

θ) ∗ .... ∗f (x

θ)

MLE

= argmax

i=1...N

f(x

θ)

Math trick: A common trick at this point is to take the log of the joint likelihood which turns the product into a summa-

tion:

ln(A ∗B ∗ C) = ln(A) + ln(B) + ln(C)

Because natural log is a monotonically increasing function of its arguments, maximizing ln(f(x)) is the same as maximizing f(x)

Thus, we can write our problem as...

MLE

= argmax

i=1...N

ln{f (x

θ)}

This is the way to approach most of the analytically tractable MLE problems you will encounter.

6.1 General MLE framework 47

————————————————————————————————————

Example: MLE of a normal distribution

Consider if we know x ∼ N (µ, σ)

Suppose that we observe {x} = {5, 3, 9, 3}.

Let θ =





Use MLE to estimate θ

————————————————————————————————————

Implement MLE:

MLE

= argmax

L({x}|

θ)

MLE

= argmax

i=1,2,3,4

L(x

θ)

MLE

= argmax

i=1,2,3,4

√

2π

exp{−

−ˆµ)

}

MLE

= argmax

i=1,2,3,4

ln[

√

2π

exp{−

−ˆµ)

}]

MLE

= argmax

i=1,2,3,4

[ln(1) −ln(

2π) + −

−ˆµ)

]

MLE

= argmax

i=1,2,3,4

[−

ln(

2π) + −

−ˆµ)

]

MLE

= argmax

i=1,2,3,4

[−ln(ˆσ) −

ln(2) −

ln(π) −

−ˆµ)

]

Great, now let’s take FOC:

∂

∂ ˆµ

i=1,2,3,4

−2(x

−ˆµ)(−1)

2ˆσ

0 =

i=1,2,3,4

2(x

−ˆµ)

2ˆσ

0 =

i=1,2,3,4

− ˆµ)

ˆµ =

i=1,2,3,4

(We can recognize this as the mean formula!)

ˆµ = 5

∂

∂ ˆµ

i=1,2,3,4

[−

ˆσ

+ 2

−ˆµ)

]

0 =

i=1,2,3,4

[−

ˆσ

−ˆµ)

]

i=1,2,3,4

ˆσ

i=1,2,3,4

−ˆµ)

i=1,2,3,4

− ˆµ)

i=1,2,3,4

− ˆµ)

(We can recognize this as the variance formula!)

((5 − 5)

+ (3 − 5)

+ (9 − 5)

+ (3 − 5)

)

(0 + 4 + 16 + 4)

(24)

= 6

Perhaps comfortingly, we see that the MLE estimates of ˆµ and ˆσ correspond to the mean and variance of our sample in this case!

This seems reasonable given what MLE is doing: the distribution that is most likely to give us data with mean 5 and variance 6 is a

distribution that also has mean 5 and variance 6. You probably would have guessed this even before you knew MLE.

An important concept to recognize at this point is that often a given estimator can be derived using different methods (i.e. MLE

or some other approach). You should think Maximum Likelihood Estimation as just another tool in your tool-belt as an empiricist

trying to characterize Data-Generating-Processes world.

6.2 Logit and Probit 48

6.2 Logit and Probit

Two of the most common econometric models that you will encounter are logit and probit. These are used often in the context of

discrete outcomes or discrete choice, such as when we only have some number “N” outcomes or are a consumer choosing between

N products. Notice that this nests binary outcomes, which are super common in economic analysis. Logit is the workhorse model in

industrial organization, where the seminal work is Berry, Levinsohn, and Pakes (1994) or BLP (which also includes random coefﬁ-

cients but is based on the logit econometric model).

Some random thoughts before we talk more about probit and logit

• Binary or categorical? For the remainder of these notes, I will be focusing on models of binary outcome variables, but these

versions of the logit and probit models can also be adapted to handle categorical outcome variables that can take on more than

two values. Check out Kenneth Train’s textbook (referenced in the appendix) for more info on this.

• Why do we like probit or logit better than OLS for binary variables? There are actually a few reasons, but some of the most

common reasons cited are as follows:

– probit and logit models don’t impose an assumption that a given independent variable has a constant impact on an

outcome. When we run a linear regression

ˆy =

where y is a binary variable, we want to interpret

as the change in probability that Y = 1 given a value of x

. If x

continuous, a change of 1 unit change of x

from 1 to 2 has the same impact as a 1 unit change from 899 to 900. There

are plenty of examples where this is probably not the case.

– Linear models can also produce ﬁtted values haty greater than 1 or less than 0. Clearly, we can’t have a probability

greater than 1 or less than 0, so this is not super desirable. Because of their shape (graphs in following pages), probit and

logit models do not have this problem.

• Estimation method? MLE is a natural approach to estimating the parameters of probit and logit models, but they could also

be estimated using what is called Generalized Method of Moments, which will be covered later in the semester, and I believe

next semester, too.

For a nice summary on demand estimation, please see excellent write-up by Harvard Economics PhD candidate Frank Pinter here

6.2 Logit and Probit 49

6.2.1 Binary Logit

Population Relationship

or Population Data-Generating Process:

(

1 β

+ β

+ ... + β

+ 

> 0

0 otherwise

Where 

∼ logistic distribution

Observe data from a sample of

N observations of i = 1 ... N

, x

} i = 1...N

Estimate parameters of population

model using econometric model:

ˆy

(

+ ... +

+ 

0 otherwise

Where 

∼ logistic distribution where

our estimator

β is selected according to

some econometric method (often MLE)

sampling

estimating

Let’s assume that we are considering a binary variable y, which

we believe to be determined by independent variables x

, ...x

Instead assuming a constant impact of variable x on the likelihood

of y being 1, we will use a bit more complicated speciﬁcation that

will allow for a non-constant impact.

We now have a condition that must be satisﬁed for y to be

equal to 1. We could think of it as some threshold. For instance, if

y is a dummy variable that indicates whether or not our company

invests in something, β

+ β

+ ... + β

+  could indicate

our proﬁt function, so that we are only investing if our proﬁt from

doing so is greater than zero.

Let us rearrange this threshold condition to get a nice ex-

pression for P r(Y

= 1|x

)

P r(y

= 1|x

) = P r(β

+ β

+ ... + β

+ 

> 0)

P r(y

= 1|x

) = P r(Xβ + 

> 0)

P r(y

= 1|x

) = P r(

> −Xβ)

P r(y

= 1|x

) = 1 − P r(

< −Xβ)

P r(y

= 1|x

) = 1 − F (−Xβ) where F is the logistic CDF

P r(y

= 1|x

) = 1 −

1+exp{Xβ}

P r(y

= 1|x

) =

1+exp{Xβ}

−

1+exp{Xβ}

P r(y

= 1|x

) =

exp{Xβ}

1+exp{Xβ}

P r(y

= 1|x

) =

1+exp{−Xβ}

Great! You will often see the logit model presented this way.

When it comes to estimating, we will observe our x

’s and

’s per usual and use an econometric method in order to generate

estimators

of the population parameters β

(assuming our

population model is accurate). MLE is a natural approach here.

Stata has a convenient command ‘logit’ which you can use just

like the ‘reg’ command. If you have a categorical outcome that

can take on more than two values, then you can use ‘ologit’,

which stands for “ordered logistic” regression.

6.2 Logit and Probit 50

6.2.2 Binary Probit

(D) Probit problem framework

Population Relationship

or Population Data-Generating Process:

(

1 β

+ β

+ ... + β

+ 

> 0

0 otherwise

Where 

∼ N(0, 1)

Observe data from a sample of

N observations of i = 1 ... N

, x

} i = 1...N

Estimate parameters of population

model using econometric model:

= g(x

θ)

where our estimator

θ is selected ac-

cording to some econometric method

sampling

estimating

Let’s still assume that we are considering a binary variable y,

which we believe to be determined by some threshold rules

based off of independent variables x

, ...x

. The only difference

between the logit and probit here is that we are now assuming the



is normally distributed.

Once again, let us rearrange our threshold to derive a proba-

bility

P r(y

= 1|x

) = P r(β

+ β

+ ... + β

+ 

> 0)

P r(y

= 1|x

) = P r(Xβ + 

> 0)

P r(y

= 1|x

) = P r(

> −Xβ)

P r(y

= 1|x

) = P r(

< Xβ) (because the normal distribution

is symmetric)

P r(y

= 1|x

) = Φ(Xβ), where Φ(·) is the Normal CDF

Again, you will often see the probit model motivated in this

way.

As before, we will observe our x

’s and y

’s and use an

econometric model in order to generate estimators

of the

population parameters β

(assuming our population model is

accurate). Once again, MLE would be a good choice here. We can

use the ‘probit’ command in stata for binary outcomes (oprobit

works for multiple outcomes, standing for an “ordered probit”

regression).

Probit v. logit: We can plot the probit and logit functions below and see that they are in a sense very similarly shaped. In many

reduced form applications, there will not be a big difference between logit and probit and it usually makes sense to just choose one

and roll with it. However, there are other situations where a certain model will have better properties than others or ones that make

more sense than others. In particular, for structural models, the choice of the error term can make an enormous difference, not just

for tractability but also for internal consistency and reasonability of parameter estimates. Please see Kenneth Train’s text on Probit

and Logit (referenced in appendix) for more information on this.

A Additional Resources

A.1 General notes

These are front-to-back excellent resources to have around to ‘ctrl+f’ rough topics.

• Ben Lambert’s course in econometerics

• William Greene’s Econometrics Course. His textbook is also excellent, so if you have access to it, I would highly reccomend

giving it a look.

• Suhasini Subba Rao’s advanced statistical inference course

• Cosma Shalizi’s course on linear statistical models

A.2 Notes on speciﬁc topics

• Probability

– PennState’s public probability notes

• Bayesian statistics

– Larry Wasserman’s notes

• Decision Theory

– Richard Bradley’s book, Decision Theory with a Human Face (this probably would not be an efﬁcient resource for the

purposes of this class, but I found it quite interesting and relates to much that we discuss! )

• Maximum Likelihood Estimation

– Ben Lambert’s MLE video series

– Ryan Martin’s notes on Likelihood and Maximum Likelihood Estimation, which I have personally found incredibly clear

and helpful.

– Suhasini Subba Rao’s notes on the Kullbach-Leibler information criterion (her website also has an excellent set of statis-

tics notes on a broad range of topics, if you are interested).

• Structural modeling and demand systems

– Kenneth Train’s generously public textbook on discrete choice methods using simulation. This is a quite advanced text,

but the introduction sections on logit and probit are written in a pretty understandable way that can provide some context.

– Frank Pinter’s overview of demand estimation in the context of industrial organization. This is helpful for people who

want to understand what is involved in structural estimation, particularly BLP demand estimation.