6.1 General MLE framework 44
6.1 General MLE framework
Let us start with an admittedly convoluted analogy. Consider if there are three possible flavors of ice-cream in the world: chocolate,
vanilla, and pistachio. We see a puddle of melted green ice-cream, and we wish to guess what flavor it is without tasting it because of
sanitary reasons. One method of arriving at a guess (or estimator) is to guess the flavor which has the highest likelihood of producing
a green puddle out of all the possible flavors. That is, for each ice-cream flavor, we think about the probability that that specific
ice-cream flavor could melt into a green puddle. We would choose the ice-cream flavor that is associated with the highest conditional
probability of producing a green puddle. In this example, because the ice-cream puddle is green, we would guess pistachio. That
is, out of the possible puddle-generating-icecream-flavors, pistachio maximizes the probability of observing the actual data that we
have collected (the puddle)
7
.
In MLE, we are going to choose the estimator value
ˆ
θ that is associated with the highest probability of observing the data that
we have in our sample, conditional on that
ˆ
θ. For example, consider if our x
i
∼ f (x
i
|θ) for some unknown θ and we observe one
observation of x
i
....
ˆ
θ
MLE
= argmax
ˆ
θ
f(x
i
, y
i
|
ˆ
θ)
More often, we will have many observations of x
i
, so the MLE estimator will be the maximizing input for the joint distribution of
all these independent x draws
ˆ
θ
MLE
= argmax
ˆ
θ
Y
i=1..N
f(x
i
, y
i
|
ˆ
θ)
| {z }
L(
ˆ
θ)
We can call L(
ˆ
θ) the likelihood function. This is the same object that we used to construct likelihood ratios earlier this semester.
Note that it is not a distribution!!! That is, it is not a PDF – it would not make sense to integrate it out to get an expected
ˆ
θ!!! It is
a function of
ˆ
θ that takes the observed data as given. This is distinct from a distribution that characterizes probabilities of random
outcomes x
i
given a parameter value.
8
The logic here is that our data was generated using some data-generating-process with parameters θ. It seems reasonable that a
good estimator
ˆ
θ of these population parameters would be the ones that are the most likely out of all possible parameters to actually
produce the data that we observe.
Why do we need other methods of deriving estimators (i.e. why isn’t OLS enough)?
• The simplest answer is that the world is complicated and the data-generating processes that we are trying to understand might
not be appropriately characterized by a linear function. For instance, our “y variable” may be a binary variable for which a
linear regression might not be appropriate (more on this when we get to probit and logit).
• In some sense, maximum likelihood estimators are also more appealing than a mindless application of OLS since they require
one to be careful about specifying
what one believes to be the statistical nature of the data generating process.
9
Is Maximum likelihood estimation a bayesian estimator?
• This is a reasonable point of confusion since likelihood functions can seem quite like conditional probabilities, which are of
course a very bayesian concept. However, MLE is a firmly frequentist or classical econometric method and could indeed only
be considered a bayesian method as an edge case where the prior over possible parameters is uniform. The equivalent of MLE
in a bayesian framework is called Maximum a Posteriori estimation (MAP).
• To see the difference more clearly, let’s look at the bayesian approach to deriving an estimator. If we were bayesians, we would
derive a distribution function P r(θ|data) =
P r(data|θ)P r(θ)
P r(data)
(bayes rule). One typically would get a distribution of possible θ ’s
with different associated probabilities. We could get a point estimate of
ˆ
θ by just picking
ˆ
θ
Bayesian
= argmax
ˆ
θ
P r(
ˆ
θ|data).
Notice that this bayesian approach requires us to take a stand on a prior P r(θ).
7
In some ways this is a stupidly obvious process that our brain does all the time without us consciously thinking about it. If we need to make a guess based off of
imperfect information, we try to guess the option that’s most consistent with what information we have been given. MLE is just formalizing this.
8
There is actually a fair bit of writing on this point. If you are interested, a quick google search of “ MLE likelihood vs. probability” will result in many forums
on the topic. In the interest of brevity, I will leave those further explorations to the reader to do on their own time.
9
What happens if we get the statistical nature of the DGP wrong? For example, what should we think of a MLE estimator when we have assumed our data
is normal when it is actually poisson? The short answer is that MLE will give us the likelihood maximizing estimator within the class of the assumed statistical
distribution we have assumed for our model, but depending on which statistical distribution the data is generated from and which model we have actually used, our
MLE estimator might actually not be so great. In fact, we might want to take a step back and select our model to minimize the impact of mis-specification itself. This
motivates a literature on model selection criteria, the most common of which is the Kullbach-Leibler information criterion. A more thorough discussion is outside the
scope of this particular set of notes