12/12/2019
1
An Intro to Text Analysis
for Social Scientists
Patrick van Kessel, Senior Data Scientist
Pew Research Center
12/12/19 AAPOR Webinar
Basic principles: how to convert text into quantitative data
Overview of common methods: a map of useful analysis tools
Demo: text analysis in action
Agenda
12/12/2019
2
The role of text in social research
Free of assumptions
Potential for richer insights relative to closed-format responses
If organic, then data collection costs are often negligible
The role of text in social research
Why text?
12/12/2019
3
The role of text in social research
Where do I find it?
Open-ended surveys / focus groups / transcripts / interviews
Social media data (tweets, FB posts, etc.)
Long-form content (articles, notes, logs, etc.)
The role of text in social research
What makes it challenging?
Messy
“Data spaghetti” with little or no structure
Sparse
Low information-to-data ratio (lots of hay, few needles)
Often organic (rather than designed)
Can be naturally generated by people and processes
Often without a research use in mind
12/12/2019
4
Data selection and preparation
Data selection and preparation
Know your objective and subject matter (
if needed find subject matter expert)
Get familiar with the data
Don’t make assumptions - know your data, quirks and all
12/12/2019
5
Data selection and preparation
Text Acquisition and Preprepation
Select relevant data (text corpus)
Content
Metadata
Prepare the input file
Determine unit of analysis
Process text to get one document
per unit of analysis
Image credit: http://www.nickmilton.com/2016/12/garbage-lessons-in-garbage-knowledge-out.html
(Pre-)Processing
Turning text into data
12/12/2019
6
Turning text into data
Image credit: https://www.softwareadvice.com/resources/what-is-text-analytics/
Turning text into data
How do we sift through text and produce insight?
Might first try searching for keywords
How many times is “analysis” mentioned?
Raw Documents
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
12/12/2019
7
Turning text into data
How do we sift through text and produce insight?
Might first try searching for keywords
How many times is “analysis” mentioned?
Raw Documents
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
We missed this one
And this one too
Turning text into data
Variations of words can have the same meaning but look completely different
to a computer
Raw Documents
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
12/12/2019
8
Turning text into data
Regular Expressions
A more sophisticated solution: regular expressions
Syntax for defining string (text) patterns
Raw Documents
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
Turning text into data
Regular Expressions
Can use to search text or extract
specific chunks
Example use cases:
Extracting dates
Finding URLs
Identifying names/entities
https://regex101.com/
http://www.regexlib.com/
Image credit: https://www.smashingmagazine.com/2009/06/essential-guide-to-regular-expressions-tools-tutorials-and-resources/
12/12/2019
9
Turning text into data
Regular Expressions
\banaly[a-z]+\b
Raw Documents
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
Turning text into data
Regular Expressions
Regular expressions can be extremely powerful…
...and terrifyingly complex:
URLS: ((https?:\/\/(www\.)?)?[-a-zA-Z0-9@:%._\+~#=]{2,4096}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*))
DOMAINS: (?:http[s]?\:\/\/)?(?:www(?:s?)\.)?([\w\.\-]+)(?:[\\\/](?:.+))?
MONEY: \$([0-9]{1,3}(?:(?:\,[0-9]{3})+)?(?:\.[0-9]{1,2})?)\s
12/12/2019
10
Turning text into data
Pre-processing
Great, but we can’t write patterns for everything
Words are messy and have a lot of variation
We need to collapse semantically
We need to clean / pre-process
Raw Documents
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
Turning text into data
Pre-processing
Common first steps:
Spell check / correct
Remove punctuation / expand contractions
Raw Documents Processed Documents
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
can’t -> cannot
they’re -> they_are
doesn’t -> does_not
12/12/2019
11
Turning text into data
Pre-processing
Now to collapse words with the same meaning
We do this with stemming or lemmatization
Break words down to their roots
Raw Documents Processed Documents
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
Turning text into data
Pre-processing
Stemming is more conservative
There are many different stemmers
Here’s the Porter stemmer (1979)
Raw Documents Processed Documents
1 Text analysis is fun Text analysi is fun
2 I enjoy analyzing text data I enjoy analyz text data
3 Data science often involves text analytics Data scienc often involv text analyt
12/12/2019
12
Turning text into data
Pre-processing
Stemming is more conservative
There are many different stemmers
Here’s the Porter stemmer (1979)
Raw Documents Processed Documents
1 Text analysis is fun Text analysi is fun
2 I enjoy analyzing text data I enjoy analyz text data
3 Data science often involves text analytics Data scienc often involv text analyt
Turning text into data
Pre-processing
The Lancaster stemmer (1990) is newer and more aggressive
Truncates words a LOT
Raw Documents Processed Documents
1 Text analysis is fun text analys is fun
2 I enjoy analyzing text data I enjoy analys text dat
3 Data science often involves text analytics dat sci oft involv text analys
12/12/2019
13
Turning text into data
Pre-processing
Lemmatization uses linguistic relationships and parts of speech to collapse
words down to their root form - so you get actual words (“lemma”), not stems
WordNet Lemmatizer
Raw Documents Processed Documents
1 Text analysis is fun text analysis is fun
2 I enjoy analyzing text data I enjoy analyze text data
3 Data science often involves text analytics data science often involve text analytics
Turning text into data
Pre-processing
Picking the right method depends on how much you want to preserve nuance
or collapse meaning
We’ll stick with Lancaster
Raw Documents Processed Documents
1 Text analysis is fun text analys is fun
2 I enjoy analyzing text data I enjoy analys text dat
3 Data science often involves text analytics dat sci oft involv text analys
12/12/2019
14
Turning text into data
Pre-processing
Finally, we need to remove words that don’t hold meaning themselves
These are called “stopwords”
Can expand standard stopword lists with custom words
Raw Documents Processed Documents
1 Text analysis is fun text analys fun
2 I enjoy analyzing text data enjoy analys text dat
3 Data science often involves text analytics dat sci oft involv text analys
Turning text into data
Pre-processing
A word of caution: there aren’t any universal rules for making pre-
processing decisions
Do what makes sense for your data - but be cautious of the researcher degrees
of freedom involved
See:
Denny and Spirling, 2016. Assessing the Consequences of Text Pre-processing Decisions”
Denny and Spirling, 2018. “Text Preprocessing for Unsupervised Learning: Why It Matters,
When It Misleads, and What to Do About It”
12/12/2019
15
Turning text into data
Tokenization
Now we need to tokenize
Break words apart according to certain rules
Usually breaks on whitespace and punctuation
What’s left are called “tokens”
Single tokens or pairs of two or more tokens are called “ngrams”
Turning text into data
Tokenization
We can express the presence of each “ngram” as a column
This is often called a “term frequency matrix”
Here are unigrams
text analys fun enjoy dat sci oft involv
1 1 1
1 1 1 1
1 1 1 1 1 1
12/12/2019
16
Turning text into data
Tokenization
We can express the presence of each “ngram” as a column
This is often called a “term frequency matrix”
And here are bigrams
text
analys
analys
fun
enjoy
analys
analys
text
text dat dat sci sci oft oft
involv
1 1
1 1 1
1 1 1 1
Turning text into data
Tokenization
If we want to characterize the whole corpus, we can just look at the most
frequent words
Here’s the “term frequency matrix”:
text analys fun enjoy dat sci oft involv
1 1 1
1 1 1 1
1 1 1 1 1 1
3 3 1 1 2 1 1 1
12/12/2019
17
Turning text into data
TF-IDF
But what if we want to distinguish documents from each other?
We know these documents are about text analysis
What makes them unique?
Image credit: https://medium.com/@imamun/creating-a-tf-idf-in-python-e43f05e4d424
Turning text into data
TF-IDF
Divide word frequencies by the number of documents they appear in
Down-weight words that are common; log-scale emphasizes unique words
Several variants that add smoothing
Image credit: https://sites.temple.edu/tudsc/2017/03/30/measuring-similarity-between-texts-in-python/tfidf-equations/
Image credit: https://medium.com/@imamun/creating-a-tf-idf-in-python-e43f05e4d424
12/12/2019
18
Turning text into data
TF-IDF
The overall distribution of words is still largely preserved
But now we’re emphasizing what makes each document unique
text analys fun enjoy dat sci oft involv
1 1 2.1
1 1 2.1 1.4
1 1 1.4 2.1 2.1 2.1
3 3 2.1 2.1 2.8 2.1 2.1 2.1
Turning text into data
TF-IDF
The overall distribution of words is still largely preserved
But now we’re emphasizing what makes each document unique
Within each document, we’re now highlighting distinctive terms
text analys fun enjoy dat sci oft involv
1 1 2.1
1 1 2.1 1.4
1 1 1.4 2.1 2.1 2.1
3 3 2.1 2.1 2.8 2.1 2.1 2.1
12/12/2019
19
Turning text into data
TF-IDF
TF-IDF is an extremely common and useful way to convert text into useful
quantitative features
It’s often all you need
But there are other, more complex ways to quantify text
Turning words into numbers
Part-of-Speech Tagging
Sometimes you care about how a word is used
Can use pre-trained part-of-speech (POS) taggers
Can also help with things like negation
“Happy” vs. “NOT happy”
Image credit: http://nltk.sourceforge.net/doc/en/ch03.html
Image credit: https://www.nltk.org/book_1ed/ch05.html
12/12/2019
20
Turning words into numbers
Might also be interested in people, places, organizations, etc.
Like POS taggers, named entity extractors use trained models
Image credit: http://inspiratron.org/blog/2019/04/15/building-named-entity-recognizer-ner-using-conditional-random-fields-crf/
Turning words into numbers
Word Embeddings
Other methods can quantify words not by frequency, but by their relation
Word2vec uses a sliding window to read words and learn their
relationships; each word gets a vector in N-dimensional space
Pretrained model: https://code.google.com/archive/p/word2vec/
Image credit: https://www.tensorflow.org/tutorials/text/word_embeddings
12/12/2019
21
Analysis
Finding patterns in text data
Finding patterns in text data
Two types of approaches:
Unsupervised NLP: automated, extracts structure from the data
Clustering
Topic modeling
Mutual information
Supervised NLP: requires training data, learns to predict labels and classes
Classification
Regression
12/12/2019
22
Finding patterns in text data
Unsupervised methods
Collocation / phrase detection
Simple way to get a quick
Bigrams are a form of
“collocation” – a more
general term for words that
occur together
Code modified from: https://www.nltk.org/howto/collocations.html
Finding patterns in text data
Unsupervised methods
Co-occurrence matrices
We can also find words that
occur in the same documents
together (not just next to
each other)
12/12/2019
23
Finding patterns in text data
Unsupervised methods
Might want to compare documents (or words) to one another
Possible applications
Spelling correction
Document deduplication
Measure similarity of language
Politicians’ speeches
Movie reviews
Product descriptions
Image credit: https://pibytes.wordpress.com/2013/02/02/deduplication-internals-part-1/
Finding patterns in text data
Unsupervised methods
Levenshtein distance
Compute number of steps needed to
turn a word/document into another
Can express as a ratio (percent of
word/document that needs to
Image credit: http://web.stanford.edu/~jurafsky/slp3/2.pdf
12/12/2019
24
Finding patterns in text data
Unsupervised methods
Cosine similarity
Compute the “angle” between two word vectors
TF-IDF: axes are the weighted frequencies for
each word
Word2Vec: axes are the learned dimensions
from the model
Image credit: https://www.machinelearningplus.com/nlp/cosine-similarity/
Finding patterns in text data
Unsupervised methods
Clustering
Algorithms that use word vectors (TF-IDF,
Word2Vec, etc.) to identify structural
groupings between observations (words,
documents)
K-Means is a very commonly used one
Image credit: http://mnemstudio.org/clustering-k-means-introduction.htm
12/12/2019
25
Finding patterns in text data
Unsupervised methods
Hierarchical/agglomerative clustering
Start with all observations, and use a
rule to pair them up, and repeat until
there’s only one group left
Image credit: https://scrnaseq-course.cog.sanger.ac.uk/website/biological-analysis.html
Image credit: https://rpkgs.datanovia.com/factoextra/reference/fviz_dend.html
Finding patterns in text data
Unsupervised methods
Network analysis
Can also get creative
After all, we’re just working
with columns and numbers
Example: link words together
by their strongest correlations
Image credit: https://www.linkedin.com/in/patrick-van-kessel
12/12/2019
26
Finding patterns in text data
Unsupervised methods
Pointwise mutual information
Based on information theory
Compares conditional and joint
probabilities to measure the
likelihood of a word occurring
with a category/outcome, beyond
random chance
Image credit: https://www.people-press.org/2017/02/23/partisan-language-in-congressional-outreach/
Finding patterns in text data
Unsupervised methods
Topic modeling
Algorithms that characterize
(groups of words)
Find topics that best fit the
data
Image credit: http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
12/12/2019
27
Finding patterns in text data
Supervised methods
Often we want to categorize documents
Unsupervised methods can help
But often we need to read and label them ourselves
Classification models can take labeled data and learn to make predictions
Finding patterns in text data
Supervised methods
Steps:
Label a sample of documents
Break your sample into two sets: a training sample and a test sample
Train a model on the training sample
Evaluate it on the test sample
Apply it to the full set of documents to make predictions
12/12/2019
28
Finding patterns in text data
Supervised methods
First you need to develop a codebook
Codebook: set of rules for labeling and categorizing documents
The best codebooks have clear rules for hard cases, and lots of examples
Categories should be MECE: mutually exclusive and collectively exhaustive
12/12/2019
29
Finding patterns in text data
Supervised methods
Finding patterns in text data
Supervised methods
Need to validate the codebook by measuring interrater reliability
Makes sure your measures are consistent, objective, and reproducible
Multiple people code the same document
Image credit: https://socialresearchmethods.net/kb/reltypes.php
12/12/2019
30
Finding patterns in text data
Supervised methods
Various metrics to test whether their agreement is high enough
Krippendorf’s alpha
Cohen’s kappa
Can also compare coders against a gold standard, if available
Image credit: https://www.researchgate.net/figure/Interpretation-of-Cohens-Kappa-Values_tbl2_302869046
Finding patterns in text data
Supervised methods
Mechanical Turk can be a great way to code a lot of documents
Have 5+ Turkers code a large sample of documents
Collapse them together with a rule
Code a subset in-house, and compute reliability
Image credit: https://machmachines.com/make-some-extra-cash-with-amazon-mechanical-turk060515/
12/12/2019
31
Finding patterns in text data
Supervised methods
After coding, split your sample into two sets (~80/20)
One for training, one for testing
We do this to check for (and avoid) overfitting
Image credit: https://medium.com/ml-research-lab/under-fitting-over-fitting-and-its-solution-dc6191e34250
Finding patterns in text data
Supervised methods
Next step is called feature extraction or feature selection
Need to extract “features” from the text
TF-IDF
Word2Vec vectors
Can also utilize metadata, if potentially useful
12/12/2019
32
Finding patterns in text data
Supervised methods
Select a classification algorithm
Common choice for text data are
support vector machines (SVMs)
Similar to regression, SVMs find the line
that best separates two or more groups
Can also use non-linear “kernels” for
better fits (radial basis function, etc.)
XGBoost is a newer and very promising
algorithm
Image credit: https://towardsdatascience.com/support-vector-machine-vs-logistic-regression-94cc2975433f
Finding patterns in text data
Supervised methods
Time to evaluate performance
Lots of different metrics, depending on
what you care about
Often we care about precision/recall
Precision: did you pick out mostly needles or
mostly hay?
Recall: how many needles did you miss?
Other metrics:
Matthew’s correlation coefficient
Brier score
Overall accuracy
Image credit: https://en.wikipedia.org/wiki/Precision_and_recall
12/12/2019
33
Finding patterns in text data
Supervised methods
Doing just one split leaves a lot up to chance
To bootstrap a better estimate of the model’s performance, it’s best to use K-
fold cross-validation
Splits your data into train/test sets multiple times and averages the
performance metrics
Ensures that you didn’t just get lucky (or unlucky)
Finding patterns in text data
Supervised methods
Model not working well?
You probably need to tune your parameters
You can use a grid search to test out different combinations of model
parameters and feature extraction methods
Many software packages can automatically help you pick the best
combination to maximize your model’s performance
12/12/2019
34
Finding patterns in text data
Supervised methods
Suggested design:
Large training sample, coded by Turkers
Small evaluation sample, coded by Turkers and in-house experts
Compute IRR between Turk and experts
Train model on training sample, use 5-fold cross-validation
Apply model to evaluation sample, compare results against in-house coders and Turkers
Finding patterns in text data
Supervised methods
Some (but not all) models produce probabilities along with their
classifications
Ideally you fit the model using your preferred scoring metric/function
But you can also use post-hoc probability thresholds to adjust your model’s
predictions
12/12/2019
35
Tools and Resources
Open-source tools
Python
NLTK, scikit-learn, pandas, numpy, scipy, gensim, spacy, etctw
R
https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
Java
Stanford Core NLP + many other useful libraries
https://nlp.stanford.edu/software/
Image credit: https://stackoverflow.com/
12/12/2019
36
Commercial tools
Cloud-based NLP
Amazon Comprehend
Google Cloud Natural Language
IBM Watson NLU
Software
SPSS Text Modeler
Provalis WordStat
Image credit: https://provalisresearch.com/products/content-analysis-software/
Time for a demo!
https://bit.ly/2rlCOUG
Full link: https://colab.research.google.com/github/patrickvankessel/AAPOR-
Text-Analysis-2019/blob/master/Tutorial.ipynb
GitHub repo: https://github.com/patrickvankessel/AAPOR-Text-Analysis-2019
Feel free to reach out:
Special thanks to Michael Jugovich for help putting these materials together for previous workshops
12/12/2019
37
12/12/2019
38
12/12/2019
39
12/12/2019
40
12/12/2019
41
12/12/2019
42
12/12/2019
43
12/12/2019
44
12/12/2019
45
12/12/2019
46
12/12/2019
47
12/12/2019
48
12/12/2019
49
12/12/2019
50
12/12/2019
51
12/12/2019
52
12/12/2019
53
12/12/2019
54
Thank you!
https://bit.ly/2rlCOUG
Full link: https://colab.research.google.com/github/patrickvankessel/AAPOR-
Text-Analysis-2019/blob/master/Tutorial.ipynb
GitHub repo: https://github.com/patrickvankessel/AAPOR-Text-Analysis-2019
Feel free to reach out:
Special thanks to Michael Jugovich for help putting these materials together for previous workshops