An Intro to Text Analysis for Social Scientists

12/12/2019

An Intro to Text Analysis

for Social Scientists

Patrick van Kessel, Senior Data Scientist

Pew Research Center

12/12/19 AAPOR Webinar

● Basic principles: how to convert text into quantitative data

● Overview of common methods: a map of useful analysis tools

● Demo: text analysis in action

Agenda

12/12/2019

The role of text in social research

● Free of assumptions

● Potential for richer insights relative to closed-format responses

● If organic, then data collection costs are often negligible

The role of text in social research

Why text?

12/12/2019

The role of text in social research

Where do I find it?

● Open-ended surveys / focus groups / transcripts / interviews

● Social media data (tweets, FB posts, etc.)

● Long-form content (articles, notes, logs, etc.)

The role of text in social research

What makes it challenging?

● Messy

○ “Data spaghetti” with little or no structure

● Sparse

○ Low information-to-data ratio (lots of hay, few needles)

● Often organic (rather than designed)

○ Can be naturally generated by people and processes

○ Often without a research use in mind

12/12/2019

Data selection and preparation

● Know your objective and subject matter (

if needed find subject matter expert)

● Get familiar with the data

● Don’t make assumptions - know your data, quirks and all

12/12/2019

Data selection and preparation

Text Acquisition and Preprepation

Select relevant data (text corpus)

● Content

● Metadata

Prepare the input file

● Determine unit of analysis

● Process text to get one document

per unit of analysis

Image credit: http://www.nickmilton.com/2016/12/garbage-lessons-in-garbage-knowledge-out.html

(Pre-)Processing

Turning text into data

12/12/2019

Turning text into data

Image credit: https://www.softwareadvice.com/resources/what-is-text-analytics/

Turning text into data

● How do we sift through text and produce insight?

● Might first try searching for keywords

● How many times is “analysis” mentioned?

Raw Documents

1 Text analysis is fun

2 I enjoy analyzing text data

3 Data science often involves text analytics

12/12/2019

Turning text into data

● How do we sift through text and produce insight?

● Might first try searching for keywords

● How many times is “analysis” mentioned?

Raw Documents

1 Text analysis is fun

2 I enjoy analyzing text data

3 Data science often involves text analytics

We missed this one

And this one too

Turning text into data

● Variations of words can have the same meaning but look completely different

to a computer

Raw Documents

1 Text analysis is fun

2 I enjoy analyzing text data

3 Data science often involves text analytics

12/12/2019

Turning text into data

Regular Expressions

● A more sophisticated solution: regular expressions

● Syntax for defining string (text) patterns

Raw Documents

1 Text analysis is fun

2 I enjoy analyzing text data

3 Data science often involves text analytics

Turning text into data

Regular Expressions

● Can use to search text or extract

specific chunks

● Example use cases:

○ Extracting dates

○ Finding URLs

○ Identifying names/entities

● https://regex101.com/

● http://www.regexlib.com/

Image credit: https://www.smashingmagazine.com/2009/06/essential-guide-to-regular-expressions-tools-tutorials-and-resources/

12/12/2019

Turning text into data

Regular Expressions

\banaly[a-z]+\b

Raw Documents

1 Text analysis is fun

2 I enjoy analyzing text data

3 Data science often involves text analytics

Turning text into data

Regular Expressions

Regular expressions can be extremely powerful…

...and terrifyingly complex:

URLS: ((https?:\/\/(www\.)?)?[-a-zA-Z0-9@:%._\+~#=]{2,4096}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*))

DOMAINS: (?:http[s]?\:\/\/)?(?:www(?:s?)\.)?([\w\.\-]+)(?:[\\\/](?:.+))?

MONEY: \$([0-9]{1,3}(?:(?:\,[0-9]{3})+)?(?:\.[0-9]{1,2})?)\s

12/12/2019

Turning text into data

Pre-processing

● Great, but we can’t write patterns for everything

● Words are messy and have a lot of variation

● We need to collapse semantically

● We need to clean / pre-process

Raw Documents

1 Text analysis is fun

2 I enjoy analyzing text data

3 Data science often involves text analytics

Turning text into data

Pre-processing

● Common first steps:

○ Spell check / correct

○ Remove punctuation / expand contractions

Raw Documents Processed Documents

1 Text analysis is fun

2 I enjoy analyzing text data

3 Data science often involves text analytics

can’t -> cannot

they’re -> they_are

doesn’t -> does_not

12/12/2019

Turning text into data

Pre-processing

● Now to collapse words with the same meaning

● We do this with stemming or lemmatization

● Break words down to their roots

Raw Documents Processed Documents

1 Text analysis is fun

2 I enjoy analyzing text data

3 Data science often involves text analytics

Turning text into data

Pre-processing

● Stemming is more conservative

● There are many different stemmers

● Here’s the Porter stemmer (1979)

Raw Documents Processed Documents

1 Text analysis is fun Text analysi is fun

2 I enjoy analyzing text data I enjoy analyz text data

3 Data science often involves text analytics Data scienc often involv text analyt

12/12/2019

Turning text into data

Pre-processing

● Stemming is more conservative

● There are many different stemmers

● Here’s the Porter stemmer (1979)

Raw Documents Processed Documents

1 Text analysis is fun Text analysi is fun

2 I enjoy analyzing text data I enjoy analyz text data

3 Data science often involves text analytics Data scienc often involv text analyt

Turning text into data

Pre-processing

● The Lancaster stemmer (1990) is newer and more aggressive

● Truncates words a LOT

Raw Documents Processed Documents

1 Text analysis is fun text analys is fun

2 I enjoy analyzing text data I enjoy analys text dat

3 Data science often involves text analytics dat sci oft involv text analys

12/12/2019

Turning text into data

Pre-processing

● Lemmatization uses linguistic relationships and parts of speech to collapse

words down to their root form - so you get actual words (“lemma”), not stems

● WordNet Lemmatizer

Raw Documents Processed Documents

1 Text analysis is fun text analysis is fun

2 I enjoy analyzing text data I enjoy analyze text data

3 Data science often involves text analytics data science often involve text analytics

Turning text into data

Pre-processing

● Picking the right method depends on how much you want to preserve nuance

or collapse meaning

● We’ll stick with Lancaster

Raw Documents Processed Documents

1 Text analysis is fun text analys is fun

2 I enjoy analyzing text data I enjoy analys text dat

3 Data science often involves text analytics dat sci oft involv text analys

12/12/2019

Turning text into data

Pre-processing

● Finally, we need to remove words that don’t hold meaning themselves

● These are called “stopwords”

● Can expand standard stopword lists with custom words

Raw Documents Processed Documents

1 Text analysis is fun text analys fun

2 I enjoy analyzing text data enjoy analys text dat

3 Data science often involves text analytics dat sci oft involv text analys

Turning text into data

Pre-processing

● A word of caution: there aren’t any universal rules for making pre-

processing decisions

● Do what makes sense for your data - but be cautious of the researcher degrees

of freedom involved

● See:

○ Denny and Spirling, 2016. Assessing the Consequences of Text Pre-processing Decisions”

○ Denny and Spirling, 2018. “Text Preprocessing for Unsupervised Learning: Why It Matters,

When It Misleads, and What to Do About It”

12/12/2019

Turning text into data

Tokenization

● Now we need to tokenize

● Break words apart according to certain rules

● Usually breaks on whitespace and punctuation

● What’s left are called “tokens”

● Single tokens or pairs of two or more tokens are called “ngrams”

Turning text into data

Tokenization

● We can express the presence of each “ngram” as a column

● This is often called a “term frequency matrix”

● Here are unigrams

text analys fun enjoy dat sci oft involv

1 1 1

1 1 1 1

1 1 1 1 1 1

12/12/2019

Turning text into data

Tokenization

● We can express the presence of each “ngram” as a column

● This is often called a “term frequency matrix”

● And here are bigrams

text

analys

fun

enjoy

analys

text

text dat dat sci sci oft oft

involv

1 1

1 1 1

1 1 1 1

Turning text into data

Tokenization

● If we want to characterize the whole corpus, we can just look at the most

frequent words

● Here’s the “term frequency matrix”:

text analys fun enjoy dat sci oft involv

1 1 1

1 1 1 1

1 1 1 1 1 1

3 3 1 1 2 1 1 1

12/12/2019

Turning text into data

TF-IDF

● But what if we want to distinguish documents from each other?

● We know these documents are about text analysis

● What makes them unique?

Image credit: https://medium.com/@imamun/creating-a-tf-idf-in-python-e43f05e4d424

Turning text into data

TF-IDF

● Divide word frequencies by the number of documents they appear in

● Down-weight words that are common; log-scale emphasizes unique words

● Several variants that add smoothing

Image credit: https://sites.temple.edu/tudsc/2017/03/30/measuring-similarity-between-texts-in-python/tfidf-equations/

Image credit: https://medium.com/@imamun/creating-a-tf-idf-in-python-e43f05e4d424

12/12/2019

Turning text into data

TF-IDF

● The overall distribution of words is still largely preserved

● But now we’re emphasizing what makes each document unique

text analys fun enjoy dat sci oft involv

1 1 2.1

1 1 2.1 1.4

1 1 1.4 2.1 2.1 2.1

3 3 2.1 2.1 2.8 2.1 2.1 2.1

Turning text into data

TF-IDF

● The overall distribution of words is still largely preserved

● But now we’re emphasizing what makes each document unique

● Within each document, we’re now highlighting distinctive terms

text analys fun enjoy dat sci oft involv

1 1 2.1

1 1 2.1 1.4

1 1 1.4 2.1 2.1 2.1

3 3 2.1 2.1 2.8 2.1 2.1 2.1

12/12/2019

Turning text into data

TF-IDF

● TF-IDF is an extremely common and useful way to convert text into useful

quantitative features

● It’s often all you need

● But there are other, more complex ways to quantify text

Turning words into numbers

Part-of-Speech Tagging

● Sometimes you care about how a word is used

● Can use pre-trained part-of-speech (POS) taggers

● Can also help with things like negation

○ “Happy” vs. “NOT happy”

Image credit: http://nltk.sourceforge.net/doc/en/ch03.html

Image credit: https://www.nltk.org/book_1ed/ch05.html

12/12/2019

Turning words into numbers

Named Entity Extraction

● Might also be interested in people, places, organizations, etc.

● Like POS taggers, named entity extractors use trained models

Image credit: http://inspiratron.org/blog/2019/04/15/building-named-entity-recognizer-ner-using-conditional-random-fields-crf/

Turning words into numbers

Word Embeddings

● Other methods can quantify words not by frequency, but by their relation

● Word2vec uses a sliding window to read words and learn their

relationships; each word gets a vector in N-dimensional space

● Pretrained model: https://code.google.com/archive/p/word2vec/

Image credit: https://www.tensorflow.org/tutorials/text/word_embeddings

12/12/2019

Analysis

Finding patterns in text data

Two types of approaches:

● Unsupervised NLP: automated, extracts structure from the data

○ Clustering

○ Topic modeling

○ Mutual information

● Supervised NLP: requires training data, learns to predict labels and classes

○ Classification

○ Regression

12/12/2019

Finding patterns in text data

Unsupervised methods

Collocation / phrase detection

● Simple way to get a quick

look at common themes

● Bigrams are a form of

“collocation” – a more

general term for words that

occur together

Code modified from: https://www.nltk.org/howto/collocations.html

Finding patterns in text data

Unsupervised methods

Co-occurrence matrices

● We can also find words that

occur in the same documents

together (not just next to

each other)

12/12/2019

Finding patterns in text data

Unsupervised methods

● Might want to compare documents (or words) to one another

● Possible applications

○ Spelling correction

○ Document deduplication

○ Measure similarity of language

■ Politicians’ speeches

■ Movie reviews

■ Product descriptions

Image credit: https://pibytes.wordpress.com/2013/02/02/deduplication-internals-part-1/

Finding patterns in text data

Unsupervised methods

Levenshtein distance

● Compute number of steps needed to

turn a word/document into another

● Can express as a ratio (percent of

word/document that needs to

change) to measure similarity

Image credit: http://web.stanford.edu/~jurafsky/slp3/2.pdf

12/12/2019

Finding patterns in text data

Unsupervised methods

Cosine similarity

● Compute the “angle” between two word vectors

● TF-IDF: axes are the weighted frequencies for

each word

● Word2Vec: axes are the learned dimensions

from the model

Image credit: https://www.machinelearningplus.com/nlp/cosine-similarity/

Finding patterns in text data

Unsupervised methods

Clustering

● Algorithms that use word vectors (TF-IDF,

Word2Vec, etc.) to identify structural

groupings between observations (words,

documents)

● K-Means is a very commonly used one

Image credit: http://mnemstudio.org/clustering-k-means-introduction.htm

12/12/2019

Finding patterns in text data

Unsupervised methods

Hierarchical/agglomerative clustering

● Start with all observations, and use a

rule to pair them up, and repeat until

there’s only one group left

Image credit: https://scrnaseq-course.cog.sanger.ac.uk/website/biological-analysis.html

Image credit: https://rpkgs.datanovia.com/factoextra/reference/fviz_dend.html

Finding patterns in text data

Unsupervised methods

Network analysis

● Can also get creative

● After all, we’re just working

with columns and numbers

● Example: link words together

by their strongest correlations

Image credit: https://www.linkedin.com/in/patrick-van-kessel

12/12/2019

Finding patterns in text data

Unsupervised methods

Pointwise mutual information

● Based on information theory

● Compares conditional and joint

probabilities to measure the

likelihood of a word occurring

with a category/outcome, beyond

random chance

Image credit: https://www.people-press.org/2017/02/23/partisan-language-in-congressional-outreach/

Finding patterns in text data

Unsupervised methods

Topic modeling

● Algorithms that characterize

documents in terms of topics

(groups of words)

● Find topics that best fit the

data

Image credit: http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

12/12/2019

Finding patterns in text data

Supervised methods

● Often we want to categorize documents

● Unsupervised methods can help

● But often we need to read and label them ourselves

● Classification models can take labeled data and learn to make predictions

Finding patterns in text data

Supervised methods

Steps:

● Label a sample of documents

● Break your sample into two sets: a training sample and a test sample

● Train a model on the training sample

● Evaluate it on the test sample

● Apply it to the full set of documents to make predictions

12/12/2019

Finding patterns in text data

Supervised methods

● First you need to develop a codebook

● Codebook: set of rules for labeling and categorizing documents

● The best codebooks have clear rules for hard cases, and lots of examples

● Categories should be MECE: mutually exclusive and collectively exhaustive

12/12/2019

Finding patterns in text data

Supervised methods

Finding patterns in text data

Supervised methods

● Need to validate the codebook by measuring interrater reliability

● Makes sure your measures are consistent, objective, and reproducible

● Multiple people code the same document

Image credit: https://socialresearchmethods.net/kb/reltypes.php

12/12/2019

Finding patterns in text data

Supervised methods

● Various metrics to test whether their agreement is high enough

○ Krippendorf’s alpha

○ Cohen’s kappa

● Can also compare coders against a gold standard, if available

Image credit: https://www.researchgate.net/figure/Interpretation-of-Cohens-Kappa-Values_tbl2_302869046

Finding patterns in text data

Supervised methods

● Mechanical Turk can be a great way to code a lot of documents

● Have 5+ Turkers code a large sample of documents

● Collapse them together with a rule

● Code a subset in-house, and compute reliability

Image credit: https://machmachines.com/make-some-extra-cash-with-amazon-mechanical-turk060515/

12/12/2019

Finding patterns in text data

Supervised methods

● After coding, split your sample into two sets (~80/20)

○ One for training, one for testing

● We do this to check for (and avoid) overfitting

Image credit: https://medium.com/ml-research-lab/under-fitting-over-fitting-and-its-solution-dc6191e34250

Finding patterns in text data

Supervised methods

● Next step is called feature extraction or feature selection

● Need to extract “features” from the text

○ TF-IDF

○ Word2Vec vectors

● Can also utilize metadata, if potentially useful

12/12/2019

Finding patterns in text data

Supervised methods

● Select a classification algorithm

● Common choice for text data are

support vector machines (SVMs)

● Similar to regression, SVMs find the line

that best separates two or more groups

● Can also use non-linear “kernels” for

better fits (radial basis function, etc.)

● XGBoost is a newer and very promising

algorithm

Image credit: https://towardsdatascience.com/support-vector-machine-vs-logistic-regression-94cc2975433f

Finding patterns in text data

Supervised methods

● Time to evaluate performance

● Lots of different metrics, depending on

what you care about

● Often we care about precision/recall

○ Precision: did you pick out mostly needles or

mostly hay?

○ Recall: how many needles did you miss?

● Other metrics:

○ Matthew’s correlation coefficient

○ Brier score

○ Overall accuracy

Image credit: https://en.wikipedia.org/wiki/Precision_and_recall

12/12/2019

Finding patterns in text data

Supervised methods

● Doing just one split leaves a lot up to chance

● To bootstrap a better estimate of the model’s performance, it’s best to use K-

fold cross-validation

● Splits your data into train/test sets multiple times and averages the

performance metrics

● Ensures that you didn’t just get lucky (or unlucky)

Finding patterns in text data

Supervised methods

● Model not working well?

● You probably need to tune your parameters

● You can use a grid search to test out different combinations of model

parameters and feature extraction methods

● Many software packages can automatically help you pick the best

combination to maximize your model’s performance

12/12/2019

Finding patterns in text data

Supervised methods

● Suggested design:

○ Large training sample, coded by Turkers

○ Small evaluation sample, coded by Turkers and in-house experts

○ Compute IRR between Turk and experts

○ Train model on training sample, use 5-fold cross-validation

○ Apply model to evaluation sample, compare results against in-house coders and Turkers

Finding patterns in text data

Supervised methods

● Some (but not all) models produce probabilities along with their

classifications

● Ideally you fit the model using your preferred scoring metric/function

● But you can also use post-hoc probability thresholds to adjust your model’s

predictions

12/12/2019

Tools and Resources

Open-source tools

● Python

○

NLTK, scikit-learn, pandas, numpy, scipy, gensim, spacy, etctw

● R

○

https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

● Java

○ Stanford Core NLP + many other useful libraries

https://nlp.stanford.edu/software/

Image credit: https://stackoverflow.com/

12/12/2019

Commercial tools

● Cloud-based NLP

○

Amazon Comprehend

○ Google Cloud Natural Language

○ IBM Watson NLU

● Software

○ SPSS Text Modeler

○ Provalis WordStat

Image credit: https://provalisresearch.com/products/content-analysis-software/

Time for a demo!

https://bit.ly/2rlCOUG

Full link: https://colab.research.google.com/github/patrickvankessel/AAPOR-

Text-Analysis-2019/blob/master/Tutorial.ipynb

GitHub repo: https://github.com/patrickvankessel/AAPOR-Text-Analysis-2019

Feel free to reach out:

[email protected]

[email protected]m

Special thanks to Michael Jugovich for help putting these materials together for previous workshops

12/12/2019

Thank you!

https://bit.ly/2rlCOUG

Full link: https://colab.research.google.com/github/patrickvankessel/AAPOR-

Text-Analysis-2019/blob/master/Tutorial.ipynb

GitHub repo: https://github.com/patrickvankessel/AAPOR-Text-Analysis-2019

Feel free to reach out:

[email protected]

[email protected]m

Special thanks to Michael Jugovich for help putting these materials together for previous workshops