12/12/2019
32
Finding patterns in text data
Supervised methods
● Select a classification algorithm
● Common choice for text data are
support vector machines (SVMs)
● Similar to regression, SVMs find the line
that best separates two or more groups
● Can also use non-linear “kernels” for
better fits (radial basis function, etc.)
● XGBoost is a newer and very promising
algorithm
Image credit: https://towardsdatascience.com/support-vector-machine-vs-logistic-regression-94cc2975433f
Finding patterns in text data
Supervised methods
● Time to evaluate performance
● Lots of different metrics, depending on
what you care about
● Often we care about precision/recall
○ Precision: did you pick out mostly needles or
mostly hay?
○ Recall: how many needles did you miss?
● Other metrics:
○ Matthew’s correlation coefficient
○ Brier score
○ Overall accuracy
Image credit: https://en.wikipedia.org/wiki/Precision_and_recall