Using NLP to compare tweets to political speeches

Given a political speech, which parts are most resonant, and why?

Using modern ML tools and the wealth of language on social media, we can analyze speeches through three lenses: the concepts they emphasize, the emotional tone they strike, and the partisan style they most resemble. Together, these perspectives reveal not just what is said, but how it lands with different audiences.

The code

  • Divide the speech up into chunks and run a basic sentiment analysis model on it. Find which parts of the speech are the most positive and most negative.

  • I took political tweets from around that time and used TFIDF on the tweets and transcript chunks to get a common corpus of words. Then run LSA separately on the transcript and tweets to get the core concepts of each. Find the most relevant chunk of the speech (based on tweet concepts) and show its cosine similarity.

  • I got another dataset of tweets from members of Congress that included what party they were in. I tried out a series of models and picked the one with the best accuracy score at predicting party based on tweet text. I then ran that on the chunk of the speech I chose previously to see if the speech aligned more closely with talking points from the Democratic or Republican party.

Sentiment Analysis

The speech I chose to analyze was Obama’s 2024 speech at the DNC.

Here’s an excerpt from the first few sentences of the speech:

“Good evening, everybody. As you’ve seen by now, this isn’t a normal convention. It’s not a normal time. So tonight, I want to talk as plainly as I can about the stakes in this election. Because what we do these next 76 days will echo through generations to come.I’m in Philadelphia, where our Constitution was drafted and signed. It wasn’t a perfect document. It allowed for the inhumanity of slavery and failed to guarantee women – and even men who didn’t own property – the right to participate in the political process.”

I divided the speech up into chunks of 6 sentences and looked at how the sentiment changed as the speech progressed.

The average sentiment of the speech was 0.45. A score closer to –1 is strongly negative, around 0 is mixed/neutral, and closer to +1 is strongly positive. Obama’s speech was thus overall closer to positive.

Note: Sentiment analysis is imperfect, and words that are positive or negative can skew this analysis when it doesn’t count in the broader context.

You can also use sentiment analysis to find the most positive and negative chunks of a speech.

Here they are:

Most positive chunk of the speech (score = 0.9914 ):

And in my friend Kamala Harris, he’s chosen an ideal partner who’s more than prepared for the job; someone who knows what it’s like to overcome barriers and who’s made a career fighting to help others live out their own American dream. Along with the experience needed to get things done, Joe and Kamala have concrete policies that will turn their vision of a better, fairer, stronger country into reality. They’ll get this pandemic under control, like Joe did when he helped me manage H1N1 and prevent an Ebola outbreak from reaching our shores. They’ll expand health care to more Americans, like Joe and I did ten years ago when he helped craft the Affordable Care Act and nail down the votes to make it the law. They’ll rescue the economy, like Joe helped me do after the Great Recession. I asked him to manage the Recovery Act, which jumpstarted the longest stretch of job growth in history.

Most negative chunk of the speech (score = -0.946 ):

We are going to bring those words, in our founding documents, to life. I’ve seen that same spirit rising these past few years. Folks of every age and background who packed city centers and airports and rural roads so that families wouldn’t be separated. So that another classroom wouldn’t get shot up. So that our kids won’t grow up on an uninhabitable planet. Americans of all races joining together to declare, in the face of injustice and brutality at the hands of the state, that Black Lives Matter, no more, but no less, so that no child in this country feels the continuing sting of racism.

Concept Alignment

Now that we have the sentiment of the speech, what about the key concepts that emerge from that speech? And can we compare them to concepts that the average person is talking about?

To get a sense of what the average person is talking about, I used a dataset of political tweets by random users. And I know, the average political tweeter is not the same as the average person, but it still sheds some light on what topics people are broadly discussing.

The dataset: https://github.com/sinking8/x-24-us-election.

I uploaded one of these files in this dataset, which included 50,000 tweets, and found out a few things. To start, here’s a list of the most common words in people’s tweets.

Most common words:

  1. the — 40,615

  2. biden — 30,258

  3. to — 25,628

  4. a — 21,069

  5. t — 20,997

  6. and — 20,129

  7. is — 19,463

  8. of — 15,302

  9. s — 14,718

  10. you — 14,196

This is useless to us, since many of the words are filler. That’s where TF-IDF comes in, where we can use term frequency and inverse document frequency to find the most semantically meaningful words.

I ran TF-IDF on both the transcript chunks and the tweets to create a shared corpus of words, and here are the words ranked by their TF-IDF score.

Now, I want to take these words and documents (tweets) and convert them into concepts.

TF-IDF maps each of the tweets into a vectorized word-space, and SVD/LSA compresses that into a lower-dimensional semantic concept space. Essentially, you can use SVD to distill all these tweet vectors into a list of concepts.

I ran LSA on the tweets and the transcript chunks separately to determine the most important concepts from each.

Each concept is a linear combo of words, and each of those words has a corresponding weight.

Note: When I initially ran it, some of the words that showed up were “https,” “like,” “just,” “como,” “hola,” and others, so I filtered out certain stopwords and words in languages such as Spanish and Catalan. These are the new and improved concept vectors after all the filtering!

From the tweets:

Concept 0 — Biden-focused discourse

  • biden (0.882)

  • joe (0.258)

  • trump (0.206)

  • hunter (0.143)

  • maga (0.078)

  • donald (0.063)

  • vote (0.056)

  • people (0.054)

Concept 1 — MAGA vs. Biden

  • maga (0.937)

  • trump (0.183)

  • biden (-0.169)

  • gop (0.130)

  • donald (0.081)

  • people (0.051)

  • joe (-0.049)

  • rickydoggin (0.036)

Concept 2 — GOP/Trump (vs. Biden/MAGA split)

  • gop (0.715)

  • trump (0.462)

  • maga (-0.293)

  • donald (0.290)

  • biden (-0.189)

  • joe (-0.072)

  • vote (0.059)

  • people (0.049)

Concept 3 — GOP vs. Trump (Party vs. Individual)

  • gop (0.636)

  • trump (-0.600)

  • donald (-0.438)

  • biden (0.107)

  • maga (0.090)

  • joe (0.068)

  • convicted (-0.039)

  • felon (-0.038)

Concept 4 — Joe Biden personal references

  • joe (0.942)

  • biden (-0.276)

  • hunter (-0.107)

  • ashy_slashee (0.090)

  • donald (0.065)

  • joebiden (0.022)

  • conservative (0.021)

  • son (0.020)

Concept 5 — Hunter Biden + conservative critique

  • hunter (0.834)

  • conservative (0.287)

  • biden (-0.207)

  • trump (-0.099)

  • gop (-0.097)

  • conviction (0.096)

  • people (0.090)

  • gun (0.082)

Concept 6 — Conservative politics & voting

  • conservative (0.663)

  • hunter (-0.429)

  • people (0.225)

  • vote (0.195)

  • party (0.146)

  • gop (-0.137)

  • donald (-0.132)

  • think (0.108)

Concept 7 — Garland contempt hearings

  • garland (0.449)

  • contempt (0.371)

  • house (0.334)

  • conservative (-0.295)

  • hold (0.268)

  • merrick (0.262)

  • audio (0.240)

  • congress (0.189)

Concepts from the transcript of Obama’s speech:

Concept 0 — Joe Biden, Kamala Harris & Voting/Jobs

  • joe (0.295)

  • vote (0.225)

  • kamala (0.182)

  • years (0.158)

  • job (0.155)

  • right (0.148)

  • trying (0.144)

  • shown (0.137)

Concept 1 — Civil Rights & Leadership Perception

  • told (0.205)

  • looked (0.194)

  • joe (-0.189)

  • way (0.188)

  • kamala (-0.188)

  • black (0.175)

  • rights (0.152)

  • government (0.143)

Concept 2 — Constitution, Government & Political Expectations

  • political (0.204)

  • men (0.192)

  • joe (-0.188)

  • born (-0.175)

  • government (0.174)

  • expect (0.167)

  • constitution (0.157)

  • office (0.156)

Concept 3 — Responsibility, Office & Accountability

  • shown (0.287)

  • times (0.215)

  • responsibility (0.196)

  • office (0.194)

  • expect (0.164)

  • knows (-0.161)

  • joe (-0.138)

  • threatened (0.134)

Concept 4 — Lives, People & Wealth Inequality

  • lives (0.314)

  • people (0.139)

  • life (0.136)

  • political (-0.127)

  • convince (0.125)

  • expect (-0.125)

  • office (-0.124)

  • wealthy (0.122)

Concept 5 — Voting, Convincing & Social Class

  • vote (0.207)

  • shown (-0.204)

  • making (0.191)

  • people (0.174)

  • convince (0.160)

  • wealthy (0.156)

  • life (0.152)

  • government (-0.152)

Concept 6 — Religious Appeals & Blessings

  • bless (0.801)

  • god (0.599)

  • told (0.000)

  • jews (0.000)

  • sit (0.000)

  • rights (0.000)

  • losing (0.000)

  • fight (0.000)

Concept 7 — Jewish Identity, Struggle & Rights

  • jews (0.250)

  • sit (0.243)

  • losing (0.231)

  • fight (0.223)

  • told (0.221)

  • rights (0.212)

  • working (0.201)

  • children (0.199)

How similar are the tweets to the speech?

Cosine similarity measures how similar two texts are in terms of word distribution. I calculated the average cosine similarity of Obama’s speech and political discourse on Twitter, and got a whopping -0.001. A 1.0 means they are perfectly similar, 0.0 means they are completely unrelated, and -1.0 means they have an opposite meaning.

What chunk of the speech is most conceptually aligned with the tweets?

I then used cosine similarity to find the part of the speech that was most aligned with the top concepts from the tweets. Here’s what I ended up with:

Best match: tweet concept 4 ↔ transcript concept 0 (cosine similarity=0.300)

Tweet concept top terms: ['joe', 'ashy_slashee', 'donald', 'joebiden', 'son', 'conservative', 'potus', 'pedophile', 'man', 'people'] Transcript concept top terms: ['joe', 'vote', 'kamala', 'years', 'job', 'right', 'trying', 'shown', 'lives', 'care']

Best chunk: When Joe listens to a parent who’s trying to hold it all together right now, he does it as the single dad who took the train back to Wilmington each and every night so he could tuck his kids into bed. When he meets with military families who’ve lost their hero, he does it as a kindred spirit; the parent of an American soldier; somebody whose faith has endured the hardest loss there is. For eight years, Joe was the last one in the room whenever I faced a big decision. He made me a better president – and he’s got the character and the experience to make us a better country. And in my friend Kamala Harris, he’s chosen an ideal partner who’s more than prepared for the job; someone who knows what it’s like to overcome barriers and who’s made a career fighting to help others live out their own American dream. Along with the experience needed to get things done, Joe and Kamala have concrete policies that will turn their vision of a better, fairer, stronger country into reality. They’ll get this pandemic under control, like Joe did when he helped me manage H1N1 and prevent an Ebola outbreak from reaching our shores. They’ll expand health care to more Americans, like Joe and I did ten years ago when he helped craft the Affordable Care Act and nail down the votes to make it the law.

Partisan Style

Does he talk like a party Democrat? Another way we can analyze the speech is by looking at tweets from members of Congress and their corresponding parties. I ran a series of different models to find which one was the best predictor of a political tweet, then ran the model on the transcript chunk that I selected.

I plugged it into a few different models to see which of them could be the most accurate predictor of party given a tweet.

Gaussian Naive Bayes Accuracy Score after a 2D PCA: 0.601

Gaussian Naive Bayes Accuracy Score after optimizing PCA (134D): 0.724

Logistic Reg. Accuracy Score (on all features): 0.751

Perceptron Accuracy Score (on all features): 0.696

Decision Tree with maxdepth of 50 and min_samples_leaf of 10: 0.637

RandomF n_estimators=100, min_samples_leaf=10, and max_depth=50: 0.779

XG Boost with nestimators = 100, maxdepth = 9: 0.804

Neural Network with 2 levels and 160 epochs: 0.844

Of these models, the neural network had the best accuracy score.

I used that model on the chunk of the transcript to predict whether it was written by a Democrat or Republican, and it indeed predicted that the chunk was written by a Democrat. So Obama’s speech chunk matched closely with party style.

Conclusions

The concepts around political tweets are highly tied to people and their character, much stronger than issues.

The party predictor based on member tweets correctly predicted that Obama’s speech was spoken by a Democrat, an early indicator that his speech is well-aligned with party rhetoric.

The consine similarity between the transcript concepts and tweet concepts were not high, meaning the concepts he talked about in the speech are not closely tied to what real people in Twitter talk about.

Next
Next

Data Visualization