Text Mining and Analysis Using Twitter Streaming API 📊

AUTHOR: Punchok Kerdsiri


Text mining is the application of natural language processing techniques and analytical methods to text data in order to derive relevant information. in this following works, It shows how to collecting data from Twitter with Twitter Streaming API that allow us to capture tweets real-time filter.

In this study case I will use #WorldCup data to compare the popularity of 4 most popularity during Fifa World Cup Russia 2018 and most football player: Cristiano Ronaldo, Neymar,Lionel Messi and Luis Suarez, and to retrieve links to the news resoruces such as tweet,website,video,youtube etc.. In the first Part, I will explaing how to connect to Twitter Streaming API and how to get the data. In the second Part, I will explain and show how to structure the data for analysis, and in the last paragraph, And Finally I will explain how to filter the data and extract links from tweets.

— Part 1:Getting Start with Twitter API —

Understanding of Twitter API

API stands for Application Programming Interface. It is a tool that makes the interaction with computer programs and web services easy

Getting API

  • Create a twitter account if you do not already have one.
  • Go to https://apps.twitter.com/ and log in with your twitter credentials.
  • Click “Create New App”
  • Fill out the form, agree to the terms, and click “Create your Twitter application”
  • In the next page, click on “API keys” tab, and copy your “API key” and “API secret”.
  • Scroll down and click “Create my access token”, and copy your “Access token” and “Access token secret”.

Create Twitter streaming API file to shows the result of realtime filter streaming

Create a file name Twitter_stream_api.py Using Tweepy library. We will be using a Python library called Tweepy to connect to Twitter Streaming API and downloading the data. If you don’t have Tweepy installed in your machine, go to this link, and follow the installation instructions.

Next create, a file called twitter_streaming.py, and copy into it the code below. Make sure to enter your credentials into access_token, access_token_secret, consumer_key, and consumer_secret.


There are the outputs when execute the instruction from above

the output returns the value in JSON from which contain more than 100 keys in 1 tweet, I’ve been streaming for 2 hours to collect data form Twitter

Capturing and Reading the Data

In order to capture the data for the analysis. I collect by following command to store data in txt file

The data that we retrived is store in worldcup2018_twitter_data.txt which are JSON forfrt you can see that the tweet contain additional and more information example :

“text”</span>:”Lionel Messi, Marcus Rojo\u2019s goals as Argentina best Nigeria

— Part 2: Structured data and analysis —

import necessary library which contain

  • Json
  • pandas
  • matplotlib
  • re
  • matplotlib’

Read captured data from Txt File


Show the total captured tweet data

using print and len ( ) function to read all count tweet data that has


In This data/worldcup2018_twitter_data.txt we’ve capture totally 6008 tweets from twitter

Mapping capture tweet from JSON format fileform text file into data frame


this shows top 5 language that has been tweet the most tweets are in English(en) and second in Protugal(pt) and third in Spanish(es),French(fr) and Japanese(jp)


Drawing the Graph

In order to impliment the graph we use Mathplotlib library to draw the grph which has many kind of graph.In a simple implimentation I use bar graph for showing counting result from above and finding top 10 Languages from 60008 tweets that’s captured

Showing different result in different kind of graph in Pie Graph


Drawing a Graph for 10 countries that tweet about ‘#WorldCup2’




— Part 3:Text Mining and Extracting Link —

Our main goals in these text mining tasks are: compare the popularity of Cristiano Ronaldo, Luis Suarez Neymar programming languages and to retrieve programming tutorial links. We will do this in 3 steps:

  • We will add tags to our tweets DataFrame in order to be able to manipualte the data easily.
  • Target tweets that have “WorldCup” or “Fifa” keywords.
  • Extract links from the relevants tweets

Defind the function to convert all text that contain Capital and mixing text to lower case also using search function to find word in column text


to show all ranking of popularity on football player from above

Specifying Relevant Tweet text

In this part I’ll try to specifying the keywor in order to match football players who were mention during the WorldCup 2018 with keywords ‘FiFa2018′ or’World Cup’ or ‘WorldCup’

Mapping keywords of Fifa2018 and Worldcup that appear in text <span style=color>relevant that take value True if the tweet has either “programming” or “tutorial” keyword, otherwise it takes value False.

Create Relevent to apply with words in text that appear on tweets

Showing Matching keyword and Football player name values that appear in Captured Tweet

Extracting links from the relevants tweets

In this part we extracted the relevant tweets, we want to retrieve links to programming tutorials. We will start by creating a function that uses regular expressions for retrieving link that start with “http://” or “https://” from a text. This function will return the url if found, otherwise it returns an empty string.

Next, we will add a column called link to our tweets DataFrame. This column will contain the urls information.

Next we will create a new DataFrame called tweets_relevant_with_link. This DataFrame is a subset of tweets DataFrame and contains all relevant tweets that have a link.

We can now print out all links for football player by executing the commands below:

Post Views: 28