Your Career Platform for Big Data

Be part of the digital revolution in Germany

 

View all Jobs

DataCareer Blog

   Random Forest is a powerful ensemble learning method that can be applied to various prediction tasks, in particular classification and regression. The method uses an ensemble of decision trees as a basis and therefore has all advantages of decision trees, such as high accuracy, easy usage, and no necessity of scaling data. Moreover, it also has a very important additional benefit, namely perseverance to overfitting (unlike simple decision trees) as the trees are combined. In this tutorial, we will try to predict the value of diamonds from the Diamonds dataset (part of ggplot2) applying a Random Forest Regressor in R. We further visualize and analyze the obtained predictive model and look into the tuning of hyperparameters and the importance of available features. Loading and preparing data # Import the dataset diamond <-diamonds head(diamond) ## # A tibble: 6 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 The dataset contains information on 54,000 diamonds. It contains the price as well as 9 other attributes. Some features are in the text format, and we need to encode them to the numerical format. Let’s also drop the unnamed index column. # Convert the variables to numerical diamond$cut <- as.integer(diamond$cut) diamond$color <-as.integer(diamond$color) diamond$clarity <- as.integer(diamond$clarity) head(diamond) ## # A tibble: 6 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <int> <int> <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 5 2 2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 4 2 3 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 2 2 5 56.9 65 327 4.05 4.07 2.31 ## 4 0.290 4 6 4 62.4 58 334 4.2 4.23 2.63 ## 5 0.31 2 7 2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 3 7 6 62.8 57 336 3.94 3.96 2.48 As we already mentioned, one of the benefits of the Random Forest algorithm is that it doesn’t require data scaling. So, to use this algorithm, we only need to define features and the target that we are trying to predict. We could potentially create numerous features by combining the available attributes. For simplicity, we will not do that now. If you are trying to build the most accurate model, feature creation is definitely a key part and substantial time should be invested in creating features (e.g. through interaction). # Create features and target X <- diamond %>% select(carat, depth, table, x, y, z, clarity, cut, color) y <- diamond$price Training the model and making predictions At this point, we have to split our data into training and test sets. As a training set, we will take 75% of all rows and use 25% as test data. # Split data into training and test sets index <- createDataPartition(y, p=0.75, list=FALSE) X_train <- X[ index, ] X_test <- X[-index, ] y_train <- y[index] y_test<-y[-index] # Train the model regr <- randomForest(x = X_train, y = y_train , maxnodes = 10, ntree = 10) Now, we have a pre-trained model and can predict values for the test data. We then compare the predicted value with the actual values in the test data and analyze the accuracy of the model. To make this comparison more illustrative, we will show it both in the forms of table and plot the price and the carat value # Make prediction predictions <- predict(regr, X_test) result <- X_test result['price'] <- y_test result['prediction']<- predictions head(result) ## # A tibble: 6 x 11 ## carat depth table x y z clarity cut color price prediction ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int> <dbl> ## 1 0.24 62.8 57 3.94 3.96 2.48 6 3 7 336 881. ## 2 0.23 59.4 61 4 4.05 2.39 5 3 5 338 863. ## 3 0.2 60.2 62 3.79 3.75 2.27 2 4 2 345 863. ## 4 0.32 60.9 58 4.38 4.42 2.68 1 4 2 345 863. ## 5 0.3 62 54 4.31 4.34 2.68 2 5 6 348 762. ## 6 0.3 62.7 59 4.21 4.27 2.66 3 3 7 351 863. # Import library for visualization library(ggplot2) # Build scatterplot ggplot( ) + geom_point( aes(x = X_test$carat, y = y_test, color = 'red', alpha = 0.5) ) + geom_point( aes(x = X_test$carat , y = predictions, color = 'blue', alpha = 0.5)) + labs(x = "Carat", y = "Price", color = "", alpha = 'Transperency') + scale_color_manual(labels = c( "Predicted", "Real"), values = c("blue", "red")) The figure displays that predicted prices (blue scatters) coincide well with the real ones (red scatters), especially in the region of small carat values. But to estimate our model more precisely, we will look at Mean absolute error (MAE), Mean squared error (MSE), and R-squared scores. # Import library for Metrics library(Metrics) ## ## Attaching package: 'Metrics' ## The following objects are masked from 'package:caret': ## ## precision, recall print(paste0('MAE: ' , mae(y_test,predictions) )) ## [1] "MAE: 742.401258870433" print(paste0('MSE: ' ,caret::postResample(predictions , y_test)['RMSE']^2 )) ## [1] "MSE: 1717272.6547428" print(paste0('R2: ' ,caret::postResample(predictions , y_test)['Rsquared'] )) ## [1] "R2: 0.894548902990278" We obtain high error values (MAE and MSE). To improve the predictive power of the model, we should tune the hyperparameters of the algorithm. We can do this manually, but it will take a lot of time. In order to tune the parameters ntrees (number of trees in the forest) and maxnodes (maximum number of terminal nodes trees in the forest can have), we will need to build a custom Random Forest model to obtain the best set of parameters for our model and compare the output for various combinations of the parameters. Tuning the parameters # If training the model takes too long try setting up lower value of N N=500 #length(X_train) X_train_ = X_train[1:N , ] y_train_ = y_train[1:N] seed <-7 metric<-'RMSE' customRF <- list(type = "Regression", library = "randomForest", loop = NULL) customRF$parameters <- data.frame(parameter = c("maxnodes", "ntree"), class = rep("numeric", 2), label = c("maxnodes", "ntree")) customRF$grid <- function(x, y, len = NULL, search = "grid") {} customRF$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) { randomForest(x, y, maxnodes = param$maxnodes, ntree=param$ntree, ...) } customRF$predict <- function(modelFit, newdata, preProc = NULL, submodels = NULL) predict(modelFit, newdata) customRF$prob <- function(modelFit, newdata, preProc = NULL, submodels = NULL) predict(modelFit, newdata, type = "prob") customRF$sort <- function(x) x[order(x[,1]),] customRF$levels <- function(x) x$classes # Set grid search parameters control <- trainControl(method="repeatedcv", number=10, repeats=3, search='grid') # Outline the grid of parameters tunegrid <- expand.grid(.maxnodes=c(70,80,90,100), .ntree=c(900, 1000, 1100)) set.seed(seed) # Train the model rf_gridsearch <- train(x=X_train_, y=y_train_, method=customRF, metric=metric, tuneGrid=tunegrid, trControl=control) Let’s visualize the impact of tuned parameters on RMSE. The plot shows how the model’s performance develops with different variations of the parameters. For values maxnodes: 80 and ntree: 900, the model seems to perform best. We would now use these parameters in the final model. plot(rf_gridsearch) Best parameters: rf_gridsearch$bestTune ## maxnodes ntree ## 5 80 1000 Defining and visualizing variables importance For this algorithm, we used all available diamond features, but some of them contain more predictive power than others. Let’s build the plot with features list on the y axis. On the X axis we’ll have incremental decrease in node impurities from splitting on the variable, averaged over all trees, it is measured by the residual sum of squares and therefore gives us a rough idea about the predictive power of the feature. Generally, it is important to keep in mind, that random forest does not allow for any causal interpretation. varImpPlot(rf_gridsearch$finalModel, main ='Feature importance') From the figure above you can see that the size of diamond (x,y,z refer to length, width, depth) and the weight (carat) contain the major part of the predictive power.
For the past few years, tasks involving text and speech processing have become really hot-trendy. Among the various researches belonging to the fields of Natural Language Processing and Machine Learning, sentiment analysis ranks really high. Sentiment analysis allows identifying and getting subjective information from the source data using data analysis and visualization, ML models for classification, text mining and analysis. This helps to understand social opinions on the subject, so sentiment analysis is widely used in business and politics and usually conducted in social networks. Social networks as the main resource for sentiment analysis Nowadays, social nets and forums are the main stage for people sharing opinions. That is why they are so interesting for researches to figure out the attitude to one or another object. Sentiment analysis allows challenging the problem of analyzing textual data created by users on social nets, microblogging platforms and forums, as well as business platforms regarding the opinions the users have about some product, service, person, idea, and so on. In the most common approach, text can be classified into two classes (binary sentiment classification): positive and negative, but sentiment analysis can have way more classes involving multi-class problem. Sentiment analysis allows processing hundreds and thousands of texts in a short period of time. This is another reason for its popularity - while people need many hours to do the same work, sentiment analysis is able to finish it in a few seconds.  Common approaches for classifying sentiments Sentiment analysis of the text data can be done via three commonly used methods: machine learning, using dictionaries, and hybrid. Learning-based approach Machine learning approach is one of the most popular nowadays. Using ML techniques and various methods, users can build a classifier that can identify different sentiment expressions in the text. Dictionary-based approach The main concept of this approach is using a bag of words with polarity scores, that can help to establish whether the word has a positive, negative, or neutral connotation. Such an approach doesn't require any training set to be used allowing to classify even a small amount of data. However, there are a lot of words and expressions that are still not included in sentiment dictionaries. Hybrid approach As is evident from the title, this approach combines machine learning and lexicon-based techniques. Despite the fact that it's not widely used, the hybrid approach shows more promising and valuable results than the two approaches used separately. In this article, we will implement a dictionary-based approach, so let's deep into its basis. Dictionary (or Lexicon)-based sentiment analysis uses special dictionaries, lexicons, and methods, a lot number of which is available for calculating sentiment in text. The main are: afinn bing nrc All three are the sentiment dictionaries which help to evaluate the valence of the textual data by searching for words that describe emotion or opinion.  Things needed to be done before sentiment analysis Before starting building sentiment analyzer, a few steps must be taken. First of all, we need to state the problem we are going to explore, understand its objective. Since we will use data from Donald Trump twitter, let’s claim our objective as an attempt to analyze which connotation his last tweets have. As the problem is outlined, we need to prepare our data for examining. Data preprocessing is basically an initial step in text and sentiment classification. Depending on the input data, various amount of techniques can be applied in order to make data more comprehensible and improve the effectiveness of the analysis. The most common steps in data processing are: removing numbers removing stopwords removing punctuation and so on. Building sentiment classifier The first step in building our classifier is installing the needed packages. We will need the following packages that can be installed via command install.packages("name of the package") directly from the development environment: twitteR dplyr splitstackshape purrr As soon as all the packages are installed, let's initialize them. In [6]: library(twitteR) library(dplyr) library(splitstackshape) library(tidytext) library(purrr) We're going to get tweets from Donald Trump's account directly from Twitter, and that is why we need to provide Twitter API credentials. In [26]: api_key <- "----" api_secret <- "----" access_token <- "----" access_token_secret <- "----" In [27]: setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret) [1] "Using direct authentication" And now it's time to get tweets from Donald Trump's account and convert the data into dataframe. In [114]: TrumpTweets <- userTimeline("realDonaldTrump", n = 3200) In [115]: TrumpTweets <- tbl_df(map_df(TrumpTweets, as.data.frame)) Here how our initial dataframe looks: In [116]: head(TrumpTweets)   text favorited favoriteCount replyToSN created truncated replyToSID id replyToUID statusSource screenName retweetCount isRetweet retweeted longitude latitude It was my great honor to host Canadian Prime Minister @JustinTrudeau at the @WhiteHouse today!🇺🇸🇨🇦 https://t.co/orlejZ9FFs FALSE 17424 NA 2019-06-20 17:49:52 FALSE NA 1141765119929700353 NA <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> realDonaldTrump 3540 FALSE FALSE NA NA Iran made a very big mistake! FALSE 127069 NA 2019-06-20 14:15:04 FALSE NA 1141711064305983488 NA <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> realDonaldTrump 38351 FALSE FALSE NA NA “The President has a really good story to tell. We have unemployment lower than we’ve seen in decades. We have peop… https://t.co/Pl2HsZbiRK FALSE 36218 NA 2019-06-20 14:14:13 TRUE NA 1141710851617034240 NA <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> realDonaldTrump 8753 FALSE FALSE NA NA S&amp;P opens at Record High! FALSE 43995 NA 2019-06-20 13:58:53 FALSE NA 1141706991464849408 NA <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> realDonaldTrump 9037 FALSE FALSE NA NA Since Election Day 2016, Stocks up almost 50%, Stocks gained 9.2 Trillion Dollars in value, and more than 5,000,000… https://t.co/nOj2hCnU11 FALSE 62468 NA 2019-06-20 00:12:31 TRUE NA 1141499029727121408 NA <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> realDonaldTrump 16296 FALSE FALSE NA NA Congratulations to President Lopez Obrador — Mexico voted to ratify the USMCA today by a huge margin. Time for Congress to do the same here! FALSE 85219 NA 2019-06-19 23:01:59 FALSE NA 1141481280653209600 NA <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> realDonaldTrump 20039 FALSE FALSE NA NA To prepare our data for classification, let's get rid of links and format dataframe in a way when only one word is in line. In [117]: TrumpTweets <- TrumpTweets[-(grep('t.co', TrumpTweets$'text')),] In [118]: TrumpTweets$'tweet' <- 'tweet' TrumpTweets <- TrumpTweets[ , c('text', 'tweet')] TrumpTweets <- unnest_tokens(TrumpTweets, words, text) And this is how our dataframe looks now: In [119]: tail(TrumpTweets) tweet words tweet most tweet of tweet their tweet people tweet from tweet venezuela In [120]: head(TrumpTweets) tweet words tweet iran tweet made tweet a tweet very tweet big tweet mistake It's obvious that dataframe also contains various words without useful content. So it's a good idea to get rid of them. In [121]: TrumpTweets <- anti_join(TrumpTweets, stop_words, by = c('words' = 'word')) And here's the result: In [122]: tail(TrumpTweets) tweet words tweet harassment tweet russia tweet informed tweet removed tweet people tweet venezuela In [123]: head(TrumpTweets) tweet words tweet iran tweet mistake tweet amp tweet record tweet congratulations tweet president Much better, isn't it? Let's see how many times each word appears in Donald Trump's tweets. In [124]: word_count <- dplyr::count(TrumpTweets, words, sort = TRUE) In [125]: head(word_count) words n day 2 democrats 2 enjoy 2 florida 2 iran 2 live 2 Now it's time to create some dataframe with sentiments that will be used for tweets classification. We will use bing dictionary although you can easily use any other source. In [126]: sentiments <-get_sentiments("bing") sentiments <- dplyr::select(sentiments, word, sentiment) In [127]: TrumpTweets_sentiments <- merge(word_count, sentiments, by.x = c('words'), by.y = c('word')) Above we did a simple classification of Trump's tweets words using our sentiment bag of words. And this is how the result looks: In [128]: TrumpTweets_sentiments words n sentiment beautiful 1 positive burning 1 negative congratulations 1 positive defy 1 negative enjoy 2 positive harassment 1 negative hell 1 negative limits 1 negative mistake 1 negative scandal 1 negative strong 1 positive trump 1 positive Let's look at the number of occurrences per sentiment in tweets. In [131]: sentiments_count <- dplyr::count(TrumpTweets_sentiments, sentiment, sort = TRUE) In [132]: sentiments_count sentiment n negative 7 positive 5 We also may want to know the total count and percentage of all the sentiments. In [133]: sentiments_sum <- sum(sentiments_count$'n') In [134]: sentiments_count$'percentage' <- sentiments_count$'n' / sentiments_sum Let's now create an ordered dataframe for plotting counts of sentiments. In [135]: sentiment_count <- rbind(sentiments_count) In [136]: sentiment_count <- sentiment_count[order(sentiment_count$sentiment), ] In [137]: sentiment_count sentiment n percentage negative 7 0.5833333 positive 5 0.4166667 And now it's time for the visualization. We will plot the results of our classifier. In [144]: sentiment_count$'colour' <- as.integer(4) In [145]: barplot(sentiment_count$'n', names.arg = sentiment_count$'sentiment', col = sentiment_count$'colour', cex.names = .5) In [146]: barplot(sentiment_count$'percentage', names.arg = sentiment_count$'sentiment', col = sentiment_count$'colour', cex.names = .5)   Conclusion Sentiment analysis is a great way to explore emotions and opinions among the people. Today we explored the most common and easy way for sentiment analysis that is still great in its simplicity and gives quite an informative result. However, it should be noted that different sentiment analysis methods and lexicons work better depending on the problem and text corpuses. The result of the dictionary-based approach also depends much on the matching between the used dictionary and the textual data needed to be classified. But still, user can create own dictionary that can be a good solution. Despite this, dictionary-based methods usually show much better results than more compound techniques.
The way other people think about one or another product or service has a big impact on our everyday process of making decisions. Earlier, people relied on the opinion of their friends, relatives, or products and services reposts, but the era of the Internet has made significant changes. Today opinions are collected from different people around the world via reviewing e-commerce sites as well as blogs and social nets. To transform gathered data into helpful information on how a product or service is perceived among the people, the sentiment analysis is needed. What is sentiment analysis and why do we need it Sentiment analysis is a computing exploration of opinions, sentiments, and emotions expressed in textual data. The reason why sentiment analysis is used increasingly by companies is due to the fact that the extracted information can result in the products and services monetizing. Words express various kinds of sentiments. They can be positive, negative, or have no emotional overtone, be neutral. To perform analysis of the text's sentiment, the understanding of the polarity of the words is needed in order to classify sentiments into positive, negative, or neutral categories. This goal can be achieved through the use of sentiment lexicons. Common approaches for classifying sentiment  Sentiment analysis can be done in three ways: using ML algorithms, using dictionaries and lexicons, and combining these techniques. The approach based on the ML algorithms got significant popularity nowadays as it gives wide opportunities for performing identification of different sentiment expressions in the text. For performing lexicon-based approach various dictionaries with polarity scores can be found. Such dictionaries can help in establishing the connotation of the word. One of the pros of such an approach is that you don't need a training set for performing analysis, and that is why even a small piece of data can be successfully classified. However, the problem is that many words are still missing in sentiment lexicons that somewhat diminishes results of the classification. Sentiment analysis based on the combination of ML and lexicon-based techniques is not much popular but allows to achieve much more promising results then the results of independent use of the two approaches. The central part of the lexicon-based sentiment analysis belongs to the dictionaries. The most popular are afinn, bing, and nrc that can be found and installed on  python packages repository  All dictionaries are based on the polarity scores that can be positive, negative, or neutral. For Python developers, two useful sentiment tools will be helpful - VADER and TextBlob. VADER is a rule and lexicon-based tool for sentiment analysis that is adapted to sentiments that can be found in social media posts. VADER uses a list of tokens that are labeled according to their semantic connotation. TextBlob is a useful library for text processing. It provides general dealing with such tasks like phrase extraction, sentiment analysis, classification and so on. Things needed to be done before SA  In this tutorial, we will build a lexicon-based sentiment classifier of Donald Trump tweets with the help of the TextBlob. Let's look, which sentiments generally prevail in the scope of tweets. As every data exploration, there are some steps needed to be done before analysis, problem statement and data preparation. As the theme of our study is already stated, let's concentrate on data preparation. We will get tweets directly from Twitter, the data will come to us in some unordered look and that is why we need to order data into dataframe and do cleaning, removing links and stopwords. Building sentiment classifier First of all, we have to install packages needed for dealing with the task. In [41]: import tweepy import pandas as pd import numpy as np import matplotlib.pyplot as plt import re import nltk import nltk.corpus as corp from textblob import TextBlob  The next step is to connect our app to Twitter via Twitter API. Provide the needed credentials that will be used in our function for connection and extracting tweets from Donald Trump's account. In [4]: CONSUMER_KEY = "Key" CONSUMER_SECRET = "Secret" ACCESS_TOKEN = "Token" ACCESS_SECRET = "Secret" In [7]: def twitter_access(): auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET) api = tweepy.API(auth) return api twitter = twitter_access() In [9]: tweets = twitter.user_timeline("RealDonaldTrump", count=600) This is how our dataset looks: In [81]: tweets[0] Out[81]: Status(_api=<tweepy.api.API object at 0x7f987f0ce240>, _json={'created_at': 'Wed Jun 26 02:34:41 +0000 2019', 'id': 1143709133234954241, 'id_str': '1143709133234954241', 'text': 'Presidential Harassment!', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': []}, 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 25073877, 'id_str': '25073877', 'name': 'Donald J. Trump', 'screen_name': 'realDonaldTrump', 'location': 'Washington, DC', 'description': '45th President of the United States of America🇺🇸', 'url': 'https://t.co/OMxB0x7xC5', 'entities': {'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 61369691, 'friends_count': 47, 'listed_count': 104541, 'created_at': 'Wed Mar 18 13:46:38 +0000 2009', 'favourites_count': 7, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': True, 'statuses_count': 42533, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': True, 'profile_background_color': '6D5C18', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/25073877/1560920145', 'profile_link_color': '1B95E0', 'profile_sidebar_border_color': 'BDDCAD', 'profile_sidebar_fill_color': 'C5CEC0', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': False, 'default_profile': False, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'regular'}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'retweet_count': 10387, 'favorite_count': 48141, 'favorited': False, 'retweeted': False, 'lang': 'in'}, created_at=datetime.datetime(2019, 6, 26, 2, 34, 41), id=1143709133234954241, id_str='1143709133234954241', text='Presidential Harassment!', truncated=False, entities={'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': []}, source='Twitter for iPhone', source_url='http://twitter.com/download/iphone', in_reply_to_status_id=None, in_reply_to_status_id_str=None, in_reply_to_user_id=None, in_reply_to_user_id_str=None, in_reply_to_screen_name=None, author=User(_api=<tweepy.api.API object at 0x7f987f0ce240>, _json={'id': 25073877, 'id_str': '25073877', 'name': 'Donald J. Trump', 'screen_name': 'realDonaldTrump', 'location': 'Washington, DC', 'description': '45th President of the United States of America🇺🇸', 'url': 'https://t.co/OMxB0x7xC5', 'entities': {'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 61369691, 'friends_count': 47, 'listed_count': 104541, 'created_at': 'Wed Mar 18 13:46:38 +0000 2009', 'favourites_count': 7, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': True, 'statuses_count': 42533, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': True, 'profile_background_color': '6D5C18', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/25073877/1560920145', 'profile_link_color': '1B95E0', 'profile_sidebar_border_color': 'BDDCAD', 'profile_sidebar_fill_color': 'C5CEC0', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': False, 'default_profile': False, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'regular'}, id=25073877, id_str='25073877', name='Donald J. Trump', screen_name='realDonaldTrump', location='Washington, DC', description='45th President of the United States of America🇺🇸', url='https://t.co/OMxB0x7xC5', entities={'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, protected=False, followers_count=61369691, friends_count=47, listed_count=104541, created_at=datetime.datetime(2009, 3, 18, 13, 46, 38), favourites_count=7, utc_offset=None, time_zone=None, geo_enabled=True, verified=True, statuses_count=42533, lang=None, contributors_enabled=False, is_translator=False, is_translation_enabled=True, profile_background_color='6D5C18', profile_background_image_url='http://abs.twimg.com/images/themes/theme1/bg.png', profile_background_image_url_https='https://abs.twimg.com/images/themes/theme1/bg.png', profile_background_tile=True, profile_image_url='http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_image_url_https='https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_banner_url='https://pbs.twimg.com/profile_banners/25073877/1560920145', profile_link_color='1B95E0', profile_sidebar_border_color='BDDCAD', profile_sidebar_fill_color='C5CEC0', profile_text_color='333333', profile_use_background_image=True, has_extended_profile=False, default_profile=False, default_profile_image=False, following=False, follow_request_sent=False, notifications=False, translator_type='regular'), user=User(_api=<tweepy.api.API object at 0x7f987f0ce240>, _json={'id': 25073877, 'id_str': '25073877', 'name': 'Donald J. Trump', 'screen_name': 'realDonaldTrump', 'location': 'Washington, DC', 'description': '45th President of the United States of America🇺🇸', 'url': 'https://t.co/OMxB0x7xC5', 'entities': {'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 61369691, 'friends_count': 47, 'listed_count': 104541, 'created_at': 'Wed Mar 18 13:46:38 +0000 2009', 'favourites_count': 7, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': True, 'statuses_count': 42533, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': True, 'profile_background_color': '6D5C18', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/25073877/1560920145', 'profile_link_color': '1B95E0', 'profile_sidebar_border_color': 'BDDCAD', 'profile_sidebar_fill_color': 'C5CEC0', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': False, 'default_profile': False, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'regular'}, id=25073877, id_str='25073877', name='Donald J. Trump', screen_name='realDonaldTrump', location='Washington, DC', description='45th President of the United States of America🇺🇸', url='https://t.co/OMxB0x7xC5', entities={'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, protected=False, followers_count=61369691, friends_count=47, listed_count=104541, created_at=datetime.datetime(2009, 3, 18, 13, 46, 38), favourites_count=7, utc_offset=None, time_zone=None, geo_enabled=True, verified=True, statuses_count=42533, lang=None, contributors_enabled=False, is_translator=False, is_translation_enabled=True, profile_background_color='6D5C18', profile_background_image_url='http://abs.twimg.com/images/themes/theme1/bg.png', profile_background_image_url_https='https://abs.twimg.com/images/themes/theme1/bg.png', profile_background_tile=True, profile_image_url='http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_image_url_https='https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_banner_url='https://pbs.twimg.com/profile_banners/25073877/1560920145', profile_link_color='1B95E0', profile_sidebar_border_color='BDDCAD', profile_sidebar_fill_color='C5CEC0', profile_text_color='333333', profile_use_background_image=True, has_extended_profile=False, default_profile=False, default_profile_image=False, following=False, follow_request_sent=False, notifications=False, translator_type='regular'), geo=None, coordinates=None, place=None, contributors=None, is_quote_status=False, retweet_count=10387, favorite_count=48141, favorited=False, retweeted=False, lang='in') Not very informative, huh? Let's make our dataset look more legible. In [101]: tweetdata = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=["tweets"]) In [102]: tweetdata["Created at"] = [tweet.created_at for tweet in tweets] tweetdata["retweets"] = [tweet.retweet_count for tweet in tweets] tweetdata["source"] = [tweet.source for tweet in tweets] tweetdata["favorites"] = [tweet.favorite_count for tweet in tweets] And this is how it looks now. Much better, isn't it? In [103]: tweetdata.head() Out[103]:   tweets Created at retweets source favorites 0 Presidential Harassment! 2019-06-26 02:34:41 10387 Twitter for iPhone 48141 1 Senator Thom Tillis of North Carolina has real... 2019-06-25 22:20:42 11127 Twitter for iPhone 45202 2 Staff Sgt. David Bellavia - today, we honor yo... 2019-06-25 21:38:42 11455 Twitter for iPhone 48278 3 Today, it was my great honor to present the Me... 2019-06-25 20:27:19 10389 Twitter for iPhone 44485 4 ....Martha is strong on Crime and Borders, the... 2019-06-25 19:25:20 9817 Twitter for iPhone 52995   The next step needed to be taken is cleaning our dataset from useless words that bring no sense and improving our dataset that will then contain, among default tweet data, its connotation (whether it's positive, negative, or neutral), sentimental score, and subjectivity. In [104]: stopword = corp.stopwords.words('english') + ['rt', 'https', 'co', 'u', 'go'] def clean_tweet(tweet): tweet = tweet.lower() filteredList = [] global stopword tweetList = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split() for i in tweetList: if not i in stopword: filteredList.append(i) return ' '.join(filteredList) In [105]: scores = [] status = [] sub = [] fullText = [] for tweet in tweetdata['tweets']: analysis = TextBlob(clean_tweet(tweet)) fullText.extend(analysis.words) value = analysis.sentiment.polarity subject = analysis.sentiment.subjectivity if value > 0: sent = 'positive' elif value == 0: sent = 'neutral' else: sent = 'negative' scores.append(value) status.append(sent) sub.append(subject) In [106]: tweetdata['sentimental_score'] = scores tweetdata['sentiment_status'] = status tweetdata['subjectivity'] = sub tweetdata.drop(tweetdata.columns[2:5], axis=1, inplace=True) In [107]: tweetdata.head() Out[107]:   tweets Created at sentimental_score sentiment_status subjectivity 0 Presidential Harassment! 2019-06-26 02:34:41 0.000000 neutral 0.000000 1 Senator Thom Tillis of North Carolina has real... 2019-06-25 22:20:42 0.081481 positive 0.588889 2 Staff Sgt. David Bellavia - today, we honor yo... 2019-06-25 21:38:42 0.333333 positive 1.000000 3 Today, it was my great honor to present the Me... 2019-06-25 20:27:19 0.400000 positive 0.375000 4 ....Martha is strong on Crime and Borders, the... 2019-06-25 19:25:20 0.086667 positive 0.396667   For a better understanding of the obtained results, let's do some visualization. In [109]: positive = len(tweetdata[tweetdata['sentiment_status'] == 'positive']) negative = len(tweetdata[tweetdata['sentiment_status'] == 'negative']) neutral = len(tweetdata[tweetdata['sentiment_status'] == 'neutral']) In [110]: fig, ax = plt.subplots(figsize = (10,5)) index = range(3) plt.bar(index[2], positive, color='green', edgecolor = 'black', width = 0.8) plt.bar(index[0], negative, color = 'orange',edgecolor = 'black', width = 0.8) plt.bar(index[1], neutral, color = 'grey',edgecolor = 'black', width = 0.8) plt.legend(['Positive', 'Negative', 'Neutral']) plt.xlabel('Sentiment Status ',fontdict = {'size' : 15}) plt.ylabel('Sentimental Frequency', fontdict = {'size' : 15}) plt.title("Donald Trump's Twitter sentiment status", fontsize = 20) Out[110]: Text(0.5, 1.0, "Donald Trump's Twitter sentiment status")     Conclusion Sentiment analysis is a great way to explore emotions and opinions among society. We created basic sentiment classifier that can be used for analyzing textual data in social nets. The lexicon-based analysis allows creating own lexicon dictionaries thanks to what you can perform fine sentiment tuning depending on the task, textual data, and the goal of the analysis.
View all blog posts