Word cloud using R

Last week I have learned making a word cloud using R.

Here I will be discussing step by step procedure to create a word comparison cloud.

NOTE: You need to get your twitter API access secret keys and token secret keys before starting.

Following R packages are used for the project.So install and explore them.

1)twitteR

2)ROAuth

3)RCurl

4)stringr

5)RJSONIO

6)wordcloud

7)tm

Let me first tell you what are the smaller steps involved in creating it.You need to first extract the data(tweets) from twitter,get text from the extracted tweets,clean the tweets which means removing extra spaces.punctuations,unnecessary numbers and then joining the texts into a single vector,remove stop words,creating corpus,creating term document matrix and then we make a word comparison cloud.

Here I took tweets from TATA DOCOMO and IDEA CELLULAR and built a word comparison cloud.

R Code:

#Collecting tweets from mobile companies
library(twitteR)
library(“ROAuth”)
library(RCurl)
library(stringr)
library(RJSONIO)
library(wordcloud)
library(tm)

# Declare Twitter API Credentials

api_key <- ########

api_secret <- ######

token <- #########

token_secret <- ####

# Create Twitter Connection
setup_twitter_oauth(api_key, api_secret, token, token_secret)

# Idea Cellular tweets
idea_tweets = userTimeline(“ideacellular”, n=500)

# Tata Docomo tweets
tata_tweets = userTimeline(“TataDocomo”, n=500)
# get text

tata_txt = sapply(tata_tweets, function(x) x$getText())
idea_txt = sapply(idea_tweets, function(x) x$getText())
##clean text

clean.text = function(x)
{
# tolower
x = tolower(x)
# remove rt
x = gsub(“rt”, “”, x)
# remove at
x = gsub(“@\\w+”, “”, x)
# remove punctuation
x = gsub(“[[:punct:]]”, “”, x)
# remove numbers
x = gsub(“[[:digit:]]”, “”, x)
# remove links http
x = gsub(“http\\w+”, “”, x)
# remove tabs
x = gsub(“[ |\t]{2,}”, “”, x)
# remove blank spaces at the beginning
x = gsub(“^ “, “”, x)
# remove blank spaces at the end
x = gsub(” $”, “”, x)
return(x)
}

##apply function clean.text

# clean texts

tata_clean = clean.text(tata_txt)
idea_clean = clean.text(idea_txt)

##Join texts in a vector for each company

tata = paste(tata_clean, collapse=” “)
idea = paste(idea_clean, collapse=” “)

# put everything in a single vector
final= c(tata,idea)
final

##remove stop-words

final = removeWords(all,c(stopwords(“english”),”amazon”,”flipkart”))

# create corpus
corpus = Corpus(VectorSource(final))

# create term-document matrix

tdm = TermDocumentMatrix(corpus)

# convert as matrix
tdm = as.matrix(tdm)

# add column names
colnames(tdm) = c(“tata”, “idea”)

# plot comparison cloud

comparison.cloud(tdm, random.order=FALSE, colors = c(“#00B2FF”, “red”),title.size=1.5, max.words=300)

#pot commonality cloud

commonality.cloud(tdm, random.order=FALSE, colors = brewer.pal(8, “Dark2”),title.size=1.5)

Word comparison cloud:

wordcomparision

Commonality cloud:

commonalitycloud.png

Hope you enjoyed learning it 🙂

Advertisements

How to show plots of different variables in single graph in R?

While working with scatter plots,box plots and how to visualise and analyse the data,I stuck up with this question.

How can I combine two different plots where in the each plot is between different variables.

This can be done using the package gridExtra in R.

Let’s take an example.

First lets create a data frame which contains information about age,salary,experience of employees of a company.

company=data.frame(list(“name”=LETTERS[1:25],”age”=sample(c(21:30),25,replace=T),”salary”=sample(25000:40000,25,replace=T),”experience”=sample(0:8,25,replace=T)))

Let’s create two different plots

q1=qplot(company$age,company$salary,geom = “boxplot”)

q2=qplot(company$experience,company$salary,geom = “point”)

and then add gridExtra package

library(gridExtra)

q1 and q2 plots can be combined into a single graph with grid.arrange function.

grid.arrange(q1,q2,ncol=2)

##ncol=2 will show you the plots in two different columns

Rplot

Note:You can even mention nrow if you want to show a number of plots in a specified number of rows and columns.

Confusion Matrix

Confusion Matrix,as it goes with the name people get confused with the terms used in the matrix.Probably you will not feel the same after reading this post.

A clear cut understanding of confusion matrix is needed in statistics part of data science.

So let’s start with definition of confusion matrix and then will explain the terms involved with an example,at the end we will discuss which parameter is important with respect to this example.

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.

Let’s take the following example and the required details were mentioned in the table.

A retail stores marketing team uses analytics to predict who is likely to buy a newly introduced high-end (expensive) product.

Buyer or not Actual Negative Actual Positive Total
Predicted Negative 725 158 883
Predicted Positive 75 302 377
Total 800 460 1260

The above matrix tells the following things.

  • 725 people are not likely to buy the product in reality and the team also predicted the same.
  • 75 people are not likely to buy the product in reality and the team predicted in opposite way.
  • 158 people are likely to buy the product in reality and the team predicted in opposite way.
  • 302 people are likely to buy the product in reality and the team also predicted the same.

Let’s now define the basic terms, which are whole numbers (not rates)

True positive (TP): These are cases in which the team predicted positively (they are likely to buy), and they are buying in reality.

True negative (TN): These are cases in which the team predicted negatively(they are not likey to buy), and they are not buying in reality.

False positive (FP): These are cases in which the team predicted positively(they are likey to buy), and they are not buying in reality.

False negative (FN): These are cases in which the team predicted negatively(they are not likey to buy), and they are buying in reality.

Note: FP is generally known as type-I error and FN is known as type-II error.

Following is a list of rates that are often computed from a confusion matrix:

Accuracy: This tells us , how often is the classifier correct?

Accuracy = (TP+TN)/(TP+TN+FP+FN) = 81.5%

Recall: When they are actually buying, how often does they predict correctly?

Recall = TP/(TP+FN)= 65.6%

This is also known as true positive rate or sensitivity.

Specificity: When they are actually not buying, how often does they predict that they are not buying?

Specificity = (TN)/(TN+FP) = 90.6%

Precision: When they predicted that they are buying, how often is it correct?

Precision = (TP)/(TP+FP) = 80.1%

F1-Score: This is a weighted average of the recall and precision.

F1-Score = (2*Recall*Precision)/(Recall+Precision) = 72.1%

So now here comes the big challenge.

About which parameter should the team be worried?

About FP or FN or equally worried about both of them?

Ans:FN

Why?

If the model predicts that the person will not buy, the product will not be marketed to him/her, and the team will lose customers,money,business. FP is not such a big worry since only the cost of a phone call, SMS or sending a catalog will be lost.

and What is more important: Recall, Precision or Accuracy?

Ans:Recall.

But this will not be same with every case.It varies from case to case.

So a through understanding of case is required before concluding which parameter is important.