Blog > data science

Exploratory Data Analysis in R

Exploratory Data Analysis in R

Introduction

Exploratory data analysis (EDA) is an approach to data analysis to summarize the main characteristics of data. It can be performed using various methods, among which data visualization takes a great place. The idea of EDA is to recognize what information can data give us beyond the formal modeling or hypothesis testing task. In other words, if initially we don’t have at all or there are not enough priori ideas about the pattern and nature of the relationships within the data, an exploratory data analysis comes for help allowing us to identify main tendencies, properties, and nature of the information. In return, based on the information obtained, the researcher will be able to evaluate the structure and nature of the available data, which can ease the search and identification of questions and the purpose of data exploration. So, EDA is a crucial step before feature engineering and can involve a part of data preprocessing.

In this tutorial, we will show you how to perform simple EDA using Google Play Store Apps Data Set. To begin with, let’s install and load all the necessary libraries that we will need.

# Remove warnings
options(warn=-1)

# Load libraries
require(ggplot2)
require(highcharter) 
require(dplyr)
require(tidyverse)
require(corrplot)
require(RColorBrewer)
require(xts)
require(treemap)
require(lubridate)


Data overview

The Play Store apps insights can tell developers information about the Android market. Each row of the dataset has values for the category, rating, size, and more apps characteristics. Here are the columns of our dataset:

  1. App - name of the application.
  2. Category - category of the app.
  3. Rating - application’s rating on Play Store.
  4. Reviews - number of the app’s reviews.
  5. Size - size of the app.
  6. Install - number of installs of the app.
  7. Type - whether the app is free or paid.
  8. Price - price of the app (0 if free).
  9. Content Rating - target audience of the app.
  10. Genres - genre the app belongs to.
  11. Last Updated - date the app was last updated.
  12. Current Ver - current version of the application.
  13. Android Ver - minimum Android version required to run the app.

Now, let’s load data and view the first rows. For that, we use head() function:

df<-read.csv("googleplaystore.csv",na.strings = c("NaN","NA",""))

head(df)
##                                                    App       Category
## 1       Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
## 2                                  Coloring book moana ART_AND_DESIGN
## 3 U Launcher Lite â\200“ FREE Live Cool Themes, Hide Apps ART_AND_DESIGN
## 4                                Sketch - Draw & Paint ART_AND_DESIGN
## 5                Pixel Draw - Number Art Coloring Book ART_AND_DESIGN
## 6                           Paper flowers instructions ART_AND_DESIGN
##   Rating Reviews Size    Installs Type Price Content.Rating
## 1    4.1     159  19M     10,000+ Free     0       Everyone
## 2    3.9     967  14M    500,000+ Free     0       Everyone
## 3    4.7   87510 8.7M  5,000,000+ Free     0       Everyone
## 4    4.5  215644  25M 50,000,000+ Free     0           Teen
## 5    4.3     967 2.8M    100,000+ Free     0       Everyone
## 6    4.4     167 5.6M     50,000+ Free     0       Everyone
##                      Genres     Last.Updated        Current.Ver
## 1              Art & Design  January 7, 2018              1.0.0
## 2 Art & Design;Pretend Play January 15, 2018              2.0.0
## 3              Art & Design   August 1, 2018              1.2.4
## 4              Art & Design     June 8, 2018 Varies with device
## 5   Art & Design;Creativity    June 20, 2018                1.1
## 6              Art & Design   March 26, 2017                1.0
##    Android.Ver
## 1 4.0.3 and up
## 2 4.0.3 and up
## 3 4.0.3 and up
## 4   4.2 and up
## 5   4.4 and up
## 6   2.3 and up

It’s useful to see data format to perform analysis. Also, we can review data by columns type using str function:

str(df)
## 'data.frame':    10841 obs. of  13 variables:
##  $ App           : Factor w/ 9660 levels "- Free Comics - Comic Apps",..: 7229 2563 8998 8113 7294 7125 8171 5589 4948 5826 ...
##  $ Category      : Factor w/ 34 levels "1.9","ART_AND_DESIGN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Rating        : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
##  $ Reviews       : Factor w/ 6002 levels "0","1","10","100",..: 1183 5924 5681 1947 5924 1310 1464 3385 816 485 ...
##  $ Size          : Factor w/ 462 levels "1,000+","1.0M",..: 55 30 368 102 64 222 55 118 146 120 ...
##  $ Installs      : Factor w/ 22 levels "0","0+","1,000,000,000+",..: 8 20 13 16 11 17 17 4 4 8 ...
##  $ Type          : Factor w/ 3 levels "0","Free","Paid": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Price         : Factor w/ 93 levels "$0.99","$1.00",..: 92 92 92 92 92 92 92 92 92 92 ...
##  $ Content.Rating: Factor w/ 6 levels "Adults only 18+",..: 2 2 2 5 2 2 2 2 2 2 ...
##  $ Genres        : Factor w/ 120 levels "Action","Action;Action & Adventure",..: 10 13 10 10 12 10 10 10 10 12 ...
##  $ Last.Updated  : Factor w/ 1378 levels "1.0.19","April 1, 2016",..: 562 482 117 825 757 901 76 726 1317 670 ...
##  $ Current.Ver   : Factor w/ 2832 levels "0.0.0.2","0.0.1",..: 120 1019 465 2825 278 114 278 2392 1456 1430 ...
##  $ Android.Ver   : Factor w/ 33 levels "1.0 and up","1.5 and up",..: 16 16 16 19 21 9 16 19 11 16 ...

As you can see, we got similar information as using head function, but here we are more concentrated on data type, rather than content. Now, we will use a function that produces summaries of the results of various model fitting functions:

summary(df)
##                                                 App       
##  ROBLOX                                           :    9  
##  CBS Sports App - Scores, News, Stats & Watch Live:    8  
##  8 Ball Pool                                      :    7  
##  Candy Crush Saga                                 :    7  
##  Duolingo: Learn Languages Free                   :    7  
##  ESPN                                             :    7  
##  (Other)                                          :10796  
##          Category        Rating          Reviews    
##  FAMILY      :1972   Min.   : 1.000   0      : 596  
##  GAME        :1144   1st Qu.: 4.000   1      : 272  
##  TOOLS       : 843   Median : 4.300   2      : 214  
##  MEDICAL     : 463   Mean   : 4.193   3      : 175  
##  BUSINESS    : 460   3rd Qu.: 4.500   4      : 137  
##  PRODUCTIVITY: 424   Max.   :19.000   5      : 108  
##  (Other)     :5535   NA's   :1474     (Other):9339  
##                  Size             Installs      Type           Price      
##  Varies with device:1695   1,000,000+ :1579   0   :    1   0      :10040  
##  11M               : 198   10,000,000+:1252   Free:10039   $0.99  :  148  
##  12M               : 196   100,000+   :1169   Paid:  800   $2.99  :  129  
##  14M               : 194   10,000+    :1054   NA's:    1   $1.99  :   73  
##  13M               : 191   1,000+     : 907                $4.99  :   72  
##  15M               : 184   5,000,000+ : 752                $3.99  :   63  
##  (Other)           :8183   (Other)    :4128                (Other):  316  
##          Content.Rating           Genres             Last.Updated 
##  Adults only 18+:   3   Tools        : 842   August 3, 2018: 326  
##  Everyone       :8714   Entertainment: 623   August 2, 2018: 304  
##  Everyone 10+   : 414   Education    : 549   July 31, 2018 : 294  
##  Mature 17+     : 499   Medical      : 463   August 1, 2018: 285  
##  Teen           :1208   Business     : 460   July 30, 2018 : 211  
##  Unrated        :   2   Productivity : 424   July 25, 2018 : 164  
##  NA's           :   1   (Other)      :7480   (Other)       :9257  
##              Current.Ver               Android.Ver  
##  Varies with device:1459   4.1 and up        :2451  
##  1.0               : 809   4.0.3 and up      :1501  
##  1.1               : 264   4.0 and up        :1375  
##  1.2               : 178   Varies with device:1362  
##  2.0               : 151   4.4 and up        : 980  
##  (Other)           :7972   (Other)           :3169  
##  NA's              :   8   NA's              :   3

NA analysis

After getting acquainted with the dataset, we should analyze it on NA and duplicates. Detecting and removing such records helps to build a model with better accuracy. First, let’s analyze missing values. We can review the result as a table:

sapply(df,function(x)sum(is.na(x)))
##            App       Category         Rating        Reviews           Size 
##              0              0           1474              0              0 
##       Installs           Type          Price Content.Rating         Genres 
##              0              1              0              1              0 
##   Last.Updated    Current.Ver    Android.Ver 
##              0              8              3

Or as a chart:

keyvalueColumns with NA valuesRatingCurrent.VerAndroid.Ver02505007501000125015001750

 

As you can see, there are three columns containing missing values, and Rating column has the largest number of them. Let’s remove such values.

df = na.omit(df)

Duplicate records removal

The next step is to check whether there are duplicates. We can check the difference between all and unique objects.

distinct <- nrow(df %>%
  distinct())
nrow(df) - distinct
## [1] 474

After detecting duplicates, we need to remove them:

df=df[!duplicated(df), ]

When data is precleaned, we can begin further visual analysis.

Analysis using visualization tools

To start off, we will review the Category column. Let’s examine which categories are the most and the least popular:

df %>%
  count(Category, Installs) %>%
  group_by(Category) %>%
  summarize(
    TotalInstalls = sum(as.numeric(Installs))
  ) %>%
  arrange(-TotalInstalls) %>%
  hchart('scatter', hcaes(x = "Category", y = "TotalInstalls", size = "TotalInstalls", color = "Category")) %>%
  hc_add_theme(hc_theme_538()) %>%
  hc_title(text = "Most popular categories (# of installs)")
CategoryTotalInstallsMost popular categories (# of installs)-2-1GAMESOCIALCOMMUNICATIONFAMILYPRODUCTIVITYTOOLSHEALTH_AND_FITNESSBUSINESSSPORTSNEWS_AND_MAGAZINESLIFESTYLEPERSONALIZATIONVIDEO_PLAYERSBOOKS_AND_REFERENCEFINANCEMEDICALPHOTOGRAPHYSHOPPINGMAPS_AND_NAVIGATIONTRAVEL_AND_LOCALFOOD_AND_DRINKDATINGWEATHEREVENTSAUTO_AND_VEHICLESART_AND_DESIGNPARENTINGBEAUTYCOMICSENTERTAINMENTLIBRARIES_AND_DEMOEDUCATIONHOUSE_AND_HOME10015020025050300

Here we can see that Game is the most popular category by installs. It’s interesting that Education has almost the lowest popularity. Moreover, Comics are also at the bottom according to popularity ratings. Now, we want to see a percentage of the apps in each category. The pie chart is not a widespread type of visual, but when you need to know the percentage, it is one of the best options. Let’s count the apps in each category and expand our color palette.

freq<-table(df$Category)
fr<-as.data.frame(freq)

fr <- fr %>%
  arrange(desc(Freq))

coul = brewer.pal(12, "Paired") 
 
# We can add more tones to this palette:
coul = colorRampPalette(coul)(15)
op <- par(cex = 0.5)

pielabels <- sprintf("%s = %3.1f%s", fr$Var1,
100*fr$Freq/sum(fr$Freq), "%")
pie(fr$Freq,
labels=NA,
clockwise=TRUE,
col=coul,
border="black",
radius=0.5,
cex=1)
legend("right",legend=pielabels,bty="n",
fill=coul)

We can see that Family now becomes a leader among the categories. Also, Education here has a higher percentage than Comics. Now, let’s look closer to the prices of the apps and review how many free apps are available in the Play Market.

tmp <- df %>%
  count(Type) %>%
  mutate(perc = round((n /sum(n))*100)) %>%
  arrange(desc(perc))

hciconarray(tmp$Type, tmp$perc, size = 5) %>%
  hc_title(text="Percentage of paid vs. free apps")

 

Percentage of paid vs. free appsFreePaid
As you can see, 93% of the apps are free. Let’s see the median price in each category:

 

df %>%
  filter(Type == "Paid") %>%
  group_by(Category) %>%
  summarize(
    Price = median(as.numeric(Price))
  ) %>%
  arrange(-Price) %>%
  hchart('treemap', hcaes(x = 'Category', value = 'Price', color = 'Price')) %>%
  hc_add_theme(hc_theme_elementary()) %>%
  hc_title(text="Median price per category") %>%
  hc_legend(align = "left", verticalAlign = "top",
            layout = "vertical", x = 0, y = 100)
Median price per categoryPARENTINGPARENTINGDATINGDATINGFINANCEFINANCEFOOD_AND_DRINKFOOD_AND_DRINKLIFESTYLELIFESTYLEENTERTAINMENTENTERTAINMENTBUSINESSBUSINESSEDUCATIONEDUCATIONWEATHERWEATHERPRODUCTIVITYPRODUCTIVITYTRAVEL_AND_LOCALTRAVEL_AND_LOCALMEDICALMEDICALBOOKS_AND_REFERENCEBOOKS_AND_REFERENCEFAMILYFAMILYGAMEGAMEHEALTH_AND_FITNESSHEALTH_AND_FITNESSPHOTOGRAPHYPHOTOGRAPHYSPORTSSPORTSTOOLSTOOLSSHOPPINGSHOPPINGCOMMUNICATIONCOMMUNICATIONNEWS_AND_MAGAZINESNEWS_AND_MAGAZINESART_AND_DESIGNART_AND_DESIGNVIDEO_PLAYERSVIDEO_PLAYERSPERSONALIZATIONPERSONALIZATIONSOCIALSOCIAL0255075NEWS_AND_MAGAZINES: 21.5

This chart is a treemap. In general, it is used to display a data hierarchy, to see summary based on two values (size and color). Therefore, we can see that Parenting category has the highest price while Personalization and Social the lowest.

Now, we will build a correlation heatmap, previously performing some data preprocessing.

df <- df %>%
mutate(
    Installs = gsub("\\+", "", as.character(Installs)),
    Installs = as.numeric(gsub(",", "", Installs)),
    Size = gsub("M", "", Size),
    Size = ifelse(grepl("k", Size), 0, as.numeric(Size)),
    Rating = as.numeric(Rating),
    Reviews = as.numeric(Reviews),
    Price = as.numeric(gsub("\\$", "", as.character(Price)))
  )%>%
  filter(
    Type %in% c("Free", "Paid")
  )

extract = c("Rating","Reviews","Size","Installs","Price")
df.extract = df[extract]
df.extract %>% filter(is.nan(df.extract$Reviews)) %>% filter(is.na(df.extract$Size))
## [1] Rating   Reviews  Size     Installs Price   
## <0 rows> (or 0-length row.names)
df.extract = na.omit(df.extract)
cor_matrix = cor(df.extract)
corrplot(cor_matrix,method = "color",order = "AOE",addCoef.col = "grey")

Unfortunately, there is no strong relation between columns. Also, let’s see a number of installs by content rating.

tmp <- df %>%
  group_by(Content.Rating) %>%
  summarize(Total.Installs = sum(Installs)) %>%
  arrange(-Total.Installs)

highchart() %>%
  hc_chart(type = "funnel") %>%
  hc_add_series_labels_values(
    labels = tmp$Content.Rating, values = tmp$Total.Installs
    ) %>%
  hc_title(
    text="Number of Installs by Content Rating"
    ) %>%
  hc_add_theme(hc_theme_elementary())
Number of Installs by Content RatingEveryoneEveryoneTeenTeenEveryone 10+Everyone 10+Mature 17+Mature 17+Adults only 18+Adults only 18+UnratedUnrated

As you might have guessed, teens take an active part in rating Play Store.

You may notice a hc_add_theme line in the code. It adds a theme to your chart. Highchart has an extensive list of themes, and you can choose one via this link.

One of the most popular chart types is time series which we will explore at last. Also, we will transform our date type using lubridate package.

# Get number of apps by last updated date
tmp <- df %>% count(Last.Updated)
# Transform date column type from text to date
tmp$Last.Updated<-mdy(tmp$Last.Updated)
# Transform data into time series
time_series <- xts(
  tmp$n, order.by = tmp$Last.Updated
)

highchart(type = "stock") %>% 
  hc_title(text = "Last updated date") %>% 
  hc_subtitle(text = "Number of applications by date of last update") %>% 
  hc_add_series(time_series) %>%
  hc_add_theme(hc_theme_economist())
Last updated dateNumber of applications by date of last updateJan '13Jul '14Jul '15Jan '16Jul '16Jan '17Jul '17Jan '18Jul '1820132015201620172018050100150200250300350Zoom1m3m6mYTD1yAllFromMay 21, 2010ToAug 8, 2018Sunday, Aug 5, 2018 Series 1: 45
Such visualization is very convenient as it contains zoom options, range slider, date filtering, and points hovering. Using this chart, we can see that the number of updates is increasing with time.

 

Conclusion

To sum up, exploration data analysis is a powerful tool for a comprehensive analysis of a dataset. In general, we can divide EDA into the next stages: data overview, duplicate records analysis, NA analysis, and data exploration. So, starting with reviewing the data structure, columns, contents, etc., we move forward to estimating and preparing our data for further analysis. Finally, visual data exploration helps to find dependencies, distribution, and more.