Exploratory data analysis (EDA) is an approach to data analysis to summarize the main characteristics of data. It can be performed using various methods, among which data visualization takes a great place. The idea of EDA is to recognize what information can data give us beyond the formal modeling or hypothesis testing task. In other words, if initially we don’t have at all or there are not enough priori ideas about the pattern and nature of the relationships within the data, an exploratory data analysis comes for help allowing us to identify main tendencies, properties, and nature of the information. In return, based on the information obtained, the researcher will be able to evaluate the structure and nature of the available data, which can ease the search and identification of questions and the purpose of data exploration. So, EDA is a crucial step before feature engineering and can involve a part of data preprocessing.
In this tutorial, we will show you how to perform simple EDA using Google Play Store Apps Data Set. To begin with, let’s install and load all the necessary libraries that we will need.
# Remove warnings
options(warn=-1)
# Load libraries
require(ggplot2)
require(highcharter)
require(dplyr)
require(tidyverse)
require(corrplot)
require(RColorBrewer)
require(xts)
require(treemap)
require(lubridate)

The Play Store apps insights can tell developers information about the Android market. Each row of the dataset has values for the category, rating, size, and more apps characteristics. Here are the columns of our dataset:
Now, let’s load data and view the first rows. For that, we use head() function:
df<-read.csv("googleplaystore.csv",na.strings = c("NaN","NA",""))
head(df)
## App Category
## 1 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
## 2 Coloring book moana ART_AND_DESIGN
## 3 U Launcher Lite â\200 FREE Live Cool Themes, Hide Apps ART_AND_DESIGN
## 4 Sketch - Draw & Paint ART_AND_DESIGN
## 5 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN
## 6 Paper flowers instructions ART_AND_DESIGN
## Rating Reviews Size Installs Type Price Content.Rating
## 1 4.1 159 19M 10,000+ Free 0 Everyone
## 2 3.9 967 14M 500,000+ Free 0 Everyone
## 3 4.7 87510 8.7M 5,000,000+ Free 0 Everyone
## 4 4.5 215644 25M 50,000,000+ Free 0 Teen
## 5 4.3 967 2.8M 100,000+ Free 0 Everyone
## 6 4.4 167 5.6M 50,000+ Free 0 Everyone
## Genres Last.Updated Current.Ver
## 1 Art & Design January 7, 2018 1.0.0
## 2 Art & Design;Pretend Play January 15, 2018 2.0.0
## 3 Art & Design August 1, 2018 1.2.4
## 4 Art & Design June 8, 2018 Varies with device
## 5 Art & Design;Creativity June 20, 2018 1.1
## 6 Art & Design March 26, 2017 1.0
## Android.Ver
## 1 4.0.3 and up
## 2 4.0.3 and up
## 3 4.0.3 and up
## 4 4.2 and up
## 5 4.4 and up
## 6 2.3 and up
It’s useful to see data format to perform analysis. Also, we can review data by columns type using str function:
str(df)
## 'data.frame': 10841 obs. of 13 variables:
## $ App : Factor w/ 9660 levels "- Free Comics - Comic Apps",..: 7229 2563 8998 8113 7294 7125 8171 5589 4948 5826 ...
## $ Category : Factor w/ 34 levels "1.9","ART_AND_DESIGN",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
## $ Reviews : Factor w/ 6002 levels "0","1","10","100",..: 1183 5924 5681 1947 5924 1310 1464 3385 816 485 ...
## $ Size : Factor w/ 462 levels "1,000+","1.0M",..: 55 30 368 102 64 222 55 118 146 120 ...
## $ Installs : Factor w/ 22 levels "0","0+","1,000,000,000+",..: 8 20 13 16 11 17 17 4 4 8 ...
## $ Type : Factor w/ 3 levels "0","Free","Paid": 2 2 2 2 2 2 2 2 2 2 ...
## $ Price : Factor w/ 93 levels "$0.99","$1.00",..: 92 92 92 92 92 92 92 92 92 92 ...
## $ Content.Rating: Factor w/ 6 levels "Adults only 18+",..: 2 2 2 5 2 2 2 2 2 2 ...
## $ Genres : Factor w/ 120 levels "Action","Action;Action & Adventure",..: 10 13 10 10 12 10 10 10 10 12 ...
## $ Last.Updated : Factor w/ 1378 levels "1.0.19","April 1, 2016",..: 562 482 117 825 757 901 76 726 1317 670 ...
## $ Current.Ver : Factor w/ 2832 levels "0.0.0.2","0.0.1",..: 120 1019 465 2825 278 114 278 2392 1456 1430 ...
## $ Android.Ver : Factor w/ 33 levels "1.0 and up","1.5 and up",..: 16 16 16 19 21 9 16 19 11 16 ...
As you can see, we got similar information as using head function, but here we are more concentrated on data type, rather than content. Now, we will use a function that produces summaries of the results of various model fitting functions:
summary(df)
## App
## ROBLOX : 9
## CBS Sports App - Scores, News, Stats & Watch Live: 8
## 8 Ball Pool : 7
## Candy Crush Saga : 7
## Duolingo: Learn Languages Free : 7
## ESPN : 7
## (Other) :10796
## Category Rating Reviews
## FAMILY :1972 Min. : 1.000 0 : 596
## GAME :1144 1st Qu.: 4.000 1 : 272
## TOOLS : 843 Median : 4.300 2 : 214
## MEDICAL : 463 Mean : 4.193 3 : 175
## BUSINESS : 460 3rd Qu.: 4.500 4 : 137
## PRODUCTIVITY: 424 Max. :19.000 5 : 108
## (Other) :5535 NA's :1474 (Other):9339
## Size Installs Type Price
## Varies with device:1695 1,000,000+ :1579 0 : 1 0 :10040
## 11M : 198 10,000,000+:1252 Free:10039 $0.99 : 148
## 12M : 196 100,000+ :1169 Paid: 800 $2.99 : 129
## 14M : 194 10,000+ :1054 NA's: 1 $1.99 : 73
## 13M : 191 1,000+ : 907 $4.99 : 72
## 15M : 184 5,000,000+ : 752 $3.99 : 63
## (Other) :8183 (Other) :4128 (Other): 316
## Content.Rating Genres Last.Updated
## Adults only 18+: 3 Tools : 842 August 3, 2018: 326
## Everyone :8714 Entertainment: 623 August 2, 2018: 304
## Everyone 10+ : 414 Education : 549 July 31, 2018 : 294
## Mature 17+ : 499 Medical : 463 August 1, 2018: 285
## Teen :1208 Business : 460 July 30, 2018 : 211
## Unrated : 2 Productivity : 424 July 25, 2018 : 164
## NA's : 1 (Other) :7480 (Other) :9257
## Current.Ver Android.Ver
## Varies with device:1459 4.1 and up :2451
## 1.0 : 809 4.0.3 and up :1501
## 1.1 : 264 4.0 and up :1375
## 1.2 : 178 Varies with device:1362
## 2.0 : 151 4.4 and up : 980
## (Other) :7972 (Other) :3169
## NA's : 8 NA's : 3
After getting acquainted with the dataset, we should analyze it on NA and duplicates. Detecting and removing such records helps to build a model with better accuracy. First, let’s analyze missing values. We can review the result as a table:
sapply(df,function(x)sum(is.na(x)))
## App Category Rating Reviews Size
## 0 0 1474 0 0
## Installs Type Price Content.Rating Genres
## 0 1 0 1 0
## Last.Updated Current.Ver Android.Ver
## 0 8 3
Or as a chart:
As you can see, there are three columns containing missing values, and Rating column has the largest number of them. Let’s remove such values.
df = na.omit(df)
The next step is to check whether there are duplicates. We can check the difference between all and unique objects.
distinct <- nrow(df %>%
distinct())
nrow(df) - distinct
## [1] 474
After detecting duplicates, we need to remove them:
df=df[!duplicated(df), ]
When data is precleaned, we can begin further visual analysis.
To start off, we will review the Category column. Let’s examine which categories are the most and the least popular:
df %>%
count(Category, Installs) %>%
group_by(Category) %>%
summarize(
TotalInstalls = sum(as.numeric(Installs))
) %>%
arrange(-TotalInstalls) %>%
hchart('scatter', hcaes(x = "Category", y = "TotalInstalls", size = "TotalInstalls", color = "Category")) %>%
hc_add_theme(hc_theme_538()) %>%
hc_title(text = "Most popular categories (# of installs)")
Here we can see that Game is the most popular category by installs. It’s interesting that Education has almost the lowest popularity. Moreover, Comics are also at the bottom according to popularity ratings. Now, we want to see a percentage of the apps in each category. The pie chart is not a widespread type of visual, but when you need to know the percentage, it is one of the best options. Let’s count the apps in each category and expand our color palette.
freq<-table(df$Category)
fr<-as.data.frame(freq)
fr <- fr %>%
arrange(desc(Freq))
coul = brewer.pal(12, "Paired")
# We can add more tones to this palette:
coul = colorRampPalette(coul)(15)
op <- par(cex = 0.5)
pielabels <- sprintf("%s = %3.1f%s", fr$Var1,
100*fr$Freq/sum(fr$Freq), "%")
pie(fr$Freq,
labels=NA,
clockwise=TRUE,
col=coul,
border="black",
radius=0.5,
cex=1)
legend("right",legend=pielabels,bty="n",
fill=coul)
We can see that Family now becomes a leader among the categories. Also, Education here has a higher percentage than Comics. Now, let’s look closer to the prices of the apps and review how many free apps are available in the Play Market.
tmp <- df %>%
count(Type) %>%
mutate(perc = round((n /sum(n))*100)) %>%
arrange(desc(perc))
hciconarray(tmp$Type, tmp$perc, size = 5) %>%
hc_title(text="Percentage of paid vs. free apps")
df %>%
filter(Type == "Paid") %>%
group_by(Category) %>%
summarize(
Price = median(as.numeric(Price))
) %>%
arrange(-Price) %>%
hchart('treemap', hcaes(x = 'Category', value = 'Price', color = 'Price')) %>%
hc_add_theme(hc_theme_elementary()) %>%
hc_title(text="Median price per category") %>%
hc_legend(align = "left", verticalAlign = "top",
layout = "vertical", x = 0, y = 100)
This chart is a treemap. In general, it is used to display a data hierarchy, to see summary based on two values (size and color). Therefore, we can see that Parenting category has the highest price while Personalization and Social the lowest.
Now, we will build a correlation heatmap, previously performing some data preprocessing.
df <- df %>%
mutate(
Installs = gsub("\\+", "", as.character(Installs)),
Installs = as.numeric(gsub(",", "", Installs)),
Size = gsub("M", "", Size),
Size = ifelse(grepl("k", Size), 0, as.numeric(Size)),
Rating = as.numeric(Rating),
Reviews = as.numeric(Reviews),
Price = as.numeric(gsub("\\$", "", as.character(Price)))
)%>%
filter(
Type %in% c("Free", "Paid")
)
extract = c("Rating","Reviews","Size","Installs","Price")
df.extract = df[extract]
df.extract %>% filter(is.nan(df.extract$Reviews)) %>% filter(is.na(df.extract$Size))
## [1] Rating Reviews Size Installs Price
## <0 rows> (or 0-length row.names)
df.extract = na.omit(df.extract)
cor_matrix = cor(df.extract)
corrplot(cor_matrix,method = "color",order = "AOE",addCoef.col = "grey")
Unfortunately, there is no strong relation between columns. Also, let’s see a number of installs by content rating.
tmp <- df %>%
group_by(Content.Rating) %>%
summarize(Total.Installs = sum(Installs)) %>%
arrange(-Total.Installs)
highchart() %>%
hc_chart(type = "funnel") %>%
hc_add_series_labels_values(
labels = tmp$Content.Rating, values = tmp$Total.Installs
) %>%
hc_title(
text="Number of Installs by Content Rating"
) %>%
hc_add_theme(hc_theme_elementary())
As you might have guessed, teens take an active part in rating Play Store.
You may notice a hc_add_theme line in the code. It adds a theme to your chart. Highchart has an extensive list of themes, and you can choose one via this link.
One of the most popular chart types is time series which we will explore at last. Also, we will transform our date type using lubridate package.
# Get number of apps by last updated date
tmp <- df %>% count(Last.Updated)
# Transform date column type from text to date
tmp$Last.Updated<-mdy(tmp$Last.Updated)
# Transform data into time series
time_series <- xts(
tmp$n, order.by = tmp$Last.Updated
)
highchart(type = "stock") %>%
hc_title(text = "Last updated date") %>%
hc_subtitle(text = "Number of applications by date of last update") %>%
hc_add_series(time_series) %>%
hc_add_theme(hc_theme_economist())
To sum up, exploration data analysis is a powerful tool for a comprehensive analysis of a dataset. In general, we can divide EDA into the next stages: data overview, duplicate records analysis, NA analysis, and data exploration. So, starting with reviewing the data structure, columns, contents, etc., we move forward to estimating and preparing our data for further analysis. Finally, visual data exploration helps to find dependencies, distribution, and more.