Blog > Python

A guide to Exploratory Data Analysis in Python

What is Exploratory Data Analysis

Exploratory data analysis (EDA) is a powerful tool for a comprehensive study of the available information providing answers to basic data analysis questions.

What distinguishes it from traditional analysis based on testing a priori hypothesis is that EDA makes it possible to detect — by using various methods — all potential systematic correlations in the data. Exploratory data analysis is practically unlimited in time and methods allowing to identify curious data fragments and correlations. Therefore, you are able to examine information more deeply and accurately, as well as choose a proper model for further work.

In Python language environment, there is a wide range of libraries that can not only ease but also streamline the process of exploring a dataset. We will use Google Play Store Apps dataset and go through the main tasks of exploration analysis to find out if there are any trends that can facilitate the process of setting and resolving a business problem.

Data overview

Before we start exploring our data, we must import the dataset and Python libraries needed for further work. We will use pandas library, a very powerful tool for comprehensive data analysis.

In [1]:
import pandas as pd
In [2]:
googleplaystore = pd.read_csv("googleplaystore.csv")

Let's explore the structure of our dataframe by viewing the first and the last 10 rows.

In [3]:
  App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up
5 Paper flowers instructions ART_AND_DESIGN 4.4 167 5.6M 50,000+ Free 0 Everyone Art & Design March 26, 2017 1.0 2.3 and up
6 Smoke Effect Photo Maker - Smoke Editor ART_AND_DESIGN 3.8 178 19M 50,000+ Free 0 Everyone Art & Design April 26, 2018 1.1 4.0.3 and up
7 Infinite Painter ART_AND_DESIGN 4.1 36815 29M 1,000,000+ Free 0 Everyone Art & Design June 14, 2018 4.2 and up
8 Garden Coloring Book ART_AND_DESIGN 4.4 13791 33M 1,000,000+ Free 0 Everyone Art & Design September 20, 2017 2.9.2 3.0 and up
9 Kids Paint Free - Drawing Fun ART_AND_DESIGN 4.7 121 3.1M 10,000+ Free 0 Everyone Art & Design;Creativity July 3, 2018 2.8 4.0.3 and up
In [4]:
  App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
10831 MAPS_AND_NAVIGATION NaN 38 9.8M 5,000+ Free 0 Everyone Maps & Navigation June 13, 2018 4.0 and up
10832 FR Tides WEATHER 3.8 1195 582k 100,000+ Free 0 Everyone Weather February 16, 2014 6.0 2.1 and up
10833 Chemin (fr) BOOKS_AND_REFERENCE 4.8 44 619k 1,000+ Free 0 Everyone Books & Reference March 23, 2014 0.8 2.2 and up
10834 FR Calculator FAMILY 4.0 7 2.6M 500+ Free 0 Everyone Education June 18, 2017 1.0.0 4.1 and up
10835 FR Forms BUSINESS NaN 0 9.6M 10+ Free 0 Everyone Business September 29, 2016 1.1.5 4.0 and up
10836 Sya9a Maroc - FR FAMILY 4.5 38 53M 5,000+ Free 0 Everyone Education July 25, 2017 1.48 4.1 and up
10837 Fr. Mike Schmitz Audio Teachings FAMILY 5.0 4 3.6M 100+ Free 0 Everyone Education July 6, 2018 1.0 4.1 and up
10838 Parkinson Exercices FR MEDICAL NaN 3 9.5M 1,000+ Free 0 Everyone Medical January 20, 2017 1.0 2.2 and up
10839 The SCP Foundation DB fr nn5n BOOKS_AND_REFERENCE 4.5 114 Varies with device 1,000+ Free 0 Mature 17+ Books & Reference January 19, 2015 Varies with device Varies with device
10840 iHoroscope - 2018 Daily Horoscope & Astrology LIFESTYLE 4.5 398307 19M 10,000,000+ Free 0 Everyone Lifestyle July 25, 2018 Varies with device Varies with device

We can see that dataframe googleplaystore has such problem as missing values. But for a more complex view on data, let's do a few more things. Firstly, we will use describe() pandas method that will help us to get a statistic summary of numerical columns in our dataset. We can also use info() method to check data types in each column as well as missing values and shape() for retrieving a number of rows and columns in the dataframe.

In [5]:
count 9367.000000
mean 4.193338
std 0.537431
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 19.000000
In [6]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
App               10841 non-null object
Category          10841 non-null object
Rating            9367 non-null float64
Reviews           10841 non-null object
Size              10841 non-null object
Installs          10841 non-null object
Type              10840 non-null object
Price             10841 non-null object
Content Rating    10840 non-null object
Genres            10841 non-null object
Last Updated      10841 non-null object
Current Ver       10833 non-null object
Android Ver       10838 non-null object
dtypes: float64(1), object(12)
memory usage: 1.1+ MB
In [7]:
(10841, 13)
In [8]:
App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

So, what information do we have after these small actions? Firstly, we have some number of apps that are divided into various categories. Secondly, although such columns as, for example, "Reviews" contain numeric data, they have non-numeric type, that can cause some problems while further data processing.

We are also interested in the total amount of apps and available categories in the dataset. To get the exact amount of apps, we will find all the unique values in the corresponding column.

In [9]:
In [10]:
unique_categories = googleplaystore["Category"].unique()
In [11]:
       '1.9'], dtype=object)

Duplicate records removal

Usually, the duplicates of data appear in datasets, and this can aggravate the quality and accuracy of exploration. Plus, such data clogs the dataset, so we need to get rid of it.

In [14]:
googleplaystore.drop_duplicates(keep='first', inplace = True)
In [15]:
(10358, 13)

For removing rows with duplicates from a dataset, pandas has powerful and customizable method drop_duplicates(), which takes certain parameters needed to be considered while cleaning dataset. "keep=False" means that method will drop all the duplicates found in dataset with keeping only one value. "inplace = True" means that all the manipulations will be done and stored in the dataset we are currently using.

As we can see above, our initial googleplaystore dataset contained 10841 rows. After removing duplicates, the number of rows decreased to 9948.

NA analysis

Another common problem of almost every dataset is columns with missing values. We will explore only the most common ways to clean a dataset from missing values.

Firstly, let's look at the total amount of missing values in every column for each dataset. One of the great things about pandas is that it allows users to combine various operations in a single action, that brings great optimization opportunities and makes the code more compact.

In [14]:
Rating            1465
Current Ver          8
Android Ver          3
Content Rating       1
Type                 1
Last Updated         0
Genres               0
Price                0
Installs             0
Size                 0
Reviews              0
Category             0
App                  0
dtype: int64

Now, let's get rid of all the rows with missing values. Although some statistical approaches allow us to impute missing data with some values (like the most common value or mean value), today we will work only with cleared data.

Pandas dropna() method also allows users to set parameters for proper data processing depending on the expected result. Here we stated that program must drop every row that contains any NA values and all the changes will be stored directly in our dataframe.

In [16]:
googleplaystore.dropna(how ='any', inplace = True)

Let's now check the shape of the dataframe after all cleaning manipulations were performed.

In [17]:
(8886, 13)

If we look closer at our dataset and result of the dtypes method, we would see that such columns like "Reviews", "Size", "Price" and "Installs" should definitely have numeric values. So, let's see what values every column has in order to specify our further manipulations.

In [18]:
array(['0', '$4.99', '$3.99', '$6.99', '$7.99', '$5.99', '$2.99', '$3.49',
       '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49', '$10.00',
       '$24.99', '$11.99', '$79.99', '$16.99', '$14.99', '$29.99',
       '$12.99', '$2.49', '$10.99', '$1.50', '$19.99', '$15.99', '$33.99',
       '$39.99', '$3.95', '$4.49', '$1.70', '$8.99', '$1.49', '$3.88',
       '$399.99', '$17.99', '$400.00', '$3.02', '$1.76', '$4.84', '$4.77',
       '$1.61', '$2.50', '$1.59', '$6.49', '$1.29', '$299.99', '$379.99',
       '$37.99', '$18.99', '$389.99', '$8.49', '$1.75', '$14.00', '$2.00',
       '$3.08', '$2.59', '$19.40', '$3.90', '$4.59', '$15.46', '$3.04',
       '$13.99', '$4.29', '$3.28', '$4.60', '$1.00', '$2.95', '$2.90',
       '$1.97', '$2.56', '$1.20'], dtype=object)
In [19]:
array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000,000,000+', '1,000+', '500,000,000+', '100+', '500+', '10+',
       '5+', '50+', '1+'], dtype=object)
In [20]:
array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
       '28M', '12M', '20M', '21M', '37M', '5.5M', '17M', '39M', '31M',
       '4.2M', '23M', '6.0M', '6.1M', '4.6M', '9.2M', '5.2M', '11M',
       '24M', 'Varies with device', '9.4M', '15M', '10M', '1.2M', '26M',
       '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k', '3.6M', '5.7M',
       '8.6M', '2.4M', '27M', '2.7M', '2.5M', '7.0M', '16M', '3.4M',
       '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
       '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
       '7.1M', '22M', '6.4M', '3.2M', '8.2M', '4.9M', '9.5M', '5.0M',
       '5.9M', '13M', '73M', '6.8M', '3.5M', '4.0M', '2.3M', '2.1M',
       '42M', '9.1M', '55M', '23k', '7.3M', '6.5M', '1.5M', '7.5M', '51M',
       '41M', '48M', '8.5M', '46M', '8.3M', '4.3M', '4.7M', '3.3M', '40M',
       '7.8M', '8.8M', '6.6M', '5.1M', '61M', '66M', '79k', '8.4M',
       '3.7M', '118k', '44M', '695k', '1.6M', '6.2M', '53M', '1.4M',
       '3.0M', '7.2M', '5.8M', '3.8M', '9.6M', '45M', '63M', '49M', '77M',
       '4.4M', '70M', '9.3M', '8.1M', '36M', '6.9M', '7.4M', '84M', '97M',
       '2.0M', '1.9M', '1.8M', '5.3M', '47M', '556k', '526k', '76M',
       '7.6M', '59M', '9.7M', '78M', '72M', '43M', '7.7M', '6.3M', '334k',
       '93M', '65M', '79M', '100M', '58M', '50M', '68M', '64M', '34M',
       '67M', '60M', '94M', '9.9M', '232k', '99M', '624k', '95M', '8.5k',
       '41k', '292k', '80M', '1.7M', '10.0M', '74M', '62M', '69M', '75M',
       '98M', '85M', '82M', '96M', '87M', '71M', '86M', '91M', '81M',
       '92M', '83M', '88M', '704k', '862k', '899k', '378k', '4.8M',
       '266k', '375k', '1.3M', '975k', '980k', '4.1M', '89M', '696k',
       '544k', '525k', '920k', '779k', '853k', '720k', '713k', '772k',
       '318k', '58k', '241k', '196k', '857k', '51k', '953k', '865k',
       '251k', '930k', '540k', '313k', '746k', '203k', '26k', '314k',
       '239k', '371k', '220k', '730k', '756k', '91k', '293k', '17k',
       '74k', '14k', '317k', '78k', '924k', '818k', '81k', '939k', '169k',
       '45k', '965k', '90M', '545k', '61k', '283k', '655k', '714k', '93k',
       '872k', '121k', '322k', '976k', '206k', '954k', '444k', '717k',
       '210k', '609k', '308k', '306k', '175k', '350k', '383k', '454k',
       '1.0M', '70k', '812k', '442k', '842k', '417k', '412k', '459k',
       '478k', '335k', '782k', '721k', '430k', '429k', '192k', '460k',
       '728k', '496k', '816k', '414k', '506k', '887k', '613k', '778k',
       '683k', '592k', '186k', '840k', '647k', '373k', '437k', '598k',
       '716k', '585k', '982k', '219k', '55k', '323k', '691k', '511k',
       '951k', '963k', '25k', '554k', '351k', '27k', '82k', '208k',
       '551k', '29k', '103k', '116k', '153k', '209k', '499k', '173k',
       '597k', '809k', '122k', '411k', '400k', '801k', '787k', '50k',
       '643k', '986k', '516k', '837k', '780k', '20k', '498k', '600k',
       '656k', '221k', '228k', '176k', '34k', '259k', '164k', '458k',
       '629k', '28k', '288k', '775k', '785k', '636k', '916k', '994k',
       '309k', '485k', '914k', '903k', '608k', '500k', '54k', '562k',
       '847k', '948k', '811k', '270k', '48k', '523k', '784k', '280k',
       '24k', '892k', '154k', '18k', '33k', '860k', '364k', '387k',
       '626k', '161k', '879k', '39k', '170k', '141k', '160k', '144k',
       '143k', '190k', '376k', '193k', '473k', '246k', '73k', '253k',
       '957k', '420k', '72k', '404k', '470k', '226k', '240k', '89k',
       '234k', '257k', '861k', '467k', '676k', '552k', '582k', '619k'],

First of all, let's get rid of the dollar sign in "Price" column and turn values into numeric type.

In [21]:
googleplaystore['Price'] = googleplaystore['Price'].apply(lambda x: x.replace('$', '') if '$' in str(x) else x)
googleplaystore['Price'] = googleplaystore['Price'].apply(lambda x: float(x))

Now, we will work with "Installs" column. We must get rid of plus sign and convert values to numeric.

In [22]:
googleplaystore['Installs'] = googleplaystore['Installs'].apply(lambda x: x.replace('+', '') if '+' in str(x) else x)
googleplaystore['Installs'] = googleplaystore['Installs'].apply(lambda x: x.replace(',', '') if ',' in str(x) else x)
googleplaystore['Installs'] = googleplaystore['Installs'].apply(lambda x: int(x))

Also, convert "Reviews" column to numeric type.

In [23]:
googleplaystore['Reviews'] = googleplaystore['Reviews'].apply(lambda x: int(x))

Finally, let's work with "Size" column as it needs more complex approach. This column contains various types of data. Among numeric values which can be whether in Mb or Kb, there are null values and strings. Moreover, we need to deal with the difference in values written in Mb and Kb.

In [24]:
googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: str(x).replace('Varies with device', 'NaN') if 'Varies with device' in str(x) else x)
googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: str(x).replace('M', '') if 'M' in str(x) else x)
googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: str(x).replace(',', '') if 'M' in str(x) else x)
googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: float(str(x).replace('k', '')) / 1000 if 'k' in str(x) else x)
googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: float(x))

Let's call describe() method one more time. As we can see, now we have statistical summary for all the needed columns that contain numeric values.

In [25]:
  Rating Reviews Size Installs Price
count 8886.000000 8.886000e+03 7418.000000 8.886000e+03 8886.000000
mean 4.187959 4.730928e+05 22.760829 1.650061e+07 0.963526
std 0.522428 2.906007e+06 23.439210 8.640413e+07 16.194792
min 1.000000 1.000000e+00 0.008500 1.000000e+00 0.000000
25% 4.000000 1.640000e+02 5.100000 1.000000e+04 0.000000
50% 4.300000 4.723000e+03 14.000000 5.000000e+05 0.000000
75% 4.500000 7.131325e+04 33.000000 5.000000e+06 0.000000
max 5.000000 7.815831e+07 100.000000 1.000000e+09 400.000000

Building visualizations

Visualization is probably one of the most useful approaches in data analysis. Sometimes not all the correlations and dependencies can be seen from the tabular data, and therefore various plots and diagrams can help to clearly depict them.

Let's go through the different ways we can explore categories.

Exploring which categories have the biggest amount of apps

One of the fanciest ways to visualize such data is to use WordCloud. With a few lines of code, we can create an illustration that shows what categories have the biggest amount of apps.

In [30]:
import matplotlib.pyplot as plt
import wordcloud
from wordcloud import WordCloud
import seaborn as sns
color = sns.color_palette()

%matplotlib inline
In [33]:
from plotly import tools
from plotly.offline import iplot, init_notebook_mode
from IPython.display import Image
import plotly.offline as py
import plotly.graph_objs as go
import as pio
import numpy as np