Data Analytics with Python: How I Analysed 550,000+ Rows of Data with Python

Data Analytics with Python: How I Analysed 550,000+ Rows of Data with Python

Data Analytics is becoming all the more important by the day. Companies and their executives need insights to make decisions, insights only gotten by analyzing data. The Data and/ or Business Intelligence Analyst in any organisation is faced with a wide variety of data, which he or she is expected to analyse effectively. A plethora of tools can be used to achieve this. In my case, I used Python

I was confronted with a dataset of 12 columns and 600,000 rows. The columns dwelt on entities such as location, device peculiarities, gender and so on. This document will discuss this project under the following categories:

  1. Pre-processing
  2. Processing
  3. Visualization

Pre-Processing

The first step with python as regards this project was to import pandas (an open source library) with the following syntax:

import pandas as pd

The convention is to use the alias pd to denote pandas, just to make it shorter and faster to type. The next step was to read the dataset into the pandas library. That was possible with:

df=pd.read_csv(r"the file path of the dataset")

I needed to confirm that the specified file is what imported into pandas. To do this, I used the following:

df.head(number of rows you want to see, usually the number here is between 5 and 10)

Having an intermediate knowledge of python, I needed to place automation at the back of my mind when working with the dataset. And so, I decided to define functions to achieve every operation I needed to do on this particular dataset. The first major task in the pre-processing stage was to fill the missing values, recorded as NaN. For categorical variables, the convention or norm is to fill the NaN(s) with the value that occurs the most. For numerical variables, the mean works just fine. The function defined to achieve this goes thus:

def fill_var(dataset):
       cat_var=dataset.select_dtyeps(exclude=['number']).columns.tolist()
       num_var=dataset.select_dtypes(include=['number']).columns.tolist()
       for column in cat_var:
         dataset[column]=dataset[column].fillna(dataset[column].index[0])
       for column in num_var:
          dataset[column]=dataset[column].fillna(dataset[column].mean())
       return dataset

The dataset had a mixture of numerical and categorical variables. I needed to seperate them to make my work easier as I progress. This was achieved with:

def get_cat_var(dataset):
    cat_var=dataset.select_dtypes(exclude=['number'])
    return cat_var

def get_cat_var(dataset):
    num_var=dataset.select_dtypes(include=['number'])
    return num_var

The dataset I worked with had some occurrences of a certain value capitalized while others were not. For example, “lagos” written as “Lagos”. I discovered this by defining another function to get unique values.

def unique_values_catcher(dataset):
    for column in dataset:
        result=dataset[column].unique()
        print(result)

An alternative way i could have done this was

def unique_values_catcher(dataset):
    unique=[]
    for column in dataset:
        unique.append(dataset[column].unique())
    return unique

In the case of the example cited above, to replace “Lagos” with “lagos”, I simply did this:

dataset[location_state].replace(['Lagos'], ['lagos'], inplace=True)

With NaN(s) and other issues taking care of, I moved to the processing stage. In the second stage, I needed to get the frequencies of categorical variables, normalize numerical variables and count unique values in multiple columns. To generate frequencies of unique values I simply did this

def count_unique_values(dataset):
       cat_var=get_cat_var(dataset)
       for columns in cat_var:
            result=dataset[column].value_counts()
            print(result)

another way I could have done this is with the sidetable module, in that case, I would have to import sidetable

import sidetable as stb
def unique_value_counter(dataset):
    for column in dataset:
        result=dataset.stb.freq([column])
        print(result)

def unique_value_counter(dataset): for column in dataset: result=dataset.stb.freq([column]) print(result)

using the sidetable will also give cumulative percentages and so on,/p> the next thing I did in this project was to normalize numerical variables. There are a number of ways to do this. Fortunately, I knew the formula for normalization as:

(a+(x-A)(b-a))/(B-A) Where: a=0 b=1 x= the value to be normalized A = the min value of the column B = the max value of the column

To do this in python with a function, I did the following:

def normalize_num_var(dataset):
      norm_num_var=get_num_var(dataset)
       for column in norm_num_var:
           result=(dataset[column]-dataset[column].min())/
           (dataset[column].max()-dataset[column].min())
           print(result)

The next thing I did after normalization was counting unique values in two columns with the use of a primary column. To give this more context, here is an example. If the first column in a dataset is state, having values like Lagos, Kaduna and so on. Other columns include values like Operating System and even gender, I will want to know how many men are from Lagos, how many Women are from Kaduna, Which kind of operating systems are prominent in Lagos and so on. To achieve this, I defined the following:

def corr_counts(dataset, primary_column):
       cat_att=get_cat_var(dataset)
       for column in cat_att:
           result=dataset.groupby(primary_column)    [column].value_counts()
           print(result)

The last section dealt with visualizations. I also defining a function to achieve this. For this project, I used seaborn (you may choose to use matplotlib.pyplot)

def chart_plotter(dataset):
       cat_att=get_cat_var(dataset)
       for column in cat_att:
             x=dataset[column].unique()
             y=dataset[column].value_counts()
             result=sns.barplot(x,y)

The analysis resulted in insights that helped the organisation make better decisions.

Get a glimpse of this project here: github.com/King-Ogunnowo/Data-Analytics-wit..

For any questions you might have, please shoot me a mail at:

Thank you for reading