In any Data Science or Business Analytics project, exploring the data is an important aspect It helps us to deep dive into the data to get numerical summaries and various hidden trends that are otherwise obscure to start with.Most of the times, a thorough analysis of data is sufficient enough to facilitate decision making. Hence it is imperative to know the basic tools to analyse data and interpret results. With high adoption of Python as a preferred Data Science tool, we will look at a small case study about a Product based company and how python can be used to answer questions related to typical data exploration exercise .

The data considered here consists of Quantity sold, Revenue and Gross Margin information for various sports product sold in different countries. Other Features are given in the form of Product Information,Order Method and type of retailer. This is spread across 4 Quarters from 2012 to 2014

To fully understand the data, the following things will be explored

**Task 0**: How many missing/NA/Null values are present in the data

**Task 1**: Which order method is most widely used

**Task 2**:Which Product has the max units sold on an overall basis

**Task 3**:Which country has the max value of Gross Margin per unit Quantity

**Task 4**:Which product line under 'Web' Method type has highest Unit sold for 2012

**Task 5**:Is this also true for 2013 and 2014

**Task 6**:Which product has seen the max decrease in Revenue from 2012 to 2014

**Task 7**:In Product line, Personal Accessories has seen a decline in Gross Margin

# Identify which Product has contributed most to this

# Identify which country has contributed most to this

**Task 8**: Identify top products for each country with respect to Revenue from 2012 to 2014

**Download the file**: https://docs.google.com/spreadsheets/d/1aRD2-xZrxcARSHZAi-93fWQBuhr6D8TOgkBaT_umkvg/edit#gid=165274715

# importing pandas

# setting up directory

import pandas as pd

dr="Your computer path where file is saved"

# Reading the csv file into the console. I am using

**Spyder**to analyse data. The analyses can be carried out using Anaconda also
df=pd.read_csv(dr+"\\Gross_Margin_Data.csv")

# Total rows and column

df.shape # Has 88475 rows and 11 columns

df.columns

# Inspecting first few records

df.head()

**Task 0**: How many missing/NA/Null values are present in the data

df.isna().sum()

Figure 1 |

# There are 581 NA entries in the Gross margin column

# These should be replaced by the mean of the Gross Margin column

# So that it doesnt affect result from any other analysis

df['Gross margin']=df['Gross margin'].fillna(df['Gross margin'].mean())

df.isna().sum()# Now there are no NA values in Gross Margin column

Figure 2 |

**Task 3**: Which country has the max value of Gross Margin per unit Quantity

# Here we need to aggregate the data at a country level

df1=df.groupby('Retailer country')['Gross margin','Quantity'].sum()

df1.columns # Only Gross Margin and Quantity are present as columns

# We need to reset the index of df1 so that Retailer Country also becomes a column

df1=df1.reset_index()

# Now creating a column of Gross Margin/Quantity

df1['Unit_Measure']=df1['Gross margin']/df1['Quantity']

# We can see that Denmark has the highest value of Gross Margin/Unit Quantity sold

**Task 4**: Which product under 'Web' Method type has highest Unit sold for 2012

Before we start on this we need to first know which is the most preferred method of ordering products

df['Order method type'].value_counts() # Web is the most preferred method

# Lets see which product under Web has the most Sales

# Product by the name Zone is most preferentially ordered using Web medium

# Lets verify if this is true for the year 2012 as well

df['Year'].head()

df[(df['Order method type']=='Web') & (df['Year']==2012)].groupby(['Year','Product'])['Quantity'].sum().sort_values(ascending=False).head(5)

# For 2012 as well 'Zone'product has the most Sales in 2012

**Task 5**:Does Product named Zone has the most Sales under Web Method of Ordering in 2013 and 2014 also

df2=df[df['Order method type']=='Web'].groupby(['Year','Product'])['Quantity'].sum()

df3=df2.reset_index()

df3=df3.sort_values(['Year','Quantity'],ascending=[True,False])

df3.groupby(['Year']).head(2) # Gives the top 2 products sold for each year under Web Method

# As we can see that for 2012 and 2013, Zone has the most Sales, but for 2014

# Granite Carabiner has the most Sales

# So Zone doesnt have most Sales for Web Orders from 2012 to 2014

**Task 6**: Which product has seen the max decrease in Revenue from 2012 to 2014

df4=df.groupby(['Product','Year'])['Revenue'].sum().reset_index()

df4.shape

# We need to remove the value corresponding to 2013 as only 2012 and 2014 are required

t_val=[2012,2014]

df5=df4[df4['Year'].isin(t_val)]

df5.shape

# Now lets pivot df4 based on column 'Year'

p = df5.pivot(index='Product', columns='Year', values='Revenue')

p.tail(16)# There are some NaN in p

p.tail(16)# There are some NaN in p

Figure 9 |

# We see that there are certain NA/Null values in p

# These are the values for which are the no entries in df4

# These entries have to be replaced with 0

p[2012]=p[2012].fillna(0)

p[2014]=p[2014].fillna(0)

# Checkingt to see if there is still any NA/Null values in p

p.isna().sum()

# Creating another column called Perc_Change

p['Perc_Change']=(p[2014]-p[2012])/p[2012]

p.head()

# Sorting p to get the highest percentage decrease

# So Sun Blocker has seen the maximum decrease in Revenue from 2012 to 2014

**Task 7**:In Product line, Personal Accessories has seen a decline in Gross Margin

# Task 7a:Identify which Product has contributed most to this

# Task 7b::Identify which country has contributed most to this

**Task 7a**:Identify which Product has contributed most to this

# Lets first check what is the overall percentage decrease from 2013 to 2014

df7=df[(df['Product line']=='Personal Accessories') & df['Year'].isin(['2013','2014'])]

df7.groupby(['Year'])['Gross margin'].sum()

# We can see that there is approx 35% dip in the gross margin

# The decline in Gross Margin can be because of the following things

# The Top Product with max Gross Margin in 2013 had a dip in 2014

# Lets check this point

df8=df7.groupby(['Product','Year'])['Gross margin'].sum().reset_index()

df9=df8.pivot(index='Product',columns='Year',values='Gross margin').sort_values(2013,ascending=False)

df10=df9.fillna(0)

# Let create a new column indicating percentage decrease

df10['Perc_change']=(df10[2013]-df10[2014])/df10[2013]

# This gives the products that had the max gross margin in 2013

# but saw a significant dip (top 3 product had a dip of more than 50%)

# Lets also check the outlook in comparison to 2012

df11=df[(df['Product line']=='Personal Accessories')]

# Gross Margin of 2012 and 2013 are comparable so there has been really

# significant drop in Gross Margin in 2014

**Task 7b**:Identify which country has contributed most to this

df12=df7.groupby(['Retailer country','Year'])['Gross margin'].sum().reset_index()

df13=df12.pivot(index='Retailer country',columns='Year',values='Gross margin').sort_values(2013,ascending=False)

df14=df13.fillna(0)

# Let create a new column indicating percentage decrease

df14['Perc_change']=(df14[2013]-df14[2014])/df14[2013]

df14.head()

# Top 5 countries have seen a dip in Gross Margin

# Now we are in a position to comment on the product as well as countries

# that have contributed to the decrease in Gross Margin

**Task 8**: Identify top products for each country with respect to Revenue from 2012 to 2014

df15=df.groupby(['Retailer country','Product','Year'])['Revenue'].sum().reset_index()

df16=df15.sort_values(['Retailer country','Year','Revenue'],ascending=False)

df17=df16.groupby(['Retailer country','Year']).head(2)

df17.head(20)

df17=df16.groupby(['Retailer country','Year']).head(2)

df17.head(20)

Figure 18 |

# df17 can now be exported to a csv file to prepare reports

This comment has been removed by the author.

ReplyDeleteYes you can use two or more categorical columns for pivoting by using hierarchical indexing in python

ReplyDeleteThe code snippets are really handy and quite useful in daily work..

ReplyDeleteThanks for the feedback ðŸ˜Š

ReplyDeleteVery useful content

ReplyDeleteyou have written an excellent blog. I learned something new from your Blog. Keep sharing valuable information.

ReplyDeleteOracle Training in Chennai

Oracle Training in Coimbatore

event management. Doubling down on becoming the leading networking-focused hybrid event platform and They continue to build an impressive client list that includes some of the largest companies in B2B events. thank you again for the opportunity and appreciate email

ReplyDelete