#In this case, I needed/used only pandas
import pandas as pd

#importing the sales dataset for 2018
data1 = pd.read_csv("C:/Users/ETIABA CHAMBER'S/Documents/Data_Analytics/My_portfolio/Online_store_sales/Year_2018.csv", encoding="ISO-8859-1")
data1.head()

#importing the sales dataset for 2019
data2 = pd.read_csv("C:/Users/ETIABA CHAMBER'S/Documents/Data_Analytics/My_portfolio/Online_store_sales/Year_2019.csv", encoding="ISO-8859-1")
data2.head()

#renaming data1 and data2 columns
data1.columns = ['Invoice_No', 'Stock_Code', 'Description', 'Quantity',
                 'Invoice_Date', 'Unit_Price', 'Customer_ID', 'Country', 'Region']
data2.columns = ['Invoice_No', 'Stock_Code', 'Description', 'Quantity',
                 'Invoice_Date', 'Unit_Price', 'Customer_ID', 'Country', 'Region']

#checking that the column names had been properly renamed 
data1.info()
data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503277 entries, 0 to 503276
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Invoice_No    503277 non-null  object 
 1   Stock_Code    503277 non-null  object 
 2   Description   500396 non-null  object 
 3   Quantity      503277 non-null  int64  
 4   Invoice_Date  503277 non-null  object 
 5   Unit_Price    503277 non-null  float64
 6   Customer_ID   401010 non-null  float64
 7   Country       503277 non-null  object 
 8   Region        503277 non-null  object 
dtypes: float64(2), int64(1), object(6)
memory usage: 34.6+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 519386 entries, 0 to 519385
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Invoice_No    519386 non-null  object 
 1   Stock_Code    519386 non-null  object 
 2   Description   518039 non-null  object 
 3   Quantity      519386 non-null  int64  
 4   Invoice_Date  519386 non-null  object 
 5   Unit_Price    519386 non-null  float64
 6   Customer_ID   392028 non-null  float64
 7   Country       519386 non-null  object 
 8   Region        519386 non-null  object 
dtypes: float64(2), int64(1), object(6)
memory usage: 35.7+ MB

#combining both datasets
df = pd.concat([data1, data2], ignore_index = True)

#checking that the datasets were properly combined
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1022663 entries, 0 to 1022662
Data columns (total 9 columns):
 #   Column        Non-Null Count    Dtype  
---  ------        --------------    -----  
 0   Invoice_No    1022663 non-null  object 
 1   Stock_Code    1022663 non-null  object 
 2   Description   1018435 non-null  object 
 3   Quantity      1022663 non-null  int64  
 4   Invoice_Date  1022663 non-null  object 
 5   Unit_Price    1022663 non-null  float64
 6   Customer_ID   793038 non-null   float64
 7   Country       1022663 non-null  object 
 8   Region        1022663 non-null  object 
dtypes: float64(2), int64(1), object(6)
memory usage: 70.2+ MB

#changing Customer_ID column to object type 
df.Customer_ID = df.Customer_ID.astype('Int32').astype(object)

#changing Invoice_Date column to datetime
df['Invoice_Date'] = pd.to_datetime(df['Invoice_Date'], infer_datetime_format=True)

#checking to see that my columns were properly changed
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1022663 entries, 0 to 1022662
Data columns (total 9 columns):
 #   Column        Non-Null Count    Dtype         
---  ------        --------------    -----         
 0   Invoice_No    1022663 non-null  object        
 1   Stock_Code    1022663 non-null  object        
 2   Description   1018435 non-null  object        
 3   Quantity      1022663 non-null  int64         
 4   Invoice_Date  1022663 non-null  datetime64[ns]
 5   Unit_Price    1022663 non-null  float64       
 6   Customer_ID   793038 non-null   object        
 7   Country       1022663 non-null  object        
 8   Region        1022663 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(6)
memory usage: 70.2+ MB

#to get the number of test entries in the dataset 
df['Stock_Code'].str.contains('TEST', na=False).sum()

17

#the output above shows there are 17 test entries in the dataset
#drop the test entries
df = df[~df.Stock_Code.str.contains("TEST")]
df.reset_index(drop=True, inplace=True)
df.head()

#double-checking that all 17 test entries were successfully dropped
df['Stock_Code'].str.contains('TEST', na=False).sum()

0

#checking that the number of rows left correspond
df.shape

(1022646, 9)

#dropping the description column and confirming it's been dropped
df.drop('Description', axis=1, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1022646 entries, 0 to 1022645
Data columns (total 8 columns):
 #   Column        Non-Null Count    Dtype         
---  ------        --------------    -----         
 0   Invoice_No    1022646 non-null  object        
 1   Stock_Code    1022646 non-null  object        
 2   Quantity      1022646 non-null  int64         
 3   Invoice_Date  1022646 non-null  datetime64[ns]
 4   Unit_Price    1022646 non-null  float64       
 5   Customer_ID   793022 non-null   object        
 6   Country       1022646 non-null  object        
 7   Region        1022646 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 62.4+ MB

#to get the number of missing values (if any) in each column
df.isnull().sum()

Invoice_No           0
Stock_Code           0
Quantity             0
Invoice_Date         0
Unit_Price           0
Customer_ID     229624
Country              0
Region               0
dtype: int64

#replace all 229624 missing values for Customer_ID
df.Customer_ID = df.Customer_ID.fillna('Unknown')

#double-checking that there are no more missing values
df.isnull().sum()

Invoice_No      0
Stock_Code      0
Quantity        0
Invoice_Date    0
Unit_Price      0
Customer_ID     0
Country         0
Region          0
dtype: int64

df.head(8)

#creating a subset of the dataframe to view the duplicate entries 
df1 = df[df.duplicated(subset = ['Invoice_No', 'Stock_Code'], keep=False)]
df1.head()

#to get the number of duplicate entries 
df.duplicated(subset = ['Invoice_No', 'Stock_Code']).sum()

22963

#there were 22963 duplicate entries 
#dropping them
df2 = df.drop_duplicates(subset = ['Invoice_No', 'Stock_Code'], keep = 'first').reset_index(drop = True)

#checking that the duplicate entries were successfully dropped ie. that the number of rows left correspond; should be 1022646-22963
df2.shape

(999683, 8)

#adding Item_Price_Total column
df2['Item_Price_Total'] =  df2['Unit_Price'] * df2['Quantity']

#adding 'Month' column
df2['Month'] = df2['Invoice_Date'].dt.strftime('%b')

#adding 'Year' column
df2['Year'] = df2['Invoice_Date'].dt.year

df2.info()
df2.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999683 entries, 0 to 999682
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Invoice_No        999683 non-null  object        
 1   Stock_Code        999683 non-null  object        
 2   Quantity          999683 non-null  int64         
 3   Invoice_Date      999683 non-null  datetime64[ns]
 4   Unit_Price        999683 non-null  float64       
 5   Customer_ID       999683 non-null  object        
 6   Country           999683 non-null  object        
 7   Region            999683 non-null  object        
 8   Item_Price_Total  999683 non-null  float64       
 9   Month             999683 non-null  object        
 10  Year              999683 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(2), object(6)
memory usage: 83.9+ MB

#calculating total order amount for each order, ie. grouping the entries by Invoice_No and calculating the sum of Item_Price_Total 
Total_Order_Amount = df2.groupby('Invoice_No')['Item_Price_Total'].sum()
Total_Order_Amount.head()

Invoice_No
491148    3882.66
491149     289.92
491150     435.60
491151     661.16
491152     436.80
Name: Item_Price_Total, dtype: float64

#calculating annual customer spend for each customer
Annual_Customer_Spend = df2.groupby(['Year', 'Customer_ID'])['Item_Price_Total'].sum()
Annual_Customer_Spend.head()

Year  Customer_ID
2018  12346            59.68
      12347          1323.32
      12348           222.16
      12349          2671.14
      12351           300.93
Name: Item_Price_Total, dtype: float64

#calculating annual sales for each country
Annual_Sale_per_Country = df2.groupby(['Country', 'Year'])['Item_Price_Total'].sum()
Annual_Sale_per_Country.head(10)

Country    Year
Australia  2018     30076.55
           2019    136359.42
Austria    2018     11585.20
           2019     10154.32
Bahrain    2018      2313.15
           2019       548.40
Belgium    2018     22204.93
           2019     40564.86
Bermuda    2018      1253.14
Brazil     2018       266.32
Name: Item_Price_Total, dtype: float64

#calculating annual sales for each region
Annual_Sale_per_Region = df2.groupby(['Region', 'Year'])['Item_Price_Total'].sum()
Annual_Sale_per_Region.head(10)

Region        Year
Asia-Pacific  2018      50701.650
              2019     186426.990
Europe        2018    8991371.724
              2019    9098321.913
MEA           2018      16198.060
              2019      13175.860
The Americas  2018       7594.260
              2019       6540.900
Unspecified   2018       4937.530
              2019       4697.480
Name: Item_Price_Total, dtype: float64

#calculating total annual sales
Total_Annual_Sales = df2.groupby(['Year'])['Item_Price_Total'].sum()
Total_Annual_Sales

Year
2018    9070803.224
2019    9309163.143
Name: Item_Price_Total, dtype: float64

#write to CSV
df2.to_csv("C:/Users/ETIABA CHAMBER'S/Documents/Data_Analytics/My_portfolio/Online_store_sales_cleaned.csv", index = False)

	Invoice	StockCode	Description	Quantity	InvoiceDate	Price	Customer ID	Country	Region
0	493410	TEST001	This is a test product.	5	1/4/2018 9:24	4.50	12346.0	United Kingdom	Europe
1	C493411	21539	RETRO SPOTS BUTTER DISH	-1	1/4/2018 9:43	4.25	14590.0	United Kingdom	Europe
2	493412	TEST001	This is a test product.	5	1/4/2018 9:53	4.50	12346.0	United Kingdom	Europe
3	493413	21724	PANDA AND BUNNIES STICKER SHEET	1	1/4/2018 9:54	0.85	NaN	United Kingdom	Europe
4	493413	84578	ELEPHANT TOY WITH BLUE T-SHIRT	1	1/4/2018 9:54	3.75	NaN	United Kingdom	Europe

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	Region
0	539993	22386	JUMBO BAG PINK POLKADOT	10	1/4/2019 10:00	1.95	13313.0	United Kingdom	Europe
1	539993	21499	BLUE POLKADOT WRAP	25	1/4/2019 10:00	0.42	13313.0	United Kingdom	Europe
2	539993	21498	RED RETROSPOT WRAP	25	1/4/2019 10:00	0.42	13313.0	United Kingdom	Europe
3	539993	22379	RECYCLING BAG RETROSPOT	5	1/4/2019 10:00	2.10	13313.0	United Kingdom	Europe
4	539993	20718	RED RETROSPOT SHOPPER BAG	10	1/4/2019 10:00	1.25	13313.0	United Kingdom	Europe

	Invoice_No	Stock_Code	Description	Quantity	Invoice_Date	Unit_Price	Customer_ID	Country	Region
0	493410	TEST001	This is a test product.	5	1/4/2018 9:24	4.50	12346.0	United Kingdom	Europe
1	C493411	21539	RETRO SPOTS BUTTER DISH	-1	1/4/2018 9:43	4.25	14590.0	United Kingdom	Europe
2	493412	TEST001	This is a test product.	5	1/4/2018 9:53	4.50	12346.0	United Kingdom	Europe
3	493413	21724	PANDA AND BUNNIES STICKER SHEET	1	1/4/2018 9:54	0.85	NaN	United Kingdom	Europe
4	493413	84578	ELEPHANT TOY WITH BLUE T-SHIRT	1	1/4/2018 9:54	3.75	NaN	United Kingdom	Europe

	Invoice_No	Stock_Code	Description	Quantity	Invoice_Date	Unit_Price	Customer_ID	Country	Region
0	C493411	21539	RETRO SPOTS BUTTER DISH	-1	2018-01-04 09:43:00	4.25	14590	United Kingdom	Europe
1	493413	21724	PANDA AND BUNNIES STICKER SHEET	1	2018-01-04 09:54:00	0.85	<NA>	United Kingdom	Europe
2	493413	84578	ELEPHANT TOY WITH BLUE T-SHIRT	1	2018-01-04 09:54:00	3.75	<NA>	United Kingdom	Europe
3	493413	21723	ALPHABET HEARTS STICKER SHEET	1	2018-01-04 09:54:00	0.85	<NA>	United Kingdom	Europe
4	493414	21844	RETRO SPOT MUG	36	2018-01-04 10:28:00	2.55	14590	United Kingdom	Europe

	Invoice_No	Stock_Code	Quantity	Invoice_Date	Unit_Price	Customer_ID	Country	Region
0	C493411	21539	-1	2018-01-04 09:43:00	4.25	14590	United Kingdom	Europe
1	493413	21724	1	2018-01-04 09:54:00	0.85	Unknown	United Kingdom	Europe
2	493413	84578	1	2018-01-04 09:54:00	3.75	Unknown	United Kingdom	Europe
3	493413	21723	1	2018-01-04 09:54:00	0.85	Unknown	United Kingdom	Europe
4	493414	21844	36	2018-01-04 10:28:00	2.55	14590	United Kingdom	Europe
5	493414	21533	12	2018-01-04 10:28:00	4.25	14590	United Kingdom	Europe
6	493414	37508	2	2018-01-04 10:28:00	2.55	14590	United Kingdom	Europe
7	493414	35001G	2	2018-01-04 10:28:00	4.25	14590	United Kingdom	Europe

Data Wrangling in Python¶

1. Importing the library used¶

2. Importing the datasets and merging them¶

3. Data Cleaning¶

~ Changing some column types¶

~ Dropping bad data¶

~ Removing Irrelevant Columns¶

~ Checking for missing values¶

~ Checking for, and removing duplicate entries¶

4. Adding new columns¶

5. Running some summaries for descriptive analysis¶

6. Exporting my cleaned dataset¶

	Invoice_No	Stock_Code	Quantity	Invoice_Date	Unit_Price	Customer_ID	Country	Region
158	493435	21678	2	2018-01-04 12:57:00	0.85	13206	United Kingdom	Europe
171	493435	21678	2	2018-01-04 12:57:00	0.85	13206	United Kingdom	Europe
257	493442	20751	1	2018-01-04 13:36:00	2.10	13821	United Kingdom	Europe
258	493442	20751	1	2018-01-04 13:36:00	2.10	13821	United Kingdom	Europe
267	493442	20712	1	2018-01-04 13:36:00	1.95	13821	United Kingdom	Europe