Data Wrangling in SQL¶

For this project, I worked with 3 HR employee datasets namely:
i). Personal Details
ii). Department Details
iii). Recruitment and Job Details
Each of the datasets had 310 entries for employees of the company in the region. Using MySQL, my approach was to do some cleaning on the datasets individually, and then combine (join) all 3 datasets into one based on their common columns.

Step 1: Cleaning 'Personal Details' dataset¶

In [2]:
SELECT * 
FROM employee.personal_details 
LIMIT 10
EmployeeName EmployeeNumber Age State Zip DOB Sex MaritalStatus Citizenship Race
0 Brown, Mia 1103024456 30 MA 1450 11/24/1987 Female Married US Citizen Black or African American
1 LaRotonda, William 1106026572 34 MA 1460 4/26/1984 Male Divorced US Citizen Black or African American
2 Steans, Tyrone 1302053333 31 MA 2703 9/1/1986 Male Single US Citizen White
3 Howard, Estelle 1211050782 32 MA 2170 9/16/1985 Female Married US Citizen White
4 Singh, Nan 1307059817 30 MA 2330 5/19/1988 Female Single US Citizen White
5 Smith, Leigh Ann 711007713 30 MA 1844 6/14/1987 Female Married US Citizen Asian
6 LeBlanc, Brandon R 1102024115 33 MA 1460 6/10/1984 Male Married US Citizen White
7 Quinn, Sean 1206043417 33 MA 2045 11/6/1984 Male Married Eligible NonCitizen Black or African American
8 Boutwell, Bonalyn 1307060188 31 MA 2468 4/4/1987 Female Married US Citizen Asian
9 Foster-Baker, Amy 1201031308 39 MA 2050 4/16/1979 Female Married US Citizen White
In [3]:
#ensuring that all the rows of the 'personal details' dataset were successfully loaded into MySQL
SELECT COUNT(EmployeeName) AS Number
FROM employee.personal_details
Number
0 310

All rows were successfully loaded.

~ Checking for duplicate entries¶

In [4]:
SELECT EmployeeName, COUNT(EmployeeName)
FROM employee.personal_details
GROUP BY EmployeeName
HAVING COUNT(EmployeeName) > 1
EmployeeName COUNT(EmployeeName)

No duplicates

~ Removing irrelevant columns¶

Here, I dropped columns that were not going to be relevant in my analysis.

In [5]:
ALTER TABLE employee.personal_details
DROP Zip,
DROP DOB,
DROP Citizenship
In [6]:
#double checking that the columns were properly dropped
SELECT *
FROM employee.personal_details 
LIMIT 3
EmployeeName EmployeeNumber Age State Sex MaritalStatus Race
0 Brown, Mia 1103024456 30 MA Female Married Black or African American
1 LaRotonda, William 1106026572 34 MA Male Divorced Black or African American
2 Steans, Tyrone 1302053333 31 MA Male Single White

The columns were successfully dropped.

~ Handling missing values¶

In [7]:
SELECT *
FROM employee.personal_details
WHERE EmployeeNumber = ''
OR Age = ''
OR State = ''
OR Sex = ''
OR MaritalStatus = ''
OR Race = ''
EmployeeName EmployeeNumber Age State Sex MaritalStatus Race

No missing values.

Step 2: Cleaning 'Department Details' dataset¶

In [8]:
SELECT * 
FROM employee.department_details 
LIMIT 10
EmployeeName EmployeeID Department Position ManagerName ManagerID SpecialProjectsCount
0 Brown, Mia 1103024456 Admin Offices Accountant I Brandon R. LeBlanc 1 6
1 LaRotonda, William 1106026572 Admin Offices Accountant I Brandon R. LeBlanc 1 4
2 Steans, Tyrone 1302053333 Admin Offices Accountant I Brandon R. LeBlanc 1 5
3 Howard, Estelle 1211050782 Admin Offices Administrative Assistant Brandon R. LeBlanc 1 4
4 Singh, Nan 1307059817 Admin Offices Administrative Assistant Brandon R. LeBlanc 1 5
5 Smith, Leigh Ann 711007713 Admin Offices Administrative Assistant Brandon R. LeBlanc 1 4
6 Bunbury, Jessica 1504073368 Sales Area Sales Manager John Smith 17 0
7 Carter, Michelle 1403065721 Sales Area Sales Manager John Smith 17 0
8 Dietrich, Jenna 1408069481 Sales Area Sales Manager John Smith 17 0
9 Digitale, Alfred 1306059197 Sales Area Sales Manager John Smith 17 0
In [9]:
#ensuring that all the rows of the 'department details' dataset were successfully loaded into MySQL
SELECT COUNT(EmployeeName) AS Number
FROM employee.department_details
Number
0 310

~ Checking for duplicate entries¶

In [10]:
SELECT EmployeeName, COUNT(EmployeeName)
FROM employee.department_details
GROUP BY EmployeeName
HAVING COUNT(EmployeeName) > 1
EmployeeName COUNT(EmployeeName)

No duplicates.

~ Removing irrelevant columns¶

In [11]:
ALTER TABLE employee.department_details
DROP ManagerName,
DROP ManagerID,
DROP SpecialProjectsCount
In [12]:
#double checking that the columns were properly dropped
SELECT * 
FROM employee.department_details 
LIMIT 3
EmployeeName EmployeeID Department Position
0 Brown, Mia 1103024456 Admin Offices Accountant I
1 LaRotonda, William 1106026572 Admin Offices Accountant I
2 Steans, Tyrone 1302053333 Admin Offices Accountant I

~ Handling missing values¶

In [13]:
SELECT *
FROM employee.department_details
WHERE EmployeeID = ''
OR Department = ''
EmployeeName EmployeeID Department Position

The output above shows that there are no missing values.

Step 3: Cleaning 'Recruitment and Job Details' dataset¶

In [14]:
SELECT * 
FROM employee.recruitment_and_job_details 
LIMIT 10
EmployeeName EmployeeID RecruitmentSource DateOfHire PayRate EngagementSurvey EmployeeSatisfaction DateOfTermination TerminationReason
0 Brown, Mia 1103024456 Diversity Job Fair 10/27/2008 28.50 2.04 2 N/A - still employed
1 LaRotonda, William 1106026572 Website Banner Ads 1/6/2014 23.00 5.00 4 N/A - still employed
2 Steans, Tyrone 1302053333 Internet Search 9/29/2014 29.00 3.90 5 N/A - still employed
3 Howard, Estelle 1211050782 Pay Per Click - Google 2/16/2015 21.50 3.24 3 4/15/2015
4 Singh, Nan 1307059817 Website Banner Ads 5/1/2015 16.56 5.00 3 N/A - still employed
5 Smith, Leigh Ann 711007713 Diversity Job Fair 9/26/2011 20.50 3.80 4 9/25/2015 career change
6 Bunbury, Jessica 1504073368 Social Networks - Facebook Twitter etc 8/15/2011 55.00 3.14 5 8/2/2015 Another position
7 Carter, Michelle 1403065721 Billboard 8/18/2014 55.00 5.00 5 N/A - still employed
8 Dietrich, Jenna 1408069481 Website Banner Ads 2/20/2012 55.00 2.30 1 N/A - still employed
9 Digitale, Alfred 1306059197 Pay Per Click - Google 8/18/2014 56.00 3.60 5 N/A - still employed
In [15]:
#ensuring that all the entries of the 'recruitment and job details' dataset were sucessfully loaded into MySQL
SELECT COUNT(EmployeeName) AS Number
FROM employee.recruitment_and_job_details
Number
0 310

~ Checking for duplicate entries¶

In [16]:
SELECT EmployeeName, COUNT(EmployeeName)
FROM employee.recruitment_and_job_details
GROUP BY EmployeeName
HAVING COUNT(EmployeeName) > 1
EmployeeName COUNT(EmployeeName)

No duplicate entries.

~Removing irrelevant columns¶

In [17]:
ALTER TABLE employee.recruitment_and_job_details
DROP RecruitmentSource,
DROP EngagementSurvey,
DROP EmployeeSatisfaction
In [18]:
#double checking that the columns were properly dropped
SELECT * 
FROM employee.recruitment_and_job_details 
LIMIT 3
EmployeeName EmployeeID DateOfHire PayRate DateOfTermination TerminationReason
0 Brown, Mia 1103024456 10/27/2008 28.5 N/A - still employed
1 LaRotonda, William 1106026572 1/6/2014 23.0 N/A - still employed
2 Steans, Tyrone 1302053333 9/29/2014 29.0 N/A - still employed

~ Handling missing values¶

In [19]:
SELECT *
FROM employee.recruitment_and_job_details
WHERE EmployeeID = ''
OR DateOfHire = ''
OR PayRate = ''
OR DateOfTermination = ''
OR TerminationReason = '' 
LIMIT 10
EmployeeName EmployeeID DateOfHire PayRate DateOfTermination TerminationReason
0 Brown, Mia 1103024456 10/27/2008 28.50 N/A - still employed
1 LaRotonda, William 1106026572 1/6/2014 23.00 N/A - still employed
2 Steans, Tyrone 1302053333 9/29/2014 29.00 N/A - still employed
3 Howard, Estelle 1211050782 2/16/2015 21.50 4/15/2015
4 Singh, Nan 1307059817 5/1/2015 16.56 N/A - still employed
5 Carter, Michelle 1403065721 8/18/2014 55.00 N/A - still employed
6 Dietrich, Jenna 1408069481 2/20/2012 55.00 N/A - still employed
7 Digitale, Alfred 1306059197 8/18/2014 56.00 N/A - still employed
8 Friedman, Gerry 1204032843 3/7/2011 55.50 N/A - still employed
9 Gonzales, Ricardo 1411071481 5/12/2014 55.50 N/A - still employed

From the above output, it can be seen that there were missing values in the 'DateOfTermination' column. This is expected because the column had no entries for employees who were still working with the company. There were also missing values in 'TerminationReason' column. I replaced the missing values in both columns with "NULL".

In [20]:
#inputing "NULL" for entries with no termination reason
UPDATE employee.recruitment_and_job_details
SET TerminationReason = "NULL"
WHERE TerminationReason = ''
This result object does not return rows. It has been closed automatically.
In [21]:
#inputing "NULL" for entries where there was no termination date (as those employees are still with the company)
UPDATE employee.recruitment_and_job_details
SET DateOfTermination = "NULL"
WHERE DateOfTermination = '' 
This result object does not return rows. It has been closed automatically.
In [22]:
#double-checking that the missing values above were successfully replaced (ie. checking that misisng values no longer exist)
SELECT *
FROM employee.recruitment_and_job_details
WHERE EmployeeID = ''
OR DateOfHire = ''
OR PayRate = ''
OR DateOfTermination = ''
OR TerminationReason = ''
EmployeeName EmployeeID DateOfHire PayRate DateOfTermination TerminationReason

No more missing values.

Step 4: Joining the Datasets¶

After I was done cleaning all 3 datasets individually, I went ahead to join the datasets using common fields.

In [23]:
#joining the 3 datasets
SELECT * FROM employee.personal_details pd
LEFT JOIN employee.department_details dd ON pd.EmployeeName = dd.EmployeeName
                                        AND pd.EmployeeNumber = dd.EmployeeID 
LEFT JOIN employee.recruitment_and_job_details rj ON pd.EmployeeName = rj.EmployeeName  
                                                 AND pd.EmployeeNumber = rj.EmployeeID
LIMIT 6                                                 
EmployeeName EmployeeNumber Age State Sex MaritalStatus Race EmployeeName EmployeeID Department Position EmployeeName EmployeeID DateOfHire PayRate DateOfTermination TerminationReason
0 Brown, Mia 1103024456 30 MA Female Married Black or African American Brown, Mia 1103024456 Admin Offices Accountant I Brown, Mia 1103024456 10/27/2008 28.50 NULL N/A - still employed
1 LaRotonda, William 1106026572 34 MA Male Divorced Black or African American LaRotonda, William 1106026572 Admin Offices Accountant I LaRotonda, William 1106026572 1/6/2014 23.00 NULL N/A - still employed
2 Steans, Tyrone 1302053333 31 MA Male Single White Steans, Tyrone 1302053333 Admin Offices Accountant I Steans, Tyrone 1302053333 9/29/2014 29.00 NULL N/A - still employed
3 Howard, Estelle 1211050782 32 MA Female Married White Howard, Estelle 1211050782 Admin Offices Administrative Assistant Howard, Estelle 1211050782 2/16/2015 21.50 4/15/2015 NULL
4 Singh, Nan 1307059817 30 MA Female Single White Singh, Nan 1307059817 Admin Offices Administrative Assistant Singh, Nan 1307059817 5/1/2015 16.56 NULL N/A - still employed
5 Smith, Leigh Ann 711007713 30 MA Female Married Asian Smith, Leigh Ann 711007713 Admin Offices Administrative Assistant Smith, Leigh Ann 711007713 9/26/2011 20.50 9/25/2015 career change

My join was sucessfully done, and I wanted to save my result in a dataset called "joined_dataset". In order for me to do that, I had to rename the "EmployeeName" and "EmployeeID" columns in the "department_details" and "recruitment_and_job_details" datasets. This is because if I tried to create a table/dataset to save my join result, it would bring up an error of duplicate column name "EmployeeName", and duplicate column name "EmployeeId", since my join result has 3 "EmployeeName" columns and 2 "EmployeeID" columns.

In [24]:
#changing EmployeeName column for department_details dataset
ALTER TABLE employee.department_details
rename column EmployeeName to EmployeeName2
This result object does not return rows. It has been closed automatically.
In [25]:
#double-checking that the EmployeeName column had been changed
SELECT * FROM employee.department_details 
LIMIT 3
EmployeeName2 EmployeeID Department Position
0 Brown, Mia 1103024456 Admin Offices Accountant I
1 LaRotonda, William 1106026572 Admin Offices Accountant I
2 Steans, Tyrone 1302053333 Admin Offices Accountant I
In [26]:
#changing EmployeeName column for recruitment_and_job_details dataset
ALTER TABLE employee.recruitment_and_job_details
rename column EmployeeName to EmployeeName3  
This result object does not return rows. It has been closed automatically.
In [27]:
#changing EmployeeID column for recruitment_and_job_details dataset
ALTER TABLE recruitment_and_job_details
rename column EmployeeID to EmployeeID2    
This result object does not return rows. It has been closed automatically.
In [28]:
#double checking that the recruitment_and_job_details columns had been changed
SELECT * FROM employee.recruitment_and_job_details 
LIMIT 3
EmployeeName3 EmployeeID2 DateOfHire PayRate DateOfTermination TerminationReason
0 Brown, Mia 1103024456 10/27/2008 28.5 NULL N/A - still employed
1 LaRotonda, William 1106026572 1/6/2014 23.0 NULL N/A - still employed
2 Steans, Tyrone 1302053333 9/29/2014 29.0 NULL N/A - still employed
In [29]:
#storing my join result in a dataset called "joined_dataset"
CREATE TABLE joined_dataset
SELECT * FROM employee.personal_details pd
LEFT JOIN employee.department_details dd ON pd.EmployeeName = dd.EmployeeName2
                                        AND pd.EmployeeNumber = dd.EmployeeID 
LEFT JOIN employee.recruitment_and_job_details rj ON pd.EmployeeName = rj.EmployeeName3  
                                                 AND pd.EmployeeNumber = rj.EmployeeID2
This result object does not return rows. It has been closed automatically.
In [30]:
#taking a look at joined_dataset
SELECT *
FROM employee.joined_dataset 
LIMIT 3
EmployeeName EmployeeNumber Age State Sex MaritalStatus Race EmployeeName2 EmployeeID Department Position EmployeeName3 EmployeeID2 DateOfHire PayRate DateOfTermination TerminationReason
0 Brown, Mia 1103024456 30 MA Female Married Black or African American Brown, Mia 1103024456 Admin Offices Accountant I Brown, Mia 1103024456 10/27/2008 28.5 NULL N/A - still employed
1 LaRotonda, William 1106026572 34 MA Male Divorced Black or African American LaRotonda, William 1106026572 Admin Offices Accountant I LaRotonda, William 1106026572 1/6/2014 23.0 NULL N/A - still employed
2 Steans, Tyrone 1302053333 31 MA Male Single White Steans, Tyrone 1302053333 Admin Offices Accountant I Steans, Tyrone 1302053333 9/29/2014 29.0 NULL N/A - still employed

After taking a first glance at joined_dataset, the next thing I did was drop the extra 2 EmployeeName and EmployeeID columns ie. so that there would be only one EmployeeName column and one EmployeeID column.

In [31]:
#dropping the extra EmployeeName and EmployeeID columns
ALTER TABLE employee.joined_dataset
DROP EmployeeName2,
DROP EmployeeName3,
DROP EmployeeNumber,
DROP EmployeeID2
This result object does not return rows. It has been closed automatically.
In [32]:
#double-checking that the columns mentioned above had been dropped
SELECT* 
FROM employee.joined_dataset 
LIMIT 3
EmployeeName Age State Sex MaritalStatus Race EmployeeID Department Position DateOfHire PayRate DateOfTermination TerminationReason
0 Brown, Mia 30 MA Female Married Black or African American 1103024456 Admin Offices Accountant I 10/27/2008 28.5 NULL N/A - still employed
1 LaRotonda, William 34 MA Male Divorced Black or African American 1106026572 Admin Offices Accountant I 1/6/2014 23.0 NULL N/A - still employed
2 Steans, Tyrone 31 MA Male Single White 1302053333 Admin Offices Accountant I 9/29/2014 29.0 NULL N/A - still employed

The columns were successfully dropped.

After I had all my columns intact, the final thing I did was to move the EmployeeID column to come right after the EmployeeName column

In [33]:
#to bring the 'EmployeeID' column to be right after 'EmployeeName'
ALTER TABLE employee.joined_dataset
MODIFY EmployeeID int
AFTER EmployeeName
This result object does not return rows. It has been closed automatically.
In [34]:
#double-checking that the  columns had been properly rearranged
SELECT * 
FROM employee.joined_dataset 
LIMIT 3
EmployeeName EmployeeID Age State Sex MaritalStatus Race Department Position DateOfHire PayRate DateOfTermination TerminationReason
0 Brown, Mia 1103024456 30 MA Female Married Black or African American Admin Offices Accountant I 10/27/2008 28.5 NULL N/A - still employed
1 LaRotonda, William 1106026572 34 MA Male Divorced Black or African American Admin Offices Accountant I 1/6/2014 23.0 NULL N/A - still employed
2 Steans, Tyrone 1302053333 31 MA Male Single White Admin Offices Accountant I 9/29/2014 29.0 NULL N/A - still employed

At this point, I was done with data wrangling, and exported the combined dataset (joined_dataset) to Tableau for further exploration and visualisation.