DATA CLEANSING ASSIGNMENT 1
Data Cleansing: Microsoft’s Northwind Data Set
Ashlee Simmons
Department of Business, Liberty University
BUSI 675: Business Analysis
Dr. Mary Jo Odom-Dull
September 17, 2023
Author Note
Ashlee Simmons
Ashlee Simmons is now at the Department of Business, Liberty University.
I have no known conflict of interest to disclose.
Correspondence concerning this article should be addressed to Ashlee Simmons
Email: [email protected]
DATA CLEANSING ASSIGNMENT 2
Microsoft’s Northwind Database
A data set, which is a collection of data, is a useful way to track and analyze an
abundance of information. Data sets are commonly created by using tables consisting of
columns, variables, and rows for each piece of data relevant to the data set. However, before a
data set is used for informational, experimental, and decision purposes, data cleansing must take
place. The use of data cleansing allows for abnormal and corrupt data to be identified. From
there, the abnormal data can then be removed which will then lead to a more accurate data set
(Martinez-Luengo et al., 2019). This paper will specifically analyze data from Microsoft’s
Northwind database that has been transported into an excel sheet. The data consists of
information regarding the company’s customers, products, product categories, and transactions.
Each transaction examined represents a product purchased by a customer, however, if a customer
purchased multiple products at once, the data reflects multiple corresponding rows in the
transactions table to represent one row for each product purchased. Additionally, there is a
product ID column that allows one to look up names of customers, products, and product
categories. Upon evaluation, it becomes apparent that some of the IDs in the transaction excel
sheet have been corrupted. There can be three reasons for that. The first reason is if an ID in the
transaction excel sheet does not correspond to any customer, product, or product category. The
second reason is if the order IDs do not correspond to only one customer ID. The third and final
reason is if a given product ID does not always correspond to the same product category ID.
Aside from the corrupted IDs, there are also shipping dates that are blank. Shipping dates can be
blank if the orders have not yet been shipped, however, they should not be blank before the
corresponding order dates (Albright & Winston, 2019). With that being said, the goal of this
DATA CLEANSING ASSIGNMENT 3
paper is to discover the corrupt information within the transaction excel sheet and to provide
images of the data that is corrupted.
Sorting the Data
Order ID
For one to properly handle large datasets, one must develop a data management strategy
(Hosseinzadeh et al., 2023). Data cleansing is a great way to manage datasets, and test for
missing values and inconsistencies. This paper will specifically utilize Excel as a management
strategy to data cleanse the dataset. The first step in the data cleaning process is to use the sort
and filter function within Excel. This function can be used on each column, making it easier to
analyze the data given. One can sort by newest to oldest, oldest to newest, A to Z, Z to A, largest
to smallest, and the list goes on. Sorting is commonly used in order to create a more pleasing
visual to examine the data for any unusual values. To begin, one shall select the data in the order
ID column and utilize the sort function. To do this, one should click home in excel, following the
editing tab, then selecting sort and filter. From there, one should select custom sort. For the
column section, order ID should be chosen. For the sort on section, cell values should be chosen,
and for the order, smallest to largest should be selected. Upon sorting the data, one is able to
identify that there is no corrupt data in that portion of data. Appendix A consists of an image of
the original excel sheet before the order ID column was sorted and Appendix B consists of
directions to properly sort the order ID column from smallest to largest.
Customer ID
When gathering and analyzing data, there is always a relevant population (Albright &
Winston, 2019). The population are the members within a study. In the data being analyzed for
this report, the population consists of customers who have ordered product(s). With that being
DATA CLEANSING ASSIGNMENT 4
said, the next column that will be sorted from smallest to largest is the customer ID column,
using the same steps as one would follow for the order ID column, however, one should select
customer ID for the column section category instead. By sorting this data, one is able to
determine that the data for the customer ID column is free of corruption. In Appendix A, one is
able to view the customer ID column prior to being sorted. Appendix C includes an image of the
customer ID column after being sorted from smallest to largest.
Product ID and Product Category ID
Continuing the process of data cleansing throughout each column is vital. The amount of
information is very extensive. A large amount of information can be helpful, however, if all
items are not checked for inaccuracy, it has the ability to skew results (Bai, 2019). Next, one
should sort the product ID column from smallest to largest. Similar to the other columns, there
are no discrepancies found. One shall refer to Appendix A to view the product ID column prior
to being sorted. Furthermore, Appendix D will provide an image of the product ID column after
being sorted from smallest to largest. The product category ID column is to be sorted next. After
sorting the product category ID, it is safe to assume that there is no corrupt data. Appendix A
reflects the original order of the product category ID column, whereas Appendix E reflects the
product category column after being sorted from smallest to largest.
Order Date and Ship Date
Unfortunately, not all real-world data sets are perfect. In fact, obtaining data from many
external sources leads to problems with the data (Albright & Winston, 2019). If one does not
determine data discrepancies when collecting data, it can lead to misleading results. At this point,
the order ID, customer ID, product ID, and product category ID data have been sorted. It is now
time to check the order date and shipping date for corrupt data by sorting them smallest to
DATA CLEANSING ASSIGNMENT 5
largest. Appendix A reflects an image of both the order date column and the shipping date
column prior to being sorted. Upon sorting the order date column, one is able to determine that
the data in that particular column is sufficient. However, upon sorting the shipping date column,
one is able to identify corrupt data, as there are many blank dates. Appendix F provides an image
of the order date column after being sorted. Appendix G provides an image of the shipped date
column. Additionally, Appendix G has the blank dates in the shipped data column highlighted in
yellow. To further investigate, however, one must add an additional column to determine which
shipping dates were blank when they should not be. For example, shipping dates can be blank for
orders that have not shipped yet, but they should not be blank before the corresponding order
dates (Albright & Winston, 2019). The additional column is named “difference in date”, as that is
what needs to be identified. The formula that must be used is =F2-E-2, as those are the two
columns that must be analyzed to determine which blank dates constitute corrupt data. This
formula should then be dragged down to the last row of the dataset. Appendix H provides an
image with the new column created and the formula used to determine which blank dates are
corrupt. The items that have a negative number in the “difference in date” column reflect corrupt
data. Appendix I includes an image that shows the corrupt data. To conclude, the same method
was used to analyze the quantity, unit price, and discount columns. Upon sorting these columns,
one is able to determine that the data in those three columns are sufficient.
Conclusion
Data cleansing is an important step for one to take when analyzing a data set. This allows
one to search for suspicious data and make any necessary changes to fix the corrupt data
(Albright & Winston, 2019). While data cleansing is a tedious process, it is necessary. Excel
tools discussed throughout the paper are commonly used to detect suspicious data within a data
DATA CLEANSING ASSIGNMENT 6
set. Because the internet and technology continues to advance and develop, individuals are often
obtaining information from those sources. The access to data is easy and accessible, however, not
always accurate. This is because there are no agreed upon standards among databases, and many
publishers use different formats of data. Because of this, the data cleansing practice has become
widely used (Bai, 2019). For larger data sets, such as the one analyzed for this paper, it is likely
that there will be errors and inconsistent data, which, in turn, affects the data. Detecting and
repairing inconsistent data has become a challenge. According to Hosseinzadeh et al. (2023),
Excel is a great way to cleanse data, as this paper has done above. This paper utilized Excel and
was able to discover the corrupt data, which consisted of a number of blank shipping dates in the
transaction excel sheet. By detecting the corrupt data, one is now able to achieve accurate data
collection (Hosseinzadeh et al., 2023).
DATA CLEANSING ASSIGNMENT 7
References
Albright, S. C., & Winston, W. L. (2019). Business Analytics: Data Analysis & Decision Making
(7th ed.). Cengage Learning US. https://mbsdirect.vitalsource.com/books/9780357110065
Bai, Y. (2019). Data cleansing method of talent management data in wireless sensor network
based on data mining technology. EURASIP Journal on Wireless Communications and
Networking, 2019(1), 1-6. https://doi.org/10.1186/s13638-019-1342-3
Hosseinzadeh, M., Azhir, E., Ahmed, O. H., Ghafour, M. Y., Ahmed, S. H., Rahmani, A. M., &
Vo, B. (2023). Data cleansing mechanisms and approaches for big data analytics: A
systematic study. Journal of Ambient Intelligence and Humanized Computing, 14(1), 99-111.
https://doi.org/10.1007/s12652-021-03590-2
Martinez-Luengo, M., Shafiee, M., & Kolios, A. (2019). Data management for structural
integrity assessment of offshore wind turbine support structures: Data cleansing and missing
data imputation. Ocean Engineering, 173, 867-883.
https://doi.org/10.1016/j.oceaneng.2019.01.003
DATA CLEANSING ASSIGNMENT 8
Appendix A
Appendix A is pictured above. Appendix A reflects the excel sheet before any items are
sorted. Pay close attention to the order ID column in particular. In Appendix A, the order ID
column is not yet sorted. However, in Appendix C, the order ID column is sorted from smallest
to largest.
DATA CLEANSING ASSIGNMENT 9
Appendix B
Appendix B reflects the correct sequence to choose in order to properly sort the order ID
column from smallest to largest.
DATA CLEANSING ASSIGNMENT 10
Appendix C
Appendix C reflects an image of the data in the customer column sorted from smallest to
largest.
DATA CLEANSING ASSIGNMENT 11
Appendix D
Appendix D reflects an image of the data in the product ID column from smallest to
largest.
DATA CLEANSING ASSIGNMENT 12
Appendix E
Appendix E reflects an image of the data in the product category ID sorted from smallest
to largest.
DATA CLEANSING ASSIGNMENT 13
Appendix F
Appendix F reflects an image of the data in the order date column sorted from smallest to
largest.
DATA CLEANSING ASSIGNMENT 14
Appendix G
Appendix G reflects an image of the data in the shipped date column where there are
blank dates. The spots with blank dates are highlighted in yellow.
DATA CLEANSING ASSIGNMENT 15
Appendix H
Appendix H provides an image with the new column created and the formula used to
determine which blank dates are corrupt.
DATA CLEANSING ASSIGNMENT 16
Appendix I
Appendix I portrays the highlighted corrupt dates, as these rows have a negative number.