Business finance – accounting assignment final

DATA CLEANSING ASSIGNMENT 1

Data Cleansing: Microsoft’s Northwind Data Set

Ashlee Simmons

Department of Business, Liberty University

BUSI 675: Business Analysis

Dr. Mary Jo Odom-Dull

September 17, 2023

Author Note

Ashlee Simmons

Ashlee Simmons is now at the Department of Business, Liberty University.

I have no known conflict of interest to disclose.

Correspondence concerning this article should be addressed to Ashlee Simmons

Email: [email protected]

DATA CLEANSING ASSIGNMENT 2

Microsoft’s Northwind Database

A data set, which is a collection of data, is a useful way to track and analyze an

abundance of information. Data sets are commonly created by using tables consisting of

columns, variables, and rows for each piece of data relevant to the data set. However, before a

data set is used for informational, experimental, and decision purposes, data cleansing must take

place. The use of data cleansing allows for abnormal and corrupt data to be identified. From

there, the abnormal data can then be removed which will then lead to a more accurate data set

(Martinez-Luengo et al., 2019). This paper will specifically analyze data from Microsoft’s

Northwind database that has been transported into an excel sheet. The data consists of

information regarding the company’s customers, products, product categories, and transactions.

Each transaction examined represents a product purchased by a customer, however, if a customer

purchased multiple products at once, the data reflects multiple corresponding rows in the

transactions table to represent one row for each product purchased. Additionally, there is a

product ID column that allows one to look up names of customers, products, and product

categories. Upon evaluation, it becomes apparent that some of the IDs in the transaction excel

sheet have been corrupted. There can be three reasons for that. The first reason is if an ID in the

transaction excel sheet does not correspond to any customer, product, or product category. The

second reason is if the order IDs do not correspond to only one customer ID. The third and final

reason is if a given product ID does not always correspond to the same product category ID.

Aside from the corrupted IDs, there are also shipping dates that are blank. Shipping dates can be

blank if the orders have not yet been shipped, however, they should not be blank before the

corresponding order dates (Albright & Winston, 2019). With that being said, the goal of this

DATA CLEANSING ASSIGNMENT 3

paper is to discover the corrupt information within the transaction excel sheet and to provide

images of the data that is corrupted.

Sorting the Data

Order ID

For one to properly handle large datasets, one must develop a data management strategy

(Hosseinzadeh et al., 2023). Data cleansing is a great way to manage datasets, and test for

missing values and inconsistencies. This paper will specifically utilize Excel as a management

strategy to data cleanse the dataset. The first step in the data cleaning process is to use the sort

and filter function within Excel. This function can be used on each column, making it easier to

analyze the data given. One can sort by newest to oldest, oldest to newest, A to Z, Z to A, largest

to smallest, and the list goes on. Sorting is commonly used in order to create a more pleasing

visual to examine the data for any unusual values. To begin, one shall select the data in the order

ID column and utilize the sort function. To do this, one should click home in excel, following the

editing tab, then selecting sort and filter. From there, one should select custom sort. For the

column section, order ID should be chosen. For the sort on section, cell values should be chosen,

and for the order, smallest to largest should be selected. Upon sorting the data, one is able to

identify that there is no corrupt data in that portion of data. Appendix A consists of an image of

the original excel sheet before the order ID column was sorted and Appendix B consists of

directions to properly sort the order ID column from smallest to largest.

Customer ID

When gathering and analyzing data, there is always a relevant population (Albright &

Winston, 2019). The population are the members within a study. In the data being analyzed for

this report, the population consists of customers who have ordered product(s). With that being

DATA CLEANSING ASSIGNMENT 4

said, the next column that will be sorted from smallest to largest is the customer ID column,

using the same steps as one would follow for the order ID column, however, one should select

customer ID for the column section category instead. By sorting this data, one is able to

determine that the data for the customer ID column is free of corruption. In Appendix A, one is

able to view the customer ID column prior to being sorted. Appendix C includes an image of the

customer ID column after being sorted from smallest to largest.

Product ID and Product Category ID

Continuing the process of data cleansing throughout each column is vital. The amount of

information is very extensive. A large amount of information can be helpful, however, if all

items are not checked for inaccuracy, it has the ability to skew results (Bai, 2019). Next, one

should sort the product ID column from smallest to largest. Similar to the other columns, there

are no discrepancies found. One shall refer to Appendix A to view the product ID column prior

to being sorted. Furthermore, Appendix D will provide an image of the product ID column after

being sorted from smallest to largest. The product category ID column is to be sorted next. After

sorting the product category ID, it is safe to assume that there is no corrupt data. Appendix A

reflects the original order of the product category ID column, whereas Appendix E reflects the

product category column after being sorted from smallest to largest.

Order Date and Ship Date

Unfortunately, not all real-world data sets are perfect. In fact, obtaining data from many

external sources leads to problems with the data (Albright & Winston, 2019). If one does not

determine data discrepancies when collecting data, it can lead to misleading results. At this point,

the order ID, customer ID, product ID, and product category ID data have been sorted. It is now

time to check the order date and shipping date for corrupt data by sorting them smallest to

DATA CLEANSING ASSIGNMENT 5

largest. Appendix A reflects an image of both the order date column and the shipping date

column prior to being sorted. Upon sorting the order date column, one is able to determine that

the data in that particular column is sufficient. However, upon sorting the shipping date column,

one is able to identify corrupt data, as there are many blank dates. Appendix F provides an image

of the order date column after being sorted. Appendix G provides an image of the shipped date

column. Additionally, Appendix G has the blank dates in the shipped data column highlighted in

yellow. To further investigate, however, one must add an additional column to determine which

shipping dates were blank when they should not be. For example, shipping dates can be blank for

orders that have not shipped yet, but they should not be blank before the corresponding order

dates (Albright & Winston, 2019). The additional column is named “difference in date”, as that is

what needs to be identified. The formula that must be used is =F2-E-2, as those are the two

columns that must be analyzed to determine which blank dates constitute corrupt data. This

formula should then be dragged down to the last row of the dataset. Appendix H provides an

image with the new column created and the formula used to determine which blank dates are

corrupt. The items that have a negative number in the “difference in date” column reflect corrupt

data. Appendix I includes an image that shows the corrupt data. To conclude, the same method

was used to analyze the quantity, unit price, and discount columns. Upon sorting these columns,

one is able to determine that the data in those three columns are sufficient.

Conclusion

Data cleansing is an important step for one to take when analyzing a data set. This allows

one to search for suspicious data and make any necessary changes to fix the corrupt data

(Albright & Winston, 2019). While data cleansing is a tedious process, it is necessary. Excel

tools discussed throughout the paper are commonly used to detect suspicious data within a data

DATA CLEANSING ASSIGNMENT 6

set. Because the internet and technology continues to advance and develop, individuals are often

obtaining information from those sources. The access to data is easy and accessible, however, not

always accurate. This is because there are no agreed upon standards among databases, and many

publishers use different formats of data. Because of this, the data cleansing practice has become

widely used (Bai, 2019). For larger data sets, such as the one analyzed for this paper, it is likely

that there will be errors and inconsistent data, which, in turn, affects the data. Detecting and

repairing inconsistent data has become a challenge. According to Hosseinzadeh et al. (2023),

Excel is a great way to cleanse data, as this paper has done above. This paper utilized Excel and

was able to discover the corrupt data, which consisted of a number of blank shipping dates in the

transaction excel sheet. By detecting the corrupt data, one is now able to achieve accurate data

collection (Hosseinzadeh et al., 2023).

DATA CLEANSING ASSIGNMENT 7

References

Albright, S. C., & Winston, W. L. (2019). Business Analytics: Data Analysis & Decision Making

(7th ed.). Cengage Learning US. https://mbsdirect.vitalsource.com/books/9780357110065

Bai, Y. (2019). Data cleansing method of talent management data in wireless sensor network

based on data mining technology. EURASIP Journal on Wireless Communications and

Networking, 2019(1), 1-6. https://doi.org/10.1186/s13638-019-1342-3

Hosseinzadeh, M., Azhir, E., Ahmed, O. H., Ghafour, M. Y., Ahmed, S. H., Rahmani, A. M., &

Vo, B. (2023). Data cleansing mechanisms and approaches for big data analytics: A

systematic study. Journal of Ambient Intelligence and Humanized Computing, 14(1), 99-111.

https://doi.org/10.1007/s12652-021-03590-2

Martinez-Luengo, M., Shafiee, M., & Kolios, A. (2019). Data management for structural

integrity assessment of offshore wind turbine support structures: Data cleansing and missing

data imputation. Ocean Engineering, 173, 867-883.

https://doi.org/10.1016/j.oceaneng.2019.01.003

DATA CLEANSING ASSIGNMENT 8

Appendix A

Appendix A is pictured above. Appendix A reflects the excel sheet before any items are

sorted. Pay close attention to the order ID column in particular. In Appendix A, the order ID

column is not yet sorted. However, in Appendix C, the order ID column is sorted from smallest

to largest.

DATA CLEANSING ASSIGNMENT 9

Appendix B

Appendix B reflects the correct sequence to choose in order to properly sort the order ID

column from smallest to largest.

DATA CLEANSING ASSIGNMENT 10

Appendix C

Appendix C reflects an image of the data in the customer column sorted from smallest to

largest.

DATA CLEANSING ASSIGNMENT 11

Appendix D

Appendix D reflects an image of the data in the product ID column from smallest to

largest.

DATA CLEANSING ASSIGNMENT 12

Appendix E

Appendix E reflects an image of the data in the product category ID sorted from smallest

to largest.

DATA CLEANSING ASSIGNMENT 13

Appendix F

Appendix F reflects an image of the data in the order date column sorted from smallest to

largest.

DATA CLEANSING ASSIGNMENT 14

Appendix G

Appendix G reflects an image of the data in the shipped date column where there are

blank dates. The spots with blank dates are highlighted in yellow.

DATA CLEANSING ASSIGNMENT 15

Appendix H

Appendix H provides an image with the new column created and the formula used to

determine which blank dates are corrupt.

DATA CLEANSING ASSIGNMENT 16

Appendix I

Appendix I portrays the highlighted corrupt dates, as these rows have a negative number.

Place your order
(550 words)

Approximate price: $22

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more