We all think we know what dirty data is, but it can mean very different things depending on who you speak to. At its most basic level, dirty data is anything incorrect. In detail within procurement, it could be misspelled vendors, incorrect invoice descriptions, missing product codes, lack of standard units of measure(i.e. ltr, l, liters), currency issues, duplicate invoices, or incorrect/partially classified data.
Dirty data can affect the whole organization, and we all have an impact on, and responsibility for the data we work with. Accurate data should be everyone’s responsibility, but currently across many organizations, data is the sole responsibility of a person or department, and everyone trusts them to make sure the data is accurate.
These people or departments tend to be specialists in data, analytics, and coding—not procurement. They don’t have the experience to know when a hotel should be classified as accommodation or as venue hire, or what direct, indirect, or tail spend is and it’s importance or priority.
How many times have you been working with a data set and noticed a small error but not said anything, or just manually corrected something from an automated report, just to get it out the door on time? It feels like too much of an inconvenience to find the right person to notify, so you just correct the error each time yourself, or you raise a ticket for the issue but never get around to checking if it’s resolved.
These small errors that you think aren’t that important can filter all the way up to the top of an organization through reports and dashboards where critical decisions are being made. It happens almost every day.
There are many ways this affects your organization, but one of the most widespread and noticeable impacts is around reporting and analytics. If you’re in senior management, you will most likely receive a dashboard from your team that you could be using to review cost savings, supplier negotiations, rationalization, forecasting, or budgets.
What if within that dashboard was £25k of cleaning spend under IBM? I can already hear you saying “that’s ridiculous.” Well, it is obvious when pointed out, but I have seen with my own eyes IBM classified as cleaning. It can happen easily and occurs more frequently than you might think.
Back to that dashboard that you are using to make decisions, you’ll see increased spend in your cleaning category, and a decrease in your IT spend, which could affect discounts with your supplier, your forecast for the year, monitoring of contract compliance etc. It could even affect reporting of your inventory—i.e. it appears you need more laptops—and unnecessary purchases are made.
When there are tens or hundreds of thousands of rows of data, errors will occur multiple times across many suppliers. And for the wider organization, this could affect demand planning, sales, marketing, and financial decisions.
And then there are technology implementations. Rarely is data preparation considered before the implementation of any new software or systems and there can even be the assumption that the software supplier will do this, which may not be the case. If they do provide that service, it might not be good enough.
It can be very far into the process of implementation before this is uncovered, by which time staff has lost faith in using the software, they are disengaged, claim it doesn’t work, or they don’t trust it because “it’s wrong.”
At this point, it either costs a lot of money to fix and you have to hope staff will re-engage, or the project is abandoned. In either case, this can take months and cost thousands, not millions of pounds/euros/dollars in abandoned software or reparation work.
You might also be considering using or engaging with a third party supplier that uses AI, machine learning, or some form of automation. I can’t emphasize enough the importance of cleansing and preparing your data before using any of these tools.
Think back to the IBM example, each quarter the data is refreshed automatically with the cleaning classification, that £25k becomes £50k, then £75k the following quarter and it’s only when the value becomes significant that someone notices the issue. By this stage, how many decisions have been based on this incorrect information?
Truthfully, it’s with a lot of hard work. There’s no magic bullet or miracle solution out there to improve the accuracy of your data, you have to use your team or an experienced professional to get the job done. Get your team to familiarize themselves with the data, if they are reviewing and maintaining it regularly they will soon be able to spot errors in the data quickly and efficiently.
If you think about data accuracy in terms of COAT, this will help to manage your data.
It should always be Consistent—everyone working to the same standards, Organized—categorized properly, and Accurate—correct. And only when you have these things will it also be Trustworthy—you wouldn't drive around in a car without regular inspection, would you?
Accurate data is important but in its raw state, it’s not the whole story. As a procurement professional you’re tasked with ensuring the best prices for products or services, as well as ensuring contract compliance on those prices, cost reductions, and monitoring any maverick spend (just to name a few!)
Accurate data alone will not help achieve this. I strongly recommend supplier normalization and spend data classification to help quickly and efficiently manage your spend and suppliers, monitor pricing, and spot any potential misuse of budgets.
With a spreadsheet of spend transactions over a period of time such as 12 to 24 months, the first step should be supplier normalization where a new column is added to consolidate several versions of the same company to get a true picture of spend with that one supplier. For example, I.B.M, IBM Ltd, I.B.M. would all be normalized to IBM.
Data can be classified using minimum information, such as Supplier Name, Invoice/PO line description, and value. To get more from the data, other factors can then be added in such as unit price. Where unit price information is not available, the quantity can be divided by the overall value.
A suitable taxonomy will then need to be found to classify the data. It can be an off the shelf product such as ProClass, UNSPSC, PROC-HE, or taxonomy can be customized so that it is specific to your organization or industry.
This initial stage may take months as you are working with large volumes of data. It might be worth considering outsourcing this initial task to professionals experienced in this area who will be able to complete the project in a shorter time with greater accuracy.
There are a number of ways to classify the data, however, to get started look for keywords in the supplier name and then the description column. The description of services could include hotel, taxi, cleaning services, cleaning products, etc., yet, it’s important to carefully check the descriptions before classifying, or errors could be introduced. A classic example is “taxi from hotel to restaurant”, depending on which keyword you search for first, it could end up being misclassified as transport, or venue costs.
I wouldn’t advise classifying row by row, as it could take more than twice as long to complete the file using this method. Start with keywords, followed by the highest value suppliers which you can get from a pivot table of the data if you’re working in Excel.
Once classified, charts can be built to analyze the data. The analysis could include, top 80% of suppliers by spend, number of suppliers by category, unit price by product by month, spend by category, or spend by month.
Patterns should start to emerge which could reveal unusually high or low spend in a category, irregular pricing, higher than expected use of services, or a higher than expected number of suppliers within a category.
Data accuracy is an investment, not a cost. Address the issues at the beginning—while it might seem like a costly exercise, you will undoubtedly spend less than if you have to resolve an issue further down the line with a time-consuming and costly data clean-up operation. And by involving the whole team or organization, it will be much easier to manage and maintain the most accurate data possible.
Spend data classification shows you the whole picture, as long as it’s accurate. You can get a true view of your spending, allowing improved cost savings, better contract compliance and possibly the most important—preventing costly mistakes before they happen.