Skip to main content

With AI, It’s Still a Buyer Beware World


Too many companies are investing in AI tools with minimal due diligence. Here are questions to ask to avoid a nasty surprise

The illustration of three robots on a blue background

Artificial intelligence tools are big business. They reach into every aspect of life: allocating social housing, hiring top talent, diagnosing medical conditions, predicting traffic jams, forecasting stock prices, generating sales leads . . . the list goes on. The global AI market is expected to exceed US$1 trillion this decade. 

Whether it’s the media, business or governments, everyone appears to be discussing AI and how it will transform the world. Managers are adopting AI because they believe it will allow their organizations to carry out tasks and make decisions more quickly and accurately and at a lower cost. They are constantly being reassured that these tools will deliver on their promises, based on seemingly credible third-party performance claims using common assessment measures. 

Yet, all too often managers risk discovering a huge gap between expectations and reality. Far from improving performance, AI tools may lower the accuracy and quality of decisions and even undermine the knowledge capital the organization built up over decades.

If it sounds shocking, it should be. Because as my research with co-authors Sarah Lebovitz and Natalia Levina shows, it is a risk that thousands of organizations take by failing to do the appropriate due diligence when adopting AI solutions. It is a risk that not only threatens to damage organizations and governments, but in some situations even cost lives. 

Ground truth, more or less 

We were able to study, at close quarters, how five AI tools were evaluated for adoption in a renowned U.S. hospital that employed leading experts in their fields. The AI application was in diagnostic radiology. For more than 11 months, we observed managers testing and evaluating the AI tools at research conferences, workshops, symposia, vendor presentations and during 31 detailed evaluation meetings. We conducted 22 interviews and had many informal conversations. We also had access to a wide range of associated data. 

What we discovered was both surprising and highly concerning. It wasn’t that the medical professionals involved didn’t want to thoroughly evaluate the AI tools. Rather, they didn’t initially know the critical questions to ask to make a reliable assessment. As they probed further and looked beyond surface-level metrics, they discovered fundamental flaws in the way the AI had been trained and validated.  

At the heart of any AI software is its “ground truth”, the labelled data that represents (and is used to verify) the correct answer to the question the AI is trying to solve. The ground truth dataset for a cat identification AI, for example, might consist of labelled images of various breeds of cats. A picture of a cat could then be checked against the ground truth dataset to see what type of cat it is. 

Most applications for AI tools in organizations, however, are far more complex, whether it is deciding if a radiology image shows a malignant tumour, a candidate is a suitable hire or a startup is worth investing in. Far from being clear and absolute, what constitutes the ground truth is often up for discussion. In these cases, users need to be certain that the ground truth is a sufficiently verifiable version of the truth so that they can rely on it for their decision-making. 

Wherever possible, the ground truth dataset should be based on objective information. A radiological image for detecting malignant tumours, for example, should be checked against subsequent biopsy results. 

Objectivity is elusive 

Often the nature of the many prediction problems means the labelling of the ground truth data is not necessarily objective. In such cases, the labelling should be performed and checked by sufficiently expert people, using their know-how and applying the relevant professional standards for that information. 

In practice, this means the tool developers interact with relevant expert practitioners and tap into the experts’ accumulated knowledge and experience to better understand the practices and processes involved to codify as much of that know-how as possible. 

Unfortunately, as with many other organizations making AI purchasing decisions, the managers in our study relied too heavily on a metric commonly used to assess AI performance: the AUC (Area Under the receiver operating characteristic Curve). 

The problem is that the AUC says little about the tool’s performance versus the performance of the people at the organization who will be using the tool. Instead, it measures how likely it is that the tool delivers a correct response based on whatever ground truth labels were selected by the AI designers — that is, performance on their own terms.

It’s Time for a Sober View of AI
Readers Also Enjoyed It’s Time for a Sober View of AI

Once the medical professionals in our study looked beyond the AUC metric and began to put the AI tools under the spotlight, problems soon emerged. In a series of pilot studies, the medical professionals used their expert know-how to develop their own ground truth datasets and test the AI against them.  

In many cases, their results conflicted with the accuracy measures claimed for the tool. On closer examination it became clear that the ground truth used by the models had not been generated in a way that reflected how the experts arrived at their decisions in real life. 

Most organizations adopting AI do not go through such a rigorous process of evaluation. But failure to examine an AI tool properly may result in damaging fall-out from its poor performance. The behaviour of managers (often reluctant AI adopters) can also aggravate problems. They may back their own know-how against an AI tool, paying lip-service to using it while continuing to work as usual. 

The risk is that, as happened in at least one organization, senior managers give the credit for good results to the AI tool and make the staff redundant. The danger is that by the time an organization realizes the hit it has taken to its performance and expertise, it becomes very costly or even impossible to rectify. 

Questions to ask 

The best strategy is to evaluate AI tools thoroughly before they are acquired, implemented and embedded in the organization. This means putting some key questions to whoever is pushing for the AI’s adoption, whether that’s the designer, vendor or the organization’s own data scientists. 

Find out exactly how the AI tool has been trained. Can it be objectively validated? How was the data labelled? Who did the labelling and validated the labelling? How expert were they in their field? Was it done to the professional standards expected in that area? What was the evidence used? Where is the data source from? How applicable is it to the exact use case? 

Don’t be put off. Don’t accept jargon-filled responses that obscure the truth. Only then, if satisfied with the answers, is it time to move to piloting the tool. 

AI will be unimaginably transformative, in many cases for the better. Eventually the way AI tools are constructed will become more transparent, best practices will be established by stakeholders and the transaction of AI tools will be better regulated. 

But until then, our study shows that caveat emptor — when the buyer alone is responsible for checking the quality and suitability of goods before a purchase — must be the watchword for organizations thinking of adopting AI tools. 

At the moment, the burden for checking the merits of AI tools falls on the purchasers. And thorough due diligence is needed if organizations want to avoid a bad case of buyer’s remorse. 

Hila Lifshitz-Assaf is professor of management at Warwick Business School and a faculty affiliate at Harvard University’s Lab for Innovation Science. Warwick and Smith School of Business are members of the Council on Business & Society and share expert commentaries such as this essay.