Sunday, January 29, 2023

Can We Trust our AI Systems? Why Testing Learning Systems is Different


Artificial Intelligence (AI) has become one of the major topics for organisations looking to improve their operations over the last decade. However, there is growing attention to the way that these systems are trained and whether the systems can be trusted when put into operation. Traditional IT systems are subject to well developed testing procedures.However, the testing of AI systems is considerably more complex and is very different. This has consequences in terms of the level of trust that can be placed on the systems and the impact of potential regulatory regimes that are being developed.


Artificial Intelligence (AI) has  been claimed to be the most transformative technology of the last decade, the foundation of the era of Digital Transformation, able to unleash Industry 4.0 and allow organisations to be more efficient, more innovative and, ultimately be more competitive.

The reality, so far, is much more mundane. A range of reports suggests that, in fact, the success rate of AI based projects is between 12% and 30%, that there are complete business sectors where AI does not seem to work at all (in spite of the claims of the developers and vendors) and in other sectors AIs that seem to work very well in the development labs, yet completely fail to deliver when implemented in real world settings.

The question, therefore, is what do we need to do in order to be able to trust our AI systems. Are there regulatory systems or other sources of guidance that may be of assistance? Are there better approaches to testing the systems before implementation?

But, before we start to investigate the answers to those questions we need to ask an even more fundamental question. What do we mean by the term AI?

What is Artificial Intelligence?

The term AI, like the term Transportation, has a very large number of different meanings to different people. We rarely talk about transportation systems as a general term, more often we talk about aviation, maritime, road, rail, ships, aircraft, trains, lorries, busses, etc. to be specific because of the many differences. The same is true of AI.

AI covers a wide range of technologies that have been developed over a period of about seven decades, from knowledge and rules based systems, via symbolic logic and expert systems to modern learning systems suchas predictive analytics, image analysis neural net systemsandlarge language models like the recent high profile ChatGPT system.

They are based on many different technologies, with attendant different levels of trust and explainability. For some technologies we can explain exactly how the decision was made, for some we can give a general explanation, and for some we have no means to explain any part of its decision.

Testing Learning Systems

Most IT systems developers are familiar with modern software testing processes, technologies and skillsets that work well with traditional non-AI systems and are used extensively during the initial system build and during any further developments in order to ensure that the specifications have been met and that existing functionality has not been changed during the modification process. This can happen because of the detailed nature of the requirements specifications that are gathered that govern the processing.

However, the requirement specification for a learning system (AI) is typically to find and learn patterns of interest in the training data and the decisions that should follow. As a result, the behaviour of the pattern finding system is almost totally dependent on the training data, provided that the software is suitable for finding the patterns.

One of the big problems that has affected many AI systems is bias in the training data. Recently, it has been reported that one mammogram analysis system which works for white women does not work for black women due to the training imagesonly containing images from white women due to the demography of patients at the training hospital.

At present, there are problems with how the tools for characterising the accuracy and effectiveness of the system once trained are used by the marketing teams; often there is selective use of data which is actually misleading. This is particularly true in the field of medical diagnostic applications where the operational data is heavily skewed towards negative results compared to the relatively few positive results that are expected. One recently reported example was being marketed with an accuracy of greater than 80% but it missed 86% of the true positive results it was supposed to detect for treatment purposes.

Regulatory and Advisory Frameworks

The draft EU AI Act is intended regulate the use of AI systems which have potential to cause harm to humans. It defines AI very widely in order to cover all the statistically based approaches that can have a direct effect on humans. It identifies various application areas where the impact has potential for significant harm, such as in education, financial services, medicine, justice and law and employment as high risk systems.

It can be observed from many reports, that many (if not most) of the failed AI projects tend to be in the areas covered by the EU AIA high risk areas, suggesting that we avoid such projects.

Model Cards have been recommended as one way of demonstrating to users and regulators that AI systems have been appropriately designed and tested for specific situations where theyshould work. They will also provide guidance of situations where the AI is unlikely to be effective or should not be used.


AI is a collection of a range of technologies. We need to be more specific about which type of AI we are discussiondue to the wide differences in capability andexplainability.

A process called Model Cards can help AI system developers, customers and regulators gain trust in such systems.

The combination of the EU AIA and Model Cards seems likely to become the international gold standard for AI systems.