Democratised AI — The Black Box Problem

Can you imagine modern life without electricity? I bet the idea seems ludicrous, and clearly, if you are reading this, you are among the 87% of the world population that enjoys access to electrical power. Andrew Ng, a household name in the Machine Learning community, famously stated that “AI is the new electricity”, which is a strikingly accurate analogy when one starts unpacking its implications.

Because, while electricity mostly produces perceptible, understandable and tangible effects, the results of using AI, or to be precise, Machine Learning technology, is innocuous, inscrutable and hidden for most of us. Nevertheless, Machine Learning pervades most aspects of modern society and industries, fueled by incentives such as efficiency, optimization, cost savings, increased insight and productivity.

There are several pitfalls to be aware of. Arguably, the worst outcome is not having wasted time and money on implementing a Machine Learning system that didn’t perform as well as hoped. It would be far worse if this system participates in perpetuating societal bias that hurts people and harms our society, either surreptitiously through discrimination; By limiting equal access to opportunities; By declining mortgages with no possibility of appeal; Or by hiding job listings from candidates that don’t fit the traditional profile.

Alternatively, the system could cause overt injury, through for instance autonomous vehicles that make fatal decisions based on poorly understood image recognition, or by recommending medical diagnoses and treatments to patients that are derived from flawed reasoning.

There’s a little Black Box for everyone

This trend is fuelled by a fortuitous congruence of circumstances. The essential programming languages for Machine Learning are intuitive and high-level, and can easily be learned in a few weeks time by someone already familiar with computer programming. And, with the availability of powerful open-source libraries and frameworks, coupled with high quality data sets in the Public Domain and cheap cloud computing, Machine Learning has turned into a fairly low entry endeavour.

But these technical aspects aside, an inherently wonderful benefit of Machine Learning being a young research field is that the general rule from the onset has been to publish scientific papers and results openly — as pre-prints on and in Open Access journals and proceedings such as JMLR and PMLR. Tutorials and lectures are freely shared on YouTube and as free online university courses. This has led to a democratisation of Machine Learning knowledge, where skills and know-hows really only depend on access to electricity, internet and a laptop. In other words: Anyone can do Machine Learning today.

However, the mere fact that anyone and their grandma are able to get a Machine Learning system up and running, does not necessarily imply that there is any understanding, or even intuition, of what is going on under the hood, neither during development nor after the system is put in production. Most such systems are magic black boxes, where one inserts data into one end, stirs around for some indeterminate amount of time, and then a number plops out on the other side.

So, how can we ward ourselves from potential harmful outcomes if we don’t know what is going on? Is it possible to peek inside this Magic Black Box that is Machine Learning, and understand how and why a Machine Learning system arrived at a particular result?

How to Good Science

Isolate the experimental setup to avoid contamination from external influences.

Choose proper metrics.

Accurately measure and store the sampled data.

Analyse and interpret the data.

Verify the results.

Especially important is The Zeroth Law: Remember to leave predetermined viewpoints at the door, and emotionally prepare for having the hypothesis completely disqualified…

These guidelines are applicable to Machine Learning as well, if one thinks of the training of the model as being the experiment, and the resulting output by the model as being the data sampled and collected during the execution of the experiment. Choosing suitable metrics, storing the output, and even verifying the results, are easily transferable concepts.

The challenge lies with controlling for outside influences and interpreting the output. Because the training data that is put into the Machine Learning system contains both the desired signal that tests the hypothesis, but also external noise that contaminate and confounds the result. And, when interpreting the result, we need to be able to decouple the signal from the noise.

Good Data Science requires Interpretability

Interpretability in Machine Learning is not a well defined concept, because it is context dependent and will mean different things for different problems. But, arguably, one can take it to mean opening up the magic black box, rendering it transparent, and being able to interpret, explain and trust what is going on inside it. Speaking in general terms, to assess if one has achieved Interpretability, I propose that we should be able to answer the following questions with some degree of confidence (that is, given our hypothesis, our questions posed to the data):

In part two in this series I will embellish on this checklist and supply examples to argue the relevance of these questions and outline potential outcomes of failing to answer them.

Understanding data is easier than understanding science

A common misconception is that most data scientists are not programmers, but often has a Master’s Degree or PhD in STEM fields, such as Mathematics, Statistics or Physics, with years of experience in doing “Good Science”, and have acquired a generally sceptical attitude towards both data quality and model veracity. However, if one does decide on going down the retraining road, it is highly recommended that the team adheres to the Checklist of Interpretability and becomes equipped with the following:

The knowledge required to identify issues and shortcomings of the data.

The skills to properly pose the questions that one wants the data to answer.

The abilities to interpret and explain the answers the system outputs.

And lastly, for everyone involved, including the product owner, to gain a deep appreciation for The Precautionary Principle:

To avoid using Machine Learning technology that is not fully understood in decision making systems where the stakes are high or critical.

Getting Machine Learning education and training right will play a significant role in the improvement of our collective future. Achieving the common goals of halting climate change, fighting poverty and raising the living standards in the developing countries, requires us to make technology that is resource and energy efficient. It must also be interpretable, explainable and proven worthy of our trust, so it does not cause unintended harm along the way.

After all, who would accept reduced access to electricity in order to lessen their ecological impact? Would you?

Original post:

Leave a Reply

Your email address will not be published. Required fields are marked *