BLUESKEYE AI's Explainable, Robust, and Adaptable Approach to Face and Voice AI

20 Feb

Written By Yca Tan

Pre-amble

Many people are rightly excited about the opportunity AI brings to preventing, detecting, and treating poor mental health. To do so, an ethical approach to AI is required, which includes making AI that is explainable and accurate regardless of who you are. And to achieve the greatest impact, it should be adaptable to be used for as many medical conditions as possible.

BLUESKEYE AI uses a unique approach to achieve explainable, robust, and adaptable analysis of medically relevant face and voice behaviour.

BLUESKEYE AI's explainable, robust, and adaptable approach to face and voice AI

A common concern of modern neural-network based AI systems is that they are black-box systems, meaning that it is unknown how they make predictions. This is not because it’s not possible to follow how an input to an AI system is transformed to the output - this is actually really simple given that it’s a sequence of basic mathematical operations performed on input data as it is passed through one layer after another until it reaches the final output layer and makes its prediction. No, it is called a black box because deep neural networks perform so many operations on the input data that it’s usually meaningless to follow how the output came about from the input, unless we’re talking about very small neural networks. On the other hand - what we are interested in is the decision making process of an AI, just as we’re interested in the reasoning done by a human. In court, we ask ‘did you see this car drive through a red light?’ - we do not ask for an explanation of how the witness saw the car, but that they saw the car cross the junction, while simultaneously they saw that the traffic light was red.

BLUESKEYE AI takes a unique approach to address the issue of explainable AI, and in the process provides our customers in the health and automotive space with AI systems that are more robust and adaptable, as a bonus.

The BLUESKEYE process for face and voice analysis splits the problem of behaviour analysis into two stages: first we use computer vision and signal processing to turn images and wave forms into behaviour parameters such as the activation of a facial muscle or the direction of your gaze, then in the second stage we teach our AI to reason with those behaviour primitive parameters to gain behaviour insights.

First stage AI - behaviour primitives

The first stage AI looks at and listens to what you can readily observe. From an input image and the wave form of the voice, the behaviour primitive detection AI recognises the intensity of a person’s facial muscle actions, the direction of their gaze, their head pose, what they say, and their tone of voice. These observations made by the AI are all of a type that can be readily checked to verify that they’re correct, and they are pretty much instantaneous. So while it doesn’t really explain how a pattern of pixels in an image results in the orientation of your eyeball, you can immediately verify that the prediction is correct.

This first stage is trained on millions of people’s face and voice data, collected with the appropriate permission in all kinds of environments. We’ve shown this stage to be unbiased with respect to apparent age, ethnicity, and gender, meaning that the performance for each group is equally good.

What is really interesting, is that the output of this first stage is now purely in terms of behaviours - did you raise an eyebrow? Did you say ‘yes!’? What direction did you look? The actual images are gone, as is the original voice data. This means, by definition, that the illumination conditions, the background noise, or for that matter a person’s skin tone are not even available for any follow up stage to be biased or confused by!

Second stage AI - behaviour insights

The goal of the second stage is to provide you and your customers with the behaviour insights they so desire. This can be an indication of their level of depression, fatigue, a window into their mood, or what instructions provided by their car engaged, distracted, or confused them.

The beauty of our two-stage approach is that our second stage has as input the behaviour primitives output by the first stage, which is only about 200 parameters, compared to the first stage that takes in images of the face and wave-forms of the voice, totalling millions of parameters. If you have ever done any statistics, or even machine learning, you will know that there’s something called the curse of dimensionality, which means that the more parameters you have as the input to your machine learned AI, the more data you need to train it properly. Although deep learning networks are less sensitive than previous machine learning methods, the principle still holds, and our two-stage approach means that we need tens of thousands of times less data to establish our behaviour insights. With proper expertise in behaviour analysis, something that BLUESKEYE is unique in delivering, we can build our customers a proof of concept model with as little as 100 data points, and achieve production-ready systems with as little as 1,000 data points.

Benefits

BLUESKEYE AI’s two-stage approach to behaviour analysis has huge benefits. The most important is that the behaviour insights stage is a highly explainable AI because its predictions can be expressed in terms of readily human interpretable behaviour primitives, such as whether you smiled or not. Tools like SHAP analysis can be applied to provide this kind of insight, as well as other techniques.

In the process of creating this explainable AI, we have gained a separation of concerns of different types of bias. The second stage AI is immune to appearance-based bias such as the tone of your skin or wrinkles caused by age, and the first stage is immune to behaviour-based bias caused by someone’s cultural upbringing, such as the way Italians talk with their hands and Japanese have social rules about how and when to display anger. This separation of concerns means that we can target our data collection, algorithm design and validation efforts to only one type of bias for each stage. Efficient.

We have also achieved an adaptable, data friendly approach. With the first stage complete, any new behaviour insight model can re-use its capabilities, and because it’s so low dimensional it can do so with very little new data of e.g. a new medical condition. We can separate AI architecture functionality; with the first stage solving computer vision and audio signal processing problems such as illumination variation, signal noise, and perspective, whereas the second stage can forget about those problems and focus on long-term temporal dynamics, causality, inter-dependencies, co-occurrences, hidden latent variables etc.

Some critics would still argue that the first stage behaviour primitive detection remains a black box and that you can’t explain how it went from the input pixels to the detection of say the direction of your gaze. This is not a useful criticism though, as explanations are not required at that level. We accept the reading of a digital thermometer without having to explain the material properties of the temperature-sensitive components, or the digital processing that was applied to it. We need to know that it has been shown to work in all conditions of its intended use. Where explainability is important is in decision making: what factors were crucial in making a decision such as a medical diagnosis or the decision to hand over control of the car to the autopilot. Our two-stage approach allows exactly that: we can explain the decisions made by the behaviour insights stage in terms of readily understandable human terms such as facial muscle actions, head pose, and what someone says.