Hot Take: Transcripts are biometric data according to the EU AI Act

12 Feb

Written By Yca Tan

The last two weeks I had some fun re-reading large chunks of the EU AI Act for a customer. In particular I was looking at how biometric data was treated. And then it struck me… given the way the EU AI Act interprets what a biometric is, I think voice transcripts would be classed as biometric data!

Transcripts are a literal ‘translation’ of someone’s speech into writing, word for word, hesitations, repetitions, and grammar mistakes and all. Transcripts used to be done as a professional service by real people, but can nowadays be very reliably be done by AI tools. Good transcription services result in rich transcripts that include the start and end time of words, from which you can get speech rates and pauses, as well as other non-verbal actions such as coughing, laughing, etc. Sometimes there are even annotations on greater than normal pitch, loudness, or jitter.

The EU AI Act gets a lot of bad press, and mostly for good reasons. However, despite this criticism, I think it does a pretty good job defining what biometric data is, and most importantly, what acceptable low risk or high risk use is, and where its use is entirely prohibited.

The EU AI Act defines biometric data as: “ ‘biometric data’ means personal data resulting from specific technical processing relating to the physical, physiological or behavioural characteristics of a natural person, such as facial images or dactyloscopic data;”.

Note that behavioural characteristics are part of the included biometric data.

So, on to my sudden thought. The voice is clearly a great source of biometric data. With the right technical processing, you can uniquely recognise people. We’ve all know that from films; my personal favourite is the iconic 1992 film ‘Sneakers’ featuring hackers and voice recognition. A voice transcript creates a written encoding of what you say, keeping a lot of that voice information intact, if not all.

Can you derive someone’s behavioural characteristics? You betcha! I even think you can recognise an individual given enough text, but that’s not even necessary to count as biometric data according to the EU AI Act definition (it IS a requirement in GDPR, but the EU AI Act definition of biometric is broader). So, that makes a transcript biometric data.

Does it matter whether a transcript is considered biometric data under the Act or not? Not necessarily. As you may recall, the most important aspect of the Act is the classification of your system as low risk, high risk, or prohibited. There are many ways to use a transcript that would keep it a low risk system. For example, automatic note taking would be low-risk.

However, there are two popular uses of transcripts that would automatically make it high-risk if it were biometric data. Annex III point (1) of the Act states that any AI system using biometric data to do biometric categorisation (grouping) or emotion recognition. That would include all applications of the popular sentiment analysis techniques, and crucially emotion recognition includes attitude and intention recognition, not just someone’s feelings. So that means using transcripts to get someone’s opinion on any topic would count as a high-risk system, if a transcript is a biometric. Biometric categorisation is using biometrics to group people, e.g. trying to group them by ethnicity, age, political opinion, sexual orientation, you name it. That is certainly possible with transcripts, and again, if transcripts would be considered a biometric, this would be high risk systems.

So, what is your opinion? Is a transcript biometric data? Should sentiment analysis and other transcript analysis use cases be classed as high-risk AI systems? Do you know of a study that either proves or disproves that you can use transcripts to identify people? Get in touch with me, I’d like to know!

I’m not a lawyer or legal expert. My understanding of the EU AI Act comes from my 20+ years as a computer scientist in AI and Automatic Human Behaviour Understanding, my interest in ethics, and my love of legal documents. They’re structured just like code, and equally full of bugs!

Yca Tan

Hot Take: Transcripts are biometric data according to the EU AI Act

What is a Social Robot? Understanding the 'Cocktail Party' Problem

Reducing perinatal mental health mortality: A call to action