In the virtual world we live in, it’s more often than not that we see a picture of someone before hearing their voice or talk to them on a conference call before meeting them in person. Then, when we do finally put a face to a voice or vice versa, it’s startling how different our imagined version of this person is to reality. Apparently, this is just a human problem. A recently published paper titled “Speech2Face: Learning the Face Behind a Voice” describes how its authors used artificial intelligence to compose a presumed face from short audio clips. The team has discovered how to rearchitect a face from a person’s voice using their Speech2Face technology. If these researchers had been around at the dawn of the Internet era, Catfish would have never been an MTV show or a thing that still plagues so many lonely hopefuls today.
The Speech2Face team made it clear that their goal was “not to predict a recognizable image of the exact face, but rather to capture dominant facial traits of the person that are correlated with the input speech.” In essence, they weren’t looking to pick someone out from a lineup. They wanted to pull a full-on Titanic moment with a computerized Jack if he couldn’t see Rose when he was painting her.
The authors did touch on ethical considerations due to security being so interwoven with facial information in today’s world. They defended their method stating that an exact reproduction of a person’s face cannot be achieved with Speech2Face. So, no, you cannot hack into your significant other’s phone with this, sorry. The technology’s inability to reconstruct a perfect model is due to the fact that it is “trained to capture visual features (related to age, gender, etc.) that are common to many individuals”. But, yes, you could avoid a “catfish” scenario. Speech2Face draws from this bank of common features to Picasso together what the subject should look like based on their speech, but instead of looking wonky like many of Picasso’s works, the faces produced look glaringly average.
Speech2Face’s expertise, like so many of us with Internet access, is founded on endless hours spent watching videos on YouTube. Its training set was comprised of videos of people talking. From there, the Speech2Face team created a neural network-based model that ascribes vocal attributes to facial features based on the video data set.
With this information, Speech2Face created pairs of speech patterns and facial patterns. The facial pattern images are then encoded into the technology’s facial recognition model, and the voice waveform is used to create a type of spectrogram to account for the diversity and uniqueness of voices.
Besides helping jealous and/or suspicious mates or stealing someone’s identity, you might be asking how the heck is Speech2Face useful. The team replied, “We believe that predicting face images directly from voice may support useful applications, such as attaching a representative face to phone/video calls based on the speaker’s voice.” Yeah, I’m going to stick with hacking into someone’s phone.
Cohen, N. 13 June, 2019. Connecting the dots between voice and a human face. Retrieved from https://techxplore.com/news/2019-06-dots-voice-human.html.