We are so close to the tech in the movie Her


Her is a romantic sci-fi film. Set in a near cyberpunk future, Theodore, the protagonist, falls in love with Samantha, an AI voice assistant starred by Scarlett Johansson. ScarJo acts the entire film as a disembodied voice. Without facial expression, emotions are conveyed solely through the rise and fall of her euphonious articulations. Show but not tell. Tell but not show.

The acting must have been challenging for her and for anyone else. But she pulled off one of the most formidable performance in filmology with ease. Those half-hidden chuckles and gasps of air between sentences send subtle but clear messages to the audience. She achieved the next level of “show, don’t tell” - tell but not show.

A decade after its release, Her is coming real. The tech is here.

“Nothing can stop someone from cutting and pasting my image”

Speaking of deepfakes, the first thing that comes to mind is deepfake[1] imagery. It has been a thing for a while now (and Johansson suffered from it). Deep learning algorithms seamlessly graft the face of celebrities onto porn stars having sex. Surfaced on the internet is fake videos of President Obama saying words that he could not have said. FakeApp, the notorious face-swapping app, is the software used in the video. It’s only a few clicks away from any individual being targeted the same way. Seeing is no longer believing.

Now, deepfake audio is around the corner. It has been the holy grail in the industry for years. There are two reasons behind that: understanding language and imitating emotions.

In Her, Samantha chats with Theodore flawlessly. She understands commands, metaphors, and emotions. Natural Language Processing (NLP) is the driving force behind this. Human language is difficult to understand for computers. NLP software acts as a translator who deciphers meanings in speech. Thanks to advancements in the field, we have products such as Siri and Grammarly.

After understanding the context, Samantha responds with affections. The tone determines the mood of the speaking. Even with the same sentence, the pitch could be different. I am calm when I hear a slow grainy voice. I feel the opposite when it is high-pitched.

The following video is a deepfake voice AI made by Sonantic. On my first listen, I had a hard time distinguishing between the bot and a natural person. I got goosebumps from the fact that an AI flirts with me. That is surreal and horrifying.

Besides the authenticity of the audio, the video is raising eyebrows and generating questions - is it right to deceive listeners? Why did the company personify its AI as a coquettish female? This is just one of the many examples of the ubiquitous influence of sexism in the male-dominating tech industry. AI assistants tend to have female names and stereotypical feminine personalities: Apple’s Siri, Microsoft’s Cortana, Amazon’s Alexa, Xiaomi’s XiaoAi, and the list goes on - you get the idea.

While the tech is maturing, one may be surprised by the sheer number of people falling in love with AI - scripted bots trained to simulate dating. These bots text messages with the user. It’s only a matter of time before computer-generated voices and images are added. The blurry lines between what’s real and what’s not will perpetuate.

  1. Deepfakes are synthetic media in which a person in an existing image or video is replaced with someone else’s likeness. ↩︎