Sarcastic Robots? How Deep Convolutional Neural Networks Are Making AI Worryingly Human
Technology is incredible at many things. The world’s fastest supercomputer, for example, can perform 200,000 trillion calculations per second - it would take most of us far longer than that to work out how many zeroes that figure would have. Self-driving cars can detect potential hazards faster than human drivers, AI can build a comprehensive picture of a person by their Google search history, and mixed reality glasses can overlay digital information onto the real world. When it comes to language, though, machines struggle.
Developers have been able to build functioning chatbots for years - tech that can get the job done provided you present it with the right questions or demands. Creating something completely natural is another thing altogether. Google is getting close, but its Assistant is programmed to perform relatively simple, contextualised tasks. It does these convincingly, but if it were presented with complaints or queries from customers all day, for example, would it be able to determine tone and act accordingly no matter how niche the information it was being presented with?
Aiding Not Replacing
Potential corporate use-cases of this technology go further than deploying machines to talk to customers, though. Other supplementary applications are already being put into action. One slightly dystopian example comes from insurer MetLife, which is using a program called Cogito to monitor its call-centre agents to ensure they are treating each call with the appropriate enthusiasm.
“Insurer MetLife [...] is using a program called Cogito to monitor its call-centre agents to ensure they are treating each call with the appropriate enthusiasm.”
The program can detect the tone of voice of both the worker and the customer, sending a notification if the worker’s tone is too greatly affected by a day of hearing about bereavement. It reminds them to be conscious of their tone and be as engaged and helpful as they can be for the customer. Emily Baker, a supervisor at MetLife, told Wired: “It’s represented by a cute little coffee cup.”
Another icon, a heart, appears when the customer’s emotional state is heightened, which can supplement an agent’s own judgment when dealing with sensitive situations. The close monitoring of an employee’s tone of voice sounds wildly overbearing but is (ostensibly) designed with the good intention of accounting for and mitigating employee fatigue, allowing them to make more money from commission.
The Lowest Form of Wit?
Sarcasm has been the obvious stumbling block for natural language processing since it started being talked about as a technology. Machines can accurately make sense of the words being spoken and even the emotion underlying a sentence, but gauging whether or not a comment was sarcastic presents a unique challenge.
A paper on sarcasm detection found that it is “very topic-dependent and highly contextual, therefore, sentiment and other contextual clues help to detect sarcasm from text. Pre-trained sentiment, emotion, and personality models are used to capture contextualized information from text.”
Emotions are deeply complex and humans often find it difficult to identify exactly the manner in which they are being spoken to. We are probably some way away from entrusting a machine to perform this and allowing it to act autonomously while expecting accurate results.
Enter Deep Convolutional Neural Networks
The next step in natural language processing is the introduction of deep convolutional neural networks. Primarily used to clarify images, perform object recognition within scenes and cluster them by similarity, these are deep artificial neural networks that can identify everything from faces and street signs to tumors.
Deep convolutional neural networks represent a significant jump forward in deep learning for computer vision, but there are clear applications in text and natural language processing that are beginning to be explored.
BDJ spoke with Erik Cambria, Assistant Professor of Computer Science and Engineering at the Nanyang Technological University in Singapore. Erik’s work has explored the possibility that deep convolutional neural networks could be harnessed for tone detection in speech.
What Are Deep Convolutional Neural Networks?
“In order to properly explain what deep CNNs are, I need to explain what a CNN is first and then explain deep networks,” Erik says.
“Before deep nets, if I had to teach a neural network how to recognise images of cars, I had to specify features such as wheels and windshield first”
“CNNs were originally thought of for computer vision in order to take as much as possible into account contextual information. For each pixel in an image, CNNs could encode not only the pixel information but also the information about its neighbouring pixels. Later on, we realised that the same paradigm can also be applied to text if you replace pixels with words and neighbouring pixels with words that come before and after each target word. Before CNNs, each document was like a bag of words with no order and no context. After CNNs, no word was an island anymore!
“Now, deep networks. Neural networks were invented 75 years ago. The reason why they have become ’sexy’ again is that now, by using a cascade of multiple layers, we do not need to perform feature extraction anymore as the network can automatically learn multiple levels of representations that correspond to different levels of abstraction.
“Before deep nets, if I had to teach a neural network how to recognize images of cars, I had to specify features such as wheels and windshield first. After deep nets, I can simply show my neural network a lot of images of cars and this will automatically extract those features.”
Recently, neural networks have been shown to improve the performance of speech recognition. A report from Microsoft goes one further, finding that convolutional neural networks can further reduce error and make these programs even more effective. On top of its practical purposes, this technology serves as an example of the versatility of deep learning, that something initially used for image recognition could be so effective in a different medium.
Why is Sarcasm so Difficult to Detect?
Why, then, is complex speech relatively straightforward for a machine to map but sarcasm is so difficult to detect? Well, think back to an email or text conversation you might have had with a friend, in which sarcasm has been used. It is not always immediately obvious that the tone of the sentence was sarcastic. Often, it can get misconstrued when the speech is abstracted, even if we have the context and we know the person speaking. For machines, this detection can be impossible.
“It is difficult because in sarcastic speech one often says something but means something else,” Erik tells us. “In standard communication, we tend to convey our message in the most direct and unambiguous way. When we are being sarcastic, however, there is an implicit meaning or intention that goes beyond words.”
“[Detecting sarcasm] is a difficult task for machines as they do not usually go beyond text, e.g., they do not have common sense. But it is also difficult for humans sometimes”
“It is a difficult task for machines as they do not usually go beyond text, e.g., they do not have common sense. But it is also difficult for humans sometimes who tend to disambiguate by collecting other multimodal clues such as tone of voice or facial expressions. So, if it is hard even for humans to detect written sarcasm, we cannot expect a machine to do better than us.”
On the Brink of a Breakthrough
How do deep convolutional neural networks relate to sarcasm detection, though? Yes, reducing inaccuracies in speech recognition is hugely important for the development of the technology, but this does not necessarily mean that they will be any better at detecting double meaning or the tone with which something is said. We put this to Erik, who believes we are not necessarily too far away from machines understanding sarcastic sentiment.
“We already have multimodal sarcasm detection systems,” he says. “In 2016, we have taken ‘A Deeper Look into Sarcastic Tweets Using Deep Convolutional Neural Networks’ (COLING paper). We kept improving the system after that with an algorithm called CASCADE (published in COLING 2018). And more recently, we have been applying multitask learning to jointly classify sentiment and sarcasm.
“There is a huge gap between detecting sarcasm and understanding sarcasm. The same gap that there is today between natural language processing and natural language understanding”
“The performance is quite good, especially considering that even humans sometimes cannot get sarcasm. But there is a huge gap between detecting sarcasm and understanding sarcasm. The same gap that there is today between natural language processing and natural language understanding.”
This is an important point - just because a machine can detect that a point was made sarcastically, this does not necessarily mean it has the capacity to understand the true meaning. Humans do this instinctively. Ordinarily, the meaning is the reverse of what the person is saying when speaking sarcastically, but this is not always the case and the opposite of perceived meaning isn’t always that easy to identify.
Tone is the next big leap that natural language processing has to take before we can build anything truly believable. Microsoft Assistant’s ability to manage the unexpected tangents involved in booking a table at a restaurant is impressive, but you have to imagine it would be fairly easily tripped up by sarcasm or contextually understandable local slang.
Deep convolutional neural networks could bring natural language processing up to speed to plug this gap in ability, adding the capability to detect the nuances of tone. Sarcasm is a useful benchmark because it is one of the most difficult elements of speech to detect - if it can be successfully identified and understood by machines, the next level of personal assistant will be on the horizon.
Some AI projects - like the one used by MetLife - are built with the intention of aiding humans in their roles. Others, no doubt increasingly, will be geared more towards performing the role autonomously in the place of humans. Arguably the company best placed to make this a reality is Google, which announced at its Cloud Next conference last year that it is working on what it calls Contact Center AI software, alongside over a dozen partners including Cisco and Vonage.
The reasoning behind the development is that, often, call centre staff are bombarded with simple transactional or informational requests. This means repetitive and busy work for the staff, as well as greater pressure to grind through calls quickly (to the detriment of those with more complex problems). Using AI to handle the mundane would free up call centre staff to handle difficult queries and significantly reduce waiting time for callers. It is an example of AI’s gentle introduction, working alongside humans to only partially replace.
This is unlikely to be Google’s end goal, though. Projects like Duplex demonstrate that Google is getting close to making a machine that sounds indistinguishable from humans. If it can eventually create an AI capable of dealing with even incredibly complex problems, the whole landscape of customer service could be changed.
Illustrations by Kseniya Forbender
To contact the editor responsible for this story:
Margarita Khartanovich at [email protected]
- How Blockchain Can Reshape Charitable Donations
- Blockchain’s Scaling Crises: Can Sidechains Be A Potential Solution?
- Hacking Blockchain: Is it Really Secure?
- Regional Strengths Are Shaping AI’s Evolution in Asia
- Credit Card vs. Bitcoin: How Do You Pay for Your Coffee?
- Do You Trust AI? This Is What You Must Understand to Do So