Using Furhat and Rasa to Assist when You Forget a Word Mid-Sentence: A Student Group Project

Can you read my…

Angus Addlesee

Published in

Heartbeat

7 min readMay 26, 2020

Introduction

Every single one of us will intermittently try to conjure the right word for a given moment, pausing mid-sentence to try and remember it. For example, forgetting the name of a place caused this hesitation:

“He was at that coffee shop on the corner called umm… Peppers”

We can all relate to this, but as cognition declines, these pauses become more common and more pronounced. As fellow humans, we’ll adapt and be patient, giving someone time to think—maybe even suggesting the completion as shown here:

A: “I got some fruit and…”
B: “vegetables?”
A: “Yeah, them. I need a coffee.”

Voice assistants (Amazon Alexa, Google Assistant, etc…) need to become more naturally interactive in order to do this. These pauses are often mistaken by voice assistants as the end of a sentence, frustratingly replying with something like “I’m sorry, I didn’t quite catch that”.

The accessibility of voice assistants is now more crucial than ever, so a team of students have been exploring this challenge with the Interaction Lab at Heriot-Watt University.

A screenshot from the demo of our system found below

We’d planned to run a user evaluation with older adults, but this was not possible due to coronavirus. We’ve decided to make a small donation to The Open Door, a local charity (for us) that aims to reduce loneliness by providing support to older adults and those with additional support needs. During this period of isolation, they’ve been on the phone to provide a friendly voice to those who need it. If you’d like, you can read The Open Door’s story.

Team

This project was planned, designed, and built over a 12 week period as part of a course called “Conversational Agents and Spoken Language Processing”. If this is a class or an area that interests you, take a look at the MSc in Conversational AI at Heriot-Watt.

Students:

Pablo Lopez Santori (LinkedIn)
Arthidevi Balavignesh (ab237 at hw.ac.uk)
Meriam Moujahid
Sunbul M. Ahmad (sma20 at hw.ac.uk)
Sethu Lekshmy (LinkedIn)
Andeol Ginoux

Supervisors:

Angus Addlesee (Medium, Twitter, LinkedIn)
Verena Rieser (Twitter, LinkedIn)
Oliver Lemon (Twitter, LinkedIn)

What We Achieved: Developing our System

TLDR: While interacting with smart devices in a living room, our system can detect a pause mid-sentence and suggest what the user may be asking. If predicted correctly, the user can confirm that choice with just their voice. Our system then brings the full user request together and completes the action. We embodied our system using Furhat, and used Alana to handle complete or out-of-scope utterances.

The Data

Voice assistants in the home are often used to interact with smart-home devices. These devices can help people live in their own homes longer and more independently. For this reason, they’re even recommended by many related charities (on a case-by-case basis, of course).

In order to analyze which devices people interact with the most, we investigated the Fluent Speech Commands dataset. It contains over 30,000 annotated commands uttered (to control a smart-home) by 97 speakers. From the data, we concluded that the living room was a good starting point. We chose four devices: the lights, heating, music, and TV. As a result, only 19.8% of the commands in the dataset were deemed out-of-scope.

Example Conversations

In order to illustrate our goal interactions, we’ve provided a couple of mock dialogues. You can watch some more examples in action at 1:58 in the video below.

Person: “Can you turn up my…”
System: “Turn up your heating or music?”
Person: “Ah, my music!”
System: “No problem, I have turned up your music.”

In the above example, the system detects the long pause and suggests two predictions. The user then selects the correct prediction by naming the device. Finally, the system combines the “music” selection with the previous “turn up” request to “turn up the music”.

Person: “Please turn on the…”
System: “Lights, heating, TV?”
Person: “The second one.”
System: “Ok, I have turned your heating on”

In this example, three predictions were made and the user numerically selected the item. The intent to “turn on” and the selection of “heating” were then combined to take the action “turn on the heating”.

You can see three more examples like these working in this video (at 1:58).

Virtual Furhat

People travel great distances to have meetings instead of just phoning. There is a similar difference between talking to Siri, or a voice assistant embodied in a robot. While talking on the phone, you miss many of the signals (nods, gaze, brow raising, etc…) that we use to guide our communication every day.

“Current voice systems cannot show facial expressions, indicate attention through eye contact, or move their lips.” — Beginning to Replicate Natural Conversation in Real Time

For these reasons, we integrated our system with Furhat to make the interactions more natural and more engaging. We planned to use a physical Furhat robot but switched to the virtual Furhat when the university closed.

Getting More Technical

In the video above, the system architecture is summarized at 0:59. It looks like this:

Our system contains the following components:

Speech Recognition (we used Furhat) - converts the user’s voice to text.
Incomplete Utterance Detector (LSTM) - as the name suggests, this processes the text to detect whether an utterance is complete or not. The Fluent Speech Commands dataset contained many similar utterances so we identified key split points and trained an LSTM on the split utterances.
Alana - handles any complete utterances. Alana is an open-domain conversational agent so can chat about almost anything.
Utterance Completion (Rasa) - receives a user’s incomplete utterance and predicts what the user is wanting to say. These predictions are filtered for likelihood and converted into rule-based natural language responses.
Text-to-Speech (Furhat) - converts this natural language prediction to audio and asks the user. The user’s response is converted to text.
Dialogue Manager (partly Rasa) - receives the user’s response and manages the interaction. Firstly, the user either selects a device (by naming it or selecting it from the list e.g. “the last one”) or indicates that the predictions were incorrect. If incorrect, the state resets and our system says something friendly to the user. If a selection is made, the selected device and intent from the utterance completion model are combined. The fully-resolved user intent, such as “turn off TV”, is the output.
Interface (rule-based) - receives the user’s intent and displays the action taken in the virtual living room. A confirmation of the action taken is also generated, converted to audio, and spoken out loud by Furhat.

Future Work

As mentioned at the beginning of this article, we planned a deeper evaluation, which was unfortunately postponed due to coronavirus. Despite this, we’re aware of many ways that our system could be improved:

Additional data. The Fluent Speech Commands dataset contains many utterances, but they aren’t particularly diverse. There are many repeated utterances and several commands that are not directly related to smart devices, such as “bring me my shoes”. Including a larger and more diverse range of commands could help improve several aspects of the system.
Extending device platforms and domains. We only handle a limited number of devices in the living room. With additional data, we could extend the capabilities of our system.
Improve incomplete utterance detection. The LSTM we trained had a 99.5% accuracy on the Fluent Speech Commands dataset. As mentioned above, however, this dataset is not very diverse. Training with additional data would make this model more robust to real user utterances.
Integration with a more sophisticated end-of turn (EOT) prediction. Our system relies on external end-of-turn prediction and is still therefore turn-based. To significantly improve fluidity of the conversation, our system needs to integrate fully with an advanced EOT prediction model.
Implement a re-ranker. Due to time constraints, we didn’t manage to complete our plan for a re-ranker. This would filter and re-order the predictions, based upon contextual information. For example, not predicting the activation of the lights when all of the lights are already on. Using computer vision and Furhat’s inbuilt camera, we could even prioritize the object that the user is looking at.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.

Heartbeat

Using Furhat and Rasa to Assist when You Forget a Word Mid-Sentence: A Student Group Project

Can you read my…

Introduction

Team

Students:

Supervisors:

What We Achieved: Developing our System

The Data

Example Conversations

Virtual Furhat

Getting More Technical

Future Work

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Heartbeat

Written by Angus Addlesee

No responses yet