For years, finding a reliable transcription service has proved notoriously difficult, verging on nigh on impossible. No wonder then that when Otter.ai launched into the fray it was grasped upon by a baying mob of journalists quick to broadcast its attributes across Twitter.

So how is Otter's AI transcription service better than almost any other on the market, including those from tech giants like Amazon and Google?

© Otter.ai
© Otter.ai

Otter’s response hinges on 'data'. The team say that the software has been trained on millions of hours of audio input collected from a range of public sources including audio books, parliamentary proceedings, TV and radio programming, and podcasts.

According to the company, it makes use of creative proprietary algorithms that scour the web for usable audio segments. Even so, isolating training material remains difficult, given the mess of noise contained in raw data. “There is supervised learning and some unsupervised learning that does not require fully labelled data,” says Sam Liang, founder and CEO of Otter’s parent company, AISense.

According to the company, these methods of data gathering have provided them with an edge over competitors. “Deep learning, which we are using, depends on a huge quantity of data and most AI companies don't actually have a sufficient amount of data to train their engine,” Liang explains. “We have thousands of machines in the cloud that are constantly crunching the data to further enhance the model.”

Although not yet perfect, for clear audio, Otter does a sound job and for individuals is effectively free - offering up to 600 complimentary hours per month. The success of the service has been reflected in a host of accolades such as being named by Fast Company and Mashable as one of the best apps of the year.

Many have expressed incredulity over the generous pricing model of the service. However, as the saying goes ‘when the product is free, chances are, you are the product’. In this case, the Otter transcription service ingests users’ audio data and feeds it into the algorithms. The more than six million audio clips uploaded (as of February 2019) become part of a vast database of sound files. Users help refine the algorithms further still, by making corrections to the transcribed text.

The startup is also attempting to perfect diarization techniques - the method of classifying separate speakers, as well as speaker recognition - assigning each person’s voice a unique code, like a fingerprint. To learn your voice, the app requests that you record a sentence that contains your name and profession, as well as how you heard about the service. You're also encouraged to sync your Google contacts and calendar with the service and to label your interlocutors with their Google account. This is ostensibly to increase the ease of sharing transcriptions across offices.

The company is also getting into the enterprise, initially with the launch of an Otter for Education service targeting universities and followed up in early 2019 with the launch of Otter for Teams. Current partners include Harvard University, Tulane University and enterprise partners such as Bridgewater Associates, Vice’s Virtue Worldwide creative agency and Moss Adams.

Otter has also announced a partnership with meeting and productivity software Zoom, to offer an automatic meeting transcription service, and according to the team, have already processed millions of meetings - excellent news for the hungry algorithms.

Built from the ground up

Unlike other companies experimenting with voice, Otter doesn’t rely on technology sourced from the likes of Google, Microsoft, or IBM. “We don't depend on them; we built the whole system from the ground up,” says Liang.

This is for good reason. The voice problems that these companies are working on are primarily in the context of voice activated assistants - Alexa, Cortana or Siri - and how these interpret users’ questions. This is a somewhat simpler problem than the one facing Otter, which concerns decoding multiple different speakers in natural, meandering conversation.

“People speak much faster, people interrupt each other, people have different accents and different speaking speeds,” notes Liang. “Some people are in a position closer to the microphone, some farther away, so their volumes might be different.” These myriad factors all increase the difficulty of developing effective speech recognition software for this use case.

“We actually do a lot with noise tolerance training,” Liang continues, noting the pervasive rumbling of AC or whirring of fans that often clog up the audio. “When you're in a café or restaurant or at a conference event, there's a lot of background noise,” he says. “So we have to do a lot of assimilation of noise and inject noise into the training data to make our algorithms very robust against background noise.”

Liang contrasts the complexity of the software running beneath the product with it’s deceptively straightforward appearance, comparing it to a self-driving car. “The UI is extremely simple - your car takes you to your destination. But to achieve that, there’s 30 years of work behind it.” Although Otter has been working on this issue for three years, since parent company AISense was founded in 2016, Liang notes that scientists have been grappling with these problems for 10 to 15 years, and that the company has drawn on this research.   

The startup is not just focused on speech recognition, but also natural language processing (NLP), which is deriving meaning from language. For the likes of Google and Amazon, this is necessary for the AI assistant to be able to craft appropriate responses to the user. Otter says that it’s necessary for its own software to be able to analyse meeting notes, understand what was being discussed, and how to effectively summarise it - capabilities that are imagined for the service down the line.

The startup combines an impressive array of talent, with employees hailing from Yahoo and Facebook, and most of the AI experts working on the problem reportedly plucked from Google’s AI team.

Liang himself is a former Google architect whose resumé includes dreaming up the ‘blue dot’ for Google Maps. How has the startup managed to attract this crop of tech talent? Liang says it’s the excitement of racing against tech giants on an intellectually challenging and less attempted voice problem: freeform dialogue. Otter's general manager Seamus McAteer puts it down to the pedigree of the co-founders and the startup’s ‘engineering-first’ approach.

Ambient future 

At present, Otter says that it's focusing on transcription for the enterprise space. But there are signs that the startup’s ambitions are bigger than office productivity software.

The future of voice is estimated to be vast. As devices powered by the internet of things become more ubiquitous, and voice becomes increasingly integrated with retail, and AI assistants become call centre operators, there’s no doubt that voice’s future is within touching distance.

However, voice may have an even bigger role to play than most of us can imagine - with the dawn of 'ambient intelligence', which can be variously conceptualised as a class of technologies or an entire philosophy of the role that technology will come to play in our lives.

The outlook predicts that technology, like ambient music, will one day fade into something akin to the soothing murmur of background noise, pervading and subtly moulding our environment and consciousness without overtly announcing itself. Every metric - from both ourselves and our environments - will be measured and used to adjust optimal output. Devices and interfaces will melt away to leave in their place the seamless melding of technology with the physical world.

Voice is integral to this vision of the future because it removes the physical barrier between you and your desired outcome. “More and more we’re spending our time thinking about the devices,” Scott Huffman, Google’s VP of engineering for conversational search told Time in 2016 of ambient intelligence. “With these devices, voice is really the only option.” Technology will be everywhere, but we won’t see it.

Also bundled up in this view of ambient intelligence is technology's slow pivot from ‘understanding behaviour’ to ‘predicting behaviour’. The theory goes that with all this tracking and analysis, instead of simply explaining behaviour, systems will be able to foresee and proactively meet demands before they’ve been uttered, or even mentally assembled. This was the intention with Google Now, which began life with ambitions to become an ‘omniscient assistant’ that would combine voice search and analytics to predict users’ needs. It has since been cut down to a purely traffic and travel focused product.

Hidden in plain sight are clues that this is the world that AISense is betting on becoming part of. For starters, the platform’s copyrighted name for its tech is ‘Ambient Voice Intelligence’.

"We believe the best technologies fit seamlessly into a user’s daily workflow, and are smartly contextual and personalised. That’s why we call our technology Ambient Voice Intelligence™. No wake words required. It’s always on," reads the site.

While Otter's site is more clearly targeted towards productivity and enterprise, AISense’s website’s hints at more sweeping ambitions. "What if you could easily analyse and share all your voice conversations?" it asks. "People spend a lot of time talking, but most of what they say is forgotten".

The word ambient crops up again in Liang and co-founder Yun Fu’s recent career. The pair previously worked for Alohar Mobile (a startup acquired by Alibaba), on an ‘ambient-sensing and location platform’ that allows mobile developers to build ‘contextually-aware’ apps that understand users’ behaviour and presents them with customised services at the right place and time without prompting - dubbed a ‘predictive Siri’.  

In a world where voice is the new medium, your unique voice profile becomes the equivalent of an IP address, and mining it for data becomes the new frontier. This will become increasingly important as emotional metadata (also known as sentiment analysis) is layered on top. While this might sound like science fiction right now, it’s very much where the world is headed. And in this world, tying a speaker's identity to a data profile will become increasingly important.

This is what Otter is working on, and what’s more, it’s already linked to Google accounts, priming it for acquisition by the giant at some point in the future. Otter has already betrayed its ideological alignment with the company by initially mining the audio clips people uploaded to better serve ads (only stopping once it was called out by ZDNet).

The ‘teach Otter your voice’ section instructs the user to say their name and profession. One can imagine the value that even the constellation of simply these three things - name, profession and unique voice print linked to Google's data profile could have in the world of ambient intelligence.

Of course, ambitions of this kind remain conjecture at present and the company remains outwardly committed to breaking into the enterprise transcription space alone, but it seems unlikely that the team is unaware of the value its service could hold in an ambient, voice-powered future.