Papercup, the U.K.-based AI startup that has developed speech technology that translates people’s voices into other languages and is already being used in the video and television industry, has raised £8 million in funding.
The round was led by LocalGlobe and Sands Capital Ventures, alongside Sky, GMG Ventures, Entrepreneur First (EF) and BDMI. Papercup says the new capital will be used to invest further into machine learning research and to expand its “human-in-the-loop” quality control functionality, which is used to improve and customise the quality of its AI-translated videos.
Meanwhile, Papercup’s existing angel investors include William Tunstall-Pedoe, the founder of Evi Technologies — the company acquired by Amazon to create Alexa — and Zoubin Ghahramani, former chief scientist and VP of AI at Uber and now part of the Google Brain leadership team.
Founded in 2017 by Jesse Shemen and Jiameng Gao while going through EF’s company builder program, Papercup is building out an AI and machine learning-based system that it says is capable of translating a person’s voice and expressiveness into other languages. Unlike a lot of text-to-speech, the startup claims the resulting voice translation is “indistinguishable” from human speech, and, perhaps uniquely, it attempts to retain the characteristics of the original speaker’s voice.
Initially, the tech is being targeted at video producers, including already being used by Sky News, Discovery and YouTube stars Yoga with Adriene, along with DIY content creators. It is pitched as a much more scalable and therefore lower-cost alternative to pure human dubbing.
“Most of the world’s video and audio content is shackled to a single language,” says Papercup co-founder and CEO Shemen. “That includes billions of hours of videos on YouTube, millions of podcast episodes, tens of thousands of classes on Skillshare and Coursera, and thousands of hours of content on Netflix. Almost every content owner is scrambling to go international, but there is yet no simple and cost-effective way to translate content beyond subtitling”.
For “deep pocketed studios,” there is of course the option to employ high-end dubbing via a professional dubbing studio and voice actors, but this is far too expensive for most content owners. And even wealthy studios are often constrained in terms of how many languages they can accommodate.
“That leaves the mid and long tail of content owners — literally 99% of all content — stranded and incapable of reaching international audiences beyond subtitling,” says Shemen, which, of course, is where Papercup comes into play. “Our aim is to generate translated voices that sound as close to the original speaker as possible”.
To do that, he says that Papercup will need to tackle four things. First up is creating “natural sounding” voices, i.e. how clear and human-like the synthetic voices sound. The second challenge is retaining emotion and pacing to reflect how the original speaker expressed themselves (think: happy, sad, angry etc.). Third is capturing the uniqueness of someone’s voice (e.g. Morgan Freeman, but in German). Lastly, the resulting translation needs the correct alignment of the audio to the video itself.
Explains Shemen: “We started off by making our voices as human-like and natural sounding as possible, where we’ve made quite a significant leap in terms of quality by honing our technology to the task, and today we have one of the best Spanish speech synthesis systems in production.
“We’re now focusing on better retainment and transfer of the original emotion and expressiveness in the original speaker across languages, and meanwhile figuring out what it is exactly that makes for quality dubbing”.
The next challenge and arguably the toughest nut to crack is “speaker adaptation,” described as capturing the uniqueness of someone’s voice. “This is the last layer of adaptation,” notes the Papercup CEO, “but it was also one of our first breakthroughs in our research. While we have models that can accomplish this, we’re focusing more of our time on emotion and expressiveness”.
That’s not to say Papercup is entirely machine-powered, even if it might be one day. The company also employs a “human-in-the-loop” process to make corrections and adjustments to the translated audio track. This includes correcting for any speech recognition or machine translation errors that come up, making adjustments to the timings of the audio, as well as enforcing emotions (e.g. happy, sad) and changing the speed of the generated voice.
How much human-in-the-loop is required depends on the type of content and priorities of the content owners, i.e. how realistic or perfect they need the resulting video to be. In other words, it isn’t a zero-sum game, as good enough will be more than enough for a swathe of content owners at scale.
Asked about the technology’s beginnings, Shemen says Papercup started with research conducted by co-founder and CTO Jiameng Gao “who is incredibly smart and oddly obsessed with speech processing”. Gao completed two Masters at University of Cambridge (in machine learning and speech language technology) and wrote a thesis on speaker adaptive speech processing. It was at Cambridge that he realised that something like Papercup was possible.
“When we started working together at Entrepreneur First at the end of 2017, we built our initial prototype systems that showed that this technology was even possible despite there being no precedent for it,” says Shemen. “Based on early conversations, the demand was clearly overwhelming for what we were building — it was just a function of actually building something that could be used in a production environment”.