Like Henry Higgins, the phonetician from George Bernard
Shaw’s play “Pygmalion,” Marius Cotescu and Georgi Tinchev recently
demonstrated how their student was trying to overcome pronunciation
difficulties.
اضافة اعلان
The two data scientists, who work for Amazon in Europe, were
teaching Alexa, the company’s digital assistant. Their task: to help Alexa
master an Irish-accented English with the aid of artificial intelligence and
recordings from native speakers.
During the demonstration, Alexa spoke about a memorable
night out. “The party last night was great craic,” Alexa said with a lilt,
using the Irish word for fun. “We got ice cream on the way home, and we were
happy out.”
Tinchev shook his head. Alexa had dropped the R in “party,”
making the word sound flat, like pah-tee. Too British, he concluded.
The technologists are part of a team at Amazon working on a
challenging area of data science known as voice disentanglement. It is a tricky
issue that has gained new relevance amid a wave of AI developments, with
researchers believing the speech and technology puzzle can help make AI-powered
devices, bots, and speech synthesizers more conversational — that is, capable
of pulling off a multitude of regional accents.
Tackling voice disentanglement involves far more than
grasping vocabulary and syntax. A speaker’s pitch, timbre and accent often give
words nuanced meaning and emotional weight. Linguists call this language
feature “prosody,” something machines have had a hard time mastering.
Only in recent years, thanks to advances in AI, computer
chips and other hardware, have researchers made strides in cracking the voice
disentanglement issue, transforming computer-generated speech into something
more pleasing to the ear.
Such work may eventually converge with an explosion of
“generative AI,” a technology that enables chatbots to generate their own
responses, researchers said. Chatbots like ChatGPT and Bard may someday fully
act on users’ voice commands and respond verbally. At the same time, voice
assistants like Alexa and Apple’s Siri will become more conversational,
potentially rekindling consumer interest in a tech segment that had seemingly
stalled, analysts said.
Getting voice assistants such as Alexa, Siri, and Google
Assistant to speak multiple languages has been an expensive and protracted
process. Tech companies have hired voice actors to record hundreds of hours of
speech, which helped create synthetic voices for digital assistants. Advanced
AI systems known as “text-to-speech models” — because they convert text to
natural-sounding synthetic speech — are just beginning to streamline this
process.
The technology “is now able to create a human’s voice and
synthetic audio based on a text input, in different languages, accents and
dialects,” said Marion Laboure, a senior strategist at Deutsche Bank Research.
Amazon has been under pressure to catch up to rivals like
Microsoft and Google in the AI race. In April, Andy Jassy, Amazon’s CEO, told
Wall Street analysts that the company planned to make Alexa “even more
proactive and conversational” with the help of sophisticated generative AI. And
Rohit Prasad, Amazon’s head scientist for Alexa, told CNBC in May that he saw
the voice assistant as a voice-enabled, “instantly available, personal AI.”
Irish Alexa made its commercial debut in November, after
nine months of training in comprehending an Irish accent and then speaking it.
“Accent is different from language,” Prasad said in an
interview. AI technologies must learn to extricate the accent from other parts
of speech, such as tone and frequency, before they can replicate the
peculiarities of local dialects — for instance, maybe the A is flatter and T’s
are pronounced more forcibly.
These systems must figure out these patterns “so you can
synthesize a whole new accent,” he said. “That’s hard”.
Harder still was trying to get the technology to learn a new
accent largely on its own, from a different-sounding speech model. That’s what
Cotescu’s team tried in building Irish Alexa. They relied heavily on an
existing speech model of primarily British-English accents — with a far smaller
range of American, Canadian, and Australian accents — to train it to speak
Irish English.
The team contended with various linguistic challenges of
Irish English. The Irish tend to drop the H in “th,” for example, pronouncing
the letters as a hard T or a D, making “bath” sound like “bat” or even “bad.” Irish
English is also rhotic, meaning the R is overpronounced. That means the R in
“party” will be more distinct than what you might hear out of a Londoner’s
mouth. Alexa had to learn these speech features and master them.
Irish English, said Cotescu, who is Romanian and was the
lead researcher on the Irish Alexa team, “is a hard one.”
The speech models that power Alexa’s verbal skills have been
growing more advanced in recent years. In 2020, Amazon researchers taught Alexa
to speak fluent Spanish from an English language-speaking model.
Cotescu and the team saw accents as the next frontier of
Alexa’s speech capabilities. They designed Irish Alexa to rely more on AI than
on actors to build up its speech model. As a result, Irish Alexa was trained on
a relatively small corpus — about 24 hours of recordings by voice actors who
recited 2,000 utterances in Irish-accented English.
At the outset, when Amazon’s researchers fed the Irish
recordings to the still-learning Irish Alexa, some weird things happened.
Letters and syllables occasionally dropped out of the
response. S’s sometimes stuck together. A word or two, sometimes crucial ones,
were inexplicably mumbled and incomprehensible. At least in one case, Alexa’s
female voice dropped a few octaves, sounding more masculine. Worse, the
masculine voice sounded distinctly British, the kind of goof that might raise
eyebrows in some Irish homes.
“They are big black boxes,” Tinchev, a Bulgarian national
who is Amazon’s lead scientist on the project, said of the speech models. “You
have to have a lot of experimentation to tune them.”
That’s what the technologists did to correct Alexa’s “party”
gaffe. They disentangled the speech, word by word, phoneme (the smallest
audible sliver of a word) by phoneme, to pinpoint where Alexa was slipping and
fine-tune it. Then they fed Irish Alexa’s speech model more recorded voice data
to correct the mispronunciation.
The result: the R in “party” returned. But then the P
disappeared.
So the data scientists went through the same process again.
They eventually zeroed in on the phoneme that contained the missing P. Then
they fine-tuned the model further so the P sound returned and the R did not
disappear. Alexa was finally learning to speak like a Dubliner.
Two Irish linguists — Elaine Vaughan, who teaches at the
University of Limerick, and Kate Tallon, a doctoral student who works in the
Phonetics and Speech Laboratory at Trinity College Dublin — have since given
Irish Alexa’s accent high marks. The way Irish Alexa emphasized R’s and softened
T’s stuck out, they said, and Amazon got the accent as a whole right.
“It sounds authentic to me,” Tallon said.
Amazon’s researchers said they were gratified by the largely
positive feedback. That their speech models disentangled the Irish accent so
quickly gave them hope they could replicate accents elsewhere.
“We also plan to extend our methodology to accents of
language other than English,” they wrote in a January research paper about the
Irish Alexa project.
Read more Technology
Jordan News