We were all born with the instinct to speak and listen, but not to type on keyboards, tap a touch screen or click touchpads/mouses.
Voice interaction offers a new kind of experience for users, with a more natural and hands free approach which enables them to use their computers in new contexts and in different ways.
We are experiencing significant progress in voice interaction. We can ask our phones if it’s going to rain tonight, navigate home or to work, do some fact checking, dictate messages, or order products.
It’s all very useful, but still it’s certainly not the final stage of the technology’s development. And here comes AI, meaning the latest advancements in machine learning (ML) and in particular improved transfer learning.
So what are the current problems that the biggest tech companies have to deal with? What are the near future predictions in this area?
Fairness of data is important. Data is always biased though it has to be accepted as factual, and data should be fair. Data processing is the responsibility of the tech companies, and ‘we are as good as our data’ is not acceptable anymore.
We have to remember that lots of training data is heavily biased, for instance “doctor“ is a man, “nurse” is a woman, etc. AI does not understand gender, bias, or religion, and attempts to introduce this abstract concept (from the machine perspective) have failed. They are deemed to be extremely hard to fix.
Bias is not ‘just’ in historical data (a big problem for ML), bias is also in the processing of data (stereotypes) which is an even harder problem; how can we rely on the human developer or analyst and their stereotypes about the world. Humans label the data automatically and manually, accept or reject models based on their results, create and execute tests, and accept or reject their results.
→ Read more Essentially, Data is good. It’s the use cases that can be problematic
And even worse, different countries have different cultures and different biases which are considered fair or unfair there, so unfortunately one model cannot be simply translated into another language or country.
AI in voice assistants has a female voice by default, isn’t that bias? There are explanations that can explain the gender stereotype in which woman means care / help and man means authority and power. A less common kind of bias (compared to racial or gender) is generational bias; another demonstration of ageism.
A voice assistant’s personality has to be localized to the community’s culture for which it is being deployed. It must be supportive, polite, and native. Voice tone, accents, and spoken phrases are all based on studies of a local culture’s analysis.
In the future, more personalization options will be available to adjust for the particular user.
A machine voice could sound like the user’s voice. This might be the best neutral solution some experts say, to others it would be spooky to hear your own voice talking back to you.
Self learning of trained, configured, and deployed models are going to continue to learn on their own based on the feedback from millions of users, and then adapted to the single user preference, which avoids the same types of errors made in the past, because of the infinite feedback loop. Now it is done semi-automatically with lots of human (expert) interventions and adjustments. The future, however, is more automatic and autonomous, which also means new threats such as model and data poisoning by adversaries.
Multimodel-voice models are unified, but maybe they shouldn’t be, as languages differ significantly (this approach is gaining attention).
Multiple simultaneous senses create full contexts. They are very natural for humans as we talk while we make gestures and look at images. They are a combination of voice and image analysis (even on-screen) that obtain more context, which will help in having more useful conversations with machines.
Developers are very important and their challenges are important, like how to develop AI solutions with less data, or how to develop them with less code and democratization which would enable developers to create ML solutions (not just ML experts, data scientists). Developers have different problems than end users, and they are able to diagnose what is slowing them down much more precisely, so feedback from them is very useful.
Voice output is much harder for user experience design and implementation, as it is totally different from screens filled with tens of items, graphics and text. Voice output should return one or two options, have more steps, but not overwhelm the user with too many options.
Computers contain an expansive knowledge about the world and about user profiles, basically a multi-model interaction, but also information about the world as it is relevant to the particular person and his/her experiences.
Context is something everybody tries with limited success, however everybody understands it is critical for more natural voice interactions, for more natural dialogs, and for better translations with multiple languages. We want the conversations to last longer and definitely not just simple questions and responses, and additionally we don’t want to be disappointed with AI when it loses the context.
English represents only 20% of web languages. There is a larger focus now on different languages, as noted by TOP 10 languages of the Internet, which is far behind in terms of focus.
In regards to ethics, we need to do better with AI. How? By improving its performance, accuracy, and accessibility, while making remote work and life better; even with relatively simple solutions such as valuable UX improvements, like background blurring or noise cancellation. So, as an end result, users do not perceive talking machines as a threat or an annoyance but as useful assistants that help.
Unification of the platform into one platform for voice interaction would be good, like Alexa plus Siri plus Google Assistant, all in one, however it is very unlikely at the moment. Now people have to choose different ecosystems. They have to choose different assistants based on language support, mobile platform of their choice (iOS, Android, Alexa), etc. Interoperability is just at its beginnings across the different ecosystems and currently the major players understand that the future in this requires their cooperation, for the seamless experience of the end user, no matter if they are Alexa, Android or Apple users.
→ More about Mobile App development with Avenga
There are proponents for voices sounding as natural as human as possible, while others prefer a voice that is easily distinguishable as a robot’s. Scientists are working on the pros and cons of both. There’s no clear answer about which option is better. It seems that the more natural a voice the better, but it creeps out some users, making them uncomfortable. (Am I talking now to a machine or a human? Many of us want to know every time which one).
Probably the safest solution would be to give the user a choice, for a more human voice or a more robotic voice, depending on their personal preferences.
People need to know what data is being collected, and to be able to view this data and have the option to delete the data. Only putting the users in control of this enables true transparency and choice.
The more that happens on a device, the more control a user has over their privacy. The present trend is clearly to move towards more voice processing on the device instead of processing it in the cloud. Plus it guarantees a much lower latency which means a more natural conversation. Additionally, it represents more resilience when the internet connection is poor (by avoiding a cloud connection).
To the surprise of many, even Google and Amazon are embracing more on-device processing, though now they are heavily based on their cloud processing; it will take months if not years for them to transition.
→ Read more Is the hybrid cloud here to stay forever?
Competent working voice interaction has been promised for so many years . . . and still, the major breakthrough is before us. Google and Amazon claim it will happen in less than five years.
The breakthrough they defined was that the voice interface would be used as often as other interaction methods used on devices.
The ultimate final goal is ambient computing, so computing without visible computers, in our homes, offices, cars, bikes, or pockets.
As long as it is much easier to achieve a specific personal or business goal by tapping on the screen or clicking and typing, the voice will continue to be a secondary choice that is limited to safety related situations (i.e., in car, hands free) or accessibility.
We’re glad to see major investments and progress from all the largest players in the area, as it enables us at Avenga to deliver even better conversational business solutions for our partners.