One of the main problems with the current generation of chatbots is that they require large amounts of training data. If you want your chatbot to recognize a specific intent, you need to provide it with a large number of sentences that express that intent. Until now, these large training corpora had to be generated manually, with one or more people writing many different sentences for each intent, vertical and language that needed to be recognized in your chatbot. Bitext Artificial Training Data technology (also called Natural Language Generation) automatically generates many different sentences with the same meaning as the original one, in order to automate the most resource-intensive part of the bot creation process. On the AI Assistant Stage at the Deep Learning Summit next week, Antonio Valderrábanos CEO and Founder of Bitext will present their latest work, discussing 'Artificial Training Data: How to Automate Your Bot Training'.
Bitext brings a unique approach to the market of Natural Language. As experts in computational linguistics, we are continuously developing new tools designed to boost accuracy when machines read and understand human utterances. Our trailblazing culture has brought us to the forefront of this disrupting technology. Our unrivalled performance results have helped us gain the acknowledgement and trust from the largest companies in the world. In 2018 Bitext is selected as “Cool Vendor in AI core technologies” in recognition for the company´s innovative and game-changing approach to computational linguistics.
In advance of the summit, we caught up with Antonio to learn more about his work and the Bitext story:
Can you give me an overview of your background - how did you start your work in AI?
I got my PhD in Computational Linguistics 20 years ago, and have been working in the field ever since. I started out at R&D labs in IBM and Novell, working on getting computers to efficiently store and leverage language data; at the time, this data was commonly used by tools such as grammar checkers, which needed to understand the structure of text in order to detect errors.
What led you to found Bitext? What problem are you trying to solve?
I felt that a true linguistic approach to NLP was missing in the industry. Most efforts were focused on statistical techniques – learning from annotated training data – which had proved successful in speech recognition but resulted in “black boxes” which were nearly impossible to fine-tune or adapt for other purposes. So, if your NER model consistently makes a certain type of mistake, you need to dig through your training data to trying to pinpoint from what examples it may have learned it. Similarly, if you train your parser using news articles, and then try to apply it to social media, the results are going to be rather poor – the only way around this is to train it from scratch using a lot of manually tagged data from social media.
With our experience in practical commercial applications of NLP, we knew that a symbolic approach (with lexical, syntactic and semantic levels) had a role to play, especially if we wanted to handle different domains and languages consistently. As a result, we created a solid language-independent NLP stack that can be repurposed for many different NLP applications (POS tagging, entity extraction, sentiment analysis, parsing…), and we can add support for a new language in a month or two. Our lexicons and grammars are built in such a way that we can easily tweak them to handle different types of text (chatbots, headlines, reviews…) and domains with minimal effort.
What challenges is the industry currently facing?
In the space of AI assistants, reliability is still an issue. If I ask my phone to “show me restaurants but not Japanese” (perhaps because I ate sushi last night), I will invariably be shown Japanese restaurants nearby. Handling common conversational phenomena like negation and coordination is still a challenge for most assistants, and we believe this can be effectively dealt with using a linguistic approach.
For AI in general, the scarcity of training data and the cost associated with generating it is perhaps the number one challenge. This is why for the past year we have focused on using Natural Language Generation (NLG) to generate Artificial Training Data for chatbots.
What can we do to encourage more diversity in AI?
Since the current generation of AI is mostly trained on data gathered from the real world, ensuring diversity is essential to prevent inadvertently introducing bias into our agents. I think that using AI itself for certain tasks can promote diversity; for example, if a company uses a carefully trained AI system to evaluate potential hires solely based on their merit rather than their gender or their ethnicity, we should expect their workforce to become more diverse. Of course, training such a system is not an easy task, because if we train it to emulate past hiring decisions made by humans, any unconscious biases present in the training data will creep into the AI model. In a sense, this is a potential problem with all kinds of training data for AI, which is why we advocate for a controlled human-in-the-loop approach to generating training data, rather than relying on purely manual processes.
How can the work you’re doing be applied to other industries?
Virtual any industry can benefit from automated assistants – from customer support and contact centers to search-based agents (such as e-commerce bots that act as front-ends to retail product catalogs). Providing natural language interfaces to search engines and databases is also one of our short-term goals.
How are you using AI for social good, or how can the work you’re doing be applied for social good in other areas?
From the beginning, we have placed a lot of emphasis on multilingual support in our technology. Developing tools and data for a new language opens the digital space to its speakers. If you only speak Telugu or Zulu and you can talk to your computer, your phone or your smart speaker in those languages, you won’t be left out of the AI revolution.
What’s next for you in your work?
Multilingual assistants, developed and deployed ASAP!
Where can we keep up to date with you? (Twitter, LinkedIn, website etc.)