Being able to find and use high-quality AI data has always been difficult for data scientists. AI requires large amounts of data sets and this can be both time consuming and costly. One of our exhibitors at the Deep Learning in Finance Summit, Appen, are working on changing just that. Appen is developing high quality, human-annotated data sets for AI and machine learning. We spoke with Wilson Pang, CTO at Appen to hear about his current work as well as the challenges of implementing AI.
1. Give us a bit of background on Appen and your role there
Appen develops high-quality, human annotated datasets for machine learning and artificial intelligence. We work with leading companies across many different industries to scale their machine learning programs, and our training data helps to improve solutions like chatbots, speech recognition systems, search engines, social media platforms, and more. Most of our clients choose to partner with us because Appen is a one-stop shop for high-quality AI data. We work with our clients to design data collection and annotation programs that are specific to their needs, and our project managers ensure that the data meets our clients' quality standards. We can scale up quickly because we have a global crowd of over 1 million contractors, working in 130 countries and 180 languages, and we can handle many data types — including sensitive data — depending on our clients' needs.
My role is Chief Technology Officer, in charge of both product and engineering. My team consists of data scientists, engineers, and product managers. We’re building the world’s leading data labeling platform, which includes an AI-assisted tools system that makes data labeling much faster. The platform also includes a workforce management system which makes it easy to engage and grow our crowd community — as well as providing data insights, data quality assurance, and tools to make our project managers super-efficient.
2. How did you start your work in machine learning?
I started my career in machine learning in search, the first domain where machine learning was widely applied. I was very lucky to be one of the founding leaders of eBay’s Search Science team 10 years ago. We built the Search Science team from scratch and drove huge revenue increases by applying and optimizing machine learning algorithms. Once you see the power of data and machine learning, you can hardly stop. From there, I also a built data science team at eBay that worked on retail science (inventory price setting, supply & gap analysis, trending and seasonality detection, etc.), experimentation, and product experience optimization.
Later on I joined China’s largest online travel agency, CTrip, as Chief Data Officer to lead most of the machine learning and data initiatives in the company. My team drove hundreds of millions of dollars in revenue increases, as well as huge reductions in customer support costs there. In my experience, machine learning can be leveraged to resolve real industry problems — and data is one of the most important factors to build great AI solution. Getting high-quality training data is very hard. Solving the AI data problem is a big challenge, and Appen is positioned well to solve that problem for our customers.
3. As a CTO, what does a typical day look like for you?
As CTO, I’m responsible for defining the technology vision and strategy, communicating them clearly within our organization and to our partners, attracting and nurturing great talent, and building a great culture for execution and innovation. A typical day can touch one or many of those areas. In addition, I also often provide guidance on planning and decisions, meet with external customers, research the latest evolutions in machine learning technology, and make sure our platform is on the leading edge in our industry.
4. What challenges are you currently facing in your work, and how is deep learning helping you solve them?
AI needs a lot of data, and the data quality needs to be high. We all know “Garbage in, garbage out.” With our global crowd, Appen is one the very few companies who can scale up easily to provide large volumes of high-quality data — but relying on humans alone can be costly. Moreover, optimizing for both quality and output in data collection and annotation is not an easy job. We’re using deep learning to reduce the cost per unit for our customers, without hurting the data quality. Our AI-assisted annotation service can pre-annotate data where appropriate.
For example, we use deep learning to pre-annotate images and then get our workers to adjust the result where it isn’t accurate. It makes our workers 10 times faster when annotating images. The same idea also applies to voice-to-text transcription, named entity extraction, and other related tasks.
5. Bias of models has always been a key issue when applying AI, how should companies approach mitigating bias when using AI?
Bias of models or machine bias is indeed a big issue. These issues are normally caused by problems within the training data. For example, if we are building an image classification model on training data with a majority of images featuring dogs and very few images featuring cats, the model most likely classify new images as dogs.
So the mitigation of bias should also focus heavily on mitigating bias within the training data:
- Companies should have diverse tech team members in charge of both building models and creating training data.
- If the training data comes from internal systems, try to find the most comprehensive data, and experiment with different datasets and metrics.
- If training data is collected or processed by external partners, it is important to recruit diversified crowds so that the data can be more representative. Moreover, it is super-important to design the tasks and instructions correctly so that the crowd is not biased when providing data. Often, companies don’t do this proactively.
- Once the training data is created, it’s important to check if the data has any bias. Sometimes it can be difficult to visualize high dimensional training data and check the balance. We are building powerful training data visualization and insight tools which will definitely help.
- It is up to us to determine the path machine learning algorithms take. As engineers and data scientists, we should carefully consider the prejudices we inherently carry when creating these technologies — and correct for them.
6. Which other industries are you most excited to see implementing AI for a positive impact in the next 5 years?
AI in the manufacturing industry is going to be the sweet spot where AI meets IOT. It will save tons of human labor, as well as boost productivity, efficiency, and profitability.
7. What are you most looking forward to at the Deep Learning in Finance Summit?
Machine learning within the finance Industry leverages a lot of personal data, including transaction history, credit records, user behavior, etc. However, unstructured data like images, video, and texts hasn’t been widely used. I’m really looking forward to seeing use cases leveraging this type of data, which will bring the industry to next level in areas like fraud prevention and customer experience.
Appen will be exhibiting at the Deep Learning in Finance Summit on the 19 & 20 of March, so make sure to get your tickets here now to get the chance to chat with them about how they can help you scale your machine learning programs.