Gamification and Collective Intelligence: Centaur Labs’ Approach to Reliable Medical AI

Erik Duhaime is the CEO of Centaur Labs, leading a specialized approach to data annotation for AI in medical and life sciences. Centaur Labs has developed a unique, gamified system that uses a global network of medical professionals to deliver scalable and affordable data annotation for healthcare innovations

C: Where do you see the biggest opportunities for healthcare firms with the latest developments in AI technology, and where could it make the most impact?

Erik Duhaime: I think there’s almost no area in the medical and life sciences that AI won’t touch in the next five to ten years. We’ve seen a lot of progress with computer vision models for diagnostics, but the recent wave of generative AI is especially powerful for text data.

Two areas where we’re seeing a lot of activity are patient-facing chatbots and drug discovery. For chatbots, they can help with tasks like triaging patients or assisting with prescriptions. On the drug discovery side, AI is helping to sift through the massive corpus of scientific literature—millions of papers—with large language models identifying potential new drug targets or relationships between genes and diseases.

C: How does ensuring accurate, reliable annotations on medical data meet your customers' needs? What are the main benefits of this approach, and does it have any direct applications for end users?

Erik Duhaime: Our customers are model developers in the medical and life sciences. We work with about half of the top 10 pharma companies today, as well as tech companies and disruptive startups developing new diagnostics and hardware.

I'll give two examples. One is a company called Eko, which makes digital stethoscopes that record heart and lung sounds. They had tens of thousands of recordings and wanted to develop algorithms to detect conditions like heart murmurs. To do that, they needed annotations on which recordings had murmurs and which didn’t. We helped them label these recordings, which led to multiple FDA-approved algorithms, including one for heart murmur detection.

Another example is a large tech company developing a patient-facing chatbot. We've all heard about AI hallucinations, and while that’s one thing in marketing, it’s a serious issue in healthcare. The chatbot could potentially give harmful advice, like recommending a dangerous dosage of medicine. To prevent this, we help ensure the accuracy of their clinical database by reviewing every article on key clinical terms. We also evaluate the chatbot's outputs, checking if it's providing correct information, recognizing emergencies, and ensuring it's sensitive to patients with disabilities.

So, it's a two-step process—first, making sure the chatbot is fed accurate data, and second, evaluating its performance to ensure it continues to work safely and effectively.

C: Can you share more about the unique aspects of your annotation process? How does the gamified system and collective intelligence approach improve data quality?

Erik Duhaime: First, we’ve developed a gamified process where subject matter experts compete to tag data. This helps ensure quality because we're not relying on a single expert's opinion. Experts can disagree, so we leverage collective intelligence by having multiple people review the same data. If there's a disagreement on a particular case, we gather more votes to ensure accuracy. This process drives precision beyond what you'd get from simply hiring a few experts.

To facilitate this we developed a unique app, DiagnosUs. The app is where a lot of this gamified process happens. Medical professionals use it to compete, and they also learn from it. We have thousands of five-star reviews on the App Store, and many medical students use it to improve their skills by receiving feedback on the cases they work on.

The key is that we don’t trust someone just because they're a doctor or medical student. We constantly test people. For example, if the task is classifying whether a skin lesion is cancerous, we mix in cases where we already know the answers with cases where we don’t. Even a highly experienced dermatologist might not do well if the task is outside their specialty or if they're not paying attention, given that it's tedious AI work rather than an in-person patient consult.

By tracking performance on these gold-standard cases where we know the ground truth, we can identify who is skilled at the task and who isn’t. If someone’s performance is poor on those known cases, we don’t count their votes on the unknown cases. This is different from how many companies in the space operate, where they might trust someone just because they have certain credentials. But just having the credentials doesn’t always mean the person will perform well at the task. So, this quality control process is essential for ensuring high-quality data annotation.

C: Beyond using experts to tag data, are there other ways Centaur Labs ensures accuracy and reliability in medical data annotations?

Erik Duhaime: Our main focus is keeping humans in the loop. Whether it’s annotating data initially, ensuring data quality, or evaluating model performance, human expertise is essential. And it has to be scalable—that’s one of the biggest bottlenecks we address.

Increasingly, we’re seeing clients move from model development to model evaluation and monitoring. It’s not just about building a model that gets good results or passes FDA approval. Once it’s deployed, you need to keep monitoring it to ensure it’s working as it should, especially as you apply it to different settings or patient populations. Models can drift or behave differently over time, so the work doesn’t end with initial development.

C: Why is it so important to keep humans in the loop with these kinds of AI technologies? How does this relate to the ethical considerations that healthcare organizations need to address?

Erik Duhaime: There’s a lot of work being done right now on how to best evaluate and monitor the safety of AI algorithms. One focus is creating something like a "nutritional label" for AI outputs, which would evaluate things like bias and performance across different settings. Another critical aspect is continual monitoring.

For example, if you’ve trained a model that works well for a particular patient population and then deploy it in another country or among a different demographic, there are ethical concerns. The model might have been approved based on biased data or assumptions that are no longer valid. If you’re not continuously monitoring it, you won’t be aware of biases that might emerge over time.

One example is a company we work with that has an operating room intelligence system. They deploy cameras in operating rooms, and when they move to a new hospital, they need to ensure the system still performs well. The surgical robots might look different, or the nurses might be wearing different uniforms. If the system was validated in a prestigious academic medical center and then deployed to a community hospital that looks systematically different, it may not perform as well.

If you're not continuously monitoring the robustness of models across different settings, you risk unintended consequences and biases, which can be both ethical and practical issues for healthcare organizations.

C: What are some of the most significant challenges your clients face as they work to improve their AI efforts, and how do you help them address those challenges?

Erik Duhaime: One of the biggest challenges is unlocking a massive amount of human expertise, which is essential to building a good AI model. A team of machine learning engineers can build a great model, but the quality of the model depends entirely on the quality of the data.

Clients often face trade-offs—do they hire a couple of experts? Do they ask their chief medical officer to spend hours tagging data? Human expertise is a scarce resource, and that's where we come in. We change the economics of the situation by unlocking an unlimited supply of human expertise to improve data quality.

Another challenge we help with is defining quality. Experts may disagree with each other, or there may not be a clear alignment in the literature. We help our clients think about their goals and translate those into a data annotation strategy that aligns with what they need.

C: Looking to the future, what do you see as the next step for AI deployment in healthcare? What do you think the next 12 months will look like?

Erik Duhaime: I think the next 12 months will see an increasing focus on model monitoring and evaluation. Right now, a lot of developers are focused on building models and figuring out how to deploy them, but they’re also becoming more concerned with ensuring these models are safe and continue to work well over time.

Today, many of our clients are still in the model development phase, but we’re seeing more and more of them thinking about how to monitor their models. Most are sending batches of data every week or month to check for potential drift or areas for improvement.

Where I see the next big shift is moving away from manually sending batches of data for checks and instead having human experts continuously in the loop, reviewing cases in near real-time. The goal is to have this process baked into the data pipeline and infrastructure so that monitoring is more integrated, rather than something you check in on occasionally. It’s about building a deeper human-AI partnership into the AI workflows.

Join Centuar Labs and RE•WORK at the upcoming AI in Healthcare & Pharma Summit, happening November 13-14, 2024, at The Colonnade Hotel, in Boston!

Register Now