Developing a new drug is a complex process that can take between 12-15 years and costs over a billion dollars. Chemical compounds undergoing this process are said to be in the ‘pipeline’ and only 2% of the compounds that go in one end, come out at the other. This is mainly for two reasons – they either do not work or they are not safe. The entire pipeline can thus be visualized as a massive funnel where compounds differentially progress through different layers of checks as they are tested for a number of properties falling under efficacy or safety.

These tests or assays need to balance a fine line between throughput and physiological relevance. For instance, it would be increasingly informative to test compounds in primary cells, organoids or whole organisms (in that order). However, costs grow exponentially with the size of these assays. As a result, the discovery process tends to favor high throughput approaches earlier while saving the more physiologically ‘realistic’ tests for candidates that they have greater confidence in.

Image-based high content screening (HCS) offers a reasonable middle-ground in this trade-off. It allows for quickly running several thousand compounds (or doses thereof) on live cells while also capturing detailed phenotypic responses using images. High throughput approaches can generally be viewed as brute-force searches of chemical space requiring significant computational ability to fully process their output. This is especially important in the case of images where the output from each multi-channel image is usually in thousands of dimensions.

Tapping the full extent of the rich data content of these images demands a new approach to image analysis that is more question-agnostic. Conventional approaches are bound by predefined features, and the processes involved – manually designing, selecting and extracting features – are long, arduous and require a moderate level of expertise with image analysis tools.

We explored the application of deep convolutional neural networks to address these concerns. These models are able to automatically extract informative features from raw pixel values thus significantly reducing time and effort while also enabling a richer phenotypic analysis of these images. Scientists are generally not convinced by algorithms that learn relationships directly from data unless they are able to better understand the reasoning behind those results. We explored feature embeddings, saliency maps and filter visualization as a means of interrogating models and interpreting results. This was also essential for the task of industrialization where we were required to guarantee model robustness within a defined problem space or else lay out bounds of certainty.

Beyond algorithmic hurdles, we faced challenges around data infrastructure as well as integrating within a pharmaceutical context. This task is especially difficult within drug discovery research where data are often varied and complex while lacking sufficient annotation in a standardized format. The goal of industrializing deep learning thus hinges on efficient data access across a global organization and on tailoring workflows to handle the increased throughput that they weren’t designed for. One of the goals of the group that I'm part of at GSK is to tap into the growing interest around machine learning to accelerate these less exciting processes of data access and standardization.

My talk at the RE•WORK: Deep Learning in Healthcare Summit will discuss these topics in greater detail while describing our effort at industrializing deep learning based image analysis to accelerate high throughput drug discovery.