Opening the Black Box: Interpretable Deep Learning for Genomics

Deep learning models are often noted as "black boxes" in reference to the difficulties of tracing a prediction back to important features to understand how an output was arrived at. Although deep learning models are giving increasingly advanced results in diverse problems, their lack of interpretability is a major problem.In healthcare fields such as genomics, interpretability of models is paramount to success. For instance, if a model trained to predict DNA mutations that cause disease performs well, it is likely to identify patterns that biologists would find valuable - but if the model is a black box, understanding how the prediction was made is difficult. Avanti Shrikumar, a PhD student in Computer Science at Stanford University, is working in this area to make models more interpretable, with a focus on applications in regulatory genomics.At the Deep Learning in Healthcare Summit in Boston, on 25-26 May, Avanti will present novel algorithms that address significant limitations of previous approaches to interpretability, and will explore the potential of deep learning in genomics and beyond. I spoke to her ahead of the summit to learn more.Can you tell us a bit about yourself and your work? Plus, give us a teaser of your presentation at the summit? I'm a PhD student in the Kundaje lab at Stanford University. A major focus of my lab is applying deep learning techniques to understand genomic regulation. Broadly speaking, genomic regulation refers to the following problem: all cells in our body have essentially the same DNA sequence, but cells from different tissues behave very differently because a different set of genes is turned on in them. "Genomic regulation" refers to the processes that ensure this tissue-specific gene expression occurs correctly. If we had a perfect understanding of genomic regulation, we could design highly advanced stem cell therapy where we take skin cells and turn them into replacement cardiac tissue. We would also have a vastly improved understanding of disease, because the majority of disease-associated DNA mutations occur not in genes, but in regions that don't code directly for genes but that play a role in gene regulation. In recent years, deep learning techniques have shown great promise in their ability to solve problems in regulatory genomics. However, a major barrier to adoption is their uninterpretable "black box" nature. Say you train a model to predict which DNA mutations will cause a disease. If your model gives breakthrough performance on this task, it has likely discovered patterns that we ourselves would like to understand - not so easy if the model is a black box. To address this need, my colleague Peyton Greenside and I have developed algorithms that take these so-called "black box" models and look inside them to understand why they work. These algorithms will be the focus of my talk.How did you start your work in deep learning? I started doing deep learning when I began my PhD. My adviser Anshul Kundaje was a believer in deep learning for genomics well before the first papers in the field came out. As for what started my work on interpretability: I was due to give my first-ever presentation at lab meeting and I didn't have any interesting results on my actual project, so at 3am I decided to hack up a simple idea for assigning importance scores to the inputs of a deep learning model (the presentation was at noon). I got reasonable results, and my adviser (who was much more excited by what I had done than I was) encouraged me to pursue it. The method later turned out to be a version of an existing technique called Layerwise Relevance Propagation, but it got us started on the interpetability journey and the algorithms we have today are profoundly different from anything else in the literature - in other words, this time I am actually excited about them :-) What key factors have enabled the recent advancements in deep learning for genomics? Usually the answer to this question is "lots of data" which is certainly true for genomics; as the cost of genomic sequencing has dropped, the amount of available data has exploded. However, I think the best explanation for why the adoption in genomics is happening right now is the availability of good deep learning software packages. These packages make it exceptionally easy to port state-of-the-art techniques from computer vision and NLP (natural language processing) to other fields. Our lab is a big fan of Keras and tensorflow, for instance.Which areas of genomics do you think deep learning will benefit the most and why? I am only qualified to make predictions about my specific field of expertise, which is regulatory genomics, but it's worth noting that deep learning has produced a lot of surprising advances on diverse problems such as predicting protein contact maps or identifying mutations from the output of a DNA sequencer. As for regulatory genomics: a fairly long-standing problem has been modeling "regulatory grammars" - that is, modeling complex interactions between the building-blocks of regulation. We are just getting a handle on the smallest building-block, which is the binding of individual regulatory proteins to DNA, and we know that these proteins interact to form protein complexes, which in turn interact to form regulatory modules, which in turn interact to form even more sophisticated structure. Most machine-learning approaches struggle to capture this. However, we know that deep learning has had great success in NLP, and language contains hierarchical structure that rivals the complexity one might expect in genomics. Many of us therefore feel that modeling regulatory grammars is one area where deep learning will yield substantial advances. And we'll need to have our interpretability techniques ready for when that happens :-)How do you feel about being a woman in tech? Have you faced any challenges? Actually yes - it's a somewhat alarming story. I understand RE•WORK found me from a YouTube video of my talk at CEHG. A few days after that talk went online, a complete stranger who saw the talk tracked down my Facebook account and personal YouTube channel to proclaim his affection for me. I explained I wasn't interested - I told him I was gay. I was being truthful, but it didn't matter because he straight up didn't believe me and became quite belligerent, to the point where it was clear he wasn't entirely stable and I became concerned for my safety. I blocked him, disabled comments on my YouTube channel, and subscribed to a service called DeleteMe which removes your personal data from public records (something I highly recommend for any woman who can afford it). Even so, details like the lab where I work are fully public; I can't conceal them without doing professional damage to myself. I encountered an unavoidable trade-off between personal safety and public visibility, and I think it's something a lot of women face.What advancements in deep learning and genomics do you hope to see in the next 3 years? I hope to see a lot more intelligent customization of deep learning methods to genomics. Till now, people have been porting techniques straight from computer vision or NLP with few if any domain-specific modifications, but genomic data is a different beast and we shouldn't expect that an architecture optimized for something like NLP is best for genomics. My lab recently put out a preprint on accounting for the fact that DNA is double-stranded in our deep learning models - pretty much the simplest domain-specific modification you can think of. We hope it is the first of many in the years to come. Avanti Shrikumar will be speaking at the Deep Learning in Healthcare Summit on 25-26 May. The summit will be held alongside the annual Deep Learning Summit in Boston - register your place here.

Confirmed speakers at the summit include Christhian Potes, Senior Scientist, Philips Research; Sergei Azernikov, Machine Learning Lead, Glidewell Labs; Hossein Estiri, Research Fellow, Harvard Medical School; Mason Victors, Lead Data Scientist, Recursion Pharmaceutical; Muyinatu Bell, Professor, John Hopkins University, and more.

The London edition of the Deep Learning in Healthcare Summit takes place next week on 28 Feb - 1 March! Tickets are now limited, register your place here.

Opening the Black Box: Interpretable Deep Learning for Genomics

11th Global Deep Learning Summit to Be Held Singapore This April

Accelerating Tech: the Future of Autonomous Vehicles