Unboxing Black Boxes

By Scarlet Stadtler

If you are interested in the talk Scarlet Stadtler gave at CASUS, please check out this video here

CASUS Institute Seminar, Dr. Scarlet Stadtler, Forschungszentrum Jülich GmbH/Jülich Supercomputing Centre (JSC)

Scarlet is a member of the Division “Federated Systems and Data” the group “Earth System Data Exploration” at the Institute for Advanced Simulation (IAS). Her research focuses on using machine learning (ML), especially deep learning algorithms in weather and air quality research. A trained meteorologist, she currently wants to explore how tools from computer vision such as video prediction can be used to forecast meteorological variables and how these techniques can be applied to air pollutants such as (organic) aerosols.

Currently, a large number of ML studies are performed in the atmospheric sciences. Getting started with ML for atmospheric chemistry is challenging. Usually, researchers have a dataset they want to analyze with a specific research question in mind, but which ML algorithm do they choose? Most commonly, researchers try a couple of algorithms and pick the one with the highest cross-validation score, or the one with the best performance. But which is the most appropriate, most fitting and ‘best’ algorithm? How do we know?

For a current study, Scarlet and her team consider not only evaluation metrics, but also explainable AI methods to get closer to an a-priori informed decision of which algorithm to use. They build upon their previous study where they introduced an air quality benchmark dataset (AQ-Bench), with a defined task and corresponding ML workflow.

AQ-Bench is the first global ozone metric benchmark dataset based on real observational data from the TOAR database. This dataset comes with a real-world task: deriving the air quality based on environmental conditions.

For atmospheric scientists, it is unsatisfying to derive air quality with a black-box model. To test the trustworthiness of the ML model, the user needs to understand how the model makes predictions. Especially, do these trained ML models fit our current understanding of atmospheric chemistry? In their study, Scarlet and her team use two different architectures, a Random Forest (RF) and a shallow Neural Network (NN). To understand how both models work, they use explainable AI methods. On one hand to understand the models’ predictions themselves. On the other hand they examine how the model decisions fit to the current atmospheric process understanding. The performance of the NN and RF differ. Besides tracking accuracy and demonstrating the best model, the team wants to understand why RF outperforms NN for the AQ-Bench dataset. Thus, they dig deeper into the architectures using visualizations and explaining individual examples.