Trustworthy Scientific Machine Learning

Funding Source: NSF CAREER

Budget: $572,765

Time: 09/2024 - 08/2029

Trustworthy machine learning for geo-distributed scientific data analytics.

Abstract: This project aims to develop a trustworthy optimization toolbox for geo-distributed scientific data analytics, addressing gaps in AI/ML practices where models trained on historical or regional data struggle with complex and evolving dynamics of phenomena like extreme weather events and climate change. The project pioneers optimization methods to enhance prediction robustness, explanation reliability, and scalable privacy protections, crucial for rare or unseen scenarios in safety-critical applications. It pursues three aims: bridging data topology and robust optimization, revolutionizing explainable machine learning for scientific discovery, and ensuring trustworthy collaborative learning. The project integrates these advancements into education, promoting diversity and inclusion in STEM through interdisciplinary outreach and curricula.

Publications:

[AAAI'25 Oral] Kien X. Nguyen, Tang Li, Xi Peng. Interpretable Failure Detection with Human-Level Concepts. In Proceedings of Association for the Advancement of Artificial Intelligence (AAAI), 2025. [PDF] [Code]
[ICML'24] Fengchun Qiao and Xi Peng. Ensemble Pruning for Out-of-distribution Generalization. In International Conference on Machine Learning, 2024. [PDF] [Code]
[ECCV'24 Strong Double Blind] Tang Li, Mengmeng Ma, Xi Peng. DEAL: Disentangle and Localize Concept-level Explanations for VLM. In European Conference on Computer Vision, 2024. [PDF] [Code]
[ICML'24] Mengmeng Ma, Tang Li, Xi Peng. Beyond Federation: Topology-aware Federated Learning for Generalization to Unseen Clients. In International Conference on Machine Learning, 2024. [PDF] [Code]
[CIKM'24] Kien X. Nguyen, Fengchun Qiao, Xi Peng. Adaptive Cascading Network for Continual Test-Time Adaptation. In Conference on Information and Knowledge Management, 2024. [PDF]

Open-Sourced Data:

SeafloorAI Dataset: It includes 696,000 sonar images, 827,000 annotated segmentation masks, 696,000 detailed language descriptions and approximately 7M question-answer pairs. We make this dataset publicly available: [https://sites.google.com/udel.edu/seafloorai/homes]
Flooding Mapping Dataset: It includes 12,719 satellite images at the 250-meter resolution which record 98 large flood events that happened in the U.S. from 2000 to 2021. This dataset is publicly available: [https://github.com/deep-real/TRO]

Open-Sourced Software:

Ensemble Pruning for OoD generalization: A Toolkit for ensemble-based robust optimization against distribution shifts. Github Repo [https://github.com/deep-real/TEP]
Ordinal Ranking of Concept Activation (ORCA): A lightweight, interpretable failure detection toolkit based on concept activation rankings. Github Repo [https://github.com/Nyquixt/ORCA]