Integration of Models and Data for Inference about Humans and Machines

To bridge the verification and explainability gap of data-driven approaches: We will investigate statistical, mathematical and computational tools that capture realistic prior knowledge about the underlying physical and social process in intelligent behavior to alleviate the burden of training data in learning, underwrite guarantees about system performance, or constrain inference in real time.

Research Highlights

1. Bayesian Methods for Inverse Scattering Problems

Computer experiments, the studies of real systems using mathematical models such as partial differential equations, have received increasing attention in science and engineering for the analysis of complex problems. Typically, computer experiments require a great deal of time and computing. Therefore, based on a finite sample of computer experiments, it is crucial to build a surrogate for the actual mathematical models and use the surrogate for prediction, inference, and optimization. The Gaussian process (GP) model, also called kriging, is a widely used surrogate model due to its flexibility, interpolating property, and the capability of uncertainty quantification through the predictive distribution.

Despite extensive studies on GP modeling, the developments for functional inputs are scarce. Motivated by an inverse scattering problem, where the computer simulations involve functional inputs and therefore the analysis and inference rely on a surrogate model that can take into account functional inputs. Figure 1 illustrates the idea of inverse scattering. Let the functional input g represent the material properties of an inhomogeneous isotropic scattering region of interest shown in the middle of Figure 1. For a given functional input, the far-field pattern, us, is obtained by solving partial differential equations which is computationally intensive. Given a new far-field pattern, the goal of inverse scattering is to recover the functional input using a surrogate model. Therefore, a crucial step to address this problem is to develop a surrogate model applicable to functional inputs.

Problems with functional inputs are frequently found in engineering applications of non-destructive testing where measurements on the surface or exterior of an object is used to infer the interior structure. To address these problems, we introduce a new class of kernel functions for GPs with functional inputs. Based on the proposed GP models, the asymptotic convergence rates of the resulting mean squared prediction errors (MSPE) are rigorously derived. Inspired by the connection between the design of the input functions and the resulting convergence rates of MSPE, new classes of space-filling design ideas are developed in this research. Furthermore, based on the proposed GP surrogate, we will explore a Bayesian inverse framework and a calibration procedure to efficiently identify the functional input given an observed far-field pattern.

2. Statistical Properties of Decision Trees and Random Forests

In answering a major open question, our findings show that decision trees constructed with Classification and Regression Trees (CART) are consistent for regression and classification tasks, even when the number of predictor variables grows sub-exponentially with the sample size, under natural 0-norm and 1-norm sparsity constraints. The theory applies to a wide range of models, including generalized additive models with component functions that are continuous, of bounded variation, or, more generally, Borel measurable. Consistency holds for arbitrary joint distributions of the predictor variables, thereby accommodating continuous, discrete, and/or dependent data. Furthermore, it is also showed that these qualitative properties of individual trees are inherited by Breiman’s random forests.

Theory for variants of popular variable importance/ranking measures is also developed in this research. Decision trees and their ensembles are endowed with a rich set of diagnostic tools for ranking and screening variables in a predictive model. Despite the widespread use of tree based variable importance measures, pinning down their theoretical properties has been challenging and therefore largely unexplored. To address this gap between theory and practice, we derived finite sample performance guarantees for variable selection in nonparametric models using a single-level CART decision tree (a decision stump). Under standard operating assumptions in variable screening literature, the marginal signal strength of each variable and ambient dimensionality can be considerably weaker and higher, respectively, than state-of-the-art nonparametric variable selection methods (such as Nonparametric Independence Screening (NIS)). Furthermore, unlike previous marginal screening methods that attempt to directly estimate each marginal projection via a truncated basis expansion, the fitted model used here is a simple, parsimonious decision stump, thereby eliminating the need for tuning the number of basis terms.

3. Building Data-driven Models of Human Communication and Using Them to Assess and Improve Human-computer Interaction

Matthew Stone in collaboration with PhD students Brian McMahan and Baber Khalid has improved the methodology for building data-driven models of human communication and to use them to assess and improve human-computer interaction. McMahan and Stone (2020) report a new, latent-variable approach to quantify corpus-based evidence of the diverse kinds of reasoning speakers employ. These models can be used to recognize not only what speakers are thinking of but what communicative strategy they are following; Khalid, Alikhani and Stone (2020) show how they can be leveraged in a reinforcement learning approach to dialogue planning. This enables interactive systems to give targeted, effective feedback, using context-sensitive clarification strategies that focus on key missing information, elicit correct answers that the system understands, and contribute to increasing dialogue success. Our ongoing work (currently under review) looks at integrating these empirical methods with best practices for human-centered design and software engineering to create new dialogue functionality. The project has shared the outcomes of this work including code, data and visualizations.
McMahan and Stone
Khalid, Alikhani and Stone

The matcher agent from Khalid, Alikhani and Stone (2020) creatively and successfully clarifies which color patch a human crowd worker (director) intends to refer to.


  • Model Identification and Control of a Low-cost Mobile Robot with Omnidirectional Wheels using Differentiable Physics by Edgar Granados, Abdeslam Boularias, Kostas Bekris and Mridul Aanjaneya, to appear in ICRA 2022, 2022.
  • Sparse learning with CART for noiseless regression models, by J. M. Klusowski, to appear in IEEE Transactions on Information Theory, 2022.
  • Large scale prediction with decision trees, by J. M. Klusowski, Reject and resubmit to Journal of the American Statistical Association, 2022.
  • Optimal Simulator Selection, by Y. Hung, L.-H. Lin, and C. F. J. Wu (2022), the Journal of American Statistical Association, to appear.
  • A perturbation problem for transmission eigenvalues, by D. Ambrose, F. Cakoni and S. Moskow (2022), Research in the Mathematical Sciences (in press).
  • Singularities almost always scatter: Regularity results for non-scattering inhomogeneities, by F. Cakoni and M. Vogelius (2022), Communication on Pure and Applied Mathematics, (in press).
  • Nonparametric variable screening with optimal decision stumps, by J. M. Klusowski and P. M. Tian, AISTATS, 2021.
  • Good classifiers are abundant in the interpolating regime, by R. Theisen, J. M. Klusowski, M. W. Mahoney, AISTATS, 2021.
  • Sharp analysis of a simple model for random forests, J. M. Klusowski, AISTATS, 2021.
  • Characterizing the SLOPE Trade-off: A Variational Perspective and the Donoho–Tanner Limit, by Z. Bu, J. M. Klusowski, C. Rush, W. J. Su, Revise and resubmit to Annals of Statistics, 2021.
  • Bayesian Indicator Selection Approach for the Gaussian Process Models in Computer Experiments, by F. Zhang, R.-B. Chen, Ying Hung, and X. Deng (2021), submitted.
  • Efficient Calibration for Imperfect Epidemic Models with Applications to the Analysis of COVID-19, by C.-L. Sung and Ying Hung (2021), Major revision to be submitted to the Journal of Royal Statistical Society, Series C.
  • Varying Coefficient Frailty Models with Applications in Single Molecular Experiments, by Ying Hung, L.-H. Lin, and C. F. J. Wu (2021), Biometrics, to appear.
  • A spectral approach to non-destructive testing via electromagentic waves, by F. Cakoni, S. Cogar and P. Monk (2021), IEEE, Transactions on Antennas and Propagation 69, no 12, 8689-8697.
  • Transmission Eigenvalues, by F. Cakoni, D. Colton and H. Haddar (2021), AMS Notices, October Issue, 68 no 9, 1499-1510.
  • A note on transmission eigenvalues in electromagnetic scattering theory, by F. Cakoni, S. Meng and J. Xiao (2021), Inverse Problems and Imaging 15 no. 5, 999–1014.
  • Analysis of the linear sampling method for imaging penetrable obstacles in the time domain, by F. Cakoni, P. Monk and V. Selgas (2021), Analysis & PDEs, 4 no. 3, 667–688.
  • The interior transmission eigenvalue problem for elastic waves in media with obstacles, by F. Cakoni, P-Z Kow and J-N Wang ((2021), Inverse Problems and Imaging, 15 no. 3, 445-474.
  • On corner scattering for operators of divergence form and applications to inverse scattering, by F. Cakoni, J. Xiao (2021), Comm. PDEs, 46, no. 3, 413-441.
  • On the discreteness of transmission eigenvalues for the Maxwell equation, by F. Cakoni, H.M. Nguyen (2021), SIAM J. Math Analysis, 53, no. 1, 888-913.
  • Target signatures for thin surfaces, by F. Cakoni, P. Monk and Y. Zhang (2021), Inverse Problems 38 025011.
  • A Generalized Gaussian Process Model for Computer Experiments with Binary Time Series, by C.-L. Sung, Y. Hung, W. Rittase, C. Zhu, and C. F. J. Wu (2020), the Journal of American Statistical Association, 115, 945-956.
  • Calibration for Computer Experiments with Binary Responses, by C.-L. Sung, Y. Hung, W. Rittase, C. Zhu, and C. F. J. Wu (2020), the Journal of American Statistical Association, 115, 1664-1674.
  • Subspace Differential Privacy, by Jie Gao, Ruobin Gong, Fang-Yi Yu, Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI-22), February 22-March 1st, 2022.
  • Log-time Prediction Markets for Interval Securities, by Miroslav Dudík, Xintong Wang, David M. Pennock, David M. Rothschild, AAMAS 2021: 465-473
  • Towards a Theory of Confidence in Market-Based Predictions, by Rupert Freeman, David M. Pennock, Daniel M. Reeves, David M. Rothschild, Bo Waggoner, ISIPTA 2021: 365-368
  • Designing a Combinatorial Financial Options Market, by Xintong Wang, David M. Pennock, Nikhil R. Devanur, David M. Rothschild, Biaoshuai Tao, Michael P. Wellman, EC 2021: 864-883
  • Beating Greedy For Approximating Reserve Prices in Multi-Unit VCG Auctions, by Mahsa Derakhshan, David M. Pennock, Aleksandrs Slivkins, SODA 2021: 1099-1118
  • Varying coefficient Frailty Models with applications in single molecular experiments, by Ying Hung, L.-H. Lin, and C.F.J. Wu, under review in Biometrics, 2020, [bib]

  • Optimal Crossover Designs for Quantitative Variables, by Ying Hung and L. Wang, under review, 2019, [bib]

  • Gaussian Process prediction using experimental design-based Subagging, by L. He and Ying Hung, under review, 2019, [bib]

  • A sequential split-and-conquer approach for the analysis of big dependent data in computer experiments, by Chengrui Li, Ying Hung, and Minge Xie, in The Canadian Journal of Statistics, Wiley Online Library, 2020, [pdf] [bib]

  • Discourse Coherence, Reference Grounding and Goal Oriented Dialogue, by Baber Khalid, Malihe Alikhani, Michael Fellner, Brian McMahan, and Matthew Stone, in The 24th Workshop on the Semantics and Pragmatics of Dialogue (Semdial), 2020, [pdf] [bib]

  • Analyzing Speaker Strategy in Referential Communication, by Brian McMahan and Matthew Stone, in Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2020, [pdf] [git] [vid] [bib]

  • That and There: Judging the Intent of Pointing Actions with Robotic Arms, by Malihe Alikhani, Baber Khalid, Rahul Shome, Chaitanya Mitash, Kostas Bekris and Matthew Stone, in Proceedings of AAAI, 2020, [pdf] [git] [bib]

  • Cross-modal Coherence Modeling for Caption Generation, by Malihe Alikhani, Piyush Sharma, Shengjie Li, Radu Soricut and Matthew Stone, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, [pdf] [git] [bib]

  • Sparse learning with CART, by Jason M. Klusowski, NeurIPS, 2020, [pdf] [bib]

  • Good linear classifiers are abundant in the interpolating regime, by Ryan Theisen and Jason M. Klusowski and Michael W. Mahoney, arXiv preprint, 2020, [pdf] [bib]

  • Limiting Boundary Correctors for Periodic Microstructures and Inverse Homogenization Series, by Fioralba Cakoni, Shari Moskow, and Tayler Pangburn, in Inverse Problems, IOP Publishing, 2020, [pdf] [bib]


  • Special Semester Tomography Across the Scales Prequel Workshop – by Fioralba Cakoni at RICAM, Linz, Austria, October 11-15, 2021.
  • Plenary Speaker at the 6th Annual Meeting of SIAM Central States Section – by Fioralba Cakoni at The University of Kansas, USA, October 2-3, 2021.
  • Rees Distinguished Lectures two-lecture series – by Fioralba Cakoni at the University of Delaware, USA, April 29-30, 2021.
  • Imagine & Inverse Problems: One World Seminar – by Fioralba Cakoni at SIAG IS via Zoom, April 14. 2021.
  • Workshop on Tomographic Reconstructions and their Startling Applications – by Fioralba Cakoni at Schrodinger International Institute for Mathematics and Physics, Vienna, Austria via Zoom, March 15 – 25, 2021.
  • Distinguished Lecture at Michigan Technological Institute – by Fioralba Cakoni via Zoom, March 5, 2021.
  • Colloquium at Center for Applicable Mathematics, School of Mathematics of the Tata Institute of Fundamental Research – by Fioralba Cakoni via Zoom, Mumbai, India, January 18, 2022.
  • International Zoom Inverse Problems Seminar – by Fioralba Cakoni via Zoom, UC Irvine, California, USA, October 7, 2021.
  • Colloquium at the University of Arizona – by Fioralba Cakoni via Zoom, Tucson, Arizona, September 3, 2021.
  • Colloquium at Michigan University – by Fioralba Cakoni via Zoom, East Lansing, Michigan, March 22, 2021.
  • Colloquium at the University of Padova – by Fioralba Cakoni via Zoom, Italy, March 3, 2021.
  • A Two-Stage Framework for Constraint Optimization in Computer Experiments with Applications in Materials Science – paper presentation by Jiazhao Zhang and Ying Hung at Joint Statistical Meetings, August 6, 2020 [url]
  • Optimization of ReaxFF Parameters – seminar talk by Yao Song, Ying Hung, and Tirthankar Dasgupta at Joint Statistical Meetings, August 5, 2020 [url]
  • Good Linear Classifiers Are Abundant in the Interpolating Regime – paper presentation by Jason Klusowski at Joint Statistical Meetings, August 5, 2020 [url]
  • Spectral problems in Inverse Scattering for Inhomogeneous media – colloquium by Fioralba Cakoni at University of Maryland, October 31, 2019
  • Transmission Eigenvalues and inverse scattering – colloquium by Fioralba Cakoni at New York University, Abu Dhabi, UAE, February 9, 2020
  • Transmission Eigenvalues and invisibility in Euclidean and Hyperbolic geometry – seminar talk by Fioralba Cakoni at International Zoom Inverse Problems Seminar, May 7, 2020
  • Computational Methods for New Directions in Inverse Problems – workshop talk by Fioralba Cakoni at the Institute for Applied Mathematics and Computational Science, Texas A&M University, February 3-5, 2020 [url]