Seminar on Statistics and Data Science

This seminar series is organized by the research group in mathematical statistics and features talks on advances in methods of data analysis, statistical theory, and their applications.
The speakers are external guests as well as researchers from other groups at TUM.

All talks in the seminar series are listed in the Munich Mathematical Calendar.


During the winter term 2020 the seminars are held on Zoom. Information on how to access the seminar, will be available on this website on the day of the seminar.

To stay up-to-date about upcoming presentations please join our mailing list. You will receive an email to confirm your subscription.

Upcoming talks

14.04.2021 12:15 Mona Azadkia (ETH Zurich): A Simple Measure Of Conditional Dependence

We propose a coefficient of conditional dependence between two random variables $Y$ and $Z$, given a set of other variables $X_1, \cdots , X_p$, based on an i.i.d. sample. The coefficient has a long list of desirable properties, the most important of which is that under absolutely no distributional assumptions, it converges to a limit in $[0, 1]$, where the limit is 0 if and only if $Y$ and $Z$ are conditionally independent given $X_1, \cdots , X_p$, and is 1 if and only if Y is equal to a measurable function of $Z$ given $X_1, \cdots , X_p$. Moreover, it has a natural interpretation as a nonlinear generalization of the familiar partial $R^2$ statistic for measuring conditional dependence by regression. Using this statistic, we devise a new variable selection algorithm, called Feature Ordering by Conditional Independence (FOCI), which is model-free, has no tuning parameters, and is provably consistent under sparsity assumptions. A number of applications to synthetic and real datasets are worked out.

14.04.2021 13:15 Taeb Armeen (ETH Zurich): Latent-variable modeling: causal inference and false discovery control

Many driving factors of physical systems are latent or unobserved. Thus, understanding such systems and producing robust predictions crucially relies on accounting for the influence of the latent structure. I will discuss methodological and theoretical advances in two important problems in latent-variable modeling. The first problem focuses on developing false discovery methods for latent-variable models that are parameterized by low-rank matrices, where the traditional perspective on false discovery control is ill-suited due to the non-discrete nature of the underlying decision spaces. To overcome this challenge, I will present a geometric reformulation of the notion of a discovery as well as a specific algorithm to control false discoveries in these settings. The second problem aims to estimate causal relations among a collection of observed variables with latent effects. Given access to data arising from perturbations (interventions), I will introduce a regularized maximum-likelihood framework that provably identifies the underlying causal structure and improves robustness to distributional changes. Throughout, I will explore the utility of the proposed methodologies for real-world applications such as water resource management.

Previous talks

24.02.2021 12:15 Elisabeth Ullmann (TUM): Multilevel estimators for models based on partial differential equations

Many mathematical models of physical processes contain uncertainties due to incomplete models or measurement errors and lack of knowledge associated with the model inputs. We consider processes which are formulated in terms of classical partial differential equations (PDEs). The challenge and novelty is that the PDEs contain random coefficient functions, e.g., some transformations of Gaussian random fields. Random PDEs are much more flexible and can model more complex situations compared to classical PDEs with deterministic coefficients. However, each sample of a PDE-based model is extremely expensive. To alleviate the high costs the numerical analysis community has developed so-called multilevel estimators which work with a hierarchy of PDE models with different resolution and cost. We review the basic idea of multilevel estimators and discuss our own recent contributions: i) a multilevel best linear unbiased estimator to approximate the expectation of a scalar output quantity of interest associated with a random PDE [1, 2], ii) a multilevel sequential Monte Carlo method for Bayesian inverse problems [3], iii) a multilevel sequential importance method to estimate the probability of rare events [4]. [1] D. Schaden, E. Ullmann: On multilevel best linear unbiased estimators. SIAM/ASA J. Uncert. Quantif. 8(2), pp. 601-635, 2020 [2] D. Schaden, E. Ullmann: Asymptotic analysis of multilevel best linear unbiased estimators, arXiv:2012.03658 [3] J. Latz, I. Papaioannou, E. Ullmann: Multilevel Sequential² Monte Carlo for Bayesian Inverse Problems. J. Comput. Phys., 368, pp. 154-178, 2018 [4] F. Wagner, J. Latz, I. Papaioannou, E. Ullmann: Multilevel sequential importance sampling for rare event estimation. SIAM J. Sci. Comput. 42(4), pp. A2062–A2087, 2020

18.02.2021 17:00 Dorota Kurowicka (TU Delft): Simplified R-vine based forward regression

An extension of the D-vine based forward regression procedure to a R-vine forward regression is proposed. In this extension any R-vine structure can be taken into account. Moreover, a new heuristic is proposed to determine which R-vine structure is the most appropriate to model the conditional distribution of the response variable given the covariates. It is shown in the simulation that the performance of the heuristic is comparable to the D-vine based approach. Furthermore, it is explained how to extend the heuristic into a situation when more than one response variable are of interest. Finally, the proposed R-vine regression is applied to perform a stress analysis on the manufacturing sector which shows its impact on the whole economy. Reference: Zhu, Kurowicka and Nane.

03.02.2021 16:00 Holger Dette (Ruhr-Universität Bochum): Testing relevant hypotheses in functional time series via self-normalization

In this paper we develop methodology for testing relevant hypotheses in a tuning-free way. Our main focus is on functional time series, but extensions to other settings are also discussed. Instead of testing for exact equality, for example for the equality of two mean functions from two independent time series, we propose to test a \textit{relevant} deviation under the null hypothesis. In the two sample problem this means that an $L^2$-distance between the two mean functions is smaller than a pre-specified threshold. For such hypotheses self-normalization, which was introduced by Shao (2010) and is commonly used to avoid the estimation of nuisance parameters, is not directly applicable. We develop new self-normalized procedures for testing relevant hypotheses and demonstrate the particular advantages of this approach in the the comparisons of eigenvalues and eigenfunctions.

20.01.2021 17:00 Marija Tepegjozova : Nonparametric C- and D-vine based quantile regression

Quantile regression is a field with steadily growing importance in statistical modeling. It is a complementary method to linear regression, since computing a range of conditional quantile functions provides a more accurate modelling of the stochastic relationship among variables, especially in the tails. We introduce a novel nonrestrictive and highly flexible nonparametric quantile regression approach based on C- and D-vine copulas. Vine copulas allow for separate modeling of marginal distributions and the dependence structure in the data, and can be expressed through a graph theoretical model given by a sequence of trees. This way we obtain a quantile regression model, that overcomes typical issues of quantile regression such as quantile crossings or collinearity, the need for transformations and interactions of variables. Our approach incorporates a two-step ahead ordering of variables, by maximizing the conditional log-likelihood of the tree sequence, while taking into account the next two tree levels. We show that the nonparametric conditional quantile estimator is consistent. The performance of the proposed methods is evaluated in both low- and high-dimensional settings using simulated and real world data. The results support the superior prediction ability of the proposed models.

09.12.2020 17:00 Thomas Nagler (Leiden University, NL): Stationary vine copula models for multivariate time series

Multivariate time series exhibit two types of dependence: across variables and across time points. Vine copulas are graphical models for the dependence and can conveniently capture both types of dependence in the same model. We derive the maximal class of graph structures that guarantees stationarity under a condition called translation invariance. Translation invariance is not only a necessary condition for stationarity, but also the only condition we can reasonably check in practice. In this sense, the new model class characterizes all practically relevant vine structures for modeling stationary time series. We propose computationally efficient methods for estimation, simulation, prediction, and uncertainty quantification and show their validity by asymptotic results and simulations. The theoretical results allow for misspecified models and, even when specialized to the \emph{iid} case, go beyond what is available in the literature. The new model class is illustrated by an application to forecasting returns of a portolio of 20 stocks, where they show excellent forecast performance. The paper is accompanied by an open source software implementation.

02.12.2020 13:00 Göran Kauermann (LMU): Nowcasting and Forecasting using COVID-19 data

We analyse the temporal and regional structure in COVID-19 infections, making use of the openly available data on registered cases in Germany published by the Robert Koch Institute (RKI) on a daily basis. We demonstrate the necessity to apply nowcasting to cope with delayed reporting. Delayed reporting occurs because local health authorities report infections with delay due to delayed test results, delayed reporting chains or other issues not controllable by the RKI. A reporting delay also occurs for fatal cases, where the decease occurs after the infection (unless post-mortem tests are applied). The talk gives a general discussion on nowcasting and applies this in two settings. First, we derive an estimate for the number of present-day infections that will, at a later date, prove to be fatal. Our district-level modelling approach allows to disentangle spatial variation into a global pattern for Germany, district-specific long-term effects and short-term dynamics, taking the demographic composition of the local population into account. Joint work with Marc Schneble, Giacomo De Nicola & Ursula Berger The second applications combines nowcasting with forecasting of infection numbers. This leads to a fore-nowcast, which is motivated methodologically. The method is suitable for all data which are reported with delay and we demonstrate the usability on COVID-19 infections.

23.07.2020 13:00 Niki Kilbertus (MPI for Intelligent Systems & University of Cambridge): A class of algorithms for general instrumental variable models

I will start with a general motivation for cause-effect estimation and describe common challenges such as identifiability. We will then take a closer look at the instrumental variable setting and how an instrument can help for identification. Most approaches to achieve identifiability require one-size-fits-all assumptions such as an additive error model for the outcome. Instead, I will present a framework for partial identification, which provides lower and upper bounds on the causal treatment effect. Our approach leverages advances in gradient-based optimization for the non-convex objective and works in the most general case, where instrument, treatment and outcome are continuous. Finally, we demonstrate on a set of synthetic and real-world data that our bounds capture the causal effect when additive methods fail, providing a useful range of answers compatible with observation as opposed to relying on unwarranted structural assumptions.

17.06.2020 12:15 Jürgen Pfeffer (TUM): The enemies of good social media samples

Thousands of researchers use social media data to analyze human behavior at scale. The underlying assumption is that millions of people leave digital traces and by collecting these traces we can re-construct activities, topics, and opinions of groups or societies. Some data biases are obvious. For instance, most social media platforms do not represent the socio-demographic setup of society. Social bots can also obscure actual human activity on these platforms. Consequently, it is not trivial to use social media analyses and draw conclusions to societal questions. In this presentation, I will focus on a more specific question: do we even get good social media samples? In other words, do social media data that are available for researchers represent the overall platform activity? I will show how nontransparent sampling algorithms create non-representative data samples and how technical artifacts of hidden algorithms can create surprising side effects with potentially devastating implications for data sample quality.

27.05.2020 12:15 Reinhard Heckel (TUM): Early stopping in deep networks: Double descent and how to mitigate it

Over-parameterized models, in particular deep networks, often exhibit a ``double-descent'' phenomenon, where as a function of model size, error first decreases, increases, and decreases at last. This intriguing double-descent behavior also occurs as a function of training time, and it has been conjectured that such ``epoch-wise double descent'' arises because training time controls the model complexity. In this paper, we show that double descent arises for a different reason: It is caused by two overlapping bias-variance tradeoffs that arise because different parts of the network are learned at different speeds.

13.05.2020 12:15 Vanda Inacio De Carvalho (The University of Edinburgh, UK): Flexible nonparametric Bayesian density regression via dependent Dirichlet process mixture models and penalised splines

In many real-life applications, it is of interest to study how the distribution of a (continuous) response variable changes with covariates. Dependent Dirichlet process (DDP) mixture of normal models, a Bayesian nonparametric method, successfully addresses such goal. The approach of considering covariate independent mixture weights, also known as the single weights dependent Dirichlet process mixture model, is very popular due to its computational convenience but can have limited flexibility in practice. To overcome the lack of flexibility, but retaining the computational tractability, this work develops a single weights DDP mixture of normal model, where the components’ means are modelled using Bayesian penalised splines (P-splines). We coin our approach as psDDP. A practically important feature of psDDP models is that all parameters have conjugate full conditional distributions thus leading to straightforward Gibbs sampling. In addition, they allow the effect associated with each covariate to be learned automatically from the data. The validity of our approach is supported by simulations and applied to a study concerning the association of a toxic metabolite on preterm birth.