Organizer: Peter Mueller, UT Austin, USA



Simultaneous Edit and Imputation for Categorical Microdata

Statistical agencies and other organizations that collect and process data are often faced with computer data files that contain faulty values. When these errors result in inconsistent records—like pregnant men or married toddlers—agencies usually correct them through a process known as edit-imputation. The dominant paradigm for edit-imputation, due to Fellegi and Holt (1976), separates the task into an error localization and an imputation phase, and is based on finding the minimal set of changes needed for the records to not be inconsistent. While this approach has the advantage of minimizing the changes to the original data, it has the disadvantage of ignoring the distribution of the data during the error localization, and thus producing biased imputations. It also ignores the uncertainty associated with the error-localization procedure. In this talk I introduce an alternative procedure for edit-imputation of categorical data based on joint modeling. This model includes a flexible representation for the underlying true values based on Dirichlet process mixtures with support only on the consistent responses; a model for the location of errors; and a model for the observed faulty data. Estimation is performed simultaneously using MCMC sampling. Through challenging data-based simulations I show how this method can deliver far superior results than those obtained from the application of the Fellegi-Holt approach.


Bayesian nonparametric clustering for large data sets

We propose two nonparametric Bayesian methods to cluster big data and apply them to cluster genes by patterns of gene-gene interaction. Both approaches define model-based clustering with nonparametric Bayesian priors and include an implementation that remains feasible for big data. The first method is based on a predictive recursion which requires a single cycle (or few cycles) of simple deterministic calculations for each observation under study. The second scheme is an exact method that divides the data into smaller subsamples and involves local partitions that can be determined in parallel. In a second step, the method requires only the sufficient statistics of each of these local clusters to derive global clusters. Under simulated and benchmark data sets the proposed methods compare favorably with other clustering algorithms, including k-means, DP-means, DBSCAN, SUGS, streaming variational Bayes (SVB) and an EM algorithm. We apply the proposed approaches to cluster a large data set of gene-gene interactions extracted from the online search tool “Zodiac”.


Bivariate BNP survival regression with partially semi-competing risks

In this work, we disscuss modeling the time to infection under current status data with informative censoring. We discuss modelling the bivariate event time distribution (time to infection and time to symptoms) via a mixture of DP-mixtures. The model formalizes partial ordering, by imposing an order constraint for time to infection and time to symptom with some probability, leaving a positive probability for symptom due to other causes, i.e., no order constraint. The model allows to probabilistically attribute symptoms to the infection of interest as a function of observed censoring times and covariates. An ilustrative example using data from the Partner Notification Study conducted in Seattle, WA. will be presented.