Exchangeable sequences of clusters for microclustering tasks
Many clustering tasks require models that assume cluster sizes grow linearly with the size of the data set. However, for other clustering tasks such as entity resolution (record linkage
or deduplication) this assumption is undesirable. In most entity resolution settings, each entity gives rise to a few records thus yielding a large number of small clusters that remain small even when the number of records increases. This contrasts classical clustering tasks accomplished with Bayesian random partition models where one seeks to divide a given population or data set in a relatively small number of subpopulations or clusters whose size grows with the number of data points. In this work, we develop a flexible and tractable prior distribution for random partitions that is appropriate for microclustering tasks such as entity resolution. To do so, we move away from the traditional assumption of having an exchangeable sequence of data points to the one of having an exchangeable sequence of clusters. We illustrate our propose method using simulations and longitudinal data from a Social Diagnosis Survey from Poland.
Bayesian joint modeling of competing risks and longitudinal data: A sequential approach.
The statistical modeling of the information generated by medical follow-up is a very important challenge in the field of personalised medicine. A particular case is the joint models of longitudinal and time-to-event outcomes, where both a repeatedly measured biomarker and the elapsed time to the event of interest are collected on each individual of the study. A key characteristic of the learning process in this type of models is its dynamic nature. This is clear in biomedical studies where data usually come from individual follow-up over time: when new information of a given patient is collected, physicians are interested in updating the relevant estimated and/or predicted outcomes. Focusing on this context, we propose a dynamic procedure for Bayesian joint modeling of competing risks and longitudinal data. We rely on sequential Monte Carlo methods, in which our primary interest is to reduce the processing time of the inferential update after obtaining new data in order to speed up clinical decision-making. Our approach is illustrated in a Spanish dataset involving patients on mechanical ventilation in intensive care units.
On estimating team member abilities based on aggregate team scores
We propose a method to estimate individuals’ ability based on their contribution to team performance. Estimation of individuals’ ability is of substantial interest in many applications, as in real-world situations individuals seldom work in isolation but instead work together in groups to accomplish specific tasks. We are particularly interested in settings in which teams compete against each other; this can include sports and industry applications, among others. With this in mind, we develop a novel Bayesian approach which provides a team ability score that is based on the estimated abilities of individuals belonging to the team. We assume that the available information is at the team level and comes from a series of encounters between teams. We study theoretical properties of the proposed approach and assess its performance using synthetic data and basketball data from the NBA.