Name: Guilherme Oliveira
Institution: Departamento de Computação, Centro Federal de Educação Tecnológica, Brazil
E-mail: guilhermeoliveira@cefetmg.br
Co-authors: Raffaele Argiento, Rosangela Loschi, Márcia Branco, Fabrizio Ruggeri, Renato Assunção

Abstract:
The modeling of data coming from poor and more socially deprived areas has been a great challenge for data analysts as data quality is usually questionable. When dealing with underreported count data, a common approach is to consider models based on compound Poisson distributions. To be identifiable, such type of models require extra and strong information about the probability of reporting the event in all areas of interest. This requirement may limit their use mainly because trustful information about such probabilities is, in general, available only for areas where data is known to be better recorded. Motivated by that, we introduce a joint model for both the data generating and reporting processes which assumes a clustering structure that linearly correlates the reporting probabilities. This structure is defined based on auxiliary clustering variables which allow to assess the quality of the recorded data. Under the proposed model, only prior information about the reporting probabilities in areas with the best data quality is required for model identification. Different features regarding the proposed methodology are studied through simulation. Model is applied to map the early neonatal hospital mortality in Minas Gerais, a Brazilian State which presents heterogeneous characteristics and a relevant socio-economical inequality.

Keywords: Compound Poisson distribution, model identifiability, neonatal mortality, underreporting.