Bayesian nonparametric inference for discovery probabilities: credible intervals and large sample asymptotics

Séminaire Probabilités & Statistique

17/11/2016 - 14:00 Julyan Arbel (LJK / Mistis) Salle 306 - Batiment IMAG

Joint work with Stefano Favaro (University of Torino); Bernardo Nipoti (Trinity College, Dublin); Yee Whye Teh (University of Oxford)

The longstanding problem of discovery probabilities dates back to World War II with Alan Turing codebreaking the Axis forces Enigma machine at Bletchley Park. The problem can be simply sketched as follows. An experimenter sampling units (say animals) from a population and recording their type (say species) asks: What is the probability that the next sampled animal coincides with a species already observed a given number of times? or that it is a newly discovered species? Applications are not limited to ecology but span bioinformatics, genetics, machine learning, multi-armed bandits, and so on.


In this talk I describe a Bayesian nonparametric (BNP) approach to the problem and compare it to the original and highly popular estimators known as Good-Turing estimators. More specifically, I start by recalling some basics about the Dirichlet process which is the cornerstone of the BNP paradigm. Then I present a closed form expression for the posterior distribution of discovery probabilities which naturally leads to simple credible intervals. Next I describe asymptotic approximations of the BNP estimators for large sample size, and conclude by illustrating the proposed results through a benchmark genomic dataset of Expressed Sequence Tags.