seminaires:hors-series:db:642

The talk will start with presenting a brief overview of querying incomplete information in databases and the main computational challenges it presents. Querying incomplete data invariably relies on the very coarse classification of query answers into those that are certain and those that are not. Such a classification is often very costly, and we propose to refine it by measuring how close an answer is to certainty.

This measure is defined as the probability that the query is true under a random interpretation of missing information in a database. Since there are infinitely many such interpretations, to pick one at random we adopt the approach used in the study of asymptotic properties and 0-1 laws for logical sentences and define the measure as the limit of a sequence. We show that in the standard model of missing data, the 0-1 law is observed: this limit always exists and can be only 0 or 1 for a very large class of queries. Thus, query answers are either almost certainly true, or almost certainly false, and this classification behaves very well computationally. When databases satisfy constraints, the measure is defined as the conditional probability of the query being true if the constraints are true. This can now be an arbitrary rational number, which is always computable. Another refinement of the notion of certainty views answers with a larger set of interpretations that make them true as better ones. We pinpoint the exact complexity of finding best answers for first-order queries.