Violetta Lonati

Pattern statistics in rational models

Tesi di dottorato discussa il 7 marzo 2005
Relatori: Proff. A. Bertoni e M. Goldwurm.



Riassunto (in inglese)

Probability on pattern occurrences in a random sequence of letters (generally called text) has been widely studied and has applications in many areas of bio-informatics, code theory and data compression, pattern matching, design and analysis of algorithms, games. Different aspects have been considered: the length of the longest matching, the moments and the distributions of the waiting times for first time occurrences of patterns, the distances between occurrences of patterns.

The thesis focuses on the frequency of occurrences of a repeated pattern in a random sequence of letters. If we assume to know the probabilistic model (and its parameters) that generates the text, the central question is: how many occurrences of a given pattern shall we expect in such a random sequence? Below, we shall refer to this question as the frequency problem.

Among the early motivations for the study of this problem, one should mention code synchronization and approximated pattern matching. However, the most recent applications are in molecular biology. Nowadays, biologists know large sets of DNA sequences and need quantitative tools and statistical methods to analyze them. Much information could be extracted from DNA sequences if they would be able to identify words that show relevant differences between their observed frequency and the frequency prediceted by a suitable model.

The frequency problem can be studied under different assumptions concerning the source that generates the text, or the pattern to search for through the text. In this thesis, pattern occurrences are studied in a new framework, where a stochastic model is defined via rational formal series (or, equivalently, by weighted automata) in non-commuting variables a and b . The rational symbol frequency problem is taken in exam: intuitively this concerns the study of the sequence of random variables Y_n representing the number of occurrences of the symbol a in words of length n chosen at random in {a,b}*, according to the probability distribution given by the rational model.

The thesis shows how the rational model can be viewed as a proper extension of the Markovian model usually considered in the literature. Indeed, the question of studying the number of occurrences of a regular pattern in a text generated by a Markovian source can always be translated into the rational symbol frequency problem for a suitable rational series over two non-commuting variables, while the converse does not hold in general.

The main contributions included in the thesis are estimating the moments of the random variable Y_n and determining local and central limit distributions of the sequence Y_n as n tends to infinity.

First, the transition matrix associated with the series defining the model is assumed to be primitive. In this case the mean and the variance of Y_n turn out to be asymptotically linear and precise expressions for the constants appearing in their asymptotic formulas are given. Moreover, a central limit theorem holds and a condition is provided that guarantees the existence of a Gaussian local limit theorem; to state this condition, a notion of symbol periodicity for weighted automata is introduced, in analogy with the classical periodicity theory of Perron--Frobenius for non-negative matrices. As an application of the previous analysis, one obtains an asymptotic estimation of the growth of the coefficients for a subclass of rational formal series in two commuting variables.

Thes results are then extended dropping the primitive hypothesis usually assumed in the literature. First, bicomponent models (defined by weighted automaton with two strongly connected components) are studied, obtaining in many cases limit distributions quite different from the Gaussian one. Finally, to deal with arbitrary non-primitive models, a general approach is presented. It is based on the decomposition of the weighted automaton defining the model into strongly connected components, in order to detect the elements that mainly determine the limit distribution. In the most relevant cases the limit distribution is explicitely determined: it is characterized by a unimodal density function defined by polynomials over adjacent intervals.