## Archive for **January 3rd, 2016**

## The generating function for Smirnov words (or: time until k consecutive results are the same)

**1. Alphabet **

Suppose we have an alphabet of size . Its generating function (using the variable to mark length) is simply , as contains elements of length each.

**2. Words **

Let denote the class of all words over the alphabet . There are many ways to find the generating function for .

** 2.1. **

We have

so its generating function is

** 2.2. **

To put it differently, in the symbolic framework, we have , so the generating function for is

** 2.3. **

We could have arrived at this with direct counting: the number of words of length is as there are choices for each of the letters, so the generating function is

**3. Smirnov words **

Next, let denote the class of Smirnov words over the alphabet , defined as words in which no two consecutive letters are identical. (That is, words in which for all , and for any .) Again, we can find the generating function for in different ways.

** 3.1. **

For any word in , by “collapsing” all runs of each letter, we get a Smirnov word. To put it differently, any word in can be obtained from a Smirnov word by “expanding” each letter into a nonempty sequence of that letter. This observation (see *Analytic Combinatorics*, pp. 204–205) lets us relate the generating functions of and as

which implicitly gives the generating function : we have

** 3.2. **

Alternatively, consider in an arbitrary word the first occurrence of a pair of repeated letters. Either this doesn’t happen at all (the word is a Smirnov word), or else, if it happens at position so that , then the part of the word up to position is a nonempty Smirnov word, the letter at position is the same as the previous letter, and everything after is an arbitrary word. This gives

or in terms of generating functions

giving

** 3.3. **

A minor variant is to again pick an arbitrary word and consider its first pair of repeated letters, happening (if it does) at positions and , but this time consider the prefix up to : either it is empty, or the pair of letters is different from the last letter of the prefix, giving us the decomposition

and corresponding generating function

so

which is the same as before after we cancel the factors.

** 3.4. **

We could have arrived at this result with direct counting. For , for a Smirnov word of length , we have choices for the first letter, and for each of the other letters, as they must not be the same as the previous letter, we have choices. This gives the number of Smirnov words of length as for , and so the generating function for Smirnov words is

again giving

**4. Words with bounded runs **

We can now generalize. Let denote the class of words in which no letter occurs more than times consecutively. (.) We can find the generating function for .

** 4.1. **

To get a word in we can take a Smirnov word and replace each letter with a nonempty sequence of up to occurrences of that letter. This gives:

so

** 4.2. **

Pick any arbitrary word, and consider its first occurrence of a run of letters. Either such a run does not exist (which means the word we picked is in ), or it occurs right at the beginning ( possibilities, one for each letter in the alphabet), or, if it occurs starting at position , then the part of the word up to position (the “prefix”) is a nonempty Smirnov word, positions to are occurrences of any of the letters other than the last letter of the prefix, and what follows is an arbitrary word. This gives

or in terms of generating functions

so

giving

** 4.3. **

Arriving at this via direct counting seems hard.

**5. Words that stop at a long run **

Now consider words in which we “stop” as soon we see consecutive identical letters. Let the class of such words be denoted (not writing to keep the notation simple). As before, we can find its generating function in multiple ways.

** 5.1. **

We get any word in by either immediately seeing a run of length and stopping, or by starting with a nonempty prefix in , and then stopping with a run of identical letters different from the last letter of the prefix. Thus we have

and

which gives

** 5.2. **

Alternatively, we can decompose any word by looking for its first run of identical letters. Either it doesn’t occur at all (the word we picked is in , or the part of the word until the end of the run belongs to and the rest is an arbitrary word, so

and

so

**6. Probability **

Finally we arrive at the motivation: suppose we keep appending a random letter from the alphabet, until we encounter the same letter times consecutively. What can we say about the length of the word thus generated? As all sequences of letters are equally likely, the probability of seeing any string of length is . So in the above generating function , the probability of our word having length is , and the probability generating function is therefore . This can be got by replacing with in the expression for : we have

In principle, this probability generating function tells us everything about the distribution of the length of the word. For example, its expected length is

(See this question on Quora for other powerful ways of finding this expected value directly.)

We can also find its variance, as

This variance is really too large to be useful, so what we would really like, is the shape of the distribution… to be continued.