Markov chains is one of the oldest ways to generate quite believable text. Rather than generating text by randomly selecting characters (possible but completely impractical), we use the Markov property that follows the chain of linked events, where what get generated next depends only on what is generated currently.
Typically generating text trough, a Markov process use the following steps:
- Tokenization: is the process of splitting text into tokens, these tokens can be at the character level or at the word level. When tokenizing the word level, each word is a token.
- Building a n-gram model: this model represents the transition probabilities between any n-gram and the next token.
- Using the model to generate new text: generating new text require a seed. The seed is setting the first n-gram manually. The Markov process will then generate the remaining text using the already build n-gram model.
Building a model
In this section we will use some python code to build a model.
To build a model we need to start from an existing text or a set of texts.
def buildModel(tokens, n): model = dict() if len(tokens) < n: return model for i in range(len(tokens) - n): gram = tuple(tokens[i:i + n]) next_token = tokens[i + n] if gram in model: model[gram].append(next_token) else: model[gram] = [next_token] #as this is a transition model, the last n-gram has no transition. #so we create or add a transition state to None gram = tuple(tokens[len(tokens) - n:]) if gram in model: model[gram].append(None) else: model[gram] = [None] return model
As the code shows creating a model is fairly easy. Once we have the token list, we create a dictionary where each key is one possible n-gram, the value linked with every key is a list of possible tokens that can follow the n-gram. The trick of keeping track of probabilities is to allow adding the same token as frequently as it appears in the text. To illustrate let consider the following sentence: ‘to be totally tense‘. The character ‘t’ is followed by the character ‘o’ 2 time on one time by the character e. To respect the fact that there is 2 times more probability for a ‘o’ than for an ‘e’ to follow a ‘t’ , The value associated with the key ‘t’ will be ‘[0, 0, e]’
Using a Model to generate text
To generate a text, we need the model build in the last section and a seed. Using the seed, we will generate the next token by using the model. The trick explained in the last section shows why a random selection based on the current n-gram will always respect the transition probabilities of the Markov Chain.
<pre>def generate(model, n, seed=None, max_iterations=100): if seed is None: seed = random.choice(model.keys()) output = list(seed) current = tuple(seed) for i in range(max_iterations): if current in model: possible_next_tokens = model[current] next_token = random.choice(possible_next_tokens) if next_token is None: break output.append(next_token) current = tuple(output[-n:]) else: break return output
The generation will stop when one to these 2 conditions is reached:
- Either we reach a terminal node in the transition probabilities.
- Or we reach the maximum interactions over the model.