Markov chains is one of the oldest ways to generate quite believable text. Rather than generating text by randomly selecting characters (possible but completely impractical), we use the Markov property that follows the chain of linked events, where what get generated next depends only on what is generated currently.
Typically generating text trough, a Markov process use the following steps:
- Tokenization: is the process of splitting text into tokens, these tokens can be at the character level or at the word level. When tokenizing the word level, each word is a token.
- Building a n-gram model: this model represents the transition probabilities between any n-gram and the next token.
- Using the model to generate new text: generating new text require a seed. The seed is setting the first n-gram manually. The Markov process will then generate the remaining text using the already build n-gram model.
Building a model
In this section we will use some python code to build a model.
To build a model we need to start from an existing text or a set of texts.
def buildModel(tokens, n): model = dict() if len(tokens) < n: return model for i in range(len(tokens) - n): gram = tuple(tokens[i:i + n]) next_token = tokens[i + n] if gram in model: model[gram].append(next_token) else: model[gram] = [next_token] #as this is a transition model, the last n-gram has no transition. #so we create or add a transition state to None gram = tuple(tokens[len(tokens) - n:]) if gram in model: model[gram].append(None) else: model[gram] = [None] return model
As the code shows creating a model is fairly easy. Once we have the token list, we create a dictionary where each key is one possible n-gram, the value linked with every key is a list of possible tokens that can follow the n-gram. The trick of keeping track of probabilities is to allow adding the same token as frequently as it appears in the text. To illustrate let consider the following sentence: ‘to be totally tense‘. The character ‘t’ is followed by the character ‘o’ 2 time on one time by the character e. To respect the fact that there is 2 times more probability for a ‘o’ than for an ‘e’ to follow a ‘t’ , The value associated with the key ‘t’ will be ‘[0, 0, e]’
Using a Model to generate text
To generate a text, we need the model build in the last section and a seed. Using the seed, we will generate the next token by using the model. The trick explained in the last section shows why a random selection based on the current n-gram will always respect the transition probabilities of the Markov Chain.
<pre>def generate(model, n, seed=None, max_iterations=100): if seed is None: seed = random.choice(model.keys()) output = list(seed) current = tuple(seed) for i in range(max_iterations): if current in model: possible_next_tokens = model[current] next_token = random.choice(possible_next_tokens) if next_token is None: break output.append(next_token) current = tuple(output[-n:]) else: break return output
The generation will stop when one to these 2 conditions is reached:
- Either we reach a terminal node in the transition probabilities.
- Or we reach the maximum interactions over the model.
I seriously love your website.. Pleasant colors & theme. Did you build this website yourself? Please reply back as I’m trying to create my very own blog and want to know where you got this from or exactly what the theme is named. Thanks!
Thanks for another informative site. The place else could I get that kind of information written in such an ideal manner? I’ve a project that I’m just now working on, and I’ve been at the glance out for such info.
An interesting discussion is worth comment. I think that you should write more on this topic, it might not be a taboo subject but generally people are not enough to speak on such topics. To the next. Cheers
I am not sure where you’re getting your info, but good topic. I needs to spend a while learning more or working out more. Thanks for magnificent info I was searching for this information for my mission.