I read about this awesome process called Markov Chains in this blog, it was really great how an engineer apply this concept to create unique and memorable ID's using no-meaning words.

Before this change, the unique IDs was difficult to read and memorize, but using Markov Chains, the software can generate memorable IDs using phonetics and almost familiar words related to the language. In that case, the choosen language was Japanese.

I recommend that you read that blog, it's so funny and makes a great story telling from the problem through the solution.

After that fantastic application, i decided to mimic that behavior in a service i was creating.

Maybe i will put a link after i create a blog about that service.

How it works Markov Chains

Markov Chains according to Wikipedia is:

Is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

So what this can be related to create unique words. Imagine if you have the power to know with certain probability, what a person can say, if only just have the beginning of a word.

Well, I understand that Markov Chains can tell you which letter would have more probable after certain letter.

For example, if you say a word that starts with "a", then the markov chain can tell you which letter could be the next. In order to know which word comes next, Markov Chain needs to be created with content. In this case can be a list of words.
Imagine if you collect all the vast vocabulary from all the Spanish books, the Markov chain created from this words can be a great way to create unique words that haven't ever existed and also can spit an existing word.

For me, markov chains are a way to spit a possible outcome base on the current data i gave it.

But before to use it, you need to feed a Markov Chain, in these needs to be words.

My problem solved using Markov Chains

The main problem was that users need to type a code when the service is payed, so it's difficult to someone remember random numbers and letters, so I decided to search a memorable and unique IDs for clients.

The first question, as the same in the article, was:

What language do I use? My clients speak Spanish, so why not use a Spanish language to create those words.

Then, second question was:

What content do I use to feed my Markov Chain?

I decided to use the Don Quixote book, by Miguel de Cervantes. Mainly because it's in public domain world wide. And because Miguel de Cervantes write this book using old words , trying to mimic and parodies the language of medieval knights and chivalric romances.

Markov Chain implementation for the words generator

These the steps I folowed:

  1. Understand how a Markov Chain works, I recommend this video from Youtube.
  2. Obtained the Don Quixote in txt format from Project Gutenberg.
  3. Clean and collect words.
  4. Feed the Markov Chain
  5. Save the Markov Chain, for the next times we generate a new word.
  6. Generate the word
  7. Finish

Code implementattion

To explain better the code I plan add comments in the code, but basically the steps before mentioned was the way I built this.

markov_chain_words_generator.py
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import re
import unicodedata
import json
import random
import string
class Serializer():
    help = 'Create a unique word based on a set or words'
    def __init__(self, handler):
      self.handler = handler#file handler
      self.clean_txt()
	# CLEAN AND COLLECT WORDS
    def clean_txt(self):
      self.cleaned_words = []
      for line in self.handler.readlines():
        cline = line.lower()
        cline = self.remove_accents(cline)
        cline = self.replace_spanish_punctuation(cline)
        cline = cline.strip()
        words = cline.split(" ")
        cwords = filter(lambda word: word.isalpha(), words)
        self.cleaned_words += cwords

    def remove_accents(self, input_str):
        # Normalize the string to decompose accented characters
        nfkd_form = unicodedata.normalize('NFKD', input_str)
        # Keep only the base characters (remove diacritics)
        return ''.join([c for c in nfkd_form if not unicodedata.combining(c)])

    def replace_spanish_punctuation(self, input_str):
        # Define a translation table for Spanish punctuation
        translation_table = str.maketrans({
            '¿': '',  # Remove inverted question mark
            '¡': '',  # Remove inverted exclamation mark
            '´': '',  # Remove acute accent (used in some cases)
            '¨': '',  # Remove diaeresis (umlaut)
            '«': '',  # Replace left double angle quote with standard double quote
            '»': '',  # Replace right double angle quote with standard double quote
            '“': '',  # Replace left double curly quote with standard double quote
            '”': '',  
            '‘': '',  
            '’': '',  
            ',': '',
            ';': '',
            '-': '',
        })
        return input_str.translate(translation_table)

    def clean_spanish_text(self,input_str):
        # Remove accents
        no_accents = self.remove_accents(input_str)
        # Replace punctuation
        cleaned_text = self.replace_spanish_punctuation(no_accents)
        return cleaned_text

    def get_words(self):
        return self.cleaned_words

    def make_word(self, limit=10,start='a'):
        self.words = self.get_words()
        self.cwords = map(lambda word: self.clean_spanish_text(word), self.words)
        filename = "dictoword.json"
        self.markov_model = self.create_markov_chain()
        self.save_model(filename)

        myword = self.generate_word(limit=limit,start=start)
        print(f"the final word is: {myword}")

    def save_model(self,filename):
        with open(filename, 'w',encoding='utf-8') as f:
          json.dump(self.markov_model,f,ensure_ascii=False,indent=4)

    def filename_exists(self,filename):
        from pathlib import Path
        fpath = Path(filename)
        return fpath.exists()

    def retrieve_model(self,filename):
        try:
            with open(filename, 'r', encoding='utf-8') as f:
              return json.load(f)
        except FileNotFoundError:
            print(f"The file {filename} does not exist.")
            pass
        except json.JSONDecodeError:
            print(f"The file {filename} contains a invalid JSON.")

	# CREATE THE MARKOV MODEL WITH THE CONTENT
    def create_markov_chain(self):
        gd = {}
        for cword in self.cwords:
            for item in range(0,len(cword)-1):
                ch_curr = cword[item]
                ch_next = cword[item+1]
                
                if ch_curr not in gd.keys():
                    gd[ch_curr] = dict()
                    gd[ch_curr][ch_next] = 1
                else:
                    if ch_next not in gd[ch_curr]:
                        gd[ch_curr][ch_next] = 1
                        pass
                    else:
                        gd[ch_curr][ch_next] += 1

        for curr_state, transition in gd.items():
            total = sum(transition.values())
            for ch_next, value in transition.items():
                gd[curr_state][ch_next] = value/total

        return gd

    def get_random_char(self):
        letters = string.ascii_lowercase
        return random.choice(letters)
    def generate_word(self, limit=10, start='a'):
        n = 0
        curr_state = start
        next_state = None
        random_word = start
        while n < limit:
            next_state = random.choices(list(self.markov_model[curr_state].keys()),
                                list(self.markov_model[curr_state].values()))
            curr_state = next_state[0]
            random_word += curr_state[0]
            n+=1

        return random_word

And this is how is used:

with open('donquijote.txt','r') as f:
  s = Serializer(f)
  ch = s.get_random_char()
  s.make_word(5,ch)

Basically, the most important part is this function:

def create_markov_chain(self):
	gd = {}
	for cword in self.cwords:
		for item in range(0,len(cword)-1):
			ch_curr = cword[item]
			ch_next = cword[item+1]
			
			if ch_curr not in gd.keys():
				gd[ch_curr] = dict()
				gd[ch_curr][ch_next] = 1
			else:
				if ch_next not in gd[ch_curr]:
					gd[ch_curr][ch_next] = 1
					pass
				else:
					gd[ch_curr][ch_next] += 1

	for curr_state, transition in gd.items():
		total = sum(transition.values())
		for ch_next, value in transition.items():
			gd[curr_state][ch_next] = value/total

	return gd

In order to create the Markov Chain, the process, if the pattern repeats. For example, "aba", this three words has a relation in the Markov Chain model, like this:

Markov chain explanation

Now imagine a letter with multiples probabilities for the next letter, and these letters also has probabilities for the next letter.

Markov Chain representation

Now imagine generate a word using this Markov Chain:

Words generator maker code

You can use Google Collab to use this code, but remember use you content, in my case I use Don Quixote by Miguel de Cervantes, but you can use other content with different languages.

Conclusion

That's all, I really like to apply some knowledge in ways that even I didn't think of it. And besides I didn't implement a filter like the blog did, but besides that, I think it's a good try.

An important thing here, this unique IDs are not guarantee that after some unknown times executed, can throw the same word previously generated. But for my purpose is acceptable.