Simple tokenization with Python

Tokenization is the process of splitting a string into smaller (elementary) units that are further processed afterwards. For the task of preprocessing arithmetic expressions such as 1+2*(3-4) there is no need for a sophisticated tokenizer. Instead, we can use re.findall:

import re
expr = "1+2*(3-4) / 7 - 29"
tokens = re.findall(r'[0-9]+|[\(\)\+\-\*/]', expr)
# tokens = ['1', '+', '2', '*', '(', '3', '-', '4', ')', '/', '7', '-', '29']

The function also already tackles whitespaces and ignores them correctly. Note that the expression only identifies integers (no floating numbers) correctly.

References

  • [1] re module reference

Leave a Reply