Tokenization is the process of splitting a string into smaller (elementary) units that are further processed afterwards. For the task of preprocessing arithmetic expressions such as 1+2*(3-4) there is no need for a sophisticated tokenizer. Instead, we can use re.findall:
import re expr = "1+2*(3-4) / 7 - 29" tokens = re.findall(r'[0-9]+|[\(\)\+\-\*/]', expr) # tokens = ['1', '+', '2', '*', '(', '3', '-', '4', ')', '/', '7', '-', '29']
The function also already tackles whitespaces and ignores them correctly. Note that the expression only identifies integers (no floating numbers) correctly.
References
- [1] re module reference