Compiling

  1. Python code -> Parse tree
  2. Parse tree -> AST
  3. Symbol table generated
  4. Code object generated
  5. Flow control graph generated
  6. Code object optimization (Peephole optimization)
  7. Bytecode generated

1. Lexer

Take the source code into each word.

  • Parser/tokenizer.c -> PyTokenizer_FromString
  • Parser/parsetok.c -> parsetok
  • Lib/tokenize.py

Tokenizing

The token is the name of some sort of symbol

For example:

a = 4
if (a <= 3):
    print("hello")

so it would turn into a list like below:

  • Name: a
  • EQUAL: =
  • NUMBER: 4
  • IF: if
  • LPAREN: (
  • etc

python3 -m tokenize test.py

_images/tokenize.png

2. Parsing

The parser does not know what the source file means for, instead, it just knows the token generated by the lexer, and the token object would use function next() to give a token to the parser one by one.

  • Python/pythonrun.c -> PyParser_ASTFromStringObject
import parser
code = "x = 2 + 2"
st = parser.suite(code)
>>> print(parser.st2list(st))
[257, [269, [270, [271, [272, [274, [305, [309, [310, [311, [312, [315, [316, [317, [318, [319, [320, [321, [322, [323, [324, [1, 'x']]]]]]]]]]]]]]]]], [22, '='], [274, [305, [309, [310, [311, [312, [315, [316, [317, [318, [319, [320, [321, [322, [323, [324, [2, '2']]]]]], [14, '+'], [320, [321, [322, [323, [324, [2, '2']]]]]]]]]]]]]]]]]]], [4, '']]], [4, ''], [0, '']]

LL_parser Full Grammar specification: https://docs.python.org/3/reference/grammar.html

3. AST

import dis
import ast
tree = ast.parse("x=2+2")
print(type(ast.dump(tree)))

AST example:

x= 1 + 1
y= x + 2
print(y)
_images/AST.png

Generated by Python AST Visualizer: https://vpyast.appspot.com/

4. Compiler

Python/compile.c

import dis
import ast
tree = ast.parse("x=2+2")
code_obejct = compile(tree,'test.py',mode='exec')
dis.dis(code_obejct)
c = compile(open('test.py').read(), 'test.py', 'exec')