Yet another guided tour of CPython
(But this one is by the BDFL. :-)

Introduction

There are a lot of subdirectories in the CPython repo. The devguide has an overview, which is broader than this doc, but shallow.

In addition, the devguide has lots of useful instructions on how to get the source code, how to get it to compile, and so on, which I won’t repeat here. Skim at least the index!

I’m also not going to explain setting up your coding environment — whether you use Vim, Notepad, or Eclipse is not the topic here. I presume there are enough online guides explaining how to explore large C programs. If there’s interest I could explain how my Emacs setup works (hint: make TAGS ), but I presume it’s not state-of-the-art, and I’m not interested in editor wars. If it works for you, great!

Most of the C code is concentrated in a few directories:

  • Include — header files
  • Objects — object implementations, from int to type
  • Python — interpreter, bytecode compiler and other essential infrastructure
  • Parser — parser, lexer and parser generator
  • Modules — stdlib extension modules, and main.c 
  • Programs — not much, but has the real main() function

When you’re on Linux or a BSD distribution (other than Mac) that’s all you need, and I recommend this.

For other platforms there are a few more places to look:

  • Mac — OS X-specific code
  • PC — Old Windows-specific code
  • PCBuild — Tools to build Python for modern Windows using Visual Studio

I believe you can still build Python on OS X without the Mac-specific stuff, it just won’t be a “framework build”. (You still need Xcode for the compiler and tools IIRC.) On Windows there’s not really a choice, you have to read PCBuild/readme.txt. (I suppose there’s Cygwin, but unless you’re already a Cygwin expert, I think it’s an uphill battle. And VS is actually pretty good.)

So I’d like to focus on the most important directories, which I consider Include, Objects and Python. The code in Parser is unusual (I’ll get to it eventually) and the code in Modules is really no different from any 3rd party C extension for Python — the existing Python/C API docs should suffice.

Chapter 1: getting to the prompt

But first let me slice things completely differently: Let’s explore what happens when you run python3 in your shell. I’ll start at main() , and I’d like to take a break when we’ve found where the famous >>>  prompt is printed.

I’m exploring the source code for Python 3.5, which is the most recent production release. We keep the master copy of the source code in Mercurial (https://hg.python.org/cpython/file/3.5) but you can also find a recent copy on GitHub (https://github.com/python/cpython/tree/3.5). The devguide will tell you how to clone the Hg repo; if you’re a GitHub user you hopefully already know how to clone the Git repo! (You won’t need the Hg repo until you’re considering submitting patches, and hopefully before the end of 2016 you won’t need it at all, since we’re going to migrate to GitHub.)

Every C program starts its execution in main() . Python’s main()  lives in Programs/python.c (though in older versions it was Modules/python.c ). The code here almost immediately gets hairy, but I’m going to ignore most of the madness. The thing to note is that the code here just ends up calling Py_Main() , which is defined in Modules/main.c . On to that file! (PS. There’s an old historical reason why it lives in Modules , not in Python . But it’s not important here.)

The first slightly interesting thing that happens here is the initialization for hash randomization. This is a security feature that causes the key order for dicts to vary from run to run. As it affects the way dictionaries find their keys, it has to be initialized before any Python objects are created. The actual function that does this initialization, _Py_Random_Init() , lives in Python/random.c , but I recommend skipping it for now.

There are a few more initialization calls (PySys_ResetWarnOptions()  and _PyOS_ResetGetOpt()), and then we finally get to the code that parses the command line. Aargh, this looks low-level! Indeed it is. C is a low-level language. The code uses _PyOS_GetOpt() , which really is just GNU getopt in disguise. The code is found in Python/getopt.c  and it’s really just a copy of GNU getopt with the type of argv  changed from char *  to wchar_t * , which is a way to support Unicode characters on platforms that support it (the original argv  is copied into this form in the  main() function mentioned above).

If you’re curious about Python’s command line options, just type python3 -h . You’ll find every option listed there as a case in the switch here. The -h  option (case 'h': ) is the first one processed after the getopt loop terminates — the usage()  function prints the message. (It’s somewhat complicated because usage()  is also called when a faulty option is encountered, see the default:  case.) There’s also a bit of code here to handle -V , which prints Python’s version. (Neither -h  nor -V  is handled in the getopt loop itself; I suppose this implies that python3 -hV  and python3 -Vh  both print the usage message.)

Onward. We find that some environment variables are equivalent to setting certain options. But note that -E causes environment variables to be ignored. If you look for where this happens, you’ll find that we actually parse the command line twice — the first time just to look for -E . Also note that Py_GETENV()  is a macro — we can tell because its name is upper case (except for the Py  prefix); its definition is in Include/pydebug.h . Read its definition to find how -E  works.

I’m going to fast-forward over a lot of following stuff, until we hit the call to Py_Initialize() . This function initializes the interpreter and basic objects. You can find its definition in Python/pylifecycle.c . There’s a lot going on here that I don’t want to explain, but the key insight at this point is that Python’s main() is just a client of the Python/C API. You can write your own C code that just calls Py_Initialize() and then executes some Python code using for example PyRun_SimpleString() :

#include <Python.h>
int
main(int argc, char **argv) {
    Py_Initialize();
    PyRun_SimpleString("print('hello world')");
}

To get it running, you need a C compiler, the Python header, and a compiled version of Python as a statically linked or shared library file. (I’ll leave the exact commands as an exercise — the Embedding Python docs should get you started — it has several longer versions of the above example.)

But I digress. The key thing is that once Py_Initialize()  has run, the interpreter is open for business, and we can use all kinds of objects. (There was already some object use in the getopt loop, related to warning_options , but that’s skating on rather thin ice if you ask me, and it only uses lists and strings, some of the most basic data types.)

The interpreter is now ready enough to print the version string (which it only does when invoked without a file or command to execute, and even then only if standard input is connected to an interactive device like a terminal emulator).

A little more messiness culminating in a call to PySys_SetArgv() , which sets the Python-level variable sys.argv .

After this there’s some code that tries to import the readline  module, which provides interactive command line editing at the >>>  prompt. Sometimes this can’t be imported, and we clear the exception in that case. This is a pretty standard pattern and good to understand:

        PyObject *v;
        v = PyImport_ImportModule("readline");
        if (v == NULL)
            PyErr_Clear();
        else
            Py_DECREF(v);

The PyImport_ImportModule() call is one of the high-level entry points into the import machinery. It returns a new reference to the imported module object, or NULL if there was an error. It turns out that we are just importing it for its side effect, so we don’t care about the new reference, and we call Py_DECREF()  on it to disown it. (In all likelihood this is just going to decrement the object’s reference counter and not free the object, because there’s still a reference to the imported module in sys.modules .)

If there’s an error, some global state regarding the exception is set, and it’s important that we clear that exception state — that’s what the PyErr_Clear()  call is for. If we don’t do this, things will appear to be all right at first, until suddenly, deep inside some unrelated code, the exception here (likely ImportError ) is reported as the outcome of some other innocent call. It’s an interesting puzzle to figure out why this is, but for now, I’ll just warn about the potential problem and emphasize the need to call PyErr_Clear()  if you want to ignore an exception.

At this point we come to a watershed. If the -c  or -m  argument was used, we handle those and we’re essentially done. A command given with -c is run using the local function run_command(). You can tell it’s defined in the same file because its name is all_lowercase and doesn’t start with Py . This does some unicode fiddling and then calls PyRun_SimpleStringFlags() . Alternatively, -m is handled by the local function RunModule(). (Personally, I think this is using the wrong naming convention — it should be run_module() , although PEP 7 is silent on the issue. CPython contributors often bring in their own cultural baggage, and other projects written in C or C++ often use the CapWords convention. Think of it as someone with a French accent. :-) The body of RunModule()  shows us what it’s really doing: it imports the runpy  module and calls runpy._run_module_as_main() where the arguments are what’s passed to RunModule() : the module name and a flag indicating whether to set sys.argv[0] . (More about this in PEP 338.) Most of the complexity in RunMain() is typical for C code calling Python code: a lot of error checking and reference count handling, punctuated by slightly more interesting Python/C API calls like PyObject_GetAttrString()  or PyObject_Call() .

Backing up slightly, if neither -c  nor -m  was given, there are two other possibilities: either we run code from a file, or we start up an interactive prompt. There’s a whole bunch of stuff here that I don’t recognize; at the bottom of this block there’s a call to run_file() which mostly just wraps PyRun_AnyFileExFlags() . This function lives in Python/pythonrun.c  and is a wrapper that calls either PyRun_InteractiveLoopFlags()  or PyRun_SimpleFileExFlags() , both defined in the same file. And this is what I’ve promised: the code that prints the >>> prompt lives in the former.

Let’s look at PyRun_InteractiveLoopFlags() in detail. If you look carefully you’ll see that it initializes the system prompt variables, sys.ps1  and sys.ps2 , to ">>> "  and "... " , respectively, unless they are already set. Because manipulation of attributes of the sys  module is common and special, there are helper functions to do this: _PySys_GetObjectId()  and _PySys_SetObjectId() . The leading underscore tells us that they’re not part of the public Python/C API; the _Py  prefix is used because they are in fact global symbols from the linker’s perspective, since they live a different file, Python/sysmodule.c . Another indication of how special sys is: even though it is technically an extension module, its C code lives in Python/ , not in Modules/ .

And then, finally, we see the “Read Eval Print Loop” or REPL. That’s an old Lisp term for a program that reads an expression (or statement) from a console, evaluates it, and prints it, repeatedly in a loop. And indeed this is where that loop lives in Python’s case. It’s a for (;;)  loop, meaning it loops “forever” — actually, until the thing it calls to do the “REP” part returns a value named E_EOF, which (apparently) means  that it encountered “End Of File”. That’s also a bit of an anachronism (presumably from the days of computer tapes), since there’s no file in sight. On UNIX it means that you typed Control-D; on Windows it’s Control-Z.

Follow the trail into PyRun_InteractiveOneObject() . This retrieves the sys.ps1 and sys.ps2  variables (named thus because the UNIX shell uses environment variables PS1 and PS2 to customize its prompts), using lots of waving of Unicode arms and error handling, and then calls PyParser_ASTFromFileObject() which asks the parser to do its thing. And still we haven’t seen where the prompt is printed! But we’re close. We follow the trail into PyParser_ParseFileObject() , from there into PyTokenizer_FromFile() … and out of it again, to parsetok(). Those last two are in Parser/parsetok.c , and you can tell that parsetok() is the main event. From the name we can also tell it’s old code!

In parsetok()  the first thing we see is a little easter egg, involving the concept of “Barry as BDFL”. That’s Barry Warsaw, a good friend and longtime core Python contributor, who also introduced the PEP process. Try to figure out how to invoke it and what it does! (There’s a clue further down.)

Next we see another “infinite” loop. The first key operation in this loop is the call to PyTokenizer_Get() , which reads one token from the input file; after a lot of distractions, the second is PyParser_AddToken() , which sends the token to the parser. (What the parser does with the token will be the subject of another chapter.)

Once the loop is over, the key operation is, for once, not a function call: it’s n = ps->p_tree; which extracts the completed parse tree into the local variable n . Everywhere else, you’d think that a variable named n would be an integer, but here it’s a node. And this node is that prized possession, for which we worked so hard. We return it to the caller of parsetok() , PyParser_ParseFileObject() , which returns it to its caller, PyParser_ASTFromFileObject() (we’ll see soon what that function does with the node).

But where did our >>>  prompt (or actually, ps1 ) get printed? Let’s find out. The last time we saw ps1  and ps2  being passed around was in PyParser_ParseFileObject() , which passed them into PyTokenizer_FromFile() . There they were assigned to members of tok , which is a struct tok_state , IOW it holds the state for the tokenizer. The last function’s purpose was to initialize this state. So parsetok()  received the tokenizer state, and apparently the prompts get printed somehow by PyTokenizer_Get() .

We’ve now arrived in Parser/tokenizer.c .The last function calls tok_get() , which is another very old piece of DNA (about the age of mitochondria). We follow it into tok_nextc() (lots of practice understanding infinite loops here!) which (occasionally) calls PyOS_Readline() . BTW there’s so much state in the tokenizer because of buffering: we read the input one physical line at a time “Physical line” roughly means “until the first newline character”, as opposed to “logical line” which (in the context of parsing Python, and waving some hands) means more or less “until the end of a statement”. Initially the buffer is empty and we have to read the first physical line. This is where ps1 (i.e. >>> , usually, plus a space) is printed. When we’ve reached the end of that line and we haven’t hit the end of the statement, we must read another line, and that’s where ps2  is printed, typically ... (three dots plus a space).

If you haven’t tried this, I recommend playing with sys.ps1  and sys.ps2 , to see their effect in action. Anyway, PyOS_Readline() lives in Python/myreadline.c . It can print the prompt in one of two ways: either it uses PyOS_StdioReadline() , which actually calls fprintf(stderr, "%s", prompt); which is as far as I will go. The other way is through the function pointer PyOS_ReadlineFunctionPointer , which is initialized (if not already set) to … PyOS_StdioReadline() !

This function pointer variable is however a public API and you can write a C extension that sets it to something else. A reason to set it would be if you’re embedding Python in some big application with its own GUI and you want that GUI to handle Python input. Or (as you can read in the public API docs I just linked to), if you’re the readline module. And that’s the reason that readline was imported previously (right after calling PySys_SetArgv() ). Because somewhere towards the end of that module’s initialization (a few lines before the bottom of Modules/readline.c ) that function pointer variable is initialized to the call_readline()  function defined in that file. In fact, much of the complexity of the interface around printing ps1 and ps2 is directly caused by the desire to use GNU readline(), which is (on UNIXoid systems, anyways) the gold standard for command line editing and history, similar to what you’re used to in Bash (in fact, it’s the same code — Bash itself uses GNU readline, even though the concept is much older than either of these). At the same time, not all systems support GNU readline() — Windows doesn’t (but it has its own command line editing built into its DOS-like console program which works just fine), and past MacOS systems didn’t either. So we’ve also encountered some code that deals with its absence.

I now call this chapter complete. We’ve seen where and how the >>>  prompt is printed. In the next chapter I’ll discuss what happens next (though we’ve already seen a little bit — the lexer and the parser are part of the story).

Chapter 2: lexer, parser, bytecode compiler

I haven’t got to this part yet. However, here’s an existing resource that explains the compiler design:


[This is how far I got. I’m still working on it, but I’ve gotten distracted by other things.]