There are a lot of subdirectories in the CPython repo. The devguide has an overview, which is broader than this doc, but shallow.
In addition, the devguide has lots of usefulinstructions on how to get the source code, how to get it to compile, and so on, which I won’t repeat here. Skim at least the index!
I’m also not going to explain setting up your coding environment — whether you use Vim, Notepad, or Eclipse is not the topic here. I presume there are enough online guides explaining how to explore large C programs. If there’s interest I could explain how my Emacs setup works(hint:make TAGS ), but I presume it’s not state-of-the-art, and I’m not interested in editor wars. If it works for you, great!
Most of the C code is concentrated in a few directories:
Include — header files
Objects — object implementations, from int to type
Python — interpreter, bytecode compiler and other essential infrastructure
Parser — parser, lexer and parser generator
Modules — stdlib extension modules, and main.c
Programs — not much, but has the real main() function
When you’re on Linux or a BSD distribution(other than Mac) that’s all you need, and I recommend this.
For other platforms there are a few more places to look:
Mac — OS X-specific code
PC — Old Windows-specific code
PCBuild — Tools to build Python for modern Windows using Visual Studio
I believe you can still build Python on OS X without the Mac-specific stuff, it just won’t be a“framework build”.(You still need Xcode for the compiler and tools IIRC.) On Windows there’s not really a choice, you have to read PCBuild/readme.txt.(I suppose there’s Cygwin, but unless you’re already a Cygwin expert, I think it’s an uphill battle. And VS is actually pretty good.)
So I’d like to focus on the most important directories, which I consider Include, Objects and Python. The code in Parser is unusual(I’ll get to it eventually) and the code in Modules is really no different from any 3rd party C extension for Python — the existing Python/C API docs should suffice.
Chapter 1: getting to the prompt
But first let me slice things completely differently: Let’s explore what happens when you run python3 in your shell. I’ll start at main() , and I’d like to take a break when we’ve found where the famous >>> prompt is printed.
I’m exploring the source code for Python 3.5, which is the most recent production release. We keep the master copy of the source code in Mercurial(https://hg.python.org/cpython/file/3.5) but you can also find a recent copy on GitHub(https://github.com/python/cpython/tree/3.5). The devguide will tell you how to clone the Hg repo; if you’re a GitHub user you hopefully already know how to clone the Git repo!(You won’t need the Hg repo until you’re considering submitting patches, and hopefully before the end of 2016 you won’t need it at all, since we’re going to migrate to GitHub.)
Every C program starts its execution in main() . Python’s main() lives in Programs/python.c(though in older versions it was Modules/python.c ). The code here almost immediately gets hairy, but I’m going to ignore most of the madness. The thing to note is that the code here just ends up calling Py_Main(), which is defined in Modules/main.c . On to that file!(PS. There’s an old historical reason why it lives in Modules , not in Python . But it’s not important here.)
The first slightly interesting thing that happens here is the initialization for hash randomization. This is a security feature that causes the key order for dicts to vary from run to run. As it affects the way dictionaries find their keys, it has to be initialized before any Python objects are created. The actual function that does this initialization, _Py_Random_Init() , lives in Python/random.c , but I recommend skipping it for now.
There are a few more initialization calls(PySys_ResetWarnOptions() and _PyOS_ResetGetOpt()), and then we finally get to the code that parses the command line. Aargh, this looks low-level! Indeed it is. C is a low-level language. The code uses _PyOS_GetOpt() , which really is just GNU getopt in disguise. The code is found in Python/getopt.c and it’s really just a copy of GNU getopt with the type of argv changed from char * to wchar_t * , which is a way to support Unicode characters on platforms that support it(the original argv is copied into this form in the main() function mentioned above).
If you’re curious about Python’s command line options, just type python3 -h . You’ll find every option listed there as a case in the switch here. The-h option(case 'h': ) is the first one processed after the getopt loop terminates — the usage() function prints the message.(It’s somewhat complicated because usage() is also called when a faulty option is encountered, see the default: case.) There’s also a bit of code here to handle-V , which prints Python’s version.(Neither-h nor-V is handled in the getopt loop itself; I suppose this implies that python3 -hV and python3 -Vh both print the usage message.)
Onward. We find that some environment variables are equivalent to setting certain options. But note that-E causes environment variables to be ignored. If you look for where this happens, you’ll find that we actually parse the command line twice — the first time just to look for-E . Also note that Py_GETENV() is a macro — we can tell because its name is upper case(except for the Py prefix); its definition is in Include/pydebug.h . Read its definition to find how-E works.
I’m going to fast-forward over a lot of following stuff, until we hit the call to Py_Initialize() . This function initializes the interpreter and basic objects. You can find its definition in Python/pylifecycle.c . There’s a lot going on here that I don’t want to explain, but the key insight at this point is that Python’s main() is just a client of the Python/C API. You can write your own C code that just calls Py_Initialize() and then executes some Python code using for example PyRun_SimpleString():