Wednesday 25 March 2009

Under the Hood: the Apache 2.2 / PHP 5 Request lifecycle

This article is going to describe what happens when a client initializes a new HTTP request for a .php ressource. In order to understand it, you need some C programming knowledge and a GNU/Linux box.

I'm going to do it with hands-on examples. Although I'll describe Apache and PHP specifics, the lecture should be interesting for intermediate C programmers wishing to learn new techniques for introspecting complex source code written by others and learning how it works, techniques which work for practically any open source project out there.

Be careful, I do build, compile and install everything as root because I know what I'm doing. If you don't master the CLI, then don't enter any of these commands. That being said, I don't assume any responsability for any damage you may do to your system by following my instructions.

The first step is to compile all the programs involved with debugging information. Most of the software out there accepts such parameters by using the ./configure command. Also note that I'm going to do all the jobs in a temporary directory named /srv, so I don't pollute my system.

Every line below written with constant-width fonts and beginning with # is a marker for whatever commands you'll need to enter in your shell. The lines starting with (gdb) denote input for the debugger.

1. Laying out the directory structure:
# cd /
# mkdir -p /srv/_build/{apache,php}
# mkdir /srv/src

In /srv/src I'm going to download all the source codes I need, but I'll compile them in /srv/_build/, in order to keep the source code clean. Later you'll realize what's the motivation for this.

2. Download all the source archives.
# cd /srv/src

You may want to go to each software's official homepage and check the closest mirror to your geographical location
# wget -c http://mirror.deri.at/apache/httpd/httpd-2.2.11.tar.gz
# wget -c http://at2.php.net/get/php-5.2.9.tar.bz2/from/this/mirror
# tar xfz httpd-2.2.11.tar.gz
# tar xfj php-5.2.9.tar.bz2

3. Building apache for hackers
# cd /srv/_build/apache
The PHP manual (http://php.net/manual/en/install.unix.apache2.php) mentions only the --enable-so flag. Additionally, a search through
# ../../src/httpd-2.2.11/configure --help|less
for "debug" leads us to --enable-maintainer-mode. Also don't forget we want to --prefix=/srv --with-mpm=prefork
 so while in /srv/_build/apache type in:
# ../../src/httpd-2.2.11/configure --prefix=/srv --enable-maintainer-mode --enable-so --with-mpm=prefork
Please read the docs for each option. It is quite important to understand what a MPM is, for instance.
# make && make install
4. Testing whether apache works or not
The command
# ls /srv/bin/
should show us apachectl and httpd, among others, so:
# /srv/bin/apachectl start
If you see something like this:

(98)Address already in use: make_sock: could not bind to address [::]:80
(98)Address already in use: make_sock: could not bind to address 0.0.0.0:80
no listening sockets available, shutting down
Unable to open logs

Then it means the "clean" httpd of your distribution is already listening on port 80, so stop that one first, by calling apachectl on your path (find out the concrete path with the command "which apachectl"):
# apachectl stop
Note: apachectl and /srv/bin/apachectl are not the same. The former manipulates the "original" httpd process, whereby the latter the one in /srv/bin which we've just compiled with debugging info.

If nothing listens on port 80 now, which you can find out by using
# netstat -tlnp
you can finally start the server now:
# /srv/bin/apachectl start
Now point your browser to http://localhost, just to make sure everything runs smoothly

5. Getting a taste for debugging
On this one I'm going to cheat a little, because I'm a noob and all I know
right now is what I'm trying to do: debugging apache

So after firing up the browser and looking on the web for apache debugging, I came across http://httpd.apache.org/dev/debugging.html which shows me how to start and debug httpd.
So first stop httpd
# /srv/bin/apachectl stop
Then load the program in the debugger
# gdb /srv/bin/httpd
At the gdb prompt, set the arguments properly, since we don't want to mess with the local configuration from /etc/ or similar (just in case we already had apache installed on our system before following the instructions in this article):
(gdb) set args -X -f /srv/conf/httpd.conf
Then run the process:
(gdb) run
Opening http://localhost should show the classical "It works!" message, the only difference is now we can inspect the process live, while it's running.
But because we've started httpd with the -X parameter, the daemon keeps the keyboard busy, so press CTRL+C to get back to the gdb prompt:
(gdb) info threads
Should confirm that apache is running one single thread (that's what -X does, you read the docs, didn't you?):
  1 Thread 0x7f2fe9d6d740 (LWP 16856)  0x00007f2fe82a29f0 in __accept_nocancel () from /lib/libpthread.so.0
This is important at least for our purpose, since we don't want to mess with multithreading, only understand the execution flow.
To really see what "happened" so far, print out a backtrace:
(gdb) bt
Shows something like this:

#0  0x00007f883f0fc9f0 in __accept_nocancel () from /lib/libpthread.so.0
#1  0x00007f883f96d362 in apr_socket_accept () from /usr/lib/libapr-1.so.0
#2  0x000000000047e95f in unixd_accept ()
#3  0x000000000047c6d1 in child_main ()
#4  0x000000000047c837 in make_child ()
#5  0x000000000047cdcc in ap_mpm_run ()
#6  0x0000000000428f36 in main ()

The only thing you should have knowledge about is the main() function. The last function
on the stack __accept_nocancel looks interesting, but it's part of libpthread,
and since I'm a noob I don't want to mess with threads yet, I'll just ignore it - it's prefixed with __, which in common programming standards means that the function is internal to pthreads and we shouldn't care about it, unless we want to have a closer look at the pthread library itself.
I do know however, that APR stands for "apache portable runtime", and I'd like to have a closer look at it, so I set a breakpoint on it:
(gdb) break apr_socket_accept
then kill the process and restart it
(gdb) kill
(gdb) run
*Note: gdb can auto-complete even function names, so after you've typed in "break apr_socket_" you can press TAB to see a list of functions matching that name. Cool, huh? :)
Now go again to http://localhost and see what happens. Right, since the Linux kernel sees a request on port 80, it "knows" that it should be routed to the httpd process, so you don't get a "connection timeout" or a similar error message from the browser. BUT the browser still waits for data, because in the gdb debugger, the process is suspended.

So at the (gdb) prompt, continue the process. You should position both windows, the browser and the terminal with the debugging session such that you see what happens in the browser window when you type this:
(gdb) continue
You'd see something similar to this:
Continuing.

Breakpoint 1, 0x00007f2478ac1330 in apr_socket_accept () from /usr/lib/libapr-1.so.0

Exactly, that means the httpd process served the request and came back to the breakpoint we've set on apr_socket_accept(). Now we can create a new request, just for the fun of it :)

You may ask yourself why did I show you that? Well, first of all I wanted to teach you some basic debugging techniques, and also give you a starting point for the steps you'll have to follow when you're going to debug the PHP runtime.

So don't worry, just recap the concepts of debugging, program, process, request, thread, MPM (multi-processing module) you've learned so far, 'coz they're damn interesting stuff! :)

6. Compiling PHP for hackers
Wow, you've got that far? Congratulations!
Basically, all we need are the configure flags --enable-debug and --enable-maintainer-zts (zts stands for zend thread safety). Additionally, we need to specify --prefix and --with-apxs just to keep our build out of the already existing system.
All the other flags are there to make the resulting object code lighter, with less debugging symbols, and you could strip much more from it, just read the help if you wish to:
# ../../src/php-5.2.9/configure --help|less
This should introduce less disturbing details we shouldn't care about while learning the basics.

First, configure, compile and install PHP:
# cd /srv/_build/php
# ../../src/php-5.2.9/configure --prefix=/srv --with-apxs2=/srv/bin/apxs --enable-debug --enable-maintainer-zts --disable-cgi --enable-cli --without-pear --disable-xml --without-sqlite --without-mysql --disable-pdo --disable-libxml --disable-simplexml --disable-xmlreader --disable-xmlwriter --disable-dom --disable-spl
# make && make install
Just be patient, it may take a while. Just as a note, since right now I'm waiting myself and I've got nothing better to do than writing this article, do not rm -rf anything within /srv/_build, since you may want to recompile later on but with different flags, and you could use some of the already compiled object files (.o) to speed up the compilation process. Ok, PHP is now compiled, at least on my box, so we're going to integrate /srv/lib/libphp5.so just as we'd integrate a "regular", non-debugging version into apache's /srv/conf/httpd.conf
At least on my system, it asks me to:
# libtool --finish /srv/_build/php/libs

Now add the following line to /srv/conf/httpd.conf:

AddType application/x-httpd-php php

Note that the php5 module has been automatically added by the PHP build system by using apxs. If it didn't do it for your, then you will have to add that one too. Now create a new index.php file and add a phpinfo() call in it:
# vim /srv/htdocs/index.php
If you've set everything correctly, you should see the php info page :)

7. How to analyse something when you don't know where to start from
For example, how to see what exactly happens when there's a file like this
<?php
phpinfo();

which gets "called" by the user? Of course, you may think it's an impossible mission, since the PHP and apache code bases are so huge you'd have no chance to inspect it entirely in a lifetime. You may be right, BUT the good news is you don't even need to.

The first way to do it is to load httpd in the debugger as you did before, set a breakpoint on apr_socket_accept, and then step into the code. Now you see, my friend, why the entire hassle of debugging only apache itself first? If we didn't do that, we couldn't know now where to set that breakpoint.

I'll show you a demo :)

Just in case you've left a daemon running:
# killall httpd
Load into the debugger:
# gdb /srv/bin/httpd
(gdb) set args -X -f /srv/conf/httpd.conf
(gdb) break apr_socket_accept
(gdb) run
Visit http://localhost/index.php
You do remember why we're doing the following, right?
(gdb) continue
But the browser still seems to wait for something. If you look again at your debugging session, the execution flow of apache has ran again into apr_socket_accept, and again, and again, a couple of times. To "get out of it", simple type continue at the gdb prompt a couple of times, until the phpinfo page is displayed.

We could continue this way, but it's frustrating for two reasons: first, I couldn't get all the debugging symbols into apache without messing my current installation, which I didn't want to do just for this one article, but you may play around with options like --with-included-apr and alike.

However, I'll show you a much faster approach to really see the backtrace of what happens when a script with a call to phpinfo() in it is called.
The solution is to make assumptions based on good programming practices, at least concerning function naming in C - you already realized that both apache httpd and php are written in C, didn't you?
If you were a programmer, you'd most probably name the function phpinfo() which is exported into the runtime in a similar way. To search for it we can use the debugger's command "info functions <regexp>":
(gdb) info function phpinfo
All functions matching regular expression "phpinfo":

File /srv/src/php-5.2.9/ext/standard/info.c:
void register_phpinfo_constants(int, int, void ***);
void zif_phpinfo(int, zval *, zval **, zval *, int, void ***);

Great, so there are two functions in the php "executable" (in our scenario it's an apache module) that match that name, "register_phpinfo_constants" and "zif_phpinfo". From what you could guess out of the function name we can exclude the former with almost 100% certainty, but we'll not make assumptions and set breakpoints for both of them

(gdb) break register_phpinfo_constants
Breakpoint 2 at 0x7fd8672554a7: file /srv/src/php-5.2.9/ext/standard/info.c, line 990.
(gdb) break zif_phpinfo
Breakpoint 3 at 0x7fd867255748: file /srv/src/php-5.2.9/ext/standard/info.c, line 1013.

Great, gdb also tells us the RAM address at which those functions start, as well as the source code file, toghether with the line number. What could we need more to simply have a look at the concrete implementation? :)

Now
(gdb) continue
and request again index.php. We'll see which of those both functions get hit by breakpoints and we're going to know which one is the "phpinfo()" function in the runtime

Breakpoint 3, zif_phpinfo (ht=0, return_value=0x1011f58, return_value_ptr=0x0, this_ptr=0x0, return_value_used=0,
    tsrm_ls=0xe98d50) at /srv/src/php-5.2.9/ext/standard/info.c:1013
1013        int argc = ZEND_NUM_ARGS();
Current language:  auto; currently c

Now you know :) Just a little secret I don't expect you to know, the prefix zif stands for zend internal function. Press CTRL+C to get back to the gdb prompt, and do a backtrace:

(gdb) bt
#0  zif_phpinfo (ht=0, return_value=0x1011f58, return_value_ptr=0x0, this_ptr=0x0, return_value_used=0,
    tsrm_ls=0xe98d50) at /srv/src/php-5.2.9/ext/standard/info.c:1013
#1  0x00007fd86737a98a in zend_do_fcall_common_helper_SPEC (execute_data=0x7fff7151e040, tsrm_ls=0xe98d50)
    at /srv/src/php-5.2.9/Zend/zend_vm_execute.h:200
#2  0x00007fd867382f0b in ZEND_DO_FCALL_SPEC_CONST_HANDLER (execute_data=0x7fff7151e040, tsrm_ls=0xe98d50)
    at /srv/src/php-5.2.9/Zend/zend_vm_execute.h:1729
#3  0x00007fd86737a28d in execute (op_array=0x1011e10, tsrm_ls=0xe98d50)
    at /srv/src/php-5.2.9/Zend/zend_vm_execute.h:92
#4  0x00007fd867349d87 in zend_execute_scripts (type=8, tsrm_ls=0xe98d50, retval=0x0, file_count=3)
    at /srv/src/php-5.2.9/Zend/zend.c:1134
#5  0x00007fd8672c8f1f in php_execute_script (primary_file=0x7fff71520580, tsrm_ls=0xe98d50)
    at /srv/src/php-5.2.9/main/main.c:2023
#6  0x00007fd8673e4c4f in php_handler (r=0x103af78) at /srv/src/php-5.2.9/sapi/apache2handler/sapi_apache2.c:632
#7  0x0000000000441a6c in ap_run_handler ()
#8  0x0000000000442305 in ap_invoke_handler ()
#9  0x0000000000462c6c in ap_process_request ()
#10 0x000000000045fc78 in ap_process_http_connection ()
#11 0x000000000044ad92 in ap_run_process_connection ()
#12 0x000000000044b1c0 in ap_process_connection ()
#13 0x000000000047c754 in child_main ()
#14 0x000000000047c837 in make_child ()
#15 0x000000000047cdcc in ap_mpm_run ()
#16 0x0000000000428f36 in main ()

WOW, that's a lot of things to digest :)
First a little bit of theory. What you call PHP is actually made of two components (ok, ok, I'm lying, there's more, actually), the zend engine and the rest of the functions, most of which get exported into the runtime.

When you request a script, the zend engine first parses that file. I'm not going to go into the details here, but if you're interested feel free to buy yourself "The Dragon Book" by Ullman & co. On short, "parsing" consists of two phases, in the first the input is tokenized into indivisible entities. For example, in the PHP language, some tokens are "{", "-", ".", "function", "class" and any other keyword - that is, anything which cannot be further divided but which describes the language. The scanner (the right name for "the tokenizer") feeds another component, the lexer, with the tokens it reads. The lexer's job is to put tokens into contexts. For example, the series of tokens

$array, [, 'message', ], =, 'hello world'

Gives a simple statement. The "problem" here is that, for example strings like 'message' or 'hello world' could mean multiple things. The lexer's job is to deduce a string's meaning from its context, for example a string between [ and ] is the key for an array, while a string on the right side of the "=" operator is a constant string value. This is a rather complex process, so complex that people invented programs capable of generating so-called "symbol tables" from grammar files that describe the syntax of the language. These programs are called parser generators.
If you're impacient and want to see a lexer and a parser generator in action, you could have a look at my blog entry How to create a math calculator in C.

As I said, it's a complex subject and I'll move on leaving out many interesting details. Let's look at how PHP deals with the execution of scripts. After the zend engine has tokenized the input, it generates so-called OPCODES from them. These are similar to CPU opcodes you may know about from assembler (like INC, MOV, CALL, RET, etc), but at a higher level of abstraction and specific to PHP.

These opcodes are executed by another component, the Zend Virtual Machine (VM, as you can also see in file names like zend_vm_execute.h above), just like the binary code of an executable would be executed by the electronics of the CPU.

Just as a side note, these opcodes are cached by opcode cachers like APC, thus avoiding repeated tokenization and compilation into vm opcodes - since as you know, the basic difference between interpreted and compiled languages is that in interpreted languages the compilation occures at each execution of the source code. Of course, saying "executing a script" is a coloquial expression, what is actually executed is the php binary, or in our scenarion, the apache daemon, in which case PHP is only a module integrated into it.

A few more notes:
1. PHP can be seen as a monolithic block of functions. Most of the functions are provided by extensions bundled toghether with PHP (for example the gd library or tidy, and so on) and exported to the runtime. You can open up about any .c source code of PHP and look for PHP_FUNCTION, and you'll the find functions which are exported.
2. This monolithic block of functions is called by external processes through so-called SAPIs (server application programming interface). For example, when you call php from the command line, you're using the CLI SAPI (you can find its source in "/srv/src/php-5.2.9/sapi/cli". Similarly, the apache process communicates with the php module through the apache2handler SAPI)
3. Feel free to ask, since I'm pretty bored right now of so much writing. But ask smartly, or I will ignore your comment :)
4. Try to look at the source code for a few days before asking. Use fgrep to find functions in the source code, set breakpoints on them, use backtrace. If you need a deeper look at how a function works, set a breakpoint on the function above it in the backtrace (or have a look at the source code, and set a breakpoint on "file.c:<line>"), kill the process and run it again.

Happy hacking! :)

3 comments:

  1. thanks for taking you're time to explain this

    ReplyDelete
  2. but the archive from:
    http://mirror.deri.at/apache/httpd/httpd-2.2.11.tar.gz
    is corrupted,

    http://apache.mirrors.evolva.ro/httpd/httpd-2.2.11.tar.gz worked

    ReplyDelete
  3. You must use whichever mirror fits your geographical location, that's all. It worked for me, but maybe there was a transfer problem for you.

    ReplyDelete