Tuesday, 21 April 2009

How to hack your own single-instance application, for dummies

This material is intended for educational purposes ONLY. I donnot assume any responsability for wanted or unwanted damages done by the reader with the gained knowledge from this article.

Today I'm going to show you how to solve the most basic programming problem[1]: hacking a single-instance application. But what is a single-instance application? Well, it's an application which you can only start once.

And since it would be illegal to hack Yahoo! Messenger (R), I'll create my own single-instance app which I'll hack afterwards :)

First, the code of the app:

#include <stdio.h>
#include <stdlib.h>
#include <windows.h>

int main(void) {
HANDLE hmutex = CreateMutex(NULL,FALSE,"some unique id");
if (hmutex && GetLastError () == ERROR_ALREADY_EXISTS) {
ReleaseMutex (hmutex);
printf("the program is already running");
printf("program's first instance");

According to the MSDN, CreateMutex returns NULL if the mutex already exists.

So basically, all we need to do is locate the call to CreateMutex and replace it with mov eax,0, since the return value of a function is always stored in EAX.

Follow these steps in ollydbg:

1. Load the executable
2. Select the binary code responsible for the call and fill it with NOPs (the no-operation operation available for any x86 CPU):

3. Double click the first NOP and replace it with the code that will make the following code think that CreateMutex was successfully called and returned 0 in EAX, then click assemble. The remaining NOPs are ok, they'll keep the eventual following pointers in place:

4. At this point, you could run your patched process. Please note, I said process. Exactly, the changes you've made are now in RAM, but you want the changes to be permanent in the program. So, right click and copy all modifications to executable, then choose "copy all" when prompted:

5. A new window will appear. This is the binary code of the patched executable. Simply save the binary to the permanent storage:

6. VoilĂ ! Now you may start as many instances of the patched executable as you like:

I hope you've enjoyed it


[1] According to myself, it's not the "Hello, World!" program any more :-))

Wednesday, 25 March 2009

Under the Hood: the Apache 2.2 / PHP 5 Request lifecycle

This article is going to describe what happens when a client initializes a new HTTP request for a .php ressource. In order to understand it, you need some C programming knowledge and a GNU/Linux box.

I'm going to do it with hands-on examples. Although I'll describe Apache and PHP specifics, the lecture should be interesting for intermediate C programmers wishing to learn new techniques for introspecting complex source code written by others and learning how it works, techniques which work for practically any open source project out there.

Be careful, I do build, compile and install everything as root because I know what I'm doing. If you don't master the CLI, then don't enter any of these commands. That being said, I don't assume any responsability for any damage you may do to your system by following my instructions.

The first step is to compile all the programs involved with debugging information. Most of the software out there accepts such parameters by using the ./configure command. Also note that I'm going to do all the jobs in a temporary directory named /srv, so I don't pollute my system.

Every line below written with constant-width fonts and beginning with # is a marker for whatever commands you'll need to enter in your shell. The lines starting with (gdb) denote input for the debugger.

1. Laying out the directory structure:
# cd /
# mkdir -p /srv/_build/{apache,php}
# mkdir /srv/src

In /srv/src I'm going to download all the source codes I need, but I'll compile them in /srv/_build/, in order to keep the source code clean. Later you'll realize what's the motivation for this.

2. Download all the source archives.
# cd /srv/src

You may want to go to each software's official homepage and check the closest mirror to your geographical location
# wget -c http://mirror.deri.at/apache/httpd/httpd-2.2.11.tar.gz
# wget -c http://at2.php.net/get/php-5.2.9.tar.bz2/from/this/mirror
# tar xfz httpd-2.2.11.tar.gz
# tar xfj php-5.2.9.tar.bz2

3. Building apache for hackers
# cd /srv/_build/apache
The PHP manual (http://php.net/manual/en/install.unix.apache2.php) mentions only the --enable-so flag. Additionally, a search through
# ../../src/httpd-2.2.11/configure --help|less
for "debug" leads us to --enable-maintainer-mode. Also don't forget we want to --prefix=/srv --with-mpm=prefork
 so while in /srv/_build/apache type in:
# ../../src/httpd-2.2.11/configure --prefix=/srv --enable-maintainer-mode --enable-so --with-mpm=prefork
Please read the docs for each option. It is quite important to understand what a MPM is, for instance.
# make && make install
4. Testing whether apache works or not
The command
# ls /srv/bin/
should show us apachectl and httpd, among others, so:
# /srv/bin/apachectl start
If you see something like this:

(98)Address already in use: make_sock: could not bind to address [::]:80
(98)Address already in use: make_sock: could not bind to address
no listening sockets available, shutting down
Unable to open logs

Then it means the "clean" httpd of your distribution is already listening on port 80, so stop that one first, by calling apachectl on your path (find out the concrete path with the command "which apachectl"):
# apachectl stop
Note: apachectl and /srv/bin/apachectl are not the same. The former manipulates the "original" httpd process, whereby the latter the one in /srv/bin which we've just compiled with debugging info.

If nothing listens on port 80 now, which you can find out by using
# netstat -tlnp
you can finally start the server now:
# /srv/bin/apachectl start
Now point your browser to http://localhost, just to make sure everything runs smoothly

5. Getting a taste for debugging
On this one I'm going to cheat a little, because I'm a noob and all I know
right now is what I'm trying to do: debugging apache

So after firing up the browser and looking on the web for apache debugging, I came across http://httpd.apache.org/dev/debugging.html which shows me how to start and debug httpd.
So first stop httpd
# /srv/bin/apachectl stop
Then load the program in the debugger
# gdb /srv/bin/httpd
At the gdb prompt, set the arguments properly, since we don't want to mess with the local configuration from /etc/ or similar (just in case we already had apache installed on our system before following the instructions in this article):
(gdb) set args -X -f /srv/conf/httpd.conf
Then run the process:
(gdb) run
Opening http://localhost should show the classical "It works!" message, the only difference is now we can inspect the process live, while it's running.
But because we've started httpd with the -X parameter, the daemon keeps the keyboard busy, so press CTRL+C to get back to the gdb prompt:
(gdb) info threads
Should confirm that apache is running one single thread (that's what -X does, you read the docs, didn't you?):
  1 Thread 0x7f2fe9d6d740 (LWP 16856)  0x00007f2fe82a29f0 in __accept_nocancel () from /lib/libpthread.so.0
This is important at least for our purpose, since we don't want to mess with multithreading, only understand the execution flow.
To really see what "happened" so far, print out a backtrace:
(gdb) bt
Shows something like this:

#0  0x00007f883f0fc9f0 in __accept_nocancel () from /lib/libpthread.so.0
#1  0x00007f883f96d362 in apr_socket_accept () from /usr/lib/libapr-1.so.0
#2  0x000000000047e95f in unixd_accept ()
#3  0x000000000047c6d1 in child_main ()
#4  0x000000000047c837 in make_child ()
#5  0x000000000047cdcc in ap_mpm_run ()
#6  0x0000000000428f36 in main ()

The only thing you should have knowledge about is the main() function. The last function
on the stack __accept_nocancel looks interesting, but it's part of libpthread,
and since I'm a noob I don't want to mess with threads yet, I'll just ignore it - it's prefixed with __, which in common programming standards means that the function is internal to pthreads and we shouldn't care about it, unless we want to have a closer look at the pthread library itself.
I do know however, that APR stands for "apache portable runtime", and I'd like to have a closer look at it, so I set a breakpoint on it:
(gdb) break apr_socket_accept
then kill the process and restart it
(gdb) kill
(gdb) run
*Note: gdb can auto-complete even function names, so after you've typed in "break apr_socket_" you can press TAB to see a list of functions matching that name. Cool, huh? :)
Now go again to http://localhost and see what happens. Right, since the Linux kernel sees a request on port 80, it "knows" that it should be routed to the httpd process, so you don't get a "connection timeout" or a similar error message from the browser. BUT the browser still waits for data, because in the gdb debugger, the process is suspended.

So at the (gdb) prompt, continue the process. You should position both windows, the browser and the terminal with the debugging session such that you see what happens in the browser window when you type this:
(gdb) continue
You'd see something similar to this:

Breakpoint 1, 0x00007f2478ac1330 in apr_socket_accept () from /usr/lib/libapr-1.so.0

Exactly, that means the httpd process served the request and came back to the breakpoint we've set on apr_socket_accept(). Now we can create a new request, just for the fun of it :)

You may ask yourself why did I show you that? Well, first of all I wanted to teach you some basic debugging techniques, and also give you a starting point for the steps you'll have to follow when you're going to debug the PHP runtime.

So don't worry, just recap the concepts of debugging, program, process, request, thread, MPM (multi-processing module) you've learned so far, 'coz they're damn interesting stuff! :)

6. Compiling PHP for hackers
Wow, you've got that far? Congratulations!
Basically, all we need are the configure flags --enable-debug and --enable-maintainer-zts (zts stands for zend thread safety). Additionally, we need to specify --prefix and --with-apxs just to keep our build out of the already existing system.
All the other flags are there to make the resulting object code lighter, with less debugging symbols, and you could strip much more from it, just read the help if you wish to:
# ../../src/php-5.2.9/configure --help|less
This should introduce less disturbing details we shouldn't care about while learning the basics.

First, configure, compile and install PHP:
# cd /srv/_build/php
# ../../src/php-5.2.9/configure --prefix=/srv --with-apxs2=/srv/bin/apxs --enable-debug --enable-maintainer-zts --disable-cgi --enable-cli --without-pear --disable-xml --without-sqlite --without-mysql --disable-pdo --disable-libxml --disable-simplexml --disable-xmlreader --disable-xmlwriter --disable-dom --disable-spl
# make && make install
Just be patient, it may take a while. Just as a note, since right now I'm waiting myself and I've got nothing better to do than writing this article, do not rm -rf anything within /srv/_build, since you may want to recompile later on but with different flags, and you could use some of the already compiled object files (.o) to speed up the compilation process. Ok, PHP is now compiled, at least on my box, so we're going to integrate /srv/lib/libphp5.so just as we'd integrate a "regular", non-debugging version into apache's /srv/conf/httpd.conf
At least on my system, it asks me to:
# libtool --finish /srv/_build/php/libs

Now add the following line to /srv/conf/httpd.conf:

AddType application/x-httpd-php php

Note that the php5 module has been automatically added by the PHP build system by using apxs. If it didn't do it for your, then you will have to add that one too. Now create a new index.php file and add a phpinfo() call in it:
# vim /srv/htdocs/index.php
If you've set everything correctly, you should see the php info page :)

7. How to analyse something when you don't know where to start from
For example, how to see what exactly happens when there's a file like this

which gets "called" by the user? Of course, you may think it's an impossible mission, since the PHP and apache code bases are so huge you'd have no chance to inspect it entirely in a lifetime. You may be right, BUT the good news is you don't even need to.

The first way to do it is to load httpd in the debugger as you did before, set a breakpoint on apr_socket_accept, and then step into the code. Now you see, my friend, why the entire hassle of debugging only apache itself first? If we didn't do that, we couldn't know now where to set that breakpoint.

I'll show you a demo :)

Just in case you've left a daemon running:
# killall httpd
Load into the debugger:
# gdb /srv/bin/httpd
(gdb) set args -X -f /srv/conf/httpd.conf
(gdb) break apr_socket_accept
(gdb) run
Visit http://localhost/index.php
You do remember why we're doing the following, right?
(gdb) continue
But the browser still seems to wait for something. If you look again at your debugging session, the execution flow of apache has ran again into apr_socket_accept, and again, and again, a couple of times. To "get out of it", simple type continue at the gdb prompt a couple of times, until the phpinfo page is displayed.

We could continue this way, but it's frustrating for two reasons: first, I couldn't get all the debugging symbols into apache without messing my current installation, which I didn't want to do just for this one article, but you may play around with options like --with-included-apr and alike.

However, I'll show you a much faster approach to really see the backtrace of what happens when a script with a call to phpinfo() in it is called.
The solution is to make assumptions based on good programming practices, at least concerning function naming in C - you already realized that both apache httpd and php are written in C, didn't you?
If you were a programmer, you'd most probably name the function phpinfo() which is exported into the runtime in a similar way. To search for it we can use the debugger's command "info functions <regexp>":
(gdb) info function phpinfo
All functions matching regular expression "phpinfo":

File /srv/src/php-5.2.9/ext/standard/info.c:
void register_phpinfo_constants(int, int, void ***);
void zif_phpinfo(int, zval *, zval **, zval *, int, void ***);

Great, so there are two functions in the php "executable" (in our scenario it's an apache module) that match that name, "register_phpinfo_constants" and "zif_phpinfo". From what you could guess out of the function name we can exclude the former with almost 100% certainty, but we'll not make assumptions and set breakpoints for both of them

(gdb) break register_phpinfo_constants
Breakpoint 2 at 0x7fd8672554a7: file /srv/src/php-5.2.9/ext/standard/info.c, line 990.
(gdb) break zif_phpinfo
Breakpoint 3 at 0x7fd867255748: file /srv/src/php-5.2.9/ext/standard/info.c, line 1013.

Great, gdb also tells us the RAM address at which those functions start, as well as the source code file, toghether with the line number. What could we need more to simply have a look at the concrete implementation? :)

(gdb) continue
and request again index.php. We'll see which of those both functions get hit by breakpoints and we're going to know which one is the "phpinfo()" function in the runtime

Breakpoint 3, zif_phpinfo (ht=0, return_value=0x1011f58, return_value_ptr=0x0, this_ptr=0x0, return_value_used=0,
    tsrm_ls=0xe98d50) at /srv/src/php-5.2.9/ext/standard/info.c:1013
1013        int argc = ZEND_NUM_ARGS();
Current language:  auto; currently c

Now you know :) Just a little secret I don't expect you to know, the prefix zif stands for zend internal function. Press CTRL+C to get back to the gdb prompt, and do a backtrace:

(gdb) bt
#0  zif_phpinfo (ht=0, return_value=0x1011f58, return_value_ptr=0x0, this_ptr=0x0, return_value_used=0,
    tsrm_ls=0xe98d50) at /srv/src/php-5.2.9/ext/standard/info.c:1013
#1  0x00007fd86737a98a in zend_do_fcall_common_helper_SPEC (execute_data=0x7fff7151e040, tsrm_ls=0xe98d50)
    at /srv/src/php-5.2.9/Zend/zend_vm_execute.h:200
#2  0x00007fd867382f0b in ZEND_DO_FCALL_SPEC_CONST_HANDLER (execute_data=0x7fff7151e040, tsrm_ls=0xe98d50)
    at /srv/src/php-5.2.9/Zend/zend_vm_execute.h:1729
#3  0x00007fd86737a28d in execute (op_array=0x1011e10, tsrm_ls=0xe98d50)
    at /srv/src/php-5.2.9/Zend/zend_vm_execute.h:92
#4  0x00007fd867349d87 in zend_execute_scripts (type=8, tsrm_ls=0xe98d50, retval=0x0, file_count=3)
    at /srv/src/php-5.2.9/Zend/zend.c:1134
#5  0x00007fd8672c8f1f in php_execute_script (primary_file=0x7fff71520580, tsrm_ls=0xe98d50)
    at /srv/src/php-5.2.9/main/main.c:2023
#6  0x00007fd8673e4c4f in php_handler (r=0x103af78) at /srv/src/php-5.2.9/sapi/apache2handler/sapi_apache2.c:632
#7  0x0000000000441a6c in ap_run_handler ()
#8  0x0000000000442305 in ap_invoke_handler ()
#9  0x0000000000462c6c in ap_process_request ()
#10 0x000000000045fc78 in ap_process_http_connection ()
#11 0x000000000044ad92 in ap_run_process_connection ()
#12 0x000000000044b1c0 in ap_process_connection ()
#13 0x000000000047c754 in child_main ()
#14 0x000000000047c837 in make_child ()
#15 0x000000000047cdcc in ap_mpm_run ()
#16 0x0000000000428f36 in main ()

WOW, that's a lot of things to digest :)
First a little bit of theory. What you call PHP is actually made of two components (ok, ok, I'm lying, there's more, actually), the zend engine and the rest of the functions, most of which get exported into the runtime.

When you request a script, the zend engine first parses that file. I'm not going to go into the details here, but if you're interested feel free to buy yourself "The Dragon Book" by Ullman & co. On short, "parsing" consists of two phases, in the first the input is tokenized into indivisible entities. For example, in the PHP language, some tokens are "{", "-", ".", "function", "class" and any other keyword - that is, anything which cannot be further divided but which describes the language. The scanner (the right name for "the tokenizer") feeds another component, the lexer, with the tokens it reads. The lexer's job is to put tokens into contexts. For example, the series of tokens

$array, [, 'message', ], =, 'hello world'

Gives a simple statement. The "problem" here is that, for example strings like 'message' or 'hello world' could mean multiple things. The lexer's job is to deduce a string's meaning from its context, for example a string between [ and ] is the key for an array, while a string on the right side of the "=" operator is a constant string value. This is a rather complex process, so complex that people invented programs capable of generating so-called "symbol tables" from grammar files that describe the syntax of the language. These programs are called parser generators.
If you're impacient and want to see a lexer and a parser generator in action, you could have a look at my blog entry How to create a math calculator in C.

As I said, it's a complex subject and I'll move on leaving out many interesting details. Let's look at how PHP deals with the execution of scripts. After the zend engine has tokenized the input, it generates so-called OPCODES from them. These are similar to CPU opcodes you may know about from assembler (like INC, MOV, CALL, RET, etc), but at a higher level of abstraction and specific to PHP.

These opcodes are executed by another component, the Zend Virtual Machine (VM, as you can also see in file names like zend_vm_execute.h above), just like the binary code of an executable would be executed by the electronics of the CPU.

Just as a side note, these opcodes are cached by opcode cachers like APC, thus avoiding repeated tokenization and compilation into vm opcodes - since as you know, the basic difference between interpreted and compiled languages is that in interpreted languages the compilation occures at each execution of the source code. Of course, saying "executing a script" is a coloquial expression, what is actually executed is the php binary, or in our scenarion, the apache daemon, in which case PHP is only a module integrated into it.

A few more notes:
1. PHP can be seen as a monolithic block of functions. Most of the functions are provided by extensions bundled toghether with PHP (for example the gd library or tidy, and so on) and exported to the runtime. You can open up about any .c source code of PHP and look for PHP_FUNCTION, and you'll the find functions which are exported.
2. This monolithic block of functions is called by external processes through so-called SAPIs (server application programming interface). For example, when you call php from the command line, you're using the CLI SAPI (you can find its source in "/srv/src/php-5.2.9/sapi/cli". Similarly, the apache process communicates with the php module through the apache2handler SAPI)
3. Feel free to ask, since I'm pretty bored right now of so much writing. But ask smartly, or I will ignore your comment :)
4. Try to look at the source code for a few days before asking. Use fgrep to find functions in the source code, set breakpoints on them, use backtrace. If you need a deeper look at how a function works, set a breakpoint on the function above it in the backtrace (or have a look at the source code, and set a breakpoint on "file.c:<line>"), kill the process and run it again.

Happy hacking! :)

Monday, 16 March 2009

How to decouple business logic from view logic and gain flexibility

A user asks: How can I improve the following code?

<div id="search_controls">


$entities_tab = ( $search_func != 'entities') ? "<a href=\"index.php?tab=entities$url_sm\">Entities</a>" : "Entities";
$entities_tab_class = ( $search_func != 'entities') ? "inactive_tab" : "active_tab";
        <li class="<?php echo $entities_tab_class; ?>"><?php echo $entities_tab; ?></li>

$products_tab = ( $search_func != 'products') ? "<a href=\"index.php?tab=products$url_sm\">Products</a>" : "Products";
$products_tab_class = ( $search_func != 'products') ? "inactive_tab" : "active_tab";
        <li class="<?php echo $products_tab_class; ?>"><?php echo $products_tab; ?></li>


$events_tab = ( $search_func != 'events') ? "<a href=\"index.php?tab=events$url_sm\">Events</a>" : "Events";
$events_tab_class = ( $search_func != 'events') ? "inactive_tab" : "active_tab";
        <li class="<?php echo $events_tab_class; ?>"><?php echo $events_tab; ?></li>

$url_sm = ( $param_sm == "b" or $param_sm == NULL ) ? "&sm=a" : "&sm=b" ;
$sm_link_txt = ( $param_sm == "b" or $param_sm == NULL ) ? "Advance Search" : "Basic Search" ;
    <a id="link_search_type" href="index.php<?php echo $url_tab.$url_sm.$url_sq; ?>"><?php echo $sm_link_txt ?></a>


A: separate the business logic from the view logic. The business logic should be restricted to working on data structures.
In the example above, you're generating some "search tabs" which look quite similar. So instead of mixing the decisional code,
like deciding what html code each $*_tab should contain ($entities_tab, $products_tab, $events_tab), you should aggregate
all these into one place, since they are all the same thing: tabs. For this, you could use classes or arrays.

For the sake of simplicity, I'll use arrays.

So first ask yourself what does every tab have? It has a label and an internal representation. For example, the entities tab above
has the label "Entities" and the internal name "entities", which is used for URL generation. You could also consider that a tab
has a content, but in our minimal example all the tabs get their content based on the other two properties mentioned above

So our $search_controls would look like this:
$search_controls = array(
        'name' => 'entities',
        'label' => 'Entities'
        'name' => 'products',
        'label' => 'Products'
        'name' => 'events',
        'label' => 'events'

Now we also have the "search control" itself: a <div> embedding the search controls and a link for the "search type".
Beside, a search control also has an "active tab" - that's a name of an existing tab which should have special visualization hints. So let's make up a $search_control containing this meta data (meta - which describes something, as such here "meta data" means additional data
which describes the data structure itself). So we're going to come up with something like this:

$search_control = array(
    'active' => 'entities',
    'id' => 'search_controls',
    'controls' => array(
            'name' => 'entities',
            'label' => 'Entities'
            'name' => 'products',
            'label' => 'Products'
            'name' => 'events',
            'label' => 'events'

Now this looks nice. We have invented an internal representation for the data that's going to be rendered.
But why would you want to do this, after all? The answer is it's much more flexible. If sometime, in the future,
you decide that you want to save the "search tabs" in your database, or let's push it even further, if you
decide that every user can define his own favourite "search tabs", you will be able to pull this data from the database
without even touching the generation of the html code!

Beside, you could use this to define new types of "controls" like the "search control" by using the same algorithm -
which means less code, which in turns means less hassle and less room for bugs!

The reusage of code is especially flexible if you're not going to store the internal representation in multi-dimensional
arrays as above, but in classes.

Now enough talking, here's how a function for rendering the "search control" - or any such "control" for that matter of fact,
could look like:

function render_control($controls) {
    $r = '<div id="'.$controls['id'].'"><ul>';
    foreach($controls['controls'] as $control) {
        if($controls['active'] === $control['name']) {
            $class = 'active_tab';
            $innerHtml = $control['label'];
        else {
            $class = 'inactive_tab';
            $innerHtml = '<a href="index.php?tab='.$control['name'].'">'.$control['label'].'</a>';
        $r .= '<li class="'.$class.'">'.$innerHtml.'</li>';
    $r .= '</ul></div>';
    return $r;

As you can see, I did leave out some things, like the "$url_sm", since I don't know how and where that variable comes from.
However, if you need more meta data in your control, simply invent a new parameter to the function or a new member in $controls, like
$controls['id']. You could also add a 'title' for the <div> box itself, or more parameters which allow further customization
of the html tags used instead of the standard '<ul>', '<li>', '<a>' and so on.

The sky is the limit!

The lesson is: a smart programmer not only writes code, but he writes it in a reusable and flexible way. Now you can throw data at
render_control() without writing any piece of html, it will simply generate it.

Isn't that a relief? Be smart, write less, code more!

Thursday, 5 March 2009

Small PHP optimizations anyone can (and should) do

As a forum moderator on the PHP area of the softpedia forum (Romanian), I've worked toghether with the community on putting together a list of do's and dont's throughout the time.

Recently I've run across this post which is by far much more complete, as ABU NAWIM MOHAMMAD SAIFUL ISLAM from Bangladesh describes.

Feel free to follow his advices, they're worth taking into account.

Checking whether an entry in a zip archive is a file or a directory with PHP

When processing .zip files with PHP, a common problem is differentiating between files and directories inside the archive. A simple code like this shows what the zip extension returns for each entry:

$zip = zip_open('foo.zip');
while($entry = zip_read($zip)) {
$entry_name = zip_entry_name($entry);
echo 'name: ',$entry_name,PHP_EOL;

After having a closer look at the output, it becomes obvious that directory entries end in a trailing slash, and as such, we'll obviously do something like this:

<?php $zip = zip_open('foo.zip'); while($entry = zip_read($zip)) { $entry_name = zip_entry_name($entry); echo 'name: ',$entry_name,PHP_EOL; if('/' === substr($entry_name,-1)) { echo 'is a file',PHP_EOL; } else { echo 'is a directory',PHP_EOL; } }

Easy huh? :-)

Tuesday, 3 March 2009

Fetching tree data with PHP from MySQL with only one query

Sometimes you may want to store some tree data in your database, for example navigation menus, where each of the "node" has children.

The most obvious way of fetching it would be of course to model the fetching algorithm similar to the nature of the data itself: recursively.

There is one problem with this method though, large trees will require dozens of queries, not to mention the storage in your client's runtime (eg. PHP).

Here is another method for describing tree data, which requires at most three queries to fetch just about anything you could fetch with hundreds if not thousands of recursive calls.

First, the structure of the database table would look like this

`id` int(10) unsigned NOT NULL auto_increment,
`order` int(10) unsigned NOT NULL COMMENT \'in which order to sort within the same parent\',
`indent` int(10) unsigned NOT NULL COMMENT \'the indentation level\',
`data` varchar(255) collate ascii_bin NOT NULL COMMENT \'a message to show\',
KEY `order` (`order`,`indent`)
The "order" column is there so you don't have to mess with the auto incrementing id column. It allows you to reorder the nodes individually.
The "indent" defines the "indentation" of the node - you may think of it denoting the node's deepness inside a tree.

The simplest code would then look like this:
$res = mysql_query('SELECT * FROM `'.TABLE_NAV.'` ORDER BY `order`,`indent`');
echo '<div style="border: 1px solid black"><pre>',PHP_EOL;
while($item = mysql_fetch_assoc($res)) {
echo str_repeat(' ',$item['indent']), $item['data'], PHP_EOL;
echo '</pre></div>',PHP_EOL;
I've added some indentation so you can get a feel for it.

You could also group the tree data into "navigation panes" by just adding a new column to it. Every node should then contain an id (not to be confounded with the id column, you just make up some common identifiers, a number would be the best) describing to which "navigation pane" that node belongs to.

The upside of this method is efficiency while fetching the data. The downside is you will need more work when deleting or reordering subtrees. That shouldn't be an issue for common cases, as you usually don't modify or delete the tree that often.

This organisatory model does not protect your data's consistency. For example, when reordering a node on it's own "indentation level", you will also need to reorder its children (to increment/decrement `order` by the same amount). The same goes for `indent`.

Here is a working PoC, just
  1. edit the "<edit me>" values in dbconf.php
  2. run install.php - no message should appear :-)
  3. view index.php
I hope you've found it useful and come back.

Note: the presented code is only a proof of concept, and there's much room for improvement. As such, I don't assume any responsability for any harms the code may do to your system. I do, however, assume the responsability for the concept itself.

Sunday, 19 October 2008

Understanding how the Internet and the Web works, for PHP programmers

In this article I'll try to explain how the platform on which you build websites works. Please read it carefully, take a deep breath from time to time, and try to brainstorm around with what you read.
And please, PLEASE be very careful about the terminology. It will help you a lot - for example when you ask for help and need to be precise in what you ask.

The Internet is "made" of so-called services. Some of the most known services are: the web (for inter-connected documents), pop3 or imap (for e-mail reading and writing), irc (for live chat), file transfers (FTP).

In order to use an Internet service, you need a special program called client, which is specially designed for that service. So you've got a multitude of services, each with its own types of clients. The service called World Wide Web for example, is so widely used, that humans gave a special name to web clients: browsers.

Many software vendors created their own browsers. And so, now we have programs like: Microsoft Internet Explorer, Mozilla Firefox, Opera, Google Chrome, Safari, among the most used.

But why do we need a client after all? Enter the world of protocols.
Imagine the following scenario, which is seen in (almost) every service:

There is a computer sitting somewhere on the network and waiting for requests. These computers are called servers. On the server, which is the physical machine, there is a program that runs in background which processes the requests sent by the clients. This program is called daemon. The administrator of the server may say colloquially that "the server is up and running", but what she actually means is: "the server is connected to the Internet and the daemon is listening for new requests".

But keep in mind that every service is different (a little bit later I'll explain you why). So, just as we've got different clients for different services, there are also different daemons for each service. Example of such daemons: Apache and IIS for the www service, UnrealIRCd for live chat via IRC, sendmail for e-mail, etc. Remember: these are the actual executable files, just like "firefox.exe". In contrast to that, the notions of client and server are generic classifications for types of software.

Now, back to the initial question: why do we need clients and daemons for every existing service on the Internet? Because these two types of programs communicate in a language called protocol, and each service has its own protocols. You may wonder why are they different? Well, because every service has a different aim. For example, writing e-mail is not the same as publishing a document on the web: an e-mail needs one or more receivers, but a document on the web will be visible for everyone and doesn't have a receiver per se.

For example, the service World Wide Web, or shortly the web, or www, uses a protocol called hypertext transfer protocol (abbr. http). This is the beginning of every "web address" you enter in your browser, i.e. "http://". You may have also seen "news://", "mailto:" or "irc://", for different protocols out there, of which every is the communication language of a specific service. The address is called technically URL or URI.

Usually, on a server (I repeat: this is the physical machine) there may be more than one daemon running concurrently and listening for requests. But how should the operating system of that server know which connection goes to which daemon? The OS itself has no notion of protocols, it only recognizes "connection requests" (at the TCP/IP level) and must forward them to the right program (to the right daemon).

The secret lies in the so-called ports. A port is a number between 1 and 65535. When the client (e.g. Firefox) initializes the TCP/IP connection, it also writes the number of the port "via" which it wants to connect. Saying "via" is not quite correct, since the port is only a number which serves association of programs (here: daemons) and incoming connections at the operating system (abbr. OS) level, but you may imagine a port as a "communication channel" though, for the sake of clearness.

Do you have enough theory? Let's look at how the stuff I explained until now look in real life, with a hands-on example.

I'm going to show you what a browser does when you type in your address bar "http://www.google.com". For this, I need to use a program called telnet. It is really basic, all it does is to create a socket (read: a connection) via TCP/IP on the port I tell it to. It has no notion of protocols, but that's exactly what we need, since we're going to talk to the server in the language HTTP manually - something that the browser would do automatically for us.

Open the CLI of your operating system (CLI - command line interface; in Windows XP, this can be achieved by Start -> run -> type in "cmd"; on *NIX, this is the shell, accessible through a terminal). A black and unfriendly window will appear.

Type in the following:
telnet google.com 80
80 is the standard port for the www service.

After the connection is established, type this text, but type it quickly:
GET / HTTP/1.1
Host: www.google.com

Attention: the request is case sensitive, which means lower/UPPER case is important!
Also note that you must press return twice after "www.google.com", that is, you must mark the end of the request with an empty line.

The entire communication between the client (here: telnet) and the server (here: google.com) on port 80 would then look something like:
GET / HTTP/1.1
Host: www.google.com

HTTP/1.1 302 Found
Location: http://www.google.at/
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Set-Cookie: PREF=ID=f8df1f836de11e39:TM=1224414258:LM=1224414258:S=l1a-I88j2a0boHIM; expires=Tue, 19-Oct-2010 11:04:18 GMT; path=/; domain=.google.com
Date: Sun, 19 Oct 2008 11:04:18 GMT
Server: gws
Content-Length: 218

<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<H1>302 Moved</H1>
The document has moved
<A HREF="http://www.google.at/">here</A>.
Connection closed by foreign host.
The first line of our http request contains the request method (here we used GET; another method would be POST, which you may have heard of if you programmed in PHP). The slash "/" is the resource we request. If we wanted to access the URL http://www.google.com/support/, then we had to enter "GET /support/ HTTP/1.1".

"HTTP" stands for the protocol used, and 1.1 is the version of the protocol.

The HTTP field "Host" tells the google daemon that we're referring to the "www" server. That's the www in "www.google.com", because google has other servers as well, like "mail.google.com", and the daemon needs to know which one we're sending the request to. Note: this association happens now at the application level (at the daemon level), as opposed to the ports, which have a meaning at the OS level.

What we sent to the server are called the request headers. After them comes the answer headers, and then the answer itself, if any. All of these three sections are separated by an empty line. That's the reason you had to press return twice when sending the http request. The communication language http (i.e. the http protocol) specifies this.

From the http response I can see it tells me it found what I asked from it, and the code is 302. Beside that, the server is kind enough to tell me that the location is http://www.google.at. The daemon detects, based on my IP address, that my geographical location is Austria, so may it do for you too. So you need to create a new HTTP request on port 80 to where it tells you to, just as I showed you above.

You will finally get the HTML code of the website. From this point onwards, a web browser would do things like:
  • rendering the markup code in its canvas
  • looking for external resources like images, frames/iframes, javascript scripts, css style sheets etc, and creating new http requests for each of them; after this step, images would appear as being part of the html document, but in fact they are separate resources, with their own URLs
  • executing any client-side codes, like javascript scripts
But since we're using telnet as a client, which has no knowledge about what HTML means, it simply shows us the markup and then closes the connection.

Feel free to play around with what you've learned so far, and ask if something is unclear.
Here are some questions which may help you brainstorming:
  1. Why is javascript not a reliable way of validating input?
  2. Why can't you trust some information in $_SERVER[], like $_SERVER['HTTP_USER_AGENT']?
  3. Why does the error "output already sent" actually exist? ( you probably know it already, it appears when you don't call session_start() appropiately)