Sunday 19 October 2008

Understanding how the Internet and the Web works, for PHP programmers

In this article I'll try to explain how the platform on which you build websites works. Please read it carefully, take a deep breath from time to time, and try to brainstorm around with what you read.
And please, PLEASE be very careful about the terminology. It will help you a lot - for example when you ask for help and need to be precise in what you ask.


The Internet is "made" of so-called services. Some of the most known services are: the web (for inter-connected documents), pop3 or imap (for e-mail reading and writing), irc (for live chat), file transfers (FTP).

In order to use an Internet service, you need a special program called client, which is specially designed for that service. So you've got a multitude of services, each with its own types of clients. The service called World Wide Web for example, is so widely used, that humans gave a special name to web clients: browsers.

Many software vendors created their own browsers. And so, now we have programs like: Microsoft Internet Explorer, Mozilla Firefox, Opera, Google Chrome, Safari, among the most used.

But why do we need a client after all? Enter the world of protocols.
Imagine the following scenario, which is seen in (almost) every service:

There is a computer sitting somewhere on the network and waiting for requests. These computers are called servers. On the server, which is the physical machine, there is a program that runs in background which processes the requests sent by the clients. This program is called daemon. The administrator of the server may say colloquially that "the server is up and running", but what she actually means is: "the server is connected to the Internet and the daemon is listening for new requests".

But keep in mind that every service is different (a little bit later I'll explain you why). So, just as we've got different clients for different services, there are also different daemons for each service. Example of such daemons: Apache and IIS for the www service, UnrealIRCd for live chat via IRC, sendmail for e-mail, etc. Remember: these are the actual executable files, just like "firefox.exe". In contrast to that, the notions of client and server are generic classifications for types of software.

Now, back to the initial question: why do we need clients and daemons for every existing service on the Internet? Because these two types of programs communicate in a language called protocol, and each service has its own protocols. You may wonder why are they different? Well, because every service has a different aim. For example, writing e-mail is not the same as publishing a document on the web: an e-mail needs one or more receivers, but a document on the web will be visible for everyone and doesn't have a receiver per se.

For example, the service World Wide Web, or shortly the web, or www, uses a protocol called hypertext transfer protocol (abbr. http). This is the beginning of every "web address" you enter in your browser, i.e. "http://". You may have also seen "news://", "mailto:" or "irc://", for different protocols out there, of which every is the communication language of a specific service. The address is called technically URL or URI.

Usually, on a server (I repeat: this is the physical machine) there may be more than one daemon running concurrently and listening for requests. But how should the operating system of that server know which connection goes to which daemon? The OS itself has no notion of protocols, it only recognizes "connection requests" (at the TCP/IP level) and must forward them to the right program (to the right daemon).

The secret lies in the so-called ports. A port is a number between 1 and 65535. When the client (e.g. Firefox) initializes the TCP/IP connection, it also writes the number of the port "via" which it wants to connect. Saying "via" is not quite correct, since the port is only a number which serves association of programs (here: daemons) and incoming connections at the operating system (abbr. OS) level, but you may imagine a port as a "communication channel" though, for the sake of clearness.

Do you have enough theory? Let's look at how the stuff I explained until now look in real life, with a hands-on example.

I'm going to show you what a browser does when you type in your address bar "http://www.google.com". For this, I need to use a program called telnet. It is really basic, all it does is to create a socket (read: a connection) via TCP/IP on the port I tell it to. It has no notion of protocols, but that's exactly what we need, since we're going to talk to the server in the language HTTP manually - something that the browser would do automatically for us.

Open the CLI of your operating system (CLI - command line interface; in Windows XP, this can be achieved by Start -> run -> type in "cmd"; on *NIX, this is the shell, accessible through a terminal). A black and unfriendly window will appear.

Type in the following:
telnet google.com 80
80 is the standard port for the www service.

After the connection is established, type this text, but type it quickly:
GET / HTTP/1.1
Host: www.google.com

Attention: the request is case sensitive, which means lower/UPPER case is important!
Also note that you must press return twice after "www.google.com", that is, you must mark the end of the request with an empty line.

The entire communication between the client (here: telnet) and the server (here: google.com) on port 80 would then look something like:
GET / HTTP/1.1
Host: www.google.com

HTTP/1.1 302 Found
Location: http://www.google.at/
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Set-Cookie: PREF=ID=f8df1f836de11e39:TM=1224414258:LM=1224414258:S=l1a-I88j2a0boHIM; expires=Tue, 19-Oct-2010 11:04:18 GMT; path=/; domain=.google.com
Date: Sun, 19 Oct 2008 11:04:18 GMT
Server: gws
Content-Length: 218


<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
<A HREF="http://www.google.at/">here</A>.
</BODY></HTML>
Connection closed by foreign host.
The first line of our http request contains the request method (here we used GET; another method would be POST, which you may have heard of if you programmed in PHP). The slash "/" is the resource we request. If we wanted to access the URL http://www.google.com/support/, then we had to enter "GET /support/ HTTP/1.1".

"HTTP" stands for the protocol used, and 1.1 is the version of the protocol.

The HTTP field "Host" tells the google daemon that we're referring to the "www" server. That's the www in "www.google.com", because google has other servers as well, like "mail.google.com", and the daemon needs to know which one we're sending the request to. Note: this association happens now at the application level (at the daemon level), as opposed to the ports, which have a meaning at the OS level.

What we sent to the server are called the request headers. After them comes the answer headers, and then the answer itself, if any. All of these three sections are separated by an empty line. That's the reason you had to press return twice when sending the http request. The communication language http (i.e. the http protocol) specifies this.

From the http response I can see it tells me it found what I asked from it, and the code is 302. Beside that, the server is kind enough to tell me that the location is http://www.google.at. The daemon detects, based on my IP address, that my geographical location is Austria, so may it do for you too. So you need to create a new HTTP request on port 80 to where it tells you to, just as I showed you above.

You will finally get the HTML code of the website. From this point onwards, a web browser would do things like:
  • rendering the markup code in its canvas
  • looking for external resources like images, frames/iframes, javascript scripts, css style sheets etc, and creating new http requests for each of them; after this step, images would appear as being part of the html document, but in fact they are separate resources, with their own URLs
  • executing any client-side codes, like javascript scripts
But since we're using telnet as a client, which has no knowledge about what HTML means, it simply shows us the markup and then closes the connection.

Feel free to play around with what you've learned so far, and ask if something is unclear.
Here are some questions which may help you brainstorming:
  1. Why is javascript not a reliable way of validating input?
  2. Why can't you trust some information in $_SERVER[], like $_SERVER['HTTP_USER_AGENT']?
  3. Why does the error "output already sent" actually exist? ( you probably know it already, it appears when you don't call session_start() appropiately)

Wednesday 15 October 2008

Die interne Funktionsweise von PHP


Abstract


This paper looks into the internal workings of PHP, a programming language that is used to generate and control the output of a webserver. In order to understand the fundamental internal processes of PHP and their causes, the structure of the Internet with focus on the Web is herein described. With this knowledge it is possible to easily understand the roles of PHP's components.
The execution flow of PHP is dissected alongside the layout of its source code repository and a couple of key functions and their roles. Syntactic analysis of the PHP script at runtime is pertinent to an understanding of the system and thus the importance of grammar is also adressed. Since PHP is very complex and a complete study would go beyond the scope of this paper, methods of independent analysis are also presented.
The execution flow of PHP is dissected in parallel with the layout of the PHP source code repository and a couple of key functions and their roles, are presented. Syntactic analysis of the PHP script at runtime is pertinent to an understanding of the system as a whole and thus the importance of grammar is also adressed.
Lastly, it is shown with an extension how the PHP-runtime can be enhanced.

Zusammmenfassung

Diese Arbeit setzt sich mit der internen Funktionsweise von PHP auseinander, eine Programmiersprache die zur Generierung bzw. Steuerung der Ausgabe eines Webservers eingesetzt wird. Um grundlegende PHP-interne Vorgänge und ihre Ursachen zu verstehen wird zuerst der Aufbau des Internets mit Schwerpunkt auf das Web nähergebracht. Mit diesem Wissen können die Rollen der Komponenten, aus denen PHP besteht, leichter verstanden werden.
Der Ausführungsfluss von PHP wird zeitgleich mit der Auslegung des PHP-Quelldepots untersucht und ein paar Schlüsselfunktionen und deren Rollen kennengelernt. Eine besondere Funktion zur Laufzeit spielt die Syntaxanalyse des PHP-Skriptums, und deshalb wird auch auf die Bedeutung der Grammatiken eingegangen. Da PHP sehr komplex ist und eine vollständige Untersuchung die Rahmen dieser Arbeit sprengen würde, werden Methoden zur selbstständigen Analyse des PHP präsentiert.
Schließlich wird anhand einer praktischen Erweiterung gezeigt, wie die Laufzeitumgebung von PHP weiterentwickelt werden kann.

Download it here.
Sorry, it's in German only. Aber du kannst gerne bei der Übersetzung helfen, sodass viele andere Programmierer rund um den Globus davon profitieren können :-)

Tuesday 14 October 2008

How to Create a Math Calculator in C, the right Way

Here is a small PoC on how to build a calculator with a lexer (re2c) and a parser generator (lemon) in C. It supports the four basic mathematical operations and parentheses.

Thanks to MacVicar for his help.

I'm not sure if it works on other platforms beside Linux, as that's all I have (not!), but hey, feel free to improve it! :-)

calc-0.1

Thursday 9 October 2008

How to set PHP on a development box, the right way

PHP is not set by default for a development machine, but for public servers. This is a list of settings I recommend you to put in php.ini on your development box:
error_reporting = E_ALL|E_STRICT
display_errors = On
short_open_tag = Off
asp_tags = Off
display_startup_errors = On

magic_quotes_gpc = Off
output_buffering = Off
allow_call_time_pass_reference = Off
zlib.output_compression = Off
track_errors = On
register_globals = Off
date.timezone should be set according to your time zone. A list of timezones can be found at http://us3.php.net/timezones
session.auto_start = 0
tidy.clean_output = Off

implicit_flush = Off
log_errors = On
ignore_repeated_errors = On
report_memleaks = On
You may also want to test your scripts with
safe_mode = On
Do not forget to restart the http daemon after saving your changes to php.ini.

Hint: you don't have to look manually for every configuration directive, use your ascii text editor's "find" option, which can usually be achieved by pressing CTRL+F in the editor.

My First Post

Toată lumea face în zilele noastre jurnalism de maidan (nu că cel "profesional" ar fi mai bun), deci am decis să încerc şi eu.

Nu ştiu cât timp îi voi dedica, dar aici intenţionez să postez sfaturi de programare, tips and tricks, probleme de care m-am lovit şi soluţiile lor, gânduri despre ce ne înconjoară, ş.a.m.d.

PS: nu, nu "personal blog", şi nici "m-am trezit, am mancat, m-am îmbrăcat ..." ;)