Sunday 19 October 2008

Understanding how the Internet and the Web works, for PHP programmers

In this article I'll try to explain how the platform on which you build websites works. Please read it carefully, take a deep breath from time to time, and try to brainstorm around with what you read.
And please, PLEASE be very careful about the terminology. It will help you a lot - for example when you ask for help and need to be precise in what you ask.


The Internet is "made" of so-called services. Some of the most known services are: the web (for inter-connected documents), pop3 or imap (for e-mail reading and writing), irc (for live chat), file transfers (FTP).

In order to use an Internet service, you need a special program called client, which is specially designed for that service. So you've got a multitude of services, each with its own types of clients. The service called World Wide Web for example, is so widely used, that humans gave a special name to web clients: browsers.

Many software vendors created their own browsers. And so, now we have programs like: Microsoft Internet Explorer, Mozilla Firefox, Opera, Google Chrome, Safari, among the most used.

But why do we need a client after all? Enter the world of protocols.
Imagine the following scenario, which is seen in (almost) every service:

There is a computer sitting somewhere on the network and waiting for requests. These computers are called servers. On the server, which is the physical machine, there is a program that runs in background which processes the requests sent by the clients. This program is called daemon. The administrator of the server may say colloquially that "the server is up and running", but what she actually means is: "the server is connected to the Internet and the daemon is listening for new requests".

But keep in mind that every service is different (a little bit later I'll explain you why). So, just as we've got different clients for different services, there are also different daemons for each service. Example of such daemons: Apache and IIS for the www service, UnrealIRCd for live chat via IRC, sendmail for e-mail, etc. Remember: these are the actual executable files, just like "firefox.exe". In contrast to that, the notions of client and server are generic classifications for types of software.

Now, back to the initial question: why do we need clients and daemons for every existing service on the Internet? Because these two types of programs communicate in a language called protocol, and each service has its own protocols. You may wonder why are they different? Well, because every service has a different aim. For example, writing e-mail is not the same as publishing a document on the web: an e-mail needs one or more receivers, but a document on the web will be visible for everyone and doesn't have a receiver per se.

For example, the service World Wide Web, or shortly the web, or www, uses a protocol called hypertext transfer protocol (abbr. http). This is the beginning of every "web address" you enter in your browser, i.e. "http://". You may have also seen "news://", "mailto:" or "irc://", for different protocols out there, of which every is the communication language of a specific service. The address is called technically URL or URI.

Usually, on a server (I repeat: this is the physical machine) there may be more than one daemon running concurrently and listening for requests. But how should the operating system of that server know which connection goes to which daemon? The OS itself has no notion of protocols, it only recognizes "connection requests" (at the TCP/IP level) and must forward them to the right program (to the right daemon).

The secret lies in the so-called ports. A port is a number between 1 and 65535. When the client (e.g. Firefox) initializes the TCP/IP connection, it also writes the number of the port "via" which it wants to connect. Saying "via" is not quite correct, since the port is only a number which serves association of programs (here: daemons) and incoming connections at the operating system (abbr. OS) level, but you may imagine a port as a "communication channel" though, for the sake of clearness.

Do you have enough theory? Let's look at how the stuff I explained until now look in real life, with a hands-on example.

I'm going to show you what a browser does when you type in your address bar "http://www.google.com". For this, I need to use a program called telnet. It is really basic, all it does is to create a socket (read: a connection) via TCP/IP on the port I tell it to. It has no notion of protocols, but that's exactly what we need, since we're going to talk to the server in the language HTTP manually - something that the browser would do automatically for us.

Open the CLI of your operating system (CLI - command line interface; in Windows XP, this can be achieved by Start -> run -> type in "cmd"; on *NIX, this is the shell, accessible through a terminal). A black and unfriendly window will appear.

Type in the following:
telnet google.com 80
80 is the standard port for the www service.

After the connection is established, type this text, but type it quickly:
GET / HTTP/1.1
Host: www.google.com

Attention: the request is case sensitive, which means lower/UPPER case is important!
Also note that you must press return twice after "www.google.com", that is, you must mark the end of the request with an empty line.

The entire communication between the client (here: telnet) and the server (here: google.com) on port 80 would then look something like:
GET / HTTP/1.1
Host: www.google.com

HTTP/1.1 302 Found
Location: http://www.google.at/
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Set-Cookie: PREF=ID=f8df1f836de11e39:TM=1224414258:LM=1224414258:S=l1a-I88j2a0boHIM; expires=Tue, 19-Oct-2010 11:04:18 GMT; path=/; domain=.google.com
Date: Sun, 19 Oct 2008 11:04:18 GMT
Server: gws
Content-Length: 218


<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
<A HREF="http://www.google.at/">here</A>.
</BODY></HTML>
Connection closed by foreign host.
The first line of our http request contains the request method (here we used GET; another method would be POST, which you may have heard of if you programmed in PHP). The slash "/" is the resource we request. If we wanted to access the URL http://www.google.com/support/, then we had to enter "GET /support/ HTTP/1.1".

"HTTP" stands for the protocol used, and 1.1 is the version of the protocol.

The HTTP field "Host" tells the google daemon that we're referring to the "www" server. That's the www in "www.google.com", because google has other servers as well, like "mail.google.com", and the daemon needs to know which one we're sending the request to. Note: this association happens now at the application level (at the daemon level), as opposed to the ports, which have a meaning at the OS level.

What we sent to the server are called the request headers. After them comes the answer headers, and then the answer itself, if any. All of these three sections are separated by an empty line. That's the reason you had to press return twice when sending the http request. The communication language http (i.e. the http protocol) specifies this.

From the http response I can see it tells me it found what I asked from it, and the code is 302. Beside that, the server is kind enough to tell me that the location is http://www.google.at. The daemon detects, based on my IP address, that my geographical location is Austria, so may it do for you too. So you need to create a new HTTP request on port 80 to where it tells you to, just as I showed you above.

You will finally get the HTML code of the website. From this point onwards, a web browser would do things like:
  • rendering the markup code in its canvas
  • looking for external resources like images, frames/iframes, javascript scripts, css style sheets etc, and creating new http requests for each of them; after this step, images would appear as being part of the html document, but in fact they are separate resources, with their own URLs
  • executing any client-side codes, like javascript scripts
But since we're using telnet as a client, which has no knowledge about what HTML means, it simply shows us the markup and then closes the connection.

Feel free to play around with what you've learned so far, and ask if something is unclear.
Here are some questions which may help you brainstorming:
  1. Why is javascript not a reliable way of validating input?
  2. Why can't you trust some information in $_SERVER[], like $_SERVER['HTTP_USER_AGENT']?
  3. Why does the error "output already sent" actually exist? ( you probably know it already, it appears when you don't call session_start() appropiately)

2 comments:

  1. a very good tutorial; take your time and read it

    ReplyDelete
  2. Very comprehensive! Good job in writing this guide.

    ReplyDelete