Chapter 5. The Apache Web server

Table of Contents
An introduction to the World Wide Web
The case study

I have based this thesis on a case study of the group developing the Apache Web server. Before looking closer at the case study itself, I will give a quick overview of the project's background and the technology they are working on.

An introduction to the World Wide Web

While being employed by the international CERN laboratory in Switzerland, Tim Berners-Lee had been working on a documentation system for better organizing the research material produced on the site. The basis for his efforts was the idea that it would be better to organize scientific material in a way that resembles the human mind—as semantic knowledge networks. The basis technology for his project was hypertext. By making use of hyper textual links, Berners-Lee found it easier to connect relevant research material together. As he simply wanted to retrieve documents from and put documents to servers, he opted for a simple network communication protocol for this new technology which he had given the name the World Wide Web [BERNERSLEE1999](Berners-Lee 1999).

The technology

The network communications protocol, Berners-Lee called the Hypertext Transfer Protocol, HTTP for short. It is a stateless network protocol piggy-backing on TCP/IP's application layer. At heart the Web is a traditional server/client architecture. Through a HTTP client software, the user requests documents on a HTTP server. The server responds by returning the requested document, and then close the connection. A client in this context is "[a]n application program that establishes connections for the purpose of sending requests" [RFC1945](Berners-Lee and Fielding 1996, p. 4). For the end user, the HTTP client is a Web browser, but it can in reality be any piece of software using HTTP to initiate communicate with another HTTP enabled piece of software. The server is "[a]n application program that accepts connections in order to service requests by sending back responses" [RFC1945](Berners-Lee and Fielding 1996, p. 4). Apart from the data transfer protocol, there were two more crucial elements of the Web: the Universal Resource Locator and the Hypertext Markup Language.

The Universal Resource Locator, URL, is the common naming format devised by Berners-Lee [RFC1738](Berners-Lee et al. 1994). It is meant as a meta naming scheme for the World Wide Web to identify both host and service. The URL consists of three elements <the network protocol>://<server name>/<file path>. During HTTP transactions the protocol element is http. While the URL was conceived for the Web, Berners-Lee realized that users need not limit themselves to just one network protocol. Instead of ignoring the existing Internet protocols and standards, he chose to embrace them with his own technology. That way it is possible to retrieve documents from, say, a FTP server using a Web browser. The URL's second element, the server name, is an IP address or host name. The URL's final part, the file path, indicates what file to retrieve from the server. File paths are handled in a traditional manner with directory structures and file names.

The third and final piece of Berners-Lee's Web technology is his document markup language, the Hypertext Markup Language, or HTML for short [RFC1866](Berners-Lee and Conolly 1995). While it is possible to transfer all kinds of files and documents using HTTP, Berners-Lee wrote his own documentation language to support the linking of information on the Web. In the spirit of his embrace and extend ideology, Berners-Lee decided to base it on SGML as that was one of the major document formats used at the CERN research site. With this markup language scientist would be able to embed the hypertext links that Berners-Lee considered the core of his technology. However, also part of his embrace and extend strategy, it would be possible to create links to document in non-HTML formats.

Growth

With the three basic building blocks in place—the URL, the HTTP, and the HTML—Berners-Lee's first steps in spreading the his new technology was to show its practical use. After much work he convinced a local systems operator at CERN to use the technology for the research site's on-line address book. Berners-Lee wrote the software himself. He had first written the HTTP client and server for his own Next computer. Next was even then an obscure operating platform, and in order for the software to run at CERN it would have to be ported to Unix. Being connected to the Internet through CERN, Berners-Lee tried his luck recruiting someone to help do the port. The effort was fruitless, and he ended up porting the server himself. The port would come to be known as the CERN httpd. The d at the end of httpd indicates that the Web server runs as a daemon process. Daemon processes are Unix processes that that runs in the background.

Berners-Lee did not copyright his software, he did not charge for its use, but merely requested to be attributed for his work on the technology. He then posted the source code to several USENET groups. The immediate response was minor, but in time there grew a substantial community around the Web technology. With time the community and user base grew so large that Berners-Lee realized that he alone could no longer direct and enhance the Web. He realized that he needed a way to direct and unify the movement surrounding the Web. As Berners-Lee did not want to commercialize the technology, he first tried pitching it to the IETF with no success. His first Internet proposal, the URL, took too long to establish as a factor. A bit disillusioned Berners-Lee started looking for other approaches. He made contact with the people who had set up and managed the X Consortium. The X Consortium is an open industry consortium that develops and maintains the X Windows graphical user interface. With their help Berners-Lee founded the World Wide Web Consortium, the W3C for short, that was to handle future development of the Web. The W3C is today an industry-wide consortium with participants from major computer companies, educational institutions, governments, and more. It is the maintainer of the Web technology developing new standards for the Web community. Yet, the networking part of the Web— the URL and the HTTP—is handled by the the IETF.

A patchy Web server

Several individuals got into Berners-Lee's technology from the start. Unfortunately for Berners-Lee hopes of spreading the Web, most of these individuals were interested in his technology as a curiosity, something to implement for their own needs. They did not share Berners-Lee's visions of a knowledge network. The early Web browsers were either implemented in queer programming languages making them impossible to port, or they were dropped like a hot potato once the term project was over. Always on the cutting edge of Internet technology, one group that got into the Web early on was some computer enthusiasts at the University of Illinois National Super Computing Applications center. Of these Rob McCool and Marc Andreesen were the most prominent. They were developing their own Web browser called Mosaic, and their own Web server, the NCSA httpd. Both applications extended and improved Berners-Lee's original Web browser and server. Probably the most visible new feature added by the NCSA group was their browser's multimedia capability. Their Mosaic browser was the first to incorporate both pictures and text in a single web page.

The original NCSA Web server can be said to be a product of the experimental approach attributed hackers and hackerdom. It was a relatively good piece of technology, but far from bug free nor a particularly neat piece of programming. Its source code had been released along with a non-restrictive license, and the software was being used by the increasing number of commercial and educational Web servers across the United States and Europe. Because of its many bugs and somewhat lacking functionality, there grew a community around the NCSA Web server's code base. It was a community consisting mainly of Web masters—systems operators running their own Web sites—who shared in their individual work to make the NCSA Web server more bug free. They were collaborating to enhance their joint technology, sharing patches that fixed bugs in the original code base and enhanced it with new features.

"A patch is [a] temporary addition to a piece of code, usually as a quick-and-dirty remedy to an existing bug or misfeature. A patch may or may not work, and may or may not eventually be incorporated permanently into the program" [RAYMOND1998](Raymond 1998, p. 349). The issue at stake was that neither of these patches had been integrated with the Web server's code base. This raised two problems. The first was one of installation. Installing the NCSA Web server had become quite a task. First the original source code had to be downloaded. Then it was a question of locating, downloading, and applying the important patches in circulation before compiling source code. At this point the second problem surfaced. As there were a fair amount of patches in circulation, the would-be system operator of an NCSA Web server could never be certain that none of the patches applied were in conflict with each other. Such a conflict would at best be time-consuming to track down, at worst next to impossible as the would-be Web master had little or no knowledge of the internal workings of the NCSA Web server.

This loosely organized group of Web masters were largely ignored by the NCSA development team. By early 1995 the current NCSA Web server release 1.3 had been out for the better part of a year without its original developers having shown any interest in updating the source code. None of the many patches circulating on the Internet had been applied to the original code base. Disillusioned by the NCSA developers' lack of interest in the Web server, this group came to organize itself around a mailing list they called new-httpd. It was a forum where they could meet and exchange their patches, cooperating on making it easier to install the Web server.

The original NCSA software was beginning to look more and more like a patchwork quilt. The new-httpd crowd were beginning to realize that the original software needed an overhaul. With the existing licensing policy of the NCSA server it was possible for the new-httpd crowd to use the original NCSA Web server code in their own product without problems. While the initial intention was to form a community that shared their patches, eliminating redundant work where several people fixed the same bugs, they soon came to realize that they wanted to create their own HTTP server. By February 1995 the patches were piling up. Something had to be done. The new-httpd crowd chose to unleash their wry tongues on the original NCSA code base. Like Unix had once been a pun upon its predecessor Multics, the new-httpd crowd chose to call their Web server Apache, a pun based on the fact that the NCSA Web server had become "a patchy" Web server.