The User-Developer Convergence: Innovation and Software Systems Development in the Apache Project
Prev	Chapter 6. Method	Next

The research process

The interpretative approach to computer sciences is an iteration between fragments and the context [MEYERS1999](Klein and Meyers 1999). This has been the main approach taken throughout working on this thesis. In this part I retrace the steps taken in collecting and interpreting data. Looking back at the work, I was working within two hermeneutic circles: the first circle being to understand what would be required of a case study, the second circle the Apache project itself.

Choosing a case study

I spent the first two months of working on this thesis by looking at appropriate candidates for case study. First I identified a large number of interesting projects that could be potential case studies. My primary concern during this phase was that I wanted to study hacking, but I had no clear idea of how best to approach the matter. What I basically did, was that I surfed the World Wide Web looking for projects connected with the Linux operating system and the IETF standardization body. After a scant month of work, I had a handful of potential candidates. Among these were the Linux kernel, the Debian GNU/Linux distribution, the IPv6 and LDAP IETF working groups, the KDE desktop environment, the PHP programming language, the Perl programming language, and the Apache Web server just to mention those that were my favorites at the time.

Once a handful of potential candidates had been identified, I started looking for an approach to select the most appropriate case study among these. The approach chosen was to identify criteria a case study needed to fulfill for it to be approachable. This was an iterative process. For each iteration I identified some new criteria, and for each iteration candidate after candidate was excluded. Each time new criteria were formulated, I had to look closer into the remaining candidates then compare my findings in order to find new criteria for my case study. With each iteration I delved deeper into the remaining candidates. In the end I was left with three potential candidates that seemed to fulfill the criteria identified. These were the Linux kernel, the KDE desktop environment, and the Apache Web server.

The criteria

The criteria used in selecting the right case to study can be split in two. There are the criteria directly connected to what I wanted to study. They stem from my initial concern, that I wanted to study hacking in practice. The other type of criteria are the pragmatic kind. These criteria are connected with how good the data foundation for a case study.

The aim of this thesis is to study the process of software systems development within a hacker community. For the best possible historical documentation, I chose to look into a distributed hacker community whose primary mode of communication is e-mail via a mailing list. That way, if the community keeps an archive of its mailing list, I would have close to complete readouts on the development process. The first criteria for choosing a case study was therefore for it to be a hacker community. The second criteria consequently became that the community had to be geographically distributed, and thereby depend on e-mail communication. Not just any distributed team would do, for there had to be an archive keeping complete records of the community's activities. A complete mailing list archive was therefore the third criteria. A fourth pragmatic criteria was added, that the e-mail archives had to be more or less complete, not fragmented, as important bits could not be missing. These four criteria were connected with the traceability of the development effort, requirements that the case had to fulfill in order to form a sufficient data basis for a case study.

To be able to study the development effort over time, it became apparent that the mailing list archives had to be of a certain size, that it stretched itself over a period of time. This would be the only way to study the development team and its practices evolved over time. No fixed time-frame was set for this eighth criteria, but it was implicit that the mailing list archives needed to stretch over several years in order to be of any use.

Next I looked closer at the kind of project I wanted to study. Small-scale projects do not suffer under the same inherent problems as larger, more complex software systems development efforts. This lead to a fifth criteria for choosing a case study, that the software development effort undertaken by the hacker community had to be significant. An absolute size was not set, but it was implicit that the project needed to be larger than one single individual could possibly keep track of himself. I was looking for a case where total opacity and top-down control was simply not possible due to the sheer size of the project. There was another side to the matter of the fifth criterion. To better reflect the reality of the software industry we were not interested in studying a group of close friends knowing each other well, but rather a large community with the occasional conflict and clash of personalities. This was the sixth criteria.

As I also wanted to look at innovation in software systems development, potential candidates had to be more than simply chasing tail-lights—as Valloppolli [VALLOPPOLLI1998](1998) has observed about a number of hacker projects—the mere imitation of previous work. The development effort had to push the boundaries of software in a way that could be studied. From this, the sixth criteria gave itself, that the software development effort had to be innovative and create new technology. From this followed a seventh criterion, more of the pragmatic kind, that the technology of the development effort had have a broader appeal. It had to be more than an obscure encryption algorithm, no matter how innovative it might be. The technology had to have a broader appeal outside of the community. Said another way, the case study had to be of such a kind that other people would be interested in reading about it.

The final criteria never gave itself until I had chosen a case study that had to be rejected. The ninth and final criteria: that important decisions and arguments had to take place on the list. My first case did not reflect the executive decisions, nor how they were being made, and as such proved impossible to pursue. The nine criteria for choosing a case study consequently became:

The case has to be a hacker community.
The community has to be geographically distributed.
The community's prime channel of communication is e-mail via mailing list(s), for which there are archives.
The community's e-mail archives have to be more or less complete. Important or large sections should not be missing.
The mailing list archives must be of a considerable size, spanning over a considerable time space.
The software development effort undertaken by the community has to be significant.
The community must be more than chasing tail-lights.
The technology developed by the community has to have a broad appeal.
Important decisions and arguments within the community have to be conducted on the mailing list.

The final choices

After going through the potential cases and eliminating those not satisfying the criteria I had drawn up, I was left with three candidates. At this stage I realized it was more a question of making a choice that I was comfortable with, than coming up with the ideal case study. The only way to see if a case was suited was to make a choice and start working on it. Being a Linux user, my only two real alternatives were the Linux kernel or the KDE desktop environment. At the time KDE was regarded as the killer application that would wrestle the desktop hegemony from Microsoft. KDE was a tempting choice because of this, even though it did have certain aspects of chasing the taillights of Microsoft's Windows desktop. The Linux kernel however did look like a much more interesting case study because of my interest in operating systems. In addition to being a case study of hacking in action, I assumed there were technical sides to the discussion that would prove interesting. The Apache Web server did not tempt me much at the time. It was undoubtedly the most widely used Web server on the Internet, but it had its vocal insider representatives. I very much thought that choosing Apache would render my thesis irrelevant, as some of these more vocal representatives would already have published any results I would find.

My first choice of case study was therefore the Linux kernel. Linux is a hacker project that has gotten enormous amount of attention. It is a technically complex project, with contributors from across the globe. There is ample access to a mailing list for Linux developers, so the source material seemed to be in order. Yet, after a period of study it was becoming apparent that the mailing list did not track any major decision making processes. After a month or so I got in touch with one of the leading Linux developers, David Miller, asking him how the decision making process was handled within the Linux community. He replied that most of the important decisions were actually being made on private e-mail between a handful of central developers (personal e-mail October 5 1999).

It had become apparent that the Linux kernel was not suitable as a case study. It did not fulfill one of the more important criteria: no important decisions were being carried out on the mailing list. While not entirely back at square one, this set me back quite some time and I had to select a new case. At this time the KDE desktop project was no longer as hot as it had initially been. It was riddled in a licensing issue discussion over the QT widget library. QT is the core technological basis of the KDE graphical user interface. It is a commercial cross-platform C++ library for creating GUIs. It had a license that allowed developers of KDE to use QT without licensing fees. This, combined with the project's chasing taillights factor, put me off. Despite my initial reluctance, I ended up choosing Apache for my case study.

Selecting the material

Based on my initial work on the Linux kernel mailing list, I realized getting through the sheer amount of e-mail stored in the new-httpd archives would be too time consuming. I had complete archives of the new-httpd mailing list from March 1995 up to today. The amount of data made me look for a way of making the archives more accessible. The Apache project stores their mailing list archives as large text files, each file containing all the e-mails sent through the mailing list over the duration of a month. The format made reading the archives extremely hard. To amend this situation I chose a tool called Hypermail. It processes mailing list archives into a set of Web pages. The amount of e-mail to read was still immense, and I chose to do some initial statistic analysis of the mailing list traffic in order to locate time slices to concentrate my efforts on and to identify possible prime movers within the development group.

Limiting the data

The new-httpd archives are complete from the inception of the mailing list in March 1995. To limit the data material, I drew a line a December 1999. I would concentrate on the data material within these two dates. The choice was influenced by the Apache developers' migration to the Apache2 architecture, an entirely new code base that I had no knowledge of. It seemed like a convenient place to stop as I had then tracked the life cycle of the first generation of the Apache Web server.

Hypermail

Hypermail is a mailbox to HTML converter with threads support. It takes a file in Unix mailbox format and generates a set of cross-referenced HTML documents. Hypermail is itself free software licensed under the GNU General Public License, available from http://www.hypermail.org. The application creates a single HTML file for every e-mail in the mailbox being processed. Each file created contains links to other articles, so that the entire archive can be browsed in a number of ways by following links. Each file generated contains (where applicable):

the subject of the article,
the name and email address of the sender,
the date the article was sent,
links to the next and previous messages in the archive,
a link to the message the article is in reply to, and
a link to the message next in the current thread.

In addition, Hypermail will convert references in each message to email addresses and URLs to hyperlinks so they can be selected. Email addresses can be converted to mailto: URLs or links to a CGI mail program.

For every mailbox being processed, Hypermail generates four index pages. Three are index listings sorting the e-mails by date, subject and author respectively. The fourth index page is chronologic, tracing the flow of discussions in a threaded model. This fourth index provides a good overview of the discussions taking place. When filtered through Hypermail each month is represented with a complete index over the messages sent to the list, with links to the individual messages. In addition to providing the number of e-mails sent onto the mailing list, the HTML pages generated by Hypermail allows the user to browse the archive by thread, author, and date. It is therefore easy to read the mailing list traffic, as in the number of e-mails processed on a monthly basis; the number of active participants on the list, as in who sends e-mail onto the mailing list; and the major participants on the list, as in how many e-mails is sent by each participant.

Statistic analysis of the archives

To get an overview of the mailing list activity, I drew a graph showing each month on the X-axis and the number of e-mails posted to the mailing list on the Y-axis. The intention of this graph was to get an indication of the effort spent in developing the Web server over time. Iterating between reading the mailing list archives and looking at the graph, I hoped to get some insight into the ebb and flow of the development effort. The underlying assumption is that the introduction of new technology and/or the emergence of a disputed topic would increase the number of e-mails sent onto the list. As such I could use the graph as a guide for locating interesting points in the history of the Apache project.

I also generated yearly graphs for better granularity. The graphs can be found in the appendix.

To gain an impression of the distribution of activity within the Apache group, I drew a graph showing the individual developer's activity measured in the number of e-mails sent to the mailing list pr. month. Using the total number of e-mails sent to the mailing list that month, a figure provided by Hypermail's index listings, I calculated how many percent of the total traffic each participant contributed with. Together with a closer reading of the archives, I hoped this would provide me single out the prime movers within the community. The assumption behind the diagram is that active participants in the development effort will contribute more on the mailing list. As it turned out, this assumption is somewhat wrong. The diagram favors vocal participants, and a closer reading of the mailing list would prove there is not necessarily a correlation between being vocal and the amount contributed. The diagram did show useful as an initial indication of the community's prime movers.

To make these pie diagrams more easily readable I chose to reduce the number of entries by collecting all developers having posted less than 10 messages to new-httpd within the space of a month, into a separate category Other. I was also hoping that this would be to provide me with an indication of whether active participants in the development effort stuck through or that there were a rapid change of developers. It could provide me with an indication of whether there is a core of developers sticking through the entire process, and other key developers joining the team for shorter or longer periods of time. Such an analysis could give me an indication of how knowledge is being kept in the development team, whether it is being kept there through tradition—that it is passed down by key members—or some other mechanism.

A correcting to the analysis above is the number of discussion threads each developer has started pr. month. The assumption is that the most active participants are those who initiates discussions. The assumption is that the number of threads a single participant has started, can provide an indication of that participant's effort in pulling the process forward. This analysis will always have to depend on the results of the mailing list's traffic analysis, as an implicit assumption here is that a low volume of traffic indicates the loss of interest by the participants. As with the total messages pr. month figure above, the total number of e-mails sent pr. individual developer does not give an accurate description of each individual developers' activity in the project. However, combined with the number of threads started, a fairly accurate picture of who contributions may be found.

These pie diagrams can be found in the appendix.

Choosing material of interest

The statistic analysis never did yield the expected results in crystallizing certain periods as particularly interesting. It did show a steady increase in traffic on the new-httpd mailing list, though. The only real hint the analysis gave me, was a slump in activity during May to July 1995. The significance of this slump did not occur to me until fairly late in working with this thesis. At first disregarded it. Instead I chose a new approach to finding interesting material. I chose a two angled approach into the mailing list material. The first angle into the material was based on getting an overview of the mailing list archives. I simply started skimming through the monthly indexes generated by Hypermail, starting with March 1995. I looked closer into the particular of topics that seemed particularly interesting, especially long discussions (measured in the number of e-mails in the discussion thread), and recurring subjects. The assumption is that both recurring topics and long discussion is significant to the project participants as they put a lot of effort into the topic. By making notes, and rereading older archive indexes, this approach provided me with a number of episodes and topics that proved interesting.

The second angle into the material was more selective. Based on the project time-line, I looked closer into periods around major releases.. In addition I used a clue provided by the article Open Source as a Business Strategy [BEHLENDORF1999](Behlendorf 1999). The article briefly mentions a controversy between America Online and the Apache group during December 1996. The assumption here is that these are periods requiring decision making. For a release decisions have to be made as to what goes into the release. The AOL controversy would require the group to consider its alternative approaches to the problem.

In order to get an inside view of a hacker community, I arranged two formal interviews with Stig Bakken. Stig Bakken is a prominent member of the PHP Core Team. PHP is a very successful hacker project, and Bakken is a local hacker. Since I already knew him, I set up two interviews with him during the first six months of working on the thesis. The aim of these interviews were to break the hermeneutic circle, and try to find an approach to the material at hand. Since PHP also comes with a plug-in module to Apache, Bakken knew some about the Apache project.

While I do not cite Bakken anywhere in the thesis, his influence on my understanding of the Apache hacker community is significant. I do not agree with some of the views he expressed during the interviews, but they helped making me aware of issues that I would otherwise have ignored.

Based on my two angles into the material and supported by the interviews with Stig Bakken I collected a number of episodes. At first these episodes spanned in time from March 1995 to the end of 1999. As work progressed, I started limiting my scope. Episodes were removed one by one. This process of elimination was iterative, in that episodes were taken out and later reinserted. This process of selecting the particular episodes to choose continued all to the end of working on the thesis. The final choice was a results of the dynamics between data and theory. I leave it with that for now, as the role of theory is discussed later in this chapter.

The final choice of cases are actually limited to a very small time slice, March 1995 until August the same year. Central here is actually the slump in activity that showed up in the initial statistic analysis. It would prove central in that the slump indicates a breakdown in the Apache project. In the final choice of episodes I end up breaking two of the criteria set down when choosing case study. While initially concerned with studying the case over a long period of time, I end up with a time slice of less than half a year. During this period, the development group is still fairly small, counting less than twenty individuals with only a handful of active participants. It turned out that these criteria were not necessary to show the point I was getting to.