SourceForge for the 21st Century

Lately I’ve been thinking a lot about continuous deployment for reasons I’m not quite yet at liberty to disclose. This has inspired me to improve the XOM release process, to make it more of a one click process, or, to be more accurate, a one ant target process. I can now release a new version simply by typing:

$ ant -Dpassword = secret -Dwebpassword=other_secret release

This not only builds the entire project. It tags the release in CVS, uploads the zip and tar.gz files to IBiblio, and uploads the documentation to my web host. It doesn’t yet file a bug to upload the maven files, but I’m working on that.

During the process of setting this up, I realized that my organization is a little backwards. In particular, I’m pushing all the artifacts from my local system. Instead, I should merely be committing everything to the source code control repository; tagging a release; and then having the further downstream artifacts like the zip and tar.gz files and documentation pulled from source code control onto the Web servers.

There are some commercial products that are organized like this, including ThoughtWorks’s Cruise, but none of the major open source hosting sites such as SourceForge and java.net work like this. Certainly, SourceForge and similar sites have been major contributors to the open source revolution. They have enabled hobbyist developers working in their garages to use tools and techniques of software development that were previously limited to corporations. They have it enabled far-flung developers around the world to collaborate with each other far more effectively than they could do by e-mailing each other tar files. They have removed the burden of system administration from many programmers, thus enabling them to devote more time to writing code. Make no mistake. SourceForge et al. are real force for good in the community.

That said, the state of the art in software development has moved forward significantly since these sites were founded. CVS has mostly been replaced by Subversion. On some projects, Subversion has been been replaced by distributed version control systems such as git and Mercurial. Unit testing and test driven development have moved from extreme practices to standard operating procedure. Continuous integration using products like Hudson and Cruise Control is routine. Nonetheless, most project hosting sites still offer little beyond a source code repository, a bug tracker, and some webspace. Not that that’s not important, but we can do so much more.

It’s time to think about what a modern project hosting site might want to offer and what it might look like.

Continuous Integration

The first step forward, and possibly the hardest, is to add continuous integration capabilities to the existing project hosting repositories. Every time code is checked into SourceForge or Java.net or code.google.com, the project should be built and the tests should be run. Technically, the hard part is understanding every project’s unique build infrastructure. Some projects use ant; some use make; some use maven; and some roll their own. Maven is probably the most constrained of a lot. For the others, it will be necessary to ask project owners which targets to run for which tasks. It’s probably a good idea to auto generate basic ant or maven or make scripts for new projects.

Beyond merely building the code, running the tests has some very serious security implications. Currently project hosting sites do not run third-party code. They store it; they display it; they make it available and bundle it up as tar and zip files; but they don’t even compile it, much less run it. Running arbitrary third-party Java and C code submitted by any random teenager with attention deficit disorder somewhere on the Internet is begging for trouble. 10 years ago I would’ve thought this was insane and impossible. But now, just maybe we can do it.

In fact, there are several services on the Internet today that will run arbitrary third-party code for all comers. Amazon’s EC2 service lets anybody with a credit card run what amounts to a complete rooted Linux box on Amazon’s network. Google’s AppEngine let’s more or less anyone, credit card or no, run Python and Java code inside Google’s cloud. And these are hardly the only such services. Advances in virtualization and security sandboxing have made this possible. That said, it certainly helps to have a real user attached to any code that you run so you know who to blame when it starts spamming the world. However when an application goes rogue whether through malice or incompetence, it is possible to shut it down quickly. It is possible to limit the resources used by anyone test suite, and to limit what else I can see on the same filesystem and the same network.

Codehaus uses Atalassian Bamboo to provide continuous integration, including test running, for their projects. However they’re a relatively small site that’s somewhat picky about the projects they host. They do use a separate server for the continuous integration. I’m not sure what other, if any, security precautions they put in place. Launchpad builds Ubuntu packages, but I’m not sure of they run tests. JavaForge builds Java code and runs the unit tests, apparently on top of Amazon EC2. Assembla will build and run tests, and also uses Amazon EC2. Both of these thereby delegate some of the security issues to Amazon’s virtualized systems.

submit queue

Once we’ve solved the problem of running continuous integration servers on project hosting sites, the next step is to flip them around. The usual process is to commit code to the repository and have the continuous integration server pull the code out of the repository. Then, if the build or tests fail, the continuous integration server goes into red mode and sends out alerts. Wouldn’t it be better if the server never turned red in the first place?

What should happen is that new code gets sent directly to the continuous integration server rather than to the source code repository. The continuous integration server pulls the latest known good build from the repository. then it patches the new code into the build and runs the tests. If the tests pass, the continuous integration server commits the code to the repository. If the tests fail, the code is never committed at all.

So far as I know, no current project hosting sites offer this; and it’s a relatively uncommon feature even among self hosted projects. However, it’s a critical one, especially when accepting contributions from the wide world of programmers, not all of whom have yet learned the importance of test driven development. I suppose such a site could also perform other checks on the source code. For example, it could verify coding conventions or measure the incremental code coverage before and after the check-in. It could automatically reject any patches that did not meet some predetermined measures of quality. That said, automated checks tend to be better used as additional data for humans to evaluate rather than as hard and fast rules. One way this can happen is by offering code metrics to code reviewers. This brings us to the next improvement in the code hosting ecosystem.

Code Reviews

Committing code, even assuming all the tests pass, is still a serious operation. Most open source projects don’t want to allow just anyone to commit code willy-nilly. Usually there’s a core group of committers that reviews all incoming patches and decides whether or not to accept them, to reject them, or to send them back for further work. This is somewhat labor-intensive both on the reviewer and the reviewee.

However, if we move to a submit queue-based system, this can become somewhat more straightforward. The continuous integration server can check every incoming patch regardless of the submitter’s status. If the tests pass, it can send an automatic request for review to a project commiter. If the commiter approves the change, then the continuous integration server can commit it to the source code control repository.

Indeed, it’s probably a good idea to require code reviews for all submitted changes, not just those from new users. After all, it’s not like the project’s owners are immune from introducing bugs. In fact, they probably introduce more than anybody else, if for no other reason than that they commit more code than anyone else. Code reviews are well known for increasing the quality of a code base and avoiding stupid errors, yet they’re one of the lesser used software development practices among open-source programmers. It’s time for that to change. Web-based code review interfaces such as Guido von Rossum’s Rietveld have the potential to really move the community forward here. We should integrate this technology or something equivalent into project hosting sites. code.google.com already offers code review, and a few others like BitBucket do too. The rest should follow.

One-button deployment

The final stage of software development is deployment. Eventually the software has to ship to and be installed by its intended users. Here is one area where open source projects have a significantly easier time than a lot of commercial projects, especially enterprise projects. The deployment process for many open source projects consists of little more than uploading a few jar files and some documentation to the right directories on the right Web servers. This should become a one-button operation.

All of project owners should have to do to release a new version is choose a version number and push a button. The server should pull all the code, documentation, and configuration information out of the source code repository; build everything; and put all the finished artifacts in the right locations. No further manual work should be required. This does require that absolutely everything needed to release goes into the repository; not only code but also HTML files, images, config files, and more. The only things that don’t go into the repository are the artifacts that are built from these components: jar files, zip files, Javadoc, etc.

Maven comes close to this, but it still builds and deploys from a local system rather than from the version control repository. This should be turned around. Ideally, maven deploy should work with nothing more than a pom.xml file. Deploying shouldn’t need to access the local maven repository or the local copy of the source code at all.

Summary

There might even be a startup idea in here somewhere. Open source projects aren’t the only ones that would like to offload some of the routine system administration tasks involved in running source code control repositories, continuous integration servers, bug trackers, and deployment pipelines. More likely, what’s really needed are some tweaks in and additions to the existing project hosting services. Or perhaps we can even take advantage of the advances in virtualization technology to install these services on top of Amazon EC2 and similar platforms.

But one thing is for certain: if open source projects are to keep pace with and surpass closed systems, then their software development practices need to be at least as good and probably better than the state-of-the-art in the overall software development community. In order to do that, it’s time to upgrade our tools.

5 Responses to “SourceForge for the 21st Century”

  1. James Abley Says:

    Excellent post. I think Hadoop has elements of this [1] – a patch is submitted to JIRA and Hudson tries it out, reporting on the quality of the patch.

    [1] http://1060.org/blogxter/entry?publicid=702D3B2007CC75DD4C3D4E73F4DD5390&token=

  2. Assaf Says:

    Check the ecosystem around Git.

    I use Github which offers Web hooks for integrating with 3rd party services. I have one that kicks off continuous integration (RunCodeRun) and another that kicks off code complexity analysis/coverage report (Devver) each time I push a change. Third one notifies rdoc.info, which pulls the changes to update online documentation.

    On big projects, I recommend each developer working on their own fork. To get (fully tested) changes into the master repository, developer makes a pull request which shows up in the fork queue. There’s an awesome UI for conducting code review on these changes before accepting them.

    Services like Heroku allow you to deploy your application by simply pushing it into a designated Git repository. I use Capistrano (self-hosted) to similar effect.

  3. Joerg Says:

    FYI – http://www.projectlocker.com has a free plan (5 users and 500MB) with source repo, wiki … there is also continuous build/test system available and auto deployments but I guess the latter features have monthly costs, i.e. not included with the free plan ;-(

  4. mike Says:

    Hi,

    Great article Elliotte. Now you are talking about software and how to make good software ๐Ÿ™‚
    I have my project on sourceforge and I also wanted a quick release process when I have changed the code. Do you mind posting your ant script?

    //mike

  5. Ali Says:

    Elliot Rusty Harold,

    You say that “SourceForge et al. are real force for good in the community.”

    From a couple of months ago sourceforge is blocking access to downloads from its entire domain to countries that are embargoed by America. The countries include Iran, Syria, North Korea, Iraq (after seven years of occupation by America and the supposed transformation of that country into a modern democracy, whatever that is), and Sudan, and …

    As such I am not able to download your XOM (which I used to do) and everything else GPL or LGPL or otherwise, as long as it resides on sourceforge. There is nothing in the GPL and its derivatives that mentions American embargoes or the assortment of American designated “Hitlers” of one sort or another, still souceforge has gone ahead and blocked access. That does not make sourceforge a force for good. It makes it a force for sending the GPL’ed programs that are made and donated with the best of intentions down into the black hole of some misbegotten American policy, thereby becoming a tool of those who have brought death to over a million Iraqis and destruction and misery to untold millions in Iraq and Afghanistan and all over the world.

    Certainly, sourceforge did not wished it to work like this. However, what is more certain is that those, including you, who have donated their code to the public at large, did not wish their code to be treated like this.

    Or did you? ๐Ÿ™‚