Spam, Spam, Spam, Spam, Spam, Spam, Spam, Spam, Spam, Spam, … Spam!

Tuesday, November 30th, 2004

The Cafes seems to be off and running. There were a few initial glitches that I have now cleaned up. Today’s project is to make the staging server work enough like the production server that I can use it for testing and debugging without affecting the production server. Yesterday I got stymied by a slight difference in how the PHP engines were configured. (The staging server didn’t have libtidy support that the site relies on heavily.)

At least three people tried to post with fictional or nospam e-mail addresses. Sorry. That won’t work. Anonymous posters are not supported. You must supply a valid e-mail address at least once to post, and it will be verified. It’s sad, but the biggest issue that has been raised most consistently by users is an unwillingness to provide an e-mail address due to fear of spam and worm droppings. While I hate spam as much as the next person, I am loathe to break a useful feature like mailto links just to avoid spambots. It’s the wrong solution to the problem. I am a big fan of spam filters including realtime black hole lists. If you’re not using them, you should be. If your ISP isn’t using them, you should find a new ISP. But in the meantime, I do wonder if there might be a middle ground that confuses spambots, Microsoft worms, and other venomous spiders without putting any noticeable roadbloacks in the path of legitimate users.


On Iterators and Indexes

Monday, November 29th, 2004

Here’s a neat little trick Wolfgang Hoschek showed me. When iterating across a list, if you don’t care about the order in which you iterate, why not do it backwards? Like so:

for (int i=list.size(); --i >= 0; ) {



Monday, November 29th, 2004

Most current XML APIs fall into one of two broad classes: event-based APIs like SAX and XNI or tree-based APIs like DOM, XOM, and JDOM. Most programmers find the tree-based APIs to be easier to use; but such APIs are less efficient, especially with respect to memory usage. An in-memory tree tends to be several times larger than the document it models. Thus tree APIs are normally not practical for documents larger than a few megabytes in size or in memory constrained environments such as J2ME. In these situations, a streaming API such as SAX or XNI is normally preferred. A streaming API uses much less memory than a tree API since it doesn’t have to hold the entire document in memory at the same time. It can process the document in small pieces. Furthermore streaming APIs are fast. They can start generating output from the input almost immediately without waiting for the entire document to be read. They don’t have to build excessively complicated tree- data structures they’ll just pull apart again into smaller pieces. However, the common streaming APIs like SAX are all push APIs. They feed the content of the document to the application as soon as they see it, whether the application is ready to receive that data or not. SAX and XNI are fast and efficient, but the patterns they require programmers to adopt are unfamiliar and uncomfortable to many developers.

Pull APIs are a more comfortable alternative for streaming processing of XML. A pull API is based around the more familiar iterator design pattern rather than the less well-known observer design pattern. In a pull API, the client program asks the parser for the next piece of information rather than the parser telling the client program when the next datum is available. In a pull API the client program drives the parser. In a push API, the parser drives the client.

Just about two years ago, I wrote an article for discussing what until now has been the preeminent pull API, XMLPULL. This article identified a number of problems with XMLPULL. The last two paragraphs of that article summed up:

These problems are not casual bugs. They are deliberate design decisions, based on a desire to reduce the footprint of XMLPULL to the minimum possible for J2ME environments. None of these problems are likely to be fixed in the future. The trade-offs made in the name of size may be acceptable if you’re working in J2ME. They are completely unacceptable in a desktop or server environment. Thus XMLPULL seems destined to remain a niche API for developers seeking efficiency at all costs.

Nonetheless, there are some interesting ideas here. Most importantly, the problems I’ve identified stem from implementation issues, not from anything fundamental to a pull-based model for XML processing. A future pull-API that learned from XMLPULL’s mistakes could easily become a real alternative to SAX and DOM.

Now it’s two years later, and I am very pleased to report that the next generation API is here. BEA Systems, working in conjunction with Sun, XMLPULL developers Stefan Haustein and Aleksandr Slominski, XML heavyweight James Clark, and others in the Java Community Process are on the verge of releasing StAX, the Streaming API for XML. StAX is a pull parsing API for XML that avoids most of the pitfalls I noted in XMLPULL. XMLPULL was a nice proof of concept. StAX is suitable for real work.


Welcome to The Cafes

Wednesday, November 17th, 2004

Welcome to The Cafes, a new site from Elliotte Rusty Harold for content that’s longer than a blog posting and shorter than a book.

This site is currently alpha at best. While I encourage you to bang on it, please do not expect it to be stable, at least not yet. As The Mythical Man-Month taught us, “Plan to throw one away. You will anyway.” This is the one I plan to throw away. If you post really long, well thought-out comments, please, please, please save your work on your local system first. I make no promises that early comments will be preserved here or even accepted into the database in the first place. The comments system is held together by duct tape and spackle (and PHP, and MySQL, and a few other tools). Even if it actually works for long enough for a comment to be posted, the likelihood I’ll be able to preserve all those comments over a period of years or even weeks is not good. I expect I will need to do major reengineering on the back end sooner rather than later that may involve throwing it out and starting over from scratch.