Comments on: Murphy’s Law of Co-occurrence Constraints

By: bloodnok

bloodnok — Wed, 17 Jan 2007 01:41:01 +0000

cool! and it corrected the &-#-8-2-1-7-; in my prior comment …

By: bloodnok

bloodnok — Wed, 17 Jan 2007 01:18:34 +0000

one of the oddities of your word press installation is its insistence on using extended-iso characters where a simple apostrophe would do. the result is on your main page toc, this topic is “Murphy’s Law of Co-occurrence Constraints”. it’s like this with firefox on osx (both ppc and intel) and win98 ($deity help us) as well as internet exploder 6 (on the same inept o/s).

By: Rick Jelliffe

Rick Jelliffe — Mon, 01 Jan 2007 14:29:04 +0000

So, if there is indeed a tradeoff between the ease of declarative languages and the generality of procedural or functional languages (and assuming that Schematron is in the former rather than the latter category), then the question becomes, should a declarative schema language that handles co-occurrence constraints be biased towards completeness (expressiveness) or familiarity?

With Schematron, I strongly adopted the bias to familiarity: even though XPaths (especially XPath 1) is limited, the issue of ease-of-use is the core issue for schema languages, especially ones that fit in the vague layer (to many people) between what grammars can do and what Java or .NET or Python can do. (Contrast it with an approach of, say, building a prolog-like rule system for validation.)

The nice side-effect is that by keeping the declarative language extremely simple, we then have room to model the kinds of systematic variations in documents (in Schematron, using phases, diagnostics and abstract patterns) without causing brain explosions.

Another aspect to streaming evaluation of schemas is the issue of fail-fast. Sometimes what is required of validation is for the document to be failed as soon as any error is found. It is perfectly possible to detect in a path-based constraint language (such as Schematron) which constraints required random access and which constraints can be evaluated in a streaming fashion. And to evaluate the streaming ones on the fly through SAX and STAX (or the equivalent) while deferring the random access constraints to a random-axis parse. See http://www.topologi.com/resources/pdfs/SchematronHeuristic.pdf for an example, sorry about the PDF.

By: Eliot Kimber

Eliot Kimber — Thu, 21 Dec 2006 20:15:53 +0000

The other big constraint language I’ve worked with a lot is STEP (ISO 10303), which is designed to represent complex system of mostly mechanical parts (but we were trying to apply it to SGML-structured content and also to the abstract structures of HyTime groves). One of the things that STEP includes is the notion of “population constraints”, which is really another way to say “co-constraints”. STEP’s solution was to include an expression language that just let you directly say things like “if the card is AmEx then the number must start with 3 and you can have no more than four instances of the blarg subcomponent”. You run into the same problem using UML to do data modeling–at some point you need to use algorithms to express the full constraints imposed by the model.

That can lead to the desire to then create a standard expression language, but at least in the pragmatic world of XML processing, that hardly seems worth the effort given a small set of widely-used languages (XSLT, Java, Python, Perl, C#, JavaScript) that are all (with the possible exception of Perl) more or less easy to understand from the point of *documenting* the rules (as opposed to implementing the rules).

I also find it interesting that in Marinelli’s paper they said that they didn’t try to implement validation of constraints that flow backward from later elements to earlier elements, for the simple reason that in the worst case it requires you to buffer the entire document tree before reporting anything, at which point you’ve lost all the value of a streaming approach. Which sort of calls into question the whole utility of trying to apply co-constraints in that environment at all.

That is, it seems always come down to the fact that some part of your validation has to happen in the processor after whatever validation you can do declaratively, so there just doesn’t seem to be a lot of value in trying to provide incremental improvements to the expressiveness of declarative constraint languages.

By: John Cowan

John Cowan — Thu, 21 Dec 2006 15:29:19 +0000

Right, exactly. So custom types aren’t sufficient for this: they can only validate each type independently. Instead you need a RELAX NG schema containing a rule like this:

citystatezip =
element city {“New York”}, element state {“NY”}, element zip {“10001”} |
element city {“New York”}, element state {“NY”}, element zip {“10002”} |
element city {“New York”}, element state {“NY”}, element zip {“10003”} |
element city {“Beverly Hills”}, element state {“CA”}, element zip {“90210”} |
…

and so on for all 50,000 or so possibilities. There’s nothing actually *wrong* with such a schema, except that it may overflow validation tools.

Schematron, though, can probably do the trick both cleanly and correctly with a clever use of the XSLT document funtion.

By: Elliotte Rusty Harold

Elliotte Rusty Harold — Thu, 21 Dec 2006 12:44:27 +0000

By writing the zip code verification as a custom type implemented in Java, you could defer the finite enumeration to a database or web service. You wouldn’t construct a schema. You’d just talk to a database. The problem is that verifying the details requires content from two separate elements.

By: John Cowan

John Cowan — Wed, 20 Dec 2006 14:42:39 +0000

The reason RELAX NG can’t handle the last case is not that it requires access to external data, but that the finite enumeration involved (of all valid city/state/zip triplets) is just too fardling big. There are something like 43,000 zipcodes in the U.S., and there is often more than one city name that they validly map to (see the Wikipedia article for details). You could (mechanically) construct a schema that represents that mapping, but I have no idea how long it would take to match against it.

BTW, Discover card numbers have a first digit of 6, and Diners Club card numbers assigned outside North America have 3. Full details at http://en.wikipedia.org/wiki/Credit_card_numbers .