Murphy’s Law of Co-occurrence Constraints
Co-occurrence constraints are a perennial topic at XML conferences because the usual schema languages (DTDs, W3C Schemas, RELAX NG) can’t handle them. Consequently they’re a fertile source of papers like XML 2006’s keynote from Paolo Marinelli on Co-constraint Validation in a Streaming Context.
However, I mentioned in hallway conversation that I wasn’t sure how common or necessary co-occurrence constraints really were. In fact, I didn’t think I’d ever found one in the real world. Naturally two days later I stumbled across several of them in a very common, very frequent real world example.
I was putting together a schema for order information for an online store. I’m sure you’ve seen dozens, probably hundreds, of these. One piece of this is the credit card information, for which a a typical element looks like this:
<CreditCard>
<Name>Elliotte Rusty Harold</Name>
<Number>5123 4567 8901 2345</Number>
<Type>Mastercard</Type>
<CVV2>314</CVV2>
<Expiration>2007-01</Expiration>
<Address1>6 Metrotech Center</Address1>
<Address2>Dept. of Computer Science</Address2>
<City>Brooklyn</City>
<State>NY</State>
<Zip>11201</Zip>
</CreditCard>
Now imagine we want to validate that. There are actually several coocccurence constraints just in those ten fields:
- If the card type is American Express, then the first digit of the card is 3
- If the card type is Visa, then the first digit of the card is 4
- If the card type is Mastercard, then the first digit of the card is 5
- If the card type is American Express, the security code (CVV2) is four digits; otherwise it’s three digits.
Of course there are also quite a few other things besides co-occurrence constraints we can’t validate in the schema:
- The credit card is authorized for the purchase.
- The expiration date is in the future.
- The credit card checksum is correct.
- The zip code matches the city and state.
I could actually write RELAX NG extension functions to handle the first three, though that might not be the best architecture for the problem. The rule that the zip code must match the city and state is actually a co-occurrence constraint that requires access to external data, and RELAX NG custom type libraries can’t handle that.
Declarative schemas may be a useful tool, and are easier to write than imperative validation code. However, they’re rarely able to check everything you need to check. Schema validation can be the first step in deciding whether to accept a document. It usually shouldn’t be the last.
December 20th, 2006 at 9:42 am
The reason RELAX NG can’t handle the last case is not that it requires access to external data, but that the finite enumeration involved (of all valid city/state/zip triplets) is just too fardling big. There are something like 43,000 zipcodes in the U.S., and there is often more than one city name that they validly map to (see the Wikipedia article for details). You could (mechanically) construct a schema that represents that mapping, but I have no idea how long it would take to match against it.
BTW, Discover card numbers have a first digit of 6, and Diners Club card numbers assigned outside North America have 3. Full details at http://en.wikipedia.org/wiki/Credit_card_numbers .
December 21st, 2006 at 7:44 am
By writing the zip code verification as a custom type implemented in Java, you could defer the finite enumeration to a database or web service. You wouldn’t construct a schema. You’d just talk to a database. The problem is that verifying the details requires content from two separate elements.
December 21st, 2006 at 10:29 am
Right, exactly. So custom types aren’t sufficient for this: they can only validate each type independently. Instead you need a RELAX NG schema containing a rule like this:
citystatezip =
element city {“New York”}, element state {“NY”}, element zip {“10001”} |
element city {“New York”}, element state {“NY”}, element zip {“10002”} |
element city {“New York”}, element state {“NY”}, element zip {“10003”} |
element city {“Beverly Hills”}, element state {“CA”}, element zip {“90210”} |
…
and so on for all 50,000 or so possibilities. There’s nothing actually *wrong* with such a schema, except that it may overflow validation tools.
Schematron, though, can probably do the trick both cleanly and correctly with a clever use of the XSLT document funtion.
December 21st, 2006 at 3:15 pm
The other big constraint language I’ve worked with a lot is STEP (ISO 10303), which is designed to represent complex system of mostly mechanical parts (but we were trying to apply it to SGML-structured content and also to the abstract structures of HyTime groves). One of the things that STEP includes is the notion of “population constraints”, which is really another way to say “co-constraints”. STEP’s solution was to include an expression language that just let you directly say things like “if the card is AmEx then the number must start with 3 and you can have no more than four instances of the blarg subcomponent”. You run into the same problem using UML to do data modeling–at some point you need to use algorithms to express the full constraints imposed by the model.
That can lead to the desire to then create a standard expression language, but at least in the pragmatic world of XML processing, that hardly seems worth the effort given a small set of widely-used languages (XSLT, Java, Python, Perl, C#, JavaScript) that are all (with the possible exception of Perl) more or less easy to understand from the point of *documenting* the rules (as opposed to implementing the rules).
I also find it interesting that in Marinelli’s paper they said that they didn’t try to implement validation of constraints that flow backward from later elements to earlier elements, for the simple reason that in the worst case it requires you to buffer the entire document tree before reporting anything, at which point you’ve lost all the value of a streaming approach. Which sort of calls into question the whole utility of trying to apply co-constraints in that environment at all.
That is, it seems always come down to the fact that some part of your validation has to happen in the processor after whatever validation you can do declaratively, so there just doesn’t seem to be a lot of value in trying to provide incremental improvements to the expressiveness of declarative constraint languages.
January 1st, 2007 at 9:29 am
So, if there is indeed a tradeoff between the ease of declarative languages and the generality of procedural or functional languages (and assuming that Schematron is in the former rather than the latter category), then the question becomes, should a declarative schema language that handles co-occurrence constraints be biased towards completeness (expressiveness) or familiarity?
With Schematron, I strongly adopted the bias to familiarity: even though XPaths (especially XPath 1) is limited, the issue of ease-of-use is the core issue for schema languages, especially ones that fit in the vague layer (to many people) between what grammars can do and what Java or .NET or Python can do. (Contrast it with an approach of, say, building a prolog-like rule system for validation.)
The nice side-effect is that by keeping the declarative language extremely simple, we then have room to model the kinds of systematic variations in documents (in Schematron, using phases, diagnostics and abstract patterns) without causing brain explosions.
Another aspect to streaming evaluation of schemas is the issue of fail-fast. Sometimes what is required of validation is for the document to be failed as soon as any error is found. It is perfectly possible to detect in a path-based constraint language (such as Schematron) which constraints required random access and which constraints can be evaluated in a streaming fashion. And to evaluate the streaming ones on the fly through SAX and STAX (or the equivalent) while deferring the random access constraints to a random-axis parse. See http://www.topologi.com/resources/pdfs/SchematronHeuristic.pdf for an example, sorry about the PDF.
January 16th, 2007 at 8:18 pm
one of the oddities of your word press installation is its insistence on using extended-iso characters where a simple apostrophe would do. the result is on your main page toc, this topic is “Murphy’s Law of Co-occurrence Constraints”. it’s like this with firefox on osx (both ppc and intel) and win98 ($deity help us) as well as internet exploder 6 (on the same inept o/s).
January 16th, 2007 at 8:41 pm
cool! and it corrected the &-#-8-2-1-7-; in my prior comment …