[relaxng-user] re: Document/schema association proposal

David Tolpin dvd at davidashen.net
Wed Dec 31 00:50:25 ICT 2003


Hi,

I have been looking at Document/schema association proposal, 
http://groups.yahoo.com/group/emacs-nxml-mode/message/259; I was going
to implement it. 

Essentially, the problem the solution is solving is classifying

1) strings (URIs)
2) xml files

according to a number of templates and their ordering. The rules in
the proposal are about matching the strings and files against informally
defined templates.

Since it is done in a framework of a language (Relax NG), which purpose
is to define templates for xml documents in a regular way, it would be
natural to just use Relax NG itself for classification.

In a similar way, regular expressions for strings, in any of accepted syntaxes,
can be used to filter URIs.

Below is a sample document/schema association, expressed in terms of regular
expressions for URIs and Relax NG grammars for documents. It is actually the utility
I am using for my work. The glue is perl, but it can as well be written in xml,
and interpeted using e-lisp, C, or any other language.

In particular, XML template for XSLT documents checks for either the root element
being stylesheet or transform or a child of the root element being template (for
literal result elements as documents).

my $TMPDIR="/tmp";
my $RNGDIR="/usr/local/share/rng-c";
my %SCHEMAS=(
  docbook=>"docbook.rnc",
  xslt=>"xslt.rnc",
  xhtml=>"xhtml.rnc",
  xslfo=>"fo.rnc",
  relaxng=>"relaxng.rnc"
);

my $ANY=<<END;
any = (element * {any}|attribute * {text}|text)*
END

my $XML=<<END;
start = element * {any}
$ANY
END

my $DOCBOOK=<<END;
start = element (book|article|part|chapter|section) { any }
$ANY
END

my $NOTXSLT=<<END;
default namespace xsl = "http://www.w3.org/1999/XSL/Transform"
start = element *-xsl:* {not-xsl}
any = (element *-xsl:* {any}|attribute * {text}|text)*
$ANY
END

my $RELAXNG=<<END;
default namespace rng = "http://relaxng.org/ns/structure/1.0"
start = element grammar {any}
$ANY
END

my $type;
for($ARGV[0]) {
  $type=(
  valid($_,$XML) 
    and (valid($_,$DOCBOOK) and "docbook"
      or!valid($_,$NOTXSLT) and "xslt"
      or valid($_,$RELAXNG) and "relaxng")
  or /\.x?ht(ml?)?$/	and "xhtml"
  or /\.xsl$/		and "xslt"
  or /\.dbx$/ 		and "docbook"
  or /\.fo$/ 		and "xslfo"
  or /(.*)\.*$/ and do {my $f=$1.".rnc"; -f $f ? $f : 0;}
  or "unknown"
  );
}
print exists $SCHEMAS{$type}?$RNGDIR."/".$SCHEMAS{$type}:$type;

sub valid {
  my ($uri,$rnc)=@_;
  my $f=$TMPDIR."/".$$.".rnc";
  my $valid;

  open RNC,">$f"; print RNC $rnc; close RNC;
  system("rnv -q $f $uri 2>/dev/null");
  unlink $f; return !$?;
}

In particular, NOTXSLT rules allows to recognize literal result elements as
documents by contents. I first though had been to use XSLT to match XML
documents, but then I realized that Relax NG is better suited for this purpose.

One can argue that it is a slow approach. Determining file type by content for
a 5 megabytes file takes 1.5 seconds witn rnv, and it parses the whole document
even if the root element is notAllowed.

Sincerely,
David Tolpin


More information about the relaxng-user mailing list