[relaxng-user] Latest proposal for smart regexes in RELAX NG
David Tolpin
dvd at davidashen.net
Thu May 6 12:17:59 ICT 2004
> > I am asking because I have a feeling that the string-based (as opposed
> > to XML) syntax for regular expressions (unix-like, adopted by W3C Schema)
> > is the compact syntax. It is easy to write and convenient to use,
> > and actually needs just one addition: ability to compose a regular
> > expression from parts.
>
> It's neither easy to write nor convenient to use, it's just that we're all
> so used to it. It optimizes the wrong things ("[A-Z]" rather than "[:ucalpha:]"),
I don't think so. I am confident that one of the most significant
benefits we get from Unicode is that we can write A-Z instead of [:ascalpha:]
(and not ucalpha, as these are different things). Letters and numbers
are ordered and at fixed places. A-Z is always ABCDEFGHIJKLMNOPQRSTUVWXYZ,
it was not so in the past, and I am glad it is now.
For ucalpha there is a different optimization in XML Schema Regular Expressions
(used with Relax NG), namely
\p{L} . Not more cryptic than :ucalpha:, and directly refers to Unicode
character classes -- much easier for those who knows what they are.
>
> > Strings are not trees.
>
> Regular expressions *are* trees: they are compositions of sequence, choice,
> and zeroOrOne. Everything else is just syntactic sugar. (Note that I
> do not provide for backreferences, a la \1, \2, ..., which make the regular
> expressions no longer regular.)
1) Strings are not trees. XML documents are trees. That's why
regular expressions (which can be represented in either tree-like
(XML) form or in the form of a sequence of instructions (traditional
string regular expressions)) should provide
- XML structured representation (or compact but still structured tree-like)
for XML documents in whole
- string representation to match strings.
One can propose XML representation of string regular expressions
to ease processing, but as well as the XML syntax is the base syntax
for XML regular expressions, and the compact syntax is designed to
make life easier in certain environment, string syntax for string
regular expressions is the base syntax, and XML syntax can be
provided as a convenience, but must map to the string syntax
bidirectionally.
2) XML Schema Regular expressions do not have back-references.
More information about the relaxng-user
mailing list