Richard
I accept that there are some users for whom the ability to have a
complete DTD as part of the file is important. It's on the "to do"
list. On the "done" list is the removal of the "editorial" remarks
from the xml.pl page, which might have given the impression that DTD
handling was left out on principle.
I'm persuaded that for version 2.0 (RSN):
1) The default should be to merge CDATA and PCDATA for input. We still
need to be able to write CDATA, for backwards compatibility, so we
should be able to read it independently, for symmetry.
2) The default should be to ignore comments. We still need to be able
to write comments, so we should be able to read them for
symmetry. (Note: Even the latest "standards compliant" browsers -
IE6.0 and NS6.1 need their in-line Javascript in a comment, even in
XHTML.)
3) To get XHTML formatting right, both for input and output, we need
to validate. To make validation worthwhile, I'll want more than the
DTD offers: I want to check more of the validity constraints and more
of the attribute datatypes. It would be good to interleave the
validation with the parsing too. I don't know which schema language
would be best - I'll code what I need directly in Prolog to start
with, and see where that leads.
Regards
John Fletcher
----- Original Message -----
From: "Richard A. O'Keefe" <ok(a)atlas.otago.ac.nz>
To: <ciao-users(a)clip.dia.fi.upm.es>; <john(a)binding-time.co.uk>; <ok(a)atlas.otago.ac.nz>
Sent: Tuesday, December 18, 2001 10:46 PM
Subject: Re: Database and memory limitations
> > (b) Binding Time may think it "unlikely that an application could make any
> > > use of a document that defines its own" DTD, but I have numerous
> > > examples. Since an application doesn't get its semantic information
> > > from a DTD in the first place, the reason given is unsound.
> >
> > I wasn't saying that the semantic information was in the DTD,
> > rather that applications must recognize the nodes, and probably the
> > structure of the document, to "make sense of it".
> >
> Yes, the application must recognize the nodes.
> No, this recognition does NOT have to be by means of checking
> the element type. If we have
>
> <!ELEMENT title (#PCDATA)>
> <!ATTLIST title ARCFORM #FIXED "h2">
>
> then an application can look at the ARCFORM attribute of an element
> and say "oh yes, this is a level 2 heading, I know what to do with that".
> Note that the content model of this <title> is a strict sublanguage of
> the content model of an <h2>; in general, the thing that matters is that
> the content AFTER MAPPING should be a sublanguage.
>
> If we make that a default, instead of a fixed, attribute,
> then particular instances can over-ride it.
>
> <!ATTLIST title ARCFORM (h1|h2|h3|h4|h5|h6) "h2">
>
> so <title> maps to h2, and <title ACRFORM="h1"> maps to h1.
>
> > This means that some fixed terms must have been defined
> > beforehand, assuming that we're talking about communication
> > amongst two or more systems. Why not map these terms onto a
> > fixed DTD, or Schema, rather than adding another level of indirection?
> >
> Because a "union" DTD or Schema will be very permissive.
> (Look at HTML, for example. The separation between block level and inline
> level is _not_ very clear; inline content is allowed in most places where
> you'd expect block content.)
>
> Because the terms of such a DTD or Schema may not be the best ones for
> the task at hand.
>
> Let's take just one more example. Imagine that I'm writing a specific
> document, and in this document I need to have a list of people. Each
> person has a name and a list of projects. Then I want to use tags
> like <name> and <projects> rather than <dt> and <dd>. And I want to
> control the content model of these things; <projects> must contain
> project description, not arbitrary HTML content.
>
> This happens to be an example I'm actually editing right now.
> I have *ONE* such document. When the document is revised, the chances
> are the grammar will be revised as well. There is no point in making
> the DTD a separate file.
>
> > > More
> > > interestingly, IBM's DARWIN approach shows that an application _can_
> > > get a lot of semantic information from a DTD, via the systematic use
> > > of #FIXED attributes.
> >
> > Defining the meaning of an element through a large set of
> > "#FIXED" attributes seems back-to-front to me. I would choose
> > to have a fixed set of tags, and as many attributes as are
> > needed, using the attribute values to parameterize the semantics. I
> > think that is more in the spirit of XML.
> >
> I strongly recommend that nayone who is interested in SGML/XML processing
> should look at the DARWIN approach. They show how you can implement an
> object-oriented model in SGML, where you can say "this element type is like
> that element type with these extensions".
>
> I would say that using a fixed set of tags is about as opposed to the
> spirit of XML as you can possibly get. XML is about using *semantic*
> markup, and that specifically includes applicatin-speficic and even
> document-specific element types. Attributes should be inferred by
> the processor from content and context whenever possible.
>
> > > The thing which makes the Binding Time parser unusable to me in its
> > > present form is that it's based on the usual mish-mash of markup-sensitive
> > > and structure-controlled approach.
> >
> > I think it might be XML that's the "mish-mash",
>
> XML provides two parsing models: validated, which fits structure-
> controlled applications very well, and well-formed, which is the mish-mash
> I am complaining of and is suitable neither for markup-sensitive applications
> (because information they might care about is lost) nor structure-controlled
> applications (because the information they need isn't there).
>
> > xml.pl is trying to simplify it as far as possible - arguably
> > farther than is possible. Nevertheless, I think it gives a good mix of
> > generality and ease of use.
> >
> Except that it gets the rules for white-space handling in attribute values
> wrong, and it doesn't let you use general entities to build documents out
> of pieces, and it doesn't handle white space in text correctly, and ...
>
> I actually tried the "<p> <em>x</em> <em>y</em> </p>" example, and xml.pl
> did produce the wrong answer. And this is ultimately related to the fact
> that it doesn't look at the DTD.
>
> > > In any XML parser, it ought to be
> > > possible for an application to say "I am a structure-controlled
> > > application. Do NOT split CDATA out separately. Act as if comments were
> > > not there at all. DO distinguish element content white space from other
> > > white space, in fact, don't give me any element content white space."
> >
> > The problem with both CDATA sections and Comments is that some
> > applications, like XHTML and SVG, expect JavaScript to be
> > delivered in them, so the default behaviour has to be to
> > preserve comments on input and to distinguish between PCDATA and
> > CDATA for output.
> >
> There are some non-sequiturs there.
>
> Yes, XHTML expects Javascript,
> yes, Javascript *****may***** be embedded in CDATA,
> no, Javascript does not *have* to be embedded in CDATA,
> so an XHTML processor that handles Javascript (which many do not)
> has to recognise Javascript WHETHER IT IS IN A CDATA SECTION OR NOT.
> In fact, having to distinguish between CDATA sections and other character
> data makes an XHTML processor's job *harder*, not easier.
>
> XHTML has CDATA sections precisely so that Javascript should NOT
> be embedded in comments. That's an old hack for HTML, not for XHTML.
> Anyone who puts Javascript in an XHTML comment deserves to have it ignored.
>
> But above all, the fact that XHTML or SVG require this (if they do),
> does ***not*** mean that distinguishing CDATA from other character
> data and reporting comments have to be the defaults. It only means
> that they have to be possible, so that XHTML and SVG applications
> can ask for them. There are very very many uses of XML that are not
> XHTML, not SVG, and do not include Javascript.
>
> > It's a pragmatic requirement, until all applications recognize
> > CDATA sections and PCDATA as interchangeable and stop using comments
> > corruptly.
> >
> It is a pragmatic requirement that these things be *POSSIBLE*,
> not that they be *defaults*. The distinction having been enshrined in
> SAX, DOM, and Infoset, what are the odds that generators ever get it right?
>
> > Beyond that, I've elected to have the calling application ignore
> > what it doesn't need, rather than provide switches in the
> > parser. If a consensus in favour of "switches" emerges, that will
> > change.
> >
> But there ARE options in the parser. The first argument of xml_parse/3
> is a list of them. Currently there are two:
> format(bool) true ->strip white space (incorrectly)
> extended_characters(bool) true -> use XHTML character entities.
> Why not
> cdata(bool) true -> return cdata(_) terms
> comment(bool) true -> return comment(_) terms
>
> Electing to have the calling application ignore what it needs not to have
> puts the burden on the application. It doesn't make sense to me to force
> applications to deal with constructs that have no value to them.
>
> It's worse than that. If I have
> <p>Example: <![CDATA[<foo bar="ugh">]]>.</p>
> what I _want_ is
> element(p,[],["Example: <foo bar=""ugh"">.])
> but what I _get_ is
> element(p,[],["Example: ",cdata("<foo bar=""ugh"">"),"."])
> which isn't even the right number of children.
>
> I don't see why every application should have to include its own
> code for the common task of stripping out comment nodes, and
> pasting sequences of plain text and cdata into single plain text items.
>
> It is more efficient to have a means of never generating these things
> in the first place.
>
> Second best would be for the package to include an
> xml_normalize(Kludgy, /*->*/ Cleaned)
> predicate.
>
> > > You really can't get XHTML white space handling right without knowing
> > > what is element content and what is mixed content, which means processing
> > > the DTD.
> >
> > I've fixed my "defaulty" explanation, if not the code. (One
> > could fix that specific XHTML problem, simply by distinguishing
> > between 'block' and 'inline' elements, rather than processing
> > the whole DTD. It's a hack, but it would be an effective one.)
> >
> It would be, if I were parsing XHTML, which I'm usually not.
>
> > I think DTDs are very high cost/low value in most cases. For
> > example, I'm sure that there must be a way of capturing more of
> > XHTML's validity constraints, more economically, than the DTD, or XML
> > Schema, manages. For "architectural forms", #FIXED attributes seem
> > rather limited when compared with XLink (URLs) and Namespaces.
> >
> DTDs are very low cost compared with Namespaces, and trivial compared
> with the cost of XLink. Of course there's a better way to capture that
> stuff, it's called Prolog. Namespaces are very easy to implement in an
> XML parser (I've done it), but they impose a high cost in every application
> that uses them, because they are so clumsy. DTDs have some cost in the
> parser, but many applications can benefit from knowing they have a structurally
> well-formed document. (Not HTML or XHTML ones, of course, because HTML and
> XHTML have so little structure.)
>
> DTDs can do more for you than most people realise.
> They were by design limited to what could be efficiently implemented in
> limited memory, but you can do quite a lot.
> Modular XHTML makes good use of them.
>
> There are other schema languages for XML, such as Relax and Trex and ...
>