Re: Database and memory limitations - Ciao-users

17 Dec 2001


      "John Fletcher" <john(a)binding-time.co.uk> wrote:
    TBH, I'm not convinced by SWI's approach of encoding text nodes
    (CDATA and PCDATA) as atoms.  It is a compact representation,
    but often the content of the text nodes (including attribute
    values) has some internal structure that needs to be "micro-parsed".
    
    I've been using Prolog to read and write XML quite a lot, and I
    found the "text as atoms" approach meant having to convert
    between atoms and chars far too often.
I was describing what SWI's library does, not what I actually prefer.
My uses of SGML are predominantly textual, and I find that I prefer using
lists of character codes to represent text.  (Amongst other things, that
makes it easy to match, generate, and transform text using DCGs.)
There are plenty of intermediate possibilities, such as representing text
as lists of tokens.
It might be that the XML applications I've been dealing with:
    XHTML, SVG and SMIL, are especially prone to this, but leaving
    the text nodes as chars has worked to advantage (see
    http://www.binding-time.co.uk/xml.pl.shtml ). The files
    themselves tend to be quite small, (average 10k maximum 250k,
    they are intended for transmission after all), so memory usage
    hasn't been a problem.  ... Perhaps RDF is different in this respect.
There are some good things about the SWI kit:
(1) The source is available; if you don't like atoms for text, don't have
    them.
(2) The parser is fast.  It's not the fastest XML parser around, but it's
    the fastest SGML parser I've managed to get my hands on.
(3) It's an SGML parser, not just an XML parser.
(4) Because the actual parser is in C, it's comparatively straightforward
    to plug it into languages other than Prolog and dialects other than SWI.
Thanks for the link to http://www.binding-time.co.uk/xml.pl.shtml.	
There are a number of points I would take issue with there:
(a) &apos; may not be a standard _HTML_ character (heck, HTML3.2 accidentally
    dropped &quot;) but it _is_ a standard XML character, and is explicitly
    present (the fifth) in "-//W3C//ENTITIES Special for XHTML//EN".
(b) Binding Time may think it "unlikely that an application could make any
    use of a document that defines its own" DTD, but I have numerous
    examples.  Since an application doesn't get its semantic information	
    from a DTD in the first place, the reason given is unsound.  More
    interestingly, IBM's DARWIN approach shows that an application _can_
    get a lot of semantic information from a DTD, via the systematic use
    of #FIXED attributes.
The thing which makes the Binding Time parser unusable to me in its
present form is that it's based on the usual mish-mash of markup-sensitive
and structure-controlled approach.  In any XML parser, it ought to be
possible for an application to say "I am a structure-controlled
application.  Do NOT split CDATA out separately.  Act as if comments were
not there at all.  DO distinguish element content white space from other
white space, in fact, don't give me any element content white space."
The Control "strip layout when no character data [sic.; but layout IS
character data] appears between two elements" is blessed by XSLT, so is
a handy thing to have, but gets the "element content white space" notion
quite wrong; the result is that if you write
"<ul> <li>x</li> <li>y</li> </ul>" it is as if you had written
"<ul><li>x</li><li>y</li></ul>", which is GOOD, but if you write
"<p> <em>x</em> <em>y</em> </p>" it is as if you had written
"<p><em>x</em><em>y</em></p>", which is BAD.
You really can't get XHTML white space handling right without knowing
what is element content and what is mixed content, which means processing
the DTD.
Another reason for processing the DTD, of course, is to get #FIXED
attribute values, defaulted attribute values, and special attribute
types such as NMTOKENS, correct.  #FIXED attributes, as I mentioned above,
are an extremely useful tool for "poor man's architectural forms".
Finally, while trying to download, I got several 404 reponses, with text
"The requested URL: /cgi-bin/USER-linklog.pl was not found on this server"