"John Fletcher" <john(a)binding-time.co.uk> wrote: TBH, I'm not convinced by SWI's approach of encoding text nodes (CDATA and PCDATA) as atoms. It is a compact representation, but often the content of the text nodes (including attribute values) has some internal structure that needs to be "micro-parsed". I've been using Prolog to read and write XML quite a lot, and I found the "text as atoms" approach meant having to convert between atoms and chars far too often.
I was describing what SWI's library does, not what I actually prefer. My uses of SGML are predominantly textual, and I find that I prefer using lists of character codes to represent text. (Amongst other things, that makes it easy to match, generate, and transform text using DCGs.)
There are plenty of intermediate possibilities, such as representing text as lists of tokens.
It might be that the XML applications I've been dealing with: XHTML, SVG and SMIL, are especially prone to this, but leaving the text nodes as chars has worked to advantage (see http://www.binding-time.co.uk/xml.pl.shtml ). The files themselves tend to be quite small, (average 10k maximum 250k, they are intended for transmission after all), so memory usage hasn't been a problem. ... Perhaps RDF is different in this respect.
There are some good things about the SWI kit: (1) The source is available; if you don't like atoms for text, don't have them. (2) The parser is fast. It's not the fastest XML parser around, but it's the fastest SGML parser I've managed to get my hands on. (3) It's an SGML parser, not just an XML parser. (4) Because the actual parser is in C, it's comparatively straightforward to plug it into languages other than Prolog and dialects other than SWI.
Thanks for the link to http://www.binding-time.co.uk/xml.pl.shtml. There are a number of points I would take issue with there: (a) ' may not be a standard _HTML_ character (heck, HTML3.2 accidentally dropped ") but it _is_ a standard XML character, and is explicitly present (the fifth) in "-//W3C//ENTITIES Special for XHTML//EN". (b) Binding Time may think it "unlikely that an application could make any use of a document that defines its own" DTD, but I have numerous examples. Since an application doesn't get its semantic information from a DTD in the first place, the reason given is unsound. More interestingly, IBM's DARWIN approach shows that an application _can_ get a lot of semantic information from a DTD, via the systematic use of #FIXED attributes.
The thing which makes the Binding Time parser unusable to me in its present form is that it's based on the usual mish-mash of markup-sensitive and structure-controlled approach. In any XML parser, it ought to be possible for an application to say "I am a structure-controlled application. Do NOT split CDATA out separately. Act as if comments were not there at all. DO distinguish element content white space from other white space, in fact, don't give me any element content white space."
The Control "strip layout when no character data [sic.; but layout IS character data] appears between two elements" is blessed by XSLT, so is a handy thing to have, but gets the "element content white space" notion quite wrong; the result is that if you write "<ul> <li>x</li> <li>y</li> </ul>" it is as if you had written "<ul><li>x</li><li>y</li></ul>", which is GOOD, but if you write "<p> <em>x</em> <em>y</em> </p>" it is as if you had written "<p><em>x</em><em>y</em></p>", which is BAD.
You really can't get XHTML white space handling right without knowing what is element content and what is mixed content, which means processing the DTD.
Another reason for processing the DTD, of course, is to get #FIXED attribute values, defaulted attribute values, and special attribute types such as NMTOKENS, correct. #FIXED attributes, as I mentioned above, are an extremely useful tool for "poor man's architectural forms".
Finally, while trying to download, I got several 404 reponses, with text "The requested URL: /cgi-bin/USER-linklog.pl was not found on this server"