Rethinking websites as single documents

James Tauber
Associate Researcher,
Curtin University of Technology

One Electronic Document != One File

Most systems for creating or processing documents do not require the document to be stored in a single file. Word processors generally have a notion of a master document with subdocuments in different files. Most text formatting systems have some inclusion facility. In particular, SGML and XML have entities.

The ability to use parsed entities in XML leads to the option of a document having a physical structure that is synchronous but not identical to the logical structure. Furthermore, external parsed entities provide for a separation between document and file. A single XML document may take up multiple files.

It is this author's view that the physical structure should be transparent to a processing system dealing with the logical structure. An XSL engine, for example, should have no knowledge of the physical structure of the document it is processing for example. If it does, then we lose the benefit of separation between document a file: the processing becomes determined in part by the physical structure, which should not be the case. One exception to this is the XML editors which should allow user control over both the logical and physical structure.

work on XML Fragments deals largely with the reverse of external parsed entities. Rather than *constructing* a single XML document from multiple entities, the work is about breaking up a single XML document into entities that can be transmitted individually.

One Electronic Document = One Printed Document

Most publishing systems take the view of a one-to-one correspondence between electronic document and printed document.

           [Electronic Document] ---> [Printed Document]

Generic markup systems like SGML and XML use stylesheets that specifies how the logical structure of the electronic document is presented on the page

           [Electronic Document] + [Stylesheet] ---> [Printed Document]

The separation of content from form that this affords can be take advantage of by having different stylesheets that produce different forms of the document. If the stylesheet language is powerful enough, the different documents may differ not only in styling but in what content has been carried over from the source.

           [Electronic Document] + [Stylesheet 1] ---> [Printed Document 1]
           [Electronic Document] + [Stylesheet 2] ---> [Printed Document 2]
           [Electronic Document] + [Stylesheet 3] ---> [Printed Document 3]

This notion of a single XML document resulting in a single output document (or set of alternative single output documents) makes a lot of sense for print documents. An XML document representing a 20-page report can be transformed to a single output document representing 20 pages.

This is not the case for web documents.

One Electronic Document != One Web Page

Let us first take the case where the output document is represented in HTML. An XSL stylesheet enables us to quite powerfully transform our single XML document into a single HTML document. A single HTML document is not generally appropriate, though, where the document would be 20 pages long in print. The reason is that there is a one-to-one correspondence between HTML document and web page. If you have one HTML document, you have a single web page.

If one wanted to have a large document being transformed to multiple HTML documents, one would either have to:

have some proprietary post-processing to break up a single result document into multiple HTML documents
represent the original document as multiple XML documents

The problem with the former is the "proprietary" nature of the post-processing. This can easily be solved if multiple output files are introduced to XSL as has been proposed by this author and others. XML processing tools like SAXON already extend XSL with the provision of such a feature.

The problem with the latter is partly that it confuses physical and logical structure. A document should be a logical whole (perhaps with multiple physical files) and decisions of how to carve it up into multiple HTML document should be part of the styling phase, not the initial authoring. Another problem is that we lose the ability to easily generate navigation between the various HTML documents that are making up our conceptually single document.

Now let us take the situation where the output document is represented in a vocabulary of online flow objects. One might imagine that the flow objects in XSL will allow one to construct a sequence of scrolling page objects. In such a case, we have a single output document (in the XSL FO vocabulary) but multiple web pages.

There are certainly benefits to this last approach, including reduced round tripping to the server.

A Web Site = One Document

We have considered the case where a prexisting 20-page report is split up into multiple web pages with navigation between them. If the final version of XSL includes the ability to output multiple files, then both the splitting and navigation will be quite simple.

This author would like to suggest that there is a fine (if not non-existent) line between a 20-page report and a larger collection of related web pages, possibly an entire site

Only part of a web page is what could really be considered "content" specific to that page. Much of what appears on a page is standard across all pages in the collection (footers, etc) or is dependent on the page's place in relation to other web pages on the site. Some links are driven by and specific to the content. This is hypertext in the true sense. Many links are purely navigational.

What, if any, is the different between an automatically generated table of contents put on every web page of a report split into multiple HTML documents and the hierarchical navigation on a web site? This author suggests little or none at all.

The easiest way to generate navigation based on hierarchical relationship between web pages is to represent those relationships in XML. One can give each web page a unique identifier and then express a site hierarchy that reference those identifiers:

           <Site Name="ACME, Inc">
             <Section Name="Products">
               <Page Ref="productintro"/>
               <Section Name="Widgets">
                 <Page Ref="widgetintro"/>
                 <Page Ref="widget1"/>
                 <Page Ref="widget2"/>
                 ...
               </Section>
               ...
             </Section>
             <Section Name="Company">
               <Page Ref="jobs"/>
               ...
             </Section>
             ...
           </Site>

What is interesting is that we are largely faking a physical structure here. We have a bunch of XML document entities (one for each web page), we give each a unique name and reference them in another XML document. Why not make the source for each web page an external entity to a single document representing the hierarchy of the site.

           <Site Name="ACME, Inc">
             &products;
             &company;
             ...
           </Site>



           <Section Name="Products">
             &productintro;
             &widgets;
             ...
           </Section>

            
           [and so on...]

Of course, there is no reason why many or all of the web pages couldn't actually be in the document entity itself, although the use of different entities improves maintainability and reusability.

But notice, we now have a single XML document (possibly spread over many files) that represents an entire website.