These are just some random jottings I made to straighten things out in my head. I thought I might as well stick them here in case anyone's interested.
(Fair warning: if you're not interested in designing content management systems then you will find this document tedious in the extreme.)
Content management requires a storage mechanism of some sort. At a low level that can be a standard database, but the important thing is the higher level structure of the data. It seems sensible to me to think about this as a file system structure, consisting of files organised hierarchically into directories/folders. A file is a chunk of data with some metadata (a name, timestamps, and perhaps other values useful for websites, like the name of the creator). A folder is a named collection of files and other folders.
The hierarchical structure described by folders will be useful for some websites, for categorising documents or other data. Many websites have a hierarchical URL structure, which could be generated directly from the folder structure. Others might have a flat or date-based URL space (blogs or news websites, for example) but still organise files into categories for indexing purposes. Even if it has no effect on the URL or website structure, folders might be useful for finding files in the CMS administrative interface.
Files might be used in various different ways to generate the website. The ones the come to mind are:
- Some files must be published as they are. They have a corresponding URL which will return their data without modification. The publishing system or webserver must be able to associate a mime type with each file served in this way. Examples are most images that appear on a website, and CSS and JavaScript files.
- Some files must be processed or filtered in some way before being returned to the browser. Usually these will be ‘documents’. A document contains textual content and markup in some form and needs to be turned into HTML or XHTML, or perhaps other formats such as PDF. It may also be desirable to filter other data, for example scaling images. Filtered files also need a mime type to be served, but it is a property of the filtering process rather than the original file. A single file may appear on the website in HTML and PDF, having been processed in different ways.
- Some files do not appear directly on the website. These might be library images that are simply kept in the system in case they are needed, so that they can be moved into place and published easily when the time comes. They should not get in the way when they are not being used. Files may also be the most appropriate place to store ancillary information used when generating parts of the website, even if they do not themselves produce resources that can be accessed through URLs.
The file system structure stored in the CMS must map to a URL structure that is visible to users of the website. If the site is published entirely dynamically, with each resource pulled out of the database for each request, then this is enough. If however the content is first copied to static files on the webserver, then the file system structure must also be mapped to a standard directory and filename hierarchy that can be written to disk, and this structure on disk must be easy to map to URLs by the webserver. (Publishing static files in this way is important to very busy sites and to people publishing on remote servers over which they don't have much control.)
URL schemes can be many and varied, and a general-purpose CMS needs to handle that. So the mapping to URLs needs to be fairly flexible, but I don't think it need allow any possible URL scheme to be constructed. It should be easy to publish things with a tasteful URL scheme generated by default. More esoteric URL schemes can be accommodated with mod_rewrite or other dynamic fiddling.
Here are some example URLs for discussion, designed according to what I consider best practice (see Cool URIs don't change):
- http://www.example.com/
- http://www.example.com/vegetables/lettuce
- http://www.example.com/vegetables/lettuce/crisphead.jpg
- http://www.example.com/blog/2004/09/file_system_cms_notes
- http://www.example.com/search?q=iceberg+lettuce
(Apologies to anyone led here searching for lettuce information. For some reason it was the first thing that came in to my head. Probably my body crying out for a healthier diet.)
A file might map to one or more URLs (or perhaps none), and a URL may exist even if there is no specific file behind it. For example, #1 above might show a general index page listing available documents, in which case it doesn't necessarily correspond to a particular document itself. More importantly, a site organized hierarchically might have an arbitrary number of section index pages (for example listing all articles and subsections related to vegetables), and it would be inconvenient to have to make files in the CMS to generate each of those. Of course, you might want to have a file there to provide editorial content or something, but that won't always be the case. It seems like it might make sense for folders to generate URLs as well as files.
There are other situations when a file might have other files attached to it. A document about lettuce might produce URL #2, but an associated image file could produce URL #3. In this case it looks like a file is behaving more like a folder, containing other files as well as its content. Perhaps files and folders should be the same thing. If this is the case, are there still cases in which URLs don't have an associated file?

Dictionary website thought experiment
A dictionary website. Most of the URLs return definitions of words or indexes of various kinds. Since this is fairly specialised information it makes sense to store it in a custom DB structure, rather than packaging it in the CMS filesystem structure. It could be done with some new tables in the same DB as the CMS, or somewhere else. We could use completely custom software for this, but a general purpose CMS would mean the site can publish articles or press releases or whatever without any grief. Experience has taught me that it's useful for the CMS to know about all the URLs involved in the site, so it would be most convenient if it handled the dictionary definition pages too, although it might be acceptable to use mod_rewrite and some specialised software just for those if they were completely separate. That would still mean that the site design (templates and stuff) would be in two separate places though.
The best way I can think of to handle this case is to have some object in the CMS file structure that generates URLs for all the dictionary definition pages. If pages are served dynamically then the requests would be directed to this overall definition document, which would have code associated with it to generate the appropriate page, using the specialised database. OK, so rather than show that URLs might not have a file associated with them I've found a situation where a potentially huge number of URLs come out of one file. If there's some custom code to handle generating the URLs, and it has access to a sensible API, then there's no reason why this couldn't work. The dictionary database might have timestamps that the code could use to find updated entries that need adding or modifying.
Objects
OK, I'm now happy that there should be one unified concept for objects in the file system. I'll call it the object for want of a better word. An object can have some content (either binary data or textual content that will be filtered) and it can contain other objects, acting like a folder.
Objects with null content (folders) either don't generate a URL (acting only as a container for organising other objects) or somehow generate an index page. In general there needs to be some way of signalling how a particular object is treated when generating URLs or output, so it needs to have some sort of code attached to it. Since there will often be groups of objects that need to behave the same way it seems sensible to make each object belong to a class of some sort. A class would somehow dictate how objects behave when generating URLs or when being filtered for output. Some classes might just specify that the output is the same as the content (for image files and things). There would likely be some predefined classes for this sort of thing.
Objects need identifiers. They'll presumably get a unique ID number in the database, but those won't stay the same if an object is exported from one CMS installation and imported into another. Solution: make sure every object has a name. If names are unique among sibling objects then we can identify one just as the Unix filesystem model does, as a sequence of names starting from the root. An image file's name might map into its filename. Names of other objects might also be used in URLs, by default.

An object's class dictates what URLs are generated when it is published, probably by having a bit of code in the class to do it. If there's no code set then we use a default algorithm. How about this:
- If the object has a ‘URL property’ of some sort, then we use that unchanged. This could be a useful override for special cases, but more importantly it defines the URLs for the objects at the root of each website. That's how we know what domain we're publishing to.
- Failing that, get the URL of the parent object. Add a
/if it doesn't already end in one, and stick the object's name on the end. - It might be sensible to have a setting somewhere for adding an
extra
/to the end, because that would map better to default Apache configurations. In that case, if we were publishing to static files, the publishing code would simply call all documents index.html or whatever.
I think this design sounds right. Lets test it with the example URLs defined above:
- There's an object for the
www.example.comhomepage, which sets the base URL for everything inside it. That object might have some content, or the content might be generated automatically from the objects inside (just a listing of recent blog entries, or whatever). This object might belong to a special class, because the homepage of a site often behaves a bit differently from all the rest. - Inside the homepage object there's a
vegetablesobject, and inside that one alettuceobject. Their names generate this URL automatically, using the default URL algorithm. This object presumably has some content about lettuce, and its class will be such that it produces an HTML page from that content. It might even generate an additional URL for the printer friendly version. - Inside the
lettuceobject there's one with the namecrisphead.jpgwhose content is a JPEG image. It doesn't have to have a .jpg extension, but that seems sensible. An HTML page might easily turn into a CGI script or PHP file later, but a JPEG image is likely to always be a JPEG image, and without an extension there's no telling what IE will do with its mime type. - This object might be a bit different. For a blog it might not make
as much sense to use the default URL algorithm. So perhaps we have all
blog entries as objects inside a single
blogobject, or perhaps we use other objects to organise them into categories. If the categories aren't used on the site then this would be a purely internal administrative convenience. Each blog article might belong to some class that defines its own URL generation code, which in turn uses the publication date and the name of the object (acting as its slug) to generate the URL. There's no need for us to manually create objects for all the years and months in which articles are published. There is one tricky thing here: we'll want to create a single ‘year archive’ object, and another ‘month archive’ one, to generate URLs for all the date-based index pages. Somehow these objects need to be republished whenever a new blog article gets published. - There's a
searchobject for the search engine interface. For a dynamic site this would have publication code that does the search and generates a page of results. For static publication it might generate a CGI script or something. The query string part of the URL should probably not be considered part of the URL for our purposes (identifying objects).
Well, I feel better for getting that out of my system. With a software design in one's head trying to get out, who can sleep?