I've been thinking about how to provide a database schema with a library that uses it (in particular, a search engine module for Perl). I want to be able to port my library to multiple database engines and make it easy for people to set up a database with the appropriate tables defined. I can provide an SQL file to feed in to the DBMS, but if I need a separate one for each different backend database it'll get hard to maintain.
Enter XML: it sounds sensible to me to provide the schema in an XML format, and have a program that can turn it in to SQL suitable for various RDBMSs. The XML file I distribute would describe the tables and columns in a reasonably readable form, and sufficiently abstracted that the same input can work across all the very different SQL syntaxes out there.
There are other things I might be able to do with the data once it's in XML: e.g., generating GraphViz input from it to draw an entity-relationship diagram, or generating documentation in HTML.
I've tried Googling for things like this, but so far haven't been able to find anything relevant. All I found were about a million XML languages for talking to ‘XML databases’, which I don't want to get involved with, and some discussion among PHP hackers which doesn't look anything like what I want.
So I've been thinking about a suitable format, to see if it would be simple enough to hack something up myself. There are quite a few issues, but I should be able to put something together sufficient for my needs.
I've come up with an example, showing the three most important tables in my search engine database. They map documents to the words they contain, using a junction table in between:

Here's a rough idea for some XML to describe that structure (I fortunately resisted the urge to use ‘tuple’ as an element name):
<database> <!-- Documents which have been indexed --> <table name="document"> <primary-key columns="id"/> <column name="id" type="autoincrement" null="no"/> <column name="filename" type="text" null="yes"/> <column name="url" type="text" null="yes"/> <column name="title" type="text" null="yes"/> <column name="hits" type="int" null="no" default="0"/> <column name="time_indexed" type="datetime" null="no"/> <column name="external_id" type="int" null="yes"/> </table> <!-- Words found in documents --> <table name="word"> <primary-key columns="id"/> <unique-key columns="word"/> <column name="id" type="autoincrement" null="no"/> <column name="word" type="varchar" size="127" null="no"/> </table> <!-- Postings - occurances of words in documents, at particular positions --> <table name="posting"> <column name="word_id" type="int" null="no"> <references table="word" column="id"/> </column> <column name="document_id" type="int" null="no"> <references table="document" column="id"/> </column> <column name="position" type="int" null="no"/> <index columns="word_id"/> <index columns="document_id"/> <index columns="document_id word_id"/> </table> </database>
That works for me. It should be easy enough to parse and generate
some SQL create table statements. The type names would be
mapped to the best equivalents in your DBMS. The null
attribute would generate nothing or a not null bit in the SQL.
I've got indexes and primary keys supplying a list of columns they
apply to (separated by whitespace, like NMTOKENS or whatever).
Inserting data into tables
A database might need to have some data preinstalled in it. I think it would make sense to allow our XML format to supply that so that we can get everything set up in one go. If all we want to do is get a few records put in when the database is created, then a simple case might look like this:
<database> <table name="user"> <primary-key columns="id"/> <column name="id" type="autoincrement" null="no"/> <column name="username" type="text" null="no"/> <column name="email-address" type="text" null="yes"/> </table> <record table="user"> <value column="username">fred</value> <value column="email-address">fred@example.com</value> </record> </database>
The id column gets given a value automatically.
The main difficulty with this is that records might refer to other records, either in the same table or a different table, by ID number. If we're inserting both of them then we need to make sure that the automatically generated ID in one record gets used in the record that refers to it. Perhaps we could tag records with an identifier of some sort and refer to them through that. The SQL generator would replace the tag with the appropriate ID or SQL code to work out what it is.
This example uses a group table. The user table
references it with the group-id column. An application might
come with a single predefined ‘admin’ user, who gets full
privileges on the system because they are in the ‘administrators’
group:
<record table="group" id="admin-group"> <value column="groupname">administrators</value> </record> <record table="user"> <value column="username">admin</value> <value column="password">frobnitz</value> <value column="group-id" record="admin-group"/> </record>
Table and column identifiers
The user table above includes a column called
email-address. The hyphen presents problems, because
a particular RDBMS might not support it in column identifiers.
I think we need to allow at least ASCII letters, digits and
some sort of separator character in identifiers. Perhaps using
underscore would be enough? Is that supported by all
interesting databases engines?
SQL also has keywords which get in the way of user-defined
identifiers. So, for example, Postgres doesn't like it if you
create a table called user, unless you escape the
identifier with double quotes:
create table "user" (...);
We could have the implementation do RDBMS-specific escaping of identifiers to get around most of this. MySQL allows escaping with backticks, but seems to allow pretty much anything inside them. Are there interesting RDBMSs which don't allow SQL keywords to be used in identifiers at all? If so, we would have to define a set of names which aren't allowed, but I'd rather avoid that.
The other issue here is case sensitivity. There is at least some case-sensitivity in RDBMSs (Postgresql table identifiers appear to be case-sensitive). It's probably best to preserve case in identifiers, and insist that you spell them right in the XML. That might also make implementation simpler.
Autoincrement type
I've used the type name autoincrement for columns which contain
unique ID numbers picked automatically by the database. That would
map to serial in Postgres, for example. Every RDBMS seems
to have a different name for that feature, but I suspect any worth
using implement it in some way.
I don't know yet whether it would be worth supporting full
SQL ‘sequences’ (which are the same idea, but allow you
to adjust or query the next ID to be used). Postgres and Oracle
support them, and SQL Server has the identity, type which
seems to be equivalent. Last time I looked, though, MySQL didn't
have these. I've never needed full support for sequences, so it
may be easiest not to support them (in terms of mapping the XML
to the largest number of RDBMSs possible).
The only use of sequences I've seen (in my limited experience of stupidly complicated database schemas) is in Bricolage, which seems to use them to reserve low-numbered IDs for predefined things, starting records that are added by users at number 1024. That might make it simpler to add new predefined records, but I suspect that it's not necessary, especially if we have the tagged records feature shown above.
It may make sense for autoincrement columns never to allow
NULL values. Not sure about that.
Updating a schema
If a new version of an application needs a slightly modified database schema (a new column or table is added, for example) then we need to update it somehow. Bricolage does this by running little Perl scripts which use SQL to modify the database. That's completely flexible, but doesn't fit with the XML idea very well.
I thought for a while about whether there could be code to work out and implement the differences between one version of the schema and the next, or between the current database schema and the new XML description, adding missing stuff and deleting stuff that's been removed. I don't think that would work well though, because there would be no way to do renames or type changes without losing the data in a column, which I would expect to be one of the more likely changes in a software upgrade.
It might be sensible to have an update syntax for making changes to the database schema. So we might have stuff like this:
<rename-column table="user" old-name="email-address" new-name="email"/> <add-column table="user"> <column name="timeout" type="int" default="3600"/> </add-column>
If I do that sort of thing then it would also make sense
(because it wouldn't cost much more, and we'll probably need it)
to have similar things for inserting, deleting and updating records.
The <insert> element, or whatever it was called,
would replace the declarative-style <record>
thing I initially thought of. If the format could be used to make
changes to an existing database, then that might open up other
interesting possibilities. It would make it something like
XUpdate, except for
ye olde non-XML databases.
There are always going to be some database mangling operations
you can't do without a full programming language. Example:
a new version of a program decides that all the password
fields should now be MD5 hashed for security. I wouldn't be
able to make that sort of change with my XML, unless I did something
insane like incorporating
Superx++ or
XEXPR :-)
Too-much-caffeine mutterings
Indexes might be more complicated than the simple examples above. Some databases have scary indexing features, like the Postgres ‘R-tree’ index type, which can be used to optimise searches for collisions between geometrical shapes(!). And even for less mad things, what do we do about applications where performance is a big deal (like my search engine) and the programmer wants to say “use a hash with Postgres because it's slightly faster for such-and-such a lookup, but with Oracle…”
It might be nice to have optional ‘annotations’, in an
element called <doc> or something, to provide
brief descriptions of tables and columns. There might be
possibilities for generating readable DB schema
documentation from the XML.
Crazy idea: I was just thinking about what I said about XUpdate above. What if you extended the syntax to cover all the SQL we actually use in typical applications… you could do scary things like write a wrapper round DBI (or other database library of your choice) so that you could write all queries in XML and have it translated to the appropriate format. That could solve a lot of SQL portability problems, probably at the expense of some programmer sanity. I'm not at all sure about this… SQL may be a bit clunky, but it's usually quite straight-forward and readable.
Oh, hang on, we could have a DBMS-independent non-XML syntax, very similar to SQL, which gets transformed through the XML intermediate format… gah!
…but we'd lose the niceness of being able to try out SQL interactively and then paste the finished product into a program. Unless we wrote a command line generic database client that also did the translation, and… no Qef, time for bed!