Graham Bowden's ThoughtWorld

Using XML

It's the buzz! It's the groove! Be into XML or you'll miss the boat!

Yeah right...

XML is definitely a good idea. A standard data format (and description format) that can be used between applications, between platforms, between countries, between organisations.

XML is fairly simple and it looks like HTML so it feels familiar. However, making use of XML will be more complicated.

Concepts of XML

XML is self defining

XML states with each item of data what that data is. It does this by means of a "tag". Tags can be nested within other tags (like <TABLE>, <TR> and <TD>) giving the data item a "context". An XML document can be checked for obeying the simple syntax rules that the tags are properly nested etc. to show that the document is "well formed". Additionally, an XML document can be verified against a definiion of a tag structure.

Level of structure of communication

Many of our systems communicate with one another now. They send predefined, highly structured files to one another, they read each other's tables, they may even use real time messaging. The whole communication has to agreed in advance. With XML, we could define tag structures that allow a machine to understand parts of a message rather than the complete message. For instance, with standard XML, a machine can read a well formed document and know it is XML. It can display the tag structure in a tree format for instance. If it has access to the tag definition (the Document Type Definition or DTD), then the system has an "understanding" of the structure.

The varied ways we communicate around our systems will become more apparent, as we move from the older interface methods to a more comprehensive approach.

This will form the first issue. For the early adopters, it will be a case of attempting to formulate a format for each type of message. Gradually, standards will emerge and the process of natural selection will determine the ones that achieve market dominance.

XML can give us a way of formating semi structured information such that some if not all of that data can be used by other applications. This data can then be incorporated as XML portions within standard databases and OLAP tools.

An example, a contacts database holds summaries of client conversations and other communications. A client mentions that they are interested in product ABC1 at a slightly lower price. The contact database application reads the XML text and "sees" ABC1. The application looks up how to handle PRODUCT_CODE and calls the product database and gains information on the product. This is then displayed in a little yellow box when the mouse rolls over the data. The same thing happens when using email - roll over a PRODUCT_CODE and the information is obtained. A right hand click and the product database update form is initiated for this product. Two more clicks and the text of the message could be added to the database.

In order to achieve this, we would need a generic XML processor as part of the operating systems. The DTD (Document Type Definition) would determine the processing of the XML tag via NOTATION. This may need to be extended to cover the different events and handling methods.

At the individual level, each method may need to be enabled, disabled or changed - I might not want to display the product database, but check the finance database instead. For employee numbers, as I am not in HR I would prefer to not continuously get the message "You are not authorised to access this system".

Furthermore, systems may not require a human operator to make certain interpretations of messages, even if the system cannot fully understand the message. For instance, when writing an email about a product, if the product database itself is copied in, then it will file the message as comments about the product. Other parts of the message that it can interpret may create additional links (Customer number, type of enquiry etc.). It may be that only the product manager can feed the database with these items.

I believe that the successful formats will be extensible just like XML and will define certain aspects of the message. In our examples, products, customers and employees may be defined in different DTDs. These are then united for an organisation, department or user. These "standards" building blocks could be for the method of return, the file generation information, the version of the file, the version of its format as well as for the application level information. So that the systems can cope with this somewhat disorganised array of formats, receiving and possibly sending systems will scan for parts of the message that they can understand. This will lead to systems having some understanding of a message even if they cannot wholely interpret it.

Writing Defined Fields

The attraction of a product code being highlighted as such for the reader is obvious, but would the writer be able to define the actions that should be done with their particular data item?

Well, with web applications this may become a limited possibility, but as the advantages are mainly for the reader, then I believe thta the administration of the actions is going to fall mainly to the reader to define (of course in practice this may well be the user's IT specialists). Within free text fields the text could be scanned for data that "looks like" it could be a particular type of item. These could be defined in a "regular expression" type way. For instance, text could be scanned for six digit numbers prefixed by a "P" to find product codes for example. When found, the XML tags could be added o surround the data automatically. Further items that may be less easily scanned for automatically (e.g. price of item) could be highlighted and tagged by the user from their palette of tags (which could be defined by the DTD and the context).

Break Away From Flat Tables

There has been a tendancy for us to think "two dimensionally" about our data. We take data from our relational database tables and put the information into spreadsheets. The relaational database, although capable of describing any data structure, and our current tools focus on taking data from two dimensional tables and creating a new two dimensional table. This thinking has stretched to our file definitions - we typically (but not exclusively) create files that contain only one table of information. If we wish to create a second table, we create a new file or "denormalise" if there is a prent child relationship, by repeating the parent data.

XML creates a tree structure which can store parent child relationships, or allow us to store two (or more) types of data that are unrelated. We no longer have to denormalise to represent parent child relationships. We can represent orphans and widows (if these are to be allowed). We can allow parts of our text areas to be structured. This freedom not only allows us to communicate our data in a more meaningful way, but will also signal developments in the way we process our data. We will develop a new approach to processing, based on the XML object model. Already, applications are adding an "explorer style" tree structure on the left hand side to allow an overview of a complex group of entities. My belief is that the "detail" shown on the right hand side will develop some new GUI widgets.

One could then say, why stop at the tree structure for new GUI widgets? Why not look at linked lists, linked lists of linked lists, indexing, unordered sets and any other data structure we may like to dream up? Well, I think the answer is in one of commonplace acceptance and understanding. The three dimensional spreadsheet was a minor leap from the two dimensional one. The tree structure, given its use in "explorer style" overviews is now widely used and understood. With the advent of XML, the tree structure wil gain a new level of acceptance.

The Pitfalls

XML can store data

Why don't we use XML for our entire database storage? Although I would not rule it out, at the current level of technology, XML could only be used for small, simple databases.

Joining is awkward, and not independent of physical storage

As we have noted, XML can represent parent child relationships. However, the nested form of this representation can only allow child a single parent. We can, of course, use relational foreign key techniques to represent other connections, but then if we use both techniques, we need two different processes to enable these "joins". XSL and XQL are not yet clever enough to be able to insulate us from the mechanics of obtaining the data. There is a further type of join possible in XML - using links. This is a design time relationship, and could be indexed and cached.

Verbose

XML is by definition verbose. Each time we create a new element, we state the name of that element at least once, and very often twice. This can greatly increase our storage requirements. The more data there is to process the more time it takes to process. Therefore as a mass storage medium, XML is not well adapted - certainly not until there is a cross-platform, standard method of compression.

Summary

Applications will develop links between each other based instead of on programmatic design, on the common data elements they contain. This can be from defined fields such as a product code, but also from text fields that have a defined fields.

XML can be seen as an advance in the way systems and their applications can communicate, but we have to be careful not throw away many of the lessons of the past and try to use XML for everything.

Computing Articles