There's an old joke to the effect the world can be divided into two groups of people, those who divide the world into two groups of people and those who don't. Similarly we've typically regarded data as either structured – broadly, databases built and maintained by a fully-fledged DBMS – or unstructured, meaning everything else. However, there's a large grey area in between, naturally called semi-structured data and in many ways semi-structured data is the most interesting of all. In order to understand why this should be, and why the distinctions are important, we first need to be clear what we mean by each of these terms. Let's take a concrete example. Consider the different forms a customer contact database might take. At its most chaotic it could be a collection of yellow Post-It notes, business cards with cryptic comments scribbled on the back and letters in filing cabinets somewhere. It's almost exclusively text, can take many physical forms, has no inherent structure, any particular piece of information may or may not exist and there are no rules for formulating queries. That's unstructured. At the other extreme you could have a DBMS-maintained customer contact database. Every contact, whether in person, by phone, by e-mail or letter, is documented, the date, time and people involved are logged, the discussions summarised and the location of any other records such as letters or phone transcripts noted. That's structured. The data exists in a rigidly defined format, stored electronically. There is a formal data model and the database file structure has a one-to-one correspondence to it. The physical file(s) containing the database itself is only part of the whole thing – it requires both the DBMS that built it and the underlying data model it was built from to give it meaning and queries must follow explicit rules. This concept of structured data implies rather more than just a rigid file format. For a start, it's not text. Indeed, its value, the information content, arises from the relationships between data items rather than from the individual data items themselves. A time and date (whether stored in binary or text form) has no meaning in itself, the importance lies in relating it to a particular phone call made at that time. Without the context, the web of relationships to other data items, an individual data item is meaningless. Another important aspect of structured data is that the dataset is guaranteed complete in the sense there is nothing missing, there are no holes. There is no way to store anything not defined in the data model, and everything defined in the data model must have a value (even if it's only some sort of "VALUE ABSENT" flag) – for example, every contact has a date and time associated with it because the applications used to create the entries require it. It is this completeness that identifies structured data, not the way it's stored. Semi-structured data
Most data will, clearly, be unstructured. That said, there are many unstructured datasets with some, but not all, of the characteristics of structured data. They may be incomplete, or not entirely relational, or in some other way fail to match all the criteria for structuredness. A good example would be a set of related spreadsheets. Suppose you've chosen to keep your customer contact records in such a way. Each salesperson, or helpdesk worker, or whoever, has their own version of a customer contact spreadsheet in which they enter the relevant details. All the spreadsheets are then collated once a month. Simple, cheap and workable and certainly a vast improvement over a bunch of Post-It notes. It's clearly not structured though. There's no mechanism to enforce completeness. Neither is it centralised – the individual spreadsheets are probably scattered across the intranet. On the other hand it's not entirely unstructured either. There's an implied data model since users can only enter information for which the spreadsheet has a space. Individual data items, such as dates and times or contact names, have no meaning out of context, and the files themselves are in a rigid format that requires the original software for reading. In fact, put that way it sort-of looks structured but not entirely so. Semi-structured in other words: * It's not just text, but such relations as exist are local rather than dataset-wide. * It's not rigidly structured; such structure as exists is irregular, partial and implicit rather than explicitly required by the data model. * The data comprises objects. The data model is a network of nodes in which different branches have different sizes and properties. * Queries may follow the network but may just as easily ignore the data model altogether. * The data model is ad hoc, growing as requirements change. It must accommodate many potential variations and must be easily adaptable for new types of input. So far all this probably seems pretty academic. One might reasonably ask what difference it actually makes to a working system administrator? In the real world does it matter whether the servers are storing structured, unstructured or semi-structured data? Well, yes it does. Quite a lot in fact. Understanding what kind of data you're dealing with is essential when planning data management strategies. But that will be the subject of Part II...