The first part of this feature looked at the key difference between structured and unstructured data. One of these is the way in which the value of a structured database lies in the relationships between data items rather than in the data items themselves. This has a fundamental effect on the way that backup and archiving are handled. For instance, a collection of word-processor files is fundamentally unstructured. Not only is there no mechanism (leaving aside file names, which may or may not mean anything) to relate individual documents to each other but data elements within each document are unrelated. A letter can be subdivided into its component sentences and the sentences retain whatever meaning they had. It's perfectly possible to lose or fragment part of such a dataset without destroying the value of whatever remains. So when dealing with unstructured data it may be all right to leave individual data owners to look after their own files. (It may even be better, since a system administrator has no way of knowing what's important and what isn't, what needs backup and what doesn't, and what can usefully be archived and removed from the system altogether.) At the other extreme an Oracle database must be preserved in its entirety. Lose any part of it and you lose the whole. Backup and archiving are important management tasks, to be decided at corporate level. Moreover if a structured database is replicated, mechanisms must be put in place to ensure every copy is identical or the relationships between data items (and hence the value of the data themselves) will soon become polluted with ambiguities. In practice structured datasets must be centralised, and centrally managed, to ensure completeness and reliability. Semi-structured data banks
Semi-structured datasets, often described as databanks, predictably fall somewhere in between. Different elements of the dataset may even require different treatment. A databank will be structured in the sense that the information content springs from the organisation rather than the individual data elements (as in a spreadsheet for example) but unstructured in that its component files are not self-describing and metadata will have to be stored independently to preserve the value of the information. This suggests the databank itself need not be centralised but a certain degree of central management is necessary to ensure the metadata is preserved intact. A Lotus Notes database is a perfect example of a semi-structured dataset. The core database must be centrally managed, and backed up, but responsibility for many files can be (and often is) left to the individual data owners. Losing one or two files may affect some people's work but won't compromise the entire dataset and in the meantime the data owners know what matters and what doesn't. The choice of physical storage medium is another issue affected by data structure (or lack of it). Semi-structured databanks are ideal for RAID. Unstructured data collections may be RAID'ed but the benefits are more limited – RAID can do no harm but frankly, regular backups are more likely to be useful (the greatest risk is from loss or corruption through user error, not hardware failure). A fully-structured dataset, in the shape of a DBMS-maintained database, will have its own security and integrity mechanisms (journaling, rollback etc.). Prudence suggests mirroring on independent hardware, at independent sites. Planning, building, maintaining and querying
Data structure has even more subtle and profound implications on planning, building, maintaining and querying a dataset. Building an unstructured data collection requires no effort beyond creating the data elements that comprise it. There's no advance planning and maintenance can be ad hoc, as needed. (Again, take the example of a collection of word-processor documents – you don't plan in advance which documents are written, you just write them and that's it, job done.) A DBMS-maintained database, on the other hand, requires both advance planning (the data model, applications for adding data and querying, etc.) and ongoing management. A semi-structured databank will generally need less advance planning but may well take rather more management over the longer term (because there are fewer automated tools and the dataset is more fragmentary). There is a reward, though, for the extra work. While one can't say anything useful about querying an unstructured data collection (apart from the obvious, that it's likely to be hard) querying structured and semi-structured datasets is mostly routine, using well established tools (typically SQL for structured data and proprietary tools or – increasingly – XML for semi-structured). Large structured or semi-structured datasets will obviously take more setting up and maintenance than small ones, but the benefit will be proportionately greater. All of which raises a nice philosophical point about work, entropy and size of organisation. It takes a lot of work to overcome the natural tendency for entropy to increase, the more so the larger the organisation, but we'll leave that for another day.