Data underpins nearly every business these days. Generally speaking, though, only a few have a single, atomic set of data that underpins everything they do; most have a number of separate repositories of data that they update and interrogate independently.
When part of the business decides that it needs to use a data source it wasn’t previously using, it is generally the technology teams that are tasked with making the data available. The networking and systems' teams provide the physical connection, the access protocols and the access control, while the development team provides the high-level interface that presents the data to the users in the required way.
In trivial cases this isn’t a big job: it could simply be a case of pointing the new users’ web browsers at a URL they weren’t previously aware of, and perhaps setting a couple of ACL entries.
In most cases, though, the task isn’t so simple: it’s common to have to dig about in multiple data sources to find the data you need. So in this feature we’ll look at the issues of providing access to heterogeneous data sources so they can be made available straightforward, to developers and other data users.
In an ideal situation, an application developer would be provided with a set of rigidly-defined functions or objects that give him access to the data he needs; he shouldn’t need to worry particularly about precisely how that data is stored, or for that matter where. So he should be able, for instance, to pass a postcode to a “FindCustomer()” method, and to have the appropriate data returned in a predefined format.
What the developer of the data access layer (which sits somewhere below the application layer) has to do is produce a magic “black box” which takes high-level calls and does the hard work of delving into the data sources.
Abstraction and data dictionaries
In order to bring together two or more data sources, step one is to define some basic data structures that span the universe of both the data sources you’re working with and the data requirements of the applications you’re developing or purchasing. This is referred to as a data dictionary (“DD”).
The idea is to wean the developer off thinking of data as strings, numbers, times, dates and such like. Instead, you present a series of home-grown data types called things like “Customer,” “Email Address” and “Telephone number.” No more will the developer define a variable as being an “Int32” – if it’s a customer ID he’ll define it as a “CustomerID.”
Once you’ve defined this abstract data interface, you have the rather more complex task of translating between the concepts it defines and the real data underneath. In a read-only world (i.e. one where you’re pulling the data out of the data repository) this is relatively straightforward; when you’re writing back to the repository, though, things can get very complicated very quickly.
Let’s illustrate the problem with a simple example. Imagine we have two back-end systems, each of which has the concept of a surname. On system A it’s a string with an upper limit of 30 characters; on B it’s a string with an upper limit of 40 characters. If all we do is read from these data repositories, we simply have to define a “Surname” in our DD based on the less restrictive of these two options – i.e. a 40-character string. This means that whatever surnames are pulled out of the back-end systems, we know they will be acceptable to our DD (and thus whatever new systems we’re implementing that are based on it).
What happens, though, when we wish to write back to one or more of our back-end systems? If we simply use our 40-character surname example, things will obviously break when we try to write (say) a 35-character surname to system A. There’s only one thing we can do, then: as with any computing problem, we compromise.
The way to deal with data discrepancies is to go through a process of agreeing what needs to be written back, to where, and in what form – and putting it in the context of the current dataset.
In many cases the answer will be easy. Life seems tricky in our example, but a quick check of the current data we hold might reveal that we don’t actually have any entries with surnames of more than 30 characters: if this is the case, the problem’s gone away. Alternatively, it may be that we don’t actually need to write surnames back to system A at all – so again, the problem has gone away. There will, of course, be cases where there isn’t an easy answer; in such cases you simply have to work out the approach that minimises the adverse effects on the database.
In the majority of cases, though, you’ll be faced with a relatively straightforward task of translating between date formats, telephone number formats, and so on.
Although in its basic form a DD is a series of definitions, you will generally want to build constraints in. So you might define that an email address must, for example, be a string of characters where there is at least one character, then an @, then at least one other character. If you’re going to do the job properly, you might go a step further and define your constraints according to the formal RFC2821 standard. Then you’d probably build your DD components into more complex objects – so once you’d defined what a title, a forename and a surname were you might define a “Person” as an object that included a compulsory surname plus an optional title and forename.
Of course, as your objects become more complex, with more constraints and integrity rules, the task of fitting the data into the legacy systems becomes more onerous, but you’ll trade this off against the improved consistency and quality you’ll get in any new and/or revised data that is introduced into the system.
One big problem with any data integration exercise is dealing with invalid information. Say you’re defining a DD element to represent a telephone number; you might include a country code, area code and local component, all of which are strings of digits. This is nice and easy in principle, and it’s not too tricky to build translation functions that will (for example), but how many times have you come across telephone numbers that say things like “020 8172 3857 WORK”? In addition to the transformation operations you’ll have to devise, then, a data cleansing exercise may well also be in order.
In reality, virtualisation of data is simply the task of implementing the abstract concepts you’ve defined. Your “virtual” customer database could therefore be an SOA service that consumer applications make calls to in order to read and write customer data according to the abstract concepts defined in your DD. So as well as not caring what the underlying types of the data objects are (as achieved by the abstraction phase) he now doesn’t even care how he gets to the data objects, or where they are (the virtualisation part of the task).
The final aspect we need to consider for data integration is that of optimisation. It’s all very well having clever mechanisms for interfacing to legacy systems and translating between the old and new data formats, but there’s usually some kind of time and/or performance constraint on the data.
Say you’re a travel company and your new marketing manager wants to do some in-depth data mining on the customer database in order to target mailshots more accurately than has been done in the past. This might seem sensible in principle, but if the sources of the customer and historical booking data are on different machines, there may be no way to directly query these repositories to get out the desired data. And even if there is, it may place an undue load on one or move live systems – perhaps slowing down the booking server, or making the website perform lethargically.
No surprises, then, that as well as considering how to connect to the data and how to interact with it, you also need to consider whether to add technology in order to maintain the performance of the legacy systems now that you’re adding new systems that interact with them. One company we know of solved this problem with a simple technology fix: they turned on replication on their sales system so that the data was replicated in near-real time to a second machine (actually an existing server with modest capacity).
By running their marketing data analyses (which are, by their nature, read-only operations) on this replica, there is negligible impact on the live server. Another company has decided to invest in a product that pulls data from a number of sources into its own internal database and provides fast ad-hoc query access to that database; a not dissimilar but equally useful approach.
The principles of virtualising and optimising data when accessing disparate repositories are relatively simple. This isn’t to say they’re not tedious (defining data transformations can become very laborious, very quickly) or that there will always be an answer (you may, from time to time, have to say: “We just can’t write this type of data to that repository” and find a way to work around the need to do so).
But if you take it one sensible step at a time (physical machine access, then file/application access, then data translation, then the implementation of constraints, then addressing performance issues) the answers may well drop out much faster than you expected.