One of the first and more difficult steps in defining an SOA is developing a complete semantic and service-level understanding of your domain.

Although the work required is pretty straightforward, the amount of effort and time required is typically huge. Also the enabling technology or tools currently available are complex and still emerging. So it pays to spend some time up front planning exactly how you're going to do this step and what tools you may use to make this job easier.

Why are we doing this? Because you can't deal with information you don't understand, including information bound to behaviour in applications or existing services.

It's extremely important for you to identify all application semantics - metadata, if you will - and services that exist in your domain so that you can properly deal with the data and services that are there and understand the inner workings as well. Remember, the goal here is to create a service-level abstraction of existing systems, and at this point, you're merely figuring out what's there.

The data landscape
An understanding of application semantics establishes the way and form in which a particular application refers to properties of the business process. For example, the same customer number for one application may have a completely different value and meaning in another application.

Understanding the semantics of an application guarantees that there will be no contradictory information when the application is integrated with other applications at the information or service levels - which is really what SOA is all about. Achieving consistent application semantics requires an application integration "Rosetta stone," and as such represents one of the major challenges to creating your SOA.

Defining application semantics is a tough job, because many of the existing systems you'll be dealing with are older, proprietary, or both. The first step in identifying and locating semantics is to create a list of candidate systems. This list will make it possible to determine which data repositories exist in support of those candidate systems.

Any technology that can reverse-engineer existing physical and logical database schemas will prove helpful in identifying data within the problem domains. Although the schema and database model may give insight into the structure of the database or databases, they cannot determine how that information is used within the context of the application or service. For the most part, it takes a human being to look at each item and determine what it represents in the context of the SOA.

Applications and services

Next, it's time to focus on existing applications and the services they offer. Within legacy transaction-based apps, web apps, and client/server apps, you'll discover most of the services that will form the basis of your SOA. Eventually, they will probably need to be reprovisioned to meet the requirements of the new architecture. For now, all you need to do is identify them.

Service interfaces - that is, interfaces in existing applications that can be addressed by other apps - are quirky and complex, much more complex than data. They differ greatly from application to application, custom or proprietary.

What's more, many interfaces, despite what the application vendors or developers may claim, are not really service interfaces at all, and you must know the difference. By definition, a service must provide both behaviour and information. Those that provide information alone should not be included in your service inventory.

It is important to devote time to validating assumptions about services, including: where they exist, the purpose of the service, the information bound to the service, dependencies (what other services a service may call upon), and security issues.

Collection tools
A services directory is the best place to aggregate all that painstakingly collected information about services and data. As with other directories, this is a repository for gathered information about available services, along with the documentation (or pointers to documentation) for each service - what it does, information passed to a service, information coming from a service, and so on. This directory is used (along with application semantics) to define the points of integration for all systems in the domain.

You could certainly do this using Excel or a small database, but today's SOA registries or repositories do a much better job. As noted previously, an SOA registry is a resource that enterprises use to publish, discover, and consume web services. A repository, on the other hand, holds additional content such as XML Schemas, DTDs (Document Type Definitions), and WSDL documents. Think of a repository as a persistence mechanism that stores information published to the registry. Today, pretty much all the products in this area have both capabilities, whether they call themselves registries or repositories.

One of the better-known offerings is Infravio's X-Registry Platform, which calls itself an SOA governance platform and allows designers to maintain both semantic and service information. (Governance involves the development, management, and enforcement of policies that set parameters for how services should be built and accessed.) Then there's Systinet Registry, recently acquired by Mercury Interactive, which provides a simple, standards-based means for publishing and discovering reusable business services and SOA artifacts. Systinet 2 is a governance and lifecycle platform built around the registry.

Other registry/repository products include Flashline SOA, RainingData's TigerLogic SOA Repository, and various functionality built into the "stacks" offered by BEA, IBM, Sun, and other big vendors - BEA actually licenses the Systinet Registry. All of these are evolving rapidly and most integrate with governance products. They make it easier to manage data and service information you've collected, but unfortunately they can't do much to relieve the hard labour of gathering that information in the first place.