Relational databases form the backbone of thousands of companies. They provide an efficient way to store and retrieve data, there are dozens of products out there that implement relational database functionality (including commercial, inexpensive and Open Source offerings) and the majority of modern development languages and systems (Java, Visual BASIC, ASP, Perl, PHP, Delphi) provide one or more mechanisms for integrating easily with relational database systems.
In this RTFM we'll look at the basic components of a relational database - we'll go into more detail in future articles.
The basis of a relational database is a "table". Think of a table like a spreadsheet - you'd give each column ("field") a title, and you'd put one data item on each row of the spreadsheet. For example, we might have a table called CUSTOMER with the following spreadsheet-like form:
As you can probably deduce from our example, the fields have certain data formats. The customer ID appears to be numeric, and the other three fields are textual. Although most database systems allow you to have each field as a text field of arbitrary length, these are unwieldy (and in many cases they're so inefficient to search that the database doesn't even let you do searches on unlimited-length text fields). So we generally impose some kind of limits on our fields - we might, for instance, dictate that the CustomerID was a 64-bit integer, or that the surname was a string of characters with a maximum length of 50.
Taking this a step further, we might want to apply some kind of constraint to each field. The most common constraint is to dictate that a particular field cannot be blank ("null" in database parlance") - if you're the DVLA, your database would certainly be constrained to ensure that the driver number, title, forename, surname, house number, street, town and postcode fields were all completed for all customer records. Other constraints might be that a numeric field lies between two extremity values, or that the Surname field in our example may only comprise alphabetic character and hyphens.
The meat of a relational database's functionality is in the ability to relate tables to each other based on common fields. Let's assume we're an e-commerce company and we have a database of orders. Clearly each order needs to be related to a customer, and we achieve this not by including the customer's full details in every order, but by storing orders and customers in separate tables and referring to each customer by inserting his or her CustomerID in the ORDER table:
When it comes to shipping an order, the system can pick up the user ID from the ORDER table and use it to refer to the CUSTOMER table to pick up the customer's details (obviously we're using a contrived example - in a real customer table you'd have some address information as well as just each person's name). The point of splitting this data into chunks (also known as "normalising" it) is that every item of a particular kind is stored in precisely one place. So when a customer moves home, you just change his or her address in the CUSTOMER table and this change is automatically available to any other table that references the customer records.
Now we've discussed the way that tables inter-relate, let's go back for a moment to the concept of constraints. We've already said that within a table we can apply constraints such as "cannot be null" - but it's also possible to apply constraints that relate tables together.
This is achieved using the concept of a "foreign key". Put simply, a foreign key constraint takes the form "This field in this table must contain a value that already exists in that field of that table over there". So in our ORDER table, we could define a constraint on the CustomerID field and point it at the CustomerID field of the CUSTOMER table. Were someone to attempt to create an order with a CustomerID that doesn't exist in the CUSTOMER table, the system would spot this fact and would throw an error message. Foreign key constraints are the holy grail for anyone who wants to maintain the integrity of the relationships between the tables in the database.
When you define a table structure, the system has to store it in some way in order to be able to access it efficiently. The most common means of achieving this is to include a "primary key" for each table - a unique identifying number for that record. In our example, the CustomerID field would uniquely identify each customer in the CUSTOMER table, and the system would store its data such that it was efficient to extract customer data when presented with requests such as "Get me the details of customer 10005". Likewise, the ORDER table's primary key would be the OrderID field - this is the reference mechanism that other tables are likely to use to link to the ORDER table, and so it makes sense for the system to organise itself around that field.
You can combine fields together in an attempt to guarantee uniqueness. So you might have multiple different CUSTOMER records for a given person, because for customer service reasons you probably want to keep old address information instead of overwriting it when someone moves house. In this instance, you might store a version number for each customer, which increments by 1 each time the customer details change, and you could combine this version number with the customer ID to guarantee uniqueness for the primary key.
Some tables simply won't have primary keys, though, because it's not relevant for them to do so. Because a primary key must be unique for each row, it can only be applied when it's possible to guarantee that the primary key will be unique. If you have, say, a table that stores a lot of free-text transaction logs, the chances are that you won't have a primary key.
The primary key is the first step in efficient data lookups, but you can only have one per table. So what happens if you want to perform efficient searches on fields other than those nominated as primary keys - say the Surname field in our CUSTOMER table? Unless you do something, each search for a person with a particular surname will result in the database trawling through every record in sequence in an attempt to find matches.
By defining an index on a database, you can help the system make searches on fields other than the primary key more efficient. So in our customer table, you'd probably define an index on the Surname field. When you define an index on a field the system trawls over the data that's there and builds a new data structure on the disk whose content is ordered such that it's fast to find records based on the content of the surname field.
Indexing is a trade-off of speed against efficiency. You could, conceivably, index every table on every field, and in theory you'd never have to worry about your searches slowing down. The problems with this approach are twofold, though. First, every index takes disk and RAM in your database system - and for a big table, an index will be big. Second, every time you add, remove or amend a record in the database, the system has to reflect the change in all indexes that relate to the tables you've changed - so the more indexes you have on a table, the more updates it will have to perform and the slower your insertions, deletions and updates will run.
In this RTFM, we've looked at the seven key constructors of a relational database, which we'll come back to many times in future features. There is an eighth important thing to bear in mind about relational databases, though: all mainstream relational databases support these concepts, but don't necessarily implement them in the same way. The concepts may be general, but the implementations vary, and by learning the specifics of your particular database system you can inevitably add efficiency. SQL Server, for example, has special types of index called "clustered indexes", which bring performance benefits but which you have to plan carefully because there are limits to how many you can employ per table. And if you're clever about how you build your indexes in Oracle, you can actually perform entire lookups solely from indexes, without ever touching the data in a table.
So get to grips with the generic concepts, and then build on them with an application-specific course.