The 2007 New York Times Bestseller, The Black Swan: The Impact of the Highly Improbable by Naseem Nicholas Taleb covers events characterised as 1) being outliers beyond normal expectation, 2) having extreme impact, and 3) being able to be rationalised after the fact. Are data consistency problems in RDBMS applications black swans in IT?
A common misconception of applications built on the top of a Relational Database Management System (RDBMS) is that they are virtually free of data consistency problems. This assumption hinges on the condition that applications use RDBMS’s transaction support correctly. But evidently that’s not quite the case.
A recent study by researchers from Purdue University found that more than one hundred transaction-related data-consistency defects in popular web modules such as OpenCart and phpBB. OpenCart is a widely available shopping cart module with over a million hits. One key function is managing coupons to ensure they are used no more than a maximum number of times.
The study finds that the code where the coupon’s usage count is read from the database, and the code where the count is updated, reside in a two separate transactions. As a result, an over selling of coupons can happen when two or more users use the coupon at the same time. To perform correctly, the code should have been encapsulated in a single transaction.
This cross-transaction data flow issue is not new and it is a fairly common topic discussed in basic database courses. However, the study by Purdue indicates that this issue is more pervasive than many anticipated. Such a defect could be disastrous for some systems and is difficult to detect.
It is not easy to spot during manual code review, as different database queries are often encapsulated in different classes and methods. It is also difficult to catch during the testing phase as it requires precise timing for interleaving of transactions to produce a wrong result.
The key in discovering these defects is to locate data dependencies between different database queries. If the value used in an update query comes from a read query, chances are these queries ought to be enclosed in the same transaction. We are not aware of any readily-available tools that can detect this defect automatically.
However, we can configure dynamic tainting tools, such as Python tainting mode, CORE GRASP etc., to do so. Dynamic tainting tools mark inputs that percolate through the system and can locate data dependencies between two variables in a piece of code.
Many of these tools are originally designed for detecting security vulnerabilities such as SQL injection by tracing the data flow of raw user input from the GUI into SQL query strings without going through any security check.
To locate data-consistency defects mentioned above, dynamic tainting tools have to be manually reconfigured by explicitly specifying the location of each read and update queries. This manual effort is not for the faint of heart, especially for large systems.
But one can mitigate this issue by focusing on critical modules where defects like this can have significant implications. Ultimately, it is the developer’s responsibility to ensure the database transactions are implemented correctly.
Data consistency problems do occur in RDBMS applications. They are difficult to detect and become increasingly so as we incorporate and reuse software modules by others.
But, depending on the consequences of failure, identifying the vulnerability is worth the effort to “turn the Black Swans white.” Taleb states that what may be a black swan surprise for a turkey is not a black swan surprise to its butcher. So “avoid being the turkey” and consider data consistency issues in your RDBMS applications.
By Teresa Tung, Manager, Accenture Technology Labs and Chen Fu, researcher, Accenture Technology Labs