Data is generally thought the be the lifeblood of 21st Century business, but the so-called “information age” is also fraught with risk. Cybercrime continues to grow rapidly, and evidence of fraud, abuse and criminal activity often resides on a multitude of electronic systems.

Most companies are still using rules-based queries and analytics tools to identify fraud, which rely on the individual to ask questions of the data, based on what is currently known. This approach requires both time and luck to uncover inconsistencies.

Ernst & Young’s forensic analytics team aims to provide its clients with new ways to identify, predict and reduce fraud and improve business efficiencies. By analysing both structured and unstructured data, the company claims to be able to take a proactive approach to fraud detection.

“We've got a lot of experience working with different organisations, and working with many different kinds of data, so within the realms of our own experience we tend to very quickly be able to spot when the fraud's being committed,” said Rashmi Joshi, Director of Ernst & Young’s IT forensics team.

“It's a question of looking at the data distributions. Mathematical algorithms are very effective at picking out any deviations from certain parameters in the data, so we can spot the unknown frauds as well as the known frauds.”

Commercial tools and bespoke algorithms

Ernst & Young uses a variety of commercial software tools in order to identify fraud. These include data visualisation technology from Tableau, SAS business analytics software, IBM's SPSS predictive analytics and data mining software from Megaputer.

“We're quite vendor agnostic. We don't have preferred partnerships, we just want to make sure we're using the right tool to analyse the data,” said Joshi.

In cases where a number of complex data sets need to be correlated together, the company also creates its own bespoke mathematical models and algorithms to pull out anomalies and predictors of fraud.

“It's one thing having lots and lots of data. It's another thing being able to recognise what data it is you actually need.”

Joshi said that analysing numerical transactional (structured) data can be very useful, but Ernst & Young also has a service know as Fraud Triangle Analytics, which enables clients to monitor their employees' emails through analysis of unstructured data.

The Fraud Triangle is based on the research of criminologist Donald Cressey, who said that in order for fraud to be present in an environment there have got to be three factors – rationale, incentive and opportunity. Together with the FBI, Ernst & Young has developed a library of 3,000 keywords associated with these factors.

If structured data indicates that fraud has been committed within an organisation and a certain group of employees is implicated, Ernst & Young is able to analyse the language used in their emails and uncover the sentiments that are involved. If keywords associated with all three factors are identified, then the likelihood is they are responsible for the fraud.

“We have a prototype that we developed on a subset of the Enron data, and very quickly through applying this we can see who the main culprits are through their fraud score,” said Joshi.

“So that's a good example of using structured data and unstructured data together. Through the combination of both you've got a very comprehensive way of tackling fraud.”

Social media can also be a valuable source of data. Ernst & Young uses social network analytics to examine suspects' social connections in order to establish their footprint in society. This can be very useful for social services, for example, to help them understand and identify benefit cheats.

The human factor

Joshi said that computers alone cannot do the job of analytics. Human manual intervention is also essential to provide interpretation, because otherwise there can be a tendency to extrapolate too far, and the findings become meaningless.

She gave the example of a recent piece of statistical analysis that Ernst & Young conducted in relation to a large comparison website, in order to predict the range of prices that customers would be willing to pay when presented with different pricing comparisons.

When performing the logistic regression analysis, Joshi's team noticed that the software was converging extremely quickly and, although all the green lights were on, it was clear that something was not right.

The team decided to investigate the underlying algorithms and discovered that there were problems with the convergence, which meant that it was not an optimum model, and therefore invalid.