Securing Structured vs. Unstructured Data: What's the Difference?

I recently sat down with tdwi's James E. Powell to talk data security. Great chance to talk some inside baseball with a pro in the field.

Upside: When it comes to protecting data, I thought data was data. However, you've told me that there's a difference between securing "traditional" (presumably structured) data and securing unstructured data. What are these differences?

Scott Lucas: There's a wide gulf between security for structured and unstructured data for three important reasons: unstructured data is more complex, its access and management are less consistent, and there are fewer controls available to secure it.

Let me address complexity first. The meaning of data in a database is consistent, straightforward, and easy to understand. Data stored in databases can be complex, to be sure, but its importance and sensitivity isn't a mystery.

Unstructured data, on the other hand, is wild and woolly. User-generated content ranges from intellectual property to contracts to sensitive HR information and everything in between. That complexity makes it tough to determine what's important and which security policies are appropriate.

Accessing and managing structured data is also more consistent. Databases exist in well-understood locations with well-defined access methods (such as APIs). This consistency means database security tasks are bounded and solvable. Unstructured data has almost no consistency. It's found on premises, in cloud applications, and in cloud storage. There are innumerable ways to access it and (at least with current tools) no way to consistently manage it.

Finally, options to control access to structured data are available and understood. Databases offer fine-grained access privileges that can be implemented and centrally managed by security professionals. In contrast, users are primarily responsible for controlling access to unstructured data. That's an important enough point that I want to say it again. Users make critical security decisions for the files they own. They email, link-share, and put files in public folders that can easily expose data to loss and theft -- and it all happens well beyond the reach of security teams.

I don't believe GDPR, for example, makes a distinction between structured and unstructured data.

You're correct, GDPR and other regulatory regimes don't care whether the protected data is structured or unstructured -- which is a big reason why organizations care so much about their unstructured content. As I noted, the path to structured data security is clear. It might not be easy or inexpensive to do it right, but the tools and techniques are available and they work. Not so with unstructured data.

To explain the need for rules, I'm going to dig into one of the industry's most popular approaches to data security. Data loss prevention (DLP) tools control how data flows across defined control points (such as the network perimeter). To make these permit/deny decisions, they need two pieces of information: what the document is and how it should be handled.

Both tasks today rely on complex rules and policies. If you want to know if a document contains personally identifiable information (PII), for example, you need a rule that tells the tool how to find it (for example, a pattern that matches a Social Security number). You also need a policy that outlines how PII is handled. Given the wide diversity of content in unstructured data, it's no surprise that these rules and policies quickly grow to be unmanageable beasts.

Why the complex configurations?

Configuration is what has to be done before the tool can function, and in this space the configuration of a new tool can take months. Policies in particular are very organization-dependent and have to be established before any unstructured data security solution works. It can take months of testing and finagling to ensure these policies are properly tuned to manage risk without slowing down the business.

There's also a big configuration requirement for solutions that rely on end-user classification.

What are the methods security experts have tried?

There are two tracks organizations take to secure unstructured data, either alone or in combination.

First, as I pointed out, DLP-style solutions use rules to identify data and policies to make control decisions. Hopefully I've explained that sufficiently.

The other major approach uses data classification or "tags" to identify critical data. Classification is, on the surface, very appealing because it relies on the document's owner to identify content. Because the owner is also the expert on the content, they're in the best position to determine if the document is sensitive, confidential, contains PII, or what have you.

Using a liberal definition of "configuration," these approaches need extensive "organizational configuration" to work. The categories need to be well designed for easy end-user understanding (What's the difference, for example, between "Confidential" and "Highly Confidential"?) and users must be trained in how and when to tag files and what these category labels mean.

What's worked, what hasn't, and why?

Both of these methods have significant drawbacks. Rules-based approaches, as I hinted at earlier, can grow into a tangle of rules, policies, exceptions, and corrections that's really hard to manage. When a rule or policy doesn't work, or it keeps users from doing what they want to do, someone has to diagnose and fix the issue. Gartner acknowledged the problem in their 2017 report titled "It's Time to Redefine Data Loss Prevention."

Classification programs suffer from a different set of problems, all related to the vagaries of any IT program that depends on end-user cooperation. Users are, well, users, and they don't always understand or comply with directives from the IT security team. The classification approach usually leaves a lot of data unclassified and out in the cold, unprotected.

What's AI got to do with it?

AI adds two key capabilities to the unstructured data security mix. First, it's very good at categorizing unstructured data. Categorization addresses the complexity problem by revealing file meaning and content -- even across a very diverse set of data. It doesn't need rules or configurations to do the job, and it doesn't ask end users for any help. It just works, continuously and autonomously.

Second, AI makes it possible for security professionals to extract the policies that ought to apply to specific data types without writing any policies themselves. Here's how that works. Once a set of files has been categorized, it's possible to establish a baseline set of security practices those files follow. You can, for example, draw some conclusions about who should have access to a certain set of legal files or where those files should be stored by looking at how those files are managed. Spotting outliers to those security practices reveals policy violations (and risk) without explicitly defining a single policy. At Concentric, we call this process Risk Distance analysis.

How can it be used and what benefits/drawbacks are there?

Concentric's Semantic Intelligence focuses on securing unstructured data. Customers use our product to reduce their risk of data loss without increasing costs or needing expert security personnel. In use, it can remediate issues directly from the tool (for example, updating access permissions on groups of files). You can also use it to integrate with other security tools in an organization's stack (one approach uses our automated file classification as a way to signal existing DLP tools). Finally, our tool can notify the security team and/or end users when a file needs attention.

Besides the need to meet compliance mandates, what other motivations are there for securing unstructured data?

The files and documents created and managed by end users contain all sorts of valuable data. We see four types of business-critical data in almost every organization:

  • Intellectual property: source code, patent applications and product designs
  • Operational secrets: income, bookings, price lists, and contracts
  • Strategic information: forecasts, road maps, product launches
  • Regulated data: names, emails, government IDs, credit card numbers

Compliance is certainly top of mind for many in IT, but it's not the only data that needs protecting.

You can read the original article here.

Get the latest from Concentric!