Dark Data & Shadow Servers : Finding the Unknown in the Cloud

Data Discovery

Data comes in many formats, classifications, and sensitivities. However, from a security professional’s perspective, arguably the two most important categories are the data you know and the data you don’t.

Unknown data is not invisible data. It’s simply vulnerable data because it is not properly inventoried and classified. In security, you can’t protect what you can’t find, creating a new imperative for security professionals to discover unseen data and shadow services.

Shadow Data vs Dark Data

While they may share many similarities, shadow data and dark data have a few key distinctions. Good decision-making within organizations frequently depends on using data insights. However, dark data is quite the opposite. It refers to the data that businesses collect but don’t use for any specific purpose.

Many organizations collect and store far more data than they actually need. Studies have shown around 55 percent of an organization’s data assets are dark data. This can cause many businesses to be data rich but information poor—meaning they’ve gathered the data but lack the tools to make sense of it.

Shadow data is data that doesn’t fall directly under the responsibility of an organization’s IT department—meaning they aren’t in charge of updating or protecting it. Since shadow data is developed solely for the end user, oftentimes it is unknown to the IT teams.

The main difference between dark data and its more shadowy counterpart, is that dark data is unknown, but still generated within an organization’s IT infrastructure, while shadow data is available outside of their data management framework. But both dark and shadow data can leave your organization vulnerable to data breaches and non-compliance.

Risk Mitigation for Dark Data in the Cloud

Dark data is more susceptible to breaches and leaks than known data, because it is unmanaged, unregulated, and unreliable. Knowing what type of data it is, where it lives, whom it belongs to, and how it is being used is critical in protecting your sensitive data.

By storing a huge amount of information in the cloud, organizations face a major challenge in keeping an accurate and up-to-date inventory of all of their data. Without the proper tools to consolidate their data and scan across structured and unstructured data types, they are left in the dark about sensitive and regulated data. In addition to the resource cost of storing unused and unknown data, they are threatened by potential breaches and non-compliance with global regulations.

To reduce the amount of dark data in the first place, organizations must prioritize data minimization by limiting the amount of data they collect and retaining data for only as long as it is needed. The less data stored, the smaller their attack surface is to external and internal threats. By reducing the amount of obsolete cloud data, organizations can protect their sensitive data, stay complaint with regulations, and increase operational efficiency.

Shadow Server Impacts on Cloud Data

In the cloud, dark data and shadow servers pose some unique challenges. Standing up a shadow server or service in a cloud platform like AWS, GCP, or Azure takes seconds. Moreover, the shadow server could be fleeting, popping up and out in a matter of hours, days, or weeks.

Given the ease of provisioning anything in the cloud, detection of new shadow servers can’t be just limited to IT. It needs to cover other use cases such as engineers building out new applications, QA teams leveraging sensitive test data, and data scientists developing new AI algorithms.

This problem is compounded with dark data. Not only does every shadow server hold dark data, but the rapid popularization of new cloud data lakes, warehouses, and lake houses makes the accumulation of unknown data unparalleled in IT history.

Servers like Snowflake, Redshift, Synapse, S3, BigQuery, and Databricks are collecting data from untold sources and departments at rapid rates. Tracking this data using traditional methods is near impossible. You need to be able to monitor the ingress and egress points (ELTs and ETLs) and rescan for changes in a cost-effective way.

Lastly, as organizations move their unstructured data to the cloud, the problem of dark data identification is not just limited to large data lake structures but also large files, images, and data blob repositories. This requires not only an ability to scan this kind of infrastructure but also to detect change dynamically and economically.

Dark Data Discovery in the Cloud

Whether your organization knows it or not — dark data exists. The question is how do you go about discovering it?

Since dark data is rarely stumbled across, it typically calls for intentional effort to find, identify and remediate. The manual work required for this challenging endeavor often begins with a massive audit of all your cloud data stores.

Investing in the right software to manage, discover, and reduce your dark data can help your organization protect their sensitive data and reduce storage costs. Data stored in the cloud that isn’t being used brings unnecessary risk when it comes to breaches, leaks, and regulatory compliance.

SmallID’s Big Thinking for Detecting Dark Data and Shadow Servers at Cloud Scale

This is where SmallID comes in with several advanced features for identifying dark data and shadow servers at cloud scale in a cost-effective way.

First, SmallID is able to automate the discovery of servers in minutes across AWS, GCP, Azure, and SaaS. Moreover, SmallID can do this on a continuous basis to detect new workloads, whether permanent or ephemeral.

Once found, these shadow servers are automatically registered and scanned within SmallID to detect, classify, and inventory dark data inside the shadow servers. The shadow servers are then monitored for changes in their data posture. The detection of new tables or schemas can be used to perform differential scans on the new data, ensuring scanning efficiency and cost minimization.

Second, large data repositories such as Snowflake, S3, BigQuery, Redshift, and Databricks can be automatically scanned for new data in two ways. Data pipelines going into the data stores can be scanned at wire speeds and used to automatically update the SmallID data inventory or catalog.

Alternatively, the storehouses can also be scanned at scale with built-in cost protection features such as predictive scanning, optional sampling, and native differential scanning for changes.

Lastly, for unstructured dark data, SmallID offers a number of innovations to scan at cloud scale with cost-efficiency. For files in repositories such as O365, Sharepoint Online, Box, GDrive, or Hadoop HDFS, SmallID invented Hyperscan to pre-identify large amounts of sensitive data faster. SmallID also offers native integrations with indexes where they exist to locate specific user data or sensitive data at lightning speeds.

SmallID supports multi-format files for Blob storage like S3 and Azure Blob. This can include common file types like Excel, Word, Powerpoint, or PDF and more esoteric types such as images, Parquet, Avro, or Orc files.

SmallID has you covered with automated detection and inventory of dark data and shadow servers. Try SmallID for free today and start discovering the unknown in your cloud.