How Secret Detection Tools Spot Leaks


Detecting leaked secrets in your repositories is harder than ever. Over half of credentials leaked on GitHub today are generic, lacking recognizable patterns like AWS or Stripe keys.
Modern secret scanners use multiple techniques beyond simple regex strings to maximize detection and minimize false positives. This article gives an overview of these techniques, with code examples to illustrate how they are implemented in practice.
Categorizing Secrets
Secrets serve different functions and come in various formats. Common categories include:
Secrets detection typically prioritizes structured, identifiable formats. However, security teams should also consider other sensitive data like Personally Identifiable Information (PII), which includes data such as names, emails, and identification numbers. Although secrets scanning tools mainly focus on system credentials, many also offer basic capabilities to flag common PII patterns.
Secrets Detection vs Secrets Validation
Secret scanning typically comprises of both detection and validation techniques:
- Detection Techniques: Identifying data that may be sensitive exposed information. This phase focuses on minimizing false negatives (missed secrets) while maintaining reasonable precision.
- Validation Techniques: Confirming that a potential secret is actually sensitive. This phase minimizes false positives (non-sensitive data incorrectly flagged).
Secrets Detection Techniques
Regex
Regular expressions (regex) are a foundational technique for secrets detection, employed by almost every open-source and commercial tool, as well as natively within the push protection offered by most VCS platforms (Github, Gitlab, Bitbucket).
Regex defines specific patterns of characters. Scanning tools search through text content (source code, configuration files, logs, commit messages, etc.) to find strings that match these predefined patterns. Many of these patterns can be found in open source repos - some of which created specifically to crawl public repos for potentially sensitive information.
Here are some common examples of regex patterns that can be used for secret scanning:
Because regex matches based on structure alone, it can flag strings that resemble secrets but are not, such as example keys, test credentials, non-sensitive UUIDs, commit hashes, or even variable names. This "noise" can overwhelm security and development teams, leading to alert fatigue where legitimate alerts are ignored.
Conversely, regex will fail to detect secrets that don't conform to a predefined pattern, such as custom or non-standard formats and most forms of PII. For example, street addresses and postal codes vary too widely between regions to be reliably captured by regex alone.
Here's a simple example of how regex can be used in practice to scan a file for secrets:
import re, pathlib, sys
PATTERNS = {name: re.compile(rx) for name, rx in {
"AWS_ACCESS_KEY": r"AKIA[0-9A-Z]{16}",
"STRIPE_KEY": r"sk_live_[0-9a-zA-Z]{24}",
"JWT": r"[A-Za-z0-9-_]+\.[A-Za-z0-9-_]+\.[A-Za-z0-9-_+/=]*",
}.items()}
def scan_file(path):
hits = []
for n, line in enumerate(path.read_text(errors="ignore").splitlines(), 1):
for tag, rx in PATTERNS.items():
if rx.search(line):
hits.append((n, tag, line.strip()))
return hits
for p in pathlib.Path(".").rglob("*.*"):
for n, tag, snippet in scan_file(p):
print(f"{p}:{n}: {tag}: {snippet[:120]}")
Entropy Filters
This method leverages information theory to measure the randomness or unpredictability of a string. Many secrets, especially machine-generated ones like API keys or tokens, consist of random characters and thus exhibit high entropy compared to human-readable text or structured code. Entropy analysis is particularly effective at detecting secrets that lack a well-defined pattern or structure, including generic keys or custom tokens that regex might miss.
The Shannon entropy formula provides a theoretical basis for quantifying this randomness:
H(X) = -∑[p(x_i) * log_2(p(x_i))]
Where:
- H(X) is the entropy of string X
- p(x_i) is the probability of character x_i appearing in the string
- The sum is calculated over all unique characters in the string
Here's an example implementation:
import math
from collections import Counter
def calculate_shannon_entropy(string):
"""Calculate the Shannon entropy of a string."""
if not string:
return 0
counts = Counter(string)
entropy = 0
for count in counts.values():
probability = count / len(string)
entropy -= probability * math.log2(probability)
return entropy
# Example usage
api_key = "AIzaSyC8kHEJzM9XVCDnAMxPy7v1uGOCM9xzUcM" # High entropy
english_text = "This is a normal sentence with words" # Lower entropy
print(f"API Key entropy: {calculate_shannon_entropy(api_key)}")
print(f"English text entropy: {calculate_shannon_entropy(english_text)}")
Some secret detection tools calculate an entropy score for candidate strings and flag those exceeding defined thresholds as potential secrets:
- Base64 character set: typically >4.5
- Hexadecimal strings: typically >3.0
- Alphanumeric strings: typically >3.7
These thresholds can be configured to tune sensitivity based on specific environments and needs.
Entropy analysis is unfortunately prone to false positives. Many non-secret strings exhibit high randomness, such as:
- Cryptographic hashes (MD5, SHA)
- UUIDs and GUIDs
- Compressed data
- Encoded content (Base64)
- Random identifiers in compiled code
Therefore, entropy checks are most effective when combined with other techniques like keyword analysis or validation techniques to improve precision.
Context Analysis
These techniques move beyond analyzing the secret string in isolation, considering its surrounding code, variable names, and external validity to improve accuracy.
Keyword Analysis
Context analysis examines the surrounding content to improve detection precision. Secret values often appear near keywords like password, api_key, token, or secret. Detection tools search for these indicator variables when analyzing potential secrets:
def has_suspicious_context(line, suspected_secret_idx, suspected_secret_len):
context_window = line[max(0, suspected_secret_idx - 30):suspected_secret_idx + suspected_secret_len + 30]
suspicious_keywords = ["password", "secret", "token", "api", "key", "auth", "credential"]
return any(keyword in context_window.lower() for keyword in suspicious_keywords)
Code Structure Analysis
More advanced context analysis examines code structures like:
- Assignment operations (apiKey = "AKIAXXXXXXXXXXXXXXXX")
- Function calls (authenticate("p@ssw0rd123"))
- HTTP authentication headers (Authorization: Bearer eyJhbGciOi...)
- Config file patterns (PASSWORD=SomeValue)
AI / Machine Learning Techniques
Machine Learning (ML) classifiers and Large Language Models (LLMs) enhance detection precision by learning from labeled examples and code context.
The main challenges with ML-based approaches are:
- Requiring sufficient training data of real secrets (which are sensitive by nature)
- Computational overhead during scanning
- The need for periodic retraining as coding patterns evolve
Classification Models
Models can be trained specifically using datasets with labeled examples of:
- True secrets (labeled by type)
- False positives (high-entropy strings that aren't secrets)
- Surrounding context (code patterns, variable names)
These models, when trained correctly, may be able to identify subtle patterns beyond what regex or rules can express.
Large Language Models
Recent research explores using LLMs for secrets detection. These models understand code semantics and can identify:
- Unusual patterns of credential usage
- Semantic meaning of variables from context
- Novel or obfuscated secrets missed by rule-based systems
Validation Techniques
Checksums & Modulus Tests
Many credential formats incorporate internal validation mechanisms. For example,credit card numbers use the Luhn algorithm (mod-10):
def verify_stripe_key(api_key):
"""Verify if a Stripe API key is valid without performing actions."""
import requests
# Use a safe, read-only endpoint
url = "https://api.stripe.com/v1/customers/limit=1"
# Make request with minimal permissions request
response = requests.head(
url,
headers={"Authorization": f"Bearer {api_key}"},
timeout=5
)
if response.status_code == 401:
return "INVALID" # Key is not valid
elif response.status_code in (200, 403):
return "VALID" # Key works but may have restricted permissions
else:
return "UNKNOWN" # Cannot determine
Provider API Verification
One of the most definitive validation techniques involves making a non-destructive API call to verify if the credential is valid.
Here is an example for a Stripe API key:
def validate_github_token_metadata(token):
"""Validate GitHub token metadata beyond regex pattern."""
# Check prefix (Personal Access Token types)
valid_prefixes = ("ghp_", "gho_", "ghu_", "ghs_", "ghr_")
if not token.startswith(valid_prefixes):
return False
# Check length (40 characters after prefix)
prefix, value = token[:4], token[4:]
if len(value) != 40:
return False
# Check character set
if not all(c in "0123456789abcdefABCDEF" for c in value):
return False
return True
Proof-of-Possession Checks
Proof-of-possession validation leverages the mathematical properties of cryptographic credentials to verify their correctness. Unlike API verification which checks if a credential is active, these techniques validate that the credential adheres to its expected cryptographic structure and properties.
SSH Key Validation
SSH private keys follow strict format and cryptographic requirements. A valid key must contain properly formatted headers and be able to derive a corresponding public key. This implementation uses OpenSSH's ssh-keygen to attempt public key derivation:
def validate_ssh_key(private_key_content):
"""Verify if an SSH private key is valid and extract the public key."""
import tempfile
import subprocess
import os
with tempfile.NamedTemporaryFile(delete=False) as temp:
temp.write(private_key_content.encode())
temp_path = temp.name
try:
# Set proper permissions
os.chmod(temp_path, 0o600)
# Try to extract public key - will fail if malformed
result = subprocess.run(
["ssh-keygen", "-y", "-f", temp_path],
capture_output=True,
text=True,
check=False
)
if result.returncode == 0:
public_key = result.stdout.strip()
# Could further validate by comparing against known_hosts
return {"valid": True, "public_key": public_key}
else:
return {"valid": False, "error": result.stderr}
finally:
os.unlink(temp_path)
This validation works because of the asymmetric cryptography underpinning SSH keys. A private key contains enough information to deterministically generate its corresponding public key. If this derivation succeeds, the key follows proper cryptographic structure. Advanced validation can take this further by checking if the derived public key matches any authorized keys in your infrastructure.
JWT Token Validation
JSON Web Tokens can be validated by checking their three-part structure (header, payload, signature) and attempting to decode the components:
def validate_jwt_structure(token):
"""Validate that a JWT token has the correct structure."""
import base64
import json
parts = token.split('.')
if len(parts) != 3:
return False
# Try to decode header and payload
try:
# Add padding if needed
header = parts[0]
if len(header) % 4 != 0:
header += '=' * (4 - len(header) % 4)
payload = parts[1]
if len(payload) % 4 != 0:
payload += '=' * (4 - len(payload) % 4)
# Decode and parse as JSON
header_data = json.loads(base64.b64decode(header).decode('utf-8'))
payload_data = json.loads(base64.b64decode(payload).decode('utf-8'))
# Check for required fields
if 'alg' not in header_data:
return False
return True
except Exception:
return False
Code scanning tools can use these checks to quickly identify false positives and assess risk levels without making external API calls.
Allowlist/Denylist Mechanisms
Modern tools implement allowlisting through:
- Secure hashing (never storing actual secrets)
- In-line markup (e.g., # pragma: allowlist secret)
- Path-based exclusions for test directories
- Configuration files with explanations for each allowlisted item
Conclusion
Secrets detection is no longer "just a big regex file." Modern tools weave together multiple techniques to strike a balance between catching everything and crying wolf.
Enterprise-ready solutions like our secret scanners for Bitbucket, Confluence and Jira combine detection and validation techniques into a battle-tested platform. Our scanners are the trusted choice for Fortune 500 companies looking to prevent costly data breaches from leaked secrets.
You can try all of our secret scanners for free for 30 days directly from the Atlassian marketplace: