How Secret Detection Tools Spot Leaks

Savannah Copland

May 5, 2025

May 8, 2025

Don't Be the Next Headline.

Download the free ebook and see proven strategies to prevent a data breach from real-world examples.

Download Free Ebook

Detecting leaked secrets in your repositories is harder than ever. Over half of credentials leaked on GitHub today are generic, lacking recognizable patterns like AWS or Stripe keys.

Modern secret scanners use multiple techniques beyond simple regex strings to maximize detection and minimize false positives. This article gives an overview of these techniques, with code examples to illustrate how they are implemented in practice.

Categorizing Secrets

Secrets serve different functions and come in various formats. Common categories include:

Type	Typical format clues	Risk if leaked
API key / token	Fixed prefix (AKIA…, sk_live_…, xoxb-…) or high-entropy slug	Service impersonation / resource abuse
Database credentials	URI (mysql://user:pass@host/db)	Data exfiltration
Passwords	password=assignment	Account takeover
Private key	PEM header -----BEGIN.*PRIVATE KEY-----	Server takeover, message forging
JWT / OAuth	Three base64url chunks separated by .	Session hijack
Generic secret	No structure—just looks random	Hardest to spot

‍

Secrets detection typically prioritizes structured, identifiable formats. However, security teams should also consider other sensitive data like Personally Identifiable Information (PII), which includes data such as names, emails, and identification numbers. Although secrets scanning tools mainly focus on system credentials, many also offer basic capabilities to flag common PII patterns.

Secrets Detection vs Secrets Validation

Secret scanning typically comprises of both detection and validation techniques:

Detection Techniques: Identifying data that may be sensitive exposed information. This phase focuses on minimizing false negatives (missed secrets) while maintaining reasonable precision.
Validation Techniques: Confirming that a potential secret is actually sensitive. This phase minimizes false positives (non-sensitive data incorrectly flagged).

Secrets Detection Techniques

Technique	What it Looks For	Strengths	Weaknesses
Regex patterns	Strings that match a well-known structure	Near-perfect recall for branded keys	Sample keys & UUIDs trigger noise
Entropy Filters	High-randomness blobs ≥ N chars	Catches custom / generic tokens	Hashes & compressed data look random too
Static context	Suspicious variable names, file types, comments	Cheap precision boost	Still rule-based; easy to bypass
ML classifiers	Learned patterns in code + context	Adapts to novel formats	Needs labelled data & tuning

‍

Regex

Regular expressions (regex) are a foundational technique for secrets detection, employed by almost every open-source and commercial tool, as well as natively within the push protection offered by most VCS platforms (Github, Gitlab, Bitbucket).

Regex defines specific patterns of characters. Scanning tools search through text content (source code, configuration files, logs, commit messages, etc.) to find strings that match these predefined patterns. Many of these patterns can be found in open source repos - some of which created specifically to crawl public repos for potentially sensitive information.

Here are some common examples of regex patterns that can be used for secret scanning:

Secret	Regex pattern
AWS Access Key ID	AKIA[0-9A-Z]{16}
AWS Secret Access Key	(?<![A-Z0-9])[A-Za-z0-9/+=]{40}(?![A-Z0-9])
Stripe live secret key	sk_live_[0-9a-zA-Z]{24}
GitHub PAT	gh[pousr]_[A-Za-z0-9_]{36}
Slack bot/user token	xox[aboprs]-[0-9a-zA-Z-]{10,48}
Google API key	AIza[0-9A-Za-z\-_]{35}
JWT (structure check)	[A-Za-z0-9-_]+\.[A-Za-z0-9-_]+\.[A-Za-z0-9-_+/=]*
SSH private-key header	-----BEGIN (RSA\\|OPENSSH\\|DSA\\|EC) PRIVATE KEY-----

‍

Because regex matches based on structure alone, it can flag strings that resemble secrets but are not, such as example keys, test credentials, non-sensitive UUIDs, commit hashes, or even variable names. This "noise" can overwhelm security and development teams, leading to alert fatigue where legitimate alerts are ignored.

Conversely, regex will fail to detect secrets that don't conform to a predefined pattern, such as custom or non-standard formats and most forms of PII. For example, street addresses and postal codes vary too widely between regions to be reliably captured by regex alone.

Here's a simple example of how regex can be used in practice to scan a file for secrets:

import re, pathlib, sys

PATTERNS = {name: re.compile(rx) for name, rx in {
    "AWS_ACCESS_KEY": r"AKIA[0-9A-Z]{16}",
    "STRIPE_KEY":     r"sk_live_[0-9a-zA-Z]{24}",
    "JWT":            r"[A-Za-z0-9-_]+\.[A-Za-z0-9-_]+\.[A-Za-z0-9-_+/=]*",
}.items()}

def scan_file(path):
    hits = []
    for n, line in enumerate(path.read_text(errors="ignore").splitlines(), 1):
        for tag, rx in PATTERNS.items():
            if rx.search(line):
                hits.append((n, tag, line.strip()))
    return hits

for p in pathlib.Path(".").rglob("*.*"):
    for n, tag, snippet in scan_file(p):
        print(f"{p}:{n}: {tag}: {snippet[:120]}")

Entropy Filters

This method leverages information theory to measure the randomness or unpredictability of a string. Many secrets, especially machine-generated ones like API keys or tokens, consist of random characters and thus exhibit high entropy compared to human-readable text or structured code. Entropy analysis is particularly effective at detecting secrets that lack a well-defined pattern or structure, including generic keys or custom tokens that regex might miss.

The Shannon entropy formula provides a theoretical basis for quantifying this randomness:

H(X) = -∑[p(x_i) * log_2(p(x_i))]

‍

Where:

H(X) is the entropy of string X
p(x_i) is the probability of character x_i appearing in the string
The sum is calculated over all unique characters in the string

Here's an example implementation:

import math
from collections import Counter

def calculate_shannon_entropy(string):
    """Calculate the Shannon entropy of a string."""
    if not string:
        return 0
        
    counts = Counter(string)
    
    entropy = 0
    for count in counts.values():
        probability = count / len(string)
        entropy -= probability * math.log2(probability)
        
    return entropy

# Example usage
api_key = "AIzaSyC8kHEJzM9XVCDnAMxPy7v1uGOCM9xzUcM"  # High entropy
english_text = "This is a normal sentence with words"  # Lower entropy

print(f"API Key entropy: {calculate_shannon_entropy(api_key)}")
print(f"English text entropy: {calculate_shannon_entropy(english_text)}")

Some secret detection tools calculate an entropy score for candidate strings and flag those exceeding defined thresholds as potential secrets:

Base64 character set: typically >4.5
Hexadecimal strings: typically >3.0
Alphanumeric strings: typically >3.7

These thresholds can be configured to tune sensitivity based on specific environments and needs.

Entropy analysis is unfortunately prone to false positives. Many non-secret strings exhibit high randomness, such as:

Cryptographic hashes (MD5, SHA)
UUIDs and GUIDs
Compressed data
Encoded content (Base64)
Random identifiers in compiled code

Therefore, entropy checks are most effective when combined with other techniques like keyword analysis or validation techniques to improve precision.

Context Analysis

These techniques move beyond analyzing the secret string in isolation, considering its surrounding code, variable names, and external validity to improve accuracy.

Keyword Analysis

Context analysis examines the surrounding content to improve detection precision. Secret values often appear near keywords like password, api_key, token, or secret. Detection tools search for these indicator variables when analyzing potential secrets:

def has_suspicious_context(line, suspected_secret_idx, suspected_secret_len):
    context_window = line[max(0, suspected_secret_idx - 30):suspected_secret_idx + suspected_secret_len + 30]
    suspicious_keywords = ["password", "secret", "token", "api", "key", "auth", "credential"]
    return any(keyword in context_window.lower() for keyword in suspicious_keywords)

Code Structure Analysis

More advanced context analysis examines code structures like:

Assignment operations (apiKey = "AKIAXXXXXXXXXXXXXXXX")
Function calls (authenticate("p@ssw0rd123"))
HTTP authentication headers (Authorization: Bearer eyJhbGciOi...)
Config file patterns (PASSWORD=SomeValue)

AI / Machine Learning Techniques

Machine Learning (ML) classifiers and Large Language Models (LLMs) enhance detection precision by learning from labeled examples and code context.

The main challenges with ML-based approaches are:

Requiring sufficient training data of real secrets (which are sensitive by nature)
Computational overhead during scanning
The need for periodic retraining as coding patterns evolve

Classification Models

Models can be trained specifically using datasets with labeled examples of:

True secrets (labeled by type)
False positives (high-entropy strings that aren't secrets)
Surrounding context (code patterns, variable names)

These models, when trained correctly, may be able to identify subtle patterns beyond what regex or rules can express.

Large Language Models

Recent research explores using LLMs for secrets detection. These models understand code semantics and can identify:

Unusual patterns of credential usage
Semantic meaning of variables from context
Novel or obfuscated secrets missed by rule-based systems

Validation Techniques

Technique	What it confirms	Strengths	Weaknesses
Checksum / Modulus	Internal checksum matches (e.g., AWS AKI has a mod‑11 check)	Fast, no external calls	Limited to formats with a checksum
Provider API ping	Credential actually authorizes (token introspection)	Near‑zero false positives	Requires outbound call, need to be done carefully
Allowlist / Denylist	Agreed false positives auto‑suppressed	Keeps dev signal clean	Lists drift if not maintained

‍

Checksums & Modulus Tests

Many credential formats incorporate internal validation mechanisms. For example,credit card numbers use the Luhn algorithm (mod-10):

def verify_stripe_key(api_key):
    """Verify if a Stripe API key is valid without performing actions."""
    import requests
    
    # Use a safe, read-only endpoint
    url = "https://api.stripe.com/v1/customers/limit=1"
    
    # Make request with minimal permissions request
    response = requests.head(
        url, 
        headers={"Authorization": f"Bearer {api_key}"},
        timeout=5
    )
    
    if response.status_code == 401:
        return "INVALID"  # Key is not valid
    elif response.status_code in (200, 403):
        return "VALID"    # Key works but may have restricted permissions
    else:
        return "UNKNOWN"  # Cannot determine

Provider API Verification

One of the most definitive validation techniques involves making a non-destructive API call to verify if the credential is valid.

Here is an example for a Stripe API key:

def validate_github_token_metadata(token):
    """Validate GitHub token metadata beyond regex pattern."""
    # Check prefix (Personal Access Token types)
    valid_prefixes = ("ghp_", "gho_", "ghu_", "ghs_", "ghr_")
    if not token.startswith(valid_prefixes):
        return False
        
    # Check length (40 characters after prefix)
    prefix, value = token[:4], token[4:]
    if len(value) != 40:
        return False
        
    # Check character set
    if not all(c in "0123456789abcdefABCDEF" for c in value):
        return False
        
    return True

Proof-of-Possession Checks

Proof-of-possession validation leverages the mathematical properties of cryptographic credentials to verify their correctness. Unlike API verification which checks if a credential is active, these techniques validate that the credential adheres to its expected cryptographic structure and properties.

SSH Key Validation

SSH private keys follow strict format and cryptographic requirements. A valid key must contain properly formatted headers and be able to derive a corresponding public key. This implementation uses OpenSSH's ssh-keygen to attempt public key derivation:

def validate_ssh_key(private_key_content):
    """Verify if an SSH private key is valid and extract the public key."""
    import tempfile
    import subprocess
    import os
    
    with tempfile.NamedTemporaryFile(delete=False) as temp:
        temp.write(private_key_content.encode())
        temp_path = temp.name
    
    try:
        # Set proper permissions
        os.chmod(temp_path, 0o600)
        
        # Try to extract public key - will fail if malformed
        result = subprocess.run(
            ["ssh-keygen", "-y", "-f", temp_path], 
            capture_output=True, 
            text=True,
            check=False
        )
        
        if result.returncode == 0:
            public_key = result.stdout.strip()
            # Could further validate by comparing against known_hosts
            return {"valid": True, "public_key": public_key}
        else:
            return {"valid": False, "error": result.stderr}
    finally:
        os.unlink(temp_path)

This validation works because of the asymmetric cryptography underpinning SSH keys. A private key contains enough information to deterministically generate its corresponding public key. If this derivation succeeds, the key follows proper cryptographic structure. Advanced validation can take this further by checking if the derived public key matches any authorized keys in your infrastructure.

JWT Token Validation

JSON Web Tokens can be validated by checking their three-part structure (header, payload, signature) and attempting to decode the components:

def validate_jwt_structure(token):
    """Validate that a JWT token has the correct structure."""
    import base64
    import json
    
    parts = token.split('.')
    if len(parts) != 3:
        return False
        
    # Try to decode header and payload
    try:
        # Add padding if needed
        header = parts[0]
        if len(header) % 4 != 0:
            header += '=' * (4 - len(header) % 4)
            
        payload = parts[1]
        if len(payload) % 4 != 0:
            payload += '=' * (4 - len(payload) % 4)
            
        # Decode and parse as JSON
        header_data = json.loads(base64.b64decode(header).decode('utf-8'))
        payload_data = json.loads(base64.b64decode(payload).decode('utf-8'))
        
        # Check for required fields
        if 'alg' not in header_data:
            return False
            
        return True
    except Exception:
        return False

Code scanning tools can use these checks to quickly identify false positives and assess risk levels without making external API calls.

Allowlist/Denylist Mechanisms

Modern tools implement allowlisting through:

Secure hashing (never storing actual secrets)
In-line markup (e.g., # pragma: allowlist secret)
Path-based exclusions for test directories
Configuration files with explanations for each allowlisted item

Conclusion

Secrets detection is no longer "just a big regex file." Modern tools weave together multiple techniques to strike a balance between catching everything and crying wolf.

Enterprise-ready solutions like our secret scanners for Bitbucket, Confluence and Jira combine detection and validation techniques into a battle-tested platform. Our scanners are the trusted choice for Fortune 500 companies looking to prevent costly data breaches from leaked secrets.

You can try all of our secret scanners for free for 30 days directly from the Atlassian marketplace: