How Secret Detection Tools Spot Leaks

Don't Be the Next Headline.
Download the free ebook and see proven strategies to prevent a data breach from real-world examples.
Graphic of the scanning app

Detecting leaked secrets in your repositories is harder than ever. Over half of credentials leaked on GitHub today are generic, lacking recognizable patterns like AWS or Stripe keys.

Modern secret scanners use multiple techniques beyond simple regex strings to maximize detection and minimize false positives. This article gives an overview of these techniques, with code examples to illustrate how they are implemented in practice.

Categorizing Secrets

Secrets serve different functions and come in various formats. Common categories include:

Type Typical format clues Risk if leaked
API key / token Fixed prefix (AKIA…, sk_live_…, xoxb-…) or high-entropy slug Service impersonation / resource abuse
Database credentials URI (mysql://user:pass@host/db) Data exfiltration
Passwords password=assignment Account takeover
Private key PEM header -----BEGIN.*PRIVATE KEY----- Server takeover, message forging
JWT / OAuth Three base64url chunks separated by . Session hijack
Generic secret No structure—just looks random Hardest to spot

Secrets detection typically prioritizes structured, identifiable formats. However, security teams should also consider other sensitive data like Personally Identifiable Information (PII), which includes data such as names, emails, and identification numbers. Although secrets scanning tools mainly focus on system credentials, many also offer basic capabilities to flag common PII patterns.

Secrets Detection vs Secrets Validation

Secret scanning typically comprises of both detection and validation techniques:

  1. Detection Techniques: Identifying data that may be sensitive exposed information. This phase focuses on minimizing false negatives (missed secrets) while maintaining reasonable precision.
  2. Validation Techniques: Confirming that a potential secret is actually sensitive. This phase minimizes false positives (non-sensitive data incorrectly flagged).

Secrets Detection Techniques

Technique What it Looks For Strengths Weaknesses
Regex patterns Strings that match a well-known structure Near-perfect recall for branded keys Sample keys & UUIDs trigger noise
Entropy Filters High-randomness blobs ≥ N chars Catches custom / generic tokens Hashes & compressed data look random too
Static context Suspicious variable names, file types, comments Cheap precision boost Still rule-based; easy to bypass
ML classifiers Learned patterns in code + context Adapts to novel formats Needs labelled data & tuning

Regex

Regular expressions (regex) are a foundational technique for secrets detection, employed by almost every open-source and commercial tool, as well as natively within the push protection offered by most VCS platforms (Github, Gitlab, Bitbucket).

Regex defines specific patterns of characters. Scanning tools search through text content (source code, configuration files, logs, commit messages, etc.) to find strings that match these predefined patterns. Many of these patterns can be found in open source repos - some of which created specifically to crawl public repos for potentially sensitive information.

Here are some common examples of regex patterns that can be used for secret scanning:

Secret Regex pattern
AWS Access Key ID AKIA[0-9A-Z]{16}
AWS Secret Access Key (?<![A-Z0-9])[A-Za-z0-9/+=]{40}(?![A-Z0-9])
Stripe live secret key sk_live_[0-9a-zA-Z]{24}
GitHub PAT gh[pousr]_[A-Za-z0-9_]{36}
Slack bot/user token xox[aboprs]-[0-9a-zA-Z-]{10,48}
Google API key AIza[0-9A-Za-z\-_]{35}
JWT (structure check) [A-Za-z0-9-_]+\.[A-Za-z0-9-_]+\.[A-Za-z0-9-_+/=]*
SSH private-key header -----BEGIN (RSA\|OPENSSH\|DSA\|EC) PRIVATE KEY-----

Because regex matches based on structure alone, it can flag strings that resemble secrets but are not, such as example keys, test credentials, non-sensitive UUIDs, commit hashes, or even variable names. This "noise" can overwhelm security and development teams, leading to alert fatigue where legitimate alerts are ignored.

Conversely, regex will fail to detect secrets that don't conform to a predefined pattern, such as custom or non-standard formats and most forms of PII. For example, street addresses and postal codes vary too widely between regions to be reliably captured by regex alone.

Here's a simple example of how regex can be used in practice to scan a file for secrets:

import re, pathlib, sys

PATTERNS = {name: re.compile(rx) for name, rx in {
    "AWS_ACCESS_KEY": r"AKIA[0-9A-Z]{16}",
    "STRIPE_KEY":     r"sk_live_[0-9a-zA-Z]{24}",
    "JWT":            r"[A-Za-z0-9-_]+\.[A-Za-z0-9-_]+\.[A-Za-z0-9-_+/=]*",
}.items()}

def scan_file(path):
    hits = []
    for n, line in enumerate(path.read_text(errors="ignore").splitlines(), 1):
        for tag, rx in PATTERNS.items():
            if rx.search(line):
                hits.append((n, tag, line.strip()))
    return hits

for p in pathlib.Path(".").rglob("*.*"):
    for n, tag, snippet in scan_file(p):
        print(f"{p}:{n}: {tag}: {snippet[:120]}")

Entropy Filters

This method leverages information theory to measure the randomness or unpredictability of a string. Many secrets, especially machine-generated ones like API keys or tokens, consist of random characters and thus exhibit high entropy compared to human-readable text or structured code. Entropy analysis is particularly effective at detecting secrets that lack a well-defined pattern or structure, including generic keys or custom tokens that regex might miss.

The Shannon entropy formula provides a theoretical basis for quantifying this randomness:

H(X) = -∑[p(x_i) * log_2(p(x_i))]

Where:

  1. H(X) is the entropy of string X
  2. p(x_i) is the probability of character x_i appearing in the string
  3. The sum is calculated over all unique characters in the string

Here's an example implementation:

import math
from collections import Counter

def calculate_shannon_entropy(string):
    """Calculate the Shannon entropy of a string."""
    if not string:
        return 0
        
    counts = Counter(string)
    
    entropy = 0
    for count in counts.values():
        probability = count / len(string)
        entropy -= probability * math.log2(probability)
        
    return entropy

# Example usage
api_key = "AIzaSyC8kHEJzM9XVCDnAMxPy7v1uGOCM9xzUcM"  # High entropy
english_text = "This is a normal sentence with words"  # Lower entropy

print(f"API Key entropy: {calculate_shannon_entropy(api_key)}")
print(f"English text entropy: {calculate_shannon_entropy(english_text)}")

Some secret detection tools calculate an entropy score for candidate strings and flag those exceeding defined thresholds as potential secrets:

  1. Base64 character set: typically >4.5
  2. Hexadecimal strings: typically >3.0
  3. Alphanumeric strings: typically >3.7

These thresholds can be configured to tune sensitivity based on specific environments and needs.

Entropy analysis is unfortunately prone to false positives. Many non-secret strings exhibit high randomness, such as:

  1. Cryptographic hashes (MD5, SHA)
  2. UUIDs and GUIDs
  3. Compressed data
  4. Encoded content (Base64)
  5. Random identifiers in compiled code

Therefore, entropy checks are most effective when combined with other techniques like keyword analysis or validation techniques to improve precision.

Context Analysis

These techniques move beyond analyzing the secret string in isolation, considering its surrounding code, variable names, and external validity to improve accuracy.

Keyword Analysis

Context analysis examines the surrounding content to improve detection precision. Secret values often appear near keywords like password, api_key, token, or secret. Detection tools search for these indicator variables when analyzing potential secrets:

def has_suspicious_context(line, suspected_secret_idx, suspected_secret_len):
    context_window = line[max(0, suspected_secret_idx - 30):suspected_secret_idx + suspected_secret_len + 30]
    suspicious_keywords = ["password", "secret", "token", "api", "key", "auth", "credential"]
    return any(keyword in context_window.lower() for keyword in suspicious_keywords)

Code Structure Analysis

More advanced context analysis examines code structures like:

  1. Assignment operations (apiKey = "AKIAXXXXXXXXXXXXXXXX")
  2. Function calls (authenticate("p@ssw0rd123"))
  3. HTTP authentication headers (Authorization: Bearer eyJhbGciOi...)
  4. Config file patterns (PASSWORD=SomeValue)

AI / Machine Learning Techniques

Machine Learning (ML) classifiers and Large Language Models (LLMs) enhance detection precision by learning from labeled examples and code context.

The main challenges with ML-based approaches are:

  1. Requiring sufficient training data of real secrets (which are sensitive by nature)
  2. Computational overhead during scanning
  3. The need for periodic retraining as coding patterns evolve

Classification Models

Models can be trained specifically using datasets with labeled examples of:

  1. True secrets (labeled by type)
  2. False positives (high-entropy strings that aren't secrets)
  3. Surrounding context (code patterns, variable names)

These models, when trained correctly, may be able to identify subtle patterns beyond what regex or rules can express.

Large Language Models

Recent research explores using LLMs for secrets detection. These models understand code semantics and can identify:

  1. Unusual patterns of credential usage
  2. Semantic meaning of variables from context
  3. Novel or obfuscated secrets missed by rule-based systems

Validation Techniques

Technique What it confirms Strengths Weaknesses
Checksum / Modulus Internal checksum matches (e.g., AWS AKI has a mod‑11 check) Fast, no external calls Limited to formats with a checksum
Provider API ping Credential actually authorizes (token introspection) Near‑zero false positives Requires outbound call, need to be done carefully
Allowlist / Denylist Agreed false positives auto‑suppressed Keeps dev signal clean Lists drift if not maintained

Checksums & Modulus Tests

Many credential formats incorporate internal validation mechanisms. For example,credit card numbers use the Luhn algorithm (mod-10):

def verify_stripe_key(api_key):
    """Verify if a Stripe API key is valid without performing actions."""
    import requests
    
    # Use a safe, read-only endpoint
    url = "https://api.stripe.com/v1/customers/limit=1"
    
    # Make request with minimal permissions request
    response = requests.head(
        url, 
        headers={"Authorization": f"Bearer {api_key}"},
        timeout=5
    )
    
    if response.status_code == 401:
        return "INVALID"  # Key is not valid
    elif response.status_code in (200, 403):
        return "VALID"    # Key works but may have restricted permissions
    else:
        return "UNKNOWN"  # Cannot determine

Provider API Verification

One of the most definitive validation techniques involves making a non-destructive API call to verify if the credential is valid.

Here is an example for a Stripe API key:

def validate_github_token_metadata(token):
    """Validate GitHub token metadata beyond regex pattern."""
    # Check prefix (Personal Access Token types)
    valid_prefixes = ("ghp_", "gho_", "ghu_", "ghs_", "ghr_")
    if not token.startswith(valid_prefixes):
        return False
        
    # Check length (40 characters after prefix)
    prefix, value = token[:4], token[4:]
    if len(value) != 40:
        return False
        
    # Check character set
    if not all(c in "0123456789abcdefABCDEF" for c in value):
        return False
        
    return True

Proof-of-Possession Checks

Proof-of-possession validation leverages the mathematical properties of cryptographic credentials to verify their correctness. Unlike API verification which checks if a credential is active, these techniques validate that the credential adheres to its expected cryptographic structure and properties.

SSH Key Validation

SSH private keys follow strict format and cryptographic requirements. A valid key must contain properly formatted headers and be able to derive a corresponding public key. This implementation uses OpenSSH's ssh-keygen to attempt public key derivation:

def validate_ssh_key(private_key_content):
    """Verify if an SSH private key is valid and extract the public key."""
    import tempfile
    import subprocess
    import os
    
    with tempfile.NamedTemporaryFile(delete=False) as temp:
        temp.write(private_key_content.encode())
        temp_path = temp.name
    
    try:
        # Set proper permissions
        os.chmod(temp_path, 0o600)
        
        # Try to extract public key - will fail if malformed
        result = subprocess.run(
            ["ssh-keygen", "-y", "-f", temp_path], 
            capture_output=True, 
            text=True,
            check=False
        )
        
        if result.returncode == 0:
            public_key = result.stdout.strip()
            # Could further validate by comparing against known_hosts
            return {"valid": True, "public_key": public_key}
        else:
            return {"valid": False, "error": result.stderr}
    finally:
        os.unlink(temp_path)

This validation works because of the asymmetric cryptography underpinning SSH keys. A private key contains enough information to deterministically generate its corresponding public key. If this derivation succeeds, the key follows proper cryptographic structure. Advanced validation can take this further by checking if the derived public key matches any authorized keys in your infrastructure.

JWT Token Validation

JSON Web Tokens can be validated by checking their three-part structure (header, payload, signature) and attempting to decode the components:

def validate_jwt_structure(token):
    """Validate that a JWT token has the correct structure."""
    import base64
    import json
    
    parts = token.split('.')
    if len(parts) != 3:
        return False
        
    # Try to decode header and payload
    try:
        # Add padding if needed
        header = parts[0]
        if len(header) % 4 != 0:
            header += '=' * (4 - len(header) % 4)
            
        payload = parts[1]
        if len(payload) % 4 != 0:
            payload += '=' * (4 - len(payload) % 4)
            
        # Decode and parse as JSON
        header_data = json.loads(base64.b64decode(header).decode('utf-8'))
        payload_data = json.loads(base64.b64decode(payload).decode('utf-8'))
        
        # Check for required fields
        if 'alg' not in header_data:
            return False
            
        return True
    except Exception:
        return False

Code scanning tools can use these checks to quickly identify false positives and assess risk levels without making external API calls.

Allowlist/Denylist Mechanisms

Modern tools implement allowlisting through:

  1. Secure hashing (never storing actual secrets)
  2. In-line markup (e.g., # pragma: allowlist secret)
  3. Path-based exclusions for test directories
  4. Configuration files with explanations for each allowlisted item

Conclusion

Secrets detection is no longer "just a big regex file." Modern tools weave together multiple techniques to strike a balance between catching everything and crying wolf.

Enterprise-ready solutions like our secret scanners for Bitbucket, Confluence and Jira combine detection and validation techniques into a battle-tested platform. Our scanners are the trusted choice for Fortune 500 companies looking to prevent costly data breaches from leaked secrets.

You can try all of our secret scanners for free for 30 days directly from the Atlassian marketplace: