llm-redact: Find Sensitive Data in Your Repos Before It Reaches an LLM

5 min read · 2026-04-15

At Lab34, we build AI tools that help organizations adopt AI safely and at scale. Today we are open-sourcing llm-redact — a command-line tool that scans cloned repositories for secrets, credentials, and personally identifiable information, so you can build guardrail lists and redaction rules before any content reaches an LLM.

We originally built llm-redact while working on the Guard Rails feature for LLM Proxy. We needed a way to audit existing codebases across entire organizations — hundreds of repositories, files and git history — to understand what sensitive data existed and where. That audit tool became llm-redact.

The Problem

Companies adopting LLM-assisted development face a common set of challenges:

Secrets in code. API keys, database connection strings, private keys, and tokens end up in source files, configuration, and environment templates. Developers paste them into prompts without thinking twice.
Secrets in git history. A credential that was “removed” three commits ago still lives in the diff history. If an LLM tool reads git context, it reads the secret.
PII in unexpected places. Email addresses, credit card numbers, national IDs, and internal hostnames appear in test fixtures, seed data, comments, and log samples.
No visibility. Without scanning, there is no centralized inventory of where sensitive data lives across your repositories. You cannot build redaction rules for things you do not know about.

llm-redact solves all of these with two complementary scanning approaches.

Two Scanners

Regex Scanner

A fast, pattern-based scanner with 45 compiled regex patterns across 8 categories. It scans both file contents and git history (commit messages and diffs), catching secrets that were committed and later removed.

The patterns cover:

API keys and tokens — AWS, GitHub, GitLab, Slack, Stripe, Google, Heroku, Twilio, SendGrid, npm, PyPI, JWTs, and generic patterns
Passwords and secrets — assignments, environment variables, basic auth URLs
Private keys and certificates — RSA, DSA, EC, OpenSSH, PGP, PKCS8
Database connection strings — PostgreSQL, MySQL, MongoDB, Redis, MSSQL, JDBC
IP addresses and internal URLs — private ranges, localhost references
Email addresses
Credit card numbers — Visa, Mastercard, Amex, Discover
National IDs — US SSN, UK NINO

# Scan a directory of repos with regex patterns
./run-scan.sh scan --path /path/to/repos --depth 2

Ollama Scanner

For things regex cannot catch — hardcoded internal hostnames, encoded secrets, sensitive comments — the Ollama scanner sends each file to a local LLM for context-aware analysis. Because it uses Ollama, everything stays on your machine. No data leaves your network.

# Scan with a local Ollama model
./run-scan.sh ollama --path /path/to/repos --model gemma3:27b --depth 3

The two scanners are complementary. Run the regex scanner first for speed and coverage, then run the Ollama scanner on the same repos to catch what patterns miss.

Multi-Repo Discovery

Both scanners support scanning a single repository or an entire directory tree of cloned repos. Point llm-redact at a parent directory and set the --depth flag:

/repos/
  org-a/
    api-service/      <- git repo
    frontend/         <- git repo
  org-b/
    backend/          <- git repo

./run-scan.sh scan --path /repos --depth 2

All three repositories are discovered and scanned automatically. This is how we use it internally — pointing it at a directory containing hundreds of cloned repos from across the organization.

Output Formats

Results can be rendered as a terminal table or exported for further processing:

Format	Behaviour
`table`	Colored terminal table grouped by repository
`json`	One `.json` file per repository
`txt`	One `.txt` file per repository, one line per finding
`csv`	Single `findings.csv` with a `repository` column

# Export findings as CSV
./run-scan.sh scan --path /path/to/repos --output csv --output-dir ./results

Key Options at a Glance

Flag	Description	Default
`--path`	Path to a repo or directory of repos	required
`--depth`	Max depth to search for git repos	`2`
`--output`	Output format: `table`, `json`, `txt`, `csv`	`table`
`--output-dir`	Directory for file-based output	-
`--include` / `--exclude`	Glob patterns to filter files	all files
`--no-git-history`	Skip commit and diff scanning	`false`
`--history-depth`	Max commits to scan per repo	all
`--workers`	Parallel workers for file scanning	CPU count
`--processes`	Repos to scan in parallel	`1`
`--model`	Ollama model name (ollama scanner only)	-

Getting Started

# Clone the repo
git clone https://github.com/lab34/llm-redact.git
cd llm-redact

# Run setup (creates venv, installs dependencies)
./setup.sh

# Scan with regex patterns
./run-scan.sh scan --path /path/to/repos

# Scan with a local Ollama model
./run-scan.sh ollama --path /path/to/repos --model llama3

The setup.sh script detects your Python installation, creates a virtual environment, and installs all dependencies. The run-scan.sh wrapper activates the venv automatically, so there is nothing else to manage.

How It Works

llm-redact is a Python CLI backed by two independent scanning engines. There are no external services to manage — everything runs locally.

Your Repositories
       |
       v
  [llm-redact]
       |
       +-- Repo Discovery (walks for .git/ dirs up to --depth)
       +-- Regex Scanner (45 patterns across files + git history)
       +-- Ollama Scanner (local LLM analysis per file)
       |
       v
  [Findings Report]
  (table, json, txt, csv)

The regex scanner uses compiled patterns and parallel workers for speed. The Ollama scanner respects file size limits and can target specific file types with --include globs, so you can focus the LLM on configuration and environment files where secrets are most likely to appear.

Why We Built This

When we built the Guard Rails feature for LLM Proxy, we realized that writing redaction rules is only half the problem. You first need to know what to redact and where it lives. For a single repository that is manageable. For an organization with hundreds of repos, years of git history, and dozens of teams — it is not.

We looked at existing secret scanning tools and found them focused on CI/CD gatekeeping (blocking commits) rather than auditing existing codebases at scale. We needed something that could scan an entire organization’s worth of cloned repos in one pass and produce a report we could use to build guardrail configurations.

Now we are releasing it as open source under the MIT license, because we believe every company using LLM tooling should be able to audit their codebases for sensitive data before adopting AI-assisted development.

Use Cases

Security teams auditing repositories before enabling LLM-powered coding tools across the organization.
Regulated industries (finance, healthcare, government) that must ensure sensitive data cannot reach external APIs.
Platform teams building guardrail configurations for LLM Proxy or similar proxy layers.
DevOps and SRE teams looking for leaked credentials in git history across hundreds of repositories.
Compliance audits — produce a CSV report of every secret and PII finding across your entire codebase.

Open Source

llm-redact is MIT-licensed and available on GitHub. We welcome contributions, bug reports, and feature requests.

GitHub: github.com/lab34/llm-redact
Requirements: Python 3.9+, Git, optionally Ollama for LLM-based scanning