llm-redact: Find Sensitive Data in Your Repos Before It Reaches an LLM
At Lab34, we build AI tools that help organizations adopt AI safely and at scale. Today we are open-sourcing llm-redact β a command-line tool that scans cloned repositories for secrets, credentials, and personally identifiable information, so you can build guardrail lists and redaction rules before any content reaches an LLM.
We originally built llm-redact while working on the Guard Rails feature for LLM Proxy. We needed a way to audit existing codebases across entire organizations β hundreds of repositories, files and git history β to understand what sensitive data existed and where. That audit tool became llm-redact.
The Problem
Companies adopting LLM-assisted development face a common set of challenges:
- Secrets in code. API keys, database connection strings, private keys, and tokens end up in source files, configuration, and environment templates. Developers paste them into prompts without thinking twice.
- Secrets in git history. A credential that was βremovedβ three commits ago still lives in the diff history. If an LLM tool reads git context, it reads the secret.
- PII in unexpected places. Email addresses, credit card numbers, national IDs, and internal hostnames appear in test fixtures, seed data, comments, and log samples.
- No visibility. Without scanning, there is no centralized inventory of where sensitive data lives across your repositories. You cannot build redaction rules for things you do not know about.
llm-redact solves all of these with two complementary scanning approaches.
Two Scanners
Regex Scanner
A fast, pattern-based scanner with 45 compiled regex patterns across 8 categories. It scans both file contents and git history (commit messages and diffs), catching secrets that were committed and later removed.
The patterns cover:
- API keys and tokens β AWS, GitHub, GitLab, Slack, Stripe, Google, Heroku, Twilio, SendGrid, npm, PyPI, JWTs, and generic patterns
- Passwords and secrets β assignments, environment variables, basic auth URLs
- Private keys and certificates β RSA, DSA, EC, OpenSSH, PGP, PKCS8
- Database connection strings β PostgreSQL, MySQL, MongoDB, Redis, MSSQL, JDBC
- IP addresses and internal URLs β private ranges, localhost references
- Email addresses
- Credit card numbers β Visa, Mastercard, Amex, Discover
- National IDs β US SSN, UK NINO
# Scan a directory of repos with regex patterns
./run-scan.sh scan --path /path/to/repos --depth 2Ollama Scanner
For things regex cannot catch β hardcoded internal hostnames, encoded secrets, sensitive comments β the Ollama scanner sends each file to a local LLM for context-aware analysis. Because it uses Ollama, everything stays on your machine. No data leaves your network.
# Scan with a local Ollama model
./run-scan.sh ollama --path /path/to/repos --model gemma3:27b --depth 3The two scanners are complementary. Run the regex scanner first for speed and coverage, then run the Ollama scanner on the same repos to catch what patterns miss.
Multi-Repo Discovery
Both scanners support scanning a single repository or an entire directory tree of cloned repos. Point llm-redact at a parent directory and set the --depth flag:
/repos/
org-a/
api-service/ <- git repo
frontend/ <- git repo
org-b/
backend/ <- git repo./run-scan.sh scan --path /repos --depth 2All three repositories are discovered and scanned automatically. This is how we use it internally β pointing it at a directory containing hundreds of cloned repos from across the organization.
Output Formats
Results can be rendered as a terminal table or exported for further processing:
| Format | Behaviour |
|---|---|
table | Colored terminal table grouped by repository |
json | One .json file per repository |
txt | One .txt file per repository, one line per finding |
csv | Single findings.csv with a repository column |
# Export findings as CSV
./run-scan.sh scan --path /path/to/repos --output csv --output-dir ./resultsKey Options at a Glance
| Flag | Description | Default |
|---|---|---|
--path | Path to a repo or directory of repos | required |
--depth | Max depth to search for git repos | 2 |
--output | Output format: table, json, txt, csv | table |
--output-dir | Directory for file-based output | - |
--include / --exclude | Glob patterns to filter files | all files |
--no-git-history | Skip commit and diff scanning | false |
--history-depth | Max commits to scan per repo | all |
--workers | Parallel workers for file scanning | CPU count |
--processes | Repos to scan in parallel | 1 |
--model | Ollama model name (ollama scanner only) | - |
Getting Started
# Clone the repo
git clone https://github.com/lab34/llm-redact.git
cd llm-redact
# Run setup (creates venv, installs dependencies)
./setup.sh
# Scan with regex patterns
./run-scan.sh scan --path /path/to/repos
# Scan with a local Ollama model
./run-scan.sh ollama --path /path/to/repos --model llama3The setup.sh script detects your Python installation, creates a virtual environment, and installs all dependencies. The run-scan.sh wrapper activates the venv automatically, so there is nothing else to manage.
How It Works
llm-redact is a Python CLI backed by two independent scanning engines. There are no external services to manage β everything runs locally.
Your Repositories
|
v
[llm-redact]
|
+-- Repo Discovery (walks for .git/ dirs up to --depth)
+-- Regex Scanner (45 patterns across files + git history)
+-- Ollama Scanner (local LLM analysis per file)
|
v
[Findings Report]
(table, json, txt, csv)The regex scanner uses compiled patterns and parallel workers for speed. The Ollama scanner respects file size limits and can target specific file types with --include globs, so you can focus the LLM on configuration and environment files where secrets are most likely to appear.
Why We Built This
When we built the Guard Rails feature for LLM Proxy, we realized that writing redaction rules is only half the problem. You first need to know what to redact and where it lives. For a single repository that is manageable. For an organization with hundreds of repos, years of git history, and dozens of teams β it is not.
We looked at existing secret scanning tools and found them focused on CI/CD gatekeeping (blocking commits) rather than auditing existing codebases at scale. We needed something that could scan an entire organizationβs worth of cloned repos in one pass and produce a report we could use to build guardrail configurations.
Now we are releasing it as open source under the MIT license, because we believe every company using LLM tooling should be able to audit their codebases for sensitive data before adopting AI-assisted development.
Use Cases
- Security teams auditing repositories before enabling LLM-powered coding tools across the organization.
- Regulated industries (finance, healthcare, government) that must ensure sensitive data cannot reach external APIs.
- Platform teams building guardrail configurations for LLM Proxy or similar proxy layers.
- DevOps and SRE teams looking for leaked credentials in git history across hundreds of repositories.
- Compliance audits β produce a CSV report of every secret and PII finding across your entire codebase.
Open Source
llm-redact is MIT-licensed and available on GitHub. We welcome contributions, bug reports, and feature requests.
- GitHub: github.com/lab34/llm-redact
- Requirements: Python 3.9+, Git, optionally Ollama for LLM-based scanning