v0.1.0 Early Access

Stop LLM failures from reaching prod.

PromptProof runs in CI to catch hallucinations, prompt regressions, and unsafe outputs — before merge. Deterministic checks, regression compare, and cost budgets. No live model calls.

Deterministic checks

Regression compare

Cost budgets

Add the GitHub Action

See a Sample Report

Works with GitHub Actions

TypeScriptPython (soon)Go (soon)

Before

After

How it works

Three simple steps to bulletproof your LLM outputs

Step 1

Define expectations

Write simple rules/tests for your model outputs (JSON schema, regex, custom checks).

# .promptproof.yml
tests:
  - name: no-hallucination
    grounding:
      method: semantic_similarity
      threshold: 0.85
  - name: valid-json
    schema:
      type: object
      required: ["status", "data"]

Step 2

Run in CI

We run your checks against recorded fixtures on every PR. No live model calls in CI.

# .github/workflows/promptproof.yml
name: PromptProof
on: [pull_request]
jobs:
  proof:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: promptproof/action@v0
        with:
          config: promptproof.yaml

Step 3

Block risky merges

Fail the check when outputs violate policy. Fix, re-run, merge.

✗ PromptProof — Policy Violations (2)
  
  test: no-pii-leak
  ✗ Found PII: email@example.com
  
  test: output-format
  ✗ Missing required field: status
  
Fix violations and re-run checks.

Coming in ~7-10 days

npx promptproof init # scaffold policies & fixtures
npx promptproof run # run checks locally

Real Production Failures

LLM Failures Zoo

Real anonymized examples of LLM failures caught in production. Learn what to test before it's too late.

Cases Documented

Failure Types

Models Covered

Failure Type

Model

JSON field drift breaks downstream parser

📉Prompt RegressionGPT-4o

Model returns null instead of expected string type, causing parser crash

-"status": null

+"status": "processing"

View case

HOT

PII slip in support reply

🔒PII LeakClaude

Full email and phone number exposed in customer support response

-email you at john@example.com

+contact you through your registered method

View case

Tool hallucination triggers phantom calendar event

🔧Tool MisuseGPT-4o

Model invents non-existent tool function causing system errors

-"tool": "schedule_meeting"

+"tool": "check_calendar"

View case

Summary invents fact with high confidence

🤖HallucinationLlama 3

Model adds information not present in source text

-40% revenue increase and steady growth

+steady growth

View case

Refusal regression after prompt refactor

📉Prompt RegressionClaude

Model starts refusing legitimate requests after prompt update

-I cannot generate marketing content

+EcoClean: The eco-friendly detergent

View case

Unsafe SQL generation allows injection

⚠️Unsafe OutputGPT-4o

Generated SQL query vulnerable to injection attacks

-WHERE email = 'user@test.com; DROP TABLE

+WHERE email = ?

View case

Browse the full Zoo

2 more cases available

Roadmap

Building in public. Ship fast, iterate faster.

Phase 1

Launched GitHub Action

Now

Published on GitHub Marketplace
Collecting feedback and use-cases
Sample reports & demo template
Core documentation

Phase 2

Contracts & CLI polish

Now → 1-2 weeks

Deterministic checks: schema, regex, list/set, bounds, file diff
Budgets: cost and latency gates
CLI usability improvements
Templates and examples

Phase 3

Distribution

2-3 weeks

NPM/PyPI packages
Multi-language examples
CI platform integrations
Early design partners

Phase 4

Scale

Later

Hosted dashboard
Team collaboration
Advanced analytics
Pricing experiments

Phase 1Now

Launched GitHub Action

Published on GitHub Marketplace
Collecting feedback and use-cases
Sample reports & demo template
Core documentation

Phase 2Now → 1-2 weeks

Contracts & CLI polish

Deterministic checks: schema, regex, list/set, bounds, file diff
Budgets: cost and latency gates
CLI usability improvements
Templates and examples

Phase 32-3 weeks

Distribution

NPM/PyPI packages
Multi-language examples
CI platform integrations
Early design partners

Phase 4Later

Scale

Hosted dashboard
Team collaboration
Advanced analytics
Pricing experiments

Join the early access

Be among the first to bulletproof your LLM outputs. Shape the future of AI testing.