Structural code quality analysis for Python
CodeClone provides deterministic structural code quality analysis for Python. It detects architectural duplication, computes quality metrics, and enforces CI gates — all with baseline-aware governance that separates known technical debt from new regressions. An optional MCP interface exposes the same canonical analysis pipeline to AI agents and IDEs.
Docs: orenlab.github.io/codeclone · Live sample report: orenlab.github.io/codeclone/examples/report/
Note
This README and docs site track the in-development v2.0.x line from main.
For the latest stable CodeClone documentation (v1.4.4), see the
v1.4.4 README
and the
v1.4.4 docs tree.
- Clone detection — function (CFG fingerprint), block (statement windows), and segment (report-only) clones
- Structural findings — duplicated branch families, clone guard/exit divergence and clone-cohort drift (report-only)
- Quality metrics — cyclomatic complexity, coupling (CBO), cohesion (LCOM4), dependency cycles, dead code, health score
- Baseline governance — separates accepted legacy debt from new regressions and lets CI fail only on what changed
- Reports — interactive HTML, deterministic JSON/TXT plus Markdown and SARIF projections from one canonical report
- MCP server — optional read-only MCP surface for AI agents and IDEs, designed as a budget-aware guided control surface for agentic development
- CI-first — deterministic output, stable ordering, exit code contract, pre-commit support
- Fast — incremental caching, parallel processing, warm-run optimization, and reproducible benchmark coverage
pip install codeclone # or: uv tool install codeclone
codeclone . # analyze
codeclone . --html # HTML report
codeclone . --html --open-html-report # open in browser
codeclone . --json --md --sarif --text # all formats
codeclone . --ci # CI modeMore examples
# timestamped report snapshots
codeclone . --html --json --timestamped-report-paths
# changed-scope gating against git diff
codeclone . --changed-only --diff-against main
# shorthand: diff source for changed-scope review
codeclone . --paths-from-git-diff HEAD~1Run without install
uvx codeclone@latest .# 1. Generate baseline (commit to repo)
codeclone . --update-baseline
# 2. Add to CI pipeline
codeclone . --ciWhat --ci enables
The --ci preset equals --fail-on-new --no-color --quiet.
When a trusted metrics baseline is loaded, CI mode also enables
--fail-on-new-metrics.
CodeClone also ships a composite GitHub Action for PR and CI workflows:
- uses: orenlab/codeclone/.github/actions/codeclone@main
with:
fail-on-new: "true"
sarif: "true"
pr-comment: "true"It can:
- run baseline-aware gating
- generate JSON and SARIF reports
- upload SARIF to GitHub Code Scanning
- post or update a PR summary comment
Action docs: .github/actions/codeclone/README.md
# Metrics thresholds
codeclone . --fail-complexity 20 --fail-coupling 10 --fail-cohesion 4 --fail-health 60
# Structural policies
codeclone . --fail-cycles --fail-dead-code
# Regression detection vs baseline
codeclone . --fail-on-new-metricsrepos:
- repo: local
hooks:
- id: codeclone
name: CodeClone
entry: codeclone
language: system
pass_filenames: false
args: [ ".", "--ci" ]
types: [ python ]CodeClone ships an optional read-only MCP server for AI agents and IDE clients.
# install the MCP extra
pip install "codeclone[mcp]"
# local agents (Claude Code, Codex, Copilot, Gemini CLI)
codeclone-mcp --transport stdio
# remote / HTTP-only clients
codeclone-mcp --transport streamable-http --port 800020 tools + 10 resources — deterministic, baseline-aware, and read-only. Never mutates source files, baselines, or repo state.
Payloads are optimized for LLM context: compact summaries by default, full detail on demand. The cheapest useful path is also the most obvious path: first-pass triage stays compact, and deeper detail is explicit.
Recommended agent flow:
analyze_repository or analyze_changed_paths → get_run_summary or get_production_triage →
list_hotspots or check_* → get_finding → get_remediation
Docs: MCP usage guide · MCP interface contract
CodeClone can load project-level configuration from pyproject.toml:
[tool.codeclone]
min_loc = 10
min_stmt = 6
baseline = "codeclone.baseline.json"
skip_metrics = false
quiet = false
html_out = ".cache/codeclone/report.html"
json_out = ".cache/codeclone/report.json"
md_out = ".cache/codeclone/report.md"
sarif_out = ".cache/codeclone/report.sarif"
text_out = ".cache/codeclone/report.txt"
block_min_loc = 20
block_min_stmt = 8
segment_min_loc = 20
segment_min_stmt = 10Precedence: CLI flags > pyproject.toml > built-in defaults.
Baselines capture the current duplication state. Once committed, they become the CI reference point.
- Clones are classified as NEW (not in baseline) or KNOWN (accepted debt)
--update-baselinewrites both clone and metrics snapshots- Trust is verified via
generator,fingerprint_version, andpayload_sha256 - In
--cimode, an untrusted baseline is a contract error (exit 2)
Full contract: Baseline contract
| Code | Meaning |
|---|---|
0 |
Success |
2 |
Contract error — untrusted baseline, invalid config, unreadable sources in CI |
3 |
Gating failure — new clones or metric threshold exceeded |
5 |
Internal error |
Contract errors (2) take precedence over gating failures (3).
| Format | Flag | Default path |
|---|---|---|
| HTML | --html |
.cache/codeclone/report.html |
| JSON | --json |
.cache/codeclone/report.json |
| Markdown | --md |
.cache/codeclone/report.md |
| SARIF | --sarif |
.cache/codeclone/report.sarif |
| Text | --text |
.cache/codeclone/report.txt |
All report formats are rendered from one canonical JSON report document.
--open-html-reportopens the generated HTML report in the default browser and requires--html.--timestamped-report-pathsappends a UTC timestamp to default report filenames for bare report flags such as--htmlor--json. Explicit report paths are not rewritten.
The docs site also includes live example HTML/JSON/SARIF reports generated from the current codeclone repository.
Structural findings include:
duplicated_branchesclone_guard_exit_divergenceclone_cohort_drift
CodeClone keeps dead-code detection deterministic and static by default. When a symbol is intentionally invoked through runtime dynamics (for example framework callbacks, plugin loading, or reflection), suppress the known false positive explicitly at the declaration site:
# codeclone: ignore[dead-code]
def handle_exception(exc: Exception) -> None:
...
class Middleware: # codeclone: ignore[dead-code]
...Dynamic/runtime false positives are resolved via explicit inline suppressions, not via broad heuristics.
Canonical JSON report shape (v2.2)
{
"report_schema_version": "2.2",
"meta": {
"codeclone_version": "2.0.0b3",
"project_name": "...",
"scan_root": ".",
"report_mode": "full",
"analysis_thresholds": {
"design_findings": {
"...": "..."
}
},
"baseline": {
"...": "..."
},
"cache": {
"...": "..."
},
"metrics_baseline": {
"...": "..."
},
"runtime": {
"analysis_started_at_utc": "...",
"report_generated_at_utc": "..."
}
},
"inventory": {
"files": {
"...": "..."
},
"code": {
"...": "..."
},
"file_registry": {
"encoding": "relative_path",
"items": []
}
},
"findings": {
"summary": {
"...": "..."
},
"groups": {
"clones": {
"functions": [],
"blocks": [],
"segments": []
},
"structural": {
"groups": []
},
"dead_code": {
"groups": []
},
"design": {
"groups": []
}
}
},
"metrics": {
"summary": {},
"families": {}
},
"derived": {
"suggestions": [],
"overview": {
"families": {},
"top_risks": [],
"source_scope_breakdown": {},
"health_snapshot": {},
"directory_hotspots": {}
},
"hotlists": {
"most_actionable_ids": [],
"highest_spread_ids": [],
"production_hotspot_ids": [],
"test_fixture_hotspot_ids": []
}
},
"integrity": {
"canonicalization": {
"version": "1",
"scope": "canonical_only"
},
"digest": {
"algorithm": "sha256",
"verified": true,
"value": "..."
}
}
}Canonical contract: Report contract and Dead-code contract
- Parse — Python source to AST
- Normalize — canonical structure (robust to renaming, formatting)
- CFG — per-function control flow graph
- Fingerprint — stable hash computation
- Group — function, block, and segment clone groups
- Metrics — complexity, coupling, cohesion, dependencies, dead code, health
- Gate — baseline comparison, threshold checks
Architecture: Architecture narrative · CFG semantics: CFG semantics
| Topic | Link |
|---|---|
| Contract book (start here) | Contracts and guarantees |
| Exit codes | Exit codes and failure policy |
| Configuration | Config and defaults |
| Baseline contract | Baseline contract |
| Cache contract | Cache contract |
| Report contract | Report contract |
| Metrics & quality gates | Metrics and quality gates |
| Dead code | Dead-code contract |
| Docker benchmark contract | Benchmarking contract |
| Determinism | Determinism policy |
Reproducible Docker Benchmark
./benchmarks/run_docker_benchmark.shThe wrapper builds benchmarks/Dockerfile, runs isolated container benchmarks, and writes results to
.cache/benchmarks/codeclone-benchmark.json.
Use environment overrides to pin the benchmark envelope:
CPUSET=0 CPUS=1.0 MEMORY=2g RUNS=16 WARMUPS=4 \
./benchmarks/run_docker_benchmark.shPerformance claims are backed by the reproducible benchmark workflow documented in Benchmarking contract
- Code: MPL-2.0
- Documentation: MIT
Versions released before this change remain under their original license terms.
- Issues: https://github.com/orenlab/codeclone/issues
- PyPI: https://pypi.org/project/codeclone/
- Licenses: MPL-2.0 · MIT docs