588 lines
26 KiB
Markdown
588 lines
26 KiB
Markdown
# Semantic Batching and Output Chunking Design
|
||
|
||
**Date:** 2026-05-24
|
||
**Status:** Draft
|
||
**Branch:** `feat/semantic-batching-and-output-chunking`
|
||
**Issue:** [#159](https://github.com/Lum1104/Understand-Anything/issues/159) — Frequently seeing output limit exceeded
|
||
|
||
---
|
||
|
||
## Problem
|
||
|
||
The `/understand` skill's Phase 2 dispatches `file-analyzer` subagents in batches of 20-30 files each (`skills/understand/SKILL.md:282`). Two issues compound on output-constrained LLM backends (notably Bedrock OPUS with default max_tokens of 4096-8192):
|
||
|
||
1. **Output cap pressure.** Each `file-analyzer` writes one `batch-<N>.json` containing all nodes (file + functions + classes) and edges for its batch. For 25 dense files the JSON content easily exceeds the per-turn `Write(content=...)` token budget. The agent improvises by entering an undefined "minimal output mode" and drops nodes/edges silently. Issue #159 reports this for OPUS on Bedrock at the 100-file scale.
|
||
|
||
2. **Count-based batching breaks module semantics.** Files are batched by count, not by logical relationship. Files that import each other (and would together form an `auth` module, an `api` module, etc.) get split across batches. The file-analyzer only sees within-batch edges confidently; `calls`/`related`/`inherits`/`implements` edges between modules get dropped at batch boundaries.
|
||
|
||
The existing `recover_imports_from_scan` in `merge-batch-graphs.py:913` is a deterministic safety net for `imports` edges — but it cannot recover semantic edges (calls / related / inherits / implements). Those are lost.
|
||
|
||
---
|
||
|
||
## Goals
|
||
|
||
- Eliminate "Batch X failed (output limit)" from `/understand` runs on Bedrock OPUS for projects up to 500 files.
|
||
- Improve cross-batch semantic edge coverage by replacing count-based batching with Louvain community detection on the import graph.
|
||
- Maintain `imports` edge coverage parity (no regression on existing safety net).
|
||
- Stay within one PR — defer broader refactors to follow-ups (Section "Out of scope").
|
||
|
||
## Non-goals
|
||
|
||
- Refactoring Phase 1 / 2 tree-sitter usage to deduplicate per-batch extraction.
|
||
- Adding LLM-generated file summaries to neighborMap.
|
||
- Auto-tuning output thresholds per provider.
|
||
|
||
---
|
||
|
||
## Architecture
|
||
|
||
Pipeline before:
|
||
|
||
```
|
||
Phase 1 project-scanner → scan-result.json (files + importMap)
|
||
Phase 2 file-analyzer (×N concur) → batch-<i>.json (one per batch; SKILL.md prose batching)
|
||
Phase 2末 merge-batch-graphs.py → assembled-graph.json
|
||
```
|
||
|
||
Pipeline after:
|
||
|
||
```
|
||
Phase 1 project-scanner → scan-result.json (unchanged)
|
||
Phase 1.5 compute-batches.mjs → batches.json (NEW — semantic batching + neighborMap)
|
||
Phase 2 file-analyzer (×N concur) → batch-<i>.json (single) OR batch-<i>-part-<k>.json (split)
|
||
Phase 2末 merge-batch-graphs.py → assembled-graph.json (verified, no code change)
|
||
```
|
||
|
||
**Phase 1.5 single responsibility:** topology decision + neighborMap construction. Pure algorithm — reads `scan-result.json`, writes `batches.json`, no LLM calls.
|
||
|
||
**Phase 2 changes:** SKILL.md stops doing prose batching; iterates `batches.json` and dispatches one file-analyzer per batch.
|
||
|
||
**file-analyzer changes:** consumes neighborMap; self-checks output size before writing; splits into `batch-<i>-part-<k>.json` when above thresholds.
|
||
|
||
**merge-batch-graphs.py:** no code changes — the `batch-*.json` glob and sort-key regex already accept multi-part naming. Test fixture and stderr report enhancement added.
|
||
|
||
---
|
||
|
||
## Component 1 — `compute-batches.mjs`
|
||
|
||
**Location:** `understand-anything-plugin/skills/understand/compute-batches.mjs`
|
||
|
||
**Invocation:** `node <SKILL_DIR>/compute-batches.mjs $PROJECT_ROOT [--changed-files=<path>]`
|
||
|
||
**Input:** `$PROJECT_ROOT/.understand-anything/intermediate/scan-result.json`
|
||
|
||
**Output:** `$PROJECT_ROOT/.understand-anything/intermediate/batches.json`
|
||
|
||
### Dependencies
|
||
|
||
Added to `understand-anything-plugin/package.json`:
|
||
|
||
- `graphology` (~10KB)
|
||
- `graphology-communities-louvain` (~30KB)
|
||
|
||
Reuses `@understand-anything/core`'s `TreeSitterPlugin` and `PluginRegistry` (already imported by `extract-structure.mjs`).
|
||
|
||
### Algorithm
|
||
|
||
```
|
||
1. Load scan-result.json.
|
||
|
||
2. Partition files by fileCategory:
|
||
- codeFiles = files where fileCategory === "code"
|
||
- nonCodeFiles = the rest
|
||
|
||
3. Code batching (Louvain on import graph):
|
||
a. Build undirected graph: nodes = codeFiles, edges = importMap relations
|
||
(weight=1, undirected so import and imported-by both count).
|
||
b. Run graphology-communities-louvain → community assignment per file.
|
||
c. For any community with size > 35 (max): split via edge-betweenness greedy
|
||
cut (or simpler weakly-connected-component partition) until each
|
||
sub-community ≤ 35. Log warning per split.
|
||
(Whether this branch fires is decided by the implementation prototype
|
||
step — see "Prototype-first implementation" below.)
|
||
d. Communities with size < 5 are kept as-is. Wasted dispatches are
|
||
bounded by the 5-concurrent cap, and the alternative ("merge small")
|
||
adds edge cases without proportional value.
|
||
|
||
4. Non-code batching (hardcoded heuristics, moved from SKILL.md prose):
|
||
- Group A: For each directory containing a `Dockerfile`, bundle that
|
||
directory's `Dockerfile` + any `docker-compose.*` + any
|
||
`.dockerignore` → one batch per such directory (so multi-service
|
||
repos with several Dockerfiles get one batch per service).
|
||
- Group B: `.github/workflows/*.yml` files → one batch.
|
||
- Group C: `.gitlab-ci.yml` + files under `.circleci/` → one batch.
|
||
- Group D: SQL files under any `migrations/` or `migration/` directory,
|
||
sorted by filename → one batch per directory.
|
||
- Group E: All other non-code files grouped by their immediate parent
|
||
directory, max 20 per batch.
|
||
|
||
5. Assign batchIndex: code communities first (1..N), non-code groups
|
||
second (N+1..M).
|
||
|
||
6. Exports extraction:
|
||
- For each code file, run TreeSitterPlugin.extract() and collect
|
||
top-level exports (function names, class names, exported const names).
|
||
- Per-file failures: catch, set exports = [], emit warning.
|
||
- Non-code files: exports = [].
|
||
|
||
7. Construct neighborMap (1-hop):
|
||
For each file F in batch B:
|
||
neighborMap[F.path] = [
|
||
{ path: G.path, batchIndex: G.batch, symbols: G.exports }
|
||
for G in importMap[F.path] ∪ reverseImportMap[F.path]
|
||
where G.batch ≠ B
|
||
]
|
||
If neighborMap[F.path].length > 50, truncate to top 50 by neighbor
|
||
degree (highest-imported neighbors kept), emit warning.
|
||
|
||
8. Construct batchImportData:
|
||
For each batch B:
|
||
batchImportData[F.path] = importMap[F.path] for F in B.files
|
||
|
||
9. Write batches.json.
|
||
|
||
Fallback (script-internal): If steps 3a-3c throw, catch → emit warning
|
||
→ assign batches by alphabetical chunking (12 files per code batch).
|
||
Steps 4, 6, 7, 8 still run normally. Set `algorithm: "count-fallback"`
|
||
in the output.
|
||
```
|
||
|
||
### Louvain implementation
|
||
|
||
Use `graphology-communities-louvain`'s default modularity-greedy algorithm:
|
||
|
||
```js
|
||
import Graph from 'graphology';
|
||
import louvain from 'graphology-communities-louvain';
|
||
|
||
const graph = new Graph({ type: 'undirected' });
|
||
for (const file of codeFiles) graph.addNode(file.path);
|
||
for (const [src, targets] of Object.entries(importMap)) {
|
||
for (const tgt of targets) {
|
||
if (graph.hasNode(src) && graph.hasNode(tgt) && !graph.hasEdge(src, tgt)) {
|
||
graph.addEdge(src, tgt);
|
||
}
|
||
}
|
||
}
|
||
const communities = louvain(graph); // { nodeId: communityId }
|
||
```
|
||
|
||
### Output schema (`batches.json`)
|
||
|
||
```json
|
||
{
|
||
"schemaVersion": 1,
|
||
"algorithm": "louvain",
|
||
"totalFiles": 100,
|
||
"totalBatches": 7,
|
||
"batches": [
|
||
{
|
||
"batchIndex": 1,
|
||
"files": [
|
||
{ "path": "src/auth/login.ts", "language": "typescript",
|
||
"sizeLines": 120, "fileCategory": "code" }
|
||
],
|
||
"batchImportData": {
|
||
"src/auth/login.ts": ["src/auth/session.ts", "src/db/users.ts"]
|
||
},
|
||
"neighborMap": {
|
||
"src/auth/login.ts": [
|
||
{ "path": "src/db/users.ts", "batchIndex": 3,
|
||
"symbols": ["User", "findById", "createUser"] }
|
||
]
|
||
}
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
`algorithm` is `"louvain"` on the happy path, `"count-fallback"` when the Louvain branch crashed.
|
||
|
||
### `--changed-files` mode
|
||
|
||
When invoked with `--changed-files=<path>`, the script:
|
||
|
||
- Loads file paths from `<path>` (one per line).
|
||
- Still builds the full project import graph (for accurate neighborMap construction).
|
||
- Only emits batches containing changed files.
|
||
- neighborMap entries reference unchanged files with their batchIndex from the deterministic full-graph Louvain re-run. The seed is fixed so the assignment is reproducible across incremental invocations.
|
||
|
||
### Prototype-first implementation
|
||
|
||
Before writing the full script, build a minimal skeleton:
|
||
|
||
1. Load `scan-result.json` from this repo's `.understand-anything/` directory (if absent, generate via `/understand --full`).
|
||
2. Run Louvain only — no size enforcement, no neighborMap.
|
||
3. Print community size distribution.
|
||
4. Decide: do real-world communities cluster in [5, 35]? If yes, size enforcement branch may be unnecessary or trivially defensive. If no, implement edge-betweenness split.
|
||
|
||
This gates the more speculative code (size enforcement) on empirical observation rather than upfront design.
|
||
|
||
---
|
||
|
||
## Component 2 — `skills/understand/SKILL.md` changes
|
||
|
||
### Add — Phase 1.5 section (after Phase 1)
|
||
|
||
```markdown
|
||
## Phase 1.5 — BATCH
|
||
|
||
Report: `[Phase 1.5/7] Computing semantic batches...`
|
||
|
||
Run the bundled batching script:
|
||
\`\`\`bash
|
||
node <SKILL_DIR>/compute-batches.mjs $PROJECT_ROOT
|
||
\`\`\`
|
||
|
||
Reads `.understand-anything/intermediate/scan-result.json`, writes
|
||
`.understand-anything/intermediate/batches.json`.
|
||
|
||
Capture stderr. Append any line starting with `Warning:` to
|
||
$PHASE_WARNINGS for the final report.
|
||
|
||
If the script exits non-zero, the failure is hard — relay the full
|
||
stderr to the user as a Phase 1.5 failure. Do not attempt to recover;
|
||
the script's internal fallback (count-based) already handles recoverable
|
||
issues. A non-zero exit means a fundamental problem (missing input file,
|
||
malformed JSON, etc.).
|
||
```
|
||
|
||
### Replace — Phase 2 ANALYZE section (current SKILL.md:280-332)
|
||
|
||
Delete the existing "Batch the file list from Phase 1 into groups of 20-30 files each" prose, the non-code grouping prose (now in compute-batches), and the dispatch-time `batchImportData` construction prose (now provided in batches.json). Replace with:
|
||
|
||
```markdown
|
||
## Phase 2 — ANALYZE
|
||
|
||
### Full analysis path
|
||
|
||
Load `.understand-anything/intermediate/batches.json` (produced by
|
||
Phase 1.5). Iterate the `batches[]` array.
|
||
|
||
Report: `[Phase 2/7] Analyzing files — <totalFiles> files in
|
||
<totalBatches> batches (up to 5 concurrent)...`
|
||
|
||
For each batch, dispatch a `file-analyzer` subagent (up to 5
|
||
concurrent). Dispatch prompt template:
|
||
|
||
> Analyze these files and produce GraphNode and GraphEdge objects.
|
||
> Project root: `$PROJECT_ROOT`
|
||
> Project: `<projectName>`
|
||
> Languages: `<languages>`
|
||
> Batch: `<batchIndex>/<totalBatches>`
|
||
> Skill directory: `<SKILL_DIR>`
|
||
> Output: write to
|
||
> `$PROJECT_ROOT/.understand-anything/intermediate/batch-<batchIndex>.json`
|
||
> (single-file mode) OR `batch-<batchIndex>-part-<k>.json` (split mode,
|
||
> per Step B of your output protocol).
|
||
>
|
||
> Pre-resolved import data (use directly — do NOT re-resolve from source):
|
||
> \`\`\`json
|
||
> <batchImportData JSON inline from batches.json[i].batchImportData>
|
||
> \`\`\`
|
||
>
|
||
> Cross-batch neighbors with their exported symbols (confidence boost
|
||
> for cross-batch edges):
|
||
> \`\`\`json
|
||
> <neighborMap JSON inline from batches.json[i].neighborMap>
|
||
> \`\`\`
|
||
>
|
||
> Files to analyze:
|
||
> 1. `<path>` (<sizeLines> lines, language: `<language>`,
|
||
> fileCategory: `<fileCategory>`)
|
||
> ...
|
||
|
||
$LANGUAGE_DIRECTIVE
|
||
|
||
After ALL batches complete, run the merge-and-normalize script:
|
||
\`\`\`bash
|
||
python <SKILL_DIR>/merge-batch-graphs.py $PROJECT_ROOT
|
||
\`\`\`
|
||
|
||
(Rest of Phase 2 unchanged.)
|
||
```
|
||
|
||
### Replace — Incremental update path (current SKILL.md:355-366)
|
||
|
||
```markdown
|
||
### Incremental update path
|
||
|
||
Run compute-batches.mjs with `--changed-files=<path>`, where `<path>`
|
||
is a temp file listing changed file paths (one per line). The script
|
||
reuses the full project's import graph for neighborMap computation
|
||
but only emits batches containing changed files. Dispatch file-analyzer
|
||
subagents per the same template as the full path.
|
||
```
|
||
|
||
### Line budget
|
||
|
||
Net added LLM-context prose: Phase 1.5 (~12 lines) + Phase 2 template clarifications (~5 lines) − removed batching prose (~15 lines) − removed batchImportData construction prose (~6 lines) ≈ **−4 lines**.
|
||
|
||
---
|
||
|
||
## Component 3 — `agents/file-analyzer.md` changes
|
||
|
||
### Add — Cross-batch context section
|
||
|
||
Insert after "Step 1: Input file construction":
|
||
|
||
```markdown
|
||
### Cross-batch context (neighborMap)
|
||
|
||
Your dispatch prompt includes a `neighborMap` — for each file in your
|
||
batch, it lists project-internal neighbors in OTHER batches (files that
|
||
import yours or that you import), with their exported symbols.
|
||
|
||
Use neighborMap as a confidence boost for cross-batch edges (`calls`,
|
||
`related`, `inherits`, `implements` to nodes outside your batch):
|
||
|
||
- If your source clearly references a symbol that appears in some
|
||
`neighbor.symbols`, emit the edge to
|
||
`function:<neighbor.path>:<symbol>` or
|
||
`class:<neighbor.path>:<symbol>` with confidence.
|
||
- If your source references a cross-batch symbol that is NOT in
|
||
neighborMap (the project-scanner may not have extracted it), you may
|
||
still emit the edge if you saw it explicitly in the imported file's
|
||
surface — but prefer matching neighborMap symbols when available.
|
||
- Imports continue to use `batchImportData` (fully resolved), not
|
||
neighborMap.
|
||
|
||
The merge script's dangling-edge dropper is the safety net for
|
||
genuinely unresolvable targets.
|
||
```
|
||
|
||
### Replace — Writing Results section (current file-analyzer.md:467-475)
|
||
|
||
```markdown
|
||
## Writing Results — single or multi-part
|
||
|
||
**Step A — Compute totals.**
|
||
\`\`\`
|
||
nodeCount = nodes.length
|
||
edgeCount = edges.length
|
||
\`\`\`
|
||
|
||
**Step B — Decide split.**
|
||
- If `nodeCount ≤ 60` AND `edgeCount ≤ 120`: write ONE file to
|
||
`.understand-anything/intermediate/batch-<batchIndex>.json`. Done.
|
||
Skip to Step E.
|
||
- Otherwise: `parts = ceil(max(nodeCount / 60, edgeCount / 120))`.
|
||
|
||
**Step C — Partition.**
|
||
Sort files in your batch alphabetically by path. Chunk them sequentially
|
||
into `parts` groups of size `ceil(N / parts)`. For each part:
|
||
- All nodes whose `filePath` is in this part's files (for non-file
|
||
nodes like `module`/`concept`, use the file they belong to).
|
||
- All edges whose `source` is in this part's nodes (target may be
|
||
anywhere — same part, different part of same batch, different batch).
|
||
|
||
**Step D — Write each part.**
|
||
Write part `k` (1-indexed) to
|
||
`.understand-anything/intermediate/batch-<batchIndex>-part-<k>.json`.
|
||
Each part is a valid GraphFragment: `{ "nodes": [...], "edges": [...] }`.
|
||
|
||
**Step E — Self-validate.**
|
||
For each file written, verify:
|
||
- Valid JSON.
|
||
- `nodes` array exists and is well-formed.
|
||
- For every edge: `source` and `target` both appear as either (a) a
|
||
node `id` in this part's nodes, OR (b) a `file:<path>` reference
|
||
where `<path>` is in `neighborMap` or `batchImportData`, OR (c) a
|
||
`function:<path>:<symbol>` / `class:<path>:<symbol>` reference where
|
||
`<symbol>` is in some `neighbor.symbols`.
|
||
|
||
If validation fails on a part, do NOT silently rebuild. Respond with
|
||
an explicit error stating which part failed, which edge(s) failed
|
||
validation, and why. The dispatching session can then retry.
|
||
|
||
**Step F — Respond.**
|
||
Respond with ONLY a brief text summary: parts written (1 or more),
|
||
total nodes/edges across all parts, any files skipped. Do NOT include
|
||
JSON content in the response.
|
||
```
|
||
|
||
### Threshold rationale
|
||
|
||
`60 nodes / 120 edges per part` derives from:
|
||
|
||
- File node JSON serialized ≈ 150-300 chars; function/class ≈ 80-150 chars; edge ≈ 100-150 chars.
|
||
- 60 nodes + 120 edges ≈ 25-35KB JSON ≈ 7000-9000 output tokens (JSON tokenization is dense).
|
||
- Bedrock OPUS default `max_tokens` 4096-8192 → ~10% safety margin.
|
||
|
||
These constants live as file-analyzer.md prose for now. Auto-tuning per provider is deferred to follow-up.
|
||
|
||
---
|
||
|
||
## Component 4 — `merge-batch-graphs.py` (verify-only)
|
||
|
||
### Confirmed compatibility
|
||
|
||
The existing glob and sort-key already handle multi-part files transparently:
|
||
|
||
- `intermediate_dir.glob("batch-*.json")` matches `batch-3-part-1.json`.
|
||
- `re.search(r"batch-(\d+)", p.stem)` extracts `3` from `batch-3-part-1`, giving the same sort key as `batch-3.json`. Python `sorted` is stable, so parts load in lexicographic tie-break order.
|
||
- `merge_and_normalize` walks `all_nodes.extend(...)` / `all_edges.extend(...)`; load order does not affect dedup correctness.
|
||
- `recover_imports_from_scan` operates on the merged graph — transparent to multi-part inputs.
|
||
- `link_tests` operates on the merged node pool — transparent.
|
||
|
||
No code change required for correctness.
|
||
|
||
### Add — Multi-part awareness in stderr report
|
||
|
||
`merge-batch-graphs.py:1026` currently prints `Found {N} batch files:`. Enhance:
|
||
|
||
```python
|
||
from collections import defaultdict
|
||
by_batch = defaultdict(list)
|
||
for f in batch_files:
|
||
m = re.match(r"batch-(\d+)(?:-part-(\d+))?\.json", f.name)
|
||
if m:
|
||
by_batch[int(m.group(1))].append(f.name)
|
||
|
||
logical_count = len(by_batch)
|
||
multi_part = sum(1 for files in by_batch.values() if len(files) > 1)
|
||
print(
|
||
f"Found {len(batch_files)} batch files "
|
||
f"({logical_count} logical batches, {multi_part} multi-part)",
|
||
file=sys.stderr,
|
||
)
|
||
```
|
||
|
||
### Add — Missing-part warning
|
||
|
||
After grouping, detect logical batches with non-contiguous part numbers (e.g. parts `{2, 3}` present but `1` missing) and emit:
|
||
|
||
```
|
||
Warning: merge: batch <i> has parts {<set>} but missing part {<missing>}
|
||
— possible truncated write — affected nodes/edges may be lost
|
||
```
|
||
|
||
---
|
||
|
||
## Failure modes & observability
|
||
|
||
| Failure point | Behavior | Safety net | Required warning text |
|
||
|---|---|---|---|
|
||
| Louvain library throws | exception | Script-internal: catch → count-based fallback (12 files/batch); neighborMap still built | `Warning: compute-batches: Louvain failed (<msg>) — falling back to count-based grouping (12 files/batch) — module semantic boundaries lost` |
|
||
| tree-sitter exports per-file failure | empty exports | symbols=[] in neighborMap | `Warning: compute-batches: exports extraction failed for <path> (<msg>) — symbols=[] in neighborMap — cross-batch edges to this file limited to file-level` |
|
||
| Louvain produces oversized community | size > 35 | Edge-betweenness split | `Warning: compute-batches: community size <N> > max 35 — splitting via edge-betweenness — modularity may decrease` |
|
||
| compute-batches complete crash | exit non-zero, no batches.json | SKILL.md surfaces full stderr to user; no Phase 2 fallback | (script's own error to stderr; SKILL.md relays verbatim) |
|
||
| neighborMap truncation | > 50 neighbors | Top-50 by degree kept | `Warning: compute-batches: neighborMap for <path> truncated from <N> to top 50 (by neighbor degree)` |
|
||
| file-analyzer part JSON malformed | `load_batch` skips | Existing `load_batch:139` warns and skips | (existing — verify the warning is not swallowed) |
|
||
| Missing part in multi-part batch | gap in parts | merge detects and warns | `Warning: merge: batch <i> has parts {<set>} but missing part {<missing>} — possible truncated write — affected nodes/edges may be lost` |
|
||
| file-analyzer dangling edges | source/target missing | merge drops, adds to `unfixable` (existing) | (existing) |
|
||
| file-analyzer dispatch fails | subagent error | existing retry-once mechanism | (existing) |
|
||
|
||
### Observability invariant
|
||
|
||
Every fallback / degrade / drop MUST:
|
||
|
||
1. Write a stderr line in `Warning: <component>: <what happened> — <why> — <impact>` format.
|
||
2. Bubble up to `$PHASE_WARNINGS` (SKILL.md existing mechanism) → user-facing Phase 7 final report.
|
||
3. Never use silent `catch {}` / `except: pass`. Code review treats this as a blocker.
|
||
|
||
### Invariants
|
||
|
||
1. **scan-result.json is source of truth.** Any batching/topology change preserves importMap; `recover_imports_from_scan` always restores `imports` edges.
|
||
2. **Dangling-edge dropper is final defense.** No batch-generated edge can connect to a nonexistent node in the assembled graph.
|
||
3. **No silent fallback.** `batches.json` missing → loud failure. Internal compute-batches fallback → loud warning that bubbles to user.
|
||
|
||
---
|
||
|
||
## Testing
|
||
|
||
### Unit tests — `compute-batches.mjs`
|
||
|
||
New file: `understand-anything-plugin/skills/understand/test_compute_batches.test.mjs` (Vitest).
|
||
|
||
Required cases:
|
||
|
||
- **Louvain basic:** 3 disjoint cliques → 3 batches.
|
||
- **Empty importMap:** independent files → count-fallback batches by alphabetical chunking.
|
||
- **Oversized community:** 50-node complete graph → split triggered, all sub-batches ≤ 35.
|
||
- **Non-code grouping A:** `Dockerfile` + `docker-compose.yml` + `.dockerignore` siblings → one batch per directory cluster.
|
||
- **Non-code grouping B:** `.github/workflows/*.yml` → one batch.
|
||
- **Non-code grouping C:** SQL migrations under `migrations/` → one batch per directory.
|
||
- **Mixed code + non-code:** non-code batchIndex follows code batches.
|
||
- **neighborMap correctness:** file A imports file B across batches → `neighborMap[A]` contains `{path: B, batchIndex: B's, symbols: B's exports}`.
|
||
- **neighborMap excludes same-batch:** A and C in same batch → `neighborMap[A]` does not contain C.
|
||
- **Exports failure tolerance:** mock TreeSitter to throw on one file → `exports = []` for that file, others unaffected.
|
||
- **`--changed-files`:** input subset → output contains only batches with changed files; neighborMap may reference unchanged files.
|
||
- **Fallback triggers:** mock Louvain throw → `algorithm` field = `"count-fallback"`, warning in stderr.
|
||
- **Warning assertion per fallback:** for each of {Louvain crash, exports failure, oversize split, neighborMap truncation}, assert the exact warning string appears in stderr.
|
||
|
||
### Unit tests — `merge-batch-graphs.py`
|
||
|
||
New test class `TestMultiPart` in `test_merge_batch_graphs.py`:
|
||
|
||
- Two parts of one logical batch: `batch-1-part-1.json` + `batch-1-part-2.json` → assembled contains all nodes/edges from both.
|
||
- Three parts of one logical batch.
|
||
- Cross-part edges: edge with source in part-1, target node in part-2 → connected after merge.
|
||
- Malformed part-1 + valid part-2: part-1 skipped with warning, part-2 contents present.
|
||
- Mixed single-batch and multi-part inputs.
|
||
- Missing part detection: `batch-1-part-2.json` + `batch-1-part-3.json` (no part-1) → warning emitted with exact text.
|
||
- stderr format: assert `"X logical batches, Y multi-part"` appears.
|
||
|
||
### Integration — PR acceptance gate (manual)
|
||
|
||
Documented in the PR's Test plan:
|
||
|
||
- [ ] `pnpm install` (graphology installs cleanly).
|
||
- [ ] `pnpm --filter @understand-anything/core build`.
|
||
- [ ] Run `/understand --full` on this repo (Understand-Anything itself):
|
||
- `batches.json` generated; community size distribution sanity-check (mix of small and medium batches).
|
||
- At least one batch produces multi-part output.
|
||
- `assembled-graph.json` node/edge counts within expected range vs current main.
|
||
- Dashboard renders normally.
|
||
- Phase 7 final report includes any `$PHASE_WARNINGS` from compute-batches (visually verify warnings reach user-facing output, not just stderr).
|
||
- [ ] Run on a ~100-file repo matching ayushghosh's scenario; confirm no "output limit" errors.
|
||
- [ ] Run on a 5-10 file small repo: fallback path (all one batch) works correctly.
|
||
|
||
### Not tested
|
||
|
||
- Louvain algorithm correctness (trust `graphology-communities-louvain`'s own tests).
|
||
- Performance benchmarks (sub-second on 100-500 files is empirical; not gated).
|
||
- Multiple LLM provider output-cap variations (thresholds are conservative for Bedrock OPUS; first-party Anthropic is more permissive).
|
||
|
||
---
|
||
|
||
## Out of scope (tracked for follow-up)
|
||
|
||
### Tree-sitter deduplication
|
||
|
||
Currently Phase 1 (project-scanner), Phase 1.5 (compute-batches), and Phase 2 (file-analyzer per-batch) each run tree-sitter independently. Consolidating into a single Phase 1.5 structure extraction would simplify file-analyzer and save time on large projects. Defer because it requires reorganizing file-analyzer's protocol significantly.
|
||
|
||
### neighborMap LLM summaries
|
||
|
||
Adding one-sentence summaries per file to neighborMap would enable file-analyzer to emit `related` edges across batches with semantic justification. Requires a new lightweight summary-pass agent; defer until the tree-sitter dedup lands (Phase 1.5 will already have full structure → cheaper to add).
|
||
|
||
### Adaptive thresholds
|
||
|
||
`60 nodes / 120 edges` are conservative for Bedrock OPUS. Anthropic first-party supports much larger output caps. Adding a `--output-cap=<N>` CLI to compute-batches and propagating to file-analyzer would unlock larger parts on permissive backends. Track real-world part counts before implementing.
|
||
|
||
### Cross-batch edge audit
|
||
|
||
A post-merge audit comparing neighborMap-suggested edges vs actually-emitted edges would surface gaps. Mirror the existing `recover_imports_from_scan` pattern. Requires preserving `batches.json` for merge-time consumption.
|
||
|
||
### Multi-language monorepo handling
|
||
|
||
Multi-language repos (TS + Python) tend to naturally split via Louvain (no cross-language imports). Bridge files (OpenAPI, protobuf) might create odd communities. Address only if real reports surface.
|
||
|
||
---
|
||
|
||
## Implementation order
|
||
|
||
1. **Prototype:** minimal `compute-batches.mjs` skeleton — load scan-result.json, run Louvain, print community sizes. Run against this repo's `scan-result.json` (generate if missing via `/understand --full`). Decide whether size-enforcement branch is needed; if needed, choose between edge-betweenness and weakly-connected-component split.
|
||
2. Add exports extraction (reuse TreeSitterPlugin).
|
||
3. Add neighborMap construction + batchImportData passthrough.
|
||
4. Add non-code grouping heuristics (Groups A-E).
|
||
5. Add fallback path + warning emissions for every failure mode listed in the Failure modes table.
|
||
6. Write unit tests for compute-batches (per Testing section), including warning-text assertions.
|
||
7. Modify `agents/file-analyzer.md` — add Cross-batch context section, replace Writing Results.
|
||
8. Modify `skills/understand/SKILL.md` — add Phase 1.5, replace Phase 2 ANALYZE batching prose, replace incremental path.
|
||
9. Add multi-part stderr report + missing-part warning to `merge-batch-graphs.py`.
|
||
10. Write unit tests for `merge-batch-graphs.py` multi-part handling.
|
||
11. Add `graphology` + `graphology-communities-louvain` to `understand-anything-plugin/package.json`.
|
||
12. Run integration acceptance gate.
|
||
13. Bump version in all five `package.json` / `plugin.json` files per the project's CLAUDE.md versioning rule.
|