Files

qiaoxinjiu e31a75d2bb Add under-anything knowledge dashboard

2026-05-27 15:40:32 +08:00

26 KiB

Raw Permalink Blame History

Semantic Batching and Output Chunking Design

Date: 2026-05-24 Status: Draft Branch: feat/semantic-batching-and-output-chunking Issue: #159 — Frequently seeing output limit exceeded

Problem

The /understand skill's Phase 2 dispatches file-analyzer subagents in batches of 20-30 files each (skills/understand/SKILL.md:282). Two issues compound on output-constrained LLM backends (notably Bedrock OPUS with default max_tokens of 4096-8192):

Output cap pressure. Each file-analyzer writes one batch-<N>.json containing all nodes (file + functions + classes) and edges for its batch. For 25 dense files the JSON content easily exceeds the per-turn Write(content=...) token budget. The agent improvises by entering an undefined "minimal output mode" and drops nodes/edges silently. Issue #159 reports this for OPUS on Bedrock at the 100-file scale.
Count-based batching breaks module semantics. Files are batched by count, not by logical relationship. Files that import each other (and would together form an auth module, an api module, etc.) get split across batches. The file-analyzer only sees within-batch edges confidently; calls/related/inherits/implements edges between modules get dropped at batch boundaries.

The existing recover_imports_from_scan in merge-batch-graphs.py:913 is a deterministic safety net for imports edges — but it cannot recover semantic edges (calls / related / inherits / implements). Those are lost.

Goals

Eliminate "Batch X failed (output limit)" from /understand runs on Bedrock OPUS for projects up to 500 files.
Improve cross-batch semantic edge coverage by replacing count-based batching with Louvain community detection on the import graph.
Maintain imports edge coverage parity (no regression on existing safety net).
Stay within one PR — defer broader refactors to follow-ups (Section "Out of scope").

Non-goals

Refactoring Phase 1 / 2 tree-sitter usage to deduplicate per-batch extraction.
Adding LLM-generated file summaries to neighborMap.
Auto-tuning output thresholds per provider.

Architecture

Pipeline before:

Phase 1   project-scanner          → scan-result.json (files + importMap)
Phase 2   file-analyzer (×N concur) → batch-<i>.json (one per batch; SKILL.md prose batching)
Phase 2末 merge-batch-graphs.py    → assembled-graph.json

Pipeline after:

Phase 1   project-scanner          → scan-result.json (unchanged)
Phase 1.5 compute-batches.mjs      → batches.json (NEW — semantic batching + neighborMap)
Phase 2   file-analyzer (×N concur) → batch-<i>.json (single) OR batch-<i>-part-<k>.json (split)
Phase 2末 merge-batch-graphs.py    → assembled-graph.json (verified, no code change)

Phase 1.5 single responsibility: topology decision + neighborMap construction. Pure algorithm — reads scan-result.json, writes batches.json, no LLM calls.

Phase 2 changes: SKILL.md stops doing prose batching; iterates batches.json and dispatches one file-analyzer per batch.

file-analyzer changes: consumes neighborMap; self-checks output size before writing; splits into batch-<i>-part-<k>.json when above thresholds.

merge-batch-graphs.py: no code changes — the batch-*.json glob and sort-key regex already accept multi-part naming. Test fixture and stderr report enhancement added.

Component 1 — `compute-batches.mjs`

Location: understand-anything-plugin/skills/understand/compute-batches.mjs

Invocation: node <SKILL_DIR>/compute-batches.mjs $PROJECT_ROOT [--changed-files=<path>]

Input: $PROJECT_ROOT/.understand-anything/intermediate/scan-result.json

Output: $PROJECT_ROOT/.understand-anything/intermediate/batches.json

Dependencies

Added to understand-anything-plugin/package.json:

graphology (~10KB)
graphology-communities-louvain (~30KB)

Reuses @understand-anything/core's TreeSitterPlugin and PluginRegistry (already imported by extract-structure.mjs).

Algorithm

1. Load scan-result.json.

2. Partition files by fileCategory:
   - codeFiles = files where fileCategory === "code"
   - nonCodeFiles = the rest

3. Code batching (Louvain on import graph):
   a. Build undirected graph: nodes = codeFiles, edges = importMap relations
      (weight=1, undirected so import and imported-by both count).
   b. Run graphology-communities-louvain → community assignment per file.
   c. For any community with size > 35 (max): split via edge-betweenness greedy
      cut (or simpler weakly-connected-component partition) until each
      sub-community ≤ 35. Log warning per split.
      (Whether this branch fires is decided by the implementation prototype
      step — see "Prototype-first implementation" below.)
   d. Communities with size < 5 are kept as-is. Wasted dispatches are
      bounded by the 5-concurrent cap, and the alternative ("merge small")
      adds edge cases without proportional value.

4. Non-code batching (hardcoded heuristics, moved from SKILL.md prose):
   - Group A: For each directory containing a `Dockerfile`, bundle that
     directory's `Dockerfile` + any `docker-compose.*` + any
     `.dockerignore` → one batch per such directory (so multi-service
     repos with several Dockerfiles get one batch per service).
   - Group B: `.github/workflows/*.yml` files → one batch.
   - Group C: `.gitlab-ci.yml` + files under `.circleci/` → one batch.
   - Group D: SQL files under any `migrations/` or `migration/` directory,
     sorted by filename → one batch per directory.
   - Group E: All other non-code files grouped by their immediate parent
     directory, max 20 per batch.

5. Assign batchIndex: code communities first (1..N), non-code groups
   second (N+1..M).

6. Exports extraction:
   - For each code file, run TreeSitterPlugin.extract() and collect
     top-level exports (function names, class names, exported const names).
   - Per-file failures: catch, set exports = [], emit warning.
   - Non-code files: exports = [].

7. Construct neighborMap (1-hop):
   For each file F in batch B:
     neighborMap[F.path] = [
       { path: G.path, batchIndex: G.batch, symbols: G.exports }
       for G in importMap[F.path] ∪ reverseImportMap[F.path]
       where G.batch ≠ B
     ]
   If neighborMap[F.path].length > 50, truncate to top 50 by neighbor
   degree (highest-imported neighbors kept), emit warning.

8. Construct batchImportData:
   For each batch B:
     batchImportData[F.path] = importMap[F.path]  for F in B.files

9. Write batches.json.

Fallback (script-internal): If steps 3a-3c throw, catch → emit warning
→ assign batches by alphabetical chunking (12 files per code batch).
Steps 4, 6, 7, 8 still run normally. Set `algorithm: "count-fallback"`
in the output.

Louvain implementation

Use graphology-communities-louvain's default modularity-greedy algorithm:

import Graph from 'graphology';
import louvain from 'graphology-communities-louvain';

const graph = new Graph({ type: 'undirected' });
for (const file of codeFiles) graph.addNode(file.path);
for (const [src, targets] of Object.entries(importMap)) {
  for (const tgt of targets) {
    if (graph.hasNode(src) && graph.hasNode(tgt) && !graph.hasEdge(src, tgt)) {
      graph.addEdge(src, tgt);
    }
  }
}
const communities = louvain(graph); // { nodeId: communityId }

Output schema (`batches.json`)

{
  "schemaVersion": 1,
  "algorithm": "louvain",
  "totalFiles": 100,
  "totalBatches": 7,
  "batches": [
    {
      "batchIndex": 1,
      "files": [
        { "path": "src/auth/login.ts", "language": "typescript",
          "sizeLines": 120, "fileCategory": "code" }
      ],
      "batchImportData": {
        "src/auth/login.ts": ["src/auth/session.ts", "src/db/users.ts"]
      },
      "neighborMap": {
        "src/auth/login.ts": [
          { "path": "src/db/users.ts", "batchIndex": 3,
            "symbols": ["User", "findById", "createUser"] }
        ]
      }
    }
  ]
}

algorithm is "louvain" on the happy path, "count-fallback" when the Louvain branch crashed.

`--changed-files` mode

When invoked with --changed-files=<path>, the script:

Loads file paths from <path> (one per line).
Still builds the full project import graph (for accurate neighborMap construction).
Only emits batches containing changed files.
neighborMap entries reference unchanged files with their batchIndex from the deterministic full-graph Louvain re-run. The seed is fixed so the assignment is reproducible across incremental invocations.

Prototype-first implementation

Before writing the full script, build a minimal skeleton:

Load scan-result.json from this repo's .understand-anything/ directory (if absent, generate via /understand --full).
Run Louvain only — no size enforcement, no neighborMap.
Print community size distribution.
Decide: do real-world communities cluster in [5, 35]? If yes, size enforcement branch may be unnecessary or trivially defensive. If no, implement edge-betweenness split.

This gates the more speculative code (size enforcement) on empirical observation rather than upfront design.

Component 2 — `skills/understand/SKILL.md` changes

Add — Phase 1.5 section (after Phase 1)

## Phase 1.5 — BATCH

Report: `[Phase 1.5/7] Computing semantic batches...`

Run the bundled batching script:
\`\`\`bash
node <SKILL_DIR>/compute-batches.mjs $PROJECT_ROOT
\`\`\`

Reads `.understand-anything/intermediate/scan-result.json`, writes
`.understand-anything/intermediate/batches.json`.

Capture stderr. Append any line starting with `Warning:` to
$PHASE_WARNINGS for the final report.

If the script exits non-zero, the failure is hard — relay the full
stderr to the user as a Phase 1.5 failure. Do not attempt to recover;
the script's internal fallback (count-based) already handles recoverable
issues. A non-zero exit means a fundamental problem (missing input file,
malformed JSON, etc.).

Replace — Phase 2 ANALYZE section (current SKILL.md:280-332)

Delete the existing "Batch the file list from Phase 1 into groups of 20-30 files each" prose, the non-code grouping prose (now in compute-batches), and the dispatch-time batchImportData construction prose (now provided in batches.json). Replace with:

## Phase 2 — ANALYZE

### Full analysis path

Load `.understand-anything/intermediate/batches.json` (produced by
Phase 1.5). Iterate the `batches[]` array.

Report: `[Phase 2/7] Analyzing files — <totalFiles> files in
<totalBatches> batches (up to 5 concurrent)...`

For each batch, dispatch a `file-analyzer` subagent (up to 5
concurrent). Dispatch prompt template:

> Analyze these files and produce GraphNode and GraphEdge objects.
> Project root: `$PROJECT_ROOT`
> Project: `<projectName>`
> Languages: `<languages>`
> Batch: `<batchIndex>/<totalBatches>`
> Skill directory: `<SKILL_DIR>`
> Output: write to
> `$PROJECT_ROOT/.understand-anything/intermediate/batch-<batchIndex>.json`
> (single-file mode) OR `batch-<batchIndex>-part-<k>.json` (split mode,
> per Step B of your output protocol).
>
> Pre-resolved import data (use directly — do NOT re-resolve from source):
> \`\`\`json
> <batchImportData JSON inline from batches.json[i].batchImportData>
> \`\`\`
>
> Cross-batch neighbors with their exported symbols (confidence boost
> for cross-batch edges):
> \`\`\`json
> <neighborMap JSON inline from batches.json[i].neighborMap>
> \`\`\`
>
> Files to analyze:
> 1. `<path>` (<sizeLines> lines, language: `<language>`,
>    fileCategory: `<fileCategory>`)
> ...

$LANGUAGE_DIRECTIVE

After ALL batches complete, run the merge-and-normalize script:
\`\`\`bash
python <SKILL_DIR>/merge-batch-graphs.py $PROJECT_ROOT
\`\`\`

(Rest of Phase 2 unchanged.)

Replace — Incremental update path (current SKILL.md:355-366)

### Incremental update path

Run compute-batches.mjs with `--changed-files=<path>`, where `<path>`
is a temp file listing changed file paths (one per line). The script
reuses the full project's import graph for neighborMap computation
but only emits batches containing changed files. Dispatch file-analyzer
subagents per the same template as the full path.

Line budget

Net added LLM-context prose: Phase 1.5 (~12 lines) + Phase 2 template clarifications (~5 lines) − removed batching prose (~15 lines) − removed batchImportData construction prose (~6 lines) ≈ −4 lines.

Component 3 — `agents/file-analyzer.md` changes

Add — Cross-batch context section

Insert after "Step 1: Input file construction":

### Cross-batch context (neighborMap)

Your dispatch prompt includes a `neighborMap` — for each file in your
batch, it lists project-internal neighbors in OTHER batches (files that
import yours or that you import), with their exported symbols.

Use neighborMap as a confidence boost for cross-batch edges (`calls`,
`related`, `inherits`, `implements` to nodes outside your batch):

- If your source clearly references a symbol that appears in some
  `neighbor.symbols`, emit the edge to
  `function:<neighbor.path>:<symbol>` or
  `class:<neighbor.path>:<symbol>` with confidence.
- If your source references a cross-batch symbol that is NOT in
  neighborMap (the project-scanner may not have extracted it), you may
  still emit the edge if you saw it explicitly in the imported file's
  surface — but prefer matching neighborMap symbols when available.
- Imports continue to use `batchImportData` (fully resolved), not
  neighborMap.

The merge script's dangling-edge dropper is the safety net for
genuinely unresolvable targets.

Replace — Writing Results section (current file-analyzer.md:467-475)

## Writing Results — single or multi-part

**Step A — Compute totals.**
\`\`\`
nodeCount = nodes.length
edgeCount = edges.length
\`\`\`

**Step B — Decide split.**
- If `nodeCount ≤ 60` AND `edgeCount ≤ 120`: write ONE file to
  `.understand-anything/intermediate/batch-<batchIndex>.json`. Done.
  Skip to Step E.
- Otherwise: `parts = ceil(max(nodeCount / 60, edgeCount / 120))`.

**Step C — Partition.**
Sort files in your batch alphabetically by path. Chunk them sequentially
into `parts` groups of size `ceil(N / parts)`. For each part:
- All nodes whose `filePath` is in this part's files (for non-file
  nodes like `module`/`concept`, use the file they belong to).
- All edges whose `source` is in this part's nodes (target may be
  anywhere — same part, different part of same batch, different batch).

**Step D — Write each part.**
Write part `k` (1-indexed) to
`.understand-anything/intermediate/batch-<batchIndex>-part-<k>.json`.
Each part is a valid GraphFragment: `{ "nodes": [...], "edges": [...] }`.

**Step E — Self-validate.**
For each file written, verify:
- Valid JSON.
- `nodes` array exists and is well-formed.
- For every edge: `source` and `target` both appear as either (a) a
  node `id` in this part's nodes, OR (b) a `file:<path>` reference
  where `<path>` is in `neighborMap` or `batchImportData`, OR (c) a
  `function:<path>:<symbol>` / `class:<path>:<symbol>` reference where
  `<symbol>` is in some `neighbor.symbols`.

If validation fails on a part, do NOT silently rebuild. Respond with
an explicit error stating which part failed, which edge(s) failed
validation, and why. The dispatching session can then retry.

**Step F — Respond.**
Respond with ONLY a brief text summary: parts written (1 or more),
total nodes/edges across all parts, any files skipped. Do NOT include
JSON content in the response.

Threshold rationale

60 nodes / 120 edges per part derives from:

File node JSON serialized ≈ 150-300 chars; function/class ≈ 80-150 chars; edge ≈ 100-150 chars.
60 nodes + 120 edges ≈ 25-35KB JSON ≈ 7000-9000 output tokens (JSON tokenization is dense).
Bedrock OPUS default max_tokens 4096-8192 → ~10% safety margin.

These constants live as file-analyzer.md prose for now. Auto-tuning per provider is deferred to follow-up.

Component 4 — `merge-batch-graphs.py` (verify-only)

Confirmed compatibility

The existing glob and sort-key already handle multi-part files transparently:

intermediate_dir.glob("batch-*.json") matches batch-3-part-1.json.
re.search(r"batch-(\d+)", p.stem) extracts 3 from batch-3-part-1, giving the same sort key as batch-3.json. Python sorted is stable, so parts load in lexicographic tie-break order.
merge_and_normalize walks all_nodes.extend(...) / all_edges.extend(...); load order does not affect dedup correctness.
recover_imports_from_scan operates on the merged graph — transparent to multi-part inputs.
link_tests operates on the merged node pool — transparent.

No code change required for correctness.

Add — Multi-part awareness in stderr report

merge-batch-graphs.py:1026 currently prints Found {N} batch files:. Enhance:

from collections import defaultdict
by_batch = defaultdict(list)
for f in batch_files:
    m = re.match(r"batch-(\d+)(?:-part-(\d+))?\.json", f.name)
    if m:
        by_batch[int(m.group(1))].append(f.name)

logical_count = len(by_batch)
multi_part = sum(1 for files in by_batch.values() if len(files) > 1)
print(
    f"Found {len(batch_files)} batch files "
    f"({logical_count} logical batches, {multi_part} multi-part)",
    file=sys.stderr,
)

Add — Missing-part warning

After grouping, detect logical batches with non-contiguous part numbers (e.g. parts {2, 3} present but 1 missing) and emit:

Warning: merge: batch <i> has parts {<set>} but missing part {<missing>}
  — possible truncated write — affected nodes/edges may be lost

Failure modes & observability

Failure point	Behavior	Safety net	Required warning text
Louvain library throws	exception	Script-internal: catch → count-based fallback (12 files/batch); neighborMap still built	`Warning: compute-batches: Louvain failed (<msg>) — falling back to count-based grouping (12 files/batch) — module semantic boundaries lost`
tree-sitter exports per-file failure	empty exports	symbols=[] in neighborMap	`Warning: compute-batches: exports extraction failed for <path> (<msg>) — symbols=[] in neighborMap — cross-batch edges to this file limited to file-level`
Louvain produces oversized community	size > 35	Edge-betweenness split	`Warning: compute-batches: community size <N> > max 35 — splitting via edge-betweenness — modularity may decrease`
compute-batches complete crash	exit non-zero, no batches.json	SKILL.md surfaces full stderr to user; no Phase 2 fallback	(script's own error to stderr; SKILL.md relays verbatim)
neighborMap truncation	> 50 neighbors	Top-50 by degree kept	`Warning: compute-batches: neighborMap for <path> truncated from <N> to top 50 (by neighbor degree)`
file-analyzer part JSON malformed	`load_batch` skips	Existing `load_batch:139` warns and skips	(existing — verify the warning is not swallowed)
Missing part in multi-part batch	gap in parts	merge detects and warns	`Warning: merge: batch <i> has parts {<set>} but missing part {<missing>} — possible truncated write — affected nodes/edges may be lost`
file-analyzer dangling edges	source/target missing	merge drops, adds to `unfixable` (existing)	(existing)
file-analyzer dispatch fails	subagent error	existing retry-once mechanism	(existing)

Observability invariant

Every fallback / degrade / drop MUST:

Write a stderr line in Warning: <component>: <what happened> — <why> — <impact> format.
Bubble up to $PHASE_WARNINGS (SKILL.md existing mechanism) → user-facing Phase 7 final report.
Never use silent catch {} / except: pass. Code review treats this as a blocker.

Invariants

scan-result.json is source of truth. Any batching/topology change preserves importMap; recover_imports_from_scan always restores imports edges.
Dangling-edge dropper is final defense. No batch-generated edge can connect to a nonexistent node in the assembled graph.
No silent fallback. batches.json missing → loud failure. Internal compute-batches fallback → loud warning that bubbles to user.

Testing

Unit tests — `compute-batches.mjs`

New file: understand-anything-plugin/skills/understand/test_compute_batches.test.mjs (Vitest).

Required cases:

Louvain basic: 3 disjoint cliques → 3 batches.
Empty importMap: independent files → count-fallback batches by alphabetical chunking.
Oversized community: 50-node complete graph → split triggered, all sub-batches ≤ 35.
Non-code grouping A: Dockerfile + docker-compose.yml + .dockerignore siblings → one batch per directory cluster.
Non-code grouping B: .github/workflows/*.yml → one batch.
Non-code grouping C: SQL migrations under migrations/ → one batch per directory.
Mixed code + non-code: non-code batchIndex follows code batches.
neighborMap correctness: file A imports file B across batches → neighborMap[A] contains {path: B, batchIndex: B's, symbols: B's exports}.
neighborMap excludes same-batch: A and C in same batch → neighborMap[A] does not contain C.
Exports failure tolerance: mock TreeSitter to throw on one file → exports = [] for that file, others unaffected.
--changed-files: input subset → output contains only batches with changed files; neighborMap may reference unchanged files.
Fallback triggers: mock Louvain throw → algorithm field = "count-fallback", warning in stderr.
Warning assertion per fallback: for each of {Louvain crash, exports failure, oversize split, neighborMap truncation}, assert the exact warning string appears in stderr.

Unit tests — `merge-batch-graphs.py`

New test class TestMultiPart in test_merge_batch_graphs.py:

Two parts of one logical batch: batch-1-part-1.json + batch-1-part-2.json → assembled contains all nodes/edges from both.
Three parts of one logical batch.
Cross-part edges: edge with source in part-1, target node in part-2 → connected after merge.
Malformed part-1 + valid part-2: part-1 skipped with warning, part-2 contents present.
Mixed single-batch and multi-part inputs.
Missing part detection: batch-1-part-2.json + batch-1-part-3.json (no part-1) → warning emitted with exact text.
stderr format: assert "X logical batches, Y multi-part" appears.

Integration — PR acceptance gate (manual)

Documented in the PR's Test plan:

pnpm install (graphology installs cleanly).
pnpm --filter @understand-anything/core build.
Run /understand --full on this repo (Understand-Anything itself):
- batches.json generated; community size distribution sanity-check (mix of small and medium batches).
- At least one batch produces multi-part output.
- assembled-graph.json node/edge counts within expected range vs current main.
- Dashboard renders normally.
- Phase 7 final report includes any $PHASE_WARNINGS from compute-batches (visually verify warnings reach user-facing output, not just stderr).
Run on a ~100-file repo matching ayushghosh's scenario; confirm no "output limit" errors.
Run on a 5-10 file small repo: fallback path (all one batch) works correctly.

Not tested

Louvain algorithm correctness (trust graphology-communities-louvain's own tests).
Performance benchmarks (sub-second on 100-500 files is empirical; not gated).
Multiple LLM provider output-cap variations (thresholds are conservative for Bedrock OPUS; first-party Anthropic is more permissive).

Out of scope (tracked for follow-up)

Tree-sitter deduplication

Currently Phase 1 (project-scanner), Phase 1.5 (compute-batches), and Phase 2 (file-analyzer per-batch) each run tree-sitter independently. Consolidating into a single Phase 1.5 structure extraction would simplify file-analyzer and save time on large projects. Defer because it requires reorganizing file-analyzer's protocol significantly.

neighborMap LLM summaries

Adding one-sentence summaries per file to neighborMap would enable file-analyzer to emit related edges across batches with semantic justification. Requires a new lightweight summary-pass agent; defer until the tree-sitter dedup lands (Phase 1.5 will already have full structure → cheaper to add).

Adaptive thresholds

60 nodes / 120 edges are conservative for Bedrock OPUS. Anthropic first-party supports much larger output caps. Adding a --output-cap=<N> CLI to compute-batches and propagating to file-analyzer would unlock larger parts on permissive backends. Track real-world part counts before implementing.

Cross-batch edge audit

A post-merge audit comparing neighborMap-suggested edges vs actually-emitted edges would surface gaps. Mirror the existing recover_imports_from_scan pattern. Requires preserving batches.json for merge-time consumption.

Multi-language monorepo handling

Multi-language repos (TS + Python) tend to naturally split via Louvain (no cross-language imports). Bridge files (OpenAPI, protobuf) might create odd communities. Address only if real reports surface.

Implementation order

Prototype: minimal compute-batches.mjs skeleton — load scan-result.json, run Louvain, print community sizes. Run against this repo's scan-result.json (generate if missing via /understand --full). Decide whether size-enforcement branch is needed; if needed, choose between edge-betweenness and weakly-connected-component split.
Add exports extraction (reuse TreeSitterPlugin).
Add neighborMap construction + batchImportData passthrough.
Add non-code grouping heuristics (Groups A-E).
Add fallback path + warning emissions for every failure mode listed in the Failure modes table.
Write unit tests for compute-batches (per Testing section), including warning-text assertions.
Modify agents/file-analyzer.md — add Cross-batch context section, replace Writing Results.
Modify skills/understand/SKILL.md — add Phase 1.5, replace Phase 2 ANALYZE batching prose, replace incremental path.
Add multi-part stderr report + missing-part warning to merge-batch-graphs.py.
Write unit tests for merge-batch-graphs.py multi-part handling.
Add graphology + graphology-communities-louvain to understand-anything-plugin/package.json.
Run integration acceptance gate.
Bump version in all five package.json / plugin.json files per the project's CLAUDE.md versioning rule.

26 KiB Raw Permalink Blame History Unescape Escape