267 lines
11 KiB
Markdown
267 lines
11 KiB
Markdown
# Understand Anything: Universal File Type Support
|
|
|
|
**Date**: 2026-03-28
|
|
**Status**: Approved
|
|
**Approach**: Big Bang — all file types in one release
|
|
|
|
## Goals
|
|
|
|
1. Extend Understand Anything to analyze **any** file type, not just code
|
|
2. Support both holistic project enrichment (non-code files enrich code graphs) and standalone analysis (docs-only repos, SQL schema collections, IaC projects)
|
|
3. Maintain backward compatibility with existing code-only analysis
|
|
|
|
## Supported File Types (26 new)
|
|
|
|
### Documentation (3)
|
|
|
|
| Type | Extensions | Parser | Node Types |
|
|
|------|-----------|--------|------------|
|
|
| Markdown | `.md`, `.mdx` | LLM + regex heading extraction | `document` |
|
|
| reStructuredText | `.rst` | LLM | `document` |
|
|
| Plain text | `.txt` | LLM | `document` |
|
|
|
|
### Configuration (5)
|
|
|
|
| Type | Extensions | Parser | Node Types |
|
|
|------|-----------|--------|------------|
|
|
| YAML | `.yaml`, `.yml` | `yaml` npm package | `config` |
|
|
| JSON | `.json`, `.jsonc` | `JSON.parse` / `jsonc-parser` | `config`, `schema` |
|
|
| TOML | `.toml` | `@iarna/toml` or similar | `config` |
|
|
| .env | `.env`, `.env.*` | Regex line parser | `config` |
|
|
| XML | `.xml` | LLM (optionally `fast-xml-parser`) | `config` |
|
|
|
|
### Infrastructure & DevOps (7)
|
|
|
|
| Type | Extensions | Parser | Node Types |
|
|
|------|-----------|--------|------------|
|
|
| Dockerfile | `Dockerfile`, `Dockerfile.*`, `.dockerfile` | Custom instruction parser | `service`, `pipeline` |
|
|
| Docker Compose | `docker-compose.yml`, `compose.yml` | YAML parser + service extraction | `service` |
|
|
| Terraform | `.tf`, `.tfvars` | Regex block parser | `resource` |
|
|
| Kubernetes | K8s YAML (detected by `apiVersion` field) | YAML + kind detection | `service`, `resource` |
|
|
| GitHub Actions | `.github/workflows/*.yml` | YAML + job/step extraction | `pipeline` |
|
|
| Jenkinsfile | `Jenkinsfile` | LLM (Groovy DSL) | `pipeline` |
|
|
| Makefile | `Makefile`, `*.mk` | Regex target parser | `pipeline` |
|
|
|
|
### Data & Schema (6)
|
|
|
|
| Type | Extensions | Parser | Node Types |
|
|
|------|-----------|--------|------------|
|
|
| SQL | `.sql` | Simple DDL parser | `table`, `endpoint` |
|
|
| GraphQL | `.graphql`, `.gql` | Regex type/query parser | `schema`, `endpoint` |
|
|
| OpenAPI/Swagger | `openapi.yaml`, `swagger.json` | YAML/JSON + path extraction | `endpoint`, `schema` |
|
|
| Protocol Buffers | `.proto` | Regex message/service parser | `schema` |
|
|
| JSON Schema | `*.schema.json` | JSON + `$ref`/`$defs` extraction | `schema` |
|
|
| CSV/TSV | `.csv`, `.tsv` | Header row extraction | `table` |
|
|
|
|
### Shell & Scripts (3)
|
|
|
|
| Type | Extensions | Parser | Node Types |
|
|
|------|-----------|--------|------------|
|
|
| Shell | `.sh`, `.bash`, `.zsh` | Regex function parser | `file`, `function` |
|
|
| PowerShell | `.ps1`, `.psm1` | LLM | `file`, `function` |
|
|
| Batch | `.bat`, `.cmd` | LLM | `file` |
|
|
|
|
### Markup (2)
|
|
|
|
| Type | Extensions | Parser | Node Types |
|
|
|------|-----------|--------|------------|
|
|
| HTML | `.html`, `.htm` | LLM (tag structure) | `document` |
|
|
| CSS/SCSS/Less | `.css`, `.scss`, `.less` | LLM | `file` |
|
|
|
|
## Schema Extensions
|
|
|
|
### New Node Types (8)
|
|
|
|
Added to the existing `file | function | class | module | concept`:
|
|
|
|
| Node Type | Purpose | Example |
|
|
|-----------|---------|---------|
|
|
| `config` | Configuration files and key settings | `package.json`, `tsconfig.json`, env vars |
|
|
| `document` | Documentation, prose, guides | `README.md`, API docs |
|
|
| `service` | Deployable services/containers | Docker containers, K8s Deployments |
|
|
| `table` | Data tables, database objects | SQL tables, CSV datasets |
|
|
| `endpoint` | API routes, queries, mutations | REST paths, GraphQL queries |
|
|
| `pipeline` | CI/CD workflows, build steps | GitHub Actions jobs, Makefile targets |
|
|
| `schema` | Type definitions for data interchange | Protobuf messages, JSON Schema |
|
|
| `resource` | Infrastructure resources | Terraform resources, K8s ConfigMaps |
|
|
|
|
### New Edge Types (8)
|
|
|
|
Added to the existing 18 edge types:
|
|
|
|
| Edge Type | Category | Meaning | Example |
|
|
|-----------|----------|---------|---------|
|
|
| `deploys` | Infrastructure | Service deploys code | Dockerfile -> app source |
|
|
| `serves` | Infrastructure | Service exposes endpoint | K8s Service -> API endpoint |
|
|
| `migrates` | Data flow | Migration modifies table | SQL migration -> table |
|
|
| `documents` | Semantic | Doc describes code | README -> module |
|
|
| `provisions` | Infrastructure | IaC creates resource | Terraform -> AWS resource |
|
|
| `routes` | Behavioral | Routes traffic to service | nginx config -> service |
|
|
| `defines_schema` | Data flow | Defines data shape | Protobuf -> endpoint |
|
|
| `triggers` | Behavioral | Triggers pipeline/action | Git push -> GitHub Actions |
|
|
|
|
### Schema Validation Auto-Fix Aliases
|
|
|
|
New node type aliases:
|
|
- `container` -> `service`, `migration` -> `table`, `workflow` -> `pipeline`
|
|
- `route` -> `endpoint`, `doc` -> `document`, `setting` -> `config`, `infra` -> `resource`
|
|
|
|
New edge type aliases:
|
|
- `describes` -> `documents`, `creates` -> `provisions`, `exposes` -> `serves`
|
|
|
|
## Plugin Architecture Changes
|
|
|
|
### Generalized AnalyzerPlugin Interface
|
|
|
|
```typescript
|
|
interface AnalyzerPlugin {
|
|
name: string;
|
|
languages: string[];
|
|
analyzeFile(filePath: string, content: string): StructuralAnalysis;
|
|
resolveImports?(filePath: string, content: string): ImportResolution[]; // Now optional
|
|
extractCallGraph?(filePath: string, content: string): CallGraphEntry[];
|
|
extractReferences?(filePath: string, content: string): ReferenceResolution[]; // NEW
|
|
}
|
|
|
|
interface ReferenceResolution {
|
|
source: string; // File making the reference
|
|
target: string; // Referenced file or identifier
|
|
type: string; // Reference type: "file", "image", "schema", "service"
|
|
line?: number;
|
|
}
|
|
```
|
|
|
|
### Extended StructuralAnalysis
|
|
|
|
```typescript
|
|
interface StructuralAnalysis {
|
|
// Existing (unchanged)
|
|
functions: FunctionInfo[];
|
|
classes: ClassInfo[];
|
|
imports: ImportInfo[];
|
|
exports: ExportInfo[];
|
|
// New (all optional for backward compat)
|
|
sections?: SectionInfo[]; // Documents: headings, chapters
|
|
definitions?: DefinitionInfo[]; // Schemas: types, messages, tables
|
|
services?: ServiceInfo[]; // Infra: containers, deployments
|
|
endpoints?: EndpointInfo[]; // APIs: routes, queries
|
|
steps?: StepInfo[]; // Pipelines: jobs, stages, targets
|
|
resources?: ResourceInfo[]; // IaC: terraform resources, K8s objects
|
|
}
|
|
```
|
|
|
|
### Custom Parsers (12)
|
|
|
|
All lightweight — mostly regex-based, minimal dependencies:
|
|
|
|
| Parser | Implementation | Extracts |
|
|
|--------|---------------|----------|
|
|
| `MarkdownParser` | Regex | Headings, links, code blocks, front matter |
|
|
| `YAMLParser` | `yaml` npm | Key hierarchy, anchors, multi-doc |
|
|
| `JSONParser` | Built-in `JSON.parse` | Key structure, `$ref`/`$defs` |
|
|
| `TOMLParser` | `@iarna/toml` | Section structure |
|
|
| `EnvParser` | Regex | Variable names and references |
|
|
| `DockerfileParser` | Regex | FROM stages, EXPOSE ports, COPY sources |
|
|
| `SQLParser` | Regex | CREATE TABLE/VIEW/INDEX, columns, foreign keys |
|
|
| `GraphQLParser` | Regex | Types, queries, mutations, subscriptions |
|
|
| `ProtobufParser` | Regex | Messages, services, enums, RPCs |
|
|
| `TerraformParser` | Regex | Resources, modules, variables, outputs |
|
|
| `MakefileParser` | Regex | Targets, dependencies, variables |
|
|
| `ShellParser` | Regex | Functions, sourced files |
|
|
|
|
## Agent Pipeline Changes
|
|
|
|
### Project Scanner
|
|
|
|
1. Scan ALL file types (remove code-only filter)
|
|
2. Tag each file with category: `code`, `config`, `docs`, `infra`, `data`, `script`, `markup`
|
|
3. Smart batch grouping: keep related files together (e.g., Dockerfile + docker-compose.yml)
|
|
|
|
### File Analyzer
|
|
|
|
Type-aware prompt templates by category:
|
|
|
|
- **Code**: Current behavior (functions, classes, imports, call graph)
|
|
- **Config**: Extract key settings, what they configure, which code files they affect
|
|
- **Documentation**: Extract sections, key concepts, which code components are documented
|
|
- **Infrastructure**: Extract services, ports, volumes, dependencies, which code they deploy
|
|
- **Data/Schema**: Extract tables, columns, types, relationships, which code consumes this data
|
|
- **Pipelines**: Extract jobs, steps, triggers, which code/infra they build/deploy
|
|
|
|
### Cross-Type Reference Resolution
|
|
|
|
Post-analysis step connecting:
|
|
- Dockerfile `COPY` -> source code directories
|
|
- CI config `run: npm test` -> test files
|
|
- K8s manifest `image:` -> Dockerfile
|
|
- SQL foreign keys -> other tables
|
|
- OpenAPI `$ref` -> schema definitions
|
|
- Markdown links -> referenced files
|
|
|
|
### Architecture Analyzer
|
|
|
|
New pattern detection:
|
|
- Deployment topology: Dockerfile -> compose -> K8s chain
|
|
- Data flow: Schema -> migration -> API endpoint -> client code
|
|
- Documentation coverage: which modules have docs vs. not
|
|
- Configuration dependency: which config files affect which code paths
|
|
|
|
### Tour Builder
|
|
|
|
Include non-code tour stops:
|
|
- Project README overview
|
|
- Dockerfile containerization
|
|
- SQL migration database schema
|
|
- CI/CD pipeline explanation
|
|
|
|
## Dashboard Visualization
|
|
|
|
### New Node Visual Styles
|
|
|
|
| Node Type | Shape | Color | Icon |
|
|
|-----------|-------|-------|------|
|
|
| `config` | Rounded rect | Teal (#5eead4) | Gear |
|
|
| `document` | Rounded rect | Sky blue (#7dd3fc) | Document |
|
|
| `service` | Hexagon | Violet (#a78bfa) | Container/Box |
|
|
| `table` | Rectangle | Emerald (#6ee7b7) | Grid |
|
|
| `endpoint` | Pill/Stadium | Orange (#fdba74) | Arrow-right |
|
|
| `pipeline` | Rounded rect | Rose (#fda4af) | Play/Workflow |
|
|
| `schema` | Diamond | Amber (#fcd34d) | Blueprint |
|
|
| `resource` | Cloud shape | Indigo (#a5b4fc) | Cloud |
|
|
|
|
### Graph Layout
|
|
|
|
1. Layer grouping by category — non-code nodes cluster separately from code nodes
|
|
2. Legend update with 8 new node types
|
|
3. Filter controls — checkboxes to show/hide each file category
|
|
|
|
### Sidebar Enhancements
|
|
|
|
NodeInfo panel updates per node type:
|
|
- **Config**: key-value pairs, referencing code files
|
|
- **Document**: heading outline, linked code components
|
|
- **Service**: ports, volumes, dependencies, deployed code
|
|
- **Table**: columns, types, foreign key relationships
|
|
- **Endpoint**: HTTP method, path, request/response schema
|
|
- **Pipeline**: jobs, triggers, deployed targets
|
|
- **Schema**: fields, nested types, consumers
|
|
- **Resource**: provider, type, dependencies
|
|
|
|
ProjectOverview panel: add "File Types" breakdown (code vs. non-code distribution).
|
|
|
|
## New Dependencies
|
|
|
|
- `yaml` — YAML parsing (already common, ~50KB)
|
|
- `@iarna/toml` — TOML parsing (~30KB)
|
|
- `jsonc-parser` — JSON with comments (~20KB)
|
|
|
|
No tree-sitter WASM additions. All other parsers are regex-based with zero dependencies.
|
|
|
|
## Backward Compatibility
|
|
|
|
- All new `StructuralAnalysis` fields are optional
|
|
- `resolveImports` becomes optional on `AnalyzerPlugin`
|
|
- Existing `LanguageConfig` entries unchanged
|
|
- Schema validation auto-fixes new type aliases
|
|
- Existing knowledge graphs remain valid (new types are additive)
|