11 KiB
11 KiB
Understand Anything: Universal File Type Support
Date: 2026-03-28 Status: Approved Approach: Big Bang — all file types in one release
Goals
- Extend Understand Anything to analyze any file type, not just code
- Support both holistic project enrichment (non-code files enrich code graphs) and standalone analysis (docs-only repos, SQL schema collections, IaC projects)
- Maintain backward compatibility with existing code-only analysis
Supported File Types (26 new)
Documentation (3)
| Type | Extensions | Parser | Node Types |
|---|---|---|---|
| Markdown | .md, .mdx |
LLM + regex heading extraction | document |
| reStructuredText | .rst |
LLM | document |
| Plain text | .txt |
LLM | document |
Configuration (5)
| Type | Extensions | Parser | Node Types |
|---|---|---|---|
| YAML | .yaml, .yml |
yaml npm package |
config |
| JSON | .json, .jsonc |
JSON.parse / jsonc-parser |
config, schema |
| TOML | .toml |
@iarna/toml or similar |
config |
| .env | .env, .env.* |
Regex line parser | config |
| XML | .xml |
LLM (optionally fast-xml-parser) |
config |
Infrastructure & DevOps (7)
| Type | Extensions | Parser | Node Types |
|---|---|---|---|
| Dockerfile | Dockerfile, Dockerfile.*, .dockerfile |
Custom instruction parser | service, pipeline |
| Docker Compose | docker-compose.yml, compose.yml |
YAML parser + service extraction | service |
| Terraform | .tf, .tfvars |
Regex block parser | resource |
| Kubernetes | K8s YAML (detected by apiVersion field) |
YAML + kind detection | service, resource |
| GitHub Actions | .github/workflows/*.yml |
YAML + job/step extraction | pipeline |
| Jenkinsfile | Jenkinsfile |
LLM (Groovy DSL) | pipeline |
| Makefile | Makefile, *.mk |
Regex target parser | pipeline |
Data & Schema (6)
| Type | Extensions | Parser | Node Types |
|---|---|---|---|
| SQL | .sql |
Simple DDL parser | table, endpoint |
| GraphQL | .graphql, .gql |
Regex type/query parser | schema, endpoint |
| OpenAPI/Swagger | openapi.yaml, swagger.json |
YAML/JSON + path extraction | endpoint, schema |
| Protocol Buffers | .proto |
Regex message/service parser | schema |
| JSON Schema | *.schema.json |
JSON + $ref/$defs extraction |
schema |
| CSV/TSV | .csv, .tsv |
Header row extraction | table |
Shell & Scripts (3)
| Type | Extensions | Parser | Node Types |
|---|---|---|---|
| Shell | .sh, .bash, .zsh |
Regex function parser | file, function |
| PowerShell | .ps1, .psm1 |
LLM | file, function |
| Batch | .bat, .cmd |
LLM | file |
Markup (2)
| Type | Extensions | Parser | Node Types |
|---|---|---|---|
| HTML | .html, .htm |
LLM (tag structure) | document |
| CSS/SCSS/Less | .css, .scss, .less |
LLM | file |
Schema Extensions
New Node Types (8)
Added to the existing file | function | class | module | concept:
| Node Type | Purpose | Example |
|---|---|---|
config |
Configuration files and key settings | package.json, tsconfig.json, env vars |
document |
Documentation, prose, guides | README.md, API docs |
service |
Deployable services/containers | Docker containers, K8s Deployments |
table |
Data tables, database objects | SQL tables, CSV datasets |
endpoint |
API routes, queries, mutations | REST paths, GraphQL queries |
pipeline |
CI/CD workflows, build steps | GitHub Actions jobs, Makefile targets |
schema |
Type definitions for data interchange | Protobuf messages, JSON Schema |
resource |
Infrastructure resources | Terraform resources, K8s ConfigMaps |
New Edge Types (8)
Added to the existing 18 edge types:
| Edge Type | Category | Meaning | Example |
|---|---|---|---|
deploys |
Infrastructure | Service deploys code | Dockerfile -> app source |
serves |
Infrastructure | Service exposes endpoint | K8s Service -> API endpoint |
migrates |
Data flow | Migration modifies table | SQL migration -> table |
documents |
Semantic | Doc describes code | README -> module |
provisions |
Infrastructure | IaC creates resource | Terraform -> AWS resource |
routes |
Behavioral | Routes traffic to service | nginx config -> service |
defines_schema |
Data flow | Defines data shape | Protobuf -> endpoint |
triggers |
Behavioral | Triggers pipeline/action | Git push -> GitHub Actions |
Schema Validation Auto-Fix Aliases
New node type aliases:
container->service,migration->table,workflow->pipelineroute->endpoint,doc->document,setting->config,infra->resource
New edge type aliases:
describes->documents,creates->provisions,exposes->serves
Plugin Architecture Changes
Generalized AnalyzerPlugin Interface
interface AnalyzerPlugin {
name: string;
languages: string[];
analyzeFile(filePath: string, content: string): StructuralAnalysis;
resolveImports?(filePath: string, content: string): ImportResolution[]; // Now optional
extractCallGraph?(filePath: string, content: string): CallGraphEntry[];
extractReferences?(filePath: string, content: string): ReferenceResolution[]; // NEW
}
interface ReferenceResolution {
source: string; // File making the reference
target: string; // Referenced file or identifier
type: string; // Reference type: "file", "image", "schema", "service"
line?: number;
}
Extended StructuralAnalysis
interface StructuralAnalysis {
// Existing (unchanged)
functions: FunctionInfo[];
classes: ClassInfo[];
imports: ImportInfo[];
exports: ExportInfo[];
// New (all optional for backward compat)
sections?: SectionInfo[]; // Documents: headings, chapters
definitions?: DefinitionInfo[]; // Schemas: types, messages, tables
services?: ServiceInfo[]; // Infra: containers, deployments
endpoints?: EndpointInfo[]; // APIs: routes, queries
steps?: StepInfo[]; // Pipelines: jobs, stages, targets
resources?: ResourceInfo[]; // IaC: terraform resources, K8s objects
}
Custom Parsers (12)
All lightweight — mostly regex-based, minimal dependencies:
| Parser | Implementation | Extracts |
|---|---|---|
MarkdownParser |
Regex | Headings, links, code blocks, front matter |
YAMLParser |
yaml npm |
Key hierarchy, anchors, multi-doc |
JSONParser |
Built-in JSON.parse |
Key structure, $ref/$defs |
TOMLParser |
@iarna/toml |
Section structure |
EnvParser |
Regex | Variable names and references |
DockerfileParser |
Regex | FROM stages, EXPOSE ports, COPY sources |
SQLParser |
Regex | CREATE TABLE/VIEW/INDEX, columns, foreign keys |
GraphQLParser |
Regex | Types, queries, mutations, subscriptions |
ProtobufParser |
Regex | Messages, services, enums, RPCs |
TerraformParser |
Regex | Resources, modules, variables, outputs |
MakefileParser |
Regex | Targets, dependencies, variables |
ShellParser |
Regex | Functions, sourced files |
Agent Pipeline Changes
Project Scanner
- Scan ALL file types (remove code-only filter)
- Tag each file with category:
code,config,docs,infra,data,script,markup - Smart batch grouping: keep related files together (e.g., Dockerfile + docker-compose.yml)
File Analyzer
Type-aware prompt templates by category:
- Code: Current behavior (functions, classes, imports, call graph)
- Config: Extract key settings, what they configure, which code files they affect
- Documentation: Extract sections, key concepts, which code components are documented
- Infrastructure: Extract services, ports, volumes, dependencies, which code they deploy
- Data/Schema: Extract tables, columns, types, relationships, which code consumes this data
- Pipelines: Extract jobs, steps, triggers, which code/infra they build/deploy
Cross-Type Reference Resolution
Post-analysis step connecting:
- Dockerfile
COPY-> source code directories - CI config
run: npm test-> test files - K8s manifest
image:-> Dockerfile - SQL foreign keys -> other tables
- OpenAPI
$ref-> schema definitions - Markdown links -> referenced files
Architecture Analyzer
New pattern detection:
- Deployment topology: Dockerfile -> compose -> K8s chain
- Data flow: Schema -> migration -> API endpoint -> client code
- Documentation coverage: which modules have docs vs. not
- Configuration dependency: which config files affect which code paths
Tour Builder
Include non-code tour stops:
- Project README overview
- Dockerfile containerization
- SQL migration database schema
- CI/CD pipeline explanation
Dashboard Visualization
New Node Visual Styles
| Node Type | Shape | Color | Icon |
|---|---|---|---|
config |
Rounded rect | Teal (#5eead4) | Gear |
document |
Rounded rect | Sky blue (#7dd3fc) | Document |
service |
Hexagon | Violet (#a78bfa) | Container/Box |
table |
Rectangle | Emerald (#6ee7b7) | Grid |
endpoint |
Pill/Stadium | Orange (#fdba74) | Arrow-right |
pipeline |
Rounded rect | Rose (#fda4af) | Play/Workflow |
schema |
Diamond | Amber (#fcd34d) | Blueprint |
resource |
Cloud shape | Indigo (#a5b4fc) | Cloud |
Graph Layout
- Layer grouping by category — non-code nodes cluster separately from code nodes
- Legend update with 8 new node types
- Filter controls — checkboxes to show/hide each file category
Sidebar Enhancements
NodeInfo panel updates per node type:
- Config: key-value pairs, referencing code files
- Document: heading outline, linked code components
- Service: ports, volumes, dependencies, deployed code
- Table: columns, types, foreign key relationships
- Endpoint: HTTP method, path, request/response schema
- Pipeline: jobs, triggers, deployed targets
- Schema: fields, nested types, consumers
- Resource: provider, type, dependencies
ProjectOverview panel: add "File Types" breakdown (code vs. non-code distribution).
New Dependencies
yaml— YAML parsing (already common, ~50KB)@iarna/toml— TOML parsing (~30KB)jsonc-parser— JSON with comments (~20KB)
No tree-sitter WASM additions. All other parsers are regex-based with zero dependencies.
Backward Compatibility
- All new
StructuralAnalysisfields are optional resolveImportsbecomes optional onAnalyzerPlugin- Existing
LanguageConfigentries unchanged - Schema validation auto-fixes new type aliases
- Existing knowledge graphs remain valid (new types are additive)