Observation Pipeline, Semantic Classification, Webhook Architecture
TL;DR
Neural Memory's foundation layer is now live. The observation pipeline captures engineering events from GitHub and Vercel in real-time, processes them through significance scoring, AI classification, and multi-view embedding generation. Events are automatically categorized into 14 engineering categories using Claude Haiku with regex fallback. The webhook architecture includes cryptographic signature verification, timestamp validation, and raw payload storage for audit trails.
Real-time event capture, AI classification, and production-ready webhook infrastructure
Observation Pipeline
The neural observation pipeline captures engineering activity from your connected sources and transforms it into searchable memory. Events flow through significance scoring, AI classification, entity extraction, and multi-view embedding generation before storage.
What's included:
Significance scoring filters low-value events (threshold: 40/100). High-value events like releases (75), deployment failures (70), and PR merges (60) pass through automatically. Routine commits (30) and trivial changes are filtered.
Multi-view embeddings generate three vectors per observation: title-only for headline searches, full content for detailed queries, and a balanced summary view. All three are stored in Pinecone with pre-computed observation IDs for direct lookup.
Entity extraction identifies API endpoints, file paths, issue references, @mentions, and environment variables from event content. Entities are deduplicated and tracked with occurrence counts.
Cluster assignment groups related observations using embedding similarity (40 points), entity overlap (30 points), actor overlap (20 points), and temporal proximity (10 points). Threshold: 60/100 to join an existing cluster.
Example: Significance Scoring
Limitations:
Significance threshold (40) is global; per-workspace configuration planned
Entity extraction limited to 50 entities per observation
Cluster lookback window is 7 days
Semantic Classification
Every observation is classified into one of 14 engineering categories using Claude Haiku. Classification drives cluster organization, topic extraction, and future retrieval filtering.
Categories:
Category | Description |
|---|---|
bug_fix | Bug fixes, patches, error corrections |
feature | New features, additions, implementations |
refactor | Code restructuring, cleanup |
documentation | Docs, README, comments |
testing | Tests, specs, coverage |
infrastructure | CI/CD, pipelines, Docker |
security | Security fixes, auth changes |
performance | Optimizations, speed improvements |
incident | Outages, emergencies, hotfixes |
decision | ADRs, architecture decisions |
discussion | RFCs, proposals, design discussions |
release | Version releases, changelogs |
deployment | Deployments, shipping to production |
other | Doesn't fit other categories |
How it works:
Claude Haiku receives event details (source, type, title, body truncated to 1000 chars)
Returns primary category, up to 3 secondary categories, up to 5 topics, and confidence score
Temperature 0.2 ensures deterministic classification across runs
Fallback:
If the LLM fails (timeout, rate limit), regex patterns classify events by matching keywords:
Limitations:
Classification results (category, confidence) are not stored in database; only topics array persists
No accuracy metrics tracked in production
Confidence threshold (0.6) defined but not enforced
Webhook Architecture
Production-ready webhook infrastructure receives events from GitHub and Vercel with cryptographic verification, replay protection, and complete audit trails.
Supported Events:
Source | Events |
|---|---|
GitHub | push (default branch), pull_request (opened/closed/reopened/ready_for_review), issues (opened/closed/reopened), release (published), discussion (created/answered) |
Vercel | deployment.created, deployment.succeeded, deployment.ready, deployment.error, deployment.canceled |
Security measures:
Signature verification: HMAC SHA-256 (GitHub) and SHA-1 (Vercel) with timing-safe comparison
Replay protection: 5-minute timestamp validation window with 60-second clock skew tolerance
Audit trail: Raw JSON payloads stored permanently in workspace_webhook_payloads table
Processing architecture:
Example: SourceEvent structure
Cross-source correlation:
Vercel deployments are linked to GitHub users via commit SHA. When a GitHub push arrives with the same commit, the Vercel observation's actor is updated with the numeric GitHub user ID.
Limitations:
Only GitHub and Vercel sources implemented (Linear, Sentry, PagerDuty planned)
No circuit breaker for failing transformers
No rate limiting at webhook endpoint level
No manual reprocessing UI
Why We Built It This Way
The observation pipeline uses a significance scoring gate before AI classification to minimize LLM costs. Only events scoring 40+ undergo classification and embedding generation. This keeps costs predictable while ensuring high-value events like security patches and releases are always captured.
Multi-view embeddings (title, content, summary) optimize retrieval for different query types. When searching for "authentication bug", the title embedding finds headline matches, while the content embedding surfaces detailed discussions. The summary view balances both for general queries.
Raw webhook payload storage enables replay and debugging. When something goes wrong, you can inspect the exact JSON received, re-trigger processing, or audit what happened.