The Modern Pipeline Processor for
High-Throughput Vector Data
Ingest Terabytes of unstructured data with scalable rewindable pipelines in any language
Pipestream AI is an open-source platform designed to solve the headaches of massive unstructured data processing. Payloads are streamed and processed in a reactive, claim-check architecture, with async checkpoints stored in S3 and processed by Kafka. Process, chunk, and index massive datasets for RAG and Vector Search at scale.
Typical Pipelines
Traditional tools built for rows and columns choke on the unstructured chaos of the AI era.
- Real Scaling Sequential pipelines block on slow steps (like OCR), causing backpressure that halts the system.
- Fast and Low Cost Pushing heavy payloads (PDFs, Images) through brokers kills throughput and explodes retention costs.
- Data intact, no Schema Drift Automated crawlers guess data types incorrectly, creating brittle pipelines that break silently.
Pipestream AI
A reactive, claim-check architecture built specifically for massive unstructured data streams.
- Reactive & Non-Blocking Built on Quarkus and Mutiny. A single pod handles thousands of concurrent streams with minimal memory.
- Claim-Check Pattern Kafka carries the envelope (metadata). S3 & gRPC carry the heavy payload.
- Strict Schemas with Apicurio Integration with Apicurio Registry ensures strict Protobuf contracts, preventing silent failures caused by drift.
Multi-Strategy Intelligence
Apply any number of chunking strategies and embedding models in a single crawl. Store all resulting vectors in the same OpenSearch record, opening the door for true AB testing. Reindex your data without crawling it again.
Claim-Check Architecture
We separate the Control Plane (Kafka) from the Data Plane (S3 + gRPC). Process terabytes of data in a horizontally scaled cluster without ever clogging your message brokers.
Time Travel & Rewind
Updated your embedding model? Reset the consumer offset. The pipeline replays the state instantly, skipping expensive ingestion steps by fetching raw text from S3.
High-Speed gRPC
Processing steps communicate via strictly typed, internal gRPC streams. Optimized raw flow control for hundreds of MB per-second throughput per stream.
Democratized Config
Dynamic JSON Forms create front ends without coding. This allows admins and Data Scientists to tweak NLP parameters (like LLM temperature) via the UI without touching backend code.
Dynamic Graph
Fan-in from multiple pipeline steps, Fan-out to multiple pipeline steps (A/B testing). The pipeline is a graph, not a linear line. So you can branch and share data between steps.
Architecture Overview
The Document Hop
Pipestream operates on a "Document Hop" model. The Engine is the central router, moving PipeDocs (rich metadata envelopes) between specialized modules. This of it as the TCP protocol for search processing.
1. Data Loading
Assets are ingested via Connectors. Data can be saved to S3 at any time; metadata tracked through the Engine.
2. Transformation
Assets are transformed to text (Parsing). This step is skippable on "Rewind" operations.
3. Enhancement (Fan-Out)
Text is chunked and embedded. Run multiple chunkers and embedders in parallel, accumulating results in the PipeDoc.
4. Sink
All accumulated vectors (from all strategies) are indexed into OpenSearch for hybrid retrieval.
Use Cases
Enterprise Search
Index internal documents, wikis, and knowledge bases with hybrid search capabilities (Lexical + Semantic).
RAG & LLM Context
Feed clean, chunked context to LLMs. Test different chunk sizes to optimize retrieval quality.
Multi-Model A/B Testing
Run OpenAI Ada and Llama3 embeddings side-by-side in the same index to compare performance.
Document Intelligence
Extract metadata and text from diverse formats (PDF, Video, Audio) with dedicated microservice parsers.
Massive Migration
Move Terabytes of legacy data into modern Vector Stores using high-throughput gRPC streams.
Hybrid Search
Combine OpenSearch BM25 ranking with Dense Vector Retrieval for superior accuracy.
Open Source & MIT Licensed
Pipestream AI bridges the gap between Backend Engineering and Data Science. Run it on a local workstation or scale it to index a Fortune 500's entire history.