The Modern Pipeline Processor for
High-Throughput Vector Data

Ingest Terabytes of unstructured data with scalable rewindable pipelines in any language

MIT Licensed • Always Free • Container Based

Pipestream AI is an open-source platform designed to solve the headaches of massive unstructured data processing. Payloads are streamed and processed in a reactive, claim-check architecture, with async checkpoints stored in S3 and processed by Kafka. Process, chunk, and index massive datasets for RAG and Vector Search at scale.

View on GitHub From Document to Search Engine - a Document Journey

⚡ Quarkus

📨 Kafka (SmallRye)

🚀 gRPC

📦 S3 / MinIO / Garage

🔎 OpenSearch

🧬 Apicurio

The Problem: "ETL Hell"

Typical Pipelines

Traditional tools built for rows and columns choke on the unstructured chaos of the AI era.

Real Scaling Sequential pipelines block on slow steps (like OCR), causing backpressure that halts the system.
Fast and Low Cost Pushing heavy payloads (PDFs, Images) through brokers kills throughput and explodes retention costs.
Data intact, no Schema Drift Automated crawlers guess data types incorrectly, creating brittle pipelines that break silently.

Pipestream AI

A reactive, claim-check architecture built specifically for massive unstructured data streams.

Reactive & Non-Blocking Built on Quarkus and Mutiny. A single pod handles thousands of concurrent streams with minimal memory.
Claim-Check Pattern Kafka carries the envelope (metadata). S3 & gRPC carry the heavy payload.
Strict Schemas with Apicurio Integration with Apicurio Registry ensures strict Protobuf contracts, preventing silent failures caused by drift.

🧬

Multi-Strategy Intelligence

Apply any number of chunking strategies and embedding models in a single crawl. Store all resulting vectors in the same OpenSearch record, opening the door for true AB testing. Reindex your data without crawling it again.

🎟️

Claim-Check Architecture

We separate the Control Plane (Kafka) from the Data Plane (S3 + gRPC). Process terabytes of data in a horizontally scaled cluster without ever clogging your message brokers.

⏪

Time Travel & Rewind

Updated your embedding model? Reset the consumer offset. The pipeline replays the state instantly, skipping expensive ingestion steps by fetching raw text from S3.

⚡

High-Speed gRPC

Processing steps communicate via strictly typed, internal gRPC streams. Optimized raw flow control for hundreds of MB per-second throughput per stream.

🛠️

Democratized Config

Dynamic JSON Forms create front ends without coding. This allows admins and Data Scientists to tweak NLP parameters (like LLM temperature) via the UI without touching backend code.

🕸️

Dynamic Graph

Fan-in from multiple pipeline steps, Fan-out to multiple pipeline steps (A/B testing). The pipeline is a graph, not a linear line. So you can branch and share data between steps.

Architecture Overview

The Document Hop

Pipestream operates on a "Document Hop" model. The Engine is the central router, moving PipeDocs (rich metadata envelopes) between specialized modules. This of it as the TCP protocol for search processing.

1. Data Loading

Assets are ingested via Connectors. Data can be saved to S3 at any time; metadata tracked through the Engine.

2. Transformation

Assets are transformed to text (Parsing). This step is skippable on "Rewind" operations.

3. Enhancement (Fan-Out)

Text is chunked and embedded. Run multiple chunkers and embedders in parallel, accumulating results in the PipeDoc.

4. Sink

All accumulated vectors (from all strategies) are indexed into OpenSearch for hybrid retrieval.

Use Cases

Enterprise Search

Index internal documents, wikis, and knowledge bases with hybrid search capabilities (Lexical + Semantic).

RAG & LLM Context

Feed clean, chunked context to LLMs. Test different chunk sizes to optimize retrieval quality.

Multi-Model A/B Testing

Run OpenAI Ada and Llama3 embeddings side-by-side in the same index to compare performance.

Document Intelligence

Extract metadata and text from diverse formats (PDF, Video, Audio) with dedicated microservice parsers.

Massive Migration

Move Terabytes of legacy data into modern Vector Stores using high-throughput gRPC streams.

Hybrid Search

Combine OpenSearch BM25 ranking with Dense Vector Retrieval for superior accuracy.

Open Source & MIT Licensed

Pipestream AI bridges the gap between Backend Engineering and Data Science. Run it on a local workstation or scale it to index a Fortune 500's entire history.

Contribute on GitHub View Documentation

The Modern Pipeline Processor forHigh-Throughput Vector Data