Anomaly detection

This is not a finished project post. It is a working notes post of what I am building, why I am building it, what is working, what is not yet, and where it is going. I am writing this now, mid-build, because it helps me think clearly.

The Problem

Most anomaly detection systems in streaming pipelines assume the data distribution stays roughly stable over time. Fix a threshold, train a model, deploy it, watch it degrade. In real markets, that assumption breaks constantly. Oil prices behave differently in stable periods versus volatile ones. A spike that would be anomalous in January might be routine in March after a geopolitical event. The environment shifts, and the model does not.

This is called concept drift a well-studied problem in ML research with a lot of theoretical work behind it, but not much work on what it actually looks like to handle it inside a production streaming pipeline, end to end, with real latency and throughput constraints. That gap is what this project is trying to address.

The core question: can I build a streaming anomaly detection system that detects when the data distribution has shifted, automatically recalibrates its models in response, and does all of this without stopping the stream?

I think it is possible. This project is my attempt to build and evaluate one.

The Architecture

The pipeline has six layers, each doing a specific job:

Live Market Data (CoinDCX WebSocket / MCX Crude Oil historical)
        ↓
Kafka (KRaft, Avro Schema Registry)
        ↓
Apache Flink (windowed aggregations + drift detection layer)
        ↓
Adaptive Anomaly Detection Layer (ADWIN + model recalibration)
        ↓
TimescaleDB (hypertables) + Redis (live cache)
        ↓
Grafana dashboards

Kafka handles ingestion. I am using Confluent Cloud's free tier with KRaft (no ZooKeeper) and Avro schema registry for typed, versioned messages. Every tick lands on mcx-crude-prices as a structured event with symbol, price, volume, and timestamp. Schema registry means that if the message structure ever changes, downstream consumers do not break silently.

Apache Flink handles stream processing. One-minute tumbling windows compute OHLCV aggregations (open, high, low, close, volume) and VWAP continuously. This is the layer where the research contribution lives — the drift detection operator runs inline here, not as a separate offline process.

The drift detection layer uses two algorithms from the River library ADWIN (Adaptive Windowing) and Page-Hinkley. ADWIN maintains a sliding window over incoming statistics and signals when the distribution within the window has shifted significantly. Page-Hinkley detects persistent directional shifts in a stream — useful for catching gradual regime changes rather than sudden jumps. When either detector fires, it publishes a drift event to a dedicated Kafka topic and triggers model recalibration downstream.

The anomaly detection layer runs three detectors in parallel, deliberately:

A static threshold CEP rule engine — the baseline, fixed thresholds for spike, drop, and volatility events
CUSUM (Cumulative Sum control chart) — adaptive, responds to persistent deviations from a running baseline
Isolation Forest — a tree-based ML model trained on recent windows, retrained when drift is detected

Evaulation Design is to run all three in parallell. The paper will compare detection accuracy, adaptation latency, and false positive rates across all three under controlled drift scenarios. The hypothesis is that the adaptive detectors significantly outperform the static threshold approach under drift, and that the ML-based approach adapts faster than CUSUM but with higher computational overhead.

TimescaleDB stores everything with time partitioning raw ticks in a hypertable with compression after seven days, OHLCV aggregates, alerts, and drift events all in separate tables. Redis caches the current live state separately so high-frequency dashboard queries do not hit the analytical database.

The Algorithms

ADWIN works by maintaining a window of recent observations and continuously testing whether the statistical properties of one half of the window differ significantly from the other. When they do, it signals drift and shrinks the window to the most recent stable segment. It is parameter-light and handles gradual as well as sudden drift.

Page-Hinkley accumulates a running sum of deviations from a reference mean and signals when the cumulative sum exceeds a threshold. It is better suited for detecting slow, persistent shifts than ADWIN, which makes them complementary.

CUSUM similarly accumulates deviations but resets when the sum goes negative — it is directional, detecting shifts upward or downward. Unlike the drift detectors, CUSUM is running as an anomaly detector rather than a drift detector — it flags individual anomalous readings rather than distributional shifts.

Isolation Forest is the ML-based baseline detector. It works by randomly partitioning the feature space and measuring how many splits it takes to isolate a point — anomalies are isolated quickly because they sit in sparse regions of the space. The key research question is whether retraining it on a post-drift window is sufficient to restore detection accuracy, and how quickly.

The Tools

Layer	Tool	Why
Ingestion	CoinDCX WebSocket API	Free, 24/7, high frequency — good for development
Historical data	MCX Crude Oil via jugaad-data	Real Indian market data for paper evaluation
Message broker	Apache Kafka (Confluent Cloud)	Production-grade, free tier available
Schema management	Confluent Schema Registry + Avro	Versioned, typed messages
Stream processing	Apache Flink (PyFlink 1.18.1)	Industry standard, stateful windowing
Drift detection	River library (ADWIN, Page-Hinkley)	Best Python implementation of online ML
Anomaly detectors	scikit-learn (Isolation Forest), custom CUSUM	Pluggable, comparable
Time-series storage	TimescaleDB	PostgreSQL extension, hypertables, compression
Live cache	Redis	Separates live queries from analytical queries
Dashboards	Grafana	Real-time visualisation

The Setup

I want to be honest about this because most project posts are not.

Getting PyFlink running on Windows without Docker is genuinely painful. The apache-flink pip package does not build cleanly on Python 3.13 or 3.14 because apache-beam's build scripts depend on pkg_resources, which newer setuptools dropped. The fix is pinning setuptools<81 and using Python 3.11 explicitly. Flink's binary distribution on Windows ships only .sh scripts, no .bat files, so you need Git Bash to run them. Java 17 requires additional --add-opens and --add-exports flags in flink-conf.yaml. And connecting to Confluent Cloud requires SASL/SSL configuration that PyFlink's shaded Kafka connector does not always handle cleanly with inline JAAS config strings.

All of it takes longer than it should. This is the kind of operational friction that research papers never mention but practitioners always hit, and documenting it here is part of what makes this project useful beyond the paper itself.

The producer is running, Kafka is receiving live ticks, TimescaleDB is storing them, and PyFlink is partially up. The windowed aggregation job and the drift detection layer are the immediate next steps.

Where This Is Going

The research contribution is architectural. The novelty is not in the drift detection algorithms themselves ADWIN and Page-Hinkley are well-established and it is not in the anomaly detectors either. The contribution is the closed-loop integration: drift detected inline, model recalibrated inline, stream uninterrupted.

The evaluation will use three drift scenarios sudden, gradual, and recurring injected synthetically into the CoinDCX stream during development, then validated on MCX Crude Oil historical data using the March 2020 crude oil crash as a natural, extreme drift event. The March 2020 scenario is particularly interesting because it is a well-documented external shock with a known onset date, making it possible to evaluate detection latency precisely.

The same architecture applies directly to IoT sensor environments: home activity monitoring, industrial telemetry, network anomaly detection. The financial market domain is the development testbed. The research implications are broader.

What Is Next

Get the PyFlink windowed aggregation job running end to end
Implement the ADWIN drift detection operator as a Flink ProcessFunction
Build the adaptive model recalibration loop
Instrument Grafana dashboards for all layers
Begin the MCX historical data replay experiments

I will update this post as the project develops. The code is at github.com/SabihaKhanum — the repo will be public once the core pipeline is stable.

Building an Adaptive Anomaly Detection Pipeline for Financial Streams

The Problem

The Architecture

The Algorithms

The Tools

The Setup

Where This Is Going

What Is Next

Comments

More from this blog

Build, Break, Ship: What I Learned Building a Real‑Time Pipeline

Redis: A Comprehensive Technical Deep Dive

How to Handle Tough Data Engineering Interview Questions: Tips from Real Experiences

Identifying Customers with Consecutive Monthly Orders: A Data Engineering Approach

Command Palette

The Problem

The Architecture

The Algorithms

The Tools

The Setup

Where This Is Going

What Is Next

Comments

More from this blog