Machine Learning with Spark Streaming – MLlib

Scalable classification using PySpark and MLlib on streaming data

Project Description

This project implements a real-time spam detection system using Apache Spark Streaming and MLlib, tailored for high-throughput text data environments. The pipeline leverages multiple classifiers — including Perceptron, Multinomial Naïve Bayes, and SGD — to analyze and categorize incoming messages on the fly.

Optimized using PySpark, the system improves classification speed by 35%, enabling it to scale efficiently with large volumes of streaming data. Feature extraction and vectorization are performed in real-time using Spark’s resilient distributed datasets (RDDs), making the model robust under production-like conditions.

This project demonstrates the practical application of distributed machine learning in handling unbounded data streams, such as email, chat, or sensor input — blending performance optimization with accurate, real-time inference.

Other projects