Exposing Deepfakes with Vision Transformers

A ViT-based pipeline for deepfake detection and explainability using the DFDC dataset

Project Description

This project explores the use of Vision Transformers (ViT) for detecting deepfakes in video content with a strong focus on explainability. Leveraging the Deepfake Detection Challenge (DFDC) dataset (~100GB), I implemented a high-performance video classification pipeline in PyTorch, achieving a test accuracy of 93.52%. The model architecture was optimized for GPU training and designed to handle large-scale frame extraction efficiently.

To make the black-box behavior of ViTs more interpretable, I integrated attention heatmaps with OpenCV-based frame-level analysis, enabling visual tracing of the regions the model focused on while making predictions. In 85% of the cases, these visualizations aligned with actual manipulated facial regions — a strong indicator of model transparency.

The system demonstrates a powerful blend of deep learning, computer vision, and model explainability — tackling an urgent real-world problem where trust, ethics, and AI converge.