Anomaly detection on audio data via Vision Transformers

10 min readDec 25, 2024

In the world of machine learning, anomaly detection is a critical task across domains such as predictive maintenance, fraud detection, and quality assurance. Audio anomaly detection, in particular, involves identifying unusual or unexpected patterns in audio data, which is especially relevant in industrial applications where machinery’s health can be monitored through sound. This article outlines an innovative approach using spectrograms and vision-based models to tackle the challenge of detecting audio anomalies.

Problem Statement

Detecting anomalies in audio signals presents unique challenges:

High Dimensionality: Audio signals are rich and complex, which makes them hard to analyze directly without transforming them into a more manageable format.
Variety of Anomalies: Anomalies in audio signals can range from subtle changes in frequency to more obvious disruptions, which complicates detection.
Lack of Labels: Real-world audio anomaly datasets often lack labels, making it difficult to train supervised models.

To address these issues, we convert audio signals into spectrograms — a visual representation of audio — and leverage computer vision techniques for anomaly detection. This method allows us to utilize powerful vision models and frameworks for analyzing audio data effectively.

Lets first start by loading our dataset from kaggle

!#!/bin/bash
!kaggle datasets download vuppalaadithyasairam/anomaly-detection-from-sound-data-fan

We use the subset of the Task-2 dataset from the DCASE 2020 Challenge, which consists of audio recordings of machines in operation. The dataset is split into training, validation, and testing sets, allowing us to train and evaluate our anomaly detection model

Training- https://zenodo.org/record/3678171

Validation- https://zenodo.org/record/3727685

Testing- https://zenodo.org/record/3841772

Step 1: Transforming Audio into Spectrograms

A spectrogram represents the frequency content of an audio signal over time. It’s generated using the Short-Time Fourier Transform (STFT). In this approach, we employ the Mel-spectrogram, which maps frequencies to the Mel scale to mimic human auditory perception.

import os
import librosa
import librosa.display
import matplotlib.pyplot as plt
from tqdm import tqdm
import numpy as np
from multiprocessing import Pool, cpu_count
import gc
import random

# Define input and output directories
input_dirs = {
    "train": "/content/dev_data_fan/train",  
    "test": "/content/dev_data_fan/test",
    "validation": "/content/dev_data_fan/validation"
}

output_base_dir = "/content/spectrograms"  
# Function to create spectrogram and save as an image
def audio_to_spectrogram(args):
    audio_path, output_path = args
    try:
        # Load audio file
        y, sr = librosa.load(audio_path, sr=None)  # sr=None keeps the original sampling rate
        # Generate Mel spectrogram
        S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128, fmax=8000)
        S_dB = librosa.power_to_db(S, ref=np.max)

        # Plot and save as image
        plt.figure(figsize=(6, 6))
        librosa.display.specshow(S_dB, sr=sr, cmap='viridis', x_axis='time', y_axis='mel', fmax=8000)
        plt.axis('off')
        plt.tight_layout(pad=0)
        plt.savefig(output_path, bbox_inches='tight', pad_inches=0)
        plt.close('all')  # Explicitly close the figure

        # Cleanup to free memory
        del y, S, S_dB
        gc.collect()
    except Exception as e:
        print(f"Error processing {audio_path}: {e}")

# Wrapper function to process files in parallel
def process_directory(input_dir, output_dir, sample_size=1000):
    os.makedirs(output_dir, exist_ok=True)

    # List all .wav files and sample up to `sample_size`
    all_files = [f for f in os.listdir(input_dir) if f.endswith('.wav')]
    sampled_files = random.sample(all_files, min(len(all_files), sample_size))

    # Prepare arguments for multiprocessing
    tasks = []
    for file_name in sampled_files:
        input_path = os.path.join(input_dir, file_name)
        output_path = os.path.join(output_dir, f"{os.path.splitext(file_name)[0]}.png")
        tasks.append((input_path, output_path))

    # Process in parallel
    with Pool(cpu_count()) as pool:
        # Reduce chunksize to avoid memory spikes
        list(tqdm(pool.imap_unordered(audio_to_spectrogram, tasks, chunksize=10), total=len(tasks), desc=f"Processing {os.path.basename(input_dir)}"))

# Process each dataset
for dataset_type, input_dir in input_dirs.items():
    output_dir = os.path.join(output_base_dir, dataset_type)
    process_directory(input_dir, output_dir, sample_size=1000)

print("Spectrogram conversion completed.")

This Python script converts audio files from multiple directories into spectrogram images. It begins by defining input directories (train, test, and validation) and an output directory for storing the spectrograms.

The audio_to_spectrogram function processes each audio file: it loads the file using librosa, generates a Mel spectrogram, converts it to decibels, and saves it as an image using matplotlib.

A parallelized wrapper function, process_directory, manages the conversion process, allowing efficient processing of large datasets. It randomly samples up to a specified number of .wav files (sample_size) from each directory, ensuring the pipeline is scalable and adaptable to different dataset sizes.

By using Python's multiprocessing capabilities, the script maximizes resource utilization, speeding up the spectrogram creation process. Finally, it iterates through the dataset types (train, test, and validation) to create and store spectrograms, preparing the data for downstream tasks such as anomaly detection.

The resulting spectograms should look something like this :

spectogram of audio signal with an anomaly

Next we create a dataframe to map our filepaths to a label:

import os
import pandas as pd
# Define the base directory for spectrograms
spectrogram_base_dir = "/content/spectrograms"  
# Initialize a list to store mappings
data = []

# Iterate over dataset types
for dataset_type in ["train", "test", "validation"]:
    dataset_dir = os.path.join(spectrogram_base_dir, dataset_type)
    for file_name in os.listdir(dataset_dir):
        if file_name.endswith('.png'):  # Process only image files
            # Determine the label
            label = 0 if file_name.startswith("normal") else 1
            # Append the mapping to the list
            data.append({
                "filepath": os.path.join(dataset_dir, file_name),
                "label": label,
                "set": dataset_type
            })


df = pd.DataFrame(data)


df.head()

The dataset is labeled by checking if the filename contains the word ‘normal’ (label 0) or indicates an anomaly (label 1).

Next we define a pipeline for loading spectrogram images and their associated labels into PyTorch-compatible datasets and data loaders

from torchvision import transforms
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import pandas as pd

# Custom Dataset Class
class SpectrogramDataset(Dataset):
    def __init__(self, df, transform=None):
        self.df = df
        self.transform = transform

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        image_path = row['filepath']
        label = row['label']
        image = Image.open(image_path).convert("RGB")  # Ensure image is RGB
        
        if self.transform:
            image = self.transform(image)
        
        return image, label

# Define image transformations
transform = transforms.Compose([
    transforms.Resize((224, 224)),  # Resize to CLIP input size
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])  # CLIP normalization
])

# Create datasets and loaders
df_train = df[df['set'] == 'train']
df_val = df[df['set'] == 'validation']

train_dataset = SpectrogramDataset(df_train, transform=transform)
val_dataset = SpectrogramDataset(df_val, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

The SpectrogramDataset class is a custom dataset implementation that reads a DataFrame containing file paths and labels. Each spectrogram image is loaded using the Python Imaging Library (PIL), converted to RGB format, and optionally transformed using a series of preprocessing steps defined by transform. The transformations include resizing the images to 224x224 pixels (a typical input size for vision models like CLIP), converting them to tensors, and normalizing them for consistency with the model's expected input range. Two datasets—train_dataset and val_dataset—are created using filtered DataFrame subsets for the training and validation splits. These datasets are wrapped into DataLoader objects to enable efficient batch processing and shuffling during training. This setup ensures the spectrogram data is correctly prepared and fed into the machine learning model for training and evaluation.

we can verify the shapes of our dataset:

# Check size of a batch from the DataLoader
for images, labels in train_loader:  # Assuming DataLoader returns (images, labels)
    print(f"Batch size: {images.size(0)}")
    print(f"Image shape: {images.shape}")  # Shape is (batch_size, channels, height, width)
    break  # Only inspect the first batch
# Batch size: 32
# Image shape: torch.Size([32, 3, 224, 224])

The Autoencoder Approach

An autoencoder is a neural network designed to reconstruct its input. During training, the autoencoder learns to compress and reconstruct normal data. For unseen data, anomalies result in a higher reconstruction error.

Architecture:

Encoder: Compresses the input spectrogram into a latent representation.
Decoder: Reconstructs the spectrogram from the latent representation.

Autoencoders are ideal for anomaly detection in this case because they are trained to reconstruct ‘normal’ data. The reconstruction error can then be used as an anomaly score, where higher errors indicate anomalies, as the model is not familiar with such data.

lets build it :


from transformers import CLIPVisionModel

import torch.nn as nn
import torch


# Move model to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# Define loss and optimizer
criterion = nn.MSELoss()




class Autoencoder(nn.Module):
    def __init__(self, vision_model):
        super(Autoencoder, self).__init__()
        self.encoder = vision_model.vision_model  # Use CLIP visual encoder
        self.decoder = nn.Sequential(
            nn.Linear(vision_model.config.hidden_size, 512),
            nn.ReLU(),
            nn.Linear(512, 1024),
            nn.ReLU(),
            nn.Linear(1024, 224 * 224),  # Assuming spectrograms are resized to 224x224
            nn.Sigmoid(),  # Output pixel values in range [0, 1]
        )

    def forward(self, x):
        embeddings = self.encoder(pixel_values=x).pooler_output  # Extract features
        reconstructions = self.decoder(embeddings)  # Reconstruct the spectrogram
        return reconstructions

def train_autoencoder(model, train_loader, criterion, optimizer, epochs=3):
    model.train()
    for epoch in range(epochs):
        train_loss = 0.0
        for images, _ in train_loader:  # Labels are not needed
            images = images.to(device)

            optimizer.zero_grad()
            reconstructions = model(images)
            
            # Reshape reconstruction to match input shape
            reconstructions = reconstructions.view(-1, 1, 224, 224)  # Assuming 224x224 spectrograms
            
            loss = criterion(reconstructions, images)  # Reconstruction loss
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
        
        print(f"Epoch {epoch+1}, Train Loss: {train_loss / len(train_loader)}")
vision_model = CLIPVisionModel.from_pretrained("openai/clip-vit-base-patch32")
autoencoder = Autoencoder(vision_model).to(device)
optimizer = torch.optim.Adam(autoencoder.parameters(), lr=1e-4)
# Train the autoencoder
train_autoencoder(autoencoder, train_loader, criterion, optimizer, epochs=5)

This code implements an autoencoder-based approach for anomaly detection in spectrograms by utilizing the pre-trained CLIP vision model from Hugging Face. It begins by setting up the computation device (GPU or CPU) and defining the loss function, which is Mean Squared Error (MSE), to measure the reconstruction error between the original and reconstructed spectrograms. The autoencoder architecture consists of an encoder, which leverages the CLIP vision model’s pre-trained visual encoder to extract high-level features from input spectrograms, and a custom decoder built using fully connected layers. The decoder maps the embeddings back to pixel-level reconstructions, with layers that progressively expand the feature dimensions and apply activation functions like ReLU for non-linearity and Sigmoid to normalize the pixel outputs to the range [0, 1]. The forward method of the autoencoder generates embeddings from the encoder and passes them through the decoder to produce reconstructed spectrograms.

The training function, train_autoencoder, uses batches of spectrograms from a data loader, processes them through the autoencoder, calculates the reconstruction loss, and updates the model weights using backpropagation. It accumulates the training loss over batches and reports progress at the end of each epoch. The training loop focuses on learning to reconstruct normal spectrograms, enabling the model to implicitly identify anomalies during inference by detecting deviations in reconstruction quality. Before training, the CLIP vision model is loaded with its pre-trained weights, and the autoencoder is initialized by attaching the decoder to the encoder. An Adam optimizer is used to adjust the model's parameters efficiently. This setup allows for robust anomaly detection, leveraging the power of pre-trained vision models and reconstruction-based learning without requiring explicit labels for anomalies. This approach is scalable, modular, and effective for scenarios where anomalies are rare or undefined during training.

Once we trained our autoencoder we can proceed to evaluating it :

def evaluate_anomalies(model, test_loader, threshold=0.01):
    model.eval()
    anomaly_scores = []
    true_labels = []
    
    with torch.no_grad():
        for images, labels in test_loader:
            images = images.to(device)
            reconstructions = model(images).view(-1, 1, 224, 224)
            errors = torch.mean((images - reconstructions) ** 2, dim=[1, 2, 3])  # MSE per sample
            
            anomaly_scores.extend(errors.cpu().numpy())
            true_labels.extend(labels.cpu().numpy())
    
    # Determine anomalies based on the threshold
    predictions = [1 if score > threshold else 0 for score in anomaly_scores]  # 1 = anomaly, 0 = normal
    
    # Evaluate metrics
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
    accuracy = accuracy_score(true_labels, predictions)
    precision = precision_score(true_labels, predictions, zero_division=0)
    recall = recall_score(true_labels, predictions, zero_division=0)
    f1 = f1_score(true_labels, predictions, zero_division=0)
    
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")
    
    return anomaly_scores, predictions

Above we define the evaluate_anomalies function, which assesses the performance of the trained autoencoder in detecting anomalies using a test dataset. The function operates in evaluation mode, ensuring the model does not update its weights. It calculates anomaly scores for each sample and compares them against a predefined threshold to classify samples as normal or anomalous. Specifically, the function processes batches of test images through the model to obtain their reconstructions, then computes the Mean Squared Error (MSE) between the original and reconstructed spectrograms for each sample. This error serves as the anomaly score, where higher errors indicate a greater likelihood of anomaly.

The function uses a threshold to convert the anomaly scores into binary predictions: samples with scores exceeding the threshold are labeled as anomalies (1), while others are classified as normal (0). It also collects the true labels from the test data for comparison. After processing all test samples, the function evaluates the model’s performance using standard classification metrics: accuracy, precision, recall, and F1-score. These metrics provide a comprehensive view of the model’s ability to detect anomalies while minimizing false positives and false negatives.

The models yeilds these results without any hyperparamter tuning :

The results show that the model achieves high recall, correctly identifying all anomalies, but the precision suggests there is room for improvement. A more balanced precision-recall tradeoff could be achieved through hyperparameter tuning or by using advanced techniques like anomaly scoring thresholds. These results are still very impressive given that we did not perform any hyperparamters tuning.

Below is a detailled breakdown of the different metrics:

Accuracy (0.7920): This indicates that the model correctly identified the majority of normal and anomalous instances, but there is still room for improvement, particularly in distinguishing between the two classes with high precision and recall.
Precision (0.7920): Precision measures how many of the predicted anomalies were actually anomalies. With a value of 0.792, the model appears to be reasonably good at detecting anomalies, though there may still be some false positives that could be reduced with tuning or more advanced techniques.
Recall (1.0000): Achieving a perfect recall score means that the model correctly identified all anomalies in the dataset. This suggests that the model is highly sensitive to anomalies, but this comes at the cost of precision (i.e., possibly flagging some normal data as anomalies).
F1-Score (0.8839): The F1-score balances precision and recall, and a value of 0.8839 is very strong, indicating that the model performs well overall in detecting anomalies while maintaining a balance between false positives and false negatives.

Advantages of This Method

Leverages Vision Models: Spectrograms unlock the power of pre-trained vision architectures.
Scalable: Easily extendable to other domains by changing the data input.
Unsupervised Training: Requires only normal data for training, reducing the need for labeled anomalies.
Explainable Results: Reconstruction errors provide intuitive insights into anomalies.

Conclusion

In this article, we demonstrated how transforming audio signals into spectrograms and applying vision models for anomaly detection offers a scalable and effective approach for industrial applications.

By converting audio anomalies into a visual problem, we tap into the rich ecosystem of image-based machine learning. This methodology is not only versatile but also achieves excellent performance with minimal feature engineering. As a next step, further enhancements could include pre-training on a larger spectrogram dataset or exploring advanced anomaly detection techniques like variational autoencoders or diffusion models.