Customer Data Analysis Platform

Unstructured, multi-modal comms turned into clustered insight at scale.

Role: AI Engineer — design & build
Year: 2025
Access: Private — walkthrough on request

Databricks
Transcription
UMAP
HDBSCAN
Embeddings
Claude (Sonnet 4.6)

Problem

Tenants sit on years of customer communication — emails and call recordings — locked in formats nobody mines: ZIP, PST, MSG. That archive is exactly where the real signal lives about how customers actually work and where they struggle. But it's unstructured, multi-modal, and at a scale no human can read. Without structured insight from it, any tenant-specific agent or automation is built on guesswork.

My role

AI Engineer — I built the pipeline end to end: ingestion, audio normalisation and transcription, email extraction, cleaning, embedding, dimensionality reduction, clustering, representative sampling, and the final LLM analysis stage.

Approach & architecture

The pipeline turns a messy archive upload into named, clustered insight. Crucially, it never tries to send everything to an LLM — it clusters first, then sends only representative conversations for interpretation.

Unstructured archives → clustered insight

1
Ingest
Tenants upload ZIP/PST/MSG via SharePoint; files are copied into Databricks volumes and archives extracted.
2
Normalise audio
Recordings are standardised to 16 kHz mono MP3 and split into ≤20-minute chunks for reliable transcription.
3
Transcribe & extract
Calls are transcribed, email content is extracted, and text is cleaned.
4
Embed
Conversations are turned into vector embeddings.
5
Reduce & cluster
UMAP reduces dimensionality; HDBSCAN finds natural clusters without pre-specifying how many.
6
Sample & interpret
Representative conversations per cluster are sent to Claude (Sonnet 4.6) to surface use-cases, workflows and interaction patterns.

Cluster-then-sample keeps the LLM stage affordable and the conclusions grounded in the whole corpus, not a handful of anecdotes.

Hard parts

Messy, multi-modal ingestion. PST/MSG/ZIP are awkward archive formats and audio quality varies wildly. Normalisation and ≤20-minute chunking were what made transcription reliable rather than flaky.
Scale and cost. You cannot embed-and-transcribe an archive and feed all of it to an LLM. Clustering first, then sampling representatives per cluster, is what makes the economics work.
Why UMAP + HDBSCAN. Density-based clustering finds natural groupings without committing to a fixed number of clusters up front; UMAP makes that tractable in high dimensions. Tuning both so the clusters were genuinely meaningful was the core data-science work.
Representativeness. Sample selection had to actually represent each cluster, so the model's summary generalises instead of fixating on an outlier.

Impact

Turned unstructured, multi-modal archives into clustered, named insight at scale.
The output feeds tenant-specific agents and automations — closing the loop with the agent-building work above.
‹defensible metric: volume of comms processed / distinct use-cases surfaced per tenant›.

Next case study

Automation Design Portal

Work with me