Skip to content
All work

Multi-modal data → insight

Joblogic

Customer Data Analysis Platform

Unstructured, multi-modal comms turned into clustered insight at scale.

Role
AI Engineer — design & build
Year
2025
Access
Private — walkthrough on request
  • Databricks
  • Transcription
  • UMAP
  • HDBSCAN
  • Embeddings
  • Claude (Sonnet 4.6)

Problem

Tenants sit on years of customer communication — emails and call recordings — locked in formats nobody mines: ZIP, PST, MSG. That archive is exactly where the real signal lives about how customers actually work and where they struggle. But it's unstructured, multi-modal, and at a scale no human can read. Without structured insight from it, any tenant-specific agent or automation is built on guesswork.

My role

AI Engineer — I built the pipeline end to end: ingestion, audio normalisation and transcription, email extraction, cleaning, embedding, dimensionality reduction, clustering, representative sampling, and the final LLM analysis stage.

Approach & architecture

The pipeline turns a messy archive upload into named, clustered insight. Crucially, it never tries to send everything to an LLM — it clusters first, then sends only representative conversations for interpretation.

Unstructured archives → clustered insight
  1. 1

    Ingest

    Tenants upload ZIP/PST/MSG via SharePoint; files are copied into Databricks volumes and archives extracted.

  2. 2

    Normalise audio

    Recordings are standardised to 16 kHz mono MP3 and split into ≤20-minute chunks for reliable transcription.

  3. 3

    Transcribe & extract

    Calls are transcribed, email content is extracted, and text is cleaned.

  4. 4

    Embed

    Conversations are turned into vector embeddings.

  5. 5

    Reduce & cluster

    UMAP reduces dimensionality; HDBSCAN finds natural clusters without pre-specifying how many.

  6. 6

    Sample & interpret

    Representative conversations per cluster are sent to Claude (Sonnet 4.6) to surface use-cases, workflows and interaction patterns.

Cluster-then-sample keeps the LLM stage affordable and the conclusions grounded in the whole corpus, not a handful of anecdotes.

Hard parts

  • Messy, multi-modal ingestion. PST/MSG/ZIP are awkward archive formats and audio quality varies wildly. Normalisation and ≤20-minute chunking were what made transcription reliable rather than flaky.
  • Scale and cost. You cannot embed-and-transcribe an archive and feed all of it to an LLM. Clustering first, then sampling representatives per cluster, is what makes the economics work.
  • Why UMAP + HDBSCAN. Density-based clustering finds natural groupings without committing to a fixed number of clusters up front; UMAP makes that tractable in high dimensions. Tuning both so the clusters were genuinely meaningful was the core data-science work.
  • Representativeness. Sample selection had to actually represent each cluster, so the model's summary generalises instead of fixating on an outlier.

Impact

  • Turned unstructured, multi-modal archives into clustered, named insight at scale.
  • The output feeds tenant-specific agents and automations — closing the loop with the agent-building work above.
  • defensible metric: volume of comms processed / distinct use-cases surfaced per tenant.