Multimodal Content Enrichtmet Pipeline for Product Data

A scalable LLM-powered system that automates product content translation and marketplace listing generation for 100k+ SKUs across multiple platforms, achieving 96% auto-mapping accuracy and significant reduction in manual localization effort.

Overview & Goal

A fast-growing European e-commerce company needed to publish 100k+ SKUs across multiple marketplaces, each with unique listing templates, attribute requirements, and tone guidelines. Their product data existed in proprietary databases with inconsistent coverage—some SKUs lacked descriptions, others had missing attributes.

The goal: build a Python/Django system that could automatically ingest CSV data, normalize it to a canonical schema, and generate platform-compliant listings in multiple languages with image-aware content enrichment.

Results

The system achieved remarkable efficiency gains:

  • 96% auto-mapping accuracy of unfamiliar CSV headers to canonical fields (improving to 98% with learned corrections)
  • 72% reduction in manual localization effort and 38% faster time-to-listing
  • 11–17% lift in product page conversion on A/B test markets
  • Less than 2% rejection rate by marketplaces (down from 12% baseline)

Challenges

Building a multilingual, multi-marketplace content pipeline involved complex technical and business challenges:

  • Unknown CSV Headers: Vendors used idiosyncratic headers like "Comp Width", "W (mm)", "shoe_upper" that needed intelligent mapping
  • Platform Heterogeneity: Each marketplace enforced unique field sets, lengths, enums, and SEO tone requirements
  • LLM Hallucinations: Preventing factual errors and maintaining grounding in source data
  • Multilingual Quality: Maintaining brand tone and technical accuracy across 10+ languages
  • Cost & Latency Control: Managing LLM API costs and processing times for large catalogs

Solution

We engineered a comprehensive Python/Django system with OpenAI integration that transforms messy product data into high-quality, localized marketplace listings through intelligent automation and validation.

Key System Components

1. Intelligent Data Ingestion

CSV upload with dual-engine header mapping: embeddings-based similarity matching and symbolic rules for semantic understanding. Active learning loop where human corrections improve accuracy over time. Pydantic validation ensures data quality from ingestion.

2. Multi-Modal Content Generation

OpenAI vision models extract product attributes from images while text models generate localized content. Structured JSON outputs via function calling ensure reliable data contracts. Image-aware descriptions provide richer, more compelling copy.

3. Platform Adaptation Engine

Canonical product schema transforms to platform-specific formats (Amazon, Shopify, Zalando, bol.com). Constraint-aware generation respects field lengths, mandatory attributes, and marketplace policies. Automated compliance linting prevents rejections.

4. Quality & Governance Framework

Pydantic models enforce strict validation contracts. Human-in-the-loop review for low-confidence cases. Audit trails track every generation with prompt versioning and content hashing for full observability.

Technical Architecture

  • Core Stack: Python, Django + DRF, Pydantic for contracts, PostgreSQL, Redis
  • Orchestration: Celery with backpressure control and adaptive model routing
  • AI Integration: OpenAI text + vision APIs with structured outputs and cost optimization
  • Localization: Per-language glossaries, term locking, and style guides per marketplace
  • Monitoring: Real-time dashboards for cost, latency, and quality metrics per SKU and market

Key Innovations

  • Grounded Generation: LLMs instructed to use canonical data as single source of truth, preventing hallucinations
  • Fusion Prompting: Combining structured attributes with visual evidence for richer, more accurate descriptions
  • Adaptive Cost Control: Cheaper models for simple transformations, premium models for creative content
  • Active Learning: System improves accuracy through feedback loops and learned corrections

If you're looking for scalable SaaS design, deep integration with complex APIs, or predictive tooling for real-world operations—this project is a proven case study of robust, end-to-end execution.

Project Info

  • Role: Software Engineer
  • Type: Enterprise LLM System
  • Date: 2024
  • Scale: 100k+ SKUs

Tech Stack

  • Backend: Python, Django + DRF
  • AI/ML: OpenAI API, Vision Models
  • Validation: Pydantic, Custom Validators
  • Database: PostgreSQL, Redis
  • Orchestration: Celery, Task Queues
  • Monitoring: Grafana, Sentry