🎓 LinguaAI System Architecture

Language Learning Platform with Conversational AI Tutor and Real-Time Pronunciation Feedback

📌 How to Use These Diagrams

1️⃣ Complete System Architecture
Overview: This diagram shows the complete LinguaAI system with all 5 architectural layers:
  • Presentation Layer: React Web App + Browser Extension
  • Application Layer: FastAPI Gateway + Socket.io Server
  • AI/ML Layer: Whisper (ASR) + Fine-tuned Whisper (MDD) + PER Calculator + LLM
  • Processing Modules: OCR, Profanity Filter, Waveform Generator, Gamification Engine
  • Data Layer: PostgreSQL (Supabase)
%%{init: {'flowchart': {'nodeSpacing': 150, 'rankSpacing': 120}}}%% graph TB subgraph PresentationLayer["PRESENTATION LAYER
(Client-Side)"] ReactWeb["React Web Application
• Pronunciation feedback UI
• Contextual scenarios
• Gamification (badges, streaks)
• Analytics dashboard (D3.js)
• Flashcards, OCR scanning"] BrowserExt["Browser Extension
• Word selection from web
• One-click dictionary add
• Context capture"] end subgraph ApplicationLayer["APPLICATION LAYER
(Backend Services)"] FastAPI["FastAPI Gateway
• Authentication & Authorization (JWT)
• Request routing & validation
• OWASP compliance
• Rate limiting & load balancing

Endpoints:
/api/auth/* | /api/user/*
/api/dictionary/* | /api/scenarios/*
/api/flashcards/* | /api/progress/*
/api/ocr/scan"] SocketIO["Socket.io Server
• Real-time WebSocket
• Live pronunciation feedback
• Waveform streaming
• Session management"] end subgraph AIMLLayer["AI/ML LAYER
(Machine Learning Services)"] ASR["ASR Service
(Whisper - OpenAI)
• Speech-to-text transcription
• Open conversation handling
• 95% accuracy with noise cancel

DSAI: Collection Phase
• Audio data collection
• Transcription storage"] MDD["MDD Service
(Fine-tuned Whisper)
• Phoneme-level analysis
• Mispronunciation detection
• Native language adaptation
(e.g., Arabic B/P confusion)

DSAI: Analysis & Modeling
• Error categorization
• Pattern recognition"] PER["PER Calculator
(Custom Algorithm)
• Phone Error Rate calculation
• Phoneme comparison
• Pronunciation score (0-100%)
• Error identification

DSAI: Evaluation Phase
• F1-score ≥ 0.90"] LLM["Flashcard Generation
(LLM)
• Generate from conversation
• Extract key vocabulary
• Contextual examples
• Spaced repetition logic

DSAI: Analysis Phase
• Text analysis
• Vocabulary contextualization"] end subgraph ProcessingLayer["PROCESSING MODULES LAYER
(Specialized Services)"] Profanity["Profanity Filter
• Content filtering
• Safe learning environment
• Text processing"] OCR["OCR Module
(Tesseract.js)
• Text extraction from images
• Server-side processing
• Vocabulary extraction"] Waveform["Waveform Generator
(FFmpeg)
• Audio visualization
• Server-side rendering
• Real-time waveform"] Gamification["Gamification Engine
(Gamify.js)
• Points, badges, streaks
• Engagement tracking
• Reward system
• Observer pattern
(Separate Microservice)"] end subgraph DataLayer["DATA LAYER
(Persistent Storage)"] PostgreSQL["PostgreSQL Database
(Supabase)
• ACID-compliant relational DB
• GDPR-compliant anonymization
• Encrypted storage with RBAC

Tables:
• users (profiles, native language)
• dictionaries (vocabulary, context)
• conversations (transcripts, scenarios)
• pronunciation_logs (PER scores, errors)
• flashcards (vocab, repetition schedule)
• progress (session, accuracy trends)
• gamification (points, badges, streaks)
• mistake_tracking (error patterns)

DSAI: Data Storage & Retrieval
• Time-series analytics
• Mistake pattern insights"] end ReactWeb -->|"HTTPS/REST"| FastAPI ReactWeb -->|"WebSocket"| SocketIO BrowserExt -->|"HTTPS/REST"| FastAPI FastAPI -->|"HTTP"| ASR FastAPI -->|"HTTP"| MDD SocketIO -->|"HTTP"| MDD SocketIO -->|"HTTP"| Waveform ASR -->|"HTTP"| MDD MDD -->|"HTTP"| PER PER -->|"HTTP"| MDD MDD -->|"HTTP"| SocketIO MDD -->|"HTTP"| FastAPI FastAPI -->|"HTTP"| LLM LLM -->|"HTTP"| FastAPI FastAPI -->|"HTTP"| Profanity Profanity -->|"HTTP"| FastAPI FastAPI -->|"HTTP"| OCR OCR -->|"HTTP"| FastAPI FastAPI -->|"HTTP"| Gamification Gamification -->|"HTTP"| FastAPI Waveform -->|"HTTP"| SocketIO FastAPI -->|"PostgreSQL"| PostgreSQL ASR -->|"PostgreSQL"| PostgreSQL MDD -->|"PostgreSQL"| PostgreSQL LLM -->|"PostgreSQL"| PostgreSQL OCR -->|"PostgreSQL"| PostgreSQL Gamification -->|"PostgreSQL"| PostgreSQL PostgreSQL -->|"PostgreSQL"| FastAPI classDef presentationStyle fill:#E3F2FD,stroke:#1976D2,stroke-width:3px,color:#000 classDef applicationStyle fill:#F3E5F5,stroke:#7B1FA2,stroke-width:3px,color:#000 classDef aimlStyle fill:#E8F5E9,stroke:#388E3C,stroke-width:3px,color:#000 classDef processingStyle fill:#FFF3E0,stroke:#F57C00,stroke-width:3px,color:#000 classDef dataStyle fill:#FCE4EC,stroke:#C2185B,stroke-width:3px,color:#000 class ReactWeb,BrowserExt presentationStyle class FastAPI,SocketIO applicationStyle class ASR,MDD,PER,LLM aimlStyle class Profanity,OCR,Waveform,Gamification processingStyle class PostgreSQL dataStyle
2️⃣ Pronunciation Feedback Flows
Two Different Processing Paths:
  • Flow 1 (Predefined Text): For practice exercises where users read given text. Direct phoneme analysis for faster, focused feedback.
  • Flow 2 (Open Conversation): For contextual scenarios (restaurant, travel). Includes ASR transcription followed by pronunciation analysis.
%%{init: {'flowchart': {'nodeSpacing': 150, 'rankSpacing': 100, 'curve': 'basis'}}}%% flowchart TB subgraph Flow1["FLOW 1: Predefined Text Practice
(Pronunciation Exercises)"] direction LR User1["👤 User
Reads given text"] MDD1["MDD Service
(Fine-tuned Whisper)
Phoneme Detection"] PER1["PER Calculator
Calculate Phone
Error Rate"] Feedback1["✅ Feedback
• Pronunciation score
• Phoneme errors
• Corrections"] User1 -->|"🎤 Audio"| MDD1 MDD1 -->|"Phoneme
predictions"| PER1 PER1 -->|"PER score
Error details"| Feedback1 end spacer1[" "]:::spacer subgraph Flow2["FLOW 2: Open Conversation
(Contextual Scenarios)"] direction LR User2["👤 User
Speaks freely"] ASR2["ASR Service
(Whisper)
Speech Recognition"] Text2["📝 Text
Transcription"] MDD2["MDD Service
(Fine-tuned Whisper)
Phoneme Detection"] PER2["PER Calculator
Calculate Phone
Error Rate"] Feedback2["✅ Feedback
• Transcription
• Pronunciation score
• Phoneme errors
• Corrections"] User2 -->|"🎤 Audio"| ASR2 ASR2 -->|"Text"| Text2 Text2 -->|"Transcription
+ Audio"| MDD2 MDD2 -->|"Phoneme
predictions"| PER2 PER2 -->|"PER score
Error details"| Feedback2 end Flow1 ~~~ spacer1 spacer1 ~~~ Flow2 classDef userStyle fill:#BBDEFB,stroke:#1976D2,stroke-width:2px,color:#000 classDef asrStyle fill:#C8E6C9,stroke:#388E3C,stroke-width:2px,color:#000 classDef mddStyle fill:#C8E6C9,stroke:#388E3C,stroke-width:2px,color:#000 classDef perStyle fill:#FFF9C4,stroke:#F57F17,stroke-width:2px,color:#000 classDef feedbackStyle fill:#C5E1A5,stroke:#689F38,stroke-width:2px,color:#000 classDef textStyle fill:#E1BEE7,stroke:#7B1FA2,stroke-width:2px,color:#000 classDef spacer fill:transparent,stroke:transparent,color:transparent class User1,User2 userStyle class ASR2 asrStyle class MDD1,MDD2 mddStyle class PER1,PER2 perStyle class Feedback1,Feedback2 feedbackStyle class Text2 textStyle class spacer1 spacer
3️⃣ DSAI Data Lifecycle
Complete Data Science Workflow: This diagram demonstrates the end-to-end data lifecycle aligned with DSAI program requirements, showing how data flows through collection, analysis, modeling, evaluation, and deployment phases.
graph TD subgraph Phase1["PHASE 1: COLLECTION"] C1["ASR Service
📊 Data Sources:
• Audio recordings
• User speech samples
• Conversation transcripts
• User interaction logs"] end subgraph Phase2["PHASE 2: ANALYSIS"] A1["MDD Service
🔍 Analysis:
• Phoneme-level error categorization
• Native language influence patterns
• Mistake pattern clustering"] A2["Flashcard Service
📚 Analysis:
• Vocabulary extraction
• Contextualization
• Text analysis from transcripts"] end subgraph Phase3["PHASE 3: MODELING"] M1["Fine-tuned Whisper
🤖 Machine Learning:
• Fine-tuning for phoneme detection
• Native language adaptation
• Error pattern recognition models

📈 Validation:
• 10-fold cross-validation
• 10K audio samples"] end subgraph Phase4["PHASE 4: EVALUATION"] E1["PER Calculator
📊 Metrics:
• Phone Error Rate (PER)
• Model accuracy ≥ 90%
• F1-score ≥ 0.90
• Pronunciation accuracy ≥ 85%
• Vocabulary retention ≥ 75%

🛠️ Tools:
• Confusion matrices
• ROC curves"] end subgraph Phase5["PHASE 5: DEPLOYMENT & MONITORING"] D1["Production System
🚀 Deployment:
• Real-time pronunciation feedback
• Progress analytics dashboards
• Continuous performance tracking

📡 Monitoring:
• 99% uptime
• Latency < 2s"] D2["PostgreSQL Database
💾 Storage:
• Time-series analytics
• Structured data for insights
• Aggregation pipelines
• Mistake pattern insights"] end C1 -->|"Raw audio data
Transcripts"| A1 C1 -->|"Conversation
transcripts"| A2 A1 -->|"Error patterns
Categorized data"| M1 A2 -->|"Extracted
vocabulary"| M1 M1 -->|"Model predictions
Phoneme detections"| E1 E1 -->|"Performance metrics
Validated models"| D1 E1 -->|"Evaluation results"| D2 D1 -->|"Production data"| D2 D2 -.->|"Feedback loop
Continuous improvement"| C1 classDef collectionStyle fill:#E3F2FD,stroke:#1976D2,stroke-width:3px,color:#000 classDef analysisStyle fill:#E8F5E9,stroke:#388E3C,stroke-width:3px,color:#000 classDef modelingStyle fill:#FFF3E0,stroke:#F57C00,stroke-width:3px,color:#000 classDef evaluationStyle fill:#FCE4EC,stroke:#C2185B,stroke-width:3px,color:#000 classDef deploymentStyle fill:#F3E5F5,stroke:#7B1FA2,stroke-width:3px,color:#000 class C1 collectionStyle class A1,A2 analysisStyle class M1 modelingStyle class E1 evaluationStyle class D1,D2 deploymentStyle

🎯 Key Architecture Highlights

💻 Technology Stack

Frontend

  • React (Web)
  • Chrome Extension API
  • D3.js (Visualization)

Backend

  • FastAPI (Python)
  • Socket.io

AI/ML

  • Whisper (ASR)
  • Fine-tuned Whisper (MDD)
  • LLM (Flashcard Gen)

Processing

  • Tesseract.js (OCR)
  • FFmpeg (Audio/Waveform)
  • Gamify.js

Database

  • Supabase (PostgreSQL)