Quote Guessing Game
This project implements a pipeline for acquisition, normalization, and verification to identify quote authors in real time. Starting from public sources, the system performs controlled web scraping, enriches each record with metadata (biography/image), and exposes a lightweight Flask API that validates user input via a typo-tolerant string matcher. The goal is to showcase an end-to-end, data-centric flow—ingest → clean → entity matching → metrics—with explicit choices around performance, resilience, and user experience.
At the core of the verifier, I combine canonical normalization (lowercasing and stripping non-alphanumerics) with simple heuristics (accept by exact last name or full name) and a similarity score using difflib.SequenceMatcher with a 0.85 threshold. This balances precision and usability: it accepts "Twian" for "Twain" (minor typo) or just the last name, while rejecting ambiguous matches. The threshold is tunable and accompanied by ablation tests to quantify effects on precision/recall. On the UX side, the system serves hints (birth date/location, first/last-name initial) and author portraits, preferring Wikipedia and gracefully degrading to fallbacks (Unsplash/Robohash/UI-Avatars) so the interface never breaks.
From a data engineering perspective, ingestion is paginated with a persistent HTTP session, timeouts, and brief sleeps to be polite to the origin. Quotes are cached in memory (~100) to cut latency for subsequent games; endpoints are idempotent and return compact JSON with clear error codes. The design is explainable and portable, and it leaves room to swap the matcher for edit-distance methods (Levenshtein/Jaro-Winkler) or embeddings (e.g., SBERT) for multilingual names and semantic disambiguation.
Stack & technical contract (selected)
-
Language/serving: Python 3, Flask (API + server-rendered views).
-
Acquisition: Requests + BeautifulSoup (paged scraping with error handling).
-
Matching: re (regex) + difflib.SequenceMatcher (configurable threshold).
-
Enrichment: Wikipedia REST API; image fallbacks for visual continuity.
-
Operations: in-memory cache, lightweight JSON responses, robust timeouts/try-except.
For evaluation, I synthesize a labeled set with controlled perturbations (typos, last-name-only, diacritics) and report accuracy, precision/recall, plus FNR to spot unfair rejections. I also monitor endpoint latency and image-fallback rate as operational KPIs. The project uses only public data—no PII—so it's safe to demonstrate in a portfolio.
Primary endpoints (API)
-
/api/load-quotes (load & cache), /api/new-game (random quote),
-
/api/check-answer (error-tolerant verification),
-
/api/get-hint (biographical/initial hints), /api/get-image (portraits with graceful degradation).
In business terms, the pattern is easily parameterized: replace the quotes source with a corporate catalog (authors, products, works, SKUs), adjust the similarity threshold, and turn "hints" into internal metadata (launch date, category, tags). The same architecture underpins typo-tolerant enterprise search, validation of customer/product names, or knowledge games in educational settings.
Industries & adaptations
-
EdTech / e-learning: quizzes with flexible, explainable answer checking.
-
Media / Publishing: trivia and engagement using in-house author/work catalogs.
-
Marketing & Community: branded challenges with hints and spokesperson/product images.
-
Museums / Culture: interactive guides identifying artists/works with curated metadata.
-
Retail / Internal catalogs: fuzzy matching for products/SKUs with DAM-hosted images.
-
Support / Helpdesk: term normalization and article suggestions (natural extension to NLP/embeddings).