Quote Guessing Game

This project implements a pipeline for acquisition, normalization, and verification to identify quote authors in real time. Starting from public sources, the system performs controlled web scraping, enriches each record with metadata (biography/image), and exposes a lightweight Flask API that validates user input via a typo-tolerant string matcher. The goal is to showcase an end-to-end, data-centric flow—ingest → clean → entity matching → metrics—with explicit choices around performance, resilience, and user experience.

At the core of the verifier, I combine canonical normalization (lowercasing and stripping non-alphanumerics) with simple heuristics (accept by exact last name or full name) and a similarity score using difflib.SequenceMatcher with a 0.85 threshold. This balances precision and usability: it accepts "Twian" for "Twain" (minor typo) or just the last name, while rejecting ambiguous matches. The threshold is tunable and accompanied by ablation tests to quantify effects on precision/recall. On the UX side, the system serves hints (birth date/location, first/last-name initial) and author portraits, preferring Wikipedia and gracefully degrading to fallbacks (Unsplash/Robohash/UI-Avatars) so the interface never breaks.

From a data engineering perspective, ingestion is paginated with a persistent HTTP session, timeouts, and brief sleeps to be polite to the origin. Quotes are cached in memory (~100) to cut latency for subsequent games; endpoints are idempotent and return compact JSON with clear error codes. The design is explainable and portable, and it leaves room to swap the matcher for edit-distance methods (Levenshtein/Jaro-Winkler) or embeddings (e.g., SBERT) for multilingual names and semantic disambiguation.

Stack & technical contract (selected)

  • Language/serving: Python 3, Flask (API + server-rendered views).

  • Acquisition: Requests + BeautifulSoup (paged scraping with error handling).

  • Matching: re (regex) + difflib.SequenceMatcher (configurable threshold).

  • Enrichment: Wikipedia REST API; image fallbacks for visual continuity.

  • Operations: in-memory cache, lightweight JSON responses, robust timeouts/try-except.

For evaluation, I synthesize a labeled set with controlled perturbations (typos, last-name-only, diacritics) and report accuracy, precision/recall, plus FNR to spot unfair rejections. I also monitor endpoint latency and image-fallback rate as operational KPIs. The project uses only public data—no PII—so it's safe to demonstrate in a portfolio.

Primary endpoints (API)

  • /api/load-quotes (load & cache), /api/new-game (random quote),

  • /api/check-answer (error-tolerant verification),

  • /api/get-hint (biographical/initial hints), /api/get-image (portraits with graceful degradation).

In business terms, the pattern is easily parameterized: replace the quotes source with a corporate catalog (authors, products, works, SKUs), adjust the similarity threshold, and turn "hints" into internal metadata (launch date, category, tags). The same architecture underpins typo-tolerant enterprise search, validation of customer/product names, or knowledge games in educational settings.

Industries & adaptations

  • EdTech / e-learning: quizzes with flexible, explainable answer checking.

  • Media / Publishing: trivia and engagement using in-house author/work catalogs.

  • Marketing & Community: branded challenges with hints and spokesperson/product images.

  • Museums / Culture: interactive guides identifying artists/works with curated metadata.

  • Retail / Internal catalogs: fuzzy matching for products/SKUs with DAM-hosted images.

  • Support / Helpdesk: term normalization and article suggestions (natural extension to NLP/embeddings).

© 2023 Todos los derechos reservados
Alexsandra Ortiz 
Creado con Webnode Cookies
¡Crea tu página web gratis! Esta página web fue creada con Webnode. Crea tu propia web gratis hoy mismo! Comenzar