Ted Talks Recommendation System with Machine Learning

This Python project builds a recommendation system for TEDx talks using natural language processing (NLP) and machine learning techniques. It takes a user's input (e.g., a talk description or topic) and recommends similar TEDx talks by measuring textual similarity. The system is embedded into a website using Flask, a lightweight web framework. 

Concepts

Pandas: Data manipulation library, useful for loading and managing datasets.

TF-IDF (Term Frequency–Inverse Document Frequency): A statistical measure used to evaluate how important a word is to a document.

Cosine Similarity: A metric that measures the cosine of the angle between two non-zero vectors. Common in text analysis to measure similarity.

Pearson Correlation Coefficient: A statistical metric that quantifies linear correlation between two variables.

NLTK (Natural Language Toolkit): A suite of tools for text processing in Python.

Stopwords: Common words (like "the", "is", "and") that are usually removed in text preprocessing because they carry less meaningful information.


DataSet: The dataset (tedx_dataset.csv) contains information about various TEDx talks, including speaker names and talk descriptions (details column).

Why use a dataset? We need historical data to compare new inputs against.


2) TF-IDF and Vectorization

What is TF-IDF?

Mathematically:

  • TF (Term Frequency): How often a word appears in a document.

  • IDF (Inverse Document Frequency): How rare a word is across all documents.

TF-IDF = TF × log(N / DF)

Where:

  • N = total number of documents

  • DF = number of documents containing the term

📊 Algebraic Insight

Each document (TED talk) is represented as a vector in an n-dimensional space, where each dimension is a word in the vocabulary. This lets us compare talks using vector-based similarity.

 1) Text Preprocessing

Before feeding text into any machine learning model, we normalize it:

  • Lowercasing helps ensure consistent comparison.

  • Removing stopwords improves focus on meaningful words.

  • Removing punctuation eliminates unnecessary symbols.

✏️ Example

Original: "The talk is amazing!" → Lowercase: "the talk is amazing" → No stopwords: "talk amazing" → No punctuation: "talk amazing"

3) Measuring Similarities

Concepts

  • Cosine Similarity is calculated as:


  • Pearson Correlation checks how two variables change together.

🔍 Why both?

  • Cosine captures directional similarity (semantics).

  • Pearson captures linear correlation, which sometimes adds a new layer of meaning.

4) Recommend Based on Similarities

  1. Calculate similarity between input and all TED talks.

  2. Rank them based on cosine and Pearson scores.

  3. Return top 5 most similar talks.

© 2023 Todos los derechos reservados
Alexsandra Ortiz 
Creado con Webnode Cookies
¡Crea tu página web gratis! Esta página web fue creada con Webnode. Crea tu propia web gratis hoy mismo! Comenzar