Ted Talks Recommendation System with Machine Learning
This Python project builds a recommendation system for TEDx talks using natural language processing (NLP) and machine learning techniques. It takes a user's input (e.g., a talk description or topic) and recommends similar TEDx talks by measuring textual similarity. The system is embedded into a website using Flask, a lightweight web framework.
Concepts
Pandas: Data manipulation library, useful for loading and managing datasets.
TF-IDF (Term Frequency–Inverse Document Frequency): A statistical measure used to evaluate how important a word is to a document.
Cosine Similarity: A metric that measures the cosine of the angle between two non-zero vectors. Common in text analysis to measure similarity.
Pearson Correlation Coefficient: A statistical metric that quantifies linear correlation between two variables.
NLTK (Natural Language Toolkit): A suite of tools for text processing in Python.
Stopwords: Common words (like "the", "is", "and") that are usually removed in text preprocessing because they carry less meaningful information.
DataSet: The dataset (tedx_dataset.csv) contains information about various TEDx talks, including speaker names and talk descriptions (details column).
Why use a dataset? We need historical data to compare new inputs against.

2) TF-IDF and Vectorization
What is TF-IDF?

Mathematically:
-
TF (Term Frequency): How often a word appears in a document.
-
IDF (Inverse Document Frequency): How rare a word is across all documents.
TF-IDF = TF × log(N / DF)
Where:
-
N = total number of documents
-
DF = number of documents containing the term
📊 Algebraic Insight
Each document (TED talk) is represented as a vector in an n-dimensional space, where each dimension is a word in the vocabulary. This lets us compare talks using vector-based similarity.
1) Text Preprocessing
Before feeding text into any machine learning model, we normalize it:
-
Lowercasing helps ensure consistent comparison.
-
Removing stopwords improves focus on meaningful words.
-
Removing punctuation eliminates unnecessary symbols.
✏️ Example
Original: "The talk is amazing!" → Lowercase: "the talk is amazing" → No stopwords: "talk amazing" → No punctuation: "talk amazing"
3) Measuring Similarities

Concepts
-
Cosine Similarity is calculated as:
-
Pearson Correlation checks how two variables change together.
🔍 Why both?
-
Cosine captures directional similarity (semantics).
-
Pearson captures linear correlation, which sometimes adds a new layer of meaning.
4) Recommend Based on Similarities
Calculate similarity between input and all TED talks.
-
Rank them based on cosine and Pearson scores.
-
Return top 5 most similar talks.
