Evaluating information retrieval with Normalized Discounted Cumulative Gain (NDCG@K) and Redis Vector Database
Full Colab NoteBook on Github.
Evaluating information retrieval (IR) systems is essential for informed design decisions and understanding system performance. Successful companies like Amazon and Google rely heavily on IR systems, with Amazon attributing over 35% of sales and Google 70% of YouTube views to their recommender systems. Effective evaluation measures are key to building and refining these systems.
In this blog, we will use Normalized Discounted Cumulative Gain(NDCG@K) to evaluate the performance of both the base model and the fine-tuned model, assessing whether the fine-tuned model outperforms the base model.
We will use Redis as a vector database as a persistent store for embedding and RedisVL as a python client library.
Normalized Discounted Cumulative Gain (NDCG@K)
NDCG evaluates retrieval quality by assigning ground truth ranks to database elements based on relevance. For example, highly relevant results might rank 5, partially relevant ones 2–4, and irrelevant ones 1. NDCG sums the ranks of retrieved items but introduces a log-based penalty to account for result order. Irrelevant items ranked higher incur greater penalties, ensuring the system rewards placing relevant items earlier in the results.
NDCG (Normalized Discounted Cumulative Gain) improves upon Cumulative Gain (CG) by accounting for the importance of rank positions, as users prioritize top results. It uses a discount factor to give higher weight to top-ranked items.
To calculate NDCG:
- DCG (Discounted Cumulative Gain): Sum relevance scores with a log-based discount for lower-ranked items.
- rel i: Relevance score of the document at position i.
- i: Rank position of the document (1-based index).
- log2(i+1): Discount factor that reduces the impact of lower-ranked items
- IDCG (Ideal DCG): Compute DCG for the ideal ranking (highest relevance first).
- NDCG: Normalize DCG by dividing it by IDCG, ensuring a score between 0 and 1, where 1 indicates perfect ranking. If IDCG is 0, NDCG is set to 0.
How to Define Relevance Scores
Relevance scores quantify how useful a retrieved item is for a given query. Scores can be defined based on the importance of the result:
- Graded Relevance (This can be customized to your problem):
- Top 1 position: Score = 3 (Highly relevant).
- Top 2 or 3 positions: Score = 2 (Moderately relevant).
- Top 4 or 5 positions: Score = 1 (Barely relevant).
- Beyond Top 5: Score = 0 (Not relevant).
Why Use Graded Relevance?
Graded relevance captures varying degrees of relevance, offering richer feedback for evaluating rankings. It reflects that some results may be useful but not perfect, while others are completely irrelevant.
let’s get back to our steps to calculate NDCG.
Step-by-Step NDCG Calculation
Why Use NDCG?
- Position Sensitivity: Emphasizes higher-ranked items, aligning with user behavior.
- Graded Relevance: Supports varying relevance levels, enhancing versatility.
- Comparability: Normalization enables comparison across queries or datasets.
Limitations of NDCG:
- Complexity: Relies on subjective and challenging-to-obtain graded relevance judgments.
- Bias Toward Longer Lists: Sensitive to the length of ranked lists during normalization.
Redis Vector Database and NDCG@10 calculation
The objective is to compute embeddings for all answers using both the base and fine-tuned embedding models and store them in two separate vector indexes. These embeddings will be used to retrieve answers based on the corresponding questions. After retrieval, the NDCG@K score will be calculated for both sets of embeddings to evaluate their performance.
Please find rest of the Blog in linkedin:
Comments
Post a Comment