{"id":2937,"date":"2025-12-17T15:30:49","date_gmt":"2025-12-17T15:30:49","guid":{"rendered":"https:\/\/blog.samarthya.me\/wps\/?p=2937"},"modified":"2025-12-17T15:30:59","modified_gmt":"2025-12-17T15:30:59","slug":"mastering-tf-idf-a-gamified-journey","status":"publish","type":"post","link":"https:\/\/blog.samarthya.me\/wps\/2025\/12\/17\/mastering-tf-idf-a-gamified-journey\/","title":{"rendered":"Mastering TF-IDF: A Gamified Journey!"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large is-style-rounded\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"559\" src=\"https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/12\/tf-idf1-1-1024x559.png\" alt=\"\" class=\"wp-image-2940\" srcset=\"https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/12\/tf-idf1-1-1024x559.png 1024w, https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/12\/tf-idf1-1-300x164.png 300w, https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/12\/tf-idf1-1.png 1408w, https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/12\/tf-idf1-1-300x164@2x.png 600w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Understanding how computers &#8220;read&#8221; and understand text is a fascinating field. One of the most fundamental techniques for identifying important keywords in a document, relative to a collection of documents, is <strong>TF-IDF (Term Frequency-Inverse Document Frequency)<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I recently embarked on a gamified learning challenge to demystify TF-IDF, breaking it down into its core components. This post summarizes my adventure!<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>The Core Idea: Why TF-IDF Matters<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Imagine you have a huge library of books. If you want to find out what a specific book is <em>really<\/em> about, just looking at words that appear frequently isn&#8217;t enough. &#8220;The&#8221; or &#8220;is&#8221; might appear often, but they tell you nothing unique. TF-IDF helps us find words that are:<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Frequent within a specific document (TF)<\/strong><\/li>\n\n\n\n<li><strong>Rare across the entire collection of documents (IDF)<\/strong><\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">When a word meets both criteria, it&#8217;s likely a powerful keyword that defines that document&#8217;s content.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Level 1: Term Frequency (TF) \u2013 How Often Does a Word Appear?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Term Frequency is the simplest part. It&#8217;s just a count of how often a word (term) appears in a document, normalized by the total number of words in that document.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Formula:<\/p>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mtext>TF<\/mtext><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>t<\/mi><mo separator=\"true\">,<\/mo><mi>d<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><mfrac><mrow><mtext>Number&nbsp;of&nbsp;times&nbsp;term&nbsp;<\/mtext><mi>t<\/mi><mtext>&nbsp;appears&nbsp;in&nbsp;document&nbsp;<\/mtext><mi>d<\/mi><\/mrow><mrow><mtext>Total&nbsp;number&nbsp;of&nbsp;words&nbsp;in&nbsp;document&nbsp;<\/mtext><mi>d<\/mi><\/mrow><\/mfrac><\/mrow><annotation encoding=\"application\/x-tex\">\\text{TF}(t, d) = \\frac{\\text{Number of times term } t \\text{ appears in document } d}{\\text{Total number of words in document } d}<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Example:<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s say we have <strong>Document A<\/strong>:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">&#8220;The cat sat on the mat. The cat is black.&#8221;<\/p>\n<\/blockquote>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Total words in Document A:<\/strong> 10<\/li>\n\n\n\n<li><strong>Term &#8216;cat&#8217; appears:<\/strong> 2 times<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Calculation:<\/p>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mtext>TF<\/mtext><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mtext>cat<\/mtext><mo separator=\"true\">,<\/mo><mtext>A<\/mtext><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><mfrac><mn>2<\/mn><mn>10<\/mn><\/mfrac><mo>=<\/mo><mn>\ud835\udfce.\ud835\udfd0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\text{TF}(\\text{cat}, \\text{A}) = \\frac{2}{10} = \\mathbf{0.2}<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">A higher TF means the word is more important <em>to that specific document<\/em>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Level 2: Inverse Document Frequency (IDF) \u2013 How Unique is a Word Across Documents?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Here&#8217;s where TF-IDF gets clever. IDF helps us ignore common words (like &#8220;the&#8221;) that appear in almost every document. It assigns a higher score to words that are rare across our entire collection of documents (our &#8220;corpus&#8221;).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Formula (General form):<\/p>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mtext>IDF<\/mtext><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>t<\/mi><mo separator=\"true\">,<\/mo><mi>D<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><mrow><mi>log<\/mi><mo>\u2061<\/mo><\/mrow><mrow><mo fence=\"true\" form=\"prefix\">(<\/mo><mfrac><mrow><mtext>Total&nbsp;number&nbsp;of&nbsp;documents&nbsp;<\/mtext><mi>N<\/mi><\/mrow><mrow><mtext>Number&nbsp;of&nbsp;documents&nbsp;containing&nbsp;term&nbsp;<\/mtext><mi>t<\/mi><\/mrow><\/mfrac><mo fence=\"true\" form=\"postfix\">)<\/mo><\/mrow><\/mrow><annotation encoding=\"application\/x-tex\">\\text{IDF}(t, D) = \\log \\left( \\frac{\\text{Total number of documents } N}{\\text{Number of documents containing term } t} \\right)<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Our Corpus (4 Documents):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>D1:<\/strong> &#8220;The <strong>cat<\/strong> sat on the mat.&#8221;<\/li>\n\n\n\n<li><strong>D2:<\/strong> &#8220;The dog chased the <strong>frisbee<\/strong>.&#8221;<\/li>\n\n\n\n<li><strong>D3:<\/strong> &#8220;I like my <strong>cat<\/strong>.&#8221;<\/li>\n\n\n\n<li><strong>D4:<\/strong> &#8220;My neighbor&#8217;s <strong>cat<\/strong> is cute.&#8221;<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Calculations (using natural logarithm):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>For &#8216;cat&#8217;: Appears in D1, D3, D4 (3 documents out of 4) <\/li>\n<\/ul>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mtext>IDF<\/mtext><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mtext>cat<\/mtext><mo separator=\"true\">,<\/mo><mi>D<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><mrow><mi>ln<\/mi><mo>\u2061<\/mo><\/mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mn>4<\/mn><mi>\/<\/mi><mn>3<\/mn><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>\u2248<\/mo><mn>\ud835\udfce.\ud835\udfd0\ud835\udfd6\ud835\udfd5<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\text{IDF}(\\text{cat}, D) = \\ln(4\/3) \\approx \\mathbf{0.287}<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<ul class=\"wp-block-list\">\n<li>For &#8216;frisbee&#8217;: Appears in D2 (1 document out of 4)<\/li>\n<\/ul>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mtext>IDF<\/mtext><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mtext>frisbee<\/mtext><mo separator=\"true\">,<\/mo><mi>D<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><mrow><mi>ln<\/mi><mo>\u2061<\/mo><\/mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mn>4<\/mn><mi>\/<\/mi><mn>1<\/mn><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>\u2248<\/mo><mn>\ud835\udfcf.\ud835\udfd1\ud835\udfd6\ud835\udfd4<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\text{IDF}(\\text{frisbee}, D) = \\ln(4\/1) \\approx \\mathbf{1.386}<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Notice: &#8216;frisbee&#8217; (rare) gets a much higher IDF than &#8216;cat&#8217; (relatively common).<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Level 3: The TF-IDF Score \u2013 Combining Frequency and Uniqueness<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The final TF-IDF score is simply the product of TF and IDF. This score highlights words that are both frequent in a document <em>and<\/em> unique across the corpus.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Formula:<\/p>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mtext>TF-IDF<\/mtext><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>t<\/mi><mo separator=\"true\">,<\/mo><mi>d<\/mi><mo separator=\"true\">,<\/mo><mi>D<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><mtext>TF<\/mtext><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>t<\/mi><mo separator=\"true\">,<\/mo><mi>d<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>\u00d7<\/mo><mtext>IDF<\/mtext><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>t<\/mi><mo separator=\"true\">,<\/mo><mi>D<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\text{TF-IDF}(t, d, D) = \\text{TF}(t, d) \\times \\text{IDF}(t, D)<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Putting it Together:<\/strong><\/h4>\n\n\n\n<ul start=\"1\" class=\"wp-block-list\">\n<li><strong>TF for &#8216;cat&#8217; in Document A:<\/strong> 0.2<\/li>\n\n\n\n<li><strong>IDF for &#8216;cat&#8217; in Corpus:<\/strong> 0.287 <\/li>\n<\/ul>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mtext>TF-IDF<\/mtext><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mtext>cat<\/mtext><mo separator=\"true\">,<\/mo><mtext>A<\/mtext><mo separator=\"true\">,<\/mo><mi>D<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><mn>0.2<\/mn><mo>\u00d7<\/mo><mn>0.287<\/mn><mo>\u2248<\/mo><mn>\ud835\udfce.\ud835\udfce\ud835\udfd3\ud835\udfd5\ud835\udfd2<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\text{TF-IDF}(\\text{cat}, \\text{A}, D) = 0.2 \\times 0.287 \\approx \\mathbf{0.0574}<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<ul start=\"1\" class=\"wp-block-list\">\n<li><strong>Hypothetical TF for &#8216;frisbee&#8217; in a document:<\/strong> 0.1 (e.g., if &#8216;frisbee&#8217; appeared once in a 10-word document)<\/li>\n\n\n\n<li><strong>IDF for &#8216;frisbee&#8217; in Corpus:<\/strong> 1.386<\/li>\n<\/ul>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mtext>TF-IDF<\/mtext><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mtext>frisbee<\/mtext><mo separator=\"true\">,<\/mo><mtext>hypothetical&nbsp;doc<\/mtext><mo separator=\"true\">,<\/mo><mi>D<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><mn>0.1<\/mn><mo>\u00d7<\/mo><mn>1.386<\/mn><mo>\u2248<\/mo><mn>\ud835\udfce.\ud835\udfcf\ud835\udfd1\ud835\udfd6\ud835\udfd4<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\text{TF-IDF}(\\text{frisbee}, \\text{hypothetical doc}, D) = 0.1 \\times 1.386 \\approx \\mathbf{0.1386}<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Result:<\/strong> Even with a lower term frequency, &#8216;frisbee&#8217; has a higher overall TF-IDF score because its uniqueness (high IDF) boosts its importance. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is the magic of TF-IDF!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Understanding how computers &#8220;read&#8221; and understand text is a fascinating field. One of the most fundamental techniques for identifying important keywords in a document, relative to a collection of documents, is TF-IDF (Term Frequency-Inverse Document Frequency). I recently embarked on a gamified learning challenge to demystify TF-IDF, breaking it down into its core components. This [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":2941,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"footnotes":""},"categories":[34],"tags":[],"class_list":["post-2937","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technical"],"_links":{"self":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/posts\/2937","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/comments?post=2937"}],"version-history":[{"count":2,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/posts\/2937\/revisions"}],"predecessor-version":[{"id":2942,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/posts\/2937\/revisions\/2942"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/media\/2941"}],"wp:attachment":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/media?parent=2937"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/categories?post=2937"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/tags?post=2937"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}