{"id":2916,"date":"2025-11-08T16:58:18","date_gmt":"2025-11-08T16:58:18","guid":{"rendered":"https:\/\/blog.samarthya.me\/wps\/?p=2916"},"modified":"2025-11-08T16:58:38","modified_gmt":"2025-11-08T16:58:38","slug":"data-mining-essentials","status":"publish","type":"post","link":"https:\/\/blog.samarthya.me\/wps\/2025\/11\/08\/data-mining-essentials\/","title":{"rendered":"Data Mining Essentials"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full is-style-default\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"1024\" src=\"https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/11\/manhattan_distance.png\" alt=\"\" class=\"wp-image-2917\" srcset=\"https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/11\/manhattan_distance.png 1024w, https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/11\/manhattan_distance-150x150@2x.png 300w, https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/11\/manhattan_distance-150x150.png 150w, https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/11\/manhattan_distance-300x300@2x.png 600w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Whether you&#8217;re preparing for a quiz or just brushing up on fundamentals, this guide distills the key concepts from Data Mining into bite-sized, memorable chunks. Let&#8217;s dive in!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Understanding Machine Learning Tasks<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Regression vs. Classification: Know Your Output<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The fundamental distinction in supervised learning comes down to what you&#8217;re predicting:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Regression<\/strong> tackles continuous outputs; think house prices, rainfall amounts, or quarterly revenue. If you can measure it on a scale with infinite precision, it&#8217;s <code>regression <\/code>territory.<\/li>\n\n\n\n<li><strong>Classification<\/strong> handles categorical outcomes. This includes <strong>binary classification<\/strong> (yes\/no, subscribe\/don&#8217;t subscribe, spam\/not spam) and <strong>multiclass classification<\/strong> (small\/medium\/large, or identifying flower species).<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Pro tip<\/strong>: If someone asks &#8220;<code>how much?<\/code>&#8221; \u2192 regression. If they ask &#8220;<code>which category?<\/code>&#8221; \u2192 classification.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Unsupervised Learning: Finding Hidden Patterns<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Unlike supervised learning where we have labeled outcomes, unsupervised learning discovers structure in unlabeled data. <strong>Clustering<\/strong> is the star example; grouping customers into segments based on behavior without predefined labels. No one tells the algorithm what the &#8220;<code>right<\/code>&#8221; groups are; it finds natural patterns on its own.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Data Preparation Fundamentals<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Handling Missing Values<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Real-world data is messy. When you encounter missing values in numerical features, you have options:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Remove<\/strong> rows (if only a few are missing) or columns (if many values are missing)<\/li>\n\n\n\n<li><strong>Impute<\/strong> by replacing missing values with statistical estimates: mean, median, or mode<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Imputation<\/strong> is generally preferred because it preserves your dataset size and captures reasonable estimates based on existing data.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Outliers: The Troublemakers<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">An <strong>outlier<\/strong> is a data point that significantly deviates from other observations\u2014think of a $50 million house in a neighborhood where most homes sell for $300,000. These points can disproportionately influence your model&#8217;s fit, especially in linear regression.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Feature Scaling: Leveling the Playing Field<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Imagine you&#8217;re building a model with two features: annual income ($20,000-$200,000) and age (18-80). Without scaling, income would dominate simply because its numbers are larger.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Feature scaling<\/strong> (through normalization or standardization) ensures all features contribute proportionally to the model. This is especially critical for distance-based algorithms like K-Nearest Neighbors, where unscaled features would completely skew distance calculations.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Distance Metrics: Measure similarity<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Manhattan Distance<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Picture yourself navigating city blocks. Manhattan distance sums the absolute differences across all dimensions:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For points A = (3, 8, 2) and B = (1, 4, 5): Manhattan Distance = |3-1| + |8-4| + |2-5| = 2 + 4 + 3 = <strong>9<\/strong><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Hamming Distance<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For categorical or binary data, Hamming distance counts mismatches. Comparing &#8220;10110&#8221; and &#8220;11101&#8221; gives us 3 differences (positions 2, 4, and 5), so Hamming Distance = <strong>3<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Key Property<\/strong>: Both Manhattan and Euclidean distances are always <strong>non-negative<\/strong>. Distance measures separation\u2014it can be zero (identical points) or positive, but never negative.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Model Training and Validation<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Train-Test split<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Why split your data? <strong>To evaluate the model&#8217;s ability to generalize to unseen data.<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Training set<\/strong>: Teaches the model patterns<\/li>\n\n\n\n<li><strong>Test set<\/strong>: Provides an honest assessment of performance on new data<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Common splits are 80\/20 or 70\/30. This isn&#8217;t about increasing data or reducing computation\u2014it&#8217;s about validation.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Underfitting vs. Overfitting<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Think Goldilocks and the Three Bears:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Underfitting<\/strong>: Model is too simple, performing poorly on both training AND test data. It hasn&#8217;t captured the underlying patterns.<\/li>\n\n\n\n<li><strong>Overfitting<\/strong>: Model is too complex, memorizing the training data (including noise) and performing well on training but poorly on test data.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The goal? A model that&#8217;s &#8220;<code>just right<\/code>&#8221; &amp; complex enough to capture patterns but not so complex it memorizes noise.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Hyperparameter tuning<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong><code>Hyperparameters<\/code><\/strong> control the learning process (like the number of neighbors in KNN or regularization strength). <strong>Hyperparameter tuning<\/strong> is the systematic process of finding optimal values.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Critical insight<\/strong>: Never use your test set for tuning! This causes <strong>data leakage<\/strong>\u2014where information inappropriately influences your model, leading to overly optimistic performance estimates.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Best practice<\/strong>: Split training data into a reduced training set and a validation set for tuning, then use the untouched test set for final evaluation.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Data leakage<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Data leakage<\/strong> occurs when information from outside the training dataset (like test set data, future information, or target-related data) is used to create features or train the model. This creates unrealistically good performance that won&#8217;t hold up in production.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Example from the notes: Using the test set to select hyperparameters &#8220;leaks&#8221; information and biases your generalization error estimate.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Practical Algorithm?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>When to Use What:<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-medium-font-size\">Predicting house prices based on size? \u2192 <strong>Linear Regression<\/strong><\/li>\n\n\n\n<li class=\"has-medium-font-size\">Determining if a customer will buy (Yes\/No)? \u2192 <strong>Binary Classification<\/strong> (use Logistic Regression or similar)<\/li>\n\n\n\n<li class=\"has-medium-font-size\">Forecasting quarterly revenue? \u2192 <strong>Regression<\/strong><\/li>\n\n\n\n<li class=\"has-medium-font-size\">Grouping customers without predefined labels? \u2192 <strong>Clustering<\/strong> (unsupervised)<\/li>\n\n\n\n<li class=\"has-medium-font-size\">Predicting rainfall amount in mm? \u2192 <strong>Regression<\/strong><\/li>\n\n\n\n<li class=\"has-medium-font-size\">Output type determines task: Continuous = regression, Categorical = classification<\/li>\n\n\n\n<li class=\"has-medium-font-size\"><strong>Train-test split = generalization test<\/strong>, not data augmentation<\/li>\n\n\n\n<li class=\"has-medium-font-size\"><strong>Poor on both sets = underfitting<\/strong>; <strong>Great on training, poor on test = overfitting<\/strong><\/li>\n\n\n\n<li class=\"has-medium-font-size\"><strong>Distances are always non-negative<\/strong> (they measure separation)<\/li>\n\n\n\n<li class=\"has-medium-font-size\"><strong>Imputation fills missing values<\/strong>; don&#8217;t confuse with one-hot encoding (for categories)<\/li>\n\n\n\n<li class=\"has-medium-font-size\"><strong>Feature scaling prevents large-value dominance<\/strong>, especially in distance-based algorithms<\/li>\n\n\n\n<li class=\"has-medium-font-size\"><strong>Data leakage = inappropriate information flow<\/strong> \u2192 ruins model validity<\/li>\n\n\n\n<li class=\"has-medium-font-size\"><strong>Hyperparameter tuning uses validation sets<\/strong>, never the test set<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Final Thought<\/strong>: Machine learning success isn&#8217;t just about fancy algorithms\u2014it&#8217;s about understanding your problem type, preparing your data properly, and validating honestly. Master these fundamentals, and you&#8217;ve built a solid foundation for everything that follows.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Whether you&#8217;re preparing for a quiz or just brushing up on fundamentals, this guide distills the key concepts from Data Mining into bite-sized, memorable chunks. Let&#8217;s dive in! Understanding Machine Learning Tasks Regression vs. Classification: Know Your Output The fundamental distinction in supervised learning comes down to what you&#8217;re predicting: Pro tip: If someone asks [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":2918,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"footnotes":""},"categories":[34],"tags":[345,352,346],"class_list":["post-2916","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technical","tag-ai","tag-datamining","tag-ml"],"_links":{"self":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/posts\/2916","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/comments?post=2916"}],"version-history":[{"count":2,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/posts\/2916\/revisions"}],"predecessor-version":[{"id":2920,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/posts\/2916\/revisions\/2920"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/media\/2918"}],"wp:attachment":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/media?parent=2916"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/categories?post=2916"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/tags?post=2916"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}