Data Mining Review Note

Introduction

This material covers key points of BBU6504: Data Mining.

Note

Some parts of material are adapted from the BBU6504 lecture slides and is intended for personal study and educational use only.

Lecture1: Introuction

What is Data Mining?

KDD - Knowledge discovery in database

  1. GET unknown and potentially useful information from data

  2. Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover

    IMG_D4D66C50C7C2-1.jpeg

Attribute

Type Ordered Differences Meaningful True Zero Ratios Valid
Nominal No No No No
Ordinal Yes No No No
Interval Yes Yes No No
Ratio Yes Yes Yes Yes

Noise

For objects, noise is an extraneous object For attributes, noise refers to modification of original values

Outliers

Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set.

IMG_3CE51D76A85A-1.jpeg

Duplicate Data

Duplicate data refers to records that are exactly or nearly identical.

Cause: It’s common when merging datasets from heterogeneous sources.

Example:

How to Handle:

Data cleaning: Process of dealing with duplicate data issues

When Not to Remove:

Lecture2 Quality, Preprocessing, and Measures

Preprocessing

Aggregation

image.png

Sampling

IMG_E1B97431D974-1.jpeg

image.png

Curse of Dimensionality

image.png

1.数据变稀疏

高维空间里,大多数数据点之间的距离都变得“差不多远”,密度差异变得不明显,这会导致很多算法(特别是依赖“距离”的算法,比如KNN、聚类)难以判断哪些点是“相近”的。

2.样本不够代表性

在高维空间中,可能需要指数级更多的样本才能覆盖整个空间。我们手头的数据,很可能根本不具代表性,训练出的模型泛化能力差。

3.分类器效果变差

对于分类任务,维度太高可能导致:

4.聚类质量下降

聚类依赖于“点与点之间的距离”来划分簇,但在高维中距离变得没意义,导致聚类结构模糊,效果差。

1. Dimensionality Reduction

Goal: Transform data into a lower-dimensional space while preserving as much useful information as possible.

Benefits:

Techniques:


2. Feature Subset Selection

Goal: Select a subset of the most relevant features instead of transforming them.

Types of Features to Remove:

Benefits:

Binarization

IMG_15D9D128DF72-1.jpeg

Attribute Transformation

IMG_44A608052579-1.jpeg

Missing Values

image.png

Continuous Use mean, median, or average of nearest neighbors
Categorical Use mode, most frequent among nearest neighbors

Similarity Measures

Minkowski Distance

image.png

SMC and Jaccard

image.png

Cosine Similarity

image.png

Correlation

image.png

方法 类型 适用数据 关注重点 推荐场景 不适合
Euclidean Distance欧几里得距离 距离 数值型(连续) 值的大小差异 聚类(如K-means)、低维连续数据 不适合高维稀疏数据
Cosine Similarity余弦相似度 相似度 数值型(稀疏向量) 方向/比例结构(忽略大小) 文本向量、用户兴趣、TF-IDF 不适合需要关注绝对大小的任务
Correlation相关系数 相似度 数值型 变量间变化趋势(线性关系) 时间序列趋势比较、评分模式对比 忽略绝对值,不适合关注“量”的场景
SMCSimple Matching Coefficient 相似度 二值型(对称) 0 和 1 匹配都重要 投票、性别、偏好 不适合稀疏的行为数据(多数是0)
Jaccard Similarity 相似度 二值型(非对称) 1 的共现匹配,忽略 0 医疗记录、用户行为、标签数据 特征密集时不敏感,不能用于对称0-1

Clustering

  1. Well-separated
  2. Prototype-based
  3. Density-based

OR

  1. Hierarchical
  2. Partitional
  3. Density-based

K-means

K-means has problems when clusters are of differing Sizes,Densities

Non-globular shapes

K-means has problems when the data contains outliers.

DBSCAN

image.png

image.png

image.png

Diana

image.png

image.png

Agglomerative Hierarchical Clustering

Advantages and Disadvantages of Linkage

1. MIN (Single Linkage)

Advantages:

Disadvantages:


2. MAX (Complete Linkage)

Advantages:

Disadvantages:


3. Group Average Linkage

Advantages:

Disadvantages:


Problems and Limitations

Q: How to overcome the limitation that “movement is the ultimate” in hierarchical clustering?

This limitation means that once two clusters are merged, the decision cannot be undone. To overcome this:

  1. Post-processing by moving branches in the dendrogram:

    After constructing the initial hierarchy, the structure of the dendrogram can be refined by moving subtrees (branches) to new positions to better optimize a global objective function (e.g., minimizing overall variance or improving silhouette score). This allows corrections to be made to earlier poor merging decisions.

  2. Hybrid approach using partitional clustering first:

    Another method is to first use a partitional algorithm like K-means to divide the data into many small, dense clusters. Then, hierarchical clustering is applied to these clusters instead of individual data points. This reduces the impact of noise and prevents early incorrect merges, leading to a more robust hierarchy.

Advanced Cluster Analysis

Silhouette

image.png

Cluster 1 contains {P1, P2}, Cluster 2 contains {P3, P4}. The dissimilarity matrix that we obtain from the similarity matrix is the following:

image.png

image.png

Decision Tree

Confusion Matrix

image.png

Impurity Calculation ( Will be asessed)

image.png

Gini gain

image.png

Gain ratio

Node impurity measures tend to prefer splits that result in large number of partitions, each being small but pure.

image.png

Finding The Best Split for Continuous Attributes

For efficient computation o(nlogn) compared to brute-force(N^2):

image.png

Pros & Cons

Advantages:

Disadvantages:

Q: Why would we choose Naive Bayes instead of decision trees in some scenarios?

In certain scenarios, Naive Bayes is preferred over decision trees for the following reasons:

  1. High-dimensional data: Naive Bayes performs well with high-dimensional data, such as text classification, where each word is a feature. It remains effective even with a small number of training examples.In contrast, decision trees are prone to overfitting in such cases.

  2. Faster training and prediction: Naive Bayes is computationally efficient because it only requires calculating probabilities. Decision trees need to search for optimal splits, which is more time-consuming.

  3. Better performance with independent features: If features are mostly conditionally independent, Naive Bayes can be highly accurate. Decision trees may overfit when feature interactions are weak or noisy.

  4. Robust to noise and missing data: Naive Bayes handles noisy and missing data gracefully, while decision trees can be sensitive and may produce overly complex models.

  5. Small dataset:

    Naive Bayes requires less data to estimate parameters, making it more suitable when training data is limited. Decision trees often need more data to build stable and generalizable models.

  6. Flexible decision boundaries:

    Decision trees can only form rectilinear (axis-aligned) boundaries, limiting their ability to model complex or diagonal class separations. Naive Bayes, especially with Gaussian distributions, can create elliptical or non-linear decision boundaries.

Classification-Evaluation

Overfitting and Underfitting

What is overfitting and underfitting?

Optimistic and Pessimistic

  1. Optimistic: Training error → generalisation error
  2. pessimistic :

image.png

MDL

The model is encoded in compressed form before being passed to B. The cost of transmission is equal to the cost of model coding.

The total cost of transmission is:

$Cost(Model,Data) = Cost(Data|Model) + Cost(Model)$

Cost(Data|Model) :encodes the misclassification errors.

Cost(Model) :Model coding overhead

IMG_FB50D1BD7F40-1.jpeg

So the total cost should be:

$Cost(Model,Data)=N(Error) \times log_2n + N(Internal)\times log_2m + N(Leaf)\times Log_2k$

where n is the number of training instances, m is the number of attributes and k is the number of classes

image.png

image.png

Handling Overfiting

Pre-pruning

Stop the algorithm before it becomes a fully-grown tree

  1. Stop if all instances belong to the same class

  2. **Stop if all the attribute values are the same

More restrictive conditions:**

  1. Stop if number of instances is less than some user- specified threshold
  2. Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).
  3. Stop if estimated generalization error falls below certain threshold

Pose-pruning

Subtree replacement

  1. If generalisation error improves after trimming, replace sub-tree by a leaf node
  2. Class label of leaf node is determined from majority class of instances in the sub-tree.

Subtree raising

  1. Replace subtree with most frequently used branch

Post-pruning tends to produce better results than pre-pruning, because unlike pre-pruning, post-pruning is a pruning decision based on a fully-growth decision tree. Pre-Pruning may terminate the growth of the decision tree too early. However, for post-pruning, when the subtree is cut off. The overhead of growing a complete decision tree is wasted

8.1 Alternative Classification Techniques

Rule-based Classifier

Rule coverage and accuracy

IMG_8BF01670A94D-1.jpeg

Rule Evaluation

IMG_9ECBD94DE1F1-1.jpeg


image.png

Direct Method-RIPPER

RIPPER (Repeated Incremental Pruning to Produce Error Reduction) is a rule-based classification algorithm designed to efficiently generate a set of interpretable rules, especially for large datasets. It follows a direct rule-learning approach through sequential covering, pruning, and MDL-based stopping criteria.


Handling Class Labels


Growing a Rule


Building a Rule Set

Indirect Method-C4.5

C4.5rules is an indirect rule-based classification method derived from the C4.5 decision tree algorithm. Instead of building rules directly from data, it extracts and simplifies rules from an unpruned decision tree, and then organizes them using the minimum description length (MDL) principle.


🔹 Rule Extraction and Simplification


🔹 Pessimistic Error Estimate

To avoid overfitting, C4.5 uses a pessimistic estimate of a rule/tree’s error:

$e_g(T) = \frac{e(T) + l(T)}{N_t}$

Where:

This formula penalizes complex models and favors generalization.


🔹 Class-Based Rule Set Organization

Instead of ordering rules linearly, C4.5rules orders subsets of rules grouped by class label. Each class has its own rule subset.

For each class subset:

$\text{Description Length} = L(\text{error}) + g \cdot L(\text{model})$

Where:

Finally, order the classes by increasing total description length, prioritizing the most compact and accurate rule sets.


🔹 C4.5rules vs RIPPER – Rule Comparison Example

Given a dataset of animals with features like “Give Birth”, “Can Fly”, “Live in Water”, etc., both algorithms generate rule sets for classification.

C4.5rules:

RIPPER:

Feature C4.5rules RIPPER
Method Type Indirect (tree-based) Direct (rule growing)
Rule Source Decision tree paths Data-driven search
Rule Simplification Pessimistic error pruning Post-pruning with FOIL gain
Class Handling Class-based subsets (ordered by MDL) Class-wise sequential covering
Default Class Explicit rules Implicit (default class at end)

KNN (know)

Bayes Classifier

IMG_E8D574F7544B-1.jpeg


Estimate Probabilities from Data

In Naïve Bayes classification, we need to estimate the conditional probability of attributes given a class:

$P(X_i \mid Y_j)$

For continuous attributes, there are two main strategies:


1. Discretization


2. Probability Density Estimation

$P(X_i = x \mid Y_j) = \frac{1}{\sqrt{2\pi \sigma_{ij}^2}} \cdot e^{-\frac{(x - \mu_{ij})^2}{2 \sigma_{ij}^2}}$


Example

Suppose we have a dataset with the attribute “Taxable Income” and class “Evade = No”:


image.png

image.png

8-2 Artificial Neural Networks

ANN

An Artificial Neural Network (ANN) is a machine learning model inspired by the structure of the human brain. It is composed of layers of interconnected nodes (neurons) that learn to map inputs to outputs through training.

Structure of ANN


How ANN Learns

  1. Forward Propagation: Inputs pass through the network to generate an output.
  2. Loss Computation: Measures how far the prediction is from the true label.
  3. Backpropagation: Computes gradients of the loss with respect to weights.
  4. Parameter Update: Uses gradient descent to adjust weights and reduce the loss.

Back-propagation

image.png

In backpropagation, we start at the output layer, calculate the error terms for each neuron layer by layer (that is, the derivative of the loss function with respect to the activation value of that layer), and then use these error terms and the activation value of the previous layer to calculate the gradient of the weight of the current layer.

8-3 SVM

Find hyperplane maximizes the margin.

image.png

image.png


Learning Nonlinear SVM: The Kernel Trick

Support Vector Machines (SVMs) are originally designed for linear classification. To handle nonlinear data, we use the Kernel Trick — a powerful method to implicitly map input data to a higher-dimensional space without actually performing the transformation.


Core Idea: Kernel Trick

Instead of explicitly mapping input x to a high-dimensional space \phi(x), we compute the inner product in that space using a kernel function:

$\phi(x_i) \cdot \phi(x_j) = K(x_i, x_j)$


What Is a Kernel Function?

A kernel function is defined in the original input space and returns the dot product of the corresponding mapped vectors in a high-dimensional feature space.


Common Kernel Functions:

Kernel Type Formula Notes
Polynomial Kernel $$K(x, y) = (x \cdot y + 1)^p$$
Captures polynomial interactions
Gaussian (RBF) $$ K(x, y) = e^{- \frac{
Sigmoid Kernel $K(x, y) = \tanh(kx \cdot y - \delta)$ Inspired by neural activation

Why It Matters


Kernel Functions: Advantages and Validity Conditions

Advantages of Using Kernel Functions

Kernel functions are central to nonlinear SVMs, allowing computations in high-dimensional spaces without explicitly mapping the data:

This makes SVMs with kernel functions computationally efficient while still allowing flexible decision boundaries.


Not All Functions Are Valid Kernels

A valid kernel must correspond to an inner product in some feature space. Arbitrary functions may not satisfy the mathematical properties required to be used as kernels.

IMG_A8A53F0A7565-1.jpeg

Support vectors

In Support Vector Machines (SVM), support vectors are the data samples that lie closest to the decision boundary (the separating hyperplane). These are the critical points that directly influence the position and orientation of the hyperplane.

What Are Support Vectors?


What Do They Do?

Association Analysis

image.png

IMG_ABA35CD8BB29-1.jpeg

IMG_C0C6104B573D-1.jpeg

Rules originating from the same itemset have identical support but can have different confidence.

If the itemset is infrequent, then all six candidate rules can be pruned immediately without computing their confidence values.


Association Rule Mining Task

Association rule mining is a key technique in data mining, often used in market basket analysis to discover relationships between items in transactional data. The process is typically divided into two main ste


Step 1: Frequent Itemset Generation


Step 2: Rule Generation


Brute-Force Approach (Inefficient Baseline)

  1. List all possible association rules from all itemsets.
  2. Compute the support and confidence for each rule.
  3. Prune any rules that fail to meet the minsup or minconf thresholds.

image.png

image.png

Candidate Generation: Fk-1 x Fk-1 (k≥3)

image.png

IMG_F9243C484116-1.jpeg

Before rule generation, confidence is not considered!!

Rule Generation

After frequent item-set generation completed, the support is already guaranteed.

image.png

Rule Generation: Confidence and Anti-Monotonicity

When generating association rules from frequent itemsets, confidence is used to measure the strength of a rule. Understanding how confidence behaves with respect to the rule structure is critical for pruning and efficiency.


Confidence Is Not Anti-Monotonic in General


But Confidence Is Anti-Monotonic Within the Same Frequent Itemset


Why It Matters


Deep Learnig

Why deep network instead of fat network?

Stronger Expressive Power: Deep networks apply multiple layers of nonlinear transformations, allowing them to model complex and abstract relationships in data that shallow networks cannot easily capture — even if made wider.

Efficiency in Parameter Use: Although a single hidden layer network (fat network) can theoretically approximate any function (Universal Approximation Theorem), it may require an exponentially large number of neurons, making it computationally expensive and inefficient.

Better Generalization: Deep architectures can learn hierarchical features (e.g., edges → shapes → objects), which improves their ability to generalize to new, unseen data. Shallow networks often fail to capture such levels of abstraction.

The slides mention:

“Deep networks achieve lossless transfer of information by implicitly transferring information, reducing human bias and improving generalization.”

Simplified End-to-End Learning: Deep networks can learn directly from raw input data, minimizing the need for manual feature engineering. In contrast, shallow networks often require significant pre-processing or domain-specific feature design.

Scalability with Big Data: Deep models can leverage large datasets more effectively by automatically extracting statistical patterns in the data, whereas shallow networks tend to saturate quickly and underfit complex distributions.

In Summary:

Deep networks outperform fat ones by learning layered feature representations, using parameters more efficiently, generalizing better, and achieving state-of-the-art performance in complex tasks.

NN basic

Activation function

image.png

image.png

image.png

image.png

CNN

RNN

LLM

🔍 Comparison: One-hot Encoding vs Distributed Representation

Aspect One-hot Encoding Distributed Representation
Definition A sparse binary vector with only one dimension set to 1. A dense, real-valued vector that captures word semantics.
Dimensionality High (equals vocabulary size). Fixed and low (e.g., 100–300).
Semantic Capacity No semantic information encoded. Captures similarity between words (e.g., “cat” and “dog” are close).
Scalability Poor – grows with vocabulary size. Good – constant size regardless of vocabulary.
Example “猫” = [1, 0, 0, 0, 0, 0] “猫” ≈ [0.2, 0.6, 0.1, 0.7, 0.1]
Computation Simple to create, no training needed. Requires training (e.g., Word2Vec, GloVe, BERT).
Storage Cost High memory usage. Low memory footprint.
Interpretability Easy to understand, purely symbolic. Hard to interpret directly, but powerful.
Context Sensitivity No — static and context-free. Can be static (Word2Vec) or contextual (e.g., BERT).

One-hot Encoding

Advantages:

Disadvantages:

Distributed Representation

Advantages:

Disadvantages:

Data security and privacy