Data Mining

John Samuel
CPE Lyon

Year: 2019-2020
Email: john(dot)samuel(at)cpe(dot)fr

Goals

Understanding Patterns
Data mining tasks
Algorithms for data mining
Feature Selection

Patterns in Nature

Symmetry
Trees, Fractals
Spirals
Chaos
Waves
Bubbles, Foam
Tesselations
Cracks
Spots, stripes

Patterns by Humans

Buildings (Symmetry)
Cities
Virtual environments (e.g., video games)
Human artifacts

Pattern creation

Repitition
Fractals
- Julia set: f(z) = z² + c

Synonyms

Pattern recognition
Knowledge discovery in databases
Data mining
Machine learning

Pattern Recognition

Goal is to detect patterns and regularities in data
Approaches
1. Supervised learning: Availability of labeled training data
2. Unsupervised learning: No labeled training data available
3. Semi-supervised learning: Small set of labeled training data and a large amount of unlabeled data

Formalization

Euclidean vector: geometric object with magnitude and direction
Vector space: collection of vectors that can be added together and multiplied by numbers.
Feature vector: n-dimensional vector
Feature space: Vector space associated with the vectors

Examples: Features

Images: pixel values.
Texts: Frequency of occurence of textual phrases.

Formalization

Feature construction¹: construction of new features from already available features
Feature construction operators
- Equality operators, arithmetic operators, array operators (min, max, average etc.)...

Example

Let Year of Birth and Year of Death be two existing features.
A new feature called Age = Year of Birth - Year of Death

https://en.wikipedia.org/wiki/Feature_vector

Formalization: Supervised learning

Let N be the number of training examples
Let X be the input feature space
Let Y be the output feature space (of labels)
Let {(x₁, y₁),...,(x_N, y_N)} be the N training examples, where
- x_i is the feature vector of i^th training example.
- y_i is its label.
The goal of supervised learning algorithm is to find g: X → Y, where
- g is one of the functions from the set of possible functions G (hypotheses space)
Scoring function F denote the space of scoring functions, where
- f: X × Y → R such that g returns the highest scoring function.

Formalization: Unsupervised learning

Let X be the input feature space
Let Y be the output feature space (of labels)
The goal of unsupervised learning algorithm is to
- find mapping X → Y

Formalization: Semi-supervised learning

Let X be the input feature space
Let Y be the output feature space (of labels)
Let {(x₁, y₁),...,(x_l, y_l)} be the l be the set of labeled training examples
Let {x_l+1,...,x_l+u} be the u be the set of unlabeled feature vectors of X.
The goal of semi-supervised learning algorithm is to do
- Transductive learning, i.e., find correct labels for {x_l+1,...,x_l+u}. OR
- Inductive learning, i.e., find correct mapping X → Y

Tasks in Data Mining

Classification
Clustering
Regression
Sequence Labeling
Association Rules
Anomaly Detection
Summarization

Generalizing known structure to apply to new data
Identifying the set of categories to which an object belongs
Binary vs. Multiclass classification

Applications

Spam vs Non-spam
Document classification
Handwriting recognition
Speech Recognition
Internet Search Engines

Formal definition

Let X be the input feature space
Let Y be the output feature space (of labels)
The goal of classification algorithm (or classifier) is to find {(x₁, y₁),...,(x_l, y_k)}, i.e., assigning a known label to every input feature vector, where
- x_i ∈ X
- y_i ∈ Y
- |X | = l
- |Y | = k
- l >= k

Classifiers

Classifying Algorithm
Two types of classifiers:
- Binary classifiers assigning an object to any of two classes
- Multiclass classifiers assigning an object to one of several classes

Linear Classifiers

A linear function assigning a score to each possible category by combining the feature vector of an instance with a vector of weights, using a dot product.
Formalization:
- Let X be the input feature space and x_i ∈ X
- Let β_k be vector of weights for category k
- score(x_i, k) = x_i.β_k, score for assigning category k to instance x_i. The category that gives the highest score is assigned as the category of the instance.

Classifiers

Let

tp: number of true postives
fp: number of false postives
fn: number of false negatives

Then

Accuracy a = (tp + tn) / (tp + tn + fp + fn)
Precision p = tp / (tp + fp)
Recall r = tp / (tp + fn)
Specificity r = tn / (tp + fn)
F1-score f1 = 2 * ((p * r) / (p + r))

Confusion Matrix

Multiclass classification

Transformation to binary
- One-vs.-rest (One-vs.-all)
- One-vs.-one
Extension from binary
- Neural networks
- k-nearest neighbours

One-vs.-rest (One-vs.-all) strategy

One-vs.-rest strategy for Multiclass classification

One-vs.-one strategy

Discovering groups and structures in the data without using known structures in the data
Objects in a cluster are more similar to each other than the objects in the other cluster

Applications

Social network analysis
Image segmentation
Recommender systems
Grouping of shopping items

Formal definition

Let X be the input feature space
The goal of clustering is to find k subsets of X, in such a way that
- C₁.. ∪ ..C_k ∪ C_outliers = X and
- C_i ∩ C_j = ϕ, i ≠ j; 1 <i,j <k
- C_outliers may consist of outlier instances (data anomaly)

Cluster models

Centroid models: cluster represented by a single mean vector
Connectivity models: distance connectivity
Distribution models: clusters modeled using statistical distributions
Density models: clusters as connected dense regions in the data space
Subspace models
Group models
Graph-based models
Neural models

Finding a function which models the data
Assigns a real-valued output to each input
Estimating the relationships among variables
Relationship between a dependent variable ('criterion variable') and one or more independent variables ('predictors').

Applications

Prediction
Forecasting
Machine learning
Finance

Formal definition

A function that maps a data item to a prediction variable
Let X be the independent variables
Let Y be the dependent variables
Let β be the unknown parameters (scalar or vector)
The goal of regression model is to approximate Y with X,β, i.e.,
- Y ≅ f(X,β)

Linear regression

straight line: y_i = β₀ + β₁x_i + ε_i OR
parabola: y_i = β₀ + β₁x_i + β₁x_i² +ε_i

Linear regression

straight line: y_i = β₀ + β₁x_i + ε_i OR
ŷ_i = β₀ + β₁_i OR
Residual: e_i = ŷ_i - y_i
Sum of squared residuals, SSE = Σ e_i, where 1 < i < n
The goal is to minimize SSE

Assigning a class to each member of a sequence of values

Applications

Part of speech tagging
Linguistic translation
Video analysis
Handwriting recognition
Information extraction

Formal definition

Let X be the input feature space
Let Y be the output feature space (of labels)
Let 〈x₁,...,x_T〉 be a sequence of length T.
The goal of sequence labeling is to generate a corresponding sequnce
- 〈y₁,...,y_T〉 of labels
- x_i ∈ X
- y_j ∈ Y

Association Rules

Searches for relationships between variables

Applications

Web usage mining
Intrusion detection
Affinity analysis

Formal definition

Let I be a set of n binary attributes called items
Let T be a set of m transactions called database
Let I = {(i₁,...,i_n)} and T = {(t₁,...,t_m)}
The goal of association rule learning is to find
- X ⇒ Y, where X ⇒ Y ⊆ I
- X is the antecedent
- Y is the consequent

Formal definition

Support: how frequently an itemset appears in the database
- supp(X) = |t ∈T; X ⊆ t| / |T|
Confidence: how frequently the rule has been found to be true.
- conf(X ⇒ Y) = supp(X ∪ Y)/supp(X)
Lift: the ratio of the observed support to that of the expected if X and Y were independent
- lift(X ⇒ Y) = supp(X ∪ Y)/(supp(X) ⨉ supp(Y))

Example

{bread, butter} ⇒ {milk}

Identification of unusual data records
Approaches
1. Unsupervised anomaly detection
2. Supervised anomaly detection
3. Semi-supervised anomaly detection

Applications

Intrusion detection
Fraud detection
Remove anomalous data
System health monitoring
Event detection in sensor networks
Misuse detection

Characteristics

Unexpected bursts

Formalization

Let Y be a set of measurements
Let P_Y(y) be a statistical model for the distribution of Y under 'normal' conditions.
Let T be a user-defined threshold.
A measurement x is an outlier if P_Y(x) < T

Providing a more compact representation of the data set
Report Generation

Applications

Keyphrase extraction
Document summarization
Search engines
Image summarization
Video summarization: Finding important events from videos

Formalization: Multidocument summarization

Let {D = D₁, ..., D_k} be a document collection of k documents
A Document {D = t₁, ..., t_m} consists of m textual units (words, sentences, paragraphs etc.)
Let {D = t₁, ..., t_n} be the complete set of all textual units from all documents, where
- t_i ∈ D, if and only if ∃ D_j such that t_i ∈ D_j
S ⊆ D constitutes a summary
Two scoring functions
- Rel(i): relevance of textual unit i in the summary
- Red(i,j): Redundancy between two textual units t_i, t_j

Formalization: Multidocument summarization

Scoring for a summary S
- s(S) score of summary S
- l(i) is the length of the textual unit i
- K is the fixed maximum length of the summary

Finding a subset from the entire subset
Approaches
1. Extraction: Selecting a subset of existing words, phrases, or sentences in the original text without any modification
2. Abstraction: Build an internal semantic representation and then use natural language generation techniques

Extractive summarization

Approaches
1. Generic summarization: Obtaining a generic summary
2. Query relevant summarization: Summary relevant to a query

Support Vector Machines (SVM)
Stochastic Gradient Descent (SGD)
Nearest-Neighbours
Naive Bayes
Decision Trees
Ensemble Methods (Random Forest)

Introduction

Supervised learning approach
Binary classification algorithm
Constructs a hyperplane ensuring the maximum separation between two classes

Hyperplane

Hyperplane of n-dimensional space is a subspace of dimension n-1
Examples
- Hyperplane of a 2-dimensional space is 1-dimensional line
- Hyperplane of a 3-dimensional space is 2-dimensional plane

Formal definition

The goal of a SVM is to estimate a function f: R^N ⨉ {+1,-1}, i.e.,
- If x₁,...,x_l ∈ R^N are the N input data points,
- the goal is to find (x₁,y₁),...,(x_l,y_l) ∈ R^N ⨉ {+1,-1}
Any hyperplane can be written by the equation using set of input points x
- w.x - b = 0, where
- w ∈ R^N, a normal vector to the plane
- b ∈ R
A decision function is given by f(x) = sign(w.x - b )

Formal definition

If the training data are linearly separable, two hyperplanes can be selected
They separate the two classes of data,
so that distance between them is as large as possible.
The hyperplanes can be given by the equations
- w.x - b = 1
- w.x - b = -1
The distance between the two hyperplanes can be given by 2/||w||
Region between these two hyperplanes is called margin.
Maximum-margin hyperplane is the hyperplane
that lies halfway between them.

Formal definition

In order to prevent data points from falling into the margin, following constraints are added
- w.x_i - b >= 1, if y_i = 1
- w.x_i - b <= -1, if y_i = -1
y_i(w.x_i - b) >= 1 for 1<= i <= n
The goal is to minimize ||w|| subject to y_i(w.x_i - b) >= 1 for 1<= i <= n
Solving for both w and b gives our classifier f(x) = sign(w.x - b)
Max-margin hyperplane is completely determined by the points that lie nearest to it, called the support vectors

Data mining tasks

Classification (Multi-class classification)
Regression
Anomaly detection

Applications

Text and hypertext categorization
Image classification
Handwriting recognition

A stochastic approximation of the gradient descent optimization
Iterative method for minimizing an objective function that is written as a sum of differentiable functions.
Finds minima or maxima by iteration

Gradient

Multi-variable generalization of the derivative.
Gives slope of the tangent of the graph of a function
Gradient points in the direction of the greatest rate of increase of a function
Magnitude of gradient is the slope of the graph in that direction

Gradient vs Derivative

Derivatives defined on functions of single variable
Gradient defined on functions of multiple variables
Gradient is a vector-valued function (range is a vector)
Derivative is a scalar-valued function

Gradient descent

First-order iterative optimization algorithm for finding the minimum of a function.
Finding a local minima involves taking steps proportional to
the negative of the gradient of the function at the current point.

Standard gradient descent method

Let's take the problem of minimizing an objective function
- Q(w) = 1/n (ΣQ_i(w)), 1<=i<n
- Summand function Q_i associated with i^th observation in the data set.
w = w - η.∇ Q(w)

Iterative method

Choose an initial vector of parameters and learning rate η.
Repeat until an approximate minimum is obtained:
- Randomly shuffle examples in the training set.
- w = w - η.∇ Q_i(w), for i=1...n

Applications

Classification
Regression

k-nearest neighbors algorithm

k-NN classification: output is a class membership
(object is classified by a majority vote of its neighbors.)
k-NN regression: output is the property value for the object
(average values of its k nearest neighbors)

Applications

Regression
Anomaly detection

Collection of simple probabilistic classifiers based on applying Bayes' theorem with strong independence assumption between the features.

Applications

Document classification (spam/non-spam)

Bayes' Theorem

If A and B are events.
P(A), P(B) are probabilities of observing A and B independently of each other..
P(A|B) is conditional probability, the likelihood of event A occurring given that B is true
P(B|A) is conditional probability, the likelihood of event B occurring given that A is true
P(B) ≠ 0
P(A|B) = (P(B|A).P(A))/P(B)

Decision support tool
Tree-like model of decisions and their possible consequences

Applications

Classification
Regression
Decision Analysis: identifying strategies to reach a goal
Operations Research

Defintion

Collection of multiple learning algorithms to obtain better predictive performance than could be obtained from one of the constituting algorithms alone.
Random forests are obtained by building multiple decision trees at training time

Multiclass classification
Multilabel classification (the problem of assigning one or more label to each instance. There is no limit on the number of classes an instance can be assigned to.)
Regression
Anomaly detection

Definition

Process of selecting a subset of relevant features
Used in domains with large number of features and comparatively few sample points

Applications

Analysis of written texts
Analysis of DNA microarray data

Formal defintion[8]

Let X be the original set of n features, i.e., |X| = n
Let w_i be the weight assigned to feature x_i∈ X
Binary feature selection assigns binary weights whereas continuous feature selection assigns weights preserving the order of its relevance.
Let J(X') be an evaluation measure, defined as J: X' ⊆ X → R
Feature selection problem may be defined in three following ways
1. |X'| = m < n. Find X' ⊂ X such that J(X') is maximum
2. Choose J₀, Find X' ⊆ X, such that J(X') >= J₀
3. Find a compromise among minimizing |X'| and maximizing J(X')

Research articles

From data mining to knowledge discovery in databases, Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, AI Magazine Volume 17 Number 3 (1996)
Survey of Clustering Data Mining Techniques, Pavel Berkhin
Mining association rules between sets of items in large databases, Agrawal, Rakesh, Tomasz Imieliński, and Arun Swami. Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD 1993. p. 207.
Comparisons of Sequence Labeling Algorithms and Extensions, Nguyen, Nam, and Yunsong Guo. Proceedings of the 24th international conference on Machine learning. ACM, 2007.

Research articles

An Analysis of Active Learning Strategies for Sequence Labeling Tasks, Settles, Burr, and Mark Craven. Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2008.
Anomaly detection in crowded scenes, Mahadevan; Vijay et al. Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010
A Study of Global Inference Algorithms in Multi-Document Summarization. McDonald, Ryan. European Conference on Information Retrieval. Springer, Berlin, Heidelberg, 2007.
Feature selection algorithms: A survey and experimental evaluation., Molina, Luis Carlos, Lluís Belanche, and Àngela Nebot. Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on. IEEE, 2002.
Support vector machines, Hearst, Marti A., et al. IEEE Intelligent Systems and their applications 13.4 (1998): 18-28.

Data Mining

Data Mining

Goals

1. Patterns

1. Patterns

Patterns in Nature

1. Patterns

Patterns by Humans

1. Patterns

Pattern creation

1. Patterns

Synonyms

1. Patterns

Pattern Recognition

1. Patterns

Formalization

Examples: Features

1. Patterns

Formalization

Example

1. Patterns

Formalization: Supervised learning

1. Patterns

Formalization: Unsupervised learning

1. Patterns

Formalization: Semi-supervised learning

2. Data Mining

Tasks in Data Mining

2.1. Classification

2.1. Classification

Applications

2.1. Classification

Formal definition

2.1. Classification

Classifiers

2.1. Classification

Linear Classifiers

2.1. Classification

Classifiers

2.1. Classification

Classifiers

2.1. Classification

2.1. Classification

Confusion Matrix

2.1. Classification

Confusion Matrix

2.1. Classification

Multiclass classification

2.1. Classification

Multiclass classification

2.1. Classification

One-vs.-rest (One-vs.-all) strategy

2.1. Classification

One-vs.-one strategy

2.2. Clustering

2.2. Clustering

Applications

2.2. Clustering

Formal definition

2.2. Clustering

Cluster models

2.3. Regression

2.3. Regression

Applications

2.3. Regression

Formal definition

2.3. Regression

Linear regression

2.3. Regression

Linear regression

2.4. Sequence Labeling

2.4. Sequence Labeling

Applications

2.4. Sequence Labeling

Formal definition

2.5. Association Rules

Association Rules

2.5. Association Rules

Applications

2.5. Association Rules