Data Mining
John Samuel
CPE Lyon
Year
: 2019-2020
Email
: john(dot)samuel(at)cpe(dot)fr
Data Mining
Goals
Lifecycle of data
Data acquistion and storage
Data extraction and integration
Pre-treatment of data
Data transformation
ETL
Data analysis
Data visualisation
1. Lifecycle of data
Lifecycle of Data
Data
Knowledge
Insights
Actions
Data Lifecycle
1. Lifecycle of data
1.1. From Data to Knowledge
Data acquisition
Data Extraction
Data Cleaning
Data Transformation
Data analysis modeling
Data Storage
Analysis
Visualisation
Major steps of data analysis
1. Lifecycle of data
1.1.1. Data Acquistion
1. Lifecycle of data
1.1.2. ETL (Extraction Transformation and Loading)
Data Extraction
Data Cleaning
Data Transformation
Loading data to information stores
ETL (Extraction, Transformation and Loading)
1. Lifecycle of data
1.1.3. Data Analysis
1.1.3. Data analysis
1. Lifecycle of data
1.1.4. Data Visualization
1. Lifecycle of data
1.1.4. Data Visualization
2. Data Acquistion and Storage
2.1. Data acquisition
Surveys
Manual surveys
Online surveys
Sensors
1
Temperature, pressure, humidity, rainfall
Acoustic, navigation
Proximity, presence sensors
Social networks
Video surveillance cameras
Web
https://en.wikipedia.org/wiki/List_of_sensors
2. Data Acquistion and Storage
2.2. Data storage formats
Binary and Textual Files
CSV/TSV
XML
JSON
Media (Images/Audio/Video)
2. Data Acquistion and Storage
2.2 Types of data stores
Structured data stores
Relational databases
Object-oriented databases
Unstructured data stores
Filesystems
Content-management systems
Document collections
Semi-structured data stores
Filesystems
NoSQL data stores
Unstructured vs. Structured vs. Semi-structured
2. Data Acquistion and Storage
2.3.1. ACID Transactions
1
Atomicity
: Each transaction must be "all or nothing".
Consistency
: Any transaction must bring database from one valid state to another.
Isolation
: Both concurrent execution and sequential execution of transactions must bring the database to same state.
Durability
: Irrespective of power losses, crashes, a transaction once committed to the database must remain in that state.
https://en.wikipedia.org/wiki/ACID
2. Data Acquistion and Storage
2.3.1. ACID Transactions
Ensure validity of databases even in case of errors, power failures
Important in banking sector
2. Data Acquistion and Storage
2.3.2. Types of data stores
Relational databases
Object-oriented databases
NoSQL (Not only SQL) data stores
NewSQL
2. Data Acquistion and Storage
2.3.3. NoSQL
Comprises consistency
Focus on availability and speed
2. Data Acquistion and Storage
2.3.3. Types of NoSQL stores
Column-oriented database
Document-oriented database
Key-value database
Graph-oriented database
3. Data Extraction and Integration
3.1. Data extraction techniques
Data dumps
Downloading complete data dumps
Downloading selective data dumps
Periodical polling of data feeds (e.g., blogs, news feeds)
Data streams
Subscrbing to data streams (push notifications)
3. Data Extraction and Integration
3.2. Query interfaces
Query endpoints supporting declarative languages
SQL
SPARQL
Automated Manual search (and filter) options
3. Data Extraction and Integration
3.3. Crawlers for web pages
Web crawlers: navigating the entire using hyperlinks
3. Data Extraction and Integration
3.4. Application Programming Interface (API)
Web operations (CRUD) to manipulate externally managed resources
Requires programmers to develop wrappers for web service integration
API (Interface de programmation)
4. Pre-treatement of Data
4.1 Data Cleaning: Types of Errors
Syntactical errors
Semantical errors
Data coverage errors
4. Pre-treatement of Data
4.1.1. Syntactical errors
Lexical errors (e.g., user entered a string instead of a number)
Data format errors (e.g, order of last name, first name)
Irregular data errors (e.g., usage of different metrics)
4. Pre-treatement of Data
4.1.2. Semantic errors
Violation of integrity constraints
Contradiction
Duplication
Invalid data (unable to detect despite presence of triggers and integrity constraints)
4. Pre-treatement of Data
4.1.3. Coverage errors
Missing values
Missing data
4. Pre-treatement of Data
4.2.1. Handling Syntactical errors
Validation using schema (e.g., XSD, JSONP)
Data transformation
4. Pre-treatement of Data
4.2.2. Handling Semantic errors
Duplicate elimination using techniques like specifying integrity constraints like functional dependencies
4. Pre-treatement of Data
4.2.3. Handling Coverage errors
Interpolation techniques
External data sources
4. Pre-treatement of Data
4.2.4. Administrators and handling errors
User feedback
Alerts and triggers
5. Data Transformation
5.1 Languages
Template languages
XSLT
AWK
Sed
Programming languages like PERL
6. ETL
6.1. ETL (Extraction Transformation and Loading)
Data Extraction
Data Cleaning
Data Transformation
Loading data to information stores
6. ETL
6.2.1. Models for data analysis
Multidimensional data analysis
Dimensions
Attributes
Levels
Hierarchies
Facts
Measures
6. ETL
6.2.1. Models for data analysis
Multidimensional data analysis: Examples
Dimensions (e.g.Spatio-temporal dimensions, Product)
Attributes (e.g. Name, Manufactures etc.)
Levels (e.g., Day, Month, Quarter, Store, City, Country etc.)
Hierarchies (e.g., Day-Month-Quarter-Year, Store-City-Country etc.)
Facts
Measures (e.g., Number of products sold/unsold)
6. ETL
6.2.3. Star Schema
6. ETL
6.2.3. Data Cubes
Data cubes for online analytical processing (OLAP)
OLAP Cube operations
Slice
Dice
Drill up/down
Pivot
6. ETL
6.2.4. Snow Schema
6. ETL
6.2. ETL: From one data store to another
From: Data sources
Internal or external databases
Web Services
To: Data warehouses
Enterprise warehouses
Web warehouses
7. Data Analysis
Activities of data analysis
Retrieving values
Filter
Compute derived values
Find extremum
Sort
Determine range
Characterize distribution
Find analysis
Cluster
Correlate
Contextualization
https://en.wikipedia.org/wiki/Data_analysis
8. Data Visualization
8.1. Data Visualization
Time-series
Ranking
Part-to-whole
Deviation
Sort
Frequency distribution
Correlation
Nominal comparison
Geographic or geospatial
https://en.wikipedia.org/wiki/Data_visualization
8. Data Visualization
8.2. Data Visualization: Examples
Bar-chart (Nominal comparison)
Pie-chart (part-to-whole)
Histograms (frequency-distribution)
Scatter-plot (correlation)
Network
Line-chart (time-series)
Treemap
Gantt chart
Heatmap
8. Data Visualization
Pie Chart
8. Data Visualization
Programming Language Paradigms (Bubble Chart)
8. Data Visualization
Timeline of Programming Languages (using Histropedia)
8. Data Visualization
Influence Graph of Programming Languages
8. Data Visualization
k Predominant colours
8. Data Visualization
RGB Scatter plots (Comparison)
References
Colors
Color Tool - Material Design
Images
Wikimedia Commons