Traitement de données massives

John Samuel
CPE Lyon

Année: 2022-2023
Courriel: john(dot)samuel(at)cpe(dot)fr

Objectifs

Machines virtuelles
Conteneurs: Docker
Orchestration: Kubernetes
Analyse de données: Hadoop/HBase
Analyse de données: Hive
Analyse de données: Spark

Machines virtuelles

Une machine virtuelle est une illusion d'un appareil informatique créée par un logiciel d'émulation ou instanciée sur un hyperviseur.
L'émulateur simule la présence des ressources suivantes:
- processeur
- mémoire
- disque dur
- système d'exploitation
Avantages:
- portabilité des logiciels
- gestion de systèmes hérités
- isolement des applications pour des raisons de sécurité
- émulation de plusieurs machines sur une seule machine

Hyperviseur

Un hyperviseur est une plate-forme de virtualisation qui permet à plusieurs systèmes d'exploitation de travailler sur une même machine physique en même temps.
Deux catégories: natif et hosted
Type 1: natif
- un logiciel qui s'exécute directement sur une plateforme matérielle
- exemples: Xen, Oracle VM, Microsoft Hyper-V
Type 2: hosted
- un logiciel qui s'exécute à l'intérieur d'un autre système d'exploitation
- il permet d'abstraire les systèmes d'exploitation invités du système d'exploitation hôte.
- un système d'exploitation invité s'exécute comme un processus sur l'hôte.
- exemples: QEMU, VirtualBox

Conteneurs: LXC (Linux Containers)

LXC est un système de virtualisation, utilisant l'isolation comme méthode de cloisonnement au niveau du système d'exploitation.
il apporte une virtualisation de l'environnement d'exécution (processeur, mémoire vive, réseau, système de fichier,...) et non pas de la machine.

Conteneurs: Docker

Docker est un logiciel libre permettant de lancer des applications dans des conteneurs logiciels
il peut regrouper une application et ses dépendances dans un conteneur virtuel

FROM ubuntu:latest

MAINTAINER John Samuel

RUN apt -y update && \
    apt -y upgrade && \
    apt -y install apache2 git

RUN git clone https://github.com/johnsamuelwrites/johnsamuelwrites.github.io

RUN rm -rf /var/www/html

RUN mv johnsamuelwrites.github.io /var/www/html

RUN echo "ServerName localhost" >>/etc/apache2/apache2.conf

EXPOSE 80

CMD apachectl -D FOREGROUND

Construction de l'image docker

docker build -t johnsamuel .

Exécution de l'image docker

docker run -dit -p 8080:80 johnsamuel`

Voir le lien: http://localhost:8080/

Reconstruction de l'image docker

docker build --no-cache -t johnsamuel .

Docker Compose

un outil permettant de définir et d'exécuter des applications Docker multi-conteneurs
utilise des fichiers YAML pour configurer les services de l'application et effectue le processus de création et de démarrage de tous les conteneurs avec une seule commande.

Orchestration: Kubernetes

Kubernetes est un système d'orchestration de conteneurs pour automatiser le déploiement, la mise à l'échelle et la gestion des applications informatiques.

HBase

système de gestion de base de données non-relationnelles distribué, disposant d'un stockage structuré pour les grandes tables
une base de données orientée colonnes

Hive

un logiciel d'entrepôt de données construit au-dessus d'Apache Hadoop pour permettre l'interrogation et l'analyse des données.
il offre une interface de type SQL pour interroger les données stockées dans différentes bases de données et systèmes de fichiers qui s'intègrent à Hadoop

        $  head /home/john/Downloads/query.csv
             itemLabel,year
             Amiga E,1993
             Embarcadero Delphi,1995
             Sather,1990
             Microsoft Small Basic,2008
             Squeak,1996
             AutoIt,1999
             Eiffel,1985
             Eiffel,1986
             Kent Recursive Calculator,1981

                          $ export HADOOP_HOME="..."
                          $ ./hive
                          hive> set hive.metastore.warehouse.dir=${env:HOME}/hive/warehouse;

                          $./hive
                          hive> set hive.metastore.warehouse.dir=${env:HOME}/hive/warehouse;
                          hive> create database mydb;
                          hive> use mydb;

        $./hive
        hive> use mydb;
        hive> CREATE TABLE IF NOT EXISTS
             proglang (name String, year int)
             COMMENT "Programming Languages"
             ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
             LINES TERMINATED BY '\n'
             STORED AS TEXTFILE;
        hive> LOAD DATA LOCAL INPATH '/home/john/Downloads/query.csv'
             OVERWRITE INTO TABLE proglang;

        $./hive
        hive> SELECT * from proglang;
        hive> SELECT * from proglang where year > 1980;

        $./hive
        hive> DELETE from proglang where year=1980;
	FAILED: SemanticException [Error 10294]: Attempt to do update
	  or delete using transaction manager that does not support these operations.

        $./hive
        hive> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
        hive> DELETE from proglang where year=1980;
	FAILED: RuntimeException [Error 10264]: To use
	  DbTxnManager you must set hive.support.concurrency=true
        hive> set hive.support.concurrency=true;
        hive> DELETE from proglang where year=1980;
	FAILED: SemanticException [Error 10297]: Attempt to do update
	  or delete on table mydb.proglang that is not transactional
        hive> ALTER TABLE proglang set TBLPROPERTIES ('transactional'='true') ;
	FAILED: Execution Error, return code 1 from i
          org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table.
          The table must be stored using an ACID compliant format
	  (such as ORC): mydb.proglang

        $./hive
        hive> use mydb;
        hive> CREATE TABLE IF NOT EXISTS
             proglangorc (name String, year int)
             COMMENT "Programming Languages"
             ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
             LINES TERMINATED BY '\n'
             STORED AS ORC;
        hive> LOAD DATA LOCAL INPATH '/home/john/Downloads/query.csv'
             OVERWRITE INTO TABLE proglangorc;
	FAILED: SemanticException Unable to load data to destination table.
          Error: The file that you are trying to load does not match
	   the file format of the destination table.

        $./hive
        hive> insert overwrite table proglangorc select * from proglang;
        hive> DELETE from proglangorc where year=1980;
	FAILED: SemanticException [Error 10297]: Attempt to do update
	  or delete on table mydb.proglangorc that is not transactional
        hive> ALTER TABLE proglangorc set TBLPROPERTIES ('transactional'='true') ;
        hive> DELETE from proglangorc where year=1980;
        hive> SELECT count(*) from proglangorc;
        hive> SELECT count(*) from proglangorc where year=1980;

Hive

                          $./pyspark
   >>> lines = sc.textFile("/home/john/Downloads/query.csv")
   >>> lineLengths = lines.map(lambda s: len(s))
   >>> totalLength = lineLengths.reduce(lambda a, b: a + b)
   >>> print(totalLength)

Hive

                          $./pyspark
   >>> lines = sc.textFile("/home/john/Downloads/query.csv")
   >>> lineWordCount = lines.map(lambda s: len(s.split()))
   >>> totalWords = lineWordCount.reduce(lambda a, b: a + b)
   >>> print(totalWords)

Apache SPARK et Jupyter

                          $ export SPARK_HOME='.../spark/spark-x.x.x-bin-hadoopx.x/bin
			  $ export PYSPARK_PYTHON=/usr/bin/python3
			  $ export PYSPARK_DRIVER_PYTHON=jupyter
			  $ export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
			  $ ./pyspark

Apache SPARK, Jupyter et Hive

			from pyspark.sql import HiveContext

                        sqlContext = HiveContext(sc)

                        sqlContext.sql("use default")

                        sqlContext.sql("show tables").show()

Apache SPARK, Jupyter et Hive

			+--------+---------+-----------+
                        |database|tableName|isTemporary|

                        +--------+---------+-----------+

                        | default| proglang|      false|

                        | default|proglang2|      false|

                        +--------+---------+-----------+

Apache SPARK, Jupyter et Hive

			result = sqlContext.sql("SELECT count(*) FROM proglang ")

			result.show()

			+--------+

                        |count(1)|

                        +--------+

                        |     611|

                        +--------+

Apache SPARK, Jupyter et Hive

			print(type(result))

			<class 'pyspark.sql.dataframe.DataFrame'>

Apache SPARK, Jupyter, Hive et Pandas

			import pandas as pd

			result = sqlContext.sql("SELECT count(*) as count FROM proglang ")

			resultFrame = result.toPandas()

			print(resultFrame)

Apache SPARK, Jupyter, Hive et Pandas

			import pandas as pd

			result = sqlContext.sql("SELECT * FROM proglang ")

                        resultFrame = result.toPandas()

                        groups = resultFrame.groupby('year').count()

                        print(groups)

Aanalyse des sentiments

			import nltk

                        nltk.download('vader_lexicon')

Aanalyse des sentiments

			from nltk.sentiment.vader import SentimentIntensityAnalyzer

                        sia = SentimentIntensityAnalyzer()

                        sentiment = sia.polarity_scores("this movie is good")

                        print(sentiment)

                        sentiment = sia.polarity_scores("this movie is not very good")

                        print(sentiment)

                        sentiment = sia.polarity_scores("this movie is bad")

                        print(sentiment)

Aanalyse des sentiments

			{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}

                        {'neg': 0.344, 'neu': 0.656, 'pos': 0.0, 'compound': -0.3865}

                        {'neg': 0.538, 'neu': 0.462, 'pos': 0.0, 'compound': -0.5423}

Lac de données

un système ou un référentiel de données stockées dans leur format naturel/brut, généralement des blocs d'objets ou des fichiers
Il donne la priorité au stockage rapide et volumineux de données hétérogènes
Nous pouvons trouver tous les types de données suivants:
- données structurées
- données non-structurées
- données semi-structurées

Articles de recherche

[Abedjan 2016] Abedjan, Ziawasch, et al. Detecting Data Errors: Where Are We and What Needs to Be Done? VLDB Endowment, 1 Aug. 2016.
[Aggarwal 2011] Aggarwal, Charu C. “An Introduction to Social Network Data Analytics.” Social Network Data Analytics, edited by Charu C. Aggarwal, Springer US, 2011, pp. 1–15. Springer Link
[AlNoamany 2014] AlNoamany, Yasmin, et al. “Who and What Links to the Internet Archive.” International Journal on Digital Libraries, vol. 14, no. 3, Aug. 2014, pp. 101–15. Springer Link
[Alspaugh 2014] Alspaugh, S., et al. Analyzing Log Analysis: An Empirical Study of User Log Mining. 2014, pp. 62–77. www.usenix.org
[Brax 2008] Brax, Christoffer, et al. “Finding Behavioural Anomalies in Public Areas Using Video Surveillance Data.” 2008 11th International Conference on Information Fusion, 2008, pp. 1–8
[Bauer 2012] Bauer, Florian, and Martin Kaltenböck. Linked Open Data: The Essentials: A Quick Start Guide for Decision Makers. Ed. mono/monochrom, 2012

Articles de recherche

[Chen 2012] Chen, Hsinchun, et al. “Business Intelligence and Analytics: From Big Data to Big Impact.” MIS Quarterly, vol. 36, no. 4, 2012, pp. 1165–88. JSTOR
[Chen, Daqing 2012] Chen, Daqing, et al. “Data Mining for the Online Retail Industry: A Case Study of RFM Model-Based Customer Segmentation Using Data Mining.” Journal of Database Marketing & Customer Strategy Management, vol. 19, no. 3, Sept. 2012, pp. 197–208. Springer Link
[Chen 2014] Chen, Min, et al. “Big Data: A Survey.” Mobile Networks and Applications, vol. 19, no. 2, Apr. 2014, pp. 171–209. Springer Link
[Crosby 2016] Crosby M, Nachiappan Pattanayak P, Verma S, Kalyanaraman V(2016) Blockchain technology: Beyond bitcoin. Appl Innov Rev2:6–19
[Dennis 2001] Dennis, Charles, et al. “Data Mining for Shopping Centres – Customer Knowledge‐management Framework.” Journal of Knowledge Management, vol. 5, no. 4, Jan. 2001, pp. 368–74. Emerald Insight
[Driscoll 2012] Driscoll, Kevin. “From Punched Cards to ‘Big Data’: A Social History of Database Populism.” Communication 1, vol. 1, no. 1, Aug. 2012, pp. 1–33

Articles de recherche

[Dong 2013] Dong, Xin Luna, and Divesh Srivastava. “Big Data Integration.” 2013 IEEE 29th International Conference on Data Engineering (ICDE), 2013, pp. 1245–48. IEEE Xplore
[Gandomi 2015] Gandomi, Amir, and Murtaza Haider. “Beyond the Hype: Big Data Concepts, Methods, and Analytics.” International Journal of Information Management, vol. 35, no. 2, Apr. 2015, pp. 137–44.
[Gao 2011] Gao, Huiji, et al. “Harnessing the Crowdsourcing Power of Social Media for Disaster Relief.” IEEE Intelligent Systems, vol. 26, no. 3, May 2011, pp. 10–14. IEEE Xplore
[Halevy 2006] Halevy, Alon, et al. “Data Integration: The Teenage Years.” Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB Endowment, 2006, pp. 9–16.
[Jagadish 2014] Jagadish, H. V., et al. Big Data and Its Technical Challenges. Association for Computing Machinery, 1 July 2014.
[Kitchin 2016] Kitchin, Rob. “Big Data.” International Encyclopedia of Geography, American Cancer Society, 2016, pp. 1–3. Wiley Online Library
[Kovalerchuk 2005] Kovalerchuk, Boris, and Evgenii Vityaev. “Data Mining for Financial Applications.” Data Mining and Knowledge Discovery Handbook, edited by Oded Maimon and Lior Rokach, Springer US, 2005, pp. 1203–24. Springer Link

Articles de recherche

[Kwon 2014] Kwon, Ohbyung, et al. “Data Quality Management, Data Usage Experience and Acquisition Intention of Big Data Analytics.” International Journal of Information Management, vol. 34, no. 3, June 2014, pp. 387–94.
[Lenzerini 2002] Lenzerini, Maurizio. “Data Integration: A Theoretical Perspective.” Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Association for Computing Machinery, 2002, pp. 233–246. ACM Digital Library
[Laender 2002] Laender, Alberto H. F., et al. A Brief Survey of Web Data Extraction Tools. Association for Computing Machinery, 1 June 2002. June 2002
[Murray-Rust 2008] Murray-Rust, Peter. “Open Data in Science.” Nature Precedings, Jan. 2008, pp. 1–1. www.nature.com
[Nakayama 2007] Nakayama, Kotaro, et al. “Wikipedia Mining for an Association Web Thesaurus Construction.” Web Information Systems Engineering – WISE 2007, edited by Boualem Benatallah et al., Springer, 2007, pp. 322–34. Springer Link
[Nofer 2017] Nofer, Michael, et al. “Blockchain.” Business & Information Systems Engineering, vol. 59, no. 3, June 2017, pp. 183–87. Springer Link
[Pouchard 2015] Pouchard, Line. “Revisiting the Data Lifecycle with Big Data Curation.” International Journal of Digital Curation, vol. 10, no. 2, June 2015, pp. 176–92.

Articles de recherche

[Richards 2014] Richards, Neil M., and Jonathan H. King. “Big Data Ethics.” Wake Forest Law Review, vol. 49, 2014
[Rizvi 2002] Rizvi, Shariq J., and Jayant R. Haritsa. “Chapter 59 - Maintaining Data Privacy in Association Rule Mining.” VLDB ’02: Proceedings of the 28th International Conference on Very Large Databases, edited by Philip A. Bernstein et al., Morgan Kaufmann, 2002, pp. 682–93. ScienceDirect
[Shen Bin 2010] Shen Bin, et al. “Research on Data Mining Models for the Internet of Things.” 2010 International Conference on Image Analysis and Signal Processing, 2010, pp. 127–32. IEEE Xplore
[van Wel 2004] van Wel, Lita, and Lambèr Royakkers. “Ethical Issues in Web Data Mining.” Ethics and Information Technology, vol. 6, no. 2, June 2004, pp. 129–40. Springer Link
[Vrandečić 2014] Vrandečić, Denny, and Markus Krötzsch. “Wikidata: A Free Collaborative Knowledgebase.” Communications of the ACM, vol. 57, no. 10, Sept. 2014, pp. 78–85.
[Xu 2014] Xu, Lei, et al. “Information Security in Big Data: Privacy and Data Mining.” IEEE Access, vol. 2, 2014, pp. 1149–76. IEEE Xplore
[Zwitter 2014] Zwitter, Andrej. “Big Data Ethics.” Big Data & Society, vol. 1, no. 2, July 2014

Traitement de données massives

Big Data

Objectifs

Machines virtuelles

Machines virtuelles

Machines virtuelles

Hyperviseur

Conteneurs

Conteneurs: LXC (Linux Containers)

Conteneurs

Conteneurs: Docker

Docker

Docker

Conteneurs

Docker Compose

Orchestration: Kubernetes

Analyse des données: MapReduce

Analyse des données: Hadoop HDFS

Analyse des données: HBase

HBase

Analyse des données: Apache Hive

Hive

Analyse des données: Apache Hive

Analyse des données: Apache Hive

Analyse des données: Apache Hive

Analyse des données: Apache Hive

Analyse des données: Apache Hive

Analyse des données: Apache Hive

Analyse des données: Apache Hive

Analyse des données: Apache Hive

Analyse des données: Apache Hive

Analyse des données: Apache Spark

Analyse des données: Apache Spark

Analyse des données: Apache Spark

Analyse des données: Apache Spark

Analyse des données: Apache Spark

Apache SPARK, Jupyter et Hive

Analyse des données: Apache Spark

Apache SPARK, Jupyter et Hive

Analyse des données: Apache Spark

Apache SPARK, Jupyter et Hive

Analyse des données: Apache Spark

Apache SPARK, Jupyter et Hive

Analyse des données: Apache Spark

Apache SPARK, Jupyter, Hive et Pandas

Analyse des données: Apache Spark

Apache SPARK, Jupyter, Hive et Pandas

Analyse des données: NLTK

Aanalyse des sentiments

Analyse des données: NLTK

Aanalyse des sentiments

Analyse des données: NLTK

Aanalyse des sentiments

Big Data

Lac de données

Références

Articles de recherche

Références

Articles de recherche

Références

Articles de recherche

Références

Articles de recherche

Références

Articles de recherche

Références

Crédits d'images