Architecture of Information Systems
Search Engine
John Samuel
CPE Lyon
Year
: 2017-2018
Email
: john(dot)samuel(at)cpe(dot)fr
Architecture of Information Systems
Outline: Search engine
Frontend development
Backend development
Application programming interface
Search Engine
Search Engine
Target Audience
Regular users
Domain Experts
Front-end development
Search interface
One-box search
Advanced search (filters)
Personalized user experience
Interface: Simple search (One box)
Queries
Keywords
Natural language queries
Search Results
Links
Structured response
Natural language response
Leonardo da Vinci (October 2017 Google results)
Advanced search (filters)
Filter search results (Multiple boxes)
Artists on Histropedia
Advanced search (filters)
Location of Archaelogical sites (Wikidata)
Advanced search (filters)
Why filters?
Reduce information overload
Precise queries
Interactive search
Advanced Search in one-box
Operators
AND
OR
NOT
Advanced Search in one-box
Mnemonics
Bangs (DuckDuckGo)
Personalized user experience
Time and location (Internationalization)
Weather (weather.com)
Personalized user experience
Past user search queries
User privacy
Backend development
Data collection
Data storage
Configuration
Logging
Dashboard
Security
Data collection
Data ownership
Internal data (e.g., official website, internal wikis, databases etc.)
External data (e.g., other websites, wikis, open data)
Data model (Data and Schema)
Unstructured data (e.g., documents, texts, web pages etc.)
Semi-structured data (e.g., JSON/XML files etc.)
Structured data (e.g., relational databases, linked data)
Data collection
Data sources
Web pages
Documents, texts
Sensors
Databases
...
Data collection
Data acquisition
Data dumps
Crawlers
Web scraping
Application Programming Interface (API)
Data collection
Data cleaning and transformation
Accuracy (e.g., verification with external sources)
Validity (e.g., detect constraint violations)
Uniformity (e.g., units)
Data storage
Model
Indexation
Query optimization
Caching
Replication
Backup
Data Model
Database schema
Schema-less
Data storage
Relational Databases
Object-oriented Databases
NoSQL databases (e.g., graph databases)
NewSQL databases (SQL + ACID guarantees)
Document indices and Query Optimization
Document indices
Forward index
Inverted index
Database Indexation
Query Optimization
Join ordering
Cost estimation
Caching
Frequently asked questions and cached responses
Replication and Backup
Replication(Primary-dependent replica)
Resource management and configuration
Availability (Wikipedia)
Resource management and configuration
Machines (servers, disks etc.)
Software packages and dependencies
Energy consumption
Deployment
Development setup
Pre-production setup
Production setup
Packaging
Containers (e.g., Linux containers)
Load balancing
Server-side
Client-side
Selective Testing
A/B Testing
Logging
Access logs
Error logs
Event logs
Transaction logs
Logging
Why logs?
Debugging
Security (e.g., detect intrusion)
Database rollbacks
Audit
Analysis (e.g., detecting patterns, resource planning)
Logging
IP address
User
Resource ID
...
Dashboard
Wikimedia (Grafana: 5
th
October 2017)
Dashboard
Wikimedia (Availability: 5
th
October 2017)
Dashboard
Performance Metrics (active users, queries served etc.)
Real-time metrics (e.g., downtime, latency, throughput)
Dashboard
Email Alerts
Visual indicators
Security
Data protection
Logged-in users or public access
Third party access
Login (Wikipedia)
Security
Authentication
Authorization to third party access
OpenID
Mozilla Persona (2011-2016)
Detecting security vulnerabilities
Intrusion
SQL code injection
Cross-site scripting
Denial of service
Application programming interface
Service-oriented (SOAP)
Resource-oriented (REST)
API: Data formats
XML
JSON
API: (CRUDL) Operations
Create
Read
Update
Delete
List
API: Data dumps
Complete data dumps
Selective data dumps
Application programming interface
HTTP
Software development kits (SDKs)
Interface definition
Human-readable documentation
Machine-readable documentation (WSDL, WADL etc.)
Human and machine-readable documentation (microformats, semantic web languages)
Human readable Documentation
Read documentation
Develop application to integrate
Add business logic, if any
Machine-readable Documentation
Fully autonomous solution to integrate
Add business logic, if any
Quality of service
Resource usage limits
Limits on API call count (per user, IP)
Limits on data transfer
Temporary blocks
Quality of service
Analysis on frequently made API calls
Resource planning and allocation
Security
No password
Basic authentication (e.g., username, password)
(Open) authentication protocols (e.g., OAuth)
OAuth
Project
Virtual Library
Project
Target audience
Project
References
References
https://en.wikipedia.org/wiki/Information_system
https://en.wikipedia.org/wiki/Search_engine_(computing)
https://query.wikidata.org/
https://weather.com/
https://duckduckgo.com/bang
http://commoncrawl.org/
https://en.wikipedia.org/wiki/High_availability
https://grafana.wikimedia.org
https://status.wikimedia.org
http://highscalability.com/
Image credits
Wikimedia Commons
http://histropedia.com/timeline/5bnj79bjyr/Artists
https://pixabay.com/