Questions: First session
- Year: 2017-2018
- Duration: 2 hours
- Total: 15 points
- Documents: allowed
- Type of allowed documents: All documents allowed
- Electronic devices : not allowed
Question 1
Consider an enterprise that has multiple internal websites using HTML and CSS linked to each other. Your goal is to collect all the web pages on these websites to create a central data store. You have access to the websites (i.e., HTML and CSS) but you do not have access to the databases behind these websites. What is your approach to download all the web pages? What are the limitations of your approach? (1.5 points)
Question 2.a
What is a NoSQL data store? What are the different types of NoSQL data stores? Briefly describe each one of them. (1 point)
Question 2.b
Consider a sensor with five different types of measurement capabilities: luminosity, pressure, ultraviolet rays, temperature and humidity? How will you represent the daily data collected by these sensors in a NoSQL data store of your choice? Explain with an example. (1 point)
Question 3
Data cleaning is a major step before doing data analysis. Why? What are the different types of errors in the data? How do you deal with them? (1 point)
Question 4
What are the differences between classification and clustering algorithms? What algorithms did you use for performing classification and clustering during your practical sessions? What are the advantages and disadvantages of these algorithms? (1 point)
Question 5.a
Consider a CSV file containing the following columns: Country, City, Year, and Population, i.e., it contains the information of population of a city (of a country) as recorded every year from 1900. Your goal is to write a Python program that can read this CSV file and perform the following:
- Find the city with the maximum population in the year 2000
- For every country, compute the average population of the cities in the year 2000
(2 points)
Question 5.b
We assume that the CSV file of population data of cities does not contain any errors and have a complete data of population from the year 1900 to 2017. Your next goal is to predict the population of different cities in the year 2050. Write a Python program to achieve this prediction task. (2.5 points)
Question 6.a
What is a perceptron? (1 point)
Question 6.b
What is a recurrent neural network? How does it differ from other artificial neural networks? (1 point)
Question 7.a
You have been asked to build a recommendation system of images for your project. Give an overview of your system, detailing the various steps, algorithms and the architecture. Compare your work with the lifecycle of data. What are the steps that you used and what did you miss? (1 point)
Question 7.b
Before downloading and using data from a website, what are your considerations? What was your approach towards this direction concerning your project? (1 point)
Question 8
Given below is a table detailing user’s color preferences. The table consist of 5 columns and 10 rows. Each row correspond to one user. Each column corresponds to one color and the column values consists of 0 and 1. If a value is 0, the user does not like the color and if the color is 1, the user likes the color. Find all possible association rules from this table. Justify your answer. (1 point)
User | C1 | C2 | C3 | C4 | C5 |
U1 | 0 | 0 | 0 | 0 | 0 |
U2 | 1 | 1 | 1 | 1 | 1 |
U3 | 1 | 1 | 1 | 1 | 1 |
U4 | 1 | 1 | 1 | 1 | 1 |
U5 | 0 | 0 | 0 | 0 | 0 |
U6 | 0 | 1 | 1 | 0 | 0 |
U7 | 0 | 0 | 0 | 0 | 0 |
U8 | 0 | 0 | 0 | 0 | 0 |
U9 | 0 | 1 | 1 | 0 | 1 |
U10 | 1 | 0 | 0 | 1 | 0 |