Author: Paw Hermansen
Date: November 13, 2018
This notebook contains the very short version of my Capstone Project for the Coursera/IBM course series IBM Data Science Professional Certificate Specialization.
The full version with all the details can be found in https://github.com/pawhermansen/Coursera_Capstone/blob/master/IsCopenhagenLikeParis.ipynb.
Travel guide publisher Lonely Planet recently put the danish capital Copenhagen on top of their list of the best cities to visit in 2019. Copenhagen has a lot to offer, says Lonely Planet and mentions the cyclists, the many green spaces, the old and new architecture, the great museums, Tivoli garden, the galleries, the restaurants, including fancy New Nordic restaurants and even marvelous street food markets and indie bars.
That made some Copenhageners claim in the local newspapers that Copenhagen is like Paris in the summer. If this is true it will be interesting not only for tourists trying to find new and exciting destinations but also for the Copenhagen tourist association, visitcopenhagen.dk, that could direct its marketing to compete directly against other cities like Paris.
It is not stated clearly in exactly what way the likeness between Copenhagen and Paris is thought to be. It is clearly not in the weather because the Copenhageners compare Copenhagen in the summer to Paris and they do not mention the winter. Also it is clearly not in the languages - even though that both Copenhagen and Paris are very alike in that they speak languages that are totally un-understandable to anyone else. In Copenhhagen, however, nearly everyone also speaks fluent English which is certainly not the case in Paris.
The likeness between Copenhagen and Paris is probably more a feeling that when you walk around in Copenhagen and Paris you will see the same kind and distribution of restaurants, bars, sights, bakeries and all other kinds of venues and this is this definition of likeness that I choose to investigate.
This notebook uses tools from Data Science and Machine Learning to investigate if Copenhagen is like Paris in the above mentioned sense.
My approach will be to part each of the two cities into neighborhoods that I will consider as homogeneous with respect to their venue types.
Then to see how alike the Copenhagen and the Paris neighborhoods are I will make a cluster analysis of all the neighborhoods together based on their frequency of their venue types. Clustering is a method of so-called unsupervised learning where the algorithm takes data-points that are not catagorized or grouped beforehand and groups them into a given number of clusters or groups based on their likeness. Here the list of venue type frequencies of each neighborhood is a data-point.
When I do this I actually cheat the cluster algorithm a little bit because the data-points are categorized beforehand with their city. I do not, however, reveal this to the cluster algorithm that will group the neighborhoods exclusively based on the venue type frequencies. My conclusion will be based on the result of the clustering algorithm compared to which city each neighborhood in each cluster is part of.
For example if I cluster all the neighborhoods into two or more clusters and all the Copenhagen neighborhoods end up in their own clusters and all the Paris neighborhoods end up in other clusters then Copenhagen and Paris are more alike to themself than to each other. But if the neighborhoods ends up being mixed in clusters across the two cities then you are very right in claiming that Copenhagen, or at least some neighborhoods of Copenhagen, are like Paris.
If the two cities in fact have neighborhoods that are alike then the groups will show which neighborhoods from the two cities are most like each other.
For the Paris neighborhoods I use the 20 so-called arrondissements of Paris that are administrative zones of Paris. The name and the geographical coordinates of the location of each arrondissement can be downloaded in different formats from the Paris Data website - the data is covered by the Open Database License (ODbL).
Later I will use Foursquare to find venues in each neighborhood. To get the venues Foursquare requires the geographical latitude and longitude for each neighborhood center and a maximal distance away from each center to search. I select a reasonable distance for each neighborhood center and end up with the search-areas show in the following map of Paris.
import folium
def createTownMap(df, zoom = 12):
# Create map centered around the mean latitude and longitude values
latitue = df['Latitude'].mean()
longitude = df['Longitude'].mean()
townmap = folium.Map(location=[latitue, longitude], zoom_start=zoom)
# Add the search-areas to map.
for lat, lng, neighborhood, radius in zip(df['Latitude'],
df['Longitude'],
df['Neighborhood'],
df['Distance to Nearest'] / 2):
label = folium.Popup(neighborhood, parse_html=True)
folium.Marker(
[lat, lng],
popup = neighborhood).add_to(townmap)
folium.Circle(
radius=radius,
location=[lat, lng],
popup=label,
color='blue',
stroke= False,
fill=True,
fill_opacity=0.2).add_to(townmap)
return townmap
import pandas as pd
df_parisNeighborhoods = pd.read_csv('data/paris_neighborhoods.csv')
createTownMap(df_parisNeighborhoods)
The Copenhagen neighborhoods are a little more difficult to get. From WikiPedia Bydele i Københavns kommune ("neighbohoods in Copenhagen Commune") I collect the ten administrative areas of Copenhagen.
The neighborhood Indre by ("Inner City") can be subdivided into smaller functionel neigborhoods but it turns out that FourSquare, that I will use later, has too few venues registered for some of the smaller neighborhoods and so I stay with Indre by as one neighborhood.
Also I include Frederiksberg that is not administratively a part of the Copenhagen Commune but geographically lies inside the borders of Copenhagen (see https://www.quora.com/Why-is-Frederiksberg-not-a-part-of-Copenhagen for more information about this curiosity).
As for the Paris neighborhoods I select a reasonable distance for each neighborhood center and end up with the search-areas show in the following map of Copenhagen.
import pandas as pd
df_cphNeighborhoods = pd.read_csv('data/cph_neighborhoods.csv')
createTownMap(df_cphNeighborhoods)
Foursquare is a service that you can use to find the best places to eat, drink, shop, or visit in any city in the world. They also offer access through an open API with some limitations, registering necessary.
We can call the Foursquare API a list of venues and their types within a certain distance from any location within Copenhagen and Paris. This means that for our purpose the neighborhoods will be defined as a center location and a radius around this center.
I get the venues for all the Copenhagen and Paris neighborhhods from Foursquare and get a large table with 2775 venues all in all. The first ten rows is seen below.
df_venues = pd.read_csv('data/venues.csv')
df_venues.head(10)
df_categories = pd.DataFrame(df_venues['Venue Category'].unique(), columns = ['Venue Category'])
print('There are {} unique categories.'.format(len(df_categories)))
That is quite a lot of different venue categories and there will certainly be no problems in expressing the differencies in the neighborhoods.
On the other hand, the venue categories might be too detailed, for example with restaurants that are categorized by their kitchens originating country. After seeing the first results of the clustering it might become relevant to consider if, for example, a Scandinavian Restaurant in Copenhagen should or should not be counted as being different from a French Restaurant in Paris.
df_restaurantCategories = df_categories[df_categories['Venue Category'].str.contains("Restaurant")]
print('Number of different Restaurant categories in the venues data is', len(df_restaurantCategories))
df_restaurantCategories.head(10)
The table below shows that Foursquare returned three times as many venues per square kilometer for Paris when compared to Copenhagen.
This could indicate that Paris have more venues that are interesting enough to make it into Foursquare but I think it is much more likely that the Foursquare app is more popular in France than in Denmark and I consider this fact as having no influence on the results in this notebook.
import numpy as np
df_parisSearchAreas = np.square(df_parisNeighborhoods['Distance to Nearest'] / 2) * 3.1416
df_cphSearchAreas = np.square(df_cphNeighborhoods['Distance to Nearest'] / 2) * 3.1416
df_venuesByCity = df_venues.groupby('City').size().reset_index(name='Venue count')
df_venuesByCity['Search Area in m2'] = [df_cphSearchAreas.sum(), df_parisSearchAreas.sum()]
df_venuesByCity['Venues per km2'] = 1e6 * df_venuesByCity['Venue count'] / df_venuesByCity['Search Area in m2']
df_venuesByCity['Venues per km2'] = df_venuesByCity['Venues per km2'].map('{:,.2f}'.format)
df_venuesByCity
To be used with the clustering algorithm I need a table of the frequencies of occurence of each venue category for each neighborhood. After a lot of hard work I end up with a table that has the first ten rows as shown below. Most of the numbers are 0 but the sum of all numbers in each neighborhood, i.e. each row, is 1.
Actually I also make another table that I call simplified because it only contains one Restaurant category instead of a category for each nationality of restaurants. Both tables will be used for clustering in a moment.
df_freq = pd.read_csv('data/frequencies.csv')
df_freq.head(10)
To be able to compare the neighborhoods myself after clustering them, I create a table of the top ten venue categories for each neighborhood for both the full and the simplified categories. The full table is shown below.
df_topTenVenues = pd.read_csv('data/topTenVenues.csv')
df_topTenVenues
For the fun of it the categories for separated Copenhagen and Paris can be shown as the following word clouds.
from wordcloud import WordCloud
import matplotlib.pyplot as plt
def wordCloud(city):
df = df_topTenVenues[df_topTenVenues['City'] == city].iloc[:,1:]
df = df.replace(r' ','\u00a0', regex=True)
words = ''
for i in range(10):
words = words + ' ' + ' '.join(df.iloc[:,i])
wordcloud = WordCloud(width = 2400, height=1200, background_color='white').generate(words)
plt.imshow(wordcloud)
plt.axis('off')
return plt
plt = wordCloud('Copenhagen')
plt.show()
plt = wordCloud('Paris')
plt.show()
I make four different clusterings.
I will just show one example in detail.
from sklearn.cluster import KMeans
kclusters = 6
df_freqClustering = df_freq.copy().drop(['Neighborhood', 'City'], 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_freqClustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_
To visualize the result I create a table for each clustered group that shows the neighborhoods of the group and the neighborhoods top 10 venue types.
It turns out that most of the groups have no mixing between the two cities at all and only one group that has minimal mixing with Indre By (Inner City) of Copenhagen. It seems like the coffee shops plays a role in the clustering but also cafés, bars, wine bars and plaza might play a role.
This result is perhaps not a very strong proof of likeness between Copenhagen and Paris but at least it suggests that Inner City of Copenhagen is somewhat like the Paris neighborhoods Entrepôt, Louvre, Luxembourg, and Temple.
from IPython.display import Markdown, display
def printClusters(df, labels):
df.insert(2, 'Cluster', labels)
for cluster in range(kclusters):
countCopenhagen = df[(df['Cluster'] == cluster) & (df['City'] == 'Copenhagen')].shape[0]
countParis = df[(df['Cluster'] == cluster) & (df['City'] == 'Paris')].shape[0]
print()
display(Markdown('**Group {} with {} Copenhagen and {} Paris neighborhoods:**'.format(cluster, countCopenhagen, countParis)))
display(df[df['Cluster'] == cluster])
printClusters(df_topTenVenues.copy(), kmeans.labels_)
The claim that Copenhagen is like Paris is somewhat supported by this investigation using tools from Data Science and Machine Learning and using Foursquare venue type data to define likeness. The evidence is not overwhelming in that some of the experiments only showed minimal mixing between the neighborhoods of Copenhagen and Paris.
The mixing that does occur, however, seems to be consistent between the experiments and they show that several Copenhagen neighborhoods, and especially Indre By (Inner City), have many likenesses with the Paris neighborhoods Entrepôt, Louvre, Luxembourg, Temple, and Buttes-Chaumont.