The Hackathon Team was asked to build a textual strings comparison system among the title and the description of a project (OpenCoesione) and the build denomination (ISTAT).
Connesion Matrix containing both datasets keys ("COD_LOCALE_PROGETTO" for OpenCoesione, "OC_COD_MUSEO).
Developed algorithms and code (open source) and reliability strategy.
Final presentation.
This is a text mining problem. To find the requested associations the team decided that the best strategy was building similarities methods between the two datasets keys and deciding the association based on the highest score museum (ISTAT) per each project (OpenCoesione).
Despite several methods were available, the most suitable one seemed to be TFIDF. A brief introduction and explanation of the method is presented in the next paragraphs. Then, some codes chunks are showed and dataily commented to let the user follow the procedure logic.
In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf.
Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields, including text summarization and classification.
Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.
$ \mathbf{tf_{i,j} = \frac{n_{i,j}}{|d_j|}}$
where $n_{i,j}$ is the number of occurences of the term $ i $ in the document $j$, and the denominator $ |d_j| $ is the dimension of the document $j$ expressed as the number of words it contains.
$ \mathbf{idf_{i} = \log \frac{|D|}{|\{d: i \in d\}|}} $
where $|D|$ is the number of the documents in the corpus, and the denominator is the number of documents that contain the term $i$.
Deeper analysis can be found on various books, as "Mining of Massive Datasets".
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import OpenCoesioneLib as ocl
from tqdm import tqdm
#import googlemaps
from datetime import datetime
import folium
import branca.colormap as cm
import pickle
from folium.plugins import MarkerCluster
istat_data = pd.read_csv(r"C:\Users\claba\Desktop\OpenCoesione\Datasets\istat_dati.csv", engine = "python", sep = ";")
progetti_data = pd.read_csv(r"C:\Users\claba\Desktop\OpenCoesione\Datasets\progetti_focus_turismo_20190630.csv", engine = "python", sep = ";")
istat_cod = pd.read_csv(r"C:\Users\claba\Desktop\OpenCoesione\Datasets\istat_rev.csv",sep = ";", engine = "python", dtype={"COD_REGIONE": str,"COD_PROVINCIA": str,"COD_COMUNE": str}, encoding='cp1252')
istat_data = istat_data.drop(['COD_REGIONE','COD_PROVINCIA','COD_COMUNE'], axis=1).merge(istat_cod, on = "OC_COD_MUSEO", how = 'left')
istat_data[['COD_REGIONE']] = pd.to_numeric(istat_data['COD_REGIONE'], errors='ignore')
progetti_cultura_data = progetti_data.loc[(progetti_data["CLASSE"] == "Cultura") & ((progetti_data["CUP_COD_NATURA"] == "03") | (progetti_data["CUP_COD_NATURA"] == 3))]
Before proceeding with the definition of the algorithm, the team needed to PreProcess the documents. In order to do this we decided to load the datasets and preprocess them entirely with functions we implemented and explained in our lib. The "PreProcessing" function implements:
These are the preprocessing chuncks:
remove_words = ["ASILO","SCUOLA","MATERNA","ELEMENTARE","SUPERIORE","TEATRALE","SCOLASTICO","LICEO","CLASSICO"]
for el in remove_words:
progetti_cultura_data = progetti_cultura_data[~progetti_cultura_data.OC_TITOLO_PROGETTO.str.contains(el)]
progetti_cultura_data.reset_index(inplace = True)
idx = []
for i in progetti_cultura_data.index:
try:
if ':' in progetti_cultura_data.loc[i]['DEN_REGIONE']:
idx.append(i)
if 'AMBITO NAZIONALE' == progetti_cultura_data.loc[i]['DEN_REGIONE']:
idx.append(i)
except:
print(i)
progetti_cultura_data = progetti_cultura_data.drop(idx)
progetti_cultura_data[['COD_REGIONE']] = pd.to_numeric(progetti_cultura_data['COD_REGIONE'], errors='ignore')
progetti_cultura_data = progetti_cultura_data[progetti_cultura_data.COD_REGIONE <= 20]
prov_foo_cleaning = np.vectorize(ocl.prov_foo_cleaning)
progetti_cultura_data[['COD_PROVINCIA']] = progetti_cultura_data[['COD_PROVINCIA']].apply(prov_foo_cleaning)
istat_data["Index_Copy"] = range(len(istat_data))
progetti_cultura_data["Index_Copy"] = range(len(progetti_cultura_data))
istat_data["DENOMINAZIONE_NP"] = istat_data["DENOMINAZIONE"] # Keep the not processed decsription for the final dataset
istat_data["DENOMINAZIONE"] = istat_data["DENOMINAZIONE"].apply(lambda x: ocl.preProcessing(x))
[dictionary, inverted_index] = ocl.create_dictionary_and_inverted_index(istat_data)
idfi = []
for term in dictionary:
idfi.append(ocl.IDFi(term,inverted_index,len(istat_data)))
inverted_index2 = ocl.create_inverted_index_with_TFIDF(istat_data, len(istat_data), dictionary, idfi)
progetti_cultura_data["DESCRITTORI_AGGREGATI"] = progetti_cultura_data["OC_TITOLO_PROGETTO"] + " " + progetti_cultura_data["OC_SINTESI_PROGETTO"]# + " " + progetti_cultura_data[""]
progetti_cultura_data["DESCRITTORI_AGGREGATI"] = progetti_cultura_data["DESCRITTORI_AGGREGATI"].apply(lambda x: ocl.preProcessing(x))
To drastically boost efficiency and results, the team decided to group the datasets by provinces and compare only the correspondent pairs of group. The following lines implement this procedures and, based on the previous computations, associates at each project, if the confidence is over a threshold $\epsilon$, the highest score museum correspondence.
For the projects whoose province is not specified, the region is used to perform the same method.
progetti_cultura_data["ACTUAL_SCORE"] = [0.4]*len(progetti_cultura_data)
progetti_cultura_data["ASSOCIATED_MUSEUM"] = ["Nessuna associazione trovata"]*len(progetti_cultura_data)
progetti_cultura_data["ASSOCIATED_MUSEUM_CODE"] = ["Nessuna associazione trovata"]*len(progetti_cultura_data)
progetti_cultura_data["ASSOCIATED_MUSEUM_TYPE"] = ["Nessuna associazione trovata"]*len(progetti_cultura_data)
prov_groups_project = progetti_cultura_data.groupby("COD_PROVINCIA")
prov_groups_istat = istat_data.groupby("COD_PROVINCIA")
provs = set(istat_data["COD_PROVINCIA"])
for prov in provs:
try:
istat_prov = prov_groups_istat.get_group(prov)
project_prov = prov_groups_project.get_group(prov)
for idx_doc in range(len(istat_prov)):
document = istat_prov.iloc[idx_doc][["DENOMINAZIONE"]]
document = list(document)
document = ("row_"+str(int(istat_prov.iloc[idx_doc][["Index_Copy"]])),document[0])
for idx_query in range(len(project_prov)):
pquery = project_prov.iloc[idx_query][["DESCRITTORI_AGGREGATI"]]
pquery = list(pquery)
similarity = ocl.score(pquery[0], document, inverted_index, inverted_index2, dictionary, idfi)
if similarity > float(project_prov.iloc[idx_query][["ACTUAL_SCORE"]]):
progetti_cultura_data.iloc[project_prov.iloc[idx_query]["Index_Copy"],195] = istat_prov.iloc[idx_doc]["DENOMINAZIONE_NP"]
progetti_cultura_data.iloc[project_prov.iloc[idx_query]["Index_Copy"],194] = similarity
progetti_cultura_data.iloc[project_prov.iloc[idx_query]["Index_Copy"],196] = istat_prov.iloc[idx_doc]["OC_COD_MUSEO"]
progetti_cultura_data.iloc[project_prov.iloc[idx_query]["Index_Copy"],197] = istat_prov.iloc[idx_doc]["TIPOL1"]
except:
print("Nessun progetto per questa provincia: "+str(prov))
reg_not_prov_groups_project = prov_groups_project.get_group("nan").groupby("COD_REGIONE")
reg_groups_istat = istat_data.groupby("COD_REGIONE")
regs = set(istat_data["COD_REGIONE"])
for reg in regs:
try:
istat_reg = reg_groups_istat.get_group(reg)
project_reg = reg_not_prov_groups_project.get_group(reg)
for idx_doc in range(len(istat_reg)):
document = istat_reg.iloc[idx_doc][["DENOMINAZIONE"]]
document = list(document)
document = ("row_"+str(int(istat_reg.iloc[idx_doc][["Index_Copy"]])),document[0])
for idx_query in range(len(project_reg)):
pquery = project_reg.iloc[idx_query][["DESCRITTORI_AGGREGATI"]]
pquery = list(pquery)
similarity = ocl.score(pquery[0], document, inverted_index, inverted_index2, dictionary, idfi)
if similarity > float(project_reg.iloc[idx_query][["ACTUAL_SCORE"]]):
progetti_cultura_data.iloc[project_reg.iloc[idx_query]["Index_Copy"],195] = istat_reg.iloc[idx_doc]["DENOMINAZIONE_NP"]
progetti_cultura_data.iloc[project_reg.iloc[idx_query]["Index_Copy"],194] = similarity
progetti_cultura_data.iloc[project_reg.iloc[idx_query]["Index_Copy"],196] = istat_reg.iloc[idx_doc]["OC_COD_MUSEO"]
progetti_cultura_data.iloc[project_reg.iloc[idx_query]["Index_Copy"],197] = istat_reg.iloc[idx_doc]["TIPOL1"]
except:
print("Nessun progetto per questa regione: "+str(reg))
This is how the final merged dataset looks like:
final_matrix = progetti_cultura_data[["\"COD_LOCALE_PROGETTO\"","ASSOCIATED_MUSEUM_CODE","OC_TITOLO_PROGETTO","OC_SINTESI_PROGETTO","ASSOCIATED_MUSEUM","ACTUAL_SCORE","ASSOCIATED_MUSEUM_TYPE"]]
final_matrix.head()
Saving of the results ($\epsilon=.4$):
final_matrix.to_csv("Final_Matrix.csv")
The proposed algorithm has a main parameter $\epsilon$ that is the similarity threshold in a "Project-Museum" pair. Its value is fundamental. As usual in machine learning and statistics, a correct choice of $\epsilon$ determines the percentage of true/false positives and true/false negatives. Based on empirical experiments the team made by manually labeling the associations as correct/incorrect, $\epsilon$ is choosen equal to $0.4$. In particular we analyzed a samples of 200 observations and we obtained that (using asymptotically normal confidence intervals):
"""
final_presi=final_matrix[final_matrix["ACTUAL_SCORE"]>.4]
tip = ["Arte (da medievale a tutto l'800)","Industriale e/o d'impresa","Altro (specificare)",
"Area archeologica","Parco archeologico","Altro (specificare)","Chiesa, edificio o complesso monumentale a carattere religioso",
"Villa o palazzo di interesse storico o artistico", "Parco o giardino di interesse storico o artistico",
"Architettura fortificata o militare",
"Architettura civile di interesse storico o artistico","Arte moderna e contemporanea (dal '900 ai giorni nostri)",
"Manufatto archeologico","Manufatto di archeologia industriale","Religione e culto","Archeologia","Storia", "Storia naturale e scienze naturali",
"Scienza e tecnica","Etnografia e antropologia","Tematico e/o specializzato"]
my_tab = pd.crosstab(index=final_presi["ASSOCIATED_MUSEUM_TYPE"],
columns="count")
my_tab.index=tip
fig = my_tab.plot.bar()
"""
The team developed a dynamic visualization of the Italy map in which are shown the number of museums in a certain area; for each museum, the correspondent marker is colored in a diferrent way based on the number of projects the museum obtained. Follow the used code:
lock = final_matrix
istat = pd.read_csv(r"C:\Users\claba\Desktop\OpenCoesione\Datasets\istat_dati.csv", engine = "python", sep = ";")
# Run ONLY IF lost files
"""
# googleMaps Api initialization. Put your personal key for google cloud platform
geolocator = googlemaps.Client(key='***')
# querying gmaps to return (latitude, longitude) given the denomination of the museum and the municipality
place = []
not_found = []
for i in istat.index:
try:
query = geolocator.geocode(str(istat.iloc[i]['DENOMINAZIONE'] + ' ' + istat.iloc[i]['COMUNE']))
if query != None:
print(istat.loc[i]['DENOMINAZIONE'], (query[0]['geometry']['location']['lat'], query[0]['geometry']['location']['lng']))
place.append((query[0]['geometry']['location']['lat'], query[0]['geometry']['location']['lng']))
else:
not_found.append(i)
except:
not_found.append(i)
#saving the locations of places found
pickle_out = open("places_found","wb")
pickle.dump(place, pickle_out)
pickle_out.close()
# saving index of places NOT found
pickle_out = open("places_not_found","wb")
pickle.dump(not_found, pickle_out)
pickle_out.close()
"""
#reading pickle with coordinates found with gmaps API
pickle_in = open("places_found","rb")
loc = pickle.load(pickle_in)
#reading indexes not found with gmaps APi
pickle_in = open("places_not_found","rb")
not_found = pickle.load(pickle_in)
# Define the function of markers colors
def colorfunction():
projects_ = int(lock[lock['ASSOCIATED_MUSEUM'] == istat.loc[idx]['DENOMINAZIONE']].count()[0])
if (projects_ == 1):
col='yellow'
elif (projects_ <= 3) and (projects_ > 1):
col='orange'
elif (projects_ <= 7) and (projects_ > 3):
col='red'
elif projects_ > 7:
col='black'
else:
col = 'lightgray'
return col
# define legenda of the map
legenda= cm.StepColormap(['yellow','orange','red','black'], index=[1, 3, 7, 10 ], vmin=0, vmax=10)
# museum_code contains only the code of the museums found by gmaps
museum_code = []
for idx in range(len(istat)):
if idx not in not_found:
museum_code.append(str(istat.loc[idx]['OC_COD_MUSEO']))
# creating the map of the museums, clustering them by vicinance
Map=folium.Map(
location=(42.442699, 13.005525),
zoom_start=5.5,
tiles="cartoDBpositron"
)
#adding marker only if found by gmaps
i = 0
mc = MarkerCluster()
for idx in range(len(istat)):
if idx not in not_found:
m = museum_code[i]
mc.add_child(folium.Marker(icon=folium.Icon(color=colorfunction()), popup = folium.Popup(m), location=loc[i])) #popup is the museum code
i+=1
colormap = legenda
colormap.caption = 'Number of projects per museum'
Map.add_child(colormap)
Map.add_child(mc)
Map.save('map.html')