OpenCoesione Hackathon

Analysis on Public Cultural Projects and Museums

Claudio Battiloro, Alessandro Flaborea, Federica Spoto, Davide Facchinelli, Livia Lilli, Riccardo Cervelli

alt text

Main Task: DataLinkage

The Hackathon Team was asked to build a textual strings comparison system among the title and the description of a project (OpenCoesione) and the build denomination (ISTAT).

Output:

  • Connesion Matrix containing both datasets keys ("COD_LOCALE_PROGETTO" for OpenCoesione, "OC_COD_MUSEO).

  • Developed algorithms and code (open source) and reliability strategy.

  • Final presentation.

Algorithms

This is a text mining problem. To find the requested associations the team decided that the best strategy was building similarities methods between the two datasets keys and deciding the association based on the highest score museum (ISTAT) per each project (OpenCoesione).

Despite several methods were available, the most suitable one seemed to be TFIDF. A brief introduction and explanation of the method is presented in the next paragraphs. Then, some codes chunks are showed and dataily commented to let the user follow the procedure logic.

A gentle introduction to TFIDF

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf.

Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields, including text summarization and classification.

Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

  • TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

$ \mathbf{tf_{i,j} = \frac{n_{i,j}}{|d_j|}}$

where $n_{i,j}$ is the number of occurences of the term $ i $ in the document $j$, and the denominator $ |d_j| $ is the dimension of the document $j$ expressed as the number of words it contains.

  • IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

$ \mathbf{idf_{i} = \log \frac{|D|}{|\{d: i \in d\}|}} $

where $|D|$ is the number of the documents in the corpus, and the denominator is the number of documents that contain the term $i$.

Deeper analysis can be found on various books, as "Mining of Massive Datasets".

Step 1: Importing useful libraries and loading Data

  • Importing Libraries:
In [68]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import OpenCoesioneLib as ocl
from tqdm import tqdm
#import googlemaps
from datetime import datetime
import folium
import branca.colormap as cm
import pickle
from folium.plugins import MarkerCluster
  • Loading Data and Filtering on the projects to analyze:
In [54]:
istat_data = pd.read_csv(r"C:\Users\claba\Desktop\OpenCoesione\Datasets\istat_dati.csv", engine = "python", sep = ";")
progetti_data = pd.read_csv(r"C:\Users\claba\Desktop\OpenCoesione\Datasets\progetti_focus_turismo_20190630.csv", engine = "python", sep = ";")
istat_cod = pd.read_csv(r"C:\Users\claba\Desktop\OpenCoesione\Datasets\istat_rev.csv",sep = ";",  engine = "python", dtype={"COD_REGIONE": str,"COD_PROVINCIA": str,"COD_COMUNE": str}, encoding='cp1252')
istat_data = istat_data.drop(['COD_REGIONE','COD_PROVINCIA','COD_COMUNE'], axis=1).merge(istat_cod, on = "OC_COD_MUSEO", how = 'left')
istat_data[['COD_REGIONE']] = pd.to_numeric(istat_data['COD_REGIONE'], errors='ignore')
progetti_cultura_data = progetti_data.loc[(progetti_data["CLASSE"] == "Cultura") & ((progetti_data["CUP_COD_NATURA"] == "03") | (progetti_data["CUP_COD_NATURA"] == 3))]

Step 2: PreProcessing

Before proceeding with the definition of the algorithm, the team needed to PreProcess the documents. In order to do this we decided to load the datasets and preprocess them entirely with functions we implemented and explained in our lib. The "PreProcessing" function implements:

  • the Tokenizing of the documents
  • the removal of stop-words
  • the stemming of words
  • the removal of non-ASCII characters
  • the removal of punctuation
  • the replacing of numbers with their literal representation
  • the conversion of words in lower case
  • other functions

These are the preprocessing chuncks:

  • Removing Terms that will lead to a misclassification
In [55]:
remove_words = ["ASILO","SCUOLA","MATERNA","ELEMENTARE","SUPERIORE","TEATRALE","SCOLASTICO","LICEO","CLASSICO"]
for el in remove_words:
    progetti_cultura_data = progetti_cultura_data[~progetti_cultura_data.OC_TITOLO_PROGETTO.str.contains(el)]
progetti_cultura_data.reset_index(inplace = True)
  • Making homogeneous codification for italian regions
In [56]:
idx = []
for i in progetti_cultura_data.index:
    try:
        if ':' in progetti_cultura_data.loc[i]['DEN_REGIONE']:
            idx.append(i)
        if 'AMBITO NAZIONALE' ==  progetti_cultura_data.loc[i]['DEN_REGIONE']:
            idx.append(i)
    except:
        print(i)
        
progetti_cultura_data = progetti_cultura_data.drop(idx)
progetti_cultura_data[['COD_REGIONE']] = pd.to_numeric(progetti_cultura_data['COD_REGIONE'], errors='ignore')
progetti_cultura_data = progetti_cultura_data[progetti_cultura_data.COD_REGIONE <= 20]
  • Make homogeneous codification for italian provinces
In [57]:
prov_foo_cleaning = np.vectorize(ocl.prov_foo_cleaning) 
progetti_cultura_data[['COD_PROVINCIA']] = progetti_cultura_data[['COD_PROVINCIA']].apply(prov_foo_cleaning)
  • The following Lines duplicate the indices of projects and museums. This is good to keep track of the records positions due to the fact that, in the next processing steps, the dataframe will be reorganized.
In [58]:
istat_data["Index_Copy"] = range(len(istat_data))
progetti_cultura_data["Index_Copy"] = range(len(progetti_cultura_data))
  • With this lines, the Istat dataset is processed, the dictionary and the inverted index are created, the IDFi are computed.
In [59]:
istat_data["DENOMINAZIONE_NP"] = istat_data["DENOMINAZIONE"] # Keep the not processed decsription for the final dataset
istat_data["DENOMINAZIONE"] = istat_data["DENOMINAZIONE"].apply(lambda x: ocl.preProcessing(x))
[dictionary, inverted_index] = ocl.create_dictionary_and_inverted_index(istat_data)

idfi = []
for term in dictionary:
    idfi.append(ocl.IDFi(term,inverted_index,len(istat_data)))

inverted_index2 = ocl.create_inverted_index_with_TFIDF(istat_data, len(istat_data), dictionary, idfi)
  • To have a more efficient procedure, we aggregate all the useful features in order to process them all together.
In [60]:
progetti_cultura_data["DESCRITTORI_AGGREGATI"] = progetti_cultura_data["OC_TITOLO_PROGETTO"]  + " " + progetti_cultura_data["OC_SINTESI_PROGETTO"]# + " " +  progetti_cultura_data[""]
progetti_cultura_data["DESCRITTORI_AGGREGATI"] = progetti_cultura_data["DESCRITTORI_AGGREGATI"].apply(lambda x: ocl.preProcessing(x))

Step 3: Core Algorithm and TFIDF implementations

To drastically boost efficiency and results, the team decided to group the datasets by provinces and compare only the correspondent pairs of group. The following lines implement this procedures and, based on the previous computations, associates at each project, if the confidence is over a threshold $\epsilon$, the highest score museum correspondence.

For the projects whoose province is not specified, the region is used to perform the same method.

In [61]:
progetti_cultura_data["ACTUAL_SCORE"] = [0.4]*len(progetti_cultura_data)
progetti_cultura_data["ASSOCIATED_MUSEUM"] = ["Nessuna associazione trovata"]*len(progetti_cultura_data)
progetti_cultura_data["ASSOCIATED_MUSEUM_CODE"] = ["Nessuna associazione trovata"]*len(progetti_cultura_data)
progetti_cultura_data["ASSOCIATED_MUSEUM_TYPE"] = ["Nessuna associazione trovata"]*len(progetti_cultura_data)


prov_groups_project = progetti_cultura_data.groupby("COD_PROVINCIA")
prov_groups_istat = istat_data.groupby("COD_PROVINCIA")
provs = set(istat_data["COD_PROVINCIA"])
for prov in provs:
    try:
        istat_prov = prov_groups_istat.get_group(prov)
        project_prov = prov_groups_project.get_group(prov)
        for idx_doc in range(len(istat_prov)):
            document = istat_prov.iloc[idx_doc][["DENOMINAZIONE"]]
            document = list(document)
            document = ("row_"+str(int(istat_prov.iloc[idx_doc][["Index_Copy"]])),document[0])
            for idx_query in range(len(project_prov)):
                    pquery = project_prov.iloc[idx_query][["DESCRITTORI_AGGREGATI"]]
                    pquery = list(pquery)
                    similarity = ocl.score(pquery[0], document, inverted_index, inverted_index2, dictionary, idfi)
                    if  similarity > float(project_prov.iloc[idx_query][["ACTUAL_SCORE"]]):
                        progetti_cultura_data.iloc[project_prov.iloc[idx_query]["Index_Copy"],195] = istat_prov.iloc[idx_doc]["DENOMINAZIONE_NP"]
                        progetti_cultura_data.iloc[project_prov.iloc[idx_query]["Index_Copy"],194]  = similarity
                        progetti_cultura_data.iloc[project_prov.iloc[idx_query]["Index_Copy"],196] = istat_prov.iloc[idx_doc]["OC_COD_MUSEO"]    
                        progetti_cultura_data.iloc[project_prov.iloc[idx_query]["Index_Copy"],197] = istat_prov.iloc[idx_doc]["TIPOL1"]    

    except:
        print("Nessun progetto per questa provincia: "+str(prov))

reg_not_prov_groups_project = prov_groups_project.get_group("nan").groupby("COD_REGIONE")
reg_groups_istat = istat_data.groupby("COD_REGIONE")
regs = set(istat_data["COD_REGIONE"])
for reg in regs:
    try:
        istat_reg = reg_groups_istat.get_group(reg)
        project_reg = reg_not_prov_groups_project.get_group(reg)
        for idx_doc in range(len(istat_reg)):
            document = istat_reg.iloc[idx_doc][["DENOMINAZIONE"]]
            document = list(document)
            document = ("row_"+str(int(istat_reg.iloc[idx_doc][["Index_Copy"]])),document[0])
            for idx_query in range(len(project_reg)):
                    pquery = project_reg.iloc[idx_query][["DESCRITTORI_AGGREGATI"]]
                    pquery = list(pquery)
                    similarity = ocl.score(pquery[0], document, inverted_index, inverted_index2, dictionary, idfi)
                    if  similarity > float(project_reg.iloc[idx_query][["ACTUAL_SCORE"]]):
                        progetti_cultura_data.iloc[project_reg.iloc[idx_query]["Index_Copy"],195] = istat_reg.iloc[idx_doc]["DENOMINAZIONE_NP"]
                        progetti_cultura_data.iloc[project_reg.iloc[idx_query]["Index_Copy"],194]  = similarity
                        progetti_cultura_data.iloc[project_reg.iloc[idx_query]["Index_Copy"],196] = istat_reg.iloc[idx_doc]["OC_COD_MUSEO"]
                        progetti_cultura_data.iloc[project_reg.iloc[idx_query]["Index_Copy"],197] = istat_reg.iloc[idx_doc]["TIPOL1"]    
    except:
        print("Nessun progetto per questa regione: "+str(reg))       
Nessun progetto per questa provincia: 021
Nessun progetto per questa regione: 1
Nessun progetto per questa regione: 2
Nessun progetto per questa regione: 3
Nessun progetto per questa regione: 4
Nessun progetto per questa regione: 7
Nessun progetto per questa regione: 9
Nessun progetto per questa regione: 10
Nessun progetto per questa regione: 11
Nessun progetto per questa regione: 12

The output matrix

This is how the final merged dataset looks like:

In [62]:
final_matrix = progetti_cultura_data[["\"COD_LOCALE_PROGETTO\"","ASSOCIATED_MUSEUM_CODE","OC_TITOLO_PROGETTO","OC_SINTESI_PROGETTO","ASSOCIATED_MUSEUM","ACTUAL_SCORE","ASSOCIATED_MUSEUM_TYPE"]]
final_matrix.head()
Out[62]:
"COD_LOCALE_PROGETTO" ASSOCIATED_MUSEUM_CODE OC_TITOLO_PROGETTO OC_SINTESI_PROGETTO ASSOCIATED_MUSEUM ACTUAL_SCORE ASSOCIATED_MUSEUM_TYPE
0 11FR33390 Nessuna associazione trovata INTERVENTO DI RECUPERO, VALORIZZAZIONE, CONSOL... INTERVENTO DI RECUPERO, VALORIZZAZIONE, CONSOL... Nessuna associazione trovata 0.4 Nessuna associazione trovata
1 11FR35244 Nessuna associazione trovata OPERA 7749 INTERVENTO DI COMPLETAMENTO DEL PROGETTO DI AM... Nessuna associazione trovata 0.4 Nessuna associazione trovata
2 13TO10360.14072017.115000019_1020 Nessuna associazione trovata (10360.14072017.115000019) TEATROGG SINTESI DEL PROGETTO - L?INTERVENTO PREVEDE IL... Nessuna associazione trovata 0.4 Nessuna associazione trovata
3 13TO10360.14072017.115000211_1141 Nessuna associazione trovata (10360.14072017.115000211) TEATRO DI BOCCHEGGIANO L'INTERVENTO PREVISTO CONSISTE NEL MIGLIORAMEN... Nessuna associazione trovata 0.4 Nessuna associazione trovata
4 13TO10360.14072017.115000235_1242 Nessuna associazione trovata (10360.14072017.115000235) RIQUALIFICAZIONE EN... INTERVENTO DI RIQUALIFICAZIONE ENERGETICA CONS... Nessuna associazione trovata 0.4 Nessuna associazione trovata

Saving of the results ($\epsilon=.4$):

In [41]:
final_matrix.to_csv("Final_Matrix.csv")

Results

Associations Accuracy

The proposed algorithm has a main parameter $\epsilon$ that is the similarity threshold in a "Project-Museum" pair. Its value is fundamental. As usual in machine learning and statistics, a correct choice of $\epsilon$ determines the percentage of true/false positives and true/false negatives. Based on empirical experiments the team made by manually labeling the associations as correct/incorrect, $\epsilon$ is choosen equal to $0.4$. In particular we analyzed a samples of 200 observations and we obtained that (using asymptotically normal confidence intervals):

  • $\epsilon = 0.3$ : $40$% of the museum are matched, $52 \pm 3$% of correct matchings.
  • $\epsilon = 0.4$ : $30$% of the museum are matched, $63 \pm 4$% of correct matchings.

Results Distribution

  • Using $\epsilon = 0.4$, the team observed that almost all the provinces but the one with zero or quasi-zero assigned projects, have more or less the same number of projects, confirming the policy explained by the OpenCoesione team.
  • Moreover, follows the distribution among the various kinds of projects based on the found associations:
In [70]:
"""
final_presi=final_matrix[final_matrix["ACTUAL_SCORE"]>.4]
tip = ["Arte (da medievale a tutto l'800)","Industriale e/o d'impresa","Altro (specificare)",
 "Area archeologica","Parco archeologico","Altro (specificare)","Chiesa, edificio o complesso monumentale a carattere religioso",
"Villa o palazzo di interesse storico o artistico", "Parco o giardino di interesse storico o artistico",
 "Architettura fortificata o militare",
"Architettura civile di interesse storico o artistico","Arte moderna e contemporanea (dal '900 ai giorni nostri)",
"Manufatto archeologico","Manufatto di archeologia industriale","Religione e culto","Archeologia","Storia", "Storia naturale e scienze naturali",
 "Scienza e tecnica","Etnografia e antropologia","Tematico e/o specializzato"]

my_tab = pd.crosstab(index=final_presi["ASSOCIATED_MUSEUM_TYPE"], 
                      columns="count")
my_tab.index=tip
fig = my_tab.plot.bar()
"""

Visualizations

The team developed a dynamic visualization of the Italy map in which are shown the number of museums in a certain area; for each museum, the correspondent marker is colored in a diferrent way based on the number of projects the museum obtained. Follow the used code:

In [66]:
lock = final_matrix
istat = pd.read_csv(r"C:\Users\claba\Desktop\OpenCoesione\Datasets\istat_dati.csv", engine = "python", sep = ";")
# Run ONLY IF lost files
"""
# googleMaps Api initialization. Put your personal key for google cloud platform
geolocator = googlemaps.Client(key='***')


# querying gmaps to return (latitude, longitude) given the denomination of the museum and the municipality
place = []
not_found = []
for i in istat.index:

    try:
        
        query = geolocator.geocode(str(istat.iloc[i]['DENOMINAZIONE'] + ' ' + istat.iloc[i]['COMUNE']))
        if query != None:
            print(istat.loc[i]['DENOMINAZIONE'], (query[0]['geometry']['location']['lat'], query[0]['geometry']['location']['lng']))

            place.append((query[0]['geometry']['location']['lat'], query[0]['geometry']['location']['lng']))
        else: 
            not_found.append(i)

    except:
        not_found.append(i)

        

#saving the locations of places found 
pickle_out = open("places_found","wb")
pickle.dump(place, pickle_out)
pickle_out.close()
        
# saving index of places NOT found
pickle_out = open("places_not_found","wb")
pickle.dump(not_found, pickle_out)
pickle_out.close()     
"""


#reading pickle with coordinates found with gmaps API
pickle_in = open("places_found","rb")
loc = pickle.load(pickle_in)


#reading indexes not found with gmaps APi
pickle_in = open("places_not_found","rb")
not_found = pickle.load(pickle_in)

# Define the function of markers colors
def colorfunction():
    projects_ = int(lock[lock['ASSOCIATED_MUSEUM'] == istat.loc[idx]['DENOMINAZIONE']].count()[0])
    if (projects_ == 1): 
        col='yellow'
    elif (projects_ <= 3) and (projects_ > 1): 
        col='orange'
    elif (projects_ <= 7) and (projects_ > 3): 
        col='red'
    elif projects_ > 7: 
        col='black'
    else:
        col = 'lightgray'
    return col



# define legenda of the map
legenda= cm.StepColormap(['yellow','orange','red','black'], index=[1, 3, 7, 10 ], vmin=0, vmax=10)


# museum_code contains only the code of the museums found by gmaps
museum_code = []
for idx in range(len(istat)): 
    if idx not in not_found:
        museum_code.append(str(istat.loc[idx]['OC_COD_MUSEO']))

# creating the map of the museums, clustering them by vicinance
Map=folium.Map(
    location=(42.442699, 13.005525),
    zoom_start=5.5, 
    tiles="cartoDBpositron"
)

#adding marker only if found by gmaps
i = 0
mc = MarkerCluster()
for idx in range(len(istat)): 
    if idx not in not_found:
        m = museum_code[i]
        mc.add_child(folium.Marker(icon=folium.Icon(color=colorfunction()), popup = folium.Popup(m), location=loc[i])) #popup is the museum code
        i+=1

        
colormap = legenda
colormap.caption = 'Number of projects per museum'
Map.add_child(colormap)


Map.add_child(mc)
Map.save('map.html')