INTERNATIONAL POSTGRADUATE COURSE ON
GEOGRAPHIC
INFORMATION TECHNOLOGIES
31
May –
Summer
school teaching unit
“Geospatial
data mining”
Fernando Lucas Bação
http://www.isegi.unl.pt/ensino/docentes/fbacao/index.html
Instituto Superior de Estatística e Gestão de
Informação
Universidade Nova
de Lisboa
Session 1: June 2: 90 min. + 90 min. (morning)
Session 2: June 3: 90 min. + 90 min. (afternoon)
Session
1:
The idea of the 1st session is to provide the basic concepts of data mining and knowledge discovery. Emphasis is added to geospatial (or geographical) data mining and to what the geo prefix implies. Clearly, data mining is a very wide field, which can only be superficially introduced in a 180 minute session. The option here is to give a brief outlook of the data mining process, including the major steps related with the preparation of data. The second part of this first session will be dedicated to the presentation of the Self-Organizing Map, an artificial neural network with applications in clustering and classification tasks. Additionally, a particular variant of the SOM will be presented, the Geo-SOM. In conclusion, session 1 will cover the aspects related with:
1. Definition of data mining;
2. Geospatial data mining;
3. Typical tasks in data mining;
4. Different types of models;
5. Overview of the basic aspects on:
a. problem definition;
b. collecting data;
c. preparing the data;
d. pre-processing the data;
6. Tools:
a. the self-organizing map (SOM)
b. the Geo-SOM
Session
2:
The practical session will be based on the exploration of the SOM and its application to the development of a geodemographic typology. SOM_PAK, which is the freely available through the internet, will be the software used. The SOM_PAK is relatively simple and it should be fairly easy for the students to produce classifications for the study area. The data used refers to the enumeration districts (ED) of the city of Lisbon. Students will have available the shapefile of the ED’s as well as the socio-economic variables (65 variables depicting the socio-economic reality of the ED’s), based on which they will produce a typology for the city of Lisbon.
Students will use SOM_PAK to classify the ED’s and export the results to a GIS package in order to produce the proper maps of the study region. Before using the SOM_PAK students should complete some preprocessing tasks, such as normalizing the results, building appropriate ratios and perform some basic univariate analysis on the variables.
Finally, students will compare the results of the typologies developed by the different groups discussing options made during the process and the results achieved. The basic steps for the practical session will be:
1. Mapping and analysis of the study region;
2. Univariate analysis of some important variables;
3. Variables choice and preprocessing (normalization);
4. Development of a classification based on the SOM_PAK;
5. Analysis of the different U-Matrices produced:
6. Import the data into a GIS;
7. Produce different mappings of the classification;
8. Discussion;
The goal is that students completing this teaching unit, should be able to answer the following 7 questions:
Ø What are the characterizing features of Data Mining?
Ø Why is Data Mining interesting in the context of GIScience? Is there any “added value”?
Ø What are the implications of the geo prefix in Geographic Data Mining?
Ø Characterize the different stages in the Data Mining process.
Ø Describe the general workings of the Self-Organizing Map.
Ø How to use U-matrices to define clusters and homogenous areas.
Ø Are there ways of introducing spatial reasoning into the SOM?
Reading assignments:
Data Mining and Knowledge Discovery (General Theory
and Context)
SOM Demos
SOM Software
SOM Documentation
Geospatial Data Mining