INTERNATIONAL POSTGRADUATE COURSE ON

GEOGRAPHIC INFORMATION TECHNOLOGIES

 

31 May – 11 June 2004

 

Universitat Jaume I

Castellón, Spain

 

Summer school teaching unit

“Geospatial data mining”

Fernando Lucas Bação

bacao@isegi.unl.pt

http://www.isegi.unl.pt/ensino/docentes/fbacao/index.html

New Technologies Laboratory

Instituto Superior de Estatística e Gestão de Informação

Universidade Nova de Lisboa

 

 

Hours

Session 1: June 2: 90 min. + 90 min. (morning)

Session 2: June 3: 90 min. + 90 min. (afternoon)

 

 

Content

 

Session 1:

 

The idea of the 1st session is to provide the basic concepts of data mining and knowledge discovery. Emphasis is added to geospatial (or geographical) data mining and to what the geo prefix implies. Clearly, data mining is a very wide field, which can only be superficially introduced in a 180 minute session. The option here is to give a brief outlook of the data mining process, including the major steps related with the preparation of data. The second part of this first session will be dedicated to the presentation of the Self-Organizing Map, an artificial neural network with applications in clustering and classification tasks. Additionally, a particular variant of the SOM will be presented, the Geo-SOM. In conclusion, session 1 will cover the aspects related with:

 

1.      Definition of data mining;

 

2.      Geospatial data mining;

 

3.      Typical tasks in data mining;

 

4.      Different types of models;

 

5.      Overview of the basic aspects on:

a.       problem definition;

b.      collecting data;

c.       preparing the data;

d.      pre-processing the data;

 

6.      Tools:

a.       the self-organizing map (SOM)

b.      the Geo-SOM

 

 

Session 2:

 

The practical session will be based on the exploration of the SOM and its application to the development of a geodemographic typology. SOM_PAK, which is the freely available through the internet, will be the software used. The SOM_PAK is relatively simple and it should be fairly easy for the students to produce classifications for the study area. The data used refers to the enumeration districts (ED) of the city of Lisbon. Students will have available the shapefile of the ED’s as well as the socio-economic variables (65 variables depicting the socio-economic reality of the ED’s), based on which they will produce a typology for the city of Lisbon.

 

Students will use SOM_PAK to classify the ED’s and export the results to a GIS package in order to produce the proper maps of the study region. Before using the SOM_PAK students should complete some preprocessing tasks, such as normalizing the results, building appropriate ratios and perform some basic univariate analysis on the variables.

 

Finally, students will compare the results of the typologies developed by the different groups discussing options made during the process and the results achieved. The basic steps for the practical session will be:

 

1.      Mapping and analysis of the study region;

 

2.      Univariate  analysis of some important variables;

 

3.      Variables choice and preprocessing (normalization);

 

4.      Development of a classification based on the SOM_PAK;

 

5.      Analysis of the different U-Matrices produced:

 

6.      Import the data into a GIS;

 

7.      Produce different mappings of the classification;

 

8.      Discussion;

 

 

Goals

 

The goal is that students completing this teaching unit, should be able to answer the following 7 questions:

 

Ø     What are the characterizing features of Data Mining?

 

Ø     Why is Data Mining interesting in the context of GIScience? Is there any “added value”?

 

Ø     What are the implications of the geo prefix in Geographic Data Mining?

 

Ø     Characterize the different stages in the Data Mining process.

 

Ø     Describe the general workings of the Self-Organizing Map.

 

Ø     How to use U-matrices to define clusters and homogenous areas.

 

Ø     Are there ways of introducing spatial reasoning into the SOM?

 

 

Students’ preparation in advance

 

Reading assignments:

 

 

 

Software downloads or URLs

 

 

Data Mining and Knowledge Discovery (General Theory and Context)

 

  1. Data Mining: Statistics and More? – from David J. Hand who also wrote this great book

 

  1. Statistics and Data Mining: Intersecting Disciplines – another one from David J. Hand, important to understand the context of Data Mining

 

  1. "Data Mining and Statistics: What's the Connection?" – from Jerome H. Friedman

 

  1. Knowledge Discovery and Data Mining: towards a unifying framework - from Fayyad (he also wrote this one), Shapiro and Smyth

 

  1. Data Mining in Soft Computing Framework: A Survey – an article on the applications of soft computing tools in data mining from Mitra, Pal and Mitra

 

  1. Kurt Thearling Lots of papers from Kurt (special emphasis on the business side). These texts can be very useful if you are searching form an introduction to Data Mining

 

  1. KDNuggets – THE BEST SITE about Knowledge Discovery. Here you can find almost anything about KD, from jobs to software, lots of useful stuff.

 

  1. Dorian Pyle – the well-known author of this book, has a website with lots of useful information.

 

 

SOM Demos

 

  1. SOFM – My favourite demo on the workings of a SOM. Every time I need to explain the SOM, this always seems the easiest way to do it.

 

  1. Interactive Self-Organizing Map demonstrations – two applets from the HUT people in Finland.

 

  1. Our own demos, developed by Roberto Henriques, one of our whiz programmers. This is only a movie, in the future we will develop an applet for the internet. The software has been developed for geographic applications, nevertheless it seems to be a great tool to understand the basics of the SOM (being developed, ready in May 15).

 

 

SOM Software

 

  1. SOM_PAK – the software from Kohonen and associates, and which we will be using during the course practical assignments.

 

  1. SOM_PAK Manual – instructions for using the SOM_PAK.

 

 

SOM Documentation

 

  1. Data exploration using self-organizing mapsSamuel Kaski’s PhD, here you can find more of his papers.

 

  1. Using SOM in Data Mining – thesis from Juha Vesanto

 

  1. Clustering of the Self-Organizing Map – from Vesanto and Alhoniemi

 

  1. Geo-Self-Organizing Map (Geo-SOM) for building and exploring homogenous regions – paper submitted to GIScience 2004

 

 

Geospatial Data Mining

 

  1. Geographical data mining: key design issues – a very interesting paper from Stan Openshaw and his personal view on the desirable characteristics of geographical data mining tools.

 

  1. Computing and the science of Geography: the postmodern turn and the geocomputational twist – philosophic perspective of the geocomputational movement within geography, by Bill Macmillan

 

  1. Is inductive machine learning just another wild goose (or might it lay the golden egg)? – the perspective from Mark Gahegan

 

  1. Geospatial Data Mining and Knowledge Discovery joint perspective from some of the leaders in the field

 

  1. Geocomputation Techniques For Spatial Analysis: Is It The Case For Health Data Sets? – from Gilberto Câmara and António Monteiro, INPE Brazil

 

  1. Geographic Data Mining and Knowledge Discovery (chapter of “Handbook of Geographic Information Science”, Blackwell, in press) – from Harvey Miller, editor of Geographic Data Mining & Knowledge Discovery.

 

  1. GeoComputation Conference – Probably the most important conference on the topic of computational-intensive methodologies for quantitative geography. Lots of papers and abstracts available online, since the first conference held in 96 in Leeds.