Data Mining in the Presence of Quantitatively and Qualitatively Diverse Information

IDM-0415190

Anne M. Denton <anne.denton@ndsu.edu>

North Dakota State University
Homepage: http://www.cs.ndsu.nodak.edu/~adenton/

Abstract

Real data often show a more complex structure than is assumed in much of statistics, machine learning, and data mining. Objects may be characterized by diverse types of information such as numerical quantities, text, and properties of a network neighborhood.  The goal of the project is to develop techniques to integrate information components that differ both quantitatively and qualitatively.  Classification algorithms that are based on homogeneous attributes can be evaluated exclusively by their overall classification quality.  In the presence of qualitatively and quantitatively diverse information, the space of all relevant combinations of techniques and parameters is too large to be evaluated by any reasonable amount of test data.  Three goals are pursued
  1. Define intermediate, homogeneous attributes that allow effective use of uniform classification and clustering techniques
  2. Develop robust criteria that allow identification of suitable intermediate attributes and do not exclusively rely on overall classification accuracy
  3. Develop efficient and effective approaches to generate intermediate attributes from data with network connectivity, time-dependent data, and text among other types of data.
Starting with a specific classification problem in bioinformatics, the project attempts to find solutions that are applicable to a wide range of data mining problems.  The work is ideally suited to teach students a broad range of research activities from fundamental concepts to applications, both in thesis and course work.  Results will be of relevance to a large number of practical applications in bioinformatics and other sciences.