Only one person per topic (first come first serve - email your request to me).
Please check the schedule before emailing your topic choice to me (to make sure
it has not been chosen by one of your colleagues already).
Your paper should be high quality in terms of style and correctness.
This paper project is YOUR project.  I am suggesting topics but if you
choose one of my suggested topics, that makes it your topic.  You should
choose one of my suggestions only if you understand it and think it has
potential as a paper topic for you.  The suggestions are meant to help
you find a suitable topic but are not intended to limit you to these topics.

Be sure to include in your paper:

INTRODUCTION AND CONTEXT.
Research the "area" of the topic (put it in context of what has been done
by others, what is still left to do, what you are contributing. - as to
what has been written by others)
This usually forms the Introduction section of
your paper and should be about 300-2000 words in length or longer.

MAIN NEW CONTRIBUTION (KILLER IDEA).
Detail your contribution so that it can be followed by a reader who
is new to area.  (this can be an expansion of the "what you are
contributing section of your Introduction).  It would be best to
have just one killer idea and do it well.
This section should be about 800-3000 words in length or longer.

PROOF.
Prove that your idea is correct and makes the contribution you claim it
does (i.e., it is a "killer" idea).  This differs with topic and area.
If the contribution or "killer idea" involves random variations
(stochasticity) a simulation may be the right way to do the proof.
If not, an analytic model (assumptions, formulas and analysis of results)
may do the job.

Actual experimentation may also be possible, though that involves
prototyping the system itself.  We can email-talk about this section on
an individual basis.
This section should be about 500-3000 words in length or longer.

CONCLUSION.
Summarize the most important points and contributions you have made.
Note that you will be "telling us what you're going to do" in the Intro;
"doing it" in the Idea and Proof Sections; and then
"Telling us what you did" in the Conclusion Section.
Thus, you will say the same thing thrice - in different ways, for
different purposes and in different depth levels.
This section should be about 200-1000 words in length or longer.

Thank you and good luck.

****************************************************************


A very hot topic area, which overlaps with web search querying and analysis,
software engineering reachability graph analysis and control flow graph
analysis, sales graph analysis and bioinformatics interactions analysis;
is the need to analyze multiple interacts for common "strong interaction cells".

Websites interactions:
 
Two websites interact if one contains the other's URL (this is a
"directional interaction" and is modeled by a directed graph in which the nodes
of the graph are websites (URLs), and there is a directed edge running from a
URL to each of the URLs it contains on its website). One can simply analyze
"existence" of references (an unlabelled directed edge iff the source URL
contains one or more instances of the destination URL), or one can analyze
"strength" of reference (a labeled graph in which the label on any edge
records the number of times the destination URL occurs on the source URL page.  

Two websites interact iff they are reference on the same webpage
(undirected graph which can be labeled with "strength = # of different pages
reference both" or just "existence = unlabeled edge" iff they are co-reference).

Two websites interact iff a given user goes from one site immediately
to the other site during a web surfing session (this is "directional" so it is
modeled by a directed graph which can have labeled edges (count of user traces)
or just existence (at least one user trace).


Bioinformatics:

Two genes (or proteins) interact iff their expression profiles in a
microarray experiment are similar enough (this would be an unlabelled
undirected graph - it could include "strength" = the level of similarity as
an undirected edge label).

Two genes (or proteins) interact iff the proteins they code for
interact (in some particular way - i.e., occur in the same pathway; combine
into the same complex, etc.).

Two genes (or proteins) interact iff they are co-referenced in the
same document in the PubMed literature (again. "existence" or "strength" are
possible). Actually, this third point is not true most of the times.
The authors might refer to other genes in different contexts with respect to
the genes they are working on. This doesn't mean that there is an interaction
among the genes that were mentioned in the same document. It is in this
scenario, GENE ONTOLOGY (GO) comes into picture. GO has several evidence codes
which signify how the functions of the genes/gene products were assigned.
For GO evidence codes, refer http://www.geneontology.org/GO.evidence.shtml#ic.
Before confirming the interaction among the genes in the same document, it
would be good to cross check with the GO evidence codes.

Software Engineering Interactions:

Two programs interact iff the same code segment occurs in both
(undirected, either labeled with a strength = e.g., the number of times that
segment occurs, or existence = just whether the segment co-exists at least once for not).

Two programs interact iff they call the same program.

Two programs interact iff they are called by the same program.

Two programs interact iff they contains roughly the same set of
constants (variables) with respect to some ontology or standards listing.

Two programs interact iff they have the same aspect designation.




Sales Analysis:

Two products interact iff they co-occur at checkout 80% of the time
(or with some other threshold support =  "% of market baskets").

Two products interact iff when one occurs in a market basked, 80% of
the time, the other will also (this is the "directed graph" version of 1 above
and has to do with the "confidence" of the association).

Two products interact iff the same salesman sells both.

Two products interact iff they are sold in the same region (at a
threshold level, or as a labeled graph, label that edge with the number sold).

Two products interact iff they are sold at a threshold level during
the same season (e.g., in December).


Security Applications:

Two persons interact iff they are from the same neighborhood (or city
or state or country).

Two persons interact iff they are in the same occupation.

Two persons interact iff they have similar records (employment records,
criminal records, etc.).

Two persons interact iff they belong to the same organizations.

Two events interact iff they are attended by similar sets of attendees.

Two locations interact iff they are visited by similar sets of people.

Two locations interact iff they are associated with similar events.


In all the interactions, the graph model is central and one is looking for
strong clusters of nodes (nodes that are strongly associated with via the
edge set)
How do we find the strong clusters?
What do we mean precisely by a cluster?


Notes:

Using vertical technologies to search out common clusters or quasi-clusters
or "cliques" should be very valuable in bioinformatics as well as in web
analysis. For instance, there are thousands of interaction graphs of interest
over a given set of genes (proteins). Using vertical technology, it is
possible to construct an index attribute and an order attribute for each
interaction graph and to analyze them (using Dr. Daxin Jiang's methods or
other methods - e.g., OPTICS-like) directly.

You can find Dr. Daxin Jiang's work here:
Jiang_TKDE_paper

A shorter version (preliminary work): Jiang_paper

You can get the OPTICS paper to get some understanding about ordering-based algorithm here: OPTICS paper


For each data set (either a micro-array dataset or an interaction graph data
set), construct two derived attributes, the step count attribute from the
ordering and the index attribute.  For each pair of such added attributes, we
can quickly search for the pulses using vertical technology (just a matter of
looking for those genes where the index exceeds a threshold and move that
threshold down until the user feels he/she has found the appropriate pulses.
These "pulse genes" have a step number in the ordering.  For each 
pulse gene, we can quickly extract the forward subinterval from that pulse to
the next. 

Each such search will give us the "flat region", from which the strong cluster
associated with that pulse can be extracted.  So we will have a vertical
"mask" defining each strong cluster from each dataset. We can quickly "AND"
those to find common strong clusters using vertical technology.

With this minor extension to Dr. Daxin Jiang's wonderful tool and a re-coding
of the tool for vertical data, one could analyze across multiple interaction
graphs. I think that is the main exciting application and across multiple
micro-arrays (may be?) and across multiple web graphs. When the dataset is
very large, the scalability becomes a very important issue.



A new method of classification or clustering based on Derived
Attributes that are "Walk-based".  The walks can be based on Z-ordering,
Hilbert ordering or another ordering (or random walks?).

An new method of classification or clustering based on some
statistic derived from the the Covariance matrix of a training space
or space to be clustered.

An new method of classification or clustering based on some
combination of derived attributes that are varition based and walk based.

An automatic alerter system to be used by Software Engineers
which will automatically alert the development team when some type
of "bad situation" or "dangerous practice" is detected (e.g., within
a system such as CVS for storing "developement version", when a version
is "checked in", the alerter analyzer would immediately analyze (classify
based on the database of past development projects done with the CVS system?)
for the "exceptional situation.

Decision Tree Induction Classification Implementation and Performance Analysis for Numeric Data.
Implement a new method of Decision Tree Induction classification data
mining.  Prove that your method performs well compared to ID3 C4.5, C5
or other known methods for at least one type of data.

Decision Tree Induction Classification Implementation and Performance Analysis for Categorical Data.
Implement a new method of C4.5 or C5 -like decision tree induction classification data mining
method  and prove it compares well to C4.5 or C5 or other known methods for categorical data.

Bayesian Classification Implementation and Performance Analysis.
Implement a new method of Bayesian classification data mining and prove
it compares well to known methods.

Neural Network Classification Implementation and Performance Analysis.
Implement a new method of Neural Network classification data mining and
prove it compares well to known methods (how well does it scale to
large datasets?)

K-Nearest Neighbor Classification Implementation and Performance Analysis or
K-Most Similar Classification Implementation and Performance Analysis.
Implement a new method and prove it compares well to known methods.

Density-Based Classification Implementation and Performance Analysis.
Implement a new method of Density-Based classification and prove it
compares well to known methods.

Genetic-Algorithm-Based Data Mining Implementation and Performance Analysis.
Implement a new method of Genetic-Algorithm-Based classification and prove it
compares well to known methods.

Simulated Annealing-Based Classification Implementation and Performance Analysis.
Implement a new method of Simulated-Annealing-Based classification and prove it
compares well to known methods.

Tabu-Search-Based Classification Implementation and Performance Analysis.
Implement a new method of Tabu-Search-Based classification and prove it
compares well to known methods.

Rough-Set-Based Classification Implementation and Performance Analysis.
Implement a new method of Rough-Set-Based classification and prove it
compares well to known methods.

Fuzzy-Set-Based Classification Implementation and Performance Analysis.
Implement a new method of Fuzzy-Set-Based classification and prove it
compares well to known methods.

Markov-Modeling-based Classification and Performance Analysis.
(Hidden Markov Model based, ...)
Implement a new method of Markov-Chain-Based classification and prove it
compares well to known methods.  A reference is "Evaluation of Techniques
for Classifying Biological Sequences", Deshpande and Karypis, PA-KDD Conference
2002, Springer-Verlag Lecture Notes in Artificial Intelligence 2336, pg 417.

Multiple-Regression Data Mining Implementation and Performance Analysis.
Implement a new method of multiple-regression-based data mining and prove it
compares well to known methods.

Non-linear-Regression Data Mining Implementation and Performance Analysis.
Implement a new method of non-linear-regression-based data mining and prove it
compares well to known methods.

Poisson-Regression Data Mining Implementation and Performance Analysis.
Implement a new method of Poisson-regression-based data mining and prove it
compares well to known methods.

Association Rule Mining Implementation and Performance Analysis.
Implement a new method of Association Rule Mining and
prove it compares well to known methods (e.g., Frequent Pattern Tree methods).

Multilevel Association Rule Mining Implementation and Performance Analysis.
Implement a new method of Multilevel Association Rule Mining and
prove it compares well to known methods.

Counts-count Association Rule Mining Implementation and Performance Analysis.
Implement a new method of Counts-count Association Rule Mining and
prove it compares well to known methods.  Counts count ARM means that
the method takes account of the number of each item in a market basket,
not just whether or not the item is bought (1 or more).

K-Means Clustering.
Implement a new method of K-Means Clustering and prove it
compares well to known methods.

K-Medoids Clustering.
Implement a new method of K-Medoids Clustering and prove it
compares well to known methods.

K-Nearest Neighbor Clustering.
Implement a new method of K-Nearest Neighbor Clustering and prove it
compares well to known methods.  A reference is "Clustering Using a Similarity
Measure Based on Shared Near Neighbors", Jarvis and Patrick, IEEE Transactions
on Computers, Vol. c-22, No. 11, November 1973.

Agglomerative Hierarchical Clustering.
Implement a new method of Agglomerative Hierarchical Clustering and prove it
compares well to known methods such as AGNES.

Divisive Hierarchical Clustering.
Implement a new method of Divisive Hierarchical Clustering and prove it
compares well to known methods such as DIANA.

Hierarchical clustering similar to BIRCH
Implement a new method similar to BIRCH clustering
and prove it compares well to known methods such as BIRCH itself.

Clustering similar to CURE.
Implement a new method similar to CURE clustering
and prove it compares well to known methods such as CURE itself.

Clustering similar to OPTICS.
Implement a new method similar to OPTICS clustering
and prove it compares well to known methods such as OPTICS itself.

Clustering similar to DB-SCAN.
Implement a new method similar to DB-SCAN clustering
and prove it compares well to known methods such as DB-SCAN itself.

Grid-based clustering similar to STING.
Implement a new method similar to STING clustering
and prove it compares well to known methods such as STING itself.

Grid-based clustering similar to CLIQUE.
Implement a new method similar to CLIQUE clustering
and prove it compares well to known methods such as CLIQUE itself.

CLARANS partioning clustering.
Implement a new clustering method similar to the CLARANS
and prove it compares well to known methods such as CLARANS itself.

Hierarchical clustering similar to ROCK
Implement a new method similar to ROCK clustering
and prove it compares well to known methods such as ROCK itself.

Hierarchical clustering similar to CAMELEON
Implement a new method similar to CAMELEON clustering
and prove it compares well to known methods such as CAMELEON itself.

Density-based clustering similar to DENCLUE.
Implement a new method similar to DENCLUE clustering
and prove it compares well to known methods such as DENCLUE itself.

Statistics-based clustering similar to COBWEB.
Implement a new method similar to COBWEB clustering
and prove it compares well to known methods such as COBWEB itself.

Statistics-based clustering similar to CLASSIT.
Implement a new method similar to CLASSIT clustering
and prove it compares well to known methods such as CLASSIT itself.

Statistics-based clustering similar to AutoClass.
Implement a new method similar to AutoClass clustering
and prove it compares well to known methods such as AtuoClass itself.

"K-Clustering using P-trees"
  1. Coding statiatic issue such as sum, mean, and variance using the DMI;
  2. Focus on our k-clustering algorithm
  3. Compare performance with K-mean, mean-spliting, variance-based algorithm.

Support Vector Machine -like Classification method.