I have these thougths regarding the exciting area of PkNNC applied to text:
Given a corpus of documents,
CORPUS( DOC_ID, TF1, IDF1, TF2, IDF2, . . ., TFn, IDFn CLASS)
where the vocabulary is
VOCABULARY(T1, . . ., Tn)
(of either terms, or stems or whatever is decided as a vocab.)
and the VOCABULARY is over alphabet,
ALPHABET(c1 , . . . , cm)
of characters (a-z or,
in the case of genomics, {a,t,g,c};
or in the case of proteomics, {a1,...,a20} (the 20 amino acids)
or any other text corpus over a vocab from an alphabet.
Multiplying TF*IDF amounts to weightings of (1,1) (equal weighting).
Explore the other possible weightings based on??
using genetic algorithms,
tabu search.
simple hill-climbing
....
So for a CORPUS of already classified documents,
set aside part as test data and
use the rest as training data.
Use GAs, or another method to optimize the weights.
Ptrees may be the way to go here since:
1. using scanning methods means taht the CORPUS must be
rescanned at each step of the GA (prohibitive).
2. Using P-trees, one can apply ISoC stopping techniques,
other???? and do it quickly.