Data mining for Tunes doc? (was: Preliminary Review database)

Massimo Dentico m.dentico@galactica.it
Mon, 12 Jun 2000 18:57:55 +0200


David Manifold <dem@tunes.org> wrote:
> 
> Right now I have a minimally functional Review database.  You can browse
> it, and I can create user accounts which can only add new data.  There is
> hardly any data in it yet, but if you want to contribute seriously, send
> me an email and I can set you up an account to edit it through the web.
> [...]
> I am working closely with water and coreyr to coordinate what is happening
> in that department. 


Brian Rice <water@tscnet.com> wrote:
> [...]
> Tunes members are more than welcome to contact Corey, the DB administrator 
> for the moment, and ask for access to manipulate nodes. I'm not leaving his 
> email address here. I suggest you use the #tunes IRC channel to contact him 
> and discuss it there.
> [...]

Dear David and Brian,

I appreciate very much your (and of other) effort to reorganize and
improve the Tunes project documentation. Feel free to create and
communicate me an account, via e-mail, possibly suggesting the
guidelines for the new documents structure (sorry Brian, I'm afraid
that it's not practical for me to chat in English: I haven't real time
performances ... well, even my batch performances are not better :-).
I think, in this phase, I can easily help in the migration of the
precedent documentation.

Only a little of perplexity: with this entire restructuring we break
every external link. Is absolutely impossible to keep a little bit of
compatibility with the static old structure? For example, transforming
the old pages to indexes to new information, with links at every anchor
instead of information?

I want propose also an idea to your attention: the big problem with
this textual unstructered information is to create and maintain a
useful classifications and cross-links. With traditional DB techniques
this require a massive human intervention that is boring and time
consuming.

I think it's possible to use statistical methods and other machine
learning methods to overcome (at least partially) this problems.
The field seems quite well developed, with commercial applications
already available. In fact there is at least one start up which has
grown greatly in the last years with these methods: Autonomy.

However, I don't know how is difficult to implement these techniques
and if it's practical to explore the subject in this moment: it is
only a suggestion.

Some references:

  CMU World Wide Knowledge Base (Web->KB) project
  - http://www.cs.cmu.edu/~webkb/

  Bow: A Toolkit for Statistical Language Modeling, Text Retrieval,
  Classification and Clustering
  - http://www.cs.cmu.edu/~mccallum/bow/

  Naive Bayes algorithm for learning to classify text
  - http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/naive-bayes.html

  Central Inductive Agency.
  - http://www.csse.monash.edu.au/~lloyd/tildeMML/Intro/index.html

  Data Mining from spam
  - http://www.hpl.hp.com/personal/Tom_Fawcett/mining-spam/index.html

  Data Mining and CRM (Kurt Thearling)
  - http://www3.shore.net/~kht/index.htm

About Autonomy:

  Wired 8.02: The Quest for Meaning
  - http://www.wired.com/wired/archive/8.02/autonomy.html

  Michael Lynch CEO Autonomy
  - http://industrystandard.net/people/display/0,1157,1889,00.html

  Autonomy - Knowledge Management and New Media Content Solutions
  - http://www.autonomy.com/

This product is free for personal use, but unfortunately only for
Windows:

  Autonomy Kenjin
  - http://www.kenjin.com/

Best regards.

-- 
Massimo Dentico