Science.Online
Publisher and Institutes
Akademie Verlag
Deutsches Institut für Urbanistik
Oldenbourg Wissenschaftsverlag
Walter de Gruyter
Schattauer
You are here: Home :: Area NEM :: Medical science :: Human medicine
 
J. A. Goldman, W. W. Chu, D. S. Parker, R. M. Goldman

Term Domain Distribution Analysis: a Data Mining Tool for Text Databases

Keywords: Data Mining, Text Analysis, Text Mining, Lung Cancer

In this paper, we give a case history illustrating the real-world application of a useful technique for data mining of text databases. The technique, which we call Term Domain Distribution Analysis (TDDA), consists of keeping track of term frequencies for specific finite domains and announcing significant differences from standard frequency distributions over these domains as a hypothesis. TDDA is part of a larger framework, the Digital Filter Model, for data mining of text documents. In the case study presented, the domain of terms was the pair right, left, over which we expected a uniform distribution. In analyzing term frequencies in a thoracic lung cancer database, the TDDA technique led to the surprising discovery that primary thoracic lung cancer tumors appear in the right lung more often than the left lung, with a ratio of 3 : 2. Treating the text discovery as a hypothesis, we verified this relationship against the medical literature in which primary lung tumor sites were reported, using a standard 2 statistic. We subsequently developed a working theoretical model of lung cancer that may explain the discovery. This discovery and our model may change how oncologists view the mechanisms of primary lung tumor location.

Methods of Information in Medicine, Schattauer

Print ISSN: 0026-1270
Volume: 38, 06/1999
Pages: 96 - 101

Show full article (external site)

Show all available items of this journal