An Information Theory Approach for
Validating Clusters in Microarray Data
Sudhakar Jonnalagadda and Rajagopalan
Srinivasan*
Department of Chemical and Biomolecular
Engineering,
National
*Corresponding Author: Email: chergs@nus.edu.sg
Abstract
Cluster
validation is commonly used for evaluating the quality of partition produced by
any clustering algorithm. In this paper, we present a novel method to assess
the quality of clustering in gene expression data. In contrast to methods which
are totally based on intra- and inter-cluster distances, our approach considers
the dynamics and rearrangement of elements when a new cluster is introduced.
Cluster quality is measured based on information change and the partition with
the highest total information is selected. We illustrate the efficacy of the
proposed method using two microarray datasets and two
artificial datasets and discuss the advantages and limitations.
Key
words: Cluster validation, Gene Expression, Information Theory.
*Contact: chergs@nus.edu.sg
Data sets and Results
This supplementary information contains all the data sets. The data partitions corresponding to the clusters predicted by our method are also included. All files are available as excel sheets.
Data set 1: This is a two dimensional
synthetic data. It consists of 300 elements finely grouped into 3 well defined
clusters.
The complete Data set along with cluster indices for all the elements at the best partition (base on our method) is available here
Data set 2: This 10 dimensional
artificial data consisting of 480 objects is generated by Doulaye Dembélé and Philippe Kastner (2003). There are
14 clusters in this data.
This data can be
downloaded from http://www-igbmc.u-strasbg.fr/projets/fcm/
The same data with cluster indices is available here
Data set 3 (serum data): This real gene
expression data is described and used by Iyer et al (1999). This data comprises
the transcriptional response of human fibroblasts to serum. The complete data
is available at: www.sciencemag.org/feature/data/984559.shl
Data with cluster indices
corresponding to 6 clusters (predicted by our method) is available here
Data set 4 (DLBCL data): This gene expression data is the data used in
Fig 1. of Alizadeh et al.
(2000). This data contains 4026 genes whose expression levels are measures
across 96 samples. Samples comprises of three prevalent adult lymphoid
malignancies (DLBCL, FLL and CLL) and normal lymphocyte subpopulations under a
range of activation conditions.
This data set can be downloaded from: http://llmpp.nih.gov/lymphoma/data.shtml
Clustered data with 4 clusters (predicted by the proposed method)
is available here
References
[1] Alizadeh
et al., (2000) Distinct types of diffuse large B-cell lymphoma identified by
gene expression profiling, Nature., 403, 503-511.
[2] Demb´
el ´e, D., Kastner, P. (2003) Fuzzy C-means method
for clustering microarray data, Bioinformatics 19,
973-980.
[3] Iyer et al., (1999) The transcriptional program in the response of human
fibroblast to serum. Science 283, 83–87.