An Information Theory Approach for Validating Clusters in Microarray Data

 

Sudhakar Jonnalagadda and Rajagopalan Srinivasan*

 

Department of Chemical and Biomolecular Engineering,

National University of Singapore, 10 Kent Ridge Crescent, Singapore 119260

*Corresponding Author: Email: chergs@nus.edu.sg

 

 

Abstract

Cluster validation is commonly used for evaluating the quality of partition produced by any clustering algorithm. In this paper, we present a novel method to assess the quality of clustering in gene expression data. In contrast to methods which are totally based on intra- and inter-cluster distances, our approach considers the dynamics and rearrangement of elements when a new cluster is introduced. Cluster quality is measured based on information change and the partition with the highest total information is selected. We illustrate the efficacy of the proposed method using two microarray datasets and two artificial datasets and discuss the advantages and limitations.

Key words: Cluster validation, Gene Expression, Information Theory.

*Contact: chergs@nus.edu.sg  

 

Data sets and Results

 

This supplementary information contains all the data sets. The data partitions corresponding to the clusters predicted by our method are also included. All files are available as excel sheets.

 

Data set 1: This is a two dimensional synthetic data. It consists of 300 elements finely grouped into 3 well defined clusters.                                   

The complete Data set along with cluster indices for all the elements at the best partition (base on our method) is available here

 

Data set 2: This 10 dimensional artificial data consisting of 480 objects is generated by Doulaye Dembélé and Philippe Kastner (2003). There are 14 clusters in this data.

This data can be downloaded from   http://www-igbmc.u-strasbg.fr/projets/fcm/

The same data with cluster indices is available here

 

Data set 3 (serum data): This real gene expression data is described and used by Iyer et al (1999). This data comprises the transcriptional response of human fibroblasts to serum. The complete data is available at: www.sciencemag.org/feature/data/984559.shl

Data with cluster indices corresponding to 6 clusters (predicted by our method) is available here

 

Data set 4 (DLBCL data):   This gene expression data is the data used in Fig 1. of Alizadeh et al. (2000). This data contains 4026 genes whose expression levels are measures across 96 samples. Samples comprises of three prevalent adult lymphoid malignancies (DLBCL, FLL and CLL) and normal lymphocyte subpopulations under a range of activation conditions.

This data set can be downloaded from:  http://llmpp.nih.gov/lymphoma/data.shtml

Clustered data with 4 clusters (predicted by the proposed method) is available here

 

References

 

[1] Alizadeh et al., (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature., 403, 503-511.

[2] Demb´ el ´e, D., Kastner, P. (2003) Fuzzy C-means method for clustering microarray data, Bioinformatics 19, 973-980.

[3] Iyer et al., (1999) The transcriptional program in the response of human fibroblast to serum. Science 283, 83–87.