Dimension reduction and identification of data clusters under Gaussian and non Gaussian set up

English

Séminaire Probabilités & Statistique

21/06/2012 - 14:00 Asis Kumar Chattopadhyay (Department of Statistics / Calcutta University) Salle 2 - Tour IRMA

For many real life situations, the number of variables under consideration and the number of observations are very large. In order to analyze such multivariate data, it is necessary to reduce the dimension properly. A smaller dimension is necessary for further analysis like classification or clustering. In statistics, Principal Component Analysis (PCA) is the most popular among the dimension reduction techniques. Although basically PCA is an exploratory technique, for making inference it is necessary to make normality assumption regarding the underlying multivariate distribution. The eigen values and eigenvectors of the covariance or correlation matrix are the main contributors of a PCA. The eigenvectors determine the directions of maximum variability whereas the eigen values specify the variances. In practice, decisions regarding the quality of the Principal Component approximation should be made on the basis of eigen value-eigenvector pairs. In order to study the sampling distribution of their estimates the multivariate normality assumptions become necessary as otherwise it is too difficult. Principal components (PCs) are a sequence of projections of the data. The components are constructed in such a way that they are uncorrelated and ordered in variance. The PCs of a p-dimensional data set provide a sequence of best linear approximations. As only a few (say, m<< p) of such linear combinations may explain a larger percentage of variation in the data, one can take only those m components instead of p variables for further analysis. 

More recently, independent component analysis (ICA) has emerged as a strong competitor to PCA and factor analysis. ICA was primarily developed for non-Gaussian data in order to find independent components (rather than uncorrelated as in PCA) responsible for a larger part of the variation. ICA separates statistically independent component data, which is the original source data, from an observed set of data mixtures. All information in the multivariate data sets is not equally important. We need to extract the most useful information. ICA extracts and reveals useful information from the whole data set. This technique has been applied in various fields like speech processing, brain imaging, stock predictions etc. Although ICA has already been used for the analysis of astronomical data, till now it has not been used for the purpose of clustering of data.  Classification of galaxies has been carried out by using two recently developed methods, viz. Independent Component Analysis (ICA) with K-means clustering and Clustering in Arbitrary Subspace based on Hough Transform (CASH) for different data sets. 

The first two sets are consisting of dwarf galaxies and their globular clusters whose distributions are non Gaussian in nature. The third one is a larger one containing a wider range of galaxies consisting of dwarfs to giants in 56 clusters of galaxies. Morphological classification of galaxies are subjective in nature and as a result, cannot properly explain the formation mechanism and other related issues under the influence of different correlated variables through a proper scientific approach. Hence objective classification by using the above mentioned methods are preferred to overcome the loopholes.