B3038 - Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data - 10/01/2018

B number: 
B3038
Principal applicant name: 
Katrijn Van Deun | Tilburg University (The Netherlands)
Co-applicants: 
Mr. Niek de Schipper, Mr. Soogeun Park, Dr. Davide Vidotto, Mr. Shuai Yuan, Mrs. Pia Tio, Mr. Zhengguo Gu, Mr. Aaron Carmack
Title of project: 
Big Data in the Social Sciences: Statistical methods for multi-source high-dimensional data
Proposal summary: 

Social science research has entered the era of big data: Many detailed measurements are taken and multiple sources of information are used to unravel complex multivariate relations. For example, in studying obesity as the outcome of environmental and genetic influences, researchers increasingly collect survey, dietary, biomarker and genetic data from the same individuals. Such novel integrated research can inform us on health strategies to prevent obesity.
Although linked more-variables-than-samples (called high-dimensional) multi-source data form an extremely rich resource for research, extracting meaningful and integrated information is challenging and not appropriately addressed by current statistical methods: A first problem is that relevant information is hidden in a bulk of irrelevant variables. Second, the sources are often very heterogeneous, which may obscure apparent links between the shared mechanisms. A statistical framework is needed to select the relevant groups of variables within each source and link them throughout data sources. In this project we develop a new framework by extending principal component analysis to common components defined by relevant clusters of variables. We use it both for exploration and outcome modelling of linked high-dimensional social sciences and (epi)genetic data.
The advanced component analysis method will be a widely applicable and novel method for knowledge extraction also allowing for more accurate predictions in many social science contexts with big data. In addition, the proposed empirical study will generate important insights on the gene-environment interaction in socially relevant outcomes like obesity.

Date proposal received: 
Tuesday, 9 January, 2018
Date proposal approved: 
Tuesday, 9 January, 2018
Keywords: 
Statistics/methodology, Obesity, Computer simulations/modelling/algorithms, Statistical methods, BMI, Environment - enviromental exposure, pollution, Epigenetics, Methods - e.g. cross cohort analysis, data mining, mendelian randomisation, etc., Statistical methods