B998 - A graphical model approach to pedigree construction using constrained optimisation - 26/05/2010

B number: 
B998
Principal applicant name: 
Dr James Cussens (University of York, UK)
Co-applicants: 
Dr Nuala Sheehan (Not used 0, Not used 0), Prof Paul Burton (Not used 0, Not used 0), Prof George Davey Smith (University of Bristol, UK)
Title of project: 
A graphical model approach to pedigree construction using constrained optimisation
Proposal summary: 

I am currently preparing a grant application for the MRC Methodology panel with a computer scientist from York, James Cussens (PI), to investigate the potential of powerful graph-learning algorithms for reconstructing pedigrees with maximum likelihood from genetic marker data. This is a problem that I have been working on for a few years with a Norwegian collaborator, Thore Egeland, but our emphasis has been very much on forensic identification applications and we have taken a more traditional statistical genetics approach. A pedigree is a special type of graph in that nodes representing individuals can only have two parents and parents must be labelled as being of opposite sex. Consequently, general graph learning algorithms are not very efficient unless they are suitably modified, although we do not have the additional problem encountered in other applications of having to learn the probabilities on the edges (arrows) between nodes as these derive from our knowledge of the genetic model under consideration. James has been in communication with us for some time and has considered a number of approaches to this specialised graph learning problem. He has decided that an approach using constrained optimisation using integer linear programming is the most promising. Indeed, initial results for a simplified situation whereby all individuals are observed and are totally ordered by age are promising, and much faster than competing methods, so we are quite optimistic.

Classical approaches to pedigree reconstruction developed 40 years ago, such as the sequential algorithm of Thompson ? which is a "greedy algorithm", in our terminology ? typically search for "good" but not necessarily optimal pedigrees by starting with sets of individuals that are most likely to be siblings, then finds the most likely parents for these sibships and moves on from there. The resulting reconstruction in heavily dependent on the decisions made about nuclear families at any given stage. Information on the maximal size of a sibship (which will vary across human populations and differ widely from animal and plant populations), the minimal number of years to reach sexual maturity (generation gap), age differences between parents etc. is incorporated in these algorithms at various stages in order to ensure that a realistic structure is produced. These are all "constraints" and one of our first tasks will be to search the statistical literature for those that are known to be important and then find a way to express them as linear constraints. All these algorithms assume complete data which we will typically not have. Searching over all possible pedigrees with missing individuals is much harder and is the interesting problem for the graph learning community. We will begin with a likelihood approach but will then move to a Bayesian framework where ""hard" prior information e.g. on certain known relationships and "soft" prior information e.g. this population is known to favour first cousin marriages etc. can all be incorporated. We will start with simulated data, moving from classic unlinked microsatellite-type data to linked markers and dense sets of SNPs: the likelihood calculations become more problematic then and we will need to employ Markov chain Monte Carlo methods to approximate likelihoods that are intractable with exact methods.

Besides the fact that this is a really nice graph theory problem, the main medical application is the potential for finding sets of related individuals from large Biobank populations for later linkage studies, or for homozygosity mapping where distantly related individuals are better because the shared environmental effects won't mask the relatively small effects of the rarer variants we're trying to find. Identifying the correct pedigree/relationship is crucial for forensic and population wildlife management applications, of course. It is also important for linkage analysis but is overkill for homozygosity mapping where you just need individuals with reasonable IBD-sharing probabilities. However, these general machine-learning algorithms are extremely efficient so if we can tailor-make one for pedigrees that will work quickly, we can obviously get distant relatives easily. We will then need to check sensitivity to getting the right relationship and investigate how best to choose relatives for homozygosity mapping from any constructed extended pedigree.

This is a methodological project with some very hard problems to solve before we get close to using it for anything. There are obviously ethical issues as to when and where you could do this in practice, but for now, we just want to see if it can be done. We will clearly do most of the work with simulated data but we will need a real dataset to check that what we are finding is not totally due to how we simulated. The main application is aimed at UK Biobank which has referred to the desirability of being able to infer relatives at a later point in its protocol. However, as they are a long way off genotyping, we will need a back-up data set to test things on if we get funded and if Biobank takes more than three years to deliver. Other existing cohort studies are probably not large enough to capture any relatives by chance. Obviously there are very close relationships in ALSPAC which are a bit too easy to find but, given that every birth in Avon over a two year period was included, we thought that there might be siblings or even a few cousins there. Our main intention would be to see if we can recover relationships you already know about - not to find hidden ones - although there is a possibility that the ones you know about are not quite as stated. Obviously all ethical issues will have to be handled as deemed appropriate at the time. In particular, we are not interested in any analysis of the ALSPAC data for relatedness, per se, so nothing will be published without clearing everything beforehand. Also, authorship of any paper that uses the ALSPAC data in any way at all, be it for the machine learning, statistical genetics, forensics or medical communities, will include appropriate ALSPAC investigator(s) and this will be discussed with you as beforehand.

Paul Burton (Leicester) and George Davey Smith (Bristol) are both coming on board as full investigators, not because of their great desire to learn about constrained optimisation but because we feel we need their input and experience to ensure a) that the focus of the research stays on medical applications and b) that anything to do with real data will be handled correctly. Knowing them both, I am confident that they will advise us well. I am being deliberately vague about what data we require because we do not know exactly what we need yet, it will be a long time before we need it and much more information than the above will have to be provided then. George mentioned that Illumina quad GWAS data would be available so once we have the machinery in place to handle dense SNP arrays, access to some of that data would be something to discuss. Right now, for the purposes of the grant application, I would like to be able to say, with your permission, that we are in discussion with ALSPAC collaborators for real data which will be used to test the methods in an appropriate way when we finally get to that stage.

Date proposal received: 
Wednesday, 26 May, 2010
Date proposal approved: 
Wednesday, 26 May, 2010
Keywords: 
Genetics
Primary keyword: