ALSPAC OMICs Data Catalogue
Table of Contents
- 1. Introduction
- 2. Catalogue overview
- 3. Genetic Array Data
- 4. Imputed Data
- 4.1. Genome-wide - HRC imputed - G0 mothers + G1 (gi_hrc_g0m_g1)
- 4.2. Genome-wide - HapMap2 imputed - G1 (gi_hapmap2_g1)
- 4.3. Genome-wide - HapMap2 imputed - G0 mothers (gi_hapmap2_g0m)
- 4.4. Genome-wide - 1000G imputed - G0 partners (gi_1000g_g0p)
- 4.5. Genome-wide - 1000G imputed - G0 mothers + G1 (gi_1000g_g0m_g1)
- 5. Sequence Data
- 6. Epigenetic Data
- 7. Gene Expression Data
- 8. Omics tips
- 8.1. Introduction
- 8.2. Disclaimer
- 8.3. Operating systems
- 8.4. Key Omics software
- 8.5. File types
- 8.6. Variant/SNP ids
- 8.7. Overview of Imputation reference panels
- 8.8. SNP data types from imputation.
- 8.9. SNP Statistics
- 8.10. Best practice
- 8.11. Population stratification
- 8.12. Common tasks
- 8.13. Courses
- 8.14. Further sources of help
1 Introduction
Welcome to the ALSPAC Omics Catalogue, a guide to the omics data offered by ALSPAC. This catalogue features a variety of named ALSPAC datasets, each consisting of collected or produced data that has been organized, named, and curated for ease of use. Every named ALSPAC dataset comes with accompanying metadata that provides information about the dataset as a whole. Each named ALSPAC dataset has at least one release version that includes a curated selection of files detailed in the metadata sections.
Please note that these datasets are not generally accessible. Please see http://www.bristol.ac.uk/alspac/researchers/access/ for details for access.
The information within this catalogue is made available for browsing to help both internal ALSPAC users and external researchers understand the data and facilitate prospective data requests.
For external ALSPAC collaborators, we offer as standard "freezes" of specific dataset versions of named ALSPAC datasets. These freezes, along with their metadata, are outlined in this catalogue. External collaborators will be granted access to these freezes upon request approval. A freeze represents a carefully selected subset of data files within a version, containing the core data from a dataset with withdrawn consent removed and specific dataset IDs applied. These freezes are subject to periodic updates.
Due to the removal of withdrawn individuals from the freezes, please note that the number of participants within each dataset may change over time and may not match those found in the Methodology fields.
Freeze 1 timing: July 2021 - Dec 2022
Freeze 2 timing: Dec 2022 - Dec 2023
Freeze 3 timing: Jan 2023 - Oct 2024
Freze 4 timing: Oct 2024 - present
Documentation for the current freeze is in the form of a yaml file is present below, listing the files external collaborators will receive, accompanied by metadata.
The metadata presented in our catalogue adheres to the ALSPAC Data catalogue Schema, which is crafted in LinkML. To explore the full schema documentation, please visit: https://alspac.github.io/alspac-data-catalogue-schema/
This website is equipped with RDFa, enabling the metadata to be machine-readable and allowing for the creation of queries using SPARQL with compatible tools, such as Apache Any23 and Apache Jena.
For more information about this see the document on FAIR data principles and the document describing the rational and construction of this catalogue here.
2 Catalogue overview
alspacdcs:alspac data catalogue 001 a dcat:Catalog |
|
---|---|
schema:description |
This catalogue is for all of the named alspac omics data sets.
|
schema:email |
alspac-omics@bristol.ac.uk
|
schema:name |
ALSPAC Omics Data Catalogue
|
alspacdcs:named alspac datasets |
,
,
,
,
,
,
,
,
,
,
and
|
alspacdcs:primary investigator orcids | |
alspacdcs:see also |
3 Genetic Array Data
3.1 Genome-wide - Illumina 550 quad - G1 (gwa_550_g1)
3.1.1 Description
This dataset contains genome wide array data genotype calls for G1 individuals. Reference genome build: GRCh37
3.1.2 Methodology
ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).
Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.
SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1). Related subjects were removed.
Associated publication:
- Horikoshi et al 2013 (https://doi.org/10.1038/ng.2477)
3.1.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gwa_550_g1_2022-12-05_f4 name: >- Genome-wide array data for G1 individuals 2022-12-05 freeze 4 description: >- The fourth freeze of the genome-wide array data for G1 based on a 2022-12-05 release. The data is in plink format. freeze_size: 997M linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112 git_tag: https://github.com/alspac/dataset_gwa_550_g1/releases/tag/freeze4 is_current_freeze: true freeze_number: 4 freeze_date: 2024-06-11 previous_freeze: alspacdcs:gwa_550_g1_2022-12-05_f3 freeze_of_alspac_dataset_version: alspacdcs:gwa_550_g1_2022-12-05 freeze_of_named_alspac_dataset: alspacdcs:gwa_550_g1 has_containers: - id: alspacdcs:e8e8dde6-0841-4135-aec5-13dee5aa065a ## uuid name: data description: A dir/folder containing the two freeze data files has_parts: - id: alspacdcs:cb1d46af-b413-4820-b395-3ab2c07c336e name: Biallelic genotype table description: >- genotype data data_distributions: - id: alspacdcs:2edc1c1f-bd1c-4d8d-a258-f85a5e2c0b5c name: freeze_id.bed description: >- Plink bed file. Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files. md5sum: 94973786388f80000dcdad0a80514e37 filesize: 982M filetype: .bed number_of_participants: 8223 number_of_variants: 500527 belongs_to_container: alspacdcs:e8e8dde6-0841-4135-aec5-13dee5aa065a - id: alspacdcs:5a798cc1-ffba-4c69-a54a-de5fd6e616cb name: Variant Information description: >- Information about SNPS data_distributions: - id: alspacdcs:356763e4-11e0-4a22-ab01-14f3c3f58bac name: freeze_id.bim description: >- Extended variant information file accompanying a .bed binary genotype table. (--make-just-bim can be used to update just this file.) A text file with no header line, and one line per variant with the following six fields: 1. Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name 2. Variant identifier 3. Position in morgans or centimorgans (safe to use dummy value of '0') 4. Base-pair coordinate (1-based; limited to 231-2) 5. Allele 1 (corresponding to clear bits in .bed; usually minor) 6. Allele 2 (corresponding to set bits in .bed; usually major) md5sum: b0789ac6126af474c916c80f77335f6a filesize: 14M filetype: .bim number_of_variants: 500527 belongs_to_container: alspacdcs:e8e8dde6-0841-4135-aec5-13dee5aa065a - id: alspacdcs:5cee7fda-8d37-4667-9909-91b847689c98 name: sample info description: >- Sample ids data_distributions: - id: alspacdcs:a81eb161-3051-4557-88d6-d82068016c67 name: freeze_id.fam description: >- A text file with no header line, and one line per sample with the following six fields: 1. Family ID ('FID') 2. Within-family ID ('IID'; cannot be '0') 3. Within-family ID of father ('0' if father isn't in dataset) 4. Within-family ID of mother ('0' if mother isn't in dataset) 5. Sex code ('1' = male, '2' = female, '0' = unknown) 6. Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control) md5sum: 854ea4dcd904ca37f441ca671e445634 filesize: 256k filetype: .fam number_of_participants: 8223 belongs_to_container: alspacdcs:e8e8dde6-0841-4135-aec5-13dee5aa065a - id: alspacdcs:9b487074-065a-4924-9ffa-f2864f148ba9 name: Heterozygous haploid and nonmale Y chromosome call list description: >- A plink report data_distributions: - id: alspacdcs:64a1b50e-68b5-4857-b876-b561ed1e9fec name: freeze_id.hh description: >- Produced automatically when the input data contains heterozygous calls where they shouldn't be possible (haploid chromosomes, male X/Y), or there are nonmissing calls for nonmales on the Y chromosome. A text file with one line per error (sorted primarily by variant ID, secondarily by sample ID) with the following three fields: Family ID Within-family ID Variant ID md5sum: 173734a688e9ff15c2911a91636bee56 filesize: 1.7M filetype: .hh belongs_to_container: alspacdcs:e8e8dde6-0841-4135-aec5-13dee5aa065a - id: alspacdcs:e2fea4ec-7fc8-4e09-8b12-4be1c2ddc1b6 name: Logs description: >- plink log data_distributions: - id: alspacdcs:caa69afd-0c19-4299-b660-fb308988a6ee name: freeze_id.log description: >- plink log file md5sum: 0b069047e228212360cc189a5d689d50 filesize: 512 filetype: .log belongs_to_container: alspacdcs:e8e8dde6-0841-4135-aec5-13dee5aa065a
3.2 Genome-wide - Illumina exome core array - G0 partners (gwa_exome_g0p)
3.2.1 Description
This dataset contains genome wide array genotype calls for G0 mothers and partners. Reference genome build: GRCh37
3.2.2 Methodology
3,453 ALSPAC mother and fathers and 535,478 SNPs were genotyped using the Illumina HumanCoreExome chip genotyping platforms by the ALSPAC lab and called using GenomeStudio. The resulting raw genome-wide data were subjected to standard quality control methods using PLINK (v1.07). Individuals were excluded on the basis of gender mismatches (n = 80); minimal or excessive heterozygosity (n = 64); disproportionate levels of individual missingness (>5%, n = 60) and possible contamination (n = 3).
Population stratification was assessed by multidimensional scaling analysis and compared with 1000 Genomes phase 3 data and principal component analysis (n = 266); all individuals with non-European ancestry were removed. Cryptic relatedness was measured as SNP relatedness in GCTA (relatedness > 0.1, n = 69 removed). SNPs with a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 1E-7) and those which failed GenomeStudio quality control measures were removed (n = 21,298). 6,594 duplicate SNPs were also removed. This resulted in 2,911 unrelated mothers and father genotypes at 507,586 SNPs. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln.
1737 putative G0 partner-G1 pairs for whom both G0 partner and G1 have called genotype data available were identified based on ALN. Given the G0 partners were invited by the G0 mother to take part and only enrolled in the study in their own right several years later, it could not be assumed that all G0 partners were biologically related to G1. Called genotype data for the 1720 unique G0 partners and 1737 unique G1s were merged (i.e. there were 17 pairs of siblings/twins among the G1 offspring), using plink v1.90b7.2 64-bit (11 Dec 2023).
After aplication of the plink filters –geno 0.05, –maf 0.01, –snps-only just-acgt and –autosome, 113288 SNPs remained. The –related command in KING version 2.3.2 was used to perform kinship analysis, which confirmed that all 1737 putative G0 partner-G1 pairs are genetically related. This would be expected for biological father-offspring pairs, using the inference criteria described in in Table 1 of "Manichaikul, Ani, et al. "Robust relationship inference in genome-wide association studies." Bioinformatics 26.22 (2010): 2867-2873."
3.2.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gwa_exome_g0p_2016-11-22_f4 name: Freeze 4 version 2016-11-22 Genome-wide - Illumina exome core array - G0 partners description: >- Freeze 4 version 2016-11-22 Genome-wide array data including raw files and genotype calls for G0 partners, also including additional G0 mothers who were absent from previous genotyping rounds freeze_size: 289M linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112 git_tag: https://github.com/alspac/dataset_gwa_exome_g0p/releases/tag/freeze4 is_current_freeze: true freeze_number: 4 freeze_date: 2024-06-11 previous_freeze: alspacdcs:gwa_exome_g0p_2016-11-22_f3 freeze_of_alspac_dataset_version: alspacdcs:gwa_exome_g0p_2016-11-22 freeze_of_named_alspac_dataset: alspacdcs:gwa_exome_g0p has_containers: - id: alspacdcs:67611038-8d3d-46a6-a780-4f897729568d name: data description: A dir/folder containing the plink data files has_parts: - id: alspacdcs:09a75379-f9a6-495d-ac9a-aa45c7eda651 name: freeze_id data_distributions: - id: alspacdcs:ecb46381-0969-4b8e-8374-a344365f29ed name: freeze_id.fam description: >- A text file with no header line, and one line per sample with the following six fields: 1. Family ID ('FID') 2. Within-family ID ('IID'; cannot be '0') 3. Within-family ID of father ('0' if father isn't in dataset) 4. Within-family ID of mother ('0' if mother isn't in dataset) 5. Sex code ('1' = male, '2' = female, '0' = unknown) 6. Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control) Here We use both the first two fields to have the full id of the participant. i.e. not separate family and within family ids. md5sum: 5d116792f1d34a5456c4016f86a372cd filesize: 128KB filetype: .fam number_of_participants: 2198 belongs_to_container: alspacdcs:67611038-8d3d-46a6-a780-4f897729568d - id: alspacdcs:e70210f0-b87d-4296-bcf4-b6cb2aecd798 name: freeze_id.bim description: >- Extended variant information file accompanying a .bed binary genotype table. (in plink you can use --make-just-bim can be used to update just this file.) A text file with no header line, and one line per variant with the following six fields: 1.Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name 2. Variant identifier 3. Position in morgans or centimorgans (safe to use dummy value of '0') 4. Base-pair coordinate (1-based; limited to 231-2) 5. Allele 1 (corresponding to clear bits in .bed; usually minor) 6. Allele 2 (corresponding to set bits in .bed; usually major) md5sum: 0fe43f888776059fef0a76d3f08d00ad filesize: 14MB filetype: .bim number_of_variants: 507586 belongs_to_container: alspacdcs:67611038-8d3d-46a6-a780-4f897729568d - id: alspacdcs:7bd80533-1300-4ae5-9357-a4fb469dd676 name: freeze_id.bed description: >- Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files. md5sum: 304b0d356880c5174806ce08d7beffd3 filesize: 267M filetype: .bed number_of_participants: 2198 number_of_variants: 507586 belongs_to_container: alspacdcs:67611038-8d3d-46a6-a780-4f897729568d - id: alspacdcs:1ba71829-0007-42a4-ba81-bed5de7acbe9 name: freeze_id.log md5sum: 8c3bc05548cfe7a95643e6db81bf30a5 filesize: 512B filetype: .log belongs_to_container: alspacdcs:67611038-8d3d-46a6-a780-4f897729568d - id: alspacdcs:f1df1ea4-7317-49c6-8a79-a8da3d5c7093 name: freeze_id.hh description: >- plink .hh file see https://www.cog-genomics.org/plink/1.9/formats#hh md5sum: ceaaced7ab039cf3631df602a96619f7 filesize: 8M filetype: .hh belongs_to_container: alspacdcs:67611038-8d3d-46a6-a780-4f897729568d
3.3 Genome-wide - Illumina 660 quad - G0 mothers (gwa_660_g0m)
3.3.1 Description
This dataset contains genome-wide array data including raw files and genotype calls for G0 mothers.
3.3.2 Methodology
ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs.
SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed. Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.
Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level. Related subjects that passed all other quality control thresholds were retained. This resulted in 9,048 subjects and 526,688 SNPs passed these quality control filters.
Associated publication:
- Rietveld et al 2013 (https://doi.org/10.1126/science.1235488)
3.3.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gwa_660_g0m_2022-12-05_f4 name: Freeze 4 version 2022-12-05 Genome-wide - Illumina 660 quad - G0 mothers description: >- Freeze 4 of genome-wide array data including genotype calls for G0 mothers freeze_size: 2G linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112 git_tag: https://github.com/alspac/dataset_gwa_660_g0m/releases/tag/freeze4 is_current_freeze: true freeze_number: 4 freeze_date: 2024-06-11 freeze_of_alspac_dataset_version: alspacdcs:gwa_660_g0m_2022-12-05 freeze_of_named_alspac_dataset: alspacdcs:gwa_660_g0m has_containers: - id: alspacdcs:8dc6326a-db1a-41c8-ba6c-58e5be88d37f name: data description: A dir/folder containing the plink data files - id: alspacdcs:f0a7bb41-de9d-45c0-99c3-b91785e5b946 name: legacy1 description: A dir/folder containing the plink data files. Includes full set of SNPs but is missing ~500 mothers who were excluded in legacy QC due to strict relatedness inclusion thresholds. belongs_to_container: alspacdcs:8dc6326a-db1a-41c8-ba6c-58e5be88d37f - id: alspacdcs:7f1bd1da-f18e-4e07-bdfd-07ebbeb3049f name: legacy2 description: A dir/folder containing the plink data files Includes full set of individuals but due to legacy QC is restricted to a set of ~480k SNPs that overlap with the Illumina 550k array (which was used for G1). belongs_to_container: alspacdcs:8dc6326a-db1a-41c8-ba6c-58e5be88d37f has_parts: - id: alspacdcs:16618c28-c82b-452a-b9a5-f63e86063c15 name: Biallelic genotype table description: >- The genetic data data_distributions: - id: alspacdcs:c0b64191-04a4-45a4-bfbb-921ac9a06755 name: freeze_id.bed description: >- Legacy 1 plink bed file. Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files. The legacy1 distribution of the plink bed file. md5sum: be66d3cc1d3d906c4d396cc161a605b1 filesize: 1020M filetype: .bed number_of_participants: 8118 number_of_variants: 526688 belongs_to_container: alspacdcs:f0a7bb41-de9d-45c0-99c3-b91785e5b946 - id: alspacdcs:ab9b5bf0-2029-4489-8e7f-993d4370823b name: freeze_id.bed description: >- Legacy 2 plink bed file. Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files. The legacy2 distribution of the plink bed file. md5sum: 7559903a4811210f6289497e1323dfe7 filesize: 961M filetype: .bed number_of_variants: 465740 number_of_participants: 8648 belongs_to_container: alspacdcs:7f1bd1da-f18e-4e07-bdfd-07ebbeb3049f - id: alspacdcs:69d8a90d-486e-45ec-a23c-325266a11ccd name: Variant Information description: >- Information about genetic variants data_distributions: - id: alspacdcs:68ec2931-a8f3-4d0b-a813-8f720738334c name: freeze_id.bim description: >- Legacy 1 Extended variant information file accompanying a .bed binary genotype table. (--make-just-bim can be used to update just this file.) A text file with no header line, and one line per variant with the following six fields: 1.Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name 2. Variant identifier 3. Position in morgans or centimorgans (safe to use dummy value of '0') 4. Base-pair coordinate (1-based; limited to 231-2) 5. Allele 1 (corresponding to clear bits in .bed; usually minor) 6. Allele 2 (corresponding to set bits in .bed; usually major) md5sum: be66d3cc1d3d906c4d396cc161a605b1 filesize: 14M filetype: .bim number_of_variants: 526688 belongs_to_container: alspacdcs:f0a7bb41-de9d-45c0-99c3-b91785e5b946 - id: alspacdcs:fc002279-f815-4d8e-af7d-f6094c1f3be6 name: freeze_id.bim description: >- Legacy 2 Extended variant information file accompanying a .bed binary genotype table. (--make-just-bim can be used to update just this file.) A text file with no header line, and one line per variant with the following six fields: 1.Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name 2. Variant identifier 3. Position in morgans or centimorgans (safe to use dummy value of '0') 4. Base-pair coordinate (1-based; limited to 231-2) 5. Allele 1 (corresponding to clear bits in .bed; usually minor) 6. Allele 2 (corresponding to set bits in .bed; usually major) md5sum: b4a1adb225de05d92d0af585950fd423 filesize: 13M filetype: .bim number_of_variants: 465740 belongs_to_container: alspacdcs:7f1bd1da-f18e-4e07-bdfd-07ebbeb3049f - id: alspacdcs:b82ab933-6a04-4932-9d4d-19df2b4e0391 name: Sample information description: >- Information about the samples for the dataset data_distributions: - id: alspacdcs:6b63f49c-2935-4322-b376-b524802e8649 name: freeze_id.fam description: >- legacy 1 A text file with no header line, and one line per sample with the following six fields: 1. Family ID ('FID') 2. Within-family ID ('IID'; cannot be '0') 3. Within-family ID of father ('0' if father isn't in dataset) 4. Within-family ID of mother ('0' if mother isn't in dataset) 5. Sex code ('1' = male, '2' = female, '0' = unknown) 6. Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control) md5sum: 68019c4b1907d320c9ba4e5e3b4343f8 filesize: 256K filetype: .fam number_of_participants: 8118 belongs_to_container: alspacdcs:f0a7bb41-de9d-45c0-99c3-b91785e5b946 - id: alspacdcs:5edc9078-1a38-4e75-8320-110cfd4195b8 name: freeze_id.fam description: >- legacy2 A text file with no header line, and one line per sample with the following six fields: 1. Family ID ('FID') 2. Within-family ID ('IID'; cannot be '0') 3. Within-family ID of father ('0' if father isn't in dataset) 4. Within-family ID of mother ('0' if mother isn't in dataset) 5. Sex code ('1' = male, '2' = female, '0' = unknown) 6. Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control) md5sum: ca78d5b8f96df516a7af3862de6ba8f6 filesize: 448k filetype: .fam number_of_participants: 8648 belongs_to_container: alspacdcs:7f1bd1da-f18e-4e07-bdfd-07ebbeb3049f - id: alspacdcs:93b5d2be-8e64-4770-af96-284093f0e508 name: Log information description: >- Information about the plink run for making the dataset data_distributions: - id: alspacdcs:28da8b88-e68a-4d5b-be09-9a545b427c48 name: freeze_id.log description: >- legacy 1 plink log file md5sum: 5adb293d1f0c0312b90ef3ab79c567b2 filesize: 512 filetype: .log belongs_to_container: alspacdcs:f0a7bb41-de9d-45c0-99c3-b91785e5b946 - id: alspacdcs:24fef9cb-60d3-4a0f-93fc-86a2e01e53c5 name: freeze_id.log description: >- legacy 2 plink log file md5sum: c713ee6c86477fbd29f329689005fc53 filesize: 512 filetype: .log belongs_to_container: alspacdcs:f0a7bb41-de9d-45c0-99c3-b91785e5b946
3.4 Genome-wide - CNV - G1 (cnv_550_g1)
3.4.1 Description
This dataset contains predicted ALSPAC CNVs using PennCNV, generated from 23andMe raw genotype data.
3.4.2 Methodology
LRR and BAF data was missing from the 23andMe raw genotype data, so we had to generate this data ourselves using an in house algorithm. Once this data was generated, we ran PennCNV using the hh550 libraries.
There are filtered PennCNV calls. Multiple calls were merged using the 'clean_cnv.pl' script, using a merge fraction of 0.5. Individuals with > 30 CNVs, a Log R Ratio SD of >0.3, a BAF drift of > 0.002, and a waviness factor of > 0.05 were removed. CNVs in which at least 50% of the length of the CNV call overlapped with any of telomeric centromeric, immunoglobulin regions were removed using the 'scan_region.pl' script in PennCNV.
In addition, CNVs covering fewer than 5 probes, of a length < 5kb, and with a confidence score of below 10 were removed. Density was calculated as the number of probes in a CNV divided by the length of the CNV, and CNVs where the density of probes across the call was < 1 probe per 20kb was removed.
These QC parameters are suggestions only and provided in filtered.cnv. Analysts can apply their own filter parameters to the raw calls in data.cnv
3.4.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:cnv_550_g1_2015-11-09_f4 name: Genome-wide - CNV - G1 release version 2015-11-09 freeze 4 description: >- This is the fourth freeze of the 2015-11-09 version of cnv_550_g1 dataset. It contains two csv versions of the cnv called data, the unfilterd and filtered versions. freeze_size: 27m linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112 git_tag: https://github.com/alspac/dataset_cnv_550_g1/releases/tag/freeze4 is_current_freeze: true freeze_number: 4 freeze_date: 2024-06-11 previous_freeze: alspacdcs:cnv_550_g1_2015-11-09_f3 freeze_of_alspac_dataset_version: alspacdcs:cnv_550_g1_2015-11-09 freeze_of_named_alspac_dataset: alspacdcs:cnv_550_g1 has_parts: - id: alspacdcs:061d8035-d73e-4211-afdd-8dbb67a96d20_cnv_550_g1_2015-11-09_cnvdata_f4 name: Unfiltered CNV data description: >- This is the output of Penncnv before filtering. columns V1 - Position V2 - Number of markers in the region V3 - CNV length V4 - Copy number estimate V6 - Start SNP V7 - End SNP V8 - Confidence score qlet - within pregnancy ID cnv_550_g1 - Individual ID data_distributions: - id: alspacdcs:4e23b21840c200f56b1b5ccf227a6e59_new_cnvdata.csv name: new_cnvdata.csv description: >- This is the csv file for the output of Penncnv before filtering. md5sum: 4e23b21840c200f56b1b5ccf227a6e59 filesize: 21M filetype: .csv number_of_participants: 7449 #data$id_qlet <- paste(data$cnv_550_g1, data$qlet, sep="_") #length(unique(data$id_qlet)) number_of_cnv_variants: 70029 # Read file into R as data then: # dim(unique(data[1])) belongs_to_container: alspacdcs:dcf3e9e0-216b-4c2b-b3e1-ace2690c31bc - id: alspacdcs:cnv_550_g1_2015-11-09_filtered_f4 name: Filtered CNV data description: >- CNV data that has been filtered. columns V1 - Position V2 - Number of markers in the region V3 - CNV length V4 - Copy number estimate V6 - Start SNP V7 - End SNP V8 - Confidence score qlet - within pregnancy ID cnv_550_g1 - Individual ID data_distributions: - id: alspacdcs:f825f62ec1cd49b8c2a059b4c5f6f13a_new_filtered.csv name: new_filtered.csv description: >- This is the csv file for the output of Penncnv after filtering. md5sum: f825f62ec1cd49b8c2a059b4c5f6f13a filesize: 5.9M filetype: .csv number_of_participants: 6792 # Read into data 2 in r # data2$id_qlet <- paste(data2$cnv_550_g1, data2$qlet, sep="_") and length(unique(data2$id_qlet)) number_of_cnv_variants: 14244 #Read into data2 in r then #length(unique(data2$V1)) belongs_to_container: alspacdcs:dcf3e9e0-216b-4c2b-b3e1-ace2690c31bc has_containers: - id: alspacdcs:dcf3e9e0-216b-4c2b-b3e1-ace2690c31bc ## uuid name: data description: A dir/folder containing the two freeze data files
4 Imputed Data
4.1 Genome-wide - HRC imputed - G0 mothers + G1 (gi_hrc_g0m_g1)
SNP chips are useful for the generation of data on hundreds of thousands of SNPs, but there are millions more polymorphisms that remain untyped with this technology. If suitable numbers of whole genome sequences exist (e.g. 1000 genomes data) then millions of genotypes that are missing from a sample because they have not been typed by SNP chips can be imputed using probabilistic methods. Here the ALSPAC mother and children data were imputed to a new reference panel known as the Haplotype Reference Consortium (HRC) panel. This comprises around 31000 sequenced individuals (mostly European), so the coverage of European haplotypes is much greater than in other panels. As a consequence imputation accuracy is expected to improve, particularly at lower frequencies.
4.1.1 Description
This dataset contains genotype data imputed to HRC for G0 mothers and G1. Reference genome build: GRCh37
4.1.2 Methodology
ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).
Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.
SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1).
Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,115 subjects and 500,527 SNPs passed these quality control filters.
ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs. SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed.
Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.
Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level. Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,048 subjects and 526,688 SNPs passed these quality control filters.
We combined 477,482 SNP genotypes in common between the sample of mothers and sample of children. We removed SNPs with genotype missingness above 1% due to poor quality (11,396 SNPs removed) and removed a further 321 subjects due to potential ID mismatches. This resulted in a dataset of 17,842 subjects containing 6,305 duos and 465,740 SNPs (112 were removed during liftover and 234 were out of HWE after combination). We estimated haplotypes using ShapeIT (v2.r644) which utilises relatedness during phasing. The phased haplotypes were then imputed to the Haplotype Reference Consortium (HRCr1.1, 2016) panel of approximately 31,000 phased whole genomes. The HRC panel was phased using ShapeIt v2.r727, and the imputation was performed using the Michigan imputation server.
4.1.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gi_hrc_g0m_g1_2017-05-04_f4 name: >- Genome-wide - HRC imputed - G0 mothers + G1 version 2017-05-04 freeze 4 description: >- Freeze 4 of version 2017-05-04 Genome-wide array data imputed to the HRC reference panel for G0 mothers and G1 individuals in bgen and sample file format (version 1.2). freeze_size: 114G linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112 git_tag: https://github.com/alspac/dataset_gi_hrc_g0m_g1/releases/tag/freeze4 is_current_freeze: true freeze_number: 4 freeze_date: 2024-06-11 previous_freeze: alspacdcs:gi_hrc_g0m_g1_2017-05-04_f3 freeze_of_alspac_dataset_version: alspacdcs:gi_hrc_g0m_g1_2017-05-04 freeze_of_named_alspac_dataset: alspacdcs:gi_hrc_g0m_g1 has_containers: - id: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 ## uuid name: data description: A dir/folder containing the freeze data bgen and .sample files has_parts: - id: alspacdcs:78966822-c2fc-4f49-bc12-bbe40aa2ba75 name: Omics ID sample data_distributions: - id: alspacdcs:15631c02-08be-4bfb-add8-e936e6bd9ed3 name: swapped.sample md5sum: 33c8b6168dee47c563cec5abf124a672 filesize: 1008K filetype: .sample number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:aea0183c-4283-478e-8a11-141ff0629c89 name: swapped_23_female.sample md5sum: 7606183e5b5195182c1e9ef61d88d1d3 filesize: 752K filetype: .sample number_of_participants: 12943 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:7e9a6c3a-f821-46e5-860e-fd7a38af98d3 name: swapped_23_male.sample md5sum: 34540d02f1271a8c99785989c8888496 filesize: 272K filetype: .sample number_of_participants: 4501 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:b2de2079-e524-461b-ab28-7ea88a2bd885 name: filtered_01 data_distributions: - id: alspacdcs:f3c493d3-7e7a-4df7-b4bc-e4df38ca5fa8 name: filtered_01.bgen md5sum: 9727306a156ab88f72dedbdcaffc1105 filesize: 8.6GB filetype: .bgen number_of_variants: 3069932 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:eb48fc81-4231-43f6-afd8-a763095ef049 name: filtered_02 data_distributions: - id: alspacdcs:aaff25c6-0480-4214-b5a3-905528db1e89 name: filtered_02.bgen md5sum: a8cb970994e21c02eceea92a513ebef6 filesize: 8.7GB filetype: .bgen number_of_variants: 3392238 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:3bb918e9-dab9-47a9-a6e0-2e1a3206b96a name: filtered_03 data_distributions: - id: alspacdcs:27acaf2b-3867-44b7-b16e-c1f5114893c2 name: filtered_04.bgen md5sum: 7e1586647816f4607b9e528be4893b5c filesize: 7.3GB filetype: .bgen number_of_variants: 2821895 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:de9a35f5-3660-46fb-80d8-971122992ee6 name: filtered_04 data_distributions: - id: alspacdcs:8a62a054-ef47-4ef6-87a2-305213007c74 name: filtered_04.bgen md5sum: 9bb513a014c18a3a0a1ea11dcf63cc1b filesize: 7.9GB filetype: .bgen number_of_variants: 2787582 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:f910f928-969d-4abd-8393-3c175edd58e2 name: filtered_05 data_distributions: - id: alspacdcs:9cdc5b2b-65b5-4b6e-b2b4-2e884e506ced name: filtered_05.bgen md5sum: 92a2d759a5bcc18d0134dc7802302055 filesize: 6.7GB filetype: .bgen number_of_variants: 2588170 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:0f9acc14-28f6-42ed-8dc1-0923ae231574 name: filtered_06 data_distributions: - id: alspacdcs:95969370-c593-4de2-a05a-68eb17a85293 name: filtered_06.bgen md5sum: 5f68a69cd54a89b8db5577711f2a7934 filesize: 6.4GB filetype: .bgen number_of_variants: 2460112 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:2bc64572-7dd5-4c0c-93f9-30fd598cd6c9 name: filtered_07 data_distributions: - id: alspacdcs:bf564b40-37d7-4654-a996-590212863971 name: filtered_07.bgen md5sum: cd02eefdb350d9859ea7a5975d5ee73a filesize: 6.7GB filetype: .bgen number_of_variants: 2289306 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:b0b446de-842c-43cf-add6-4957e17b16e2 name: filtered_08 data_distributions: - id: alspacdcs:a08d5fec-f380-4c08-b366-196ad509439b name: filtered_08.bgen md5sum: 68b4ea416441637c01ebcc1c2e9ac8cf filesize: 5.7GB filetype: .bgen number_of_variants: 2242706 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:7722d481-eb31-4678-ab4b-d32fdefd1ebb name: filtered_09 data_distributions: - id: alspacdcs:c3b256b2-6422-4970-86cc-3108c80c7d2a name: filtered_09.bgen md5sum: a262516e4a9c48fe2b7edfb68a0f0577 filesize: 4.5GB filetype: .bgen number_of_variants: 1675899 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:e3ac74ff-38ab-42af-8448-1228a67cd02c name: filtered_10 data_distributions: - id: alspacdcs:f6fdbefc-4df6-4c44-8bc2-adb96a699501 name: filtered_10.bgen md5sum: 659c1e9b8c9500aa02b84d8a121e4a23 filesize: 5.2GB filetype: .bgen number_of_variants: 1927504 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:9797bc6d-6cb3-4459-8f40-f0d3c32c07db name: filtered_11 data_distributions: - id: alspacdcs:8516c34a-8803-40fe-a8e7-d0183d7fcb67 name: filtered_11.bgen md5sum: 94ae65053c6cb28ffa5413a447bea2a7 filesize: 5.3GB filetype: .bgen number_of_variants: 1936990 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:38b72245-748f-4eb3-b1d5-9b0548141454 name: filtered_12 data_distributions: - id: alspacdcs:da40bb2d-2b8f-449f-b98d-c66c720009c7 name: filtered_12.bgen md5sum: 5e488efe1865265b70f0db0ba0e8ceb2 filesize: 5.1GB filetype: .bgen number_of_variants: 1848118 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:448ee503-3596-432b-b6b7-71d09895c3db name: filtered_13 data_distributions: - id: alspacdcs:87f60018-85ed-41ce-97f4-1cabc4f3b825 name: filtered_13.bgen md5sum: c6d8c39e1714020ef24236ce0e0e65f4 filesize: 3.7GB filetype: .bgen number_of_variants: 1385434 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:0469011c-c9fa-47c6-a18e-81d126b36a91 name: filtered_14 data_distributions: - id: alspacdcs:fbf207d1-5a86-44da-9acf-e78a139e8455 name: filtered_14.bgen md5sum: a7ceaec0d5986e1396214bbc4a8bcfb5 filesize: 3.6GB filetype: .bgen number_of_variants: 1266536 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:1e7c4590-fcb9-402a-96ab-f0a08ca31457 name: filtered_15 data_distributions: - id: alspacdcs:9527cff8-33fe-47f6-a2cc-456b0391c3c4 name: filtered_15.bgen md5sum: 30a19dcda6047a6ac690d650ee5fea8c filesize: 3.4GB filetype: .bgen number_of_variants: 1139215 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:c9ad086a-611c-4119-b93c-25427598c3ad name: filtered_16 data_distributions: - id: alspacdcs:94542142-6f87-4e44-90cb-c04936e5114e name: filtered_16.bgen md5sum: d4ffb3324217ec7ac9e3716ae3de9106 filesize: 4.1GB filetype: .bgen number_of_variants: 1281298 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:0dbf46b8-da95-4362-b66d-487bd4e9923d name: filtered_17 data_distributions: - id: alspacdcs:efefc34b-18c5-43fc-8b8e-ec9af2d343ab name: filtered_17.bgen md5sum: a0baaf8155e3e97ee33d440035877a96 filesize: 3.6GB filetype: .bgen number_of_variants: 1090072 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:f4b5f154-4e6f-44ba-885b-39cb52a77df5 name: filtered_18 data_distributions: - id: alspacdcs:9719bf81-5bac-4f53-8fa8-13df06907351 name: filtered_18.bgen md5sum: 1236c268dfab2d46148835e50efcec5d filesize: 3.2GB filetype: .bgen number_of_variants: 1104755 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:f4c9f93b-3b98-4d1b-80db-ae24d35bbf25 name: filtered_19 data_distributions: - id: alspacdcs:13a48032-774b-4db4-a57e-ffbd9bdfb540 name: filtered_19.bgen md5sum: 1c17198a8d5a7be881d671559048d073 filesize: 3.5GB filetype: .bgen number_of_variants: 868554 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:56ba0748-364c-4013-95bd-50e0f9f4d6ca name: filtered_20 data_distributions: - id: alspacdcs:b1d608b5-ed9c-4ca7-838c-89c8507e0bf9 name: filtered_20.bgen md5sum: 336791734294796bcc5c725048756155 filesize: 2.6GB filetype: .bgen number_of_variants: 884983 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:a610277c-3364-474f-bb15-349555976465 name: filtered_21 data_distributions: - id: alspacdcs:13d0df10-9e76-4829-9b83-e2ce4a95dded name: filtered_21.bgen md5sum: d97d780938173eb14c5c1aae66e1005e filesize: GB filetype: .bgen number_of_variants: 531276 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:db30e9f1-4441-4bc7-b867-873b04e87481 name: filtered_22 data_distributions: - id: alspacdcs:48c0dcf4-a3e1-442f-8d6c-9dbc7c5b5af3 name: filtered_22.bgen md5sum: 343581eebfe7e38242db0c8b019c2264 filesize: 1.8GB filetype: .bgen number_of_variants: 524544 number_of_participants: 17444 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:00b3d2af-7c46-41a3-b50c-f17dfa8e1e61 name: filtered_23female data_distributions: - id: alspacdcs:9e15dbd4-27fd-471b-ac73-0d5bae9649e6 name: filtered_23female.bgen md5sum: d4abdc0d84bda1f8a3eec5c9cee8977b filesize: 4.2GB filetype: .bgen number_of_variants: 1228035 number_of_participants: 12943 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 - id: alspacdcs:347b682b-9708-454b-9f5f-86deec277d19 name: swapped_23_male data_distributions: - id: alspacdcs:d9282f36-6915-46c3-bda9-ff9ab6c8c56c name: swapped_23_male.sample md5sum: bebe6967a0489a186166d61cd1b07a18 filesize: 1.3GB filetype: .bgen number_of_variants: 1228035 number_of_participants: 4501 belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7
4.2 Genome-wide - HapMap2 imputed - G1 (gi_hapmap2_g1)
4.2.1 Description
This dataset contains genotype data imputed to HapMap 2 for G1. Reference genome build: GRCh36
4.2.2 Methodology
A total of 9912 subjects were genotyped using the Illumina HumanHap550 quad genome-wide SNP genotyping platform by 23 and Me subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, USA.
Individuals were excluded from further analysis on the basis of having incorrect gender assignments; minimal or excessive heterozygosity (<0.320 and >0.345 for the Sanger data and <0.310 and >0.330 for the LabCorp data); disproportionate levels of individual missingness (>3%); evidence of cryptic relatedness (>10% IBD) and being of non-European ancestry (as detected by a multidimensional scaling analysis seeded with HapMap 2 individuals, EIGENSTRAT analysis revealed no additional obvious population stratification and genome-wide analyses with other phenotypes indicate a low lambda). The resulting data set consisted of 8365 individuals (84% of those genotyped).
SNPs with a minor allele frequency of <1% and call rate of <95% were removed. Furthermore, only SNPs which passed an exact test of Hardy-Weinberg equilibrium (P > 5 x 10-7) were considered for analysis. Genotypes were subsequently imputed with MACH 1.0.16 Markov Chain Haplotyping software, using CEPH individuals from phase 2 of the HapMap project as a reference set (release 22).
Associated publication:
4.2.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gi_hapmap2_g1_2022-12-07_f4 name: Genome-wide - HapMap2 imputed - G1 version 2022-12-07 freeze 4 description: >- Freeze 4 of 2022-12-07 version of Genome-wide array data imputed to the HapMap2 reference panel for G1 individuals freeze_size: 5G linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112 git_tag: https://github.com/alspac/dataset_gi_hapmap2_g1/releases/tag/freeze4 is_current_freeze: true freeze_number: 4 freeze_date: 2024-06-11 previous_freeze: alspacdcs:gi_hapmap2_g1_2022-12-07_f3 freeze_of_alspac_dataset_version: alspacdcs:gi_hapmap2_g1_2022-12-07 freeze_of_named_alspac_dataset: alspacdcs:gi_hapmap2_g1 has_containers: - id: alspacdcs:5595c2d9-feb4-4eb1-918c-ff3579ce6e21 name: data description: A dir/folder containing the plink freeze data files has_parts: - id: alspacdcs:5d8b20a5-b2d3-4d3b-a02e-fe865810dd92 name: freeze_id data_distributions:5978bf8c-9302-4c31-bb80-27ed307e93b1 - id: alspacdcs:42cb2bae-f94e-4a75-8025-d954db951d0d name: freeze_id.fam md5sum: 6b5ddc58729fdb5997fd0004e9ae8055 filesize: 288KB filetype: .fam number_of_participants: 8223 belongs_to_container: alspacdcs:5595c2d9-feb4-4eb1-918c-ff3579ce6e21 - id: alspacdcs:f7582580-52ce-44d5-9572-ffd8b7fa0391 name: freeze_id data_distributions: - id: alspacdcs:dc2941cc-4329-4f28-9d27-a0b23d8dcf53 name: freeze_id.bim md5sum: a1ebaaf6286af5b12f4561b380cd302a filesize: 68MB filetype: .bim number_of_variants: 2543887 belongs_to_container: alspacdcs:5595c2d9-feb4-4eb1-918c-ff3579ce6e21 - id: alspacdcs:2a8a7c04-c771-4c80-8e53-32012bcf6cbe name: freeze_id data_distributions: - id: alspacdcs:4c4667a7-2c3b-4c61-a645-8fd398674a47 name: freeze_id.bed md5sum: c1b6c00b67513aef2147d6d507c4d1be filesize: 4.9GB filetype: .bed number_of_variants: 2543887 number_of_participants: 8223 belongs_to_container: alspacdcs:5595c2d9-feb4-4eb1-918c-ff3579ce6e21 - id: alspacdcs:ac4c4c8c-1da4-4c26-9c4d-3bdea467d5b7 name: freeze_id data_distributions: - id: alspacdcs:ce5bf2cc-3a1d-479a-97ef-dcce860b9eda name: freeze_id.log md5sum: 6ebb804e83f17af2bcca0dfb7f143f56 filesize: 958B filetype: .log belongs_to_container: alspacdcs:5595c2d9-feb4-4eb1-918c-ff3579ce6e21
4.3 Genome-wide - HapMap2 imputed - G0 mothers (gi_hapmap2_g0m)
4.3.1 Description
This dataset contains genotype data imputed to HapMap 2 for G0 mothers. Reference genome build: GRCh36
4.3.2 Methodology
A total of 10 015 women (mothers from the ALSPAC cohort) were genotyped using the Illumina 660 quad SNP chip which contains 557 124 SNP markers. Markers with minor allele frequency < 1%, SNPs with >5% missing genotypes and any markers that failed an exact test of Hardy-Weinberg equilibrium (P < 1 x 10-6) were excluded from further analyses. Genome-wide identity by state sharing was calculated for each pair of individuals in the cohort to identify cryptic relatedness.
In order to identify individuals who might have ancestries other than Western European, we merged data from both cohorts with the 60 western European (CEU) founder, 60 Nigerian (YRI) founder and 90 Japanese (JPT) and Han Chinese (CHB) individuals from the International HapMap Project. Genome-wide IBS distances for each pair of individuals were calculated on markers shared between the HapMap and the Illumina 660K SNP chip, and then the multidimensional scaling option in R was used to generate a two-dimensional plot based upon individuals' scores on the first two principal coordinates from this analysis. Samples that did not cluster with the CEU individuals were excluded from subsequent analyses. In addition, we plotted the proportion of missing data for each individual against their genome-wide heterozygosity. Any individual, who did not cluster with others, was removed from further analyses. Samples were also excluded from analyses in the case of excessive missingness (>5%), unusual genome-wide or X chromosome heterozygosity, as well as one individual from each pair of putatively related individuals (genome-wide IBD >10%). After data cleaning, 8340 individuals and 526688 SNPs were left in the genome-wide data set.
We then conducted imputation using the MACH Markov Chain Haplotyping software with CEU individuals from phase 2 of the HapMap project as a reference set (release 22). The final imputed data set consisted of 8340 individuals, each with 2 594 390 imputed markers. Only imputed genotypes with minor allele frequencies ≥1% and R-sqr ≥0.3 were considered for association. Of these 8340 with genetic data, 2874 mothers also had phenotype data available.
Associated publication:
4.3.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gi_hapmap2_g0m_2022-12-07_f4 name: Genome-wide - HapMap2 imputed - G0 mothers version 2022-12-07 freeze 4 description: >- Version 2022-12-07 freeze 4 of Genome-wide array data imputed to the HapMap2 reference panel for G0 mothers. The number of variants & individuals within each plink file set can be viewed within the log file. freeze_size: 4.9G linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112 git_tag: https://github.com/alspac/dataset_gi_hapmap2_g0m/releases/tag/freeze4 is_current_freeze: true freeze_number: 4 freeze_date: 2024-06-11 previous_freeze: alspacdcs:gi_hapmap2_g0m_2022-12-07_f3 freeze_of_alspac_dataset_version: alspacdcs:gi_hapmap2_g0m_2022-12-07 freeze_of_named_alspac_dataset: alspacdcs:gi_hapmap2_g0m has_containers: - id: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c ## uuid name: plink description: A dir/folder containing the plink freeze data files. There are 8123 individuals within this dataset. has_parts: - id: alspacdcs:19bda7bc-6720-459b-bd0d-dbc0d6f2655f name: freeze_id_chr19 data_distributions: - id: alspacdcs:c412635e-67de-4714-9d8f-429bfa6fcae8 name: freeze_id_chr19.bim md5sum: c6fce7e15e198304f752ccbce66299b9 filesize: 1012.3KB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:840a92b2-f571-43b5-ad97-d79e77bd19af name: freeze_id_chr19.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:0e19015a-6693-47cb-bfa8-c85d712ec1c0 name: freeze_id_chr19.log md5sum: 84b19267a3dfa1641aba676a0e5eb3e0 filesize: 975.0B filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:4653442b-45a9-403e-9d3b-7199b15bfa3c name: freeze_id_chr19.bed md5sum: 801ccb3bb64dddaabfc2b7a4a1e4c5b0 filesize: 71.7MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:68b5b829-8926-448a-96ba-577aecab4471 name: freeze_id_chr15 data_distributions: - id: alspacdcs:9819abd4-7318-4199-a4f6-e49f52531cf1 name: freeze_id_chr15.bim md5sum: 1e1139db4b031ba577b5ac6ae000ce6f filesize: 1.9MB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:c8105272-c607-4a82-82a7-f6dd269edc08 name: freeze_id_chr15.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:c2dea4f8-7d19-4d14-982f-95a0d2964495 name: freeze_id_chr15.log md5sum: 0e054fc3cce4a123b109394752e580b0 filesize: 1.0KB filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:a6efbd8c-0f56-4c6a-a71f-2419bafeb024 name: freeze_id_chr15.bed md5sum: 611159bc9c4500de559615d0a7c549f2 filesize: 140.0MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:3ac1b66a-08fe-464e-b6eb-dd0ce078ac89 name: freeze_id_chr1 data_distributions: - id: alspacdcs:c1098693-5f9e-4713-964d-4e614f34cef9 name: freeze_id_chr1.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:d484165b-1c01-4e79-a00f-7a8cdf6aeeb2 name: freeze_id_chr1.bed md5sum: 01f7205ea4b6e852c0e8feb72a2cb9cd filesize: 374.7MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:72cfc84a-14ad-4c10-85f7-7b30a6f258d5 name: freeze_id_chr1.log md5sum: 59f4597c9621be95e1fc28a44d855361 filesize: 1.0KB filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:b73aebf8-ab24-4ca1-9972-79c4eedf1f49 name: freeze_id_chr1.bim md5sum: 44795681691b62d1921ad8855fd11a09 filesize: 5.1MB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:01757763-aee0-4d25-a8ed-9201a741801a name: freeze_id_chr20 data_distributions: - id: alspacdcs:386aff33-b1fb-43cb-88a4-10c6881dc6fd name: freeze_id_chr20.log md5sum: 8cbb4fc64f55bd5294cd9a206e0f37e8 filesize: 1.0KB filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:8ed2d8b2-6d78-44dc-972e-efb5de58fe09 name: freeze_id_chr20.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:fa2dcbf9-7cad-4415-81f3-bdf9d7a22c8d name: freeze_id_chr20.bed md5sum: 2af011bb98d6b8a8b00b7d938700fdac filesize: 122.8MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:d99817d7-0b61-4edb-a5e8-2ce47cbbae88 name: freeze_id_chr20.bim md5sum: 6e0b2d6cd06cc6e36f9cbc3f8df0a169 filesize: 1.7MB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:2ca6174b-b52c-47c2-a668-076f84319060 name: freeze_id_chr6 data_distributions: - id: alspacdcs:be3f3499-dc08-477a-bbd3-aaa4b2d99f1c name: freeze_id_chr6.log md5sum: d91f8058884a1dd82a9c2de687179eca filesize: 971.0B filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:d85e88c5-1458-481b-ae2d-4c84727258b3 name: freeze_id_chr6.bed md5sum: 953f9c82981d59d25dabe44ba5718b29 filesize: 353.1MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:6feafd60-0d96-459e-beb0-2eab94b98eec name: freeze_id_chr6.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:da32ca03-fc3d-4617-acb5-a26d5e561f5b name: freeze_id_chr6.bim md5sum: 3fd4e793a35c5e935454efc1105be192 filesize: 4.8MB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:fad2ed72-86b0-4056-bd17-93e56ade3ecb name: freeze_id_chr21 data_distributions: - id: alspacdcs:5c0b52b0-a16a-4376-9812-34bc2a5d3381 name: freeze_id_chr21.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:eb1aef35-ee59-4348-b909-38e712204b32 name: freeze_id_chr21.bim md5sum: c1f6f2181c49172608ac79e18425e4f4 filesize: 924.7KB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:4e4694d8-4e3c-43c7-be72-98b077d64b10 name: freeze_id_chr21.bed md5sum: 13165e1c9a27aa42853429b0246a1ed5 filesize: 65.6MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:901373ae-7601-45bd-91be-d47d804b5213 name: freeze_id_chr21.log md5sum: bff5f387cc08a205f3ceea4912301c4f filesize: 975.0B filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:7d3cd13a-ecd5-4f89-9ceb-b4bb49579325 name: freeze_id_chr17 data_distributions: - id: alspacdcs:c1d15a48-abcf-4d1c-b239-8c68e2bfd37e name: freeze_id_chr17.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:a1bc18b6-03ea-4aa5-9d17-37b72b58e469 name: freeze_id_chr17.log md5sum: 6537de563fe66f45ab0880c5c36695a5 filesize: 1.0KB filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:9dcdc5f3-f793-4853-909b-70d2527eda35 name: freeze_id_chr17.bim md5sum: 0dc0770759f9edccec7ce305e07b57d4 filesize: 1.6MB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:d2113c6b-7986-4939-9994-8853d5490517 name: freeze_id_chr17.bed md5sum: c6d54ed5ac68f2e0bd806b6124463ee4 filesize: 113.2MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:95bd785b-ed60-4d1c-ba40-fef689036123 name: freeze_id_chr11 data_distributions: - id: alspacdcs:a6b1b401-f71d-4c17-a237-e8354a01af7f name: freeze_id_chr11.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:302011cd-eed0-49ed-a214-73b1a696204a name: freeze_id_chr11.bim md5sum: 703ecef520ce7363c24e9600b363570f filesize: 3.5MB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:5f49f602-6d97-4464-afe8-6a75017069ea name: freeze_id_chr11.bed md5sum: 3c89898ce9fc0445c566ea0c060fb9db filesize: 251.8MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:7fd97f7e-cb07-41ba-b90b-75726822143d name: freeze_id_chr11.log md5sum: 1b63b92463c92cafc756f3e8c330a698 filesize: 977.0B filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:895b29b0-c4c2-41e5-8340-4d3ae277461c name: freeze_id_chr4 data_distributions: - id: alspacdcs:3af343a6-e042-42eb-a599-44de06b574d9 name: freeze_id_chr4.bed md5sum: 147fee33c621f644dad5a2d8ee86fc1d filesize: 315.9MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:34856871-ee87-42e4-bb41-9d89a605cf8f name: freeze_id_chr4.log md5sum: a932f5b22c0160602f031f6589dd0e60 filesize: 971.0B filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:82aab8b9-4c3f-4b66-aa77-26540c9cf4be name: freeze_id_chr4.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:4a07aea9-9c62-4346-89f0-d819d55aa016 name: freeze_id_chr4.bim md5sum: 54a244447b1345636690b252215bfd2d filesize: 4.3MB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:fd2b4071-7a92-427d-8aa5-318e131cad2b name: freeze_id_chr9 data_distributions: - id: alspacdcs:1c48c6ed-2f21-4c69-b661-01920c806dec name: freeze_id_chr9.bed md5sum: 58ff215f0652257867e42f567ff1c2be filesize: 236.4MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:e69923e6-552c-49b2-b643-0e332f60f5b0 name: freeze_id_chr9.log md5sum: d82db9e0fae823f565e88baf698f5d99 filesize: 1.0KB filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:a309e4c9-e320-4788-a6b1-f80d7ea77b48 name: freeze_id_chr9.bim md5sum: 1e828e0f36c2d168ce6c1df5887a764b filesize: 3.2MB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:9cc37db9-958b-4f11-9441-acb3d66171c6 name: freeze_id_chr9.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:1d5f1a43-9d6d-452c-8437-a1a6590200f8 name: freeze_id_chr7 data_distributions: - id: alspacdcs:dfbd747d-f228-4de6-a5b6-bbd4778abdb4 name: freeze_id_chr7.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:0d5d665c-280d-44c2-b63f-a4d2e1711289 name: freeze_id_chr7.bim md5sum: dae38c5168605323dfc584a73f3ce4a1 filesize: 3.8MB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:1a60cd7e-f732-4337-9aa5-7346462d14b5 name: freeze_id_chr7.log md5sum: aa54d34c3425faa9859f2f254db2082f filesize: 971.0B filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:0a3ead16-4d79-45c1-a0ee-5a3ea755f645 name: freeze_id_chr7.bed md5sum: fb9e8aaf4ae7c3fc75233248ec9d03b0 filesize: 277.3MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:759778c9-8038-417b-97bf-caabc09b2f1e name: freeze_id_chr10 data_distributions: - id: alspacdcs:307fc131-302b-453e-85ae-02690e80b688 name: freeze_id_chr10.log md5sum: 98542bb6181aab17d59cdb333bb038ea filesize: 1.0KB filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:f069ae4f-bccf-4211-9110-62068b374cca name: freeze_id_chr10.bim md5sum: 3c259904c7da548d25c86a4a36e96285 filesize: 3.8MB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:d9f75d55-f419-4544-bed7-1a3f8e016ef2 name: freeze_id_chr10.bed md5sum: 4606d4a5a008927b6ab051461218094a filesize: 267.9MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:c9174b77-373f-42f8-afea-dbfe0eb291aa name: freeze_id_chr10.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:7bc29065-807d-4c21-8918-8ed8f65a2825 name: freeze_id_chr8 data_distributions: - id: alspacdcs:f2db501c-2db4-40e2-a6ea-45abf2db2e8b name: freeze_id_chr8.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:e254bf18-e94f-49bd-b406-dce1d679ccfd name: freeze_id_chr8.log md5sum: 12f65c73a1612c7f01e7cbb9faa6728d filesize: 971.0B filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:562e8ab2-65ee-4c5e-9fee-2abdbc2ebd4a name: freeze_id_chr8.bim md5sum: 6243ef376ee6cbe643bec69201bec604 filesize: 3.9MB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:333ab86d-a675-4d94-954b-e62dcf883019 name: freeze_id_chr8.bed md5sum: de34e8ef57e4c08991e4778401adf861 filesize: 285.5MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:7d328e00-143e-44e4-8b21-d493f748d918 name: freeze_id_chr22 data_distributions: - id: alspacdcs:17890378-9f3b-495a-83cb-629aaaf25b2e name: freeze_id_chr22.bed md5sum: 5abcf552c585152ed0ee11754f3e7833 filesize: 65.5MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:d01e1f6f-ab86-4eca-9ab3-75252b37e44d name: freeze_id_chr22.log md5sum: 22dbdaf004a39f13df827d8fe16eb86d filesize: 975.0B filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:cc879e99-6a12-4d4f-a109-8de9bde833e8 name: freeze_id_chr22.bim md5sum: 86a1da3366ba87e62f561dc09f64f9ac filesize: 920.9KB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:2eb50171-543f-42f4-9af3-cef7a0d2bcbd name: freeze_id_chr22.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:65c471a1-913a-4bc3-a812-b16e54dd200f name: freeze_id_chr16 data_distributions: - id: alspacdcs:8c869b1d-dbf2-4aec-9ba5-5b661cc67b16 name: freeze_id_chr16.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:336096c8-697b-498c-a4e1-4ca26a024bfb name: freeze_id_chr16.bed md5sum: b04eb2e4e66fef7ee7d48cb666d78c38 filesize: 138.5MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:ad98ee65-76bc-41db-9092-3d2aa5faa0ac name: freeze_id_chr16.bim md5sum: 8bd9cb45256b6b5ca37ce66eec810035 filesize: 1.9MB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:2b3ec9b5-e28b-417e-8c0b-1cc8ed5f7f7b name: freeze_id_chr16.log md5sum: 34657e68c9325f21c33207746f9ddd0a filesize: 975.0B filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:7bf77605-1f09-4c1b-a272-1c44bffe2293 name: freeze_id_chr14 data_distributions: - id: alspacdcs:701b9895-05b5-477d-b2b0-fb3e8adfb030 name: freeze_id_chr14.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:2369e5f9-0d84-415b-b6b8-67ec1526ea0b name: freeze_id_chr14.bim md5sum: 4a933818aaea48201f455ebd07ea1b78 filesize: 2.3MB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:3e4f19f4-e3bd-4b75-829b-cb1b37ca3f7e name: freeze_id_chr14.bed md5sum: a41f9803ec71a0dcdf137806b21ba2e6 filesize: 162.5MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:43abadf5-e1b3-4ca9-abc5-044cb5682693 name: freeze_id_chr14.log md5sum: e1f2c8b876e4ec9e85deae8dd1a9bec7 filesize: 975.0B filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:9b1521fe-96a6-48d1-a3fe-47d1cc39d2b0 name: freeze_id_chr13 data_distributions: - id: alspacdcs:4c7b8334-319c-4c99-91af-186a1a3492b2 name: freeze_id_chr13.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:2062869c-01cb-4f2d-8a3c-263566f2cb11 name: freeze_id_chr13.bed md5sum: 0e99cf077012880a802dc36ce72142c1 filesize: 201.6MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:fa22332d-494c-4fad-bdb6-ad9da610c5e6 name: freeze_id_chr13.log md5sum: 9eac1d36058e281cd55934aff7d91261 filesize: 977.0B filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:9ed2cb1e-593a-48e4-abb1-42f08bee40f5 name: freeze_id_chr13.bim md5sum: cd1b7c80977fb5a0bbd87bc83dd85aed filesize: 2.8MB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:0cf314bb-9a17-4a7d-957d-e0e9fb3e1653 name: freeze_id_chr18 data_distributions: - id: alspacdcs:cb61d296-4080-4d4b-b872-720c214d8322 name: freeze_id_chr18.bim md5sum: 9ffd8f006c82701060dff29bf460e8fe filesize: 2.1MB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:af5b5611-d842-47af-b78b-c329f8ef6ddc name: freeze_id_chr18.log md5sum: 5f8a5de0d684936a5e73482143ceaa86 filesize: 975.0B filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:71d13017-4ba4-44c8-bc9a-b568d1da3fb5 name: freeze_id_chr18.bed md5sum: 6b46a8d2993dae303334b9a51b50b92c filesize: 148.7MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:7a1af215-15f2-440a-9f4d-83d6f47df5fc name: freeze_id_chr18.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:41642a5d-04a4-4bff-8cf3-191d3daf1f52 name: freeze_id_chr2 data_distributions: - id: alspacdcs:f6ef8665-4a9c-412d-ae56-251edce2ad20 name: freeze_id_chr2.log md5sum: 52044e2a3b44dc32c249292fbe6791bd filesize: 971.0B filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:d67023a6-a122-4cdc-bcc6-cfdbeba5094c name: freeze_id_chr2.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:25f3ab57-0a4d-4231-92c8-34270d61a1c3 name: freeze_id_chr2.bed md5sum: 494713bafedd17c3be4e782f7881dcc0 filesize: 427.5MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:772d2466-ea01-4458-a53b-a47b01e4230f name: freeze_id_chr2.bim md5sum: 275cefa559489b51bebbc65657a91822 filesize: 5.9MB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:39e85461-90a4-43bc-aba6-89717ba9867e name: freeze_id_chr12 data_distributions: - id: alspacdcs:491388b8-ad94-4237-a4d7-2c2f95f4aeef name: freeze_id_chr12.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:1c7e897a-1d73-4255-964d-47e7afbb8099 name: freeze_id_chr12.bed md5sum: 367f44ccd183c47334cfc7cb8333628a filesize: 241.7MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:6a491a6b-42cf-4c13-85f6-a95c165f5a1c name: freeze_id_chr12.log md5sum: 74355c1596f3f360051c9843a6bcad13 filesize: 977.0B filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:39ea7914-51cb-49f3-a3e6-f2e35f0b340a name: freeze_id_chr12.bim md5sum: 515a46f735c531163377d114549042b5 filesize: 3.4MB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:8fc655d9-a011-4da5-bdf9-f534545e5314 name: freeze_id_chr3 data_distributions: - id: alspacdcs:052eacc7-17a6-46f7-872b-a79f39b5d7d2 name: freeze_id_chr3.bed md5sum: 609847ca0489b7a97725ec275f8337d2 filesize: 337.5MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:8ec247fe-0839-4136-a63d-1551838502b8 name: freeze_id_chr3.log md5sum: 1f737c62165516ad1ebced45dde7449d filesize: 1.0KB filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:67c67c59-bfb6-4010-9e4d-91b1b5f8bbf0 name: freeze_id_chr3.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:9648c74d-c1b2-480c-8a79-ed99fba786c9 name: freeze_id_chr3.bim md5sum: 96d147406f1f24697b0cb9af0c7091fc filesize: 4.6MB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:4cd1b4c1-7e30-4576-bac0-3983c8f4f56c name: freeze_id_chr5 data_distributions: - id: alspacdcs:0a988283-2b96-4bba-acc8-e8678b741bd2 name: freeze_id_chr5.log md5sum: 5637edc58fd0a953a2283149a1ffff55 filesize: 971.0B filetype: .log belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:415ae163-6976-4a2a-ac69-05d34c119920 name: freeze_id_chr5.bim md5sum: e8f55ef9016bf2f03ee43f08a6c974c3 filesize: 4.4MB filetype: .bim belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:c29b23f3-a035-4a53-a7b5-034e7ad042e1 name: freeze_id_chr5.bed md5sum: a3a47a8ea90e0fa39d5c203436b6d982 filesize: 325.5MB filetype: .bed belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c - id: alspacdcs:02d97e8a-ac68-4097-a329-d69445eac6a8 name: freeze_id_chr5.fam md5sum: c9fc6b68df21dc6f2b433fdcc052aa14 filesize: 277.5KB filetype: .fam belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
4.4 Genome-wide - 1000G imputed - G0 partners (gi_1000g_g0p)
4.4.1 Description
This dataset contains genome-wide array data imputed to the 1000 genomes reference panel for G0 partners, with some additional G0 mothers and G1 individuals. This data has been cleaned, flipped to the positive strand and in b37 coordinates and imputed to the 1000 genomes phase I version 3. Reference genome build: GRCh37
4.4.2 Methodology
3,453 ALSPAC mother and fathers and 535,478 SNPs were genotyped using the Illumina HumanCoreExome chip genotyping platforms by the ALSPAC lab and called using GenomeStudio. The resulting raw genome-wide data were subjected to standard quality control methods using PLINK (v1.07). Individuals were excluded on the basis of gender mismatches (n = 80); minimal or excessive heterozygosity (n = 64); disproportionate levels of individual missingness (>5%, n = 60) and possible contamination (n = 3).
Population stratification was assessed by multidimensional scaling analysis and compared with 1000 Genomes phase 3 data and principal component analysis (n = 266); all individuals with non-European ancestry were removed.
Cryptic relatedness was measured as SNP relatedness in GCTA (relatedness > 0.1, n = 69 removed). SNPs with a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 1E-7) and those which failed GenomeStudio quality control measures were removed (n = 21,298). 6,594 duplicate SNPs were also removed.
This resulted in 2,911 unrelated mothers and father genotypes at 507,586 SNPs. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln.
We phased data of 3074 samples that passed qc but contained related subjects in shapeit v2.r837. We then removed 155,336 monomorphic SNPs, 1033 markers not in 1000 genomes, 11,842 A/T or G/C SNPs and 10 duplicate sites to give 337,732 SNPs on chromosomes 1-23. Of the 329,363 markers on chromosomes 1-22, 298,742 overlapped the reference genome. We imputed to the 1000 genomes phase 1 version 3 using the Michigan Imputation Server. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln. We then removed 12 subjects who have withdrawn consent and 6 subjects genotyped in an earlier work package to give 2201 subjects.
1737 putative G0 partner-G1 pairs for whom both G0 partner and G1 have called genotype data available were identified based on ALN. Given the G0 partners were invited by the G0 mother to take part and only enrolled in the study in their own right several years later, it could not be assumed that all G0 partners were biologically related to G1. Called genotype data for the 1720 unique G0 partners and 1737 unique G1s were merged (i.e. there were 17 pairs of siblings/twins among the G1 offspring), using plink v1.90b7.2 64-bit (11 Dec 2023).
After aplication of the plink filters –geno 0.05, –maf 0.01, –snps-only just-acgt and –autosome. The –related command in KING version 2.3.2 was used to perform kinship analysis, which confirmed that all 1737 putative G0 partner-G1 pairs are genetically related. This would be expected for biological father-offspring pairs, using the inference criteria described in in Table 1 of "Manichaikul, Ani, et al. "Robust relationship inference in genome-wide association studies." Bioinformatics 26.22 (2010): 2867-2873."
4.4.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gi_1000g_g0p_2016-11-22_f4 name: Genome-wide - 1000G imputed - G0 partners version 2016-11-22 freeze 4 description: >- This dataset is the fourth freeze of 2016-11-22 versiono of the Genome-wide array data imputed to the 1000 genomes reference panel for G0 partners, with some additional G0 mothers and G1 individuals. freeze_size: 44G linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112 git_tag: https://github.com/alspac/dataset_gi_1000g_g0p/releases/tag/freeze4 is_current_freeze: true freeze_number: 4 freeze_date: 2023-09-11 previous_freeze: alspacdcs:gi_1000g_g0p_2016-11-22_f4 next_freeze: freeze_of_alspac_dataset_version: alspacdcs:gi_1000g_g0p_2016-11-22 freeze_of_named_alspac_dataset: alspacdcs:gi_1000g_g0p has_containers: - id: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c name: data description: A dir/folder containing the data bgen and sample files has_parts: - id: alspacdcs:gi_1000g_g0p_2016-11-22_sample_f4 name: Samples description: >- The samples in the data. To be used with the genetic data. data_distributions: - id: alspacdcs:dfa0ee7c627927a47e286aa23b0514e4_swapped.sample name: swapped.sample description: >- A plain text .sample file. See https://doi.org/10.1101/308296 for file format details. md5sum: dfa0ee7c627927a47e286aa23b0514e4 filesize: 165k filetype: .sample number_of_participants: 2198 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr1_f4 name: Chr1 description: Data for Chr1 data_distributions: - id: alspacdcs:a5eb049e4df5a8b005ae51b47947d830_filtered_data_chr01.bgen name: filtered_data_chr01.bgen description: >- An Oxford Bgen file for Chr1. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: a5eb049e4df5a8b005ae51b47947d830 filesize: 3.4G filetype: .bgen number_of_participants: 2198 number_of_variants: 2159337 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr2_f4 name: Chr2 description: Data for Chr2 data_distributions: - id: alspacdcs:e297c8d30455053d23ac360bcc886bb0_filtered_data_chr02.bgen name: filtered_data_chr02.bgen description: >- An Oxford Bgen file for Chr2. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: e297c8d30455053d23ac360bcc886bb0 filesize: 3.6G filetype: .bgen number_of_participants: 2198 number_of_variants: 2349883 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr3_f4 name: Chr3 description: Data for Chr3 data_distributions: - id: alspacdcs:c0b55e9d65c219ffb1b8c58a0ebb7c18_filtered_data_chr03.bgen name: filtered_data_chr03.bgen description: >- An Oxford Bgen file for Chr1. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: c0b55e9d65c219ffb1b8c58a0ebb7c18 filesize: 3.0G filetype: .bgen number_of_participants: 2198 number_of_variants: 1969275 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr4_f4 name: Chr4 description: Data for Chr4 data_distributions: - id: alspacdcs:514f09f02c74fc3eca83379e9e99c5dc_filtered_data_chr04.bgen name: filtered_data_chr04.bgen description: >- An Oxford Bgen file for Chr4. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 514f09f02c74fc3eca83379e9e99c5dc filesize: 3.1G filetype: .bgen number_of_participants: 2198 number_of_variants: 1969883 - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr5_f4 name: Chr5 description: Data for Chr5 data_distributions: - id: alspacdcs:f4accbf5bdd6a2ccc9598e9e2221915d_filtered_data_chr05.bgen name: filtered_data_chr05.bgen description: >- An Oxford Bgen file for Chr5. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: f4accbf5bdd6a2ccc9598e9e2221915d filesize: 2.8G filetype: .bgen number_of_participants: 2198 number_of_variants: 1809961 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr6_f4 name: Chr6 description: Data for Chr6 data_distributions: - id: alspacdcs:a9327ad1591fdf7d349b066544e71c3a_filtered_data_chr06.bgen name: filtered_data_chr06.bgen description: >- An Oxford Bgen file for Chr6. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: a9327ad1591fdf7d349b066544e71c3a filesize: 2.6G filetype: .bgen number_of_participants: 2198 number_of_variants: 1758025 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr7_f4 name: Chr7 description: Data for Chr7 data_distributions: - id: alspacdcs:f832922558eddcf3feed87091c2ec0ae_filtered_data_chr07.bgen name: filtered_data_chr07.bgen description: >- An Oxford Bgen file for Chr7. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: f832922558eddcf3feed87091c2ec0ae filesize: 2.7G filetype: .bgen number_of_participants: 2198 number_of_variants: 1601293 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr8_f4 name: Chr8 description: Data for Chr8 data_distributions: - id: alspacdcs:47d79712e676a0048f90858cbb888179_filtered_data_chr08.bgen name: filtered_data_chr08.bgen description: >- An Oxford Bgen file for Chr8. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 47d79712e676a0048f90858cbb888179 filesize: 2.4G filetype: .bgen number_of_participants: 2198 number_of_variants: 1558902 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr9_f4 name: Chr9 description: Data for Chr9 data_distributions: - id: alspacdcs:82a480f3e8792db2c1cec3adc50e1357_filtered_data_chr09.bgen name: filtered_data_chr09.bgen description: >- An Oxford Bgen file for Chr9. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 82a480f3e8792db2c1cec3adc50e1357 filesize: 1.9G filetype: .bgen number_of_participants: 2198 number_of_variants: 1189463 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr10_f4 name: Chr10 description: Data for Chr10 data_distributions: - id: alspacdcs:8f64fe184e4c876a345a728ed5eeddcf_filtered_data_chr10.bgen name: filtered_data_chr10.bgen description: >- An Oxford Bgen file for Chr10. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 8f64fe184e4c876a345a728ed5eeddcf filesize: 2.2G filetype: .bgen number_of_participants: 2198 number_of_variants: 1363104 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr11_f4 name: Chr11 description: Data for Chr11 data_distributions: - id: alspacdcs:b1b7e3bef0fe72cd90bd0ba456f687aa_filtered_data_chr11.bgen name: filtered_data_chr11.bgen description: >- An Oxford Bgen file for Chr11. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: b1b7e3bef0fe72cd90bd0ba456f687aa filesize: 2.2G filetype: .bgen number_of_participants: 2198 number_of_variants: 1359640 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr12_f4 name: Chr12 description: Data for Chr12 data_distributions: - id: alspacdcs:509202db22200fe0bd58210ab8e9c757_filtered_data_chr12.bgen name: filtered_data_chr12.bgen description: >- An Oxford Bgen file for Chr12. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 509202db22200fe0bd58210ab8e9c757 filesize: 2.1G filetype: .bgen number_of_participants: 2198 number_of_variants: 1316510 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr13_f4 name: Chr13 description: Data for Chr13 data_distributions: - id: alspacdcs:176a10d38ab80783a8e392e5791edea7_filtered_data_chr13.bgen name: filtered_data_chr13.bgen description: >- An Oxford Bgen file for Chr13. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 176a10d38ab80783a8e392e5791edea7 filesize: 1.6G filetype: .bgen number_of_participants: 2198 number_of_variants: 988473 - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr14_f4 name: Chr14 description: Data for Chr14 data_distributions: - id: alspacdcs:1ecd96aab2925bafd7d20497d85dd937_filtered_data_chr14.bgen name: filtered_data_chr14.bgen description: >- An Oxford Bgen file for Chr14. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 1ecd96aab2925bafd7d20497d85dd937 filesize: 1.5G filetype: .bgen number_of_participants: 2198 number_of_variants: 903811 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr15_f4 name: Chr15 description: Data for Chr15 data_distributions: - id: alspacdcs:f8c5b54206189808e9a361cc0da63798_filtered_data_chr15.bgen name: filtered_data_chr15.bgen description: >- An Oxford Bgen file for Chr15. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: f8c5b54206189808e9a361cc0da63798 filesize: 1.4G filetype: .bgen number_of_participants: 2198 number_of_variants: 814028 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr16_f4 name: Chr16 description: Data for Chr16 data_distributions: - id: alspacdcs:52f065575d3cb2dff34df6763a583766_filtered_data_chr16.bgen name: filtered_data_chr16.bgen description: >- An Oxford Bgen file for Chr16. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 52f065575d3cb2dff34df6763a583766 filesize: 1.6G filetype: .bgen number_of_participants: 2198 number_of_variants: 867901 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr17_f4 name: Chr17 description: Data for Chr17 data_distributions: - id: alspacdcs:73d85caf67dcedc63b11a43bd5ccb44d_filtered_data_chr17.bgen name: filtered_data_chr17.bgen description: >- An Oxford Bgen file for Chr17. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 73d85caf67dcedc63b11a43bd5ccb44d filesize: 1.4G filetype: .bgen number_of_participants: 2198 number_of_variants: 755467 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr18_f4 name: Chr18 description: Data for Chr18 data_distributions: - id: alspacdcs:b8e055a6c0955bb67161c9f7a1d8cad7_filtered_data_chr18.bgen name: filtered_data_chr18.bgen description: >- An Oxford Bgen file for Chr18. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: b8e055a6c0955bb67161c9f7a1d8cad7 filesize: 1.4G filetype: .bgen number_of_participants: 2198 number_of_variants: 783661 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr19_f4 name: Chr19 description: Data for Chr19 data_distributions: - id: alspacdcs:37ea045cd9f4027cba547b7b89c3a1a0_filtered_data_chr19.bgen name: filtered_data_chr19.bgen description: >- An Oxford Bgen file for Chr19. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 37ea045cd9f4027cba547b7b89c3a1a0 filesize: 1.3G filetype: .bgen number_of_participants: 2198 number_of_variants: 606147 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr20_f4 name: Chr20 description: Data for Chr20 data_distributions: - id: alspacdcs:d241eb21be3188c26c460e1f65f0d8c1_filtered_data_chr20.bgen name: filtered_data_chr20.bgen description: >- An Oxford Bgen file for Chr20. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: d241eb21be3188c26c460e1f65f0d8c1 filesize: 1.1G filetype: .bgen number_of_participants: 2198 number_of_variants: 618749 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr21_f4 name: Chr21 description: Data for Chr21 data_distributions: - id: alspacdcs:7881bdc24e7f0adbfb800b49d1efd590_filtered_data_chr21.bgen name: filtered_data_chr21.bgen description: >- An Oxford Bgen file for Chr21. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 7881bdc24e7f0adbfb800b49d1efd590 filesize: 672M filetype: .bgen number_of_participants: 2198 number_of_variants: 378064 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr22_f4 name: Chr22 description: Data for Chr22 data_distributions: - id: alspacdcs:824412e963441699f260c6245f65659d_filtered_data_chr22.bgen name: filtered_data_chr22.bgen description: >- An Oxford Bgen file for Chr22. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 824412e963441699f260c6245f65659d filesize: 722M filetype: .bgen number_of_participants: 2198 number_of_variants: 366590 belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c
4.5 Genome-wide - 1000G imputed - G0 mothers + G1 (gi_1000g_g0m_g1)
4.5.1 Description
This dataset contains genome-wide 1000G imputed data for G0 mothers + G1. This data has been cleaned, flipped to the positive strand and in b37 coordinates and imputed to the 1000 genomes phase I version 3. Reference genome build: GRCh37
4.5.2 Methodology
ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).
Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.
SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1). Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,115 subjects and 500,527 SNPs passed these quality control filters.
ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs. SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed.
Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.
Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level. Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,048 subjects and 526,688 SNPs passed these quality control filters.
We combined 477,482 SNP genotypes in common between the sample of mothers and sample of children. We removed SNPs with genotype missingness above 1% due to poor quality (11,396 SNPs removed) and removed a further 321 subjects due to potential ID mismatches. This resulted in a dataset of 17,842 subjects containing 6,305 duos and 465,740 SNPs (112 were removed during liftover and 234 were out of HWE after combination). We estimated haplotypes using ShapeIT(v2.r644) which utilises relatedness during phasing. We obtained a phased version of the 1000 genomes reference panel (Phase 1, Version3) from the Impute2 reference data repository (phased using ShapeItv2.r644, haplotype release date Dec 2013). Imputation of the target data was performed using Impute V2.2.2 against the reference panel(all polymorphic SNPs excluding singletons), using all 2186 reference haplotypes (including non-Europeans).
This gave 8,237 eligible children and 8,196 eligible mothers withavailable genotype data after exclusion of related subjects using cryptic relatedness measures described previously.
Known issues: There is a known strand issue present within this imputation: The Dec 2013 haplotype release of 1000 genomes phase 1 version 3 have 199 reported SNPs with incorrect strand. For more information and the origins of this list please visit https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated_SHAPEIT2_16-06-14.html. It is very unlikely that they have systematic effects across the genome and most probably are just isolated to these 199 known problematic SNPs. The user is advised to discard them from their analysis.
Formatting of the bgen files within the gi_1000g_g0m_g1 dataset have NA in place of the chromosome column. Some tools may allow this, while others are less forgiving. This may mean users wish to re-format the dataset (using QCtool or equivalent) for their work.
Allele frequency concordance with other cohorts: When contributing to consortia you may find that the allele frequencies in ALSPAC for a few thousand SNPs are discordant from a reference panel used by the consortium. This is actually to be expected - when calculating allele frequencies, even from the same population, in two different samples for many millions of SNPs there will be a number of SNPs that appear to be highly discordant.
4.5.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_f4 name: >- Genome-wide - 1000G imputed - G0 mothers + G1 version 2015-10-30 freeze 4 description: >- This is the fourth freeze of the the 2015-10-30 version of gi_1000g_g0m_g1 datatset. It contains data in the oxford format which is a combination of bgen and sample (version 1.2) files. It is a subset of the data in gi_1000g_g0m_g1_2015-10-30 limited to one format and with participants who have withdrawn their consent removed. The Dec 2013 haplotype release of 1000 genomes phase 1 version 3 have 199 reported SNPs with incorrect strand. The strand issues are present in this imputation version. For more information and the origins of this list please visit: https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated_SHAPEIT2_16-06-14.html It is very unlikely that they have systematic effects across the genome and most probably are just isolated to these 199 known problematic SNPs. The user is advised to discard them from their analysis. freeze_size: 122G linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112 git_tag: https://github.com/alspac/dataset_gi_1000g_g0m_g1/releases/tag/freeze4 is_current_freeze: true freeze_number: 4 freeze_date: 2024-06-11 previous_freeze: alspacdcs:gi_1000g_g0m_g1_2015-10-30_f3 freeze_of_alspac_dataset_version: alspacdcs:gi_1000g_g0m_g1_2015-10-30 freeze_of_named_alspac_dataset: alspacdcs:gi_1000g_g0m_g1 has_parts: - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_sample_f4 name: Samples description: >- The samples in the data. To be used with the genetic data. data_distributions: - id: alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample name: swapped.sample description: >- A plain text .sample file. See https://doi.org/10.1101/308296 for file format details. md5sum: 65bf6fc592b85ce69dec0473aca5b5cd filesize: 1.3M filetype: .sample number_of_participants: 17444 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr1_f4 name: Chr1 description: Data for Chr1 data_distributions: - id: alspacdcs:fad144852b7c9c929ea1a55b8481798c_filtered_01.bgen name: filtered_01.bgen description: >- An Oxford Bgen file for Chr1. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: fad144852b7c9c929ea1a55b8481798c filesize: 9.1G filetype: .bgen number_of_participants: 17444 number_of_variants: 2155158 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr2_f4 name: Chr2 description: Data for Chr2 data_distributions: - id: alspacdcs:91168a792595ee55375d6c72c881fa6c_filtered_02.bgen name: filtered_02.bgen description: >- An Oxford Bgen file for Chr2. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 91168a792595ee55375d6c72c881fa6c filesize: 9.1G filetype: .bgen number_of_participants: 17444 number_of_variants: 2346862 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr3_f4 name: Chr3 description: Data for Chr3 data_distributions: - id: alspacdcs:6e898fe7aba1d39e832245267a9ec30e_filtered_03.bgen name: filtered_03.bgen description: >- An Oxford Bgen file for Chr1. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 6e898fe7aba1d39e832245267a9ec30e filesize: 7.7G filetype: .bgen number_of_participants: 17444 number_of_variants: 1966662 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr4_f4 name: Chr4 description: Data for Chr4 data_distributions: - id: alspacdcs:c7ba39fbff7de19ffd98b93ff217108b_filtered_04.bgen name: filtered_04.bgen description: >- An Oxford Bgen file for Chr4. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: c7ba39fbff7de19ffd98b93ff217108b filesize: 8.4G filetype: .bgen number_of_participants: 17444 number_of_variants: 1968171 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr5_f4 name: Chr5 description: Data for Chr5 data_distributions: - id: alspacdcs:173056913dd6dc1684e9118907af1fd5_filtered_05.bgen name: filtered_05.bgen description: >- An Oxford Bgen file for Chr5. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 173056913dd6dc1684e9118907af1fd5 filesize: 6.9G filetype: .bgen number_of_participants: 17444 number_of_variants: 1808090 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr6_f4 name: Chr6 description: Data for Chr6 data_distributions: - id: alspacdcs:b8296902cc14e29111b2caefbc52a00b_filtered_06.bgen name: filtered_06.bgen description: >- An Oxford Bgen file for Chr6. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: b8296902cc14e29111b2caefbc52a00b filesize: 6.8G filetype: .bgen number_of_participants: 17444 number_of_variants: 1755859 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr7_f4 name: Chr7 description: Data for Chr7 data_distributions: - id: alspacdcs:3072cca6a05fdb782b858f70beed6e06_filtered_08.bgen name: filtered_07.bgen description: >- An Oxford Bgen file for Chr7. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 3072cca6a05fdb782b858f70beed6e06 filesize: 7.1G filetype: .bgen number_of_participants: 17444 number_of_variants: 1599387 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr8_f4 name: Chr8 description: Data for Chr8 data_distributions: - id: alspacdcs:c57b0cc8c3b47c8058e6f95ba742a89d_filtered_08.bgen name: filtered_08.bgen description: >- An Oxford Bgen file for Chr8. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: c57b0cc8c3b47c8058e6f95ba742a89d filesize: 5.9G filetype: .bgen number_of_participants: 17444 number_of_variants: 1557429 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr9_f4 name: Chr9 description: Data for Chr9 data_distributions: - id: alspacdcs:0e0d21cb1dc4d276d0a4353cc7da0564_filtered_09.bgen name: filtered_09.bgen description: >- An Oxford Bgen file for Chr9. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 0e0d21cb1dc4d276d0a4353cc7da0564 filesize: 5.1G filetype: .bgen number_of_participants: 17444 number_of_variants: 1187731 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr10_f4 name: Chr10 description: Data for Chr10 data_distributions: - id: alspacdcs:e5f8a44f260c009a9fec7bdc105ead76_filtered_10.bgen name: filtered_10.bgen description: >- An Oxford Bgen file for Chr10. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: e5f8a44f260c009a9fec7bdc105ead76 filesize: 5.4G filetype: .bgen number_of_participants: 17444 number_of_variants: 1361506 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr11_f4 name: Chr11 description: Data for Chr11 data_distributions: - id: alspacdcs:7c64c009aaf9fdb84c21b31f51e28bfa_filtered_11.bgen name: filtered_11.bgen description: >- An Oxford Bgen file for Chr11. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 7c64c009aaf9fdb84c21b31f51e28bfa filesize: 5.4G filetype: .bgen number_of_participants: 17444 number_of_variants: 1356882 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr12_f4 name: Chr12 description: Data for Chr12 data_distributions: - id: alspacdcs:8f0d903ca1cf24ca0e45494bd0a1426c_filtered_12.bgen name: filtered_12.bgen description: >- An Oxford Bgen file for Chr12. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 8f0d903ca1cf24ca0e45494bd0a1426c filesize: 5.4G filetype: .bgen number_of_participants: 17444 number_of_variants: 1314328 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr13_f4 name: Chr13 description: Data for Chr13 data_distributions: - id: alspacdcs:e59348ea876d3f5c3b6331e738daa162_filtered_13.bgen name: filtered_13.bgen description: >- An Oxford Bgen file for Chr13. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: e59348ea876d3f5c3b6331e738daa162 filesize: 4.0G filetype: .bgen number_of_participants: 17444 number_of_variants: 987740 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr14_f4 name: Chr14 description: Data for Chr14 data_distributions: - id: alspacdcs:3f80471a1e183e478ca3674482ed89e4_filtered_14.bgen name: filtered_14.bgen description: >- An Oxford Bgen file for Chr14. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 3f80471a1e183e478ca3674482ed89e4 filesize: 3.9G filetype: .bgen number_of_participants: 17444 number_of_variants: 904351 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr15_f4 name: Chr15 description: Data for Chr15 data_distributions: - id: alspacdcs:2166a96fc0bbdc990b1bcb513f4372bd_filtered_15.bgen name: filtered_15.bgen description: >- An Oxford Bgen file for Chr15. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 2166a96fc0bbdc990b1bcb513f4372bd filesize: 3.7G filetype: .bgen number_of_participants: 17444 number_of_variants: 812545 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr16_f4 name: Chr16 description: Data for Chr16 data_distributions: - id: alspacdcs:c44b1d287c79c69b2171c6822339cf4b_filtered_16.bgen name: filtered_16.bgen description: >- An Oxford Bgen file for Chr16. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: c44b1d287c79c69b2171c6822339cf4b filesize: 4.3G filetype: .bgen number_of_participants: 17444 number_of_variants: 865998 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr17_f4 name: Chr17 description: Data for Chr17 data_distributions: - id: alspacdcs:e4c50e9c54d4baa59d191a756d60b32e_filtered_17.bgen name: filtered_17.bgen description: >- An Oxford Bgen file for Chr17. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: e4c50e9c54d4baa59d191a756d60b32e filesize: 3.8G filetype: .bgen number_of_participants: 17444 number_of_variants: 753174 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr18_f4 name: Chr18 description: Data for Chr18 data_distributions: - id: alspacdcs:fa893fede52923d5805f8583dbed51bd_filtered_18.bgen name: filtered_18.bgen description: >- An Oxford Bgen file for Chr18. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: fa893fede52923d5805f8583dbed51bd filesize: 3.5G filetype: .bgen number_of_participants: 17444 number_of_variants: 783010 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr19_f4 name: Chr19 description: Data for Chr19 data_distributions: - id: alspacdcs:999c860cfb0f3484d1a78ef639c594fa_filtered_19.bgen name: filtered_19.bgen description: >- An Oxford Bgen file for Chr19. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 999c860cfb0f3484d1a78ef639c594fa filesize: 4.0G filetype: .bgen number_of_participants: 17444 number_of_variants: 603516 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr20_f4 name: Chr20 description: Data for Chr20 data_distributions: - id: alspacdcs:59dd1ebbefb28c2b5818fb2aca9805de_filtered_20.bgen name: filtered_20.bgen description: >- An Oxford Bgen file for Chr20. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 59dd1ebbefb28c2b5818fb2aca9805de filesize: 2.8G filetype: .bgen number_of_participants: 17444 number_of_variants: 617694 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr21_f4 name: Chr21 description: Data for Chr21 data_distributions: - id: alspacdcs:dce2d85e4d08018ea365afdeac561447_filtered_21.bgen name: filtered_21.bgen description: >- An Oxford Bgen file for Chr21. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: dce2d85e4d08018ea365afdeac561447 filesize: 1.9G filetype: .bgen number_of_participants: 17444 number_of_variants: 377554 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr22_f4 name: Chr22 description: Data for Chr22 data_distributions: - id: alspacdcs:b5ba868e802d8eee4ac76b0f878d427c_filtered_22.bgen name: filtered_22.bgen description: >- An Oxford Bgen file for Chr22. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: b5ba868e802d8eee4ac76b0f878d427c filesize: 2.1G filetype: .bgen number_of_participants: 17444 number_of_variants: 365644 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr23_f4 name: Chr23 description: Data for Chr23 data_distributions: - id: alspacdcs:512a78f6c379ce43e827da44a91b4c5f_filtered_23.bgen name: filtered_23.bgen description: >- An Oxford Bgen file for Chr23. To be used with alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 512a78f6c379ce43e827da44a91b4c5f filesize: 5.9G filetype: .bgen number_of_participants: 17444 number_of_variants: 1250218
5 Sequence Data
5.1 Whole genome sequencing - G1 (wgs_hiseq_g1)
5.1.1 Description
This dataset contains whole genome sequencing for G1 individuals, part of the UK10K dataset. Reference genome build: GRCh37
5.1.2 Methodology
ALSPAC and TwinsUK cohorts were sequenced at an average read depth of 6.7x through the UK10K program (http://www.uk10k.org) using the Illumina HiSeq platform, and aligned to the GRCh37 human reference using BWA. SNV calls were completed using samtools/bcftools and VQSR and GATK were used to recall these calls.
Associated publication:
Please ensure you have permission to access this data (http://www.uk10k.org/data_access.html) before using it.
5.1.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:wgs_hiseq_g1_2016-08-18_f4 name: Whole genome sequencing - G1 version 2016-08-18 freeze 4 description: >- This is the freeze 4 of version 2016-08-18 of the Whole genome sequencing for G1 individuals, part of the UK10K dataset. freeze_size: 341G linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112 git_tag: https://github.com/alspac/dataset_wgs_hiseq_g1/releases/tag/freeze4 is_current_freeze: true freeze_number: 4 freeze_date: 2024-06-11 previous_freeze: alspacdcs:wgs_hiseq_g1_2016-08-18_f3 freeze_of_alspac_dataset_version: alspacdcs:wgs_hiseq_g1_2016-08-18 freeze_of_named_alspac_dataset: alspacdcs:wgs_hiseq_g1 has_containers: - id: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 ## uuid name: data description: A dir/folder containing the freeze data files has_parts: - id: alspacdcs:1319d16a-a9e8-4fb7-b4ee-a02a4345d98d name: 1_freeze data_distributions: - id: alspacdcs:e0c5c3ec-e61b-48b6-b5f7-c7ecfdb9a014 name: 1_freeze.vcf.gz md5sum: a029c1cd1a1a10e830467299fbb335dd filesize: 26.3GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:1f5242ec-bcbf-4dee-8eef-d81c014297cf name: 1_freeze.vcf.gz.csi md5sum: 50de551ba81402a82de9728ea95e0483 filesize: 145.6KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:a3afc031-0157-4a1a-9325-963407437cde name: 2_freeze data_distributions: - id: alspacdcs:86fbed6d-1d05-4654-98cf-90c84a4e060f name: 2_freeze.vcf.gz md5sum: 72babe074fc3e53b1e1315268511f7ec filesize: 28.8GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:02d35ade-5406-4423-b583-3f912fffd6d8 name: 2_freeze.vcf.gz.csi md5sum: 5aec6c33496c048f740b592898541689 filesize: 156.1KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:7ff95792-90e8-47fe-ba96-0c547a748b4f name: 3_freeze data_distributions: - id: alspacdcs:416c7611-9bde-4012-a92d-b84b69448b56 name: 3_freeze.vcf.gz md5sum: 9672caad30ce5207afc857f15265a56e filesize: 24.2GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:972ad268-ef6d-49f2-8061-d36660476167 name: 3_freeze.vcf.gz.csi md5sum: 48fd68f3460095f32471b74de80ae28a filesize: 127.9KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:b0aea300-50a3-4c75-95bc-996b04ebe1bb name: 4_freeze.vcf data_distributions: - id: alspacdcs:7bfde034-0983-4238-a675-d45ac002f73b name: 4_freeze.vcf.gz md5sum: 6a35500eba8d4af7a67e5af589b3e3f9 filesize: 23.2GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:e6f65c92-3751-4bd9-af2e-767174683085 name: 4_freeze.vcf.gz.csi md5sum: 20f5fb662923c30e6000ba81247e15dc filesize: 122.6KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:42ff212b-70d7-4db8-ac36-12b06dbae07c name: 5_freeze.vcf data_distributions: - id: alspacdcs:88e9082a-d77a-4875-b020-67fa1631d8e4 name: 5_freeze.vcf.gz md5sum: 7df166f6560000a139f551be6f21624e filesize: 21.6GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:e8673bf7-7375-41e1-84af-f0cf2ab8035a name: 5_freeze.vcf.gz.csi md5sum: 76adb2b829d61f9334403f55a7d071e1 filesize: 116.1KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:338e51e2-816b-4019-82c1-4caa35e5cdfe name: 6_freeze.vcf data_distributions: - id: alspacdcs:eb337552-2193-427e-89b0-a719eef53f20 name: 6_freeze.vcf.gz md5sum: c2b56e9bc605b2fc1a54697a176e4a1c filesize: 21.0GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:c7dbf6cd-e8be-4678-b0a3-cda9b564689f name: 6_freeze.vcf.gz.csi md5sum: 25f5ec873519eed3a4e278cb47266f9b filesize: 109.9KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:ea928374-1723-4d63-a556-8391affe5cc7 name: 7_freeze.vcf data_distributions: - id: alspacdcs:8e6332a2-c901-4a37-beac-9cc4e71a6475 name: 7_freeze.vcf.gz md5sum: b05886a2a8f89de82864109368d7a69c filesize: 19.0GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:f8b28862-e40b-430c-a080-01a30fc8e7e9 name: 7_freeze.vcf.gz.csi md5sum: 69c8aedc94f876e0edaa3f8493ca2e94 filesize: 101.8KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:147ef50e-d162-4343-a437-130dd03adc4f name: 8_freeze.vcf data_distributions: - id: alspacdcs:10e38f1c-feb9-424c-b8f1-5ac140a141f1 name: 8_freeze.vcf.gz md5sum: 832e3eca8a7672f66dec0a97d33e363f filesize: 18.8GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:d7e4cb16-3680-4d9a-ad8c-e13cabf6d8e6 name: 8_freeze.vcf.gz.csi md5sum: af52ac2aa78f0f7f69c0bbf0cb804b40 filesize: 92.8KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:34df9600-d9f8-410a-8a2d-ce4f8927d76c name: 9_freeze.vcf data_distributions: - id: alspacdcs:5d8715cc-e6fb-43f2-bd9c-ce2aae728c1e name: 9_freeze.vcf.gz md5sum: be05806b3337f1fb6f884f9c10a0dedd filesize: 14.2GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:a66c335a-04bf-4ec9-8aa2-e922eee5b4b2 name: 9_freeze.vcf.gz.csi md5sum: 65ff200207e4b9f067154e7dbbd5b14a filesize: 75.4KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:75dfd99f-4757-4482-95a8-6a0d3d4fc16e name: 10_freeze.vcf data_distributions: - id: alspacdcs:c312b734-ed43-4109-a634-8a0bb4ff29b3 name: 10_freeze.vcf.gz md5sum: 8dc40e17fd16a4f7fd46947cd8efba37 filesize: 16.3GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:9d2dd9af-158d-42f4-94ad-1ee35bd17691 name: 10_freeze.vcf.gz.csi md5sum: 344a55d89f42977d545dd73768bee6b1 filesize: 85.5KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:fee168b6-1ec7-4793-a731-0e007b0afb69 name: 11_freeze.vcf data_distributions: - id: alspacdcs:05ada9db-b03c-442f-a84d-cac99eeca001 name: 11_freeze.vcf.gz md5sum: da169eb3d82bb130c3eba955ec1381d9 filesize: 16.4GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:2436b27e-d541-4c76-8a29-edea10abc75c name: 11_freeze.vcf.gz.csi md5sum: 1f8573df3e205babd9a38cbd3a3769c7 filesize: 85.2KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:0ac098c1-38be-4914-b2dd-5ea80af419ec name: 12_freeze.vcf data_distributions: - id: alspacdcs:3dd22dc0-76b8-4373-b4f2-e9bbf4e3a373 name: 12_freeze.vcf.gz md5sum: 47b92a6ede9e9df895c2134b70c0c1bc filesize: 15.7GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:b37fa69b-0018-4521-9c9d-a840a9b9d7a9 name: 12_freeze.vcf.gz.csi md5sum: 76b3de73d4576dc1b3d90b30677d50b8 filesize: 85.5KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:c8b764ef-bbd1-4a32-9c26-d673a03fb23f name: 13_freeze.vcf data_distributions: - id: alspacdcs:64e89d80-a86c-46dd-8f87-4508707425fe name: 13_freeze.vcf.gz md5sum: ccd89b86e9421cd0f1ebfa9a4cf43228 filesize: 11.8GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:4dbdfee7-98f4-440b-80e0-59efec244b0e name: 13_freeze.vcf.gz.csi md5sum: c87bf856f671a839b10b0d69cadd0d02 filesize: 62.1KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:099c8270-865b-4aba-9c34-085743639bbc name: 14_freeze.vcf data_distributions: - id: alspacdcs:cf599ca2-dd11-460b-a0a7-f85ae126f264 name: 14_freeze.vcf.gz md5sum: 5d9a04231afd3784e205ff939da426ba filesize: 10.7GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:fd8c3b0d-5842-4881-a219-3090898b1570 name: 14_freeze.vcf.gz.csi md5sum: 0a5a77211053a1ed7b2ce33a8e8b612b filesize: 56.6KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:3ecd2d87-7a67-4d75-9a7d-12790013aeae name: 15_freeze.vcf data_distributions: - id: alspacdcs:6ab8edc2-13a6-471d-a85b-c040db7ab3bd name: 15_freeze.vcf.gz md5sum: 8779e214368a81a82a3831a6099a4e94 filesize: 9.7GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:b14b07fc-a1f8-4ded-86e1-ade43351df3f name: 15_freeze.vcf.gz.csi md5sum: d1fdb4fbc9ac84cd545802728ad7fb22 filesize: 51.7KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:d63be5f3-15c4-45fb-97e2-2f3a874455d2 name: 16_freeze.vcf data_distributions: - id: alspacdcs:409936a1-052f-458d-aa6f-394852a1463c name: 16_freeze.vcf.gz md5sum: 74daf54822613ae3fd731e279026ba6a filesize: 10.6GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:9b059dbb-4356-4c64-9a85-3bd254ae5cd9 name: 16_freeze.vcf.gz.csi md5sum: 3ec8fb57ca2b147816e0a67f694b1162 filesize: 50.4KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:4305ac0f-4c77-40e4-9110-888755c835f8 name: 17_freeze.vcf data_distributions: - id: alspacdcs:033eef37-655b-404c-8f8c-544477499023 name: 17_freeze.vcf.gz md5sum: da9c1da2da281f7a1545af31faea13a3 filesize: 9.1GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:951a4ed9-43d1-4a70-9d30-6c0c31282413 name: 17_freeze.vcf.gz.csi md5sum: e26d19b6435cc8cdfabad63888203371 filesize: 49.9KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:4f48380c-b623-4926-aa2e-c68f84f49241 name: 18_freeze.vcf data_distributions: - id: alspacdcs:6f05ec3b-744c-4094-a1cf-4a9b45164872 name: 18_freeze.vcf.gz md5sum: a7ace5116a6ec3056300504f64c406e3 filesize: 9.4GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:244635ac-290b-40c5-bd12-f9f63711eec1 name: 18_freeze.vcf.gz.csi md5sum: b4c0eb6f8bcd5faff6d23f6f11004a61 filesize: 48.5KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:712d627b-e31d-4981-a968-0f6f9f8ee2ac name: 19_freeze.vcf data_distributions: - id: alspacdcs:d69e019e-1e2f-46c9-b2c8-b2cc4ccbbb2c name: 19_freeze.vcf.gz md5sum: 73079fb7f693e5f7ff8c23fc72a0d62b filesize: 7.0GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:c98d8665-a945-4778-a980-2bdf102f6f14 name: 19_freeze.vcf.gz.csi md5sum: c1b4df2d51ac20fb5fe3335b59f844c4 filesize: 35.7KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:7a81d5a6-ee22-4f8a-9920-094592dab855 name: 20_freeze.vcf data_distributions: - id: alspacdcs:fd23d6f9-156e-482d-9d97-bd210a0d3344 name: 20_freeze.vcf.gz md5sum: 46c2a5875f1e31137cd0e7a42a98ee04 filesize: 7.5GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:7dd60e32-c766-46be-8e25-cf7fa89fa381 name: 20_freeze.vcf.gz.csi md5sum: cea8babd39bc5f0e0640c668fc9854d5 filesize: 38.2KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:c32cb1b8-623d-444f-a0db-b9b20c5f7056 name: 21_freeze.vcf data_distributions: - id: alspacdcs:12233fc0-383d-4b97-a45d-de478ef165b8 name: 21_freeze.vcf.gz md5sum: 68ad67687100082013805e8bcd63b989 filesize: 4.3GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:964ef7af-f7d7-4747-b035-08a2d8069b5b name: 21_freeze.vcf.gz.csi md5sum: 1ef4648fe43cf331b600c610e1daaa4c filesize: 22.1KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:863f95cc-6ec2-406a-a072-6740a77dbcf6 name: 22_freeze.vcf data_distributions: - id: alspacdcs:120d67ae-8c24-48d6-ad74-e9ec1865d3b4 name: 22_freeze.vcf.gz md5sum: 11aac1ce01ecf5fa92b1f0b5c40209c7 filesize: 4.4GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:a45e4ab3-57e4-45ab-a149-87e5fc49e534 name: 22_freeze.vcf.gz.csi md5sum: 73bc5296a886342eb1a10e249f314c49 filesize: 22.1KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:b445d1b2-d59c-4f14-b0a5-f9bf636fecfd name: X_freeze.vcf data_distributions: - id: alspacdcs:eb02a97b-4ea9-4769-a6ca-a5cbdfb65b5f name: X_freeze.vcf.gz md5sum: 1dd617a386e1fdb0273dcfc9e1231d32 filesize: 10.5GB filetype: vcf.gz belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 - id: alspacdcs:463b2334-a8c0-45aa-a436-93e8e9752fd8 name: X_freeze.vcf.gz.csi md5sum: d687ace453fb2a58484fa1db45c0e7cd filesize: 96.0KB filetype: .csi belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
5.2 Whole exome sequencing - G0 & G1 (wes_novaseq_g0_g1)
5.2.1 Description
This dataset contains whole exome sequencing for G0 and G1 individuals. It was generated at the Sanger Institute as part of an initiative sequencing multiple Birth cohorts: ALSPAC, MCS and BiB. As part of this initiative, the exome sequencing data will also be available via EGA but researchers will still gain access through ALSPACs project approval system. Reference genome build: GRCh38
5.2.2 Methodology
Exome sequencing was conducted on DNA for 12,374 participants (8,605 children and 3,389 of their parents) at the Sanger Institute, using Illumina NovaSeq. Reads were aligned to GRCh38 with BWA-MEM. There was an average on-target depth of ~62X for ALSPAC.
QC was conducted on the dataset at the Sanger Institute, please find details within the associated publication (Koko et al., 2024). Sample QC was done before (base-calls after sequencing, alignment quality, CRAM file quality) and after variant calling (PCA analysis, comparison to array data, relatedness). Integrated variant QC removed potentially false positive variants using a trained random forest model. Genotype QC removed low quality individual genotype calls.
Single nucleotide variant (SNV) and small insertions/deletion (indels) calling was conducted with GATK HaplotypeCaller, GenomicsDBImport and GenotypeGVCFs (GATK version 4.2.4.0 for ALSPAC) following GATK best practices (Van der Auwera and O'Connor, 2020).
Associated publication:
- doi.org/10.12688/wellcomeopenres.22697.1
5.2.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_f4 name: >- Whole Exome Sequencing - Novaseq - G0 & G1 version 2024-09-20 freeze 4 description: >- This is first iteration of wes_novaseq_g0_g1, first introduced in freeze 4. It contains data in vcf 4.2 format. It contains the majority of the G1 cohort (n=~8296), accompanied by G0 mothers (n=~1642) and partners (n=~1630) to create trios. Participants who have withdrawn their consent are removed and an omics ID applied according to the freeze. Over time the participants are able to withdraw their consent and will be removed from the dataset, so the number of available individuals can reduce as time progresses. This exome sequencing (ES) data was conducted at the Sanger institute and was part of an effort to ES ALSPAC, MCS and BiB. All ES data was quality controlled at the Sanger institute prior to this ALSPAC release and has been extensively document in the relevant publication (see below). In brief (exert from associated publication, Koko et al., 2024): "Sample QC: * Before variant calling: Samples were removed if they failed one or more filters based on quality of base-calls after sequencing, or quality of the CRAM files of aligned reads. The remainder then underwent variant calling. * After variant calling: We assigned individuals to populations using principal component analysis (PCA), then identified and removed individuals who were outliers on one or more variant-based metrics within each of the populations. We compared the exome data to genotyping array data from the same samples and removed samples that did not match as expected, since these could be sample mix-ups. The samples were also checked for unexpected relatedness; samples showing conflicts between reported and inferred relatedness were removed. This sample QC was split in two separate steps, before and after variant and genotype QC, as detailed in the coming sections. Integrated variant and genotype QC: * Variant QC: We removed candidate variants which may not be real, instead being artefacts or mapping errors, using a trained random forest model to distinguish likely true positives from likely false positives. * Genotype QC: We removed low-quality individual genotype calls from the dataset. This was done in conjunction with variant QC, as we will explain below." for extended information such as thresholds please find within the publication. Associated publication: Koko et al., 2024 DOI: doi.org/10.12688/wellcomeopenres.22697.1 freeze_size: 167G linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112 git_tag: https://github.com/alspac/dataset_wes_novaseq_g0_g1/releases/tag/freeze4 is_current_freeze: true freeze_number: 4 freeze_date: 2024-06-11 previous_freeze: N/A freeze_of_alspac_dataset_version: alspacdcs:wes_novaseq_g0_g1_2024-09-20 freeze_of_named_alspac_dataset: alspacdcs:wes_novaseq_g0_g1 has_parts: - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr1_data_f4 name: chr1_data data_distributions: - id: alspacdcs:chr1_data.vcf.gz name: chr1_data.vcf.gz description: >- vcf file containing all participants for chromsome 1, to be used with chr1_data.vcf.gz.csi md5sum: c61d331e2c58b800516da170853f8220 filesize: 17G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 370645 # bcftools query -f '%POS\n' file.vcf.gz | wc -l - id: alspacdcs:_chr1_data.vcf.gz.csi name: chr1_data.vcf.gz.csi description: >- index for vcf file - chr1_data.vcf.gz, generated using bcftools v1.19. md5sum: f413bc9edb1d2a959f38790a3c72656c filesize: 64K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr2_data_f4 name: chr2_data data_distributions: - id: alspacdcs:chr2_data.vcf.gz name: chr2_data.vcf.gz description: >- vcf file containing all participants for chromsome 2, to be used with chr2_data.fvcf.gz.csi md5sum: 6c6d6b76a6792444058ff19c0036381c filesize: 12G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 272150 - id: alspacdcs:_chr2_data.vcf.gz.csi name: chr2_data.vcf.gz.csi description: >- index for vcf file - chr2_data.vcf.gz, generated using bcftools v1.19. md5sum: dc7fe92d532898ecad15efe923c48a12 filesize: 48K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr3_data_f4 name: chr3_data data_distributions: - id: alspacdcs:chr3_data.vcf.gz name: chr3_data.vcf.gz description: >- vcf file containing all participants for chromsome 3, to be used with chr3_data.vcf.gz.csi md5sum: 1bc654effca79e7c67b0e0e9cd180064 filesize: 9.1G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 206875 - id: alspacdcs:_chr3_data.vcf.gz.csi name: chr3_data.vcf.gz.csi description: >- index for vcf file - chr3_data.vcf.gz, generated using bcftools v1.19. md5sum: 5e9918574d70f08d50c52bc755e99a57 filesize: 48K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr4_data_f4 name: chr4_data data_distributions: - id: alspacdcs:chr4_data.vcf.gz name: chr4_data.vcf.gz description: >- vcf file containing all participants for chromsome 4, to be used with chr4_data.vcf.gz.csi md5sum: 409b4664817cbccdb04c64ef50c20260 filesize: 6.2G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 140675 - id: alspacdcs:_chr4_data.vcf.gz.csi name: chr4_data.vcf.gz.csi description: >- index for vcf file - chr4_data.vcf.gz, generated using bcftools v1.19. md5sum: 2e97a34c0d5de1f4c20a96013ddd3954 filesize: 32K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr5_data_f4 name: chr5_data data_distributions: - id: alspacdcs:chr5_data.vcf.gz name: chr5_data.vcf.gz description: >- vcf file containing all participants for chromsome 5, to be used with chr5_data.vcf.gz.csi md5sum: ca1aefe6597d304995b6fadf26cc1dc6 filesize: 7.1G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 161010 - id: alspacdcs:_chr5_data.vcf.gz.csi name: chr5_data.vcf.gz.csi description: >- index for vcf file - chr5_data.vcf.gz, generated using bcftools v1.19. md5sum: bb0aa89a3bf4ea37d0766437bf954fde filesize: 32K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr6_data_f4 name: chr6_data data_distributions: - id: alspacdcs:chr6_data.vcf.gz name: chr6_data.vcf.gz description: >- vcf file containing all participants for chromsome 6, to be used with chr6_data.vcf.gz.csi md5sum: e0eaf0d3a06ce9b9b74be440d39702f5 filesize: 8.1G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 181754 - id: alspacdcs:_chr6_data.vcf.gz.csi name: chr6_data.vcf.gz.csi description: >- index for vcf file - chr6_data.vcf.gz, generated using bcftools v1.19. md5sum: d106a089e187fd067841488006d412f3 filesize: 48K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr7_data_f4 name: chr7_data data_distributions: - id: alspacdcs:chr7_data.vcf.gz name: chr7_data.vcf.gz description: >- vcf file containing all participants for chromsome 7, to be used with chr7_data.vcf.gz.csi md5sum: e433cbd47a52fb3a0876a520a4134d31 filesize: 8.1G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 181925 - id: alspacdcs:_chr7_data.vcf.gz.csi name: chr7_data.vcf.gz.csi description: >- index for vcf file - chr7_data.vcf.gz, generated using bcftools v1.19. md5sum: 6c2fa683baf0b095cd86d877171e481f filesize: 48K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr8_data_f4 name: chr8_data data_distributions: - id: alspacdcs:chr8_data.vcf.gz name: chr8_data.vcf.gz description: >- vcf file containing all participants for chromsome 8, to be used with chr8_data.vcf.gz.csi md5sum: 1c9a537e557fb5fdd125b1025fbce749 filesize: 5.9G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 133894 - id: alspacdcs:_chr8_data.vcf.gz.csi name: chr8_data.vcf.gz.csi description: >- index for vcf file - chr8_data.vcf.gz, generated using bcftools v1.19. md5sum: 695209b64080ccbe035f51d4d9b92566 filesize: 32K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr9_data_f4 name: chr9_data data_distributions: - id: alspacdcs:ch9_data.vcf.gz name: chr9_data.vcf.gz description: >- vcf file containing all participants for chromsome 9, to be used with chr9_data.vcf.gz.csi md5sum: 2fbd587057be6f3e6e40bfb9d4cdd072 filesize: 7.1G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 161039 - id: alspacdcs:_chr9_data.vcf.gz.csi name: chr9_data.vcf.gz.csi description: >- index for vcf file - chr9_data.vcf.gz, generated using bcftools v1.19. md5sum: 0aa0503dfe1a267fef70c19b8ec5ce5d filesize: 32K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr10_data_f4 name: chr10_data data_distributions: - id: alspacdcs:chr10_data.vcf.gz name: chr10_data.vcf.gz description: >- vcf file containing all participants for chromsome 10, to be used with chr10_data.vcf.gz.csi md5sum: 3fad84065f76243852cb94f191aafc71 filesize: 6.6G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 149730 - id: alspacdcs:_chr10_data.vcf.gz.csi name: chr10_data.vcf.gz.csi description: >- index for vcf file - chr10_data.vcf.gz, generated using bcftools v1.19. md5sum: 3f74f64664ce6a5631366d98946546a6 filesize: 32K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr11_data_f4 name: chr11_data data_distributions: - id: alspacdcs:chr11_data.vcf.gz name: chr11_data.vcf.gz description: >- vcf file containing all participants for chromsome 11, to be used with chr11_data.vcf.gz.csi md5sum: 67421ec85241f6162eb9a7ab29e1be6b filesize: 11G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 227858 - id: alspacdcs:_chr11_data.vcf.gz.csi name: chr11_data.vcf.gz.csi description: >- index for vcf file - chr11_data.vcf.gz, generated using bcftools v1.19. md5sum: eec8568f19181000442d8948edcdc65d filesize: 32K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr12_data_f4 name: chr12_data data_distributions: - id: alspacdcs:chr12_data.vcf.gz name: chr12_data.vcf.gz description: >- vcf file containing all participants for chromsome 12, to be used with chr12_data.vcf.gz.csi md5sum: eb497a8adb2372048ed1badaecb92a96 filesize: 8.5G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 193518 - id: alspacdcs:_chr12_data.vcf.gz.csi name: chr12_data.vcf.gz.csi description: >- index for vcf file - chr12_data.vcf.gz, generated using bcftools v1.19. md5sum: 86a00f322b4c7eb4376c5d5f49ebc8d8 filesize: 32K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr13_data_f4 name: chr13_data data_distributions: - id: alspacdcs:chr13_data.vcf.gz name: chr13_data.vcf.gz description: >- vcf file containing all participants for chromsome 13, to be used with chr13_data.vcf.gz.csi md5sum: 0e8074d71e841cdebc9fd86247c46c3f filesize: 2.8G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 63931 - id: alspacdcs:_chr13_data.vcf.gz.csi name: chr13_data.vcf.gz.csi description: >- index for vcf file - chr13_data.vcf.gz, generated using bcftools v1.19. md5sum: 9962040c9d4122ed040f924eb8d2174f filesize: 16K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr14_data_f4 name: chr14_data data_distributions: - id: alspacdcs:chr14_data.vcf.gz name: chr14_data.vcf.gz description: >- vcf file containing all participants for chromsome 14, to be used with chr14_data.vcf.gz.csi md5sum: e7b8b73da8ddd0bd666f988d5d9d049e filesize: 5.7G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 128137 - id: alspacdcs:_chr14_data.vcf.gz.csi name: chr14_data.vcf.gz.csi description: >- index for vcf file - chr14_data.vcf.gz, generated using bcftools v1.19. md5sum: 63e6e3769d9bb411288d2b5174c61d9d filesize: 32K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr15_data_f4 name: chr15_data data_distributions: - id: alspacdcs:chr15_data.vcf.gz name: chr15_data.vcf.gz description: >- vcf file containing all participants for chromsome 15, to be used with chr15_data.vcf.gz.csi md5sum: 19ed6a943eb7d379f693b2a9e0f7ff22 filesize: 5.6G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 127646 - id: alspacdcs:_chr15_data.vcf.gz.csi name: chr15_data.vcf.gz.csi description: >- index for vcf file - chr15_data.vcf.gz, generated using bcftools v1.19. md5sum: abc35107b45198cf856fbb943c94c5ba filesize: 32K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr16_data_f4 name: chr16_data data_distributions: - id: alspacdcs:chr16_data.vcf.gz name: chr16_data.vcf.gz description: >- vcf file containing all participants for chromsome 16, to be used with chr16_data.vcf.gz.csi md5sum: b5bae04936506ba275664aafd595d99d filesize: 8.4G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 186300 - id: alspacdcs:_chr16_data.vcf.gz.csi name: chr16_data.vcf.gz.csi description: >- index for vcf file - chr16_data.vcf.gz, generated using bcftools v1.19. md5sum: 8e97703c8f865ef4cb90db140903022f filesize: 32K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr17_data_f4 name: chr17_data data_distributions: - id: alspacdcs:chr17_data.vcf.gz name: chr17_data.vcf.gz description: >- vcf file containing all participants for chromsome 17, to be used with chr17_data.vcf.gz.csi md5sum: 471702bad7d86459c024fb468c7a7ee9 filesize: 10G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 224774 - id: alspacdcs:_chr17_data.vcf.gz.csi name: chr17_data.vcf.gz.csi description: >- index for vcf file - chr17_data.vcf.gz, generated using bcftools v1.19. md5sum: 4782b222eda4bfcb871f330fa2a2728a filesize: 32K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr18_data_f4 name: chr18_data data_distributions: - id: alspacdcs:chr18_data.vcf.gz name: chr18_data.vcf.gz description: >- vcf file containing all participants for chromsome 18, to be used with chr18_data.vcf.gz.csi md5sum: 3745cae09c423dd4cd00d772c82243d2 filesize: 2.5G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 57017 - id: alspacdcs:_chr18_data.vcf.gz.csi name: chr18_data.vcf.gz.csi description: >- index for vcf file - chr18_data.vcf.gz, generated using bcftools v1.19. md5sum: 7439d7225754bfb05a5d60544d8ec763 filesize: 16K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr19_data_f4 name: chr19_data data_distributions: - id: alspacdcs:chr19_data.vcf.gz name: chr19_data.vcf.gz description: >- vcf file containing all participants for chromsome 19, to be used with chr19_data.vcf.gz.csi md5sum: e1ca35ee4003146b6d78aa60a11e019c filesize: 13G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 271080 - id: alspacdcs:_chr19_data.vcf.gz.csi name: chr19_data.vcf.gz.csi description: >- index for vcf file - chr19_data.vcf.gz, generated using bcftools v1.19. md5sum: 17f12041e5261526ae320439f2736fa4 filesize: 32K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr20_data_f4 name: chr20_data data_distributions: - id: alspacdcs:chr20_data.vcf.gz name: chr20_data.vcf.gz description: >- vcf file containing all participants for chromsome 20, to be used with chr20_data.vcf.gz.csi md5sum: 7c8cc69afb82df0442116e4dbfd99269 filesize: 4.3G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 96655 - id: alspacdcs:_chr20_data.vcf.gz.csi name: chr20_data.vcf.gz.csi description: >- index for vcf file - chr20_data.vcf.gz, generated using bcftools v1.19. md5sum: 2e4526809c85ae04c1a7690a430e3fad filesize: 16K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr21_data_f4 name: chr21_data data_distributions: - id: alspacdcs:chr21_data.vcf.gz name: chr21_data.vcf.gz description: >- vcf file containing all participants for chromsome 21, to be used with chr21_data.vcf.gz.csi md5sum: 46202d0ba651b0ec1c9b9fbc980fdcb7 filesize: 1.9G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 42207 - id: alspacdcs:_chr21_data.vcf.gz.csi name: chr21_data.vcf.gz.csi description: >- index for vcf file - chr21_data.vcf.gz, generated using bcftools v1.19. md5sum: 4e010094593ac87fce4fe5d55cc80bee filesize: 16K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr22_data_f4 name: chr22_data data_distributions: - id: alspacdcs:chr22_data.vcf.gz name: chr22_data.vcf.gz description: >- vcf file containing all participants for chromsome 22, to be used with chr22_data.vcf.gz.csi md5sum: a423d731c368b4ce1f30a896ab0f1c18 filesize: 4.3G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 94446 - id: alspacdcs:_chr22_data.vcf.gz.csi name: chr22_data.vcf.gz.csi description: >- index for vcf file - chr22_data.vcf.gz, generated using bcftools v1.19. md5sum: ac6eb2a7076ec0221b86c0ac8300c1af filesize: 16K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chrX_data_f4 name: chrX_data data_distributions: - id: alspacdcs:chrX_data.vcf.gz name: chrX_data.vcf.gz description: >- vcf file containing all participants for chromsome X, to be used with chrX_data.vcf.gz.csi md5sum: 1ca33edf2265f47f61e34ea4462e5afd filesize: 3.8G filetype: vcf.gz number_of_participants: 11500 number_of_variants: 86925 - id: alspacdcs:_chrX_data.vcf.gz.csi name: chrX_data.vcf.gz.csi description: >- index for vcf file - chrX_data.vcf.gz, generated using bcftools v1.19. md5sum: 1f133d314e9acff9c0075184d495792a filesize: 32K filetype: .csi - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chrY_data_f4 name: chrY_data data_distributions: - id: alspacdcs:chrY_data.vcf.gz name: chrY_data.vcf.gz description: >- vcf file containing all participants for chromsome Y, to be used with chrY_data.vcf.gz.csi md5sum: 51a2d25baf60cbaef21b457df2c7530b filesize: 368K filetype: vcf.gz number_of_participants: 11500 number_of_variants: 9 - id: alspacdcs:_chrY_data.vcf.gz.csi name: chrY_data.vcf.gz.csi description: >- index for vcf file - chrY_data.vcf.gz, generated using bcftools v1.19. md5sum: e370622e50f6b9b847ff0925eee02313 filesize: 512 filetype: .csi
5.3 Whole exome sequencing - G1 (wes_novaseq_g1)
5.3.1 Description
This dataset contains whole exome sequencing for G1 individuals. It was generated at the Broad Institute for ~2900 G1 individuals. Reference genome build: GRCh38
5.3.2 Methodology
The exomes returned from the Broad Insitute did not undergo PCA or relatedness filtering; instead provided as raw VCF data. The following thresholds were applied to the samples:
- Chimera rate: Less than 0.05
- Contamination rate: Less than 0.10
- PF aligned rate: More than 0.60
87 individuals were removed from the dataset who were believed to have been a sample mismatch. These exomes had discordance rate of above 0.05 when compared to existing array data using bcftools gtcheck.
Associated publications:
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9980234/ (conducted additional QC beyond dataset)
5.3.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:wes_novaseq_g1_204-04-12_f4 name: >- Whole Exome Sequencing - Novaseq - G1 version 2024-04-09 freeze 4 description: >- This is first iteration of wes_novaseq_g1, first introduced in freeze 4. It contains data in vcf 4.2 format. It is a subset of the G1 cohort, with participants who have withdrawn their consent removed and omics IDs applied according to the freeze. Samples were selected for whole exome sequencing at the Broad Institute from the G1 cohort (the cohort of index children) and were from subjects who were singletons/unrelated and of European/British ancestry, had blood-derived DNA available, and had been genotyped on a whole genome genotyping array. The QC was performed by the broad. The following thresholds were applied: Chimera rate < 0.05 Contamination rate < 0.10 PF aligned rate < 0.60 87 individuals were removed from the dataset who were believed to have been a sample mismatch. These exomes had discordance rate of above 0.05 when compared to existing array data using bcftools gtcheck. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9980234/ describes this dataset in supplementary materials. freeze_size: 28G linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112 git_tag: https://github.com/alspac/dataset_wes_novaseq_g1/releases/tag/freeze4 is_current_freeze: true freeze_number: 4 freeze_date: 2024-06-11 previous_freeze: N/A freeze_of_alspac_dataset_version: alspacdcs:wes_novaseq_g1_2024-03-26 freeze_of_named_alspac_dataset: alspacdcs:wes_novaseq_g1 has_parts: - id: alspacdcs:wes_novaseq_g1_2024-04-09_all_chr_f4 name: all_chr description: >- All chromosomes and all participants within the dataset contained within a single vcf version 4.2 file, which has been compressed using bcftools 1.19. data_distributions: - id: alspacdcs:3e3bde5e-b410-4135-981b-f923f57a6ce0_all_chr.vcf.gz name: all_chr.vcf.gz description: >- vcf file containing all participants and chromosomes, to be used with all_chr.vcf.gz.csi md5sum: 1f75c2f55107aceaf9d4e7edb19fd364 filesize: 28G filetype: vcf.gz number_of_participants: 2879 #number_of_gene_expression_probe_values: - id: alspacdcs:4da2f634-bdb9-4b21-b051-6fa469ba711c_all_chr.vcf.gz.csi name: all_chr.vcf.gz.csi description: >- index for vcf file - all_chr.vcf.gz, generated using bcftools v1.19. md5sum: ff4baac889f49b1cb1611c3c63627890 filesize: 800K filetype: .csi
6 Epigenetic Data
6.1 DNA methylation - 450k - G0 mothers + G1 (dnam_450_g0m_g1)
6.1.1 Description
This dataset contains Illumina Infinium HumanMethylation450K BeadChip array on G1 mothers at two timepoints (pregnancy and middle age), G1 participants at 5 timepoints and G0 participants at three timepoints (birth, childhood and adolescence).
This dataset was generated as part of the Accessible Resource for Integrated Epigenomics Studies (http://www.ariesepigenomics.org.uk/). This dataset is superseded by dnam_epic450_g0_g1.
6.1.2 Methodology
Associated publication:
Associated R package:
6.1.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:dnam_450_g0m_g1_2016-05-03_f4 name: >- DNA methylation - 450k - G0 mothers + G1 version 2016-05-03 Freeze 4 description: >- This is the fourth freeze of the 2016-05-03 version of dnam_450_g0m_g1 dataset. freeze_size: 18G linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112 git_tag: https://github.com/alspac/dataset_dnam_450_g0m_g1/releases/tag/Freeze4 is_current_freeze: true freeze_number: 4 freeze_date: 2024-06-11 previous_freeze: alspacdcs:dnam_450_g0m_g1_2016-05-03_f3 freeze_of_alspac_dataset_version: alspacdcs:dnam_450_g0m_g1_2016-05-03 freeze_of_named_alspac_dataset: alspacdcs:dnam_450_g0m_g1 has_containers: - id: alspacdcs:7d3cb68e-0cbb-4c60-9f6b-77354a951caf name: data description: A dir/folder containing the data files - id: alspacdcs:88e75491-5bab-4fb7-9099-5341e17f3739 name: betas description: A dir/folder containing the beta files belongs_to_container: alspacdcs:7d3cb68e-0cbb-4c60-9f6b-77354a951caf - id: alspacdcs:b5b7a645-484f-490f-92bc-e2d255504a2d name: control_matrix description: A dir/folder containing the control matrix files belongs_to_container: alspacdcs:7d3cb68e-0cbb-4c60-9f6b-77354a951caf - id: alspacdcs:a98c4fb7-6b92-4f27-9a00-079dbb1a50db name: derived description: A dir/folder containing the derived data (e.g. Cell count predictions) belongs_to_container: alspacdcs:7d3cb68e-0cbb-4c60-9f6b-77354a951caf - id: alspacdcs:650f1c7b-e8ab-40c9-90b3-67d3c552100a name: cellcounts description: A dir/folder containing the cell count predictions belongs_to_container: alspacdcs:a98c4fb7-6b92-4f27-9a00-079dbb1a50db - id: alspacdcs:3c795f53-8dfc-45fe-b88b-5363a5a3bc77 name: cord description: >- A dir/folder containing the cell count predictions for cord. belongs_to_container: alspacdcs:650f1c7b-e8ab-40c9-90b3-67d3c552100a - id: alspacdcs:06167109-d949-4d24-b33a-a70bc48e49a1 name: andrews-and-bakulski description: >- A dir/folder containing the cell count predictions by andrews-and-bakulski algorithm belongs_to_container: alspacdcs:3c795f53-8dfc-45fe-b88b-5363a5a3bc77 - id: alspacdcs:e9b1e42c-85e7-4a3f-bcf0-f1fa3d20b5b8 name: gervinandlyle description: >- A dir/folder containing the cell count predictions by gervinandlyle algorithm/method. belongs_to_container: alspacdcs:3c795f53-8dfc-45fe-b88b-5363a5a3bc77 - id: alspacdcs:54feaa38-f2de-4f98-babe-13c4c0b4791a name: gse68456 description: >- A dir/folder containing the cell count predictions by the gse68456 method. belongs_to_container: alspacdcs:3c795f53-8dfc-45fe-b88b-5363a5a3bc77 - id: alspacdcs:9d8ee029-67cc-47f2-a663-7bac8d803459 name: houseman description: >- A dir/folder containing the cell count predictions by houseman method. belongs_to_container: alspacdcs:650f1c7b-e8ab-40c9-90b3-67d3c552100a - id: alspacdcs:218a4ebd-ae56-4f5a-aa47-9614cb633a1e name: detection_p_values description: A dir/folder containing the matrix of detection values belongs_to_container: alspacdcs:7d3cb68e-0cbb-4c60-9f6b-77354a951caf - id: alspacdcs:cb1d7257-328f-4f7b-b578-133ed4eda164 name: qc.objects_all description: >- A dir/folder containing the samples extracted from lims and not cleaned. belongs_to_container: alspacdcs:7d3cb68e-0cbb-4c60-9f6b-77354a951caf - id: alspacdcs:9b6bd75c-0da7-4ab8-9bb1-e5a9e4a3854d name: qc.objects_clean description: A dir/folder containing the cleaned samples from Lims belongs_to_container: alspacdcs:7d3cb68e-0cbb-4c60-9f6b-77354a951caf - id: alspacdcs:672a863a-458c-477f-93b3-f92454b490fa name: samplesheet description: A dir/folder containing the manifest file from Lims. belongs_to_container: alspacdcs:7d3cb68e-0cbb-4c60-9f6b-77354a951caf has_parts: - id: alspacdcs:eb35b571-f62d-4cd9-91a5-779ad8ae334b name: betas description: >- Normalized betas using functional normalization. We used 10 PCs on the controlmatrix to regress out technical variation. Slide was regressed out as random effect before normaliziation. CpGs are in rows and samples in columns. data_distributions: - id: alspacdcs:06428ec1-232f-45e0-b17a-40a4b382c6e0 name: data.Robj description: >- R data object for the Normalized beta data. md5sum: 454aac748f353ea4bd73afb1717c2716 filesize: 17G filetype: .Robj belongs_to_container: alspacdcs:88e75491-5bab-4fb7-9099-5341e17f3739 number_of_participants: 4843 number_of_sites: 482855 - id: alspacdcs:06b395ba-9cf9-4985-93f4-35e4011f6d28 name: control matrix description: >- The 850 control probes are summarized in 42 control types. These probes can roughly be divided into negative control probes (613), probes intended for between array normalization (186) and the remainder (49), which are designed for quality control, including assessing the bisulfite conversion rate. None of these probes are designed to measure a biological signal. The summarized control probes can be used as surrogates for unwanted variation and are used for the functional normalization. Samples are rows and 42 control types are in columns. data_distributions: - id: alspacdcs:7b41f832-6201-42f1-bb27-6463151dc2fa name: data.txt description: >- Plain text file of the control matrix. md5sum: 471b487a4b0761f00e33088b0065dd94 filesize: 1.8M filetype: .txt belongs_to_container: alspacdcs:b5b7a645-484f-490f-92bc-e2d255504a2d number_of_participants: 4843 - id: alspacdcs:102cbbca-7165-42c0-8b49-1d3ecabd1bb8 name: andrews and bakulksi cord cell counts description: >- Cellcounts in cord predicted using cord reference published in Bakulski et al 2016 (PMID: 27019159). This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns. data_distributions: - id: alspacdcs:d9cba595-0f19-40d8-ab2c-538c55f56b28 name: data.txt description: >- Plain text file of cellcounts in cord predicted using Bakulski. md5sum: 79b04868cc502a1a34ade01958f22790 filesize: 118k filetype: .txt belongs_to_container: alspacdcs:06167109-d949-4d24-b33a-a70bc48e49a1 number_of_participants: 912 - id: alspacdcs:29df92c4-c042-4b29-93a2-06d5ae4e8dee name: geervin and lyle cord cell counts description: >- Cellcounts in cord predicted using GervinandLyle cord reference (unpublised). This reference has been implemented in meffil. Samples are in rows and cell types in columns. data_distributions: - id: alspacdcs:15371e80-9b1d-4776-ad5f-400e9bf8f02b name: data.txt description: >- Plain text file of cell counts predicted using GervinandLyle cord reference. md5sum: 0d8535330ac6e12e7f3c5a5f3f30e600 filesize: 100k filetype: .txt belongs_to_container: alspacdcs:e9b1e42c-85e7-4a3f-bcf0-f1fa3d20b5b8 number_of_participants: 912 - id: alspacdcs:8196d769-fa52-4dd3-bd62-d81cccb77fc7 name: gse68456 cord cell counts description: >- Cellcounts in cord predicted using cord reference published in de Goede et al (PMID: 26366232). This reference has been implemented in meffil. Samples are in rows and cell types in columns. data_distributions: - id: alspacdcs:d821314a-6716-4de9-8f27-2d65621d6617 name: data.txt description: >- Plain text file containinng cell counts predicted using cord reference. md5sum: 837e1e40bf27d8f6bd1a402f016b798e filesize: 120k filetype: .txt belongs_to_container: alspacdcs:54feaa38-f2de-4f98-babe-13c4c0b4791a number_of_participants: 912 - id: alspacdcs:280efa41-1668-456e-9974-9b4a45d13417 name: houseman cell counts description: >- Cell counts extracted using Houseman algorithm implemented in meffil (PMID: 22568884). Samples are in rows and cell types in columns. data_distributions: - id: alspacdcs:ae1eb48d-cf51-4e88-b2d4-643b610f6f27 name: data.txt description: >- Text file of the cell counts calculated using Houseman algorithm. md5sum: 2792f7708e710536c069b05c0192c57d filesize: 569k filetype: .txt belongs_to_container: alspacdcs:9d8ee029-67cc-47f2-a663-7bac8d803459 number_of_participants: 4843 - id: alspacdcs:99af94de-18b9-4caf-a798-fc3b8a8ca554 name: detection p values description: >- This matrix shows the detection pvalues for each sample and each CpG and is extracted from the idat files using the "meffil.load.detection.pvalues" function in meffil. CpGs are in rows and samples in columns. data_distributions: - id: alspacdcs:1dd9411c-e1f1-4cd8-b8dc-f528c893447f name: data.Robj description: >- R object file for the detection p values matrix md5sum: fbbd840f2561e28b443b1c959656f0f4 filesize: 418M filetype: .Robj belongs_to_container: alspacdcs:218a4ebd-ae56-4f5a-aa47-9614cb633a1e number_of_participants: 4843 - id: alspacdcs:83220340-b1e7-4a47-8435-473f9fecbe68 name: qc objects all description: >- This objects contain samples extracted from LIMS and is not cleaned up. This object has been used to do the data cleaning. All data processing has been conducted using Meffil. Meffil uses illuminaio R package to parse Illumina IDAT files into a meffil object called qc.objects. All meffil functions, QC summary, functional normalization and post-normalization QC summary operate on the qc or norm.objects. Specifically, the qc.objects contain raw control probe intensities, poor quality probes based on detection Pvalues and number of beads, predicted sex, predicted cellcounts and a samplesheet with batch variables. In addition, copy number variation can be extracted. This object is a list of individuals. data_distributions: - id: alspacdcs:f7fb5bce-dc29-425b-88c9-57559a3b1994 name: data.Robj description: >- R data file of the qc objects. md5sum: 677b3fd580acf8600fc5e31f7597d787 filesize: 497M filetype: .Robj belongs_to_container: alspacdcs:cb1d7257-328f-4f7b-b578-133ed4eda164 number_of_participants: 4843 - id: alspacdcs:5f074661-585b-4613-aa3a-f52960806f3d name: qc objects clean description: >- All data processing has been conducted using Meffil. Meffil uses illuminaio R package to parse Illumina IDAT files into a meffil object called norm.objects. All meffil functions, QC summary, functional normalization and post-normalization QC summary operate on the norm.objects. Specifically, the norm.objects contain raw control probe intensities, quantile distributions of the raw intensities, poor quality probes based on detection Pvalues and number of beads, predicted sex, predicted cellcounts and a samplesheet with batch variables. In addition, copy number variation can be extracted. This object is a list of individuals. data_distributions: - id: alspacdcs:34a39d30-f2b9-4a68-b8be-eb3b8ca3487a name: data.Robj description: >- R object file of qc objects clean. md5sum: 25f961e24da7611bb34b5238175a522a filesize: 659M filetype: .Robj belongs_to_container: alspacdcs:9b6bd75c-0da7-4ab8-9bb1-e5a9e4a3854d number_of_participants: 4843 - id: alspacdcs:01574baf-1473-4e89-8ff9-db04ad000b1d name: samplesheet description: >- Manifest file with columns extracted directly from LIMS and age, sex, aln, timepoint, timecode, sampletype, genotypeQC columns to remove population stratification samples, duplicate.rm column to remove duplicates. Samples in rows, variables in columns. data_distributions: - id: alspacdcs:2ff495d8-47db-43aa-ae8e-02c5963f4d6a name: data.Robj description: >- R data object manifest file. md5sum: a9f34d7a00da910d3806089b65ccc547 filesize: 100K filetype: .Robj belongs_to_container: alspacdcs:672a863a-458c-477f-93b3-f92454b490fa number_of_participants: 4843
6.2 DNA methylation - EPIC & 450k - G0 + G1 (dnam_epic450_g0_g1)
6.2.1 Description
This dataset contains methylation data collected from both G0 and G1 on two arrays at different timepoints. This dataset supersedes dnam_450_g0m_g1.
There is data from Illumina Infinium HumanMethylation450K BeadChip array on G1 mothers at two timepoints (pregnancy and middle age), G1 participants at 5 timepoints and G0 participants at three timepoints (birth, childhood and adolescence). This dataset also contains data from Infinium MethylationEPIC v1.0 data on 2721 G1 individuals at 2 timepoints.
This dataset was generated as part of the Accessible Resource for Integrated Epigenomics Studies (http://www.ariesepigenomics.org.uk/).
6.2.2 Methodology
Preprocessing and quality control for this dataset was conducted using Meffil.
Associated publications:
Associated R packages:
- aries: https://github.com/MRCIEU/aries is associated with loading and using this dataset.
- meffil: https://github.com/perishky/meffil/ was used for QC and normalisations within
6.2.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:dnam_epic450_g0_g1_2022-7-13_f4 name: >- DNA methylation - EPIC & 450k - G0 + G1 version 2022-7-13 Freeze 4 description: >- This is the freeze 4 version of dnam_epic450_g0_g1, which was first introduced in freeze 2 and first released 2022-7-13. freeze_size: 137G linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112 git_tag: https://github.com/alspac/dataset_dnam_epic450_g0_g1/releases/tag/Freeze4 is_current_freeze: true freeze_number: 4 freeze_date: 2024-06-11 ### Update to align with date of release previous_freeze: 3 freeze_of_alspac_dataset_version: alspacdcs:dnam_epic450_g0_g1_2022-7-13 freeze_of_named_alspac_dataset: alspacdcs:dnam_epic450_g0_g1 has_containers: - id: alspacdcs:a4ae8168-cbdf-44b4-8b10-e5cc7a988826 name: data description: A dir/folder containing the data files - id: alspacdcs:368b116d-4f30-4930-915b-f25a540aabb6 name: betas description: A dir/folder containing the beta files belongs_to_container: alspacdcs:a4ae8168-cbdf-44b4-8b10-e5cc7a988826 - id: alspacdcs:64f83a02-ba1b-455b-a8d5-eebd33f17adf name: control_matrix description: A dir/folder containing the control matrix files belongs_to_container: alspacdcs:a4ae8168-cbdf-44b4-8b10-e5cc7a988826 - id: alspacdcs:087b88a3-bdc8-41df-9574-5f449e78a882 name: derived description: A dir/folder containing the derived data (e.g. Cell count predictions and dnamage) belongs_to_container: alspacdcs:a4ae8168-cbdf-44b4-8b10-e5cc7a988826 - id: alspacdcs:ad61ac55-921a-4b7a-9a80-c3c7b8d6e975 name: cellcounts description: A dir/folder containing the cell count predictions belongs_to_container: alspacdcs:087b88a3-bdc8-41df-9574-5f449e78a882 - id: alspacdcs:07768702-d945-449b-ab16-3fa064bf981a name: detection_p_values description: A dir/folder containing the matrix of detection values belongs_to_container: alspacdcs:a4ae8168-cbdf-44b4-8b10-e5cc7a988826 - id: alspacdcs:3ab1bca7-efaf-414f-837d-cf0ad30afb09 name: samplesheet description: A dir/folder containing matrices of the sample identification. belongs_to_container: alspacdcs:a4ae8168-cbdf-44b4-8b10-e5cc7a988826 has_parts: - id: alspacdcs:bc629684-4fa1-42f0-b48c-7e4473d4ed4d name: betas description: >- Normalized betas using functional normalization. We used 10 PCs on the controlmatrix to regress out technical variation. Slide was regressed out as random effect before normaliziation. CpGs are in rows and samples in columns. data_distributions: - id: alspacdcs:1f940257-3a73-49d1-bd6c-ceeb794c0a4b name: 450.gds description: >- R data object for the Normalized beta data for the 450 array only. md5sum: 02e9b3cdda39d3476bfce111f5935f93 filesize: 22G filetype: .gds belongs_to_container: alspacdcs:368b116d-4f30-4930-915b-f25a540aabb6 number_of_participants: 5927 - id: alspacdcs:4c23fc84-df4d-48c0-969c-c3f8e12dd93f name: common.gds description: >- R data object for the Normalized beta data for both the EPIC and 450 arrays. md5sum: 2d447051e6241bf35dc1bfba4e740848 filesize: 30G filetype: .gds belongs_to_container: alspacdcs:368b116d-4f30-4930-915b-f25a540aabb6 number_of_participants: 8669 - id: alspacdcs:dc5ebcb3-a432-44c1-9f6f-1cbcdf7480ae name: epic.gds description: >- R data object for the Normalized beta data for the EPIC array only. md5sum: 0357486c3af3b5ee120c7b05bf077340 filesize: 18G filetype: .gds belongs_to_container: alspacdcs:368b116d-4f30-4930-915b-f25a540aabb6 number_of_participants: 2742 - id: alspacdcs:cde6fb9f-9fa7-4941-aa0f-b3fe3140999b name: control_matrix description: >- The 850 control probes are summarized in 42 control types. These probes can roughly be divided into negative control probes (613), probes intended for between array normalization (186) and the remainder (49), which are designed for quality control, including assessing the bisulfite conversion rate. None of these probes are designed to measure a biological signal. The summarized control probes can be used as surrogates for unwanted variation and are used for the functional normalization. Samples are rows and 42 control types are in columns. data_distributions: - id: alspacdcs:8ca4a216-7dac-47c8-949a-38cc4a26af18 name: 450.txt description: >- Plain text file of the control matrix for the 450 array only. md5sum: 9e6aa62498c5bb7493f7512e274056ba filesize: 2.2M filetype: .txt belongs_to_container: alspacdcs:64f83a02-ba1b-455b-a8d5-eebd33f17adf number_of_participants: 5927 - id: alspacdcs:8ddf1661-41f3-47e8-840e-cce8fed13f04 name: common.txt description: >- Plain text file of the control matrix for both the EPIC and 450 arrays. md5sum: 42d21ff7a2ead483e85b909b279e9912 filesize: 3.2M filetype: .txt belongs_to_container: alspacdcs:64f83a02-ba1b-455b-a8d5-eebd33f17adf number_of_participants: 8669 - id: alspacdcs:09bc4485-93c8-41a5-bfe2-bf44f6e9a345 name: epic.txt description: >- Plain text file of the control matrix for the EPIC array only. md5sum: 7a680d3ccd26a491ec7dde2ce91eeeab filesize: 1.0M filetype: .txt belongs_to_container: alspacdcs:64f83a02-ba1b-455b-a8d5-eebd33f17adf number_of_participants: 2742 - id: alspacdcs:b73ed28a-7219-49b5-94b8-e39d2bbda6f2 name: DNA methylation age description: >- DNA methylation aging estimates from within the dataset. Further information on this data and its usage is found within the `dnamage.html` and `dnamage.md` within the docs dir/folder. data_distributions: - id: alspacdcs:2ba6caa3-327a-4615-af93-ec81836bec57 name: dnamage.csv description: >- A csv file containing DNA methylation aging estimates within the dataset. md5sum: bd0c2efef6ee145cd0804d61c7e83151 filesize: 12M filetype: .csv belongs_to_container: alspacdcs:087b88a3-bdc8-41df-9574-5f449e78a882 number_of_participants: 8192 - id: alspacdcs:6a7baf4c-121e-400d-a72f-357c33980ac1 name: cell counts description: >- Files contain cell counts estimated using a variety of cell type references using the Houseman deconvolution algorithm (PMID: 22568884). In each file, samples correspond to rows and cell types to columns. data_distributions: - id: alspacdcs:0fafdf8e-12b0-4cb6-bd85-c0e6bc82c8d1 name: andrews-and-bakulski-cord-blood.txt description: >- Cord blood cell count estimates derived using the Bakulski et al. 2016 reference (PMID 27019159; https://bioconductor.org/packages/release/data/experiment/html/FlowSorted.CordBlood.450k.html). This reference has been implemented in meffil. Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, natural killer cells and nucleated red blood cells. In this text file, samples are in rows and cell types in columns. md5sum: 33c69aa8e50deb28355dcb82d01c7510 filesize: 114K filetype: .txt belongs_to_container: alspacdcs:ad61ac55-921a-4b7a-9a80-c3c7b8d6e975 number_of_participants: 913 - id: alspacdcs:1b7994b3-22db-4aff-99b2-8438d283d12d name: gervin-and-lyle-cord-blood.txt description: >- Cord blood cell count estimates derived using the Gervin et al. 2019 reference (PMID 31455416; GEO accession GSE127824). Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, and natural killer cells. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns. md5sum: 099c4cf9bd4ecfee91c19c3c2d2b6f70 filesize: 100K filetype: .txt belongs_to_container: alspacdcs:ad61ac55-921a-4b7a-9a80-c3c7b8d6e975 number_of_participants: 913 - id: alspacdcs:b426de30-1685-45c8-9cf2-5831f65b44d4 name: cord-blood-gse68456.txt description: >- Cord blood cell count estimates derived using the de Goede et al. 2015 reference (PMID 26366232; GEO accession GSE68456). Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, natural killer cells and nucleated red blood cells. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns. md5sum: 941f8a9ce1289ab5baaf10fb29bd8941 filesize: 130K filetype: .txt belongs_to_container: alspacdcs:ad61ac55-921a-4b7a-9a80-c3c7b8d6e975 number_of_participants: 913 - id: alspacdcs:8048cc87-bf93-4d83-8440-a27e6fe9f2ae name: blood-gse35069-complete.txt description: >- Cell counts in peripheral blood predicted using the peripheral blood reference published in Reinius et al. 2012 (PMID: 22848472). Same as 'blood gse35069.txt' but replaces granulocytes with eosinophils and neutrophils. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns. md5sum: 27ab648c56b56e62709a98fcba95a764 filesize: 1.2M filetype: .txt belongs_to_container: alspacdcs:ad61ac55-921a-4b7a-9a80-c3c7b8d6e975 number_of_participants: 8669 - id: alspacdcs:74143e64-6d2d-4e69-b99e-1e12a7df3657 name: blood-gse35069.txt description: >- Blood cell count estimates derived using the Reinius et al. 2012 reference (PMID 25424692; GEO accession GSE35069). Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, and natural killer cells. In this text file, samples are in rows and cell types in columns. md5sum: 53fb63b4cef457d90688b3ddb861fa73 filesize: 1021K filetype: .txt belongs_to_container: alspacdcs:ad61ac55-921a-4b7a-9a80-c3c7b8d6e975 number_of_participants: 8669 - id: alspacdcs:3e2301b0-7d09-45bd-8364-700fdc3e873a name: blood-idoloptimized-epic.txt description: >- Cell counts in peripheral blood predicted using the cell type reference from Bioconductor package FlowSorted.Blood.EPIC. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns. md5sum: 7331e83d31e1d200bbff3d041223cde1 filesize: 347K filetype: .txt belongs_to_container: alspacdcs:ad61ac55-921a-4b7a-9a80-c3c7b8d6e975 number_of_participants: 2742 - id: alspacdcs:a803d79f-faf0-4f7a-aae6-2548031834cc name: blood-idoloptimized.txt description: >- Cell counts in peripheral blood predicted using the cell type reference from Bioconductor package FlowSorted.Blood.EPIC but restricted to the IDOLOptimizedCpGs450klegacy CpG sites. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns. md5sum: 2c2bdbf34093960af969ca37ae43c77b filesize: 1.1M filetype: .txt belongs_to_container: alspacdcs:ad61ac55-921a-4b7a-9a80-c3c7b8d6e975 number_of_participants: 8669 - id: alspacdcs:e3a207d4-4dfa-44ad-bdf1-58cae95bb972 name: combined-cord-blood.txt description: >- Cord blood cell count estimates derived using the Bakulski et al, Gervin et al., de Goede et al., and Lin et al. references (https://bioconductor.org/packages/release/data/experiment/html/FlowSorted.CordBloodCombined.450k.html) for CpG sites selected using the IDOL algorithm and optimized for the Illumina Infinium HumanMethylation450 Beadchip. Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, natural killer cells and nucleated red blood cells. In this text file, samples are in rows and cell types in columns. md5sum: 7cbcf72ca00012d17d22ff6d21b7575c filesize: 129K filetype: .txt belongs_to_container: alspacdcs:ad61ac55-921a-4b7a-9a80-c3c7b8d6e975 number_of_participants: 913 - id: alspacdcs:7556f038-dc22-47f0-96eb-af81a58eefe6 name: detection p values description: >- This matrix shows the detection pvalues for each sample and each CpG and is extracted from the idat files using the "meffil.load.detection.pvalues" function in meffil. CpGs are in rows and samples in columns. data_distributions: - id: alspacdcs:a1cc883a-a4ce-4660-8926-0bdb67c731fd name: 450.gds description: >- R object file for the detection p values matrix for the 450 array only. md5sum: 1c437226b2aab0c00aed7098e739f49d filesize: 22G filetype: .gds belongs_to_container: alspacdcs:07768702-d945-449b-ab16-3fa064bf981a number_of_participants: 5927 - id: alspacdcs:a2ef6985-97de-4ee3-996f-4d295773fbbc name: common.gds description: >- R object file for the detection p values matrix for both EPIC and 450 arrays. md5sum: c6f4348fa7d92a5f341f69e1784036da filesize: 30G filetype: .gds belongs_to_container: alspacdcs:07768702-d945-449b-ab16-3fa064bf981a number_of_participants: 8669 - id: alspacdcs:d312d4b0-3e87-4a49-8840-b2162c0daa1a name: epic.gds description: >- R object file for the detection p values matrix for the EPIC array only. md5sum: 341d1194d468e10e80be9dc9990c474b filesize: 18G filetype: .gds belongs_to_container: alspacdcs:07768702-d945-449b-ab16-3fa064bf981a number_of_participants: 2742 - id: alspacdcs: description: >- Manifest files with columns extracted directly from LIMS and age, sex, omics ID, timepoint, timecode, sampletype, genotype columns to report sample mismatches, duplicate.rm column to remove duplicates. Samples in rows, variables in columns. data_distributions: - id: alspacdcs:4547e736-b1c4-4ade-adc4-622d44522f7c name: samplesheet-450.csv description: >- R data object manifest file for the 450 array only. md5sum: ae8ccd22c2784bb900959362bfdf95e5 filesize: 2.2M filetype: .csv belongs_to_container: alspacdcs:3ab1bca7-efaf-414f-837d-cf0ad30afb09 number_of_participants: 5927 - id: alspacdcs:1c1bf0bc-c254-4c25-96bf-96558f37f059 name: samplesheet-common.csv description: >- R data object manifest file for both the EPIC and 450 arrays. This is a duplicate with samplesheet.csv. md5sum: 1e60ab2f50c9f578c3a6ead251974197 filesize: 3.3M filetype: .csv belongs_to_container: alspacdcs:3ab1bca7-efaf-414f-837d-cf0ad30afb09 number_of_participants: 8669 - id: alspacdcs:fce38b25-3100-4b12-b13d-6b528d8dfffc name: samplesheet-epic.csv description: >- R data object manifest file for the EPIC array only. md5sum: 656ead1968eb4ae0ac07b1a2416907ad filesize: 1.1M filetype: .csv belongs_to_container: alspacdcs:3ab1bca7-efaf-414f-837d-cf0ad30afb09 number_of_participants: 2742 - id: alspacdcs:707a0d83-66fe-4a74-96fc-1b2c5d7f0158 name: samplesheet.csv description: >- R data object manifest file for both the EPIC and 450 arrays. This is a duplicate with samplesheet-common.csv. md5sum: 1e60ab2f50c9f578c3a6ead251974197 # should be the same as samplesheet-common.csv filesize: 3.3M filetype: .csv belongs_to_container: alspacdcs:3ab1bca7-efaf-414f-837d-cf0ad30afb09 number_of_participants: 8669
7 Gene Expression Data
7.1 Gene expression - array - G1 (ge_ht12_g1)
7.1.1 Description
There are two different types of QC'd data available in this version, one performed by David Evans for the Bryois et al 2014 paper, and one performed by Gibran Hemani for the molgenis eQTL mapping meta analysis. A version without QC is available as well. Details on the QC'd versions can be seen below.
This data was generated from LCLs. The majority of samples used in their generation were collected at age 9 years. LCL's are a lymphoblastoid cell lines which were produced by transforming lymphocytes with Epstein Barr Virus and cultured before DNA was extracted. Gene expression patterns may not be the same as that from untransformed lymphocytes taken from a 9 year old.
7.1.2 Methodology
Bryois:
- LCL's from unrelated individuals were grown under identical conditions and cells frozen in RNAlater. RNA was extracted using an RNeasy extraction kit (Qiagen) and was amplified using the Illumina TotalPrep-96 RNA Amplification kit (Ambion). Expression profiling of the samples, each with two technical replicates, were performed using the Illumina Human HT-12 V3 BeadChips (Illumina Inc) including 48,804 probes where 200 ng of total RNA was processed according to the protocol supplied by Illumina. Raw data was imported to the Illumina Beadstudio software and probes with less than three beads present were excluded. Log2 - transformed expression signals were then normalized with quantile normalization of the replicates of each individual followed by quantile normalization across all individuals.
We restricted our analysis to 23'935 probes tagging genes annotated in Ensembl. Principal component analysis was performed on 931 individuals. 62 individuals with principal component 1 or 2 greater than one standard deviation of the population were excluded from further analysis. See http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004461 for full details.
Molgenis:
- Genetic outliers were removed, any individuals that were clear outliers in the first 2 genetic principal components. Each probe was simply quantile normalised and then log2 transformed. Then adjusted for the first 4 genetic MDS, expression principal components (excluding those that had genetic associations), and scaled to have mean 0 and variance 1. See https://github.com/molgenis/systemsgenetics/wiki/eQTL-mapping-analysis-cookbook for full details.
7.1.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:ge_ht12_g1_2015-11-02_f4 name: Gene expression - array - G1 release version 2015-11-02 freeze 4 description: >- This is the fourth freeze of the 2015-11-02 version of ge_ht12_g1 dataset which has .csv distributions of the data rather than .Rdata files in order to be easier to use across differnt data science software and languages. freeze_size: 2.6G linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112 git_tag: https://github.com/alspac/dataset_ge_ht12_g1/releases/tag/freeze4 is_current_freeze: true freeze_number: 4 freeze_date: 2024-06-11 previous_freeze: alspacdcs:ge_ht12_g1_2015-11-02_f3 freeze_of_alspac_dataset_version: alspacdcs:ge_ht12_g1_2015-11-02 freeze_of_named_alspac_dataset: alspacdcs:ge_ht12_g1 has_parts: - id: alspacdcs:ge_ht12_g1_2015-11-02_bryosis_f4 name: Bryosis data description: Dataset part for the Bryosis data in ge_ht12_g1 version 2015-11-02 freeze4 data_distributions: - id: alspacdcs:564477290c6962a88697e9a9eae4991a_bryosis.csv name: bryosis.csv description: >- The freeze 4 csv version of the bryosis data. IDs in columns and Illumina probe IDs in rows. This is the normalised data used in Bryois et al 2014. Probe IDs are mapped to Genes in raw.csv md5sum: 564477290c6962a88697e9a9eae4991a filesize: 742M filetype: .csv number_of_participants: 947 number_of_gene_expression_probe_values: 48630 - id: alspacdcs:ge_ht12_g1_2015-11-02_molgenis_f4 name: Molgenis description: >- Dataset part for the Molgenis data in ge_ht12_g1 version 2015-11-02 freeze 4 data_distributions: - id: alspacdcs:e5dcaa8260bd63189290e403d5ddc9f7_molgenis.csv name: molgenis.csv description: >- The freeze 4 csv version of the molgenis data. IDs in columns and Illumina probe IDs in rows. Normalised data following the molgenis pipeline, found at https://github.com/molgenis/systemsgenetics/wiki/eQTL-mapping-analysis-cookbook. Probe IDs are mapped to Genes in raw.csv md5sum: e5dcaa8260bd63189290e403d5ddc9f7 filesize: 752M filetype: .csv number_of_participants: 879 number_of_gene_expression_probe_values: 48630 - id: alspacdcs:ge_ht12_g1_2015-11-02_raw_f4 name: Raw description: Dataset part for the raw data in ge_ht12_g1 version 2015-11-02 freeze 4 data_distributions: - id: alspacdcs:7251c3016a62431b1fc41823ffff2bef_raw.csv name: raw.csv description: >- The freeze 4 csv version of the raw ge data. IDs in columns and probes in rows. Two columns per individual, with one column for average signal and one column for average number of beads. Presumably this is a file generated by the Illumina Genome Studio software. md5sum: 7251c3016a62431b1fc41823ffff2bef filesize: 1.1G filetype: .csv number_of_participants: 994 ##This is not how wide this dataframe is number_of_gene_expression_probe_values: 48630
8 Omics tips
8.1 Introduction
This section is a guide to using 'Omics datasets. It explains which software to use and describes common file formats. It's a good starting point for beginners and helpful for problem-solving.
8.2 Disclaimer
Some information is copied or reworded from software documentation. Check the original documentation alongside this guide for up-to-date information. Note that some links may no longer work.
8.3 Operating systems
You can use ALSPAC data with any operating system, but Unix-based systems like Macintosh, Linux, or BSD are more convenient due to the data's size and complexity. We recommend using the command line and programming scripts with languages like Bash, R, Python, or Perl. Many online resources are available to learn these tools. Use free/libre and open-source software where possible.
Links:
- Unix guide: https://www.osc.edu/supercomputing/unix-cmds
- Beginning Python: https://www.python.org/about/gettingstarted/
- Beginning R: https://www.statmethods.net/r-tutorial/index.html
- Free/libre and open-source software: https://www.fsf.org/about/
8.4 Key Omics software
8.4.1 Plink
Plink is a tool for performing quality control and whole genome association analysis of genetic data.
8.4.2 SNPTest
SNPTest is a tool for performing whole genome association analysis of genetic data.
- Link: https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html (Not open source)
8.4.3 BoltLmm
BoltLmm is a tool for performing genome association analysis of genetic data. It is recommended for analysis of more than 5000 samples, its methods automatically take into account population substructures.
8.4.4 Qctools
A tool for quality control of genetic data. It is also useful to inspect and modify .gen .bgen and vcf files etc (see section 4 below).
8.4.5 SAMTOOLS
Samtools is a suite of tools which are used for genomic analysis.
- Link: http://www.htslib.org/
8.4.6 VCFTOOLS
Part of samtools that allows you to work with vcf files.
8.4.7 BCFTOOLS
This is a part of samstools and allows users to manipulate .bcf files.
8.5 File types
In a Unix environment the postfix of a file name does not explicitly mean anything to the operating system, unlike in a Windows system which will look at the file types. In a Unix system it is just part of the name of the file and humans use it to distinguish file formats. The following is a non-exhaustive list of file types you may encounter whilst using ALSPAC Omics data.
8.5.1 .gen
This is an 'oxford' data format for genetic data. The .gen file is a plain text file, this means that standard Unix command line tools can be used to inspect the data. For example, 'head' or 'less'.
The .gen (genotype) file stores data on a one-line-per-SNP format. The first 5 entries of each line are the SNP ID, RS ID of the SNP, base-pair position of the SNP, the allele coded A and the allele coded B. The SNP ID can be used to denote the chromosome number of each SNP. The next three numbers on the line are the probabilities of the three genotypes AA, AB and BB at the SNP for the first individual in the cohort. The next three numbers are the genotype probabilities for the second individual in the cohort. The next three numbers are for the third individual and so on. The order of individuals in the genotype file should match the order of the individuals in the sample file (see below). It should be noted that the probabilities need not sum to 1 to allow for the possibility of a NULL genotype call. This format allows for genotype uncertainty. This genotype file format is the same as that produced by the genotype calling algorithm CHIAMO. NOTE : We recommend that you arrange SNPs in base-pair order in the genotype files. This is required if you want to use the files with IMPUTE and will make viewing the output of SNPTEST somewhat easier. For example, Suppose you want to create a genotype for 2 individuals at 5 SNPs whose genotypes are
SNP 1 | AA | AA |
SNP 2 | GG | GT |
SNP 3 | CC | CT |
SNP 4 | CT | CT |
SNP 5 | AG | GG |
The correct genotype file would look like this:
SNP1 rs1 1000 | A | C | 1 | 0 | 0 | 1 | 0 | 0 |
SNP2 rs2 2000 | G | T | 1 | 0 | 0 | 0 | 1 | 0 |
SNP3 rs3 3000 | C | T | 1 | 0 | 0 | 0 | 1 | 0 |
SNP4 rs4 4000 | C | T | 0 | 1 | 0 | 0 | 1 | 0 |
SNP5 rs5 5000 | A | G | 0 | 1 | 0 | 0 | 0 | 1 |
8.5.2 .bgen
A binary version of a .gen file. This file can not be visually inspected on the command line. .bgen files are used because they greatly increase the speed and storage efficiency of software for storing large amounts of Omics data. The full details of the file format are discussed in : https://www.well.ox.ac.uk/~gav/bgen_format/ bgen files are normally used with tools such as qctools and snptest There is also a library for reading .bgen files into R : https://bitbucket.org/gavinband/bgen/wiki/rbgen
8.5.3 .sample
The .sample file is paired with either .gen or .bgen files. It contains information on the samples that is not genetic. It is a plain text file that can be inspected with standard Unix command line tools.
Please note that the sample file format changed with the release of SNPTEST v2. Specifically, the way in which covariates and phenotypes are coded on the second line of the header file has changed. The sample file has three parts (a) a header line detailing the names of the columns in the file, (b) a line detailing the types of variables stored in each column, and (c) a line for each individual detailing the information for that individual. Here is an example of the start of a sample file for reference
ID_1 | ID_2 | missing | cov_1 | cov_2 | cov_3 | cov_4 | pheno1 | bin1 |
0 | 0 | 0 | D | D | C | C | P | B |
1 | 1 | 0 | .007 | 1 | 2 | 0 | .0019 | -0.008 1.233 1 |
2 | 2 | 0 | .009 | 1 | 2 | 0 | .0022 | -0.001 6.234 0 |
3 | 3 | 0 | .005 | 1 | 2 | 0 | .0025 | 0.0028 6.121 1 |
4 | 4 | 0 | .007 | 2 | 1 | 0 | .0017 | -0.011 3.234 1 |
5 | 5 | 0 | .004 | 3 | 2 | -0 | .012 | 0.0236 2.786 0 |
The header line: This line needs a minimum of three entries. The first three entries should always be ID_1, ID_2 and missing. They denote that the first three columns contain the first ID, second ID and missing data proportion of each individual. Additional entries on this line should be the names of covariates or phenotypes that are included in the file. In the above example, there are 4 covariates named cov_1, cov_2, cov_3, cov_4, a continuous phenotype named pheno1 and a binary phenotype named bin1. NOTE : All phenotypes should appear after the covariates in this file. The second line of the file details the type of variables included in each column. The first three entries of this line should be set to 0. Subsequent entries in this line for covariates and phenotypes should be specified by the following rules
D | Discrete covariate (coded using positive integers) |
C | Continuous covariates |
P | Continuous Phenotype |
B | Binary Phenotype (0 = Controls, 1 = Cases) |
The remainder of the file should consist of a line for each individual containing the information specified by the entries of the header line (see example above). Use spaces to separate the entries of the sample file and not TABS because that is the expected character.
Missing values - Specifying missing values for covariates and phenotypes is possible. It was recommended that you use -9 for missing values. This was the default value assumed by SNPTEST v1, although the -missing_code option in SNPTEST v1 meant that you could use other numeric values for the missing code, In SNPTEST v2 the behavior of the -missing_code option has changed so that it now takes a comma-separated list of values, each of which is treated as missing when encountered in the sample file(s). Default missing values are now denoted by the two character string "NA".
8.5.4 .ped
A plink format file that is in plain text and can be viewed with standard tools. It contains genetic variant data. https://www.cog-genomics.org/plink/1.9/formats#ped
8.5.5 .map
A plink format file that is in plain text. It contains information about variants. https://www.cog-genomics.org/plink/1.9/formats#map
8.5.6 .bed
A plink format file that isa binary equivalent of a .ped file. It is smaller and faster to process but is not easily viewable or editable. https://www.cog-genomics.org/plink/1.9/formats#bed
8.5.7 .bim
A plink format, similar to a .map file but is used with binary .bed files. https://www.cog-genomics.org/plink/1.9/formats#bin
8.5.8 .fam
A plain text format that contains sample information for plink binary files. https://www.cog-genomics.org/plink/1.9/formats#fam
8.5.9 .csv
A plain text format where different fields are separated by commas. (Comma separated variables).
8.5.10 .vcf
VCF files are a flexible file format for storing different types of genetic variants. They are a plain text format that can be inspected on the command line with standard Unix tools. However they are often very large files, and specific tools such as 'vcftools' are useful for working with this data. Commonly SNPs are stored in these files but other variants such as Copy Number variations can also be stored. The basic form for a vcf file is: https://en.wikipedia.org/wiki/Variant_Call_Format
8.5.11 .bcf
This is a binary version of a vcf file. It cannot be inspected on the command line, but can be used with the genomic tools mentioned in this document.
8.5.12 .tar.gz
This is a standard Unix file format for bundling and compressing a set of files. It is similar to a .zip file. It is made by first bundling a set of files into a .tar file (sometimes called a tar ball). This is then compressed using 'gun zip'. https://en.wikipedia.org/wiki/Tar_(computing) https://en.wikipedia.org/wiki/Gzip
8.5.13 .enc
This file extension is used as a convention to mean that the file is encrypted. You will need to have that password that was used to encrypt the data in order to unencrypt the files. https://en.wikipedia.org/wiki/OpenSSL
8.6 Variant/SNP ids
There are many types of genetic variation. A common type is a single nucleotide polymorphism (SNP). Others include copy number variations.
Variants can be specified by a Chromosome and location in reference to a specific build of the human genome. They can also be given a reference SNP (rs) cluster identifier.
- Chr:Location
- Rs ids
8.7 Overview of Imputation reference panels
SNP array data frequently contain hundreds of thousands of variants. However due to linkage disequilibrium it is possible to estimate many more SNP values for an individual. This estimation procedure is called imputation and it works by combining an individuals SNP array data with a large reference population of sequenced data. In this way it is possible to have accurate estimations of millions of SNP values for an individual without the cost of fully sequencing each person. ALSPAC has prerun the imputation process using three different imputation panels.
8.7.1 Panels
- TOPmed
An upcoming (to alspac) reference panel which will have the most snps
- HRC
This is the latest reference panel and our data contains circa 40 millions of SNPs.
- 1000 Genomes
This is the previous generation reference panel which is still widely used in ALSPAC studies. There are some SNPs that appear in this panel that are not in the HRC panel.
- Hapmap
This was the first widely used imputation panel.
8.8 SNP data types from imputation.
SNPs that have been imputed can be stored and analysed in different formats. These can be appropriate for different types of analysis, for example an analysis could assume and additive effect for the minor allele or it could assume a recessive/dominant effect.
- Best guess. The data will be presented as either 0,1, or 2 to represent how many of the minor alleles at that position a person has. The best guess is derived from the probability of a variant calculated from the imputation process.
- Dosage. This is the probability that the person has 0, 1 or 2 of the minor allele. i.e. 0.1, 0.2,0.7. This will sum to one across the three possibilities (i.e for each SNP for each individual).
8.9 SNP Statistics
You can generate statistics on your SNP data using the program 'QCtools'. This will give you the imputation information scores. For example:
qctool -g example.bgen -s example.sample -sample-stats -osample sample-stats.txt
8.10 Best practice
8.10.1 GWAS
We recommend you follow the steps outlined in the following paper when performing GWAS: Marees, Andries T., et al. "A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis." International journal of methods in psychiatric research 27.2 (2018): e1608. https://doi.org/10.1002/mpr.1608
8.10.2 Phewas
We recommend you follow the steps outlined in the following paper when performing Phewas: Millard, L., Davies, N., Timpson, N. et al. MR-PheWAS: hypothesis prioritization among potential causal effects of body mass index on many outcomes, using Mendelian randomization. Sci Rep 5, 16645 (2015). https://doi.org/10.1038/srep16645
8.10.3 Methylation
The following paper describes the methylation data available in ALSPAC Relton, Caroline L., et al. "Data resource profile: accessible resource for integrated epigenomic studies (ARIES)." International journal of epidemiology 44.4 (2015): 1181-1190.
8.11 Population stratification
This is when an observed genetic association is due to the population/geography. Not taking this into account can lead to biased estimates of effects. One common method to account for these is to calculate principal components of the genetic data and then to include these as covariables in any models. Principal components can be generated using plink or other tools.
For more information about how to do this in plink see:https://www.cog-genomics.org/plink/1.9/strat
An common method used to account for population substructure is by using linear mixed models. For example using the bolt LMM software tool.
8.12 Common tasks
Here we provide links to webpages that provide instructions or provide brief details any code for completing common tasks using the various software we have described above (section x):
- Extract some SNPs from a bgen data file and convert to plain text.
https://www.well.ox.ac.uk/~gav/qctool_v2/documentation/examples/filtering_variants.html
- Extract some SNPs from bed data:
http://zzz.bwh.harvard.edu/plink/dataman.shtml
plink –bfile mydata –chr 2 –from-kb 5000 –to-kb 10000
- Reading .bgen and .sample oxford files in plink
Plink supports bgen files but it is fussy about the types of its columns in the data.sample file. You may wish to remove or retype columns to read a data.sample file into plink. For more info see:
https://www.cog-genomics.org/plink/2.0/input
To make a new sample file removing some columns you can use the Unix command: 'cut -f 1,2,3 -d " " data.sample > data2.sample'
8.13 Courses
Working with 'Omics data can be complicated but there are many excellent resources available to help you learn how to do this. There are both paid in person courses and free online courses.
Details on paid courses offered by Bristol University can be found here: https://www.bristol.ac.uk/medical-school/study/short-courses/ In addition, a number of free online courses are summarised here: https://www.mooc-list.com/tags/bioinformatics
8.14 Further sources of help
8.14.1 Stack exchange
Stack exchange is an online Q&A community which is divided into different sub-communities. The first and most well-known is Stack overflow. This is one of the best place to ask questions about programming on the Internet. Other useful exchange sites include bioinformatics https://bioinformatics.stackexchange.com/, maths https://mathoverflow.net/ and statistics https://stats.stackexchange.com/.
8.14.2 Bio-stars
Biostars is bioinformatics community Q&A web-site: https://www.biostars.org/
8.14.3 Mailing lists
For individual product/projects there is often a mailing list. For example to get help using SNPTEST you can ask on the mailing list https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html#contact
8.14.4 AI tools
AI tools such as chatGPT can be useful to understand how to work with omics data.
8.14.5 Ask ALSPAC
If you can not find the answer to your question or you think there is something wrong with your data then please contact the alspac-omics@bristol.ac.uk mailbox and we will do our best to help you.