ALSPAC OMICs Data Catalogue
Table of Contents
- 1. Introduction
- 2. Genetic Array Data
- 3. Imputed Data
- 3.1. Genome-wide - HRC imputed - G0 mothers + G1 (gi_hrc_g0m_g1)
- 3.2. Genome-wide - HapMap2 imputed - G1 (gi_hapmap2_g1)
- 3.3. Genome-wide - HapMap2 imputed - G0 mothers (gi_hapmap2_g0m)
- 3.4. Genome-wide - 1000G imputed - G0 partners (gi_1000g_g0p)
- 3.5. Genome-wide - 1000G imputed - G0 mothers + G1 (gi_1000g_g0m_g1)
- 3.6. Genome-wide - TOPMed round 2 imputed - G0 mothers + G1 (gi_topmed_g0m_g1)
- 4. Sequence Data
- 5. Epigenetic Data
- 6. Gene Expression Data
- 7. Omics tips
- 7.1. Introduction
- 7.2. Disclaimer
- 7.3. Operating systems
- 7.4. Key Omics software
- 7.5. File types
- 7.6. Variant/SNP ids
- 7.7. Overview of Imputation reference panels
- 7.8. SNP data types from imputation.
- 7.9. SNP Statistics
- 7.10. Best practice
- 7.11. Population stratification
- 7.12. Polygenic risk scores (PRS)
- 7.13. Common tasks
- 7.14. Courses
- 7.15. Further sources of help
1 Introduction
Welcome to the ALSPAC Omics Catalogue, a guide to the omics data offered by ALSPAC. This catalogue features a variety of named ALSPAC datasets, each consisting of collected or produced data that has been organized, named, and curated for ease of use. Every named ALSPAC dataset comes with accompanying metadata that provides information about the dataset as a whole. Each named ALSPAC dataset has at least one release version that includes a curated selection of files detailed in the metadata sections.
Please note that these datasets are not generally accessible. Please see http://www.bristol.ac.uk/alspac/researchers/access/ for details for access.
The information within this catalogue is made available for browsing to help both internal ALSPAC users and external researchers understand the data and facilitate prospective data requests.
For external ALSPAC collaborators, we offer as standard "freezes" of specific dataset versions of named ALSPAC datasets. These freezes, along with their metadata, are outlined in this catalogue. External collaborators will be granted access to these freezes upon request approval. A freeze represents a carefully selected subset of data files within a version, containing the core data from a dataset with withdrawn consent removed and specific dataset IDs applied. These freezes are subject to periodic updates.
Due to the removal of withdrawn individuals from the freezes, please note that the number of participants within each dataset may change over time and may not match those found in the Methodology fields.
Freeze 1 timing: July 2021 - Dec 2022
Freeze 2 timing: Dec 2022 - Dec 2023
Freeze 3 timing: Jan 2023 - Oct 2024
Freeze 4 timing: Oct 2024 - June 2025
Freeze 5 timing: June 2025 - Current
Documentation for the current freeze is in the form of a yaml file is present below, listing the files external collaborators will receive, accompanied by metadata.
2 Genetic Array Data
2.1 Genome-wide - Illumina 550 quad - G1 (gwa_550_g1)
2.1.1 Description
This dataset contains genome wide array data genotype calls for G1 individuals. Reference genome build: GRCh36
2.1.2 Methodology
ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).
Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.
SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1). Related subjects were removed.
Associated publication:
- Horikoshi et al 2013 (https://doi.org/10.1038/ng.2477)
2.1.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gwa_550_g1_2022-12-05_f5 name: >- Genome-wide array data for G1 individuals 2022-12-05 freeze 5 description: >- The fith freeze of the genome-wide array data for G1 based on a 2022-12-05 release. The data is in plink format. freeze_size: 997M linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9 woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a git_tag: https://github.com/alspac/dataset_gwa_550_g1/releases/tag/freeze5 is_current_freeze: true freeze_number: 5 freeze_date: 2025-02-27 previous_freeze: alspacdcs:gwa_550_g1_2022-12-05_f4 freeze_of_alspac_dataset_version: alspacdcs:gwa_550_g1_2022-12-05 freeze_of_named_alspac_dataset: alspacdcs:gwa_550_g1 has_containers: - id: alspacdcs:1600720f-a580-4999-9bd6-4bbcd60554bb ## uuid name: data description: A dir/folder containing the two freeze data files has_parts: - id: alspacdcs:b84cc4d9-20b0-40d1-93d2-b5a4d221af3b name: Biallelic genotype table description: >- genotype data data_distributions: - id: alspacdcs:a8552a46-1740-4056-8adf-38d32f6a7472 name: freeze_id.bed description: >- Plink bed file. Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files. md5sum: 94973786388f80000dcdad0a80514e37 filesize: 982M filetype: .bed number_of_participants: 8223 number_of_variants: 500527 belongs_to_container: alspacdcs:1600720f-a580-4999-9bd6-4bbcd60554bb - id: alspacdcs:f79e204e-8f5c-4c85-9bd1-07b0e1f1e874 name: Variant Information description: >- Information about SNPS data_distributions: - id: alspacdcs:9b0b34c4-f31c-48e9-8cbd-f87d3257de11 name: freeze_id.bim description: >- Extended variant information file accompanying a .bed binary genotype table. (--make-just-bim can be used to update just this file.) A text file with no header line, and one line per variant with the following six fields: 1. Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name 2. Variant identifier 3. Position in morgans or centimorgans (safe to use dummy value of '0') 4. Base-pair coordinate (1-based; limited to 231-2) 5. Allele 1 (corresponding to clear bits in .bed; usually minor) 6. Allele 2 (corresponding to set bits in .bed; usually major) md5sum: b0789ac6126af474c916c80f77335f6a filesize: 14M filetype: .bim number_of_variants: 500527 belongs_to_container: alspacdcs:1600720f-a580-4999-9bd6-4bbcd60554bb - id: alspacdcs:6097a628-e5a5-4b03-9281-2efff7ae48f5 name: sample info description: >- Sample ids data_distributions: - id: alspacdcs:76b09971-7168-420f-b4f2-7f6482f5d0ef name: freeze_id.fam description: >- A text file with no header line, and one line per sample with the following six fields: 1. Family ID ('FID') 2. Within-family ID ('IID'; cannot be '0') 3. Within-family ID of father ('0' if father isn't in dataset) 4. Within-family ID of mother ('0' if mother isn't in dataset) 5. Sex code ('1' = male, '2' = female, '0' = unknown) 6. Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control) md5sum: 2bc551594141e9da29b24488bdd2afe7 filesize: 256k filetype: .fam number_of_participants: 8223 belongs_to_container: alspacdcs:1600720f-a580-4999-9bd6-4bbcd60554bb - id: alspacdcs:48aeacef-1450-4519-8eb7-c4c25420a4df name: Heterozygous haploid and nonmale Y chromosome call list description: >- A plink report data_distributions: - id: alspacdcs:303a3c36-8b63-4c03-94a9-7fb35bf2885e name: freeze_id.hh description: >- Produced automatically when the input data contains heterozygous calls where they shouldn't be possible (haploid chromosomes, male X/Y), or there are nonmissing calls for nonmales on the Y chromosome. A text file with one line per error (sorted primarily by variant ID, secondarily by sample ID) with the following three fields: Family ID Within-family ID Variant ID md5sum: cce791501bb562953f352b9f54eacecb filesize: 1.7M filetype: .hh belongs_to_container: alspacdcs:1600720f-a580-4999-9bd6-4bbcd60554bb - id: alspacdcs:b243c546-07d3-43cb-9649-1926973f7211 name: Logs description: >- plink log data_distributions: - id: alspacdcs:34f0b85a-c298-4e21-9b8a-f9644d582a1e name: freeze_id.log description: >- plink log file md5sum: 5a63dd14cc69e894f78758f7ca3d8197 filesize: 512 filetype: .log belongs_to_container: alspacdcs:1600720f-a580-4999-9bd6-4bbcd60554bb
2.2 Genome-wide - Illumina exome core array - G0 partners (gwa_exome_g0p)
2.2.1 Description
This dataset contains genome wide array genotype calls for G0 mothers and partners. Reference genome build: GRCh37
2.2.2 Methodology
3,453 ALSPAC mother and fathers and 535,478 SNPs were genotyped using the Illumina HumanCoreExome chip genotyping platforms by the ALSPAC lab and called using GenomeStudio. The resulting raw genome-wide data were subjected to standard quality control methods using PLINK (v1.07). Individuals were excluded on the basis of gender mismatches (n = 80); minimal or excessive heterozygosity (n = 64); disproportionate levels of individual missingness (>5%, n = 60) and possible contamination (n = 3).
Population stratification was assessed by multidimensional scaling analysis and compared with 1000 Genomes phase 3 data and principal component analysis (n = 266); all individuals with non-European ancestry were removed. Cryptic relatedness was measured as SNP relatedness in GCTA (relatedness > 0.1, n = 69 removed). SNPs with a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 1E-7) and those which failed GenomeStudio quality control measures were removed (n = 21,298). 6,594 duplicate SNPs were also removed. This resulted in 2,911 unrelated mothers and father genotypes at 507,586 SNPs. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln.
1737 putative G0 partner-G1 pairs for whom both G0 partner and G1 have called genotype data available were identified based on ALN. Given the G0 partners were invited by the G0 mother to take part and only enrolled in the study in their own right several years later, it could not be assumed that all G0 partners were biologically related to G1. Called genotype data for the 1720 unique G0 partners and 1737 unique G1s were merged (i.e. there were 17 pairs of siblings/twins among the G1 offspring), using plink v1.90b7.2 64-bit (11 Dec 2023).
After aplication of the plink filters –geno 0.05, –maf 0.01, –snps-only just-acgt and –autosome, 113288 SNPs remained. The –related command in KING version 2.3.2 was used to perform kinship analysis, which confirmed that all 1737 putative G0 partner-G1 pairs are genetically related. This would be expected for biological father-offspring pairs, using the inference criteria described in in Table 1 of "Manichaikul, Ani, et al. "Robust relationship inference in genome-wide association studies." Bioinformatics 26.22 (2010): 2867-2873."
2.2.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gwa_exome_g0p_2016-11-22_f5 name: Freeze 5 version 2016-11-22 Genome-wide - Illumina exome core array - G0 partners description: >- Freeze 5 version 2016-11-22 Genome-wide array data including raw files and genotype calls for G0 partners, also including additional G0 mothers who were absent from previous genotyping rounds freeze_size: 289M linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9 woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a git_tag: https://github.com/alspac/dataset_gwa_exome_g0p/releases/tag/freeze5 is_current_freeze: true freeze_number: 5 freeze_date: 2025-02-27 previous_freeze: alspacdcs:gwa_exome_g0p_2016-11-22_f4 freeze_of_alspac_dataset_version: alspacdcs:gwa_exome_g0p_2016-11-22 freeze_of_named_alspac_dataset: alspacdcs:gwa_exome_g0p has_containers: - id: alspacdcs:0e154bae-77b3-40bd-b81c-a5b127cf9184 name: data description: A dir/folder containing the plink data files has_parts: - id: alspacdcs:041a43f13-1e58-4fea-a9d8-722dfe40bb1d name: freeze_id data_distributions: - id: alspacdcs:646c2553-6799-47b3-b84c-4f533ec5ebed name: freeze_id.fam description: >- A text file with no header line, and one line per sample with the following six fields: 1. Family ID ('FID') 2. Within-family ID ('IID'; cannot be '0') 3. Within-family ID of father ('0' if father isn't in dataset) 4. Within-family ID of mother ('0' if mother isn't in dataset) 5. Sex code ('1' = male, '2' = female, '0' = unknown) 6. Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control) Here We use both the first two fields to have the full id of the participant. i.e. not separate family and within family ids. md5sum: 422fe647fc778a80f6cf39815eb7691f filesize: 128KB filetype: .fam number_of_participants: 2198 belongs_to_container: alspacdcs:0e154bae-77b3-40bd-b81c-a5b127cf9184 - id: alspacdcs:2bb32a02-893a-4d6c-972e-14256a5ed3a4 name: freeze_id.bim description: >- Extended variant information file accompanying a .bed binary genotype table. (in plink you can use --make-just-bim can be used to update just this file.) A text file with no header line, and one line per variant with the following six fields: 1.Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name 2. Variant identifier 3. Position in morgans or centimorgans (safe to use dummy value of '0') 4. Base-pair coordinate (1-based; limited to 231-2) 5. Allele 1 (corresponding to clear bits in .bed; usually minor) 6. Allele 2 (corresponding to set bits in .bed; usually major) md5sum: 0fe43f888776059fef0a76d3f08d00ad filesize: 14MB filetype: .bim number_of_variants: 507586 belongs_to_container: alspacdcs:0e154bae-77b3-40bd-b81c-a5b127cf9184 - id: alspacdcs:8fbc90b4-cadd-41e2-95c4-72495aa273a8 name: freeze_id.bed description: >- Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files. md5sum: 304b0d356880c5174806ce08d7beffd3 filesize: 267M filetype: .bed number_of_participants: 2198 number_of_variants: 507586 belongs_to_container: alspacdcs:0e154bae-77b3-40bd-b81c-a5b127cf9184 - id: alspacdcs:2628b6e2-01e9-4822-92f2-972af0dbca42 name: freeze_id.log md5sum: c6f073df29726db7df0aab3cefc82a0d filesize: 512B filetype: .log belongs_to_container: alspacdcs:0e154bae-77b3-40bd-b81c-a5b127cf9184 - id: alspacdcs:2854d79c-ceca-499e-bf31-ba63b47718fa name: freeze_id.hh description: >- plink .hh file see https://www.cog-genomics.org/plink/1.9/formats#hh md5sum: 18e7547bb1c75e008caa9538baa57071 filesize: 8M filetype: .hh belongs_to_container: alspacdcs:0e154bae-77b3-40bd-b81c-a5b127cf9184
2.3 Genome-wide - Illumina 660 quad - G0 mothers (gwa_660_g0m)
2.3.1 Description
This dataset contains genome-wide array data including raw files and genotype calls for G0 mothers.
2.3.2 Methodology
ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs.
SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed. Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.
Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level. Related subjects that passed all other quality control thresholds were retained. This resulted in 9,048 subjects and 526,688 SNPs passed these quality control filters.
Associated publication:
- Rietveld et al 2013 (https://doi.org/10.1126/science.1235488)
2.3.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gwa_660_g0m_2022-12-05_f5 name: Freeze 5 version 2022-12-05 Genome-wide - Illumina 660 quad - G0 mothers description: >- Freeze 5 of genome-wide array data including genotype calls for G0 mothers freeze_size: 2G linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9 woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a git_tag: https://github.com/alspac/dataset_gwa_660_g0m/releases/tag/freeze5 is_current_freeze: true freeze_number: 5 freeze_date: 2025-02-27 freeze_of_alspac_dataset_version: alspacdcs:gwa_660_g0m_2022-12-05 freeze_of_named_alspac_dataset: alspacdcs:gwa_660_g0m has_containers: - id: alspacdcs:aeeb8633-73ce-4975-9b4b-35f0a6ceaef5 name: data description: A dir/folder containing the plink data files - id: alspacdcs:3ec3c4ae-b52c-437f-a68f-ba9e7bca2c00 name: legacy1 description: >- A dir/folder containing the plink data files. Includes full set of SNPs but is missing ~500 mothers who were excluded in legacy QC due to strict relatedness inclusion thresholds. belongs_to_container: alspacdcs:aeeb8633-73ce-4975-9b4b-35f0a6ceaef5 - id: alspacdcs:604be37a-50ad-4743-803e-783a5c1d6687 name: legacy2 description: >- A dir/folder containing the plink data files Includes full set of individuals but due to legacy QC is restricted to a set of ~480k SNPs that overlap with the Illumina 550k array (which was used for G1). belongs_to_container: alspacdcs:aeeb8633-73ce-4975-9b4b-35f0a6ceaef5 has_parts: - id: alspacdcs:4f364f94-01ec-4b13-ac7b-2ba283120c99 name: Biallelic genotype table description: >- The genetic data. Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files. The legacy1 & legacy2 distribution of the plink bed file. data_distributions: - id: alspacdcs:3b4029da-80d8-4030-b9bc-50aca869fd9d name: freeze_id.bed description: >- Legacy 1 plink bed file. md5sum: be66d3cc1d3d906c4d396cc161a605b1 filesize: 1019.6MB filetype: .bed belongs_to_container: alspacdcs:3ec3c4ae-b52c-437f-a68f-ba9e7bca2c00 # legacy1 - id: alspacdcs:5ec89906-79de-4d1d-8a78-814acb45b42e name: freeze_id.bed description: >- Legacy 2 plink bed file. md5sum: 7559903a4811210f6289497e1323dfe7 filesize: 960.3MB filetype: .bed belongs_to_container: alspacdcs:604be37a-50ad-4743-803e-783a5c1d6687 # legacy2 - id: alspacdcs:6a49d358-1ee0-426e-b16d-30fbabb8cd25 name: Variant Information description: >- Information about genetic variants. data_distributions: - id: alspacdcs:116f2f0f-563f-4906-8901-0bb2e1a5787f name: freeze_id.bim description: >- Legacy 1 Extended variant information file accompanying a .bed binary genotype table. (--make-just-bim can be used to update just this file.) A text file with no header line, and one line per variant with the following six fields: 1.Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name 2. Variant identifier 3. Position in morgans or centimorgans (safe to use dummy value of '0') 4. Base-pair coordinate (1-based; limited to 231-2) 5. Allele 1 (corresponding to clear bits in .bed; usually minor) 6. Allele 2 (corresponding to set bits in .bed; usually major) md5sum: 88b8c2221ef4ddc03118042db70d8575 filesize: 14.0MB filetype: .bim number_of_variants: 526688 belongs_to_container: alspacdcs:3ec3c4ae-b52c-437f-a68f-ba9e7bca2c00 # legacy1 - id: alspacdcs:9e6408b2-0c8f-4bde-971a-bfad729b2a87 name: freeze_id.bim description: >- Legacy 2 Extended variant information file accompanying a .bed binary genotype table. (--make-just-bim can be used to update just this file.) A text file with no header line, and one line per variant with the following six fields: 1.Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name 2. Variant identifier 3. Position in morgans or centimorgans (safe to use dummy value of '0') 4. Base-pair coordinate (1-based; limited to 231-2) 5. Allele 1 (corresponding to clear bits in .bed; usually minor) 6. Allele 2 (corresponding to set bits in .bed; usually major) md5sum: b4a1adb225de05d92d0af585950fd423 filesize: 12.3MB filetype: .bim number_of_variants: 465740 belongs_to_container: alspacdcs:604be37a-50ad-4743-803e-783a5c1d6687 # legacy2 - id: alspacdcs:971ab861-76a4-4d46-8dcb-090d956c7f15 name: Sample information description: >- Information about the samples for the dataset data_distributions: - id: alspacdcs:6979affd-d593-4849-a7e6-9ac84d08bf97 name: freeze_id.fam description: >- legacy 1 A text file with no header line, and one line per sample with the following six fields: 1. Family ID ('FID') 2. Within-family ID ('IID'; cannot be '0') 3. Within-family ID of father ('0' if father isn't in dataset) 4. Within-family ID of mother ('0' if mother isn't in dataset) 5. Sex code ('1' = male, '2' = female, '0' = unknown) 6. Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control) md5sum: d54855c6d6e0afaeef6522025707807b filesize: 253.7KB filetype: .fam number_of_participants: 8118 belongs_to_container: alspacdcs:3ec3c4ae-b52c-437f-a68f-ba9e7bca2c00 # legacy1 - id: alspacdcs:6fc40545-6c73-4883-bd26-c0621669599e name: freeze_id.fam description: >- legacy 2 A text file with no header line, and one line per sample with the following six fields: 1. Family ID ('FID') 2. Within-family ID ('IID'; cannot be '0') 3. Within-family ID of father ('0' if father isn't in dataset) 4. Within-family ID of mother ('0' if mother isn't in dataset) 5. Sex code ('1' = male, '2' = female, '0' = unknown) 6. Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control) md5sum: e23995bb57482d3c6b8eeac3100b5009 filesize: 447.6KB filetype: .fam number_of_participants: 8648 belongs_to_container: alspacdcs:604be37a-50ad-4743-803e-783a5c1d6687 # legacy2 - id: alspacdcs:3d87fed3-1e50-4928-b05c-0bc9e098dc9c name: Log information description: >- Information about the plink run for making the dataset data_distributions: - id: alspacdcs:75f0ab84-5003-475d-8894-846cdc1ca073 name: freeze_id.log description: >- legacy 1 plink log file md5sum: ee1acd97e7e4a69885762798eb121821 filesize: 995.0B filetype: .log belongs_to_container: alspacdcs:3ec3c4ae-b52c-437f-a68f-ba9e7bca2c00 # legacy1 - id: alspacdcs:09c46d87-316c-4713-bac2-f818b9b8f6e9 name: freeze_id.log description: >- legacy 2 plink log file md5sum: 5206ddfc05c0d5955af430d7758f13bb filesize: 995.0B filetype: .log belongs_to_container: alspacdcs:604be37a-50ad-4743-803e-783a5c1d6687 # legacy2
2.4 Genome-wide - CNV - G1 (cnv_550_g1)
2.4.1 Description
This dataset contains predicted ALSPAC CNVs using PennCNV, generated from 23andMe raw genotype data.
2.4.2 Methodology
LRR and BAF data was missing from the 23andMe raw genotype data, so we had to generate this data ourselves using an in house algorithm. Once this data was generated, we ran PennCNV using the hh550 libraries.
There are filtered PennCNV calls. Multiple calls were merged using the 'clean_cnv.pl' script, using a merge fraction of 0.5. Individuals with > 30 CNVs, a Log R Ratio SD of >0.3, a BAF drift of > 0.002, and a waviness factor of > 0.05 were removed. CNVs in which at least 50% of the length of the CNV call overlapped with any of telomeric centromeric, immunoglobulin regions were removed using the 'scan_region.pl' script in PennCNV.
In addition, CNVs covering fewer than 5 probes, of a length < 5kb, and with a confidence score of below 10 were removed. Density was calculated as the number of probes in a CNV divided by the length of the CNV, and CNVs where the density of probes across the call was < 1 probe per 20kb was removed.
These QC parameters are suggestions only and provided in filtered.cnv. Analysts can apply their own filter parameters to the raw calls in data.cnv
2.4.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:cnv_550_g1_2015-11-09_f5 name: Genome-wide - CNV - G1 release version 2015-11-09 freeze 5 description: >- This is the fith freeze of the 2015-11-09 version of cnv_550_g1 dataset. It contains two csv versions of the cnv called data, the unfilterd and filtered versions. freeze_size: 27m linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9 woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a git_tag: https://github.com/alspac/dataset_cnv_550_g1/releases/tag/freeze5 is_current_freeze: true freeze_number: 5 freeze_date: 2025-02-27 previous_freeze: alspacdcs:cnv_550_g1_2015-11-09_f4 freeze_of_alspac_dataset_version: alspacdcs:cnv_550_g1_2015-11-09 freeze_of_named_alspac_dataset: alspacdcs:cnv_550_g1 has_parts: - id: alspacdcs:2443e67d-711e-410a-bbff-62ad6e89fc78_cnv_550_g1_2015-11-09_cnvdata_f5 name: Unfiltered CNV data description: >- This is the output of Penncnv before filtering. columns V1 - Position V2 - Number of markers in the region V3 - CNV length V4 - Copy number estimate V6 - Start SNP V7 - End SNP V8 - Confidence score qlet - within pregnancy ID cnv_550_g1 - Individual ID data_distributions: - id: alspacdcs:1ffcdc95-dfd4-4c4f-9b23-7a955cf9c2c9 name: new_cnvdata.csv description: >- This is the csv file for the output of Penncnv before filtering. md5sum: 25aa47310d8c9e17a168d9bff54961f9 filesize: 21M filetype: .csv number_of_participants: 7449 #data$id_qlet <- paste(data$cnv_550_g1, data$qlet, sep="_") #length(unique(data$id_qlet)) number_of_cnv_variants: 70029 # Read file into R as data then: # dim(unique(data[1])) belongs_to_container: alspacdcs:723ce3b3-bae5-4bf5-932c-fad912f5c6e4 - id: alspacdcs:0c96dd13-08f8-41fb-a389-f716c20f373c name: Filtered CNV data description: >- CNV data that has been filtered. columns V1 - Position V2 - Number of markers in the region V3 - CNV length V4 - Copy number estimate V6 - Start SNP V7 - End SNP V8 - Confidence score qlet - within pregnancy ID cnv_550_g1 - Individual ID data_distributions: - id: alspacdcs:578318bb-0f3a-4a3c-ac2c-cf14c48198c5 name: new_filtered.csv description: >- This is the csv file for the output of Penncnv after filtering. md5sum: 71c3e6841fcc492045602c20d72806d0 filesize: 5.9M filetype: .csv number_of_participants: 6792 # Read into data 2 in r # data2$id_qlet <- paste(data2$cnv_550_g1, data2$qlet, sep="_") and length(unique(data2$id_qlet)) number_of_cnv_variants: 14244 #Read into data2 in r then #length(unique(data2$V1)) belongs_to_container: alspacdcs:723ce3b3-bae5-4bf5-932c-fad912f5c6e4 has_containers: - id: alspacdcs:723ce3b3-bae5-4bf5-932c-fad912f5c6e4 ## uuid name: data description: A dir/folder containing the two freeze data files
3 Imputed Data
3.1 Genome-wide - HRC imputed - G0 mothers + G1 (gi_hrc_g0m_g1)
SNP chips are useful for the generation of data on hundreds of thousands of SNPs, but there are millions more polymorphisms that remain untyped with this technology. If suitable numbers of whole genome sequences exist (e.g. 1000 genomes data) then millions of genotypes that are missing from a sample because they have not been typed by SNP chips can be imputed using probabilistic methods. Here the ALSPAC mother and children data were imputed to a new reference panel known as the Haplotype Reference Consortium (HRC) panel. This comprises around 31000 sequenced individuals (mostly European), so the coverage of European haplotypes is much greater than in other panels. As a consequence imputation accuracy is expected to improve, particularly at lower frequencies.
3.1.1 Description
This dataset contains genotype data imputed to HRC for G0 mothers and G1. Reference genome build: GRCh37
3.1.2 Methodology
ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).
Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.
SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1).
Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,115 subjects and 500,527 SNPs passed these quality control filters.
ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs. SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed.
Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.
Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level. Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,048 subjects and 526,688 SNPs passed these quality control filters.
We combined 477,482 SNP genotypes in common between the sample of mothers and sample of children. We removed SNPs with genotype missingness above 1% due to poor quality (11,396 SNPs removed) and removed a further 321 subjects due to potential ID mismatches. This resulted in a dataset of 17,842 subjects containing 6,305 duos and 465,740 SNPs (112 were removed during liftover and 234 were out of HWE after combination). We estimated haplotypes using ShapeIT (v2.r644) which utilises relatedness during phasing. The phased haplotypes were then imputed to the Haplotype Reference Consortium (HRCr1.1, 2016) panel of approximately 31,000 phased whole genomes. The HRC panel was phased using ShapeIt v2.r727, and the imputation was performed using the Michigan imputation server.
3.1.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gi_hrc_g0m_g1_2017-05-04_f5 name: >- Genome-wide - HRC imputed - G0 mothers + G1 version 2017-05-04 freeze 5 description: >- Freeze 5 of version 2017-05-04 Genome-wide array data imputed to the HRC reference panel for G0 mothers and G1 individuals in bgen and sample file format (version 1.2). freeze_size: 114G linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9 woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a git_tag: https://github.com/alspac/dataset_gi_hrc_g0m_g1/releases/tag/freeze5 is_current_freeze: true freeze_number: 5 freeze_date: 2025-02-27 previous_freeze: alspacdcs:gi_hrc_g0m_g1_2017-05-04_f4 freeze_of_alspac_dataset_version: alspacdcs:gi_hrc_g0m_g1_2017-05-04 freeze_of_named_alspac_dataset: alspacdcs:gi_hrc_g0m_g1 has_containers: - id: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e ## uuid name: data description: A dir/folder containing the freeze data bgen and .sample files # belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e has_parts: - id: alspacdcs:68b35374-11ca-4f91-b8dc-5c17c50f6992 name: Omics ID sample description: >- The samples in the data. To be used with the genetic data. A plain text .sample file. See https://doi.org/10.1101/308296 for file format details. data_distributions: - id: alspacdcs:b03967aa-0991-4e4b-9c97-1faff53ec548 name: swapped.sample md5sum: 3e8e18ce5f6e30ac1c79e92695279bce filesize: 1005.1KB filetype: .sample number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:1ac41b4e-e272-4cc3-900e-34fc623556fc name: swapped_23_female.sample md5sum: 19f80cc93eb8474b7354a04e4fabd050 filesize: 745.8KB filetype: .sample number_of_participants: 12943 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:0574c0e6-b957-4427-adf4-f8d04fe997e5 name: swapped_23_male.sample md5sum: 623083d3d4e7294c1ac86817d40fb435 filesize: 259.4KB filetype: .sample number_of_participants: 4501 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:0d22e4c5-9c75-4fb2-93a7-4c30d9b15a84 name: Bgens description: >- An Oxford Bgen (v1.2) file for all chromosomes. To be used with sample file. See https://doi.org/10.1101/308296 for file format details. data_distributions: - id: alspacdcs:79cf0083-7c53-4eab-806d-fda62fe8f8cd name: filtered_01.bgen md5sum: 9727306a156ab88f72dedbdcaffc1105 filesize: 8.6GB filetype: .bgen number_of_variants: 3069932 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:1342c5e0-b921-40cf-9a66-8917029add62 name: filtered_02.bgen md5sum: a8cb970994e21c02eceea92a513ebef6 filesize: 8.7GB filetype: .bgen number_of_variants: 3392238 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:3c377fe2-b11f-4724-aef3-54e7562e0bff name: filtered_03.bgen md5sum: 7e1586647816f4607b9e528be4893b5c filesize: 7.3GB filetype: .bgen number_of_variants: 2821895 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:f735b892-a53b-4bd7-92df-2a968eb5de82 name: filtered_04.bgen md5sum: 9bb513a014c18a3a0a1ea11dcf63cc1b filesize: 7.9GB filetype: .bgen number_of_variants: 2787582 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:7573e42c-4f61-41e0-98ce-4933517adde1 name: filtered_05.bgen md5sum: 92a2d759a5bcc18d0134dc7802302055 filesize: 6.7GB filetype: .bgen number_of_variants: 2588170 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:e034f623-dd24-46a6-9f7a-2a503cee39fd name: filtered_06.bgen md5sum: 5f68a69cd54a89b8db5577711f2a7934 filesize: 6.3GB filetype: .bgen number_of_variants: 2460112 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:29b4c425-be26-4778-b08c-6104bb497269 name: filtered_07.bgen md5sum: cd02eefdb350d9859ea7a5975d5ee73a filesize: 6.6GB filetype: .bgen number_of_variants: 2289306 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:86a83679-b397-4d2d-8d77-cde33919932e name: filtered_08.bgen md5sum: 68b4ea416441637c01ebcc1c2e9ac8cf filesize: 5.7GB filetype: .bgen number_of_variants: 2242706 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:7e21bc11-89d2-4560-831f-2e6497c31360 name: filtered_09.bgen md5sum: a262516e4a9c48fe2b7edfb68a0f0577 filesize: 4.5GB filetype: .bgen number_of_variants: 1675899 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:7ff4c7d7-7199-4a41-a55e-40ff89673f4c name: filtered_10.bgen md5sum: 659c1e9b8c9500aa02b84d8a121e4a23 filesize: 5.1GB filetype: .bgen number_of_variants: 1927504 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:5357c517-b844-4bcc-9a49-f489369c9233 name: filtered_11.bgen md5sum: 94ae65053c6cb28ffa5413a447bea2a7 filesize: 5.2GB filetype: .bgen number_of_variants: 1936990 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:c0f774be-af04-41a1-a1de-9b3d230e177d name: filtered_12.bgen md5sum: 5e488efe1865265b70f0db0ba0e8ceb2 filesize: 5.1GB filetype: .bgen number_of_variants: 1848118 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:f8896772-0ae4-41c1-9f1c-f194b8c9a5b7 name: filtered_13.bgen md5sum: c6d8c39e1714020ef24236ce0e0e65f4 filesize: 3.7GB filetype: .bgen number_of_variants: 1385434 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:9048547c-291f-4e7e-8e19-4c433b4186f6 name: filtered_14.bgen md5sum: a7ceaec0d5986e1396214bbc4a8bcfb5 filesize: 3.5GB filetype: .bgen number_of_variants: 1266536 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:dc53f49b-8a3a-4fde-bc74-856c6bb16fe2 name: filtered_15.bgen md5sum: 30a19dcda6047a6ac690d650ee5fea8c filesize: 3.4GB filetype: .bgen number_of_variants: 1139215 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:c36b8dc1-8415-4608-bfaa-b7196ed98ea3 name: filtered_16.bgen md5sum: d4ffb3324217ec7ac9e3716ae3de9106 filesize: 4.1GB filetype: .bgen number_of_variants: 1281298 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:b52fbba0-a8b3-4c70-acfa-b2721c86cd7c name: filtered_17.bgen md5sum: a0baaf8155e3e97ee33d440035877a96 filesize: 3.6GB filetype: .bgen number_of_variants: 1090072 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:c07a6c33-e0cc-4b69-8701-85c14a0ca771 name: filtered_18.bgen md5sum: 1236c268dfab2d46148835e50efcec5d filesize: 3.1GB filetype: .bgen number_of_variants: 1104755 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:577dc456-a811-4b50-ae68-1a0992a65bb9 name: filtered_19.bgen md5sum: 1c17198a8d5a7be881d671559048d073 filesize: 3.4GB filetype: .bgen number_of_variants: 868554 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:6a6d2735-1aac-48c4-972b-f97ffd4fb396 name: filtered_20.bgen md5sum: 336791734294796bcc5c725048756155 filesize: 2.6GB filetype: .bgen number_of_variants: 884983 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:5b5cbe53-85fb-403b-adef-cb59b545e86b name: filtered_21.bgen md5sum: d97d780938173eb14c5c1aae66e1005e filesize: 1.7GB filetype: .bgen number_of_variants: 531276 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:2312c3b2-7352-4f27-91d0-5bc078727a1b name: filtered_22.bgen md5sum: 343581eebfe7e38242db0c8b019c2264 filesize: 1.8GB filetype: .bgen number_of_variants: 524544 number_of_participants: 17444 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:0df6a041-2c11-4fa0-bae9-2d5570d1554d name: filtered_23female.bgen md5sum: d4abdc0d84bda1f8a3eec5c9cee8977b filesize: 4.2GB filetype: .bgen number_of_variants: 1228035 number_of_participants: 12943 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e - id: alspacdcs:f6e79f2e-a99a-4b3f-9dc7-14e2f2c54aff name: filtered_23male.bgen md5sum: bebe6967a0489a186166d61cd1b07a18 filesize: 1.2GB filetype: .bgen number_of_variants: 1228035 number_of_participants: 4501 belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e
3.2 Genome-wide - HapMap2 imputed - G1 (gi_hapmap2_g1)
3.2.1 Description
This dataset contains genotype data imputed to HapMap 2 for G1. Reference genome build: GRCh36
3.2.2 Methodology
A total of 9912 subjects were genotyped using the Illumina HumanHap550 quad genome-wide SNP genotyping platform by 23 and Me subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, USA.
Individuals were excluded from further analysis on the basis of having incorrect gender assignments; minimal or excessive heterozygosity (<0.320 and >0.345 for the Sanger data and <0.310 and >0.330 for the LabCorp data); disproportionate levels of individual missingness (>3%); evidence of cryptic relatedness (>10% IBD) and being of non-European ancestry (as detected by a multidimensional scaling analysis seeded with HapMap 2 individuals, EIGENSTRAT analysis revealed no additional obvious population stratification and genome-wide analyses with other phenotypes indicate a low lambda). The resulting data set consisted of 8365 individuals (84% of those genotyped).
SNPs with a minor allele frequency of <1% and call rate of <95% were removed. Furthermore, only SNPs which passed an exact test of Hardy-Weinberg equilibrium (P > 5 x 10-7) were considered for analysis. Genotypes were subsequently imputed with MACH 1.0.16 Markov Chain Haplotyping software, using CEPH individuals from phase 2 of the HapMap project as a reference set (release 22).
Associated publication:
3.2.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gi_hapmap2_g1_2022-12-07_f5 name: Genome-wide - HapMap2 imputed - G1 version 2022-12-07 freeze 5 description: >- Freeze 5 of 2022-12-07 version of Genome-wide array data imputed to the HapMap2 reference panel for G1 individuals freeze_size: 5G linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9 woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a git_tag: https://github.com/alspac/dataset_gi_hapmap2_g1/releases/tag/freeze5 is_current_freeze: true freeze_number: 5 freeze_date: 2025-02-27 previous_freeze: alspacdcs:gi_hapmap2_g1_2022-12-07_f4 freeze_of_alspac_dataset_version: alspacdcs:gi_hapmap2_g1_2022-12-07 freeze_of_named_alspac_dataset: alspacdcs:gi_hapmap2_g1 has_containers: - id: alspacdcs:0f3ae21c-aa11-4aab-b8a0-78bf13801cfb name: data description: A dir/folder containing the plink freeze data files has_parts: - id: alspacdcs:5d8b20a5-b2d3-4d3b-a02e-fe865810dd92 name: bed file description: >- Plink standard format bed file. See https://www.cog-genomics.org/plink/1.9/formats for further information. data_distributions: - id: alspacdcs:30b0f629-2582-44cd-a13d-27511f2bfc3b name: freeze_id.bed md5sum: c1b6c00b67513aef2147d6d507c4d1be filesize: 4.9GB filetype: .bed belongs_to_container: alspacdcs:0f3ae21c-aa11-4aab-b8a0-78bf13801cfb - id: alspacdcs:5a8e0987-f3e7-4354-b77a-353d47390aa2 name: bim file description: >- Plink standard bim file. Contains variant information. See https://www.cog-genomics.org/plink/1.9/formats for further information. data_distributions: - id: alspacdcs:3022bf70-3de1-4fe5-a827-627cb6998a53 name: freeze_id.bim md5sum: a1ebaaf6286af5b12f4561b380cd302a filesize: 67.6MB filetype: .bim number_of_variants: 2543887 belongs_to_container: alspacdcs:0f3ae21c-aa11-4aab-b8a0-78bf13801cfb - id: alspacdcs:2a8a7c04-c771-4c80-8e53-32012bcf6cbe name: fam file description: >- Plink standard format fam file. Contains sample information. See https://www.cog-genomics.org/plink/1.9/formats for further information. data_distributions: - id: alspacdcs:26508ec1-acb7-46ae-a7b4-dffa70cdf574 name: freeze_id.fam md5sum: 58d7bf44f023345bced230d50c8f0736 filesize: 273.0KB filetype: .fam number_of_participants: 8223 belongs_to_container: alspacdcs:0f3ae21c-aa11-4aab-b8a0-78bf13801cfb - id: alspacdcs:ac4c4c8c-1da4-4c26-9c4d-3bdea467d5b7 name: log file description: >- Plink log files. One per chromosome. Contains log information. See https://www.cog-genomics.org/plink/1.9/formats for further information. data_distributions: - id: alspacdcs:d74bea9e-7c43-4b67-9abd-f114daeebd06 name: freeze_id.log md5sum: 09b318336263bb2e8ecc563decb92aed filesize: 941.0B filetype: .log belongs_to_container: alspacdcs:0f3ae21c-aa11-4aab-b8a0-78bf13801cfb
3.3 Genome-wide - HapMap2 imputed - G0 mothers (gi_hapmap2_g0m)
3.3.1 Description
This dataset contains genotype data imputed to HapMap 2 for G0 mothers. Reference genome build: GRCh36
3.3.2 Methodology
A total of 10 015 women (mothers from the ALSPAC cohort) were genotyped using the Illumina 660 quad SNP chip which contains 557 124 SNP markers. Markers with minor allele frequency < 1%, SNPs with >5% missing genotypes and any markers that failed an exact test of Hardy-Weinberg equilibrium (P < 1 x 10-6) were excluded from further analyses. Genome-wide identity by state sharing was calculated for each pair of individuals in the cohort to identify cryptic relatedness.
In order to identify individuals who might have ancestries other than Western European, we merged data from both cohorts with the 60 western European (CEU) founder, 60 Nigerian (YRI) founder and 90 Japanese (JPT) and Han Chinese (CHB) individuals from the International HapMap Project. Genome-wide IBS distances for each pair of individuals were calculated on markers shared between the HapMap and the Illumina 660K SNP chip, and then the multidimensional scaling option in R was used to generate a two-dimensional plot based upon individuals' scores on the first two principal coordinates from this analysis. Samples that did not cluster with the CEU individuals were excluded from subsequent analyses. In addition, we plotted the proportion of missing data for each individual against their genome-wide heterozygosity. Any individual, who did not cluster with others, was removed from further analyses. Samples were also excluded from analyses in the case of excessive missingness (>5%), unusual genome-wide or X chromosome heterozygosity, as well as one individual from each pair of putatively related individuals (genome-wide IBD >10%). After data cleaning, 8340 individuals and 526688 SNPs were left in the genome-wide data set.
We then conducted imputation using the MACH Markov Chain Haplotyping software with CEU individuals from phase 2 of the HapMap project as a reference set (release 22). The final imputed data set consisted of 8340 individuals, each with 2 594 390 imputed markers. Only imputed genotypes with minor allele frequencies ≥1% and R-sqr ≥0.3 were considered for association. Of these 8340 with genetic data, 2874 mothers also had phenotype data available.
Associated publication:
3.3.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gi_hapmap2_g0m_2022-12-07_f5 name: Genome-wide - HapMap2 imputed - G0 mothers version 2022-12-07 freeze 5 description: >- Version 2022-12-07 freeze 5 of Genome-wide array data imputed to the HapMap2 reference panel for G0 mothers. The number of variants & individuals within each plink file set can be viewed within the log file. freeze_size: 4.9G linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9 woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a git_tag: https://github.com/alspac/dataset_gi_hapmap2_g0m/releases/tag/freeze5 is_current_freeze: true freeze_number: 5 freeze_date: 2025-02-27 previous_freeze: alspacdcs:gi_hapmap2_g0m_2022-12-07_f4 freeze_of_alspac_dataset_version: alspacdcs:gi_hapmap2_g0m_2022-12-07 freeze_of_named_alspac_dataset: alspacdcs:gi_hapmap2_g0m has_containers: - id: alspacdcs:50ae5d38-7e0d-4366-ab2d-73e91960eb24 ## uuid name: plink description: A dir/folder containing the plink freeze data files. There are 8118 individuals within this dataset. has_parts: - id: alspacdcs:b5dac573-22e8-4aff-93e3-6988b564df3d name: bed files description: >- Plink standard format bed files. One per chromosome. See https://www.cog-genomics.org/plink/1.9/formats for further information. belongs_to_container: alspacdcs:50ae5d38-7e0d-4366-ab2d-73e91960eb24 data_distributions: - id: alspacdcs:7efe0162-1061-4d97-8149-57346a60105d name: freeze_id_chr1.bed md5sum: 01f7205ea4b6e852c0e8feb72a2cb9cd filesize: 374.7MB filetype: .bed - id: alspacdcs:10652e6a-a6f9-4cbb-a46a-57f6202df995 name: freeze_id_chr2.bed md5sum: 494713bafedd17c3be4e782f7881dcc0 filesize: 427.5MB filetype: .bed - id: alspacdcs:1eccd5b3-a5d3-4362-9755-383141333576 name: freeze_id_chr3.bed md5sum: 609847ca0489b7a97725ec275f8337d2 filesize: 337.5MB filetype: .bed - id: alspacdcs:7158d2fe-2c64-4286-902f-2374c8f3d1c9 name: freeze_id_chr4.bed md5sum: 147fee33c621f644dad5a2d8ee86fc1d filesize: 315.9MB filetype: .bed - id: alspacdcs:4ce18ceb-5a2a-49f8-864e-96f188ff4015 name: freeze_id_chr5.bed md5sum: a3a47a8ea90e0fa39d5c203436b6d982 filesize: 325.5MB filetype: .bed - id: alspacdcs:c2b647c2-c543-4b0c-824a-eb46b8e71043 name: freeze_id_chr6.bed md5sum: 953f9c82981d59d25dabe44ba5718b29 filesize: 353.1MB filetype: .bed - id: alspacdcs:92eca9bf-acec-4211-8078-5d727847f51f name: freeze_id_chr7.bed md5sum: fb9e8aaf4ae7c3fc75233248ec9d03b0 filesize: 277.3MB filetype: .bed - id: alspacdcs:e479937f-30e9-42d6-8b7e-98948660f187 name: freeze_id_chr8.bed md5sum: de34e8ef57e4c08991e4778401adf861 filesize: 285.5MB filetype: .bed - id: alspacdcs:3ac4109b-ce90-404e-9c33-14a0e0724e57 name: freeze_id_chr9.bed md5sum: 58ff215f0652257867e42f567ff1c2be filesize: 236.4MB filetype: .bed - id: alspacdcs:71a413bf-2a47-43f9-be4c-a90166c94ad6 name: freeze_id_chr10.bed md5sum: 4606d4a5a008927b6ab051461218094a filesize: 267.9MB filetype: .bed - id: alspacdcs:f437e00b-2de1-4f72-b169-a5a13d844a89 name: freeze_id_chr11.bed md5sum: 3c89898ce9fc0445c566ea0c060fb9db filesize: 251.8MB filetype: .bed - id: alspacdcs:1bcea287-df1c-4d62-8fc5-974658a1b6ed name: freeze_id_chr12.bed md5sum: 367f44ccd183c47334cfc7cb8333628a filesize: 241.7MB filetype: .bed - id: alspacdcs:355962f4-d2c8-4477-b5f1-d523befa9695 name: freeze_id_chr13.bed md5sum: 0e99cf077012880a802dc36ce72142c1 filesize: 201.6MB filetype: .bed - id: alspacdcs:1b0a61a1-7b3a-4950-8eea-14663fba419d name: freeze_id_chr14.bed md5sum: a41f9803ec71a0dcdf137806b21ba2e6 filesize: 162.5MB filetype: .bed - id: alspacdcs:f403092d-e133-49eb-b8da-b8dd05069089 name: freeze_id_chr15.bed md5sum: 611159bc9c4500de559615d0a7c549f2 filesize: 140.0MB filetype: .bed - id: alspacdcs:1bb114aa-4804-42c9-b62b-bede2c7d4b0f name: freeze_id_chr16.bed md5sum: b04eb2e4e66fef7ee7d48cb666d78c38 filesize: 138.5MB filetype: .bed - id: alspacdcs:e8d235a0-c488-4294-b39a-302f92646247 name: freeze_id_chr17.bed md5sum: c6d54ed5ac68f2e0bd806b6124463ee4 filesize: 113.2MB filetype: .bed - id: alspacdcs:9871e903-1d2c-401f-aedc-9a20b44e375a name: freeze_id_chr18.bed md5sum: 6b46a8d2993dae303334b9a51b50b92c filesize: 148.7MB filetype: .bed - id: alspacdcs:70569400-5d25-4fac-93a8-181eaecb0d6c name: freeze_id_chr19.bed md5sum: 801ccb3bb64dddaabfc2b7a4a1e4c5b0 filesize: 71.7MB filetype: .bed - id: alspacdcs:bfb9ce39-6ad2-4c04-952e-070511f96cb8 name: freeze_id_chr20.bed md5sum: 2af011bb98d6b8a8b00b7d938700fdac filesize: 122.8MB filetype: .bed - id: alspacdcs:41a74b3f-38f5-45ac-aef4-05764316abd3 name: freeze_id_chr21.bed md5sum: 13165e1c9a27aa42853429b0246a1ed5 filesize: 65.6MB filetype: .bed - id: alspacdcs:a2cdb6ad-d6c0-49d9-a9e8-33b1f61d8226 name: freeze_id_chr22.bed md5sum: 5abcf552c585152ed0ee11754f3e7833 filesize: 65.5MB filetype: .bed - id: alspacdcs:03943a89-75f1-4833-83f2-0fb740aff2df name: bim files description: >- Plink standard bim files. One per chromosome. Contains variant information. See https://www.cog-genomics.org/plink/1.9/formats for further information. belongs_to_container: alspacdcs:50ae5d38-7e0d-4366-ab2d-73e91960eb24 data_distributions: - id: alspacdcs:b7f6b64c-c359-4871-955c-e69a328dff6d name: freeze_id_chr1.bim md5sum: 44795681691b62d1921ad8855fd11a09 filesize: 5.1MB filetype: .bim number_of_variants: 193554 - id: alspacdcs:25f5868c-261d-4f4b-a61f-2585c25cc16c name: freeze_id_chr2.bim md5sum: 275cefa559489b51bebbc65657a91822 filesize: 5.9MB filetype: .bim number_of_variants: 220833 - id: alspacdcs:bbc0949b-a6b1-4c49-bfc5-5c14c112bb32 name: freeze_id_chr3.bim md5sum: 96d147406f1f24697b0cb9af0c7091fc filesize: 4.6MB filetype: .bim number_of_variants: 174356 - id: alspacdcs:f5a905fe-a12b-49f5-bf6d-89ae8dbcce69 name: freeze_id_chr4.bim md5sum: 54a244447b1345636690b252215bfd2d filesize: 4.3MB filetype: .bim number_of_variants: 163157 - id: alspacdcs:bd66e1db-4dda-476c-9f29-8a01c22740ec name: freeze_id_chr5.bim md5sum: e8f55ef9016bf2f03ee43f08a6c974c3 filesize: 4.4MB filetype: .bim number_of_variants: 168144 - id: alspacdcs:47363a39-591d-4d1d-a0eb-d1124ea94485 name: freeze_id_chr6.bim md5sum: 3fd4e793a35c5e935454efc1105be192 filesize: 4.8MB filetype: .bim number_of_variants: 182381 - id: alspacdcs:9d7d252a-49d5-4476-9222-bb3e2c2efdf4 name: freeze_id_chr7.bim md5sum: dae38c5168605323dfc584a73f3ce4a1 filesize: 3.8MB filetype: .bim number_of_variants: 143232 - id: alspacdcs:6dfeca17-935d-4664-a428-28118165d701 name: freeze_id_chr8.bim md5sum: 6243ef376ee6cbe643bec69201bec604 filesize: 3.9MB filetype: .bim number_of_variants: 147483 - id: alspacdcs:38dee48b-137e-49cb-a115-aeaded91f3e3 name: freeze_id_chr9.bim md5sum: 1e828e0f36c2d168ce6c1df5887a764b filesize: 3.2MB filetype: .bim number_of_variants: 122112 - id: alspacdcs:602fd303-2bae-414a-a5eb-e8e8e283f39c name: freeze_id_chr10.bim md5sum: 3c259904c7da548d25c86a4a36e96285 filesize: 3.8MB filetype: .bim number_of_variants: 138402 - id: alspacdcs:75d5e65b-5fc3-4f8a-b01b-e69ea1c45628 name: freeze_id_chr11.bim md5sum: 703ecef520ce7363c24e9600b363570f filesize: 3.5MB filetype: .bim number_of_variants: 130069 - id: alspacdcs:f5166ec7-30e3-4ec4-9de7-0e91454381a8 name: freeze_id_chr12.bim md5sum: 515a46f735c531163377d114549042b5 filesize: 3.4MB filetype: .bim number_of_variants: 124860 - id: alspacdcs:d7f85dcf-75dd-4736-bd7f-5f458f2081c7 name: freeze_id_chr13.bim md5sum: cd1b7c80977fb5a0bbd87bc83dd85aed filesize: 2.8MB filetype: .bim number_of_variants: 104120 - id: alspacdcs:b6b69cba-f36a-4ddd-bdab-1b80703f7817 name: freeze_id_chr14.bim md5sum: 4a933818aaea48201f455ebd07ea1b78 filesize: 2.3MB filetype: .bim number_of_variants: 83936 - id: alspacdcs:a0f73fdd-1936-4407-ad7a-3743f87fe429 name: freeze_id_chr15.bim md5sum: 1e1139db4b031ba577b5ac6ae000ce6f filesize: 1.9MB filetype: .bim number_of_variants: 72300 - id: alspacdcs:538edd86-5b6b-4d41-8387-1e8d8b2d7b72 name: freeze_id_chr16.bim md5sum: 8bd9cb45256b6b5ca37ce66eec810035 filesize: 1.9MB filetype: .bim number_of_variants: 71550 - id: alspacdcs:64834bdd-5c5d-4883-a2cf-b457e35ae0ea name: freeze_id_chr17.bim md5sum: 0dc0770759f9edccec7ce305e07b57d4 filesize: 1.6MB filetype: .bim number_of_variants: 58455 - id: alspacdcs:38d19e6e-ed06-4724-a339-04eac4625317 name: freeze_id_chr18.bim md5sum: 9ffd8f006c82701060dff29bf460e8fe filesize: 2.1MB filetype: .bim number_of_variants: 76812 - id: alspacdcs:6119148e-dd26-4c3e-a934-bfdbd96d5099 name: freeze_id_chr19.bim md5sum: c6fce7e15e198304f752ccbce66299b9 filesize: 1012.3KB filetype: .bim number_of_variants: 37045 - id: alspacdcs:4a80d624-fb52-48cf-9e39-fdda12c2b6c0 name: freeze_id_chr20.bim md5sum: 6e0b2d6cd06cc6e36f9cbc3f8df0a169 filesize: 1.7MB filetype: .bim number_of_variants: 63408 - id: alspacdcs:fbf18f66-9ddd-4e0f-b3cf-026c43b48826 name: freeze_id_chr21.bim md5sum: c1f6f2181c49172608ac79e18425e4f4 filesize: 924.7KB filetype: .bim number_of_variants: 33863 - id: alspacdcs:47ffbaa6-f74d-470b-b48e-15a469f6e7c8 name: freeze_id_chr22.bim md5sum: 86a1da3366ba87e62f561dc09f64f9ac filesize: 920.9KB filetype: .bim number_of_variants: 33815 - id: alspacdcs:ef194b12-6864-44b3-9d89-2090dfe305d0 name: fam files description: >- Plink standard format fam files. One per chromosome. Contains sample information. See https://www.cog-genomics.org/plink/1.9/formats for further information. belongs_to_container: alspacdcs:50ae5d38-7e0d-4366-ab2d-73e91960eb24 data_distributions: - id: alspacdcs:3b986b53-808e-4c4e-8275-d81d00d5ebb0 name: freeze_id_chr1.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:be2d1539-f667-4748-ad08-993aee03319a name: freeze_id_chr2.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:eff796b6-bb95-4d15-babf-ff893a328ae8 name: freeze_id_chr3.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:34325f98-aafa-4d83-92c5-1efc16e60b31 name: freeze_id_chr4.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:13f956cf-a81a-49a0-bcc3-7315425d94f6 name: freeze_id_chr5.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:d43d7d7c-bd21-4035-9cf4-5a946c0b0548 name: freeze_id_chr6.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:534c0964-72af-4418-9f0a-223d0dfbc74f name: freeze_id_chr7.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:a139d84d-286d-4d65-bfa6-e1a329345ac9 name: freeze_id_chr8.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:23ee8db2-4cb5-413d-b714-0dd88334f645 name: freeze_id_chr9.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:a2155a79-99ea-425b-bac3-0b305f097246 name: freeze_id_chr10.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:6b064bab-e8cc-4477-8c4d-8bfc17bb685a name: freeze_id_chr11.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:f8826d19-d418-488a-8112-8367b5243ea2 name: freeze_id_chr12.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:45bc4687-9b97-47db-b3cd-0e7660a77abd name: freeze_id_chr13.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:68e4a8f3-df5a-4450-b93a-f7a71689a397 name: freeze_id_chr14.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:c1b56a86-4d10-4547-b8f0-a3820f4c20ed name: freeze_id_chr15.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:9cb04ba9-e116-4ba7-ae9f-2e1d833708fc name: freeze_id_chr16.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:e4d93b96-b25f-4608-b721-f891e6c2d6df name: freeze_id_chr17.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:f6ca5a65-8201-46fe-ac3d-c674fe440d8c name: freeze_id_chr18.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:212d5196-61c5-4d52-b613-5f6616df9fca name: freeze_id_chr19.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:cfaa861f-bdf7-468d-b61a-8b43829bc5ae name: freeze_id_chr20.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:8df35b0d-b9ab-4009-94be-4d37bdd31dad name: freeze_id_chr21.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:6055cdf3-c2cb-4fe6-9440-1853beb7ebd3 name: freeze_id_chr22.fam md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8 filesize: 277.5KB filetype: .fam number_of_participants: 8118 - id: alspacdcs:ac6efac2-4681-4147-92da-b3ec2467aa96 name: log files description: >- Plink log files. One per chromosome. Contains log information. See https://www.cog-genomics.org/plink/1.9/formats for further information. belongs_to_container: alspacdcs:50ae5d38-7e0d-4366-ab2d-73e91960eb24 data_distributions: - id: alspacdcs:d929e593-a613-4ca3-9f35-c35b32be2db6 name: freeze_id_chr1.log md5sum: 4f11877b64d9c9a0995ab1c577f56110 filesize: 971.0B filetype: .log - id: alspacdcs:6fa0dbe8-630d-4856-8226-a832e5dc5389 name: freeze_id_chr2.log md5sum: 8a009ab139c9672ba79aaa52f4203d51 filesize: 971.0B filetype: .log - id: alspacdcs:a68c566f-c96e-46cd-b6c6-0c1125539094 name: freeze_id_chr3.log md5sum: 97957d1a5d46b7752133aabd29338719 filesize: 971.0B filetype: .log - id: alspacdcs:4c09ff2b-d741-485d-941c-b2d6a3b66058 name: freeze_id_chr4.log md5sum: b0320867b36bc85b3ad9a598056bba4b filesize: 971.0B filetype: .log - id: alspacdcs:85a71fd7-6eff-4dae-a190-cda29c54b293 name: freeze_id_chr5.log md5sum: ffebb4d4442623feadd2ebb9fea762aa filesize: 971.0B filetype: .log - id: alspacdcs:06fc2045-5551-4e12-942f-bb036da84233 name: freeze_id_chr6.log md5sum: 6a43815212a8ec77fe3a35c9d9c3692e filesize: 971.0B filetype: .log - id: alspacdcs:b686c422-0770-4799-a0f7-497ba9805a6b name: freeze_id_chr7.log md5sum: e0aed5e04f28d83c75d9c6dd77e90995 filesize: 971.0B filetype: .log - id: alspacdcs:a0a1f077-e686-4ca3-8d5d-8ce9613b59c9 name: freeze_id_chr8.log md5sum: 2f044afef682620cd7e97f0a65263245 filesize: 971.0B filetype: .log - id: alspacdcs:d3e4c602-3173-4e3e-9484-f3ab8a9aa003 name: freeze_id_chr9.log md5sum: d3be92946a1deb543f6e5c9d46620d0c filesize: 971.0B filetype: .log - id: alspacdcs:2685cbf0-561e-42f8-b520-0db9bc44b792 name: freeze_id_chr10.log md5sum: 4f371c2cd4e72ab6cf1a72b05e564bc3 filesize: 977.0B filetype: .log - id: alspacdcs:595095f1-bcb0-4f70-857e-f541d9a93db7 name: freeze_id_chr11.log md5sum: 2adab11dda24b32f38c8d15b59aca641 filesize: 977.0B filetype: .log - id: alspacdcs:64ffd176-931c-4077-9ad4-2a9fcd30ddb3 name: freeze_id_chr12.log md5sum: 42e1a1b5359f038351b4c4da9dd64832 filesize: 977.0B filetype: .log - id: alspacdcs:357ad1e6-d057-4201-adbd-e54b1659f998 name: freeze_id_chr13.log md5sum: e3bdc7637c2434b93036875597d8d0af filesize: 977.0B filetype: .log - id: alspacdcs:60190e1f-b199-4e98-bdf0-9e9fde558722 name: freeze_id_chr14.log md5sum: adae00f346ebe7c59179bd6921711b11 filesize: 975.0B filetype: .log - id: alspacdcs:b9373ee2-cd15-4098-a754-84b7ee0454cf name: freeze_id_chr15.log md5sum: 1396755abe3983552865e224131367b9 filesize: 975.0B filetype: .log - id: alspacdcs:900c8cb4-489d-4990-bdcf-6a2884abfbae name: freeze_id_chr16.log md5sum: f644d2721c522b723f870e122886244b filesize: 975.0B filetype: .log - id: alspacdcs:73c2f0fc-d654-4563-ab9a-9e927d7f7057 name: freeze_id_chr17.log md5sum: 62c56ae0fa0d64212bee294600a0f78e filesize: 975.0B filetype: .log - id: alspacdcs:97d1ff20-3781-4b93-b82b-06f485178201 name: freeze_id_chr18.log md5sum: b9803f7d4723f6e7d1759115354b88cf filesize: 975.0B filetype: .log - id: alspacdcs:94c2b803-de0d-41c8-9f30-a0b72c6e5e3c name: freeze_id_chr19.log md5sum: 3e462678c1e53cceb57658be036768aa filesize: 975.0B filetype: .log - id: alspacdcs:0a8c20ea-b26c-4794-b9e2-b8971e4da850 name: freeze_id_chr20.log md5sum: b5d7ac4496e2c2bb53bd025d6f1cf948 filesize: 975.0B filetype: .log - id: alspacdcs:cce89426-b396-43c9-86af-8d1d1ccd2fde name: freeze_id_chr21.log md5sum: 57d7138f8b258ae539b289b863c8bab4 filesize: 975.0B filetype: .log - id: alspacdcs:7c8ce228-052f-4490-bf61-ffcb3e517990 name: freeze_id_chr22.log md5sum: f547e75de71b74d04a640df1d153d46c filesize: 975.0B filetype: .log
3.4 Genome-wide - 1000G imputed - G0 partners (gi_1000g_g0p)
3.4.1 Description
This dataset contains genome-wide array data imputed to the 1000 genomes reference panel for G0 partners, with some additional G0 mothers and G1 individuals. This data has been cleaned, flipped to the positive strand and in b37 coordinates and imputed to the 1000 genomes phase I version 3. Reference genome build: GRCh37
3.4.2 Methodology
3,453 ALSPAC mother and fathers and 535,478 SNPs were genotyped using the Illumina HumanCoreExome chip genotyping platforms by the ALSPAC lab and called using GenomeStudio. The resulting raw genome-wide data were subjected to standard quality control methods using PLINK (v1.07). Individuals were excluded on the basis of gender mismatches (n = 80); minimal or excessive heterozygosity (n = 64); disproportionate levels of individual missingness (>5%, n = 60) and possible contamination (n = 3).
Population stratification was assessed by multidimensional scaling analysis and compared with 1000 Genomes phase 3 data and principal component analysis (n = 266); all individuals with non-European ancestry were removed.
Cryptic relatedness was measured as SNP relatedness in GCTA (relatedness > 0.1, n = 69 removed). SNPs with a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 1E-7) and those which failed GenomeStudio quality control measures were removed (n = 21,298). 6,594 duplicate SNPs were also removed.
This resulted in 2,911 unrelated mothers and father genotypes at 507,586 SNPs. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln.
We phased data of 3074 samples that passed qc but contained related subjects in shapeit v2.r837. We then removed 155,336 monomorphic SNPs, 1033 markers not in 1000 genomes, 11,842 A/T or G/C SNPs and 10 duplicate sites to give 337,732 SNPs on chromosomes 1-23. Of the 329,363 markers on chromosomes 1-22, 298,742 overlapped the reference genome. We imputed to the 1000 genomes phase 1 version 3 using the Michigan Imputation Server. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln. We then removed 12 subjects who have withdrawn consent and 6 subjects genotyped in an earlier work package to give 2201 subjects.
1737 putative G0 partner-G1 pairs for whom both G0 partner and G1 have called genotype data available were identified based on ALN. Given the G0 partners were invited by the G0 mother to take part and only enrolled in the study in their own right several years later, it could not be assumed that all G0 partners were biologically related to G1. Called genotype data for the 1720 unique G0 partners and 1737 unique G1s were merged (i.e. there were 17 pairs of siblings/twins among the G1 offspring), using plink v1.90b7.2 64-bit (11 Dec 2023).
After aplication of the plink filters –geno 0.05, –maf 0.01, –snps-only just-acgt and –autosome. The –related command in KING version 2.3.2 was used to perform kinship analysis, which confirmed that all 1737 putative G0 partner-G1 pairs are genetically related. This would be expected for biological father-offspring pairs, using the inference criteria described in in Table 1 of "Manichaikul, Ani, et al. "Robust relationship inference in genome-wide association studies." Bioinformatics 26.22 (2010): 2867-2873."
3.4.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gi_1000g_g0p_2016-11-22_f5 name: Genome-wide - 1000G imputed - G0 partners version 2016-11-22 freeze 5 description: >- This dataset is the fith freeze of 2016-11-22 versiono of the Genome-wide array data imputed to the 1000 genomes reference panel for G0 partners, with some additional G0 mothers and G1 individuals. freeze_size: 44G linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9 woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a git_tag: https://github.com/alspac/dataset_gi_1000g_g0p/releases/tag/freeze5 is_current_freeze: true freeze_number: 5 freeze_date: 2025-02-27 previous_freeze: alspacdcs:gi_1000g_g0p_2016-11-22_f4 freeze_of_alspac_dataset_version: alspacdcs:gi_1000g_g0p_2016-11-22 freeze_of_named_alspac_dataset: alspacdcs:gi_1000g_g0p has_containers: - id: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c name: data description: A dir/folder containing the data bgen and sample files has_parts: - id: alspacdcs:gi_1000g_g0p_2016-11-22_sample_f4 name: Samples description: >- The samples in the data. To be used with the genetic data. A plain text .sample file. See https://doi.org/10.1101/308296 for file format details. data_distributions: - id: alspacdcs:593e4010-b671-4f25-b040-295b78e3107b name: swapped.sample md5sum: fc74e422b93dc53025b9664c0a57f320 filesize: 164.9KB filetype: .sample number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:c2d03376-f408-4ae9-b9a5-523ee1173b9a name: bgens description: >- An Oxford Bgen (v1.2) file for all chromosomes. To be used with sample file. See https://doi.org/10.1101/308296 for file format details. data distributions: - id: alspacdcs:b8bd3364-26fe-4bc8-b635-7050788ef646 name: filtered_data_chr01.bgen md5sum: a5eb049e4df5a8b005ae51b47947d830 filesize: 3.3GB filetype: .bgen number_of_variants: 2159337 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:5557335c-c2d9-4456-a25f-e11f652e9612 name: filtered_data_chr02.bgen md5sum: e297c8d30455053d23ac360bcc886bb0 filesize: 3.5GB filetype: .bgen number_of_variants: 2349883 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:2e35ed2f-6b37-4d55-b88f-08d9f090f636 name: filtered_data_chr03.bgen md5sum: c0b55e9d65c219ffb1b8c58a0ebb7c18 filesize: 3.0GB filetype: .bgen number_of_variants: 1969275 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:bed2d41e-0fed-46fb-98d0-aa2cc845c0ca name: filtered_data_chr04.bgen md5sum: 514f09f02c74fc3eca83379e9e99c5dc filesize: 3.1GB filetype: .bgen number_of_variants: 1969883 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:4f346ccf-883b-436c-9af3-d1c0c76fe03b name: filtered_data_chr05.bgen md5sum: f4accbf5bdd6a2ccc9598e9e2221915d filesize: 2.7GB filetype: .bgen number_of_variants: 1809961 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:8fbfd61d-e9b9-4f5f-84f1-ff534cd061d4 name: filtered_data_chr06.bgen md5sum: a9327ad1591fdf7d349b066544e71c3a filesize: 2.6GB filetype: .bgen number_of_variants: 1758025 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:785f2386-a16d-4631-859a-39a1a6c3fb8b name: filtered_data_chr07.bgen md5sum: f832922558eddcf3feed87091c2ec0ae filesize: 2.6GB filetype: .bgen number_of_variants: 1601293 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:f63100b6-49f9-4d01-9c6d-e990ee513da1 name: filtered_data_chr08.bgen md5sum: 47d79712e676a0048f90858cbb888179 filesize: 2.3GB filetype: .bgen number_of_variants: 1558902 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:359141df-26e5-4048-8d25-a6940a9a8893 name: filtered_data_chr09.bgen md5sum: 82a480f3e8792db2c1cec3adc50e1357 filesize: 1.9GB filetype: .bgen number_of_variants: 1189463 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:8f97f928-e876-4f1c-a9b3-db74887b9fc8 name: filtered_data_chr10.bgen md5sum: 8f64fe184e4c876a345a728ed5eeddcf filesize: 2.1GB filetype: .bgen number_of_variants: 1363104 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:0adf771c-75f7-4709-afc1-47b9c32107d8 name: filtered_data_chr11.bgen md5sum: b1b7e3bef0fe72cd90bd0ba456f687aa filesize: 2.1GB filetype: .bgen number_of_variants: 1359640 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:71dcdd1b-18d3-42a0-a52c-f1e8bc6cb933 name: filtered_data_chr12.bgen md5sum: 509202db22200fe0bd58210ab8e9c757 filesize: 2.1GB filetype: .bgen number_of_variants: 1316510 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:c2e5243e-9c84-4030-a34e-0148cc9c42b2 name: filtered_data_chr13.bgen md5sum: 176a10d38ab80783a8e392e5791edea7 filesize: 1.5GB filetype: .bgen number_of_variants: 988473 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:971ace3d-1a9c-47f0-b012-a1b943881b70 name: filtered_data_chr14.bgen md5sum: 1ecd96aab2925bafd7d20497d85dd937 filesize: 1.4GB filetype: .bgen number_of_variants: 903811 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:5a4bd639-5d03-4eb3-a4eb-ba074e7d27ef name: filtered_data_chr15.bgen md5sum: f8c5b54206189808e9a361cc0da63798 filesize: 1.4GB filetype: .bgen number_of_variants: 814028 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:bbe21f58-22bc-4ce0-a40b-9f0e11f5bafd name: filtered_data_chr16.bgen md5sum: 52f065575d3cb2dff34df6763a583766 filesize: 1.5GB filetype: .bgen number_of_variants: 867901 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:7df9f8db-2c34-4f5e-a868-e55620c740c7 name: filtered_data_chr17.bgen md5sum: 73d85caf67dcedc63b11a43bd5ccb44d filesize: 1.4GB filetype: .bgen number_of_variants: 755467 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:2a8eddd9-716c-460e-bc4a-78b4b5df12b4 name: filtered_data_chr18.bgen md5sum: b8e055a6c0955bb67161c9f7a1d8cad7 filesize: 1.3GB filetype: .bgen number_of_variants: 783661 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:6891d0c5-e685-46c7-a98f-ef09036db1e9 name: filtered_data_chr19.bgen md5sum: 37ea045cd9f4027cba547b7b89c3a1a0 filesize: 1.2GB filetype: .bgen number_of_variants: 606147 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:51e60215-105a-403a-8112-7d731de3471e name: filtered_data_chr20.bgen md5sum: d241eb21be3188c26c460e1f65f0d8c1 filesize: 1.1GB filetype: .bgen number_of_variants: 618749 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:1dde86df-9e0a-4c02-afed-6381299cfa49 name: filtered_data_chr21.bgen md5sum: 7881bdc24e7f0adbfb800b49d1efd590 filesize: 671.1MB filetype: .bgen number_of_variants: 378064 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c - id: alspacdcs:16e2f696-cb21-4661-bce0-3a712fcd3eae name: filtered_data_chr22.bgen md5sum: 824412e963441699f260c6245f65659d filesize: 721.5MB filetype: .bgen number_of_variants: 366590 number_of_participants: 2198 belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c
3.5 Genome-wide - 1000G imputed - G0 mothers + G1 (gi_1000g_g0m_g1)
3.5.1 Description
This dataset contains genome-wide 1000G imputed data for G0 mothers + G1. This data has been cleaned, flipped to the positive strand and in b37 coordinates and imputed to the 1000 genomes phase I version 3. Reference genome build: GRCh37
3.5.2 Methodology
ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).
Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.
SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1). Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,115 subjects and 500,527 SNPs passed these quality control filters.
ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs. SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed.
Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.
Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level. Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,048 subjects and 526,688 SNPs passed these quality control filters.
We combined 477,482 SNP genotypes in common between the sample of mothers and sample of children. We removed SNPs with genotype missingness above 1% due to poor quality (11,396 SNPs removed) and removed a further 321 subjects due to potential ID mismatches. This resulted in a dataset of 17,842 subjects containing 6,305 duos and 465,740 SNPs (112 were removed during liftover and 234 were out of HWE after combination). We estimated haplotypes using ShapeIT(v2.r644) which utilises relatedness during phasing. We obtained a phased version of the 1000 genomes reference panel (Phase 1, Version3) from the Impute2 reference data repository (phased using ShapeItv2.r644, haplotype release date Dec 2013). Imputation of the target data was performed using Impute V2.2.2 against the reference panel(all polymorphic SNPs excluding singletons), using all 2186 reference haplotypes (including non-Europeans).
This gave 8,237 eligible children and 8,196 eligible mothers withavailable genotype data after exclusion of related subjects using cryptic relatedness measures described previously.
Known issues: There is a known strand issue present within this imputation: The Dec 2013 haplotype release of 1000 genomes phase 1 version 3 have 199 reported SNPs with incorrect strand. For more information and the origins of this list please visit https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated_SHAPEIT2_16-06-14.html. It is very unlikely that they have systematic effects across the genome and most probably are just isolated to these 199 known problematic SNPs. The user is advised to discard them from their analysis.
Formatting of the bgen files within the gi_1000g_g0m_g1 dataset have NA in place of the chromosome column. Some tools may allow this, while others are less forgiving. This may mean users wish to re-format the dataset (using QCtool or equivalent) for their work.
Allele frequency concordance with other cohorts: When contributing to consortia you may find that the allele frequencies in ALSPAC for a few thousand SNPs are discordant from a reference panel used by the consortium. This is actually to be expected - when calculating allele frequencies, even from the same population, in two different samples for many millions of SNPs there will be a number of SNPs that appear to be highly discordant.
3.5.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_f5 name: >- Genome-wide - 1000G imputed - G0 mothers + G1 version 2015-10-30 freeze 5 description: >- This is the fifth freeze of the the 2015-10-30 version of gi_1000g_g0m_g1 datatset. It contains data in the oxford format which is a combination of bgen and sample (version 1.2) files. It is a subset of the data in gi_1000g_g0m_g1_2015-10-30 limited to one format and with participants who have withdrawn their consent removed. The Dec 2013 haplotype release of 1000 genomes phase 1 version 3 have 199 reported SNPs with incorrect strand. The strand issues are present in this imputation version. For more information and the origins of this list please visit: https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated_SHAPEIT2_16-06-14.html It is very unlikely that they have systematic effects across the genome and most probably are just isolated to these 199 known problematic SNPs. The user is advised to discard them from their analysis. freeze_size: 122G linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9 woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a git_tag: https://github.com/alspac/dataset_gi_1000g_g0m_g1/releases/tag/freeze5 is_current_freeze: true freeze_number: 5 freeze_date: 2025-02-27 previous_freeze: alspacdcs:gi_1000g_g0m_g1_2015-10-30_f4 freeze_of_alspac_dataset_version: alspacdcs:gi_1000g_g0m_g1_2015-10-30 freeze_of_named_alspac_dataset: alspacdcs:gi_1000g_g0m_g1 has_containers: - id: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 name: data description: A dir/folder containing the data bgen and sample files has_parts: - id: alspacdcs:fa64c3c2-14ae-4853-bb1a-bec2545d217d name: Samples description: >- The samples in the data. To be used with the genetic data. A plain text .sample file. See https://doi.org/10.1101/308296 for file format details. data_distributions: - id: alspacdcs:bf6acc7d-a788-4ea1-b836-691582bef85f name: swapped.sample md5sum: d7dd4fe786b399bb107b332acf27f8bc filesize: 1.2MB filetype: .sample number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:87995855-f693-4b4d-8155-8dcb141b85ec name: Bgens description: >- An Oxford Bgen (v1.2) file for all chromosomes. To be used with sample file. See https://doi.org/10.1101/308296 for file format details. data_distributions: - id: alspacdcs:15a76b52-275f-4969-9b11-5bb9b89a6460 name: filtered_01.bgen md5sum: fad144852b7c9c929ea1a55b8481798c filesize: 9.0GB filetype: .bgen number_of_variants: 2155158 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:c34645be-c813-42c9-9032-33b9ef6a4ec0 name: filtered_02.bgen md5sum: 91168a792595ee55375d6c72c881fa6c filesize: 9.1GB filetype: .bgen number_of_variants: 2346862 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:7bacda1b-0ba1-4f81-9f92-6821d2cfd588 name: filtered_03.bgen md5sum: 6e898fe7aba1d39e832245267a9ec30e filesize: 7.6GB filetype: .bgen number_of_variants: 1966662 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:c8518005-9401-48fb-80c8-5841611a1e17 name: filtered_04.bgen md5sum: c7ba39fbff7de19ffd98b93ff217108b filesize: 8.3GB filetype: .bgen number_of_variants: 1968171 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:84e27681-1425-4d3a-b348-e5cacbf110cf name: filtered_05.bgen md5sum: 173056913dd6dc1684e9118907af1fd5 filesize: 6.8GB filetype: .bgen number_of_variants: 1808090 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:922442bb-33de-402c-ae5e-7268e642f05e name: filtered_06.bgen md5sum: b8296902cc14e29111b2caefbc52a00b filesize: 6.8GB filetype: .bgen number_of_variants: 1755859 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:8cf64efd-266f-433b-8392-cf8eea0133b7 name: filtered_07.bgen md5sum: 3072cca6a05fdb782b858f70beed6e06 filesize: 7.1GB filetype: .bgen number_of_variants: 1599387 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:12f65474-8d94-4cfe-a014-0ff3fa84bec2 name: filtered_08.bgen md5sum: c57b0cc8c3b47c8058e6f95ba742a89d filesize: 5.9GB filetype: .bgen number_of_variants: 1557429 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:869352b1-9634-41ee-9851-69c3fb0e990a name: filtered_09.bgen md5sum: 0e0d21cb1dc4d276d0a4353cc7da0564 filesize: 5.0GB filetype: .bgen number_of_variants: 1187731 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:e17bd046-20e1-434d-975e-09348fd69ffc name: filtered_10.bgen md5sum: e5f8a44f260c009a9fec7bdc105ead76 filesize: 5.4GB filetype: .bgen number_of_variants: 1361506 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:160c5e26-45f8-47bf-9cec-c627d8912c5f name: filtered_11.bgen md5sum: 7c64c009aaf9fdb84c21b31f51e28bfa filesize: 5.3GB filetype: .bgen number_of_variants: 1356882 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 number_of_participants: 17444 - id: alspacdcs:9fc4fbc3-2b5d-4dd2-98b7-c7928e669bd7 name: filtered_12.bgen md5sum: 8f0d903ca1cf24ca0e45494bd0a1426c filesize: 5.3GB filetype: .bgen number_of_variants: 1314328 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:12ea18df-3860-40ac-afbc-dbb8d7bfc61e name: filtered_13.bgen md5sum: e59348ea876d3f5c3b6331e738daa162 filesize: 3.9GB filetype: .bgen number_of_variants: 987740 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:bead6d13-e811-44c3-8f77-8da121506d90 name: filtered_14.bgen md5sum: 3f80471a1e183e478ca3674482ed89e4 filesize: 3.9GB filetype: .bgen number_of_variants: 904351 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:8dc54c1e-659e-4489-9f3a-48768f65a067 name: filtered_15.bgen md5sum: 2166a96fc0bbdc990b1bcb513f4372bd filesize: 3.7GB filetype: .bgen number_of_variants: 812545 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:08129d30-8910-4ae4-bfdd-52f9c41af15d name: filtered_16.bgen md5sum: c44b1d287c79c69b2171c6822339cf4b filesize: 4.3GB filetype: .bgen number_of_variants: 865998 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:e876631c-c38f-4105-a3ce-4e6f00ccba6d name: filtered_17.bgen md5sum: e4c50e9c54d4baa59d191a756d60b32e filesize: 3.8GB filetype: .bgen number_of_variants: 753174 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:2b7f5674-4c73-4118-93d2-0e648d2306b6 name: filtered_18.bgen md5sum: fa893fede52923d5805f8583dbed51bd filesize: 3.4GB filetype: .bgen number_of_variants: 783010 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:d277b81c-0fad-4397-bb58-c2843864e0db name: filtered_19.bgen md5sum: 999c860cfb0f3484d1a78ef639c594fa filesize: 3.9GB filetype: .bgen number_of_variants: 603516 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:42613b8c-17fd-47a7-b78c-cdc08fb01e61 name: filtered_20.bgen md5sum: 59dd1ebbefb28c2b5818fb2aca9805de filesize: 2.7GB filetype: .bgen number_of_variants: 617694 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:eb1acd03-d59d-482f-97d5-3c2e9e3f3311 name: filtered_21.bgen md5sum: dce2d85e4d08018ea365afdeac561447 filesize: 1.9GB filetype: .bgen number_of_variants: 377554 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:b4518de2-7c28-4536-9164-d67aa7d97c28 name: filtered_22.bgen md5sum: b5ba868e802d8eee4ac76b0f878d427c filesize: 2.0GB filetype: .bgen number_of_variants: 365644 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505 - id: alspacdcs:e1e6958c-0852-4173-9e98-ee7dd50f5ad3 name: filtered_23.bgen md5sum: 512a78f6c379ce43e827da44a91b4c5f filesize: 5.9GB filetype: .bgen number_of_variants: 1250218 number_of_participants: 17444 belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505
3.6 Genome-wide - TOPMed round 2 imputed - G0 mothers + G1 (gi_topmed_g0m_g1)
SNP chips are useful for the generation of data on hundreds of thousands of SNPs, but there are millions more polymorphisms that remain untyped with this technology. If suitable numbers of whole genome sequences exist (e.g. 1000 genomes data) then millions of genotypes that are missing from a sample because they have not been typed by SNP chips can be imputed using probabilistic methods. Here the ALSPAC mother and children data were imputed to a new reference panel known as the Haplotype Reference Consortium (HRC) panel. This comprises around 31000 sequenced individuals (mostly European), so the coverage of European haplotypes is much greater than in other panels. As a consequence imputation accuracy is expected to improve, particularly at lower frequencies.
3.6.1 Description
This dataset contains genotype data imputed to TOPMed round 2 for G0 mothers and G1. Reference genome build: GRCh38
3.6.2 Methodology
ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).
Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.
SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1).
Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,115 subjects and 500,527 SNPs passed these quality control filters.
ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs. SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed.
Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.
Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level. Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,048 subjects and 526,688 SNPs passed these quality control filters.
We combined 477,482 SNP genotypes in common between the sample of mothers and sample of children. We removed SNPs with genotype missingness above 1% due to poor quality (11,396 SNPs removed) and removed a further 321 subjects due to potential ID mismatches. This resulted in a dataset of 17,842 subjects containing 6,305 duos and 465,740 SNPs (112 were removed during liftOver and 234 were out of HWE after combination).
Individuals within this dataset, but who have withdrawn from the project were removed from the dataset before proceeding with imputation specific quality control. This left 17450 individuals.
The combined mothers and children combined genotype panel was filtered to remove SNPs below MAF 0.01, missing call rates exceeding 0.01 using Plink 2.0. The joint set of SNPs was checked for palindromic SNPs but none were present. The combined call set was swapped from GRCh37 to GRCh38 using UCSC liftOver.
The dataset was later filtered to SNPs above HWE of 1e-6 leaving 455150 SNPs. The combined autosomal call set was then converted to VCF files, before being uploaded to the TOPMed imputation server to flag variants requiring a strand fix. Any SNPs flagged with an issue were corrected, or filtered out using Plink2. 454248 SNPs remained within the autosomes.
Phasing and imputation was conducted on the Michigan TOPMed imputation server (v1.7.4) in October of 2023. Phasing was done using Eagle (v2.4). Imputation was done on minimac4 (v1.0.2) to TOPMed R2. An R squared filter of 0.3 was applied.
3.6.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gi_topmed_g0m_g1_2024-12-19_f4 name: >- Genome-wide - TOPmed imputed - G0 mothers + G1 version 2024-12-19 freeze 5 description: >- Freeze 5 of version 2024-12-19 Genome-wide array data imputed to the TOPmed round 2 reference panel for G0 mothers and G1 individuals in bgen and sample file format (version 1.2). freeze_size: 161G linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9 woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a git_tag: https://github.com/alspac/dataset_gi_topmed_g0m_g1/releases/tag/freeze5 is_current_freeze: true freeze_number: 5 freeze_date: 2025-02-27 previous_freeze: freeze_of_alspac_dataset_version: alspacdcs:gi_topmed_g0m_g1_2024-12-19 freeze_of_named_alspac_dataset: alspacdcs:gi_topmed_g0m_g1 has_containers: - id: alspacdcs:7bffd114-d042-4ec3-9c78-3e7fa8c7d8fd ## uuid name: data description: A dir/folder containing the freeze data bgen and .sample files has_parts: - id: alspacdcs:8614a15d-8936-4b4e-ae34-a855ac8d1810 name: Omics ID sample description: >- The samples in the data. To be used with the genetic data. A plain text .sample file. See https://doi.org/10.1101/308296 for file format details. data_distributions: - id: alspacdcs:ab0afa36-1c8d-45f0-9b0f-946ee1c56dae name: freeze.sample md5sum: 6523f64b382a44d4354a3be8bd5205e3 filesize: 954.0KB filetype: .sample number_of_participants: 17444 - id: alspacdcs:1f676e6d-f136-43d6-bb9c-381758a833f3 name: Data bgen files description: >- An Oxford Bgen (v1.2) file for all chromosomes. To be used with sample file. See https://doi.org/10.1101/308296 for file format details. data_distributions: - id: alspacdcs:f430e13a-3a06-4b86-8a46-e98648075a9f name: chr1_freeze.bgen md5sum: 21b6a08b8d2e90004b3f54bb06e9443b filesize: 8.0GB filetype: .bgen number_of_variants: 5665189 number_of_participants: 17444 - id: alspacdcs:c8d1d13d-1e73-4dcd-a583-afbd5e9329f9 name: chr2_freeze.bgen md5sum: d6e8ac3bcda8f42f3294e6a80160e25f filesize: 8.3GB filetype: .bgen number_of_variants: 6104056 number_of_participants: 17444 - id: alspacdcs:d6155947-6f27-4cc6-bafc-2cd373fb8703 name: chr3_freeze.bgen md5sum: 593e63c2b01b85d971bba521f1dd6dcc filesize: 7.0GB filetype: .bgen number_of_variants: 5039584 number_of_participants: 17444 - id: alspacdcs:f5e7a8fa-7019-4a83-a94a-a14191f98ba4 name: chr4_freeze.bgen md5sum: 60a2a1fa8d779535d1b471816b5d198a filesize: 7.5GB filetype: .bgen number_of_variants: 4910014 number_of_participants: 17444 - id: alspacdcs:b85035e4-be1f-44e8-90a3-5919be154e42 name: chr5_freeze.bgen md5sum: 4bcd99f640ceb04a235f65d4361c9b2c filesize: 6.4GB filetype: .bgen number_of_variants: 4540467 number_of_participants: 17444 - id: alspacdcs:797604c0-3a6d-47e3-b983-15384995a24e name: chr6_freeze.bgen md5sum: 658372add4922a417d15dd794a7c4cf6 filesize: 6.1GB filetype: .bgen number_of_variants: 4341095 number_of_participants: 17444 - id: alspacdcs:a573e85f-8dbf-4182-917c-c28cd8a7ddc7 name: chr7_freeze.bgen md5sum: 08e4ab75c11d694859262bcdfa7c28a6 filesize: 6.1GB filetype: .bgen number_of_variants: 4083826 number_of_participants: 17444 - id: alspacdcs:e85a6acb-d587-423e-87e4-90f697e7a390 name: chr8_freeze.bgen md5sum: 7da68399716feee6d58670c380ce6136 filesize: 5.4GB filetype: .bgen number_of_variants: 3923042 number_of_participants: 17444 - id: alspacdcs:4485a925-332b-457a-98ae-fb8f83f24cfc name: chr9_freeze.bgen md5sum: 242c4e477c3a38dfe35060deca254c02 filesize: 4.3GB filetype: .bgen number_of_variants: 3121200 number_of_participants: 17444 - id: alspacdcs:63fae86d-44ca-4777-9c03-40f26c8c3489 name: chr10_freeze.bgen md5sum: f1b1a42a77534ea92368abb64eb04750 filesize: 5.0GB filetype: .bgen number_of_variants: 3462260 number_of_participants: 17444 - id: alspacdcs:c9c116f6-bb4f-4f3c-bede-6b7b6fa6c4be name: chr11_freeze.bgen md5sum: 142cbe60c9975ef6678884ae3f2f8b3c filesize: 5.0GB filetype: .bgen number_of_variants: 3500176 number_of_participants: 17444 - id: alspacdcs:1ea4e961-cdee-4a7a-bad3-5b493e964c6d name: chr12_freeze.bgen md5sum: f929e2bc815169f2195d573638c0bd79 filesize: 4.8GB filetype: .bgen number_of_variants: 3380589 number_of_participants: 17444 - id: alspacdcs:ff104d34-f95d-4afc-8a4a-a0558fa70f01 name: chr13_freeze.bgen md5sum: b31c183805b7c905f10db4df575b4e09 filesize: 3.7GB filetype: .bgen number_of_variants: 2529048 number_of_participants: 17444 - id: alspacdcs:5232e411-30dd-4386-a5f9-8b1160078387 name: chr14_freeze.bgen md5sum: 65f2e357ed7dfa431f3354c408c0c8a5 filesize: 3.2GB filetype: .bgen number_of_variants: 2255877 number_of_participants: 17444 - id: alspacdcs:74cb2f8b-ab30-4fb3-9035-7e21764aa28b name: chr15_freeze.bgen md5sum: 372f6b19cb1a3ce5c4d8b535740baef1 filesize: 3.0GB filetype: .bgen number_of_variants: 2071294 number_of_participants: 17444 - id: alspacdcs:89998bcd-1e59-4d8f-82e3-53d14bce4de3 name: chr16_freeze.bgen md5sum: c4d8a9a9549afe5bc0c301b593e2c1fd filesize: 3.4GB filetype: .bgen number_of_variants: 2273274 number_of_participants: 17444 - id: alspacdcs:e0a934a1-6949-469b-aedf-370b556a96cd name: chr17_freeze.bgen md5sum: 89d4a99fff839df51bcdac1fadc88041 filesize: 3.2GB filetype: .bgen number_of_variants: 2040685 number_of_participants: 17444 - id: alspacdcs:5df288f2-8113-4967-82b2-ded3eb04c2ca name: chr18_freeze.bgen md5sum: 76e731d7340944547712a9b1eaa88aea filesize: 3.0GB filetype: .bgen number_of_variants: 1994769 number_of_participants: 17444 - id: alspacdcs:61f213e6-c14b-4ad1-b35a-0e2c0cfdc40c name: chr19_freeze.bgen md5sum: 7bdc3bf8ded14faf8e7098a127e9e031 filesize: 2.8GB filetype: .bgen number_of_variants: 1605223 number_of_participants: 17444 - id: alspacdcs:b04fd601-4745-4b05-8c27-56c556d08eb4 name: chr20_freeze.bgen md5sum: d38b47cfec00c0c979560884e6566803 filesize: 2.4GB filetype: .bgen number_of_variants: 1615112 number_of_participants: 17444 - id: alspacdcs:1e394ca7-9d52-43a7-806d-e9fd4179cb88 name: chr21_freeze.bgen md5sum: 586cb702158f0549fb20fd7703ba53cc filesize: 1.5GB filetype: .bgen number_of_variants: 935142 number_of_participants: 17444 - id: alspacdcs:571b477b-61d1-49c2-b8d1-c652679b11d8 name: chr22_freeze.bgen md5sum: 578d7032a52f458612ed6b337f724e54 filesize: 1.7GB filetype: .bgen number_of_variants: 1002345 number_of_participants: 17444
4 Sequence Data
4.1 Whole genome sequencing - G1 (wgs_hiseq_g1)
4.1.1 Description
This dataset contains whole genome sequencing for G1 individuals, part of the UK10K dataset. Reference genome build: GRCh37
4.1.2 Methodology
ALSPAC and TwinsUK cohorts were sequenced at an average read depth of 6.7x through the UK10K program (http://www.uk10k.org) using the Illumina HiSeq platform, and aligned to the GRCh37 human reference using BWA. SNV calls were completed using samtools/bcftools and VQSR and GATK were used to recall these calls.
Associated publication:
Please ensure you have permission to access this data (http://www.uk10k.org/data_access.html) before using it.
4.1.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:wgs_hiseq_g1_2016-08-18_f5 name: Whole genome sequencing - G1 version 2016-08-18 freeze 5 description: >- This is the freeze 5 of version 2016-08-18 of the Whole genome sequencing for G1 individuals, part of the UK10K dataset. freeze_size: 341G linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9 woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a git_tag: https://github.com/alspac/dataset_wgs_hiseq_g1/releases/tag/freeze5 is_current_freeze: true freeze_number: 5 freeze_date: 2025-02-27 previous_freeze: alspacdcs:wgs_hiseq_g1_2016-08-18_f4 freeze_of_alspac_dataset_version: alspacdcs:wgs_hiseq_g1_2016-08-18 freeze_of_named_alspac_dataset: alspacdcs:wgs_hiseq_g1 has_containers: - id: alspacdcs:ec72d464-46e6-4059-b7dd-7b0f68739ddb ## uuid name: data description: A dir/folder containing the freeze data files has_parts: - id: alspacdcs:1319d16a-a9e8-4fb7-b4ee-a02a4345d98d name: compressed vcf files description: >- Compressed vcf file containing all participants for each chromosome.. To be used with corresponding index file (in format of chr1_data.vcf.gz.csi). data_distributions: - id: alspacdcs:fbbe04e8-c67f-4177-9b3e-d852b20729e5 name: 1_freeze.vcf.gz md5sum: d33e15c386ae29f5c3be6e75427f8b3b filesize: 26.3GB filetype: .gz number_of_variants: 3406915 number_of_participants: 1865 - id: alspacdcs:771837f1-76be-4a4b-b6e6-4d6e00db905b name: 2_freeze.vcf.gz md5sum: 2d4f58a1b75aa9502c8f317d5497219c filesize: 28.8GB filetype: .gz number_of_variants: 3749277 number_of_participants: 1865 - id: alspacdcs:97d1da53-68e4-472b-966f-3825b0b1f3de name: 3_freeze.vcf.gz md5sum: 787c1476921621d3f6d57237c55366df filesize: 24.2GB filetype: .gz number_of_variants: 3147254 number_of_participants: 1865 - id: alspacdcs:6e707b26-d8f6-4353-8a91-556f2fd75644 name: 4_freeze.vcf.gz md5sum: 7834894c689f08444b791562a976ea5e filesize: 23.2GB filetype: .gz number_of_variants: 3019176 number_of_participants: 1865 - id: alspacdcs:c0655148-9acb-4df8-a114-ea1a4a00e8dc name: 5_freeze.vcf.gz md5sum: 3a92aa1fe872807df22a81dfa060983b filesize: 21.6GB filetype: .gz number_of_variants: 2804359 number_of_participants: 1865 - id: alspacdcs:9b8a43b6-1097-451c-965c-0d751bc9cb9e name: 6_freeze.vcf.gz md5sum: b85e5b5182a70bf720b32acf42331723 filesize: 21.0GB filetype: .gz number_of_variants: 2704091 number_of_participants: 1865 - id: alspacdcs:aabd8528-a20e-4735-93ff-2528eb7485d4 name: 7_freeze.vcf.gz md5sum: 9e62ee4c2ef3f1872bbebe5c25c974ac filesize: 19.0GB filetype: .gz number_of_variants: 2445204 number_of_participants: 1865 - id: alspacdcs:219b3731-47b5-4dc3-821f-f87921e14443 name: 8_freeze.vcf.gz md5sum: 15a74e25f9fcf4e3cb0c3ade8e4ea523 filesize: 18.8GB filetype: .gz number_of_variants: 2451009 number_of_participants: 1865 - id: alspacdcs:f10bafdd-56d9-4dd4-9fb1-03d4664d4925 name: 9_freeze.vcf.gz md5sum: b76caf23e115f32f268eccc63f89befc filesize: 14.2GB filetype: .gz number_of_variants: 1845456 number_of_participants: 1865 - id: alspacdcs:9a9c8dd0-4620-4946-a970-34133bce0cff name: 10_freeze.vcf.gz md5sum: e5aec1e24bf2b1708db803093717fe86 filesize: 16.3GB filetype: .gz number_of_variants: 2110436 number_of_participants: 1865 - id: alspacdcs:6b62073d-ab80-45f7-b684-d9ee20dc2803 name: 11_freeze.vcf.gz md5sum: 40245fa0ca954109bd3d72b9258a5604 filesize: 16.4GB filetype: .gz number_of_variants: 2125064 number_of_participants: 1865 - id: alspacdcs:6bb40956-ab1e-474c-8d36-6c2b3ad3e11d name: 12_freeze.vcf.gz md5sum: fc35ea6f6c4eac159d355756c1fa1e99 filesize: 15.7GB filetype: .gz number_of_variants: 2047922 number_of_participants: 1865 - id: alspacdcs:36f180c3-b4ad-496e-a210-3d31104f5abb name: 13_freeze.vcf.gz md5sum: fbaa1857b2337a453977604691dda40a filesize: 11.8GB filetype: .gz number_of_variants: 1527053 number_of_participants: 1865 - id: alspacdcs:fd76156d-de27-4096-a40d-f3bdf2191dfb name: 14_freeze.vcf.gz md5sum: b91c4b551aad8ddff969f325837ae391 filesize: 10.7GB filetype: .gz number_of_variants: 1403580 number_of_participants: 1865 - id: alspacdcs:2298d12f-31ac-402e-bcf9-bbda7a0cfdb0 name: 15_freeze.vcf.gz md5sum: cdb06fde51346d76533ab100e9f9d497 filesize: 9.7GB filetype: .gz number_of_variants: 1262404 number_of_participants: 1865 - id: alspacdcs:040624dc-7f50-4fb9-93ec-dca0c644779a name: 16_freeze.vcf.gz md5sum: 7276bd89d5c658e11a4abddd64ce0e50 filesize: 10.6GB filetype: .gz number_of_variants: 1373607 number_of_participants: 1865 - id: alspacdcs:de04c691-4a2e-4336-9c79-4b0202fee2d0 name: 17_freeze.vcf.gz md5sum: fe4b05ae5ef0fc510623bf5e54c1e1b2 filesize: 9.1GB filetype: .gz number_of_variants: 1177884 number_of_participants: 1865 - id: alspacdcs:aee7c3cf-b1e1-4128-9f94-0881d61618d7 name: 18_freeze.vcf.gz md5sum: c1f0e9e06f78f9c1a00531692a3d2cd0 filesize: 9.4GB filetype: .gz number_of_variants: 1220427 number_of_participants: 1865 - id: alspacdcs:3e5c35f5-6ea7-4a53-b7f7-8266d87899c4 name: 19_freeze.vcf.gz md5sum: 8430d6bf3230feb3136069180b250055 filesize: 7.0GB filetype: .gz number_of_variants: 886630 number_of_participants: 1865 - id: alspacdcs:a7d924eb-88db-4e9f-b16f-42bea3f6b821 name: 20_freeze.vcf.gz md5sum: 081da0fbcfd89c1fbcd403d60a83e400 filesize: 7.5GB filetype: .gz number_of_variants: 970869 number_of_participants: 1865 - id: alspacdcs:7debb6d3-0e95-408a-b587-ce61f2cf2785 name: 21_freeze.vcf.gz md5sum: 1f0c7f8dffd9e7540c1fa695c7940fe8 filesize: 4.3GB filetype: .gz number_of_variants: 563988 number_of_participants: 1865 - id: alspacdcs:ad3618eb-cdf4-4740-8fad-ffda5e9c2fa2 name: 22_freeze.vcf.gz md5sum: 2676aaa6b442dfe0cde83fe15ccfa95b filesize: 4.4GB filetype: .gz number_of_variants: 552675 number_of_participants: 1865 - id: alspacdcs:57430b22-f463-453e-9e31-df7d921c02af name: X_freeze.vcf.gz md5sum: 1695e4907cd419d93933f7703b56850b filesize: 10.5GB filetype: .gz number_of_variants: 1700742 number_of_participants: 1865 - id: alspacdcs:a3afc031-0157-4a1a-9325-963407437cde name: vcf index files description: >- vcf index file allowing for faster use of compressed vcf counterpart. To be used with corresponding vcf file (in format of chr1_data.vcf.gz.csi). data_distributions: - id: alspacdcs:3fbcb888-5de0-456c-86cc-7362065efede name: 1_freeze.vcf.gz.csi md5sum: 6d9e416a4c43c723ba97d72c7405849c filesize: 145.6KB filetype: .csi - id: alspacdcs:2b0b8799-9d40-428e-843c-0044f15c5358 name: 2_freeze.vcf.gz.csi md5sum: b21f248b785fcf0db92f72a3c3c66b2f filesize: 156.1KB filetype: .csi - id: alspacdcs:6bc095b9-aca3-4a22-ad14-4f9d6b490056 name: 3_freeze.vcf.gz.csi md5sum: 3720daf1b4726d6904783c61f5234c6d filesize: 127.9KB filetype: .csi - id: alspacdcs:db608415-e6a1-45e9-97de-8f35129759ae name: 4_freeze.vcf.gz.csi md5sum: a0bb677911ee282e6526a881b2a98916 filesize: 122.6KB filetype: .csi - id: alspacdcs:df8eb7f7-3e05-4b19-b733-fe1edb99de99 name: 5_freeze.vcf.gz.csi md5sum: 8b00b378e1375f701f9d4d310009d49a filesize: 116.1KB filetype: .csi - id: alspacdcs:754e0b2a-b593-4bb9-9b6f-51dc3cd07e2b name: 6_freeze.vcf.gz.csi md5sum: 3266612ae5cc6605f28f72e741e92d57 filesize: 109.8KB filetype: .csi - id: alspacdcs:5c9ab5b9-89bb-4238-a757-3279148252d9 name: 7_freeze.vcf.gz.csi md5sum: 706ac014ea4d9c76e87faeccb739aea3 filesize: 101.8KB filetype: .csi - id: alspacdcs:1bf17d44-e29f-491c-98c4-ca0fafcc8c25 name: 8_freeze.vcf.gz.csi md5sum: c1289344eec48a51e1096378312eda79 filesize: 92.8KB filetype: .csi - id: alspacdcs:2c1ff40a-9d4b-4c27-af92-600966a9cd95 name: 9_freeze.vcf.gz.csi md5sum: a1704c7204fd3e9656ff7bfae73a9a4a filesize: 75.4KB filetype: .csi - id: alspacdcs:e3a5b7e9-4892-4286-b616-4fe3444d2a2b name: 10_freeze.vcf.gz.csi md5sum: ee5b0a7f2220f00c4e032a5ddf35e510 filesize: 85.5KB filetype: .csi - id: alspacdcs:71068481-76e6-4baa-b8ab-a6b83e32b053 name: 11_freeze.vcf.gz.csi md5sum: 35acd0fd59f3d23cdc20838a7379eb3e filesize: 85.2KB filetype: .csi - id: alspacdcs:87843472-0227-46ae-bd8b-cb271a9770fe name: 12_freeze.vcf.gz.csi md5sum: 5cf94f16cf009e8cfb7501b7324f17bc filesize: 85.4KB filetype: .csi - id: alspacdcs:9406ac0b-f2b0-4bff-9789-67af3e5f4dfb name: 13_freeze.vcf.gz.csi md5sum: c30148a951069b2b0bc4421a74f0bf62 filesize: 62.1KB filetype: .csi - id: alspacdcs:7983ca05-0103-4898-8d96-d1ff0f9b2594 name: 14_freeze.vcf.gz.csi md5sum: e8555b348c8074117a963554ed0b1dc5 filesize: 56.7KB filetype: .csi - id: alspacdcs:6037341b-7f8f-4c69-823c-4db7ce90e747 name: 15_freeze.vcf.gz.csi md5sum: 26a4b2633ea20e1ca13c4f23a40d7583 filesize: 51.6KB filetype: .csi - id: alspacdcs:e8775539-a045-4f69-b842-c6b27be94d58 name: 16_freeze.vcf.gz.csi md5sum: 0becfa273182ab2a2d238bf4130ae991 filesize: 50.4KB filetype: .csi - id: alspacdcs:d75b124b-b063-44a0-a0ac-dc942934c0bd name: 17_freeze.vcf.gz.csi md5sum: f7b852a30bf4fd2a6c215ff2e588ef06 filesize: 49.9KB filetype: .csi - id: alspacdcs:c2065c84-cf80-4f8b-8ffe-29c6262452a1 name: 18_freeze.vcf.gz.csi md5sum: 63fe7327c4d2933e6180bdce4823b7cd filesize: 48.4KB filetype: .csi - id: alspacdcs:4cc73c9d-264d-4b77-9ad6-541f63043f72 name: 19_freeze.vcf.gz.csi md5sum: 727745e1e9bdcc63b5d9f236e8c354e5 filesize: 35.7KB filetype: .csi - id: alspacdcs:9d6dd993-d2f8-4858-8ae9-4e6c5cd8b7a9 name: 20_freeze.vcf.gz.csi md5sum: 79e681e770aa992abc51b6be6ee98736 filesize: 38.2KB filetype: .csi - id: alspacdcs:8936b2bd-5b44-4048-9350-6a04df757cb7 name: 21_freeze.vcf.gz.csi md5sum: ee5adcbdec0505621ec1bd6ca2390c4a filesize: 22.1KB filetype: .csi - id: alspacdcs:4501c4ec-bf71-456c-85e5-bed669a4f993 name: 22_freeze.vcf.gz.csi md5sum: cd89ef1f49a81c0d7dd27d91f87000fc filesize: 22.1KB filetype: .csi - id: alspacdcs:2ebe499d-4a01-4784-89a0-f3d4709c0d19 name: X_freeze.vcf.gz.csi md5sum: d50db7c315c45db319ad7f7a6176d326 filesize: 96.0KB filetype: .csi
4.2 Whole exome sequencing - G0 & G1 (wes_novaseq_g0_g1)
4.2.1 Description
This dataset contains whole exome sequencing for G0 and G1 individuals. It was generated at the Sanger Institute as part of an initiative sequencing multiple Birth cohorts: ALSPAC, MCS and BiB. As part of this initiative, the exome sequencing data will also be available via EGA but researchers will still gain access through ALSPACs project approval system. Reference genome build: GRCh38
4.2.2 Methodology
Exome sequencing was conducted on DNA for 12,374 participants (8,605 children and 3,389 of their parents) at the Sanger Institute, using Illumina NovaSeq. Reads were aligned to GRCh38 with BWA-MEM. There was an average on-target depth of ~62X for ALSPAC.
QC was conducted on the dataset at the Sanger Institute, please find details within the associated publication (Koko et al., 2024). Sample QC was done before (base-calls after sequencing, alignment quality, CRAM file quality) and after variant calling (PCA analysis, comparison to array data, relatedness). Integrated variant QC removed potentially false positive variants using a trained random forest model. Genotype QC removed low quality individual genotype calls.
Single nucleotide variant (SNV) and small insertions/deletion (indels) calling was conducted with GATK HaplotypeCaller, GenomicsDBImport and GenotypeGVCFs (GATK version 4.2.4.0 for ALSPAC) following GATK best practices (Van der Auwera and O'Connor, 2020).
There were 12 individuals identified to have sex mismatches within the dataset, withflagging as mismatches based on X F stat. When looking at the Y coverage of these individuals, 3 were clear cut-offs based from both X f stat and Y depth, while 9 were only mismatches based off the X F stat. The 3 individuals with clear mismatches on both statistics were removed from the dataset, while the other mismatches were retained.
Associated publication:
- doi.org/10.12688/wellcomeopenres.22697.1
4.2.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_f5 name: >- Whole Exome Sequencing - Novaseq - G0 & G1 version 2024-09-20 freeze 5 description: >- This is first iteration of wes_novaseq_g0_g1, first introduced in freeze 5. It contains data in vcf 4.2 format. It contains the majority of the G1 cohort (n=~8296), accompanied by G0 mothers (n=~1642) and partners (n=~1630) to create trios. Participants who have withdrawn their consent are removed and an omics ID applied according to the freeze. Over time the participants are able to withdraw their consent and will be removed from the dataset, so the number of available individuals can reduce as time progresses. This exome sequencing (ES) data was conducted at the Sanger institute and was part of an effort to ES ALSPAC, MCS and BiB. All ES data was quality controlled at the Sanger institute prior to this ALSPAC release and has been extensively document in the relevant publication (see below). In brief (exert from associated publication, Koko et al., 2024): "Sample QC: * Before variant calling: Samples were removed if they failed one or more filters based on quality of base-calls after sequencing, or quality of the CRAM files of aligned reads. The remainder then underwent variant calling. * After variant calling: We assigned individuals to populations using principal component analysis (PCA), then identified and removed individuals who were outliers on one or more variant-based metrics within each of the populations. We compared the exome data to genotyping array data from the same samples and removed samples that did not match as expected, since these could be sample mix-ups. The samples were also checked for unexpected relatedness; samples showing conflicts between reported and inferred relatedness were removed. This sample QC was split in two separate steps, before and after variant and genotype QC, as detailed in the coming sections. Integrated variant and genotype QC: * Variant QC: We removed candidate variants which may not be real, instead being artefacts or mapping errors, using a trained random forest model to distinguish likely true positives from likely false positives. * Genotype QC: We removed low-quality individual genotype calls from the dataset. This was done in conjunction with variant QC, as we will explain below." for extended information such as thresholds please find within the publication. Associated publication: Koko et al., 2024 DOI: https://doi.org/10.12688/wellcomeopenres.22697.2 freeze_size: 167G linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9 woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a git_tag: https://github.com/alspac/dataset_wes_novaseq_g0_g1/releases/tag/freeze5 is_current_freeze: true freeze_number: 5 freeze_date: 2025-02-27 previous_freeze: alspacdcs:wes_novaseq_g0_g1_2024-09-20_f4 freeze_of_alspac_dataset_version: alspacdcs:wes_novaseq_g0_g1_2024-09-20 freeze_of_named_alspac_dataset: alspacdcs:wes_novaseq_g0_g1 has_parts: - id: alspacdcs:ae68d502-5a2d-4280-9c4c-a3598957eb27 name: compressed vcf files description: >- Compressed vcf file containing all participants for each chromosome. Generated using bcftools v1.19. To be used with corresponding index file (in format of chr1_data.vcf.gz.csi). data_distributions: - id: alspacdcs:1393a335-9614-462a-8171-cacce83d9228 name: chr1_data.vcf.gz md5sum: 991ae5adc49999b67ad2a5eeb947c415 filesize: 16.3GB filetype: .gz number_of_variants: 370645 number_of_participants: 11500 - id: alspacdcs:4a580346-e4c1-45b6-90eb-01e909368fdf name: chr2_data.vcf.gz md5sum: acee28fa07f40a4d65bdebfb496ebd78 filesize: 11.8GB filetype: .gz number_of_variants: 272150 number_of_participants: 11500 - id: alspacdcs:783a3911-16f3-44b4-a083-e677ca8701d7 name: chr3_data.vcf.gz md5sum: c30a14536c4927e2fe0b6029dc913711 filesize: 9.1GB filetype: .gz number_of_variants: 206875 number_of_participants: 11500 - id: alspacdcs:5d2e9095-2f4c-48f8-9469-4d0a2db6feee name: chr4_data.vcf.gz md5sum: 0e03787179147bfe52995addcc0b4bb3 filesize: 6.1GB filetype: .gz number_of_variants: 140675 number_of_participants: 11500 - id: alspacdcs:c6618b13-3ca1-4b6f-9bd8-3c035f1711b5 name: chr5_data.vcf.gz md5sum: 58c9ccd5d3cb94baa71c6d0948ab94a0 filesize: 7.0GB filetype: .gz number_of_variants: 161010 number_of_participants: 11500 - id: alspacdcs:df0d7ea2-874d-4cf8-a777-4c85d1b9fda6 name: chr6_data.vcf.gz md5sum: 9483605e1c3baa66b46f9e3d0137466d filesize: 8.0GB filetype: .gz number_of_variants: 181754 number_of_participants: 11500 - id: alspacdcs:481c4b01-5a75-49f5-8b14-e3dc91b7df58 name: chr7_data.vcf.gz md5sum: 9823da37fd8d1bcb70710f0f12a85cae filesize: 8.1GB filetype: .gz number_of_variants: 181925 number_of_participants: 11500 - id: alspacdcs:08dc63cd-df57-4cd7-9f9b-da9aa29fd06d name: chr8_data.vcf.gz md5sum: daca7b2c5d14c28267d477516355fe5b filesize: 5.9GB filetype: .gz number_of_variants: 133894 number_of_participants: 11500 - id: alspacdcs:7b60eb90-7159-44e0-a0a2-eb339f9a3e1e name: chr9_data.vcf.gz md5sum: 06854c04f90b3f4e358bb60181042161 filesize: 7.1GB filetype: .gz number_of_variants: 161039 number_of_participants: 11500 - id: alspacdcs:2feaab6d-9ef6-48be-804f-623cd58c7b45 name: chr10_data.vcf.gz md5sum: 2145f8e9ed3bed5831b0921ea65e2e11 filesize: 6.5GB filetype: .gz number_of_variants: 149730 number_of_participants: 11505 - id: alspacdcs:36f15a4c-7461-4f5c-845a-540d78969bb5 name: chr11_data.vcf.gz md5sum: 509b385b9195da7fe93355536ae49450 filesize: 10.2GB filetype: .gz number_of_variants: 227858 number_of_participants: 11500 - id: alspacdcs:27d17b38-d51e-49ef-8e98-9f6b1ecee217 name: chr12_data.vcf.gz md5sum: d983ad6ffefc50e7c163edabd6157b4b filesize: 8.5GB filetype: .gz number_of_variants: 193518 number_of_participants: 11500 - id: alspacdcs:6a4ddadd-bfa2-47ec-9a91-9ab5eafc49e5 name: chr13_data.vcf.gz md5sum: b2f5304ba781bfc2604783ecc8dd8a3a filesize: 2.8GB filetype: .gz number_of_variants: 63931 number_of_participants: 11500 - id: alspacdcs:a80774c1-3374-469c-8d4a-727735eb114e name: chr14_data.vcf.gz md5sum: c4ee5f555d11c9f70282340417e78294 filesize: 5.7GB filetype: .gz number_of_variants: 128137 number_of_participants: 11500 - id: alspacdcs:d124d290-33f6-4c7b-905a-badffbd0d824 name: chr15_data.vcf.gz md5sum: 42a6c9f94f26927f428c06498def17d7 filesize: 5.6GB filetype: .gz number_of_variants: 127646 number_of_participants: 11500 - id: alspacdcs:015e0d1d-2bed-4fe5-83e2-f7841be0d591 name: chr16_data.vcf.gz md5sum: 8441010f7cb684dbaeacf0bbd4e42249 filesize: 8.3GB filetype: .gz number_of_variants: 186300 number_of_participants: 11500 - id: alspacdcs:f61fe2eb-ea44-439d-88c8-22a6c91283d6 name: chr17_data.vcf.gz md5sum: 4e235ba3f8c100b1f278269b9abe162b filesize: 10.0GB filetype: .gz number_of_variants: 224774 number_of_participants: 11500 - id: alspacdcs:c668c9f6-4dd4-425c-8e06-dd5a7536a892 name: chr18_data.vcf.gz md5sum: db54dcc6eb17b46f1fcefdddb3fd0955 filesize: 2.5GB filetype: .gz number_of_variants: 57017 number_of_participants: 11500 - id: alspacdcs:2c380e03-7824-4b6f-b42b-9c3ebf9a58db name: chr19_data.vcf.gz md5sum: f79c758c9a54e71b132c180158f47e19 filesize: 12.5GB filetype: .gz number_of_variants: 271080 number_of_participants: 11500 - id: alspacdcs:ee246fa3-d17e-447d-9360-b0997d0d882b name: chr20_data.vcf.gz md5sum: 517231ad499d8827b57c8bc3dfc3d320 filesize: 4.3GB filetype: .gz number_of_variants: 96655 number_of_participants: 11500 - id: alspacdcs:42d991a1-d820-4ab0-8b6d-f6a96db7d68e name: chr21_data.vcf.gz md5sum: a37a11719ebb6fac2c0185fd725847b8 filesize: 1.9GB filetype: .gz number_of_variants: 42207 number_of_participants: 11500 - id: alspacdcs:57cc588a-3e86-4294-9c38-173b5fe35da4 name: chr22_data.vcf.gz md5sum: 057239d5968e4ebdb982c7a761fddda2 filesize: 4.2GB filetype: .gz number_of_variants: 94446 number_of_participants: 11500 - id: alspacdcs:1e697a14-6c20-4d98-bf0c-b432b307c9bb name: chrX_data.vcf.gz md5sum: 96466a642fa95a37c6ce18bc081a9313 filesize: 3.8GB filetype: .gz number_of_variants: 86925 number_of_participants: 11500 - id: alspacdcs:63fe419d-41ce-49f4-880d-46751b5d3e7e name: chrY_data.vcf.gz md5sum: 98551aecc6df538eb439face7d20067e filesize: 363.9KB filetype: .gz number_of_variants: 9 number_of_participants: 11500 - id: alspacdcs:af405b0a-6161-4946-8004-d9c7333d9788 name: vcf index files description: >- vcf index file allowing for faster use of compressed vcf counterpart. Generated using bcftools v1.19. To be used with corresponding vcf file (in format of chr1_data.vcf.gz.csi). data_distributions: - id: alspacdcs:5b755c68-5434-4b17-90ca-b95a4967d2b0 name: chr1_data.vcf.gz.csi md5sum: 7e91b9c00c2510cb8d7b9219761127db filesize: 59.3KB filetype: .csi - id: alspacdcs:5496ec33-9e40-4f44-b5d5-a905e346ba36 name: chr2_data.vcf.gz.csi md5sum: 9d99d101fdae1b73cb1add0995db1281 filesize: 47.6KB filetype: .csi - id: alspacdcs:c6e523d1-099b-4789-92a1-a17d1ca80890 name: chr3_data.vcf.gz.csi md5sum: c6f3d3c8784876cc34e1da41ff477544 filesize: 37.9KB filetype: .csi - id: alspacdcs:5df3296f-67c2-4634-8e14-b670b8fb70ec name: chr4_data.vcf.gz.csi md5sum: f45a01d452497ef61a6944f8c4874dcf filesize: 29.7KB filetype: .csi - id: alspacdcs:933eeba2-146a-4f5a-accb-e202be199ae7 name: chr5_data.vcf.gz.csi md5sum: 0d399ff3b3812732702706af558b4cf2 filesize: 30.8KB filetype: .csi - id: alspacdcs:7c402631-6da4-4b6e-bce3-cdca11ef5af9 name: chr6_data.vcf.gz.csi md5sum: c84bc37706a8886c04f011290e0fb527 filesize: 32.2KB filetype: .csi - id: alspacdcs:49bc0d6d-ae74-4e49-a192-d99f2bad008a name: chr7_data.vcf.gz.csi md5sum: 742aa1280b71dded05a7eca06671467d filesize: 32.2KB filetype: .csi - id: alspacdcs:876f8247-ec68-49db-8998-a081e3570eea name: chr8_data.vcf.gz.csi md5sum: 969272ccf78c43e2eb0c37ae726fa9cb filesize: 24.6KB filetype: .csi - id: alspacdcs:63b70682-ae5d-41fd-846f-815df58ebd21 name: chr9_data.vcf.gz.csi md5sum: 0a30e5878fecd543b8b27b65c2153ff4 filesize: 25.0KB filetype: .csi - id: alspacdcs:0607a7f5-3c58-46b5-a5b2-09a511118bb7 name: chr10_data.vcf.gz.csi md5sum: 9c79d177d09b4a29fde5e29eb6aa681d filesize: 27.8KB filetype: .csi - id: alspacdcs:bbe8ba0d-a093-4bdf-9eda-23c0142b5079 name: chr11_data.vcf.gz.csi md5sum: c6520a9a3a2ea08b089a69a676493f7a filesize: 31.5KB filetype: .csi - id: alspacdcs:2695c1de-8205-48b2-a22d-e4dba6ae637a name: chr12_data.vcf.gz.csi md5sum: 780fdb6487f2bc2e88b5ac6cb31beab2 filesize: 31.7KB filetype: .csi - id: alspacdcs:90d3ba16-c3ef-4222-a39a-02378d6a8982 name: chr13_data.vcf.gz.csi md5sum: 9c1701ea5de03e58a324373cdf36e35b filesize: 13.4KB filetype: .csi - id: alspacdcs:25199639-b5fd-4b13-8d52-aa1cfaaf974d name: chr14_data.vcf.gz.csi md5sum: 3c487ecb0d4d405526a57fddf54c3411 filesize: 19.1KB filetype: .csi - id: alspacdcs:c263f9ae-05e7-4bf8-92af-ef02ae58e5f4 name: chr15_data.vcf.gz.csi md5sum: 5f39316e63aa7bfc6dcad5f6ec29e0f5 filesize: 19.7KB filetype: .csi - id: alspacdcs:017da27c-a816-4652-b4c9-fa358610181b name: chr16_data.vcf.gz.csi md5sum: 07816ce3396eef33d6c8a2128593df4f filesize: 19.9KB filetype: .csi - id: alspacdcs:341e4feb-17ce-4461-9d63-5c841906da3a name: chr17_data.vcf.gz.csi md5sum: fc0dd9fc4480fe26b4e4c7cfdbbe90ae filesize: 26.3KB filetype: .csi - id: alspacdcs:09b71a98-19a6-4860-8fc5-9cd0fd1b4951 name: chr18_data.vcf.gz.csi md5sum: 25fd0f8ba4ece3eb2b82eb809d32b274 filesize: 12.4KB filetype: .csi - id: alspacdcs:9cbd4505-aebc-4674-92de-b7e478ee112e name: chr19_data.vcf.gz.csi md5sum: 9422f9902edfa9815dea5abfeb699b87 filesize: 23.7KB filetype: .csi - id: alspacdcs:6f4bd3d4-b60b-4af2-bcb9-e5ec04cbf034 name: chr20_data.vcf.gz.csi md5sum: 5b6c115377d8cbee42e33a4730512221 filesize: 14.8KB filetype: .csi - id: alspacdcs:6cde5398-d318-485b-90c9-0a4c65f93a66 name: chr21_data.vcf.gz.csi md5sum: 309c8eba9d76d97f53813af20b97948d filesize: 6.3KB filetype: .csi - id: alspacdcs:87c78128-434c-49cc-80d2-07102c87b542 name: chr22_data.vcf.gz.csi md5sum: d58b2f9dd877301a8b00920a8a963a97 filesize: 11.0KB filetype: .csi - id: alspacdcs:3bb30796-6e72-46da-9c89-5f353c17bd24 name: chrX_data.vcf.gz.csi md5sum: a9bcae58debbda0537e9f16f6bf08844 filesize: 22.9KB filetype: .csi - id: alspacdcs:4c98124f-7401-422a-ad2c-e8d53703c9f3 name: chrY_data.vcf.gz.csi md5sum: 08a6efe6cc066092a0332203f72a377c filesize: 129.0B filetype: .csi
4.3 Whole exome sequencing - G1 (wes_novaseq_g1)
4.3.1 Description
This dataset contains whole exome sequencing for G1 individuals. It was generated at the Broad Institute for ~2900 G1 individuals. Reference genome build: GRCh38
4.3.2 Methodology
The exomes returned from the Broad Insitute did not undergo PCA or relatedness filtering; instead provided as raw VCF data. The following thresholds were applied to the samples:
- Chimera rate: Less than 0.05
- Contamination rate: Less than 0.10
- PF aligned rate: More than 0.60
87 individuals were removed from the dataset who were believed to have been a sample mismatch. These exomes had discordance rate of above 0.05 when compared to existing array data using bcftools gtcheck.
Associated publications:
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9980234/ (conducted additional QC beyond dataset)
4.3.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:wes_novaseq_g1_204-04-12_f5 name: >- Whole Exome Sequencing - Novaseq - G1 version 2024-04-09 freeze 5 description: >- This is first iteration of wes_novaseq_g1, first introduced in freeze 4. It contains data in vcf 4.2 format. It is a subset of the G1 cohort, with participants who have withdrawn their consent removed and omics IDs applied according to the freeze. Samples were selected for whole exome sequencing at the Broad Institute from the G1 cohort (the cohort of index children) and were from subjects who were singletons/unrelated and of European/British ancestry, had blood-derived DNA available, and had been genotyped on a whole genome genotyping array. The QC was performed by the broad. The following thresholds were applied: Chimera rate < 0.05 Contamination rate < 0.10 PF aligned rate < 0.60 87 individuals were removed from the dataset who were believed to have been a sample mismatch. These exomes had discordance rate of above 0.05 when compared to existing array data using bcftools gtcheck. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9980234/ describes this dataset in supplementary materials. freeze_size: linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9 woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a git_tag: https://github.com/alspac/dataset_wes_novaseq_g1/releases/tag/freeze5 is_current_freeze: true freeze_number: 5 freeze_date: 2025-02-27 previous_freeze: alspacdcs:wes_novaseq_g1_204-04-12_f4 freeze_of_alspac_dataset_version: alspacdcs:wes_novaseq_g1_2024-03-26 freeze_of_named_alspac_dataset: alspacdcs:wes_novaseq_g1 has_parts: - id: alspacdcs:wes_novaseq_g1_2024-04-09_all_chr_f5 name: all_chr description: >- All chromosomes and all participants within the dataset contained within a single vcf version 4.2 file, which has been compressed using bcftools 1.19. data_distributions: - id: alspacdcs:37f5619e-b3e9-4f12-b58e-69678dac59db name: all_chr.vcf.gz description: >- vcf file containing all participants and chromosomes, to be used with all_chr.vcf.gz.csi md5sum: 1caa32ff3e54ccc46f9553960f70645f filesize: 28G filetype: vcf.gz number_of_participants: 2879 #number_of_gene_expression_probe_values: - id: alspacdcs:6f65a113-6dfa-45ac-80e8-ad23d4f8c958 name: all_chr.vcf.gz.csi description: >- index for vcf file - all_chr.vcf.gz, generated using bcftools v1.19. md5sum: cbfa46323e5ae250fabf071df72b5856 filesize: 800K filetype: .csi
5 Epigenetic Data
5.1 DNA methylation - EPIC & 450k - G0 + G1 (dnam_epic450_g0_g1)
5.1.1 Description
This dataset contains methylation data collected from both G0 and G1 on two arrays at different timepoints. This dataset supersedes dnam_450_g0m_g1.
There is data from Illumina Infinium HumanMethylation450K BeadChip array on G1 mothers at two timepoints (pregnancy and middle age), G1 participants at 5 timepoints and G0 participants at three timepoints (birth, childhood and adolescence). This dataset also contains data from Infinium MethylationEPIC v1.0 data on 2721 G1 individuals at 2 timepoints.
This dataset was generated as part of the Accessible Resource for Integrated Epigenomics Studies (http://www.ariesepigenomics.org.uk/).
5.1.2 Methodology
Preprocessing and quality control for this dataset was conducted using Meffil.
Associated publications:
Associated R packages:
- aries: https://github.com/MRCIEU/aries is associated with loading and using this dataset.
- meffil: https://github.com/perishky/meffil/ was used for QC and normalisations within
5.1.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:dnam_epic450_g0_g1_2022-7-13_f5 name: >- DNA methylation - EPIC & 450k - G0 + G1 version 2022-7-13 Freeze 5 description: >- This is the freeze 5 version of dnam_epic450_g0_g1, which was first introduced in freeze 2 and first released 2022-7-13. freeze_size: 137G linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9 woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a git_tag: https://github.com/alspac/dataset_dnam_epic450_g0_g1/releases/tag/Freeze5 is_current_freeze: true freeze_number: 5 freeze_date: 2025-02-27 ### Update to align with date of release previous_freeze: 4 freeze_of_alspac_dataset_version: alspacdcs:dnam_epic450_g0_g1_2022-7-13 freeze_of_named_alspac_dataset: alspacdcs:dnam_epic450_g0_g1 has_containers: - id: alspacdcs:b56c1d92-e706-4771-8c9f-aa3f8b4d696e name: data description: A dir/folder containing the data files - id: alspacdcs:34396db7-83c1-4a7c-ac91-1b61c36be058 name: betas description: A dir/folder containing the beta files belongs_to_container: alspacdcs:b56c1d92-e706-4771-8c9f-aa3f8b4d696e - id: alspacdcs:6b98295d-0ad1-441f-be61-f5fb01354bf5 name: control_matrix description: A dir/folder containing the control matrix files belongs_to_container: alspacdcs:b56c1d92-e706-4771-8c9f-aa3f8b4d696e - id: alspacdcs:55e81f3d-b724-495e-84dd-2a378a4aa5df name: derived description: A dir/folder containing the derived data (e.g. Cell count predictions and dnamage) belongs_to_container: alspacdcs:b56c1d92-e706-4771-8c9f-aa3f8b4d696e - id: alspacdcs:47b7227d-cfdc-4a6c-ac6b-fb885de042d6 name: cellcounts description: A dir/folder containing the cell count predictions belongs_to_container: alspacdcs:55e81f3d-b724-495e-84dd-2a378a4aa5df - id: alspacdcs:41debec0-c442-419a-bc9d-55f3712e64c9 name: detection_p_values description: A dir/folder containing the matrix of detection values belongs_to_container: alspacdcs:b56c1d92-e706-4771-8c9f-aa3f8b4d696e - id: alspacdcs:97cd1d8c-1847-471e-851d-702e74d0b878 name: samplesheet description: A dir/folder containing matrices of the sample identification. belongs_to_container: alspacdcs:b56c1d92-e706-4771-8c9f-aa3f8b4d696e has_parts: - id: alspacdcs:c74b655d-aeb5-472a-838a-53aab0fd43f6 name: betas description: >- Normalized betas using functional normalization. We used 10 PCs on the controlmatrix to regress out technical variation. Slide was regressed out as random effect before normaliziation. CpGs are in rows and samples in columns. data_distributions: - id: alspacdcs:8bb819bb-2593-4418-8aa3-30ccbf42e5f7 name: 450.gds description: >- R data object for the Normalized beta data for the 450 array only. md5sum: 02e9b3cdda39d3476bfce111f5935f93 filesize: 22G filetype: .gds belongs_to_container: alspacdcs:34396db7-83c1-4a7c-ac91-1b61c36be058 number_of_samples: 5927 - id: alspacdcs:c9040c38-0d33-40ca-b1c9-0633519367d2 name: common.gds description: >- R data object for the Normalized beta data for both the EPIC and 450 arrays. md5sum: 2d447051e6241bf35dc1bfba4e740848 filesize: 30G filetype: .gds belongs_to_container: alspacdcs:34396db7-83c1-4a7c-ac91-1b61c36be058 number_of_samples: 8669 - id: alspacdcs:6ea804dc-22cf-4c20-bad9-10dd022ad60e name: epic.gds description: >- R data object for the Normalized beta data for the EPIC array only. md5sum: 0357486c3af3b5ee120c7b05bf077340 filesize: 18G filetype: .gds belongs_to_container: alspacdcs:34396db7-83c1-4a7c-ac91-1b61c36be058 number_of_samples: 2742 - id: alspacdcs:9e066b44-dfc4-4499-87ab-ecf1f920e22d name: control_matrix description: >- The 850 control probes are summarized in 42 control types. These probes can roughly be divided into negative control probes (613), probes intended for between array normalization (186) and the remainder (49), which are designed for quality control, including assessing the bisulfite conversion rate. None of these probes are designed to measure a biological signal. The summarized control probes can be used as surrogates for unwanted variation and are used for the functional normalization. Samples are rows and 42 control types are in columns. data_distributions: - id: alspacdcs:11c170b9-8d2f-4f97-8b95-484e7c6eca5a name: 450.txt description: >- Plain text file of the control matrix for the 450 array only. md5sum: 9e6aa62498c5bb7493f7512e274056ba filesize: 2.2M filetype: .txt belongs_to_container: alspacdcs:6b98295d-0ad1-441f-be61-f5fb01354bf5 number_of_samples: 5927 - id: alspacdcs:d506ee43-e7cf-4965-b77f-cbf92a840160 name: common.txt description: >- Plain text file of the control matrix for both the EPIC and 450 arrays. md5sum: 42d21ff7a2ead483e85b909b279e9912 filesize: 3.2M filetype: .txt belongs_to_container: alspacdcs:6b98295d-0ad1-441f-be61-f5fb01354bf5 number_of_samples: 8669 - id: alspacdcs:58d260b0-529b-44dd-af72-e886bd49cbb3 name: epic.txt description: >- Plain text file of the control matrix for the EPIC array only. md5sum: 7a680d3ccd26a491ec7dde2ce91eeeab filesize: 1.0M filetype: .txt belongs_to_container: alspacdcs:6b98295d-0ad1-441f-be61-f5fb01354bf5 number_of_samples: 2742 - id: alspacdcs:bccb21fd-8f7c-4745-b5ec-23934efc158a name: DNA methylation age description: >- DNA methylation aging estimates from within the dataset. Further information on this data and its usage is found within the `dnamage.html` and `dnamage.md` within the docs dir/folder. data_distributions: - id: alspacdcs:5c1df92f-44dc-4953-90bd-f33f51b3a704 name: dnamage.csv description: >- A csv file containing DNA methylation aging estimates within the dataset. md5sum: bd0c2efef6ee145cd0804d61c7e83151 filesize: 12M filetype: .csv belongs_to_container: alspacdcs:55e81f3d-b724-495e-84dd-2a378a4aa5df number_of_samples: 8192 - id: alspacdcs:00045e99-d84b-4ef0-b6c4-a2fd4c7db852 name: cell counts description: >- Files contain cell counts estimated using a variety of cell type references using the Houseman deconvolution algorithm (PMID: 22568884). In each file, samples correspond to rows and cell types to columns. data_distributions: - id: alspacdcs:00045e99-d84b-4ef0-b6c4-a2fd4c7db852 name: andrews-and-bakulski-cord-blood.txt description: >- Cord blood cell count estimates derived using the Bakulski et al. 2016 reference (PMID 27019159; https://bioconductor.org/packages/release/data/experiment/html/FlowSorted.CordBlood.450k.html). This reference has been implemented in meffil. Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, natural killer cells and nucleated red blood cells. In this text file, samples are in rows and cell types in columns. md5sum: 33c69aa8e50deb28355dcb82d01c7510 filesize: 114K filetype: .txt belongs_to_container: alspacdcs:47b7227d-cfdc-4a6c-ac6b-fb885de042d6 number_of_participants: 913 - id: alspacdcs:27721a12-e645-474f-b425-2e07a6a00db8 name: gervin-and-lyle-cord-blood.txt description: >- Cord blood cell count estimates derived using the Gervin et al. 2019 reference (PMID 31455416; GEO accession GSE127824). Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, and natural killer cells. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns. md5sum: 099c4cf9bd4ecfee91c19c3c2d2b6f70 filesize: 100K filetype: .txt belongs_to_container: alspacdcs:47b7227d-cfdc-4a6c-ac6b-fb885de042d6 number_of_participants: 913 - id: alspacdcs:3ff0d680-2322-4733-905f-a84834980180 name: cord-blood-gse68456.txt description: >- Cord blood cell count estimates derived using the de Goede et al. 2015 reference (PMID 26366232; GEO accession GSE68456). Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, natural killer cells and nucleated red blood cells. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns. md5sum: 941f8a9ce1289ab5baaf10fb29bd8941 filesize: 130K filetype: .txt belongs_to_container: alspacdcs:47b7227d-cfdc-4a6c-ac6b-fb885de042d6 number_of_participants: 913 - id: alspacdcs:06fd9d57-d014-4464-9458-aad9cf2d568b name: blood-gse35069-complete.txt description: >- Cell counts in peripheral blood predicted using the peripheral blood reference published in Reinius et al. 2012 (PMID: 22848472). Same as 'blood gse35069.txt' but replaces granulocytes with eosinophils and neutrophils. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns. md5sum: 27ab648c56b56e62709a98fcba95a764 filesize: 1.2M filetype: .txt belongs_to_container: alspacdcs:47b7227d-cfdc-4a6c-ac6b-fb885de042d6 number_of_samples: 8669 - id: alspacdcs:fa746dfc-458a-4037-b190-6e40bb8cc7a1 name: blood-gse35069.txt description: >- Blood cell count estimates derived using the Reinius et al. 2012 reference (PMID 25424692; GEO accession GSE35069). Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, and natural killer cells. In this text file, samples are in rows and cell types in columns. md5sum: 53fb63b4cef457d90688b3ddb861fa73 filesize: 1021K filetype: .txt belongs_to_container: alspacdcs:47b7227d-cfdc-4a6c-ac6b-fb885de042d6 number_of_samples: 8669 - id: alspacdcs:3a1ab6dd-e48a-4b50-9d06-097078acfe54 name: blood-idoloptimized-epic.txt description: >- Cell counts in peripheral blood predicted using the cell type reference from Bioconductor package FlowSorted.Blood.EPIC. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns. md5sum: 7331e83d31e1d200bbff3d041223cde1 filesize: 347K filetype: .txt belongs_to_container: alspacdcs:47b7227d-cfdc-4a6c-ac6b-fb885de042d6 number_of_samples: 2742 - id: alspacdcs:fa7b1498-e630-496f-9c0e-a592361b312a name: blood-idoloptimized.txt description: >- Cell counts in peripheral blood predicted using the cell type reference from Bioconductor package FlowSorted.Blood.EPIC but restricted to the IDOLOptimizedCpGs450klegacy CpG sites. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns. md5sum: 2c2bdbf34093960af969ca37ae43c77b filesize: 1.1M filetype: .txt belongs_to_container: alspacdcs:47b7227d-cfdc-4a6c-ac6b-fb885de042d6 number_of_samples: 8669 - id: alspacdcs:8206505f-6c04-458d-9889-7e7e80411721 name: combined-cord-blood.txt description: >- Cord blood cell count estimates derived using the Bakulski et al, Gervin et al., de Goede et al., and Lin et al. references (https://bioconductor.org/packages/release/data/experiment/html/FlowSorted.CordBloodCombined.450k.html) for CpG sites selected using the IDOL algorithm and optimized for the Illumina Infinium HumanMethylation450 Beadchip. Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, natural killer cells and nucleated red blood cells. In this text file, samples are in rows and cell types in columns. md5sum: 7cbcf72ca00012d17d22ff6d21b7575c filesize: 129K filetype: .txt belongs_to_container: alspacdcs:47b7227d-cfdc-4a6c-ac6b-fb885de042d6 number_of_participants: 913 - id: alspacdcs:a011593e-6da2-4136-95e8-acdb402e9fb7 name: detection p values description: >- This matrix shows the detection pvalues for each sample and each CpG and is extracted from the idat files using the "meffil.load.detection.pvalues" function in meffil. CpGs are in rows and samples in columns. data_distributions: - id: alspacdcs:659a87cb-65a6-4330-8d08-d8f5a243e6b1 name: 450.gds description: >- R object file for the detection p values matrix for the 450 array only. md5sum: 1c437226b2aab0c00aed7098e739f49d filesize: 22G filetype: .gds belongs_to_container: alspacdcs:41debec0-c442-419a-bc9d-55f3712e64c9 number_of_samples: 5927 - id: alspacdcs:e5e83be9-63d3-4da0-a4ef-7a9367df6c02 name: common.gds description: >- R object file for the detection p values matrix for both EPIC and 450 arrays. md5sum: c6f4348fa7d92a5f341f69e1784036da filesize: 30G filetype: .gds belongs_to_container: alspacdcs:41debec0-c442-419a-bc9d-55f3712e64c9 number_of_samples: 8669 - id: alspacdcs:0e522100-8538-45ec-a68a-3158da8605e8 name: epic.gds description: >- R object file for the detection p values matrix for the EPIC array only. md5sum: 341d1194d468e10e80be9dc9990c474b filesize: 18G filetype: .gds belongs_to_container: alspacdcs:41debec0-c442-419a-bc9d-55f3712e64c9 number_of_samples: 2742 - id: alspacdcs:1099f8cd-a644-46c9-8722-90a3bc34db30 name: samplesheets description: >- Manifest files with columns extracted directly from LIMS and age, sex, omics ID, timepoint, timecode, sampletype, genotype columns to report sample mismatches, duplicate.rm column to remove duplicates. Samples in rows, variables in columns. data_distributions: - id: alspacdcs:74a1d3bd-310a-429f-af09-b5745740419e9o0 name: samplesheet-450.csv description: >- R data object manifest file for the 450 array only. md5sum: a94696265d5418d2240be82ab91c79d1 filesize: 2.2M filetype: .csv belongs_to_container: alspacdcs:97cd1d8c-1847-471e-851d-702e74d0b878 number_of_samples: 5927 - id: alspacdcs:74a1d3bd-310a-429f-af09-b5745740419e name: samplesheet-common.csv description: >- R data object manifest file for both the EPIC and 450 arrays. This is a duplicate with samplesheet.csv. md5sum: 702d0d663d92b636fee1b04ff5f681fa filesize: 3.3M filetype: .csv belongs_to_container: alspacdcs:97cd1d8c-1847-471e-851d-702e74d0b878 number_of_samples: 8669 - id: alspacdcs:708fb297-82ee-4b61-a48f-73cc9642e0d9 name: samplesheet-epic.csv description: >- R data object manifest file for the EPIC array only. md5sum: 42b2dc297d28f4bc992eac9b6a17cb60 filesize: 1.1M filetype: .csv belongs_to_container: alspacdcs:97cd1d8c-1847-471e-851d-702e74d0b878 number_of_samples: 2742 - id: alspacdcs:15bce2eb-ef05-46bd-8cc3-c06e8d6ba2fd name: samplesheet.csv description: >- R data object manifest file for both the EPIC and 450 arrays. This is a duplicate with samplesheet-common.csv. md5sum: 702d0d663d92b636fee1b04ff5f681fa # should be the same as samplesheet-common.csv filesize: 3.3M filetype: .csv belongs_to_container: alspacdcs:97cd1d8c-1847-471e-851d-702e74d0b878 number_of_samples: 8669
6 Gene Expression Data
6.1 Gene expression - array - G1 (ge_ht12_g1)
6.1.1 Description
There are two different types of QC'd data available in this version, one performed by David Evans for the Bryois et al 2014 paper, and one performed by Gibran Hemani for the molgenis eQTL mapping meta analysis. A version without QC is available as well. Details on the QC'd versions can be seen below.
This data was generated from LCLs. The majority of samples used in their generation were collected at age 9 years. LCL's are a lymphoblastoid cell lines which were produced by transforming lymphocytes with Epstein Barr Virus and cultured before DNA was extracted. Gene expression patterns may not be the same as that from untransformed lymphocytes taken from a 9 year old.
6.1.2 Methodology
Bryois:
- LCL's from unrelated individuals were grown under identical conditions and cells frozen in RNAlater. RNA was extracted using an RNeasy extraction kit (Qiagen) and was amplified using the Illumina TotalPrep-96 RNA Amplification kit (Ambion). Expression profiling of the samples, each with two technical replicates, were performed using the Illumina Human HT-12 V3 BeadChips (Illumina Inc) including 48,804 probes where 200 ng of total RNA was processed according to the protocol supplied by Illumina. Raw data was imported to the Illumina Beadstudio software and probes with less than three beads present were excluded. Log2 - transformed expression signals were then normalized with quantile normalization of the replicates of each individual followed by quantile normalization across all individuals.
We restricted our analysis to 23'935 probes tagging genes annotated in Ensembl. Principal component analysis was performed on 931 individuals. 62 individuals with principal component 1 or 2 greater than one standard deviation of the population were excluded from further analysis. See http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004461 for full details.
Molgenis:
- Genetic outliers were removed, any individuals that were clear outliers in the first 2 genetic principal components. Each probe was simply quantile normalised and then log2 transformed. Then adjusted for the first 4 genetic MDS, expression principal components (excluding those that had genetic associations), and scaled to have mean 0 and variance 1. See https://github.com/molgenis/systemsgenetics/wiki/eQTL-mapping-analysis-cookbook for full details.
6.1.3 Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:ge_ht12_g1_2015-11-02_f5 name: Gene expression - array - G1 release version 2015-11-02 freeze 5 description: >- This is the fith freeze of the 2015-11-02 version of ge_ht12_g1 dataset which has .csv distributions of the data rather than .Rdata files in order to be easier to use across differnt data science software and languages. freeze_size: 2.6G linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9 woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473 all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a git_tag: https://github.com/alspac/dataset_ge_ht12_g1/releases/tag/freeze5 is_current_freeze: true freeze_number: 5 freeze_date: 2025-02-27 previous_freeze: alspacdcs:ge_ht12_g1_2015-11-02_f4 freeze_of_alspac_dataset_version: alspacdcs:ge_ht12_g1_2015-11-02 freeze_of_named_alspac_dataset: alspacdcs:ge_ht12_g1 has_parts: - id: alspacdcs:ge_ht12_g1_2015-11-02_bryois_f5 name: bryois data description: Dataset part for the bryois data in ge_ht12_g1 version 2015-11-02 freeze 5 data_distributions: - id: alspacdcs:6495a875-5088-4a6a-86ac-9995d9203f72 name: bryois.csv description: >- The freeze 5 csv version of the bryois data. IDs in columns and Illumina probe IDs in rows. This is the normalised data used in Bryois et al 2014. Probe IDs are mapped to genes in raw.csv md5sum: 2ef6aa2cd66c0cc31c69479bdc67432f filesize: 742M filetype: .csv number_of_participants: 947 number_of_gene_expression_probe_values: 48630 - id: alspacdcs:ge_ht12_g1_2015-11-02_molgenis_f5 name: Molgenis data description: >- Dataset part for the Molgenis data in ge_ht12_g1 version 2015-11-02 freeze 5 data_distributions: - id: alspacdcs:282fa0c9-a01b-4dd3-8664-78e0dde10e1f name: molgenis.csv description: >- The freeze 5 csv version of the molgenis data. IDs in columns and Illumina probe IDs in rows. Normalised data following the molgenis pipeline, found at https://github.com/molgenis/systemsgenetics/wiki/eQTL-mapping-analysis-cookbook. Probe IDs are mapped to genes in raw.csv md5sum: 4a3739d68b3d52d6650003aab2424ab8 filesize: 752M filetype: .csv number_of_participants: 879 number_of_gene_expression_probe_values: 48630 - id: alspacdcs:ge_ht12_g1_2015-11-02_raw_f5 name: Raw data description: Dataset part for the raw data in ge_ht12_g1 version 2015-11-02 freeze 5 data_distributions: - id: alspacdcs:7025f451-01eb-4a40-bd6c-dec89db0f7ab name: raw.csv description: >- The freeze 5 csv version of the raw ge data. IDs in columns and probes in rows. Four columns per individual, with two columns for average signal and two columns for average number of beads. Presumably this is a file generated by the Illumina Genome Studio software. md5sum: d1b6b2f1c8231e02666fea06ff1b4f9a filesize: 1.1G filetype: .csv number_of_participants: 994 ##This is not how wide this dataframe is number_of_gene_expression_probe_values: 48630
7 Omics tips
7.1 Introduction
This section is a guide to using 'Omics datasets. It explains which software to use and describes common file formats. It's a good starting point for beginners and helpful for problem-solving.
7.2 Disclaimer
Some information is copied or reworded from software documentation. Check the original documentation alongside this guide for up-to-date information. Note that some links may no longer work.
7.3 Operating systems
You can use ALSPAC data with any operating system, but Unix-based systems like Macintosh, Linux, or BSD are more convenient due to the data's size and complexity. We recommend using the command line and programming scripts with languages like Bash, R, Python, or Perl. Many online resources are available to learn these tools. Use free/libre and open-source software where possible.
Links:
- Unix guide: https://www.osc.edu/supercomputing/unix-cmds
- Beginning Python: https://www.python.org/about/gettingstarted/
- Beginning R: https://www.statmethods.net/r-tutorial/index.html
- Free/libre and open-source software: https://www.fsf.org/about/
7.4 Key Omics software
7.4.1 Plink
Plink is a tool for performing quality control and whole genome association analysis of genetic data.
7.4.2 SNPTest
SNPTest is a tool for performing whole genome association analysis of genetic data.
- Link: https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html (Not open source)
7.4.3 BoltLmm
BoltLmm is a tool for performing genome association analysis of genetic data. It is recommended for analysis of more than 5000 samples, its methods automatically take into account population substructures.
7.4.4 Qctools
A tool for quality control of genetic data. It is also useful to inspect and modify .gen .bgen and vcf files etc (see section 4 below).
7.4.5 SAMTOOLS
Samtools is a suite of tools which are used for genomic analysis.
- Link: http://www.htslib.org/
7.4.6 VCFTOOLS
Part of samtools that allows you to work with vcf files.
7.4.7 BCFTOOLS
This is a part of samstools and allows users to manipulate .bcf files.
7.5 File types
In a Unix environment the postfix of a file name does not explicitly mean anything to the operating system, unlike in a Windows system which will look at the file types. In a Unix system it is just part of the name of the file and humans use it to distinguish file formats. The following is a non-exhaustive list of file types you may encounter whilst using ALSPAC Omics data.
7.5.1 .gen
This is an 'oxford' data format for genetic data. The .gen file is a plain text file, this means that standard Unix command line tools can be used to inspect the data. For example, 'head' or 'less'.
The .gen (genotype) file stores data on a one-line-per-SNP format. The first 5 entries of each line are the SNP ID, RS ID of the SNP, base-pair position of the SNP, the allele coded A and the allele coded B. The SNP ID can be used to denote the chromosome number of each SNP. The next three numbers on the line are the probabilities of the three genotypes AA, AB and BB at the SNP for the first individual in the cohort. The next three numbers are the genotype probabilities for the second individual in the cohort. The next three numbers are for the third individual and so on. The order of individuals in the genotype file should match the order of the individuals in the sample file (see below). It should be noted that the probabilities need not sum to 1 to allow for the possibility of a NULL genotype call. This format allows for genotype uncertainty. This genotype file format is the same as that produced by the genotype calling algorithm CHIAMO. NOTE : We recommend that you arrange SNPs in base-pair order in the genotype files. This is required if you want to use the files with IMPUTE and will make viewing the output of SNPTEST somewhat easier. For example, Suppose you want to create a genotype for 2 individuals at 5 SNPs whose genotypes are
SNP 1 | AA | AA |
SNP 2 | GG | GT |
SNP 3 | CC | CT |
SNP 4 | CT | CT |
SNP 5 | AG | GG |
The correct genotype file would look like this:
SNP1 rs1 1000 | A | C | 1 | 0 | 0 | 1 | 0 | 0 |
SNP2 rs2 2000 | G | T | 1 | 0 | 0 | 0 | 1 | 0 |
SNP3 rs3 3000 | C | T | 1 | 0 | 0 | 0 | 1 | 0 |
SNP4 rs4 4000 | C | T | 0 | 1 | 0 | 0 | 1 | 0 |
SNP5 rs5 5000 | A | G | 0 | 1 | 0 | 0 | 0 | 1 |
7.5.2 .bgen
A binary version of a .gen file. This file can not be visually inspected on the command line. .bgen files are used because they greatly increase the speed and storage efficiency of software for storing large amounts of Omics data. The full details of the file format are discussed in : https://www.well.ox.ac.uk/~gav/bgen_format/ bgen files are normally used with tools such as qctools and snptest There is also a library for reading .bgen files into R : https://bitbucket.org/gavinband/bgen/wiki/rbgen
7.5.3 .sample
The .sample file is paired with either .gen or .bgen files. It contains information on the samples that is not genetic. It is a plain text file that can be inspected with standard Unix command line tools.
Please note that the sample file format changed with the release of SNPTEST v2. Specifically, the way in which covariates and phenotypes are coded on the second line of the header file has changed. The sample file has three parts (a) a header line detailing the names of the columns in the file, (b) a line detailing the types of variables stored in each column, and (c) a line for each individual detailing the information for that individual. Here is an example of the start of a sample file for reference
ID_1 | ID_2 | missing | cov_1 | cov_2 | cov_3 | cov_4 | pheno1 | bin1 |
0 | 0 | 0 | D | D | C | C | P | B |
1 | 1 | 0 | .007 | 1 | 2 | 0 | .0019 | -0.008 1.233 1 |
2 | 2 | 0 | .009 | 1 | 2 | 0 | .0022 | -0.001 6.234 0 |
3 | 3 | 0 | .005 | 1 | 2 | 0 | .0025 | 0.0028 6.121 1 |
4 | 4 | 0 | .007 | 2 | 1 | 0 | .0017 | -0.011 3.234 1 |
5 | 5 | 0 | .004 | 3 | 2 | -0 | .012 | 0.0236 2.786 0 |
The header line: This line needs a minimum of three entries. The first three entries should always be ID_1, ID_2 and missing. They denote that the first three columns contain the first ID, second ID and missing data proportion of each individual. Additional entries on this line should be the names of covariates or phenotypes that are included in the file. In the above example, there are 4 covariates named cov_1, cov_2, cov_3, cov_4, a continuous phenotype named pheno1 and a binary phenotype named bin1. NOTE : All phenotypes should appear after the covariates in this file. The second line of the file details the type of variables included in each column. The first three entries of this line should be set to 0. Subsequent entries in this line for covariates and phenotypes should be specified by the following rules
D | Discrete covariate (coded using positive integers) |
C | Continuous covariates |
P | Continuous Phenotype |
B | Binary Phenotype (0 = Controls, 1 = Cases) |
The remainder of the file should consist of a line for each individual containing the information specified by the entries of the header line (see example above). Use spaces to separate the entries of the sample file and not TABS because that is the expected character.
Missing values - Specifying missing values for covariates and phenotypes is possible. It was recommended that you use -9 for missing values. This was the default value assumed by SNPTEST v1, although the -missing_code option in SNPTEST v1 meant that you could use other numeric values for the missing code, In SNPTEST v2 the behavior of the -missing_code option has changed so that it now takes a comma-separated list of values, each of which is treated as missing when encountered in the sample file(s). Default missing values are now denoted by the two character string "NA".
7.5.4 .ped
A plink format file that is in plain text and can be viewed with standard tools. It contains genetic variant data. https://www.cog-genomics.org/plink/1.9/formats#ped
7.5.5 .map
A plink format file that is in plain text. It contains information about variants. https://www.cog-genomics.org/plink/1.9/formats#map
7.5.6 .bed
A plink format file that isa binary equivalent of a .ped file. It is smaller and faster to process but is not easily viewable or editable. https://www.cog-genomics.org/plink/1.9/formats#bed
7.5.7 .bim
A plink format, similar to a .map file but is used with binary .bed files. https://www.cog-genomics.org/plink/1.9/formats#bin
7.5.8 .fam
A plain text format that contains sample information for plink binary files. https://www.cog-genomics.org/plink/1.9/formats#fam
7.5.9 .csv
A plain text format where different fields are separated by commas. (Comma separated variables).
7.5.10 .vcf
VCF files are a flexible file format for storing different types of genetic variants. They are a plain text format that can be inspected on the command line with standard Unix tools. However they are often very large files, and specific tools such as 'vcftools' are useful for working with this data. Commonly SNPs are stored in these files but other variants such as Copy Number variations can also be stored. The basic form for a vcf file is: https://en.wikipedia.org/wiki/Variant_Call_Format
7.5.11 .bcf
This is a binary version of a vcf file. It cannot be inspected on the command line, but can be used with the genomic tools mentioned in this document.
7.5.12 .tar.gz
This is a standard Unix file format for bundling and compressing a set of files. It is similar to a .zip file. It is made by first bundling a set of files into a .tar file (sometimes called a tar ball). This is then compressed using 'gun zip'. https://en.wikipedia.org/wiki/Tar_(computing) https://en.wikipedia.org/wiki/Gzip
7.5.13 .enc
This file extension is used as a convention to mean that the file is encrypted. You will need to have that password that was used to encrypt the data in order to unencrypt the files. https://en.wikipedia.org/wiki/OpenSSL
7.6 Variant/SNP ids
There are many types of genetic variation. A common type is a single nucleotide polymorphism (SNP). Others include copy number variations.
Variants can be specified by a Chromosome and location in reference to a specific build of the human genome. They can also be given a reference SNP (rs) cluster identifier.
- Chr:Location
- Rs ids
7.7 Overview of Imputation reference panels
SNP array data frequently contain hundreds of thousands of variants. However due to linkage disequilibrium it is possible to estimate many more SNP values for an individual. This estimation procedure is called imputation and it works by combining an individuals SNP array data with a large reference population of sequenced data. In this way it is possible to have accurate estimations of millions of SNP values for an individual without the cost of fully sequencing each person. ALSPAC has prerun the imputation process using three different imputation panels.
7.7.1 Panels
- TOPmed
The latest reference panel (to ALSPAC), which has the most snps
- HRC
This is the latest reference panel and our data contains circa 40 millions of SNPs.
- 1000 Genomes
This is the previous generation reference panel which is still widely used in ALSPAC studies. There are some SNPs that appear in this panel that are not in the HRC panel.
- Hapmap
This was the first widely used imputation panel.
7.8 SNP data types from imputation.
SNPs that have been imputed can be stored and analysed in different formats. These can be appropriate for different types of analysis, for example an analysis could assume and additive effect for the minor allele or it could assume a recessive/dominant effect.
- Best guess. The data will be presented as either 0,1, or 2 to represent how many of the minor alleles at that position a person has. The best guess is derived from the probability of a variant calculated from the imputation process.
- Dosage. This is the probability that the person has 0, 1 or 2 of the minor allele. i.e. 0.1, 0.2,0.7. This will sum to one across the three possibilities (i.e for each SNP for each individual).
7.9 SNP Statistics
You can generate statistics on your SNP data using the program 'QCtools'. This will give you the imputation information scores. For example:
qctool -g example.bgen -s example.sample -sample-stats -osample sample-stats.txt
7.10 Best practice
7.10.1 GWAS
We recommend you follow the steps outlined in the following paper when performing GWAS: Marees, Andries T., et al. "A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis." International journal of methods in psychiatric research 27.2 (2018): e1608. https://doi.org/10.1002/mpr.1608
7.10.2 Phewas
We recommend you follow the steps outlined in the following paper when performing Phewas: Millard, L., Davies, N., Timpson, N. et al. MR-PheWAS: hypothesis prioritization among potential causal effects of body mass index on many outcomes, using Mendelian randomization. Sci Rep 5, 16645 (2015). https://doi.org/10.1038/srep16645
7.10.3 Methylation
The following paper describes the methylation data available in ALSPAC Relton, Caroline L., et al. "Data resource profile: accessible resource for integrated epigenomic studies (ARIES)." International journal of epidemiology 44.4 (2015): 1181-1190.
7.11 Population stratification
This is when an observed genetic association is due to the population/geography. Not taking this into account can lead to biased estimates of effects. One common method to account for these is to calculate principal components (PCs) of the genetic data and then to include these as covariables in any models.
ALSPAC do not provide PCs as part of the standard omics datasets, as these would require being re-generated and tested alongside each freeze. PCs can be generated using plink, hail or a variety of other tools.
For more information about how to do this in plink see: https://www.cog-genomics.org/plink/1.9/strat
An common method used to account for population substructure is by using linear mixed models. For example using the bolt LMM software tool.
7.12 Polygenic risk scores (PRS)
These are scores which estimate the effect of variants in an individual genome on a given phenotypic trait or disease.
Further explanations can be found online, such as: https://www.genome.gov/Health/Genomics-and-Medicine/Polygenic-risk-scores
Or example tutorials for calculating PRSs: https://www.nature.com/articles/s41596-020-0353-1
Different collaborators often generate PRS for ALSPAC, but these are not shared as part of our standard omics datasets. Collaborators wishing for PRSs will need to generate these themselves.
7.13 Common tasks
Here we provide links to webpages that provide instructions or provide brief details any code for completing common tasks using the various software we have described above (section x):
- Extract some SNPs from a bgen data file and convert to plain text.
https://www.well.ox.ac.uk/~gav/qctool_v2/documentation/examples/filtering_variants.html
- Extract some SNPs from bed data:
http://zzz.bwh.harvard.edu/plink/dataman.shtml
plink –bfile mydata –chr 2 –from-kb 5000 –to-kb 10000
- Reading .bgen and .sample oxford files in plink
Plink supports bgen files but it is fussy about the types of its columns in the data.sample file. You may wish to remove or retype columns to read a data.sample file into plink. For more info see:
https://www.cog-genomics.org/plink/2.0/input
To make a new sample file removing some columns you can use the Unix command: 'cut -f 1,2,3 -d " " data.sample > data2.sample'
7.14 Courses
Working with 'Omics data can be complicated but there are many excellent resources available to help you learn how to do this. There are both paid in person courses and free online courses.
Details on paid courses offered by Bristol University can be found here: https://www.bristol.ac.uk/medical-school/study/short-courses/ In addition, a number of free online courses are summarised here: https://www.mooc-list.com/tags/bioinformatics
7.15 Further sources of help
7.15.1 Stack exchange
Stack exchange is an online Q&A community which is divided into different sub-communities. The first and most well-known is Stack overflow. This is one of the best place to ask questions about programming on the Internet. Other useful exchange sites include bioinformatics https://bioinformatics.stackexchange.com/, maths https://mathoverflow.net/ and statistics https://stats.stackexchange.com/.
7.15.2 Bio-stars
Biostars is bioinformatics community Q&A web-site: https://www.biostars.org/
7.15.3 Mailing lists
For individual product/projects there is often a mailing list. For example to get help using SNPTEST you can ask on the mailing list https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html#contact
7.15.4 AI tools
AI tools such as chatGPT can be useful to understand how to work with omics data, but please do understand their limitations and look at documentation or research papers directly.
7.15.5 Ask ALSPAC
If you can not find the answer to your question or you think there is something wrong with your data then please contact the alspac-omics@bristol.ac.uk mailbox and we will do our best to help you.