Introduction

Welcome to the ALSPAC Omics Catalogue, a guide to the omics data offered by ALSPAC. This catalogue features a variety of named ALSPAC datasets, each consisting of collected or produced data that has been organized, named, and curated for ease of use. Every named ALSPAC dataset comes with accompanying metadata that provides information about the dataset as a whole. Each named ALSPAC dataset has at least one release version that includes a curated selection of files detailed in the metadata sections.

Please note that these datasets are not generally accessible. Please see http://www.bristol.ac.uk/alspac/researchers/access/ for details for access.

The information within this catalogue is made available for browsing to help both internal ALSPAC users and external researchers understand the data and facilitate prospective data requests.

For external collaborators we offer as standard “freezes” of specific named ALSPAC datasets. These freezes, along with their metadata, are outlined in this catalogue. External collaborators will be granted access to these freezes upon request approval. A freeze represents a carefully selected subset of data files within a version, containing the core data from a dataset with withdrawn consent removed and specific dataset IDs applied. These freezes are subject to periodic updates.

Documentation for the current freeze is in the form of a yaml file is present below, listing the files external collaborators will receive, accompanied by metadata.

Due to the removal of withdrawn individuals from the freezes, please note that the number of participants within each dataset may change over time and may not match those found in the Methodology fields.

Freeze information:

Number	Timing	Updates
1	July 2021 - Dec 2022	Updated to using the freeze release system.
2	Dec 2022 - Dec 2023	Dataset added: `dnam_epic450_g0_g1`. Update freeze 2.1 fixed truncated bgen issue in HRC dataset.
3	Jan 2023 - Oct 2024
4	Oct 2024 - June 2025	Datasets added: `wes_novaseq_g0_g1`, `wes_novaseq_g1`
5	June 2025 - Dec 2025	Dataset removed: `dnam_450_g0m_g1`. Dataset added: `gi_topmed_g0m_g1`.
6	Dec 2025 - June 2026	Dataset `dnam_epic450_g0_g1` has QC reports integrated into the freeze dataset. Dataset `gi_topmed_g0m_g1` now filters out monomorphic SNPs.
7	June 2026 - Current	Datasets removed: `gi_hapmap2_g0m`, `gi_hapmap2_g1`. Dataset added: `gi_topmed_g0p`. Dataset updated to TOPMed R3: `gi_topmed_g0m_g1`.

Genetic Array Data

Genome-wide - Illumina 550 quad - G1 (gwa_550_g1)

Description

This dataset contains genome wide array data genotype calls for G1 individuals.
Reference genome build: GRCh37

Methodology

ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).

Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.

SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1). Related subjects were removed.

Associated publication:
- Horikoshi et al 2013 (https://doi.org/10.1038/ng.2477)

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gwa_550_g1_2022-12-05_f7
name: >-
  Genome-wide array data for G1 individuals 2022-12-05 freeze 7
description: >-
  The 7th freeze of the genome-wide array data for G1 based on the 2022-12-05 release. The data is in plink format.
  
  Contains .hh file, which is produced automatically when the input data contains heterozygous calls where they shouldn't be possible (haploid chromosomes, male X/Y), or there are nonmissing calls for nonmales on the Y chromosome. Consists of a text file with one line per error (sorted primarily by variant ID, secondarily by sample ID) with the following three fields:
	1. Family ID
	2. Within-family ID
	3. Variant ID

freeze_size: 997M
linker_file_md5sum: 2b2e49bc61c1a0efc3f57db6e48656ea
woc_file_md5sum: ee409be51e4e12594cb700dd1be99314
all_individuals_to_exclude_md5sum: 241c00aec78b7178d6797a83478dec02
git_tag: https://github.com/alspac/dataset_gwa_550_g1/releases/tag/freeze7
is_current_freeze: true
freeze_number: 7
freeze_date: 2026-04-14
previous_freeze: alspacdcs:gwa_550_g1_2022-12-05_f6
freeze_of_alspac_dataset_version: alspacdcs:gwa_550_g1_2022-12-05
freeze_of_named_alspac_dataset: alspacdcs:gwa_550_g1

contains:
- data
files: []
data:
  contains:
  - freeze_id.bed
  - freeze_id.bim
  - freeze_id.fam
  - freeze_id.hh
  - freeze_id.log
  files:
  - id: alspacdcs:5cf6901b-340b-4a29-91fc-148630c39490
    name: freeze_id.bed
    md5sum: 8e9e93a2d035960064b73705a5ae61f7
    filesize: 981.4MB
    filetype: .bed
    belongs_to: data
  - id: alspacdcs:36808773-0dde-4b3e-b520-7cdcd5c6060f
    name: freeze_id.bim
    md5sum: 81fe729cd2a42d7f4637e944eeb270e0
    filesize: 13.4MB
    filetype: .bim
    number_of_variants: 500527
    belongs_to: data
  - id: alspacdcs:6aad8b80-5ce0-41a4-b88a-abc77dd9abd9
    name: freeze_id.fam
    md5sum: 32ea8aac6bb2cb4c5d3d20ac7c142d4e
    filesize: 248.9KB
    filetype: .fam
    number_of_participants: 8221
    belongs_to: data
  - id: alspacdcs:0f99c664-537b-447b-bb33-71a6a9f3648c
    name: freeze_id.hh
    md5sum: aec33afc0e40b593f155c468084ca62e
    filesize: 1.6MB
    filetype: .hh
    belongs_to: data
  - id: alspacdcs:43358610-24fb-4cd4-866e-a4940d2ae35b
    name: freeze_id.log
    md5sum: d7d62020ada9cea6cee9e73b27e645e8
    filesize: 1.2KB
    filetype: .log
    belongs_to: data

Genome-wide - Illumina exome core array - G0 partners (gwa_exome_g0p)

Description

This dataset contains genome wide array genotype calls for G0 mothers and partners.
Reference genome build: GRCh37

Methodology

3,453 ALSPAC mother and fathers and 535,478 SNPs were genotyped using the Illumina HumanCoreExome chip genotyping platforms by the ALSPAC lab and called using GenomeStudio. The resulting raw genome-wide data were subjected to standard quality control methods using PLINK (v1.07). Individuals were excluded on the basis of gender mismatches (n = 80); minimal or excessive heterozygosity (n = 64); disproportionate levels of individual missingness (>5%, n = 60) and possible contamination (n = 3).

Population stratification was assessed by multidimensional scaling analysis and compared with 1000 Genomes phase 3 data and principal component analysis (n = 266); all individuals with non-European ancestry were removed. Cryptic relatedness was measured as SNP relatedness in GCTA (relatedness > 0.1, n = 69 removed). SNPs with a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 1E-7) and those which failed GenomeStudio quality control measures were removed (n = 21,298). 6,594 duplicate SNPs were also removed. This resulted in 2,911 unrelated mothers and father genotypes at 507,586 SNPs. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln.

1737 putative G0 partner-G1 pairs for whom both G0 partner and G1 have called genotype data available were identified based on ALN. Given the G0 partners were invited by the G0 mother to take part and only enrolled in the study in their own right several years later, it could not be assumed that all G0 partners were biologically related to G1. Called genotype data for the 1720 unique G0 partners and 1737 unique G1s were merged (i.e. there were 17 pairs of siblings/twins among the G1 offspring), using plink v1.90b7.2 64-bit (11 Dec 2023).

After aplication of the plink filters –geno 0.05, –maf 0.01, –snps-only just-acgt and –autosome, 113288 SNPs remained. The –related command in KING version 2.3.2 was used to perform kinship analysis, which confirmed that all 1737 putative G0 partner-G1 pairs are genetically related. This would be expected for biological father-offspring pairs, using the inference criteria described in in Table 1 of “Manichaikul, Ani, et al. ”Robust relationship inference in genome-wide association studies.” Bioinformatics 26.22 (2010): 2867-2873.”

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gwa_exome_g0p_2016-11-22_f7
name: Freeze 7 version 2016-11-22 Genome-wide - Illumina exome core array - G0 partners
description: >-
  Freeze 7 version 2016-11-22 Genome-wide array data including genotype calls for G0 partners, including additional G0 mothers who were absent from previous genotyping rounds. 

  Data in plink format, including .hh file, which is produced automatically when the input data contains heterozygous calls where they shouldn't be possible (haploid chromosomes, male X/Y), or there are nonmissing calls for nonmales on the Y chromosome. Consists of a text file with one line per error (sorted primarily by variant ID, secondarily by sample ID) with the following three fields:
	1. Family ID
	2. Within-family ID
	3. Variant ID

freeze_size: 281M
linker_file_md5sum: 2b2e49bc61c1a0efc3f57db6e48656ea
woc_file_md5sum: ee409be51e4e12594cb700dd1be99314
all_individuals_to_exclude_md5sum: 241c00aec78b7178d6797a83478dec02
git_tag: https://github.com/alspac/dataset_gwa_exome_g0p/releases/tag/freeze7
is_current_freeze: true
freeze_number: 7
freeze_date: 2026-04-14
previous_freeze: alspacdcs:gwa_exome_g0p_2016-11-22_f6
freeze_of_alspac_dataset_version: alspacdcs:gwa_exome_g0p_2016-11-22
freeze_of_named_alspac_dataset: alspacdcs:gwa_exome_g0p

contains:
- data
files: []
data:
  contains:
  - freeze_id.bed
  - freeze_id.bim
  - freeze_id.fam
  - freeze_id.hh
  - freeze_id.log
  files:
  - id: alspacdcs:51780ba8-1288-404e-a5b5-6c4ee486b38f
    name: freeze_id.bed
    md5sum: 304b0d356880c5174806ce08d7beffd3
    filesize: 266.2MB
    filetype: .bed
    belongs_to: data
  - id: alspacdcs:1a7450fa-a2e6-4798-986f-21bd2397c8eb
    name: freeze_id.bim
    md5sum: 0fe43f888776059fef0a76d3f08d00ad
    filesize: 13.9MB
    filetype: .bim
    number_of_variants: 507586
    belongs_to: data
  - id: alspacdcs:f621de26-53b1-4cb6-8e3c-648eb62b000d
    name: freeze_id.fam
    md5sum: db01a4cc170d48733f5e921ae1551c2e
    filesize: 122.3KB
    filetype: .fam
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:82c5a47b-6e69-4d26-a598-4ad6ea35c236
    name: freeze_id.hh
    md5sum: 6b370ecf18b2b7551567cfacc5ff736a
    filesize: 115.3KB
    filetype: .hh
    belongs_to: data
  - id: alspacdcs:f7c63be4-bc0b-4ce0-afc6-16dcc0791601
    name: freeze_id.log
    md5sum: 9779b1f24dd0bd9e6ab40cba235a8027
    filesize: 1.2KB
    filetype: .log
    belongs_to: data

Genome-wide - Illumina 660 quad - G0 mothers (gwa_660_g0m)

Description

This dataset contains genome-wide array data including raw files and genotype calls for G0 mothers.
Legacy 1 reference genome: GRCh36
Legacy 2 reference genome: GRCh37

Methodology

SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed. Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.

Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level. Related subjects that passed all other quality control thresholds were retained. This resulted in 9,048 subjects and 526,688 SNPs passed these quality control filters.

Associated publication:
- Rietveld et al 2013 (https://doi.org/10.1126%2Fscience.1235488)

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gwa_660_g0m_2022-12-05_f7
name: Freeze 7 version 2022-12-05 Genome-wide - Illumina 660 quad - G0 mothers
description: >-
  Freeze 7 of genome-wide array data including genotype calls for G0 mothers.

  Contains 2 sets of data, legacy1 and legacy2. 
  legacy1: A dir/folder containing the plink data files. 
      Includes full set of SNPs, but missing ~500 mothers who 
      were excluded in legacy QC due to strict relatedness inclusion thresholds.
  legacy2: A dir/folder containing the plink data files
      Includes full set of individuals but due to legacy QC is restricted
      to a set of ~480k SNPs that overlap with the Illumina 550k array 
      (which was used for G1 in gwa_550_g1). This QC was performed alongside liftOver to Hg37.

freeze_size: 2G
linker_file_md5sum: 2b2e49bc61c1a0efc3f57db6e48656ea
woc_file_md5sum: ee409be51e4e12594cb700dd1be99314
all_individuals_to_exclude_md5sum: 241c00aec78b7178d6797a83478dec02
git_tag: https://github.com/alspac/dataset_gwa_660_g0m/releases/tag/freeze7
is_current_freeze: true
freeze_number: 7
freeze_date: 2026-04-14
freeze_of_alspac_dataset_version: alspacdcs:gwa_660_g0m_2022-12-05
freeze_of_named_alspac_dataset: alspacdcs:gwa_660_g0m

contains:
- data
files: []
data:
  contains:
  - legacy2
  - legacy1
  files: []
  legacy2:
    contains:
    - freeze_id.bed
    - freeze_id.bim
    - freeze_id.fam
    - freeze_id.log
    files:
    - id: alspacdcs:d818bfb5-c99c-4db3-88e2-7037c819f6e2
      name: freeze_id.bed
      md5sum: d9be87ec7d2a87429a347adc6efc9db4
      filesize: 959.8MB
      filetype: .bed
      belongs_to: data/legacy2
    - id: alspacdcs:6d72d6e2-f0bb-460a-b949-05f499d4fefb
      name: freeze_id.bim
      md5sum: a733e7ae6ac46a855cc400925a2138fa
      filesize: 12.3MB
      filetype: .bim
      number_of_variants: 465740
      belongs_to: data/legacy2
    - id: alspacdcs:64934a22-64fa-4a53-9d03-a7f896604144
      name: freeze_id.fam
      md5sum: 6b74cf1f2cebd886a1d52eeac76bac43
      filesize: 447.3KB
      filetype: .fam
      number_of_participants: 8643
      belongs_to: data/legacy2
    - id: alspacdcs:e0df3154-5ee3-4397-bdb0-4847a3bf85db
      name: freeze_id.log
      md5sum: 083c818ca06a68f73c1cb79bccc0feda
      filesize: 1.1KB
      filetype: .log
      belongs_to: data/legacy2
  legacy1:
    contains:
    - freeze_id.bed
    - freeze_id.bim
    - freeze_id.fam
    - freeze_id.log
    files:
    - id: alspacdcs:00fb4b88-5661-4a44-a106-f718cb98ab58
      name: freeze_id.bed
      md5sum: 43f6862b721f637e557cae589414c4b8
      filesize: 1019.1MB
      filetype: .bed
      belongs_to: data/legacy1
    - id: alspacdcs:86d0b324-61e9-4a24-9ce3-d958015a5b94
      name: freeze_id.bim
      md5sum: 05b87fd6e4e7ef0c66ccc566bf5786bc
      filesize: 14.0MB
      filetype: .bim
      number_of_variants: 526688
      belongs_to: data/legacy1
    - id: alspacdcs:0624e984-b4a9-4c9e-b8ed-0e703c8e6c70
      name: freeze_id.fam
      md5sum: 4e0e14fa075f6b1c418c595e6bd805fb
      filesize: 253.5KB
      filetype: .fam
      number_of_participants: 8113
      belongs_to: data/legacy1
    - id: alspacdcs:4b64032e-6c27-4962-8bce-91942f138565
      name: freeze_id.log
      md5sum: bdcc0e05a9ab89468779ffd131e247f7
      filesize: 1.1KB
      filetype: .log
      belongs_to: data/legacy1

Genome-wide - CNV - G1 (cnv_550_g1)

Description

This dataset contains predicted ALSPAC CNVs using PennCNV, generated from 23andMe raw genotype data.

Methodology

original-cnv:

LRR and BAF data was missing from the 23andMe raw genotype data, so we had to generate this data ourselves using an in house algorithm. Once this data was generated, we ran PennCNV using the hh550 libraries.

There are filtered PennCNV calls. Multiple calls were merged using the clean_cnv.pl script, using a merge fraction of 0.5. Individuals with > 30 CNVs, a Log R Ratio SD of >0.3, a BAF drift of > 0.002, and a waviness factor of > 0.05 were removed. CNVs in which at least 50% of the length of the CNV call overlapped with any of telomeric centromeric, immunoglobulin regions were removed using the ‘scan_region.pl’ script in PennCNV.

In addition, CNVs covering fewer than 5 probes, of a length < 5kb, and with a confidence score of below 10 were removed. Density was calculated as the number of probes in a CNV divided by the length of the CNV, and CNVs where the density of probes across the call was < 1 probe per 20kb was removed.

These QC parameters are suggestions only and provided in filtered.cnv. Analysts can apply their own filter parameters to the raw calls in data.cnv

nd-cnv:

Full Neurodevelopmental CNV data available within dataset or on request: docs/nd-cnv/ALSPAC_CNV_protocol.md.

Raw .tab files were generated per individual in the dataset. LRR and BAF were missing from the 23andMe dataset so were generated using in house code at ALSPAC. These files were formatted for input to PennCNV. Gcmodel and pfb files wer generated for use in penncnv using inbuilt cal_gc_snp.pl and compile_pfb.pl from PennCNV.

CNVs were called using PennCNV, CNVs were detected using detect_cnv.pl, then multiple calls were merged using the clean_cnv.pl and filtered using filter_cnv.pl with defaults LRR SD 0.3, BAF drift 0.01, WF 0.05, all chromosomes included, minimum number of SNPs set as 3.

A final list of ND CNV carriers were derived using custom R code.

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:cnv_550_g1_2025-05-08_f6
name: Genome-wide - CNV - G1 release version 2025-05-08 freeze 7
description: >-
  This seventh freeze of the cnv_550_g1 dataset. 
  
  This is the first freeze containing an updated release, of the CNV dataset. 
  
  original-cnv:
    The original version, (data/original-cnv) contains two csv versions of the originally called CNV data, an unfiltered and filtered versions. 
    LRR and BAF data was missing from the 23andMe raw genotype data, so was generated for this data using an in house algorithm. Once this data was generated, we ran PennCNV using the hh550 libraries. 
    There are filtered PennCNV calls. Multiple calls were merged, using a merge fraction of 0.5. Individuals with > 30 CNVs, a Log R Ratio SD of >0.3, a BAF drift of > 0.002, and a waviness factor of > 0.05 were removed. CNVs in which at least 50% of the length of the CNV call overlapped with any of telomeric centromeric, immunoglobulin regions were removed in PennCNV.

    In addition, CNVs covering fewer than 5 probes, of a length < 5kb, and with a confidence score of below 10 were removed. Density was calculated as the number of probes in a CNV divided by the length of the CNV, and CNVs where the density of probes across the call was < 1 probe per 20kb was removed.

    These QC parameters are suggestions only and provided in filtered.cnv. Analysts can apply their own filter parameters to the raw calls in data.cnv

  nd-cnv:
    Updated version of CNV data, running the Cardiff MRC Pathfinder pipeline which was originally designed for identifying neurodevelopmental CNVs. This was generated using  .tab files from the original CNV dataset, alongisde LRR and BAF which had been derived. 

    A full protocol description is provided in docs/nd-cnv/ALSPAC_CNV_protocol.md.

    The files with omics IDs within contain a 'duplicate_sample' column. This originates from some individuals having multiple samples sent. This is marked to True for only 1 of the duplicates, if any are present. This allows matching the duplicate from the data to the metadata, such as indicating those which pass QC. 

    Suggestions:
      1. For researchers looking to do entirely their own QC - i.e. no decisions regarding QC have been made to create these files
        - freeze.rawcnvs, freeze.qcsum, ALSPAC.pfb

      2. For researchers looking for a cleaner dataset - i.e. some decisions around QC have been made to create these files. ALSPAC.qcsum still needed to apply own filters
        - freeze.goodcnv, freeze.qcpass, freeze.qcsum

    Associated publication: 
    - DOI: 10.1016/j.biopsych.2025.03.004

freeze_size: 72m
linker_file_md5sum: 2b2e49bc61c1a0efc3f57db6e48656ea
woc_file_md5sum: ee409be51e4e12594cb700dd1be99314
all_individuals_to_exclude_md5sum: 241c00aec78b7178d6797a83478dec02
git_tag: https://github.com/alspac/dataset_cnv_550_g1/releases/tag/freeze7
is_current_freeze: true
freeze_number: 7
freeze_date: 2026-04-14
previous_freeze: alspacdcs:cnv_550_g1_2015-11-09_f6
freeze_of_alspac_dataset_version: alspacdcs:cnv_550_g1_2025-05-08
freeze_of_named_alspac_dataset: alspacdcs:cnv_550_g1

contains:
- data
- docs
files: []
data:
  contains:
  - nd-cnv
  - original-cnv
  files: []
  nd-cnv:
    contains:
    - metadata
    - cnvdata
    files: []
    metadata:
      contains:
      - exon.NCBI.NRXN1
      - exon.PAFAH1B1
      - exon.YWHAE
      - ALSPAC.gcmodel
      - ALSPAC.pfb
      - freeze.qcpass
      - freeze.qcsum
      - CNVS.KK.2019.Sorted.txt
      - freeze.ids_fail_qc.txt
      files:
      - id: alspacdcs:d99dbd7d-128d-421f-aec9-72cec240416c
        name: exon.NCBI.NRXN1
        md5sum: 59a58150c3cabc327a1095396cf2c78d
        filesize: 9.3KB
        filetype: .NRXN1
        belongs_to: data/nd-cnv/metadata
      - id: alspacdcs:7f29c4a9-eca9-45f6-ad33-5322b3b66a50
        name: exon.PAFAH1B1
        md5sum: 3f946e6a9a9f7ef9f6edf93d93d14e84
        filesize: 744.0B
        filetype: .PAFAH1B1
        belongs_to: data/nd-cnv/metadata
      - id: alspacdcs:05116db4-9aee-4eb2-b466-572a8b32a708
        name: exon.YWHAE
        md5sum: 3f946e6a9a9f7ef9f6edf93d93d14e84
        filesize: 744.0B
        filetype: .YWHAE
        belongs_to: data/nd-cnv/metadata
      - id: alspacdcs:56bb7a9f-4ebf-48d4-8656-4250cb38966a
        name: ALSPAC.gcmodel
        md5sum: 171c8202b70c9ac40a65a472d92b6e4f
        filesize: 13.0MB
        filetype: .gcmodel
        belongs_to: data/nd-cnv/metadata
      - id: alspacdcs:7d1da910-c626-4998-8989-c4d6e442520c
        name: ALSPAC.pfb
        md5sum: e5205d1b881332b81bb68c336d2c1cf8
        filesize: 12.5MB
        filetype: .pfb
        belongs_to: data/nd-cnv/metadata
      - id: alspacdcs:cc18e5c1-cb13-48b9-800f-131f3d22a885
        name: freeze.qcpass
        md5sum: ab4d7b1449a8690f171834951d73a2cf
        filesize: 239.4KB
        filetype: .qcpass
        belongs_to: data/nd-cnv/metadata
      - id: alspacdcs:00fafbba-0399-46eb-a503-b0a40cce8f00
        name: freeze.qcsum
        md5sum: 626b12d5a805345066882cbea2307f9d
        filesize: 718.5KB
        filetype: .qcsum
        belongs_to: data/nd-cnv/metadata
      - id: alspacdcs:f8c7f34d-0d99-4797-91c4-49b03a1520d9
        name: CNVS.KK.2019.Sorted.txt
        md5sum: 60ca28d8d9c768ce7809b6adae817e23
        filesize: 2.5KB
        filetype: .txt
        belongs_to: data/nd-cnv/metadata
      - id: alspacdcs:05116fc6-656b-45da-9e2d-613f0ec731f1
        name: freeze.ids_fail_qc.txt
        md5sum: 75644a5645b1bb4c3dc1a782ed41236e
        filesize: 268.8KB
        filetype: .txt
        belongs_to: data/nd-cnv/metadata
    cnvdata:
      contains:
      - freeze.goodcnv
      - freeze.rawcnvs
      files:
      - id: alspacdcs:d3638a00-2c5c-418d-9759-be0b345d59d9
        name: freeze.goodcnv
        md5sum: ad2cfa64ae53556d6d17ca6761b522a7
        filesize: 4.3MB
        filetype: .goodcnv
        belongs_to: data/nd-cnv/cnvdata
      - id: alspacdcs:0406571b-565b-47b9-a476-7000f247ea18
        name: freeze.rawcnvs
        md5sum: 1ced3622411808452f7a95ba922a3267
        filesize: 14.2MB
        filetype: .rawcnvs
        belongs_to: data/nd-cnv/cnvdata
  original-cnv:
    contains:
    - timepoints.csv
    - filtered.csv
    - cnvdata.csv
    files:
    - id: alspacdcs:92aa7eb0-dc2f-46ab-a940-28366c5e47d5
      name: timepoints.csv
      md5sum: 90640b955d6aaf63866b3cc0281c9bee
      filesize: 334.5KB
      filetype: .csv
      belongs_to: data/original-cnv
    - id: alspacdcs:c1b84856-19b8-4705-a3d0-06586fd73148
      name: filtered.csv
      md5sum: ce186871a415e12bedd90278fc24d75a
      filesize: 5.8MB
      filetype: .csv
      belongs_to: data/original-cnv
    - id: alspacdcs:7a92ad5b-499d-4b18-9f8c-c40096cf3a75
      name: cnvdata.csv
      md5sum: a9c4f47453f184c86d1058be20c386cf
      filesize: 20.8MB
      filetype: .csv
      belongs_to: data/original-cnv
docs:
  contains:
  - nd-cnv
  nd-cnv:
    contains:
    - CNV_QC_ALSPAC.Rmd
    - ALSPAC_CNV_protocol.md
    files:
    - id: alspacdcs:73990a85-038f-443e-b07c-e271614779e1
      name: CNV_QC_ALSPAC.Rmd
      md5sum: 14305de35583750ffa0d8c1cc73ee966
      filesize: 35.3KB
      filetype: .Rmd
      belongs_to: docs/nd-cnv
    - id: alspacdcs:8f93f938-2374-4b9e-a31b-2a563ad0822c
      name: ALSPAC_CNV_protocol.md
      md5sum: 892716e4980da9b80896e53cb0928432
      filesize: 10.0KB
      filetype: .md
      belongs_to: docs/nd-cnv

Imputed Data

Genome-wide - HRC imputed - G0 mothers + G1 (gi_hrc_g0m_g1)

Description

This dataset contains genotype data imputed to HRC for G0 mothers and G1.
Reference genome build: GRCh37

Methodology

SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed.

Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,115 subjects and 500,527 SNPs passed these quality control filters.

ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs. SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed.

Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.

9,048 subjects and 526,688 SNPs passed these quality control filters.

We combined 477,482 SNP genotypes in common between the sample of mothers and sample of children. We removed SNPs with genotype missingness above 1% due to poor quality (11,396 SNPs removed) and removed a further 321 subjects due to potential ID mismatches. This resulted in a dataset of 17,842 subjects containing 6,305 duos and 465,740 SNPs (112 were removed during liftover and 234 were out of HWE after combination). We estimated haplotypes using ShapeIT (v2.r644) which utilises relatedness during phasing. The phased haplotypes were then imputed to the Haplotype Reference Consortium (HRCr1.1, 2016) panel of approximately 31,000 phased whole genomes. The HRC panel was phased using ShapeIt v2.r727, and the imputation was performed using the Michigan imputation server.

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_hrc_g0m_g1_2017-05-04_f7
name: >-
  Genome-wide - HRC imputed - G0 mothers + G1 version 2017-05-04 freeze 7
description: >-
  Freeze 7 of version 2017-05-04 Genome-wide array data imputed to the HRC reference panel for G0 mothers and G1 individuals in bgen and sample file format (version 1.2). 
freeze_size: 114G
linker_file_md5sum: 2b2e49bc61c1a0efc3f57db6e48656ea
woc_file_md5sum: ee409be51e4e12594cb700dd1be99314
all_individuals_to_exclude_md5sum: 241c00aec78b7178d6797a83478dec02
git_tag: https://github.com/alspac/dataset_gi_hrc_g0m_g1/releases/tag/freeze7
is_current_freeze: true
freeze_number: 7
freeze_date: 2026-04-14
previous_freeze: alspacdcs:gi_hrc_g0m_g1_2017-05-04_f6
freeze_of_alspac_dataset_version: alspacdcs:gi_hrc_g0m_g1_2017-05-04
freeze_of_named_alspac_dataset: alspacdcs:gi_hrc_g0m_g1

contains:
- data
files: []
data:
  contains:
  - filtered_23female.bgen
  - filtered_17.bgen
  - filtered_16.bgen
  - filtered_11.bgen
  - filtered_12.bgen
  - filtered_23male.bgen
  - filtered_10.bgen
  - filtered_19.bgen
  - filtered_08.bgen
  - filtered_15.bgen
  - filtered_04.bgen
  - filtered_20.bgen
  - filtered_18.bgen
  - filtered_05.bgen
  - filtered_09.bgen
  - filtered_21.bgen
  - filtered_07.bgen
  - filtered_06.bgen
  - filtered_22.bgen
  - filtered_14.bgen
  - filtered_13.bgen
  - filtered_03.bgen
  - filtered_01.bgen
  - filtered_02.bgen
  - swapped_23_female.sample
  - swapped.sample
  - swapped_23_male.sample
  files:
  - id: alspacdcs:b8fffc07-d4a4-468f-8ebc-fe9c1aa33c8f
    name: filtered_23female.bgen
    md5sum: 9a2fa1521500ffb506fbeb05b0f6e434
    filesize: 4.2GB
    filetype: .bgen
    number_of_variants: 1228035
    number_of_participants: 12937
    belongs_to: data
  - id: alspacdcs:8254a9d1-641b-4d08-a53d-006c74e707f6
    name: filtered_17.bgen
    md5sum: 2c76ac68f7a873f24a9fea3b99ea0563
    filesize: 3.6GB
    filetype: .bgen
    number_of_variants: 1090072
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:80710916-8157-4340-ad93-952d0bbd7bc0
    name: filtered_16.bgen
    md5sum: a67661d1cb9206d67f33af81de2efab1
    filesize: 4.1GB
    filetype: .bgen
    number_of_variants: 1281298
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:5dd81333-bc27-44b1-81e4-17460f6406a0
    name: filtered_11.bgen
    md5sum: 40ede8af9ca8bf97e9cc4fc561ef83f3
    filesize: 5.2GB
    filetype: .bgen
    number_of_variants: 1936990
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:ca176bf9-36d2-4b7f-91f8-7c828d509688
    name: filtered_12.bgen
    md5sum: f85a9e5d0ced40db2a7dd3dcf0d7c457
    filesize: 5.1GB
    filetype: .bgen
    number_of_variants: 1848118
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:4de54b22-7ac4-4824-8b17-109b46235ca2
    name: filtered_23male.bgen
    md5sum: 0a865f7f362a08741f62980790ea00d1
    filesize: 1.2GB
    filetype: .bgen
    number_of_variants: 1228035
    number_of_participants: 4500
    belongs_to: data
  - id: alspacdcs:ea229c07-5322-4206-8276-a4e64a0cb2a4
    name: filtered_10.bgen
    md5sum: 38ec07bb152ec1a4bd9c9919461c0c9b
    filesize: 5.1GB
    filetype: .bgen
    number_of_variants: 1927504
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:a1e50f8b-587d-443e-9a1b-f7d32d10f172
    name: filtered_19.bgen
    md5sum: b165787a2ad0e9875d78d30f6036a725
    filesize: 3.4GB
    filetype: .bgen
    number_of_variants: 868554
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:0b5a705b-3177-46ba-a10a-95b06dc05f62
    name: filtered_08.bgen
    md5sum: 6c0542e2ee7998815a372ccce589bfe5
    filesize: 5.7GB
    filetype: .bgen
    number_of_variants: 2242706
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:72e9e8a7-1477-4186-8879-82a3b54e1021
    name: filtered_15.bgen
    md5sum: 3b50e97833f49bfac0fb3929c27c44a8
    filesize: 3.4GB
    filetype: .bgen
    number_of_variants: 1139215
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:33f65d0f-644a-43ad-bd2a-f651d96f2fcc
    name: filtered_04.bgen
    md5sum: 0335b240ea488e6e4b59e16ddf43eeb3
    filesize: 7.9GB
    filetype: .bgen
    number_of_variants: 2787582
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:3937fe02-7c4e-4dd0-b737-324449095b10
    name: filtered_20.bgen
    md5sum: 44af95fd98a1e105571ae52a6474e27e
    filesize: 2.6GB
    filetype: .bgen
    number_of_variants: 884983
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:0fceaa8a-c9bb-4d26-8bdc-2e5d6025f4f8
    name: filtered_18.bgen
    md5sum: 7bdae7c0a7b5f6bb20dfb46bca8a69da
    filesize: 3.1GB
    filetype: .bgen
    number_of_variants: 1104755
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:10d1f3e0-58f5-495a-b9ac-f491e2ede6bd
    name: filtered_05.bgen
    md5sum: c2957a725c7dd9353209591d538fb815
    filesize: 6.7GB
    filetype: .bgen
    number_of_variants: 2588170
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:59b22428-8dc9-4faa-889b-8f154d06b443
    name: filtered_09.bgen
    md5sum: 191a058144998119f8baeaa06d49d7ab
    filesize: 4.5GB
    filetype: .bgen
    number_of_variants: 1675899
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:8beb38ea-4b73-4e8b-8a8e-15991464cbb7
    name: filtered_21.bgen
    md5sum: 1036d105dc6e500b6447d2f0217468f5
    filesize: 1.7GB
    filetype: .bgen
    number_of_variants: 531276
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:45c7f914-a23e-40ad-83d3-9971b0484fbd
    name: filtered_07.bgen
    md5sum: 4ac90ecb1101025b355412c87b57fca7
    filesize: 6.6GB
    filetype: .bgen
    number_of_variants: 2289306
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:17d6bd43-566b-401d-a940-c2cb98e1b630
    name: filtered_06.bgen
    md5sum: 988cccee782acae516c29c75e123bcd8
    filesize: 6.3GB
    filetype: .bgen
    number_of_variants: 2460112
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:cde36994-cd52-4b1e-875a-351516b9b094
    name: filtered_22.bgen
    md5sum: 6f68665d403b3ed9d6c86f18c236d317
    filesize: 1.8GB
    filetype: .bgen
    number_of_variants: 524544
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:a8a85802-cd29-4b72-b1fc-121d564a4f5a
    name: filtered_14.bgen
    md5sum: b18ada6a30feb86a7dfad8d8fcb98b70
    filesize: 3.5GB
    filetype: .bgen
    number_of_variants: 1266536
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:830926f7-2378-40f4-b007-bb8179c389f0
    name: filtered_13.bgen
    md5sum: 25a5626dfbbfef9cd285e44341556e5c
    filesize: 3.7GB
    filetype: .bgen
    number_of_variants: 1385434
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:e1e04105-be30-4c1a-9947-aa161db9709c
    name: filtered_03.bgen
    md5sum: 316967e4c846cf11ec34141623762fde
    filesize: 7.3GB
    filetype: .bgen
    number_of_variants: 2821895
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:2ca4a8d8-433c-4564-8571-b87c95379214
    name: filtered_01.bgen
    md5sum: e098ec39b76a0372f15bc774bc6b33ba
    filesize: 8.6GB
    filetype: .bgen
    number_of_variants: 3069932
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:e0c7856f-4da4-41db-acc2-b667dc6de8b1
    name: filtered_02.bgen
    md5sum: 57c3e552bc0e37639b3846323a6acdbf
    filesize: 8.7GB
    filetype: .bgen
    number_of_variants: 3392238
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:37c201d1-6402-4e53-9158-0c12d6f393d6
    name: swapped_23_female.sample
    md5sum: 8ec5811791525c1f6c3886ba79b35475
    filesize: 745.5KB
    filetype: .sample
    number_of_participants: 12937
    belongs_to: data
  - id: alspacdcs:36430b10-42b4-4413-98f1-89a2fb70f0dd
    name: swapped.sample
    md5sum: 5c6aab2318c697366886eb8e29bc7e2a
    filesize: 1004.7KB
    filetype: .sample
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:b94f9615-68dc-4041-841c-aae1b073ee24
    name: swapped_23_male.sample
    md5sum: b514f227781e40a54a85edec7519d531
    filesize: 259.3KB
    filetype: .sample
    number_of_participants: 4500
    belongs_to: data

Genome-wide - 1000G imputed - G0 partners (gi_1000g_g0p)

Description

This dataset contains genome-wide array data imputed to the 1000 genomes reference panel for G0 partners, with some additional G0 mothers and G1 individuals. This data has been cleaned, flipped to the positive strand and in b37 coordinates and imputed to the 1000 genomes phase I version 3.
Reference genome build: GRCh37

Methodology

Cryptic relatedness was measured as SNP relatedness in GCTA (relatedness > 0.1, n = 69 removed). SNPs with a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 1E-7) and those which failed GenomeStudio quality control measures were removed (n = 21,298). 6,594 duplicate SNPs were also removed.

This resulted in 2,911 unrelated mothers and father genotypes at 507,586 SNPs. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln.

We phased data of 3074 samples that passed qc but contained related subjects in shapeit v2.r837. We then removed 155,336 monomorphic SNPs, 1033 markers not in 1000 genomes, 11,842 A/T or G/C SNPs and 10 duplicate sites to give 337,732 SNPs on chromosomes 1-23. Of the 329,363 markers on chromosomes 1-22, 298,742 overlapped the reference genome. We imputed to the 1000 genomes phase 1 version 3 using the Michigan Imputation Server. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln. We then removed 12 subjects who have withdrawn consent and 6 subjects genotyped in an earlier work package to give 2201 subjects.

After aplication of the plink filters –geno 0.05, –maf 0.01, –snps-only just-acgt and –autosome. The –related command in KING version 2.3.2 was used to perform kinship analysis, which confirmed that all 1737 putative G0 partner-G1 pairs are genetically related. This would be expected for biological father-offspring pairs, using the inference criteria described in in Table 1 of “Manichaikul, Ani, et al. ”Robust relationship inference in genome-wide association studies.” Bioinformatics 26.22 (2010): 2867-2873.”

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_1000g_g0p_2016-11-22_f7
name: Genome-wide - 1000G imputed - G0 partners version 2016-11-22 freeze 7
description: >-
  This dataset is the seventh freeze of 2016-11-22 version of the Genome-wide array data imputed to the 1000 genomes reference panel
  for G0 partners, with some additional G0 mothers and G1 individuals.

freeze_size: 44G
linker_file_md5sum: 2b2e49bc61c1a0efc3f57db6e48656ea
woc_file_md5sum: ee409be51e4e12594cb700dd1be99314
all_individuals_to_exclude_md5sum: 241c00aec78b7178d6797a83478dec02
git_tag: https://github.com/alspac/dataset_gi_1000g_g0p/releases/tag/freeze7
is_current_freeze: true
freeze_number: 7
freeze_date: 2026-04-14
previous_freeze: alspacdcs:gi_1000g_g0p_2016-11-22_f6
freeze_of_alspac_dataset_version: alspacdcs:gi_1000g_g0p_2016-11-22
freeze_of_named_alspac_dataset: alspacdcs:gi_1000g_g0p

contains:
- data
files: []
data:
  contains:
  - filtered_data_chr21.bgen
  - filtered_data_chr20.bgen
  - filtered_data_chr17.bgen
  - filtered_data_chr14.bgen
  - filtered_data_chr15.bgen
  - filtered_data_chr22.bgen
  - filtered_data_chr12.bgen
  - filtered_data_chr07.bgen
  - filtered_data_chr06.bgen
  - filtered_data_chr19.bgen
  - filtered_data_chr01.bgen
  - filtered_data_chr10.bgen
  - filtered_data_chr13.bgen
  - filtered_data_chr09.bgen
  - filtered_data_chr03.bgen
  - filtered_data_chr11.bgen
  - filtered_data_chr16.bgen
  - filtered_data_chr18.bgen
  - filtered_data_chr05.bgen
  - filtered_data_chr08.bgen
  - filtered_data_chr04.bgen
  - filtered_data_chr02.bgen
  - swapped.sample
  files:
  - id: alspacdcs:2144fe8c-8b2c-4ad2-8c81-670958d3eb8c
    name: filtered_data_chr21.bgen
    md5sum: 7881bdc24e7f0adbfb800b49d1efd590
    filesize: 671.1MB
    filetype: .bgen
    number_of_variants: 378064
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:291259ba-4380-4827-8655-0af8362bf156
    name: filtered_data_chr20.bgen
    md5sum: d241eb21be3188c26c460e1f65f0d8c1
    filesize: 1.1GB
    filetype: .bgen
    number_of_variants: 618749
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:d7364f68-1744-496d-b25c-c719754863cd
    name: filtered_data_chr17.bgen
    md5sum: 73d85caf67dcedc63b11a43bd5ccb44d
    filesize: 1.4GB
    filetype: .bgen
    number_of_variants: 755467
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:a2d06c1a-16cb-46fd-9a92-f27a3244e1b4
    name: filtered_data_chr14.bgen
    md5sum: 1ecd96aab2925bafd7d20497d85dd937
    filesize: 1.4GB
    filetype: .bgen
    number_of_variants: 903811
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:8e6b1ea8-08b8-42dc-82fc-113cbc5c53b5
    name: filtered_data_chr15.bgen
    md5sum: f8c5b54206189808e9a361cc0da63798
    filesize: 1.4GB
    filetype: .bgen
    number_of_variants: 814028
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:72cef321-3bf2-47c3-b809-4a8c058df1ae
    name: filtered_data_chr22.bgen
    md5sum: 824412e963441699f260c6245f65659d
    filesize: 721.5MB
    filetype: .bgen
    number_of_variants: 366590
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:5a210914-47cd-46c4-b964-586d301b21e9
    name: filtered_data_chr12.bgen
    md5sum: 509202db22200fe0bd58210ab8e9c757
    filesize: 2.1GB
    filetype: .bgen
    number_of_variants: 1316510
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:84297ea2-6ea4-4e03-85a2-6a3bd5db43a5
    name: filtered_data_chr07.bgen
    md5sum: f832922558eddcf3feed87091c2ec0ae
    filesize: 2.6GB
    filetype: .bgen
    number_of_variants: 1601293
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:c30d2a1c-97d2-4e61-bf39-25a191cc208e
    name: filtered_data_chr06.bgen
    md5sum: a9327ad1591fdf7d349b066544e71c3a
    filesize: 2.6GB
    filetype: .bgen
    number_of_variants: 1758025
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:c9848334-7253-436a-963e-54b017c063cf
    name: filtered_data_chr19.bgen
    md5sum: 37ea045cd9f4027cba547b7b89c3a1a0
    filesize: 1.2GB
    filetype: .bgen
    number_of_variants: 606147
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:08201a50-2a51-4e5f-ab96-fb5cbdc86000
    name: filtered_data_chr01.bgen
    md5sum: a5eb049e4df5a8b005ae51b47947d830
    filesize: 3.3GB
    filetype: .bgen
    number_of_variants: 2159337
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:dc7772ed-490d-4a35-8414-ef23c6dff4a5
    name: filtered_data_chr10.bgen
    md5sum: 8f64fe184e4c876a345a728ed5eeddcf
    filesize: 2.1GB
    filetype: .bgen
    number_of_variants: 1363104
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:f215a1f9-f7df-4347-ad26-8d4ad74e2a40
    name: filtered_data_chr13.bgen
    md5sum: 176a10d38ab80783a8e392e5791edea7
    filesize: 1.5GB
    filetype: .bgen
    number_of_variants: 988473
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:eed0d72d-0f84-4205-a894-26179d79c8f3
    name: filtered_data_chr09.bgen
    md5sum: 82a480f3e8792db2c1cec3adc50e1357
    filesize: 1.9GB
    filetype: .bgen
    number_of_variants: 1189463
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:dda907d1-6227-40a4-936b-8c19d2b97f51
    name: filtered_data_chr03.bgen
    md5sum: c0b55e9d65c219ffb1b8c58a0ebb7c18
    filesize: 3.0GB
    filetype: .bgen
    number_of_variants: 1969275
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:5282f711-76da-432a-8d8d-54f46b048052
    name: filtered_data_chr11.bgen
    md5sum: b1b7e3bef0fe72cd90bd0ba456f687aa
    filesize: 2.1GB
    filetype: .bgen
    number_of_variants: 1359640
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:b5f80dc3-9613-47d9-9c5f-359b9f440c8a
    name: filtered_data_chr16.bgen
    md5sum: 52f065575d3cb2dff34df6763a583766
    filesize: 1.5GB
    filetype: .bgen
    number_of_variants: 867901
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:8aff8808-8468-4623-8165-3b99fc6ea6b7
    name: filtered_data_chr18.bgen
    md5sum: b8e055a6c0955bb67161c9f7a1d8cad7
    filesize: 1.3GB
    filetype: .bgen
    number_of_variants: 783661
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:a571d744-f223-4a52-87cf-64add826bf3c
    name: filtered_data_chr05.bgen
    md5sum: f4accbf5bdd6a2ccc9598e9e2221915d
    filesize: 2.7GB
    filetype: .bgen
    number_of_variants: 1809961
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:0c37b6f6-5223-4e62-b688-ff6808374346
    name: filtered_data_chr08.bgen
    md5sum: 47d79712e676a0048f90858cbb888179
    filesize: 2.3GB
    filetype: .bgen
    number_of_variants: 1558902
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:7609d01d-efc9-48d9-aee7-dd24aff42885
    name: filtered_data_chr04.bgen
    md5sum: 514f09f02c74fc3eca83379e9e99c5dc
    filesize: 3.1GB
    filetype: .bgen
    number_of_variants: 1969883
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:835f458c-36f7-452d-b5a0-6d41f162e90b
    name: filtered_data_chr02.bgen
    md5sum: e297c8d30455053d23ac360bcc886bb0
    filesize: 3.5GB
    filetype: .bgen
    number_of_variants: 2349883
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:5366a1e0-61fe-4288-80d1-5efb9e4bdc64
    name: swapped.sample
    md5sum: bd7568dd7d222ea368957996e7b76a1b
    filesize: 164.9KB
    filetype: .sample
    number_of_participants: 2198
    belongs_to: data

Genome-wide - 1000G imputed - G0 mothers + G1 (gi_1000g_g0m_g1)

Description

This dataset contains genome-wide 1000G imputed data for G0 mothers + G1. This data has been cleaned, flipped to the positive strand and in b37 coordinates and imputed to the 1000 genomes phase I version 3.
Reference genome build: GRCh37

Methodology

SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. 9,115 subjects and 500,527 SNPs passed these quality control filters.

9,048 subjects and 526,688 SNPs passed these quality control filters.

We combined 477,482 SNP genotypes in common between the sample of mothers and sample of children. We removed SNPs with genotype missingness above 1% due to poor quality (11,396 SNPs removed) and removed a further 321 subjects due to potential ID mismatches. This resulted in a dataset of 17,842 subjects containing 6,305 duos and 465,740 SNPs (112 were removed during liftover and 234 were out of HWE after combination). We estimated haplotypes using ShapeIT(v2.r644) which utilises relatedness during phasing. We obtained a phased version of the 1000 genomes reference panel (Phase 1, Version3) from the Impute2 reference data repository (phased using ShapeItv2.r644, haplotype release date Dec 2013). Imputation of the target data was performed using Impute V2.2.2 against the reference panel(all polymorphic SNPs excluding singletons), using all 2186 reference haplotypes (including non-Europeans).

This gave 8,237 eligible children and 8,196 eligible mothers withavailable genotype data after exclusion of related subjects using cryptic relatedness measures described previously.

Known issues: There is a known strand issue present within this imputation: The Dec 2013 haplotype release of 1000 genomes phase 1 version 3 have 199 reported SNPs with incorrect strand. For more information and the origins of this list please visit https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated_SHAPEIT2_16-06-14.html. It is very unlikely that they have systematic effects across the genome and most probably are just isolated to these 199 known problematic SNPs. The user is advised to discard them from their analysis.

Formatting of the bgen files within the gi_1000g_g0m_g1 dataset have NA in place of the chromosome column. Some tools may allow this, while others are less forgiving. This may mean users wish to re-format the dataset (using QCtool or equivalent) for their work.

Allele frequency concordance with other cohorts: When contributing to consortia you may find that the allele frequencies in ALSPAC for a few thousand SNPs are discordant from a reference panel used by the consortium. This is actually to be expected - when calculating allele frequencies, even from the same population, in two different samples for many millions of SNPs there will be a number of SNPs that appear to be highly discordant.

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_f7
name: >-
  Genome-wide - 1000G imputed - G0 mothers + G1 version 2015-10-30
  freeze 7
description: >-
  This is the seventh freeze of the the 2015-10-30 version of
  gi_1000g_g0m_g1 datatset. It contains data in the oxford format
  which is a combination of bgen and sample (version 1.2) files. It is a subset of
  the data in gi_1000g_g0m_g1_2015-10-30 limited to one format and
  with participants who have withdrawn their consent removed.

  The Dec 2013 haplotype release of 1000 genomes phase 1 version 3 have 199 reported SNPs
  with incorrect strand. The strand issues are present in this imputation version. For more 
  information and the origins of this list please visit:
  https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated_SHAPEIT2_16-06-14.html

  It is very unlikely that they have systematic effects across the genome and most 
  probably are just isolated to these 199 known problematic SNPs.

  The user is advised to discard them from their analysis.

freeze_size: 123G
linker_file_md5sum: 2b2e49bc61c1a0efc3f57db6e48656ea
woc_file_md5sum: ee409be51e4e12594cb700dd1be99314
all_individuals_to_exclude_md5sum: 241c00aec78b7178d6797a83478dec02
git_tag: https://github.com/alspac/dataset_gi_1000g_g0m_g1/releases/tag/freeze7
is_current_freeze: true
freeze_number: 7
freeze_date: 2026-04-14
previous_freeze: alspacdcs:gi_1000g_g0m_g1_2015-10-30_f6
freeze_of_alspac_dataset_version: alspacdcs:gi_1000g_g0m_g1_2015-10-30
freeze_of_named_alspac_dataset: alspacdcs:gi_1000g_g0m_g1

contains:
- data
- docs
files: []
data:
  contains:
  - filtered_17.bgen
  - filtered_16.bgen
  - filtered_11.bgen
  - filtered_10.bgen
  - filtered_19.bgen
  - filtered_08.bgen
  - filtered_12.bgen
  - filtered_15.bgen
  - filtered_04.bgen
  - filtered_05.bgen
  - filtered_20.bgen
  - filtered_06.bgen
  - filtered_21.bgen
  - filtered_18.bgen
  - filtered_07.bgen
  - filtered_09.bgen
  - filtered_13.bgen
  - filtered_22.bgen
  - filtered_14.bgen
  - filtered_23.bgen
  - filtered_03.bgen
  - filtered_01.bgen
  - filtered_02.bgen
  - swapped.sample
  files:
  - id: alspacdcs:7916941a-73dc-452f-9ef8-981d82ca6c58
    name: filtered_17.bgen
    md5sum: 30d636fabb041a62727b14eabe41d03d
    filesize: 3.8GB
    filetype: .bgen
    number_of_variants: 753174
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:9218ce5d-118c-497a-9aa8-65d09b5f2cda
    name: filtered_16.bgen
    md5sum: 9649ceef3120933c2a1f6dd0e3e5d4e2
    filesize: 4.3GB
    filetype: .bgen
    number_of_variants: 865998
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:bb61b717-18f3-427e-8535-528bf08234d6
    name: filtered_11.bgen
    md5sum: dfcfe37f4296073bbaff697406eae3af
    filesize: 5.3GB
    filetype: .bgen
    number_of_variants: 1356882
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:f47db626-ac59-42a4-8c5c-aa9b152a50b7
    name: filtered_10.bgen
    md5sum: 1d68a97256836e961da464ce42181b09
    filesize: 5.4GB
    filetype: .bgen
    number_of_variants: 1361506
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:065b1617-9ee0-45a6-b47b-2be1fe87d2d4
    name: filtered_19.bgen
    md5sum: 59d0405126edc09edd390072aa042117
    filesize: 3.9GB
    filetype: .bgen
    number_of_variants: 603516
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:7d609a37-363a-42f2-add6-2e12cb444687
    name: filtered_08.bgen
    md5sum: 41c19089856b673f93ce504fd7121681
    filesize: 5.9GB
    filetype: .bgen
    number_of_variants: 1557429
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:ff9be78c-f16b-4e87-9a31-ec2e6aaa0496
    name: filtered_12.bgen
    md5sum: 3838c6d902dcf3885c885892107b288b
    filesize: 5.3GB
    filetype: .bgen
    number_of_variants: 1314328
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:70caa3da-51a8-43e5-b1f1-3fde9c8a9d98
    name: filtered_15.bgen
    md5sum: a7e828dc8cba53f4c71eb530ba480128
    filesize: 3.7GB
    filetype: .bgen
    number_of_variants: 812545
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:4980879a-1b67-4de4-bf3b-7223bb447c97
    name: filtered_04.bgen
    md5sum: cf067f366f2bca3024d5be75a1120650
    filesize: 8.3GB
    filetype: .bgen
    number_of_variants: 1968171
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:eeeb0265-4aa4-447e-ac66-34fe7c4dc544
    name: filtered_05.bgen
    md5sum: f923353b8c81aad0365c17af25700b2d
    filesize: 6.8GB
    filetype: .bgen
    number_of_variants: 1808090
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:f1e32dff-b2e0-4fdf-93ee-96621ea9dbde
    name: filtered_20.bgen
    md5sum: 878cc210c992c56e9302fc0a147e893c
    filesize: 2.7GB
    filetype: .bgen
    number_of_variants: 617694
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:d3427219-a13e-4b0d-9ba7-8ff3612b461e
    name: filtered_06.bgen
    md5sum: 812369046c5b1962507ddbb42e76b42b
    filesize: 6.8GB
    filetype: .bgen
    number_of_variants: 1755859
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:f6097a34-5c78-4f72-9eef-b346c5a3d0d4
    name: filtered_21.bgen
    md5sum: 157259126ff9ae0d328d06cae0327028
    filesize: 1.9GB
    filetype: .bgen
    number_of_variants: 377554
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:ce3f9aea-ca68-44c7-8596-11c5e2f55081
    name: filtered_18.bgen
    md5sum: 8e1c796ff59aa554f2f476a5b74a10a0
    filesize: 3.4GB
    filetype: .bgen
    number_of_variants: 783010
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:5be56ef3-3ed2-40cf-8fad-f79ee43ee4ca
    name: filtered_07.bgen
    md5sum: 232c7f64601144ba93d860e0b1cbe2e5
    filesize: 7.1GB
    filetype: .bgen
    number_of_variants: 1599387
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:d3d23f2d-235f-4ed8-9f16-2d3bed5c76a4
    name: filtered_09.bgen
    md5sum: fc378f5588b03df3c6454bae95a44a0e
    filesize: 5.0GB
    filetype: .bgen
    number_of_variants: 1187731
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:d6c92c42-0436-460c-a265-57f3a5c4ac7e
    name: filtered_13.bgen
    md5sum: 392bee0dd01f1fa4f10c7c666afeca4d
    filesize: 3.9GB
    filetype: .bgen
    number_of_variants: 987740
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:dbe9ab61-b682-4837-a704-88fa3c7fe08c
    name: filtered_22.bgen
    md5sum: 26c4ddfb909ac44c7fe55317490297d3
    filesize: 2.0GB
    filetype: .bgen
    number_of_variants: 365644
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:920ec9dd-ff2e-4a0f-8850-3246f4791aa5
    name: filtered_14.bgen
    md5sum: 57102c571a902a923a75d6e60e93ad92
    filesize: 3.9GB
    filetype: .bgen
    number_of_variants: 904351
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:c9f94a93-a18a-4db9-99ef-6343c519ddb2
    name: filtered_23.bgen
    md5sum: b4dea54d4567092719371627dab34d82
    filesize: 5.9GB
    filetype: .bgen
    number_of_variants: 1250218
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:9b10ebfa-af18-471d-a8d3-47c4a0f0f340
    name: filtered_03.bgen
    md5sum: 016f6deb0f93cfc9ded14a12f58af235
    filesize: 7.6GB
    filetype: .bgen
    number_of_variants: 1966662
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:6794c2b5-ff4a-406c-a9c4-dd1363586a75
    name: filtered_01.bgen
    md5sum: 00ac9620c6a7738b8beba5aea63b6602
    filesize: 9.0GB
    filetype: .bgen
    number_of_variants: 2155158
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:0d60d569-be8b-472a-81e9-9706372f76ab
    name: filtered_02.bgen
    md5sum: 69f9465e174e2abb8afa704bcecd79ad
    filesize: 9.1GB
    filetype: .bgen
    number_of_variants: 2346862
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:c0dec0f1-e2bf-415e-a44f-d6377afdbc17
    name: swapped.sample
    md5sum: c50c8bc1123e8ea37aaa1bae9a12672a
    filesize: 1.2MB
    filetype: .sample
    number_of_participants: 17437
    belongs_to: data
docs:
  contains:
  - gi_1000g_g0m_g1_2015-10-30_f6.yaml
  files:
  - id: alspacdcs:057332f8-3983-4780-a0a2-7ac07b00a9bb
    name: gi_1000g_g0m_g1_2015-10-30_f6.yaml
    md5sum: b89aa14bf4d9f1c6ae09438169d73b1f
    filesize: 8.2KB
    filetype: .yaml
    belongs_to: docs

Genome-wide - TOPMed round 3 imputed - G0 mothers + G1 (gi_topmed_g0m_g1)

Description

This dataset contains genotype data imputed to TOPMed round 2 for G0 mothers and G1.
Reference genome build: GRCh38

Methodology

Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,115 subjects and 500,527 SNPs passed these quality control filters.

9,048 subjects and 526,688 SNPs passed these quality control filters.

Individuals within this dataset, but who have withdrawn from the project were removed from the dataset before proceeding with imputation specific quality control. This left 17450 individuals.

The combined mothers and children combined genotype panel was filtered to remove SNPs below MAF 0.01, missing call rates exceeding 0.01 using Plink 2.0. The joint set of SNPs was checked for palindromic SNPs but none were present. The combined call set was swapped from GRCh37 to GRCh38 using UCSC liftOver.

The dataset was later filtered to SNPs above HWE of 1e-6 leaving 455150 SNPs. The combined autosomal call set was then converted to VCF files, before being uploaded to the TOPMed imputation server to flag variants requiring a strand fix. Any SNPs flagged with an issue were corrected, or filtered out using Plink2. 454248 SNPs remained within the autosomes.

Phasing and imputation was conducted on the Michigan TOPMed imputation server in December 2025. Phasing was done using Eagle. Imputation was done on minimac4 to TOPMed R3. An R squared filter of 0.3 was applied.

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_topmed_g0m_g1_2026-03-09_f7
name: >-
  Genome-wide - TOPmed imputed - G0 mothers + G1 version 2026-03-09
  freeze 7
description: >-
  Freeze 7 of version 2026-03-09 Genome-wide array data imputed to the TOPmed round 3 reference panel for G0 mothers and G1 individuals in bgen and sample file format (version 1.2). 
  The 2025-07-25 version of the dataset updated to have all monomorphic variants filtered out of the dataset to reduce overall size.  
  The 2026-03-09 version of teh dataset updated to use TOPMed round 3 instead of round 2. 
freeze_size: 124G
linker_file_md5sum: 2b2e49bc61c1a0efc3f57db6e48656ea
woc_file_md5sum: ee409be51e4e12594cb700dd1be99314
all_individuals_to_exclude_md5sum: 241c00aec78b7178d6797a83478dec02
git_tag: https://github.com/alspac/dataset_gi_topmed_g0m_g1/releases/tag/freeze7
is_current_freeze: true
freeze_number: 7
freeze_date: 2026-04-14
previous_freeze: alspacdcs:gi_topmed_g0m_g1_2024-12-19_f6
freeze_of_alspac_dataset_version: alspacdcs:gi_topmed_g0m_g1_2026-03-09
freeze_of_named_alspac_dataset: alspacdcs:gi_topmed_g0m_g1

contains:
- data

files: []
data:
  contains:
  - chr19_freeze.bgen
  - chr15_freeze.bgen
  - chr13_freeze.bgen
  - chr21_freeze.bgen
  - chr11_freeze.bgen
  - chr10_freeze.bgen
  - chr8_freeze.bgen
  - chr6_freeze.bgen
  - chr5_freeze.bgen
  - chr17_freeze.bgen
  - chr22_freeze.bgen
  - chr14_freeze.bgen
  - chr18_freeze.bgen
  - chr12_freeze.bgen
  - chr4_freeze.bgen
  - chr16_freeze.bgen
  - chr20_freeze.bgen
  - chr9_freeze.bgen
  - chr3_freeze.bgen
  - chr7_freeze.bgen
  - chr2_freeze.bgen
  - chr1_freeze.bgen
  - freeze.sample
  files:
  - id: alspacdcs:8356a0c2-8e95-4ad4-af28-ab726bc647dd
    name: chr19_freeze.bgen
    md5sum: a2d8302221651ccce325147e973b4a6a
    filesize: 3.2GB
    filetype: .bgen
    number_of_variants: 1706270
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:b0feabea-b2c9-45cf-a8af-9a8171b025e3
    name: chr15_freeze.bgen
    md5sum: 6a4bb4e44bd5447e18de7af8a7369882
    filesize: 3.5GB
    filetype: .bgen
    number_of_variants: 2201299
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:4b518038-0220-4a26-9673-bb12ed0ca727
    name: chr13_freeze.bgen
    md5sum: ea103b58ced47e0e86720dd1bc4491df
    filesize: 4.3GB
    filetype: .bgen
    number_of_variants: 2681386
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:2ea639c6-38a9-4413-8c03-bb960065bb5e
    name: chr21_freeze.bgen
    md5sum: efe990fbf743c152d871f87736f3ab4f
    filesize: 1.7GB
    filetype: .bgen
    number_of_variants: 990088
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:b5d574e3-6484-45a4-84cf-7277e043946a
    name: chr11_freeze.bgen
    md5sum: 3e78299f45e8cae74013656272f1b79b
    filesize: 5.9GB
    filetype: .bgen
    number_of_variants: 3683845
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:d5310539-eda5-4667-8e56-46319ac3eaf8
    name: chr10_freeze.bgen
    md5sum: c8daefc714028e76aeb9506378c156d0
    filesize: 5.8GB
    filetype: .bgen
    number_of_variants: 3680602
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:a40a6565-6579-4018-843a-54d886aa589d
    name: chr8_freeze.bgen
    md5sum: bda1f7e06d794648d3deea989a3c864a
    filesize: 6.4GB
    filetype: .bgen
    number_of_variants: 4158986
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:817b4743-ca04-45e3-9413-bb8cf0cb39ab
    name: chr6_freeze.bgen
    md5sum: 71a785288567f1968e278b10f5cdeae0
    filesize: 7.3GB
    filetype: .bgen
    number_of_variants: 4624023
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:30648924-12e0-484d-b19d-a780aadfc3a6
    name: chr5_freeze.bgen
    md5sum: 6cb97d79a6ad73fc56a7e00743c2e749
    filesize: 7.6GB
    filetype: .bgen
    number_of_variants: 4822152
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:6f6975ec-94ee-43af-b419-a1011e84d73c
    name: chr17_freeze.bgen
    md5sum: e7ddd9ea66f7101cd034c2777921f8a4
    filesize: 3.6GB
    filetype: .bgen
    number_of_variants: 2173526
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:865693ad-6056-438e-809e-320e78181dd4
    name: chr22_freeze.bgen
    md5sum: b9ff0c71b249a6596e060be3a7f7c63c
    filesize: 1.7GB
    filetype: .bgen
    number_of_variants: 1044068
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:c05f19ab-3cae-41eb-9c1c-56b5981902b3
    name: chr14_freeze.bgen
    md5sum: cc07fd3e8091e8edb4e4743fe963c645
    filesize: 3.7GB
    filetype: .bgen
    number_of_variants: 2410840
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:c6463e3b-f9d8-46e8-895d-b989aafa9831
    name: chr18_freeze.bgen
    md5sum: 512a3d1abfefab1eb42f36ec2f1b3eca
    filesize: 3.5GB
    filetype: .bgen
    number_of_variants: 2114657
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:b30205b7-9773-49b3-8e2a-fd62d38a5507
    name: chr12_freeze.bgen
    md5sum: 6754a2cf62923068ef2a10b5b93b8304
    filesize: 5.7GB
    filetype: .bgen
    number_of_variants: 3565533
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:0acef870-0b55-4fd1-b570-9acb558735a2
    name: chr4_freeze.bgen
    md5sum: acff37c9e412c696f064b41da605ecae
    filesize: 9.0GB
    filetype: .bgen
    number_of_variants: 5210100
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:15c1eae2-6fdc-4771-a0fe-9aecda18c8f2
    name: chr16_freeze.bgen
    md5sum: e7c14743c0ba6ca440d7f8ac2f864640
    filesize: 3.9GB
    filetype: .bgen
    number_of_variants: 2402320
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:9f46d73d-a08f-4f16-8512-109c07a6df24
    name: chr20_freeze.bgen
    md5sum: 4e12629c98636806da1642030e3027f0
    filesize: 2.8GB
    filetype: .bgen
    number_of_variants: 1727107
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:cb656d7c-e315-4527-9488-3e8857cd7cf9
    name: chr9_freeze.bgen
    md5sum: b7eb80079febb33aa23d34aef62a29d7
    filesize: 5.0GB
    filetype: .bgen
    number_of_variants: 3284259
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:530b1c49-0901-4f4e-bff7-1640527ad763
    name: chr3_freeze.bgen
    md5sum: da67b15d2ac6543151dedb556aabf62d
    filesize: 8.5GB
    filetype: .bgen
    number_of_variants: 5352509
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:19b08d4a-613d-4f8e-958a-90647539aa2b
    name: chr7_freeze.bgen
    md5sum: c61ebce2c0f62ec078c2832bfeff1e41
    filesize: 7.1GB
    filetype: .bgen
    number_of_variants: 4313189
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:6f5e920a-4985-4429-a430-f955ff84a37a
    name: chr2_freeze.bgen
    md5sum: 671c544946fa9cdb600443a2e2ec2968
    filesize: 9.8GB
    filetype: .bgen
    number_of_variants: 6477549
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:090e86a8-d67c-4e94-a2bb-b4af0127d6fd
    name: chr1_freeze.bgen
    md5sum: 1e337c882e0feeb962e1dccd7d1cc71b
    filesize: 9.4GB
    filetype: .bgen
    number_of_variants: 6003546
    number_of_participants: 17437
    belongs_to: data
  - id: alspacdcs:117c8056-28e7-4952-99d9-cc827a9dc496
    name: freeze.sample
    md5sum: 9d8ebe9bc5f65e251df39805799c63c1
    filesize: 953.6KB
    filetype: .sample
    number_of_participants: 17437
    belongs_to: data

Genome-wide - TOPMed round 3 imputed - G0 partners (gi_topmed_g0p)

Description

This dataset contains genotype data imputed to TOPMed round 2 for G0 mothers and G1.
Reference genome build: GRCh38

Methodology

After application of the plink filters –geno 0.05, –maf 0.01, –snps-only just-acgt and –autosome, 113288 SNPs remained. The –related command in KING version 2.3.2 was used to perform kinship analysis, which confirmed that all 1737 putative G0 partner-G1 pairs are genetically related. This would be expected for biological father-offspring pairs, using the inference criteria described in in Table 1 of “Manichaikul, Ani, et al. ”Robust relationship inference in genome-wide association studies.” Bioinformatics 26.22 (2010): 2867-2873.”

Of the 507586 SNPs which passed the original QC, filtering on –maf and –geno of 0.01, 256693 SNPs remained.

Individuals within this dataset, but who have withdrawn from the project were removed from the dataset before proceeding with imputation specific quality control. This left 2198 individuals.

The genotype panel was filtered to remove SNPs below MAF 0.01, missing call rates exceeding 0.01 using Plink 2.0. The SNPs were checked for palindromic SNPs which were filtered out. The genotype data was swapped from GRCh37 to GRCh38 using UCSC liftOver.

The dataset was later filtered to SNPs above HWE of 1e-6 leaving 246380 SNPs. The combined autosomal call set was then converted to VCF files, before being uploaded to the TOPMed imputation server to flag variants requiring a strand fix. Any SNPs flagged with an issue were corrected, or filtered out using Plink2. 246380 SNPs remained within the autosomes.

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_topmed_g0p_2026-03-05_f7
name: >-
  Genome-wide - TOPmed imputed - G0 partners version 2026-03-05
  freeze 7
description: >-
  Freeze 7 of version 2026-03-09 Genome-wide array data imputed to the TOPmed round 3 reference panel for G0 partners and a few G0 mothers in bgen and sample file format (version 1.2). 
freeze_size: 22G
linker_file_md5sum: 2b2e49bc61c1a0efc3f57db6e48656ea
woc_file_md5sum: ee409be51e4e12594cb700dd1be99314
all_individuals_to_exclude_md5sum: 241c00aec78b7178d6797a83478dec02
git_tag: https://github.com/alspac/dataset_gi_topmed_g0p/releases/tag/freeze7
is_current_freeze: true
freeze_number: 7
freeze_date: 2026-04-14
previous_freeze: NA
freeze_of_alspac_dataset_version: alspacdcs:gi_topmed_g0p_2026-03-05
freeze_of_named_alspac_dataset: alspacdcs:gi_topmed_g0p

contains:
- data
files: []
data:
  contains:
  - chr15_freeze.bgen
  - chr19_freeze.bgen
  - chr13_freeze.bgen
  - chr21_freeze.bgen
  - chr10_freeze.bgen
  - chr11_freeze.bgen
  - chr5_freeze.bgen
  - chr8_freeze.bgen
  - chr6_freeze.bgen
  - chr17_freeze.bgen
  - chr22_freeze.bgen
  - chr12_freeze.bgen
  - chr14_freeze.bgen
  - chr18_freeze.bgen
  - chr16_freeze.bgen
  - chr20_freeze.bgen
  - chr9_freeze.bgen
  - chr4_freeze.bgen
  - chr3_freeze.bgen
  - chr7_freeze.bgen
  - chr2_freeze.bgen
  - chr1_freeze.bgen
  - freeze.sample
  files:
  - id: alspacdcs:896a5868-62b3-4977-afc2-7c48e4e83fe3
    name: chr15_freeze.bgen
    md5sum: 9606120b7d711ff2b2fc1b270a5a32c9
    filesize: 665.6MB
    filetype: .bgen
    number_of_variants: 946259
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:ca8f89f0-6881-4f9e-9721-fb8150317e5f
    name: chr19_freeze.bgen
    md5sum: f05ca7e90fbd43946da44fe412cd3fe7
    filesize: 558.4MB
    filetype: .bgen
    number_of_variants: 738614
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:6a60a87f-5740-4aaa-a966-f7ef23420512
    name: chr13_freeze.bgen
    md5sum: 22f3943a6d523918c8ff85babdefdd1a
    filesize: 815.3MB
    filetype: .bgen
    number_of_variants: 1189351
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:7b58037e-5ddb-483b-af5e-32fc53a393d6
    name: chr21_freeze.bgen
    md5sum: e1c94daa160cacd8376b2ce13d4634ad
    filesize: 328.4MB
    filetype: .bgen
    number_of_variants: 430038
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:12617f44-9084-40f4-81fd-2c4a207ea9b6
    name: chr10_freeze.bgen
    md5sum: 17ea66dec66bf9f792367f8643db009e
    filesize: 1.1GB
    filetype: .bgen
    number_of_variants: 1640786
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:764c2691-3271-41a6-a225-d8a5eb118444
    name: chr11_freeze.bgen
    md5sum: f53035ee0dc61b1fb0dd7eace77b316d
    filesize: 1.1GB
    filetype: .bgen
    number_of_variants: 1633833
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:13fffb28-dacc-4a52-bf64-9b715d7eede7
    name: chr5_freeze.bgen
    md5sum: 4e65772a4270c7867de4d360b3fbb90e
    filesize: 1.4GB
    filetype: .bgen
    number_of_variants: 2154914
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:3c41c9bb-7162-44fa-97b9-33ed530d61df
    name: chr8_freeze.bgen
    md5sum: 22dc547b58371f66dba68e5fa290d9b0
    filesize: 1.2GB
    filetype: .bgen
    number_of_variants: 1840903
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:81576647-1b56-42d1-8733-c7b01bffd43c
    name: chr6_freeze.bgen
    md5sum: 51f802c3216c4038bb9e601cfe689ce9
    filesize: 1.3GB
    filetype: .bgen
    number_of_variants: 2094485
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:7eddd9ca-343a-40eb-95f9-9c8ba67b3f22
    name: chr17_freeze.bgen
    md5sum: 4dde203a3840e3e5656e9368645c7e79
    filesize: 680.9MB
    filetype: .bgen
    number_of_variants: 928784
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:f4394790-6bdf-495a-a7d9-f1f5fd326b0c
    name: chr22_freeze.bgen
    md5sum: 1856df2ac51183f4fa3628e0ac8e10b8
    filesize: 329.3MB
    filetype: .bgen
    number_of_variants: 445037
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:bb68d019-c68b-4c78-beb1-3544a1f57ac7
    name: chr12_freeze.bgen
    md5sum: d6554162765d6d1be60b8b1686e8d777
    filesize: 1.0GB
    filetype: .bgen
    number_of_variants: 1571845
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:1f090c2c-b25d-4f02-88a2-b22544997602
    name: chr14_freeze.bgen
    md5sum: 90e11d2e620ef759d2687316bc6d97c4
    filesize: 717.6MB
    filetype: .bgen
    number_of_variants: 1063271
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:f9a1d0d7-3d03-4266-a824-08a3cf99b60c
    name: chr18_freeze.bgen
    md5sum: f6785ac96db505e6524c62bb97113d13
    filesize: 671.2MB
    filetype: .bgen
    number_of_variants: 941993
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:904b0a56-83b6-4b78-bbd2-c37a4a9a6c78
    name: chr16_freeze.bgen
    md5sum: 0fd6bb9434447235cee417188b9c2a52
    filesize: 732.1MB
    filetype: .bgen
    number_of_variants: 1041239
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:0c1bb5af-bf43-40b6-95ff-12e56324bd4e
    name: chr20_freeze.bgen
    md5sum: f40f8071f1041eff977bbbfdacfbe7bc
    filesize: 537.7MB
    filetype: .bgen
    number_of_variants: 746712
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:8c5bc46b-9375-4b2a-ab25-299e42b56f08
    name: chr9_freeze.bgen
    md5sum: 7f11eaaf2a1994944570eba0ccb1b759
    filesize: 964.2MB
    filetype: .bgen
    number_of_variants: 1446728
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:12b3d880-abb9-4a8e-b9bd-e9bc03de996e
    name: chr4_freeze.bgen
    md5sum: 2f42ff7aff9d3554313f062a72cccddd
    filesize: 1.6GB
    filetype: .bgen
    number_of_variants: 2348703
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:f2d71a03-8785-4f47-9a55-3f6e86885bb8
    name: chr3_freeze.bgen
    md5sum: 127f3c7043067ddb1966cab142ed541a
    filesize: 1.6GB
    filetype: .bgen
    number_of_variants: 2375486
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:f3a7c283-1d4d-47c5-8519-8b35494b459a
    name: chr7_freeze.bgen
    md5sum: f7a00a3fe2411e1b4dd706c173f11cb1
    filesize: 1.3GB
    filetype: .bgen
    number_of_variants: 1924941
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:dfbae54d-05a3-48e7-9cb1-9364be264e34
    name: chr2_freeze.bgen
    md5sum: b6d11260933002b6b93914e358bded1e
    filesize: 1.8GB
    filetype: .bgen
    number_of_variants: 2850810
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:abc2cba5-16ee-41eb-8324-238b934937ee
    name: chr1_freeze.bgen
    md5sum: 3c84eeb511d11b0aa2b09d840efe122f
    filesize: 1.7GB
    filetype: .bgen
    number_of_variants: 2636883
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:8025063c-1ab2-4d9b-b537-838795b9ef32
    name: freeze.sample
    md5sum: bc4d80c65b3a76fe6a4c0e7ca061d8d0
    filesize: 120.2KB
    filetype: .sample
    number_of_participants: 2198
    belongs_to: data

Sequence Data

Whole genome sequencing - G1 (wgs_hiseq_g1)

Description

This dataset contains whole genome sequencing for G1 individuals, part of the UK10K dataset.
Reference genome build: GRCh37

Methodology

ALSPAC and TwinsUK cohorts were sequenced at an average read depth of 6.7x through the UK10K program (http://www.UK10K.org) using the Illumina HiSeq platform, and aligned to the GRCh37 human reference using BWA. SNV calls were completed using samtools/bcftools and VQSR and GATK were used to recall these calls.

Sites which passed filters were brought forward to genotype refinement. Samples were dropped after calling and before genotype refinement for reasons described below.

Sample exclusion: 48 samples were removed:
- 1 excessive het rate
- 36 discordance > 3%
- 11 coverage < 4x

Genotype refinement:
The missing and low confidence genotypes were refined with BEAGLE 4, rev909 in chunks of 3,000 sites plus 1,000 sites in buffer regions. Multiallelic sites were included in the imputation.

Further Sample exclusion after genotype refinement:
- 1 sample removed
- 1 contamination
- 1 NRD > 5%

For downstream analysis a further 177 samples may be excluded due to non-European ancestry, or relatedness.

Annotations: The calls were annotated using vcf-annotate and include dbSNP 137 rsIDs. Sites where different alternate allele of the same type (SNP vs indel) were found in the UK10K data have dbSNPmismatch flag in the INFO column. The 1000Genomes frequencies are taken from the final 1000 Genomes Phase 1 integrated (v3) callset available here:
- ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/phase1/analysis_results/integrated_call_sets

The files used for these annotations can be found on the above sftp site:
- /uk10k/ref/annots-rsIDs-dbSNPv137.2012-09-13.tab.gz
- /uk10k/ref/annots-rsIDs-AFs.2012-07-19.tab.gz

Functional Annotations:
Variant consequence annotations are called using the Ensembl Variant Effect Predictor (http://www.ensembl.org/info/docs/variation/vep/index.html), v2.6 against Ensembl 68. This provides coding consequence predictions and SIFT,PolyPhen and Condel annotations. Ensembl also provides GERP conservation scores. Grantham matrix values come from a simple lookup.

The consequences are in the format:
CSQ=ENSTid:Genename:consequence_string[:CDS_coord:Peptide_coord:AA>AA:Functional_annot,value]+ENSTid2:Genename:consequence_string2:…[+gerpScore]

i.e. consequences are separated by ‘+’, with each consequence containing ‘:’ separated fields. Functional annotations (PolyPhen, etc) are given as name,value.

Based on non-European ancestry and relatedness 177 of the samples in this release should be excluded from certain downstream analysis.

Associated publication:
- http://www.ncbi.nlm.nih.gov/pubmed/26367797

Please ensure you have permission to access this data (http://www.uk10k.org/data_access.html) before using it.

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:wgs_hiseq_g1_2016-08-18_f7
name: Whole genome sequencing - G1 version 2016-08-18 freeze 7
description: >-
  This is the freeze 7 of version 2016-08-18 of the Whole genome sequencing for G1 individuals, part of the UK10K dataset.
freeze_size: 341G
linker_file_md5sum: 2b2e49bc61c1a0efc3f57db6e48656ea
woc_file_md5sum: ee409be51e4e12594cb700dd1be99314
all_individuals_to_exclude_md5sum: 241c00aec78b7178d6797a83478dec02
git_tag: https://github.com/alspac/dataset_wgs_hiseq_g1/releases/tag/freeze7
is_current_freeze: true
freeze_number: 7
freeze_date: 2026-04-14
previous_freeze: alspacdcs:wgs_hiseq_g1_2016-08-18_f6
freeze_of_alspac_dataset_version: alspacdcs:wgs_hiseq_g1_2016-08-18
freeze_of_named_alspac_dataset: alspacdcs:wgs_hiseq_g1

contains:
- data
files: []
data:
  contains:
  - 17_freeze.vcf.gz.csi
  - 9_freeze.vcf.gz.csi
  - 8_freeze.vcf.gz.csi
  - 6_freeze.vcf.gz.csi
  - 4_freeze.vcf.gz.csi
  - 16_freeze.vcf.gz.csi
  - 7_freeze.vcf.gz.csi
  - 15_freeze.vcf.gz.csi
  - 20_freeze.vcf.gz.csi
  - 3_freeze.vcf.gz.csi
  - 19_freeze.vcf.gz.csi
  - 12_freeze.vcf.gz.csi
  - 10_freeze.vcf.gz.csi
  - 21_freeze.vcf.gz.csi
  - 1_freeze.vcf.gz.csi
  - X_freeze.vcf.gz.csi
  - 11_freeze.vcf.gz.csi
  - 13_freeze.vcf.gz.csi
  - 5_freeze.vcf.gz.csi
  - 18_freeze.vcf.gz.csi
  - 2_freeze.vcf.gz.csi
  - 22_freeze.vcf.gz.csi
  - 14_freeze.vcf.gz.csi
  - 22_freeze.vcf.gz
  - 20_freeze.vcf.gz
  - 18_freeze.vcf.gz
  - 15_freeze.vcf.gz
  - 16_freeze.vcf.gz
  - X_freeze.vcf.gz
  - 11_freeze.vcf.gz
  - 8_freeze.vcf.gz
  - 19_freeze.vcf.gz
  - 12_freeze.vcf.gz
  - 5_freeze.vcf.gz
  - 10_freeze.vcf.gz
  - 6_freeze.vcf.gz
  - 7_freeze.vcf.gz
  - 4_freeze.vcf.gz
  - 21_freeze.vcf.gz
  - 9_freeze.vcf.gz
  - 13_freeze.vcf.gz
  - 14_freeze.vcf.gz
  - 17_freeze.vcf.gz
  - 3_freeze.vcf.gz
  - 2_freeze.vcf.gz
  - 1_freeze.vcf.gz
  files:
  - id: alspacdcs:42fdb11e-489a-48d9-bfea-205641989a87
    name: 17_freeze.vcf.gz.csi
    md5sum: cb91b25b11663a5294b52cdb148f2111
    filesize: 49.9KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:21c5fe39-742f-40b5-85c9-69e6c58cba99
    name: 9_freeze.vcf.gz.csi
    md5sum: 08c718c03529b1de9e09121080e243c7
    filesize: 75.4KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:caa57080-c78b-4fac-9f30-a9f1248e4c58
    name: 8_freeze.vcf.gz.csi
    md5sum: 77418891e834bcc661ef28366d6a2870
    filesize: 92.8KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:77d196af-f9ae-4b08-a7d4-3f050fd89e09
    name: 6_freeze.vcf.gz.csi
    md5sum: 47e67a62888eaa188030340ad35123c7
    filesize: 109.8KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:d00a65c9-a2e6-4fec-add8-fa6f149dadd8
    name: 4_freeze.vcf.gz.csi
    md5sum: ced3865e2a31897e5e303045a5ec5fb6
    filesize: 122.6KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:fb7bb513-a688-47c9-82d5-b770878f445e
    name: 16_freeze.vcf.gz.csi
    md5sum: a1ca66733837987f48d5cfdf81cedd7a
    filesize: 50.4KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:5365449b-a34f-46f9-b791-cc8721ca4898
    name: 7_freeze.vcf.gz.csi
    md5sum: 63320d1b9169a9cf0527fb2adcd53cdc
    filesize: 101.8KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:1f4508aa-28a5-45d1-90a0-7c3194957ea2
    name: 15_freeze.vcf.gz.csi
    md5sum: d50a2f8e9f485acd1b328e7deee56fc8
    filesize: 51.7KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:1527cc5a-295e-41e5-936d-f4769c0de4d8
    name: 20_freeze.vcf.gz.csi
    md5sum: 424757f9b03c71eac9467bf0565c7342
    filesize: 38.2KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:0f871826-9a4c-431f-b82b-1d557463e292
    name: 3_freeze.vcf.gz.csi
    md5sum: a304a2029d4e4007a2a55cc60b5351af
    filesize: 127.9KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:a323bd35-2699-4dc8-a6e3-5373e2cb1700
    name: 19_freeze.vcf.gz.csi
    md5sum: 1b8651ddb8b72717598014335d81b380
    filesize: 35.7KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:df4d8a66-db06-4eac-91b1-e668231133b9
    name: 12_freeze.vcf.gz.csi
    md5sum: 019796cdd326aeb8f1218ec35d5851bd
    filesize: 85.4KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:e6b3afc0-066a-459b-98b4-709acd7ec2f7
    name: 10_freeze.vcf.gz.csi
    md5sum: 74125f1fc2d9d7a16bd040cb219bd425
    filesize: 85.5KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:3f033bd8-0fe3-4118-8139-0981b16c740a
    name: 21_freeze.vcf.gz.csi
    md5sum: 3ff7547f2c3af46d6a733adf0849f935
    filesize: 22.1KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:890964bd-eb89-43d7-8e71-fd10441cd79a
    name: 1_freeze.vcf.gz.csi
    md5sum: 9c00c0c1e0130551b18ff03ed44c30a2
    filesize: 145.6KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:6811bfed-9dfc-4289-94c4-26868bce2010
    name: X_freeze.vcf.gz.csi
    md5sum: 9ce2da7016463ab856b06f5fa0ec9be3
    filesize: 96.0KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:7231880e-ed5c-44c2-8cbc-a9b0f015be62
    name: 11_freeze.vcf.gz.csi
    md5sum: 35acd0fd59f3d23cdc20838a7379eb3e
    filesize: 85.2KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:a767611d-98c7-42d6-8345-ce1b3d2b0677
    name: 13_freeze.vcf.gz.csi
    md5sum: 9a370853a88d632ec83b163de8b06cdb
    filesize: 62.1KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:6f76db9d-daa9-4ba0-8719-1986ebf80fd5
    name: 5_freeze.vcf.gz.csi
    md5sum: bcfa026b5a0e7e42c4226cd04246f502
    filesize: 116.1KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:c9bac3f4-42cc-4787-be4f-21c91bfb2309
    name: 18_freeze.vcf.gz.csi
    md5sum: 2ee78986acc3243a7effe129fcfca4b2
    filesize: 48.4KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:02e936ab-8803-401f-a996-7f8f2594e3e8
    name: 2_freeze.vcf.gz.csi
    md5sum: b21f248b785fcf0db92f72a3c3c66b2f
    filesize: 156.1KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:6ee4bbea-6596-4f85-bbae-fb68232c82e8
    name: 22_freeze.vcf.gz.csi
    md5sum: 452be3c66312943409d7c08ed6455639
    filesize: 22.1KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:0fda6e99-0f33-4608-b5f9-1837f512b084
    name: 14_freeze.vcf.gz.csi
    md5sum: d87ef5118d4bb52d359965ae130a1c02
    filesize: 56.6KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:58d01ac5-3349-474b-b1ef-fc03f4df4613
    name: 22_freeze.vcf.gz
    md5sum: 2d2afea28e69432571cfa69982b76018
    filesize: 4.4GB
    filetype: .gz
    number_of_variants: 552675
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:b5ea89d9-cc58-4bd2-9052-313d940ebe8a
    name: 20_freeze.vcf.gz
    md5sum: f36f060e329ee42a94ea74fab5c7cbf2
    filesize: 7.5GB
    filetype: .gz
    number_of_variants: 970869
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:5b3664fd-8e7f-4cc8-8de4-866aab57b02a
    name: 18_freeze.vcf.gz
    md5sum: f6519f3803638277bc7326609f3b8db1
    filesize: 9.4GB
    filetype: .gz
    number_of_variants: 1220427
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:fe960a81-694c-49d0-bd89-6baf542d1a56
    name: 15_freeze.vcf.gz
    md5sum: 25c273b0a99d94e163d224b904b34f01
    filesize: 9.7GB
    filetype: .gz
    number_of_variants: 1262404
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:830d7a0e-b211-40b6-b0ec-37b5af7e79cf
    name: 16_freeze.vcf.gz
    md5sum: 89e2a8ddc4353cf0516605bdf4b7aae8
    filesize: 10.6GB
    filetype: .gz
    number_of_variants: 1373607
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:7c9341e3-5296-45e9-a60c-691750b84203
    name: X_freeze.vcf.gz
    md5sum: fabcf5ff25b5e630ad8664130d033a62
    filesize: 10.5GB
    filetype: .gz
    number_of_variants: 1700742
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:b54eb344-977d-4743-904a-bcf5e4d8e078
    name: 11_freeze.vcf.gz
    md5sum: 90b673f50487abf917ca779689735c3c
    filesize: 16.4GB
    filetype: .gz
    number_of_variants: 2125064
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:500261fd-e6e5-416c-b19f-6b27bb55be48
    name: 8_freeze.vcf.gz
    md5sum: a79f949b6c8ca5a5f9d7e05dc92ba5ea
    filesize: 18.8GB
    filetype: .gz
    number_of_variants: 2451009
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:cd7fe7b0-39d9-4f5a-93db-936fef961344
    name: 19_freeze.vcf.gz
    md5sum: 9d935e33ecacb23b03923e15f6f320e1
    filesize: 7.0GB
    filetype: .gz
    number_of_variants: 886630
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:1e7f8108-c598-4d63-b360-076d1aa90f92
    name: 12_freeze.vcf.gz
    md5sum: 8fd020f20e7301cec7be60750db5ecc8
    filesize: 15.7GB
    filetype: .gz
    number_of_variants: 2047922
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:2fc39798-aabc-418b-b46e-ff8671f0a15c
    name: 5_freeze.vcf.gz
    md5sum: 36b370305ee3b6e7c65c0f43f981751b
    filesize: 21.6GB
    filetype: .gz
    number_of_variants: 2804359
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:0d9b1f68-b04b-456c-9b55-c72b3748d517
    name: 10_freeze.vcf.gz
    md5sum: 698b546cf2da9c8b7b7082f36972384d
    filesize: 16.3GB
    filetype: .gz
    number_of_variants: 2110436
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:24fcf40d-4b1d-4044-b291-5ed50820097b
    name: 6_freeze.vcf.gz
    md5sum: 8e5f91c3ea17e4b048b6866a629d51d3
    filesize: 21.0GB
    filetype: .gz
    number_of_variants: 2704091
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:78e30260-2ea2-435c-a770-3e7a3f2188f5
    name: 7_freeze.vcf.gz
    md5sum: 48bf00442127c34f1c254b274a3c1011
    filesize: 19.0GB
    filetype: .gz
    number_of_variants: 2445204
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:8b379f1d-c4f9-4e54-90b5-e99a73805117
    name: 4_freeze.vcf.gz
    md5sum: 6572535464c71243562c9beb0679b682
    filesize: 23.2GB
    filetype: .gz
    number_of_variants: 3019176
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:975fe55e-1ffc-4d68-a08d-6f25c1ba9833
    name: 21_freeze.vcf.gz
    md5sum: 89a7fc75e89fb03598cc88931769e90f
    filesize: 4.3GB
    filetype: .gz
    number_of_variants: 563988
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:d41ff792-d7aa-4088-8e24-c0c071d158e8
    name: 9_freeze.vcf.gz
    md5sum: 3737f24f7dc4cce49cc6d3cd9f7c59fc
    filesize: 14.2GB
    filetype: .gz
    number_of_variants: 1845456
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:b66ce097-f374-474a-a35a-eac64b23b6d5
    name: 13_freeze.vcf.gz
    md5sum: d64a161223a91e0382e4e132f03a3d69
    filesize: 11.8GB
    filetype: .gz
    number_of_variants: 1527053
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:98cb679a-0563-4801-967a-3313c39f1097
    name: 14_freeze.vcf.gz
    md5sum: a10c0758fde7dc9469276b9122ff03e2
    filesize: 10.7GB
    filetype: .gz
    number_of_variants: 1403580
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:9224d2eb-39e3-4443-81bb-eff4f31befc5
    name: 17_freeze.vcf.gz
    md5sum: d10ae9424b7c8b16385f85644aca578f
    filesize: 9.1GB
    filetype: .gz
    number_of_variants: 1177884
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:7d7b9047-48fb-4f01-bfbb-724ff8f341fa
    name: 3_freeze.vcf.gz
    md5sum: 2c54dcb4cd5acbb4ca93f2998d5f72e3
    filesize: 24.2GB
    filetype: .gz
    number_of_variants: 3147254
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:20038021-c6cb-4ef3-87e7-d9445867cc65
    name: 2_freeze.vcf.gz
    md5sum: ed4e605d0c0362981598e3e63b551a23
    filesize: 28.8GB
    filetype: .gz
    number_of_variants: 3749277
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:eafd2581-0dab-4ff7-a0c5-cede2d346a3f
    name: 1_freeze.vcf.gz
    md5sum: 1730c495dca8b0392d02912c27a5d02c
    filesize: 26.3GB
    filetype: .gz
    number_of_variants: 3406915
    number_of_participants: 1865
    belongs_to: data

Whole exome sequencing - G0 & G1 (wes_novaseq_g0_g1)

Description

This dataset contains whole exome sequencing for G0 and G1 individuals. It was generated at the Sanger Institute as part of an initiative sequencing multiple Birth cohorts: ALSPAC, MCS and BiB. As part of this initiative, the exome sequencing data will also be available via EGA but researchers will still gain access through ALSPACs project approval system.
Reference genome build: GRCh38

Methodology

Exome sequencing was conducted on DNA for 12,374 participants (8,605 children and 3,389 of their parents) at the Sanger Institute, using Illumina NovaSeq. Reads were aligned to GRCh38 with BWA-MEM. There was an average on-target depth of ~62X for ALSPAC.

QC was conducted on the dataset at the Sanger Institute, please find details within the associated publication (Koko et al., 2024). Sample QC was done before (base-calls after sequencing, alignment quality, CRAM file quality) and after variant calling (PCA analysis, comparison to array data, relatedness). Integrated variant QC removed potentially false positive variants using a trained random forest model. Genotype QC removed low quality individual genotype calls.

Single nucleotide variant (SNV) and small insertions/deletion (indels) calling was conducted with GATK HaplotypeCaller, GenomicsDBImport and GenotypeGVCFs (GATK version 4.2.4.0 for ALSPAC) following GATK best practices (Van der Auwera and O’Connor, 2020).

There were 12 individuals identified to have sex mismatches within the dataset, withflagging as mismatches based on X F stat. When looking at the Y coverage of these individuals, 3 were clear cut-offs based from both X f stat and Y depth, while 9 were only mismatches based off the X F stat. The 3 individuals with clear mismatches on both statistics were removed from the dataset, while the other mismatches were retained.

Associated publication:
- doi.org/10.12688/wellcomeopenres.22697.1

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_f7
name: >- 
  Whole Exome Sequencing - Novaseq - G0 & G1 version 2024-09-20 freeze 7
description: >-
  This is first iteration of wes_novaseq_g0_g1, first introduced in freeze 4. It contains data in vcf 4.2 format. It contains the majority of the G1 cohort (n=~8296), accompanied by G0 mothers (n=~1642) and partners (n=~1630) to create trios. Over time the participants may withdraw their consent, and subsequently will be removed from the dataset, so the number of available individuals from each cohort may differ from stated above. 
  
  This exome sequencing (ES) data was conducted at the Sanger institute and was part of an effort to ES ALSPAC, MCS and BiB. All ES data was quality controlled at the Sanger institute prior to this ALSPAC release and has been extensively document in the relevant publication (see below). 

  In brief (exert from associated publication, Koko et al., 2024):

    "Sample QC: 
      * Before variant calling: Samples were removed if they failed one or more filters based on quality of base-calls after sequencing, or quality of the CRAM files of aligned reads. The remainder then underwent variant calling.
      * After variant calling: We assigned individuals to populations using principal component analysis (PCA), then identified and removed individuals who were outliers on one or more variant-based metrics within each of the populations. We compared the exome data to genotyping array data from the same samples and removed samples that did not match as expected, since these could be sample mix-ups. The samples were also checked for unexpected relatedness; samples showing conflicts between reported and inferred relatedness were removed. This sample QC was split in two separate steps, before and after variant and genotype QC, as detailed in the coming sections. 
    Integrated variant and genotype QC:
      * Variant QC: We removed candidate variants which may not be real, instead being artefacts or mapping errors, using a trained random forest model to distinguish likely true positives from likely false positives. 
      * Genotype QC: We removed low-quality individual genotype calls from the dataset. This was done in conjunction with variant QC, as we will explain below."

  for extended information such as thresholds please find within the publication.

  Associated publication:
    Koko et al., 2024
    DOI: https://doi.org/10.12688/wellcomeopenres.22697.2


freeze_size: 167G
linker_file_md5sum: 2b2e49bc61c1a0efc3f57db6e48656ea
woc_file_md5sum: ee409be51e4e12594cb700dd1be99314
all_individuals_to_exclude_md5sum: 241c00aec78b7178d6797a83478dec02
git_tag: https://github.com/alspac/dataset_wes_novaseq_g0_g1/releases/tag/freeze7
is_current_freeze: true
freeze_number: 7
freeze_date: 2026-04-14
previous_freeze: alspacdcs:wes_novaseq_g0_g1_2024-09-20_f6
freeze_of_alspac_dataset_version: alspacdcs:wes_novaseq_g0_g1_2024-09-20
freeze_of_named_alspac_dataset: alspacdcs:wes_novaseq_g0_g1

contains:
- data
files: []
data:
  contains:
  - chr22_data.vcf.gz.csi
  - chr13_data.vcf.gz.csi
  - chr1_data.vcf.gz.csi
  - chr7_data.vcf.gz.csi
  - chr6_data.vcf.gz.csi
  - chr12_data.vcf.gz.csi
  - chr16_data.vcf.gz.csi
  - chr10_data.vcf.gz.csi
  - chr19_data.vcf.gz.csi
  - chr21_data.vcf.gz.csi
  - chr3_data.vcf.gz.csi
  - chr18_data.vcf.gz.csi
  - chr9_data.vcf.gz.csi
  - chr15_data.vcf.gz.csi
  - chr17_data.vcf.gz.csi
  - chr14_data.vcf.gz.csi
  - chr11_data.vcf.gz.csi
  - chr20_data.vcf.gz.csi
  - chrY_data.vcf.gz.csi
  - chr8_data.vcf.gz.csi
  - chr2_data.vcf.gz.csi
  - chrX_data.vcf.gz.csi
  - chr5_data.vcf.gz.csi
  - chr4_data.vcf.gz.csi
  - chrY_data.vcf.gz
  - chr13_data.vcf.gz
  - chr22_data.vcf.gz
  - chr10_data.vcf.gz
  - chr8_data.vcf.gz
  - chr6_data.vcf.gz
  - chr4_data.vcf.gz
  - chr15_data.vcf.gz
  - chr17_data.vcf.gz
  - chr16_data.vcf.gz
  - chr21_data.vcf.gz
  - chr18_data.vcf.gz
  - chr12_data.vcf.gz
  - chr20_data.vcf.gz
  - chr3_data.vcf.gz
  - chr9_data.vcf.gz
  - chrX_data.vcf.gz
  - chr11_data.vcf.gz
  - chr14_data.vcf.gz
  - chr19_data.vcf.gz
  - chr5_data.vcf.gz
  - chr7_data.vcf.gz
  - chr2_data.vcf.gz
  - chr1_data.vcf.gz
  files:
  - id: alspacdcs:8a3b26be-1782-45ce-a0a9-6f429cec1122
    name: chr22_data.vcf.gz.csi
    md5sum: 4d40cd99bf09e598fbef5f527af8e767
    filesize: 11.0KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:57c0839c-74c5-4480-8736-8c9934e40a84
    name: chr13_data.vcf.gz.csi
    md5sum: f94b86d972456503385ce9e9b881f775
    filesize: 13.4KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:965672f3-1aa1-47ec-a2e4-0cc25fa4a9f6
    name: chr1_data.vcf.gz.csi
    md5sum: 729ae8bd04a34b1ee79522566daa7f68
    filesize: 59.5KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:e396ed7a-194b-402a-8e7d-dbda43f58a93
    name: chr7_data.vcf.gz.csi
    md5sum: a67f796828a7badc593576b1b3195ca8
    filesize: 32.3KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:30e6067a-fe4d-44e4-9360-42344902aa25
    name: chr6_data.vcf.gz.csi
    md5sum: 06258ad5e10d7f86ff78295c9c8e9beb
    filesize: 32.2KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:a623bf09-a68e-4afb-803a-bd9981b95b8f
    name: chr12_data.vcf.gz.csi
    md5sum: 65f18ea429660898a97ba7cedefd6d3f
    filesize: 31.7KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:f0d0d193-0808-426d-8079-3a89cfb81851
    name: chr16_data.vcf.gz.csi
    md5sum: e49e162d6b2e6d0e92a112dc1b6d4f1c
    filesize: 19.9KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:23d69501-de42-48f6-83e6-26c290fdd425
    name: chr10_data.vcf.gz.csi
    md5sum: d5269b0c9d649a3803c870b1ec3b1009
    filesize: 27.9KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:4caa0a2c-7596-43b5-a553-3a5d77ce33b2
    name: chr19_data.vcf.gz.csi
    md5sum: 6b862d1a7d1e7189d34621f8f2705cc2
    filesize: 23.7KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:c4794de5-749c-43a3-a71b-0a6e89658e27
    name: chr21_data.vcf.gz.csi
    md5sum: ce7aef5c2527e2f382b622e501326afd
    filesize: 6.3KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:734f2248-ee8d-4d5c-9035-48d49d6ed6b3
    name: chr3_data.vcf.gz.csi
    md5sum: 0fdfb39df4fa72202300cb0cc9ef9f7b
    filesize: 37.9KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:4c552e09-adc4-46ce-ac91-34b7dfe4a16e
    name: chr18_data.vcf.gz.csi
    md5sum: 770a397c54e1e6de4c9f191ba462cada
    filesize: 12.4KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:79906959-b3e0-4aa5-8761-02256020b129
    name: chr9_data.vcf.gz.csi
    md5sum: 48131c8e15307a9eeae874fa43c248f9
    filesize: 24.9KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:43215052-f122-4c35-8d9b-93750f83994b
    name: chr15_data.vcf.gz.csi
    md5sum: a35370a4bf0b8254b9628723e8831ad1
    filesize: 19.6KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:c0c30f14-f15f-408a-ba67-f96e49b1addd
    name: chr17_data.vcf.gz.csi
    md5sum: f32ca2f1e68c0ab14f9d3224d14395fc
    filesize: 26.1KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:49cdf8c0-b8df-4dcf-9df8-a4e0a8950e13
    name: chr14_data.vcf.gz.csi
    md5sum: 8faa86b078b4c12e4303c372585ff55f
    filesize: 19.1KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:5d85a999-7491-404e-a073-0551f52c8199
    name: chr11_data.vcf.gz.csi
    md5sum: 1e34bf00d9fa5f2db76199804c9f35d0
    filesize: 31.5KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:1f60c8f0-f828-4934-aef0-97a0ac35f369
    name: chr20_data.vcf.gz.csi
    md5sum: cb3e42a96498b4b29c16b75ef71a4572
    filesize: 14.7KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:21af75a1-4342-4991-9602-5bbcd3ad0801
    name: chrY_data.vcf.gz.csi
    md5sum: fbf15f3a73050773d1f44d791e2f2db4
    filesize: 129.0B
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:6ee51796-0a87-4536-8e6d-383786e45063
    name: chr8_data.vcf.gz.csi
    md5sum: d2f203ca424f57e6f1a9164b1adb4d90
    filesize: 24.6KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:94ce8232-1371-41ad-8727-ca91f8c9bcd0
    name: chr2_data.vcf.gz.csi
    md5sum: 652a93e0d647dfbaf71553814a6123ea
    filesize: 47.7KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:2df0696c-f698-48e7-9bec-03c19cfdc5ef
    name: chrX_data.vcf.gz.csi
    md5sum: 9bbf7c95272bb3fc934accf2395f3688
    filesize: 22.9KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:c6b2a432-725d-4260-87e4-598748436530
    name: chr5_data.vcf.gz.csi
    md5sum: c1e2a36daa66e681d51df75d8d2f10f9
    filesize: 30.7KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:8ebf7f7e-f32e-490b-ad1e-8b8f0e261ed5
    name: chr4_data.vcf.gz.csi
    md5sum: 886420754ecac5e9221723a0d5678b51
    filesize: 29.7KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:ba6b39a7-4598-4141-9c54-2f500dd1b21f
    name: chrY_data.vcf.gz
    md5sum: 3de3f6d835d02ed5b157fd0d46fa6aef
    filesize: 363.7KB
    filetype: .gz
    number_of_variants: 9
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:de75c4cf-3639-4b4b-83c1-8f54f4323efc
    name: chr13_data.vcf.gz
    md5sum: 33a9a7046ab9709d276e7f77b9264d15
    filesize: 2.8GB
    filetype: .gz
    number_of_variants: 63931
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:61610b3e-5137-432b-a644-1426c4095791
    name: chr22_data.vcf.gz
    md5sum: a0314caeba0f87c2efdc7db1f188f14d
    filesize: 4.2GB
    filetype: .gz
    number_of_variants: 94446
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:c9fd9aa4-844c-423c-85ea-8df30205d644
    name: chr10_data.vcf.gz
    md5sum: 3d2bcfb24defbccca9890e12840bb22a
    filesize: 6.5GB
    filetype: .gz
    number_of_variants: 149730
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:ee531b1c-fd5b-4ebf-a604-43a76198cb7a
    name: chr8_data.vcf.gz
    md5sum: efb58d5788ddd8a316f95077e66c1382
    filesize: 5.9GB
    filetype: .gz
    number_of_variants: 133894
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:d4bf6ddc-ff5a-47f3-8f39-ad045f950ca9
    name: chr6_data.vcf.gz
    md5sum: fb5b97331f2c406734e13af4053a8265
    filesize: 8.0GB
    filetype: .gz
    number_of_variants: 181754
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:9113bde3-88cd-4b1c-a9e7-1e4918583373
    name: chr4_data.vcf.gz
    md5sum: fa456519600291c786463707f71f86b5
    filesize: 6.1GB
    filetype: .gz
    number_of_variants: 140675
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:1b5f0c33-d553-44b3-9cc3-5498ab84be94
    name: chr15_data.vcf.gz
    md5sum: 1472911f32fc32e4f2f6dc2489874e92
    filesize: 5.6GB
    filetype: .gz
    number_of_variants: 127646
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:213622c8-0e8c-44c4-9125-a3ff320a66d3
    name: chr17_data.vcf.gz
    md5sum: 28961948c8829902c6c1b09f915bb668
    filesize: 10.0GB
    filetype: .gz
    number_of_variants: 224774
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:fcefc4e6-59b5-4424-9bf4-ae34bbc6571d
    name: chr16_data.vcf.gz
    md5sum: c74cf1faf268754a867922dd53f6ee00
    filesize: 8.3GB
    filetype: .gz
    number_of_variants: 186300
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:4b03ec88-9730-43d2-bb6c-a1ebb20b8938
    name: chr21_data.vcf.gz
    md5sum: 8dee081247086ab96cd74d27bec3a6b2
    filesize: 1.9GB
    filetype: .gz
    number_of_variants: 42207
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:1d387e0d-0ff4-4a90-9029-00dd88d82aca
    name: chr18_data.vcf.gz
    md5sum: e2a0cc57078b58ddb924db7b84cd0472
    filesize: 2.5GB
    filetype: .gz
    number_of_variants: 57017
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:524c98c9-218c-407c-9776-c1093e0f1fa8
    name: chr12_data.vcf.gz
    md5sum: b581896080fe85eb64d621cdbfda70d9
    filesize: 8.5GB
    filetype: .gz
    number_of_variants: 193518
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:f9c9e7dc-6a61-41ab-8533-f6e79707c555
    name: chr20_data.vcf.gz
    md5sum: b4256c2e079b5e6ebc5ac330fe1af91e
    filesize: 4.3GB
    filetype: .gz
    number_of_variants: 96655
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:dcb906e3-006e-4955-bde2-a377c702b214
    name: chr3_data.vcf.gz
    md5sum: 0616978a476604a986be94bce09675d6
    filesize: 9.1GB
    filetype: .gz
    number_of_variants: 206875
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:6578f805-09bc-4790-88e5-ae69ad2f7fa7
    name: chr9_data.vcf.gz
    md5sum: 30bd61b388a42f540935847e333ad642
    filesize: 7.1GB
    filetype: .gz
    number_of_variants: 161039
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:81363938-e441-4418-a7bc-9800cf1aec45
    name: chrX_data.vcf.gz
    md5sum: 93fd761ac84723f9fab2855848d4b4a9
    filesize: 3.8GB
    filetype: .gz
    number_of_variants: 86925
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:969a8e18-ce7c-4807-aad5-f3941925931d
    name: chr11_data.vcf.gz
    md5sum: 7181d48cc29afc3a44011eb83d98b947
    filesize: 10.2GB
    filetype: .gz
    number_of_variants: 227858
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:b4dcef8e-d8ac-49a6-8b0a-bc6ff4943412
    name: chr14_data.vcf.gz
    md5sum: 6269ae164370c737cf450b2ba9460e07
    filesize: 5.7GB
    filetype: .gz
    number_of_variants: 128137
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:0cc3e92a-ca2d-4da8-9a52-a933d5bd9160
    name: chr19_data.vcf.gz
    md5sum: ccba7c4c489754c9ffe363f1c84d0bb0
    filesize: 12.5GB
    filetype: .gz
    number_of_variants: 271080
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:fc94d850-2b68-49f4-ac54-a12e43635fc7
    name: chr5_data.vcf.gz
    md5sum: 60dc8fc1ec0795b70bf283341236b499
    filesize: 7.0GB
    filetype: .gz
    number_of_variants: 161010
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:7219e0a7-5f72-4784-8ade-1d8ccd27592c
    name: chr7_data.vcf.gz
    md5sum: 3799089aff8bd3e4af30af72b071b622
    filesize: 8.1GB
    filetype: .gz
    number_of_variants: 181925
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:4206626c-c630-4d43-8dee-07438842e54b
    name: chr2_data.vcf.gz
    md5sum: 78a3167c4ff6e6c91ac0be0e5c079eb4
    filesize: 11.8GB
    filetype: .gz
    number_of_variants: 272150
    number_of_participants: 11497
    belongs_to: data
  - id: alspacdcs:f89da18d-5182-46d9-88a4-00315ba0cc2d
    name: chr1_data.vcf.gz
    md5sum: b1bcc953b2bf4afd1bee926935f18812
    filesize: 16.3GB
    filetype: .gz
    number_of_variants: 370645
    number_of_participants: 11497
    belongs_to: data

Whole exome sequencing - G1 (wes_novaseq_g1)

Description

This dataset contains whole exome sequencing for G1 individuals. It was generated at the Broad Institute for ~2900 G1 individuals.
Reference genome build: GRCh38

Methodology

The exomes returned from the Broad Insitute did not undergo PCA or relatedness filtering; instead provided as raw VCF data. The following thresholds were applied to the samples:

Chimera rate: Less than 0.05
Contamination rate: Less than 0.10
PF aligned rate: More than 0.60

87 individuals were removed from the dataset who were believed to have been a sample mismatch. These exomes had discordance rate of above 0.05 when compared to existing array data using bcftools gtcheck.

Associated publications:
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9980234/ (conducted additional QC beyond dataset)

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:wes_novaseq_g1_204-04-12_f7
name: >- 
  Whole Exome Sequencing - Novaseq - G1 version 2024-04-09 freeze 7
description: >-
  This contains whole exome sequencing done at the Broad institute, first introduced in freeze 4. It contains data in vcf 4.2 format and an index file in csi format. It is a subset of the G1 cohort, with participants who have withdrawn their consent removed and omics IDs applied according to the freeze. 
  
  Samples were selected for whole exome sequencing at the Broad Institute from the G1 cohort (the cohort of index children) and were from subjects who were singletons/unrelated and of European/British ancestry, had blood-derived DNA available, and had been genotyped on a whole genome genotyping array.

  The QC was performed by at the Broad. The following thresholds were applied:
  Chimera rate < 0.05
  Contamination rate < 0.10
  PF aligned rate < 0.60

  87 individuals were removed from the dataset who were believed to have been a sample mismatch. These exomes had discordance rate of above 0.05 when  compared to existing array data using bcftools gtcheck.

  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9980234/ describes this dataset in supplementary materials. 

freeze_size: 28G
linker_file_md5sum: 2b2e49bc61c1a0efc3f57db6e48656ea
woc_file_md5sum: ee409be51e4e12594cb700dd1be99314
all_individuals_to_exclude_md5sum: 241c00aec78b7178d6797a83478dec02
git_tag: https://github.com/alspac/dataset_wes_novaseq_g1/releases/tag/freeze7
is_current_freeze: true
freeze_number: 7
freeze_date: 2026-04-14
previous_freeze: alspacdcs:wes_novaseq_g1_204-04-12_f6

freeze_of_alspac_dataset_version: alspacdcs:wes_novaseq_g1_2024-03-26
freeze_of_named_alspac_dataset: alspacdcs:wes_novaseq_g1

contains:
- data

files: []
data:
  contains:
  - all_chr.vcf.gz.csi
  - all_chr.vcf.gz
  files:
  - id: alspacdcs:1356f38d-878f-48f4-a5c6-b7fe298771a0
    name: all_chr.vcf.gz.csi
    md5sum: d1c0a5185ac37acc797c79849532a306
    filesize: 785.7KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:8ba6adde-0084-49ce-be84-9351f3c577d8
    name: all_chr.vcf.gz
    md5sum: 25a4b814e5d51d6c5cbb0c1894b6eb25
    filesize: 27.1GB
    filetype: .gz
    number_of_variants: 2965032
    number_of_participants: 2879
    belongs_to: data

Epigenetic Data

DNA methylation - EPIC & 450k - G0 + G1 (dnam_epic450_g0_g1)

Description

This dataset contains methylation data collected from both G0 and G1 on two arrays at different timepoints. This dataset supersedes dnam_450_g0m_g1.

There is data from Illumina Infinium HumanMethylation450K BeadChip array on G0 mothers at two timepoints (pregnancy and middle age), G1 participants at 5 timepoints (across birth, childhood and adolescence) and G0 participants at one timepoint. This dataset also contains data from Infinium MethylationEPIC v1.0 data on 2721 G1 individuals at 2 timepoints.

This dataset was generated as part of the Accessible Resource for Integrated Epigenomics Studies (http://www.ariesepigenomics.org.uk/).

Methodology

Preprocessing and quality control for this dataset was conducted using Meffil.

Associated publications:
- https://doi.org/10.1093/ije/dyv072
- https://doi.org/10.1093/bioinformatics/bty476

Associated R packages:
- aries: https://github.com/MRCIEU/aries is associated with loading and using this dataset
- meffil: https://github.com/perishky/meffil/ was used for QC and normalisations within

Sample types

The sample type used for each sample can be identified using samplesheets provided within the standard dataset. The column samcode contains a 3 digit number which indicates the origin of the sample

SAMCODE	COHORT	TIME_POINT	SAMPLE_TYPE	ADDITIVE
100	mother	antenatal	whole blood	EDTA
109	mother	antenatal	white blood cells	heparin
211	child	Birth	white blood cells	heparin
212	child	Birth	blood spot	heparin
342	child	CIF clinic@43m	white blood cells	EDTA
343	child	CIF clinic@43m	whole blood	EDTA
357	child	CIF clinic@61m	white blood cells	EDTA
400	child	F@7	whole blood	EDTA
402	child	F@7	white blood cells	EDTA
427	child	F@9	PBL	CPDA
428	child	F@9	whole blood	CPDA
430	child	F@9	accuspin remains	CPDA
475	carer	2004-2008	PBL	CPDA
476	carer	2004-2008	whole blood	CPDA
492	child	TF3	whole blood	CPDA
497	child	TF3	white blood cells
511	mother	FOM1	white blood cells	EDTA
526	child	F17	white blood cells	EDTA
942	child	YP24+	white blood cells	EDTA

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:dnam_epic450_g0_g1_2022-7-13_f7
name: >-
  DNA methylation - EPIC & 450k - G0 + G1 version 2022-7-13 Freeze 7
description: >-
  This is the freeze 7 version of dnam_epic450_g0_g1.

  This dataset consists of multiple sections, each are described:
  Betas: 
      Normalized betas using functional normalization. We used 10 PCs on the control matrix to regress out technical variation. Slide was regressed out as random effect before normaliziation. CpGs are in rows and samples in columns. These are in .gds format. 
  control_matrix: 
      The 850 control probes are summarized in 42 control types. These probes can roughly be divided into negative control probes (613), probes intended for between array normalization (186) and the remainder (49), which are designed for quality control, including assessing the bisulfite conversion rate. None of these probes are designed to measure a biological signal. The summarized control probes can be used as surrogates for unwanted variation and are used for the functional normalization. Samples are rows and 42 control types are in columns. These are in .txt format. 
  derived:
      dnamage: 
          DNA methylation aging estimates from within the dataset. Further information on this data and its usage is found within the `dnamage.html` and `dnamage.md` within the docs dir/folder. 
          dnamage data file is a csv file containing DNA methylation aging estimates within the dataset.
      cellcounts:
          Files contain cell counts estimated using a variety of cell type references using the Houseman deconvolution algorithm (PMID: 22568884). In each file, samples correspond to rows and cell types to columns.
      reports:
          Collection of QC and normalization reports generated by the R meffil package upon freeze creation. This was first introduced in freeze 6. These are in html format. 
  detection_p_values:
      This matrix shows the detection pvalues for each sample and each CpG and is extracted from the idat files using the "meffil.load.detection.pvalues" function in meffil. CpGs are in rows and samples in columns. These are .gds files.
  samplesheet:
      Manifest files with columns extracted directly from LIMS and age, sex, omics ID, timepoint, timecode, sampletype, genotype columns to report sample mismatches, duplicate.rm column to remove duplicates. Samples in rows, variables in columns. These are csv files and the sampleheet.csv is the same as samplesheet-common.csv
  
  cell count files specific details:
      andrews-and-bakulski-cord-blood.txt
          Cord blood cell count estimates derived using the Bakulski et al. 2016 reference (PMID 27019159; https://bioconductor.org/packages/release/data/experiment/html/FlowSorted.CordBlood.450k.html). This reference has been implemented in meffil. Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, natural killer cells and nucleated red blood cells. In this text file, samples are in rows and cell types in columns.
      gervin-and-lyle-cord-blood.txt
          Cord blood cell count estimates derived using the Gervin et al. 2019 reference (PMID 31455416; GEO accession GSE127824). Cell counts  estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, and natural killer cells. This reference has been implemented in meffil.  In this text file, samples are in rows and cell types in columns.
      cord-blood-gse68456.txt
          Cord blood cell count estimates derived using the de Goede et al. 2015 reference (PMID 26366232; GEO accession GSE68456).  Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, natural killer cells and nucleated red blood cells. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns.
      blood-gse35069-complete.txt
          Cell counts in peripheral blood predicted using the peripheral blood reference published in Reinius et al. 2012 (PMID: 22848472). Same as 'blood gse35069.txt' but replaces granulocyteswith eosinophils and neutrophils. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns. 
      blood-gse35069.txt
          Blood cell count estimates derived using the Reinius et al. 2012 reference (PMID 25424692; GEO accession GSE35069).  Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, and natural killer cells. In this text file, samples are in rows and cell types in columns.
      blood-idoloptimized-epic.txt
          Cell counts in peripheral blood predicted using the cell type reference from Bioconductor package FlowSorted.Blood.EPIC. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns.
      blood-idoloptimized.txt
          Cell counts in peripheral blood predicted using the cell type reference from Bioconductor package FlowSorted.Blood.EPIC but restricted to the IDOLOptimizedCpGs450klegacy CpG sites. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns.
      combined-cord-blood.txt
          Cord blood cell count estimates derived using the Bakulski et al, Gervin et al., de Goede et al., and Lin et al. references (https://bioconductor.org/packages/release/data/experiment/html/FlowSorted.CordBloodCombined.450k.html) for CpG sites selected using the IDOL algorithm and optimized for the Illumina Infinium  HumanMethylation450 Beadchip. Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, natural killer cells and nucleated red blood cells. In this text file, samples are in rows and cell types in columns.

freeze_size: 137G
linker_file_md5sum: 2b2e49bc61c1a0efc3f57db6e48656ea
woc_file_md5sum: ee409be51e4e12594cb700dd1be99314
all_individuals_to_exclude_md5sum: 241c00aec78b7178d6797a83478dec02
git_tag: https://github.com/alspac/dataset_dnam_epic450_g0_g1/releases/tag/Freeze7
is_current_freeze: true
freeze_number: 7
freeze_date: 2026-04-14
previous_freeze: 6
freeze_of_alspac_dataset_version: alspacdcs:dnam_epic450_g0_g1_2022-7-13
freeze_of_named_alspac_dataset: alspacdcs:dnam_epic450_g0_g1

contains:
- data
- docs
files: []
data:
  contains:
  - samplesheet
  - derived
  - betas
  - detection_p_values
  - control_matrix
  files: []
  samplesheet:
    contains:
    - samplesheet-epic.csv
    - samplesheet-450.csv
    - samplesheet.csv
    - samplesheet-common.csv
    files:
    - id: alspacdcs:d726689a-c026-4e9b-a07e-b3583571456d
      name: samplesheet-epic.csv
      md5sum: 3c6cf2548270401c9b765072b61ddbc1
      filesize: 1.0MB
      filetype: .csv
      belongs_to: data/samplesheet
    - id: alspacdcs:56b20a16-8f16-465c-bbef-7321f3ef3c85
      name: samplesheet-450.csv
      md5sum: ad79e189101bab7892a842ef3f81baf8
      filesize: 2.1MB
      filetype: .csv
      belongs_to: data/samplesheet
    - id: alspacdcs:b6944d26-ea12-4905-92e6-d12c2b63458d
      name: samplesheet.csv
      md5sum: 93ff2150a923bb2d43d6dfc5eab9305a
      filesize: 3.2MB
      filetype: .csv
      belongs_to: data/samplesheet
    - id: alspacdcs:ba3f2981-a6e1-4074-9be7-bfbbff245efd
      name: samplesheet-common.csv
      md5sum: 93ff2150a923bb2d43d6dfc5eab9305a
      filesize: 3.2MB
      filetype: .csv
      belongs_to: data/samplesheet
  derived:
    contains:
    - dnamage.csv
    - reports
    - cellcounts
    files:
    - id: alspacdcs:3fbefa8e-8514-41f6-9a08-1b5962c35d54
      name: dnamage.csv
      md5sum: f023bfb85363895da4a141489c0bd013
      filesize: 11.2MB
      filetype: .csv
      belongs_to: data/derived
    reports:
      contains:
      - qc
      - normalization
      files: []
      qc:
        contains:
        - qc-report-450.html
        - qc-report-common.html
        - qc-report-epic.html
        - qc-report-450.md
        - qc-report-common.md
        - qc-report-epic.md
        - figure
        files:
        - id: alspacdcs:65ff1eb7-5df7-4e69-95ef-42dfc3604160
          name: qc-report-450.html
          md5sum: c042af99690905c8de8d8c3962f0465d
          filesize: 1.7MB
          filetype: .html
          belongs_to: data/derived/reports/qc
        - id: alspacdcs:ed6e1dea-38da-4f77-920d-1bf99aa3a7d6
          name: qc-report-common.html
          md5sum: b70ef111c47d87836c0d7c826be61154
          filesize: 1.6MB
          filetype: .html
          belongs_to: data/derived/reports/qc
        - id: alspacdcs:7486aa1b-52a5-42e0-a5f1-d26377930a93
          name: qc-report-epic.html
          md5sum: fec299de03b0b24609a06e52e625c5f5
          filesize: 1.7MB
          filetype: .html
          belongs_to: data/derived/reports/qc
        - id: alspacdcs:9a7d44c2-3535-469e-9d1b-58ea7e0eacf6
          name: qc-report-450.md
          md5sum: a301a5cc8240309b04520250adb9ec7e
          filesize: 21.1KB
          filetype: .md
          belongs_to: data/derived/reports/qc
        - id: alspacdcs:c750e8e1-b6b2-40cc-bd3c-5ac071017cb7
          name: qc-report-common.md
          md5sum: 0b0f6629e0acb73460aca0419d1de125
          filesize: 21.1KB
          filetype: .md
          belongs_to: data/derived/reports/qc
        - id: alspacdcs:223be358-3cb9-4142-953d-d0d06112f8f8
          name: qc-report-epic.md
          md5sum: f72ff7ee42326aa5377d289d65c2735d
          filesize: 19.9KB
          filetype: .md
          belongs_to: data/derived/reports/qc
        figure:
          contains:
          - unnamed-chunk-16-1.png
          - unnamed-chunk-36-1.png
          - unnamed-chunk-5-1.png
          - unnamed-chunk-7-1.png
          - unnamed-chunk-13-1.png
          - unnamed-chunk-11-1.png
          - unnamed-chunk-35-1.png
          - unnamed-chunk-3-1.png
          - unnamed-chunk-9-1.png
          - unnamed-chunk-12-1.png
          files:
          - id: alspacdcs:08d815dc-054a-431d-a0c8-f10f3c35e34d
            name: unnamed-chunk-16-1.png
            md5sum: 60eb456b6848a4574507634f6110aff5
            filesize: 107.4KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
          - id: alspacdcs:0b5dc0d8-db1a-4f06-8683-80bc74d33f7d
            name: unnamed-chunk-36-1.png
            md5sum: 1c1cbfadbc51a707f2d596c7b524cc3f
            filesize: 29.9KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
          - id: alspacdcs:5a108918-20d8-4d39-9008-29d847f91407
            name: unnamed-chunk-5-1.png
            md5sum: 5d111e4bbc9bf43c14e0218f43b38ae8
            filesize: 126.9KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
          - id: alspacdcs:6d737ff7-5c3c-4f1a-a27c-4a4ae5bacdf1
            name: unnamed-chunk-7-1.png
            md5sum: a736e2022533b8774eb71eaea5205503
            filesize: 452.8KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
          - id: alspacdcs:b8b657df-b6fa-4655-80bc-8c7bc15f85f3
            name: unnamed-chunk-13-1.png
            md5sum: ce7fb6fee9571240aca9917e43548dbc
            filesize: 79.5KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
          - id: alspacdcs:68cd4494-484b-4f6d-bea7-0f7ae9a7a431
            name: unnamed-chunk-11-1.png
            md5sum: fb762339b8382426fc163086764cd8f7
            filesize: 33.4KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
          - id: alspacdcs:8a248080-8784-4683-800d-919f9e6de147
            name: unnamed-chunk-35-1.png
            md5sum: d387226410b191145b3a3da2ba725288
            filesize: 26.1KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
          - id: alspacdcs:91a3e021-4f04-4830-8e0a-b80a82b85f9a
            name: unnamed-chunk-3-1.png
            md5sum: c7b04efa67e7fbac1b61e2b400820e82
            filesize: 105.7KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
          - id: alspacdcs:07a5ce01-d7f3-4b5e-ba18-c4455ed196a1
            name: unnamed-chunk-9-1.png
            md5sum: c4a04fe47757519977dd3ca2f6f05908
            filesize: 56.8KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
          - id: alspacdcs:cea91373-3f73-43a2-ac30-856c92f5ad8c
            name: unnamed-chunk-12-1.png
            md5sum: 13ba50bfb39465f8f3af6532367c18af
            filesize: 268.5KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
      normalization:
        contains:
        - norm-report-450.html
        - norm-report-common.html
        - norm-report-epic.html
        - norm-report-epic.md
        - norm-report-450.md
        - norm-report-common.md
        - figure
        files:
        - id: alspacdcs:2d94e47f-e238-4041-80ff-15eb6378fc11
          name: norm-report-450.html
          md5sum: 5abad07bfadd8edc07c02153c0e877a8
          filesize: 1.8MB
          filetype: .html
          belongs_to: data/derived/reports/normalization
        - id: alspacdcs:b0ac79b5-468b-4c2b-acad-5a3108214c70
          name: norm-report-common.html
          md5sum: cef926dff897cf944a539f537366d44e
          filesize: 1.7MB
          filetype: .html
          belongs_to: data/derived/reports/normalization
        - id: alspacdcs:2e496497-d068-462c-9caa-733418b8659d
          name: norm-report-epic.html
          md5sum: 6dca6ba2f70cf45d583dd185a33afbb0
          filesize: 1.7MB
          filetype: .html
          belongs_to: data/derived/reports/normalization
        - id: alspacdcs:47b90928-e08f-48cf-947d-0770ec451321
          name: norm-report-epic.md
          md5sum: 0ac3e8694c8407fe921b46dc0a5cb309
          filesize: 11.6KB
          filetype: .md
          belongs_to: data/derived/reports/normalization
        - id: alspacdcs:1d2194c5-6651-4479-8a49-827fe7e81b40
          name: norm-report-450.md
          md5sum: 2ca71f7c4ccccc245b56c2d305bd399b
          filesize: 20.0KB
          filetype: .md
          belongs_to: data/derived/reports/normalization
        - id: alspacdcs:75ed2d6f-8df8-4dec-b40c-f6b5e9b4ea8f
          name: norm-report-common.md
          md5sum: 7d6d6b2351fceea3ecab6a9088cd2ba4
          filesize: 24.4KB
          filetype: .md
          belongs_to: data/derived/reports/normalization
        figure:
          contains:
          - unnamed-chunk-43-1.png
          - unnamed-chunk-46-1.png
          - unnamed-chunk-45-1.png
          - unnamed-chunk-32-1.png
          - unnamed-chunk-47-1.png
          - unnamed-chunk-27-1.png
          - unnamed-chunk-48-1.png
          - unnamed-chunk-78-1.png
          - unnamed-chunk-21-1.png
          - unnamed-chunk-76-1.png
          - unnamed-chunk-44-1.png
          - unnamed-chunk-61-1.png
          - unnamed-chunk-74-1.png
          - unnamed-chunk-38-1.png
          - unnamed-chunk-77-1.png
          - unnamed-chunk-64-1.png
          - unnamed-chunk-50-1.png
          - unnamed-chunk-35-1.png
          - unnamed-chunk-72-1.png
          - unnamed-chunk-67-1.png
          - unnamed-chunk-75-1.png
          - unnamed-chunk-23-1.png
          - unnamed-chunk-42-1.png
          - unnamed-chunk-56-1.png
          - unnamed-chunk-71-1.png
          - unnamed-chunk-51-1.png
          - unnamed-chunk-79-1.png
          - unnamed-chunk-22-1.png
          - unnamed-chunk-73-1.png
          - unnamed-chunk-49-1.png
          - unnamed-chunk-80-1.png
          files:
          - id: alspacdcs:960f5106-bb26-44f4-8802-0d9b9cf82651
            name: unnamed-chunk-43-1.png
            md5sum: db015f1d599967aacff3e918ca210ed7
            filesize: 45.0KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:73ad0bb8-eb09-46a3-a329-ff09de979e57
            name: unnamed-chunk-46-1.png
            md5sum: 2a361ec26c3d5d729fc3ecc385f3dc37
            filesize: 41.6KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:0bd91d4a-1814-4ea5-9081-60d225f4b893
            name: unnamed-chunk-45-1.png
            md5sum: 397ceb400ba6cc54f8957a04a7223b80
            filesize: 40.5KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:60c50ea2-004e-4759-a50b-84f0ef70f241
            name: unnamed-chunk-32-1.png
            md5sum: e2fd34ad90f83276ac8531d441b6efea
            filesize: 12.9KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:b094de03-f35d-4126-a869-90fa86f35d30
            name: unnamed-chunk-47-1.png
            md5sum: e8a702b741c582874e86c524178b3224
            filesize: 43.6KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:00e24a22-6316-4c65-a761-f9947caae764
            name: unnamed-chunk-27-1.png
            md5sum: d741adec199194e22300578b399b2adf
            filesize: 212.6KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:7c2963b5-3523-434d-8c27-c74209a348a9
            name: unnamed-chunk-48-1.png
            md5sum: 974eb0b4ab2a678d374df2ae5fa4cb36
            filesize: 41.4KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:be6cae82-cfcb-4186-b960-07f754632e8a
            name: unnamed-chunk-78-1.png
            md5sum: 1376d286deab61aff5b82ca728e79193
            filesize: 40.2KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:77632d8d-7982-439c-9369-70b7eeaa8a33
            name: unnamed-chunk-21-1.png
            md5sum: 843c2535db5036f216cde2995c7d6e4d
            filesize: 7.9KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:045a99c0-7456-4518-99af-6a7f81719aec
            name: unnamed-chunk-76-1.png
            md5sum: b3a9faa0ba9216acccb586e28e6cc7e7
            filesize: 35.3KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:458391fa-9f9b-4b28-bf47-b3ffa5b006af
            name: unnamed-chunk-44-1.png
            md5sum: c16673a6bb0109a0c533055bcccfd9b0
            filesize: 40.8KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:6589a215-0c3f-4bcb-8e47-471382da8dd2
            name: unnamed-chunk-61-1.png
            md5sum: 8909fde9844719d9c0dc6d4f27d6bee7
            filesize: 14.1KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:701f6f93-9e34-46fb-ae89-f7a65ee89c6f
            name: unnamed-chunk-74-1.png
            md5sum: dc69a1f6993cea3c5c289edbebe1395f
            filesize: 34.6KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:d875c710-0523-4857-b6fe-9dcbee85c802
            name: unnamed-chunk-38-1.png
            md5sum: 523af90e3cb6d55e1927bdcc7ba581c1
            filesize: 14.9KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:759683a0-170a-4763-a6cb-0cce34a3b87b
            name: unnamed-chunk-77-1.png
            md5sum: 3cd5906468110fc3efa0e7059e50f011
            filesize: 33.6KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:19e21e14-1e53-417c-b640-83e5229b64a6
            name: unnamed-chunk-64-1.png
            md5sum: 7feb3f164bc369cae2872b7f4e67b6a9
            filesize: 15.7KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:9e51f7ef-bbaa-470d-ba2b-20a671361bb4
            name: unnamed-chunk-50-1.png
            md5sum: 63d4be9ba5dd131d42c5a1755a32ab3c
            filesize: 37.7KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:73475f51-b515-43d2-9cf5-62550ec6545d
            name: unnamed-chunk-35-1.png
            md5sum: 62397c374b15bafae8d4c77b5ab66903
            filesize: 13.8KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:fdfa8b3e-a385-4efb-9d8c-745f2721c369
            name: unnamed-chunk-72-1.png
            md5sum: 62659ae0e97f0d410803cdb84f86ad06
            filesize: 43.0KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:21bb12fb-9141-4dad-a9ea-14c1bd960abc
            name: unnamed-chunk-67-1.png
            md5sum: 8310a8c9e27a59e61f93314fc83d4f0f
            filesize: 13.3KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:6999294d-3fb1-41bb-a75c-9cd2421e0c0e
            name: unnamed-chunk-75-1.png
            md5sum: 86dbcf633fe7791f3fcd2730bbaca1fe
            filesize: 33.7KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:55527d15-3d87-4d61-b6a0-ea958041b328
            name: unnamed-chunk-23-1.png
            md5sum: 7a6f89888fd84a613415240965293fe2
            filesize: 8.3KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:89d54d3d-46c2-4947-b078-fe71eaf9ab39
            name: unnamed-chunk-42-1.png
            md5sum: cd0df689b84096cbb12bca6965b465dc
            filesize: 43.7KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:8be8d6d5-008b-4d69-8b62-d94f9afa249a
            name: unnamed-chunk-56-1.png
            md5sum: a305a033260dd8391902463465d4539b
            filesize: 195.0KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:63cfc994-cdf0-4540-8622-4935871ff547
            name: unnamed-chunk-71-1.png
            md5sum: 540e6c04a4bf0de24c33072f251de261
            filesize: 35.3KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:087f8811-8b20-47f5-9a27-81da74ca9ec6
            name: unnamed-chunk-51-1.png
            md5sum: ff43a8fecff1d2aab034fc95dc58353d
            filesize: 36.1KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:79f557d3-2da5-43a8-aaef-d0b99c5ec6e6
            name: unnamed-chunk-79-1.png
            md5sum: b348bd0dfdea753fbfa4502b5e4e85ef
            filesize: 37.8KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:b45288ec-6407-4f32-a418-70e494dda7cb
            name: unnamed-chunk-22-1.png
            md5sum: a197d74fb49c30f66707e60a8bdebd33
            filesize: 7.9KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:bc57bb1e-19b2-455c-a57e-12ab17565560
            name: unnamed-chunk-73-1.png
            md5sum: 6640674615e33436ee7f004385ccae38
            filesize: 44.7KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:dae49cca-4afa-4213-8c2d-821e6a1edc96
            name: unnamed-chunk-49-1.png
            md5sum: f6b29e0a8dab04649277a20958a9bdd3
            filesize: 40.9KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:fb1f01e6-1ec0-42ab-8b6b-bc16b7bb8747
            name: unnamed-chunk-80-1.png
            md5sum: a0024a6beb1ae9e8f1049cc6494b41b0
            filesize: 37.9KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
    cellcounts:
      contains:
      - andrews-and-bakulski-cord-blood.txt
      - combined-cord-blood.txt
      - blood-gse35069.txt
      - gervin-and-lyle-cord-blood.txt
      - blood-idoloptimized-epic.txt
      - blood-idoloptimized.txt
      - cord-blood-gse68456.txt
      - blood-gse35069-complete.txt
      files:
      - id: alspacdcs:e58a70ec-c085-441e-bf73-7c6db110ced5
        name: andrews-and-bakulski-cord-blood.txt
        md5sum: 33c69aa8e50deb28355dcb82d01c7510
        filesize: 113.7KB
        filetype: .txt
        belongs_to: data/derived/cellcounts
      - id: alspacdcs:e1e2db52-5cee-41e1-b66c-f2c40979675e
        name: combined-cord-blood.txt
        md5sum: 7cbcf72ca00012d17d22ff6d21b7575c
        filesize: 128.2KB
        filetype: .txt
        belongs_to: data/derived/cellcounts
      - id: alspacdcs:8ee56e0a-9b05-403b-849b-e7a57e00802b
        name: blood-gse35069.txt
        md5sum: 8d0b1bbf40d51fd2e041de67dfdc89a3
        filesize: 1020.1KB
        filetype: .txt
        belongs_to: data/derived/cellcounts
      - id: alspacdcs:2d779034-34cd-4898-8c8e-3f38e6986fe1
        name: gervin-and-lyle-cord-blood.txt
        md5sum: 099c4cf9bd4ecfee91c19c3c2d2b6f70
        filesize: 99.5KB
        filetype: .txt
        belongs_to: data/derived/cellcounts
      - id: alspacdcs:6f3ec872-5d7d-4dd1-a966-5a63648410fa
        name: blood-idoloptimized-epic.txt
        md5sum: 7331e83d31e1d200bbff3d041223cde1
        filesize: 345.9KB
        filetype: .txt
        belongs_to: data/derived/cellcounts
      - id: alspacdcs:21fd1612-5847-40dc-8280-fa0a6451476d
        name: blood-idoloptimized.txt
        md5sum: fc9c1872c14656b6a76403dde932619f
        filesize: 1.1MB
        filetype: .txt
        belongs_to: data/derived/cellcounts
      - id: alspacdcs:61e2c087-b25e-4f1f-971c-0e0caa09d1a2
        name: cord-blood-gse68456.txt
        md5sum: 941f8a9ce1289ab5baaf10fb29bd8941
        filesize: 129.8KB
        filetype: .txt
        belongs_to: data/derived/cellcounts
      - id: alspacdcs:511c8004-6daa-40f0-a9c0-b1884578b1ef
        name: blood-gse35069-complete.txt
        md5sum: 848a27e06b57ec5378aac229e95c80ce
        filesize: 1.1MB
        filetype: .txt
        belongs_to: data/derived/cellcounts
  betas:
    contains:
    - epic.gds
    - 450.gds
    - common.gds
    files:
    - id: alspacdcs:2764d7cd-3921-432e-a335-8305352b7477
      name: epic.gds
      md5sum: 0357486c3af3b5ee120c7b05bf077340
      filesize: 17.5GB
      filetype: .gds
      belongs_to: data/betas
    - id: alspacdcs:2a8b8ab8-a101-49ba-b80c-b7b608c3a486
      name: 450.gds
      md5sum: 1c82dfd9ffdc6a62b88a94acfd13eb22
      filesize: 21.3GB
      filetype: .gds
      belongs_to: data/betas
    - id: alspacdcs:d6cc5f0b-3c63-4c57-9d17-7fb8ec74a382
      name: common.gds
      md5sum: e2db524c139d4000cc7bd26aabcbf54a
      filesize: 29.1GB
      filetype: .gds
      belongs_to: data/betas
  detection_p_values:
    contains:
    - epic.gds
    - 450.gds
    - common.gds
    files:
    - id: alspacdcs:ea197bc8-a4f8-4292-b873-0c90bed30164
      name: epic.gds
      md5sum: 341d1194d468e10e80be9dc9990c474b
      filesize: 17.7GB
      filetype: .gds
      belongs_to: data/detection_p_values
    - id: alspacdcs:337a1153-7267-4eff-9217-b3b31dd4ca9e
      name: 450.gds
      md5sum: 52fa5327ea666a0446f3530420b7831d
      filesize: 21.5GB
      filetype: .gds
      belongs_to: data/detection_p_values
    - id: alspacdcs:f1be32c1-8935-43d9-b598-3ba35b925293
      name: common.gds
      md5sum: aa2f58ed1ed73a4eb11a6e9df1dfc268
      filesize: 29.3GB
      filetype: .gds
      belongs_to: data/detection_p_values
  control_matrix:
    contains:
    - epic.txt
    - common.txt
    - 450.txt
    files:
    - id: alspacdcs:3a92b254-b373-44dd-a433-6a0abdf88705
      name: epic.txt
      md5sum: 7a680d3ccd26a491ec7dde2ce91eeeab
      filesize: 1008.8KB
      filetype: .txt
      belongs_to: data/control_matrix
    - id: alspacdcs:9b7ee2ff-fdfc-4d0d-9ce1-c4700ec699d3
      name: common.txt
      md5sum: 8d36d1af3be913e0164e40af7d6a9bb9
      filesize: 3.1MB
      filetype: .txt
      belongs_to: data/control_matrix
    - id: alspacdcs:cc566b11-cf78-4dbd-b322-fc3643d0465a
      name: 450.txt
      md5sum: f7e6be6140266239e2c8aea92e772123
      filesize: 2.1MB
      filetype: .txt
      belongs_to: data/control_matrix
docs:
  contains:
  - dnamage.html
  - dnamage.md
  files:
  - id: alspacdcs:760bbdb8-3bd7-4087-9558-2c328aba1180
    name: dnamage.html
    md5sum: b2f45bdec85fbd8149299ddda511c87d
    filesize: 22.0KB
    filetype: .html
    belongs_to: docs
  - id: alspacdcs:276fa231-08e2-4c65-bb31-2937d515d584
    name: dnamage.md
    md5sum: 77e08c6cc266970d16bd4d2341266011
    filesize: 6.0KB
    filetype: .md
    belongs_to: docs

Gene Expression Data

Gene expression - array - G1 (ge_ht12_g1)

Description

There are two different types of QC’d data available in this version, one performed by David Evans for the Bryois et al 2014 paper, and one performed by Gibran Hemani for the molgenis eQTL mapping meta analysis. A version without QC is available as well. Details on the QC’d versions can be seen below.

This data was generated from LCLs. The majority of samples used in their generation were collected at age 9 years. LCL’s are a lymphoblastoid cell lines which were produced by transforming lymphocytes with Epstein Barr Virus and cultured before DNA was extracted. Gene expression patterns may not be the same as that from untransformed lymphocytes taken from a 9 year old.

Methodology

Bryois:
- LCL’s from unrelated individuals were grown under identical conditions and cells frozen in RNAlater. RNA was extracted using an RNeasy extraction kit (Qiagen) and was amplified using the Illumina TotalPrep-96 RNA Amplification kit (Ambion). Expression profiling of the samples, each with two technical replicates, were performed using the Illumina Human HT-12 V3 BeadChips (Illumina Inc) including 48,804 probes where 200 ng of total RNA was processed according to the protocol supplied by Illumina. Raw data was imported to the Illumina Beadstudio software and probes with less than three beads present were excluded. Log2 - transformed expression signals were then normalized with quantile normalization of the replicates of each individual followed by quantile normalization across all individuals.

We restricted our analysis to 23’935 probes tagging genes annotated in Ensembl. Principal component analysis was performed on 931 individuals. 62 individuals with principal component 1 or 2 greater than one standard deviation of the population were excluded from further analysis. See http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004461 for full details.

Molgenis:
- Genetic outliers were removed, any individuals that were clear outliers in the first 2 genetic principal components. Each probe was simply quantile normalised and then log2 transformed. Then adjusted for the first 4 genetic MDS, expression principal components (excluding those that had genetic associations), and scaled to have mean 0 and variance 1. See https://github.com/molgenis/systemsgenetics/wiki/eQTL-mapping-analysis-cookbook for full details.

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:ge_ht12_g1_2015-11-02_f7
name: Gene expression - array - G1 release version 2015-11-02 freeze 7
description: >-
  This is the seventh freeze of the 2015-11-02 version of ge_ht12_g1 dataset. This uses .csv distributions of the data rather than .Rdata files in order to be easier to use across different data science software and languages.

freeze_size: 2.6G
linker_file_md5sum: 2b2e49bc61c1a0efc3f57db6e48656ea
woc_file_md5sum: ee409be51e4e12594cb700dd1be99314
all_individuals_to_exclude_md5sum: 241c00aec78b7178d6797a83478dec02
git_tag: https://github.com/alspac/dataset_ge_ht12_g1/releases/tag/freeze7
is_current_freeze: true
freeze_number: 6
freeze_date: 2026-04-14
previous_freeze: alspacdcs:ge_ht12_g1_2015-11-02_f6
freeze_of_alspac_dataset_version: alspacdcs:ge_ht12_g1_2015-11-02
freeze_of_named_alspac_dataset: alspacdcs:ge_ht12_g1

contains:
- bryois.csv
- molgenis.csv
- raw.csv
files:
- id: alspacdcs:dddf4c22-c1c6-467b-93bf-b64643069f5a
  name: bryois.csv
  md5sum: 926a017d049021721b59b07e9cfd7da1
  description: >-
    The freeze csv version of the bryois data.
    IDs in columns and Illumina probe IDs in rows.
    This is the normalised data used in Bryois et al 2014.
    Probe IDs are mapped to genes in raw.csv
  number_of_participants: 947
  number_of_gene_expression_probe_values: 48630
  filesize: 741.2MB
  filetype: .csv
  belongs_to: data
- id: alspacdcs:31f736ad-63b2-470f-8262-a16321225b78
  name: molgenis.csv
  md5sum: 34292d6d810cfed68f84ed0e39457578
  description: >-
    The freeze csv version of the molgenis data.
    IDs in columns and Illumina probe IDs in rows.
    Normalised data following the molgenis pipeline,
    found at
    https://github.com/molgenis/systemsgenetics/wiki/eQTL-mapping-analysis-cookbook.
    Probe IDs are mapped to genes in raw.csv
  number_of_participants: 879
  number_of_gene_expression_probe_values: 48630
  filesize: 751.2MB
  filetype: .csv
  belongs_to: data
- id: alspacdcs:64377f43-4e3c-43ca-9830-d6d5b992b2af
  name: raw.csv
  md5sum: 1b42892f49d840336934fdc24299203f
  description: >-
    The freeze csv version of the raw ge data.
    IDs in columns and probes in rows. Four columns per
    individual, with two columns for average signal and two columns
    for average number of beads.
    Presumably this is a file generated by the Illumina Genome
    Studio software.
  number_of_participants: 994
  number_of_gene_expression_probe_values: 48630
  filesize: 1.1GB
  filetype: .csv
  belongs_to: data

Other omics datasets

Both metabolomics and proteomics data has been generated for ALSPAC. Due to data type, namedly flat text formats, these are incorporated into the standard phenotypic variables. If you wish to access these data, you should look at our data dictionary. You can also search for specific variables using the variable search tool, to see if metabolites of interest are available in the ALSPAC dataset.

Metabolomics

Metabolomics data has been generated at multiple timepoints using mupltiple technologies. Data has been generated at either Nightingale Health using nuclear magnetic resonance (NMR), or Metabolon using mass spectrometry (MS).

NMR data that was derived has been released previously, but have all been recently updated according to updated marker labelling by Nightingale.

G0

Data source: Mother_samples_6c & Child_bloods_7a

Timeponit	FOM1	FOM2	FOM3	FOM4
fasting status	fasting	fasting	fasting	fasting
Nightingale	4368	2793		1681
Olink (Target 96 inflammatory panel)	2968

G1

Timepoint	F7 (7yrs)	BBS (~8yrs)	TF3 (15.5yrs)	TF4 (~17yrs)	F24 (24yrs)	F30 (30yrs)	data source
Fasting status	Non-fasting	fasting	fasting	fasting	fasting	fasting	Child_bloods_7a
Longitudinal: Nightingale	5525	640	3366	3167	3270	2894	Child_bloods_7a
Longitudinal: Metabolon	226		281	226	281	226	Child_bloods_7a
Metabolon (B4132)						520	G1_MS_metabolon_B4132

G2

Data source

Type	Count
Nightingale	823

Substudy data is available, but may be non-standard and require additional access requests in the application.
- B2714: https://pubmed.ncbi.nlm.nih.gov/32494907/
- B3194: https://pubmed.ncbi.nlm.nih.gov/35598895/
- B4132: https://wellcomeopenresearch.org/articles/10-632,

Metadata: - B4132: https://proposals.epi.bristol.ac.uk/G1_MS_metabolon_B4132-feature-metadata.xlsx

Associated publications:
- https://wellcomeopenresearch.org/articles/10-632 (@30 samples and updated version of earlier data)

Associated packages:
- https://github.com/MRCIEU/metaboprep (https://academic.oup.com/bioinformatics/article/38/7/1980/6522114)

Proteomics

Proteomics data has been collected within ALSPAC across multiple time points:

Olink data was used to analyse 9000 samples across multiple timepoints and generations. Detailed informatino is available within the associated publication.

Cohort	Timepoint	N.
G0 Mothers	F0M1	2968
G1 individuals	F9 (~9yrs)	3005
G1 individuals	F24 (~24yrs)	3027

Associated publications:
- https://pubmed.ncbi.nlm.nih.gov/39268475/

Associated packages:
- https://github.com/MRCIEU/metaboprep (https://academic.oup.com/bioinformatics/article/38/7/1980/6522114)

Omics tips

Introduction

This section is a guide to using ’Omics datasets. It explains which software to use and describes common file formats. It’s a good starting point for beginners and helpful for problem-solving.

Disclaimer

Some information is copied or reworded from software documentation. Check the original documentation alongside this guide for up-to-date information. Note that some links may no longer work.

Operating systems

You can use ALSPAC data with any operating system, but Unix-based systems like Macintosh, Linux, or BSD are more convenient due to the data’s size and complexity. We recommend using the command line and programming scripts with languages like Bash, R, Python, or Perl. Many online resources are available to learn these tools. Use free/libre and open-source software where possible.

Links:

Unix guide: https://www.osc.edu/supercomputing/unix-cmds
Beginning Python: https://www.python.org/about/gettingstarted/
Beginning R: https://www.statmethods.net/r-tutorial/index.html
Free/libre and open-source software: https://www.fsf.org/about/

Key Omics software

Plink

Plink is a tool for performing quality control and whole genome association analysis of genetic data. - Link: http://zzz.bwh.harvard.edu/plink/ ### SNPTest SNPTest is a tool for performing whole genome association analysis of genetic data. - Link: https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html (Not open source) ### BoltLmm BoltLmm is a tool for performing genome association analysis of genetic data. It is recommended for analysis of more than 5000 samples, its methods automatically take into account population substructures. - Link: https://data.broadinstitute.org/alkesgroup/BOLT-LMM/ ### Qctools A tool for quality control of genetic data. It is also useful to inspect and modify .gen .bgen and vcf files etc (see section 4 below). - Link: https://www.well.ox.ac.uk/~gav/qctool_v2/ ### SAMTOOLS Samtools is a suite of tools which are used for genomic analysis. - Link: http://www.htslib.org/ ### VCFTOOLS Part of samtools that allows you to work with vcf files. - Link: https://vcftools.github.io/index.html ### BCFTOOLS This is a part of samstools and allows users to manipulate .bcf files. - Link: http://samtools.github.io/bcftools/bcftools.html

File types

In a Unix environment the postfix of a file name does not explicitly mean anything to the operating system, unlike in a Windows system which will look at the file types. In a Unix system it is just part of the name of the file and humans use it to distinguish file formats. The following is a non-exhaustive list of file types you may encounter whilst using ALSPAC Omics data.

.gen

This is an ‘oxford’ data format for genetic data. The .gen file is a plain text file, this means that standard Unix command line tools can be used to inspect the data. For example, ‘head’ or ‘less’.

The .gen (genotype) file stores data on a one-line-per-SNP format. The first 5 entries of each line are the SNP ID, RS ID of the SNP, base-pair position of the SNP, the allele coded A and the allele coded B. The SNP ID can be used to denote the chromosome number of each SNP. The next three numbers on the line are the probabilities of the three genotypes AA, AB and BB at the SNP for the first individual in the cohort. The next three numbers are the genotype probabilities for the second individual in the cohort. The next three numbers are for the third individual and so on. The order of individuals in the genotype file should match the order of the individuals in the sample file (see below). It should be noted that the probabilities need not sum to 1 to allow for the possibility of a NULL genotype call. This format allows for genotype uncertainty. This genotype file format is the same as that produced by the genotype calling algorithm CHIAMO. NOTE : We recommend that you arrange SNPs in base-pair order in the genotype files. This is required if you want to use the files with IMPUTE and will make viewing the output of SNPTEST somewhat easier. For example, Suppose you want to create a genotype for 2 individuals at 5 SNPs whose genotypes are

SNP 1	AA	AA
SNP 2	GG	GT
SNP 3	CC	CT
SNP 4	CT	CT
SNP 5	AG	GG

The correct genotype file would look like this:

SNP1 rs1 1000	A	C	1	0	1	0	0
SNP2 rs2 2000	G	T	1	0	0	1	0
SNP3 rs3 3000	C	T	1	0	0	1	0
SNP4 rs4 4000	C	T	0	1	0	1	0
SNP5 rs5 5000	A	G	0	1	0	0	1

.bgen

A binary version of a .gen file. This file can not be visually inspected on the command line. .bgen files are used because they greatly increase the speed and storage efficiency of software for storing large amounts of Omics data. The full details of the file format are discussed in : https://www.well.ox.ac.uk/~gav/bgen_format/ bgen files are normally used with tools such as qctools and snptest There is also a library for reading .bgen files into R : https://bitbucket.org/gavinband/bgen/wiki/rbgen

.sample

The .sample file is paired with either .gen or .bgen files. It contains information on the samples that is not genetic. It is a plain text file that can be inspected with standard Unix command line tools.

Please note that the sample file format changed with the release of SNPTEST v2. Specifically, the way in which covariates and phenotypes are coded on the second line of the header file has changed. The sample file has three parts (a) a header line detailing the names of the columns in the file, (b) a line detailing the types of variables stored in each column, and (c) a line for each individual detailing the information for that individual. Here is an example of the start of a sample file for reference

ID_1	ID_2	cov_1	cov_2	cov_3	cov_4	pheno1	bin1
0	0	D	D	C	C	P	B
1	1	.007	1	2	0	.0019	-0.008 1.233 1
2	2	.009	1	2	0	.0022	-0.001 6.234 0
3	3	.005	1	2	0	.0025	0.0028 6.121 1
4	4	.007	2	1	0	.0017	-0.011 3.234 1
5	5	.004	3	2	-0	.012	0.0236 2.786 0

The header line: This line needs a minimum of three entries. The first three entries should always be ID_1, ID_2 and missing. They denote that the first three columns contain the first ID, second ID and missing data proportion of each individual. Additional entries on this line should be the names of covariates or phenotypes that are included in the file. In the above example, there are 4 covariates named cov_1, cov_2, cov_3, cov_4, a continuous phenotype named pheno1 and a binary phenotype named bin1. NOTE : All phenotypes should appear after the covariates in this file. The second line of the file details the type of variables included in each column. The first three entries of this line should be set to 0. Subsequent entries in this line for covariates and phenotypes should be specified by the following rules

D	Discrete covariate (coded using positive integers)
C	Continuous covariates
P	Continuous Phenotype
B	Binary Phenotype (0 = Controls, 1 = Cases)

The remainder of the file should consist of a line for each individual containing the information specified by the entries of the header line (see example above). Use spaces to separate the entries of the sample file and not TABS because that is the expected character.

Missing values - Specifying missing values for covariates and phenotypes is possible. It was recommended that you use -9 for missing values. This was the default value assumed by SNPTEST v1, although the -missing_code option in SNPTEST v1 meant that you could use other numeric values for the missing code, In SNPTEST v2 the behavior of the -missing_code option has changed so that it now takes a comma-separated list of values, each of which is treated as missing when encountered in the sample file(s). Default missing values are now denoted by the two character string “NA”.

.ped

A plink format file that is in plain text and can be viewed with standard tools. It contains genetic variant data. https://www.cog-genomics.org/plink/1.9/formats#ped

.map

A plink format file that is in plain text. It contains information about variants. https://www.cog-genomics.org/plink/1.9/formats#map

.bed

A plink format file that isa binary equivalent of a .ped file. It is smaller and faster to process but is not easily viewable or editable. https://www.cog-genomics.org/plink/1.9/formats#bed

.bim

A plink format, similar to a .map file but is used with binary .bed files. https://www.cog-genomics.org/plink/1.9/formats#bin

.fam

A plain text format that contains sample information for plink binary files. https://www.cog-genomics.org/plink/1.9/formats#fam

.csv

A plain text format where different fields are separated by commas. (Comma separated variables).

.vcf

VCF files are a flexible file format for storing different types of genetic variants. They are a plain text format that can be inspected on the command line with standard Unix tools. However they are often very large files, and specific tools such as ‘vcftools’ are useful for working with this data. Commonly SNPs are stored in these files but other variants such as Copy Number variations can also be stored. The basic form for a vcf file is: https://en.wikipedia.org/wiki/Variant_Call_Format

.bcf

This is a binary version of a vcf file. It cannot be inspected on the command line, but can be used with the genomic tools mentioned in this document.

.tar.gz

This is a standard Unix file format for bundling and compressing a set of files. It is similar to a .zip file. It is made by first bundling a set of files into a .tar file (sometimes called a tar ball). This is then compressed using ‘gun zip’. https://en.wikipedia.org/wiki/Tar_(computing) https://en.wikipedia.org/wiki/Gzip

.enc

This file extension is used as a convention to mean that the file is encrypted. You will need to have that password that was used to encrypt the data in order to unencrypt the files. https://en.wikipedia.org/wiki/OpenSSL

Variant/SNP ids

There are many types of genetic variation. A common type is a single nucleotide polymorphism (SNP). Others include copy number variations.

Variants can be specified by a Chromosome and location in reference to a specific build of the human genome. They can also be given a reference SNP (rs) cluster identifier.

chr:Location
rs_ids

Overview of Imputation reference panels

SNP array data frequently contain hundreds of thousands of variants. However due to linkage disequilibrium it is possible to estimate many more SNP values for an individual. This estimation procedure is called imputation and it works by combining an individuals SNP array data with a large reference population of sequenced data. In this way it is possible to have accurate estimations of millions of SNP values for an individual without the cost of fully sequencing each person. ALSPAC has prerun the imputation process using three different imputation panels.

Panels

TOPmed: The latest reference panel (to ALSPAC), which has the most snps
HRC: This is the latest reference panel and our data contains circa 40 millions of SNPs.
1000 Genomes: This is the previous generation reference panel which is still widely used in ALSPAC studies. There are some SNPs that appear in this panel that are not in the HRC panel.
Hapmap: This was the first widely used imputation panel.

SNP data types from imputation.

SNPs that have been imputed can be stored and analysed in different formats. These can be appropriate for different types of analysis, for example an analysis could assume and additive effect for the minor allele or it could assume a recessive/dominant effect.

Best guess. The data will be presented as either 0,1, or 2 to represent how many of the minor alleles at that position a person has. The best guess is derived from the probability of a variant calculated from the imputation process.
Dosage. This is the probability that the person has 0, 1 or 2 of the minor allele. i.e. 0.1, 0.2,0.7. This will sum to one across the three possibilities (i.e for each SNP for each individual).

SNP Statistics

You can generate statistics on your SNP data using the program ‘QCtools’. This will give you the imputation information scores. For example:

qctool -g example.bgen -s example.sample -sample-stats -osample sample-stats.txt

Best practice

GWAS

We recommend you follow the steps outlined in the following paper when performing GWAS: Marees, Andries T., et al. “A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis.” International journal of methods in psychiatric research 27.2 (2018): e1608. https://doi.org/10.1002/mpr.1608 ### Phewas We recommend you follow the steps outlined in the following paper when performing Phewas: Millard, L., Davies, N., Timpson, N. et al. MR-PheWAS: hypothesis prioritization among potential causal effects of body mass index on many outcomes, using Mendelian randomization. Sci Rep 5, 16645 (2015). https://doi.org/10.1038/srep16645 ### Methylation The following paper describes the methylation data available in ALSPAC Relton, Caroline L., et al. “Data resource profile: accessible resource for integrated epigenomic studies (ARIES).” International journal of epidemiology 44.4 (2015): 1181-1190.

Population stratification

This is when an observed genetic association is due to the population/geography. Not taking this into account can lead to biased estimates of effects. One common method to account for these is to calculate principal components (PCs) of the genetic data and then to include these as covariables in any models.

ALSPAC do not provide PCs as part of the standard omics datasets, as these would require being re-generated and tested alongside each freeze. PCs can be generated using plink, hail or a variety of other tools.

For more information about how to do this in plink see: https://www.cog-genomics.org/plink/1.9/strat

An common method used to account for population substructure is by using linear mixed models. For example using the bolt LMM software tool.

https://data.broadinstitute.org/alkesgroup/BOLT-LMM/

Polygenic risk scores (PRS)

These are scores which estimate the effect of variants in an individual genome on a given phenotypic trait or disease.

Further explanations can be found online, such as: https://www.genome.gov/Health/Genomics-and-Medicine/Polygenic-risk-scores

Or example tutorials for calculating PRSs: https://www.nature.com/articles/s41596-020-0353-1

Different collaborators often generate PRS for ALSPAC, but these are not shared as part of our standard omics datasets. Collaborators wishing for PRSs will need to generate these themselves.

Common tasks

Here we provide links to webpages that provide instructions or provide brief details any code for completing common tasks using the various software we have described above (section x):

Extract some SNPs from a bgen data file and convert to plain text.

https://www.well.ox.ac.uk/~gav/qctool_v2/documentation/examples/filtering_variants.html

Extract some SNPs from bed data:

http://zzz.bwh.harvard.edu/plink/dataman.shtml

plink --bfile mydata --chr 2 --from-kb 5000 --to-kb 10000

Reading .bgen and .sample oxford files in plink

Plink supports bgen files but it is fussy about the types of its columns in the data.sample file. You may wish to remove or retype columns to read a data.sample file into plink. For more info see:

https://www.cog-genomics.org/plink/2.0/input

To make a new sample file removing some columns you can use the Unix command: ‘cut -f 1,2,3 -d ” ” data.sample > data2.sample’

Courses

Working with ’Omics data can be complicated but there are many excellent resources available to help you learn how to do this. There are both paid in person courses and free online courses.

Details on paid courses offered by Bristol University can be found here: https://www.bristol.ac.uk/medical-school/study/short-courses/ In addition, a number of free online courses are summarised here: https://www.mooc-list.com/tags/bioinformatics

Further sources of help

Stack exchange

Stack exchange is an online Q&A community which is divided into different sub-communities. The first and most well-known is Stack overflow. This is one of the best place to ask questions about programming on the Internet. Other useful exchange sites include bioinformatics https://bioinformatics.stackexchange.com/, maths https://mathoverflow.net/ and statistics https://stats.stackexchange.com/.

Bio-stars

Biostars is bioinformatics community Q&A web-site: https://www.biostars.org/

Mailing lists

For individual product/projects there is often a mailing list. For example to get help using SNPTEST you can ask on the mailing list https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html#contact

AI tools

AI tools such as chatGPT can be useful to understand how to work with omics data, but please do understand their limitations and look at documentation or research papers directly.

Ask ALSPAC

If you can not find the answer to your question or you think there is something wrong with your data then please contact the alspac-omics@bristol.ac.uk mailbox and we will do our best to help you.