ALSPAC OMICs Data Catalogue

ALSPAC Omics team

2025-12-18 14:30:49

Header image

Introduction

Welcome to the ALSPAC Omics Catalogue, a guide to the omics data offered by ALSPAC. This catalogue features a variety of named ALSPAC datasets, each consisting of collected or produced data that has been organized, named, and curated for ease of use. Every named ALSPAC dataset comes with accompanying metadata that provides information about the dataset as a whole. Each named ALSPAC dataset has at least one release version that includes a curated selection of files detailed in the metadata sections.

Please note that these datasets are not generally accessible. Please see http://www.bristol.ac.uk/alspac/researchers/access/ for details for access.

The information within this catalogue is made available for browsing to help both internal ALSPAC users and external researchers understand the data and facilitate prospective data requests.

For external collaborators we offer as standard “freezes” of specific named ALSPAC datasets. These freezes, along with their metadata, are outlined in this catalogue. External collaborators will be granted access to these freezes upon request approval. A freeze represents a carefully selected subset of data files within a version, containing the core data from a dataset with withdrawn consent removed and specific dataset IDs applied. These freezes are subject to periodic updates.

Documentation for the current freeze is in the form of a yaml file is present below, listing the files external collaborators will receive, accompanied by metadata.

Due to the removal of withdrawn individuals from the freezes, please note that the number of participants within each dataset may change over time and may not match those found in the Methodology fields.

Freeze 1 timing: July 2021 - Dec 2022
Freeze 2 timing: Dec 2022 - Dec 2023
Freeze 3 timing: Jan 2023 - Oct 2024
Freeze 4 timing: Oct 2024 - June 2025
Freeze 5 timing: June 2025 - Dec 2025
Freeze 6 timing: Dec 2025 - Current

Genetic Array Data

Genome-wide - Illumina 550 quad - G1 (gwa_550_g1)

Description

This dataset contains genome wide array data genotype calls for G1 individuals.
Reference genome build: GRCh37

Methodology

ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).

Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.

SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed.

Associated publication:
- Horikoshi et al 2013 (https://doi.org/10.1038/ng.2477)

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gwa_550_g1_2022-12-05_f6
name: >-
  Genome-wide array data for G1 individuals 2022-12-05 freeze 6
description: >-
  The sixth freeze of the genome-wide array data for G1 based on the 2022-12-05 release. The data is in plink format.
  
  Contains .hh file, which is produced automatically when the input data contains heterozygous calls where they shouldn't be possible (haploid chromosomes, male X/Y), or there are nonmissing calls for nonmales on the Y chromosome. Consists of a text file with one line per error (sorted primarily by variant ID, secondarily by sample ID) with the following three fields:
	1. Family ID
	2. Within-family ID
	3. Variant ID
freeze_size: 997M
linker_file_md5sum: 45415c7d4fae355b4fb2d6ccd042620d
woc_file_md5sum: 6c887db8c7dd10cc695630ca73b41405
all_individuals_to_exclude_md5sum: e4efce63f9f671548d08c8bb2f9cc4f7
git_tag: https://github.com/alspac/dataset_gwa_550_g1/releases/tag/freeze6
is_current_freeze: true
freeze_number: 6
freeze_date: 2025-09-30
previous_freeze: alspacdcs:gwa_550_g1_2022-12-05_f5
freeze_of_alspac_dataset_version: alspacdcs:gwa_550_g1_2022-12-05
freeze_of_named_alspac_dataset: alspacdcs:gwa_550_g1

contains:
- data

files: []
data:
  contains:
  - freeze_id.bed
  - freeze_id.bim
  - freeze_id.fam
  - freeze_id.hh
  - freeze_id.log
  files:
  - id: alspacdcs:19718249-e6b9-437a-89e6-f8023285ba85
    name: freeze_id.bed
    md5sum: c708b16229b4a9af9ddd2f98e34b2d39
    filesize: 981.4MB
    filetype: .bed
    belongs_to: data
  - id: alspacdcs:bd518596-7d72-4662-a351-2aabc4b2c816
    name: freeze_id.bim
    md5sum: 0be48a05ee0e98d0de8180ae658768b2
    filesize: 13.4MB
    filetype: .bim
    number_of_variants: 500527
    belongs_to: data
  - id: alspacdcs:d25e6ef8-f6c0-4a29-b7f9-1cf9cf4139bb
    name: freeze_id.fam
    md5sum: 09847a7ba78db2da9fd6495a5d771c4f
    filesize: 248.9KB
    filetype: .fam
    number_of_participants: 8222
    belongs_to: data
  - id: alspacdcs:34ad861a-4492-41d5-a527-17e673aa8196
    name: freeze_id.hh
    md5sum: 609e4e8b8fd7f660b853b3f99013c0a4
    filesize: 1.6MB
    filetype: .hh
    belongs_to: data
  - id: alspacdcs:bad942d2-bd9e-4862-817c-d7e000bad2e0
    name: freeze_id.log
    md5sum: af9ddd5af43a34e2acf38ecb99d8fe4b
    filesize: 1.1KB
    filetype: .log
    belongs_to: data

Genome-wide - Illumina exome core array - G0 partners (gwa_exome_g0p)

Description

This dataset contains genome wide array genotype calls for G0 mothers and partners.
Reference genome build: GRCh37

Methodology

3,453 ALSPAC mother and fathers and 535,478 SNPs were genotyped using the Illumina HumanCoreExome chip genotyping platforms by the ALSPAC lab and called using GenomeStudio. The resulting raw genome-wide data were subjected to standard quality control methods using PLINK (v1.07). Individuals were excluded on the basis of gender mismatches (n = 80); minimal or excessive heterozygosity (n = 64); disproportionate levels of individual missingness (>5%, n = 60) and possible contamination (n = 3).

Population stratification was assessed by multidimensional scaling analysis and compared with 1000 Genomes phase 3 data and principal component analysis (n = 266); all individuals with non-European ancestry were removed. Cryptic relatedness was measured as SNP relatedness in GCTA (relatedness > 0.1, n = 69 removed). SNPs with a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 1E-7) and those which failed GenomeStudio quality control measures were removed (n = 21,298). 6,594 duplicate SNPs were also removed. This resulted in 2,911 unrelated mothers and father genotypes at 507,586 SNPs. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln.

1737 putative G0 partner-G1 pairs for whom both G0 partner and G1 have called genotype data available were identified based on ALN. Given the G0 partners were invited by the G0 mother to take part and only enrolled in the study in their own right several years later, it could not be assumed that all G0 partners were biologically related to G1. Called genotype data for the 1720 unique G0 partners and 1737 unique G1s were merged (i.e. there were 17 pairs of siblings/twins among the G1 offspring), using plink v1.90b7.2 64-bit (11 Dec 2023).

After aplication of the plink filters –geno 0.05, –maf 0.01, –snps-only just-acgt and –autosome, 113288 SNPs remained. The –related command in KING version 2.3.2 was used to perform kinship analysis, which confirmed that all 1737 putative G0 partner-G1 pairs are genetically related. This would be expected for biological father-offspring pairs, using the inference criteria described in in Table 1 of “Manichaikul, Ani, et al. ”Robust relationship inference in genome-wide association studies.” Bioinformatics 26.22 (2010): 2867-2873.”

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gwa_exome_g0p_2016-11-22_f6
name: Freeze 6 version 2016-11-22 Genome-wide - Illumina exome core array - G0 partners
description: >-
  Freeze 6 version 2016-11-22 Genome-wide array data including genotype calls for G0 partners, including additional G0 mothers who were absent from previous genotyping rounds. 

  Data in plink format, including .hh file, which is produced automatically when the input data contains heterozygous calls where they shouldn't be possible (haploid chromosomes, male X/Y), or there are nonmissing calls for nonmales on the Y chromosome. Consists of a text file with one line per error (sorted primarily by variant ID, secondarily by sample ID) with the following three fields:
	1. Family ID
	2. Within-family ID
	3. Variant ID
freeze_size: 281M
linker_file_md5sum: 45415c7d4fae355b4fb2d6ccd042620d
woc_file_md5sum: 6c887db8c7dd10cc695630ca73b41405
all_individuals_to_exclude_md5sum: e4efce63f9f671548d08c8bb2f9cc4f7
git_tag: https://github.com/alspac/dataset_gwa_exome_g0p/releases/tag/freeze6
is_current_freeze: true
freeze_number: 5
freeze_date: 2025-09-30
previous_freeze: alspacdcs:gwa_exome_g0p_2016-11-22_f5
freeze_of_alspac_dataset_version: alspacdcs:gwa_exome_g0p_2016-11-22
freeze_of_named_alspac_dataset: alspacdcs:gwa_exome_g0p

contains:
- data

files: []
data:
  contains:
  - freeze_id.bed
  - freeze_id.bim
  - freeze_id.fam
  - freeze_id.hh
  - freeze_id.log
  files:
  - id: alspacdcs:ec23e114-dc37-440f-a7cc-fca07375ccad
    name: freeze_id.bed
    md5sum: 304b0d356880c5174806ce08d7beffd3
    filesize: 266.2MB
    filetype: .bed
    belongs_to: data
  - id: alspacdcs:76ea8622-c436-4e75-a8d2-2c5b2bfd0d2c
    name: freeze_id.bim
    md5sum: 0fe43f888776059fef0a76d3f08d00ad
    filesize: 13.9MB
    filetype: .bim
    number_of_variants: 507586
    belongs_to: data
  - id: alspacdcs:0e23f056-c8a1-4ccb-8f41-ffae41613be2
    name: freeze_id.fam
    md5sum: 5145c717970e73ceaa7268ce00a7ea15
    filesize: 122.3KB
    filetype: .fam
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:e5d619e3-7202-4d90-92f4-6a37e47bfe39
    name: freeze_id.hh
    md5sum: cfa6a113c8f8e54c4e5d4b69e8a31fa9
    filesize: 115.3KB
    filetype: .hh
    belongs_to: data
  - id: alspacdcs:773da93d-79bf-464c-a4d6-e3d35d9398f1
    name: freeze_id.log
    md5sum: a1ab2605df887f103555e24b11bb2545
    filesize: 1.1KB
    filetype: .log
    belongs_to: data

Genome-wide - Illumina 660 quad - G0 mothers (gwa_660_g0m)

Description

This dataset contains genome-wide array data including raw files and genotype calls for G0 mothers.
Legacy 1 reference genome: GRCh36
Legacy 2 reference genome: GRCh37

Methodology

ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs.

SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed. Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.

Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level. Related subjects that passed all other quality control thresholds were retained. This resulted in 9,048 subjects and 526,688 SNPs passed these quality control filters.

Associated publication:
- Rietveld et al 2013 (https://doi.org/10.1126%2Fscience.1235488)

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gwa_660_g0m_2022-12-05_f6
name: Freeze 6 version 2022-12-05 Genome-wide - Illumina 660 quad - G0 mothers
description: >-
  Freeze 6 of genome-wide array data including genotype calls for G0 mothers.

  Contains 2 sets of data, legacy1 and legacy2. 
  legacy1: A dir/folder containing the plink data files. 
      Includes full set of SNPs (aligned to hg36), but missing ~500 mothers who 
      were excluded in legacy QC due to strict relatedness inclusion thresholds.
  legacy2: A dir/folder containing the plink data files
      Includes full set of individuals but due to legacy QC is restricted
      to a set of ~480k SNPs that overlap with the Illumina 550k array 
      (which was used for G1 in gwa_550_g1). This QC was performed alongside liftOver to Hg37.

freeze_size: 2G
linker_file_md5sum: 45415c7d4fae355b4fb2d6ccd042620d
woc_file_md5sum: 6c887db8c7dd10cc695630ca73b41405
all_individuals_to_exclude_md5sum: e4efce63f9f671548d08c8bb2f9cc4f7
git_tag: https://github.com/alspac/dataset_gwa_660_g0m/releases/tag/freeze6
is_current_freeze: true
freeze_number: 6
freeze_date: 2025-09-30
freeze_of_alspac_dataset_version: alspacdcs:gwa_660_g0m_2022-12-05
freeze_of_named_alspac_dataset: alspacdcs:gwa_660_g0m

contains:
- data
files: []
data:
  contains:
  - legacy2
  - legacy1
  files: []
  legacy2:
    contains:
    - freeze_id.bed
    - freeze_id.bim
    - freeze_id.fam
    - freeze_id.log
    files:
    - id: alspacdcs:321df196-491b-4cdf-99a1-7ff40882c242
      name: freeze_id.bed
      md5sum: 7559903a4811210f6289497e1323dfe7
      filesize: 960.3MB
      filetype: .bed
      belongs_to: data/legacy2
    - id: alspacdcs:0843af40-8fea-4265-9870-cd492fab06cd
      name: freeze_id.bim
      md5sum: b4a1adb225de05d92d0af585950fd423
      filesize: 12.3MB
      filetype: .bim
      number_of_variants: 465740
      belongs_to: data/legacy2
    - id: alspacdcs:f7bb4abb-2e0f-4809-9449-53cb2e35659c
      name: freeze_id.fam
      md5sum: 4f3c4043ebed461f5b1272b5ab8579ea
      filesize: 447.6KB
      filetype: .fam
      number_of_participants: 8648
      belongs_to: data/legacy2
    - id: alspacdcs:d28b0471-5e01-49ea-9107-37c92f06a29c
      name: freeze_id.log
      md5sum: 59864532d578c1ba0fdf3aa95022510b
      filesize: 981.0B
      filetype: .log
      belongs_to: data/legacy2
  legacy1:
    contains:
    - freeze_id.bed
    - freeze_id.bim
    - freeze_id.fam
    - freeze_id.log
    files:
    - id: alspacdcs:375a4bd0-df16-45e1-878d-aef46db50e8b
      name: freeze_id.bed
      md5sum: be66d3cc1d3d906c4d396cc161a605b1
      filesize: 1019.6MB
      filetype: .bed
      belongs_to: data/legacy1
    - id: alspacdcs:daf916d2-8f0b-4113-afa9-d1d420b9d894
      name: freeze_id.bim
      md5sum: 88b8c2221ef4ddc03118042db70d8575
      filesize: 14.0MB
      filetype: .bim
      number_of_variants: 526688
      belongs_to: data/legacy1
    - id: alspacdcs:e2152aba-48db-4ca8-96f3-86a6ec257448
      name: freeze_id.fam
      md5sum: c97e448b8ae0bf29c1ca609a4719d05b
      filesize: 253.7KB
      filetype: .fam
      number_of_participants: 8118
      belongs_to: data/legacy1
    - id: alspacdcs:b1cad595-044c-4b34-922b-3ed8095ae628
      name: freeze_id.log
      md5sum: 69b9dd28928c7adfb0459e7c3f07a0f0
      filesize: 981.0B
      filetype: .log
      belongs_to: data/legacy1

Genome-wide - CNV - G1 (cnv_550_g1)

Description

This dataset contains predicted ALSPAC CNVs using PennCNV, generated from 23andMe raw genotype data.

Methodology

LRR and BAF data was missing from the 23andMe raw genotype data, so we had to generate this data ourselves using an in house algorithm. Once this data was generated, we ran PennCNV using the hh550 libraries.

There are filtered PennCNV calls. Multiple calls were merged using the ‘clean_cnv.pl’ script, using a merge fraction of 0.5. Individuals with > 30 CNVs, a Log R Ratio SD of >0.3, a BAF drift of > 0.002, and a waviness factor of > 0.05 were removed. CNVs in which at least 50% of the length of the CNV call overlapped with any of telomeric centromeric, immunoglobulin regions were removed using the ‘scan_region.pl’ script in PennCNV.

In addition, CNVs covering fewer than 5 probes, of a length < 5kb, and with a confidence score of below 10 were removed. Density was calculated as the number of probes in a CNV divided by the length of the CNV, and CNVs where the density of probes across the call was < 1 probe per 20kb was removed.

These QC parameters are suggestions only and provided in filtered.cnv. Analysts can apply their own filter parameters to the raw calls in data.cnv

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:cnv_550_g1_2015-11-09_f6
name: Genome-wide - CNV - G1 release version 2015-11-09 freeze 6
description: >-
  This is the sixth freeze of the 2015-11-09 version of
  cnv_550_g1 dataset.
  It contains two csv versions of the cnv called data, the unfilterd
  and filtered versions.
freeze_size: 27m
linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a
git_tag: https://github.com/alspac/dataset_cnv_550_g1/releases/tag/freeze6
is_current_freeze: true
freeze_number: 6
freeze_date: 2025-09-30
previous_freeze: alspacdcs:cnv_550_g1_2015-11-09_f5
freeze_of_alspac_dataset_version: alspacdcs:cnv_550_g1_2015-11-09
freeze_of_named_alspac_dataset: alspacdcs:cnv_550_g1

contains:
- data
files: []

data:
  contains:
  - new_filtered.csv
  - new_cnvdata.csv
  files:
  - id: alspacdcs:50895b7b-f20f-4d6d-97ae-a1bc18f1d393
    name: new_filtered.csv
    md5sum: aeb36ef5266f890bfecff3448325da8c
    description: >-
      CNV data that has been filtered.
      columns
        V1 - Position
        V2 - Number of markers in the region
        V3 - CNV length
        V4 - Copy number estimate
        V6 - Start SNP
        V7 - End SNP
        V8 - Confidence score
        qlet - within pregnancy ID
        cnv_550_g1 - pregnancy ID
    number_of_participants: 6791  #data$id_qlet <- paste(data$cnv_550_g1, data$qlet, sep="_")
        #length(unique(data$id_qlet))
    number_of_cnv_variants: 14242 # Read file into R as data then:
        # dim(unique(data[1]))
    filesize: 5.9MB
    filetype: .csv
    belongs_to: data
  - id: alspacdcs:45549969-94fc-4e76-91d1-3a12e750d380
    name: new_cnvdata.csv
    md5sum: 3bf366db747ed456613b100566bbd9a8
    description: >-
      This is the output of Penncnv before filtering.
      columns
        V1 - Position
        V2 - Number of markers in the region
        V3 - CNV length
        V4 - Copy number estimate
        V6 - Start SNP
        V7 - End SNP
        V8 - Confidence score
        qlet - within pregnancy ID
        cnv_550_g1 - pregnancy ID
    number_of_participants: 7448  #data$id_qlet <- paste(data$cnv_550_g1, data$qlet, sep="_")
        #length(unique(data$id_qlet))
    number_of_cnv_variants: 70025 # Read file into R as data then:
        # dim(unique(data[1]))
    filesize: 20.8MB
    filetype: .csv
    belongs_to: data

Imputed Data

Genome-wide - HRC imputed - G0 mothers + G1 (gi_hrc_g0m_g1)

Description

This dataset contains genotype data imputed to HRC for G0 mothers and G1.
Reference genome build: GRCh37

Methodology

ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).

Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.

SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed.

Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,115 subjects and 500,527 SNPs passed these quality control filters.

ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs. SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed.

Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.

9,048 subjects and 526,688 SNPs passed these quality control filters.

We combined 477,482 SNP genotypes in common between the sample of mothers and sample of children. We removed SNPs with genotype missingness above 1% due to poor quality (11,396 SNPs removed) and removed a further 321 subjects due to potential ID mismatches. This resulted in a dataset of 17,842 subjects containing 6,305 duos and 465,740 SNPs (112 were removed during liftover and 234 were out of HWE after combination). We estimated haplotypes using ShapeIT (v2.r644) which utilises relatedness during phasing. The phased haplotypes were then imputed to the Haplotype Reference Consortium (HRCr1.1, 2016) panel of approximately 31,000 phased whole genomes. The HRC panel was phased using ShapeIt v2.r727, and the imputation was performed using the Michigan imputation server.

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_hrc_g0m_g1_2017-05-04_f6
name: >-
  Genome-wide - HRC imputed - G0 mothers + G1 version 2017-05-04 freeze 6
description: >-
  Freeze 6 of version 2017-05-04 Genome-wide array data imputed to the HRC reference panel for G0 mothers and G1 individuals in bgen and sample file format (version 1.2). 
freeze_size: 114G
linker_file_md5sum: 45415c7d4fae355b4fb2d6ccd042620d
woc_file_md5sum: 6c887db8c7dd10cc695630ca73b41405
all_individuals_to_exclude_md5sum: e4efce63f9f671548d08c8bb2f9cc4f7
git_tag: https://github.com/alspac/dataset_gi_hrc_g0m_g1/releases/tag/freeze6
is_current_freeze: true
freeze_number: 6
freeze_date: 2025-09-30
previous_freeze: alspacdcs:gi_hrc_g0m_g1_2017-05-04_f5
freeze_of_alspac_dataset_version: alspacdcs:gi_hrc_g0m_g1_2017-05-04
freeze_of_named_alspac_dataset: alspacdcs:gi_hrc_g0m_g1

contains:
- data

files: []
data:
  contains:
  - filtered_21.bgen
  - filtered_22.bgen
  - filtered_23male.bgen
  - filtered_20.bgen
  - filtered_19.bgen
  - filtered_18.bgen
  - filtered_15.bgen
  - filtered_17.bgen
  - filtered_14.bgen
  - filtered_23female.bgen
  - filtered_16.bgen
  - filtered_13.bgen
  - filtered_09.bgen
  - filtered_12.bgen
  - filtered_10.bgen
  - filtered_11.bgen
  - filtered_08.bgen
  - filtered_07.bgen
  - filtered_06.bgen
  - filtered_05.bgen
  - filtered_03.bgen
  - filtered_04.bgen
  - filtered_01.bgen
  - filtered_02.bgen
  - swapped_23_male.sample
  - swapped_23_female.sample
  - swapped.sample
  files:
  - id: alspacdcs:c7184af4-0945-445e-b8f1-9343d2797971
    name: filtered_21.bgen
    md5sum: d944952cd7ae62525a0b2902306b0371
    filesize: 1.7GB
    filetype: .bgen
    number_of_variants: 531276
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:8991c75e-ddca-4ed8-acd9-b1cc8e52b465
    name: filtered_22.bgen
    md5sum: e6e35fc7bb4af26579e86117182a867b
    filesize: 1.8GB
    filetype: .bgen
    number_of_variants: 524544
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:505ca2f7-01dc-4780-807d-32238508e3d1
    name: filtered_23male.bgen
    md5sum: 0a865f7f362a08741f62980790ea00d1
    filesize: 1.2GB
    filetype: .bgen
    number_of_variants: 1228035
    number_of_participants: 4500
    belongs_to: data
  - id: alspacdcs:56d0823d-e769-4547-be13-4c48d2b69897
    name: filtered_20.bgen
    md5sum: 91cd660fb3febebc4acd427f69ce9b76
    filesize: 2.6GB
    filetype: .bgen
    number_of_variants: 884983
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:24ba7793-a9e5-447c-81b3-37051ac9e9e8
    name: filtered_19.bgen
    md5sum: 395c180a38fc6b3982ec31a2b870e520
    filesize: 3.4GB
    filetype: .bgen
    number_of_variants: 868554
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:bec2e8f8-1f53-42a1-ae88-92866d0e6288
    name: filtered_18.bgen
    md5sum: 8b07da40891cd7d6d140402e11ccc450
    filesize: 3.1GB
    filetype: .bgen
    number_of_variants: 1104755
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:d855abcd-9b51-49ca-a47d-d6462f41315b
    name: filtered_15.bgen
    md5sum: 949e7d7fd1db2ae89ef659957967e03e
    filesize: 3.4GB
    filetype: .bgen
    number_of_variants: 1139215
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:b7170945-432a-4a9a-9e3b-17f629463fa3
    name: filtered_17.bgen
    md5sum: b405b7608ef1189ce700fd8fe1df096e
    filesize: 3.6GB
    filetype: .bgen
    number_of_variants: 1090072
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:ad4b422c-8473-4593-b683-fdc862642761
    name: filtered_14.bgen
    md5sum: 0c20c36da72c89c59204a344b66d3758
    filesize: 3.5GB
    filetype: .bgen
    number_of_variants: 1266536
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:82972f35-103d-45f8-ad02-f3c4a5194c86
    name: filtered_23female.bgen
    md5sum: d4abdc0d84bda1f8a3eec5c9cee8977b
    filesize: 4.2GB
    filetype: .bgen
    number_of_variants: 1228035
    number_of_participants: 12943
    belongs_to: data
  - id: alspacdcs:a0b692db-9a88-4ef2-b73b-dbce7161257d
    name: filtered_16.bgen
    md5sum: d941572511b3377843eb6a8aefe79a34
    filesize: 4.1GB
    filetype: .bgen
    number_of_variants: 1281298
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:3043d7b8-e1e4-4c57-a130-d65a654e4062
    name: filtered_13.bgen
    md5sum: 6c7efae02d5581e86db400aff3213b8f
    filesize: 3.7GB
    filetype: .bgen
    number_of_variants: 1385434
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:55c6a805-bdbe-4436-9cd5-f047ce051900
    name: filtered_09.bgen
    md5sum: 0cc5973a41ede08fa4afe3958c4972c6
    filesize: 4.5GB
    filetype: .bgen
    number_of_variants: 1675899
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:754f61a5-e376-4755-b4d4-66c5a6ba2188
    name: filtered_12.bgen
    md5sum: 016275752dfe39b2b921f82e476420bf
    filesize: 5.1GB
    filetype: .bgen
    number_of_variants: 1848118
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:89eb6efe-438d-42ce-992a-644c8edcdc44
    name: filtered_10.bgen
    md5sum: a44c8a763298e2fb94efccb35d21cf5c
    filesize: 5.1GB
    filetype: .bgen
    number_of_variants: 1927504
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:f42cc063-672e-4299-9af5-0c98ec7382eb
    name: filtered_11.bgen
    md5sum: b1fee7aa390f3c52f4884a2fb5f7196a
    filesize: 5.2GB
    filetype: .bgen
    number_of_variants: 1936990
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:afdcb88c-111d-4b2c-9485-cb66eed0f392
    name: filtered_08.bgen
    md5sum: 3403e27325bbf785589498e42f4536dd
    filesize: 5.7GB
    filetype: .bgen
    number_of_variants: 2242706
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:1ce1ee18-6715-4ea5-8721-2c07334e001c
    name: filtered_07.bgen
    md5sum: 68fa282857de492d9a36ad0e5d045ed9
    filesize: 6.6GB
    filetype: .bgen
    number_of_variants: 2289306
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:185f7830-7af7-4808-87ee-43dd1ae05b2c
    name: filtered_06.bgen
    md5sum: 74b1b0d1e46662f1b0216a9e67c42f54
    filesize: 6.3GB
    filetype: .bgen
    number_of_variants: 2460112
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:919fe0bb-9bb6-4d1e-8dbc-bbf6ea76d994
    name: filtered_05.bgen
    md5sum: 90c9d5589d86b9611cc6b16da239fd36
    filesize: 6.7GB
    filetype: .bgen
    number_of_variants: 2588170
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:31cb65a9-320e-4db7-a243-af3547f269f9
    name: filtered_03.bgen
    md5sum: de1fe1cd4acd6b25430310c9e849caaa
    filesize: 7.3GB
    filetype: .bgen
    number_of_variants: 2821895
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:f1b8b2f7-d0a5-46e0-8f31-a54bfd84066d
    name: filtered_04.bgen
    md5sum: 7e0625019c52c820202127cf0edd4e63
    filesize: 7.9GB
    filetype: .bgen
    number_of_variants: 2787582
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:b03079de-2c94-45e6-a7b3-d5b18267bac1
    name: filtered_01.bgen
    md5sum: 99bcd042d88989e303d6425e0a82f27d
    filesize: 8.6GB
    filetype: .bgen
    number_of_variants: 3069932
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:08a494f3-84fc-455a-a2a0-4bdd0efe8b3b
    name: filtered_02.bgen
    md5sum: f50b3709b381b89a571468133a954f38
    filesize: 8.7GB
    filetype: .bgen
    number_of_variants: 3392238
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:18f8763d-b876-4648-8946-e03d564d5a6e
    name: swapped_23_male.sample
    md5sum: 48c1a0e6ab8f3c7a22662957b69646dd
    filesize: 259.3KB
    filetype: .sample
    number_of_participants: 4500
    belongs_to: data
  - id: alspacdcs:f9603ac1-c20b-4824-9608-15acaac5769d
    name: swapped_23_female.sample
    md5sum: 77adf7126efae7f70c74c32abac67679
    filesize: 745.8KB
    filetype: .sample
    number_of_participants: 12943
    belongs_to: data
  - id: alspacdcs:015eadb8-9b0c-4f4d-8206-bb715f0a3c03
    name: swapped.sample
    md5sum: 83772169b3ae48a868b30615c69804a6
    filesize: 1005.1KB
    filetype: .sample
    number_of_participants: 17443
    belongs_to: data

Genome-wide - HapMap2 imputed - G1 (gi_hapmap2_g1)

Description

This dataset contains genotype data imputed to HapMap 2 for G1.
Reference genome build: GRCh36

Methodology

A total of 9912 subjects were genotyped using the Illumina HumanHap550 quad genome-wide SNP genotyping platform by 23 and Me subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, USA.

Individuals were excluded from further analysis on the basis of having incorrect gender assignments; minimal or excessive heterozygosity (<0.320 and >0.345 for the Sanger data and <0.310 and >0.330 for the LabCorp data); disproportionate levels of individual missingness (>3%); evidence of cryptic relatedness (>10% IBD) and being of non-European ancestry (as detected by a multidimensional scaling analysis seeded with HapMap 2 individuals, EIGENSTRAT analysis revealed no additional obvious population stratification and genome-wide analyses with other phenotypes indicate a low lambda). The resulting data set consisted of 8365 individuals (84% of those genotyped).

SNPs with a minor allele frequency of <1% and call rate of <95% were removed. Furthermore, only SNPs which passed an exact test of Hardy-Weinberg equilibrium (P > 5 x 10-7) were considered for analysis. Genotypes were subsequently imputed with MACH 1.0.16 Markov Chain Haplotyping software, using CEPH individuals from phase 2 of the HapMap project as a reference set (release 22).

Associated publication:
- https://doi.org/10.1093/hmg/ddr309

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_hapmap2_g1_2022-12-07_f6
name: Genome-wide - HapMap2 imputed - G1 version 2022-12-07 freeze 6
description: >-
  Freeze 6 of 2022-12-07 version of Genome-wide array data imputed to the HapMap2 reference panel for G1 individuals. 
  In Plink standard format, See https://www.cog-genomics.org/plink/1.9/formats for further information.  
freeze_size: 5G
linker_file_md5sum: 45415c7d4fae355b4fb2d6ccd042620d
woc_file_md5sum: 6c887db8c7dd10cc695630ca73b41405
all_individuals_to_exclude_md5sum: e4efce63f9f671548d08c8bb2f9cc4f7
git_tag: https://github.com/alspac/dataset_gi_hapmap2_g1/releases/tag/freeze6
is_current_freeze: true
freeze_number: 6
freeze_date: 2025-09-30
previous_freeze: alspacdcs:gi_hapmap2_g1_2022-12-07_f5
freeze_of_alspac_dataset_version: alspacdcs:gi_hapmap2_g1_2022-12-07
freeze_of_named_alspac_dataset: alspacdcs:gi_hapmap2_g1

contains:
- data
files: []
data:
  contains:
  - freeze_id.bed
  - freeze_id.bim
  - freeze_id.fam
  - freeze_id.log
  files:
  - id: alspacdcs:51e0b3de-dcce-456f-a99b-76752e42cbc7
    name: freeze_id.bed
    md5sum: 4362bfc2985fe02e84950530668c379d
    filesize: 4.9GB
    filetype: .bed
    belongs_to: data
  - id: alspacdcs:fa76edfd-d194-49ef-b9b7-8a9f922582e2
    name: freeze_id.bim
    md5sum: da64bd173633ec7198b8c9b7f61fabca
    filesize: 67.6MB
    filetype: .bim
    number_of_variants: 2543887
    belongs_to: data
  - id: alspacdcs:a7e7e68b-7ff1-4287-a4dc-2437ddee77e4
    name: freeze_id.fam
    md5sum: fd2ed9d93ab7c69f6bfef1927ae0feaa
    filesize: 273.0KB
    filetype: .fam
    number_of_participants: 8222
    belongs_to: data
  - id: alspacdcs:476a079c-a8c6-446c-a84a-af79610ee216
    name: freeze_id.log
    md5sum: b291cf31d5cfd1f44eb6217819c9fa20
    filesize: 941.0B
    filetype: .log
    belongs_to: data


Genome-wide - HapMap2 imputed - G0 mothers (gi_hapmap2_g0m)

Description

This dataset contains genotype data imputed to HapMap 2 for G0 mothers.
Reference genome build: GRCh36

Methodology

A total of 10 015 women (mothers from the ALSPAC cohort) were genotyped using the Illumina 660 quad SNP chip which contains 557 124 SNP markers. Markers with minor allele frequency < 1%, SNPs with >5% missing genotypes and any markers that failed an exact test of Hardy-Weinberg equilibrium (P < 1 x 10-6) were excluded from further analyses. Genome-wide identity by state sharing was calculated for each pair of individuals in the cohort to identify cryptic relatedness.

In order to identify individuals who might have ancestries other than Western European, we merged data from both cohorts with the 60 western European (CEU) founder, 60 Nigerian (YRI) founder and 90 Japanese (JPT) and Han Chinese (CHB) individuals from the International HapMap Project. Genome-wide IBS distances for each pair of individuals were calculated on markers shared between the HapMap and the Illumina 660K SNP chip, and then the multidimensional scaling option in R was used to generate a two-dimensional plot based upon individuals’ scores on the first two principal coordinates from this analysis. Samples that did not cluster with the CEU individuals were excluded from subsequent analyses. In addition, we plotted the proportion of missing data for each individual against their genome-wide heterozygosity. Any individual, who did not cluster with others, was removed from further analyses. Samples were also excluded from analyses in the case of excessive missingness (>5%), unusual genome-wide or X chromosome heterozygosity, as well as one individual from each pair of putatively related individuals (genome-wide IBD >10%). After data cleaning, 8340 individuals and 526688 SNPs were left in the genome-wide data set.

We then conducted imputation using the MACH Markov Chain Haplotyping software with CEU individuals from phase 2 of the HapMap project as a reference set (release 22). The final imputed data set consisted of 8340 individuals, each with 2 594 390 imputed markers. Only imputed genotypes with minor allele frequencies ≥1% and R-sqr ≥0.3 were considered for association. Of these 8340 with genetic data, 2874 mothers also had phenotype data available.

Associated publication:
- https://doi.org/10.1093/hmg/ddt239

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_hapmap2_g0m_2022-12-07_f6
name: Genome-wide - HapMap2 imputed - G0 mothers version 2022-12-07 freeze 6
description: >-
  Version 2022-12-07 freeze 6 of Genome-wide array data imputed to the HapMap2 reference panel for G0 mothers.
  The number of variants & individuals within each plink file set can be viewed within the log file.
freeze_size: 5G
linker_file_md5sum: 45415c7d4fae355b4fb2d6ccd042620d
woc_file_md5sum: 6c887db8c7dd10cc695630ca73b41405
all_individuals_to_exclude_md5sum: e4efce63f9f671548d08c8bb2f9cc4f7
git_tag: https://github.com/alspac/dataset_gi_hapmap2_g0m/releases/tag/freeze6
is_current_freeze: true
freeze_number: 6
freeze_date: 2025-09-30
previous_freeze: alspacdcs:gi_hapmap2_g0m_2022-12-07_f5
freeze_of_alspac_dataset_version: alspacdcs:gi_hapmap2_g0m_2022-12-07
freeze_of_named_alspac_dataset: alspacdcs:gi_hapmap2_g0m

contains:
- plink

files: []
plink:
  contains:
  - freeze_id_chr21.bed
  - freeze_id_chr22.bed
  - freeze_id_chr20.bed
  - freeze_id_chr18.bed
  - freeze_id_chr19.bed
  - freeze_id_chr14.bed
  - freeze_id_chr17.bed
  - freeze_id_chr16.bed
  - freeze_id_chr12.bed
  - freeze_id_chr15.bed
  - freeze_id_chr13.bed
  - freeze_id_chr4.bed
  - freeze_id_chr9.bed
  - freeze_id_chr6.bed
  - freeze_id_chr11.bed
  - freeze_id_chr3.bed
  - freeze_id_chr7.bed
  - freeze_id_chr10.bed
  - freeze_id_chr1.bed
  - freeze_id_chr8.bed
  - freeze_id_chr2.bed
  - freeze_id_chr5.bed
  - freeze_id_chr21.bim
  - freeze_id_chr14.bim
  - freeze_id_chr19.bim
  - freeze_id_chr5.bim
  - freeze_id_chr18.bim
  - freeze_id_chr15.bim
  - freeze_id_chr7.bim
  - freeze_id_chr16.bim
  - freeze_id_chr20.bim
  - freeze_id_chr11.bim
  - freeze_id_chr4.bim
  - freeze_id_chr10.bim
  - freeze_id_chr17.bim
  - freeze_id_chr2.bim
  - freeze_id_chr1.bim
  - freeze_id_chr12.bim
  - freeze_id_chr6.bim
  - freeze_id_chr9.bim
  - freeze_id_chr22.bim
  - freeze_id_chr3.bim
  - freeze_id_chr8.bim
  - freeze_id_chr13.bim
  - freeze_id_chr20.fam
  - freeze_id_chr16.fam
  - freeze_id_chr17.fam
  - freeze_id_chr6.fam
  - freeze_id_chr12.fam
  - freeze_id_chr4.fam
  - freeze_id_chr14.fam
  - freeze_id_chr15.fam
  - freeze_id_chr7.fam
  - freeze_id_chr2.fam
  - freeze_id_chr11.fam
  - freeze_id_chr13.fam
  - freeze_id_chr21.fam
  - freeze_id_chr1.fam
  - freeze_id_chr8.fam
  - freeze_id_chr18.fam
  - freeze_id_chr3.fam
  - freeze_id_chr9.fam
  - freeze_id_chr22.fam
  - freeze_id_chr10.fam
  - freeze_id_chr5.fam
  - freeze_id_chr19.fam
  - freeze_id_chr20.log
  - freeze_id_chr5.log
  - freeze_id_chr6.log
  - freeze_id_chr4.log
  - freeze_id_chr2.log
  - freeze_id_chr18.log
  - freeze_id_chr10.log
  - freeze_id_chr12.log
  - freeze_id_chr15.log
  - freeze_id_chr17.log
  - freeze_id_chr3.log
  - freeze_id_chr9.log
  - freeze_id_chr1.log
  - freeze_id_chr14.log
  - freeze_id_chr7.log
  - freeze_id_chr22.log
  - freeze_id_chr16.log
  - freeze_id_chr8.log
  - freeze_id_chr19.log
  - freeze_id_chr13.log
  - freeze_id_chr21.log
  - freeze_id_chr11.log
  files:
  - id: alspacdcs:4d0de059-4893-4c69-9fa5-43f78961fe14
    name: freeze_id_chr21.bed
    md5sum: 13165e1c9a27aa42853429b0246a1ed5
    filesize: 65.6MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:ad0f93f0-f7b3-4500-a73c-a26fd706680f
    name: freeze_id_chr22.bed
    md5sum: 5abcf552c585152ed0ee11754f3e7833
    filesize: 65.5MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:7c2cd7f8-5513-4c49-88be-06d7c251f381
    name: freeze_id_chr20.bed
    md5sum: 2af011bb98d6b8a8b00b7d938700fdac
    filesize: 122.8MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:1dd436e8-5f33-4c9a-832c-74c3cb343f3f
    name: freeze_id_chr18.bed
    md5sum: 6b46a8d2993dae303334b9a51b50b92c
    filesize: 148.7MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:c76b53ac-facb-4a38-b114-a15a2bbad2a5
    name: freeze_id_chr19.bed
    md5sum: 801ccb3bb64dddaabfc2b7a4a1e4c5b0
    filesize: 71.7MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:27a29455-1a51-42da-869a-b84c9d1f5575
    name: freeze_id_chr14.bed
    md5sum: a41f9803ec71a0dcdf137806b21ba2e6
    filesize: 162.5MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:0676b576-8d37-4879-873f-82c926d3db10
    name: freeze_id_chr17.bed
    md5sum: c6d54ed5ac68f2e0bd806b6124463ee4
    filesize: 113.2MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:94c3ca16-2032-4cb3-b723-bea46b76e195
    name: freeze_id_chr16.bed
    md5sum: b04eb2e4e66fef7ee7d48cb666d78c38
    filesize: 138.5MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:79c2a62c-2b42-4cc7-81f2-1dbaa7ebcf2b
    name: freeze_id_chr12.bed
    md5sum: 367f44ccd183c47334cfc7cb8333628a
    filesize: 241.7MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:6e50d58c-be8e-4371-89fe-6ba9e9c764c3
    name: freeze_id_chr15.bed
    md5sum: 611159bc9c4500de559615d0a7c549f2
    filesize: 140.0MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:fe6ad8e2-6509-4194-a22f-bb85550bcbce
    name: freeze_id_chr13.bed
    md5sum: 0e99cf077012880a802dc36ce72142c1
    filesize: 201.6MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:ed39eb11-b13b-4498-9c64-a85f01d6e6e9
    name: freeze_id_chr4.bed
    md5sum: 147fee33c621f644dad5a2d8ee86fc1d
    filesize: 315.9MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:c4f65aef-2670-4c92-bd14-e116317599e0
    name: freeze_id_chr9.bed
    md5sum: 58ff215f0652257867e42f567ff1c2be
    filesize: 236.4MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:0f814664-6480-4e8d-9714-c5f6d14d9e99
    name: freeze_id_chr6.bed
    md5sum: 953f9c82981d59d25dabe44ba5718b29
    filesize: 353.1MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:d887041f-c254-41ca-bec4-e1077f661182
    name: freeze_id_chr11.bed
    md5sum: 3c89898ce9fc0445c566ea0c060fb9db
    filesize: 251.8MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:34f61935-0af5-45bc-84ae-9d6d47230e4b
    name: freeze_id_chr3.bed
    md5sum: 609847ca0489b7a97725ec275f8337d2
    filesize: 337.5MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:6461c8cb-58e4-4117-a6c5-217f4a05731a
    name: freeze_id_chr7.bed
    md5sum: fb9e8aaf4ae7c3fc75233248ec9d03b0
    filesize: 277.3MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:ecf053b3-2e66-4f55-ba41-c14dade20283
    name: freeze_id_chr10.bed
    md5sum: 4606d4a5a008927b6ab051461218094a
    filesize: 267.9MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:138530d5-810e-4c0f-bf28-8e4527e75994
    name: freeze_id_chr1.bed
    md5sum: 01f7205ea4b6e852c0e8feb72a2cb9cd
    filesize: 374.7MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:3f74efb1-d0fa-4d63-be04-cc952f0f8fe7
    name: freeze_id_chr8.bed
    md5sum: de34e8ef57e4c08991e4778401adf861
    filesize: 285.5MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:cfd357f9-898a-4c61-bfb1-8622eee6b3c4
    name: freeze_id_chr2.bed
    md5sum: 494713bafedd17c3be4e782f7881dcc0
    filesize: 427.5MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:bba780dc-a454-428e-adcd-4d573c91dac3
    name: freeze_id_chr5.bed
    md5sum: a3a47a8ea90e0fa39d5c203436b6d982
    filesize: 325.5MB
    filetype: .bed
    belongs_to: plink
  - id: alspacdcs:0d071024-e6e0-42f6-bb30-c46b8e10878e
    name: freeze_id_chr21.bim
    md5sum: c1f6f2181c49172608ac79e18425e4f4
    filesize: 924.7KB
    filetype: .bim
    number_of_variants: 33863
    belongs_to: plink
  - id: alspacdcs:99819ab1-c952-473f-b867-4ad5ea1602bc
    name: freeze_id_chr14.bim
    md5sum: 4a933818aaea48201f455ebd07ea1b78
    filesize: 2.3MB
    filetype: .bim
    number_of_variants: 83936
    belongs_to: plink
  - id: alspacdcs:524d05bb-8de8-404a-96c8-51ec20e24d4e
    name: freeze_id_chr19.bim
    md5sum: c6fce7e15e198304f752ccbce66299b9
    filesize: 1012.3KB
    filetype: .bim
    number_of_variants: 37045
    belongs_to: plink
  - id: alspacdcs:223d49ae-f1ef-40c5-b75f-d2507f94523d
    name: freeze_id_chr5.bim
    md5sum: e8f55ef9016bf2f03ee43f08a6c974c3
    filesize: 4.4MB
    filetype: .bim
    number_of_variants: 168144
    belongs_to: plink
  - id: alspacdcs:4c02c893-aec2-46cb-a4c6-0eaeb202dae2
    name: freeze_id_chr18.bim
    md5sum: 9ffd8f006c82701060dff29bf460e8fe
    filesize: 2.1MB
    filetype: .bim
    number_of_variants: 76812
    belongs_to: plink
  - id: alspacdcs:839b79d3-999f-4747-b6af-8046a9a44a30
    name: freeze_id_chr15.bim
    md5sum: 1e1139db4b031ba577b5ac6ae000ce6f
    filesize: 1.9MB
    filetype: .bim
    number_of_variants: 72300
    belongs_to: plink
  - id: alspacdcs:7b7c44cd-7f15-47f0-81b1-84be66ad88d1
    name: freeze_id_chr7.bim
    md5sum: dae38c5168605323dfc584a73f3ce4a1
    filesize: 3.8MB
    filetype: .bim
    number_of_variants: 143232
    belongs_to: plink
  - id: alspacdcs:3ba18f76-ecc8-41d1-ae4b-544df2519703
    name: freeze_id_chr16.bim
    md5sum: 8bd9cb45256b6b5ca37ce66eec810035
    filesize: 1.9MB
    filetype: .bim
    number_of_variants: 71550
    belongs_to: plink
  - id: alspacdcs:fa231004-46b8-413c-96b8-91dd0b5644c0
    name: freeze_id_chr20.bim
    md5sum: 6e0b2d6cd06cc6e36f9cbc3f8df0a169
    filesize: 1.7MB
    filetype: .bim
    number_of_variants: 63408
    belongs_to: plink
  - id: alspacdcs:8babfa2c-b0f7-442d-99f2-f25928c8def7
    name: freeze_id_chr11.bim
    md5sum: 703ecef520ce7363c24e9600b363570f
    filesize: 3.5MB
    filetype: .bim
    number_of_variants: 130069
    belongs_to: plink
  - id: alspacdcs:eb0a5025-f35c-460a-9384-0228c559483e
    name: freeze_id_chr4.bim
    md5sum: 54a244447b1345636690b252215bfd2d
    filesize: 4.3MB
    filetype: .bim
    number_of_variants: 163157
    belongs_to: plink
  - id: alspacdcs:5e0dae3e-bcc0-46ad-b09e-f04e324fb888
    name: freeze_id_chr10.bim
    md5sum: 3c259904c7da548d25c86a4a36e96285
    filesize: 3.8MB
    filetype: .bim
    number_of_variants: 138402
    belongs_to: plink
  - id: alspacdcs:90635b2b-b142-48af-9e6d-18d72cfc0634
    name: freeze_id_chr17.bim
    md5sum: 0dc0770759f9edccec7ce305e07b57d4
    filesize: 1.6MB
    filetype: .bim
    number_of_variants: 58455
    belongs_to: plink
  - id: alspacdcs:205cf2e2-d9a3-43c9-bdac-ccd1d055e456
    name: freeze_id_chr2.bim
    md5sum: 275cefa559489b51bebbc65657a91822
    filesize: 5.9MB
    filetype: .bim
    number_of_variants: 220833
    belongs_to: plink
  - id: alspacdcs:2824e3a5-b354-46f9-8d93-617e1e13b935
    name: freeze_id_chr1.bim
    md5sum: 44795681691b62d1921ad8855fd11a09
    filesize: 5.1MB
    filetype: .bim
    number_of_variants: 193554
    belongs_to: plink
  - id: alspacdcs:22b98431-8ff4-4727-990b-2af616a6073a
    name: freeze_id_chr12.bim
    md5sum: 515a46f735c531163377d114549042b5
    filesize: 3.4MB
    filetype: .bim
    number_of_variants: 124860
    belongs_to: plink
  - id: alspacdcs:f088a831-e7d8-4924-a07c-25ee38c02dab
    name: freeze_id_chr6.bim
    md5sum: 3fd4e793a35c5e935454efc1105be192
    filesize: 4.8MB
    filetype: .bim
    number_of_variants: 182381
    belongs_to: plink
  - id: alspacdcs:30e15b56-e8e9-40ff-aaf2-7e13e0b6e966
    name: freeze_id_chr9.bim
    md5sum: 1e828e0f36c2d168ce6c1df5887a764b
    filesize: 3.2MB
    filetype: .bim
    number_of_variants: 122112
    belongs_to: plink
  - id: alspacdcs:6495275d-0674-48af-8b89-96cd2a4cac31
    name: freeze_id_chr22.bim
    md5sum: 86a1da3366ba87e62f561dc09f64f9ac
    filesize: 920.9KB
    filetype: .bim
    number_of_variants: 33815
    belongs_to: plink
  - id: alspacdcs:add2c16b-6453-404c-a3b2-2f12165323a7
    name: freeze_id_chr3.bim
    md5sum: 96d147406f1f24697b0cb9af0c7091fc
    filesize: 4.6MB
    filetype: .bim
    number_of_variants: 174356
    belongs_to: plink
  - id: alspacdcs:cb453287-93df-430f-82b4-1274c53b8f1b
    name: freeze_id_chr8.bim
    md5sum: 6243ef376ee6cbe643bec69201bec604
    filesize: 3.9MB
    filetype: .bim
    number_of_variants: 147483
    belongs_to: plink
  - id: alspacdcs:e6547c62-7904-4ef8-915b-794fc791728a
    name: freeze_id_chr13.bim
    md5sum: cd1b7c80977fb5a0bbd87bc83dd85aed
    filesize: 2.8MB
    filetype: .bim
    number_of_variants: 104120
    belongs_to: plink
  - id: alspacdcs:082730f3-2524-4bd6-a513-9519176ff930
    name: freeze_id_chr20.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:487e2beb-1c64-45a5-8c59-3ca980045a18
    name: freeze_id_chr16.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:eb1a7726-3011-4dec-83cc-609c55b87c70
    name: freeze_id_chr17.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:af7bcd06-b0c6-4b22-bfec-99d963ab0ad9
    name: freeze_id_chr6.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:86767f39-0f41-4260-a643-9ba9ddf3347b
    name: freeze_id_chr12.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:a58901cc-5262-47e7-ab84-09044aad73df
    name: freeze_id_chr4.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:ed5bc02f-d28f-4cf0-8d10-55571afd99c2
    name: freeze_id_chr14.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:e2da1046-3a96-43d2-902b-0a2f0ee3ed48
    name: freeze_id_chr15.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:48ceeaba-3116-4183-a42b-5e9a9bdc08fa
    name: freeze_id_chr7.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:4a18d6e0-e531-47de-96ee-40724b3c8a09
    name: freeze_id_chr2.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:cc02ab10-5505-4dd8-b754-4c5b126a3125
    name: freeze_id_chr11.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:4fb39ed6-6bc7-4f8f-8e44-750f84f89269
    name: freeze_id_chr13.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:feb5710a-76e2-452d-955a-bda1d5c5cf26
    name: freeze_id_chr21.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:a9dd0bc9-3310-4ed5-afd4-948cbf300d97
    name: freeze_id_chr1.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:de7be791-f12e-4513-8f52-0adb6d82b44c
    name: freeze_id_chr8.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:5490fdc5-b116-476c-a5f9-0fb8e288293d
    name: freeze_id_chr18.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:873a1134-d308-471f-ae4d-629ffbf27d01
    name: freeze_id_chr3.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:619e79bc-51be-4a8e-b576-6f1090995b7e
    name: freeze_id_chr9.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:d9db5a85-5691-4f7c-9ffe-ff20318e2d6c
    name: freeze_id_chr22.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:5b8b7ca5-6be1-4cf1-b2c1-f3ba32d5c3c0
    name: freeze_id_chr10.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:748e1050-e5ad-49cc-b2c9-e3d615c463c6
    name: freeze_id_chr5.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:4adc17ab-eec6-4270-ad85-d31dc80fb3f7
    name: freeze_id_chr19.fam
    md5sum: 02a0b436dddcc4646d6fd0fb2ac3f591
    filesize: 277.5KB
    filetype: .fam
    number_of_participants: 8118
    belongs_to: plink
  - id: alspacdcs:acc95b26-02b3-47a0-9445-36f175e792b0
    name: freeze_id_chr20.log
    md5sum: 39173b45309913c6ccc1cd639081f198
    filesize: 975.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:4cc54de9-97be-4dcb-bde5-48205f1f6cff
    name: freeze_id_chr5.log
    md5sum: c676952aec770492e37516fb583043a1
    filesize: 971.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:39e7375f-d35c-4b93-80f6-fc9cc0c43ec5
    name: freeze_id_chr6.log
    md5sum: f1a481559a61558066b4a5f82a54b261
    filesize: 971.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:d63190d2-6602-4a17-836e-b787de1a3ad3
    name: freeze_id_chr4.log
    md5sum: b944fbfdcb2d6e4578398d6a75cea4eb
    filesize: 971.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:4eb9745a-7faa-4370-95a7-3627edefd059
    name: freeze_id_chr2.log
    md5sum: 58c0ae51d3d950091908b26d9fcbf662
    filesize: 971.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:6a2c28f5-7c46-4738-b348-a14c8aa09088
    name: freeze_id_chr18.log
    md5sum: c85719b99d729238125cef5af686af40
    filesize: 975.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:fab8e72c-5d9e-4391-85a5-0705a5ba1931
    name: freeze_id_chr10.log
    md5sum: 9944731b8939bb29ec0f058fb85fc8a6
    filesize: 977.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:37d48538-54f2-4006-9010-b55dabdffa0c
    name: freeze_id_chr12.log
    md5sum: 6a92b280ef1cb18e0cca60b34b77cf8c
    filesize: 977.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:891848f8-5c96-4433-bff8-0690191ccf78
    name: freeze_id_chr15.log
    md5sum: f111c4fde2aef6c4d6549b509b07a8eb
    filesize: 975.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:f59eb5fe-0712-4324-9dba-55fcfdd8995d
    name: freeze_id_chr17.log
    md5sum: 2477b8d24bc302ba49a95c89b5894560
    filesize: 975.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:05923457-4385-4adf-bc5b-699393117763
    name: freeze_id_chr3.log
    md5sum: d46fc3e1a1fbbcf6ee7e5aca5b3913e6
    filesize: 971.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:1e37f74e-932b-4cdd-a6a8-2097c7c9d343
    name: freeze_id_chr9.log
    md5sum: c334d5c2e55192550c082508c43f0b8c
    filesize: 971.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:d4c5a687-74e9-4714-aae4-40d7354ac513
    name: freeze_id_chr1.log
    md5sum: a31745db1e5b091d2b083d188e078b51
    filesize: 971.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:cfd03b67-3c6e-4d45-a46e-799b0c76e795
    name: freeze_id_chr14.log
    md5sum: 1da37f38292b25228306b0c12d0233b7
    filesize: 975.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:53d1a685-75fc-4759-a16e-55e37f4fcee3
    name: freeze_id_chr7.log
    md5sum: ca37be820a48ec6005f80e8298b0d24f
    filesize: 971.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:1cd105ee-dd11-41a4-b044-a87663ed7f9a
    name: freeze_id_chr22.log
    md5sum: a400793a0f02086dd3c2f32a40a42ea5
    filesize: 975.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:4a5875a7-ad8d-4727-a34d-f97236d87e6b
    name: freeze_id_chr16.log
    md5sum: 088cc8fbf1bb1741cf4406b9cca934c5
    filesize: 975.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:d99b8342-76da-40a6-877f-9cc23f6c0cc5
    name: freeze_id_chr8.log
    md5sum: b612c22e9f96a11ffc220257fb1c40a2
    filesize: 971.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:6d04e6c1-cb79-466d-b225-d56cc0633e7c
    name: freeze_id_chr19.log
    md5sum: f1e8728cc5a2ec80a0b31845c5797403
    filesize: 975.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:149473be-439e-4b4e-95ad-6ba89828c0b2
    name: freeze_id_chr13.log
    md5sum: 86280b924bb8789c35d7313eff1d4b83
    filesize: 977.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:eaa04a24-6af6-437a-bfe4-8fefe5ecc7b5
    name: freeze_id_chr21.log
    md5sum: a0e7a40ef5874dd577150a472100520a
    filesize: 975.0B
    filetype: .log
    belongs_to: plink
  - id: alspacdcs:8546ec4d-3642-41a0-bb82-74284d9fefa2
    name: freeze_id_chr11.log
    md5sum: 308e63e5f26651ad4f144d4485b05176
    filesize: 977.0B
    filetype: .log
    belongs_to: plink

Genome-wide - 1000G imputed - G0 partners (gi_1000g_g0p)

Description

This dataset contains genome-wide array data imputed to the 1000 genomes reference panel for G0 partners, with some additional G0 mothers and G1 individuals. This data has been cleaned, flipped to the positive strand and in b37 coordinates and imputed to the 1000 genomes phase I version 3.
Reference genome build: GRCh37

Methodology

3,453 ALSPAC mother and fathers and 535,478 SNPs were genotyped using the Illumina HumanCoreExome chip genotyping platforms by the ALSPAC lab and called using GenomeStudio. The resulting raw genome-wide data were subjected to standard quality control methods using PLINK (v1.07). Individuals were excluded on the basis of gender mismatches (n = 80); minimal or excessive heterozygosity (n = 64); disproportionate levels of individual missingness (>5%, n = 60) and possible contamination (n = 3).

Population stratification was assessed by multidimensional scaling analysis and compared with 1000 Genomes phase 3 data and principal component analysis (n = 266); all individuals with non-European ancestry were removed.

Cryptic relatedness was measured as SNP relatedness in GCTA (relatedness > 0.1, n = 69 removed). SNPs with a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 1E-7) and those which failed GenomeStudio quality control measures were removed (n = 21,298). 6,594 duplicate SNPs were also removed.

This resulted in 2,911 unrelated mothers and father genotypes at 507,586 SNPs. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln.

We phased data of 3074 samples that passed qc but contained related subjects in shapeit v2.r837. We then removed 155,336 monomorphic SNPs, 1033 markers not in 1000 genomes, 11,842 A/T or G/C SNPs and 10 duplicate sites to give 337,732 SNPs on chromosomes 1-23. Of the 329,363 markers on chromosomes 1-22, 298,742 overlapped the reference genome. We imputed to the 1000 genomes phase 1 version 3 using the Michigan Imputation Server. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln. We then removed 12 subjects who have withdrawn consent and 6 subjects genotyped in an earlier work package to give 2201 subjects.

1737 putative G0 partner-G1 pairs for whom both G0 partner and G1 have called genotype data available were identified based on ALN. Given the G0 partners were invited by the G0 mother to take part and only enrolled in the study in their own right several years later, it could not be assumed that all G0 partners were biologically related to G1. Called genotype data for the 1720 unique G0 partners and 1737 unique G1s were merged (i.e. there were 17 pairs of siblings/twins among the G1 offspring), using plink v1.90b7.2 64-bit (11 Dec 2023).

After aplication of the plink filters –geno 0.05, –maf 0.01, –snps-only just-acgt and –autosome. The –related command in KING version 2.3.2 was used to perform kinship analysis, which confirmed that all 1737 putative G0 partner-G1 pairs are genetically related. This would be expected for biological father-offspring pairs, using the inference criteria described in in Table 1 of “Manichaikul, Ani, et al. ”Robust relationship inference in genome-wide association studies.” Bioinformatics 26.22 (2010): 2867-2873.”

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_1000g_g0p_2016-11-22_f6
name: Genome-wide - 1000G imputed - G0 partners version 2016-11-22 freeze 5
description: >-
  This dataset is the sixth freeze of 2016-11-22 version of the Genome-wide array data imputed to the 1000 genomes reference panel
  for G0 partners, with some additional G0 mothers and G1 individuals.

freeze_size: 44G
linker_file_md5sum: 45415c7d4fae355b4fb2d6ccd042620d
woc_file_md5sum: 6c887db8c7dd10cc695630ca73b41405
all_individuals_to_exclude_md5sum: e4efce63f9f671548d08c8bb2f9cc4f7
git_tag: https://github.com/alspac/dataset_gi_1000g_g0p/releases/tag/freeze6
is_current_freeze: true
freeze_number: 6
freeze_date: 2025-09-30
previous_freeze: alspacdcs:gi_1000g_g0p_2016-11-22_f5
freeze_of_alspac_dataset_version: alspacdcs:gi_1000g_g0p_2016-11-22
freeze_of_named_alspac_dataset: alspacdcs:gi_1000g_g0p

contains:
- data

files: []
data:
  contains:
  - filtered_data_chr22.bgen
  - filtered_data_chr21.bgen
  - filtered_data_chr20.bgen
  - filtered_data_chr19.bgen
  - filtered_data_chr16.bgen
  - filtered_data_chr18.bgen
  - filtered_data_chr14.bgen
  - filtered_data_chr17.bgen
  - filtered_data_chr15.bgen
  - filtered_data_chr13.bgen
  - filtered_data_chr09.bgen
  - filtered_data_chr12.bgen
  - filtered_data_chr11.bgen
  - filtered_data_chr10.bgen
  - filtered_data_chr07.bgen
  - filtered_data_chr08.bgen
  - filtered_data_chr06.bgen
  - filtered_data_chr04.bgen
  - filtered_data_chr05.bgen
  - filtered_data_chr02.bgen
  - filtered_data_chr03.bgen
  - filtered_data_chr01.bgen
  - swapped.sample
  files:
  - id: alspacdcs:88d6c1fb-637e-45e6-aed7-b3a314f01b43
    name: filtered_data_chr22.bgen
    md5sum: 824412e963441699f260c6245f65659d
    filesize: 721.5MB
    filetype: .bgen
    number_of_variants: 366590
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:4df43797-f090-4a69-b8f0-eb27dde2b726
    name: filtered_data_chr21.bgen
    md5sum: 7881bdc24e7f0adbfb800b49d1efd590
    filesize: 671.1MB
    filetype: .bgen
    number_of_variants: 378064
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:3a69092e-67f8-40e1-8f48-bd8a0ef43806
    name: filtered_data_chr20.bgen
    md5sum: d241eb21be3188c26c460e1f65f0d8c1
    filesize: 1.1GB
    filetype: .bgen
    number_of_variants: 618749
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:2c1484b3-3a79-47c4-957b-eb7dc0a0a343
    name: filtered_data_chr19.bgen
    md5sum: 37ea045cd9f4027cba547b7b89c3a1a0
    filesize: 1.2GB
    filetype: .bgen
    number_of_variants: 606147
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:127c1e60-f254-4f6f-af8b-6a5fc78f1c12
    name: filtered_data_chr16.bgen
    md5sum: 52f065575d3cb2dff34df6763a583766
    filesize: 1.5GB
    filetype: .bgen
    number_of_variants: 867901
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:a4d6a3eb-9f36-4c56-b79a-b48b9bb04772
    name: filtered_data_chr18.bgen
    md5sum: b8e055a6c0955bb67161c9f7a1d8cad7
    filesize: 1.3GB
    filetype: .bgen
    number_of_variants: 783661
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:588abf2e-4186-4fbb-aca3-9e9e8a0d7b33
    name: filtered_data_chr14.bgen
    md5sum: 1ecd96aab2925bafd7d20497d85dd937
    filesize: 1.4GB
    filetype: .bgen
    number_of_variants: 903811
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:a2c4b90b-c8d5-4781-86fc-30abb3dda4ca
    name: filtered_data_chr17.bgen
    md5sum: 73d85caf67dcedc63b11a43bd5ccb44d
    filesize: 1.4GB
    filetype: .bgen
    number_of_variants: 755467
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:1494da4f-c326-4be9-bd9a-1042fb06339d
    name: filtered_data_chr15.bgen
    md5sum: f8c5b54206189808e9a361cc0da63798
    filesize: 1.4GB
    filetype: .bgen
    number_of_variants: 814028
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:f31abcf8-9570-4e8e-a3cb-3f6064a6a362
    name: filtered_data_chr13.bgen
    md5sum: 176a10d38ab80783a8e392e5791edea7
    filesize: 1.5GB
    filetype: .bgen
    number_of_variants: 988473
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:f6fd76fe-aec2-460e-9e16-b131e0a91776
    name: filtered_data_chr09.bgen
    md5sum: 82a480f3e8792db2c1cec3adc50e1357
    filesize: 1.9GB
    filetype: .bgen
    number_of_variants: 1189463
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:b9b4f93b-3ee7-404f-9b24-bd3b6d3b4736
    name: filtered_data_chr12.bgen
    md5sum: 509202db22200fe0bd58210ab8e9c757
    filesize: 2.1GB
    filetype: .bgen
    number_of_variants: 1316510
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:7c8064bb-916e-4e4f-a36b-225234db223b
    name: filtered_data_chr11.bgen
    md5sum: b1b7e3bef0fe72cd90bd0ba456f687aa
    filesize: 2.1GB
    filetype: .bgen
    number_of_variants: 1359640
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:6c6e37e4-78cd-46c1-b784-db9aa41e00ff
    name: filtered_data_chr10.bgen
    md5sum: 8f64fe184e4c876a345a728ed5eeddcf
    filesize: 2.1GB
    filetype: .bgen
    number_of_variants: 1363104
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:1f2be053-4a68-440a-b328-f806d7ab6790
    name: filtered_data_chr07.bgen
    md5sum: f832922558eddcf3feed87091c2ec0ae
    filesize: 2.6GB
    filetype: .bgen
    number_of_variants: 1601293
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:4ab48954-47ad-4138-bc9a-2e2b65df9ec5
    name: filtered_data_chr08.bgen
    md5sum: 47d79712e676a0048f90858cbb888179
    filesize: 2.3GB
    filetype: .bgen
    number_of_variants: 1558902
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:021c15de-7ffd-4ee2-b834-8f9e8411dd04
    name: filtered_data_chr06.bgen
    md5sum: a9327ad1591fdf7d349b066544e71c3a
    filesize: 2.6GB
    filetype: .bgen
    number_of_variants: 1758025
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:12db13fe-2594-45b1-bccd-f6fe916cb6b7
    name: filtered_data_chr04.bgen
    md5sum: 514f09f02c74fc3eca83379e9e99c5dc
    filesize: 3.1GB
    filetype: .bgen
    number_of_variants: 1969883
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:2f9c9b1f-553a-48ab-973e-cad46335845f
    name: filtered_data_chr05.bgen
    md5sum: f4accbf5bdd6a2ccc9598e9e2221915d
    filesize: 2.7GB
    filetype: .bgen
    number_of_variants: 1809961
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:250aba37-8329-430a-9065-37e3fa65494e
    name: filtered_data_chr02.bgen
    md5sum: e297c8d30455053d23ac360bcc886bb0
    filesize: 3.5GB
    filetype: .bgen
    number_of_variants: 2349883
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:9c547458-6640-4bb3-972e-408f607047f7
    name: filtered_data_chr03.bgen
    md5sum: c0b55e9d65c219ffb1b8c58a0ebb7c18
    filesize: 3.0GB
    filetype: .bgen
    number_of_variants: 1969275
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:4acc911f-cc1b-419e-8f16-f178f079229a
    name: filtered_data_chr01.bgen
    md5sum: a5eb049e4df5a8b005ae51b47947d830
    filesize: 3.3GB
    filetype: .bgen
    number_of_variants: 2159337
    number_of_participants: 2198
    belongs_to: data
  - id: alspacdcs:859a623f-b408-4cb5-81b7-4f48da58e7b6
    name: swapped.sample
    md5sum: 1bf22d5d9118fc1479199f108af11138
    filesize: 164.9KB
    filetype: .sample
    number_of_participants: 2198
    belongs_to: data

Genome-wide - 1000G imputed - G0 mothers + G1 (gi_1000g_g0m_g1)

Description

This dataset contains genome-wide 1000G imputed data for G0 mothers + G1. This data has been cleaned, flipped to the positive strand and in b37 coordinates and imputed to the 1000 genomes phase I version 3.
Reference genome build: GRCh37

Methodology

ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).

Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.

SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. 9,115 subjects and 500,527 SNPs passed these quality control filters.

ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs. SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed.

Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.

9,048 subjects and 526,688 SNPs passed these quality control filters.

We combined 477,482 SNP genotypes in common between the sample of mothers and sample of children. We removed SNPs with genotype missingness above 1% due to poor quality (11,396 SNPs removed) and removed a further 321 subjects due to potential ID mismatches. This resulted in a dataset of 17,842 subjects containing 6,305 duos and 465,740 SNPs (112 were removed during liftover and 234 were out of HWE after combination). We estimated haplotypes using ShapeIT(v2.r644) which utilises relatedness during phasing. We obtained a phased version of the 1000 genomes reference panel (Phase 1, Version3) from the Impute2 reference data repository (phased using ShapeItv2.r644, haplotype release date Dec 2013). Imputation of the target data was performed using Impute V2.2.2 against the reference panel(all polymorphic SNPs excluding singletons), using all 2186 reference haplotypes (including non-Europeans).

This gave 8,237 eligible children and 8,196 eligible mothers withavailable genotype data after exclusion of related subjects using cryptic relatedness measures described previously.

Known issues: There is a known strand issue present within this imputation: The Dec 2013 haplotype release of 1000 genomes phase 1 version 3 have 199 reported SNPs with incorrect strand. For more information and the origins of this list please visit https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated_SHAPEIT2_16-06-14.html. It is very unlikely that they have systematic effects across the genome and most probably are just isolated to these 199 known problematic SNPs. The user is advised to discard them from their analysis.

Formatting of the bgen files within the gi_1000g_g0m_g1 dataset have NA in place of the chromosome column. Some tools may allow this, while others are less forgiving. This may mean users wish to re-format the dataset (using QCtool or equivalent) for their work.

Allele frequency concordance with other cohorts: When contributing to consortia you may find that the allele frequencies in ALSPAC for a few thousand SNPs are discordant from a reference panel used by the consortium. This is actually to be expected - when calculating allele frequencies, even from the same population, in two different samples for many millions of SNPs there will be a number of SNPs that appear to be highly discordant.

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_f6
name: >-
  Genome-wide - 1000G imputed - G0 mothers + G1 version 2015-10-30
  freeze 6
description: >-
  This is the sixth freeze of the the 2015-10-30 version of
  gi_1000g_g0m_g1 datatset. It contains data in the oxford format
  which is a combination of bgen and sample (version 1.2) files. It is a subset of
  the data in gi_1000g_g0m_g1_2015-10-30 limited to one format and
  with participants who have withdrawn their consent removed.

  The Dec 2013 haplotype release of 1000 genomes phase 1 version 3 have 199 reported SNPs
  with incorrect strand. The strand issues are present in this imputation version. For more 
  information and the origins of this list please visit:
  https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated_SHAPEIT2_16-06-14.html

  It is very unlikely that they have systematic effects across the genome and most 
  probably are just isolated to these 199 known problematic SNPs.

  The user is advised to discard them from their analysis.
freeze_size: 123G
linker_file_md5sum: 45415c7d4fae355b4fb2d6ccd042620d
woc_file_md5sum: 6c887db8c7dd10cc695630ca73b41405
all_individuals_to_exclude_md5sum: e4efce63f9f671548d08c8bb2f9cc4f7
git_tag: https://github.com/alspac/dataset_gi_1000g_g0m_g1/releases/tag/freeze6
is_current_freeze: true
freeze_number: 6
freeze_date: 2025-09-30
previous_freeze: alspacdcs:gi_1000g_g0m_g1_2015-10-30_f5
freeze_of_alspac_dataset_version: alspacdcs:gi_1000g_g0m_g1_2015-10-30
freeze_of_named_alspac_dataset: alspacdcs:gi_1000g_g0m_g1

contains:
- data

files: []
data:
  contains:
  - filtered_22.bgen
  - filtered_21.bgen
  - filtered_20.bgen
  - filtered_19.bgen
  - filtered_17.bgen
  - filtered_18.bgen
  - filtered_15.bgen
  - filtered_16.bgen
  - filtered_14.bgen
  - filtered_13.bgen
  - filtered_09.bgen
  - filtered_23.bgen
  - filtered_12.bgen
  - filtered_11.bgen
  - filtered_10.bgen
  - filtered_08.bgen
  - filtered_07.bgen
  - filtered_06.bgen
  - filtered_05.bgen
  - filtered_03.bgen
  - filtered_04.bgen
  - filtered_01.bgen
  - filtered_02.bgen
  - swapped.sample
  files:
  - id: alspacdcs:a4c65023-c1e8-483e-af6a-a5eda202135b
    name: filtered_22.bgen
    md5sum: fe115a073819d3ddca57180a314edf96
    filesize: 2.0GB
    filetype: .bgen
    number_of_variants: 365644
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:e189d2bf-679c-4d71-ab1b-5911d0680689
    name: filtered_21.bgen
    md5sum: 7d481004542668f9bfec0cc9a6f23205
    filesize: 1.9GB
    filetype: .bgen
    number_of_variants: 377554
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:8e5ff0c0-87a7-49d0-9602-aae35bc1f4d3
    name: filtered_20.bgen
    md5sum: 32d561ccc75a8cff9cfb7d0ff2f6beb5
    filesize: 2.7GB
    filetype: .bgen
    number_of_variants: 617694
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:f4852506-1b9f-4dd7-aa43-703931a0beae
    name: filtered_19.bgen
    md5sum: 9136b29ea7e9ccbdcb4ac7889fe8aef7
    filesize: 3.9GB
    filetype: .bgen
    number_of_variants: 603516
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:e1455c23-9df9-4073-93bc-fd99abe9837b
    name: filtered_17.bgen
    md5sum: aa761d8764e878d227a4af63c9748b63
    filesize: 3.8GB
    filetype: .bgen
    number_of_variants: 753174
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:2ff1ff0c-3531-495d-8efb-1333d09a586e
    name: filtered_18.bgen
    md5sum: e60407d3601e26584b9e8cbdefc1d62c
    filesize: 3.4GB
    filetype: .bgen
    number_of_variants: 783010
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:05db4f4c-9edf-46dd-a873-d411cb05bc99
    name: filtered_15.bgen
    md5sum: 618133a6ef0e5be6cbb9b20214d689d9
    filesize: 3.7GB
    filetype: .bgen
    number_of_variants: 812545
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:7cb98407-45d4-4484-8b9f-3c37402169c0
    name: filtered_16.bgen
    md5sum: 485eaa35595bd2d5b09ac112661a7e00
    filesize: 4.3GB
    filetype: .bgen
    number_of_variants: 865998
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:544d9814-32ad-4375-b7cd-11b0cd4b6191
    name: filtered_14.bgen
    md5sum: 36a40f49a0b30786fba809efe8fb515f
    filesize: 3.9GB
    filetype: .bgen
    number_of_variants: 904351
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:807ca4b7-b7fc-4a08-9fed-7505100e8e3a
    name: filtered_13.bgen
    md5sum: 0cd06c79431689b0abf3b611b4353054
    filesize: 3.9GB
    filetype: .bgen
    number_of_variants: 987740
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:3826dfea-edc3-4b5f-97f2-27f0b8984a1b
    name: filtered_09.bgen
    md5sum: aa484f17e3432cf848f8284842cf12d5
    filesize: 5.0GB
    filetype: .bgen
    number_of_variants: 1187731
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:94361599-9a8e-4d65-a64b-8dd2f205ecb3
    name: filtered_23.bgen
    md5sum: 3c60c10ed23c2d8e66999e6f736646da
    filesize: 5.9GB
    filetype: .bgen
    number_of_variants: 1250218
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:eb39d1c6-76c0-4c65-893d-7441d408874c
    name: filtered_12.bgen
    md5sum: ebfb0facd3f3e9329a1cec9d2edf035b
    filesize: 5.3GB
    filetype: .bgen
    number_of_variants: 1314328
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:a1a16d1a-f4fc-4fe8-8058-8bd9b94e9a02
    name: filtered_11.bgen
    md5sum: 34c90038607804acde536fbdcefb5f12
    filesize: 5.3GB
    filetype: .bgen
    number_of_variants: 1356882
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:02184bad-3ba9-4e29-b390-31192645fd5b
    name: filtered_10.bgen
    md5sum: 57e035cd8f5b67b99e7292482712f007
    filesize: 5.4GB
    filetype: .bgen
    number_of_variants: 1361506
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:d13bffcf-8879-4246-af07-8ccd87bb30da
    name: filtered_08.bgen
    md5sum: d79768b17b72de5f27ff7a65bc2f4f22
    filesize: 5.9GB
    filetype: .bgen
    number_of_variants: 1557429
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:e21e81ad-c2cc-457e-bfd9-370f40ced126
    name: filtered_07.bgen
    md5sum: d31560a8a8a2ae087ea92d81d85c337e
    filesize: 7.1GB
    filetype: .bgen
    number_of_variants: 1599387
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:5cd3c842-678f-4bac-921a-ca7e810df276
    name: filtered_06.bgen
    md5sum: 76bc20f38bb1c155375c38e597c501ab
    filesize: 6.8GB
    filetype: .bgen
    number_of_variants: 1755859
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:c835c01a-4641-4d85-8ae6-e2ebc5675cea
    name: filtered_05.bgen
    md5sum: 1c79aeefda8460272e4f964182f10afd
    filesize: 6.8GB
    filetype: .bgen
    number_of_variants: 1808090
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:c220949d-7fd4-4714-ab41-3638c77a45b7
    name: filtered_03.bgen
    md5sum: ab8430120ce8f09840e194b1a4649ea9
    filesize: 7.6GB
    filetype: .bgen
    number_of_variants: 1966662
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:76f8330c-390d-4f9a-8e0d-ba437fa5154a
    name: filtered_04.bgen
    md5sum: 064e18391df4c15af5c8a99dacccceae
    filesize: 8.3GB
    filetype: .bgen
    number_of_variants: 1968171
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:8ee96eba-2b3c-4fcb-b550-9e81a934938a
    name: filtered_01.bgen
    md5sum: 2d645050a449c6c9210f8c9948790555
    filesize: 9.0GB
    filetype: .bgen
    number_of_variants: 2155158
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:9e07e8aa-956e-4e5d-a505-412fe106c9a3
    name: filtered_02.bgen
    md5sum: 621d9fb9e88ee50f898372f0a17439d8
    filesize: 9.1GB
    filetype: .bgen
    number_of_variants: 2346862
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:f450474c-3688-44db-8fce-c11c1a229950
    name: swapped.sample
    md5sum: 8e3d90b5108bc3da7ede33e39718f57d
    filesize: 1.2MB
    filetype: .sample
    number_of_participants: 17443
    belongs_to: data

Genome-wide - TOPMed round 2 imputed - G0 mothers + G1 (gi_topmed_g0m_g1)

Description

This dataset contains genotype data imputed to TOPMed round 2 for G0 mothers and G1.
Reference genome build: GRCh38

Methodology

ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).

Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.

SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1).

Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,115 subjects and 500,527 SNPs passed these quality control filters.

ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs. SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed.

Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.

9,048 subjects and 526,688 SNPs passed these quality control filters.

We combined 477,482 SNP genotypes in common between the sample of mothers and sample of children. We removed SNPs with genotype missingness above 1% due to poor quality (11,396 SNPs removed) and removed a further 321 subjects due to potential ID mismatches. This resulted in a dataset of 17,842 subjects containing 6,305 duos and 465,740 SNPs (112 were removed during liftOver and 234 were out of HWE after combination).

Individuals within this dataset, but who have withdrawn from the project were removed from the dataset before proceeding with imputation specific quality control. This left 17450 individuals.

The combined mothers and children combined genotype panel was filtered to remove SNPs below MAF 0.01, missing call rates exceeding 0.01 using Plink 2.0. The joint set of SNPs was checked for palindromic SNPs but none were present. The combined call set was swapped from GRCh37 to GRCh38 using UCSC liftOver.

The dataset was later filtered to SNPs above HWE of 1e-6 leaving 455150 SNPs. The combined autosomal call set was then converted to VCF files, before being uploaded to the TOPMed imputation server to flag variants requiring a strand fix. Any SNPs flagged with an issue were corrected, or filtered out using Plink2. 454248 SNPs remained within the autosomes.

Phasing and imputation was conducted on the Michigan TOPMed imputation server (v1.7.4) in October of 2023. Phasing was done using Eagle (v2.4). Imputation was done on minimac4 (v1.0.2) to TOPMed R2. An R squared filter of 0.3 was applied.

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_topmed_g0m_g1_2025-07-25_f6
name: >-
  Genome-wide - TOPmed imputed - G0 mothers + G1 version 2025-07-25
  freeze 5
description: >-
  Freeze 6 of version 2025-07-25 Genome-wide array data imputed to the TOPmed round 2 reference panel for G0 mothers and G1 individuals in bgen and sample file format (version 1.2). 
  The 2025-07-25 version of the dataset has had all monomorphic variants filtered out of the dataset to reduce overall size.  
freeze_size: 102G
linker_file_md5sum: 45415c7d4fae355b4fb2d6ccd042620d
woc_file_md5sum: 6c887db8c7dd10cc695630ca73b41405
all_individuals_to_exclude_md5sum: e4efce63f9f671548d08c8bb2f9cc4f7
git_tag: https://github.com/alspac/dataset_gi_topmed_g0m_g1/releases/tag/freeze6
is_current_freeze: true
freeze_number: 6
freeze_date: 2025-09-30
previous_freeze: alspacdcs:gi_topmed_g0m_g1_2024-12-19_f5
freeze_of_alspac_dataset_version: alspacdcs:gi_topmed_g0m_g1_2025-07-25
freeze_of_named_alspac_dataset: alspacdcs:gi_topmed_g0m_g1

contains:
- data

files: []
data:
  contains:
  - chr22_freeze.bgen
  - chr21_freeze.bgen
  - chr20_freeze.bgen
  - chr19_freeze.bgen
  - chr15_freeze.bgen
  - chr16_freeze.bgen
  - chr17_freeze.bgen
  - chr18_freeze.bgen
  - chr14_freeze.bgen
  - chr6_freeze.bgen
  - chr11_freeze.bgen
  - chr9_freeze.bgen
  - chr7_freeze.bgen
  - chr13_freeze.bgen
  - chr12_freeze.bgen
  - chr8_freeze.bgen
  - chr10_freeze.bgen
  - chr3_freeze.bgen
  - chr5_freeze.bgen
  - chr1_freeze.bgen
  - chr4_freeze.bgen
  - chr2_freeze.bgen
  - freeze.sample
  files:
  - id: alspacdcs:bc269aaf-b168-4486-8fb9-9765872ace28
    name: chr22_freeze.bgen
    md5sum: dfa50660994f12fb75c6555ff5a8aecb
    filesize: 1.6GB
    filetype: .bgen
    number_of_variants: 962561
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:66e69d02-4806-43fa-ba36-61cb82c32f4f
    name: chr21_freeze.bgen
    md5sum: f5e9aff73c2f8e53827dd2436a2617ed
    filesize: 1.5GB
    filetype: .bgen
    number_of_variants: 900622
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:18df2c84-cf46-4826-96fd-a1eb764104a7
    name: chr20_freeze.bgen
    md5sum: d20992c1426cd71ba5347438e4796b04
    filesize: 2.4GB
    filetype: .bgen
    number_of_variants: 1553082
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:f2bf3956-efe5-4e8c-b661-57224158a638
    name: chr19_freeze.bgen
    md5sum: 2b411c0382a710f441d1a4facff85377
    filesize: 2.8GB
    filetype: .bgen
    number_of_variants: 1545576
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:bf62d97c-7a23-423b-b058-4cc1399a22aa
    name: chr15_freeze.bgen
    md5sum: 2ba7b10c3acdbc05068324b5e6c49e64
    filesize: 3.0GB
    filetype: .bgen
    number_of_variants: 1991728
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:507db985-2ff1-47f8-83cb-3a0e9c23e73c
    name: chr16_freeze.bgen
    md5sum: fb6c49d8517a13f2fe100ec945cae487
    filesize: 3.4GB
    filetype: .bgen
    number_of_variants: 2182386
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:a10bbf67-2c3b-4500-9c9a-cf412bf8e5bf
    name: chr17_freeze.bgen
    md5sum: 7e7b431d7a9a56854437bc0e508224c9
    filesize: 3.1GB
    filetype: .bgen
    number_of_variants: 1960949
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:73ff5630-9dae-48fc-a1c7-393389a1d5b9
    name: chr18_freeze.bgen
    md5sum: 46b5223f3aab805ff1472c8170f16246
    filesize: 3.0GB
    filetype: .bgen
    number_of_variants: 1917077
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:198a0ce9-b36e-4283-882e-2c340083a7c8
    name: chr14_freeze.bgen
    md5sum: 31e3d2a5e3025ff295732b81fb8eb67e
    filesize: 3.2GB
    filetype: .bgen
    number_of_variants: 2168141
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:a77b0371-6a4c-4cb8-9efd-6554fcc65687
    name: chr6_freeze.bgen
    md5sum: 7ad4e779bec8ff015b6f456fbd2168a1
    filesize: 6.0GB
    filetype: .bgen
    number_of_variants: 4170487
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:5c029c98-e44f-45dd-aa3d-b55fabc4825c
    name: chr11_freeze.bgen
    md5sum: 4c0bb18bd8b37d03167dba5b5832c73d
    filesize: 5.0GB
    filetype: .bgen
    number_of_variants: 3361214
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:55d0c079-6544-42d4-a777-128384faccba
    name: chr9_freeze.bgen
    md5sum: 7de2977c6dc18dbfea7f13a1fc935a2a
    filesize: 4.3GB
    filetype: .bgen
    number_of_variants: 2996234
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:080fa42b-02ec-4ae7-ade0-ae1a7c3c3d60
    name: chr7_freeze.bgen
    md5sum: f24b5dca4f7eb1f3f0352f4921d81439
    filesize: 6.1GB
    filetype: .bgen
    number_of_variants: 3924564
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:5f06e0fd-682c-4c2f-970e-0787dbc6402b
    name: chr13_freeze.bgen
    md5sum: bd34a1d9ee2d3793753d40c941065ebe
    filesize: 3.6GB
    filetype: .bgen
    number_of_variants: 2429492
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:78d407d7-0af8-4829-b814-5b8daf36c820
    name: chr12_freeze.bgen
    md5sum: af25a87a5c448baa1afabc0e863570d4
    filesize: 4.8GB
    filetype: .bgen
    number_of_variants: 3247986
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:35de9a9c-6ce6-4fdc-9595-0b53664187c0
    name: chr8_freeze.bgen
    md5sum: 5ff96ffda3f0f0eb369488cba3daf90c
    filesize: 5.3GB
    filetype: .bgen
    number_of_variants: 3767813
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:75358122-201b-40c6-8301-f6b52f9dafad
    name: chr10_freeze.bgen
    md5sum: 49c9fec775b0a100122c55fd4dbcb1e6
    filesize: 5.0GB
    filetype: .bgen
    number_of_variants: 3328581
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:2d2df7b7-21d0-4594-98c2-b13b88d8bb6e
    name: chr3_freeze.bgen
    md5sum: 968e0450a13df1db361bb7f051eb15d6
    filesize: 7.0GB
    filetype: .bgen
    number_of_variants: 4839527
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:789c55ab-434b-4a57-8cf2-a06032d26655
    name: chr5_freeze.bgen
    md5sum: efb585b055f4fc6152539451536fdce7
    filesize: 6.3GB
    filetype: .bgen
    number_of_variants: 4361228
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:ab21d436-44ac-4644-9423-ea97d8d30621
    name: chr1_freeze.bgen
    md5sum: 71d10371666edf730b3a9bbbba4d9656
    filesize: 7.9GB
    filetype: .bgen
    number_of_variants: 5442779
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:e2d62c6c-1091-4167-80a5-3b9a1d7bdc76
    name: chr4_freeze.bgen
    md5sum: 061a70600f6dc402186c4e6d5a466e36
    filesize: 7.5GB
    filetype: .bgen
    number_of_variants: 4721728
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:ceb776ea-f450-4941-a824-5c1f2f5b5170
    name: chr2_freeze.bgen
    md5sum: 9f34455ed28304c165f40516d7ba6a28
    filesize: 8.2GB
    filetype: .bgen
    number_of_variants: 5860289
    number_of_participants: 17443
    belongs_to: data
  - id: alspacdcs:f8a3e43c-2a74-43e5-9b56-10b38cf07d23
    name: freeze.sample
    md5sum: 4dd920b481fbd03cb9cde07d05fd0e40
    filesize: 953.9KB
    filetype: .sample
    number_of_participants: 17443
    belongs_to: data

Sequence Data

Whole genome sequencing - G1 (wgs_hiseq_g1)

Description

This dataset contains whole genome sequencing for G1 individuals, part of the UK10K dataset.
Reference genome build: GRCh37

Methodology

ALSPAC and TwinsUK cohorts were sequenced at an average read depth of 6.7x through the UK10K program (http://www.UK10K.org) using the Illumina HiSeq platform, and aligned to the GRCh37 human reference using BWA. SNV calls were completed using samtools/bcftools and VQSR and GATK were used to recall these calls.

Associated publication:
- http://www.ncbi.nlm.nih.gov/pubmed/26367797

Please ensure you have permission to access this data (http://www.uk10k.org/data_access.html) before using it.

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:wgs_hiseq_g1_2016-08-18_f6
name: Whole genome sequencing - G1 version 2016-08-18 freeze 6
description: >-
  This is the freeze 6 of version 2016-08-18 of the Whole genome sequencing for G1 individuals, part of the UK10K dataset.
freeze_size: 341G
linker_file_md5sum: 45415c7d4fae355b4fb2d6ccd042620d
woc_file_md5sum: 6c887db8c7dd10cc695630ca73b41405
all_individuals_to_exclude_md5sum: e4efce63f9f671548d08c8bb2f9cc4f7
git_tag: https://github.com/alspac/dataset_wgs_hiseq_g1/releases/tag/freeze6
is_current_freeze: true
freeze_number: 6
freeze_date: 2025-09-30
previous_freeze: alspacdcs:wgs_hiseq_g1_2016-08-18_f5
freeze_of_alspac_dataset_version: alspacdcs:wgs_hiseq_g1_2016-08-18
freeze_of_named_alspac_dataset: alspacdcs:wgs_hiseq_g1

contains:
- data
files: []
data:
  contains:
  - 4_freeze.vcf.gz.csi
  - 3_freeze.vcf.gz.csi
  - 7_freeze.vcf.gz.csi
  - 17_freeze.vcf.gz.csi
  - 20_freeze.vcf.gz.csi
  - 15_freeze.vcf.gz.csi
  - 9_freeze.vcf.gz.csi
  - 16_freeze.vcf.gz.csi
  - 12_freeze.vcf.gz.csi
  - 10_freeze.vcf.gz.csi
  - 6_freeze.vcf.gz.csi
  - X_freeze.vcf.gz.csi
  - 8_freeze.vcf.gz.csi
  - 21_freeze.vcf.gz.csi
  - 19_freeze.vcf.gz.csi
  - 1_freeze.vcf.gz.csi
  - 11_freeze.vcf.gz.csi
  - 13_freeze.vcf.gz.csi
  - 18_freeze.vcf.gz.csi
  - 5_freeze.vcf.gz.csi
  - 2_freeze.vcf.gz.csi
  - 22_freeze.vcf.gz.csi
  - 14_freeze.vcf.gz.csi
  - 21_freeze.vcf.gz
  - 22_freeze.vcf.gz
  - 19_freeze.vcf.gz
  - 20_freeze.vcf.gz
  - 15_freeze.vcf.gz
  - 17_freeze.vcf.gz
  - 14_freeze.vcf.gz
  - 18_freeze.vcf.gz
  - 16_freeze.vcf.gz
  - X_freeze.vcf.gz
  - 13_freeze.vcf.gz
  - 9_freeze.vcf.gz
  - 10_freeze.vcf.gz
  - 8_freeze.vcf.gz
  - 12_freeze.vcf.gz
  - 11_freeze.vcf.gz
  - 6_freeze.vcf.gz
  - 5_freeze.vcf.gz
  - 7_freeze.vcf.gz
  - 4_freeze.vcf.gz
  - 3_freeze.vcf.gz
  - 1_freeze.vcf.gz
  - 2_freeze.vcf.gz
  files:
  - id: alspacdcs:bc517e41-e006-4bb4-98d6-561234d4c927
    name: 4_freeze.vcf.gz.csi
    md5sum: 1e96e09dda062a07d0e6dbed3d620609
    filesize: 122.6KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:657dbf85-17d4-48d2-9ef6-7eff00bf0e62
    name: 3_freeze.vcf.gz.csi
    md5sum: 710268ee23b70a2f4a7692c016d23954
    filesize: 127.9KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:1d4e1d33-6af9-458f-9b80-6387ec4718ae
    name: 7_freeze.vcf.gz.csi
    md5sum: 79858dec4f5980281600065acc55645a
    filesize: 101.8KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:49f0b8f5-d6de-43c2-b0d9-aebddb7e95a1
    name: 17_freeze.vcf.gz.csi
    md5sum: e55c23ef41e971f4cbba256ab90f6c0e
    filesize: 49.9KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:38301176-dce1-4712-997a-38e165d0e86b
    name: 20_freeze.vcf.gz.csi
    md5sum: 01abfd1020dfdd35ae4b8fafd887bb75
    filesize: 38.2KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:d353fef6-de4b-4b8f-a6b9-cd1e8cb2d8b3
    name: 15_freeze.vcf.gz.csi
    md5sum: 59ed2877301a832f461cf72965a7456c
    filesize: 51.7KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:6a990423-f1d8-4a78-9182-34fc1342067e
    name: 9_freeze.vcf.gz.csi
    md5sum: 8f582cdaf97496a225c064103b4966df
    filesize: 75.4KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:81bca53a-7958-421e-969e-767f56f494d9
    name: 16_freeze.vcf.gz.csi
    md5sum: ac3a87b8284a80237a5612ce1c10763b
    filesize: 50.4KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:85c02321-3454-4bf6-b5fc-b56262cf1be8
    name: 12_freeze.vcf.gz.csi
    md5sum: eadd942f5d3d41bbd6747d4ed1445fdf
    filesize: 85.5KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:10019143-029a-4dd7-90b6-8f762d3e351a
    name: 10_freeze.vcf.gz.csi
    md5sum: c6a35b0f8ab981baba9f5f76cadb807f
    filesize: 85.5KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:f3e4c4bb-b99e-4c9d-81ec-f6fed8fd4ccb
    name: 6_freeze.vcf.gz.csi
    md5sum: 66afc781a4738e294b7c1c71ee7d0bf0
    filesize: 109.9KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:84116aa8-a76f-4af1-b138-2cdeb4c28c35
    name: X_freeze.vcf.gz.csi
    md5sum: feb208ab0f31fe27ab5b4b8a688ad67c
    filesize: 96.0KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:138a1186-3078-4eae-b8d2-832f78aabea5
    name: 8_freeze.vcf.gz.csi
    md5sum: d00479aa24a74e7d00e90fd4c259b63d
    filesize: 92.8KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:76a34242-a3dc-4c52-a6ba-d0a3339ac7c4
    name: 21_freeze.vcf.gz.csi
    md5sum: 45b9ef5f1036c573549400642b1817fd
    filesize: 22.1KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:e3c1eeb7-9b1a-4ad4-af45-9d3a22a828b9
    name: 19_freeze.vcf.gz.csi
    md5sum: c0d1ba8b4f99a46bf2690484b4e8a08f
    filesize: 35.8KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:7ee211cc-e460-4339-897e-8c58f68bc8d6
    name: 1_freeze.vcf.gz.csi
    md5sum: 689c7e022a0a6b95ee0d2355b03e7bad
    filesize: 145.6KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:12b528bb-2278-43b5-aa04-56d33c6e8bd3
    name: 11_freeze.vcf.gz.csi
    md5sum: 361b60ef1aa9e2c8a5edb2b524c3176e
    filesize: 85.2KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:53e14d75-9344-4d8a-a050-bd10b17adbb0
    name: 13_freeze.vcf.gz.csi
    md5sum: 8ec1fa0d623cb977bb19d811741b1a39
    filesize: 62.1KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:27546e70-05f1-48c8-b908-9f499bf1d0c1
    name: 18_freeze.vcf.gz.csi
    md5sum: 03c5e29b9df938ecc249d83d5a29f37e
    filesize: 48.5KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:8e68f9d4-5814-4d86-9180-7b8489bbf2ac
    name: 5_freeze.vcf.gz.csi
    md5sum: b6dbbdbd8640267057a6b1db5311add8
    filesize: 116.1KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:ae159f9a-33ef-4b1a-8347-9ce73d4a5c5c
    name: 2_freeze.vcf.gz.csi
    md5sum: 6cec8fff890738c60c71bf42ca6ba952
    filesize: 156.1KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:6849eb0f-6cba-497e-b608-4f0277b548f9
    name: 22_freeze.vcf.gz.csi
    md5sum: 80b954d0b63ae61f6661a512b040023d
    filesize: 22.1KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:04fa0db1-24dd-4c3b-a8c9-64b607850b61
    name: 14_freeze.vcf.gz.csi
    md5sum: 01148117d507206aeeec75d4cb909b9e
    filesize: 56.6KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:a40bee63-e57d-4fe2-b0f8-79bb97fef1a3
    name: 21_freeze.vcf.gz
    md5sum: 407fc245f5af69ebee43b7f6900c7d3a
    filesize: 4.3GB
    filetype: .gz
    number_of_variants: 563988
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:5d0ee7ad-cd85-4c36-b0c4-e18184da88f0
    name: 22_freeze.vcf.gz
    md5sum: 41eb74a2deb305d78562e4a0545ad429
    filesize: 4.4GB
    filetype: .gz
    number_of_variants: 552675
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:61e3b02f-9177-4274-bba7-97427a5cc05f
    name: 19_freeze.vcf.gz
    md5sum: 2aabff2631af303fb7510912d483387a
    filesize: 7.0GB
    filetype: .gz
    number_of_variants: 886630
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:251aea18-1247-4bd4-bb73-032149511ff7
    name: 20_freeze.vcf.gz
    md5sum: 9f7d7d3408b37dbf1d3b910486d6b8de
    filesize: 7.5GB
    filetype: .gz
    number_of_variants: 970869
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:aaf7deab-9969-45ee-a464-88fa1bba3c37
    name: 15_freeze.vcf.gz
    md5sum: 70828f306ee2698ef5bf3f68d6482214
    filesize: 9.7GB
    filetype: .gz
    number_of_variants: 1262404
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:1d328f6b-1c5f-44d4-b179-f301064f87be
    name: 17_freeze.vcf.gz
    md5sum: 8170bc093573b70bbc86f40fa820b555
    filesize: 9.1GB
    filetype: .gz
    number_of_variants: 1177884
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:0234d628-9ebe-4eae-891f-0c25105caa73
    name: 14_freeze.vcf.gz
    md5sum: 13c6732a437d7c5998ad4aa23327217f
    filesize: 10.7GB
    filetype: .gz
    number_of_variants: 1403580
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:349521d9-a02e-4e93-9d0c-7bc4a923760e
    name: 18_freeze.vcf.gz
    md5sum: a8c08c119c443ae7de5c5679cc8cf84c
    filesize: 9.4GB
    filetype: .gz
    number_of_variants: 1220427
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:228136ed-df23-4d67-acf4-f32775e6262f
    name: 16_freeze.vcf.gz
    md5sum: 660eb1a7f9bfa9453104b6f6a36b3792
    filesize: 10.6GB
    filetype: .gz
    number_of_variants: 1373607
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:99891c2e-9103-4405-b91a-0a0f6ea0ad38
    name: X_freeze.vcf.gz
    md5sum: 83fd00533ca8fa79ef2cb90cf24a4447
    filesize: 10.5GB
    filetype: .gz
    number_of_variants: 1700742
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:1eba6a4a-f4b7-43c7-83fb-062ff890cae9
    name: 13_freeze.vcf.gz
    md5sum: 884aee3994ad39536cbe22216b8013be
    filesize: 11.8GB
    filetype: .gz
    number_of_variants: 1527053
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:e5c1a4b3-1d71-4361-a285-08034f6abd29
    name: 9_freeze.vcf.gz
    md5sum: 342ecb473fad630b7a5a2da3084e2870
    filesize: 14.2GB
    filetype: .gz
    number_of_variants: 1845456
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:b1bb1feb-f3f5-46b6-9371-2e8df16ed159
    name: 10_freeze.vcf.gz
    md5sum: 6f6447119f502325bead2828c6e8eeda
    filesize: 16.3GB
    filetype: .gz
    number_of_variants: 2110436
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:86e912ca-7535-4e06-959f-9c67d4b7b09e
    name: 8_freeze.vcf.gz
    md5sum: d7f427e88eaa35de8c0f85d406a1e17d
    filesize: 18.8GB
    filetype: .gz
    number_of_variants: 2451009
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:6ce14e08-2a85-46be-8ea6-e4eb81d444b4
    name: 12_freeze.vcf.gz
    md5sum: 0128a134ac2d2be58424a0dfe4fadb63
    filesize: 15.7GB
    filetype: .gz
    number_of_variants: 2047922
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:1976a545-7c16-4f99-9a7d-9c5d2d77860d
    name: 11_freeze.vcf.gz
    md5sum: 0160ce2e72cdf53ff2b7770c84415225
    filesize: 16.4GB
    filetype: .gz
    number_of_variants: 2125064
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:22088f45-f8f4-4386-b1a3-939ddecb1641
    name: 6_freeze.vcf.gz
    md5sum: ed5f30f81e5145e74c2b60266b0baab1
    filesize: 21.0GB
    filetype: .gz
    number_of_variants: 2704091
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:6dd03d42-28da-4947-b785-b843e90a48f9
    name: 5_freeze.vcf.gz
    md5sum: ceb9bc294a5fae86481d6a17d4807b62
    filesize: 21.6GB
    filetype: .gz
    number_of_variants: 2804359
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:ab486b1a-2849-4297-94af-6a7e4437f18d
    name: 7_freeze.vcf.gz
    md5sum: 586f5de80643fd974397c7fafaf3227a
    filesize: 19.0GB
    filetype: .gz
    number_of_variants: 2445204
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:4709a4af-7190-4f19-8ffb-869f546a9d16
    name: 4_freeze.vcf.gz
    md5sum: c224d57595609b177ec126f7a356c12f
    filesize: 23.2GB
    filetype: .gz
    number_of_variants: 3019176
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:6533b2ac-5b9a-4f86-9f95-df2a61013ae2
    name: 3_freeze.vcf.gz
    md5sum: 2ff06564c5dd09220d04cba5eff9601e
    filesize: 24.2GB
    filetype: .gz
    number_of_variants: 3147254
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:da3293aa-2c59-4b9b-9f21-ed3912c050c2
    name: 1_freeze.vcf.gz
    md5sum: 0f78af877e87d5f733e51f2f3d3885f6
    filesize: 26.3GB
    filetype: .gz
    number_of_variants: 3406915
    number_of_participants: 1865
    belongs_to: data
  - id: alspacdcs:794abf20-8d47-423b-b7a2-343ddb24255a
    name: 2_freeze.vcf.gz
    md5sum: 57503332a8bd17997da7916c8057cad4
    filesize: 28.8GB
    filetype: .gz
    number_of_variants: 3749277
    number_of_participants: 1865
    belongs_to: data

Whole exome sequencing - G0 & G1 (wes_novaseq_g0_g1)

Description

This dataset contains whole exome sequencing for G0 and G1 individuals. It was generated at the Sanger Institute as part of an initiative sequencing multiple Birth cohorts: ALSPAC, MCS and BiB. As part of this initiative, the exome sequencing data will also be available via EGA but researchers will still gain access through ALSPACs project approval system.
Reference genome build: GRCh38

Methodology

Exome sequencing was conducted on DNA for 12,374 participants (8,605 children and 3,389 of their parents) at the Sanger Institute, using Illumina NovaSeq. Reads were aligned to GRCh38 with BWA-MEM. There was an average on-target depth of ~62X for ALSPAC.

QC was conducted on the dataset at the Sanger Institute, please find details within the associated publication (Koko et al., 2024). Sample QC was done before (base-calls after sequencing, alignment quality, CRAM file quality) and after variant calling (PCA analysis, comparison to array data, relatedness). Integrated variant QC removed potentially false positive variants using a trained random forest model. Genotype QC removed low quality individual genotype calls.

Single nucleotide variant (SNV) and small insertions/deletion (indels) calling was conducted with GATK HaplotypeCaller, GenomicsDBImport and GenotypeGVCFs (GATK version 4.2.4.0 for ALSPAC) following GATK best practices (Van der Auwera and O’Connor, 2020).

There were 12 individuals identified to have sex mismatches within the dataset, withflagging as mismatches based on X F stat. When looking at the Y coverage of these individuals, 3 were clear cut-offs based from both X f stat and Y depth, while 9 were only mismatches based off the X F stat. The 3 individuals with clear mismatches on both statistics were removed from the dataset, while the other mismatches were retained.

Associated publication:
- doi.org/10.12688/wellcomeopenres.22697.1

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_f6
name: >- 
  Whole Exome Sequencing - Novaseq - G0 & G1 version 2024-09-20 freeze 6
description: >-
  This is first iteration of wes_novaseq_g0_g1, first introduced in freeze 4. It contains data in vcf 4.2 format. It contains the majority of the G1 cohort (n=~8296), accompanied by G0 mothers (n=~1642) and partners (n=~1630) to create trios. Over time the participants may withdraw their consent, and subsequently will be removed from the dataset, so the number of available individuals from each cohort may differ from stated above. 
  
  This exome sequencing (ES) data was conducted at the Sanger institute and was part of an effort to ES ALSPAC, MCS and BiB. All ES data was quality controlled at the Sanger institute prior to this ALSPAC release and has been extensively document in the relevant publication (see below). 

  In brief (exert from associated publication, Koko et al., 2024):

    "Sample QC: 
      * Before variant calling: Samples were removed if they failed one or more filters based on quality of base-calls after sequencing, or quality of the CRAM files of aligned reads. The remainder then underwent variant calling.
      * After variant calling: We assigned individuals to populations using principal component analysis (PCA), then identified and removed individuals who were outliers on one or more variant-based metrics within each of the populations. We compared the exome data to genotyping array data from the same samples and removed samples that did not match as expected, since these could be sample mix-ups. The samples were also checked for unexpected relatedness; samples showing conflicts between reported and inferred relatedness were removed. This sample QC was split in two separate steps, before and after variant and genotype QC, as detailed in the coming sections. 
    Integrated variant and genotype QC:
      * Variant QC: We removed candidate variants which may not be real, instead being artefacts or mapping errors, using a trained random forest model to distinguish likely true positives from likely false positives. 
      * Genotype QC: We removed low-quality individual genotype calls from the dataset. This was done in conjunction with variant QC, as we will explain below."

  for extended information such as thresholds please find within the publication.

  Associated publication:
    Koko et al., 2024
    DOI: https://doi.org/10.12688/wellcomeopenres.22697.2


freeze_size: 167G
linker_file_md5sum: 45415c7d4fae355b4fb2d6ccd042620d
woc_file_md5sum: 6c887db8c7dd10cc695630ca73b41405
all_individuals_to_exclude_md5sum: e4efce63f9f671548d08c8bb2f9cc4f7
git_tag: https://github.com/alspac/dataset_wes_novaseq_g0_g1/releases/tag/freeze6
is_current_freeze: true
freeze_number: 6
freeze_date: 2025-09-30
previous_freeze: alspacdcs:wes_novaseq_g0_g1_2024-09-20_f5
freeze_of_alspac_dataset_version: alspacdcs:wes_novaseq_g0_g1_2024-09-20
freeze_of_named_alspac_dataset: alspacdcs:wes_novaseq_g0_g1

contains:
- data

files: []
data:
  contains:
  - chr13_data.vcf.gz.csi
  - chr3_data.vcf.gz.csi
  - chr1_data.vcf.gz.csi
  - chr7_data.vcf.gz.csi
  - chr6_data.vcf.gz.csi
  - chr10_data.vcf.gz.csi
  - chr9_data.vcf.gz.csi
  - chr14_data.vcf.gz.csi
  - chrY_data.vcf.gz.csi
  - chr21_data.vcf.gz.csi
  - chr18_data.vcf.gz.csi
  - chr12_data.vcf.gz.csi
  - chr19_data.vcf.gz.csi
  - chr16_data.vcf.gz.csi
  - chr20_data.vcf.gz.csi
  - chr17_data.vcf.gz.csi
  - chr8_data.vcf.gz.csi
  - chr11_data.vcf.gz.csi
  - chr2_data.vcf.gz.csi
  - chr15_data.vcf.gz.csi
  - chr5_data.vcf.gz.csi
  - chr22_data.vcf.gz.csi
  - chr4_data.vcf.gz.csi
  - chrX_data.vcf.gz.csi
  - chrY_data.vcf.gz
  - chr21_data.vcf.gz
  - chr13_data.vcf.gz
  - chr18_data.vcf.gz
  - chr20_data.vcf.gz
  - chrX_data.vcf.gz
  - chr22_data.vcf.gz
  - chr15_data.vcf.gz
  - chr9_data.vcf.gz
  - chr10_data.vcf.gz
  - chr7_data.vcf.gz
  - chr8_data.vcf.gz
  - chr14_data.vcf.gz
  - chr4_data.vcf.gz
  - chr5_data.vcf.gz
  - chr12_data.vcf.gz
  - chr16_data.vcf.gz
  - chr3_data.vcf.gz
  - chr6_data.vcf.gz
  - chr17_data.vcf.gz
  - chr11_data.vcf.gz
  - chr19_data.vcf.gz
  - chr2_data.vcf.gz
  - chr1_data.vcf.gz
  files:
  - id: alspacdcs:dacbe298-6538-4cc4-8294-2b82588032b5
    name: chr13_data.vcf.gz.csi
    md5sum: 4b1088f3717f8dfa6e9d1435125ebcb7
    filesize: 13.4KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:f61a7fa1-107a-4d3c-8a44-8784c9d5dccb
    name: chr3_data.vcf.gz.csi
    md5sum: c4ba2ca77fe3ef0a54ffd132b67e501a
    filesize: 37.9KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:9c248b37-fb0f-4108-bec7-6bfccd87838e
    name: chr1_data.vcf.gz.csi
    md5sum: 0cbe45073d54a442cee4d7adb3cc1d43
    filesize: 59.4KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:c4e40fb2-5707-4ec8-8d14-97193f34f5e1
    name: chr7_data.vcf.gz.csi
    md5sum: aeb5dbe67f4ea80055674dd7f975bb07
    filesize: 32.2KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:8a9b904b-000a-44f4-ba14-726c70d25d28
    name: chr6_data.vcf.gz.csi
    md5sum: 7e749d71a869e5e13a30ffbfd91fa741
    filesize: 32.1KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:09dfab5a-2e81-4893-bb03-9693b44e9422
    name: chr10_data.vcf.gz.csi
    md5sum: 2e7d3061a8c3321352176fa1a6c75613
    filesize: 27.8KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:a4a08f7c-d6ac-4014-a6ef-c982e53b6239
    name: chr9_data.vcf.gz.csi
    md5sum: dc1bf4d82d9b65fe01a0be1ded03928c
    filesize: 25.0KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:d8fa72a5-9103-402a-baa2-9d0dc1e0e8e5
    name: chr14_data.vcf.gz.csi
    md5sum: 18984469d87f958ceb6b6978b27c98d5
    filesize: 19.1KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:f83490ab-222d-4eb0-af78-534022ba18e7
    name: chrY_data.vcf.gz.csi
    md5sum: ee2bdf73b5f72154dc520c4da1a3b3f6
    filesize: 129.0B
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:6c66a88c-bc26-407f-a116-12a68eb98e62
    name: chr21_data.vcf.gz.csi
    md5sum: a4ad759539e5f3c2e586b528e8640b80
    filesize: 6.3KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:0e5acc30-23ef-48e2-8bee-addc4e8b4561
    name: chr18_data.vcf.gz.csi
    md5sum: 2222655064ce404092d784f6b6fd1ac9
    filesize: 12.4KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:4cab45e5-a2dd-4c13-8110-bd0a50727082
    name: chr12_data.vcf.gz.csi
    md5sum: 1c215daae12e958e9e7adc65d9e036bd
    filesize: 31.7KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:0f1f7933-322e-4490-96f6-96b6962aff05
    name: chr19_data.vcf.gz.csi
    md5sum: 3b2b9f9973f5d97cce113b3ab878e60a
    filesize: 23.7KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:ce4ecf99-46a8-4115-a0fa-63b7e0e9c49c
    name: chr16_data.vcf.gz.csi
    md5sum: a1d59dc67083f2596e9702af8e734ccb
    filesize: 19.9KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:e9455650-d11c-4338-84c4-4ef06288e6df
    name: chr20_data.vcf.gz.csi
    md5sum: 755cdaf303b2f09a19c2f6d11cf7401f
    filesize: 14.7KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:96909c3e-2255-4019-ac36-e67bec620851
    name: chr17_data.vcf.gz.csi
    md5sum: 947c56aa0082ffff41fabd3196e0abc5
    filesize: 26.2KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:bb2f79c7-7877-4571-8b61-0d3316753943
    name: chr8_data.vcf.gz.csi
    md5sum: 0a95f0920e76f0331993186b80fc8e65
    filesize: 24.6KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:ccc44a15-b42c-423c-a185-341cf3cdc0ee
    name: chr11_data.vcf.gz.csi
    md5sum: 7348297f255b9025f3e485be1b62b2f6
    filesize: 31.5KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:60776f11-9a96-4752-95de-343118a30e61
    name: chr2_data.vcf.gz.csi
    md5sum: 5d1135a086b68de5e46d4152cabfd04a
    filesize: 47.6KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:d3895d74-5cd3-4c2f-8a08-fe87081c05c2
    name: chr15_data.vcf.gz.csi
    md5sum: 5a62f6f281f6dbd005afb149ee52daf1
    filesize: 19.6KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:fbf90a1b-cc06-41b4-be90-69a1b43ab0d9
    name: chr5_data.vcf.gz.csi
    md5sum: 948665995542d984eedc0b52141d18a1
    filesize: 30.8KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:9893fdc5-0368-4864-a5bf-2e4cd8519682
    name: chr22_data.vcf.gz.csi
    md5sum: 44dd6c78e5648b2ebdec6fce8d0feb1c
    filesize: 11.0KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:9b01dd43-9a1f-4e89-a13f-893b5e3f35eb
    name: chr4_data.vcf.gz.csi
    md5sum: 014f977ed3351d8052197e58f24db7db
    filesize: 29.7KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:8ac119cb-c6e6-48e2-8b95-b414de9d7923
    name: chrX_data.vcf.gz.csi
    md5sum: 561bc722911156aadade8d025721f0af
    filesize: 22.9KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:cdd72e0b-0fb0-4855-b8eb-212032b2fea2
    name: chrY_data.vcf.gz
    md5sum: e4ea71e21eb7e842a8a6a63dcff96f5c
    filesize: 363.9KB
    filetype: .gz
    number_of_variants: 9
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:11e3f53f-e9a8-490b-add4-8a04c879f10b
    name: chr21_data.vcf.gz
    md5sum: b7918e78a4f18b43cc1958f4552cbfc6
    filesize: 1.9GB
    filetype: .gz
    number_of_variants: 42207
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:f599664c-be2b-4672-a0cb-55649de2ce0b
    name: chr13_data.vcf.gz
    md5sum: a2614d41e8ffdabf8f1ed8f2cbcd1479
    filesize: 2.8GB
    filetype: .gz
    number_of_variants: 63931
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:171aa6f5-e09d-4094-b660-fe0269d43567
    name: chr18_data.vcf.gz
    md5sum: 0f1a4c920c7121630a57b721fc876c04
    filesize: 2.5GB
    filetype: .gz
    number_of_variants: 57017
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:c6379949-8645-4d5e-837f-7112e0d984ab
    name: chr20_data.vcf.gz
    md5sum: 6595e284dbfa1a5787edf228356b97e4
    filesize: 4.3GB
    filetype: .gz
    number_of_variants: 96655
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:eb560ff9-0a18-492c-8720-dfe6fef3145d
    name: chrX_data.vcf.gz
    md5sum: 64c17e9b179a795bbb7e759000e711f5
    filesize: 3.8GB
    filetype: .gz
    number_of_variants: 86925
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:d6361e5f-b844-4b42-83fe-4906a7295f03
    name: chr22_data.vcf.gz
    md5sum: 30138d4b019ded3aaed8427dd8a06f87
    filesize: 4.2GB
    filetype: .gz
    number_of_variants: 94446
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:7093f81c-6f4e-415b-a465-363d20ffa553
    name: chr15_data.vcf.gz
    md5sum: 74aa176cbdc57357581703748d44aea0
    filesize: 5.6GB
    filetype: .gz
    number_of_variants: 127646
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:231e21a8-e379-43cd-a25b-7a7fa12f3c43
    name: chr9_data.vcf.gz
    md5sum: 0afb58ce21087d9e50bfa2d86793a8d9
    filesize: 7.1GB
    filetype: .gz
    number_of_variants: 161039
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:6e8945b4-3f3c-4d3a-afea-08d838f056ff
    name: chr10_data.vcf.gz
    md5sum: 3adcd480d18be470dacfae3e5f96d426
    filesize: 6.5GB
    filetype: .gz
    number_of_variants: 149730
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:d92a64f4-a4e6-43d9-bbeb-930d2210b84e
    name: chr7_data.vcf.gz
    md5sum: 6db789bff9c81208c328843e5781e7f6
    filesize: 8.1GB
    filetype: .gz
    number_of_variants: 181925
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:4ca85d29-56d8-44e1-812b-496d8fc11e40
    name: chr8_data.vcf.gz
    md5sum: 4bcd43a72b346de5b5787a403ef74e05
    filesize: 5.9GB
    filetype: .gz
    number_of_variants: 133894
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:d0bfe7af-b25c-445c-84b1-98bac22f963a
    name: chr14_data.vcf.gz
    md5sum: fb66388a3b5110af66d81e26253e188b
    filesize: 5.7GB
    filetype: .gz
    number_of_variants: 128137
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:ff11a24a-a18e-4ca4-9fa7-d815190390bb
    name: chr4_data.vcf.gz
    md5sum: e3af61e808ba72769d41139f14da6a37
    filesize: 6.1GB
    filetype: .gz
    number_of_variants: 140675
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:b3eaa07e-cefb-4a7b-ac6b-49f476078a80
    name: chr5_data.vcf.gz
    md5sum: 18000c1da5ab74f7f1dd75d9d2cc016b
    filesize: 7.0GB
    filetype: .gz
    number_of_variants: 161010
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:9998c158-0e4a-48d8-8da5-0bb5c659a15c
    name: chr12_data.vcf.gz
    md5sum: 96671ee7f5203edc897c091eeec95afa
    filesize: 8.5GB
    filetype: .gz
    number_of_variants: 193518
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:69cdce0a-8164-421e-a8bb-419732d8c5cc
    name: chr16_data.vcf.gz
    md5sum: d715e6d923f9cab2145d21988b8ebc4e
    filesize: 8.3GB
    filetype: .gz
    number_of_variants: 186300
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:6a1bf086-48e0-44c7-bc77-2dd9a9f55ac7
    name: chr3_data.vcf.gz
    md5sum: 20775cf2ee65817d8aeab72cc1f2c217
    filesize: 9.1GB
    filetype: .gz
    number_of_variants: 206875
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:1c1275fd-bc2f-410b-bc62-ff8a13a46dce
    name: chr6_data.vcf.gz
    md5sum: e158c0b18e4f47b23bdc9f022a125411
    filesize: 8.0GB
    filetype: .gz
    number_of_variants: 181754
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:f0fa15b4-2b5d-4c9d-befa-a1ab3cca9de4
    name: chr17_data.vcf.gz
    md5sum: 09644c190b7c6e7ef48be198e256452a
    filesize: 10.0GB
    filetype: .gz
    number_of_variants: 224774
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:a9db4b9d-ea30-4b0d-a1af-9ed5a3bcf9ce
    name: chr11_data.vcf.gz
    md5sum: ed62fd5d53cf7e2c412a5b6107b33aa2
    filesize: 10.2GB
    filetype: .gz
    number_of_variants: 227858
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:3c7e8732-e9a3-4dba-a784-fb75517d8d88
    name: chr19_data.vcf.gz
    md5sum: 292b7e544d1e4480489168c9fd0889a0
    filesize: 12.5GB
    filetype: .gz
    number_of_variants: 271080
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:41e14418-d698-44d4-8a57-7dc70749d6a8
    name: chr2_data.vcf.gz
    md5sum: e5c46d64ec1d086e8a6af1ae985112c2
    filesize: 11.8GB
    filetype: .gz
    number_of_variants: 272150
    number_of_participants: 11499
    belongs_to: data
  - id: alspacdcs:4dad866c-05dc-4c26-8ce6-8485c907df79
    name: chr1_data.vcf.gz
    md5sum: 4d4cdab191e20d80e68cd5ca1a8ae997
    filesize: 16.3GB
    filetype: .gz
    number_of_variants: 370645
    number_of_participants: 11499
    belongs_to: data

Whole exome sequencing - G1 (wes_novaseq_g1)

Description

This dataset contains whole exome sequencing for G1 individuals. It was generated at the Broad Institute for ~2900 G1 individuals.
Reference genome build: GRCh38

Methodology

The exomes returned from the Broad Insitute did not undergo PCA or relatedness filtering; instead provided as raw VCF data. The following thresholds were applied to the samples:

87 individuals were removed from the dataset who were believed to have been a sample mismatch. These exomes had discordance rate of above 0.05 when compared to existing array data using bcftools gtcheck.

Associated publications:
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9980234/ (conducted additional QC beyond dataset)

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:wes_novaseq_g1_204-04-12_f6
name: >- 
  Whole Exome Sequencing - Novaseq - G1 version 2024-04-09 freeze 6
description: >-
  This contains whole exome sequencing done at the Broad institute, first introduced in freeze 4. It contains data in vcf 4.2 format and an index file in csi format. It is a subset of the G1 cohort, with participants who have withdrawn their consent removed and omics IDs applied according to the freeze. 
  
  Samples were selected for whole exome sequencing at the Broad Institute from the G1 cohort (the cohort of index children) and were from subjects who were singletons/unrelated and of European/British ancestry, had blood-derived DNA available, and had been genotyped on a whole genome genotyping array.

  The QC was performed by at the Broad. The following thresholds were applied:
  Chimera rate < 0.05
  Contamination rate < 0.10
  PF aligned rate < 0.60

  87 individuals were removed from the dataset who were believed to have been a sample mismatch. These exomes had discordance rate of above 0.05 when  compared to existing array data using bcftools gtcheck.

  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9980234/ describes this dataset in supplementary materials. 

freeze_size: 28G
linker_file_md5sum: 45415c7d4fae355b4fb2d6ccd042620d
woc_file_md5sum: 6c887db8c7dd10cc695630ca73b41405
all_individuals_to_exclude_md5sum: e4efce63f9f671548d08c8bb2f9cc4f7
git_tag: https://github.com/alspac/dataset_wes_novaseq_g1/releases/tag/freeze6
is_current_freeze: true
freeze_number: 6
freeze_date: 2025-09-30
previous_freeze: alspacdcs:wes_novaseq_g1_204-04-12_f5

freeze_of_alspac_dataset_version: alspacdcs:wes_novaseq_g1_2024-03-26
freeze_of_named_alspac_dataset: alspacdcs:wes_novaseq_g1

contains:
- data
files: []
data:
  contains:
  - all_chr.vcf.gz.csi
  - all_chr.vcf.gz
  files:
  - id: alspacdcs:614f9332-d95a-446a-a3fc-031649e1d6b3
    name: all_chr.vcf.gz.csi
    md5sum: afe84f33398dea988e73e6f66a781977
    filesize: 785.7KB
    filetype: .csi
    belongs_to: data
  - id: alspacdcs:60a87d01-0070-41a1-8026-fb726623bc40
    name: all_chr.vcf.gz
    md5sum: 3c6c1622289d2df4a5c871275c3bdb9a
    filesize: 27.1GB
    filetype: .gz
    number_of_variants: 2965032
    number_of_participants: 2879
    belongs_to: data

Epigenetic Data

DNA methylation - EPIC & 450k - G0 + G1 (dnam_epic450_g0_g1)

Description

This dataset contains methylation data collected from both G0 and G1 on two arrays at different timepoints. This dataset supersedes dnam_450_g0m_g1.

There is data from Illumina Infinium HumanMethylation450K BeadChip array on G0 mothers at two timepoints (pregnancy and middle age), G1 participants at 5 timepoints (across birth, childhood and adolescence) and G0 participants at one timepoint. This dataset also contains data from Infinium MethylationEPIC v1.0 data on 2721 G1 individuals at 2 timepoints.

This dataset was generated as part of the Accessible Resource for Integrated Epigenomics Studies (http://www.ariesepigenomics.org.uk/).

Methodology

Preprocessing and quality control for this dataset was conducted using Meffil.

Associated publications:
- https://doi.org/10.1093/ije/dyv072 - https://doi.org/10.1093/bioinformatics/bty476.

Associated R packages: - aries: https://github.com/MRCIEU/aries is associated with loading and using this dataset. - meffil: https://github.com/perishky/meffil/ was used for QC and normalisations within

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:dnam_epic450_g0_g1_2022-7-13_f6
name: >-
  DNA methylation - EPIC & 450k - G0 + G1 version 2022-7-13 Freeze 6
description: >-
  This is the freeze 6 version of dnam_epic450_g0_g1, which was first introduced
  in freeze 2 and first released 2022-7-13.

  This dataset consists of multiple sections, each are described:
  Betas: 
      Normalized betas using functional normalization. We used 10 PCs on the controlmatrix to regress out technical variation. Slide was regressed out as random effect before normaliziation. CpGs are in rows and samples in columns. These are in .gds format. 
  control_matrix: 
      The 850 control probes are summarized in 42 control types. These probes can roughly be divided into negative control probes (613), probes intended for between array normalization (186) and the remainder (49), which are designed for quality control, including assessing the bisulfite conversion rate. None of these probes are designed to measure a biological signal. The summarized control probes can be used as surrogates for unwanted variation and are used for the functional normalization. Samples are rows and 42 control types are in columns. These are in .txt format. 
  derived:
      dnamage: 
          DNA methylation aging estimates from within the dataset. Further information on this data and its usage is found within the `dnamage.html` and `dnamage.md` within the docs dir/folder. 
          dnamage data file is a csv file containing DNA methylation aging estimates within the dataset.
      cellcounts:
          Files contain cell counts estimated using a variety of cell type references using the Houseman deconvolution algorithm (PMID: 22568884). In each file, samples correspond to rows and cell types to columns.
      reports:
          Collection of QC and normalization reports generated by the R meffil package upon freeze creation. This was first introduced in freeze 6. These are in html format. 
  detection_p_values:
      This matrix shows the detection pvalues for each sample and each CpG and is extracted from the idat files using the "meffil.load.detection.pvalues" function in meffil. CpGs are in rows and samples in columns. These are .gds files.
  samplesheet:
      Manifest files with columns extracted directly from LIMS and age, sex, omics ID, timepoint, timecode, sampletype, genotype columns to report sample mismatches, duplicate.rm column to remove duplicates. Samples in rows, variables in columns. These are csv files and the sampleheet.csv is the same as samplesheet-common.csv
  
  cell count files specific details:
      andrews-and-bakulski-cord-blood.txt
          Cord blood cell count estimates derived using the Bakulski et al. 2016 reference (PMID 27019159; https://bioconductor.org/packages/release/data/experiment/html/FlowSorted.CordBlood.450k.html). This reference has been implemented in meffil. Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, natural killer cells and nucleated red blood cells. In this text file, samples are in rows and cell types in columns.
      gervin-and-lyle-cord-blood.txt
          Cord blood cell count estimates derived using the Gervin et al. 2019 reference (PMID 31455416; GEO accession GSE127824). Cell counts  estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, and natural killer cells. This reference has been implemented in meffil.  In this text file, samples are in rows and cell types in columns.
      cord-blood-gse68456.txt
          Cord blood cell count estimates derived using the de Goede et al. 2015 reference (PMID 26366232; GEO accession GSE68456).  Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, natural killer cells and nucleated red blood cells. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns.
      blood-gse35069-complete.txt
          Cell counts in peripheral blood predicted using the peripheral blood reference published in Reinius et al. 2012 (PMID: 22848472). Same as 'blood gse35069.txt' but replaces granulocyteswith eosinophils and neutrophils. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns. 
      blood-gse35069.txt
          Blood cell count estimates derived using the Reinius et al. 2012 reference (PMID 25424692; GEO accession GSE35069).  Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, and natural killer cells. In this text file, samples are in rows and cell types in columns.
      blood-idoloptimized-epic.txt
          Cell counts in peripheral blood predicted using the cell type reference from Bioconductor package FlowSorted.Blood.EPIC. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns.
      blood-idoloptimized.txt
          Cell counts in peripheral blood predicted using the cell type reference from Bioconductor package FlowSorted.Blood.EPIC but restricted to the IDOLOptimizedCpGs450klegacy CpG sites. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns.
      combined-cord-blood.txt
          Cord blood cell count estimates derived using the Bakulski et al, Gervin et al., de Goede et al., and Lin et al. references (https://bioconductor.org/packages/release/data/experiment/html/FlowSorted.CordBloodCombined.450k.html) for CpG sites selected using the IDOL algorithm and optimized for the Illumina Infinium  HumanMethylation450 Beadchip. Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, natural killer cells and nucleated red blood cells. In this text file, samples are in rows and cell types in columns.

freeze_size: 
linker_file_md5sum: 45415c7d4fae355b4fb2d6ccd042620d
woc_file_md5sum: 6c887db8c7dd10cc695630ca73b41405
all_individuals_to_exclude_md5sum: e4efce63f9f671548d08c8bb2f9cc4f7
git_tag: https://github.com/alspac/dataset_dnam_epic450_g0_g1/releases/tag/Freeze6
is_current_freeze: true
freeze_number: 6
freeze_date: 2025-09-30 ### Update to align with date of release
previous_freeze: 5
freeze_of_alspac_dataset_version: alspacdcs:dnam_epic450_g0_g1_2022-7-13
freeze_of_named_alspac_dataset: alspacdcs:dnam_epic450_g0_g1

contains:
- data

files: []
data:
  contains:
  - samplesheet
  - derived
  - betas
  - detection_p_values
  - control_matrix
  files: []
  samplesheet:
    contains:
    - samplesheet-epic.csv
    - samplesheet-450.csv
    - samplesheet-common.csv
    - samplesheet.csv
    files:
    - id: alspacdcs:70eb011a-4680-4cb3-aa12-8dfae3ef55ca
      name: samplesheet-epic.csv
      md5sum: b186ad4f758ca51dfeb0e9cab45f2c3f
      filesize: 1.0MB
      filetype: .csv
      belongs_to: data/samplesheet
    - id: alspacdcs:ff6084b6-26f2-41f8-b4ec-1f59f0846ee6
      name: samplesheet-450.csv
      md5sum: 4ade52ccb70cea58acf588fd06e2952f
      filesize: 2.1MB
      filetype: .csv
      belongs_to: data/samplesheet
    - id: alspacdcs:bbdb5acd-979d-4078-8b4a-db7e57af77d8
      name: samplesheet-common.csv
      md5sum: 45032bf2732e0493f93f20afb3f588c4
      filesize: 3.2MB
      filetype: .csv
      belongs_to: data/samplesheet
    - id: alspacdcs:5e6a3daa-4078-40b4-8e0c-57f66d3c8511
      name: samplesheet.csv
      md5sum: 45032bf2732e0493f93f20afb3f588c4
      filesize: 3.2MB
      filetype: .csv
      belongs_to: data/samplesheet
  derived:
    contains:
    - dnamage.csv
    - reports
    - cellcounts
    files:
    - id: alspacdcs:9fc2876f-8cdf-43cd-a70f-e9e50160d284
      name: dnamage.csv
      md5sum: bd0c2efef6ee145cd0804d61c7e83151
      filesize: 11.2MB
      filetype: .csv
      belongs_to: data/derived
    reports:
      contains:
      - qc
      - normalization
      files: []
      qc:
        contains:
        - qc-report-450.html
        - qc-report-epic.html
        - qc-report-common.html
        - qc-report-450.md
        - qc-report-common.md
        - qc-report-epic.md
        - figure
        files:
        - id: alspacdcs:a7070157-9eac-4a11-bf92-4b0d21f70088
          name: qc-report-450.html
          md5sum: 2b3e191892ea9537adbf0559261961be
          filesize: 1.7MB
          filetype: .html
          belongs_to: data/derived/reports/qc
        - id: alspacdcs:36710710-e348-4971-8ccb-03af82681f42
          name: qc-report-epic.html
          md5sum: b2847e855c05d0b6bd3ccc70a5499fe3
          filesize: 1.7MB
          filetype: .html
          belongs_to: data/derived/reports/qc
        - id: alspacdcs:56b25c8b-8a2e-4cf1-82a0-e8e73131590d
          name: qc-report-common.html
          md5sum: d132611210a761836c41c0d2b466fc54
          filesize: 1.6MB
          filetype: .html
          belongs_to: data/derived/reports/qc
        - id: alspacdcs:3f35fece-6d92-47c9-8eee-b86f7ffc10ba
          name: qc-report-450.md
          md5sum: 74cf93e9000b1448661b8b8c9f83a085
          filesize: 21.1KB
          filetype: .md
          belongs_to: data/derived/reports/qc
        - id: alspacdcs:82b6d23d-02f5-4b06-8c98-deae14de0e53
          name: qc-report-common.md
          md5sum: ebb115f7ccc1f9b170da20e86046beb0
          filesize: 21.1KB
          filetype: .md
          belongs_to: data/derived/reports/qc
        - id: alspacdcs:03e7aa54-1f47-4f68-b51f-3f39c8b4893b
          name: qc-report-epic.md
          md5sum: b072dd1b731f524f5509d5e91a738e62
          filesize: 19.9KB
          filetype: .md
          belongs_to: data/derived/reports/qc
        figure:
          contains:
          - unnamed-chunk-16-1.png
          - unnamed-chunk-36-1.png
          - unnamed-chunk-5-1.png
          - unnamed-chunk-7-1.png
          - unnamed-chunk-35-1.png
          - unnamed-chunk-13-1.png
          - unnamed-chunk-9-1.png
          - unnamed-chunk-11-1.png
          - unnamed-chunk-12-1.png
          - unnamed-chunk-3-1.png
          files:
          - id: alspacdcs:496ef4c5-c5de-4e89-bcee-f36a700d8399
            name: unnamed-chunk-16-1.png
            md5sum: 60eb456b6848a4574507634f6110aff5
            filesize: 107.4KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
          - id: alspacdcs:618340c6-f9c8-4058-9255-b3b71bbcb7ee
            name: unnamed-chunk-36-1.png
            md5sum: 1c1cbfadbc51a707f2d596c7b524cc3f
            filesize: 29.9KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
          - id: alspacdcs:a9d24441-669b-4715-aa66-3fc3e6ea4cb3
            name: unnamed-chunk-5-1.png
            md5sum: 5d111e4bbc9bf43c14e0218f43b38ae8
            filesize: 126.9KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
          - id: alspacdcs:6963248d-017c-402c-9540-be746811cc29
            name: unnamed-chunk-7-1.png
            md5sum: a736e2022533b8774eb71eaea5205503
            filesize: 452.8KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
          - id: alspacdcs:ff23ca91-d373-4dd8-87e9-aff48e28dbdc
            name: unnamed-chunk-35-1.png
            md5sum: d387226410b191145b3a3da2ba725288
            filesize: 26.1KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
          - id: alspacdcs:80dc37e4-07aa-4a46-906a-6124d2f66ea7
            name: unnamed-chunk-13-1.png
            md5sum: ce7fb6fee9571240aca9917e43548dbc
            filesize: 79.5KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
          - id: alspacdcs:97770508-bbd3-42f3-8812-529f8eb0c079
            name: unnamed-chunk-9-1.png
            md5sum: c4a04fe47757519977dd3ca2f6f05908
            filesize: 56.8KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
          - id: alspacdcs:743c3f26-3694-4a95-9e96-a67d3ea3f80e
            name: unnamed-chunk-11-1.png
            md5sum: fb762339b8382426fc163086764cd8f7
            filesize: 33.4KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
          - id: alspacdcs:a8cc7846-d470-4cf7-b23a-eb2386e06fe9
            name: unnamed-chunk-12-1.png
            md5sum: 13ba50bfb39465f8f3af6532367c18af
            filesize: 268.5KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
          - id: alspacdcs:a4138686-1912-4ea7-97fe-006172270faf
            name: unnamed-chunk-3-1.png
            md5sum: 770268d74e2f4e73cc2fcf4f7198fb9b
            filesize: 106.4KB
            filetype: .png
            belongs_to: data/derived/reports/qc/figure
      normalization:
        contains:
        - norm-report-450.html
        - norm-report-epic.html
        - norm-report-common.html
        - norm-report-epic.md
        - norm-report-450.md
        - norm-report-common.md
        - figure
        files:
        - id: alspacdcs:27351be8-a689-436a-b30e-358a3441ef68
          name: norm-report-450.html
          md5sum: aba86d0952ddcd684fbd494c11c9feb7
          filesize: 1.8MB
          filetype: .html
          belongs_to: data/derived/reports/normalization
        - id: alspacdcs:8bd2e716-8d6c-4f37-88ee-700398bf0dcc
          name: norm-report-epic.html
          md5sum: f07cbf1e5f46b12770520dfc909e419d
          filesize: 1.7MB
          filetype: .html
          belongs_to: data/derived/reports/normalization
        - id: alspacdcs:d7dadf30-de9c-4666-b053-d5340a0d3579
          name: norm-report-common.html
          md5sum: b7c8a003959b2b426252fabd307dc397
          filesize: 1.7MB
          filetype: .html
          belongs_to: data/derived/reports/normalization
        - id: alspacdcs:93f6590a-8ba6-4dec-9a92-22956531e803
          name: norm-report-epic.md
          md5sum: c0a90deb92e022cd440996b114990c38
          filesize: 11.6KB
          filetype: .md
          belongs_to: data/derived/reports/normalization
        - id: alspacdcs:007eb8c6-facc-40ae-855b-63b41df638bf
          name: norm-report-450.md
          md5sum: d36090a1fdac51bbd314f01bca31a22f
          filesize: 20.1KB
          filetype: .md
          belongs_to: data/derived/reports/normalization
        - id: alspacdcs:4afe1adf-5dcc-48c9-a9f0-ed8c975b4566
          name: norm-report-common.md
          md5sum: bd71c88eaafaba4abb5644ac3a43f185
          filesize: 24.5KB
          filetype: .md
          belongs_to: data/derived/reports/normalization
        figure:
          contains:
          - unnamed-chunk-47-1.png
          - unnamed-chunk-45-1.png
          - unnamed-chunk-46-1.png
          - unnamed-chunk-32-1.png
          - unnamed-chunk-27-1.png
          - unnamed-chunk-43-1.png
          - unnamed-chunk-78-1.png
          - unnamed-chunk-76-1.png
          - unnamed-chunk-48-1.png
          - unnamed-chunk-44-1.png
          - unnamed-chunk-38-1.png
          - unnamed-chunk-61-1.png
          - unnamed-chunk-64-1.png
          - unnamed-chunk-50-1.png
          - unnamed-chunk-21-1.png
          - unnamed-chunk-77-1.png
          - unnamed-chunk-35-1.png
          - unnamed-chunk-74-1.png
          - unnamed-chunk-67-1.png
          - unnamed-chunk-72-1.png
          - unnamed-chunk-42-1.png
          - unnamed-chunk-56-1.png
          - unnamed-chunk-75-1.png
          - unnamed-chunk-23-1.png
          - unnamed-chunk-49-1.png
          - unnamed-chunk-79-1.png
          - unnamed-chunk-71-1.png
          - unnamed-chunk-22-1.png
          - unnamed-chunk-51-1.png
          - unnamed-chunk-80-1.png
          - unnamed-chunk-73-1.png
          files:
          - id: alspacdcs:8504be18-9cf2-4f25-b89c-dbb7bc32a1bc
            name: unnamed-chunk-47-1.png
            md5sum: e8a702b741c582874e86c524178b3224
            filesize: 43.6KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:0f88ece8-0b4f-4fb9-b792-1001e22f4f1d
            name: unnamed-chunk-45-1.png
            md5sum: 397ceb400ba6cc54f8957a04a7223b80
            filesize: 40.5KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:1b7de838-299d-41e7-b71f-25abe6655548
            name: unnamed-chunk-46-1.png
            md5sum: 2a361ec26c3d5d729fc3ecc385f3dc37
            filesize: 41.6KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:61e18d09-99de-4071-a146-45375913c274
            name: unnamed-chunk-32-1.png
            md5sum: e2fd34ad90f83276ac8531d441b6efea
            filesize: 12.9KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:4bec956f-9904-4fcd-9b86-1df9d759fae3
            name: unnamed-chunk-27-1.png
            md5sum: d741adec199194e22300578b399b2adf
            filesize: 212.6KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:39bd9815-c0ae-46d6-985a-e23a1b6fcad0
            name: unnamed-chunk-43-1.png
            md5sum: db015f1d599967aacff3e918ca210ed7
            filesize: 45.0KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:f5c9a61b-da3c-4dc7-96fb-4e8646fda8e7
            name: unnamed-chunk-78-1.png
            md5sum: 1376d286deab61aff5b82ca728e79193
            filesize: 40.2KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:6222599c-5d8c-4c6c-ac4a-53a339726aaa
            name: unnamed-chunk-76-1.png
            md5sum: b3a9faa0ba9216acccb586e28e6cc7e7
            filesize: 35.3KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:5e6da925-c4de-4468-9c6c-8e8d100b2595
            name: unnamed-chunk-48-1.png
            md5sum: 974eb0b4ab2a678d374df2ae5fa4cb36
            filesize: 41.4KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:4522e502-c9ce-4586-9c05-e49abd83ae25
            name: unnamed-chunk-44-1.png
            md5sum: c16673a6bb0109a0c533055bcccfd9b0
            filesize: 40.8KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:fc0ff28a-e49e-4a1d-8ce7-3ab6b874c4bb
            name: unnamed-chunk-38-1.png
            md5sum: 523af90e3cb6d55e1927bdcc7ba581c1
            filesize: 14.9KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:43d661fe-a4ae-44f8-85f9-7abc89e0fb16
            name: unnamed-chunk-61-1.png
            md5sum: 8909fde9844719d9c0dc6d4f27d6bee7
            filesize: 14.1KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:df11f3a1-616b-4c48-8d7e-11963516aa07
            name: unnamed-chunk-64-1.png
            md5sum: 7feb3f164bc369cae2872b7f4e67b6a9
            filesize: 15.7KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:fe380c9b-5cec-4b25-b006-cfb999e5e441
            name: unnamed-chunk-50-1.png
            md5sum: 63d4be9ba5dd131d42c5a1755a32ab3c
            filesize: 37.7KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:4ef9c734-79a5-43ca-83f7-8341034378fd
            name: unnamed-chunk-21-1.png
            md5sum: 843c2535db5036f216cde2995c7d6e4d
            filesize: 7.9KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:34f39e92-0032-4f12-8a97-f0e34ca5c2eb
            name: unnamed-chunk-77-1.png
            md5sum: 3cd5906468110fc3efa0e7059e50f011
            filesize: 33.6KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:556bd24f-dc52-43ae-a7fc-0d36710441bc
            name: unnamed-chunk-35-1.png
            md5sum: 62397c374b15bafae8d4c77b5ab66903
            filesize: 13.8KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:a776fb34-97c7-4ff4-8805-c08108b232af
            name: unnamed-chunk-74-1.png
            md5sum: dc69a1f6993cea3c5c289edbebe1395f
            filesize: 34.6KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:a5ed73d6-5ceb-4e2d-8eff-355f7a66af27
            name: unnamed-chunk-67-1.png
            md5sum: 8310a8c9e27a59e61f93314fc83d4f0f
            filesize: 13.3KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:f1cb7439-8fd0-4821-b6fb-fd8f71bec3a5
            name: unnamed-chunk-72-1.png
            md5sum: 62659ae0e97f0d410803cdb84f86ad06
            filesize: 43.0KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:bebe6e86-f64f-4624-8ea0-46b0018c6993
            name: unnamed-chunk-42-1.png
            md5sum: cd0df689b84096cbb12bca6965b465dc
            filesize: 43.7KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:72654aa4-46cb-4df7-9383-fc5e79f9e2f2
            name: unnamed-chunk-56-1.png
            md5sum: a305a033260dd8391902463465d4539b
            filesize: 195.0KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:855b211f-c5a9-4cc9-9c2a-7c13e4816ddc
            name: unnamed-chunk-75-1.png
            md5sum: 86dbcf633fe7791f3fcd2730bbaca1fe
            filesize: 33.7KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:d7b53a91-ddea-4975-8962-cac7adafd94c
            name: unnamed-chunk-23-1.png
            md5sum: 7a6f89888fd84a613415240965293fe2
            filesize: 8.3KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:477db1b7-d261-42ec-822b-77f70a645a52
            name: unnamed-chunk-49-1.png
            md5sum: f6b29e0a8dab04649277a20958a9bdd3
            filesize: 40.9KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:de610732-67e7-4d03-84ad-8bd0c3a6b3ec
            name: unnamed-chunk-79-1.png
            md5sum: b348bd0dfdea753fbfa4502b5e4e85ef
            filesize: 37.8KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:bb91723a-f756-45be-bcaf-24dbc9e8db46
            name: unnamed-chunk-71-1.png
            md5sum: 540e6c04a4bf0de24c33072f251de261
            filesize: 35.3KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:efa4497e-b13d-40d5-b86c-ca7a86edf94f
            name: unnamed-chunk-22-1.png
            md5sum: a197d74fb49c30f66707e60a8bdebd33
            filesize: 7.9KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:b5bc7cc3-a98e-49df-8fd4-955a2a612837
            name: unnamed-chunk-51-1.png
            md5sum: ff43a8fecff1d2aab034fc95dc58353d
            filesize: 36.1KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:f7f8c4cc-37c2-4c70-9a6b-646777b8b277
            name: unnamed-chunk-80-1.png
            md5sum: a0024a6beb1ae9e8f1049cc6494b41b0
            filesize: 37.9KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
          - id: alspacdcs:0a9c7485-73bc-4992-b58b-e338a65d253b
            name: unnamed-chunk-73-1.png
            md5sum: 6640674615e33436ee7f004385ccae38
            filesize: 44.7KB
            filetype: .png
            belongs_to: data/derived/reports/normalization/figure
    cellcounts:
      contains:
      - blood-idoloptimized-epic.txt
      - combined-cord-blood.txt
      - andrews-and-bakulski-cord-blood.txt
      - gervin-and-lyle-cord-blood.txt
      - blood-gse35069.txt
      - blood-gse35069-complete.txt
      - blood-idoloptimized.txt
      - cord-blood-gse68456.txt
      files:
      - id: alspacdcs:bde83af1-b866-418a-9778-91fb56b4da3f
        name: blood-idoloptimized-epic.txt
        md5sum: 7331e83d31e1d200bbff3d041223cde1
        filesize: 345.9KB
        filetype: .txt
        belongs_to: data/derived/cellcounts
      - id: alspacdcs:f17a7996-c9af-408c-8ae2-93ed3ebc1524
        name: combined-cord-blood.txt
        md5sum: 7cbcf72ca00012d17d22ff6d21b7575c
        filesize: 128.2KB
        filetype: .txt
        belongs_to: data/derived/cellcounts
      - id: alspacdcs:e5e3dfbb-1eb2-4c37-a480-55e1e39cfdbf
        name: andrews-and-bakulski-cord-blood.txt
        md5sum: 33c69aa8e50deb28355dcb82d01c7510
        filesize: 113.7KB
        filetype: .txt
        belongs_to: data/derived/cellcounts
      - id: alspacdcs:e553d486-dd1e-424d-80c6-327f6f749740
        name: gervin-and-lyle-cord-blood.txt
        md5sum: 099c4cf9bd4ecfee91c19c3c2d2b6f70
        filesize: 99.5KB
        filetype: .txt
        belongs_to: data/derived/cellcounts
      - id: alspacdcs:8c7e54c5-a525-44f4-a553-003b09436936
        name: blood-gse35069.txt
        md5sum: 53fb63b4cef457d90688b3ddb861fa73
        filesize: 1020.6KB
        filetype: .txt
        belongs_to: data/derived/cellcounts
      - id: alspacdcs:4cddb029-b2ec-4794-8290-6f01f18b0d0f
        name: blood-gse35069-complete.txt
        md5sum: 27ab648c56b56e62709a98fcba95a764
        filesize: 1.1MB
        filetype: .txt
        belongs_to: data/derived/cellcounts
      - id: alspacdcs:5956609a-dee6-4552-a59f-d6db466fc45c
        name: blood-idoloptimized.txt
        md5sum: 2c2bdbf34093960af969ca37ae43c77b
        filesize: 1.1MB
        filetype: .txt
        belongs_to: data/derived/cellcounts
      - id: alspacdcs:051b757c-971b-45e5-a010-f269e09fc9fb
        name: cord-blood-gse68456.txt
        md5sum: 941f8a9ce1289ab5baaf10fb29bd8941
        filesize: 129.8KB
        filetype: .txt
        belongs_to: data/derived/cellcounts
  betas:
    contains:
    - epic.gds
    - 450.gds
    - common.gds
    files:
    - id: alspacdcs:c0cb59ba-fb67-4a49-87ee-12a86c9d82fc
      name: epic.gds
      md5sum: 0357486c3af3b5ee120c7b05bf077340
      filesize: 17.5GB
      filetype: .gds
      belongs_to: data/betas
    - id: alspacdcs:d4ad6861-58d0-4500-9807-eefdd933dd68
      name: 450.gds
      md5sum: 02e9b3cdda39d3476bfce111f5935f93
      filesize: 21.3GB
      filetype: .gds
      belongs_to: data/betas
    - id: alspacdcs:c27c47a4-6f52-474f-93f6-3ffa9360fef4
      name: common.gds
      md5sum: 2d447051e6241bf35dc1bfba4e740848
      filesize: 29.1GB
      filetype: .gds
      belongs_to: data/betas
  detection_p_values:
    contains:
    - epic.gds
    - 450.gds
    - common.gds
    files:
    - id: alspacdcs:fe210696-2b93-4d85-8211-fdae1ff5f647
      name: epic.gds
      md5sum: 341d1194d468e10e80be9dc9990c474b
      filesize: 17.7GB
      filetype: .gds
      belongs_to: data/detection_p_values
    - id: alspacdcs:01a334ae-04d5-4d4a-9fff-b365c8893c54
      name: 450.gds
      md5sum: 1c437226b2aab0c00aed7098e739f49d
      filesize: 21.5GB
      filetype: .gds
      belongs_to: data/detection_p_values
    - id: alspacdcs:aeaf85d2-4522-431b-975e-2ae0a85e4fbe
      name: common.gds
      md5sum: c6f4348fa7d92a5f341f69e1784036da
      filesize: 29.3GB
      filetype: .gds
      belongs_to: data/detection_p_values
  control_matrix:
    contains:
    - epic.txt
    - common.txt
    - 450.txt
    files:
    - id: alspacdcs:9042df18-14aa-4597-8d56-6d2e991a0c0a
      name: epic.txt
      md5sum: 7a680d3ccd26a491ec7dde2ce91eeeab
      filesize: 1008.8KB
      filetype: .txt
      belongs_to: data/control_matrix
    - id: alspacdcs:440e5699-4274-40f1-a49b-dcb6d47af57a
      name: common.txt
      md5sum: 42d21ff7a2ead483e85b909b279e9912
      filesize: 3.1MB
      filetype: .txt
      belongs_to: data/control_matrix
    - id: alspacdcs:b8218208-65c7-4174-8c5e-e259fbb0da6b
      name: 450.txt
      md5sum: 9e6aa62498c5bb7493f7512e274056ba
      filesize: 2.1MB
      filetype: .txt
      belongs_to: data/control_matrix

Gene Expression Data

Gene expression - array - G1 (ge_ht12_g1)

Description

There are two different types of QC’d data available in this version, one performed by David Evans for the Bryois et al 2014 paper, and one performed by Gibran Hemani for the molgenis eQTL mapping meta analysis. A version without QC is available as well. Details on the QC’d versions can be seen below.

This data was generated from LCLs. The majority of samples used in their generation were collected at age 9 years. LCL’s are a lymphoblastoid cell lines which were produced by transforming lymphocytes with Epstein Barr Virus and cultured before DNA was extracted. Gene expression patterns may not be the same as that from untransformed lymphocytes taken from a 9 year old.

Methodology

Bryois: - LCL’s from unrelated individuals were grown under identical conditions and cells frozen in RNAlater. RNA was extracted using an RNeasy extraction kit (Qiagen) and was amplified using the Illumina TotalPrep-96 RNA Amplification kit (Ambion). Expression profiling of the samples, each with two technical replicates, were performed using the Illumina Human HT-12 V3 BeadChips (Illumina Inc) including 48,804 probes where 200 ng of total RNA was processed according to the protocol supplied by Illumina. Raw data was imported to the Illumina Beadstudio software and probes with less than three beads present were excluded. Log2 - transformed expression signals were then normalized with quantile normalization of the replicates of each individual followed by quantile normalization across all individuals.

We restricted our analysis to 23’935 probes tagging genes annotated in Ensembl. Principal component analysis was performed on 931 individuals. 62 individuals with principal component 1 or 2 greater than one standard deviation of the population were excluded from further analysis. See http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004461 for full details.

Molgenis: - Genetic outliers were removed, any individuals that were clear outliers in the first 2 genetic principal components. Each probe was simply quantile normalised and then log2 transformed. Then adjusted for the first 4 genetic MDS, expression principal components (excluding those that had genetic associations), and scaled to have mean 0 and variance 1. See https://github.com/molgenis/systemsgenetics/wiki/eQTL-mapping-analysis-cookbook for full details.

Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:ge_ht12_g1_2015-11-02_f6
name: Gene expression - array - G1 release version 2015-11-02 freeze 6
description: >-
  This is the sixth freeze of the 2015-11-02 version of
  ge_ht12_g1 dataset which has .csv distributions of the data rather than
  .Rdata files in order to be easier to use across differnt data
  science software and languages.

freeze_size: 2.6G
linker_file_md5sum: 45415c7d4fae355b4fb2d6ccd042620d
woc_file_md5sum: 6c887db8c7dd10cc695630ca73b41405
all_individuals_to_exclude_md5sum: e4efce63f9f671548d08c8bb2f9cc4f7
git_tag: https://github.com/alspac/dataset_ge_ht12_g1/releases/tag/freeze6
is_current_freeze: true
freeze_number: 6
freeze_date: 2025-09-30
previous_freeze: alspacdcs:ge_ht12_g1_2015-11-02_f5
freeze_of_alspac_dataset_version: alspacdcs:ge_ht12_g1_2015-11-02
freeze_of_named_alspac_dataset: alspacdcs:ge_ht12_g1

contains:
- data
- docs
files: []
data:
  contains:
  - bryois.csv
  - molgenis.csv
  - raw.csv
  files:
  - id: alspacdcs:ad99beeb-c614-4f26-ba39-1823a17c9fb2
    name: bryois.csv
    md5sum: 47f1e98d0b16a448362c299f86d80bbb
    description: >-
      Csv version of the bryois data.
      IDs in columns and Illumina probe IDs in rows.
      This is the normalised data used in Bryois et al 2014.
      Probe IDs are mapped to genes in raw.csv
    filesize: 741.2MB
    filetype: .csv
    belongs_to: data
    number_of_participants: 947
    number_of_gene_expression_probe_values: 48630
  - id: alspacdcs:aa466bac-6e5a-427a-aecc-89a7114dc811
    name: molgenis.csv
    md5sum: 6fe74a566bad2d2357554adf53a41960
    description: >-
      The freeze 6 csv version of the molgenis data.
      IDs in columns and Illumina probe IDs in rows.
      Normalised data following the molgenis pipeline,
      found at
      https://github.com/molgenis/systemsgenetics/wiki/eQTL-mapping-analysis-cookbook.
      Probe IDs are mapped to genes in raw.csv
    filesize: 751.2MB
    filetype: .csv
    belongs_to: data
    number_of_participants: 879
    number_of_gene_expression_probe_values: 48630
  - id: alspacdcs:d83e0b3c-5a0e-4116-be15-9f89d58b3675
    name: raw.csv
    md5sum: 3f6ac964549b12dea0cd245f7f8b9dcd
    description: >-
      The 6 csv version of the raw ge data.
      IDs in columns and probes in rows. Four columns per
      individual, with two columns for average signal and two columns
      for average number of beads.
      Presumably this is a file generated by the Illumina Genome
      Studio software.
    filesize: 1.1GB
    filetype: .csv
    belongs_to: data
    number_of_participants: 994
    number_of_gene_expression_probe_values: 48630

Omics tips

Introduction

This section is a guide to using ’Omics datasets. It explains which software to use and describes common file formats. It’s a good starting point for beginners and helpful for problem-solving.

Disclaimer

Some information is copied or reworded from software documentation. Check the original documentation alongside this guide for up-to-date information. Note that some links may no longer work.

Operating systems

You can use ALSPAC data with any operating system, but Unix-based systems like Macintosh, Linux, or BSD are more convenient due to the data’s size and complexity. We recommend using the command line and programming scripts with languages like Bash, R, Python, or Perl. Many online resources are available to learn these tools. Use free/libre and open-source software where possible.

Links:

Key Omics software

Plink is a tool for performing quality control and whole genome association analysis of genetic data. - Link: http://zzz.bwh.harvard.edu/plink/ ### SNPTest SNPTest is a tool for performing whole genome association analysis of genetic data. - Link: https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html (Not open source) ### BoltLmm BoltLmm is a tool for performing genome association analysis of genetic data. It is recommended for analysis of more than 5000 samples, its methods automatically take into account population substructures. - Link: https://data.broadinstitute.org/alkesgroup/BOLT-LMM/ ### Qctools A tool for quality control of genetic data. It is also useful to inspect and modify .gen .bgen and vcf files etc (see section 4 below). - Link: https://www.well.ox.ac.uk/~gav/qctool_v2/ ### SAMTOOLS Samtools is a suite of tools which are used for genomic analysis. - Link: http://www.htslib.org/ ### VCFTOOLS Part of samtools that allows you to work with vcf files. - Link: https://vcftools.github.io/index.html ### BCFTOOLS This is a part of samstools and allows users to manipulate .bcf files. - Link: http://samtools.github.io/bcftools/bcftools.html

File types

In a Unix environment the postfix of a file name does not explicitly mean anything to the operating system, unlike in a Windows system which will look at the file types. In a Unix system it is just part of the name of the file and humans use it to distinguish file formats. The following is a non-exhaustive list of file types you may encounter whilst using ALSPAC Omics data.

.gen

This is an ‘oxford’ data format for genetic data. The .gen file is a plain text file, this means that standard Unix command line tools can be used to inspect the data. For example, ‘head’ or ‘less’.

The .gen (genotype) file stores data on a one-line-per-SNP format. The first 5 entries of each line are the SNP ID, RS ID of the SNP, base-pair position of the SNP, the allele coded A and the allele coded B. The SNP ID can be used to denote the chromosome number of each SNP. The next three numbers on the line are the probabilities of the three genotypes AA, AB and BB at the SNP for the first individual in the cohort. The next three numbers are the genotype probabilities for the second individual in the cohort. The next three numbers are for the third individual and so on. The order of individuals in the genotype file should match the order of the individuals in the sample file (see below). It should be noted that the probabilities need not sum to 1 to allow for the possibility of a NULL genotype call. This format allows for genotype uncertainty. This genotype file format is the same as that produced by the genotype calling algorithm CHIAMO. NOTE : We recommend that you arrange SNPs in base-pair order in the genotype files. This is required if you want to use the files with IMPUTE and will make viewing the output of SNPTEST somewhat easier. For example, Suppose you want to create a genotype for 2 individuals at 5 SNPs whose genotypes are

SNP 1 AA AA
SNP 2 GG GT
SNP 3 CC CT
SNP 4 CT CT
SNP 5 AG GG

The correct genotype file would look like this:

SNP1 rs1 1000 A C 1 0 0 1 0 0
SNP2 rs2 2000 G T 1 0 0 0 1 0
SNP3 rs3 3000 C T 1 0 0 0 1 0
SNP4 rs4 4000 C T 0 1 0 0 1 0
SNP5 rs5 5000 A G 0 1 0 0 0 1

.bgen

A binary version of a .gen file. This file can not be visually inspected on the command line. .bgen files are used because they greatly increase the speed and storage efficiency of software for storing large amounts of Omics data. The full details of the file format are discussed in : https://www.well.ox.ac.uk/~gav/bgen_format/ bgen files are normally used with tools such as qctools and snptest There is also a library for reading .bgen files into R : https://bitbucket.org/gavinband/bgen/wiki/rbgen

.sample

The .sample file is paired with either .gen or .bgen files. It contains information on the samples that is not genetic. It is a plain text file that can be inspected with standard Unix command line tools.

Please note that the sample file format changed with the release of SNPTEST v2. Specifically, the way in which covariates and phenotypes are coded on the second line of the header file has changed. The sample file has three parts (a) a header line detailing the names of the columns in the file, (b) a line detailing the types of variables stored in each column, and (c) a line for each individual detailing the information for that individual. Here is an example of the start of a sample file for reference

ID_1 ID_2 missing cov_1 cov_2 cov_3 cov_4 pheno1 bin1
0 0 0 D D C C P B
1 1 0 .007 1 2 0 .0019 -0.008 1.233 1
2 2 0 .009 1 2 0 .0022 -0.001 6.234 0
3 3 0 .005 1 2 0 .0025 0.0028 6.121 1
4 4 0 .007 2 1 0 .0017 -0.011 3.234 1
5 5 0 .004 3 2 -0 .012 0.0236 2.786 0

The header line: This line needs a minimum of three entries. The first three entries should always be ID_1, ID_2 and missing. They denote that the first three columns contain the first ID, second ID and missing data proportion of each individual. Additional entries on this line should be the names of covariates or phenotypes that are included in the file. In the above example, there are 4 covariates named cov_1, cov_2, cov_3, cov_4, a continuous phenotype named pheno1 and a binary phenotype named bin1. NOTE : All phenotypes should appear after the covariates in this file. The second line of the file details the type of variables included in each column. The first three entries of this line should be set to 0. Subsequent entries in this line for covariates and phenotypes should be specified by the following rules

D Discrete covariate (coded using positive integers)
C Continuous covariates
P Continuous Phenotype
B Binary Phenotype (0 = Controls, 1 = Cases)

The remainder of the file should consist of a line for each individual containing the information specified by the entries of the header line (see example above). Use spaces to separate the entries of the sample file and not TABS because that is the expected character.

Missing values - Specifying missing values for covariates and phenotypes is possible. It was recommended that you use -9 for missing values. This was the default value assumed by SNPTEST v1, although the -missing_code option in SNPTEST v1 meant that you could use other numeric values for the missing code, In SNPTEST v2 the behavior of the -missing_code option has changed so that it now takes a comma-separated list of values, each of which is treated as missing when encountered in the sample file(s). Default missing values are now denoted by the two character string “NA”.

.ped

A plink format file that is in plain text and can be viewed with standard tools. It contains genetic variant data. https://www.cog-genomics.org/plink/1.9/formats#ped

.map

A plink format file that is in plain text. It contains information about variants. https://www.cog-genomics.org/plink/1.9/formats#map

.bed

A plink format file that isa binary equivalent of a .ped file. It is smaller and faster to process but is not easily viewable or editable. https://www.cog-genomics.org/plink/1.9/formats#bed

.bim

A plink format, similar to a .map file but is used with binary .bed files. https://www.cog-genomics.org/plink/1.9/formats#bin

.fam

A plain text format that contains sample information for plink binary files. https://www.cog-genomics.org/plink/1.9/formats#fam

.csv

A plain text format where different fields are separated by commas. (Comma separated variables).

.vcf

VCF files are a flexible file format for storing different types of genetic variants. They are a plain text format that can be inspected on the command line with standard Unix tools. However they are often very large files, and specific tools such as ‘vcftools’ are useful for working with this data. Commonly SNPs are stored in these files but other variants such as Copy Number variations can also be stored. The basic form for a vcf file is: https://en.wikipedia.org/wiki/Variant_Call_Format

.bcf

This is a binary version of a vcf file. It cannot be inspected on the command line, but can be used with the genomic tools mentioned in this document.

.tar.gz

This is a standard Unix file format for bundling and compressing a set of files. It is similar to a .zip file. It is made by first bundling a set of files into a .tar file (sometimes called a tar ball). This is then compressed using ‘gun zip’. https://en.wikipedia.org/wiki/Tar_(computing) https://en.wikipedia.org/wiki/Gzip

.enc

This file extension is used as a convention to mean that the file is encrypted. You will need to have that password that was used to encrypt the data in order to unencrypt the files. https://en.wikipedia.org/wiki/OpenSSL

Variant/SNP ids

There are many types of genetic variation. A common type is a single nucleotide polymorphism (SNP). Others include copy number variations.

Variants can be specified by a Chromosome and location in reference to a specific build of the human genome. They can also be given a reference SNP (rs) cluster identifier.

Overview of Imputation reference panels

SNP array data frequently contain hundreds of thousands of variants. However due to linkage disequilibrium it is possible to estimate many more SNP values for an individual. This estimation procedure is called imputation and it works by combining an individuals SNP array data with a large reference population of sequenced data. In this way it is possible to have accurate estimations of millions of SNP values for an individual without the cost of fully sequencing each person. ALSPAC has prerun the imputation process using three different imputation panels.

Panels

SNP data types from imputation.

SNPs that have been imputed can be stored and analysed in different formats. These can be appropriate for different types of analysis, for example an analysis could assume and additive effect for the minor allele or it could assume a recessive/dominant effect.

SNP Statistics

You can generate statistics on your SNP data using the program ‘QCtools’. This will give you the imputation information scores. For example:

qctool -g example.bgen -s example.sample -sample-stats -osample sample-stats.txt

Best practice

GWAS

We recommend you follow the steps outlined in the following paper when performing GWAS: Marees, Andries T., et al. “A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis.” International journal of methods in psychiatric research 27.2 (2018): e1608. https://doi.org/10.1002/mpr.1608 ### Phewas We recommend you follow the steps outlined in the following paper when performing Phewas: Millard, L., Davies, N., Timpson, N. et al. MR-PheWAS: hypothesis prioritization among potential causal effects of body mass index on many outcomes, using Mendelian randomization. Sci Rep 5, 16645 (2015). https://doi.org/10.1038/srep16645 ### Methylation The following paper describes the methylation data available in ALSPAC Relton, Caroline L., et al. “Data resource profile: accessible resource for integrated epigenomic studies (ARIES).” International journal of epidemiology 44.4 (2015): 1181-1190.

Population stratification

This is when an observed genetic association is due to the population/geography. Not taking this into account can lead to biased estimates of effects. One common method to account for these is to calculate principal components (PCs) of the genetic data and then to include these as covariables in any models.

ALSPAC do not provide PCs as part of the standard omics datasets, as these would require being re-generated and tested alongside each freeze. PCs can be generated using plink, hail or a variety of other tools.

For more information about how to do this in plink see: https://www.cog-genomics.org/plink/1.9/strat

An common method used to account for population substructure is by using linear mixed models. For example using the bolt LMM software tool.

https://data.broadinstitute.org/alkesgroup/BOLT-LMM/

Polygenic risk scores (PRS)

These are scores which estimate the effect of variants in an individual genome on a given phenotypic trait or disease.

Further explanations can be found online, such as: https://www.genome.gov/Health/Genomics-and-Medicine/Polygenic-risk-scores

Or example tutorials for calculating PRSs: https://www.nature.com/articles/s41596-020-0353-1

Different collaborators often generate PRS for ALSPAC, but these are not shared as part of our standard omics datasets. Collaborators wishing for PRSs will need to generate these themselves.

Common tasks

Here we provide links to webpages that provide instructions or provide brief details any code for completing common tasks using the various software we have described above (section x):

https://www.well.ox.ac.uk/~gav/qctool_v2/documentation/examples/filtering_variants.html

http://zzz.bwh.harvard.edu/plink/dataman.shtml

plink --bfile mydata --chr 2 --from-kb 5000 --to-kb 10000

Plink supports bgen files but it is fussy about the types of its columns in the data.sample file. You may wish to remove or retype columns to read a data.sample file into plink. For more info see:

https://www.cog-genomics.org/plink/2.0/input

To make a new sample file removing some columns you can use the Unix command: ‘cut -f 1,2,3 -d ” ” data.sample > data2.sample’

Courses

Working with ’Omics data can be complicated but there are many excellent resources available to help you learn how to do this. There are both paid in person courses and free online courses.

Details on paid courses offered by Bristol University can be found here: https://www.bristol.ac.uk/medical-school/study/short-courses/ In addition, a number of free online courses are summarised here: https://www.mooc-list.com/tags/bioinformatics

Further sources of help

Stack exchange

Stack exchange is an online Q&A community which is divided into different sub-communities. The first and most well-known is Stack overflow. This is one of the best place to ask questions about programming on the Internet. Other useful exchange sites include bioinformatics https://bioinformatics.stackexchange.com/, maths https://mathoverflow.net/ and statistics https://stats.stackexchange.com/.

Bio-stars

Biostars is bioinformatics community Q&A web-site: https://www.biostars.org/

Mailing lists

For individual product/projects there is often a mailing list. For example to get help using SNPTEST you can ask on the mailing list https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html#contact

AI tools

AI tools such as chatGPT can be useful to understand how to work with omics data, but please do understand their limitations and look at documentation or research papers directly.

Ask ALSPAC

If you can not find the answer to your question or you think there is something wrong with your data then please contact the alspac-omics@bristol.ac.uk mailbox and we will do our best to help you.