ALSPAC OMICs Data Catalogue

Table of Contents

1 Introduction

Welcome to the ALSPAC Omics Catalogue, a guide to the omics data offered by ALSPAC. This catalogue features a variety of named ALSPAC datasets, each consisting of collected or produced data that has been organized, named, and curated for ease of use. Every named ALSPAC dataset comes with accompanying metadata that provides information about the dataset as a whole. Each named ALSPAC dataset has at least one release version that includes a curated selection of files detailed in the metadata sections.

Please note that these datasets are not generally accessible. Please see http://www.bristol.ac.uk/alspac/researchers/access/ for details for access.

The information within this catalogue is made available for browsing to help both internal ALSPAC users and external researchers understand the data and facilitate prospective data requests.

For external ALSPAC collaborators, we offer as standard "freezes" of specific dataset versions of named ALSPAC datasets. These freezes, along with their metadata, are outlined in this catalogue. External collaborators will be granted access to these freezes upon request approval. A freeze represents a carefully selected subset of data files within a version, containing the core data from a dataset with withdrawn consent removed and specific dataset IDs applied. These freezes are subject to periodic updates.

Due to the removal of withdrawn individuals from the freezes, please note that the number of participants within each dataset may change over time and may not match those found in the Methodology fields.

Freeze 1 timing: July 2021 - Dec 2022
Freeze 2 timing: Dec 2022 - Dec 2023
Freeze 3 timing: Jan 2023 - Oct 2024
Freeze 4 timing: Oct 2024 - June 2025
Freeze 5 timing: June 2025 - Current

Documentation for the current freeze is in the form of a yaml file is present below, listing the files external collaborators will receive, accompanied by metadata.

NamedALSPACDataset DatasetVersion Freeze

2 Genetic Array Data

2.1 Genome-wide - Illumina 550 quad - G1 (gwa_550_g1)

2.1.1 Description

This dataset contains genome wide array data genotype calls for G1 individuals. Reference genome build: GRCh36

2.1.2 Methodology

ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).

Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.

SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1). Related subjects were removed.

Associated publication:

2.1.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gwa_550_g1_2022-12-05_f5
name: >-
  Genome-wide array data for G1 individuals 2022-12-05 freeze 5
description: >-
  The fith freeze of the genome-wide array data for G1 based on a
  2022-12-05 release. The data is in plink format.
freeze_size: 997M
linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a
git_tag: https://github.com/alspac/dataset_gwa_550_g1/releases/tag/freeze5
is_current_freeze: true
freeze_number: 5
freeze_date: 2025-02-27
previous_freeze: alspacdcs:gwa_550_g1_2022-12-05_f4
freeze_of_alspac_dataset_version: alspacdcs:gwa_550_g1_2022-12-05
freeze_of_named_alspac_dataset: alspacdcs:gwa_550_g1

has_containers:
  - id: alspacdcs:1600720f-a580-4999-9bd6-4bbcd60554bb ## uuid
    name: data
    description: A dir/folder containing the two freeze data files


has_parts:
  - id: alspacdcs:b84cc4d9-20b0-40d1-93d2-b5a4d221af3b
    name: Biallelic genotype table
    description: >-
      genotype data
    data_distributions:
      - id: alspacdcs:a8552a46-1740-4056-8adf-38d32f6a7472
	name: freeze_id.bed
	description: >- 
	  Plink bed file.
	  Primary representation of genotype calls at biallelic
	  variants. Must be accompanied by .bim and .fam files.
	md5sum: 94973786388f80000dcdad0a80514e37
	filesize: 982M
	filetype: .bed
	number_of_participants: 8223
	number_of_variants: 500527
	belongs_to_container: alspacdcs:1600720f-a580-4999-9bd6-4bbcd60554bb
  - id: alspacdcs:f79e204e-8f5c-4c85-9bd1-07b0e1f1e874
    name: Variant Information
    description: >-
      Information about SNPS
    data_distributions:
      - id: alspacdcs:9b0b34c4-f31c-48e9-8cbd-f87d3257de11
	name: freeze_id.bim
	description: >-
	   Extended variant information file accompanying a .bed binary
	   genotype table. (--make-just-bim can be used to update just
	   this file.) A text file with no header line, and one line per
	   variant with the following six fields:

	   1. Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT';
	      '0' indicates unknown) or name
	   2. Variant identifier
	   3. Position in morgans or centimorgans (safe to use dummy value of '0')
	   4. Base-pair coordinate (1-based; limited to 231-2)
	   5. Allele 1 (corresponding to clear bits in .bed; usually minor)
	   6. Allele 2 (corresponding to set bits in .bed; usually major)
	md5sum: b0789ac6126af474c916c80f77335f6a
	filesize: 14M
	filetype: .bim
	number_of_variants: 500527
	belongs_to_container: alspacdcs:1600720f-a580-4999-9bd6-4bbcd60554bb
  - id: alspacdcs:6097a628-e5a5-4b03-9281-2efff7ae48f5
    name: sample info
    description: >-
      Sample ids
    data_distributions:
      - id: alspacdcs:76b09971-7168-420f-b4f2-7f6482f5d0ef
	name: freeze_id.fam
	description: >-
	  A text file with no header line, and one line per sample
	  with the following six fields:
	    1. Family ID ('FID')
	    2. Within-family ID ('IID'; cannot be '0')
	    3. Within-family ID of father ('0' if father isn't in dataset)
	    4. Within-family ID of mother ('0' if mother isn't in dataset)
	    5. Sex code ('1' = male, '2' = female, '0' = unknown)
	    6. Phenotype value ('1' = control, '2' = case,
	    '-9'/'0'/non-numeric =
	    missing data if case/control)
	md5sum: 2bc551594141e9da29b24488bdd2afe7
	filesize: 256k
	filetype: .fam
	number_of_participants: 8223
	belongs_to_container: alspacdcs:1600720f-a580-4999-9bd6-4bbcd60554bb
  - id: alspacdcs:48aeacef-1450-4519-8eb7-c4c25420a4df
    name: Heterozygous haploid and nonmale Y chromosome call list
    description: >-
      A plink report
    data_distributions:
      - id: alspacdcs:303a3c36-8b63-4c03-94a9-7fb35bf2885e
	name: freeze_id.hh
	description: >-
	  Produced automatically when the input data contains
	  heterozygous calls where they shouldn't be possible (haploid
	  chromosomes, male X/Y), or there are nonmissing calls for
	  nonmales on the Y chromosome.

	  A text file with one line per error (sorted primarily by
	  variant ID, secondarily by sample ID) with the following three fields:

	  Family ID
	  Within-family ID
	  Variant ID
	md5sum: cce791501bb562953f352b9f54eacecb
	filesize: 1.7M
	filetype: .hh
	belongs_to_container: alspacdcs:1600720f-a580-4999-9bd6-4bbcd60554bb
  - id: alspacdcs:b243c546-07d3-43cb-9649-1926973f7211
    name: Logs
    description: >-
      plink log
    data_distributions:
      - id: alspacdcs:34f0b85a-c298-4e21-9b8a-f9644d582a1e
	name: freeze_id.log
	description: >-
	  plink log file
	md5sum: 5a63dd14cc69e894f78758f7ca3d8197
	filesize: 512
	filetype: .log
	belongs_to_container: alspacdcs:1600720f-a580-4999-9bd6-4bbcd60554bb

2.2 Genome-wide - Illumina exome core array - G0 partners (gwa_exome_g0p)

2.2.1 Description

This dataset contains genome wide array genotype calls for G0 mothers and partners. Reference genome build: GRCh37

2.2.2 Methodology

3,453 ALSPAC mother and fathers and 535,478 SNPs were genotyped using the Illumina HumanCoreExome chip genotyping platforms by the ALSPAC lab and called using GenomeStudio. The resulting raw genome-wide data were subjected to standard quality control methods using PLINK (v1.07). Individuals were excluded on the basis of gender mismatches (n = 80); minimal or excessive heterozygosity (n = 64); disproportionate levels of individual missingness (>5%, n = 60) and possible contamination (n = 3).

Population stratification was assessed by multidimensional scaling analysis and compared with 1000 Genomes phase 3 data and principal component analysis (n = 266); all individuals with non-European ancestry were removed. Cryptic relatedness was measured as SNP relatedness in GCTA (relatedness > 0.1, n = 69 removed). SNPs with a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 1E-7) and those which failed GenomeStudio quality control measures were removed (n = 21,298). 6,594 duplicate SNPs were also removed. This resulted in 2,911 unrelated mothers and father genotypes at 507,586 SNPs. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln.

1737 putative G0 partner-G1 pairs for whom both G0 partner and G1 have called genotype data available were identified based on ALN. Given the G0 partners were invited by the G0 mother to take part and only enrolled in the study in their own right several years later, it could not be assumed that all G0 partners were biologically related to G1. Called genotype data for the 1720 unique G0 partners and 1737 unique G1s were merged (i.e. there were 17 pairs of siblings/twins among the G1 offspring), using plink v1.90b7.2 64-bit (11 Dec 2023).

After aplication of the plink filters –geno 0.05, –maf 0.01, –snps-only just-acgt and –autosome, 113288 SNPs remained. The –related command in KING version 2.3.2 was used to perform kinship analysis, which confirmed that all 1737 putative G0 partner-G1 pairs are genetically related. This would be expected for biological father-offspring pairs, using the inference criteria described in in Table 1 of "Manichaikul, Ani, et al. "Robust relationship inference in genome-wide association studies." Bioinformatics 26.22 (2010): 2867-2873."

2.2.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gwa_exome_g0p_2016-11-22_f5
name: Freeze 5 version 2016-11-22 Genome-wide - Illumina exome core array - G0 partners
description: >-
  Freeze 5 version 2016-11-22 Genome-wide array data including raw files and genotype calls for G0 partners, also including additional G0 mothers  who were absent from previous genotyping rounds
freeze_size: 289M
linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a
git_tag: https://github.com/alspac/dataset_gwa_exome_g0p/releases/tag/freeze5
is_current_freeze: true
freeze_number: 5
freeze_date: 2025-02-27
previous_freeze: alspacdcs:gwa_exome_g0p_2016-11-22_f4
freeze_of_alspac_dataset_version: alspacdcs:gwa_exome_g0p_2016-11-22
freeze_of_named_alspac_dataset: alspacdcs:gwa_exome_g0p

has_containers:
  - id: alspacdcs:0e154bae-77b3-40bd-b81c-a5b127cf9184
    name: data
    description: A dir/folder containing the plink data files

has_parts:
- id: alspacdcs:041a43f13-1e58-4fea-a9d8-722dfe40bb1d
  name: freeze_id
  data_distributions:
  - id: alspacdcs:646c2553-6799-47b3-b84c-4f533ec5ebed
    name: freeze_id.fam
    description: >-
	A text file with no header line, and one line per sample with the following six fields:

	1. Family ID ('FID')
	2. Within-family ID ('IID'; cannot be '0')
	3. Within-family ID of father ('0' if father isn't in dataset)
	4. Within-family ID of mother ('0' if mother isn't in dataset)
	5. Sex code ('1' = male, '2' = female, '0' = unknown)
	6. Phenotype value ('1' = control, '2' = case,
	'-9'/'0'/non-numeric =
	missing data if case/control)

	Here We use both the first two fields to have the full id of the
	participant. i.e. not separate family and within family ids.
    md5sum: 422fe647fc778a80f6cf39815eb7691f
    filesize: 128KB
    filetype: .fam
    number_of_participants: 2198
    belongs_to_container: alspacdcs:0e154bae-77b3-40bd-b81c-a5b127cf9184

  - id: alspacdcs:2bb32a02-893a-4d6c-972e-14256a5ed3a4
    name: freeze_id.bim
    description: >-
      Extended variant information file accompanying a .bed binary
	genotype table. (in plink you can use --make-just-bim can be used to update just
	this file.) A text file with no header line, and one line per
	variant with the following six fields:

	  1.Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT';
	  '0' indicates unknown)
	  or name
	  2. Variant identifier
	  3. Position in morgans or centimorgans (safe to use dummy value of '0')
	  4. Base-pair coordinate (1-based; limited to 231-2)
	  5. Allele 1 (corresponding to clear bits in .bed; usually minor)
	  6. Allele 2 (corresponding to set bits in .bed; usually major)

    md5sum: 0fe43f888776059fef0a76d3f08d00ad
    filesize: 14MB
    filetype: .bim
    number_of_variants: 507586
    belongs_to_container: alspacdcs:0e154bae-77b3-40bd-b81c-a5b127cf9184

  - id: alspacdcs:8fbc90b4-cadd-41e2-95c4-72495aa273a8
    name: freeze_id.bed
    description: >-
      Primary representation of genotype calls at biallelic
      variants. Must be accompanied by .bim and .fam files.

    md5sum: 304b0d356880c5174806ce08d7beffd3
    filesize: 267M
    filetype: .bed
    number_of_participants: 2198
    number_of_variants: 507586
    belongs_to_container: alspacdcs:0e154bae-77b3-40bd-b81c-a5b127cf9184

  - id: alspacdcs:2628b6e2-01e9-4822-92f2-972af0dbca42
    name: freeze_id.log
    md5sum: c6f073df29726db7df0aab3cefc82a0d
    filesize: 512B
    filetype: .log
    belongs_to_container: alspacdcs:0e154bae-77b3-40bd-b81c-a5b127cf9184

  - id: alspacdcs:2854d79c-ceca-499e-bf31-ba63b47718fa
    name: freeze_id.hh
    description: >-
      plink .hh file see
      https://www.cog-genomics.org/plink/1.9/formats#hh 
    md5sum: 18e7547bb1c75e008caa9538baa57071
    filesize: 8M
    filetype: .hh
    belongs_to_container: alspacdcs:0e154bae-77b3-40bd-b81c-a5b127cf9184

2.3 Genome-wide - Illumina 660 quad - G0 mothers (gwa_660_g0m)

2.3.1 Description

This dataset contains genome-wide array data including raw files and genotype calls for G0 mothers.

2.3.2 Methodology

ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs.

SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed. Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.

Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level. Related subjects that passed all other quality control thresholds were retained. This resulted in 9,048 subjects and 526,688 SNPs passed these quality control filters.

Associated publication:

2.3.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gwa_660_g0m_2022-12-05_f5
name: Freeze 5 version 2022-12-05 Genome-wide - Illumina 660 quad - G0 mothers
description: >-
  Freeze 5 of genome-wide array data including genotype calls for G0 mothers
freeze_size: 2G
linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a
git_tag: https://github.com/alspac/dataset_gwa_660_g0m/releases/tag/freeze5
is_current_freeze: true
freeze_number: 5
freeze_date: 2025-02-27
freeze_of_alspac_dataset_version: alspacdcs:gwa_660_g0m_2022-12-05
freeze_of_named_alspac_dataset: alspacdcs:gwa_660_g0m


has_containers:
  - id: alspacdcs:aeeb8633-73ce-4975-9b4b-35f0a6ceaef5
    name: data
    description: A dir/folder containing the plink data files

  - id: alspacdcs:3ec3c4ae-b52c-437f-a68f-ba9e7bca2c00
    name: legacy1
    description: >-
      A dir/folder containing the plink data files. 
      Includes full set of SNPs but is missing ~500 mothers who 
      were excluded in legacy QC due to strict relatedness inclusion thresholds.
    belongs_to_container: alspacdcs:aeeb8633-73ce-4975-9b4b-35f0a6ceaef5

  - id: alspacdcs:604be37a-50ad-4743-803e-783a5c1d6687
    name: legacy2
    description: >-
      A dir/folder containing the plink data files
      Includes full set of individuals but due to legacy QC is restricted
      to a set of ~480k SNPs that overlap with the Illumina 550k array 
      (which was used for G1).
    belongs_to_container: alspacdcs:aeeb8633-73ce-4975-9b4b-35f0a6ceaef5


has_parts:
  - id: alspacdcs:4f364f94-01ec-4b13-ac7b-2ba283120c99
    name: Biallelic genotype table
    description: >-
      The genetic data. Primary representation of genotype calls at biallelic
	variants. Must be accompanied by .bim and .fam files.
	The legacy1 & legacy2 distribution of the plink bed file.
    data_distributions:
      - id: alspacdcs:3b4029da-80d8-4030-b9bc-50aca869fd9d
	name: freeze_id.bed
	description: >-
	  Legacy 1 plink bed file.
	md5sum: be66d3cc1d3d906c4d396cc161a605b1
	filesize: 1019.6MB
	filetype: .bed
	belongs_to_container: alspacdcs:3ec3c4ae-b52c-437f-a68f-ba9e7bca2c00 # legacy1

      - id: alspacdcs:5ec89906-79de-4d1d-8a78-814acb45b42e
	name: freeze_id.bed
	description: >-
	  Legacy 2 plink bed file.
	md5sum: 7559903a4811210f6289497e1323dfe7
	filesize: 960.3MB
	filetype: .bed
	belongs_to_container: alspacdcs:604be37a-50ad-4743-803e-783a5c1d6687 # legacy2

  - id: alspacdcs:6a49d358-1ee0-426e-b16d-30fbabb8cd25
    name: Variant Information 
    description: >-
      Information about genetic variants.
    data_distributions:
      - id: alspacdcs:116f2f0f-563f-4906-8901-0bb2e1a5787f
	name: freeze_id.bim
	description: >-
	  Legacy 1
	  Extended variant information file accompanying a .bed binary
	  genotype table. (--make-just-bim can be used to update just
	  this file.) A text file with no header line, and one line per
	  variant with the following six fields:

	    1.Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT';
	    '0' indicates unknown)
	    or name
	    2. Variant identifier
	    3. Position in morgans or centimorgans (safe to use dummy value of '0')
	    4. Base-pair coordinate (1-based; limited to 231-2)
	    5. Allele 1 (corresponding to clear bits in .bed; usually minor)
	    6. Allele 2 (corresponding to set bits in .bed; usually major)
	md5sum: 88b8c2221ef4ddc03118042db70d8575
	filesize: 14.0MB
	filetype: .bim
	number_of_variants: 526688
	belongs_to_container: alspacdcs:3ec3c4ae-b52c-437f-a68f-ba9e7bca2c00 # legacy1

      - id: alspacdcs:9e6408b2-0c8f-4bde-971a-bfad729b2a87
	name: freeze_id.bim
	description: >-
	  Legacy 2 
	  Extended variant information file accompanying a .bed binary
	  genotype table. (--make-just-bim can be used to update just
	  this file.) A text file with no header line, and one line per
	  variant with the following six fields:

	    1.Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT';
	    '0' indicates unknown)
	    or name
	    2. Variant identifier
	    3. Position in morgans or centimorgans (safe to use dummy value of '0')
	    4. Base-pair coordinate (1-based; limited to 231-2)
	    5. Allele 1 (corresponding to clear bits in .bed; usually minor)
	    6. Allele 2 (corresponding to set bits in .bed; usually major)
	md5sum: b4a1adb225de05d92d0af585950fd423
	filesize: 12.3MB
	filetype: .bim
	number_of_variants: 465740
	belongs_to_container: alspacdcs:604be37a-50ad-4743-803e-783a5c1d6687 # legacy2

  - id: alspacdcs:971ab861-76a4-4d46-8dcb-090d956c7f15
    name:  Sample information
    description: >-
      Information about the samples for the dataset
    data_distributions:
      - id: alspacdcs:6979affd-d593-4849-a7e6-9ac84d08bf97
	name: freeze_id.fam
	description: >-
	  legacy 1

	  A text file with no header line, and one line per sample with the following six fields:

	  1. Family ID ('FID')
	  2. Within-family ID ('IID'; cannot be '0')
	  3. Within-family ID of father ('0' if father isn't in dataset)
	  4. Within-family ID of mother ('0' if mother isn't in dataset)
	  5. Sex code ('1' = male, '2' = female, '0' = unknown)
	  6. Phenotype value ('1' = control, '2' = case,
	  '-9'/'0'/non-numeric = missing data if case/control)
	md5sum: d54855c6d6e0afaeef6522025707807b
	filesize: 253.7KB
	filetype: .fam
	number_of_participants: 8118
	belongs_to_container: alspacdcs:3ec3c4ae-b52c-437f-a68f-ba9e7bca2c00 # legacy1

      - id: alspacdcs:6fc40545-6c73-4883-bd26-c0621669599e
	name: freeze_id.fam
	description: >-
	  legacy 2

	  A text file with no header line, and one line per sample with the following six fields:

	  1. Family ID ('FID')
	  2. Within-family ID ('IID'; cannot be '0')
	  3. Within-family ID of father ('0' if father isn't in dataset)
	  4. Within-family ID of mother ('0' if mother isn't in dataset)
	  5. Sex code ('1' = male, '2' = female, '0' = unknown)
	  6. Phenotype value ('1' = control, '2' = case,
	  '-9'/'0'/non-numeric = missing data if case/control)

	md5sum: e23995bb57482d3c6b8eeac3100b5009
	filesize: 447.6KB
	filetype: .fam
	number_of_participants: 8648
	belongs_to_container: alspacdcs:604be37a-50ad-4743-803e-783a5c1d6687 # legacy2     

  - id: alspacdcs:3d87fed3-1e50-4928-b05c-0bc9e098dc9c
    name:  Log information
    description: >-
      Information about the plink run for making the dataset
    data_distributions:
      - id: alspacdcs:75f0ab84-5003-475d-8894-846cdc1ca073
	name: freeze_id.log
	description: >-
	  legacy 1 plink log file
	md5sum: ee1acd97e7e4a69885762798eb121821
	filesize: 995.0B
	filetype: .log
	belongs_to_container: alspacdcs:3ec3c4ae-b52c-437f-a68f-ba9e7bca2c00 # legacy1

      - id: alspacdcs:09c46d87-316c-4713-bac2-f818b9b8f6e9
	name: freeze_id.log
	description: >-
	  legacy 2 plink log file
	md5sum: 5206ddfc05c0d5955af430d7758f13bb
	filesize: 995.0B
	filetype: .log
	belongs_to_container: alspacdcs:604be37a-50ad-4743-803e-783a5c1d6687 # legacy2

2.4 Genome-wide - CNV - G1 (cnv_550_g1)

2.4.1 Description

This dataset contains predicted ALSPAC CNVs using PennCNV, generated from 23andMe raw genotype data.

2.4.2 Methodology

LRR and BAF data was missing from the 23andMe raw genotype data, so we had to generate this data ourselves using an in house algorithm. Once this data was generated, we ran PennCNV using the hh550 libraries.

There are filtered PennCNV calls. Multiple calls were merged using the 'clean_cnv.pl' script, using a merge fraction of 0.5. Individuals with > 30 CNVs, a Log R Ratio SD of >0.3, a BAF drift of > 0.002, and a waviness factor of > 0.05 were removed. CNVs in which at least 50% of the length of the CNV call overlapped with any of telomeric centromeric, immunoglobulin regions were removed using the 'scan_region.pl' script in PennCNV.

In addition, CNVs covering fewer than 5 probes, of a length < 5kb, and with a confidence score of below 10 were removed. Density was calculated as the number of probes in a CNV divided by the length of the CNV, and CNVs where the density of probes across the call was < 1 probe per 20kb was removed.

These QC parameters are suggestions only and provided in filtered.cnv. Analysts can apply their own filter parameters to the raw calls in data.cnv

2.4.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:cnv_550_g1_2015-11-09_f5
name: Genome-wide - CNV - G1 release version 2015-11-09 freeze 5
description: >-
  This is the fith freeze of the 2015-11-09 version of
  cnv_550_g1 dataset.
  It contains two csv versions of the cnv called data, the unfilterd
  and filtered versions.
freeze_size: 27m
linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a
git_tag: https://github.com/alspac/dataset_cnv_550_g1/releases/tag/freeze5
is_current_freeze: true
freeze_number: 5
freeze_date: 2025-02-27
previous_freeze: alspacdcs:cnv_550_g1_2015-11-09_f4
freeze_of_alspac_dataset_version: alspacdcs:cnv_550_g1_2015-11-09
freeze_of_named_alspac_dataset: alspacdcs:cnv_550_g1
has_parts:
  - id: alspacdcs:2443e67d-711e-410a-bbff-62ad6e89fc78_cnv_550_g1_2015-11-09_cnvdata_f5
    name: Unfiltered CNV data
    description: >- 
      This is the output of Penncnv before filtering.
      columns
	V1 - Position
	V2 - Number of markers in the region
	V3 - CNV length
	V4 - Copy number estimate
	V6 - Start SNP
	V7 - End SNP
	V8 - Confidence score
	qlet - within pregnancy ID
	cnv_550_g1 - Individual ID
    data_distributions:
      - id: alspacdcs:1ffcdc95-dfd4-4c4f-9b23-7a955cf9c2c9
	name: new_cnvdata.csv
	description: >- 
	  This is the csv file for the output of Penncnv before filtering.
	md5sum: 25aa47310d8c9e17a168d9bff54961f9
	filesize: 21M
	filetype: .csv
	number_of_participants: 7449  #data$id_qlet <- paste(data$cnv_550_g1, data$qlet, sep="_")
	#length(unique(data$id_qlet))
	number_of_cnv_variants: 70029 # Read file into R as data then:
	# dim(unique(data[1]))
	belongs_to_container: alspacdcs:723ce3b3-bae5-4bf5-932c-fad912f5c6e4

  - id: alspacdcs:0c96dd13-08f8-41fb-a389-f716c20f373c
    name: Filtered CNV data
    description: >-
      CNV data that has been filtered.
      columns
	V1 - Position
	V2 - Number of markers in the region
	V3 - CNV length
	V4 - Copy number estimate
	V6 - Start SNP
	V7 - End SNP
	V8 - Confidence score
	qlet - within pregnancy ID
	cnv_550_g1 - Individual ID
    data_distributions:
      - id: alspacdcs:578318bb-0f3a-4a3c-ac2c-cf14c48198c5
	name: new_filtered.csv
	description: >-
	  This is the csv file for the output of Penncnv after filtering.
	md5sum: 71c3e6841fcc492045602c20d72806d0
	filesize: 5.9M
	filetype: .csv
	number_of_participants: 6792 # Read into data 2 in r
	# data2$id_qlet <- paste(data2$cnv_550_g1, data2$qlet, sep="_") and length(unique(data2$id_qlet))
	number_of_cnv_variants: 14244 #Read into data2 in r then
	#length(unique(data2$V1))
	belongs_to_container: alspacdcs:723ce3b3-bae5-4bf5-932c-fad912f5c6e4

has_containers:
  - id: alspacdcs:723ce3b3-bae5-4bf5-932c-fad912f5c6e4 ## uuid
    name: data
    description: A dir/folder containing the two freeze data files

3 Imputed Data

3.1 Genome-wide - HRC imputed - G0 mothers + G1 (gi_hrc_g0m_g1)

SNP chips are useful for the generation of data on hundreds of thousands of SNPs, but there are millions more polymorphisms that remain untyped with this technology. If suitable numbers of whole genome sequences exist (e.g. 1000 genomes data) then millions of genotypes that are missing from a sample because they have not been typed by SNP chips can be imputed using probabilistic methods. Here the ALSPAC mother and children data were imputed to a new reference panel known as the Haplotype Reference Consortium (HRC) panel. This comprises around 31000 sequenced individuals (mostly European), so the coverage of European haplotypes is much greater than in other panels. As a consequence imputation accuracy is expected to improve, particularly at lower frequencies.

3.1.1 Description

This dataset contains genotype data imputed to HRC for G0 mothers and G1. Reference genome build: GRCh37

3.1.2 Methodology

ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).

Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.

SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1).

Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,115 subjects and 500,527 SNPs passed these quality control filters.

ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs. SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed.

Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.

Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level. Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,048 subjects and 526,688 SNPs passed these quality control filters.

We combined 477,482 SNP genotypes in common between the sample of mothers and sample of children. We removed SNPs with genotype missingness above 1% due to poor quality (11,396 SNPs removed) and removed a further 321 subjects due to potential ID mismatches. This resulted in a dataset of 17,842 subjects containing 6,305 duos and 465,740 SNPs (112 were removed during liftover and 234 were out of HWE after combination). We estimated haplotypes using ShapeIT (v2.r644) which utilises relatedness during phasing. The phased haplotypes were then imputed to the Haplotype Reference Consortium (HRCr1.1, 2016) panel of approximately 31,000 phased whole genomes. The HRC panel was phased using ShapeIt v2.r727, and the imputation was performed using the Michigan imputation server.

3.1.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_hrc_g0m_g1_2017-05-04_f5
name: >-
  Genome-wide - HRC imputed - G0 mothers + G1 version 2017-05-04
  freeze 5
description: >-
  Freeze 5 of version 2017-05-04 Genome-wide array data imputed to the HRC reference panel for G0 mothers and G1 individuals in bgen and sample file format (version 1.2). 
freeze_size: 114G
linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a
git_tag: https://github.com/alspac/dataset_gi_hrc_g0m_g1/releases/tag/freeze5
is_current_freeze: true
freeze_number: 5
freeze_date: 2025-02-27
previous_freeze: alspacdcs:gi_hrc_g0m_g1_2017-05-04_f4
freeze_of_alspac_dataset_version: alspacdcs:gi_hrc_g0m_g1_2017-05-04
freeze_of_named_alspac_dataset: alspacdcs:gi_hrc_g0m_g1

has_containers:
  - id: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e ## uuid
    name: data
    description: A dir/folder containing the freeze data bgen and .sample files

# belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e
has_parts:
  - id: alspacdcs:68b35374-11ca-4f91-b8dc-5c17c50f6992
    name: Omics ID sample
    description: >-
      The samples in the data. To be used with the genetic data.
      A plain text .sample file.
      See https://doi.org/10.1101/308296 for file format details.
    data_distributions:
      - id: alspacdcs:b03967aa-0991-4e4b-9c97-1faff53ec548
	name: swapped.sample
	md5sum: 3e8e18ce5f6e30ac1c79e92695279bce
	filesize: 1005.1KB
	filetype: .sample
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:1ac41b4e-e272-4cc3-900e-34fc623556fc
	name: swapped_23_female.sample
	md5sum: 19f80cc93eb8474b7354a04e4fabd050
	filesize: 745.8KB
	filetype: .sample
	number_of_participants: 12943
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:0574c0e6-b957-4427-adf4-f8d04fe997e5
	name: swapped_23_male.sample
	md5sum: 623083d3d4e7294c1ac86817d40fb435
	filesize: 259.4KB
	filetype: .sample
	number_of_participants: 4501
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

  - id: alspacdcs:0d22e4c5-9c75-4fb2-93a7-4c30d9b15a84
    name: Bgens
    description: >-
      An Oxford Bgen (v1.2) file for all chromosomes. To be used with sample file.
      See https://doi.org/10.1101/308296 for file format details.
    data_distributions:
      - id: alspacdcs:79cf0083-7c53-4eab-806d-fda62fe8f8cd
	name: filtered_01.bgen
	md5sum: 9727306a156ab88f72dedbdcaffc1105
	filesize: 8.6GB
	filetype: .bgen
	number_of_variants: 3069932
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:1342c5e0-b921-40cf-9a66-8917029add62
	name: filtered_02.bgen
	md5sum: a8cb970994e21c02eceea92a513ebef6
	filesize: 8.7GB
	filetype: .bgen
	number_of_variants: 3392238
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:3c377fe2-b11f-4724-aef3-54e7562e0bff
	name: filtered_03.bgen
	md5sum: 7e1586647816f4607b9e528be4893b5c
	filesize: 7.3GB
	filetype: .bgen
	number_of_variants: 2821895
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:f735b892-a53b-4bd7-92df-2a968eb5de82
	name: filtered_04.bgen
	md5sum: 9bb513a014c18a3a0a1ea11dcf63cc1b
	filesize: 7.9GB
	filetype: .bgen
	number_of_variants: 2787582
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:7573e42c-4f61-41e0-98ce-4933517adde1
	name: filtered_05.bgen
	md5sum: 92a2d759a5bcc18d0134dc7802302055
	filesize: 6.7GB
	filetype: .bgen
	number_of_variants: 2588170
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:e034f623-dd24-46a6-9f7a-2a503cee39fd
	name: filtered_06.bgen
	md5sum: 5f68a69cd54a89b8db5577711f2a7934
	filesize: 6.3GB
	filetype: .bgen
	number_of_variants: 2460112
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:29b4c425-be26-4778-b08c-6104bb497269
	name: filtered_07.bgen
	md5sum: cd02eefdb350d9859ea7a5975d5ee73a
	filesize: 6.6GB
	filetype: .bgen
	number_of_variants: 2289306
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:86a83679-b397-4d2d-8d77-cde33919932e
	name: filtered_08.bgen
	md5sum: 68b4ea416441637c01ebcc1c2e9ac8cf
	filesize: 5.7GB
	filetype: .bgen
	number_of_variants: 2242706
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:7e21bc11-89d2-4560-831f-2e6497c31360
	name: filtered_09.bgen
	md5sum: a262516e4a9c48fe2b7edfb68a0f0577
	filesize: 4.5GB
	filetype: .bgen
	number_of_variants: 1675899
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:7ff4c7d7-7199-4a41-a55e-40ff89673f4c
	name: filtered_10.bgen
	md5sum: 659c1e9b8c9500aa02b84d8a121e4a23
	filesize: 5.1GB
	filetype: .bgen
	number_of_variants: 1927504
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:5357c517-b844-4bcc-9a49-f489369c9233
	name: filtered_11.bgen
	md5sum: 94ae65053c6cb28ffa5413a447bea2a7
	filesize: 5.2GB
	filetype: .bgen
	number_of_variants: 1936990
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:c0f774be-af04-41a1-a1de-9b3d230e177d
	name: filtered_12.bgen
	md5sum: 5e488efe1865265b70f0db0ba0e8ceb2
	filesize: 5.1GB
	filetype: .bgen
	number_of_variants: 1848118
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:f8896772-0ae4-41c1-9f1c-f194b8c9a5b7
	name: filtered_13.bgen
	md5sum: c6d8c39e1714020ef24236ce0e0e65f4
	filesize: 3.7GB
	filetype: .bgen
	number_of_variants: 1385434
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:9048547c-291f-4e7e-8e19-4c433b4186f6
	name: filtered_14.bgen
	md5sum: a7ceaec0d5986e1396214bbc4a8bcfb5
	filesize: 3.5GB
	filetype: .bgen
	number_of_variants: 1266536
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:dc53f49b-8a3a-4fde-bc74-856c6bb16fe2
	name: filtered_15.bgen
	md5sum: 30a19dcda6047a6ac690d650ee5fea8c
	filesize: 3.4GB
	filetype: .bgen
	number_of_variants: 1139215
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:c36b8dc1-8415-4608-bfaa-b7196ed98ea3
	name: filtered_16.bgen
	md5sum: d4ffb3324217ec7ac9e3716ae3de9106
	filesize: 4.1GB
	filetype: .bgen
	number_of_variants: 1281298
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:b52fbba0-a8b3-4c70-acfa-b2721c86cd7c
	name: filtered_17.bgen
	md5sum: a0baaf8155e3e97ee33d440035877a96
	filesize: 3.6GB
	filetype: .bgen
	number_of_variants: 1090072
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:c07a6c33-e0cc-4b69-8701-85c14a0ca771
	name: filtered_18.bgen
	md5sum: 1236c268dfab2d46148835e50efcec5d
	filesize: 3.1GB
	filetype: .bgen
	number_of_variants: 1104755
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:577dc456-a811-4b50-ae68-1a0992a65bb9
	name: filtered_19.bgen
	md5sum: 1c17198a8d5a7be881d671559048d073
	filesize: 3.4GB
	filetype: .bgen
	number_of_variants: 868554
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:6a6d2735-1aac-48c4-972b-f97ffd4fb396
	name: filtered_20.bgen
	md5sum: 336791734294796bcc5c725048756155
	filesize: 2.6GB
	filetype: .bgen
	number_of_variants: 884983
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:5b5cbe53-85fb-403b-adef-cb59b545e86b
	name: filtered_21.bgen
	md5sum: d97d780938173eb14c5c1aae66e1005e
	filesize: 1.7GB
	filetype: .bgen
	number_of_variants: 531276
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:2312c3b2-7352-4f27-91d0-5bc078727a1b
	name: filtered_22.bgen
	md5sum: 343581eebfe7e38242db0c8b019c2264
	filesize: 1.8GB
	filetype: .bgen
	number_of_variants: 524544
	number_of_participants: 17444
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:0df6a041-2c11-4fa0-bae9-2d5570d1554d
	name: filtered_23female.bgen
	md5sum: d4abdc0d84bda1f8a3eec5c9cee8977b
	filesize: 4.2GB
	filetype: .bgen
	number_of_variants: 1228035
	number_of_participants: 12943
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

      - id: alspacdcs:f6e79f2e-a99a-4b3f-9dc7-14e2f2c54aff
	name: filtered_23male.bgen
	md5sum: bebe6967a0489a186166d61cd1b07a18
	filesize: 1.2GB
	filetype: .bgen
	number_of_variants: 1228035
	number_of_participants: 4501
	belongs_to_container: alspacdcs:6247701a-3826-469c-8725-ad8d79cc1b1e

3.2 Genome-wide - HapMap2 imputed - G1 (gi_hapmap2_g1)

3.2.1 Description

This dataset contains genotype data imputed to HapMap 2 for G1. Reference genome build: GRCh36

3.2.2 Methodology

A total of 9912 subjects were genotyped using the Illumina HumanHap550 quad genome-wide SNP genotyping platform by 23 and Me subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, USA.

Individuals were excluded from further analysis on the basis of having incorrect gender assignments; minimal or excessive heterozygosity (<0.320 and >0.345 for the Sanger data and <0.310 and >0.330 for the LabCorp data); disproportionate levels of individual missingness (>3%); evidence of cryptic relatedness (>10% IBD) and being of non-European ancestry (as detected by a multidimensional scaling analysis seeded with HapMap 2 individuals, EIGENSTRAT analysis revealed no additional obvious population stratification and genome-wide analyses with other phenotypes indicate a low lambda). The resulting data set consisted of 8365 individuals (84% of those genotyped).

SNPs with a minor allele frequency of <1% and call rate of <95% were removed. Furthermore, only SNPs which passed an exact test of Hardy-Weinberg equilibrium (P > 5 x 10-7) were considered for analysis. Genotypes were subsequently imputed with MACH 1.0.16 Markov Chain Haplotyping software, using CEPH individuals from phase 2 of the HapMap project as a reference set (release 22).

Associated publication:

3.2.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_hapmap2_g1_2022-12-07_f5
name: Genome-wide - HapMap2 imputed - G1 version 2022-12-07 freeze 5
description: >-
  Freeze 5 of 2022-12-07 version of Genome-wide array data imputed to the HapMap2 reference panel for G1 individuals

freeze_size: 5G
linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a
git_tag: https://github.com/alspac/dataset_gi_hapmap2_g1/releases/tag/freeze5
is_current_freeze: true
freeze_number: 5
freeze_date: 2025-02-27
previous_freeze: alspacdcs:gi_hapmap2_g1_2022-12-07_f4
freeze_of_alspac_dataset_version: alspacdcs:gi_hapmap2_g1_2022-12-07
freeze_of_named_alspac_dataset: alspacdcs:gi_hapmap2_g1


has_containers:
  - id: alspacdcs:0f3ae21c-aa11-4aab-b8a0-78bf13801cfb
    name: data
    description: A dir/folder containing the plink freeze data files


has_parts:
  - id: alspacdcs:5d8b20a5-b2d3-4d3b-a02e-fe865810dd92
    name: bed file
    description: >-
      Plink standard format bed file. See https://www.cog-genomics.org/plink/1.9/formats for further information.
    data_distributions:
      - id: alspacdcs:30b0f629-2582-44cd-a13d-27511f2bfc3b
	name: freeze_id.bed
	md5sum: c1b6c00b67513aef2147d6d507c4d1be
	filesize: 4.9GB
	filetype: .bed
	belongs_to_container: alspacdcs:0f3ae21c-aa11-4aab-b8a0-78bf13801cfb

  - id: alspacdcs:5a8e0987-f3e7-4354-b77a-353d47390aa2
    name: bim file
    description: >-
      Plink standard bim file. Contains variant information. See https://www.cog-genomics.org/plink/1.9/formats for further information.
    data_distributions:
      - id: alspacdcs:3022bf70-3de1-4fe5-a827-627cb6998a53
	name: freeze_id.bim
	md5sum: a1ebaaf6286af5b12f4561b380cd302a
	filesize: 67.6MB
	filetype: .bim
	number_of_variants: 2543887
	belongs_to_container: alspacdcs:0f3ae21c-aa11-4aab-b8a0-78bf13801cfb

  - id: alspacdcs:2a8a7c04-c771-4c80-8e53-32012bcf6cbe
    name: fam file
    description: >-
      Plink standard format fam file. Contains sample information. See https://www.cog-genomics.org/plink/1.9/formats for further information.
    data_distributions:
      - id: alspacdcs:26508ec1-acb7-46ae-a7b4-dffa70cdf574
	name: freeze_id.fam
	md5sum: 58d7bf44f023345bced230d50c8f0736
	filesize: 273.0KB
	filetype: .fam
	number_of_participants: 8223
	belongs_to_container: alspacdcs:0f3ae21c-aa11-4aab-b8a0-78bf13801cfb

  - id: alspacdcs:ac4c4c8c-1da4-4c26-9c4d-3bdea467d5b7
    name: log file
    description: >-
      Plink log files. One per chromosome. Contains log information. See https://www.cog-genomics.org/plink/1.9/formats for further information.
    data_distributions:
      - id: alspacdcs:d74bea9e-7c43-4b67-9abd-f114daeebd06
	name: freeze_id.log
	md5sum: 09b318336263bb2e8ecc563decb92aed
	filesize: 941.0B
	filetype: .log
	belongs_to_container: alspacdcs:0f3ae21c-aa11-4aab-b8a0-78bf13801cfb

3.3 Genome-wide - HapMap2 imputed - G0 mothers (gi_hapmap2_g0m)

3.3.1 Description

This dataset contains genotype data imputed to HapMap 2 for G0 mothers. Reference genome build: GRCh36

3.3.2 Methodology

A total of 10 015 women (mothers from the ALSPAC cohort) were genotyped using the Illumina 660 quad SNP chip which contains 557 124 SNP markers. Markers with minor allele frequency < 1%, SNPs with >5% missing genotypes and any markers that failed an exact test of Hardy-Weinberg equilibrium (P < 1 x 10-6) were excluded from further analyses. Genome-wide identity by state sharing was calculated for each pair of individuals in the cohort to identify cryptic relatedness.

In order to identify individuals who might have ancestries other than Western European, we merged data from both cohorts with the 60 western European (CEU) founder, 60 Nigerian (YRI) founder and 90 Japanese (JPT) and Han Chinese (CHB) individuals from the International HapMap Project. Genome-wide IBS distances for each pair of individuals were calculated on markers shared between the HapMap and the Illumina 660K SNP chip, and then the multidimensional scaling option in R was used to generate a two-dimensional plot based upon individuals' scores on the first two principal coordinates from this analysis. Samples that did not cluster with the CEU individuals were excluded from subsequent analyses. In addition, we plotted the proportion of missing data for each individual against their genome-wide heterozygosity. Any individual, who did not cluster with others, was removed from further analyses. Samples were also excluded from analyses in the case of excessive missingness (>5%), unusual genome-wide or X chromosome heterozygosity, as well as one individual from each pair of putatively related individuals (genome-wide IBD >10%). After data cleaning, 8340 individuals and 526688 SNPs were left in the genome-wide data set.

We then conducted imputation using the MACH Markov Chain Haplotyping software with CEU individuals from phase 2 of the HapMap project as a reference set (release 22). The final imputed data set consisted of 8340 individuals, each with 2 594 390 imputed markers. Only imputed genotypes with minor allele frequencies ≥1% and R-sqr ≥0.3 were considered for association. Of these 8340 with genetic data, 2874 mothers also had phenotype data available.

Associated publication:

3.3.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_hapmap2_g0m_2022-12-07_f5
name: Genome-wide - HapMap2 imputed - G0 mothers version 2022-12-07 freeze 5
description: >-
  Version 2022-12-07 freeze 5 of Genome-wide array data imputed to the HapMap2 reference panel for G0 mothers.
  The number of variants & individuals within each plink file set can be viewed within the log file.
freeze_size: 4.9G
linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a
git_tag: https://github.com/alspac/dataset_gi_hapmap2_g0m/releases/tag/freeze5
is_current_freeze: true
freeze_number: 5
freeze_date: 2025-02-27
previous_freeze: alspacdcs:gi_hapmap2_g0m_2022-12-07_f4
freeze_of_alspac_dataset_version: alspacdcs:gi_hapmap2_g0m_2022-12-07
freeze_of_named_alspac_dataset: alspacdcs:gi_hapmap2_g0m


has_containers:
  - id: alspacdcs:50ae5d38-7e0d-4366-ab2d-73e91960eb24 ## uuid
    name: plink
    description: A dir/folder containing the plink freeze data files. There are 8118 individuals within this dataset. 


has_parts:
  - id: alspacdcs:b5dac573-22e8-4aff-93e3-6988b564df3d
    name: bed files
    description: >-
      Plink standard format bed files. One per chromosome. See https://www.cog-genomics.org/plink/1.9/formats for further information.
    belongs_to_container: alspacdcs:50ae5d38-7e0d-4366-ab2d-73e91960eb24
    data_distributions:
      - id: alspacdcs:7efe0162-1061-4d97-8149-57346a60105d
	name: freeze_id_chr1.bed
	md5sum: 01f7205ea4b6e852c0e8feb72a2cb9cd
	filesize: 374.7MB
	filetype: .bed
      - id: alspacdcs:10652e6a-a6f9-4cbb-a46a-57f6202df995
	name: freeze_id_chr2.bed
	md5sum: 494713bafedd17c3be4e782f7881dcc0
	filesize: 427.5MB
	filetype: .bed
      - id: alspacdcs:1eccd5b3-a5d3-4362-9755-383141333576
	name: freeze_id_chr3.bed
	md5sum: 609847ca0489b7a97725ec275f8337d2
	filesize: 337.5MB
	filetype: .bed
      - id: alspacdcs:7158d2fe-2c64-4286-902f-2374c8f3d1c9
	name: freeze_id_chr4.bed
	md5sum: 147fee33c621f644dad5a2d8ee86fc1d
	filesize: 315.9MB
	filetype: .bed
      - id: alspacdcs:4ce18ceb-5a2a-49f8-864e-96f188ff4015
	name: freeze_id_chr5.bed
	md5sum: a3a47a8ea90e0fa39d5c203436b6d982
	filesize: 325.5MB
	filetype: .bed
      - id: alspacdcs:c2b647c2-c543-4b0c-824a-eb46b8e71043
	name: freeze_id_chr6.bed
	md5sum: 953f9c82981d59d25dabe44ba5718b29
	filesize: 353.1MB
	filetype: .bed
      - id: alspacdcs:92eca9bf-acec-4211-8078-5d727847f51f
	name: freeze_id_chr7.bed
	md5sum: fb9e8aaf4ae7c3fc75233248ec9d03b0
	filesize: 277.3MB
	filetype: .bed
      - id: alspacdcs:e479937f-30e9-42d6-8b7e-98948660f187
	name: freeze_id_chr8.bed
	md5sum: de34e8ef57e4c08991e4778401adf861
	filesize: 285.5MB
	filetype: .bed
      - id: alspacdcs:3ac4109b-ce90-404e-9c33-14a0e0724e57
	name: freeze_id_chr9.bed
	md5sum: 58ff215f0652257867e42f567ff1c2be
	filesize: 236.4MB
	filetype: .bed
      - id: alspacdcs:71a413bf-2a47-43f9-be4c-a90166c94ad6
	name: freeze_id_chr10.bed
	md5sum: 4606d4a5a008927b6ab051461218094a
	filesize: 267.9MB
	filetype: .bed
      - id: alspacdcs:f437e00b-2de1-4f72-b169-a5a13d844a89
	name: freeze_id_chr11.bed
	md5sum: 3c89898ce9fc0445c566ea0c060fb9db
	filesize: 251.8MB
	filetype: .bed
      - id: alspacdcs:1bcea287-df1c-4d62-8fc5-974658a1b6ed
	name: freeze_id_chr12.bed
	md5sum: 367f44ccd183c47334cfc7cb8333628a
	filesize: 241.7MB
	filetype: .bed
      - id: alspacdcs:355962f4-d2c8-4477-b5f1-d523befa9695
	name: freeze_id_chr13.bed
	md5sum: 0e99cf077012880a802dc36ce72142c1
	filesize: 201.6MB
	filetype: .bed
      - id: alspacdcs:1b0a61a1-7b3a-4950-8eea-14663fba419d
	name: freeze_id_chr14.bed
	md5sum: a41f9803ec71a0dcdf137806b21ba2e6
	filesize: 162.5MB
	filetype: .bed
      - id: alspacdcs:f403092d-e133-49eb-b8da-b8dd05069089
	name: freeze_id_chr15.bed
	md5sum: 611159bc9c4500de559615d0a7c549f2
	filesize: 140.0MB
	filetype: .bed
      - id: alspacdcs:1bb114aa-4804-42c9-b62b-bede2c7d4b0f
	name: freeze_id_chr16.bed
	md5sum: b04eb2e4e66fef7ee7d48cb666d78c38
	filesize: 138.5MB
	filetype: .bed
      - id: alspacdcs:e8d235a0-c488-4294-b39a-302f92646247
	name: freeze_id_chr17.bed
	md5sum: c6d54ed5ac68f2e0bd806b6124463ee4
	filesize: 113.2MB
	filetype: .bed
      - id: alspacdcs:9871e903-1d2c-401f-aedc-9a20b44e375a
	name: freeze_id_chr18.bed
	md5sum: 6b46a8d2993dae303334b9a51b50b92c
	filesize: 148.7MB
	filetype: .bed
      - id: alspacdcs:70569400-5d25-4fac-93a8-181eaecb0d6c
	name: freeze_id_chr19.bed
	md5sum: 801ccb3bb64dddaabfc2b7a4a1e4c5b0
	filesize: 71.7MB
	filetype: .bed
      - id: alspacdcs:bfb9ce39-6ad2-4c04-952e-070511f96cb8
	name: freeze_id_chr20.bed
	md5sum: 2af011bb98d6b8a8b00b7d938700fdac
	filesize: 122.8MB
	filetype: .bed
      - id: alspacdcs:41a74b3f-38f5-45ac-aef4-05764316abd3
	name: freeze_id_chr21.bed
	md5sum: 13165e1c9a27aa42853429b0246a1ed5
	filesize: 65.6MB
	filetype: .bed
      - id: alspacdcs:a2cdb6ad-d6c0-49d9-a9e8-33b1f61d8226
	name: freeze_id_chr22.bed
	md5sum: 5abcf552c585152ed0ee11754f3e7833
	filesize: 65.5MB
	filetype: .bed

  - id: alspacdcs:03943a89-75f1-4833-83f2-0fb740aff2df
    name: bim files
    description: >-
      Plink standard bim files. One per chromosome. Contains variant information. See https://www.cog-genomics.org/plink/1.9/formats for further information.
    belongs_to_container: alspacdcs:50ae5d38-7e0d-4366-ab2d-73e91960eb24
    data_distributions:
      - id: alspacdcs:b7f6b64c-c359-4871-955c-e69a328dff6d
	name: freeze_id_chr1.bim
	md5sum: 44795681691b62d1921ad8855fd11a09
	filesize: 5.1MB
	filetype: .bim
	number_of_variants: 193554
      - id: alspacdcs:25f5868c-261d-4f4b-a61f-2585c25cc16c
	name: freeze_id_chr2.bim
	md5sum: 275cefa559489b51bebbc65657a91822
	filesize: 5.9MB
	filetype: .bim
	number_of_variants: 220833
      - id: alspacdcs:bbc0949b-a6b1-4c49-bfc5-5c14c112bb32
	name: freeze_id_chr3.bim
	md5sum: 96d147406f1f24697b0cb9af0c7091fc
	filesize: 4.6MB
	filetype: .bim
	number_of_variants: 174356
      - id: alspacdcs:f5a905fe-a12b-49f5-bf6d-89ae8dbcce69
	name: freeze_id_chr4.bim
	md5sum: 54a244447b1345636690b252215bfd2d
	filesize: 4.3MB
	filetype: .bim
	number_of_variants: 163157
      - id: alspacdcs:bd66e1db-4dda-476c-9f29-8a01c22740ec
	name: freeze_id_chr5.bim
	md5sum: e8f55ef9016bf2f03ee43f08a6c974c3
	filesize: 4.4MB
	filetype: .bim
	number_of_variants: 168144
      - id: alspacdcs:47363a39-591d-4d1d-a0eb-d1124ea94485
	name: freeze_id_chr6.bim
	md5sum: 3fd4e793a35c5e935454efc1105be192
	filesize: 4.8MB
	filetype: .bim
	number_of_variants: 182381
      - id: alspacdcs:9d7d252a-49d5-4476-9222-bb3e2c2efdf4
	name: freeze_id_chr7.bim
	md5sum: dae38c5168605323dfc584a73f3ce4a1
	filesize: 3.8MB
	filetype: .bim
	number_of_variants: 143232
      - id: alspacdcs:6dfeca17-935d-4664-a428-28118165d701
	name: freeze_id_chr8.bim
	md5sum: 6243ef376ee6cbe643bec69201bec604
	filesize: 3.9MB
	filetype: .bim
	number_of_variants: 147483
      - id: alspacdcs:38dee48b-137e-49cb-a115-aeaded91f3e3
	name: freeze_id_chr9.bim
	md5sum: 1e828e0f36c2d168ce6c1df5887a764b
	filesize: 3.2MB
	filetype: .bim
	number_of_variants: 122112
      - id: alspacdcs:602fd303-2bae-414a-a5eb-e8e8e283f39c
	name: freeze_id_chr10.bim
	md5sum: 3c259904c7da548d25c86a4a36e96285
	filesize: 3.8MB
	filetype: .bim
	number_of_variants: 138402
      - id: alspacdcs:75d5e65b-5fc3-4f8a-b01b-e69ea1c45628
	name: freeze_id_chr11.bim
	md5sum: 703ecef520ce7363c24e9600b363570f
	filesize: 3.5MB
	filetype: .bim
	number_of_variants: 130069
      - id: alspacdcs:f5166ec7-30e3-4ec4-9de7-0e91454381a8
	name: freeze_id_chr12.bim
	md5sum: 515a46f735c531163377d114549042b5
	filesize: 3.4MB
	filetype: .bim
	number_of_variants: 124860
      - id: alspacdcs:d7f85dcf-75dd-4736-bd7f-5f458f2081c7
	name: freeze_id_chr13.bim
	md5sum: cd1b7c80977fb5a0bbd87bc83dd85aed
	filesize: 2.8MB
	filetype: .bim
	number_of_variants: 104120
      - id: alspacdcs:b6b69cba-f36a-4ddd-bdab-1b80703f7817
	name: freeze_id_chr14.bim
	md5sum: 4a933818aaea48201f455ebd07ea1b78
	filesize: 2.3MB
	filetype: .bim
	number_of_variants: 83936
      - id: alspacdcs:a0f73fdd-1936-4407-ad7a-3743f87fe429
	name: freeze_id_chr15.bim
	md5sum: 1e1139db4b031ba577b5ac6ae000ce6f
	filesize: 1.9MB
	filetype: .bim
	number_of_variants: 72300
      - id: alspacdcs:538edd86-5b6b-4d41-8387-1e8d8b2d7b72
	name: freeze_id_chr16.bim
	md5sum: 8bd9cb45256b6b5ca37ce66eec810035
	filesize: 1.9MB
	filetype: .bim
	number_of_variants: 71550
      - id: alspacdcs:64834bdd-5c5d-4883-a2cf-b457e35ae0ea
	name: freeze_id_chr17.bim
	md5sum: 0dc0770759f9edccec7ce305e07b57d4
	filesize: 1.6MB
	filetype: .bim
	number_of_variants: 58455
      - id: alspacdcs:38d19e6e-ed06-4724-a339-04eac4625317
	name: freeze_id_chr18.bim
	md5sum: 9ffd8f006c82701060dff29bf460e8fe
	filesize: 2.1MB
	filetype: .bim
	number_of_variants: 76812
      - id: alspacdcs:6119148e-dd26-4c3e-a934-bfdbd96d5099
	name: freeze_id_chr19.bim
	md5sum: c6fce7e15e198304f752ccbce66299b9
	filesize: 1012.3KB
	filetype: .bim
	number_of_variants: 37045
      - id: alspacdcs:4a80d624-fb52-48cf-9e39-fdda12c2b6c0
	name: freeze_id_chr20.bim
	md5sum: 6e0b2d6cd06cc6e36f9cbc3f8df0a169
	filesize: 1.7MB
	filetype: .bim
	number_of_variants: 63408
      - id: alspacdcs:fbf18f66-9ddd-4e0f-b3cf-026c43b48826
	name: freeze_id_chr21.bim
	md5sum: c1f6f2181c49172608ac79e18425e4f4
	filesize: 924.7KB
	filetype: .bim
	number_of_variants: 33863
      - id: alspacdcs:47ffbaa6-f74d-470b-b48e-15a469f6e7c8
	name: freeze_id_chr22.bim
	md5sum: 86a1da3366ba87e62f561dc09f64f9ac
	filesize: 920.9KB
	filetype: .bim
	number_of_variants: 33815

  - id: alspacdcs:ef194b12-6864-44b3-9d89-2090dfe305d0
    name: fam files
    description: >-
      Plink standard format fam files. One per chromosome. Contains sample information. See https://www.cog-genomics.org/plink/1.9/formats for further information.
    belongs_to_container: alspacdcs:50ae5d38-7e0d-4366-ab2d-73e91960eb24
    data_distributions:
      - id: alspacdcs:3b986b53-808e-4c4e-8275-d81d00d5ebb0
	name: freeze_id_chr1.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:be2d1539-f667-4748-ad08-993aee03319a
	name: freeze_id_chr2.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:eff796b6-bb95-4d15-babf-ff893a328ae8
	name: freeze_id_chr3.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:34325f98-aafa-4d83-92c5-1efc16e60b31
	name: freeze_id_chr4.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:13f956cf-a81a-49a0-bcc3-7315425d94f6
	name: freeze_id_chr5.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:d43d7d7c-bd21-4035-9cf4-5a946c0b0548
	name: freeze_id_chr6.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:534c0964-72af-4418-9f0a-223d0dfbc74f
	name: freeze_id_chr7.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:a139d84d-286d-4d65-bfa6-e1a329345ac9
	name: freeze_id_chr8.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:23ee8db2-4cb5-413d-b714-0dd88334f645
	name: freeze_id_chr9.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:a2155a79-99ea-425b-bac3-0b305f097246
	name: freeze_id_chr10.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:6b064bab-e8cc-4477-8c4d-8bfc17bb685a
	name: freeze_id_chr11.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:f8826d19-d418-488a-8112-8367b5243ea2
	name: freeze_id_chr12.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:45bc4687-9b97-47db-b3cd-0e7660a77abd
	name: freeze_id_chr13.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:68e4a8f3-df5a-4450-b93a-f7a71689a397
	name: freeze_id_chr14.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:c1b56a86-4d10-4547-b8f0-a3820f4c20ed
	name: freeze_id_chr15.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:9cb04ba9-e116-4ba7-ae9f-2e1d833708fc
	name: freeze_id_chr16.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:e4d93b96-b25f-4608-b721-f891e6c2d6df
	name: freeze_id_chr17.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:f6ca5a65-8201-46fe-ac3d-c674fe440d8c
	name: freeze_id_chr18.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:212d5196-61c5-4d52-b613-5f6616df9fca
	name: freeze_id_chr19.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:cfaa861f-bdf7-468d-b61a-8b43829bc5ae
	name: freeze_id_chr20.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:8df35b0d-b9ab-4009-94be-4d37bdd31dad
	name: freeze_id_chr21.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118
      - id: alspacdcs:6055cdf3-c2cb-4fe6-9440-1853beb7ebd3
	name: freeze_id_chr22.fam
	md5sum: 7d0f1e0b6f52f143324cb926a75f6cb8
	filesize: 277.5KB
	filetype: .fam
	number_of_participants: 8118

  - id: alspacdcs:ac6efac2-4681-4147-92da-b3ec2467aa96
    name: log files
    description: >-
      Plink log files. One per chromosome. Contains log information. See https://www.cog-genomics.org/plink/1.9/formats for further information.
    belongs_to_container: alspacdcs:50ae5d38-7e0d-4366-ab2d-73e91960eb24
    data_distributions:
      - id: alspacdcs:d929e593-a613-4ca3-9f35-c35b32be2db6
	name: freeze_id_chr1.log
	md5sum: 4f11877b64d9c9a0995ab1c577f56110
	filesize: 971.0B
	filetype: .log
      - id: alspacdcs:6fa0dbe8-630d-4856-8226-a832e5dc5389
	name: freeze_id_chr2.log
	md5sum: 8a009ab139c9672ba79aaa52f4203d51
	filesize: 971.0B
	filetype: .log
      - id: alspacdcs:a68c566f-c96e-46cd-b6c6-0c1125539094
	name: freeze_id_chr3.log
	md5sum: 97957d1a5d46b7752133aabd29338719
	filesize: 971.0B
	filetype: .log
      - id: alspacdcs:4c09ff2b-d741-485d-941c-b2d6a3b66058
	name: freeze_id_chr4.log
	md5sum: b0320867b36bc85b3ad9a598056bba4b
	filesize: 971.0B
	filetype: .log
      - id: alspacdcs:85a71fd7-6eff-4dae-a190-cda29c54b293
	name: freeze_id_chr5.log
	md5sum: ffebb4d4442623feadd2ebb9fea762aa
	filesize: 971.0B
	filetype: .log
      - id: alspacdcs:06fc2045-5551-4e12-942f-bb036da84233
	name: freeze_id_chr6.log
	md5sum: 6a43815212a8ec77fe3a35c9d9c3692e
	filesize: 971.0B
	filetype: .log
      - id: alspacdcs:b686c422-0770-4799-a0f7-497ba9805a6b
	name: freeze_id_chr7.log
	md5sum: e0aed5e04f28d83c75d9c6dd77e90995
	filesize: 971.0B
	filetype: .log
      - id: alspacdcs:a0a1f077-e686-4ca3-8d5d-8ce9613b59c9
	name: freeze_id_chr8.log
	md5sum: 2f044afef682620cd7e97f0a65263245
	filesize: 971.0B
	filetype: .log
      - id: alspacdcs:d3e4c602-3173-4e3e-9484-f3ab8a9aa003
	name: freeze_id_chr9.log
	md5sum: d3be92946a1deb543f6e5c9d46620d0c
	filesize: 971.0B
	filetype: .log
      - id: alspacdcs:2685cbf0-561e-42f8-b520-0db9bc44b792
	name: freeze_id_chr10.log
	md5sum: 4f371c2cd4e72ab6cf1a72b05e564bc3
	filesize: 977.0B
	filetype: .log
      - id: alspacdcs:595095f1-bcb0-4f70-857e-f541d9a93db7
	name: freeze_id_chr11.log
	md5sum: 2adab11dda24b32f38c8d15b59aca641
	filesize: 977.0B
	filetype: .log
      - id: alspacdcs:64ffd176-931c-4077-9ad4-2a9fcd30ddb3
	name: freeze_id_chr12.log
	md5sum: 42e1a1b5359f038351b4c4da9dd64832
	filesize: 977.0B
	filetype: .log
      - id: alspacdcs:357ad1e6-d057-4201-adbd-e54b1659f998
	name: freeze_id_chr13.log
	md5sum: e3bdc7637c2434b93036875597d8d0af
	filesize: 977.0B
	filetype: .log
      - id: alspacdcs:60190e1f-b199-4e98-bdf0-9e9fde558722
	name: freeze_id_chr14.log
	md5sum: adae00f346ebe7c59179bd6921711b11
	filesize: 975.0B
	filetype: .log
      - id: alspacdcs:b9373ee2-cd15-4098-a754-84b7ee0454cf
	name: freeze_id_chr15.log
	md5sum: 1396755abe3983552865e224131367b9
	filesize: 975.0B
	filetype: .log
      - id: alspacdcs:900c8cb4-489d-4990-bdcf-6a2884abfbae
	name: freeze_id_chr16.log
	md5sum: f644d2721c522b723f870e122886244b
	filesize: 975.0B
	filetype: .log
      - id: alspacdcs:73c2f0fc-d654-4563-ab9a-9e927d7f7057
	name: freeze_id_chr17.log
	md5sum: 62c56ae0fa0d64212bee294600a0f78e
	filesize: 975.0B
	filetype: .log
      - id: alspacdcs:97d1ff20-3781-4b93-b82b-06f485178201
	name: freeze_id_chr18.log
	md5sum: b9803f7d4723f6e7d1759115354b88cf
	filesize: 975.0B
	filetype: .log
      - id: alspacdcs:94c2b803-de0d-41c8-9f30-a0b72c6e5e3c
	name: freeze_id_chr19.log
	md5sum: 3e462678c1e53cceb57658be036768aa
	filesize: 975.0B
	filetype: .log
      - id: alspacdcs:0a8c20ea-b26c-4794-b9e2-b8971e4da850
	name: freeze_id_chr20.log
	md5sum: b5d7ac4496e2c2bb53bd025d6f1cf948
	filesize: 975.0B
	filetype: .log
      - id: alspacdcs:cce89426-b396-43c9-86af-8d1d1ccd2fde
	name: freeze_id_chr21.log
	md5sum: 57d7138f8b258ae539b289b863c8bab4
	filesize: 975.0B
	filetype: .log
      - id: alspacdcs:7c8ce228-052f-4490-bf61-ffcb3e517990
	name: freeze_id_chr22.log
	md5sum: f547e75de71b74d04a640df1d153d46c
	filesize: 975.0B
	filetype: .log

3.4 Genome-wide - 1000G imputed - G0 partners (gi_1000g_g0p)

3.4.1 Description

This dataset contains genome-wide array data imputed to the 1000 genomes reference panel for G0 partners, with some additional G0 mothers and G1 individuals. This data has been cleaned, flipped to the positive strand and in b37 coordinates and imputed to the 1000 genomes phase I version 3. Reference genome build: GRCh37

3.4.2 Methodology

3,453 ALSPAC mother and fathers and 535,478 SNPs were genotyped using the Illumina HumanCoreExome chip genotyping platforms by the ALSPAC lab and called using GenomeStudio. The resulting raw genome-wide data were subjected to standard quality control methods using PLINK (v1.07). Individuals were excluded on the basis of gender mismatches (n = 80); minimal or excessive heterozygosity (n = 64); disproportionate levels of individual missingness (>5%, n = 60) and possible contamination (n = 3).

Population stratification was assessed by multidimensional scaling analysis and compared with 1000 Genomes phase 3 data and principal component analysis (n = 266); all individuals with non-European ancestry were removed.

Cryptic relatedness was measured as SNP relatedness in GCTA (relatedness > 0.1, n = 69 removed). SNPs with a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 1E-7) and those which failed GenomeStudio quality control measures were removed (n = 21,298). 6,594 duplicate SNPs were also removed.

This resulted in 2,911 unrelated mothers and father genotypes at 507,586 SNPs. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln.

We phased data of 3074 samples that passed qc but contained related subjects in shapeit v2.r837. We then removed 155,336 monomorphic SNPs, 1033 markers not in 1000 genomes, 11,842 A/T or G/C SNPs and 10 duplicate sites to give 337,732 SNPs on chromosomes 1-23. Of the 329,363 markers on chromosomes 1-22, 298,742 overlapped the reference genome. We imputed to the 1000 genomes phase 1 version 3 using the Michigan Imputation Server. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln. We then removed 12 subjects who have withdrawn consent and 6 subjects genotyped in an earlier work package to give 2201 subjects.

1737 putative G0 partner-G1 pairs for whom both G0 partner and G1 have called genotype data available were identified based on ALN. Given the G0 partners were invited by the G0 mother to take part and only enrolled in the study in their own right several years later, it could not be assumed that all G0 partners were biologically related to G1. Called genotype data for the 1720 unique G0 partners and 1737 unique G1s were merged (i.e. there were 17 pairs of siblings/twins among the G1 offspring), using plink v1.90b7.2 64-bit (11 Dec 2023).

After aplication of the plink filters –geno 0.05, –maf 0.01, –snps-only just-acgt and –autosome. The –related command in KING version 2.3.2 was used to perform kinship analysis, which confirmed that all 1737 putative G0 partner-G1 pairs are genetically related. This would be expected for biological father-offspring pairs, using the inference criteria described in in Table 1 of "Manichaikul, Ani, et al. "Robust relationship inference in genome-wide association studies." Bioinformatics 26.22 (2010): 2867-2873."

3.4.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_1000g_g0p_2016-11-22_f5
name: Genome-wide - 1000G imputed - G0 partners version 2016-11-22 freeze 5
description: >-
  This dataset is the fith freeze of 2016-11-22 versiono of the Genome-wide array data imputed to the 1000 genomes reference panel
  for G0 partners, with some additional G0 mothers and G1 individuals.

freeze_size: 44G
linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a
git_tag: https://github.com/alspac/dataset_gi_1000g_g0p/releases/tag/freeze5
is_current_freeze: true
freeze_number: 5
freeze_date: 2025-02-27
previous_freeze: alspacdcs:gi_1000g_g0p_2016-11-22_f4
freeze_of_alspac_dataset_version: alspacdcs:gi_1000g_g0p_2016-11-22
freeze_of_named_alspac_dataset: alspacdcs:gi_1000g_g0p


has_containers:
  - id: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c
    name: data
    description: A dir/folder containing the data bgen and sample files

has_parts:
  - id: alspacdcs:gi_1000g_g0p_2016-11-22_sample_f4
    name: Samples
    description: >-
      The samples in the data. To be used with the genetic data.
      A plain text .sample file.
      See https://doi.org/10.1101/308296 for file format details.
    data_distributions:

      - id: alspacdcs:593e4010-b671-4f25-b040-295b78e3107b
	name: swapped.sample
	md5sum: fc74e422b93dc53025b9664c0a57f320
	filesize: 164.9KB
	filetype: .sample
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

  - id: alspacdcs:c2d03376-f408-4ae9-b9a5-523ee1173b9a
    name: bgens
    description: >-
	  An Oxford Bgen (v1.2) file for all chromosomes. To be used with sample file.
	  See https://doi.org/10.1101/308296 for file format details.
    data distributions:

      - id: alspacdcs:b8bd3364-26fe-4bc8-b635-7050788ef646
	name: filtered_data_chr01.bgen
	md5sum: a5eb049e4df5a8b005ae51b47947d830
	filesize: 3.3GB
	filetype: .bgen
	number_of_variants: 2159337
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:5557335c-c2d9-4456-a25f-e11f652e9612
	name: filtered_data_chr02.bgen
	md5sum: e297c8d30455053d23ac360bcc886bb0
	filesize: 3.5GB
	filetype: .bgen
	number_of_variants: 2349883
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:2e35ed2f-6b37-4d55-b88f-08d9f090f636
	name: filtered_data_chr03.bgen
	md5sum: c0b55e9d65c219ffb1b8c58a0ebb7c18
	filesize: 3.0GB
	filetype: .bgen
	number_of_variants: 1969275
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:bed2d41e-0fed-46fb-98d0-aa2cc845c0ca
	name: filtered_data_chr04.bgen
	md5sum: 514f09f02c74fc3eca83379e9e99c5dc
	filesize: 3.1GB
	filetype: .bgen
	number_of_variants: 1969883
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:4f346ccf-883b-436c-9af3-d1c0c76fe03b
	name: filtered_data_chr05.bgen
	md5sum: f4accbf5bdd6a2ccc9598e9e2221915d
	filesize: 2.7GB
	filetype: .bgen
	number_of_variants: 1809961
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:8fbfd61d-e9b9-4f5f-84f1-ff534cd061d4
	name: filtered_data_chr06.bgen
	md5sum: a9327ad1591fdf7d349b066544e71c3a
	filesize: 2.6GB
	filetype: .bgen
	number_of_variants: 1758025
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:785f2386-a16d-4631-859a-39a1a6c3fb8b
	name: filtered_data_chr07.bgen
	md5sum: f832922558eddcf3feed87091c2ec0ae
	filesize: 2.6GB
	filetype: .bgen
	number_of_variants: 1601293
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:f63100b6-49f9-4d01-9c6d-e990ee513da1
	name: filtered_data_chr08.bgen
	md5sum: 47d79712e676a0048f90858cbb888179
	filesize: 2.3GB
	filetype: .bgen
	number_of_variants: 1558902
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:359141df-26e5-4048-8d25-a6940a9a8893
	name: filtered_data_chr09.bgen
	md5sum: 82a480f3e8792db2c1cec3adc50e1357
	filesize: 1.9GB
	filetype: .bgen
	number_of_variants: 1189463
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:8f97f928-e876-4f1c-a9b3-db74887b9fc8
	name: filtered_data_chr10.bgen
	md5sum: 8f64fe184e4c876a345a728ed5eeddcf
	filesize: 2.1GB
	filetype: .bgen
	number_of_variants: 1363104
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:0adf771c-75f7-4709-afc1-47b9c32107d8
	name: filtered_data_chr11.bgen
	md5sum: b1b7e3bef0fe72cd90bd0ba456f687aa
	filesize: 2.1GB
	filetype: .bgen
	number_of_variants: 1359640
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:71dcdd1b-18d3-42a0-a52c-f1e8bc6cb933
	name: filtered_data_chr12.bgen
	md5sum: 509202db22200fe0bd58210ab8e9c757
	filesize: 2.1GB
	filetype: .bgen
	number_of_variants: 1316510
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:c2e5243e-9c84-4030-a34e-0148cc9c42b2
	name: filtered_data_chr13.bgen
	md5sum: 176a10d38ab80783a8e392e5791edea7
	filesize: 1.5GB
	filetype: .bgen
	number_of_variants: 988473
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:971ace3d-1a9c-47f0-b012-a1b943881b70
	name: filtered_data_chr14.bgen
	md5sum: 1ecd96aab2925bafd7d20497d85dd937
	filesize: 1.4GB
	filetype: .bgen
	number_of_variants: 903811
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:5a4bd639-5d03-4eb3-a4eb-ba074e7d27ef
	name: filtered_data_chr15.bgen
	md5sum: f8c5b54206189808e9a361cc0da63798
	filesize: 1.4GB
	filetype: .bgen
	number_of_variants: 814028
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:bbe21f58-22bc-4ce0-a40b-9f0e11f5bafd
	name: filtered_data_chr16.bgen
	md5sum: 52f065575d3cb2dff34df6763a583766
	filesize: 1.5GB
	filetype: .bgen
	number_of_variants: 867901
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:7df9f8db-2c34-4f5e-a868-e55620c740c7
	name: filtered_data_chr17.bgen
	md5sum: 73d85caf67dcedc63b11a43bd5ccb44d
	filesize: 1.4GB
	filetype: .bgen
	number_of_variants: 755467
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:2a8eddd9-716c-460e-bc4a-78b4b5df12b4
	name: filtered_data_chr18.bgen
	md5sum: b8e055a6c0955bb67161c9f7a1d8cad7
	filesize: 1.3GB
	filetype: .bgen
	number_of_variants: 783661
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:6891d0c5-e685-46c7-a98f-ef09036db1e9
	name: filtered_data_chr19.bgen
	md5sum: 37ea045cd9f4027cba547b7b89c3a1a0
	filesize: 1.2GB
	filetype: .bgen
	number_of_variants: 606147
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:51e60215-105a-403a-8112-7d731de3471e
	name: filtered_data_chr20.bgen
	md5sum: d241eb21be3188c26c460e1f65f0d8c1
	filesize: 1.1GB
	filetype: .bgen
	number_of_variants: 618749
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:1dde86df-9e0a-4c02-afed-6381299cfa49
	name: filtered_data_chr21.bgen
	md5sum: 7881bdc24e7f0adbfb800b49d1efd590
	filesize: 671.1MB
	filetype: .bgen
	number_of_variants: 378064
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

      - id: alspacdcs:16e2f696-cb21-4661-bce0-3a712fcd3eae
	name: filtered_data_chr22.bgen
	md5sum: 824412e963441699f260c6245f65659d
	filesize: 721.5MB
	filetype: .bgen
	number_of_variants: 366590
	number_of_participants: 2198
	belongs_to_container: alspacdcs:26cc220b-a052-463e-90ab-73280c3f4b0c

3.5 Genome-wide - 1000G imputed - G0 mothers + G1 (gi_1000g_g0m_g1)

3.5.1 Description

This dataset contains genome-wide 1000G imputed data for G0 mothers + G1. This data has been cleaned, flipped to the positive strand and in b37 coordinates and imputed to the 1000 genomes phase I version 3. Reference genome build: GRCh37

3.5.2 Methodology

ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).

Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.

SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1). Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,115 subjects and 500,527 SNPs passed these quality control filters.

ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs. SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed.

Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.

Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level. Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,048 subjects and 526,688 SNPs passed these quality control filters.

We combined 477,482 SNP genotypes in common between the sample of mothers and sample of children. We removed SNPs with genotype missingness above 1% due to poor quality (11,396 SNPs removed) and removed a further 321 subjects due to potential ID mismatches. This resulted in a dataset of 17,842 subjects containing 6,305 duos and 465,740 SNPs (112 were removed during liftover and 234 were out of HWE after combination). We estimated haplotypes using ShapeIT(v2.r644) which utilises relatedness during phasing. We obtained a phased version of the 1000 genomes reference panel (Phase 1, Version3) from the Impute2 reference data repository (phased using ShapeItv2.r644, haplotype release date Dec 2013). Imputation of the target data was performed using Impute V2.2.2 against the reference panel(all polymorphic SNPs excluding singletons), using all 2186 reference haplotypes (including non-Europeans).

This gave 8,237 eligible children and 8,196 eligible mothers withavailable genotype data after exclusion of related subjects using cryptic relatedness measures described previously.

Known issues: There is a known strand issue present within this imputation: The Dec 2013 haplotype release of 1000 genomes phase 1 version 3 have 199 reported SNPs with incorrect strand. For more information and the origins of this list please visit https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated_SHAPEIT2_16-06-14.html. It is very unlikely that they have systematic effects across the genome and most probably are just isolated to these 199 known problematic SNPs. The user is advised to discard them from their analysis.

Formatting of the bgen files within the gi_1000g_g0m_g1 dataset have NA in place of the chromosome column. Some tools may allow this, while others are less forgiving. This may mean users wish to re-format the dataset (using QCtool or equivalent) for their work.

Allele frequency concordance with other cohorts: When contributing to consortia you may find that the allele frequencies in ALSPAC for a few thousand SNPs are discordant from a reference panel used by the consortium. This is actually to be expected - when calculating allele frequencies, even from the same population, in two different samples for many millions of SNPs there will be a number of SNPs that appear to be highly discordant.

3.5.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_f5
name: >-
  Genome-wide - 1000G imputed - G0 mothers + G1 version 2015-10-30
  freeze 5
description: >-
  This is the fifth freeze of the the 2015-10-30 version of
  gi_1000g_g0m_g1 datatset. It contains data in the oxford format
  which is a combination of bgen and sample (version 1.2) files. It is a subset of
  the data in gi_1000g_g0m_g1_2015-10-30 limited to one format and
  with participants who have withdrawn their consent removed.

  The Dec 2013 haplotype release of 1000 genomes phase 1 version 3 have 199 reported SNPs
  with incorrect strand. The strand issues are present in this imputation version. For more 
  information and the origins of this list please visit:
  https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated_SHAPEIT2_16-06-14.html

  It is very unlikely that they have systematic effects across the genome and most 
  probably are just isolated to these 199 known problematic SNPs.

  The user is advised to discard them from their analysis.
freeze_size: 122G
linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a
git_tag: https://github.com/alspac/dataset_gi_1000g_g0m_g1/releases/tag/freeze5
is_current_freeze: true
freeze_number: 5
freeze_date: 2025-02-27
previous_freeze: alspacdcs:gi_1000g_g0m_g1_2015-10-30_f4
freeze_of_alspac_dataset_version: alspacdcs:gi_1000g_g0m_g1_2015-10-30
freeze_of_named_alspac_dataset: alspacdcs:gi_1000g_g0m_g1

has_containers:
  - id: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505
    name: data
    description: A dir/folder containing the data bgen and sample files

has_parts:
  - id: alspacdcs:fa64c3c2-14ae-4853-bb1a-bec2545d217d
    name: Samples
    description: >-
      The samples in the data. To be used with the genetic data.
      A plain text .sample file.
      See https://doi.org/10.1101/308296 for file format details.
    data_distributions:
      - id: alspacdcs:bf6acc7d-a788-4ea1-b836-691582bef85f
	name: swapped.sample
	md5sum: d7dd4fe786b399bb107b332acf27f8bc
	filesize: 1.2MB
	filetype: .sample
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

  - id: alspacdcs:87995855-f693-4b4d-8155-8dcb141b85ec
    name: Bgens
    description: >-
      An Oxford Bgen (v1.2) file for all chromosomes. To be used with sample file.
      See https://doi.org/10.1101/308296 for file format details.
    data_distributions:
      - id: alspacdcs:15a76b52-275f-4969-9b11-5bb9b89a6460
	name: filtered_01.bgen
	md5sum: fad144852b7c9c929ea1a55b8481798c
	filesize: 9.0GB
	filetype: .bgen
	number_of_variants: 2155158
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:c34645be-c813-42c9-9032-33b9ef6a4ec0
	name: filtered_02.bgen
	md5sum: 91168a792595ee55375d6c72c881fa6c
	filesize: 9.1GB
	filetype: .bgen
	number_of_variants: 2346862
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:7bacda1b-0ba1-4f81-9f92-6821d2cfd588
	name: filtered_03.bgen
	md5sum: 6e898fe7aba1d39e832245267a9ec30e
	filesize: 7.6GB
	filetype: .bgen
	number_of_variants: 1966662
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:c8518005-9401-48fb-80c8-5841611a1e17
	name: filtered_04.bgen
	md5sum: c7ba39fbff7de19ffd98b93ff217108b
	filesize: 8.3GB
	filetype: .bgen
	number_of_variants: 1968171
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:84e27681-1425-4d3a-b348-e5cacbf110cf
	name: filtered_05.bgen
	md5sum: 173056913dd6dc1684e9118907af1fd5
	filesize: 6.8GB
	filetype: .bgen
	number_of_variants: 1808090
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:922442bb-33de-402c-ae5e-7268e642f05e
	name: filtered_06.bgen
	md5sum: b8296902cc14e29111b2caefbc52a00b
	filesize: 6.8GB
	filetype: .bgen
	number_of_variants: 1755859
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:8cf64efd-266f-433b-8392-cf8eea0133b7
	name: filtered_07.bgen
	md5sum: 3072cca6a05fdb782b858f70beed6e06
	filesize: 7.1GB
	filetype: .bgen
	number_of_variants: 1599387
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:12f65474-8d94-4cfe-a014-0ff3fa84bec2
	name: filtered_08.bgen
	md5sum: c57b0cc8c3b47c8058e6f95ba742a89d
	filesize: 5.9GB
	filetype: .bgen
	number_of_variants: 1557429
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:869352b1-9634-41ee-9851-69c3fb0e990a
	name: filtered_09.bgen
	md5sum: 0e0d21cb1dc4d276d0a4353cc7da0564
	filesize: 5.0GB
	filetype: .bgen
	number_of_variants: 1187731
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:e17bd046-20e1-434d-975e-09348fd69ffc
	name: filtered_10.bgen
	md5sum: e5f8a44f260c009a9fec7bdc105ead76
	filesize: 5.4GB
	filetype: .bgen
	number_of_variants: 1361506
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:160c5e26-45f8-47bf-9cec-c627d8912c5f
	name: filtered_11.bgen
	md5sum: 7c64c009aaf9fdb84c21b31f51e28bfa
	filesize: 5.3GB
	filetype: .bgen
	number_of_variants: 1356882
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505
	number_of_participants: 17444

      - id: alspacdcs:9fc4fbc3-2b5d-4dd2-98b7-c7928e669bd7
	name: filtered_12.bgen
	md5sum: 8f0d903ca1cf24ca0e45494bd0a1426c
	filesize: 5.3GB
	filetype: .bgen
	number_of_variants: 1314328
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:12ea18df-3860-40ac-afbc-dbb8d7bfc61e
	name: filtered_13.bgen
	md5sum: e59348ea876d3f5c3b6331e738daa162
	filesize: 3.9GB
	filetype: .bgen
	number_of_variants: 987740
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:bead6d13-e811-44c3-8f77-8da121506d90
	name: filtered_14.bgen
	md5sum: 3f80471a1e183e478ca3674482ed89e4
	filesize: 3.9GB
	filetype: .bgen
	number_of_variants: 904351
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:8dc54c1e-659e-4489-9f3a-48768f65a067
	name: filtered_15.bgen
	md5sum: 2166a96fc0bbdc990b1bcb513f4372bd
	filesize: 3.7GB
	filetype: .bgen
	number_of_variants: 812545
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:08129d30-8910-4ae4-bfdd-52f9c41af15d
	name: filtered_16.bgen
	md5sum: c44b1d287c79c69b2171c6822339cf4b
	filesize: 4.3GB
	filetype: .bgen
	number_of_variants: 865998
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:e876631c-c38f-4105-a3ce-4e6f00ccba6d
	name: filtered_17.bgen
	md5sum: e4c50e9c54d4baa59d191a756d60b32e
	filesize: 3.8GB
	filetype: .bgen
	number_of_variants: 753174
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:2b7f5674-4c73-4118-93d2-0e648d2306b6
	name: filtered_18.bgen
	md5sum: fa893fede52923d5805f8583dbed51bd
	filesize: 3.4GB
	filetype: .bgen
	number_of_variants: 783010
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:d277b81c-0fad-4397-bb58-c2843864e0db
	name: filtered_19.bgen
	md5sum: 999c860cfb0f3484d1a78ef639c594fa
	filesize: 3.9GB
	filetype: .bgen
	number_of_variants: 603516
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:42613b8c-17fd-47a7-b78c-cdc08fb01e61
	name: filtered_20.bgen
	md5sum: 59dd1ebbefb28c2b5818fb2aca9805de
	filesize: 2.7GB
	filetype: .bgen
	number_of_variants: 617694
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:eb1acd03-d59d-482f-97d5-3c2e9e3f3311
	name: filtered_21.bgen
	md5sum: dce2d85e4d08018ea365afdeac561447
	filesize: 1.9GB
	filetype: .bgen
	number_of_variants: 377554
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:b4518de2-7c28-4536-9164-d67aa7d97c28
	name: filtered_22.bgen
	md5sum: b5ba868e802d8eee4ac76b0f878d427c
	filesize: 2.0GB
	filetype: .bgen
	number_of_variants: 365644
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

      - id: alspacdcs:e1e6958c-0852-4173-9e98-ee7dd50f5ad3
	name: filtered_23.bgen
	md5sum: 512a78f6c379ce43e827da44a91b4c5f
	filesize: 5.9GB
	filetype: .bgen
	number_of_variants: 1250218
	number_of_participants: 17444
	belongs_to_container: alspacdcs:0bda16d0-d4c1-47f0-b2fa-1bb213ed6505

3.6 Genome-wide - TOPMed round 2 imputed - G0 mothers + G1 (gi_topmed_g0m_g1)

SNP chips are useful for the generation of data on hundreds of thousands of SNPs, but there are millions more polymorphisms that remain untyped with this technology. If suitable numbers of whole genome sequences exist (e.g. 1000 genomes data) then millions of genotypes that are missing from a sample because they have not been typed by SNP chips can be imputed using probabilistic methods. Here the ALSPAC mother and children data were imputed to a new reference panel known as the Haplotype Reference Consortium (HRC) panel. This comprises around 31000 sequenced individuals (mostly European), so the coverage of European haplotypes is much greater than in other panels. As a consequence imputation accuracy is expected to improve, particularly at lower frequencies.

3.6.1 Description

This dataset contains genotype data imputed to TOPMed round 2 for G0 mothers and G1. Reference genome build: GRCh38

3.6.2 Methodology

ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).

Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.

SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1).

Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,115 subjects and 500,527 SNPs passed these quality control filters.

ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs. SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed.

Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.

Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level. Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,048 subjects and 526,688 SNPs passed these quality control filters.

We combined 477,482 SNP genotypes in common between the sample of mothers and sample of children. We removed SNPs with genotype missingness above 1% due to poor quality (11,396 SNPs removed) and removed a further 321 subjects due to potential ID mismatches. This resulted in a dataset of 17,842 subjects containing 6,305 duos and 465,740 SNPs (112 were removed during liftOver and 234 were out of HWE after combination).

Individuals within this dataset, but who have withdrawn from the project were removed from the dataset before proceeding with imputation specific quality control. This left 17450 individuals.

The combined mothers and children combined genotype panel was filtered to remove SNPs below MAF 0.01, missing call rates exceeding 0.01 using Plink 2.0. The joint set of SNPs was checked for palindromic SNPs but none were present. The combined call set was swapped from GRCh37 to GRCh38 using UCSC liftOver.

The dataset was later filtered to SNPs above HWE of 1e-6 leaving 455150 SNPs. The combined autosomal call set was then converted to VCF files, before being uploaded to the TOPMed imputation server to flag variants requiring a strand fix. Any SNPs flagged with an issue were corrected, or filtered out using Plink2. 454248 SNPs remained within the autosomes.

Phasing and imputation was conducted on the Michigan TOPMed imputation server (v1.7.4) in October of 2023. Phasing was done using Eagle (v2.4). Imputation was done on minimac4 (v1.0.2) to TOPMed R2. An R squared filter of 0.3 was applied.

3.6.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_topmed_g0m_g1_2024-12-19_f4
name: >-
  Genome-wide - TOPmed imputed - G0 mothers + G1 version 2024-12-19
  freeze 5
description: >-
  Freeze 5 of version 2024-12-19 Genome-wide array data imputed to the TOPmed round 2 reference panel for G0 mothers and G1 individuals in bgen and sample file format (version 1.2). 
freeze_size: 161G
linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a
git_tag: https://github.com/alspac/dataset_gi_topmed_g0m_g1/releases/tag/freeze5
is_current_freeze: true
freeze_number: 5
freeze_date: 2025-02-27
previous_freeze: 
freeze_of_alspac_dataset_version: alspacdcs:gi_topmed_g0m_g1_2024-12-19
freeze_of_named_alspac_dataset: alspacdcs:gi_topmed_g0m_g1

has_containers:
  - id: alspacdcs:7bffd114-d042-4ec3-9c78-3e7fa8c7d8fd ## uuid
    name: data
    description: A dir/folder containing the freeze data bgen and .sample files

has_parts:
  - id: alspacdcs:8614a15d-8936-4b4e-ae34-a855ac8d1810
    name: Omics ID sample
    description: >-
      The samples in the data. To be used with the genetic data.
      A plain text .sample file.
      See https://doi.org/10.1101/308296 for file format details.
    data_distributions:
    - id: alspacdcs:ab0afa36-1c8d-45f0-9b0f-946ee1c56dae
      name: freeze.sample
      md5sum: 6523f64b382a44d4354a3be8bd5205e3
      filesize: 954.0KB
      filetype: .sample
      number_of_participants: 17444

  - id: alspacdcs:1f676e6d-f136-43d6-bb9c-381758a833f3
    name: Data bgen files
    description: >-
      An Oxford Bgen (v1.2) file for all chromosomes. To be used with sample file.
      See https://doi.org/10.1101/308296 for file format details.
    data_distributions:
    - id: alspacdcs:f430e13a-3a06-4b86-8a46-e98648075a9f
      name: chr1_freeze.bgen
      md5sum: 21b6a08b8d2e90004b3f54bb06e9443b
      filesize: 8.0GB
      filetype: .bgen
      number_of_variants: 5665189
      number_of_participants: 17444
    - id: alspacdcs:c8d1d13d-1e73-4dcd-a583-afbd5e9329f9
      name: chr2_freeze.bgen
      md5sum: d6e8ac3bcda8f42f3294e6a80160e25f
      filesize: 8.3GB
      filetype: .bgen
      number_of_variants: 6104056
      number_of_participants: 17444
    - id: alspacdcs:d6155947-6f27-4cc6-bafc-2cd373fb8703
      name: chr3_freeze.bgen
      md5sum: 593e63c2b01b85d971bba521f1dd6dcc
      filesize: 7.0GB
      filetype: .bgen
      number_of_variants: 5039584
      number_of_participants: 17444
    - id: alspacdcs:f5e7a8fa-7019-4a83-a94a-a14191f98ba4
      name: chr4_freeze.bgen
      md5sum: 60a2a1fa8d779535d1b471816b5d198a
      filesize: 7.5GB
      filetype: .bgen
      number_of_variants: 4910014
      number_of_participants: 17444
    - id: alspacdcs:b85035e4-be1f-44e8-90a3-5919be154e42
      name: chr5_freeze.bgen
      md5sum: 4bcd99f640ceb04a235f65d4361c9b2c
      filesize: 6.4GB
      filetype: .bgen
      number_of_variants: 4540467
      number_of_participants: 17444
    - id: alspacdcs:797604c0-3a6d-47e3-b983-15384995a24e
      name: chr6_freeze.bgen
      md5sum: 658372add4922a417d15dd794a7c4cf6
      filesize: 6.1GB
      filetype: .bgen
      number_of_variants: 4341095
      number_of_participants: 17444
    - id: alspacdcs:a573e85f-8dbf-4182-917c-c28cd8a7ddc7
      name: chr7_freeze.bgen
      md5sum: 08e4ab75c11d694859262bcdfa7c28a6
      filesize: 6.1GB
      filetype: .bgen
      number_of_variants: 4083826
      number_of_participants: 17444
    - id: alspacdcs:e85a6acb-d587-423e-87e4-90f697e7a390
      name: chr8_freeze.bgen
      md5sum: 7da68399716feee6d58670c380ce6136
      filesize: 5.4GB
      filetype: .bgen
      number_of_variants: 3923042
      number_of_participants: 17444
    - id: alspacdcs:4485a925-332b-457a-98ae-fb8f83f24cfc
      name: chr9_freeze.bgen
      md5sum: 242c4e477c3a38dfe35060deca254c02
      filesize: 4.3GB
      filetype: .bgen
      number_of_variants: 3121200
      number_of_participants: 17444
    - id: alspacdcs:63fae86d-44ca-4777-9c03-40f26c8c3489
      name: chr10_freeze.bgen
      md5sum: f1b1a42a77534ea92368abb64eb04750
      filesize: 5.0GB
      filetype: .bgen
      number_of_variants: 3462260
      number_of_participants: 17444
    - id: alspacdcs:c9c116f6-bb4f-4f3c-bede-6b7b6fa6c4be
      name: chr11_freeze.bgen
      md5sum: 142cbe60c9975ef6678884ae3f2f8b3c
      filesize: 5.0GB
      filetype: .bgen
      number_of_variants: 3500176
      number_of_participants: 17444
    - id: alspacdcs:1ea4e961-cdee-4a7a-bad3-5b493e964c6d
      name: chr12_freeze.bgen
      md5sum: f929e2bc815169f2195d573638c0bd79
      filesize: 4.8GB
      filetype: .bgen
      number_of_variants: 3380589
      number_of_participants: 17444
    - id: alspacdcs:ff104d34-f95d-4afc-8a4a-a0558fa70f01
      name: chr13_freeze.bgen
      md5sum: b31c183805b7c905f10db4df575b4e09
      filesize: 3.7GB
      filetype: .bgen
      number_of_variants: 2529048
      number_of_participants: 17444
    - id: alspacdcs:5232e411-30dd-4386-a5f9-8b1160078387
      name: chr14_freeze.bgen
      md5sum: 65f2e357ed7dfa431f3354c408c0c8a5
      filesize: 3.2GB
      filetype: .bgen
      number_of_variants: 2255877
      number_of_participants: 17444
    - id: alspacdcs:74cb2f8b-ab30-4fb3-9035-7e21764aa28b
      name: chr15_freeze.bgen
      md5sum: 372f6b19cb1a3ce5c4d8b535740baef1
      filesize: 3.0GB
      filetype: .bgen
      number_of_variants: 2071294
      number_of_participants: 17444
    - id: alspacdcs:89998bcd-1e59-4d8f-82e3-53d14bce4de3
      name: chr16_freeze.bgen
      md5sum: c4d8a9a9549afe5bc0c301b593e2c1fd
      filesize: 3.4GB
      filetype: .bgen
      number_of_variants: 2273274
      number_of_participants: 17444
    - id: alspacdcs:e0a934a1-6949-469b-aedf-370b556a96cd
      name: chr17_freeze.bgen
      md5sum: 89d4a99fff839df51bcdac1fadc88041
      filesize: 3.2GB
      filetype: .bgen
      number_of_variants: 2040685
      number_of_participants: 17444
    - id: alspacdcs:5df288f2-8113-4967-82b2-ded3eb04c2ca
      name: chr18_freeze.bgen
      md5sum: 76e731d7340944547712a9b1eaa88aea
      filesize: 3.0GB
      filetype: .bgen
      number_of_variants: 1994769
      number_of_participants: 17444
    - id: alspacdcs:61f213e6-c14b-4ad1-b35a-0e2c0cfdc40c
      name: chr19_freeze.bgen
      md5sum: 7bdc3bf8ded14faf8e7098a127e9e031
      filesize: 2.8GB
      filetype: .bgen
      number_of_variants: 1605223
      number_of_participants: 17444
    - id: alspacdcs:b04fd601-4745-4b05-8c27-56c556d08eb4
      name: chr20_freeze.bgen
      md5sum: d38b47cfec00c0c979560884e6566803
      filesize: 2.4GB
      filetype: .bgen
      number_of_variants: 1615112
      number_of_participants: 17444
    - id: alspacdcs:1e394ca7-9d52-43a7-806d-e9fd4179cb88
      name: chr21_freeze.bgen
      md5sum: 586cb702158f0549fb20fd7703ba53cc
      filesize: 1.5GB
      filetype: .bgen
      number_of_variants: 935142
      number_of_participants: 17444
    - id: alspacdcs:571b477b-61d1-49c2-b8d1-c652679b11d8
      name: chr22_freeze.bgen
      md5sum: 578d7032a52f458612ed6b337f724e54
      filesize: 1.7GB
      filetype: .bgen
      number_of_variants: 1002345
      number_of_participants: 17444

4 Sequence Data

4.1 Whole genome sequencing - G1 (wgs_hiseq_g1)

4.1.1 Description

This dataset contains whole genome sequencing for G1 individuals, part of the UK10K dataset. Reference genome build: GRCh37

4.1.2 Methodology

ALSPAC and TwinsUK cohorts were sequenced at an average read depth of 6.7x through the UK10K program (http://www.uk10k.org) using the Illumina HiSeq platform, and aligned to the GRCh37 human reference using BWA. SNV calls were completed using samtools/bcftools and VQSR and GATK were used to recall these calls.

Associated publication:

Please ensure you have permission to access this data (http://www.uk10k.org/data_access.html) before using it.

4.1.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:wgs_hiseq_g1_2016-08-18_f5
name: Whole genome sequencing - G1 version 2016-08-18 freeze 5
description: >-
  This is the freeze 5 of version 2016-08-18 of the Whole genome sequencing for G1 individuals, part of the UK10K dataset.
freeze_size: 341G
linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a
git_tag: https://github.com/alspac/dataset_wgs_hiseq_g1/releases/tag/freeze5
is_current_freeze: true
freeze_number: 5
freeze_date: 2025-02-27
previous_freeze: alspacdcs:wgs_hiseq_g1_2016-08-18_f4
freeze_of_alspac_dataset_version: alspacdcs:wgs_hiseq_g1_2016-08-18
freeze_of_named_alspac_dataset: alspacdcs:wgs_hiseq_g1

has_containers:
  - id: alspacdcs:ec72d464-46e6-4059-b7dd-7b0f68739ddb ## uuid
    name: data
    description: A dir/folder containing the freeze data files


has_parts:
- id: alspacdcs:1319d16a-a9e8-4fb7-b4ee-a02a4345d98d
  name: compressed vcf files
  description: >- 
    Compressed vcf file containing all participants for each chromosome.. To be used with corresponding index file (in format of chr1_data.vcf.gz.csi).

  data_distributions:
    - id: alspacdcs:fbbe04e8-c67f-4177-9b3e-d852b20729e5
      name: 1_freeze.vcf.gz
      md5sum: d33e15c386ae29f5c3be6e75427f8b3b
      filesize: 26.3GB
      filetype: .gz
      number_of_variants: 3406915
      number_of_participants: 1865
    - id: alspacdcs:771837f1-76be-4a4b-b6e6-4d6e00db905b
      name: 2_freeze.vcf.gz
      md5sum: 2d4f58a1b75aa9502c8f317d5497219c
      filesize: 28.8GB
      filetype: .gz
      number_of_variants: 3749277
      number_of_participants: 1865
    - id: alspacdcs:97d1da53-68e4-472b-966f-3825b0b1f3de
      name: 3_freeze.vcf.gz
      md5sum: 787c1476921621d3f6d57237c55366df
      filesize: 24.2GB
      filetype: .gz
      number_of_variants: 3147254
      number_of_participants: 1865
    - id: alspacdcs:6e707b26-d8f6-4353-8a91-556f2fd75644
      name: 4_freeze.vcf.gz
      md5sum: 7834894c689f08444b791562a976ea5e
      filesize: 23.2GB
      filetype: .gz
      number_of_variants: 3019176
      number_of_participants: 1865
    - id: alspacdcs:c0655148-9acb-4df8-a114-ea1a4a00e8dc
      name: 5_freeze.vcf.gz
      md5sum: 3a92aa1fe872807df22a81dfa060983b
      filesize: 21.6GB
      filetype: .gz
      number_of_variants: 2804359
      number_of_participants: 1865
    - id: alspacdcs:9b8a43b6-1097-451c-965c-0d751bc9cb9e
      name: 6_freeze.vcf.gz
      md5sum: b85e5b5182a70bf720b32acf42331723
      filesize: 21.0GB
      filetype: .gz
      number_of_variants: 2704091
      number_of_participants: 1865
    - id: alspacdcs:aabd8528-a20e-4735-93ff-2528eb7485d4
      name: 7_freeze.vcf.gz
      md5sum: 9e62ee4c2ef3f1872bbebe5c25c974ac
      filesize: 19.0GB
      filetype: .gz
      number_of_variants: 2445204
      number_of_participants: 1865
    - id: alspacdcs:219b3731-47b5-4dc3-821f-f87921e14443
      name: 8_freeze.vcf.gz
      md5sum: 15a74e25f9fcf4e3cb0c3ade8e4ea523
      filesize: 18.8GB
      filetype: .gz
      number_of_variants: 2451009
      number_of_participants: 1865
    - id: alspacdcs:f10bafdd-56d9-4dd4-9fb1-03d4664d4925
      name: 9_freeze.vcf.gz
      md5sum: b76caf23e115f32f268eccc63f89befc
      filesize: 14.2GB
      filetype: .gz
      number_of_variants: 1845456
      number_of_participants: 1865
    - id: alspacdcs:9a9c8dd0-4620-4946-a970-34133bce0cff
      name: 10_freeze.vcf.gz
      md5sum: e5aec1e24bf2b1708db803093717fe86
      filesize: 16.3GB
      filetype: .gz
      number_of_variants: 2110436
      number_of_participants: 1865
    - id: alspacdcs:6b62073d-ab80-45f7-b684-d9ee20dc2803
      name: 11_freeze.vcf.gz
      md5sum: 40245fa0ca954109bd3d72b9258a5604
      filesize: 16.4GB
      filetype: .gz
      number_of_variants: 2125064
      number_of_participants: 1865
    - id: alspacdcs:6bb40956-ab1e-474c-8d36-6c2b3ad3e11d
      name: 12_freeze.vcf.gz
      md5sum: fc35ea6f6c4eac159d355756c1fa1e99
      filesize: 15.7GB
      filetype: .gz
      number_of_variants: 2047922
      number_of_participants: 1865
    - id: alspacdcs:36f180c3-b4ad-496e-a210-3d31104f5abb
      name: 13_freeze.vcf.gz
      md5sum: fbaa1857b2337a453977604691dda40a
      filesize: 11.8GB
      filetype: .gz
      number_of_variants: 1527053
      number_of_participants: 1865
    - id: alspacdcs:fd76156d-de27-4096-a40d-f3bdf2191dfb
      name: 14_freeze.vcf.gz
      md5sum: b91c4b551aad8ddff969f325837ae391
      filesize: 10.7GB
      filetype: .gz
      number_of_variants: 1403580
      number_of_participants: 1865
    - id: alspacdcs:2298d12f-31ac-402e-bcf9-bbda7a0cfdb0
      name: 15_freeze.vcf.gz
      md5sum: cdb06fde51346d76533ab100e9f9d497
      filesize: 9.7GB
      filetype: .gz
      number_of_variants: 1262404
      number_of_participants: 1865
    - id: alspacdcs:040624dc-7f50-4fb9-93ec-dca0c644779a
      name: 16_freeze.vcf.gz
      md5sum: 7276bd89d5c658e11a4abddd64ce0e50
      filesize: 10.6GB
      filetype: .gz
      number_of_variants: 1373607
      number_of_participants: 1865
    - id: alspacdcs:de04c691-4a2e-4336-9c79-4b0202fee2d0
      name: 17_freeze.vcf.gz
      md5sum: fe4b05ae5ef0fc510623bf5e54c1e1b2
      filesize: 9.1GB
      filetype: .gz
      number_of_variants: 1177884
      number_of_participants: 1865
    - id: alspacdcs:aee7c3cf-b1e1-4128-9f94-0881d61618d7
      name: 18_freeze.vcf.gz
      md5sum: c1f0e9e06f78f9c1a00531692a3d2cd0
      filesize: 9.4GB
      filetype: .gz
      number_of_variants: 1220427
      number_of_participants: 1865
    - id: alspacdcs:3e5c35f5-6ea7-4a53-b7f7-8266d87899c4
      name: 19_freeze.vcf.gz
      md5sum: 8430d6bf3230feb3136069180b250055
      filesize: 7.0GB
      filetype: .gz
      number_of_variants: 886630
      number_of_participants: 1865
    - id: alspacdcs:a7d924eb-88db-4e9f-b16f-42bea3f6b821
      name: 20_freeze.vcf.gz
      md5sum: 081da0fbcfd89c1fbcd403d60a83e400
      filesize: 7.5GB
      filetype: .gz
      number_of_variants: 970869
      number_of_participants: 1865
    - id: alspacdcs:7debb6d3-0e95-408a-b587-ce61f2cf2785
      name: 21_freeze.vcf.gz
      md5sum: 1f0c7f8dffd9e7540c1fa695c7940fe8
      filesize: 4.3GB
      filetype: .gz
      number_of_variants: 563988
      number_of_participants: 1865
    - id: alspacdcs:ad3618eb-cdf4-4740-8fad-ffda5e9c2fa2
      name: 22_freeze.vcf.gz
      md5sum: 2676aaa6b442dfe0cde83fe15ccfa95b
      filesize: 4.4GB
      filetype: .gz
      number_of_variants: 552675
      number_of_participants: 1865
    - id: alspacdcs:57430b22-f463-453e-9e31-df7d921c02af
      name: X_freeze.vcf.gz
      md5sum: 1695e4907cd419d93933f7703b56850b
      filesize: 10.5GB
      filetype: .gz
      number_of_variants: 1700742
      number_of_participants: 1865

- id: alspacdcs:a3afc031-0157-4a1a-9325-963407437cde
  name: vcf index files
  description: >- 
	vcf index file allowing for faster use of compressed vcf counterpart. To be used with corresponding vcf file (in format of chr1_data.vcf.gz.csi).
  data_distributions:
    - id: alspacdcs:3fbcb888-5de0-456c-86cc-7362065efede
      name: 1_freeze.vcf.gz.csi
      md5sum: 6d9e416a4c43c723ba97d72c7405849c
      filesize: 145.6KB
      filetype: .csi
    - id: alspacdcs:2b0b8799-9d40-428e-843c-0044f15c5358
      name: 2_freeze.vcf.gz.csi
      md5sum: b21f248b785fcf0db92f72a3c3c66b2f
      filesize: 156.1KB
      filetype: .csi
    - id: alspacdcs:6bc095b9-aca3-4a22-ad14-4f9d6b490056
      name: 3_freeze.vcf.gz.csi
      md5sum: 3720daf1b4726d6904783c61f5234c6d
      filesize: 127.9KB
      filetype: .csi
    - id: alspacdcs:db608415-e6a1-45e9-97de-8f35129759ae
      name: 4_freeze.vcf.gz.csi
      md5sum: a0bb677911ee282e6526a881b2a98916
      filesize: 122.6KB
      filetype: .csi
    - id: alspacdcs:df8eb7f7-3e05-4b19-b733-fe1edb99de99
      name: 5_freeze.vcf.gz.csi
      md5sum: 8b00b378e1375f701f9d4d310009d49a
      filesize: 116.1KB
      filetype: .csi
    - id: alspacdcs:754e0b2a-b593-4bb9-9b6f-51dc3cd07e2b
      name: 6_freeze.vcf.gz.csi
      md5sum: 3266612ae5cc6605f28f72e741e92d57
      filesize: 109.8KB
      filetype: .csi
    - id: alspacdcs:5c9ab5b9-89bb-4238-a757-3279148252d9
      name: 7_freeze.vcf.gz.csi
      md5sum: 706ac014ea4d9c76e87faeccb739aea3
      filesize: 101.8KB
      filetype: .csi
    - id: alspacdcs:1bf17d44-e29f-491c-98c4-ca0fafcc8c25
      name: 8_freeze.vcf.gz.csi
      md5sum: c1289344eec48a51e1096378312eda79
      filesize: 92.8KB
      filetype: .csi
    - id: alspacdcs:2c1ff40a-9d4b-4c27-af92-600966a9cd95
      name: 9_freeze.vcf.gz.csi
      md5sum: a1704c7204fd3e9656ff7bfae73a9a4a
      filesize: 75.4KB
      filetype: .csi
    - id: alspacdcs:e3a5b7e9-4892-4286-b616-4fe3444d2a2b
      name: 10_freeze.vcf.gz.csi
      md5sum: ee5b0a7f2220f00c4e032a5ddf35e510
      filesize: 85.5KB
      filetype: .csi
    - id: alspacdcs:71068481-76e6-4baa-b8ab-a6b83e32b053
      name: 11_freeze.vcf.gz.csi
      md5sum: 35acd0fd59f3d23cdc20838a7379eb3e
      filesize: 85.2KB
      filetype: .csi
    - id: alspacdcs:87843472-0227-46ae-bd8b-cb271a9770fe
      name: 12_freeze.vcf.gz.csi
      md5sum: 5cf94f16cf009e8cfb7501b7324f17bc
      filesize: 85.4KB
      filetype: .csi
    - id: alspacdcs:9406ac0b-f2b0-4bff-9789-67af3e5f4dfb
      name: 13_freeze.vcf.gz.csi
      md5sum: c30148a951069b2b0bc4421a74f0bf62
      filesize: 62.1KB
      filetype: .csi
    - id: alspacdcs:7983ca05-0103-4898-8d96-d1ff0f9b2594
      name: 14_freeze.vcf.gz.csi
      md5sum: e8555b348c8074117a963554ed0b1dc5
      filesize: 56.7KB
      filetype: .csi
    - id: alspacdcs:6037341b-7f8f-4c69-823c-4db7ce90e747
      name: 15_freeze.vcf.gz.csi
      md5sum: 26a4b2633ea20e1ca13c4f23a40d7583
      filesize: 51.6KB
      filetype: .csi
    - id: alspacdcs:e8775539-a045-4f69-b842-c6b27be94d58
      name: 16_freeze.vcf.gz.csi
      md5sum: 0becfa273182ab2a2d238bf4130ae991
      filesize: 50.4KB
      filetype: .csi
    - id: alspacdcs:d75b124b-b063-44a0-a0ac-dc942934c0bd
      name: 17_freeze.vcf.gz.csi
      md5sum: f7b852a30bf4fd2a6c215ff2e588ef06
      filesize: 49.9KB
      filetype: .csi
    - id: alspacdcs:c2065c84-cf80-4f8b-8ffe-29c6262452a1
      name: 18_freeze.vcf.gz.csi
      md5sum: 63fe7327c4d2933e6180bdce4823b7cd
      filesize: 48.4KB
      filetype: .csi
    - id: alspacdcs:4cc73c9d-264d-4b77-9ad6-541f63043f72
      name: 19_freeze.vcf.gz.csi
      md5sum: 727745e1e9bdcc63b5d9f236e8c354e5
      filesize: 35.7KB
      filetype: .csi
    - id: alspacdcs:9d6dd993-d2f8-4858-8ae9-4e6c5cd8b7a9
      name: 20_freeze.vcf.gz.csi
      md5sum: 79e681e770aa992abc51b6be6ee98736
      filesize: 38.2KB
      filetype: .csi
    - id: alspacdcs:8936b2bd-5b44-4048-9350-6a04df757cb7
      name: 21_freeze.vcf.gz.csi
      md5sum: ee5adcbdec0505621ec1bd6ca2390c4a
      filesize: 22.1KB
      filetype: .csi
    - id: alspacdcs:4501c4ec-bf71-456c-85e5-bed669a4f993
      name: 22_freeze.vcf.gz.csi
      md5sum: cd89ef1f49a81c0d7dd27d91f87000fc
      filesize: 22.1KB
      filetype: .csi
    - id: alspacdcs:2ebe499d-4a01-4784-89a0-f3d4709c0d19
      name: X_freeze.vcf.gz.csi
      md5sum: d50db7c315c45db319ad7f7a6176d326
      filesize: 96.0KB
      filetype: .csi

4.2 Whole exome sequencing - G0 & G1 (wes_novaseq_g0_g1)

4.2.1 Description

This dataset contains whole exome sequencing for G0 and G1 individuals. It was generated at the Sanger Institute as part of an initiative sequencing multiple Birth cohorts: ALSPAC, MCS and BiB. As part of this initiative, the exome sequencing data will also be available via EGA but researchers will still gain access through ALSPACs project approval system. Reference genome build: GRCh38

4.2.2 Methodology

Exome sequencing was conducted on DNA for 12,374 participants (8,605 children and 3,389 of their parents) at the Sanger Institute, using Illumina NovaSeq. Reads were aligned to GRCh38 with BWA-MEM. There was an average on-target depth of ~62X for ALSPAC.

QC was conducted on the dataset at the Sanger Institute, please find details within the associated publication (Koko et al., 2024). Sample QC was done before (base-calls after sequencing, alignment quality, CRAM file quality) and after variant calling (PCA analysis, comparison to array data, relatedness). Integrated variant QC removed potentially false positive variants using a trained random forest model. Genotype QC removed low quality individual genotype calls.

Single nucleotide variant (SNV) and small insertions/deletion (indels) calling was conducted with GATK HaplotypeCaller, GenomicsDBImport and GenotypeGVCFs (GATK version 4.2.4.0 for ALSPAC) following GATK best practices (Van der Auwera and O'Connor, 2020).

There were 12 individuals identified to have sex mismatches within the dataset, withflagging as mismatches based on X F stat. When looking at the Y coverage of these individuals, 3 were clear cut-offs based from both X f stat and Y depth, while 9 were only mismatches based off the X F stat. The 3 individuals with clear mismatches on both statistics were removed from the dataset, while the other mismatches were retained.

Associated publication:

  • doi.org/10.12688/wellcomeopenres.22697.1

4.2.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_f5
name: >- 
  Whole Exome Sequencing - Novaseq - G0 & G1 version 2024-09-20 freeze 5
description: >-
  This is first iteration of wes_novaseq_g0_g1, first introduced in freeze 5. It contains data in vcf 4.2 format. It contains the majority of the G1 cohort (n=~8296), accompanied by G0 mothers (n=~1642) and partners (n=~1630) to create trios. Participants who have withdrawn their consent are removed and an omics ID applied according to the freeze. Over time the participants are able to withdraw their consent and will be removed from the dataset, so the number of available individuals can reduce as time progresses. 

  This exome sequencing (ES) data was conducted at the Sanger institute and was part of an effort to ES ALSPAC, MCS and BiB. All ES data was quality controlled at the Sanger institute prior to this ALSPAC release and has been extensively document in the relevant publication (see below). 

  In brief (exert from associated publication, Koko et al., 2024):

    "Sample QC: 
      * Before variant calling: Samples were removed if they failed one or more filters based on quality of base-calls after sequencing, or quality of the CRAM files of aligned reads. The remainder then underwent variant calling.
      * After variant calling: We assigned individuals to populations using principal component analysis (PCA), then identified and removed individuals who were outliers on one or more variant-based metrics within each of the populations. We compared the exome data to genotyping array data from the same samples and removed samples that did not match as expected, since these could be sample mix-ups. The samples were also checked for unexpected relatedness; samples showing conflicts between reported and inferred relatedness were removed. This sample QC was split in two separate steps, before and after variant and genotype QC, as detailed in the coming sections. 
    Integrated variant and genotype QC:
      * Variant QC: We removed candidate variants which may not be real, instead being artefacts or mapping errors, using a trained random forest model to distinguish likely true positives from likely false positives. 
      * Genotype QC: We removed low-quality individual genotype calls from the dataset. This was done in conjunction with variant QC, as we will explain below."

  for extended information such as thresholds please find within the publication.

  Associated publication:
    Koko et al., 2024
    DOI: https://doi.org/10.12688/wellcomeopenres.22697.2


freeze_size: 167G
linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a
git_tag: https://github.com/alspac/dataset_wes_novaseq_g0_g1/releases/tag/freeze5
is_current_freeze: true
freeze_number: 5
freeze_date: 2025-02-27
previous_freeze: alspacdcs:wes_novaseq_g0_g1_2024-09-20_f4
freeze_of_alspac_dataset_version: alspacdcs:wes_novaseq_g0_g1_2024-09-20
freeze_of_named_alspac_dataset: alspacdcs:wes_novaseq_g0_g1

has_parts:
  - id: alspacdcs:ae68d502-5a2d-4280-9c4c-a3598957eb27
    name: compressed vcf files
    description: >- 
      Compressed vcf file containing all participants for each chromosome. Generated using bcftools v1.19. To be used with corresponding index file (in format of chr1_data.vcf.gz.csi).

    data_distributions:
      - id: alspacdcs:1393a335-9614-462a-8171-cacce83d9228
	name: chr1_data.vcf.gz
	md5sum: 991ae5adc49999b67ad2a5eeb947c415
	filesize: 16.3GB
	filetype: .gz
	number_of_variants: 370645
	number_of_participants: 11500
      - id: alspacdcs:4a580346-e4c1-45b6-90eb-01e909368fdf
	name: chr2_data.vcf.gz
	md5sum: acee28fa07f40a4d65bdebfb496ebd78
	filesize: 11.8GB
	filetype: .gz
	number_of_variants: 272150
	number_of_participants: 11500
      - id: alspacdcs:783a3911-16f3-44b4-a083-e677ca8701d7
	name: chr3_data.vcf.gz
	md5sum: c30a14536c4927e2fe0b6029dc913711
	filesize: 9.1GB
	filetype: .gz
	number_of_variants: 206875
	number_of_participants: 11500
      - id: alspacdcs:5d2e9095-2f4c-48f8-9469-4d0a2db6feee
	name: chr4_data.vcf.gz
	md5sum: 0e03787179147bfe52995addcc0b4bb3
	filesize: 6.1GB
	filetype: .gz
	number_of_variants: 140675
	number_of_participants: 11500
      - id: alspacdcs:c6618b13-3ca1-4b6f-9bd8-3c035f1711b5
	name: chr5_data.vcf.gz
	md5sum: 58c9ccd5d3cb94baa71c6d0948ab94a0
	filesize: 7.0GB
	filetype: .gz
	number_of_variants: 161010
	number_of_participants: 11500
      - id: alspacdcs:df0d7ea2-874d-4cf8-a777-4c85d1b9fda6
	name: chr6_data.vcf.gz
	md5sum: 9483605e1c3baa66b46f9e3d0137466d
	filesize: 8.0GB
	filetype: .gz
	number_of_variants: 181754
	number_of_participants: 11500
      - id: alspacdcs:481c4b01-5a75-49f5-8b14-e3dc91b7df58
	name: chr7_data.vcf.gz
	md5sum: 9823da37fd8d1bcb70710f0f12a85cae
	filesize: 8.1GB
	filetype: .gz
	number_of_variants: 181925
	number_of_participants: 11500
      - id: alspacdcs:08dc63cd-df57-4cd7-9f9b-da9aa29fd06d
	name: chr8_data.vcf.gz
	md5sum: daca7b2c5d14c28267d477516355fe5b
	filesize: 5.9GB
	filetype: .gz
	number_of_variants: 133894
	number_of_participants: 11500
      - id: alspacdcs:7b60eb90-7159-44e0-a0a2-eb339f9a3e1e
	name: chr9_data.vcf.gz
	md5sum: 06854c04f90b3f4e358bb60181042161
	filesize: 7.1GB
	filetype: .gz
	number_of_variants: 161039
	number_of_participants: 11500
      - id: alspacdcs:2feaab6d-9ef6-48be-804f-623cd58c7b45
	name: chr10_data.vcf.gz
	md5sum: 2145f8e9ed3bed5831b0921ea65e2e11
	filesize: 6.5GB
	filetype: .gz
	number_of_variants: 149730
	number_of_participants: 11505
      - id: alspacdcs:36f15a4c-7461-4f5c-845a-540d78969bb5
	name: chr11_data.vcf.gz
	md5sum: 509b385b9195da7fe93355536ae49450
	filesize: 10.2GB
	filetype: .gz
	number_of_variants: 227858
	number_of_participants: 11500
      - id: alspacdcs:27d17b38-d51e-49ef-8e98-9f6b1ecee217
	name: chr12_data.vcf.gz
	md5sum: d983ad6ffefc50e7c163edabd6157b4b
	filesize: 8.5GB
	filetype: .gz
	number_of_variants: 193518
	number_of_participants: 11500
      - id: alspacdcs:6a4ddadd-bfa2-47ec-9a91-9ab5eafc49e5
	name: chr13_data.vcf.gz
	md5sum: b2f5304ba781bfc2604783ecc8dd8a3a
	filesize: 2.8GB
	filetype: .gz
	number_of_variants: 63931
	number_of_participants: 11500
      - id: alspacdcs:a80774c1-3374-469c-8d4a-727735eb114e
	name: chr14_data.vcf.gz
	md5sum: c4ee5f555d11c9f70282340417e78294
	filesize: 5.7GB
	filetype: .gz
	number_of_variants: 128137
	number_of_participants: 11500
      - id: alspacdcs:d124d290-33f6-4c7b-905a-badffbd0d824
	name: chr15_data.vcf.gz
	md5sum: 42a6c9f94f26927f428c06498def17d7
	filesize: 5.6GB
	filetype: .gz
	number_of_variants: 127646
	number_of_participants: 11500
      - id: alspacdcs:015e0d1d-2bed-4fe5-83e2-f7841be0d591
	name: chr16_data.vcf.gz
	md5sum: 8441010f7cb684dbaeacf0bbd4e42249
	filesize: 8.3GB
	filetype: .gz
	number_of_variants: 186300
	number_of_participants: 11500
      - id: alspacdcs:f61fe2eb-ea44-439d-88c8-22a6c91283d6
	name: chr17_data.vcf.gz
	md5sum: 4e235ba3f8c100b1f278269b9abe162b
	filesize: 10.0GB
	filetype: .gz
	number_of_variants: 224774
	number_of_participants: 11500
      - id: alspacdcs:c668c9f6-4dd4-425c-8e06-dd5a7536a892
	name: chr18_data.vcf.gz
	md5sum: db54dcc6eb17b46f1fcefdddb3fd0955
	filesize: 2.5GB
	filetype: .gz
	number_of_variants: 57017
	number_of_participants: 11500
      - id: alspacdcs:2c380e03-7824-4b6f-b42b-9c3ebf9a58db
	name: chr19_data.vcf.gz
	md5sum: f79c758c9a54e71b132c180158f47e19
	filesize: 12.5GB
	filetype: .gz
	number_of_variants: 271080
	number_of_participants: 11500
      - id: alspacdcs:ee246fa3-d17e-447d-9360-b0997d0d882b
	name: chr20_data.vcf.gz
	md5sum: 517231ad499d8827b57c8bc3dfc3d320
	filesize: 4.3GB
	filetype: .gz
	number_of_variants: 96655
	number_of_participants: 11500
      - id: alspacdcs:42d991a1-d820-4ab0-8b6d-f6a96db7d68e
	name: chr21_data.vcf.gz
	md5sum: a37a11719ebb6fac2c0185fd725847b8
	filesize: 1.9GB
	filetype: .gz
	number_of_variants: 42207
	number_of_participants: 11500
      - id: alspacdcs:57cc588a-3e86-4294-9c38-173b5fe35da4
	name: chr22_data.vcf.gz
	md5sum: 057239d5968e4ebdb982c7a761fddda2
	filesize: 4.2GB
	filetype: .gz
	number_of_variants: 94446
	number_of_participants: 11500
      - id: alspacdcs:1e697a14-6c20-4d98-bf0c-b432b307c9bb
	name: chrX_data.vcf.gz
	md5sum: 96466a642fa95a37c6ce18bc081a9313
	filesize: 3.8GB
	filetype: .gz
	number_of_variants: 86925
	number_of_participants: 11500
      - id: alspacdcs:63fe419d-41ce-49f4-880d-46751b5d3e7e
	name: chrY_data.vcf.gz
	md5sum: 98551aecc6df538eb439face7d20067e
	filesize: 363.9KB
	filetype: .gz
	number_of_variants: 9
	number_of_participants: 11500


  - id: alspacdcs:af405b0a-6161-4946-8004-d9c7333d9788
    name: vcf index files
    description: >- 
	  vcf index file allowing for faster use of compressed vcf counterpart. Generated using bcftools v1.19. To be used with corresponding vcf file (in format of chr1_data.vcf.gz.csi).
    data_distributions:
      - id: alspacdcs:5b755c68-5434-4b17-90ca-b95a4967d2b0
	name: chr1_data.vcf.gz.csi
	md5sum: 7e91b9c00c2510cb8d7b9219761127db
	filesize: 59.3KB
	filetype: .csi
      - id: alspacdcs:5496ec33-9e40-4f44-b5d5-a905e346ba36
	name: chr2_data.vcf.gz.csi
	md5sum: 9d99d101fdae1b73cb1add0995db1281
	filesize: 47.6KB
	filetype: .csi
      - id: alspacdcs:c6e523d1-099b-4789-92a1-a17d1ca80890
	name: chr3_data.vcf.gz.csi
	md5sum: c6f3d3c8784876cc34e1da41ff477544
	filesize: 37.9KB
	filetype: .csi
      - id: alspacdcs:5df3296f-67c2-4634-8e14-b670b8fb70ec
	name: chr4_data.vcf.gz.csi
	md5sum: f45a01d452497ef61a6944f8c4874dcf
	filesize: 29.7KB
	filetype: .csi
      - id: alspacdcs:933eeba2-146a-4f5a-accb-e202be199ae7
	name: chr5_data.vcf.gz.csi
	md5sum: 0d399ff3b3812732702706af558b4cf2
	filesize: 30.8KB
	filetype: .csi
      - id: alspacdcs:7c402631-6da4-4b6e-bce3-cdca11ef5af9
	name: chr6_data.vcf.gz.csi
	md5sum: c84bc37706a8886c04f011290e0fb527
	filesize: 32.2KB
	filetype: .csi
      - id: alspacdcs:49bc0d6d-ae74-4e49-a192-d99f2bad008a
	name: chr7_data.vcf.gz.csi
	md5sum: 742aa1280b71dded05a7eca06671467d
	filesize: 32.2KB
	filetype: .csi
      - id: alspacdcs:876f8247-ec68-49db-8998-a081e3570eea
	name: chr8_data.vcf.gz.csi
	md5sum: 969272ccf78c43e2eb0c37ae726fa9cb
	filesize: 24.6KB
	filetype: .csi
      - id: alspacdcs:63b70682-ae5d-41fd-846f-815df58ebd21
	name: chr9_data.vcf.gz.csi
	md5sum: 0a30e5878fecd543b8b27b65c2153ff4
	filesize: 25.0KB
	filetype: .csi
      - id: alspacdcs:0607a7f5-3c58-46b5-a5b2-09a511118bb7
	name: chr10_data.vcf.gz.csi
	md5sum: 9c79d177d09b4a29fde5e29eb6aa681d
	filesize: 27.8KB
	filetype: .csi
      - id: alspacdcs:bbe8ba0d-a093-4bdf-9eda-23c0142b5079
	name: chr11_data.vcf.gz.csi
	md5sum: c6520a9a3a2ea08b089a69a676493f7a
	filesize: 31.5KB
	filetype: .csi
      - id: alspacdcs:2695c1de-8205-48b2-a22d-e4dba6ae637a
	name: chr12_data.vcf.gz.csi
	md5sum: 780fdb6487f2bc2e88b5ac6cb31beab2
	filesize: 31.7KB
	filetype: .csi
      - id: alspacdcs:90d3ba16-c3ef-4222-a39a-02378d6a8982
	name: chr13_data.vcf.gz.csi
	md5sum: 9c1701ea5de03e58a324373cdf36e35b
	filesize: 13.4KB
	filetype: .csi
      - id: alspacdcs:25199639-b5fd-4b13-8d52-aa1cfaaf974d
	name: chr14_data.vcf.gz.csi
	md5sum: 3c487ecb0d4d405526a57fddf54c3411
	filesize: 19.1KB
	filetype: .csi
      - id: alspacdcs:c263f9ae-05e7-4bf8-92af-ef02ae58e5f4
	name: chr15_data.vcf.gz.csi
	md5sum: 5f39316e63aa7bfc6dcad5f6ec29e0f5
	filesize: 19.7KB
	filetype: .csi
      - id: alspacdcs:017da27c-a816-4652-b4c9-fa358610181b
	name: chr16_data.vcf.gz.csi
	md5sum: 07816ce3396eef33d6c8a2128593df4f
	filesize: 19.9KB
	filetype: .csi
      - id: alspacdcs:341e4feb-17ce-4461-9d63-5c841906da3a
	name: chr17_data.vcf.gz.csi
	md5sum: fc0dd9fc4480fe26b4e4c7cfdbbe90ae
	filesize: 26.3KB
	filetype: .csi
      - id: alspacdcs:09b71a98-19a6-4860-8fc5-9cd0fd1b4951
	name: chr18_data.vcf.gz.csi
	md5sum: 25fd0f8ba4ece3eb2b82eb809d32b274
	filesize: 12.4KB
	filetype: .csi
      - id: alspacdcs:9cbd4505-aebc-4674-92de-b7e478ee112e
	name: chr19_data.vcf.gz.csi
	md5sum: 9422f9902edfa9815dea5abfeb699b87
	filesize: 23.7KB
	filetype: .csi
      - id: alspacdcs:6f4bd3d4-b60b-4af2-bcb9-e5ec04cbf034
	name: chr20_data.vcf.gz.csi
	md5sum: 5b6c115377d8cbee42e33a4730512221
	filesize: 14.8KB
	filetype: .csi
      - id: alspacdcs:6cde5398-d318-485b-90c9-0a4c65f93a66
	name: chr21_data.vcf.gz.csi
	md5sum: 309c8eba9d76d97f53813af20b97948d
	filesize: 6.3KB
	filetype: .csi
      - id: alspacdcs:87c78128-434c-49cc-80d2-07102c87b542
	name: chr22_data.vcf.gz.csi
	md5sum: d58b2f9dd877301a8b00920a8a963a97
	filesize: 11.0KB
	filetype: .csi
      - id: alspacdcs:3bb30796-6e72-46da-9c89-5f353c17bd24
	name: chrX_data.vcf.gz.csi
	md5sum: a9bcae58debbda0537e9f16f6bf08844
	filesize: 22.9KB
	filetype: .csi
      - id: alspacdcs:4c98124f-7401-422a-ad2c-e8d53703c9f3
	name: chrY_data.vcf.gz.csi
	md5sum: 08a6efe6cc066092a0332203f72a377c
	filesize: 129.0B
	filetype: .csi

4.3 Whole exome sequencing - G1 (wes_novaseq_g1)

4.3.1 Description

This dataset contains whole exome sequencing for G1 individuals. It was generated at the Broad Institute for ~2900 G1 individuals. Reference genome build: GRCh38

4.3.2 Methodology

The exomes returned from the Broad Insitute did not undergo PCA or relatedness filtering; instead provided as raw VCF data. The following thresholds were applied to the samples:

  • Chimera rate: Less than 0.05
  • Contamination rate: Less than 0.10
  • PF aligned rate: More than 0.60

87 individuals were removed from the dataset who were believed to have been a sample mismatch. These exomes had discordance rate of above 0.05 when compared to existing array data using bcftools gtcheck.

Associated publications:

4.3.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:wes_novaseq_g1_204-04-12_f5
name: >- 
  Whole Exome Sequencing - Novaseq - G1 version 2024-04-09 freeze 5
description: >-
  This is first iteration of wes_novaseq_g1, first introduced in freeze 4. It contains data in vcf 4.2 format. It is a subset of the G1 cohort, with participants who have withdrawn their consent removed and omics IDs applied according to the freeze. Samples were selected for whole exome sequencing at the Broad Institute from the G1 cohort (the cohort of index children) and were from subjects who were singletons/unrelated and of European/British ancestry, had blood-derived DNA available, and had been genotyped on a whole genome genotyping array.

  The QC was performed by the broad. The following thresholds were applied:
  Chimera rate < 0.05
  Contamination rate < 0.10
  PF aligned rate < 0.60

  87 individuals were removed from the dataset who were believed to have been a sample mismatch. These exomes had discordance rate of above 0.05 when  compared to existing array data using bcftools gtcheck.

  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9980234/ describes this dataset in supplementary materials. 

freeze_size: 
linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a
git_tag: https://github.com/alspac/dataset_wes_novaseq_g1/releases/tag/freeze5
is_current_freeze: true
freeze_number: 5
freeze_date: 2025-02-27
previous_freeze: alspacdcs:wes_novaseq_g1_204-04-12_f4

freeze_of_alspac_dataset_version: alspacdcs:wes_novaseq_g1_2024-03-26
freeze_of_named_alspac_dataset: alspacdcs:wes_novaseq_g1

has_parts:
  - id: alspacdcs:wes_novaseq_g1_2024-04-09_all_chr_f5
    name: all_chr
    description: >-
      All chromosomes and all participants within the dataset contained within a single vcf version 4.2 file, which has been compressed using bcftools 1.19.
    data_distributions:
      - id: alspacdcs:37f5619e-b3e9-4f12-b58e-69678dac59db
	name: all_chr.vcf.gz
	description: >- 
	  vcf file containing all participants and chromosomes, to be used with all_chr.vcf.gz.csi
	md5sum: 1caa32ff3e54ccc46f9553960f70645f
	filesize: 28G
	filetype:  vcf.gz
	number_of_participants: 2879
	#number_of_gene_expression_probe_values: 

      - id: alspacdcs:6f65a113-6dfa-45ac-80e8-ad23d4f8c958
	name: all_chr.vcf.gz.csi
	description: >-
	  index for vcf file - all_chr.vcf.gz, generated using bcftools v1.19.
	md5sum: cbfa46323e5ae250fabf071df72b5856
	filesize: 800K
	filetype: .csi

5 Epigenetic Data

5.1 DNA methylation - EPIC & 450k - G0 + G1 (dnam_epic450_g0_g1)

5.1.1 Description

This dataset contains methylation data collected from both G0 and G1 on two arrays at different timepoints. This dataset supersedes dnam_450_g0m_g1.

There is data from Illumina Infinium HumanMethylation450K BeadChip array on G1 mothers at two timepoints (pregnancy and middle age), G1 participants at 5 timepoints and G0 participants at three timepoints (birth, childhood and adolescence). This dataset also contains data from Infinium MethylationEPIC v1.0 data on 2721 G1 individuals at 2 timepoints.

This dataset was generated as part of the Accessible Resource for Integrated Epigenomics Studies (http://www.ariesepigenomics.org.uk/).

5.1.2 Methodology

Preprocessing and quality control for this dataset was conducted using Meffil.

Associated publications:

Associated R packages:

5.1.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:dnam_epic450_g0_g1_2022-7-13_f5
name: >-
  DNA methylation - EPIC & 450k - G0 + G1 version 2022-7-13 Freeze 5
description: >-
  This is the freeze 5 version of dnam_epic450_g0_g1, which was first introduced
  in freeze 2 and first released 2022-7-13.

freeze_size: 137G
linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a
git_tag: https://github.com/alspac/dataset_dnam_epic450_g0_g1/releases/tag/Freeze5
is_current_freeze: true
freeze_number: 5
freeze_date: 2025-02-27 ### Update to align with date of release
previous_freeze: 4
freeze_of_alspac_dataset_version: alspacdcs:dnam_epic450_g0_g1_2022-7-13
freeze_of_named_alspac_dataset: alspacdcs:dnam_epic450_g0_g1


has_containers:
  - id: alspacdcs:b56c1d92-e706-4771-8c9f-aa3f8b4d696e
    name: data
    description: A dir/folder containing the data files
  - id: alspacdcs:34396db7-83c1-4a7c-ac91-1b61c36be058
    name: betas
    description: A dir/folder containing the beta files
    belongs_to_container: alspacdcs:b56c1d92-e706-4771-8c9f-aa3f8b4d696e
  - id: alspacdcs:6b98295d-0ad1-441f-be61-f5fb01354bf5
    name: control_matrix
    description: A dir/folder containing the control matrix files 
    belongs_to_container: alspacdcs:b56c1d92-e706-4771-8c9f-aa3f8b4d696e
  - id: alspacdcs:55e81f3d-b724-495e-84dd-2a378a4aa5df
    name: derived
    description: A dir/folder containing the derived data (e.g. Cell count predictions and dnamage) 
    belongs_to_container: alspacdcs:b56c1d92-e706-4771-8c9f-aa3f8b4d696e
  - id: alspacdcs:47b7227d-cfdc-4a6c-ac6b-fb885de042d6
    name: cellcounts
    description: A dir/folder containing the cell count predictions
    belongs_to_container: alspacdcs:55e81f3d-b724-495e-84dd-2a378a4aa5df
  - id: alspacdcs:41debec0-c442-419a-bc9d-55f3712e64c9
    name: detection_p_values
    description: A dir/folder containing the matrix of detection values
    belongs_to_container: alspacdcs:b56c1d92-e706-4771-8c9f-aa3f8b4d696e
  - id: alspacdcs:97cd1d8c-1847-471e-851d-702e74d0b878
    name: samplesheet
    description: A dir/folder containing matrices of the sample identification.
    belongs_to_container: alspacdcs:b56c1d92-e706-4771-8c9f-aa3f8b4d696e 


has_parts:
  - id: alspacdcs:c74b655d-aeb5-472a-838a-53aab0fd43f6
    name: betas
    description: >-
      Normalized betas using functional normalization.
      We used 10 PCs on the controlmatrix to regress out technical
      variation. Slide was regressed out as random effect before
      normaliziation. CpGs are in rows and samples in columns.
    data_distributions:
      - id: alspacdcs:8bb819bb-2593-4418-8aa3-30ccbf42e5f7
	name: 450.gds
	description: >-
	  R data object for the Normalized beta data for the 450 array only.
	md5sum: 02e9b3cdda39d3476bfce111f5935f93
	filesize: 22G
	filetype: .gds
	belongs_to_container: alspacdcs:34396db7-83c1-4a7c-ac91-1b61c36be058
	number_of_samples: 5927
      - id: alspacdcs:c9040c38-0d33-40ca-b1c9-0633519367d2
	name: common.gds
	description: >-
	  R data object for the Normalized beta data for both the EPIC and 450 arrays.
	md5sum: 2d447051e6241bf35dc1bfba4e740848
	filesize: 30G
	filetype: .gds
	belongs_to_container: alspacdcs:34396db7-83c1-4a7c-ac91-1b61c36be058
	number_of_samples: 8669
      - id: alspacdcs:6ea804dc-22cf-4c20-bad9-10dd022ad60e
	name: epic.gds
	description: >-
	  R data object for the Normalized beta data for  the EPIC array only.
	md5sum: 0357486c3af3b5ee120c7b05bf077340
	filesize: 18G
	filetype: .gds
	belongs_to_container: alspacdcs:34396db7-83c1-4a7c-ac91-1b61c36be058
	number_of_samples: 2742

  - id: alspacdcs:9e066b44-dfc4-4499-87ab-ecf1f920e22d
    name: control_matrix
    description: >-
      The 850 control probes are summarized in 42 control types.
      These probes can roughly be divided into negative control probes
      (613), probes intended for between array normalization (186)
      and the remainder (49), which are designed for quality
      control, including assessing the
      bisulfite conversion rate. None of these probes are designed
      to measure a biological signal.
      The summarized control probes can be used as surrogates for
      unwanted variation and are used for the functional
      normalization.
      Samples are rows and 42 control types are in columns.
    data_distributions:
      - id: alspacdcs:11c170b9-8d2f-4f97-8b95-484e7c6eca5a
	name: 450.txt
	description: >-
	  Plain text file of the control matrix for the 450 array only.
	md5sum: 9e6aa62498c5bb7493f7512e274056ba
	filesize: 2.2M
	filetype: .txt
	belongs_to_container: alspacdcs:6b98295d-0ad1-441f-be61-f5fb01354bf5
	number_of_samples: 5927
      - id: alspacdcs:d506ee43-e7cf-4965-b77f-cbf92a840160
	name: common.txt
	description: >-
	  Plain text file of the control matrix for both the EPIC and 450 arrays.
	md5sum: 42d21ff7a2ead483e85b909b279e9912
	filesize: 3.2M
	filetype: .txt
	belongs_to_container: alspacdcs:6b98295d-0ad1-441f-be61-f5fb01354bf5
	number_of_samples:  8669
      - id: alspacdcs:58d260b0-529b-44dd-af72-e886bd49cbb3
	name: epic.txt
	description: >-
	  Plain text file of the control matrix for the EPIC array only.
	md5sum: 7a680d3ccd26a491ec7dde2ce91eeeab
	filesize: 1.0M
	filetype: .txt
	belongs_to_container: alspacdcs:6b98295d-0ad1-441f-be61-f5fb01354bf5
	number_of_samples:  2742

  - id: alspacdcs:bccb21fd-8f7c-4745-b5ec-23934efc158a
    name: DNA methylation age
    description: >-
      DNA methylation aging estimates from within the dataset. 
      Further information on this data and its usage is found
      within the `dnamage.html` and `dnamage.md` within the docs
      dir/folder.
    data_distributions:
      - id: alspacdcs:5c1df92f-44dc-4953-90bd-f33f51b3a704
	name: dnamage.csv
	description: >-
	  A csv file containing DNA methylation aging estimates within the dataset. 
	md5sum: bd0c2efef6ee145cd0804d61c7e83151
	filesize: 12M
	filetype: .csv
	belongs_to_container: alspacdcs:55e81f3d-b724-495e-84dd-2a378a4aa5df
	number_of_samples:  8192

  - id: alspacdcs:00045e99-d84b-4ef0-b6c4-a2fd4c7db852
    name: cell counts
    description: >-
      Files contain cell counts estimated using a variety of cell type 
      references using the Houseman deconvolution algorithm (PMID: 22568884).
      In each file, samples correspond to rows and cell types to columns.
    data_distributions:
      - id: alspacdcs:00045e99-d84b-4ef0-b6c4-a2fd4c7db852
	name: andrews-and-bakulski-cord-blood.txt
	description: >-
	  Cord blood cell count estimates derived using the Bakulski et al. 2016 reference 
	  (PMID 27019159; https://bioconductor.org/packages/release/data/experiment/html/FlowSorted.CordBlood.450k.html).
	  This reference has been implemented in meffil. Cell counts estimated for b-cells, 
	  cd4+ t cells, cd8+ t cells, granulocytes, monocytes, natural killer cells and 
	  nucleated red blood cells. In this text file, samples are in rows and cell types in columns.
	md5sum: 33c69aa8e50deb28355dcb82d01c7510
	filesize: 114K
	filetype: .txt
	belongs_to_container: alspacdcs:47b7227d-cfdc-4a6c-ac6b-fb885de042d6
	number_of_participants: 913  
      - id: alspacdcs:27721a12-e645-474f-b425-2e07a6a00db8
	name: gervin-and-lyle-cord-blood.txt
	description: >-
	  Cord blood cell count estimates derived using the Gervin et al. 2019
	  reference (PMID 31455416; GEO accession GSE127824). Cell counts 
	  estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes,
	  and natural killer cells. This reference has been implemented in meffil. 
	  In this text file, samples are in rows and cell types in columns.
	md5sum: 099c4cf9bd4ecfee91c19c3c2d2b6f70
	filesize: 100K
	filetype: .txt
	belongs_to_container: alspacdcs:47b7227d-cfdc-4a6c-ac6b-fb885de042d6
	number_of_participants: 913
      - id: alspacdcs:3ff0d680-2322-4733-905f-a84834980180
	name: cord-blood-gse68456.txt
	description: >-
	  Cord blood cell count estimates derived using the de Goede et al. 2015 reference
	  (PMID 26366232; GEO accession GSE68456).  Cell counts estimated for b-cells, cd4+ t cells,
	  cd8+ t cells, granulocytes, monocytes, natural killer cells and nucleated red blood cells.
	  This reference has been implemented in meffil. In this text file, samples are in rows and
	  cell types in columns.
	md5sum: 941f8a9ce1289ab5baaf10fb29bd8941
	filesize: 130K
	filetype: .txt
	belongs_to_container: alspacdcs:47b7227d-cfdc-4a6c-ac6b-fb885de042d6
	number_of_participants: 913
      - id: alspacdcs:06fd9d57-d014-4464-9458-aad9cf2d568b
	name: blood-gse35069-complete.txt
	description: >-
	  Cell counts in peripheral blood predicted using the peripheral blood reference published in 
	  Reinius et al. 2012 (PMID: 22848472). Same as 'blood gse35069.txt' but replaces granulocytes
	  with eosinophils and neutrophils. This reference has been implemented in meffil. 
	  In this text file, samples are in rows and cell types in columns.  
	md5sum: 27ab648c56b56e62709a98fcba95a764
	filesize: 1.2M
	filetype: .txt
	belongs_to_container: alspacdcs:47b7227d-cfdc-4a6c-ac6b-fb885de042d6
	number_of_samples: 8669         
      - id: alspacdcs:fa746dfc-458a-4037-b190-6e40bb8cc7a1
	name: blood-gse35069.txt
	description: >-
	  Blood cell count estimates derived using the Reinius et al. 2012 reference 
	  (PMID 25424692; GEO accession GSE35069).  Cell counts estimated for b-cells,
	  cd4+ t cells, cd8+ t cells, granulocytes, monocytes, and natural killer cells.
	  In this text file, samples are in rows and cell types in columns.
	md5sum: 53fb63b4cef457d90688b3ddb861fa73
	filesize: 1021K
	filetype: .txt
	belongs_to_container: alspacdcs:47b7227d-cfdc-4a6c-ac6b-fb885de042d6
	number_of_samples:  8669
      - id: alspacdcs:3a1ab6dd-e48a-4b50-9d06-097078acfe54
	name: blood-idoloptimized-epic.txt
	description: >-
	  Cell counts in peripheral blood predicted using the cell type reference from Bioconductor 
	  package FlowSorted.Blood.EPIC. This reference has been implemented in meffil. In this text file,
	  samples are in rows and cell types in columns.
	md5sum: 7331e83d31e1d200bbff3d041223cde1
	filesize: 347K
	filetype: .txt
	belongs_to_container: alspacdcs:47b7227d-cfdc-4a6c-ac6b-fb885de042d6
	number_of_samples: 2742
      - id: alspacdcs:fa7b1498-e630-496f-9c0e-a592361b312a
	name: blood-idoloptimized.txt
	description: >-
	  Cell counts in peripheral blood predicted using the cell type reference from Bioconductor 
	  package FlowSorted.Blood.EPIC but restricted to the IDOLOptimizedCpGs450klegacy CpG sites. 
	  This reference has been implemented in meffil. In this text file, samples are in rows and 
	  cell types in columns.
	md5sum: 2c2bdbf34093960af969ca37ae43c77b
	filesize: 1.1M
	filetype: .txt
	belongs_to_container: alspacdcs:47b7227d-cfdc-4a6c-ac6b-fb885de042d6
	number_of_samples: 8669
      - id: alspacdcs:8206505f-6c04-458d-9889-7e7e80411721
	name: combined-cord-blood.txt
	description: >-
	  Cord blood cell count estimates derived using the Bakulski et al, Gervin et al., de Goede et al.,
	  and Lin et al. references (https://bioconductor.org/packages/release/data/experiment/html/FlowSorted.CordBloodCombined.450k.html)
	  for CpG sites selected using the IDOL algorithm and optimized for the Illumina Infinium 
	  HumanMethylation450 Beadchip. Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells,
	  granulocytes, monocytes, natural killer cells and nucleated red blood cells.
	  In this text file, samples are in rows and cell types in columns.
	md5sum: 7cbcf72ca00012d17d22ff6d21b7575c
	filesize: 129K
	filetype: .txt
	belongs_to_container: alspacdcs:47b7227d-cfdc-4a6c-ac6b-fb885de042d6
	number_of_participants: 913

  - id: alspacdcs:a011593e-6da2-4136-95e8-acdb402e9fb7
    name: detection p values
    description: >-
      This matrix shows the detection pvalues for each sample and
      each CpG and is extracted from the idat files using the "meffil.load.detection.pvalues"
      function in meffil. CpGs are in rows and samples in columns.
    data_distributions:
      - id: alspacdcs:659a87cb-65a6-4330-8d08-d8f5a243e6b1
	name: 450.gds
	description: >-
	  R object file for the detection p values matrix for the 450 array only.
	md5sum: 1c437226b2aab0c00aed7098e739f49d
	filesize: 22G
	filetype: .gds
	belongs_to_container: alspacdcs:41debec0-c442-419a-bc9d-55f3712e64c9
	number_of_samples: 5927
      - id: alspacdcs:e5e83be9-63d3-4da0-a4ef-7a9367df6c02
	name: common.gds
	description: >-
	  R object file for the detection p values matrix for both EPIC and 450 arrays.
	md5sum: c6f4348fa7d92a5f341f69e1784036da
	filesize: 30G
	filetype: .gds
	belongs_to_container: alspacdcs:41debec0-c442-419a-bc9d-55f3712e64c9
	number_of_samples: 8669
      - id: alspacdcs:0e522100-8538-45ec-a68a-3158da8605e8
	name: epic.gds
	description: >-
	  R object file for the detection p values matrix for the EPIC array only.
	md5sum: 341d1194d468e10e80be9dc9990c474b
	filesize: 18G
	filetype: .gds
	belongs_to_container: alspacdcs:41debec0-c442-419a-bc9d-55f3712e64c9
	number_of_samples: 2742

  - id: alspacdcs:1099f8cd-a644-46c9-8722-90a3bc34db30
    name: samplesheets
    description: >-
      Manifest files with columns extracted directly from LIMS and age,
      sex, omics ID, timepoint, timecode, sampletype, genotype columns to report
      sample mismatches, duplicate.rm column to remove duplicates.
      Samples in rows, variables in columns.
    data_distributions:
      - id: alspacdcs:74a1d3bd-310a-429f-af09-b5745740419e9o0
	name: samplesheet-450.csv
	description: >-
	  R data object manifest file for the 450 array only.
	md5sum: a94696265d5418d2240be82ab91c79d1
	filesize: 2.2M
	filetype: .csv
	belongs_to_container: alspacdcs:97cd1d8c-1847-471e-851d-702e74d0b878
	number_of_samples: 5927              
      - id: alspacdcs:74a1d3bd-310a-429f-af09-b5745740419e
	name: samplesheet-common.csv
	description: >-
	  R data object manifest file for both the EPIC and 450 arrays. This is a duplicate with samplesheet.csv.
	md5sum: 702d0d663d92b636fee1b04ff5f681fa
	filesize: 3.3M
	filetype: .csv
	belongs_to_container: alspacdcs:97cd1d8c-1847-471e-851d-702e74d0b878
	number_of_samples: 8669
      - id: alspacdcs:708fb297-82ee-4b61-a48f-73cc9642e0d9
	name: samplesheet-epic.csv
	description: >-
	  R data object manifest file for the EPIC array only.
	md5sum: 42b2dc297d28f4bc992eac9b6a17cb60
	filesize: 1.1M
	filetype: .csv
	belongs_to_container: alspacdcs:97cd1d8c-1847-471e-851d-702e74d0b878
	number_of_samples: 2742 
      - id: alspacdcs:15bce2eb-ef05-46bd-8cc3-c06e8d6ba2fd
	name: samplesheet.csv
	description: >-
	  R data object manifest file for both the EPIC and 450 arrays. This is a duplicate with samplesheet-common.csv.
	md5sum: 702d0d663d92b636fee1b04ff5f681fa # should be the same as samplesheet-common.csv
	filesize: 3.3M
	filetype: .csv
	belongs_to_container: alspacdcs:97cd1d8c-1847-471e-851d-702e74d0b878
	number_of_samples: 8669

6 Gene Expression Data

6.1 Gene expression - array - G1 (ge_ht12_g1)

6.1.1 Description

There are two different types of QC'd data available in this version, one performed by David Evans for the Bryois et al 2014 paper, and one performed by Gibran Hemani for the molgenis eQTL mapping meta analysis. A version without QC is available as well. Details on the QC'd versions can be seen below.

This data was generated from LCLs. The majority of samples used in their generation were collected at age 9 years. LCL's are a lymphoblastoid cell lines which were produced by transforming lymphocytes with Epstein Barr Virus and cultured before DNA was extracted. Gene expression patterns may not be the same as that from untransformed lymphocytes taken from a 9 year old.

6.1.2 Methodology

Bryois:

  • LCL's from unrelated individuals were grown under identical conditions and cells frozen in RNAlater. RNA was extracted using an RNeasy extraction kit (Qiagen) and was amplified using the Illumina TotalPrep-96 RNA Amplification kit (Ambion). Expression profiling of the samples, each with two technical replicates, were performed using the Illumina Human HT-12 V3 BeadChips (Illumina Inc) including 48,804 probes where 200 ng of total RNA was processed according to the protocol supplied by Illumina. Raw data was imported to the Illumina Beadstudio software and probes with less than three beads present were excluded. Log2 - transformed expression signals were then normalized with quantile normalization of the replicates of each individual followed by quantile normalization across all individuals.

We restricted our analysis to 23'935 probes tagging genes annotated in Ensembl. Principal component analysis was performed on 931 individuals. 62 individuals with principal component 1 or 2 greater than one standard deviation of the population were excluded from further analysis. See http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004461 for full details.

Molgenis:

  • Genetic outliers were removed, any individuals that were clear outliers in the first 2 genetic principal components. Each probe was simply quantile normalised and then log2 transformed. Then adjusted for the first 4 genetic MDS, expression principal components (excluding those that had genetic associations), and scaled to have mean 0 and variance 1. See https://github.com/molgenis/systemsgenetics/wiki/eQTL-mapping-analysis-cookbook for full details.

6.1.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:ge_ht12_g1_2015-11-02_f5
name: Gene expression - array - G1 release version 2015-11-02 freeze 5
description: >-
  This is the fith freeze of the 2015-11-02 version of
  ge_ht12_g1 dataset which has .csv distributions of the data rather than
  .Rdata files in order to be easier to use across differnt data
  science software and languages.

freeze_size: 2.6G
linker_file_md5sum: a8b3ed028e1a22a41e428612a62bc7c9
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: e551ddec737da29e25fc8d3119989a6a
git_tag: https://github.com/alspac/dataset_ge_ht12_g1/releases/tag/freeze5
is_current_freeze: true
freeze_number: 5
freeze_date: 2025-02-27
previous_freeze: alspacdcs:ge_ht12_g1_2015-11-02_f4
freeze_of_alspac_dataset_version: alspacdcs:ge_ht12_g1_2015-11-02
freeze_of_named_alspac_dataset: alspacdcs:ge_ht12_g1
has_parts:
  - id: alspacdcs:ge_ht12_g1_2015-11-02_bryois_f5
    name: bryois data
    description: Dataset part for the bryois data in ge_ht12_g1 version 2015-11-02 freeze 5
    data_distributions:
      - id: alspacdcs:6495a875-5088-4a6a-86ac-9995d9203f72
	name: bryois.csv
	description: >-
	  The freeze 5 csv version of the bryois data.
	  IDs in columns and Illumina probe IDs in rows.
	  This is the normalised data used in Bryois et al 2014.
	  Probe IDs are mapped to genes in raw.csv
	md5sum: 2ef6aa2cd66c0cc31c69479bdc67432f
	filesize: 742M
	filetype: .csv
	number_of_participants: 947
	number_of_gene_expression_probe_values: 48630
  - id: alspacdcs:ge_ht12_g1_2015-11-02_molgenis_f5
    name: Molgenis data
    description: >-
      Dataset part for the Molgenis data in ge_ht12_g1 version 2015-11-02 freeze 5
    data_distributions:
      - id: alspacdcs:282fa0c9-a01b-4dd3-8664-78e0dde10e1f
	name: molgenis.csv
	description: >-
	  The freeze 5 csv version of the molgenis data.
	  IDs in columns and Illumina probe IDs in rows.
	  Normalised data following the molgenis pipeline,
	  found at
	  https://github.com/molgenis/systemsgenetics/wiki/eQTL-mapping-analysis-cookbook.
	  Probe IDs are mapped to genes in raw.csv
	md5sum: 4a3739d68b3d52d6650003aab2424ab8
	filesize: 752M
	filetype: .csv
	number_of_participants: 879
	number_of_gene_expression_probe_values: 48630
  - id: alspacdcs:ge_ht12_g1_2015-11-02_raw_f5
    name: Raw data
    description: Dataset part for the raw data in ge_ht12_g1 version 2015-11-02 freeze 5
    data_distributions: 
      - id: alspacdcs:7025f451-01eb-4a40-bd6c-dec89db0f7ab
	name: raw.csv
	description: >-
	  The freeze 5 csv version of the raw ge data.
	  IDs in columns and probes in rows. Four columns per
	  individual, with two columns for average signal and two columns
	  for average number of beads.
	  Presumably this is a file generated by the Illumina Genome
	  Studio software.
	md5sum: d1b6b2f1c8231e02666fea06ff1b4f9a
	filesize: 1.1G
	filetype: .csv
	number_of_participants: 994 ##This is not how wide this dataframe is
	number_of_gene_expression_probe_values: 48630

7 Omics tips

7.1 Introduction

This section is a guide to using 'Omics datasets. It explains which software to use and describes common file formats. It's a good starting point for beginners and helpful for problem-solving.

7.2 Disclaimer

Some information is copied or reworded from software documentation. Check the original documentation alongside this guide for up-to-date information. Note that some links may no longer work.

7.3 Operating systems

You can use ALSPAC data with any operating system, but Unix-based systems like Macintosh, Linux, or BSD are more convenient due to the data's size and complexity. We recommend using the command line and programming scripts with languages like Bash, R, Python, or Perl. Many online resources are available to learn these tools. Use free/libre and open-source software where possible.

Links:

7.4 Key Omics software

7.4.1 Plink

Plink is a tool for performing quality control and whole genome association analysis of genetic data.

7.4.2 SNPTest

SNPTest is a tool for performing whole genome association analysis of genetic data.

7.4.3 BoltLmm

BoltLmm is a tool for performing genome association analysis of genetic data. It is recommended for analysis of more than 5000 samples, its methods automatically take into account population substructures.

7.4.4 Qctools

A tool for quality control of genetic data. It is also useful to inspect and modify .gen .bgen and vcf files etc (see section 4 below).

7.4.5 SAMTOOLS

Samtools is a suite of tools which are used for genomic analysis.

7.4.6 VCFTOOLS

Part of samtools that allows you to work with vcf files.

7.4.7 BCFTOOLS

This is a part of samstools and allows users to manipulate .bcf files.

7.5 File types

In a Unix environment the postfix of a file name does not explicitly mean anything to the operating system, unlike in a Windows system which will look at the file types. In a Unix system it is just part of the name of the file and humans use it to distinguish file formats. The following is a non-exhaustive list of file types you may encounter whilst using ALSPAC Omics data.

7.5.1 .gen

This is an 'oxford' data format for genetic data. The .gen file is a plain text file, this means that standard Unix command line tools can be used to inspect the data. For example, 'head' or 'less'.

The .gen (genotype) file stores data on a one-line-per-SNP format. The first 5 entries of each line are the SNP ID, RS ID of the SNP, base-pair position of the SNP, the allele coded A and the allele coded B. The SNP ID can be used to denote the chromosome number of each SNP. The next three numbers on the line are the probabilities of the three genotypes AA, AB and BB at the SNP for the first individual in the cohort. The next three numbers are the genotype probabilities for the second individual in the cohort. The next three numbers are for the third individual and so on. The order of individuals in the genotype file should match the order of the individuals in the sample file (see below). It should be noted that the probabilities need not sum to 1 to allow for the possibility of a NULL genotype call. This format allows for genotype uncertainty. This genotype file format is the same as that produced by the genotype calling algorithm CHIAMO. NOTE : We recommend that you arrange SNPs in base-pair order in the genotype files. This is required if you want to use the files with IMPUTE and will make viewing the output of SNPTEST somewhat easier. For example, Suppose you want to create a genotype for 2 individuals at 5 SNPs whose genotypes are

SNP 1 AA AA
SNP 2 GG GT
SNP 3 CC CT
SNP 4 CT CT
SNP 5 AG GG

The correct genotype file would look like this:

SNP1 rs1 1000 A C 1 0 0 1 0 0
SNP2 rs2 2000 G T 1 0 0 0 1 0
SNP3 rs3 3000 C T 1 0 0 0 1 0
SNP4 rs4 4000 C T 0 1 0 0 1 0
SNP5 rs5 5000 A G 0 1 0 0 0 1

7.5.2 .bgen

A binary version of a .gen file. This file can not be visually inspected on the command line. .bgen files are used because they greatly increase the speed and storage efficiency of software for storing large amounts of Omics data. The full details of the file format are discussed in : https://www.well.ox.ac.uk/~gav/bgen_format/ bgen files are normally used with tools such as qctools and snptest There is also a library for reading .bgen files into R : https://bitbucket.org/gavinband/bgen/wiki/rbgen

7.5.3 .sample

The .sample file is paired with either .gen or .bgen files. It contains information on the samples that is not genetic. It is a plain text file that can be inspected with standard Unix command line tools.

Please note that the sample file format changed with the release of SNPTEST v2. Specifically, the way in which covariates and phenotypes are coded on the second line of the header file has changed. The sample file has three parts (a) a header line detailing the names of the columns in the file, (b) a line detailing the types of variables stored in each column, and (c) a line for each individual detailing the information for that individual. Here is an example of the start of a sample file for reference

ID_1 ID_2 missing cov_1 cov_2 cov_3 cov_4 pheno1 bin1
0 0 0 D D C C P B
1 1 0 .007 1 2 0 .0019 -0.008 1.233 1
2 2 0 .009 1 2 0 .0022 -0.001 6.234 0
3 3 0 .005 1 2 0 .0025 0.0028 6.121 1
4 4 0 .007 2 1 0 .0017 -0.011 3.234 1
5 5 0 .004 3 2 -0 .012 0.0236 2.786 0

The header line: This line needs a minimum of three entries. The first three entries should always be ID_1, ID_2 and missing. They denote that the first three columns contain the first ID, second ID and missing data proportion of each individual. Additional entries on this line should be the names of covariates or phenotypes that are included in the file. In the above example, there are 4 covariates named cov_1, cov_2, cov_3, cov_4, a continuous phenotype named pheno1 and a binary phenotype named bin1. NOTE : All phenotypes should appear after the covariates in this file. The second line of the file details the type of variables included in each column. The first three entries of this line should be set to 0. Subsequent entries in this line for covariates and phenotypes should be specified by the following rules

D Discrete covariate (coded using positive integers)
C Continuous covariates
P Continuous Phenotype
B Binary Phenotype (0 = Controls, 1 = Cases)

The remainder of the file should consist of a line for each individual containing the information specified by the entries of the header line (see example above). Use spaces to separate the entries of the sample file and not TABS because that is the expected character.

Missing values - Specifying missing values for covariates and phenotypes is possible. It was recommended that you use -9 for missing values. This was the default value assumed by SNPTEST v1, although the -missing_code option in SNPTEST v1 meant that you could use other numeric values for the missing code, In SNPTEST v2 the behavior of the -missing_code option has changed so that it now takes a comma-separated list of values, each of which is treated as missing when encountered in the sample file(s). Default missing values are now denoted by the two character string "NA".

7.5.4 .ped

A plink format file that is in plain text and can be viewed with standard tools. It contains genetic variant data. https://www.cog-genomics.org/plink/1.9/formats#ped

7.5.5 .map

A plink format file that is in plain text. It contains information about variants. https://www.cog-genomics.org/plink/1.9/formats#map

7.5.6 .bed

A plink format file that isa binary equivalent of a .ped file. It is smaller and faster to process but is not easily viewable or editable. https://www.cog-genomics.org/plink/1.9/formats#bed

7.5.7 .bim

A plink format, similar to a .map file but is used with binary .bed files. https://www.cog-genomics.org/plink/1.9/formats#bin

7.5.8 .fam

A plain text format that contains sample information for plink binary files. https://www.cog-genomics.org/plink/1.9/formats#fam

7.5.9 .csv

A plain text format where different fields are separated by commas. (Comma separated variables).

7.5.10 .vcf

VCF files are a flexible file format for storing different types of genetic variants. They are a plain text format that can be inspected on the command line with standard Unix tools. However they are often very large files, and specific tools such as 'vcftools' are useful for working with this data. Commonly SNPs are stored in these files but other variants such as Copy Number variations can also be stored. The basic form for a vcf file is: https://en.wikipedia.org/wiki/Variant_Call_Format

7.5.11 .bcf

This is a binary version of a vcf file. It cannot be inspected on the command line, but can be used with the genomic tools mentioned in this document.

7.5.12 .tar.gz

This is a standard Unix file format for bundling and compressing a set of files. It is similar to a .zip file. It is made by first bundling a set of files into a .tar file (sometimes called a tar ball). This is then compressed using 'gun zip'. https://en.wikipedia.org/wiki/Tar_(computing) https://en.wikipedia.org/wiki/Gzip

7.5.13 .enc

This file extension is used as a convention to mean that the file is encrypted. You will need to have that password that was used to encrypt the data in order to unencrypt the files. https://en.wikipedia.org/wiki/OpenSSL

7.6 Variant/SNP ids

There are many types of genetic variation. A common type is a single nucleotide polymorphism (SNP). Others include copy number variations.

Variants can be specified by a Chromosome and location in reference to a specific build of the human genome. They can also be given a reference SNP (rs) cluster identifier.

  • Chr:Location
  • Rs ids

7.7 Overview of Imputation reference panels

SNP array data frequently contain hundreds of thousands of variants. However due to linkage disequilibrium it is possible to estimate many more SNP values for an individual. This estimation procedure is called imputation and it works by combining an individuals SNP array data with a large reference population of sequenced data. In this way it is possible to have accurate estimations of millions of SNP values for an individual without the cost of fully sequencing each person. ALSPAC has prerun the imputation process using three different imputation panels.

7.7.1 Panels

  1. TOPmed

    The latest reference panel (to ALSPAC), which has the most snps

  2. HRC

    This is the latest reference panel and our data contains circa 40 millions of SNPs.

  3. 1000 Genomes

    This is the previous generation reference panel which is still widely used in ALSPAC studies. There are some SNPs that appear in this panel that are not in the HRC panel.

  4. Hapmap

    This was the first widely used imputation panel.

7.8 SNP data types from imputation.

SNPs that have been imputed can be stored and analysed in different formats. These can be appropriate for different types of analysis, for example an analysis could assume and additive effect for the minor allele or it could assume a recessive/dominant effect.

  • Best guess. The data will be presented as either 0,1, or 2 to represent how many of the minor alleles at that position a person has. The best guess is derived from the probability of a variant calculated from the imputation process.
  • Dosage. This is the probability that the person has 0, 1 or 2 of the minor allele. i.e. 0.1, 0.2,0.7. This will sum to one across the three possibilities (i.e for each SNP for each individual).

7.9 SNP Statistics

You can generate statistics on your SNP data using the program 'QCtools'. This will give you the imputation information scores. For example:

qctool -g example.bgen -s example.sample -sample-stats -osample sample-stats.txt

7.10 Best practice

7.10.1 GWAS

We recommend you follow the steps outlined in the following paper when performing GWAS: Marees, Andries T., et al. "A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis." International journal of methods in psychiatric research 27.2 (2018): e1608. https://doi.org/10.1002/mpr.1608

7.10.2 Phewas

We recommend you follow the steps outlined in the following paper when performing Phewas: Millard, L., Davies, N., Timpson, N. et al. MR-PheWAS: hypothesis prioritization among potential causal effects of body mass index on many outcomes, using Mendelian randomization. Sci Rep 5, 16645 (2015). https://doi.org/10.1038/srep16645

7.10.3 Methylation

The following paper describes the methylation data available in ALSPAC Relton, Caroline L., et al. "Data resource profile: accessible resource for integrated epigenomic studies (ARIES)." International journal of epidemiology 44.4 (2015): 1181-1190.

7.11 Population stratification

This is when an observed genetic association is due to the population/geography. Not taking this into account can lead to biased estimates of effects. One common method to account for these is to calculate principal components (PCs) of the genetic data and then to include these as covariables in any models.

ALSPAC do not provide PCs as part of the standard omics datasets, as these would require being re-generated and tested alongside each freeze. PCs can be generated using plink, hail or a variety of other tools.

For more information about how to do this in plink see: https://www.cog-genomics.org/plink/1.9/strat

An common method used to account for population substructure is by using linear mixed models. For example using the bolt LMM software tool.

https://data.broadinstitute.org/alkesgroup/BOLT-LMM/

7.12 Polygenic risk scores (PRS)

These are scores which estimate the effect of variants in an individual genome on a given phenotypic trait or disease.

Further explanations can be found online, such as: https://www.genome.gov/Health/Genomics-and-Medicine/Polygenic-risk-scores

Or example tutorials for calculating PRSs: https://www.nature.com/articles/s41596-020-0353-1

Different collaborators often generate PRS for ALSPAC, but these are not shared as part of our standard omics datasets. Collaborators wishing for PRSs will need to generate these themselves.

7.13 Common tasks

Here we provide links to webpages that provide instructions or provide brief details any code for completing common tasks using the various software we have described above (section x):

  • Extract some SNPs from a bgen data file and convert to plain text.

https://www.well.ox.ac.uk/~gav/qctool_v2/documentation/examples/filtering_variants.html

  • Extract some SNPs from bed data:

http://zzz.bwh.harvard.edu/plink/dataman.shtml

plink –bfile mydata –chr 2 –from-kb 5000 –to-kb 10000

  • Reading .bgen and .sample oxford files in plink

Plink supports bgen files but it is fussy about the types of its columns in the data.sample file. You may wish to remove or retype columns to read a data.sample file into plink. For more info see:

https://www.cog-genomics.org/plink/2.0/input

To make a new sample file removing some columns you can use the Unix command: 'cut -f 1,2,3 -d " " data.sample > data2.sample'

7.14 Courses

Working with 'Omics data can be complicated but there are many excellent resources available to help you learn how to do this. There are both paid in person courses and free online courses.

Details on paid courses offered by Bristol University can be found here: https://www.bristol.ac.uk/medical-school/study/short-courses/ In addition, a number of free online courses are summarised here: https://www.mooc-list.com/tags/bioinformatics

7.15 Further sources of help

7.15.1 Stack exchange

Stack exchange is an online Q&A community which is divided into different sub-communities. The first and most well-known is Stack overflow. This is one of the best place to ask questions about programming on the Internet. Other useful exchange sites include bioinformatics https://bioinformatics.stackexchange.com/, maths https://mathoverflow.net/ and statistics https://stats.stackexchange.com/.

7.15.2 Bio-stars

Biostars is bioinformatics community Q&A web-site: https://www.biostars.org/

7.15.3 Mailing lists

For individual product/projects there is often a mailing list. For example to get help using SNPTEST you can ask on the mailing list https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html#contact

7.15.4 AI tools

AI tools such as chatGPT can be useful to understand how to work with omics data, but please do understand their limitations and look at documentation or research papers directly.

7.15.5 Ask ALSPAC

If you can not find the answer to your question or you think there is something wrong with your data then please contact the alspac-omics@bristol.ac.uk mailbox and we will do our best to help you.

Author: ALSPAC Omics team

Created: 2025-06-06 Fri 11:32

Validate