ALSPAC OMICs Data Catalogue

Table of Contents

1 Introduction

Welcome to the ALSPAC Omics Catalogue, a guide to the omics data offered by ALSPAC. This catalogue features a variety of named ALSPAC datasets, each consisting of collected or produced data that has been organized, named, and curated for ease of use. Every named ALSPAC dataset comes with accompanying metadata that provides information about the dataset as a whole. Each named ALSPAC dataset has at least one release version that includes a curated selection of files detailed in the metadata sections.

Please note that these datasets are not generally accessible. Please see http://www.bristol.ac.uk/alspac/researchers/access/ for details for access.

The information within this catalogue is made available for browsing to help both internal ALSPAC users and external researchers understand the data and facilitate prospective data requests.

For external ALSPAC collaborators, we offer as standard "freezes" of specific dataset versions of named ALSPAC datasets. These freezes, along with their metadata, are outlined in this catalogue. External collaborators will be granted access to these freezes upon request approval. A freeze represents a carefully selected subset of data files within a version, containing the core data from a dataset with withdrawn consent removed and specific dataset IDs applied. These freezes are subject to periodic updates.

Due to the removal of withdrawn individuals from the freezes, please note that the number of participants within each dataset may change over time and may not match those found in the Methodology fields.

Freeze 1 timing: July 2021 - Dec 2022
Freeze 2 timing: Dec 2022 - Dec 2023
Freeze 3 timing: Jan 2023 - Oct 2024
Freze 4 timing: Oct 2024 - present

Documentation for the current freeze is in the form of a yaml file is present below, listing the files external collaborators will receive, accompanied by metadata.

NamedALSPACDataset DatasetVersion Freeze

The metadata presented in our catalogue adheres to the ALSPAC Data catalogue Schema, which is crafted in LinkML. To explore the full schema documentation, please visit: https://alspac.github.io/alspac-data-catalogue-schema/

This website is equipped with RDFa, enabling the metadata to be machine-readable and allowing for the creation of queries using SPARQL with compatible tools, such as Apache Any23 and Apache Jena.

For more information about this see the document on FAIR data principles and the document describing the rational and construction of this catalogue here.

2 Catalogue overview

3 Genetic Array Data

3.1 Genome-wide - Illumina 550 quad - G1 (gwa_550_g1)

3.1.1 Description

This dataset contains genome wide array data genotype calls for G1 individuals. Reference genome build: GRCh37

3.1.2 Methodology

ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).

Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.

SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1). Related subjects were removed.

Associated publication:

3.1.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gwa_550_g1_2022-12-05_f4
name: >-
  Genome-wide array data for G1 individuals 2022-12-05 freeze 4
description: >-
  The fourth freeze of the genome-wide array data for G1 based on a
  2022-12-05 release. The data is in plink format.
freeze_size: 997M
linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112
git_tag: https://github.com/alspac/dataset_gwa_550_g1/releases/tag/freeze4
is_current_freeze: true
freeze_number: 4
freeze_date: 2024-06-11
previous_freeze: alspacdcs:gwa_550_g1_2022-12-05_f3
freeze_of_alspac_dataset_version: alspacdcs:gwa_550_g1_2022-12-05
freeze_of_named_alspac_dataset: alspacdcs:gwa_550_g1

has_containers:
  - id: alspacdcs:e8e8dde6-0841-4135-aec5-13dee5aa065a ## uuid
    name: data
    description: A dir/folder containing the two freeze data files


has_parts:
  - id: alspacdcs:cb1d46af-b413-4820-b395-3ab2c07c336e
    name: Biallelic genotype table
    description: >-
      genotype data
    data_distributions:
      - id: alspacdcs:2edc1c1f-bd1c-4d8d-a258-f85a5e2c0b5c
	name: freeze_id.bed
	description: >- 
	  Plink bed file.
	  Primary representation of genotype calls at biallelic
	  variants. Must be accompanied by .bim and .fam files.
	md5sum: 94973786388f80000dcdad0a80514e37
	filesize: 982M
	filetype: .bed
	number_of_participants: 8223
	number_of_variants: 500527
	belongs_to_container: alspacdcs:e8e8dde6-0841-4135-aec5-13dee5aa065a
  - id: alspacdcs:5a798cc1-ffba-4c69-a54a-de5fd6e616cb
    name: Variant Information
    description: >-
      Information about SNPS
    data_distributions:
      - id: alspacdcs:356763e4-11e0-4a22-ab01-14f3c3f58bac
	name: freeze_id.bim
	description: >-
	   Extended variant information file accompanying a .bed binary
	   genotype table. (--make-just-bim can be used to update just
	   this file.) A text file with no header line, and one line per
	   variant with the following six fields:

	   1. Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT';
	      '0' indicates unknown) or name
	   2. Variant identifier
	   3. Position in morgans or centimorgans (safe to use dummy value of '0')
	   4. Base-pair coordinate (1-based; limited to 231-2)
	   5. Allele 1 (corresponding to clear bits in .bed; usually minor)
	   6. Allele 2 (corresponding to set bits in .bed; usually major)
	md5sum: b0789ac6126af474c916c80f77335f6a
	filesize: 14M
	filetype: .bim
	number_of_variants: 500527
	belongs_to_container: alspacdcs:e8e8dde6-0841-4135-aec5-13dee5aa065a
  - id: alspacdcs:5cee7fda-8d37-4667-9909-91b847689c98
    name: sample info
    description: >-
      Sample ids
    data_distributions:
      - id: alspacdcs:a81eb161-3051-4557-88d6-d82068016c67
	name: freeze_id.fam
	description: >-
	  A text file with no header line, and one line per sample
	  with the following six fields:
	    1. Family ID ('FID')
	    2. Within-family ID ('IID'; cannot be '0')
	    3. Within-family ID of father ('0' if father isn't in dataset)
	    4. Within-family ID of mother ('0' if mother isn't in dataset)
	    5. Sex code ('1' = male, '2' = female, '0' = unknown)
	    6. Phenotype value ('1' = control, '2' = case,
	    '-9'/'0'/non-numeric =
	    missing data if case/control)
	md5sum: 854ea4dcd904ca37f441ca671e445634
	filesize: 256k
	filetype: .fam
	number_of_participants: 8223
	belongs_to_container: alspacdcs:e8e8dde6-0841-4135-aec5-13dee5aa065a
  - id: alspacdcs:9b487074-065a-4924-9ffa-f2864f148ba9
    name: Heterozygous haploid and nonmale Y chromosome call list
    description: >-
      A plink report
    data_distributions:
      - id: alspacdcs:64a1b50e-68b5-4857-b876-b561ed1e9fec
	name: freeze_id.hh
	description: >-
	  Produced automatically when the input data contains
	  heterozygous calls where they shouldn't be possible (haploid
	  chromosomes, male X/Y), or there are nonmissing calls for
	  nonmales on the Y chromosome.

	  A text file with one line per error (sorted primarily by
	  variant ID, secondarily by sample ID) with the following three fields:

	  Family ID
	  Within-family ID
	  Variant ID
	md5sum: 173734a688e9ff15c2911a91636bee56
	filesize: 1.7M
	filetype: .hh
	belongs_to_container: alspacdcs:e8e8dde6-0841-4135-aec5-13dee5aa065a
  - id: alspacdcs:e2fea4ec-7fc8-4e09-8b12-4be1c2ddc1b6
    name: Logs
    description: >-
      plink log
    data_distributions:
      - id: alspacdcs:caa69afd-0c19-4299-b660-fb308988a6ee
	name: freeze_id.log
	description: >-
	  plink log file
	md5sum: 0b069047e228212360cc189a5d689d50
	filesize: 512
	filetype: .log
	belongs_to_container: alspacdcs:e8e8dde6-0841-4135-aec5-13dee5aa065a

3.2 Genome-wide - Illumina exome core array - G0 partners (gwa_exome_g0p)

3.2.1 Description

This dataset contains genome wide array genotype calls for G0 mothers and partners. Reference genome build: GRCh37

3.2.2 Methodology

3,453 ALSPAC mother and fathers and 535,478 SNPs were genotyped using the Illumina HumanCoreExome chip genotyping platforms by the ALSPAC lab and called using GenomeStudio. The resulting raw genome-wide data were subjected to standard quality control methods using PLINK (v1.07). Individuals were excluded on the basis of gender mismatches (n = 80); minimal or excessive heterozygosity (n = 64); disproportionate levels of individual missingness (>5%, n = 60) and possible contamination (n = 3).

Population stratification was assessed by multidimensional scaling analysis and compared with 1000 Genomes phase 3 data and principal component analysis (n = 266); all individuals with non-European ancestry were removed. Cryptic relatedness was measured as SNP relatedness in GCTA (relatedness > 0.1, n = 69 removed). SNPs with a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 1E-7) and those which failed GenomeStudio quality control measures were removed (n = 21,298). 6,594 duplicate SNPs were also removed. This resulted in 2,911 unrelated mothers and father genotypes at 507,586 SNPs. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln.

1737 putative G0 partner-G1 pairs for whom both G0 partner and G1 have called genotype data available were identified based on ALN. Given the G0 partners were invited by the G0 mother to take part and only enrolled in the study in their own right several years later, it could not be assumed that all G0 partners were biologically related to G1. Called genotype data for the 1720 unique G0 partners and 1737 unique G1s were merged (i.e. there were 17 pairs of siblings/twins among the G1 offspring), using plink v1.90b7.2 64-bit (11 Dec 2023).

After aplication of the plink filters –geno 0.05, –maf 0.01, –snps-only just-acgt and –autosome, 113288 SNPs remained. The –related command in KING version 2.3.2 was used to perform kinship analysis, which confirmed that all 1737 putative G0 partner-G1 pairs are genetically related. This would be expected for biological father-offspring pairs, using the inference criteria described in in Table 1 of "Manichaikul, Ani, et al. "Robust relationship inference in genome-wide association studies." Bioinformatics 26.22 (2010): 2867-2873."

3.2.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gwa_exome_g0p_2016-11-22_f4
name: Freeze 4 version 2016-11-22 Genome-wide - Illumina exome core array - G0 partners
description: >-
  Freeze 4 version 2016-11-22 Genome-wide array data including raw files and genotype calls for G0 partners, also including additional G0 mothers  who were absent from previous genotyping rounds
freeze_size: 289M
linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112
git_tag: https://github.com/alspac/dataset_gwa_exome_g0p/releases/tag/freeze4
is_current_freeze: true
freeze_number: 4
freeze_date: 2024-06-11
previous_freeze: alspacdcs:gwa_exome_g0p_2016-11-22_f3
freeze_of_alspac_dataset_version: alspacdcs:gwa_exome_g0p_2016-11-22
freeze_of_named_alspac_dataset: alspacdcs:gwa_exome_g0p

has_containers:
  - id: alspacdcs:67611038-8d3d-46a6-a780-4f897729568d
    name: data
    description: A dir/folder containing the plink data files

has_parts:
- id: alspacdcs:09a75379-f9a6-495d-ac9a-aa45c7eda651
  name: freeze_id
  data_distributions:
  - id: alspacdcs:ecb46381-0969-4b8e-8374-a344365f29ed
    name: freeze_id.fam
    description: >-
	A text file with no header line, and one line per sample with the following six fields:

	1. Family ID ('FID')
	2. Within-family ID ('IID'; cannot be '0')
	3. Within-family ID of father ('0' if father isn't in dataset)
	4. Within-family ID of mother ('0' if mother isn't in dataset)
	5. Sex code ('1' = male, '2' = female, '0' = unknown)
	6. Phenotype value ('1' = control, '2' = case,
	'-9'/'0'/non-numeric =
	missing data if case/control)

	Here We use both the first two fields to have the full id of the
	participant. i.e. not separate family and within family ids.
    md5sum: 5d116792f1d34a5456c4016f86a372cd
    filesize: 128KB
    filetype: .fam
    number_of_participants: 2198
    belongs_to_container: alspacdcs:67611038-8d3d-46a6-a780-4f897729568d

  - id: alspacdcs:e70210f0-b87d-4296-bcf4-b6cb2aecd798
    name: freeze_id.bim
    description: >-
      Extended variant information file accompanying a .bed binary
	genotype table. (in plink you can use --make-just-bim can be used to update just
	this file.) A text file with no header line, and one line per
	variant with the following six fields:

	  1.Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT';
	  '0' indicates unknown)
	  or name
	  2. Variant identifier
	  3. Position in morgans or centimorgans (safe to use dummy value of '0')
	  4. Base-pair coordinate (1-based; limited to 231-2)
	  5. Allele 1 (corresponding to clear bits in .bed; usually minor)
	  6. Allele 2 (corresponding to set bits in .bed; usually major)

    md5sum: 0fe43f888776059fef0a76d3f08d00ad
    filesize: 14MB
    filetype: .bim
    number_of_variants: 507586
    belongs_to_container: alspacdcs:67611038-8d3d-46a6-a780-4f897729568d

  - id: alspacdcs:7bd80533-1300-4ae5-9357-a4fb469dd676
    name: freeze_id.bed
    description: >-
      Primary representation of genotype calls at biallelic
      variants. Must be accompanied by .bim and .fam files.

    md5sum: 304b0d356880c5174806ce08d7beffd3
    filesize: 267M
    filetype: .bed
    number_of_participants: 2198
    number_of_variants: 507586
    belongs_to_container: alspacdcs:67611038-8d3d-46a6-a780-4f897729568d

  - id: alspacdcs:1ba71829-0007-42a4-ba81-bed5de7acbe9
    name: freeze_id.log
    md5sum: 8c3bc05548cfe7a95643e6db81bf30a5
    filesize: 512B
    filetype: .log
    belongs_to_container: alspacdcs:67611038-8d3d-46a6-a780-4f897729568d

  - id: alspacdcs:f1df1ea4-7317-49c6-8a79-a8da3d5c7093
    name: freeze_id.hh
    description: >-
      plink .hh file see
      https://www.cog-genomics.org/plink/1.9/formats#hh 
    md5sum: ceaaced7ab039cf3631df602a96619f7
    filesize: 8M
    filetype: .hh
    belongs_to_container: alspacdcs:67611038-8d3d-46a6-a780-4f897729568d

3.3 Genome-wide - Illumina 660 quad - G0 mothers (gwa_660_g0m)

3.3.1 Description

This dataset contains genome-wide array data including raw files and genotype calls for G0 mothers.

3.3.2 Methodology

ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs.

SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed. Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.

Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level. Related subjects that passed all other quality control thresholds were retained. This resulted in 9,048 subjects and 526,688 SNPs passed these quality control filters.

Associated publication:

3.3.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gwa_660_g0m_2022-12-05_f4
name: Freeze 4 version 2022-12-05 Genome-wide - Illumina 660 quad - G0 mothers
description: >-
  Freeze 4 of genome-wide array data including genotype calls for G0 mothers
freeze_size: 2G
linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112
git_tag: https://github.com/alspac/dataset_gwa_660_g0m/releases/tag/freeze4
is_current_freeze: true
freeze_number: 4
freeze_date: 2024-06-11
freeze_of_alspac_dataset_version: alspacdcs:gwa_660_g0m_2022-12-05
freeze_of_named_alspac_dataset: alspacdcs:gwa_660_g0m


has_containers:
  - id: alspacdcs:8dc6326a-db1a-41c8-ba6c-58e5be88d37f
    name: data
    description: A dir/folder containing the plink data files

  - id: alspacdcs:f0a7bb41-de9d-45c0-99c3-b91785e5b946
    name: legacy1
    description: A dir/folder containing the plink data files. 
    Includes full set of SNPs but is missing ~500 mothers who 
    were excluded in legacy QC due to strict relatedness inclusion thresholds.
    belongs_to_container: alspacdcs:8dc6326a-db1a-41c8-ba6c-58e5be88d37f

  - id: alspacdcs:7f1bd1da-f18e-4e07-bdfd-07ebbeb3049f
    name: legacy2
    description: A dir/folder containing the plink data files
    Includes full set of individuals but due to legacy QC is restricted
    to a set of ~480k SNPs that overlap with the Illumina 550k array 
    (which was used for G1).
    belongs_to_container: alspacdcs:8dc6326a-db1a-41c8-ba6c-58e5be88d37f

has_parts:
  - id: alspacdcs:16618c28-c82b-452a-b9a5-f63e86063c15
    name: Biallelic genotype table
    description: >-
      The genetic data
    data_distributions:
    - id: alspacdcs:c0b64191-04a4-45a4-bfbb-921ac9a06755
      name: freeze_id.bed
      description: >-
	Legacy 1 plink bed file.
	Primary representation of genotype calls at biallelic
	variants. Must be accompanied by .bim and .fam files.
	The legacy1 distribution of the plink bed file.
      md5sum: be66d3cc1d3d906c4d396cc161a605b1
      filesize: 1020M
      filetype: .bed
      number_of_participants: 8118
      number_of_variants: 526688
      belongs_to_container: alspacdcs:f0a7bb41-de9d-45c0-99c3-b91785e5b946
    - id: alspacdcs:ab9b5bf0-2029-4489-8e7f-993d4370823b
      name: freeze_id.bed
      description: >-
	Legacy 2 plink bed file.
	Primary representation of genotype calls at biallelic
	variants. Must be accompanied by .bim and .fam files.
	The legacy2 distribution of the plink bed file.
      md5sum: 7559903a4811210f6289497e1323dfe7
      filesize: 961M
      filetype: .bed
      number_of_variants: 465740
      number_of_participants: 8648
      belongs_to_container: alspacdcs:7f1bd1da-f18e-4e07-bdfd-07ebbeb3049f
  - id: alspacdcs:69d8a90d-486e-45ec-a23c-325266a11ccd
    name: Variant Information 
    description: >-
      Information about genetic variants
    data_distributions:
    - id: alspacdcs:68ec2931-a8f3-4d0b-a813-8f720738334c
      name: freeze_id.bim
      description: >-
	Legacy 1
	Extended variant information file accompanying a .bed binary
	genotype table. (--make-just-bim can be used to update just
	this file.) A text file with no header line, and one line per
	variant with the following six fields:

	  1.Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT';
	  '0' indicates unknown)
	  or name
	  2. Variant identifier
	  3. Position in morgans or centimorgans (safe to use dummy value of '0')
	  4. Base-pair coordinate (1-based; limited to 231-2)
	  5. Allele 1 (corresponding to clear bits in .bed; usually minor)
	  6. Allele 2 (corresponding to set bits in .bed; usually major)

      md5sum: be66d3cc1d3d906c4d396cc161a605b1
      filesize: 14M
      filetype: .bim
      number_of_variants: 526688
      belongs_to_container: alspacdcs:f0a7bb41-de9d-45c0-99c3-b91785e5b946
    - id: alspacdcs:fc002279-f815-4d8e-af7d-f6094c1f3be6
      name: freeze_id.bim
      description: >-
	Legacy 2 
	Extended variant information file accompanying a .bed binary
	genotype table. (--make-just-bim can be used to update just
	this file.) A text file with no header line, and one line per
	variant with the following six fields:

	  1.Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT';
	  '0' indicates unknown)
	  or name
	  2. Variant identifier
	  3. Position in morgans or centimorgans (safe to use dummy value of '0')
	  4. Base-pair coordinate (1-based; limited to 231-2)
	  5. Allele 1 (corresponding to clear bits in .bed; usually minor)
	  6. Allele 2 (corresponding to set bits in .bed; usually major)

      md5sum: b4a1adb225de05d92d0af585950fd423
      filesize: 13M
      filetype: .bim
      number_of_variants: 465740
      belongs_to_container: alspacdcs:7f1bd1da-f18e-4e07-bdfd-07ebbeb3049f
  - id: alspacdcs:b82ab933-6a04-4932-9d4d-19df2b4e0391
    name:  Sample information
    description: >-
      Information about the samples for the dataset
    data_distributions:
    - id: alspacdcs:6b63f49c-2935-4322-b376-b524802e8649
      name: freeze_id.fam
      description: >-
	legacy 1

	A text file with no header line, and one line per sample with the following six fields:

	1. Family ID ('FID')
	2. Within-family ID ('IID'; cannot be '0')
	3. Within-family ID of father ('0' if father isn't in dataset)
	4. Within-family ID of mother ('0' if mother isn't in dataset)
	5. Sex code ('1' = male, '2' = female, '0' = unknown)
	6. Phenotype value ('1' = control, '2' = case,
	'-9'/'0'/non-numeric = missing data if case/control)

      md5sum: 68019c4b1907d320c9ba4e5e3b4343f8
      filesize: 256K
      filetype: .fam
      number_of_participants: 8118
      belongs_to_container: alspacdcs:f0a7bb41-de9d-45c0-99c3-b91785e5b946
    - id: alspacdcs:5edc9078-1a38-4e75-8320-110cfd4195b8
      name: freeze_id.fam
      description: >-
	legacy2

	A text file with no header line, and one line per sample with the following six fields:

	1. Family ID ('FID')
	2. Within-family ID ('IID'; cannot be '0')
	3. Within-family ID of father ('0' if father isn't in dataset)
	4. Within-family ID of mother ('0' if mother isn't in dataset)
	5. Sex code ('1' = male, '2' = female, '0' = unknown)
	6. Phenotype value ('1' = control, '2' = case,
	'-9'/'0'/non-numeric = missing data if case/control)

      md5sum: ca78d5b8f96df516a7af3862de6ba8f6
      filesize: 448k
      filetype: .fam
      number_of_participants: 8648
      belongs_to_container: alspacdcs:7f1bd1da-f18e-4e07-bdfd-07ebbeb3049f 
  - id: alspacdcs:93b5d2be-8e64-4770-af96-284093f0e508
    name:  Log information
    description: >-
      Information about the plink run for making the dataset
    data_distributions:
    - id: alspacdcs:28da8b88-e68a-4d5b-be09-9a545b427c48
      name: freeze_id.log
      description: >-
	legacy 1 plink log file
      md5sum: 5adb293d1f0c0312b90ef3ab79c567b2
      filesize: 512
      filetype: .log
      belongs_to_container: alspacdcs:f0a7bb41-de9d-45c0-99c3-b91785e5b946
    - id: alspacdcs:24fef9cb-60d3-4a0f-93fc-86a2e01e53c5
      name: freeze_id.log
      description: >-
	legacy 2 plink log file
      md5sum: c713ee6c86477fbd29f329689005fc53
      filesize: 512
      filetype: .log
      belongs_to_container: alspacdcs:f0a7bb41-de9d-45c0-99c3-b91785e5b946

3.4 Genome-wide - CNV - G1 (cnv_550_g1)

3.4.1 Description

This dataset contains predicted ALSPAC CNVs using PennCNV, generated from 23andMe raw genotype data.

3.4.2 Methodology

LRR and BAF data was missing from the 23andMe raw genotype data, so we had to generate this data ourselves using an in house algorithm. Once this data was generated, we ran PennCNV using the hh550 libraries.

There are filtered PennCNV calls. Multiple calls were merged using the 'clean_cnv.pl' script, using a merge fraction of 0.5. Individuals with > 30 CNVs, a Log R Ratio SD of >0.3, a BAF drift of > 0.002, and a waviness factor of > 0.05 were removed. CNVs in which at least 50% of the length of the CNV call overlapped with any of telomeric centromeric, immunoglobulin regions were removed using the 'scan_region.pl' script in PennCNV.

In addition, CNVs covering fewer than 5 probes, of a length < 5kb, and with a confidence score of below 10 were removed. Density was calculated as the number of probes in a CNV divided by the length of the CNV, and CNVs where the density of probes across the call was < 1 probe per 20kb was removed.

These QC parameters are suggestions only and provided in filtered.cnv. Analysts can apply their own filter parameters to the raw calls in data.cnv

3.4.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:cnv_550_g1_2015-11-09_f4
name: Genome-wide - CNV - G1 release version 2015-11-09 freeze 4
description: >-
  This is the fourth freeze of the 2015-11-09 version of
  cnv_550_g1 dataset.
  It contains two csv versions of the cnv called data, the unfilterd
  and filtered versions.
freeze_size: 27m
linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112
git_tag: https://github.com/alspac/dataset_cnv_550_g1/releases/tag/freeze4
is_current_freeze: true
freeze_number: 4
freeze_date: 2024-06-11
previous_freeze: alspacdcs:cnv_550_g1_2015-11-09_f3
freeze_of_alspac_dataset_version: alspacdcs:cnv_550_g1_2015-11-09
freeze_of_named_alspac_dataset: alspacdcs:cnv_550_g1
has_parts:
  - id: alspacdcs:061d8035-d73e-4211-afdd-8dbb67a96d20_cnv_550_g1_2015-11-09_cnvdata_f4
    name: Unfiltered CNV data
    description: >- 
      This is the output of Penncnv before filtering.
      columns
	V1 - Position
	V2 - Number of markers in the region
	V3 - CNV length
	V4 - Copy number estimate
	V6 - Start SNP
	V7 - End SNP
	V8 - Confidence score
	qlet - within pregnancy ID
	cnv_550_g1 - Individual ID
    data_distributions:
      - id: alspacdcs:4e23b21840c200f56b1b5ccf227a6e59_new_cnvdata.csv
	name: new_cnvdata.csv
	description: >- 
	  This is the csv file for the output of Penncnv before filtering.
	md5sum: 4e23b21840c200f56b1b5ccf227a6e59
	filesize: 21M
	filetype: .csv
	number_of_participants: 7449  #data$id_qlet <- paste(data$cnv_550_g1, data$qlet, sep="_")
	#length(unique(data$id_qlet))
	number_of_cnv_variants: 70029 # Read file into R as data then:
	# dim(unique(data[1]))
	belongs_to_container: alspacdcs:dcf3e9e0-216b-4c2b-b3e1-ace2690c31bc

  - id: alspacdcs:cnv_550_g1_2015-11-09_filtered_f4
    name: Filtered CNV data
    description: >-
      CNV data that has been filtered.
      columns
	V1 - Position
	V2 - Number of markers in the region
	V3 - CNV length
	V4 - Copy number estimate
	V6 - Start SNP
	V7 - End SNP
	V8 - Confidence score
	qlet - within pregnancy ID
	cnv_550_g1 - Individual ID
    data_distributions:
      - id: alspacdcs:f825f62ec1cd49b8c2a059b4c5f6f13a_new_filtered.csv
	name: new_filtered.csv
	description: >-
	  This is the csv file for the output of Penncnv after filtering.
	md5sum: f825f62ec1cd49b8c2a059b4c5f6f13a
	filesize: 5.9M
	filetype: .csv
	number_of_participants: 6792 # Read into data 2 in r
	# data2$id_qlet <- paste(data2$cnv_550_g1, data2$qlet, sep="_") and length(unique(data2$id_qlet))
	number_of_cnv_variants: 14244 #Read into data2 in r then
	#length(unique(data2$V1))
	belongs_to_container: alspacdcs:dcf3e9e0-216b-4c2b-b3e1-ace2690c31bc

has_containers:
  - id: alspacdcs:dcf3e9e0-216b-4c2b-b3e1-ace2690c31bc ## uuid
    name: data
    description: A dir/folder containing the two freeze data files

4 Imputed Data

4.1 Genome-wide - HRC imputed - G0 mothers + G1 (gi_hrc_g0m_g1)

SNP chips are useful for the generation of data on hundreds of thousands of SNPs, but there are millions more polymorphisms that remain untyped with this technology. If suitable numbers of whole genome sequences exist (e.g. 1000 genomes data) then millions of genotypes that are missing from a sample because they have not been typed by SNP chips can be imputed using probabilistic methods. Here the ALSPAC mother and children data were imputed to a new reference panel known as the Haplotype Reference Consortium (HRC) panel. This comprises around 31000 sequenced individuals (mostly European), so the coverage of European haplotypes is much greater than in other panels. As a consequence imputation accuracy is expected to improve, particularly at lower frequencies.

4.1.1 Description

This dataset contains genotype data imputed to HRC for G0 mothers and G1. Reference genome build: GRCh37

4.1.2 Methodology

ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).

Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.

SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1).

Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,115 subjects and 500,527 SNPs passed these quality control filters.

ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs. SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed.

Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.

Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level. Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,048 subjects and 526,688 SNPs passed these quality control filters.

We combined 477,482 SNP genotypes in common between the sample of mothers and sample of children. We removed SNPs with genotype missingness above 1% due to poor quality (11,396 SNPs removed) and removed a further 321 subjects due to potential ID mismatches. This resulted in a dataset of 17,842 subjects containing 6,305 duos and 465,740 SNPs (112 were removed during liftover and 234 were out of HWE after combination). We estimated haplotypes using ShapeIT (v2.r644) which utilises relatedness during phasing. The phased haplotypes were then imputed to the Haplotype Reference Consortium (HRCr1.1, 2016) panel of approximately 31,000 phased whole genomes. The HRC panel was phased using ShapeIt v2.r727, and the imputation was performed using the Michigan imputation server.

4.1.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_hrc_g0m_g1_2017-05-04_f4
name: >-
  Genome-wide - HRC imputed - G0 mothers + G1 version 2017-05-04
  freeze 4
description: >-
  Freeze 4 of version 2017-05-04 Genome-wide array data imputed to the HRC reference panel for G0 mothers and G1 individuals in bgen and sample file format (version 1.2). 
freeze_size: 114G
linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112
git_tag: https://github.com/alspac/dataset_gi_hrc_g0m_g1/releases/tag/freeze4
is_current_freeze: true
freeze_number: 4
freeze_date: 2024-06-11
previous_freeze: alspacdcs:gi_hrc_g0m_g1_2017-05-04_f3
freeze_of_alspac_dataset_version: alspacdcs:gi_hrc_g0m_g1_2017-05-04
freeze_of_named_alspac_dataset: alspacdcs:gi_hrc_g0m_g1

has_containers:
  - id: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7 ## uuid
    name: data
    description: A dir/folder containing the freeze data bgen and .sample files

has_parts:
  - id: alspacdcs:78966822-c2fc-4f49-bc12-bbe40aa2ba75
    name: Omics ID sample
    data_distributions:
    - id: alspacdcs:15631c02-08be-4bfb-add8-e936e6bd9ed3
      name: swapped.sample
      md5sum: 33c8b6168dee47c563cec5abf124a672
      filesize: 1008K
      filetype: .sample
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

    - id: alspacdcs:aea0183c-4283-478e-8a11-141ff0629c89
      name: swapped_23_female.sample
      md5sum: 7606183e5b5195182c1e9ef61d88d1d3
      filesize: 752K
      filetype: .sample
      number_of_participants: 12943
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

    - id: alspacdcs:7e9a6c3a-f821-46e5-860e-fd7a38af98d3
      name: swapped_23_male.sample
      md5sum: 34540d02f1271a8c99785989c8888496
      filesize: 272K
      filetype: .sample
      number_of_participants: 4501
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:b2de2079-e524-461b-ab28-7ea88a2bd885
    name: filtered_01
    data_distributions:
    - id: alspacdcs:f3c493d3-7e7a-4df7-b4bc-e4df38ca5fa8
      name: filtered_01.bgen
      md5sum: 9727306a156ab88f72dedbdcaffc1105
      filesize: 8.6GB
      filetype: .bgen
      number_of_variants: 3069932
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:eb48fc81-4231-43f6-afd8-a763095ef049
    name: filtered_02
    data_distributions:
    - id: alspacdcs:aaff25c6-0480-4214-b5a3-905528db1e89
      name: filtered_02.bgen
      md5sum: a8cb970994e21c02eceea92a513ebef6
      filesize: 8.7GB
      filetype: .bgen
      number_of_variants: 3392238
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:3bb918e9-dab9-47a9-a6e0-2e1a3206b96a
    name: filtered_03
    data_distributions:
    - id: alspacdcs:27acaf2b-3867-44b7-b16e-c1f5114893c2
      name: filtered_04.bgen
      md5sum: 7e1586647816f4607b9e528be4893b5c
      filesize: 7.3GB
      filetype: .bgen
      number_of_variants: 2821895
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:de9a35f5-3660-46fb-80d8-971122992ee6
    name: filtered_04
    data_distributions:
    - id: alspacdcs:8a62a054-ef47-4ef6-87a2-305213007c74
      name: filtered_04.bgen
      md5sum: 9bb513a014c18a3a0a1ea11dcf63cc1b
      filesize: 7.9GB
      filetype: .bgen
      number_of_variants: 2787582
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:f910f928-969d-4abd-8393-3c175edd58e2
    name: filtered_05
    data_distributions:
    - id: alspacdcs:9cdc5b2b-65b5-4b6e-b2b4-2e884e506ced
      name: filtered_05.bgen
      md5sum: 92a2d759a5bcc18d0134dc7802302055
      filesize: 6.7GB
      filetype: .bgen
      number_of_variants: 2588170
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:0f9acc14-28f6-42ed-8dc1-0923ae231574
    name: filtered_06
    data_distributions:
    - id: alspacdcs:95969370-c593-4de2-a05a-68eb17a85293
      name: filtered_06.bgen
      md5sum: 5f68a69cd54a89b8db5577711f2a7934
      filesize: 6.4GB
      filetype: .bgen
      number_of_variants: 2460112
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:2bc64572-7dd5-4c0c-93f9-30fd598cd6c9
    name: filtered_07
    data_distributions:
    - id: alspacdcs:bf564b40-37d7-4654-a996-590212863971
      name: filtered_07.bgen
      md5sum: cd02eefdb350d9859ea7a5975d5ee73a
      filesize: 6.7GB
      filetype: .bgen
      number_of_variants: 2289306
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:b0b446de-842c-43cf-add6-4957e17b16e2
    name: filtered_08
    data_distributions:
    - id: alspacdcs:a08d5fec-f380-4c08-b366-196ad509439b
      name: filtered_08.bgen
      md5sum: 68b4ea416441637c01ebcc1c2e9ac8cf
      filesize: 5.7GB
      filetype: .bgen
      number_of_variants: 2242706
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:7722d481-eb31-4678-ab4b-d32fdefd1ebb
    name: filtered_09
    data_distributions:
    - id: alspacdcs:c3b256b2-6422-4970-86cc-3108c80c7d2a
      name: filtered_09.bgen
      md5sum: a262516e4a9c48fe2b7edfb68a0f0577
      filesize: 4.5GB
      filetype: .bgen
      number_of_variants: 1675899
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:e3ac74ff-38ab-42af-8448-1228a67cd02c
    name: filtered_10
    data_distributions:
    - id: alspacdcs:f6fdbefc-4df6-4c44-8bc2-adb96a699501
      name: filtered_10.bgen
      md5sum: 659c1e9b8c9500aa02b84d8a121e4a23
      filesize: 5.2GB
      filetype: .bgen
      number_of_variants: 1927504
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:9797bc6d-6cb3-4459-8f40-f0d3c32c07db
    name: filtered_11
    data_distributions:
    - id: alspacdcs:8516c34a-8803-40fe-a8e7-d0183d7fcb67
      name: filtered_11.bgen
      md5sum: 94ae65053c6cb28ffa5413a447bea2a7
      filesize: 5.3GB
      filetype: .bgen
      number_of_variants: 1936990
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:38b72245-748f-4eb3-b1d5-9b0548141454
    name: filtered_12
    data_distributions:
    - id: alspacdcs:da40bb2d-2b8f-449f-b98d-c66c720009c7
      name: filtered_12.bgen
      md5sum: 5e488efe1865265b70f0db0ba0e8ceb2
      filesize: 5.1GB
      filetype: .bgen
      number_of_variants: 1848118
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:448ee503-3596-432b-b6b7-71d09895c3db
    name: filtered_13
    data_distributions:
    - id: alspacdcs:87f60018-85ed-41ce-97f4-1cabc4f3b825
      name: filtered_13.bgen
      md5sum: c6d8c39e1714020ef24236ce0e0e65f4
      filesize: 3.7GB
      filetype: .bgen
      number_of_variants: 1385434
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:0469011c-c9fa-47c6-a18e-81d126b36a91
    name: filtered_14
    data_distributions:
    - id: alspacdcs:fbf207d1-5a86-44da-9acf-e78a139e8455
      name: filtered_14.bgen
      md5sum: a7ceaec0d5986e1396214bbc4a8bcfb5
      filesize: 3.6GB
      filetype: .bgen
      number_of_variants: 1266536
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:1e7c4590-fcb9-402a-96ab-f0a08ca31457
    name: filtered_15
    data_distributions:
    - id: alspacdcs:9527cff8-33fe-47f6-a2cc-456b0391c3c4
      name: filtered_15.bgen
      md5sum: 30a19dcda6047a6ac690d650ee5fea8c
      filesize: 3.4GB
      filetype: .bgen
      number_of_variants: 1139215
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:c9ad086a-611c-4119-b93c-25427598c3ad
    name: filtered_16
    data_distributions:
    - id: alspacdcs:94542142-6f87-4e44-90cb-c04936e5114e
      name: filtered_16.bgen
      md5sum: d4ffb3324217ec7ac9e3716ae3de9106
      filesize: 4.1GB
      filetype: .bgen
      number_of_variants: 1281298
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:0dbf46b8-da95-4362-b66d-487bd4e9923d
    name: filtered_17
    data_distributions:
    - id: alspacdcs:efefc34b-18c5-43fc-8b8e-ec9af2d343ab
      name: filtered_17.bgen
      md5sum: a0baaf8155e3e97ee33d440035877a96
      filesize: 3.6GB
      filetype: .bgen
      number_of_variants: 1090072
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:f4b5f154-4e6f-44ba-885b-39cb52a77df5
    name: filtered_18
    data_distributions:
    - id: alspacdcs:9719bf81-5bac-4f53-8fa8-13df06907351
      name: filtered_18.bgen
      md5sum: 1236c268dfab2d46148835e50efcec5d
      filesize: 3.2GB
      filetype: .bgen
      number_of_variants: 1104755
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:f4c9f93b-3b98-4d1b-80db-ae24d35bbf25
    name: filtered_19
    data_distributions:
    - id: alspacdcs:13a48032-774b-4db4-a57e-ffbd9bdfb540
      name: filtered_19.bgen
      md5sum: 1c17198a8d5a7be881d671559048d073
      filesize: 3.5GB
      filetype: .bgen
      number_of_variants: 868554
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:56ba0748-364c-4013-95bd-50e0f9f4d6ca
    name: filtered_20
    data_distributions:
    - id: alspacdcs:b1d608b5-ed9c-4ca7-838c-89c8507e0bf9
      name: filtered_20.bgen
      md5sum: 336791734294796bcc5c725048756155
      filesize: 2.6GB
      filetype: .bgen
      number_of_variants: 884983
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:a610277c-3364-474f-bb15-349555976465
    name: filtered_21
    data_distributions:
    - id: alspacdcs:13d0df10-9e76-4829-9b83-e2ce4a95dded
      name: filtered_21.bgen
      md5sum: d97d780938173eb14c5c1aae66e1005e
      filesize: GB
      filetype: .bgen
      number_of_variants: 531276
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:db30e9f1-4441-4bc7-b867-873b04e87481
    name: filtered_22
    data_distributions:
    - id: alspacdcs:48c0dcf4-a3e1-442f-8d6c-9dbc7c5b5af3
      name: filtered_22.bgen
      md5sum: 343581eebfe7e38242db0c8b019c2264
      filesize: 1.8GB
      filetype: .bgen
      number_of_variants: 524544
      number_of_participants: 17444
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:00b3d2af-7c46-41a3-b50c-f17dfa8e1e61
    name: filtered_23female
    data_distributions:
    - id: alspacdcs:9e15dbd4-27fd-471b-ac73-0d5bae9649e6
      name: filtered_23female.bgen
      md5sum: d4abdc0d84bda1f8a3eec5c9cee8977b
      filesize: 4.2GB
      filetype: .bgen
      number_of_variants: 1228035
      number_of_participants: 12943
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

  - id: alspacdcs:347b682b-9708-454b-9f5f-86deec277d19
    name: swapped_23_male
    data_distributions:
    - id: alspacdcs:d9282f36-6915-46c3-bda9-ff9ab6c8c56c
      name: swapped_23_male.sample
      md5sum: bebe6967a0489a186166d61cd1b07a18
      filesize: 1.3GB
      filetype: .bgen
      number_of_variants: 1228035
      number_of_participants: 4501
      belongs_to_container: alspacdcs:28e3078c-ab02-4fb1-99b2-504e143c8fa7

4.2 Genome-wide - HapMap2 imputed - G1 (gi_hapmap2_g1)

4.2.1 Description

This dataset contains genotype data imputed to HapMap 2 for G1. Reference genome build: GRCh36

4.2.2 Methodology

A total of 9912 subjects were genotyped using the Illumina HumanHap550 quad genome-wide SNP genotyping platform by 23 and Me subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, USA.

Individuals were excluded from further analysis on the basis of having incorrect gender assignments; minimal or excessive heterozygosity (<0.320 and >0.345 for the Sanger data and <0.310 and >0.330 for the LabCorp data); disproportionate levels of individual missingness (>3%); evidence of cryptic relatedness (>10% IBD) and being of non-European ancestry (as detected by a multidimensional scaling analysis seeded with HapMap 2 individuals, EIGENSTRAT analysis revealed no additional obvious population stratification and genome-wide analyses with other phenotypes indicate a low lambda). The resulting data set consisted of 8365 individuals (84% of those genotyped).

SNPs with a minor allele frequency of <1% and call rate of <95% were removed. Furthermore, only SNPs which passed an exact test of Hardy-Weinberg equilibrium (P > 5 x 10-7) were considered for analysis. Genotypes were subsequently imputed with MACH 1.0.16 Markov Chain Haplotyping software, using CEPH individuals from phase 2 of the HapMap project as a reference set (release 22).

Associated publication:

4.2.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_hapmap2_g1_2022-12-07_f4
name: Genome-wide - HapMap2 imputed - G1 version 2022-12-07 freeze 4
description: >-
  Freeze 4 of 2022-12-07 version of Genome-wide array data imputed to the HapMap2 reference panel for G1 individuals

freeze_size: 5G
linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112
git_tag: https://github.com/alspac/dataset_gi_hapmap2_g1/releases/tag/freeze4
is_current_freeze: true
freeze_number: 4
freeze_date: 2024-06-11
previous_freeze: alspacdcs:gi_hapmap2_g1_2022-12-07_f3
freeze_of_alspac_dataset_version: alspacdcs:gi_hapmap2_g1_2022-12-07
freeze_of_named_alspac_dataset: alspacdcs:gi_hapmap2_g1


has_containers:

  - id: alspacdcs:5595c2d9-feb4-4eb1-918c-ff3579ce6e21
    name: data
    description: A dir/folder containing the plink freeze data files


has_parts:
  - id: alspacdcs:5d8b20a5-b2d3-4d3b-a02e-fe865810dd92
    name: freeze_id
    data_distributions:5978bf8c-9302-4c31-bb80-27ed307e93b1
    - id: alspacdcs:42cb2bae-f94e-4a75-8025-d954db951d0d
      name: freeze_id.fam
      md5sum: 6b5ddc58729fdb5997fd0004e9ae8055
      filesize: 288KB
      filetype: .fam
      number_of_participants: 8223
      belongs_to_container: alspacdcs:5595c2d9-feb4-4eb1-918c-ff3579ce6e21
  - id: alspacdcs:f7582580-52ce-44d5-9572-ffd8b7fa0391
    name: freeze_id
    data_distributions:
    - id: alspacdcs:dc2941cc-4329-4f28-9d27-a0b23d8dcf53
      name: freeze_id.bim
      md5sum: a1ebaaf6286af5b12f4561b380cd302a
      filesize: 68MB
      filetype: .bim
      number_of_variants: 2543887
      belongs_to_container: alspacdcs:5595c2d9-feb4-4eb1-918c-ff3579ce6e21
  - id: alspacdcs:2a8a7c04-c771-4c80-8e53-32012bcf6cbe
    name: freeze_id
    data_distributions:
    - id: alspacdcs:4c4667a7-2c3b-4c61-a645-8fd398674a47
      name: freeze_id.bed
      md5sum: c1b6c00b67513aef2147d6d507c4d1be
      filesize: 4.9GB
      filetype: .bed
      number_of_variants: 2543887
      number_of_participants: 8223
      belongs_to_container: alspacdcs:5595c2d9-feb4-4eb1-918c-ff3579ce6e21
  - id: alspacdcs:ac4c4c8c-1da4-4c26-9c4d-3bdea467d5b7
    name: freeze_id
    data_distributions:
    - id: alspacdcs:ce5bf2cc-3a1d-479a-97ef-dcce860b9eda
      name: freeze_id.log
      md5sum: 6ebb804e83f17af2bcca0dfb7f143f56
      filesize: 958B
      filetype: .log
      belongs_to_container: alspacdcs:5595c2d9-feb4-4eb1-918c-ff3579ce6e21

4.3 Genome-wide - HapMap2 imputed - G0 mothers (gi_hapmap2_g0m)

4.3.1 Description

This dataset contains genotype data imputed to HapMap 2 for G0 mothers. Reference genome build: GRCh36

4.3.2 Methodology

A total of 10 015 women (mothers from the ALSPAC cohort) were genotyped using the Illumina 660 quad SNP chip which contains 557 124 SNP markers. Markers with minor allele frequency < 1%, SNPs with >5% missing genotypes and any markers that failed an exact test of Hardy-Weinberg equilibrium (P < 1 x 10-6) were excluded from further analyses. Genome-wide identity by state sharing was calculated for each pair of individuals in the cohort to identify cryptic relatedness.

In order to identify individuals who might have ancestries other than Western European, we merged data from both cohorts with the 60 western European (CEU) founder, 60 Nigerian (YRI) founder and 90 Japanese (JPT) and Han Chinese (CHB) individuals from the International HapMap Project. Genome-wide IBS distances for each pair of individuals were calculated on markers shared between the HapMap and the Illumina 660K SNP chip, and then the multidimensional scaling option in R was used to generate a two-dimensional plot based upon individuals' scores on the first two principal coordinates from this analysis. Samples that did not cluster with the CEU individuals were excluded from subsequent analyses. In addition, we plotted the proportion of missing data for each individual against their genome-wide heterozygosity. Any individual, who did not cluster with others, was removed from further analyses. Samples were also excluded from analyses in the case of excessive missingness (>5%), unusual genome-wide or X chromosome heterozygosity, as well as one individual from each pair of putatively related individuals (genome-wide IBD >10%). After data cleaning, 8340 individuals and 526688 SNPs were left in the genome-wide data set.

We then conducted imputation using the MACH Markov Chain Haplotyping software with CEU individuals from phase 2 of the HapMap project as a reference set (release 22). The final imputed data set consisted of 8340 individuals, each with 2 594 390 imputed markers. Only imputed genotypes with minor allele frequencies ≥1% and R-sqr ≥0.3 were considered for association. Of these 8340 with genetic data, 2874 mothers also had phenotype data available.

Associated publication:

4.3.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_hapmap2_g0m_2022-12-07_f4
name: Genome-wide - HapMap2 imputed - G0 mothers version 2022-12-07 freeze 4
description: >-
  Version 2022-12-07 freeze 4 of Genome-wide array data imputed to the HapMap2 reference panel for G0 mothers.
  The number of variants & individuals within each plink file set can be viewed within the log file.
freeze_size: 4.9G
linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112
git_tag: https://github.com/alspac/dataset_gi_hapmap2_g0m/releases/tag/freeze4
is_current_freeze: true
freeze_number: 4
freeze_date: 2024-06-11
previous_freeze: alspacdcs:gi_hapmap2_g0m_2022-12-07_f3
freeze_of_alspac_dataset_version: alspacdcs:gi_hapmap2_g0m_2022-12-07
freeze_of_named_alspac_dataset: alspacdcs:gi_hapmap2_g0m


has_containers:
  - id: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c ## uuid
    name: plink
    description: A dir/folder containing the plink freeze data files. There are 8123 individuals within this dataset. 


has_parts:
  - id: alspacdcs:19bda7bc-6720-459b-bd0d-dbc0d6f2655f
    name: freeze_id_chr19
    data_distributions:
    - id: alspacdcs:c412635e-67de-4714-9d8f-429bfa6fcae8
      name: freeze_id_chr19.bim
      md5sum: c6fce7e15e198304f752ccbce66299b9
      filesize: 1012.3KB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:840a92b2-f571-43b5-ad97-d79e77bd19af
      name: freeze_id_chr19.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:0e19015a-6693-47cb-bfa8-c85d712ec1c0
      name: freeze_id_chr19.log
      md5sum: 84b19267a3dfa1641aba676a0e5eb3e0
      filesize: 975.0B
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:4653442b-45a9-403e-9d3b-7199b15bfa3c
      name: freeze_id_chr19.bed
      md5sum: 801ccb3bb64dddaabfc2b7a4a1e4c5b0
      filesize: 71.7MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:68b5b829-8926-448a-96ba-577aecab4471
    name: freeze_id_chr15
    data_distributions:
    - id: alspacdcs:9819abd4-7318-4199-a4f6-e49f52531cf1
      name: freeze_id_chr15.bim
      md5sum: 1e1139db4b031ba577b5ac6ae000ce6f
      filesize: 1.9MB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:c8105272-c607-4a82-82a7-f6dd269edc08
      name: freeze_id_chr15.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:c2dea4f8-7d19-4d14-982f-95a0d2964495
      name: freeze_id_chr15.log
      md5sum: 0e054fc3cce4a123b109394752e580b0
      filesize: 1.0KB
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:a6efbd8c-0f56-4c6a-a71f-2419bafeb024
      name: freeze_id_chr15.bed
      md5sum: 611159bc9c4500de559615d0a7c549f2
      filesize: 140.0MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:3ac1b66a-08fe-464e-b6eb-dd0ce078ac89
    name: freeze_id_chr1
    data_distributions:
    - id: alspacdcs:c1098693-5f9e-4713-964d-4e614f34cef9
      name: freeze_id_chr1.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:d484165b-1c01-4e79-a00f-7a8cdf6aeeb2
      name: freeze_id_chr1.bed
      md5sum: 01f7205ea4b6e852c0e8feb72a2cb9cd
      filesize: 374.7MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:72cfc84a-14ad-4c10-85f7-7b30a6f258d5
      name: freeze_id_chr1.log
      md5sum: 59f4597c9621be95e1fc28a44d855361
      filesize: 1.0KB
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:b73aebf8-ab24-4ca1-9972-79c4eedf1f49
      name: freeze_id_chr1.bim
      md5sum: 44795681691b62d1921ad8855fd11a09
      filesize: 5.1MB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:01757763-aee0-4d25-a8ed-9201a741801a
    name: freeze_id_chr20
    data_distributions:
    - id: alspacdcs:386aff33-b1fb-43cb-88a4-10c6881dc6fd
      name: freeze_id_chr20.log
      md5sum: 8cbb4fc64f55bd5294cd9a206e0f37e8
      filesize: 1.0KB
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:8ed2d8b2-6d78-44dc-972e-efb5de58fe09
      name: freeze_id_chr20.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:fa2dcbf9-7cad-4415-81f3-bdf9d7a22c8d
      name: freeze_id_chr20.bed
      md5sum: 2af011bb98d6b8a8b00b7d938700fdac
      filesize: 122.8MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:d99817d7-0b61-4edb-a5e8-2ce47cbbae88
      name: freeze_id_chr20.bim
      md5sum: 6e0b2d6cd06cc6e36f9cbc3f8df0a169
      filesize: 1.7MB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:2ca6174b-b52c-47c2-a668-076f84319060
    name: freeze_id_chr6
    data_distributions:
    - id: alspacdcs:be3f3499-dc08-477a-bbd3-aaa4b2d99f1c
      name: freeze_id_chr6.log
      md5sum: d91f8058884a1dd82a9c2de687179eca
      filesize: 971.0B
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:d85e88c5-1458-481b-ae2d-4c84727258b3
      name: freeze_id_chr6.bed
      md5sum: 953f9c82981d59d25dabe44ba5718b29
      filesize: 353.1MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:6feafd60-0d96-459e-beb0-2eab94b98eec
      name: freeze_id_chr6.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:da32ca03-fc3d-4617-acb5-a26d5e561f5b
      name: freeze_id_chr6.bim
      md5sum: 3fd4e793a35c5e935454efc1105be192
      filesize: 4.8MB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:fad2ed72-86b0-4056-bd17-93e56ade3ecb
    name: freeze_id_chr21
    data_distributions:
    - id: alspacdcs:5c0b52b0-a16a-4376-9812-34bc2a5d3381
      name: freeze_id_chr21.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:eb1aef35-ee59-4348-b909-38e712204b32
      name: freeze_id_chr21.bim
      md5sum: c1f6f2181c49172608ac79e18425e4f4
      filesize: 924.7KB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:4e4694d8-4e3c-43c7-be72-98b077d64b10
      name: freeze_id_chr21.bed
      md5sum: 13165e1c9a27aa42853429b0246a1ed5
      filesize: 65.6MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:901373ae-7601-45bd-91be-d47d804b5213
      name: freeze_id_chr21.log
      md5sum: bff5f387cc08a205f3ceea4912301c4f
      filesize: 975.0B
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:7d3cd13a-ecd5-4f89-9ceb-b4bb49579325
    name: freeze_id_chr17
    data_distributions:
    - id: alspacdcs:c1d15a48-abcf-4d1c-b239-8c68e2bfd37e
      name: freeze_id_chr17.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:a1bc18b6-03ea-4aa5-9d17-37b72b58e469
      name: freeze_id_chr17.log
      md5sum: 6537de563fe66f45ab0880c5c36695a5
      filesize: 1.0KB
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:9dcdc5f3-f793-4853-909b-70d2527eda35
      name: freeze_id_chr17.bim
      md5sum: 0dc0770759f9edccec7ce305e07b57d4
      filesize: 1.6MB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:d2113c6b-7986-4939-9994-8853d5490517
      name: freeze_id_chr17.bed
      md5sum: c6d54ed5ac68f2e0bd806b6124463ee4
      filesize: 113.2MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:95bd785b-ed60-4d1c-ba40-fef689036123
    name: freeze_id_chr11
    data_distributions:
    - id: alspacdcs:a6b1b401-f71d-4c17-a237-e8354a01af7f
      name: freeze_id_chr11.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:302011cd-eed0-49ed-a214-73b1a696204a
      name: freeze_id_chr11.bim
      md5sum: 703ecef520ce7363c24e9600b363570f
      filesize: 3.5MB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:5f49f602-6d97-4464-afe8-6a75017069ea
      name: freeze_id_chr11.bed
      md5sum: 3c89898ce9fc0445c566ea0c060fb9db
      filesize: 251.8MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:7fd97f7e-cb07-41ba-b90b-75726822143d
      name: freeze_id_chr11.log
      md5sum: 1b63b92463c92cafc756f3e8c330a698
      filesize: 977.0B
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:895b29b0-c4c2-41e5-8340-4d3ae277461c
    name: freeze_id_chr4
    data_distributions:
    - id: alspacdcs:3af343a6-e042-42eb-a599-44de06b574d9
      name: freeze_id_chr4.bed
      md5sum: 147fee33c621f644dad5a2d8ee86fc1d
      filesize: 315.9MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:34856871-ee87-42e4-bb41-9d89a605cf8f
      name: freeze_id_chr4.log
      md5sum: a932f5b22c0160602f031f6589dd0e60
      filesize: 971.0B
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:82aab8b9-4c3f-4b66-aa77-26540c9cf4be
      name: freeze_id_chr4.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:4a07aea9-9c62-4346-89f0-d819d55aa016
      name: freeze_id_chr4.bim
      md5sum: 54a244447b1345636690b252215bfd2d
      filesize: 4.3MB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:fd2b4071-7a92-427d-8aa5-318e131cad2b
    name: freeze_id_chr9
    data_distributions:
    - id: alspacdcs:1c48c6ed-2f21-4c69-b661-01920c806dec
      name: freeze_id_chr9.bed
      md5sum: 58ff215f0652257867e42f567ff1c2be
      filesize: 236.4MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:e69923e6-552c-49b2-b643-0e332f60f5b0
      name: freeze_id_chr9.log
      md5sum: d82db9e0fae823f565e88baf698f5d99
      filesize: 1.0KB
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:a309e4c9-e320-4788-a6b1-f80d7ea77b48
      name: freeze_id_chr9.bim
      md5sum: 1e828e0f36c2d168ce6c1df5887a764b
      filesize: 3.2MB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:9cc37db9-958b-4f11-9441-acb3d66171c6
      name: freeze_id_chr9.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:1d5f1a43-9d6d-452c-8437-a1a6590200f8
    name: freeze_id_chr7
    data_distributions:
    - id: alspacdcs:dfbd747d-f228-4de6-a5b6-bbd4778abdb4
      name: freeze_id_chr7.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:0d5d665c-280d-44c2-b63f-a4d2e1711289
      name: freeze_id_chr7.bim
      md5sum: dae38c5168605323dfc584a73f3ce4a1
      filesize: 3.8MB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:1a60cd7e-f732-4337-9aa5-7346462d14b5
      name: freeze_id_chr7.log
      md5sum: aa54d34c3425faa9859f2f254db2082f
      filesize: 971.0B
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:0a3ead16-4d79-45c1-a0ee-5a3ea755f645
      name: freeze_id_chr7.bed
      md5sum: fb9e8aaf4ae7c3fc75233248ec9d03b0
      filesize: 277.3MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:759778c9-8038-417b-97bf-caabc09b2f1e
    name: freeze_id_chr10
    data_distributions:
    - id: alspacdcs:307fc131-302b-453e-85ae-02690e80b688
      name: freeze_id_chr10.log
      md5sum: 98542bb6181aab17d59cdb333bb038ea
      filesize: 1.0KB
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:f069ae4f-bccf-4211-9110-62068b374cca
      name: freeze_id_chr10.bim
      md5sum: 3c259904c7da548d25c86a4a36e96285
      filesize: 3.8MB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:d9f75d55-f419-4544-bed7-1a3f8e016ef2
      name: freeze_id_chr10.bed
      md5sum: 4606d4a5a008927b6ab051461218094a
      filesize: 267.9MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:c9174b77-373f-42f8-afea-dbfe0eb291aa
      name: freeze_id_chr10.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:7bc29065-807d-4c21-8918-8ed8f65a2825
    name: freeze_id_chr8
    data_distributions:
    - id: alspacdcs:f2db501c-2db4-40e2-a6ea-45abf2db2e8b
      name: freeze_id_chr8.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:e254bf18-e94f-49bd-b406-dce1d679ccfd
      name: freeze_id_chr8.log
      md5sum: 12f65c73a1612c7f01e7cbb9faa6728d
      filesize: 971.0B
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:562e8ab2-65ee-4c5e-9fee-2abdbc2ebd4a
      name: freeze_id_chr8.bim
      md5sum: 6243ef376ee6cbe643bec69201bec604
      filesize: 3.9MB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:333ab86d-a675-4d94-954b-e62dcf883019
      name: freeze_id_chr8.bed
      md5sum: de34e8ef57e4c08991e4778401adf861
      filesize: 285.5MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:7d328e00-143e-44e4-8b21-d493f748d918
    name: freeze_id_chr22
    data_distributions:
    - id: alspacdcs:17890378-9f3b-495a-83cb-629aaaf25b2e
      name: freeze_id_chr22.bed
      md5sum: 5abcf552c585152ed0ee11754f3e7833
      filesize: 65.5MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:d01e1f6f-ab86-4eca-9ab3-75252b37e44d
      name: freeze_id_chr22.log
      md5sum: 22dbdaf004a39f13df827d8fe16eb86d
      filesize: 975.0B
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:cc879e99-6a12-4d4f-a109-8de9bde833e8
      name: freeze_id_chr22.bim
      md5sum: 86a1da3366ba87e62f561dc09f64f9ac
      filesize: 920.9KB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:2eb50171-543f-42f4-9af3-cef7a0d2bcbd
      name: freeze_id_chr22.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:65c471a1-913a-4bc3-a812-b16e54dd200f
    name: freeze_id_chr16
    data_distributions:
    - id: alspacdcs:8c869b1d-dbf2-4aec-9ba5-5b661cc67b16
      name: freeze_id_chr16.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:336096c8-697b-498c-a4e1-4ca26a024bfb
      name: freeze_id_chr16.bed
      md5sum: b04eb2e4e66fef7ee7d48cb666d78c38
      filesize: 138.5MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:ad98ee65-76bc-41db-9092-3d2aa5faa0ac
      name: freeze_id_chr16.bim
      md5sum: 8bd9cb45256b6b5ca37ce66eec810035
      filesize: 1.9MB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:2b3ec9b5-e28b-417e-8c0b-1cc8ed5f7f7b
      name: freeze_id_chr16.log
      md5sum: 34657e68c9325f21c33207746f9ddd0a
      filesize: 975.0B
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:7bf77605-1f09-4c1b-a272-1c44bffe2293
    name: freeze_id_chr14
    data_distributions:
    - id: alspacdcs:701b9895-05b5-477d-b2b0-fb3e8adfb030
      name: freeze_id_chr14.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:2369e5f9-0d84-415b-b6b8-67ec1526ea0b
      name: freeze_id_chr14.bim
      md5sum: 4a933818aaea48201f455ebd07ea1b78
      filesize: 2.3MB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:3e4f19f4-e3bd-4b75-829b-cb1b37ca3f7e
      name: freeze_id_chr14.bed
      md5sum: a41f9803ec71a0dcdf137806b21ba2e6
      filesize: 162.5MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:43abadf5-e1b3-4ca9-abc5-044cb5682693
      name: freeze_id_chr14.log
      md5sum: e1f2c8b876e4ec9e85deae8dd1a9bec7
      filesize: 975.0B
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:9b1521fe-96a6-48d1-a3fe-47d1cc39d2b0
    name: freeze_id_chr13
    data_distributions:
    - id: alspacdcs:4c7b8334-319c-4c99-91af-186a1a3492b2
      name: freeze_id_chr13.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:2062869c-01cb-4f2d-8a3c-263566f2cb11
      name: freeze_id_chr13.bed
      md5sum: 0e99cf077012880a802dc36ce72142c1
      filesize: 201.6MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:fa22332d-494c-4fad-bdb6-ad9da610c5e6
      name: freeze_id_chr13.log
      md5sum: 9eac1d36058e281cd55934aff7d91261
      filesize: 977.0B
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:9ed2cb1e-593a-48e4-abb1-42f08bee40f5
      name: freeze_id_chr13.bim
      md5sum: cd1b7c80977fb5a0bbd87bc83dd85aed
      filesize: 2.8MB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:0cf314bb-9a17-4a7d-957d-e0e9fb3e1653
    name: freeze_id_chr18
    data_distributions:
    - id: alspacdcs:cb61d296-4080-4d4b-b872-720c214d8322
      name: freeze_id_chr18.bim
      md5sum: 9ffd8f006c82701060dff29bf460e8fe
      filesize: 2.1MB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:af5b5611-d842-47af-b78b-c329f8ef6ddc
      name: freeze_id_chr18.log
      md5sum: 5f8a5de0d684936a5e73482143ceaa86
      filesize: 975.0B
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:71d13017-4ba4-44c8-bc9a-b568d1da3fb5
      name: freeze_id_chr18.bed
      md5sum: 6b46a8d2993dae303334b9a51b50b92c
      filesize: 148.7MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:7a1af215-15f2-440a-9f4d-83d6f47df5fc
      name: freeze_id_chr18.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:41642a5d-04a4-4bff-8cf3-191d3daf1f52
    name: freeze_id_chr2
    data_distributions:
    - id: alspacdcs:f6ef8665-4a9c-412d-ae56-251edce2ad20
      name: freeze_id_chr2.log
      md5sum: 52044e2a3b44dc32c249292fbe6791bd
      filesize: 971.0B
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:d67023a6-a122-4cdc-bcc6-cfdbeba5094c
      name: freeze_id_chr2.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:25f3ab57-0a4d-4231-92c8-34270d61a1c3
      name: freeze_id_chr2.bed
      md5sum: 494713bafedd17c3be4e782f7881dcc0
      filesize: 427.5MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:772d2466-ea01-4458-a53b-a47b01e4230f
      name: freeze_id_chr2.bim
      md5sum: 275cefa559489b51bebbc65657a91822
      filesize: 5.9MB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:39e85461-90a4-43bc-aba6-89717ba9867e
    name: freeze_id_chr12
    data_distributions:
    - id: alspacdcs:491388b8-ad94-4237-a4d7-2c2f95f4aeef
      name: freeze_id_chr12.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:1c7e897a-1d73-4255-964d-47e7afbb8099
      name: freeze_id_chr12.bed
      md5sum: 367f44ccd183c47334cfc7cb8333628a
      filesize: 241.7MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:6a491a6b-42cf-4c13-85f6-a95c165f5a1c
      name: freeze_id_chr12.log
      md5sum: 74355c1596f3f360051c9843a6bcad13
      filesize: 977.0B
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:39ea7914-51cb-49f3-a3e6-f2e35f0b340a
      name: freeze_id_chr12.bim
      md5sum: 515a46f735c531163377d114549042b5
      filesize: 3.4MB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:8fc655d9-a011-4da5-bdf9-f534545e5314
    name: freeze_id_chr3
    data_distributions:
    - id: alspacdcs:052eacc7-17a6-46f7-872b-a79f39b5d7d2
      name: freeze_id_chr3.bed
      md5sum: 609847ca0489b7a97725ec275f8337d2
      filesize: 337.5MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:8ec247fe-0839-4136-a63d-1551838502b8
      name: freeze_id_chr3.log
      md5sum: 1f737c62165516ad1ebced45dde7449d
      filesize: 1.0KB
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:67c67c59-bfb6-4010-9e4d-91b1b5f8bbf0
      name: freeze_id_chr3.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:9648c74d-c1b2-480c-8a79-ed99fba786c9
      name: freeze_id_chr3.bim
      md5sum: 96d147406f1f24697b0cb9af0c7091fc
      filesize: 4.6MB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
  - id: alspacdcs:4cd1b4c1-7e30-4576-bac0-3983c8f4f56c
    name: freeze_id_chr5
    data_distributions:
    - id: alspacdcs:0a988283-2b96-4bba-acc8-e8678b741bd2
      name: freeze_id_chr5.log
      md5sum: 5637edc58fd0a953a2283149a1ffff55
      filesize: 971.0B
      filetype: .log
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:415ae163-6976-4a2a-ac69-05d34c119920
      name: freeze_id_chr5.bim
      md5sum: e8f55ef9016bf2f03ee43f08a6c974c3
      filesize: 4.4MB
      filetype: .bim
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:c29b23f3-a035-4a53-a7b5-034e7ad042e1
      name: freeze_id_chr5.bed
      md5sum: a3a47a8ea90e0fa39d5c203436b6d982
      filesize: 325.5MB
      filetype: .bed
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c
    - id: alspacdcs:02d97e8a-ac68-4097-a329-d69445eac6a8
      name: freeze_id_chr5.fam
      md5sum: c9fc6b68df21dc6f2b433fdcc052aa14
      filesize: 277.5KB
      filetype: .fam
      belongs_to_container: alspacdcs:8a3c06a5-54ec-4ec2-91d8-ec1d74dba69c

4.4 Genome-wide - 1000G imputed - G0 partners (gi_1000g_g0p)

4.4.1 Description

This dataset contains genome-wide array data imputed to the 1000 genomes reference panel for G0 partners, with some additional G0 mothers and G1 individuals. This data has been cleaned, flipped to the positive strand and in b37 coordinates and imputed to the 1000 genomes phase I version 3. Reference genome build: GRCh37

4.4.2 Methodology

3,453 ALSPAC mother and fathers and 535,478 SNPs were genotyped using the Illumina HumanCoreExome chip genotyping platforms by the ALSPAC lab and called using GenomeStudio. The resulting raw genome-wide data were subjected to standard quality control methods using PLINK (v1.07). Individuals were excluded on the basis of gender mismatches (n = 80); minimal or excessive heterozygosity (n = 64); disproportionate levels of individual missingness (>5%, n = 60) and possible contamination (n = 3).

Population stratification was assessed by multidimensional scaling analysis and compared with 1000 Genomes phase 3 data and principal component analysis (n = 266); all individuals with non-European ancestry were removed.

Cryptic relatedness was measured as SNP relatedness in GCTA (relatedness > 0.1, n = 69 removed). SNPs with a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 1E-7) and those which failed GenomeStudio quality control measures were removed (n = 21,298). 6,594 duplicate SNPs were also removed.

This resulted in 2,911 unrelated mothers and father genotypes at 507,586 SNPs. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln.

We phased data of 3074 samples that passed qc but contained related subjects in shapeit v2.r837. We then removed 155,336 monomorphic SNPs, 1033 markers not in 1000 genomes, 11,842 A/T or G/C SNPs and 10 duplicate sites to give 337,732 SNPs on chromosomes 1-23. Of the 329,363 markers on chromosomes 1-22, 298,742 overlapped the reference genome. We imputed to the 1000 genomes phase 1 version 3 using the Michigan Imputation Server. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln. We then removed 12 subjects who have withdrawn consent and 6 subjects genotyped in an earlier work package to give 2201 subjects.

1737 putative G0 partner-G1 pairs for whom both G0 partner and G1 have called genotype data available were identified based on ALN. Given the G0 partners were invited by the G0 mother to take part and only enrolled in the study in their own right several years later, it could not be assumed that all G0 partners were biologically related to G1. Called genotype data for the 1720 unique G0 partners and 1737 unique G1s were merged (i.e. there were 17 pairs of siblings/twins among the G1 offspring), using plink v1.90b7.2 64-bit (11 Dec 2023).

After aplication of the plink filters –geno 0.05, –maf 0.01, –snps-only just-acgt and –autosome. The –related command in KING version 2.3.2 was used to perform kinship analysis, which confirmed that all 1737 putative G0 partner-G1 pairs are genetically related. This would be expected for biological father-offspring pairs, using the inference criteria described in in Table 1 of "Manichaikul, Ani, et al. "Robust relationship inference in genome-wide association studies." Bioinformatics 26.22 (2010): 2867-2873."

4.4.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_1000g_g0p_2016-11-22_f4
name: Genome-wide - 1000G imputed - G0 partners version 2016-11-22 freeze 4
description: >-
  This dataset is the fourth freeze of 2016-11-22 versiono of the Genome-wide array data imputed to the 1000 genomes reference panel
  for G0 partners, with some additional G0 mothers and G1 individuals.

freeze_size: 44G
linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112
git_tag: https://github.com/alspac/dataset_gi_1000g_g0p/releases/tag/freeze4
is_current_freeze: true
freeze_number: 4
freeze_date: 2023-09-11
previous_freeze: alspacdcs:gi_1000g_g0p_2016-11-22_f4
next_freeze:
freeze_of_alspac_dataset_version: alspacdcs:gi_1000g_g0p_2016-11-22
freeze_of_named_alspac_dataset: alspacdcs:gi_1000g_g0p


has_containers:
  - id: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c
    name: data
    description: A dir/folder containing the data bgen and sample files

has_parts:
  - id: alspacdcs:gi_1000g_g0p_2016-11-22_sample_f4
    name: Samples
    description: >-
      The samples in the data. To be used with the genetic data.
    data_distributions:
      - id: alspacdcs:dfa0ee7c627927a47e286aa23b0514e4_swapped.sample
	name: swapped.sample
	description: >-
	  A plain text .sample file.
	  See https://doi.org/10.1101/308296 for file format details.
	md5sum: dfa0ee7c627927a47e286aa23b0514e4
	filesize: 165k
	filetype: .sample
	number_of_participants: 2198
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c



  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr1_f4
    name: Chr1
    description: Data for Chr1
    data_distributions:
      - id: alspacdcs:a5eb049e4df5a8b005ae51b47947d830_filtered_data_chr01.bgen
	name: filtered_data_chr01.bgen
	description: >- 
	  An Oxford Bgen file for Chr1. To be used with
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	     (bgen v1.2)       
	md5sum: a5eb049e4df5a8b005ae51b47947d830
	filesize: 3.4G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 2159337
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c

  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr2_f4
    name: Chr2
    description: Data for Chr2
    data_distributions:
      - id: alspacdcs:e297c8d30455053d23ac360bcc886bb0_filtered_data_chr02.bgen
	name: filtered_data_chr02.bgen
	description: >- 
	  An Oxford Bgen file for Chr2. To be used with

	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	   (bgen v1.2)         
	md5sum: e297c8d30455053d23ac360bcc886bb0
	filesize: 3.6G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 2349883
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c

  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr3_f4
    name: Chr3
    description: Data for Chr3
    data_distributions:
      - id: alspacdcs:c0b55e9d65c219ffb1b8c58a0ebb7c18_filtered_data_chr03.bgen
	name: filtered_data_chr03.bgen
	description: >- 
	  An Oxford Bgen file for Chr1. To be used with

	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	   (bgen v1.2)         
	md5sum: c0b55e9d65c219ffb1b8c58a0ebb7c18
	filesize: 3.0G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 1969275
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c

  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr4_f4
    name: Chr4
    description: Data for Chr4
    data_distributions:
      - id: alspacdcs:514f09f02c74fc3eca83379e9e99c5dc_filtered_data_chr04.bgen
	name: filtered_data_chr04.bgen
	description: >- 
	  An Oxford Bgen file for Chr4. To be used with

	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: 514f09f02c74fc3eca83379e9e99c5dc
	filesize: 3.1G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 1969883

  - id:  alspacdcs:gi_1000g_g0p_2016-11-22_chr5_f4
    name: Chr5
    description: Data for Chr5
    data_distributions:
      - id: alspacdcs:f4accbf5bdd6a2ccc9598e9e2221915d_filtered_data_chr05.bgen
	name: filtered_data_chr05.bgen
	description: >- 
	  An Oxford Bgen file for Chr5. To be used with

	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: f4accbf5bdd6a2ccc9598e9e2221915d
	filesize: 2.8G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 1809961
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c        

  - id:  alspacdcs:gi_1000g_g0p_2016-11-22_chr6_f4
    name: Chr6
    description: Data for Chr6
    data_distributions:
      - id: alspacdcs:a9327ad1591fdf7d349b066544e71c3a_filtered_data_chr06.bgen
	name: filtered_data_chr06.bgen
	description: >- 
	  An Oxford Bgen file for Chr6. To be used with

	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	    (bgen v1.2)        
	md5sum: a9327ad1591fdf7d349b066544e71c3a
	filesize: 2.6G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 1758025
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c

  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr7_f4
    name: Chr7
    description: Data for Chr7
    data_distributions:
      - id: alspacdcs:f832922558eddcf3feed87091c2ec0ae_filtered_data_chr07.bgen
	name: filtered_data_chr07.bgen
	description: >- 
	  An Oxford Bgen file for Chr7. To be used with

	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	    (bgen v1.2)        
	md5sum: f832922558eddcf3feed87091c2ec0ae
	filesize: 2.7G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 1601293
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c

  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr8_f4
    name: Chr8
    description: Data for Chr8
    data_distributions:
      - id: alspacdcs:47d79712e676a0048f90858cbb888179_filtered_data_chr08.bgen
	name: filtered_data_chr08.bgen
	description: >- 
	  An Oxford Bgen file for Chr8. To be used with

	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: 47d79712e676a0048f90858cbb888179
	filesize: 2.4G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 1558902
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c        

  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr9_f4
    name: Chr9
    description: Data for Chr9
    data_distributions:
      - id: alspacdcs:82a480f3e8792db2c1cec3adc50e1357_filtered_data_chr09.bgen
	name: filtered_data_chr09.bgen
	description: >- 
	  An Oxford Bgen file for Chr9. To be used with

	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	   (bgen v1.2)         
	md5sum: 82a480f3e8792db2c1cec3adc50e1357
	filesize: 1.9G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 1189463
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c

  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr10_f4
    name: Chr10
    description: Data for Chr10
    data_distributions:
      - id: alspacdcs:8f64fe184e4c876a345a728ed5eeddcf_filtered_data_chr10.bgen
	name: filtered_data_chr10.bgen
	description: >- 
	  An Oxford Bgen file for Chr10. To be used with
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: 8f64fe184e4c876a345a728ed5eeddcf
	filesize: 2.2G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 1363104
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c

  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr11_f4 
    name: Chr11
    description: Data for Chr11
    data_distributions:
      - id: alspacdcs:b1b7e3bef0fe72cd90bd0ba456f687aa_filtered_data_chr11.bgen
	name: filtered_data_chr11.bgen
	description: >- 
	  An Oxford Bgen file for Chr11. To be used with

	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: b1b7e3bef0fe72cd90bd0ba456f687aa
	filesize: 2.2G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 1359640
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c        

  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr12_f4
    name: Chr12
    description: Data for Chr12
    data_distributions:
      - id: alspacdcs:509202db22200fe0bd58210ab8e9c757_filtered_data_chr12.bgen
	name: filtered_data_chr12.bgen
	description: >- 
	  An Oxford Bgen file for Chr12. To be used with

	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: 509202db22200fe0bd58210ab8e9c757
	filesize: 2.1G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 1316510
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c        

  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr13_f4
    name: Chr13
    description: Data for Chr13
    data_distributions:
      - id: alspacdcs:176a10d38ab80783a8e392e5791edea7_filtered_data_chr13.bgen
	name: filtered_data_chr13.bgen
	description: >- 
	  An Oxford Bgen file for Chr13. To be used with

	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: 176a10d38ab80783a8e392e5791edea7
	filesize: 1.6G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 988473

  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr14_f4
    name: Chr14
    description: Data for Chr14
    data_distributions:
      - id: alspacdcs:1ecd96aab2925bafd7d20497d85dd937_filtered_data_chr14.bgen
	name: filtered_data_chr14.bgen
	description: >- 
	  An Oxford Bgen file for Chr14. To be used with

	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	   (bgen v1.2)         
	md5sum: 1ecd96aab2925bafd7d20497d85dd937
	filesize: 1.5G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 903811
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c        

  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr15_f4 
    name: Chr15
    description: Data for Chr15
    data_distributions:
      - id: alspacdcs:f8c5b54206189808e9a361cc0da63798_filtered_data_chr15.bgen
	name: filtered_data_chr15.bgen
	description: >- 
	  An Oxford Bgen file for Chr15. To be used with

	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: f8c5b54206189808e9a361cc0da63798
	filesize: 1.4G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 814028
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c        

  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr16_f4 
    name: Chr16
    description: Data for Chr16
    data_distributions:
      - id: alspacdcs:52f065575d3cb2dff34df6763a583766_filtered_data_chr16.bgen
	name: filtered_data_chr16.bgen
	description: >- 
	  An Oxford Bgen file for Chr16. To be used with

	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	   (bgen v1.2)         
	md5sum: 52f065575d3cb2dff34df6763a583766
	filesize: 1.6G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 867901
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c

  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr17_f4  
    name: Chr17
    description: Data for Chr17
    data_distributions:
      - id: alspacdcs:73d85caf67dcedc63b11a43bd5ccb44d_filtered_data_chr17.bgen
	name: filtered_data_chr17.bgen
	description: >- 
	  An Oxford Bgen file for Chr17. To be used with
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: 73d85caf67dcedc63b11a43bd5ccb44d
	filesize: 1.4G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 755467
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c        

  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr18_f4  
    name: Chr18
    description: Data for Chr18
    data_distributions:
      - id: alspacdcs:b8e055a6c0955bb67161c9f7a1d8cad7_filtered_data_chr18.bgen
	name: filtered_data_chr18.bgen
	description: >- 
	  An Oxford Bgen file for Chr18. To be used with

	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: b8e055a6c0955bb67161c9f7a1d8cad7
	filesize: 1.4G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 783661
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c        

  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr19_f4  
    name: Chr19
    description: Data for Chr19
    data_distributions:
      - id: alspacdcs:37ea045cd9f4027cba547b7b89c3a1a0_filtered_data_chr19.bgen
	name: filtered_data_chr19.bgen
	description: >- 
	  An Oxford Bgen file for Chr19. To be used with
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: 37ea045cd9f4027cba547b7b89c3a1a0
	filesize: 1.3G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 606147
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c        

  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr20_f4
    name: Chr20
    description: Data for Chr20
    data_distributions:
      - id: alspacdcs:d241eb21be3188c26c460e1f65f0d8c1_filtered_data_chr20.bgen
	name: filtered_data_chr20.bgen
	description: >- 
	  An Oxford Bgen file for Chr20. To be used with
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: d241eb21be3188c26c460e1f65f0d8c1
	filesize: 1.1G
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 618749
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c

  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr21_f4
    name: Chr21
    description: Data for Chr21
    data_distributions:
      - id: alspacdcs:7881bdc24e7f0adbfb800b49d1efd590_filtered_data_chr21.bgen
	name: filtered_data_chr21.bgen
	description: >- 
	  An Oxford Bgen file for Chr21. To be used with

	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: 7881bdc24e7f0adbfb800b49d1efd590
	filesize: 672M
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 378064
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c        

  - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr22_f4
    name: Chr22
    description: Data for Chr22
    data_distributions:
      - id: alspacdcs:824412e963441699f260c6245f65659d_filtered_data_chr22.bgen
	name: filtered_data_chr22.bgen
	description: >- 
	  An Oxford Bgen file for Chr22. To be used with

	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)
	md5sum: 824412e963441699f260c6245f65659d
	filesize: 722M
	filetype: .bgen
	number_of_participants: 2198
	number_of_variants: 366590
	belongs_to_container: alspacdcs:e3604b28-bf00-4f55-8252-80ef3df26e9c        

4.5 Genome-wide - 1000G imputed - G0 mothers + G1 (gi_1000g_g0m_g1)

4.5.1 Description

This dataset contains genome-wide 1000G imputed data for G0 mothers + G1. This data has been cleaned, flipped to the positive strand and in b37 coordinates and imputed to the 1000 genomes phase I version 3. Reference genome build: GRCh37

4.5.2 Methodology

ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).

Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.

SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1). Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,115 subjects and 500,527 SNPs passed these quality control filters.

ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs. SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed.

Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.

Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level. Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,048 subjects and 526,688 SNPs passed these quality control filters.

We combined 477,482 SNP genotypes in common between the sample of mothers and sample of children. We removed SNPs with genotype missingness above 1% due to poor quality (11,396 SNPs removed) and removed a further 321 subjects due to potential ID mismatches. This resulted in a dataset of 17,842 subjects containing 6,305 duos and 465,740 SNPs (112 were removed during liftover and 234 were out of HWE after combination). We estimated haplotypes using ShapeIT(v2.r644) which utilises relatedness during phasing. We obtained a phased version of the 1000 genomes reference panel (Phase 1, Version3) from the Impute2 reference data repository (phased using ShapeItv2.r644, haplotype release date Dec 2013). Imputation of the target data was performed using Impute V2.2.2 against the reference panel(all polymorphic SNPs excluding singletons), using all 2186 reference haplotypes (including non-Europeans).

This gave 8,237 eligible children and 8,196 eligible mothers withavailable genotype data after exclusion of related subjects using cryptic relatedness measures described previously.

Known issues: There is a known strand issue present within this imputation: The Dec 2013 haplotype release of 1000 genomes phase 1 version 3 have 199 reported SNPs with incorrect strand. For more information and the origins of this list please visit https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated_SHAPEIT2_16-06-14.html. It is very unlikely that they have systematic effects across the genome and most probably are just isolated to these 199 known problematic SNPs. The user is advised to discard them from their analysis.

Formatting of the bgen files within the gi_1000g_g0m_g1 dataset have NA in place of the chromosome column. Some tools may allow this, while others are less forgiving. This may mean users wish to re-format the dataset (using QCtool or equivalent) for their work.

Allele frequency concordance with other cohorts: When contributing to consortia you may find that the allele frequencies in ALSPAC for a few thousand SNPs are discordant from a reference panel used by the consortium. This is actually to be expected - when calculating allele frequencies, even from the same population, in two different samples for many millions of SNPs there will be a number of SNPs that appear to be highly discordant.

4.5.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_f4
name: >-
  Genome-wide - 1000G imputed - G0 mothers + G1 version 2015-10-30
  freeze 4
description: >-
  This is the fourth freeze of the the 2015-10-30 version of
  gi_1000g_g0m_g1 datatset. It contains data in the oxford format
  which is a combination of bgen and sample (version 1.2) files. It is a subset of
  the data in gi_1000g_g0m_g1_2015-10-30 limited to one format and
  with participants who have withdrawn their consent removed.

  The Dec 2013 haplotype release of 1000 genomes phase 1 version 3 have 199 reported SNPs
  with incorrect strand. The strand issues are present in this imputation version. For more 
  information and the origins of this list please visit:
  https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated_SHAPEIT2_16-06-14.html

  It is very unlikely that they have systematic effects across the genome and most 
  probably are just isolated to these 199 known problematic SNPs.

  The user is advised to discard them from their analysis.
freeze_size: 122G
linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112
git_tag: https://github.com/alspac/dataset_gi_1000g_g0m_g1/releases/tag/freeze4
is_current_freeze: true
freeze_number: 4
freeze_date: 2024-06-11
previous_freeze: alspacdcs:gi_1000g_g0m_g1_2015-10-30_f3
freeze_of_alspac_dataset_version: alspacdcs:gi_1000g_g0m_g1_2015-10-30
freeze_of_named_alspac_dataset: alspacdcs:gi_1000g_g0m_g1

has_parts:
  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_sample_f4
    name: Samples
    description: >-
      The samples in the data. To be used with the genetic data.
    data_distributions:
      - id: alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	name: swapped.sample
	description: >-
	  A plain text .sample file.
	  See https://doi.org/10.1101/308296 for file format details.
	md5sum: 65bf6fc592b85ce69dec0473aca5b5cd
	filesize: 1.3M
	filetype: .sample
	number_of_participants: 17444

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr1_f4
    name: Chr1
    description: Data for Chr1
    data_distributions:
      - id: alspacdcs:fad144852b7c9c929ea1a55b8481798c_filtered_01.bgen
	name: filtered_01.bgen
	description: >- 
	  An Oxford Bgen file for Chr1. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)       
	md5sum: fad144852b7c9c929ea1a55b8481798c
	filesize: 9.1G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 2155158

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr2_f4
    name: Chr2
    description: Data for Chr2
    data_distributions:
      - id: alspacdcs:91168a792595ee55375d6c72c881fa6c_filtered_02.bgen
	name: filtered_02.bgen
	description: >- 
	  An Oxford Bgen file for Chr2. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)         
	md5sum: 91168a792595ee55375d6c72c881fa6c
	filesize: 9.1G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 2346862

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr3_f4
    name: Chr3
    description: Data for Chr3
    data_distributions:
      - id: alspacdcs:6e898fe7aba1d39e832245267a9ec30e_filtered_03.bgen
	name: filtered_03.bgen
	description: >- 
	  An Oxford Bgen file for Chr1. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)         
	md5sum: 6e898fe7aba1d39e832245267a9ec30e
	filesize: 7.7G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 1966662

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr4_f4
    name: Chr4
    description: Data for Chr4
    data_distributions:
      - id: alspacdcs:c7ba39fbff7de19ffd98b93ff217108b_filtered_04.bgen
	name: filtered_04.bgen
	description: >- 
	  An Oxford Bgen file for Chr4. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: c7ba39fbff7de19ffd98b93ff217108b
	filesize: 8.4G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 1968171

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr5_f4
    name: Chr5
    description: Data for Chr5
    data_distributions:
      - id: alspacdcs:173056913dd6dc1684e9118907af1fd5_filtered_05.bgen
	name: filtered_05.bgen
	description: >- 
	  An Oxford Bgen file for Chr5. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: 173056913dd6dc1684e9118907af1fd5
	filesize: 6.9G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 1808090

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr6_f4
    name: Chr6
    description: Data for Chr6
    data_distributions:
      - id: alspacdcs:b8296902cc14e29111b2caefbc52a00b_filtered_06.bgen
	name: filtered_06.bgen
	description: >- 
	  An Oxford Bgen file for Chr6. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	    (bgen v1.2)        
	md5sum: b8296902cc14e29111b2caefbc52a00b
	filesize: 6.8G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 1755859

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr7_f4
    name: Chr7
    description: Data for Chr7
    data_distributions:
      - id: alspacdcs:3072cca6a05fdb782b858f70beed6e06_filtered_08.bgen
	name: filtered_07.bgen
	description: >- 
	  An Oxford Bgen file for Chr7. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	    (bgen v1.2)        
	md5sum: 3072cca6a05fdb782b858f70beed6e06
	filesize: 7.1G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 1599387

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr8_f4
    name: Chr8
    description: Data for Chr8
    data_distributions:
      - id: alspacdcs:c57b0cc8c3b47c8058e6f95ba742a89d_filtered_08.bgen
	name: filtered_08.bgen
	description: >- 
	  An Oxford Bgen file for Chr8. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: c57b0cc8c3b47c8058e6f95ba742a89d
	filesize: 5.9G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 1557429

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr9_f4
    name: Chr9
    description: Data for Chr9
    data_distributions:
      - id: alspacdcs:0e0d21cb1dc4d276d0a4353cc7da0564_filtered_09.bgen
	name: filtered_09.bgen
	description: >- 
	  An Oxford Bgen file for Chr9. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)         
	md5sum: 0e0d21cb1dc4d276d0a4353cc7da0564
	filesize: 5.1G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 1187731

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr10_f4
    name: Chr10
    description: Data for Chr10
    data_distributions:
      - id: alspacdcs:e5f8a44f260c009a9fec7bdc105ead76_filtered_10.bgen
	name: filtered_10.bgen
	description: >- 
	  An Oxford Bgen file for Chr10. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: e5f8a44f260c009a9fec7bdc105ead76
	filesize: 5.4G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 1361506

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr11_f4
    name: Chr11
    description: Data for Chr11
    data_distributions:
      - id: alspacdcs:7c64c009aaf9fdb84c21b31f51e28bfa_filtered_11.bgen
	name: filtered_11.bgen
	description: >- 
	  An Oxford Bgen file for Chr11. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: 7c64c009aaf9fdb84c21b31f51e28bfa
	filesize: 5.4G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 1356882

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr12_f4
    name: Chr12
    description: Data for Chr12
    data_distributions:
      - id: alspacdcs:8f0d903ca1cf24ca0e45494bd0a1426c_filtered_12.bgen
	name: filtered_12.bgen
	description: >- 
	  An Oxford Bgen file for Chr12. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: 8f0d903ca1cf24ca0e45494bd0a1426c
	filesize: 5.4G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 1314328

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr13_f4
    name: Chr13
    description: Data for Chr13
    data_distributions:
      - id: alspacdcs:e59348ea876d3f5c3b6331e738daa162_filtered_13.bgen
	name: filtered_13.bgen
	description: >- 
	  An Oxford Bgen file for Chr13. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: e59348ea876d3f5c3b6331e738daa162
	filesize: 4.0G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 987740

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr14_f4
    name: Chr14
    description: Data for Chr14
    data_distributions:
      - id: alspacdcs:3f80471a1e183e478ca3674482ed89e4_filtered_14.bgen
	name: filtered_14.bgen
	description: >- 
	  An Oxford Bgen file for Chr14. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)         
	md5sum: 3f80471a1e183e478ca3674482ed89e4
	filesize: 3.9G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 904351

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr15_f4
    name: Chr15
    description: Data for Chr15
    data_distributions:
      - id: alspacdcs:2166a96fc0bbdc990b1bcb513f4372bd_filtered_15.bgen
	name: filtered_15.bgen
	description: >- 
	  An Oxford Bgen file for Chr15. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: 2166a96fc0bbdc990b1bcb513f4372bd
	filesize: 3.7G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 812545

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr16_f4
    name: Chr16
    description: Data for Chr16
    data_distributions:
      - id: alspacdcs:c44b1d287c79c69b2171c6822339cf4b_filtered_16.bgen
	name: filtered_16.bgen
	description: >- 
	  An Oxford Bgen file for Chr16. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)         
	md5sum: c44b1d287c79c69b2171c6822339cf4b
	filesize: 4.3G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 865998

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr17_f4
    name: Chr17
    description: Data for Chr17
    data_distributions:
      - id: alspacdcs:e4c50e9c54d4baa59d191a756d60b32e_filtered_17.bgen
	name: filtered_17.bgen
	description: >- 
	  An Oxford Bgen file for Chr17. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: e4c50e9c54d4baa59d191a756d60b32e
	filesize: 3.8G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 753174

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr18_f4
    name: Chr18
    description: Data for Chr18
    data_distributions:
      - id: alspacdcs:fa893fede52923d5805f8583dbed51bd_filtered_18.bgen
	name: filtered_18.bgen
	description: >- 
	  An Oxford Bgen file for Chr18. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: fa893fede52923d5805f8583dbed51bd
	filesize: 3.5G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 783010

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr19_f4
    name: Chr19
    description: Data for Chr19
    data_distributions:
      - id: alspacdcs:999c860cfb0f3484d1a78ef639c594fa_filtered_19.bgen
	name: filtered_19.bgen
	description: >- 
	  An Oxford Bgen file for Chr19. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: 999c860cfb0f3484d1a78ef639c594fa
	filesize: 4.0G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 603516

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr20_f4
    name: Chr20
    description: Data for Chr20
    data_distributions:
      - id: alspacdcs:59dd1ebbefb28c2b5818fb2aca9805de_filtered_20.bgen
	name: filtered_20.bgen
	description: >- 
	  An Oxford Bgen file for Chr20. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: 59dd1ebbefb28c2b5818fb2aca9805de
	filesize: 2.8G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 617694

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr21_f4
    name: Chr21
    description: Data for Chr21
    data_distributions:
      - id: alspacdcs:dce2d85e4d08018ea365afdeac561447_filtered_21.bgen
	name: filtered_21.bgen
	description: >- 
	  An Oxford Bgen file for Chr21. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)          
	md5sum: dce2d85e4d08018ea365afdeac561447
	filesize: 1.9G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 377554

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr22_f4
    name: Chr22
    description: Data for Chr22
    data_distributions:
      - id: alspacdcs:b5ba868e802d8eee4ac76b0f878d427c_filtered_22.bgen
	name: filtered_22.bgen
	description: >- 
	  An Oxford Bgen file for Chr22. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)
	md5sum: b5ba868e802d8eee4ac76b0f878d427c
	filesize: 2.1G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 365644

  - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr23_f4
    name: Chr23
    description: Data for Chr23
    data_distributions:
      - id: alspacdcs:512a78f6c379ce43e827da44a91b4c5f_filtered_23.bgen
	name: filtered_23.bgen
	description: >- 
	  An Oxford Bgen file for Chr23. To be used with
	  alspacdcs:65bf6fc592b85ce69dec0473aca5b5cd_swapped.sample
	  file.
	  See https://doi.org/10.1101/308296 for file format details.
	  (bgen v1.2)
	md5sum: 512a78f6c379ce43e827da44a91b4c5f
	filesize: 5.9G
	filetype: .bgen
	number_of_participants: 17444
	number_of_variants: 1250218

5 Sequence Data

5.1 Whole genome sequencing - G1 (wgs_hiseq_g1)

5.1.1 Description

This dataset contains whole genome sequencing for G1 individuals, part of the UK10K dataset. Reference genome build: GRCh37

5.1.2 Methodology

ALSPAC and TwinsUK cohorts were sequenced at an average read depth of 6.7x through the UK10K program (http://www.uk10k.org) using the Illumina HiSeq platform, and aligned to the GRCh37 human reference using BWA. SNV calls were completed using samtools/bcftools and VQSR and GATK were used to recall these calls.

Associated publication:

Please ensure you have permission to access this data (http://www.uk10k.org/data_access.html) before using it.

5.1.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:wgs_hiseq_g1_2016-08-18_f4
name: Whole genome sequencing - G1 version 2016-08-18 freeze 4
description: >-
  This is the freeze 4 of version 2016-08-18 of the Whole genome sequencing for G1 individuals, part of the UK10K dataset.
freeze_size: 341G
linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112
git_tag: https://github.com/alspac/dataset_wgs_hiseq_g1/releases/tag/freeze4
is_current_freeze: true
freeze_number: 4
freeze_date: 2024-06-11
previous_freeze: alspacdcs:wgs_hiseq_g1_2016-08-18_f3
freeze_of_alspac_dataset_version: alspacdcs:wgs_hiseq_g1_2016-08-18
freeze_of_named_alspac_dataset: alspacdcs:wgs_hiseq_g1

has_containers:
  - id: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43 ## uuid
    name: data
    description: A dir/folder containing the freeze data files


has_parts:
- id: alspacdcs:1319d16a-a9e8-4fb7-b4ee-a02a4345d98d
  name: 1_freeze
  data_distributions:
  - id: alspacdcs:e0c5c3ec-e61b-48b6-b5f7-c7ecfdb9a014
    name: 1_freeze.vcf.gz
    md5sum: a029c1cd1a1a10e830467299fbb335dd
    filesize: 26.3GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:1f5242ec-bcbf-4dee-8eef-d81c014297cf
    name: 1_freeze.vcf.gz.csi
    md5sum: 50de551ba81402a82de9728ea95e0483
    filesize: 145.6KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:a3afc031-0157-4a1a-9325-963407437cde
  name: 2_freeze
  data_distributions:
  - id: alspacdcs:86fbed6d-1d05-4654-98cf-90c84a4e060f
    name: 2_freeze.vcf.gz
    md5sum: 72babe074fc3e53b1e1315268511f7ec
    filesize: 28.8GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:02d35ade-5406-4423-b583-3f912fffd6d8
    name: 2_freeze.vcf.gz.csi
    md5sum: 5aec6c33496c048f740b592898541689
    filesize: 156.1KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:7ff95792-90e8-47fe-ba96-0c547a748b4f
  name: 3_freeze
  data_distributions:
  - id: alspacdcs:416c7611-9bde-4012-a92d-b84b69448b56
    name: 3_freeze.vcf.gz
    md5sum: 9672caad30ce5207afc857f15265a56e
    filesize: 24.2GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:972ad268-ef6d-49f2-8061-d36660476167
    name: 3_freeze.vcf.gz.csi
    md5sum: 48fd68f3460095f32471b74de80ae28a
    filesize: 127.9KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:b0aea300-50a3-4c75-95bc-996b04ebe1bb
  name: 4_freeze.vcf
  data_distributions:
  - id: alspacdcs:7bfde034-0983-4238-a675-d45ac002f73b
    name: 4_freeze.vcf.gz
    md5sum: 6a35500eba8d4af7a67e5af589b3e3f9
    filesize: 23.2GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:e6f65c92-3751-4bd9-af2e-767174683085
    name: 4_freeze.vcf.gz.csi
    md5sum: 20f5fb662923c30e6000ba81247e15dc
    filesize: 122.6KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:42ff212b-70d7-4db8-ac36-12b06dbae07c
  name: 5_freeze.vcf
  data_distributions:
  - id: alspacdcs:88e9082a-d77a-4875-b020-67fa1631d8e4
    name: 5_freeze.vcf.gz
    md5sum: 7df166f6560000a139f551be6f21624e
    filesize: 21.6GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:e8673bf7-7375-41e1-84af-f0cf2ab8035a
    name: 5_freeze.vcf.gz.csi
    md5sum: 76adb2b829d61f9334403f55a7d071e1
    filesize: 116.1KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:338e51e2-816b-4019-82c1-4caa35e5cdfe
  name: 6_freeze.vcf
  data_distributions:
  - id: alspacdcs:eb337552-2193-427e-89b0-a719eef53f20
    name: 6_freeze.vcf.gz
    md5sum: c2b56e9bc605b2fc1a54697a176e4a1c
    filesize: 21.0GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:c7dbf6cd-e8be-4678-b0a3-cda9b564689f
    name: 6_freeze.vcf.gz.csi
    md5sum: 25f5ec873519eed3a4e278cb47266f9b
    filesize: 109.9KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:ea928374-1723-4d63-a556-8391affe5cc7
  name: 7_freeze.vcf
  data_distributions:
  - id: alspacdcs:8e6332a2-c901-4a37-beac-9cc4e71a6475
    name: 7_freeze.vcf.gz
    md5sum: b05886a2a8f89de82864109368d7a69c
    filesize: 19.0GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:f8b28862-e40b-430c-a080-01a30fc8e7e9
    name: 7_freeze.vcf.gz.csi
    md5sum: 69c8aedc94f876e0edaa3f8493ca2e94
    filesize: 101.8KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:147ef50e-d162-4343-a437-130dd03adc4f
  name: 8_freeze.vcf
  data_distributions:
  - id: alspacdcs:10e38f1c-feb9-424c-b8f1-5ac140a141f1
    name: 8_freeze.vcf.gz
    md5sum: 832e3eca8a7672f66dec0a97d33e363f
    filesize: 18.8GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:d7e4cb16-3680-4d9a-ad8c-e13cabf6d8e6
    name: 8_freeze.vcf.gz.csi
    md5sum: af52ac2aa78f0f7f69c0bbf0cb804b40
    filesize: 92.8KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:34df9600-d9f8-410a-8a2d-ce4f8927d76c
  name: 9_freeze.vcf
  data_distributions:
  - id: alspacdcs:5d8715cc-e6fb-43f2-bd9c-ce2aae728c1e
    name: 9_freeze.vcf.gz
    md5sum: be05806b3337f1fb6f884f9c10a0dedd
    filesize: 14.2GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:a66c335a-04bf-4ec9-8aa2-e922eee5b4b2
    name: 9_freeze.vcf.gz.csi
    md5sum: 65ff200207e4b9f067154e7dbbd5b14a
    filesize: 75.4KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:75dfd99f-4757-4482-95a8-6a0d3d4fc16e
  name: 10_freeze.vcf
  data_distributions:
  - id: alspacdcs:c312b734-ed43-4109-a634-8a0bb4ff29b3
    name: 10_freeze.vcf.gz
    md5sum: 8dc40e17fd16a4f7fd46947cd8efba37
    filesize: 16.3GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:9d2dd9af-158d-42f4-94ad-1ee35bd17691
    name: 10_freeze.vcf.gz.csi
    md5sum: 344a55d89f42977d545dd73768bee6b1
    filesize: 85.5KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:fee168b6-1ec7-4793-a731-0e007b0afb69
  name: 11_freeze.vcf
  data_distributions:
  - id: alspacdcs:05ada9db-b03c-442f-a84d-cac99eeca001
    name: 11_freeze.vcf.gz
    md5sum: da169eb3d82bb130c3eba955ec1381d9
    filesize: 16.4GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:2436b27e-d541-4c76-8a29-edea10abc75c
    name: 11_freeze.vcf.gz.csi
    md5sum: 1f8573df3e205babd9a38cbd3a3769c7
    filesize: 85.2KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:0ac098c1-38be-4914-b2dd-5ea80af419ec
  name: 12_freeze.vcf
  data_distributions:
  - id: alspacdcs:3dd22dc0-76b8-4373-b4f2-e9bbf4e3a373
    name: 12_freeze.vcf.gz
    md5sum: 47b92a6ede9e9df895c2134b70c0c1bc
    filesize: 15.7GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:b37fa69b-0018-4521-9c9d-a840a9b9d7a9
    name: 12_freeze.vcf.gz.csi
    md5sum: 76b3de73d4576dc1b3d90b30677d50b8
    filesize: 85.5KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:c8b764ef-bbd1-4a32-9c26-d673a03fb23f
  name: 13_freeze.vcf
  data_distributions:
  - id: alspacdcs:64e89d80-a86c-46dd-8f87-4508707425fe
    name: 13_freeze.vcf.gz
    md5sum: ccd89b86e9421cd0f1ebfa9a4cf43228
    filesize: 11.8GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:4dbdfee7-98f4-440b-80e0-59efec244b0e
    name: 13_freeze.vcf.gz.csi
    md5sum: c87bf856f671a839b10b0d69cadd0d02
    filesize: 62.1KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:099c8270-865b-4aba-9c34-085743639bbc
  name: 14_freeze.vcf
  data_distributions:
  - id: alspacdcs:cf599ca2-dd11-460b-a0a7-f85ae126f264
    name: 14_freeze.vcf.gz
    md5sum: 5d9a04231afd3784e205ff939da426ba
    filesize: 10.7GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:fd8c3b0d-5842-4881-a219-3090898b1570
    name: 14_freeze.vcf.gz.csi
    md5sum: 0a5a77211053a1ed7b2ce33a8e8b612b
    filesize: 56.6KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:3ecd2d87-7a67-4d75-9a7d-12790013aeae
  name: 15_freeze.vcf
  data_distributions:
  - id: alspacdcs:6ab8edc2-13a6-471d-a85b-c040db7ab3bd
    name: 15_freeze.vcf.gz
    md5sum: 8779e214368a81a82a3831a6099a4e94
    filesize: 9.7GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:b14b07fc-a1f8-4ded-86e1-ade43351df3f
    name: 15_freeze.vcf.gz.csi
    md5sum: d1fdb4fbc9ac84cd545802728ad7fb22
    filesize: 51.7KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:d63be5f3-15c4-45fb-97e2-2f3a874455d2
  name: 16_freeze.vcf
  data_distributions:
  - id: alspacdcs:409936a1-052f-458d-aa6f-394852a1463c
    name: 16_freeze.vcf.gz
    md5sum: 74daf54822613ae3fd731e279026ba6a
    filesize: 10.6GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:9b059dbb-4356-4c64-9a85-3bd254ae5cd9
    name: 16_freeze.vcf.gz.csi
    md5sum: 3ec8fb57ca2b147816e0a67f694b1162
    filesize: 50.4KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:4305ac0f-4c77-40e4-9110-888755c835f8
  name: 17_freeze.vcf
  data_distributions:
  - id: alspacdcs:033eef37-655b-404c-8f8c-544477499023
    name: 17_freeze.vcf.gz
    md5sum: da9c1da2da281f7a1545af31faea13a3
    filesize: 9.1GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:951a4ed9-43d1-4a70-9d30-6c0c31282413
    name: 17_freeze.vcf.gz.csi
    md5sum: e26d19b6435cc8cdfabad63888203371
    filesize: 49.9KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:4f48380c-b623-4926-aa2e-c68f84f49241
  name: 18_freeze.vcf
  data_distributions:
  - id: alspacdcs:6f05ec3b-744c-4094-a1cf-4a9b45164872
    name: 18_freeze.vcf.gz
    md5sum: a7ace5116a6ec3056300504f64c406e3
    filesize: 9.4GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:244635ac-290b-40c5-bd12-f9f63711eec1
    name: 18_freeze.vcf.gz.csi
    md5sum: b4c0eb6f8bcd5faff6d23f6f11004a61
    filesize: 48.5KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:712d627b-e31d-4981-a968-0f6f9f8ee2ac
  name: 19_freeze.vcf
  data_distributions:
  - id: alspacdcs:d69e019e-1e2f-46c9-b2c8-b2cc4ccbbb2c
    name: 19_freeze.vcf.gz
    md5sum: 73079fb7f693e5f7ff8c23fc72a0d62b
    filesize: 7.0GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:c98d8665-a945-4778-a980-2bdf102f6f14
    name: 19_freeze.vcf.gz.csi
    md5sum: c1b4df2d51ac20fb5fe3335b59f844c4
    filesize: 35.7KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:7a81d5a6-ee22-4f8a-9920-094592dab855
  name: 20_freeze.vcf
  data_distributions:
  - id: alspacdcs:fd23d6f9-156e-482d-9d97-bd210a0d3344
    name: 20_freeze.vcf.gz
    md5sum: 46c2a5875f1e31137cd0e7a42a98ee04
    filesize: 7.5GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:7dd60e32-c766-46be-8e25-cf7fa89fa381
    name: 20_freeze.vcf.gz.csi
    md5sum: cea8babd39bc5f0e0640c668fc9854d5
    filesize: 38.2KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:c32cb1b8-623d-444f-a0db-b9b20c5f7056
  name: 21_freeze.vcf
  data_distributions:
  - id: alspacdcs:12233fc0-383d-4b97-a45d-de478ef165b8
    name: 21_freeze.vcf.gz
    md5sum: 68ad67687100082013805e8bcd63b989
    filesize: 4.3GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:964ef7af-f7d7-4747-b035-08a2d8069b5b
    name: 21_freeze.vcf.gz.csi
    md5sum: 1ef4648fe43cf331b600c610e1daaa4c
    filesize: 22.1KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:863f95cc-6ec2-406a-a072-6740a77dbcf6
  name: 22_freeze.vcf
  data_distributions:
  - id: alspacdcs:120d67ae-8c24-48d6-ad74-e9ec1865d3b4
    name: 22_freeze.vcf.gz
    md5sum: 11aac1ce01ecf5fa92b1f0b5c40209c7
    filesize: 4.4GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:a45e4ab3-57e4-45ab-a149-87e5fc49e534
    name: 22_freeze.vcf.gz.csi
    md5sum: 73bc5296a886342eb1a10e249f314c49
    filesize: 22.1KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

- id: alspacdcs:b445d1b2-d59c-4f14-b0a5-f9bf636fecfd
  name: X_freeze.vcf
  data_distributions:
  - id: alspacdcs:eb02a97b-4ea9-4769-a6ca-a5cbdfb65b5f
    name: X_freeze.vcf.gz
    md5sum: 1dd617a386e1fdb0273dcfc9e1231d32
    filesize: 10.5GB
    filetype: vcf.gz
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43
  - id: alspacdcs:463b2334-a8c0-45aa-a436-93e8e9752fd8
    name: X_freeze.vcf.gz.csi
    md5sum: d687ace453fb2a58484fa1db45c0e7cd
    filesize: 96.0KB
    filetype: .csi
    belongs_to_container: alspacdcs:fb4196e5-a658-4390-be51-217c07ad2a43

5.2 Whole exome sequencing - G0 & G1 (wes_novaseq_g0_g1)

5.2.1 Description

This dataset contains whole exome sequencing for G0 and G1 individuals. It was generated at the Sanger Institute as part of an initiative sequencing multiple Birth cohorts: ALSPAC, MCS and BiB. As part of this initiative, the exome sequencing data will also be available via EGA but researchers will still gain access through ALSPACs project approval system. Reference genome build: GRCh38

5.2.2 Methodology

Exome sequencing was conducted on DNA for 12,374 participants (8,605 children and 3,389 of their parents) at the Sanger Institute, using Illumina NovaSeq. Reads were aligned to GRCh38 with BWA-MEM. There was an average on-target depth of ~62X for ALSPAC.

QC was conducted on the dataset at the Sanger Institute, please find details within the associated publication (Koko et al., 2024). Sample QC was done before (base-calls after sequencing, alignment quality, CRAM file quality) and after variant calling (PCA analysis, comparison to array data, relatedness). Integrated variant QC removed potentially false positive variants using a trained random forest model. Genotype QC removed low quality individual genotype calls.

Single nucleotide variant (SNV) and small insertions/deletion (indels) calling was conducted with GATK HaplotypeCaller, GenomicsDBImport and GenotypeGVCFs (GATK version 4.2.4.0 for ALSPAC) following GATK best practices (Van der Auwera and O'Connor, 2020).

Associated publication:

  • doi.org/10.12688/wellcomeopenres.22697.1

5.2.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_f4
name: >- 
  Whole Exome Sequencing - Novaseq - G0 & G1 version 2024-09-20 freeze 4
description: >-
  This is first iteration of wes_novaseq_g0_g1, first introduced in freeze 4. It contains data in vcf 4.2 format. It contains the majority of the G1 cohort (n=~8296), accompanied by G0 mothers (n=~1642) and partners (n=~1630) to create trios. Participants who have withdrawn their consent are removed and an omics ID applied according to the freeze. Over time the participants are able to withdraw their consent and will be removed from the dataset, so the number of available individuals can reduce as time progresses. 

  This exome sequencing (ES) data was conducted at the Sanger institute and was part of an effort to ES ALSPAC, MCS and BiB. All ES data was quality controlled at the Sanger institute prior to this ALSPAC release and has been extensively document in the relevant publication (see below). 

  In brief (exert from associated publication, Koko et al., 2024):

    "Sample QC: 
      * Before variant calling: Samples were removed if they failed one or more filters based on quality of base-calls after sequencing, or quality of the CRAM files of aligned reads. The remainder then underwent variant calling.
      * After variant calling: We assigned individuals to populations using principal component analysis (PCA), then identified and removed individuals who were outliers on one or more variant-based metrics within each of the populations. We compared the exome data to genotyping array data from the same samples and removed samples that did not match as expected, since these could be sample mix-ups. The samples were also checked for unexpected relatedness; samples showing conflicts between reported and inferred relatedness were removed. This sample QC was split in two separate steps, before and after variant and genotype QC, as detailed in the coming sections. 
    Integrated variant and genotype QC:
      * Variant QC: We removed candidate variants which may not be real, instead being artefacts or mapping errors, using a trained random forest model to distinguish likely true positives from likely false positives. 
      * Genotype QC: We removed low-quality individual genotype calls from the dataset. This was done in conjunction with variant QC, as we will explain below."

  for extended information such as thresholds please find within the publication.

  Associated publication:
    Koko et al., 2024
    DOI: doi.org/10.12688/wellcomeopenres.22697.1


freeze_size: 167G
linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112
git_tag: https://github.com/alspac/dataset_wes_novaseq_g0_g1/releases/tag/freeze4
is_current_freeze: true
freeze_number: 4
freeze_date: 2024-06-11
previous_freeze: N/A
freeze_of_alspac_dataset_version: alspacdcs:wes_novaseq_g0_g1_2024-09-20
freeze_of_named_alspac_dataset: alspacdcs:wes_novaseq_g0_g1

has_parts:
  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr1_data_f4
    name: chr1_data
    data_distributions:
      - id: alspacdcs:chr1_data.vcf.gz
	name: chr1_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 1, to be used with chr1_data.vcf.gz.csi
	md5sum: c61d331e2c58b800516da170853f8220
	filesize: 17G
	filetype:  vcf.gz
	number_of_participants: 11500
	number_of_variants: 370645 # bcftools query -f '%POS\n' file.vcf.gz | wc -l
      - id: alspacdcs:_chr1_data.vcf.gz.csi
	name: chr1_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr1_data.vcf.gz, generated using bcftools v1.19.
	md5sum: f413bc9edb1d2a959f38790a3c72656c
	filesize: 64K
	filetype: .csi

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr2_data_f4
    name: chr2_data
    data_distributions:
      - id: alspacdcs:chr2_data.vcf.gz
	name: chr2_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 2, to be used with chr2_data.fvcf.gz.csi
	md5sum: 6c6d6b76a6792444058ff19c0036381c
	filesize: 12G
	filetype:  vcf.gz
	number_of_participants: 11500
	number_of_variants: 272150
      - id: alspacdcs:_chr2_data.vcf.gz.csi
	name: chr2_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr2_data.vcf.gz, generated using bcftools v1.19.
	md5sum: dc7fe92d532898ecad15efe923c48a12
	filesize: 48K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr3_data_f4
    name: chr3_data
    data_distributions:
      - id: alspacdcs:chr3_data.vcf.gz
	name: chr3_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 3, to be used with chr3_data.vcf.gz.csi
	md5sum: 1bc654effca79e7c67b0e0e9cd180064
	filesize: 9.1G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 206875
      - id: alspacdcs:_chr3_data.vcf.gz.csi
	name: chr3_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr3_data.vcf.gz, generated using bcftools v1.19.
	md5sum: 5e9918574d70f08d50c52bc755e99a57
	filesize: 48K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr4_data_f4
    name: chr4_data
    data_distributions:
      - id: alspacdcs:chr4_data.vcf.gz
	name: chr4_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 4, to be used with chr4_data.vcf.gz.csi
	md5sum: 409b4664817cbccdb04c64ef50c20260
	filesize: 6.2G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 140675
      - id: alspacdcs:_chr4_data.vcf.gz.csi
	name: chr4_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr4_data.vcf.gz, generated using bcftools v1.19.
	md5sum: 2e97a34c0d5de1f4c20a96013ddd3954
	filesize: 32K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr5_data_f4
    name: chr5_data
    data_distributions:
      - id: alspacdcs:chr5_data.vcf.gz
	name: chr5_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 5, to be used with chr5_data.vcf.gz.csi
	md5sum: ca1aefe6597d304995b6fadf26cc1dc6
	filesize: 7.1G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 161010
      - id: alspacdcs:_chr5_data.vcf.gz.csi
	name: chr5_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr5_data.vcf.gz, generated using bcftools v1.19.
	md5sum: bb0aa89a3bf4ea37d0766437bf954fde
	filesize: 32K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr6_data_f4
    name: chr6_data
    data_distributions:
      - id: alspacdcs:chr6_data.vcf.gz
	name: chr6_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 6, to be used with chr6_data.vcf.gz.csi
	md5sum: e0eaf0d3a06ce9b9b74be440d39702f5
	filesize: 8.1G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 181754
      - id: alspacdcs:_chr6_data.vcf.gz.csi
	name: chr6_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr6_data.vcf.gz, generated using bcftools v1.19.
	md5sum: d106a089e187fd067841488006d412f3
	filesize: 48K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr7_data_f4
    name: chr7_data
    data_distributions:
      - id: alspacdcs:chr7_data.vcf.gz
	name: chr7_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 7, to be used with chr7_data.vcf.gz.csi
	md5sum: e433cbd47a52fb3a0876a520a4134d31
	filesize: 8.1G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 181925
      - id: alspacdcs:_chr7_data.vcf.gz.csi
	name: chr7_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr7_data.vcf.gz, generated using bcftools v1.19.
	md5sum: 6c2fa683baf0b095cd86d877171e481f
	filesize: 48K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr8_data_f4
    name: chr8_data
    data_distributions:
      - id: alspacdcs:chr8_data.vcf.gz
	name: chr8_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 8, to be used with chr8_data.vcf.gz.csi
	md5sum: 1c9a537e557fb5fdd125b1025fbce749
	filesize: 5.9G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 133894
      - id: alspacdcs:_chr8_data.vcf.gz.csi
	name: chr8_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr8_data.vcf.gz, generated using bcftools v1.19.
	md5sum: 695209b64080ccbe035f51d4d9b92566
	filesize: 32K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr9_data_f4
    name: chr9_data
    data_distributions:
      - id: alspacdcs:ch9_data.vcf.gz
	name: chr9_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 9, to be used with chr9_data.vcf.gz.csi
	md5sum: 2fbd587057be6f3e6e40bfb9d4cdd072
	filesize: 7.1G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 161039
      - id: alspacdcs:_chr9_data.vcf.gz.csi
	name: chr9_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr9_data.vcf.gz, generated using bcftools v1.19.
	md5sum: 0aa0503dfe1a267fef70c19b8ec5ce5d
	filesize: 32K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr10_data_f4
    name: chr10_data
    data_distributions:
      - id: alspacdcs:chr10_data.vcf.gz
	name: chr10_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 10, to be used with chr10_data.vcf.gz.csi
	md5sum: 3fad84065f76243852cb94f191aafc71
	filesize: 6.6G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 149730
      - id: alspacdcs:_chr10_data.vcf.gz.csi
	name: chr10_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr10_data.vcf.gz, generated using bcftools v1.19.
	md5sum: 3f74f64664ce6a5631366d98946546a6
	filesize: 32K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr11_data_f4
    name: chr11_data
    data_distributions:
      - id: alspacdcs:chr11_data.vcf.gz
	name: chr11_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 11, to be used with chr11_data.vcf.gz.csi
	md5sum: 67421ec85241f6162eb9a7ab29e1be6b
	filesize: 11G 
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 227858
      - id: alspacdcs:_chr11_data.vcf.gz.csi
	name: chr11_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr11_data.vcf.gz, generated using bcftools v1.19.
	md5sum: eec8568f19181000442d8948edcdc65d
	filesize: 32K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr12_data_f4
    name: chr12_data
    data_distributions:
      - id: alspacdcs:chr12_data.vcf.gz
	name: chr12_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 12, to be used with chr12_data.vcf.gz.csi
	md5sum: eb497a8adb2372048ed1badaecb92a96
	filesize: 8.5G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 193518
      - id: alspacdcs:_chr12_data.vcf.gz.csi
	name: chr12_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr12_data.vcf.gz, generated using bcftools v1.19.
	md5sum: 86a00f322b4c7eb4376c5d5f49ebc8d8
	filesize: 32K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr13_data_f4
    name: chr13_data
    data_distributions:
      - id: alspacdcs:chr13_data.vcf.gz
	name: chr13_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 13, to be used with chr13_data.vcf.gz.csi
	md5sum: 0e8074d71e841cdebc9fd86247c46c3f
	filesize: 2.8G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 63931
      - id: alspacdcs:_chr13_data.vcf.gz.csi
	name: chr13_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr13_data.vcf.gz, generated using bcftools v1.19.
	md5sum: 9962040c9d4122ed040f924eb8d2174f
	filesize: 16K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr14_data_f4
    name: chr14_data
    data_distributions:
      - id: alspacdcs:chr14_data.vcf.gz
	name: chr14_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 14, to be used with chr14_data.vcf.gz.csi
	md5sum: e7b8b73da8ddd0bd666f988d5d9d049e
	filesize: 5.7G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 128137
      - id: alspacdcs:_chr14_data.vcf.gz.csi
	name: chr14_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr14_data.vcf.gz, generated using bcftools v1.19.
	md5sum: 63e6e3769d9bb411288d2b5174c61d9d
	filesize: 32K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr15_data_f4
    name: chr15_data
    data_distributions:
      - id: alspacdcs:chr15_data.vcf.gz
	name: chr15_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 15, to be used with chr15_data.vcf.gz.csi
	md5sum: 19ed6a943eb7d379f693b2a9e0f7ff22
	filesize: 5.6G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 127646
      - id: alspacdcs:_chr15_data.vcf.gz.csi
	name: chr15_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr15_data.vcf.gz, generated using bcftools v1.19.
	md5sum: abc35107b45198cf856fbb943c94c5ba
	filesize: 32K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr16_data_f4
    name: chr16_data
    data_distributions:
      - id: alspacdcs:chr16_data.vcf.gz
	name: chr16_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 16, to be used with chr16_data.vcf.gz.csi
	md5sum: b5bae04936506ba275664aafd595d99d
	filesize: 8.4G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 186300
      - id: alspacdcs:_chr16_data.vcf.gz.csi
	name: chr16_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr16_data.vcf.gz, generated using bcftools v1.19.
	md5sum: 8e97703c8f865ef4cb90db140903022f
	filesize: 32K 
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr17_data_f4
    name: chr17_data
    data_distributions:
      - id: alspacdcs:chr17_data.vcf.gz
	name: chr17_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 17, to be used with chr17_data.vcf.gz.csi
	md5sum: 471702bad7d86459c024fb468c7a7ee9
	filesize: 10G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 224774
      - id: alspacdcs:_chr17_data.vcf.gz.csi
	name: chr17_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr17_data.vcf.gz, generated using bcftools v1.19.
	md5sum: 4782b222eda4bfcb871f330fa2a2728a
	filesize: 32K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr18_data_f4
    name: chr18_data
    data_distributions:
      - id: alspacdcs:chr18_data.vcf.gz
	name: chr18_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 18, to be used with chr18_data.vcf.gz.csi
	md5sum: 3745cae09c423dd4cd00d772c82243d2
	filesize: 2.5G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 57017
      - id: alspacdcs:_chr18_data.vcf.gz.csi
	name: chr18_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr18_data.vcf.gz, generated using bcftools v1.19.
	md5sum: 7439d7225754bfb05a5d60544d8ec763
	filesize: 16K 
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr19_data_f4
    name: chr19_data
    data_distributions:
      - id: alspacdcs:chr19_data.vcf.gz
	name: chr19_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 19, to be used with chr19_data.vcf.gz.csi
	md5sum: e1ca35ee4003146b6d78aa60a11e019c
	filesize: 13G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 271080
      - id: alspacdcs:_chr19_data.vcf.gz.csi
	name: chr19_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr19_data.vcf.gz, generated using bcftools v1.19.
	md5sum: 17f12041e5261526ae320439f2736fa4
	filesize: 32K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr20_data_f4
    name: chr20_data
    data_distributions:
      - id: alspacdcs:chr20_data.vcf.gz
	name: chr20_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 20, to be used with chr20_data.vcf.gz.csi
	md5sum: 7c8cc69afb82df0442116e4dbfd99269
	filesize: 4.3G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 96655
      - id: alspacdcs:_chr20_data.vcf.gz.csi
	name: chr20_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr20_data.vcf.gz, generated using bcftools v1.19.
	md5sum: 2e4526809c85ae04c1a7690a430e3fad
	filesize: 16K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr21_data_f4
    name: chr21_data
    data_distributions:
      - id: alspacdcs:chr21_data.vcf.gz
	name: chr21_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 21, to be used with chr21_data.vcf.gz.csi
	md5sum: 46202d0ba651b0ec1c9b9fbc980fdcb7
	filesize: 1.9G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 42207
      - id: alspacdcs:_chr21_data.vcf.gz.csi
	name: chr21_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr21_data.vcf.gz, generated using bcftools v1.19.
	md5sum: 4e010094593ac87fce4fe5d55cc80bee
	filesize: 16K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chr22_data_f4
    name: chr22_data
    data_distributions:
      - id: alspacdcs:chr22_data.vcf.gz
	name: chr22_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome 22, to be used with chr22_data.vcf.gz.csi
	md5sum: a423d731c368b4ce1f30a896ab0f1c18
	filesize: 4.3G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 94446
      - id: alspacdcs:_chr22_data.vcf.gz.csi
	name: chr22_data.vcf.gz.csi
	description: >-
	  index for vcf file - chr22_data.vcf.gz, generated using bcftools v1.19.
	md5sum: ac6eb2a7076ec0221b86c0ac8300c1af
	filesize: 16K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chrX_data_f4
    name: chrX_data
    data_distributions:
      - id: alspacdcs:chrX_data.vcf.gz
	name: chrX_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome X, to be used with chrX_data.vcf.gz.csi
	md5sum: 1ca33edf2265f47f61e34ea4462e5afd
	filesize: 3.8G
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 86925
      - id: alspacdcs:_chrX_data.vcf.gz.csi
	name: chrX_data.vcf.gz.csi
	description: >-
	  index for vcf file - chrX_data.vcf.gz, generated using bcftools v1.19.
	md5sum: 1f133d314e9acff9c0075184d495792a
	filesize: 32K
	filetype: .csi 

  - id: alspacdcs:wes_novaseq_g0_g1_2024-09-20_chrY_data_f4
    name: chrY_data
    data_distributions:
      - id: alspacdcs:chrY_data.vcf.gz
	name: chrY_data.vcf.gz
	description: >- 
	  vcf file containing all participants for chromsome Y, to be used with chrY_data.vcf.gz.csi
	md5sum: 51a2d25baf60cbaef21b457df2c7530b
	filesize: 368K
	filetype:  vcf.gz
	number_of_participants: 11500 
	number_of_variants: 9
      - id: alspacdcs:_chrY_data.vcf.gz.csi
	name: chrY_data.vcf.gz.csi
	description: >-
	  index for vcf file - chrY_data.vcf.gz, generated using bcftools v1.19.
	md5sum: e370622e50f6b9b847ff0925eee02313
	filesize: 512
	filetype: .csi

5.3 Whole exome sequencing - G1 (wes_novaseq_g1)

5.3.1 Description

This dataset contains whole exome sequencing for G1 individuals. It was generated at the Broad Institute for ~2900 G1 individuals. Reference genome build: GRCh38

5.3.2 Methodology

The exomes returned from the Broad Insitute did not undergo PCA or relatedness filtering; instead provided as raw VCF data. The following thresholds were applied to the samples:

  • Chimera rate: Less than 0.05
  • Contamination rate: Less than 0.10
  • PF aligned rate: More than 0.60

87 individuals were removed from the dataset who were believed to have been a sample mismatch. These exomes had discordance rate of above 0.05 when compared to existing array data using bcftools gtcheck.

Associated publications:

5.3.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:wes_novaseq_g1_204-04-12_f4
name: >- 
  Whole Exome Sequencing - Novaseq - G1 version 2024-04-09 freeze 4
description: >-
  This is first iteration of wes_novaseq_g1, first introduced in freeze 4. It contains data in vcf 4.2 format. It is a subset of the G1 cohort, with participants who have withdrawn their consent removed and omics IDs applied according to the freeze. Samples were selected for whole exome sequencing at the Broad Institute from the G1 cohort (the cohort of index children) and were from subjects who were singletons/unrelated and of European/British ancestry, had blood-derived DNA available, and had been genotyped on a whole genome genotyping array.

  The QC was performed by the broad. The following thresholds were applied:
  Chimera rate < 0.05
  Contamination rate < 0.10
  PF aligned rate < 0.60

  87 individuals were removed from the dataset who were believed to have been a sample mismatch. These exomes had discordance rate of above 0.05 when  compared to existing array data using bcftools gtcheck.

  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9980234/ describes this dataset in supplementary materials. 

freeze_size: 28G
linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112
git_tag: https://github.com/alspac/dataset_wes_novaseq_g1/releases/tag/freeze4
is_current_freeze: true
freeze_number: 4
freeze_date: 2024-06-11
previous_freeze: N/A

freeze_of_alspac_dataset_version: alspacdcs:wes_novaseq_g1_2024-03-26
freeze_of_named_alspac_dataset: alspacdcs:wes_novaseq_g1

has_parts:
  - id: alspacdcs:wes_novaseq_g1_2024-04-09_all_chr_f4
    name: all_chr
    description: >-
      All chromosomes and all participants within the dataset contained within a single vcf version 4.2 file, which has been compressed using bcftools 1.19.
    data_distributions:
      - id: alspacdcs:3e3bde5e-b410-4135-981b-f923f57a6ce0_all_chr.vcf.gz
	name: all_chr.vcf.gz
	description: >- 
	  vcf file containing all participants and chromosomes, to be used with all_chr.vcf.gz.csi
	md5sum: 1f75c2f55107aceaf9d4e7edb19fd364
	filesize: 28G
	filetype:  vcf.gz
	number_of_participants: 2879
	#number_of_gene_expression_probe_values: 

      - id: alspacdcs:4da2f634-bdb9-4b21-b051-6fa469ba711c_all_chr.vcf.gz.csi
	name: all_chr.vcf.gz.csi
	description: >-
	  index for vcf file - all_chr.vcf.gz, generated using bcftools v1.19.
	md5sum: ff4baac889f49b1cb1611c3c63627890
	filesize: 800K
	filetype: .csi

6 Epigenetic Data

6.1 DNA methylation - 450k - G0 mothers + G1 (dnam_450_g0m_g1)

6.1.1 Description

This dataset contains Illumina Infinium HumanMethylation450K BeadChip array on G1 mothers at two timepoints (pregnancy and middle age), G1 participants at 5 timepoints and G0 participants at three timepoints (birth, childhood and adolescence).

This dataset was generated as part of the Accessible Resource for Integrated Epigenomics Studies (http://www.ariesepigenomics.org.uk/). This dataset is superseded by dnam_epic450_g0_g1.

6.1.2 Methodology

Associated publication:

Associated R package:

6.1.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:dnam_450_g0m_g1_2016-05-03_f4
name: >-
  DNA methylation - 450k - G0 mothers + G1 version 2016-05-03 Freeze 4
description: >-
  This is the fourth freeze of the 2016-05-03 version of
  dnam_450_g0m_g1 dataset.

freeze_size: 18G
linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112
git_tag: https://github.com/alspac/dataset_dnam_450_g0m_g1/releases/tag/Freeze4
is_current_freeze: true
freeze_number: 4
freeze_date: 2024-06-11
previous_freeze: alspacdcs:dnam_450_g0m_g1_2016-05-03_f3
freeze_of_alspac_dataset_version: alspacdcs:dnam_450_g0m_g1_2016-05-03
freeze_of_named_alspac_dataset: alspacdcs:dnam_450_g0m_g1


has_containers:
  - id: alspacdcs:7d3cb68e-0cbb-4c60-9f6b-77354a951caf
    name: data
    description: A dir/folder containing the data files
  - id: alspacdcs:88e75491-5bab-4fb7-9099-5341e17f3739
    name: betas
    description: A dir/folder containing the beta files
    belongs_to_container: alspacdcs:7d3cb68e-0cbb-4c60-9f6b-77354a951caf
  - id: alspacdcs:b5b7a645-484f-490f-92bc-e2d255504a2d
    name: control_matrix
    description: A dir/folder containing the control matrix files 
    belongs_to_container: alspacdcs:7d3cb68e-0cbb-4c60-9f6b-77354a951caf
  - id: alspacdcs:a98c4fb7-6b92-4f27-9a00-079dbb1a50db
    name: derived
    description: A dir/folder containing the derived data (e.g. Cell count predictions)
    belongs_to_container: alspacdcs:7d3cb68e-0cbb-4c60-9f6b-77354a951caf
  - id: alspacdcs:650f1c7b-e8ab-40c9-90b3-67d3c552100a
    name: cellcounts
    description: A dir/folder containing the cell count predictions
    belongs_to_container: alspacdcs:a98c4fb7-6b92-4f27-9a00-079dbb1a50db
  - id: alspacdcs:3c795f53-8dfc-45fe-b88b-5363a5a3bc77
    name: cord
    description: >-
      A dir/folder containing the cell count predictions
      for cord.
    belongs_to_container: alspacdcs:650f1c7b-e8ab-40c9-90b3-67d3c552100a
  - id: alspacdcs:06167109-d949-4d24-b33a-a70bc48e49a1
    name: andrews-and-bakulski
    description: >-
      A dir/folder containing the cell count predictions by
      andrews-and-bakulski algorithm
    belongs_to_container: alspacdcs:3c795f53-8dfc-45fe-b88b-5363a5a3bc77

  - id: alspacdcs:e9b1e42c-85e7-4a3f-bcf0-f1fa3d20b5b8
    name: gervinandlyle
    description: >-
      A dir/folder containing the cell count predictions by
      gervinandlyle algorithm/method.
    belongs_to_container: alspacdcs:3c795f53-8dfc-45fe-b88b-5363a5a3bc77

  - id: alspacdcs:54feaa38-f2de-4f98-babe-13c4c0b4791a
    name: gse68456
    description: >-
      A dir/folder containing the cell count predictions by
      the gse68456 method.
    belongs_to_container: alspacdcs:3c795f53-8dfc-45fe-b88b-5363a5a3bc77
  - id: alspacdcs:9d8ee029-67cc-47f2-a663-7bac8d803459
    name: houseman
    description: >-
      A dir/folder containing the cell count predictions by
      houseman method. 
    belongs_to_container: alspacdcs:650f1c7b-e8ab-40c9-90b3-67d3c552100a
  - id: alspacdcs:218a4ebd-ae56-4f5a-aa47-9614cb633a1e
    name: detection_p_values
    description: A dir/folder containing the matrix of detection values
    belongs_to_container: alspacdcs:7d3cb68e-0cbb-4c60-9f6b-77354a951caf
  - id: alspacdcs:cb1d7257-328f-4f7b-b578-133ed4eda164
    name: qc.objects_all
    description: >-
      A dir/folder containing the samples extracted from
      lims and not cleaned. 
    belongs_to_container: alspacdcs:7d3cb68e-0cbb-4c60-9f6b-77354a951caf
  - id: alspacdcs:9b6bd75c-0da7-4ab8-9bb1-e5a9e4a3854d
    name: qc.objects_clean
    description: A dir/folder containing the cleaned samples from Lims 
    belongs_to_container: alspacdcs:7d3cb68e-0cbb-4c60-9f6b-77354a951caf
  - id: alspacdcs:672a863a-458c-477f-93b3-f92454b490fa
    name: samplesheet
    description: A dir/folder containing the manifest file from Lims.
    belongs_to_container: alspacdcs:7d3cb68e-0cbb-4c60-9f6b-77354a951caf

has_parts:
  - id: alspacdcs:eb35b571-f62d-4cd9-91a5-779ad8ae334b
    name: betas
    description: >-
      Normalized betas using functional normalization.
      We used 10 PCs on the controlmatrix to regress out technical
      variation. Slide was regressed out as random effect before
      normaliziation.
      CpGs are in rows and samples in columns.
    data_distributions:
      - id: alspacdcs:06428ec1-232f-45e0-b17a-40a4b382c6e0
	name: data.Robj
	description: >-
	  R data object for the Normalized beta data.
	md5sum: 454aac748f353ea4bd73afb1717c2716
	filesize: 17G
	filetype: .Robj
	belongs_to_container: alspacdcs:88e75491-5bab-4fb7-9099-5341e17f3739
	number_of_participants: 4843
	number_of_sites: 482855

  - id: alspacdcs:06b395ba-9cf9-4985-93f4-35e4011f6d28
    name: control matrix
    description: >-
      The 850 control probes are summarized in 42 control types.
      These probes can roughly be divided into negative control probes
	(613), probes intended for between array normalization (186)
	and the remainder (49), which are designed for quality
	control, including assessing the
	bisulfite conversion rate. None of these probes are designed
	to measure a biological signal.
	The summarized control probes can be used as surrogates for
	unwanted variation and are used for the functional
	normalization.
	Samples are rows and 42 control types are in columns.
    data_distributions:
      - id: alspacdcs:7b41f832-6201-42f1-bb27-6463151dc2fa
	name: data.txt
	description: >-
	  Plain text file of the control matrix.

	md5sum: 471b487a4b0761f00e33088b0065dd94
	filesize: 1.8M
	filetype: .txt
	belongs_to_container: alspacdcs:b5b7a645-484f-490f-92bc-e2d255504a2d
	number_of_participants: 4843

  - id: alspacdcs:102cbbca-7165-42c0-8b49-1d3ecabd1bb8
    name: andrews and bakulksi cord cell counts
    description: >-
      Cellcounts in cord predicted using cord reference published in
      Bakulski et al 2016 (PMID: 27019159). This reference has been
      implemented in meffil. In this text file, samples are in rows and cell types in columns.
    data_distributions:
      - id: alspacdcs:d9cba595-0f19-40d8-ab2c-538c55f56b28
	name: data.txt
	description: >-
	  Plain text file of cellcounts in cord predicted using Bakulski.

	md5sum: 79b04868cc502a1a34ade01958f22790
	filesize: 118k
	filetype: .txt
	belongs_to_container: alspacdcs:06167109-d949-4d24-b33a-a70bc48e49a1
	number_of_participants: 912     

  - id: alspacdcs:29df92c4-c042-4b29-93a2-06d5ae4e8dee
    name: geervin and lyle cord cell counts
    description: >-
      Cellcounts in cord predicted using GervinandLyle cord reference
      (unpublised). This reference has been implemented in meffil.
      Samples are in rows and cell types in columns.
    data_distributions:
      - id: alspacdcs:15371e80-9b1d-4776-ad5f-400e9bf8f02b
	name: data.txt
	description: >-
	  Plain text file of cell counts predicted using GervinandLyle
	  cord reference.


	md5sum: 0d8535330ac6e12e7f3c5a5f3f30e600
	filesize: 100k
	filetype: .txt
	belongs_to_container: alspacdcs:e9b1e42c-85e7-4a3f-bcf0-f1fa3d20b5b8
	number_of_participants: 912       

  - id: alspacdcs:8196d769-fa52-4dd3-bd62-d81cccb77fc7
    name: gse68456 cord cell counts
    description: >-
      Cellcounts in cord predicted using cord reference published in
      de Goede et al (PMID: 26366232). This reference has been implemented in meffil.
      Samples are in rows and cell types in columns.
    data_distributions:
      - id: alspacdcs:d821314a-6716-4de9-8f27-2d65621d6617
	name: data.txt
	description: >-
	  Plain text file containinng cell counts predicted using cord reference.


	md5sum: 837e1e40bf27d8f6bd1a402f016b798e
	filesize: 120k
	filetype: .txt
	belongs_to_container: alspacdcs:54feaa38-f2de-4f98-babe-13c4c0b4791a
	number_of_participants: 912

  - id: alspacdcs:280efa41-1668-456e-9974-9b4a45d13417
    name: houseman cell counts
    description: >-
      Cell counts extracted using Houseman algorithm implemented in
      meffil (PMID: 22568884). Samples are in rows and cell types in columns.
    data_distributions:
      - id: alspacdcs:ae1eb48d-cf51-4e88-b2d4-643b610f6f27
	name: data.txt
	description: >-
	  Text file of the cell counts calculated using Houseman algorithm.

	md5sum: 2792f7708e710536c069b05c0192c57d
	filesize: 569k
	filetype: .txt
	belongs_to_container: alspacdcs:9d8ee029-67cc-47f2-a663-7bac8d803459
	number_of_participants: 4843           

  - id: alspacdcs:99af94de-18b9-4caf-a798-fc3b8a8ca554
    name: detection p values
    description: >-
      This matrix shows the detection pvalues for each sample and
      each CpG and is extracted from the idat files using the "meffil.load.detection.pvalues"
      function in meffil. CpGs are in rows and samples in columns.
    data_distributions:
      - id: alspacdcs:1dd9411c-e1f1-4cd8-b8dc-f528c893447f
	name: data.Robj
	description: >-
	  R object file for the detection p values matrix

	md5sum: fbbd840f2561e28b443b1c959656f0f4
	filesize: 418M
	filetype: .Robj
	belongs_to_container: alspacdcs:218a4ebd-ae56-4f5a-aa47-9614cb633a1e
	number_of_participants: 4843

  - id: alspacdcs:83220340-b1e7-4a47-8435-473f9fecbe68
    name: qc objects all
    description: >-
      This objects contain samples extracted from LIMS and is not
      cleaned up. This object has been used to do the data cleaning.
      All data processing has been conducted using Meffil.
      Meffil uses illuminaio R package to parse Illumina IDAT files
      into a meffil object called qc.objects. All meffil functions,
      QC summary, functional normalization and post-normalization QC summary
      operate on the qc or norm.objects. Specifically, the qc.objects contain
      raw control probe intensities, poor quality probes based on
      detection Pvalues and number of beads, predicted sex,  predicted
      cellcounts and a samplesheet with batch variables.
      In addition, copy number variation can be extracted. This object is a list of individuals.
    data_distributions:
      - id: alspacdcs:f7fb5bce-dc29-425b-88c9-57559a3b1994
	name: data.Robj
	description: >-
	  R data file of the qc objects.

	md5sum: 677b3fd580acf8600fc5e31f7597d787
	filesize: 497M
	filetype: .Robj
	belongs_to_container: alspacdcs:cb1d7257-328f-4f7b-b578-133ed4eda164
	number_of_participants: 4843   

  - id: alspacdcs:5f074661-585b-4613-aa3a-f52960806f3d
    name: qc objects clean
    description: >-
      All data processing has been conducted using Meffil. Meffil uses
      illuminaio R package to parse Illumina IDAT files into a meffil
      object called norm.objects. All meffil functions, QC summary,
      functional normalization and post-normalization QC summary operate on the norm.objects.
      Specifically, the norm.objects contain raw control probe
      intensities, quantile distributions of the raw intensities, poor
      quality probes based on detection Pvalues and number of beads,
      predicted sex, predicted cellcounts and a samplesheet with batch
      variables. In addition, copy number variation can be extracted. This object is a list of individuals.
    data_distributions:
      - id: alspacdcs:34a39d30-f2b9-4a68-b8be-eb3b8ca3487a
	name: data.Robj
	description: >-
	  R object file  of qc objects clean.

	md5sum: 25f961e24da7611bb34b5238175a522a
	filesize: 659M
	filetype: .Robj
	belongs_to_container: alspacdcs:9b6bd75c-0da7-4ab8-9bb1-e5a9e4a3854d
	number_of_participants: 4843        

  - id: alspacdcs:01574baf-1473-4e89-8ff9-db04ad000b1d
    name: samplesheet
    description: >-
      Manifest file with columns extracted directly from LIMS and age,
      sex, aln, timepoint, timecode, sampletype, genotypeQC columns to
      remove population stratification samples, duplicate.rm column to
      remove duplicates.
      Samples in rows, variables in columns.
    data_distributions:
      - id: alspacdcs:2ff495d8-47db-43aa-ae8e-02c5963f4d6a
	name: data.Robj
	description: >-
	  R data object manifest file.

	md5sum: a9f34d7a00da910d3806089b65ccc547
	filesize: 100K
	filetype: .Robj
	belongs_to_container: alspacdcs:672a863a-458c-477f-93b3-f92454b490fa
	number_of_participants: 4843               

6.2 DNA methylation - EPIC & 450k - G0 + G1 (dnam_epic450_g0_g1)

6.2.1 Description

This dataset contains methylation data collected from both G0 and G1 on two arrays at different timepoints. This dataset supersedes dnam_450_g0m_g1.

There is data from Illumina Infinium HumanMethylation450K BeadChip array on G1 mothers at two timepoints (pregnancy and middle age), G1 participants at 5 timepoints and G0 participants at three timepoints (birth, childhood and adolescence). This dataset also contains data from Infinium MethylationEPIC v1.0 data on 2721 G1 individuals at 2 timepoints.

This dataset was generated as part of the Accessible Resource for Integrated Epigenomics Studies (http://www.ariesepigenomics.org.uk/).

6.2.2 Methodology

Preprocessing and quality control for this dataset was conducted using Meffil.

Associated publications:

Associated R packages:

6.2.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:dnam_epic450_g0_g1_2022-7-13_f4
name: >-
  DNA methylation - EPIC & 450k - G0 + G1 version 2022-7-13 Freeze 4
description: >-
  This is the freeze 4 version of dnam_epic450_g0_g1, which was first introduced
  in freeze 2 and first released 2022-7-13.

freeze_size: 137G
linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112
git_tag: https://github.com/alspac/dataset_dnam_epic450_g0_g1/releases/tag/Freeze4
is_current_freeze: true
freeze_number: 4
freeze_date: 2024-06-11 ### Update to align with date of release
previous_freeze: 3
freeze_of_alspac_dataset_version: alspacdcs:dnam_epic450_g0_g1_2022-7-13
freeze_of_named_alspac_dataset: alspacdcs:dnam_epic450_g0_g1

has_containers:
  - id: alspacdcs:a4ae8168-cbdf-44b4-8b10-e5cc7a988826
    name: data
    description: A dir/folder containing the data files
  - id: alspacdcs:368b116d-4f30-4930-915b-f25a540aabb6
    name: betas
    description: A dir/folder containing the beta files
    belongs_to_container: alspacdcs:a4ae8168-cbdf-44b4-8b10-e5cc7a988826
  - id: alspacdcs:64f83a02-ba1b-455b-a8d5-eebd33f17adf
    name: control_matrix
    description: A dir/folder containing the control matrix files 
    belongs_to_container: alspacdcs:a4ae8168-cbdf-44b4-8b10-e5cc7a988826
  - id: alspacdcs:087b88a3-bdc8-41df-9574-5f449e78a882
    name: derived
    description: A dir/folder containing the derived data (e.g. Cell count predictions and dnamage) 
    belongs_to_container: alspacdcs:a4ae8168-cbdf-44b4-8b10-e5cc7a988826
  - id: alspacdcs:ad61ac55-921a-4b7a-9a80-c3c7b8d6e975
    name: cellcounts
    description: A dir/folder containing the cell count predictions
    belongs_to_container: alspacdcs:087b88a3-bdc8-41df-9574-5f449e78a882
  - id: alspacdcs:07768702-d945-449b-ab16-3fa064bf981a
    name: detection_p_values
    description: A dir/folder containing the matrix of detection values
    belongs_to_container: alspacdcs:a4ae8168-cbdf-44b4-8b10-e5cc7a988826
  - id: alspacdcs:3ab1bca7-efaf-414f-837d-cf0ad30afb09
    name: samplesheet
    description: A dir/folder containing matrices of the sample identification.
    belongs_to_container: alspacdcs:a4ae8168-cbdf-44b4-8b10-e5cc7a988826 


has_parts:
  - id: alspacdcs:bc629684-4fa1-42f0-b48c-7e4473d4ed4d
    name: betas
    description: >-
      Normalized betas using functional normalization.
      We used 10 PCs on the controlmatrix to regress out technical
      variation. Slide was regressed out as random effect before
      normaliziation. CpGs are in rows and samples in columns.
    data_distributions:
      - id: alspacdcs:1f940257-3a73-49d1-bd6c-ceeb794c0a4b
	name: 450.gds
	description: >-
	  R data object for the Normalized beta data for the 450 array only.
	md5sum: 02e9b3cdda39d3476bfce111f5935f93
	filesize: 22G
	filetype: .gds
	belongs_to_container: alspacdcs:368b116d-4f30-4930-915b-f25a540aabb6
	number_of_participants: 5927
      - id: alspacdcs:4c23fc84-df4d-48c0-969c-c3f8e12dd93f
	name: common.gds
	description: >-
	  R data object for the Normalized beta data for both the EPIC and 450 arrays.
	md5sum: 2d447051e6241bf35dc1bfba4e740848
	filesize: 30G
	filetype: .gds
	belongs_to_container: alspacdcs:368b116d-4f30-4930-915b-f25a540aabb6
	number_of_participants: 8669
      - id: alspacdcs:dc5ebcb3-a432-44c1-9f6f-1cbcdf7480ae
	name: epic.gds
	description: >-
	  R data object for the Normalized beta data for  the EPIC array only.
	md5sum: 0357486c3af3b5ee120c7b05bf077340
	filesize: 18G
	filetype: .gds
	belongs_to_container: alspacdcs:368b116d-4f30-4930-915b-f25a540aabb6
	number_of_participants: 2742

  - id: alspacdcs:cde6fb9f-9fa7-4941-aa0f-b3fe3140999b
    name: control_matrix
    description: >-
      The 850 control probes are summarized in 42 control types.
      These probes can roughly be divided into negative control probes
      (613), probes intended for between array normalization (186)
      and the remainder (49), which are designed for quality
      control, including assessing the
      bisulfite conversion rate. None of these probes are designed
      to measure a biological signal.
      The summarized control probes can be used as surrogates for
      unwanted variation and are used for the functional
      normalization.
      Samples are rows and 42 control types are in columns.
    data_distributions:
      - id: alspacdcs:8ca4a216-7dac-47c8-949a-38cc4a26af18
	name: 450.txt
	description: >-
	  Plain text file of the control matrix for the 450 array only.
	md5sum: 9e6aa62498c5bb7493f7512e274056ba
	filesize: 2.2M
	filetype: .txt
	belongs_to_container: alspacdcs:64f83a02-ba1b-455b-a8d5-eebd33f17adf
	number_of_participants: 5927
      - id: alspacdcs:8ddf1661-41f3-47e8-840e-cce8fed13f04
	name: common.txt
	description: >-
	  Plain text file of the control matrix for both the EPIC and 450 arrays.
	md5sum: 42d21ff7a2ead483e85b909b279e9912
	filesize: 3.2M
	filetype: .txt
	belongs_to_container: alspacdcs:64f83a02-ba1b-455b-a8d5-eebd33f17adf
	number_of_participants:  8669
      - id: alspacdcs:09bc4485-93c8-41a5-bfe2-bf44f6e9a345
	name: epic.txt
	description: >-
	  Plain text file of the control matrix for the EPIC array only.
	md5sum: 7a680d3ccd26a491ec7dde2ce91eeeab
	filesize: 1.0M
	filetype: .txt
	belongs_to_container: alspacdcs:64f83a02-ba1b-455b-a8d5-eebd33f17adf
	number_of_participants:  2742

  - id: alspacdcs:b73ed28a-7219-49b5-94b8-e39d2bbda6f2
    name: DNA methylation age
    description: >-
      DNA methylation aging estimates from within the dataset. 
      Further information on this data and its usage is found
      within the `dnamage.html` and `dnamage.md` within the docs
      dir/folder.
    data_distributions:
      - id: alspacdcs:2ba6caa3-327a-4615-af93-ec81836bec57
	name: dnamage.csv
	description: >-
	  A csv file containing DNA methylation aging estimates within the dataset. 
	md5sum: bd0c2efef6ee145cd0804d61c7e83151
	filesize: 12M
	filetype: .csv
	belongs_to_container: alspacdcs:087b88a3-bdc8-41df-9574-5f449e78a882
	number_of_participants:  8192

  - id: alspacdcs:6a7baf4c-121e-400d-a72f-357c33980ac1
    name: cell counts
    description: >-
      Files contain cell counts estimated using a variety of cell type 
      references using the Houseman deconvolution algorithm (PMID: 22568884).
      In each file, samples correspond to rows and cell types to columns.
    data_distributions:
      - id: alspacdcs:0fafdf8e-12b0-4cb6-bd85-c0e6bc82c8d1
	name: andrews-and-bakulski-cord-blood.txt
	description: >-
	  Cord blood cell count estimates derived using the Bakulski et al. 2016 reference 
	  (PMID 27019159; https://bioconductor.org/packages/release/data/experiment/html/FlowSorted.CordBlood.450k.html).
	  This reference has been implemented in meffil. Cell counts estimated for b-cells, 
	  cd4+ t cells, cd8+ t cells, granulocytes, monocytes, natural killer cells and 
	  nucleated red blood cells. In this text file, samples are in rows and cell types in columns.
	md5sum: 33c69aa8e50deb28355dcb82d01c7510
	filesize: 114K
	filetype: .txt
	belongs_to_container: alspacdcs:ad61ac55-921a-4b7a-9a80-c3c7b8d6e975
	number_of_participants: 913  
      - id: alspacdcs:1b7994b3-22db-4aff-99b2-8438d283d12d
	name: gervin-and-lyle-cord-blood.txt
	description: >-
	  Cord blood cell count estimates derived using the Gervin et al. 2019
	  reference (PMID 31455416; GEO accession GSE127824). Cell counts 
	  estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes,
	  and natural killer cells. This reference has been implemented in meffil. 
	  In this text file, samples are in rows and cell types in columns.
	md5sum: 099c4cf9bd4ecfee91c19c3c2d2b6f70
	filesize: 100K
	filetype: .txt
	belongs_to_container: alspacdcs:ad61ac55-921a-4b7a-9a80-c3c7b8d6e975
	number_of_participants: 913
      - id: alspacdcs:b426de30-1685-45c8-9cf2-5831f65b44d4
	name: cord-blood-gse68456.txt
	description: >-
	  Cord blood cell count estimates derived using the de Goede et al. 2015 reference
	  (PMID 26366232; GEO accession GSE68456).  Cell counts estimated for b-cells, cd4+ t cells,
	  cd8+ t cells, granulocytes, monocytes, natural killer cells and nucleated red blood cells.
	  This reference has been implemented in meffil. In this text file, samples are in rows and
	  cell types in columns.
	md5sum: 941f8a9ce1289ab5baaf10fb29bd8941
	filesize: 130K
	filetype: .txt
	belongs_to_container: alspacdcs:ad61ac55-921a-4b7a-9a80-c3c7b8d6e975
	number_of_participants: 913
      - id: alspacdcs:8048cc87-bf93-4d83-8440-a27e6fe9f2ae
	name: blood-gse35069-complete.txt
	description: >-
	  Cell counts in peripheral blood predicted using the peripheral blood reference published in 
	  Reinius et al. 2012 (PMID: 22848472). Same as 'blood gse35069.txt' but replaces granulocytes
	  with eosinophils and neutrophils. This reference has been implemented in meffil. 
	  In this text file, samples are in rows and cell types in columns.  
	md5sum: 27ab648c56b56e62709a98fcba95a764
	filesize: 1.2M
	filetype: .txt
	belongs_to_container: alspacdcs:ad61ac55-921a-4b7a-9a80-c3c7b8d6e975
	number_of_participants: 8669         
      - id: alspacdcs:74143e64-6d2d-4e69-b99e-1e12a7df3657
	name: blood-gse35069.txt
	description: >-
	  Blood cell count estimates derived using the Reinius et al. 2012 reference 
	  (PMID 25424692; GEO accession GSE35069).  Cell counts estimated for b-cells,
	  cd4+ t cells, cd8+ t cells, granulocytes, monocytes, and natural killer cells.
	  In this text file, samples are in rows and cell types in columns.
	md5sum: 53fb63b4cef457d90688b3ddb861fa73
	filesize: 1021K
	filetype: .txt
	belongs_to_container: alspacdcs:ad61ac55-921a-4b7a-9a80-c3c7b8d6e975
	number_of_participants:  8669
      - id: alspacdcs:3e2301b0-7d09-45bd-8364-700fdc3e873a
	name: blood-idoloptimized-epic.txt
	description: >-
	  Cell counts in peripheral blood predicted using the cell type reference from Bioconductor 
	  package FlowSorted.Blood.EPIC. This reference has been implemented in meffil. In this text file,
	  samples are in rows and cell types in columns.
	md5sum: 7331e83d31e1d200bbff3d041223cde1
	filesize: 347K
	filetype: .txt
	belongs_to_container: alspacdcs:ad61ac55-921a-4b7a-9a80-c3c7b8d6e975
	number_of_participants: 2742
      - id: alspacdcs:a803d79f-faf0-4f7a-aae6-2548031834cc
	name: blood-idoloptimized.txt
	description: >-
	  Cell counts in peripheral blood predicted using the cell type reference from Bioconductor 
	  package FlowSorted.Blood.EPIC but restricted to the IDOLOptimizedCpGs450klegacy CpG sites. 
	  This reference has been implemented in meffil. In this text file, samples are in rows and 
	  cell types in columns.
	md5sum: 2c2bdbf34093960af969ca37ae43c77b
	filesize: 1.1M
	filetype: .txt
	belongs_to_container: alspacdcs:ad61ac55-921a-4b7a-9a80-c3c7b8d6e975
	number_of_participants: 8669
      - id: alspacdcs:e3a207d4-4dfa-44ad-bdf1-58cae95bb972
	name: combined-cord-blood.txt
	description: >-
	  Cord blood cell count estimates derived using the Bakulski et al, Gervin et al., de Goede et al.,
	  and Lin et al. references (https://bioconductor.org/packages/release/data/experiment/html/FlowSorted.CordBloodCombined.450k.html)
	  for CpG sites selected using the IDOL algorithm and optimized for the Illumina Infinium 
	  HumanMethylation450 Beadchip. Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells,
	  granulocytes, monocytes, natural killer cells and nucleated red blood cells.
	  In this text file, samples are in rows and cell types in columns.
	md5sum: 7cbcf72ca00012d17d22ff6d21b7575c
	filesize: 129K
	filetype: .txt
	belongs_to_container: alspacdcs:ad61ac55-921a-4b7a-9a80-c3c7b8d6e975
	number_of_participants: 913

  - id: alspacdcs:7556f038-dc22-47f0-96eb-af81a58eefe6
    name: detection p values
    description: >-
      This matrix shows the detection pvalues for each sample and
      each CpG and is extracted from the idat files using the "meffil.load.detection.pvalues"
      function in meffil. CpGs are in rows and samples in columns.
    data_distributions:
      - id: alspacdcs:a1cc883a-a4ce-4660-8926-0bdb67c731fd
	name: 450.gds
	description: >-
	  R object file for the detection p values matrix for the 450 array only.
	md5sum: 1c437226b2aab0c00aed7098e739f49d
	filesize: 22G
	filetype: .gds
	belongs_to_container: alspacdcs:07768702-d945-449b-ab16-3fa064bf981a
	number_of_participants: 5927
      - id: alspacdcs:a2ef6985-97de-4ee3-996f-4d295773fbbc
	name: common.gds
	description: >-
	  R object file for the detection p values matrix for both EPIC and 450 arrays.
	md5sum: c6f4348fa7d92a5f341f69e1784036da
	filesize: 30G
	filetype: .gds
	belongs_to_container: alspacdcs:07768702-d945-449b-ab16-3fa064bf981a
	number_of_participants: 8669
      - id: alspacdcs:d312d4b0-3e87-4a49-8840-b2162c0daa1a
	name: epic.gds
	description: >-
	  R object file for the detection p values matrix for the EPIC array only.
	md5sum: 341d1194d468e10e80be9dc9990c474b
	filesize: 18G
	filetype: .gds
	belongs_to_container: alspacdcs:07768702-d945-449b-ab16-3fa064bf981a
	number_of_participants: 2742

  - id: alspacdcs:
    description: >-
      Manifest files with columns extracted directly from LIMS and age,
      sex, omics ID, timepoint, timecode, sampletype, genotype columns to report
      sample mismatches, duplicate.rm column to remove duplicates.
      Samples in rows, variables in columns.
    data_distributions:
      - id: alspacdcs:4547e736-b1c4-4ade-adc4-622d44522f7c
	name: samplesheet-450.csv
	description: >-
	  R data object manifest file for the 450 array only.
	md5sum: ae8ccd22c2784bb900959362bfdf95e5
	filesize: 2.2M
	filetype: .csv
	belongs_to_container: alspacdcs:3ab1bca7-efaf-414f-837d-cf0ad30afb09
	number_of_participants: 5927              
      - id: alspacdcs:1c1bf0bc-c254-4c25-96bf-96558f37f059
	name: samplesheet-common.csv
	description: >-
	  R data object manifest file for both the EPIC and 450 arrays. This is a duplicate with samplesheet.csv.
	md5sum: 1e60ab2f50c9f578c3a6ead251974197
	filesize: 3.3M
	filetype: .csv
	belongs_to_container: alspacdcs:3ab1bca7-efaf-414f-837d-cf0ad30afb09
	number_of_participants: 8669
      - id: alspacdcs:fce38b25-3100-4b12-b13d-6b528d8dfffc
	name: samplesheet-epic.csv
	description: >-
	  R data object manifest file for the EPIC array only.
	md5sum: 656ead1968eb4ae0ac07b1a2416907ad
	filesize: 1.1M
	filetype: .csv
	belongs_to_container: alspacdcs:3ab1bca7-efaf-414f-837d-cf0ad30afb09
	number_of_participants: 2742 
      - id: alspacdcs:707a0d83-66fe-4a74-96fc-1b2c5d7f0158
	name: samplesheet.csv
	description: >-
	  R data object manifest file for both the EPIC and 450 arrays. This is a duplicate with samplesheet-common.csv.
	md5sum: 1e60ab2f50c9f578c3a6ead251974197 # should be the same as samplesheet-common.csv
	filesize: 3.3M
	filetype: .csv
	belongs_to_container: alspacdcs:3ab1bca7-efaf-414f-837d-cf0ad30afb09
	number_of_participants: 8669

7 Gene Expression Data

7.1 Gene expression - array - G1 (ge_ht12_g1)

7.1.1 Description

There are two different types of QC'd data available in this version, one performed by David Evans for the Bryois et al 2014 paper, and one performed by Gibran Hemani for the molgenis eQTL mapping meta analysis. A version without QC is available as well. Details on the QC'd versions can be seen below.

This data was generated from LCLs. The majority of samples used in their generation were collected at age 9 years. LCL's are a lymphoblastoid cell lines which were produced by transforming lymphocytes with Epstein Barr Virus and cultured before DNA was extracted. Gene expression patterns may not be the same as that from untransformed lymphocytes taken from a 9 year old.

7.1.2 Methodology

Bryois:

  • LCL's from unrelated individuals were grown under identical conditions and cells frozen in RNAlater. RNA was extracted using an RNeasy extraction kit (Qiagen) and was amplified using the Illumina TotalPrep-96 RNA Amplification kit (Ambion). Expression profiling of the samples, each with two technical replicates, were performed using the Illumina Human HT-12 V3 BeadChips (Illumina Inc) including 48,804 probes where 200 ng of total RNA was processed according to the protocol supplied by Illumina. Raw data was imported to the Illumina Beadstudio software and probes with less than three beads present were excluded. Log2 - transformed expression signals were then normalized with quantile normalization of the replicates of each individual followed by quantile normalization across all individuals.

We restricted our analysis to 23'935 probes tagging genes annotated in Ensembl. Principal component analysis was performed on 931 individuals. 62 individuals with principal component 1 or 2 greater than one standard deviation of the population were excluded from further analysis. See http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004461 for full details.

Molgenis:

  • Genetic outliers were removed, any individuals that were clear outliers in the first 2 genetic principal components. Each probe was simply quantile normalised and then log2 transformed. Then adjusted for the first 4 genetic MDS, expression principal components (excluding those that had genetic associations), and scaled to have mean 0 and variance 1. See https://github.com/molgenis/systemsgenetics/wiki/eQTL-mapping-analysis-cookbook for full details.

7.1.3 Freeze Docs

# This yaml file is a description of a freeze of a released version of a named alspac dataset
# It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema

id: alspacdcs:ge_ht12_g1_2015-11-02_f4
name: Gene expression - array - G1 release version 2015-11-02 freeze 4
description: >-
  This is the fourth freeze of the 2015-11-02 version of
  ge_ht12_g1 dataset which has .csv distributions of the data rather than
  .Rdata files in order to be easier to use across differnt data
  science software and languages.

freeze_size: 2.6G
linker_file_md5sum: fafe49f2e5ce4d5bd018fba250503eff
woc_file_md5sum: 163b7668b82ec7e5e6b7e35aecbbb473
all_individuals_to_exclude_md5sum: 92b14080f8a933c32fcb064767eb1112
git_tag: https://github.com/alspac/dataset_ge_ht12_g1/releases/tag/freeze4
is_current_freeze: true
freeze_number: 4
freeze_date: 2024-06-11
previous_freeze: alspacdcs:ge_ht12_g1_2015-11-02_f3
freeze_of_alspac_dataset_version: alspacdcs:ge_ht12_g1_2015-11-02
freeze_of_named_alspac_dataset: alspacdcs:ge_ht12_g1
has_parts:
  - id: alspacdcs:ge_ht12_g1_2015-11-02_bryosis_f4
    name: Bryosis data
    description: Dataset part for the Bryosis data in ge_ht12_g1 version 2015-11-02 freeze4
    data_distributions:
      - id: alspacdcs:564477290c6962a88697e9a9eae4991a_bryosis.csv
	name: bryosis.csv
	description: >-
	  The freeze 4 csv version of the bryosis data.
	  IDs in columns and Illumina probe IDs in rows.
	  This is the normalised data used in Bryois et al 2014.
	  Probe IDs are mapped to Genes
	  in raw.csv
	md5sum: 564477290c6962a88697e9a9eae4991a
	filesize: 742M
	filetype: .csv
	number_of_participants: 947
	number_of_gene_expression_probe_values: 48630
  - id: alspacdcs:ge_ht12_g1_2015-11-02_molgenis_f4
    name: Molgenis
    description: >-
      Dataset part for the Molgenis data in ge_ht12_g1 version 2015-11-02 freeze 4
    data_distributions:
      - id: alspacdcs:e5dcaa8260bd63189290e403d5ddc9f7_molgenis.csv
	name: molgenis.csv
	description: >-
	  The freeze 4 csv version of the molgenis data.
	  IDs in columns and Illumina probe IDs in rows.
	  Normalised data following the molgenis pipeline,
	  found at
	  https://github.com/molgenis/systemsgenetics/wiki/eQTL-mapping-analysis-cookbook.
	  Probe IDs are mapped to Genes
	  in raw.csv

	md5sum: e5dcaa8260bd63189290e403d5ddc9f7
	filesize: 752M
	filetype: .csv
	number_of_participants: 879
	number_of_gene_expression_probe_values: 48630
  - id: alspacdcs:ge_ht12_g1_2015-11-02_raw_f4
    name: Raw
    description: Dataset part for the raw data in ge_ht12_g1 version 2015-11-02 freeze 4
    data_distributions: 
      - id: alspacdcs:7251c3016a62431b1fc41823ffff2bef_raw.csv
	name: raw.csv
	description: >-
	  The freeze 4 csv version of the raw ge data.
	  IDs in columns and probes in rows. Two columns per
	  individual, with one column for average signal and one column
	  for average number of beads.
	  Presumably this is a file generated by the Illumina Genome
	  Studio software.
	md5sum: 7251c3016a62431b1fc41823ffff2bef
	filesize: 1.1G
	filetype: .csv
	number_of_participants: 994 ##This is not how wide this dataframe is
	number_of_gene_expression_probe_values: 48630

8 Omics tips

8.1 Introduction

This section is a guide to using 'Omics datasets. It explains which software to use and describes common file formats. It's a good starting point for beginners and helpful for problem-solving.

8.2 Disclaimer

Some information is copied or reworded from software documentation. Check the original documentation alongside this guide for up-to-date information. Note that some links may no longer work.

8.3 Operating systems

You can use ALSPAC data with any operating system, but Unix-based systems like Macintosh, Linux, or BSD are more convenient due to the data's size and complexity. We recommend using the command line and programming scripts with languages like Bash, R, Python, or Perl. Many online resources are available to learn these tools. Use free/libre and open-source software where possible.

Links:

8.4 Key Omics software

8.4.1 Plink

Plink is a tool for performing quality control and whole genome association analysis of genetic data.

8.4.2 SNPTest

SNPTest is a tool for performing whole genome association analysis of genetic data.

8.4.3 BoltLmm

BoltLmm is a tool for performing genome association analysis of genetic data. It is recommended for analysis of more than 5000 samples, its methods automatically take into account population substructures.

8.4.4 Qctools

A tool for quality control of genetic data. It is also useful to inspect and modify .gen .bgen and vcf files etc (see section 4 below).

8.4.5 SAMTOOLS

Samtools is a suite of tools which are used for genomic analysis.

8.4.6 VCFTOOLS

Part of samtools that allows you to work with vcf files.

8.4.7 BCFTOOLS

This is a part of samstools and allows users to manipulate .bcf files.

8.5 File types

In a Unix environment the postfix of a file name does not explicitly mean anything to the operating system, unlike in a Windows system which will look at the file types. In a Unix system it is just part of the name of the file and humans use it to distinguish file formats. The following is a non-exhaustive list of file types you may encounter whilst using ALSPAC Omics data.

8.5.1 .gen

This is an 'oxford' data format for genetic data. The .gen file is a plain text file, this means that standard Unix command line tools can be used to inspect the data. For example, 'head' or 'less'.

The .gen (genotype) file stores data on a one-line-per-SNP format. The first 5 entries of each line are the SNP ID, RS ID of the SNP, base-pair position of the SNP, the allele coded A and the allele coded B. The SNP ID can be used to denote the chromosome number of each SNP. The next three numbers on the line are the probabilities of the three genotypes AA, AB and BB at the SNP for the first individual in the cohort. The next three numbers are the genotype probabilities for the second individual in the cohort. The next three numbers are for the third individual and so on. The order of individuals in the genotype file should match the order of the individuals in the sample file (see below). It should be noted that the probabilities need not sum to 1 to allow for the possibility of a NULL genotype call. This format allows for genotype uncertainty. This genotype file format is the same as that produced by the genotype calling algorithm CHIAMO. NOTE : We recommend that you arrange SNPs in base-pair order in the genotype files. This is required if you want to use the files with IMPUTE and will make viewing the output of SNPTEST somewhat easier. For example, Suppose you want to create a genotype for 2 individuals at 5 SNPs whose genotypes are

SNP 1 AA AA
SNP 2 GG GT
SNP 3 CC CT
SNP 4 CT CT
SNP 5 AG GG

The correct genotype file would look like this:

SNP1 rs1 1000 A C 1 0 0 1 0 0
SNP2 rs2 2000 G T 1 0 0 0 1 0
SNP3 rs3 3000 C T 1 0 0 0 1 0
SNP4 rs4 4000 C T 0 1 0 0 1 0
SNP5 rs5 5000 A G 0 1 0 0 0 1

8.5.2 .bgen

A binary version of a .gen file. This file can not be visually inspected on the command line. .bgen files are used because they greatly increase the speed and storage efficiency of software for storing large amounts of Omics data. The full details of the file format are discussed in : https://www.well.ox.ac.uk/~gav/bgen_format/ bgen files are normally used with tools such as qctools and snptest There is also a library for reading .bgen files into R : https://bitbucket.org/gavinband/bgen/wiki/rbgen

8.5.3 .sample

The .sample file is paired with either .gen or .bgen files. It contains information on the samples that is not genetic. It is a plain text file that can be inspected with standard Unix command line tools.

Please note that the sample file format changed with the release of SNPTEST v2. Specifically, the way in which covariates and phenotypes are coded on the second line of the header file has changed. The sample file has three parts (a) a header line detailing the names of the columns in the file, (b) a line detailing the types of variables stored in each column, and (c) a line for each individual detailing the information for that individual. Here is an example of the start of a sample file for reference

ID_1 ID_2 missing cov_1 cov_2 cov_3 cov_4 pheno1 bin1
0 0 0 D D C C P B
1 1 0 .007 1 2 0 .0019 -0.008 1.233 1
2 2 0 .009 1 2 0 .0022 -0.001 6.234 0
3 3 0 .005 1 2 0 .0025 0.0028 6.121 1
4 4 0 .007 2 1 0 .0017 -0.011 3.234 1
5 5 0 .004 3 2 -0 .012 0.0236 2.786 0

The header line: This line needs a minimum of three entries. The first three entries should always be ID_1, ID_2 and missing. They denote that the first three columns contain the first ID, second ID and missing data proportion of each individual. Additional entries on this line should be the names of covariates or phenotypes that are included in the file. In the above example, there are 4 covariates named cov_1, cov_2, cov_3, cov_4, a continuous phenotype named pheno1 and a binary phenotype named bin1. NOTE : All phenotypes should appear after the covariates in this file. The second line of the file details the type of variables included in each column. The first three entries of this line should be set to 0. Subsequent entries in this line for covariates and phenotypes should be specified by the following rules

D Discrete covariate (coded using positive integers)
C Continuous covariates
P Continuous Phenotype
B Binary Phenotype (0 = Controls, 1 = Cases)

The remainder of the file should consist of a line for each individual containing the information specified by the entries of the header line (see example above). Use spaces to separate the entries of the sample file and not TABS because that is the expected character.

Missing values - Specifying missing values for covariates and phenotypes is possible. It was recommended that you use -9 for missing values. This was the default value assumed by SNPTEST v1, although the -missing_code option in SNPTEST v1 meant that you could use other numeric values for the missing code, In SNPTEST v2 the behavior of the -missing_code option has changed so that it now takes a comma-separated list of values, each of which is treated as missing when encountered in the sample file(s). Default missing values are now denoted by the two character string "NA".

8.5.4 .ped

A plink format file that is in plain text and can be viewed with standard tools. It contains genetic variant data. https://www.cog-genomics.org/plink/1.9/formats#ped

8.5.5 .map

A plink format file that is in plain text. It contains information about variants. https://www.cog-genomics.org/plink/1.9/formats#map

8.5.6 .bed

A plink format file that isa binary equivalent of a .ped file. It is smaller and faster to process but is not easily viewable or editable. https://www.cog-genomics.org/plink/1.9/formats#bed

8.5.7 .bim

A plink format, similar to a .map file but is used with binary .bed files. https://www.cog-genomics.org/plink/1.9/formats#bin

8.5.8 .fam

A plain text format that contains sample information for plink binary files. https://www.cog-genomics.org/plink/1.9/formats#fam

8.5.9 .csv

A plain text format where different fields are separated by commas. (Comma separated variables).

8.5.10 .vcf

VCF files are a flexible file format for storing different types of genetic variants. They are a plain text format that can be inspected on the command line with standard Unix tools. However they are often very large files, and specific tools such as 'vcftools' are useful for working with this data. Commonly SNPs are stored in these files but other variants such as Copy Number variations can also be stored. The basic form for a vcf file is: https://en.wikipedia.org/wiki/Variant_Call_Format

8.5.11 .bcf

This is a binary version of a vcf file. It cannot be inspected on the command line, but can be used with the genomic tools mentioned in this document.

8.5.12 .tar.gz

This is a standard Unix file format for bundling and compressing a set of files. It is similar to a .zip file. It is made by first bundling a set of files into a .tar file (sometimes called a tar ball). This is then compressed using 'gun zip'. https://en.wikipedia.org/wiki/Tar_(computing) https://en.wikipedia.org/wiki/Gzip

8.5.13 .enc

This file extension is used as a convention to mean that the file is encrypted. You will need to have that password that was used to encrypt the data in order to unencrypt the files. https://en.wikipedia.org/wiki/OpenSSL

8.6 Variant/SNP ids

There are many types of genetic variation. A common type is a single nucleotide polymorphism (SNP). Others include copy number variations.

Variants can be specified by a Chromosome and location in reference to a specific build of the human genome. They can also be given a reference SNP (rs) cluster identifier.

  • Chr:Location
  • Rs ids

8.7 Overview of Imputation reference panels

SNP array data frequently contain hundreds of thousands of variants. However due to linkage disequilibrium it is possible to estimate many more SNP values for an individual. This estimation procedure is called imputation and it works by combining an individuals SNP array data with a large reference population of sequenced data. In this way it is possible to have accurate estimations of millions of SNP values for an individual without the cost of fully sequencing each person. ALSPAC has prerun the imputation process using three different imputation panels.

8.7.1 Panels

  1. TOPmed

    An upcoming (to alspac) reference panel which will have the most snps

  2. HRC

    This is the latest reference panel and our data contains circa 40 millions of SNPs.

  3. 1000 Genomes

    This is the previous generation reference panel which is still widely used in ALSPAC studies. There are some SNPs that appear in this panel that are not in the HRC panel.

  4. Hapmap

    This was the first widely used imputation panel.

8.8 SNP data types from imputation.

SNPs that have been imputed can be stored and analysed in different formats. These can be appropriate for different types of analysis, for example an analysis could assume and additive effect for the minor allele or it could assume a recessive/dominant effect.

  • Best guess. The data will be presented as either 0,1, or 2 to represent how many of the minor alleles at that position a person has. The best guess is derived from the probability of a variant calculated from the imputation process.
  • Dosage. This is the probability that the person has 0, 1 or 2 of the minor allele. i.e. 0.1, 0.2,0.7. This will sum to one across the three possibilities (i.e for each SNP for each individual).

8.9 SNP Statistics

You can generate statistics on your SNP data using the program 'QCtools'. This will give you the imputation information scores. For example:

qctool -g example.bgen -s example.sample -sample-stats -osample sample-stats.txt

8.10 Best practice

8.10.1 GWAS

We recommend you follow the steps outlined in the following paper when performing GWAS: Marees, Andries T., et al. "A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis." International journal of methods in psychiatric research 27.2 (2018): e1608. https://doi.org/10.1002/mpr.1608

8.10.2 Phewas

We recommend you follow the steps outlined in the following paper when performing Phewas: Millard, L., Davies, N., Timpson, N. et al. MR-PheWAS: hypothesis prioritization among potential causal effects of body mass index on many outcomes, using Mendelian randomization. Sci Rep 5, 16645 (2015). https://doi.org/10.1038/srep16645

8.10.3 Methylation

The following paper describes the methylation data available in ALSPAC Relton, Caroline L., et al. "Data resource profile: accessible resource for integrated epigenomic studies (ARIES)." International journal of epidemiology 44.4 (2015): 1181-1190.

8.11 Population stratification

This is when an observed genetic association is due to the population/geography. Not taking this into account can lead to biased estimates of effects. One common method to account for these is to calculate principal components of the genetic data and then to include these as covariables in any models. Principal components can be generated using plink or other tools.

For more information about how to do this in plink see:https://www.cog-genomics.org/plink/1.9/strat

An common method used to account for population substructure is by using linear mixed models. For example using the bolt LMM software tool.

https://data.broadinstitute.org/alkesgroup/BOLT-LMM/

8.12 Common tasks

Here we provide links to webpages that provide instructions or provide brief details any code for completing common tasks using the various software we have described above (section x):

  • Extract some SNPs from a bgen data file and convert to plain text.

https://www.well.ox.ac.uk/~gav/qctool_v2/documentation/examples/filtering_variants.html

  • Extract some SNPs from bed data:

http://zzz.bwh.harvard.edu/plink/dataman.shtml

plink –bfile mydata –chr 2 –from-kb 5000 –to-kb 10000

  • Reading .bgen and .sample oxford files in plink

Plink supports bgen files but it is fussy about the types of its columns in the data.sample file. You may wish to remove or retype columns to read a data.sample file into plink. For more info see:

https://www.cog-genomics.org/plink/2.0/input

To make a new sample file removing some columns you can use the Unix command: 'cut -f 1,2,3 -d " " data.sample > data2.sample'

8.13 Courses

Working with 'Omics data can be complicated but there are many excellent resources available to help you learn how to do this. There are both paid in person courses and free online courses.

Details on paid courses offered by Bristol University can be found here: https://www.bristol.ac.uk/medical-school/study/short-courses/ In addition, a number of free online courses are summarised here: https://www.mooc-list.com/tags/bioinformatics

8.14 Further sources of help

8.14.1 Stack exchange

Stack exchange is an online Q&A community which is divided into different sub-communities. The first and most well-known is Stack overflow. This is one of the best place to ask questions about programming on the Internet. Other useful exchange sites include bioinformatics https://bioinformatics.stackexchange.com/, maths https://mathoverflow.net/ and statistics https://stats.stackexchange.com/.

8.14.2 Bio-stars

Biostars is bioinformatics community Q&A web-site: https://www.biostars.org/

8.14.3 Mailing lists

For individual product/projects there is often a mailing list. For example to get help using SNPTEST you can ask on the mailing list https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html#contact

8.14.4 AI tools

AI tools such as chatGPT can be useful to understand how to work with omics data.

8.14.5 Ask ALSPAC

If you can not find the answer to your question or you think there is something wrong with your data then please contact the alspac-omics@bristol.ac.uk mailbox and we will do our best to help you.

Author: ALSPAC Omics team

Created: 2024-12-16 Mon 15:42

Validate