Home Page
About Us
The Consortium of Long Read Sequencing (CoLoRS) is a coalition of international researchers focused on cataloging and providing frequency information, for all classes of variation found within the human genome, using long-read whole genome sequencing. This effort complements existing databases such as gnomAD, which was largely built on short-read sequencing data, as long reads enable much greater sensitivity and precision for complex variants, especially structural variants (SVs) and tandem repeats. Cataloging and classifying such variants is a challenging task and we aimed to collaborate on expected challenges like elucidating strategies for merging structural variants and calling mosaicism in repeat expansions.
The genomic data to populate the database was extracted from pre-existing and ongoing research projects conducted by members of the consortium. Sequencing reads were processed via standardized software pipelines for each variant type at the individual sites to summary-level data.
The goal of this initiative is to provide variant frequency data for public use and as a resource to the global scientific and clinical research community as well as a resource for existing NHGRI funded projects such as GREGoR or IGVF, or other NIH projects such as TOPMed or All Of Us.
Sample Description
CoLoRS aggregated data from nearly 1,400 long read genomes come from a diverse group of institutions with ongoing HiFi sequencing research projects, and as such have differing characteristics such as read depth, disease focus, trio availability and ancestry. For example, the largest source of genome data in this dataset is the Children’s Mercy Research Institute (CMRI, Children’s Mercy Kansas City, Kansas City, MO, USA). Each of these projects have different demographics. The CMRI’s cohort is from a rare disease study which utilizes a cohort (with parents when available), >25x depth and 85% European ancestry. The CMRI project is led by CoLoRS member Tomi Pastinen, MD, PhD.
Other population control cohorts include genomes from HPRC (127 genomes @30-40x, from 1,000 genomes) and HGSVC (37 genomes @30-40x). Other disease-focused cohorts include genomes from the SolveRD study focused on the role of structural variants in rare disease (100 genomes @8-10x, European Ancestry) and the Hudson Alpha Institute (50 probands, 30 parents @20x, mix of European and African American ancestry).
Table reflects samples on 5/21/24, We expect these samples to grow with expansion of the projects above and by the addition of new collaborators.
Project Methods
Data availability. The CoLoRSDb dataset is publicly available and can be found HERE. A README and data format description is included.
Pipeline optimization. We develop optimized WDL-based workflows for variant calling, variant unification and genotyping. The workflow optimization focused on open access datasets (HPRC, HGSVC) so that we can easily share results to identify the best approach using a mix of known and trusted callers to address the heterogeneous data set.
Long read alignment will use pbmm2 and/or minimap2; SV callers to be evaluated may include pbsv, Sniffles2, SVision, and PAV; Variant merging may include SURVIVOR and jasmine. We have plans to also benchmark short-read genotypers including PanGenie, Giraffe, and Paragraph, by comparing short-read genotypes to long-read discovery in samples with both short- and long-read sequencing.
Tandem Repeat Genotyping Tool (TRGT) will be used to generate a tandem repeat call set and Google DeepVariant will be used to call small variants. In addition, other specialized variant callers will be utilized as they are developed by this group or the scientific community at large.
Large-scale long-read alignment and variant calling. The alignments and variant calling will occur within AnVIL using a unified WDL, but we will use separate workspaces for each cohort to maintain privacy of the samples. Data sets that cannot be uploaded to AnVIL will be processed at the location where they are housed with a non-cloud version of the pipeline. Summary statistics of the variant calls will then be exported from the individual cohorts into a unified workspace to aggregate and analyze the results.
Variant annotation and analysis. We will quality control the variant calls based on variant size distribution, Hardy-Weinberg equilibrium, Mendelian concordance within trios, and comparison to gold standard structural variant calls from Genome-In-A-Bottle and HGSVC.
After the database of variants has been constructed from long reads, we hope to genotype the variants in other short-read datasets. While not as performant as directly identifying variants using long reads directly, pan-genome tools such as PanGenie, Girrafe or Paragraph allow the assessment of most types of variants. This should improve allele frequency estimates and allow us to identify which variants are only accessible via long-read sequencing. We also plan to perform SV-eQTL analysis of the variants within datasets with any RNAseq data available, especially GTEx.
Finalize analysis and disseminate results. The variant catalog will be distributed as a public workspace in AnVIL, and the workflows will be deposited in Dockstore for use in other projects. We also aim to write a manuscript describing the results of the analysis, highlighting how the CoLoRS variant database can be used to enable screening of potential pathogenic variants in clinical research samples.