The Consortium of Long Read Sequencing (CoLoRS) is an open coalition of international researchers focused on cataloging and providing frequency information, for all classes of variation found within the human genome, using long-read whole genome sequencing. This effort complements existing databases such as gnomAD, as long reads enable much greater sensitivity and precision for complex variants, especially structural variants (SVs) and tandem repeats. Cataloging and classifying such variants is a challenging task and we aim to collaborate on expected challenges like elucidating strategies for merging structural variants and calling mosaicism in repeat expansions.
The genomic data to populate the database will be extracted from pre-existing and on ongoing research projects conducted by members of the consortium. Sequencing reads will be processed via standardized software pipelines for each variant type at the individual sites or within the AnVIL cloud platform, to summary-level data. We plan to use summary statistics, from large population control studies such as the All of Us Research Program to pinpoint pathogenic variants of interest.
The goal of this initiative is to provide variant frequency data for public use and as a resource to the global scientific and clinical research community as well as a resource for existing NHGRI funded projects such as GREGoR or IGVF, or other NIH projects such as TOPMed or All Of Us.
CoLoRS will aggregate the data from nearly 2,000 long read genomes which come from different ongoing research projects and as such have differing characteristics such as read depth, disease focus, trio availability and ancestry. For example, the largest sources of genome data are Children’s Mercy Research Institute (CMRI) and a cohort from Dr. Shinichi Morishita from the University of Tokyo. Each of these projects have different demographics. The CMRI’s cohort is from a rare disease study which utilizes a cohort (with parents when available), >25x depth and 85% European ancestry. The CMRI project is led by CoLoRS member Tomi Pastinen, MD, PhD, and the data is currently being ingested into AnVIL.
Other population control cohorts include genomes from HPRC (127 genomes @30-40x, from 1,000 genomes), HGSVC (37 genomes@30-40x). Other disease focused cohorts include genomes from the SolveRD study focused on the role of structural variants in rare disease (100 genomes @8-10x, European Ancestry) and the Hudson Alpha Institute (50 probands, 30 parents @20x, mix of European and African American ancestry).
Table reflects samples on 7/18/23, We expect these samples to grow with expansion of the projects above and by the addition of new collaborators.
Pipeline optimization. We will develop optimized WDL-based workflows for variant calling, variant unification and genotyping. The workflow optimization will focus on open access datasets (HPRC, HGSVC) so that we can easily share results. To identify the best approach using a mix of known and trusted callers to address the heterogeneous data set.
Long read alignment will use pbmm2 and/or minimap2; SV callers to be evaluated may include pbsv, Sniffles2, SVision, and PAV; Variant merging may include SURVIVOR and jasmine. We have plans to also benchmark short-read genotypers including PanGenie, Giraffe, and Paragraph, by comparing short-read genotypes to long-read discovery in samples with both short- and long-read sequencing.
Tandem Repeat Genotyping Tool (TRGT) will be used to generate a tandem repeat call set and Google DeepVariant will be used to call small variants. In addition, other specialized variant callers will be utilized as they are developed by this group or the scientific community at large.
Large-scale long-read alignment and variant calling. The alignments and variant calling will occur within AnVIL using a unified WDL, but we will use separate workspaces for each cohort to maintain privacy of the samples. Data sets that cannot be uploaded to AnVIL will be processed at the location where they are housed with a non-cloud version of the pipeline. Summary statistics of the variant calls will then be exported from the individual cohorts into a unified workspace to aggregate and analyze the results.
Variant annotation and analysis. We will quality control the variant calls based on variant size distribution, Hardy-Weinberg equilibrium, Mendelian concordance within trios, and comparison to gold standard structural variant calls from Genome-In-A-Bottle and HGSVC.
After the database of variants has been constructed from long reads, we hope to genotype the variants in other short-read datasets. While not as performant as directly identifying variants using long reads directly, pan-genome tools such as PanGenie, Girrafe or Paragraph allow the assessment of most types of variants. This should improve allele frequency estimates and allow us to identify which variants are only accessible via long-read sequencing. We also plan to perform SV-eQTL analysis of the variants within datasets with any RNAseq data available, especially GTEx.
Finalize analysis and disseminate results. The variant catalog will be distributed as a public workspace in AnVIL, and the workflows will be deposited in Dockstore for use in other projects. We also aim to write a manuscript describing the results of the analysis, highlighting how the CoLoRS variant database can be used to enable screening of potential pathogenic variants in clinical research samples.