The edges are constructed by connecting minimizer anchored segments as a bidirected graph. One can consider this to be an extension of the string graph52 where the overlaps are the minimizers at both ends. However, in the pangenome graph, each vertex includes a set of sequence segments from multiple genomes rather than one sequence. We can also use the vertices in the MAP-graph to conduct a principal component analysis of the MHC class II regions. We collected all vertices in the MAP-graph to form the basis of vectors.

For comparison, we generate the AMY1A MAP-graphs at two different scales (Fig. 2) from the HPRC year one assemblies . These can be generated with PRG-TK in less than 3 minutes from indexed sequence data. In additional to the MAP-graph, we provide tools analyzing a MAP-graph to ‘relinearize’ the graph into a set of ‘principal bundles’. We design the algorithm to generate the principal bundles representing those consensus paths that are most likely corresponding to repeat units in the pangenomes.

  • The smaller choice of ‘r’ generates a MAP-graph with more vertices in the graph and each vertex only represents a smaller portion of the pangenome.
  • Meanwhile, a limited representation of genomes in a population may miss significant structural variants, additional copies of genes and context for important variants for diseases not observed in a smaller dataset.
The successful deployment of minimizer- or minhash- based approaches in sequence comparison39,40,49 indicates that sequence segments with the same minimizer labels are also likely to be highly homologous. The homology between sequences can be further confirmed by explicit sequence alignment of the segment inside a MAP-graph vertex. However, the computation intensive base-to-base alignment is not required for building the MAP-graph. In concurrent multiscale modeling, the quantities needed in the macroscale model are computedon-the-fly from the microscale models as the computation proceeds. In this setup, the macro- and micro-scale models are used concurrently. Take again the example of molecular dynamics.

Furthermore, we can generate a local pangenomics (MAP-graph) for comparing the sequences in the pangenome dataset at various scales by adjusting parameters to fit different analysis tasks. Another important gene family DAZ1/DAZ2/DAZ3/DAZ4 are in a set of nested palindromic repeats. It has been reported that partial deletions in this region may cause male infertility57. It would be useful to understand the natural distribution of non-pathogenic structural variants across this ampliconic gene cluster. DAZ1 and DAZ2 are roughly 1.5 Mbp from DAZ3 and DAZ4, and HG002 has a 1 to 2 Mbp inversion relative to GRCh38 with breakpoints in the segmental duplications that contain the DAZ genes (Fig. 4b). In addition to the large inversion, the DAZ genes contain structural variants, including a roughly 10 kb deletion in DAZ2, two deletions in DAZ4 and two insertions in DAZ3 of sequences that are only in DAZ1 and DAZ4 in GRCh38.