4 Dataset provided for download

4.1 The details of four Dataset

We provide downloads of variant datasets, including single nucleotide polymorphism (SNP), insert and deletion (InDel, with size ≤50bps), and large structural (SV, with size ≥51bps) variant datasets (Figure4.1). The SNPs and small InDels are identified among all 2,839 rice hybrids and 486 parental lines of hybrids. Variants are called using “HaplotypeCaller”, “GenomicsDBImport” and “GenotypeGVCFs” functions in GATK (the genome analysis toolkit v4.1.4.1) with default parameters (McKenna et al., 2010). Variant filtration is conducted using “VariantFiltration” function in GATK with parameters of “–cluster-size 3 –cluster-window-size 10 QD<10.00 FS>15.000 AC<3 DP>200||DP<5” for SNPs and “QD<10.00 FS>30.000 DP>200||DP<5” for InDels. And the SVs are identified among 964 rice hybrids. SVs are identified using an graph-based genome (Qin et al., 2021) and an SV genotyping pipeline integrated in Variation graph toolkit (Garrison et al., 2018).
SNP, InDel and SV variant datasets for download

Figure 4.1: SNP, InDel and SV variant datasets for download

In addition, a differentiated indica-japonica variant dataset is also available (Figure4.2). This dataset comprises 830,245 SNPs, which are identified according to the following criteria:

at an indica-japonica differentiated SNP site,

  1. ≥17 indica varieties are the same genotype;

  2. ≥21 japonica varieties held the same genotype;

  3. Indica and japonica rice accessions possess different genotypes.

Differentiated indica-japonica variant dataset for download

Figure 4.2: Differentiated indica-japonica variant dataset for download

In total, nineteen indica and twenty-three temperate japonica rice accessions are used for analysis (Figure4.3). Among them, 41 accessions are reported by Zhao et al. (Zhao et al., 2018), and an additional accession (named V1) of O. sativa temperate japonica is newly sequenced.
Neighbor-joining tree of 42 rice accessions used for differentiated SNPs identification

Figure 4.3: Neighbor-joining tree of 42 rice accessions used for differentiated SNPs identification

4.2 Reference

McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20, 1297-1303 (2010).

Qin, P. et al. Pan-genome analysis of 33 genetically diverse rice accessions reveals hidden genomic variations. Cell 184, 3542-3558 (2021).

Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol 36, 875-879 (2018).

Zhao, Q. et al. Pan-genome analysis highlights the extent of genomic variation in cultivated and wild rice. Nat Genet 50, 278-284 (2018).