InDel refers to the insertion or deletion sequence of small fragments in the genome, and its length is between 1-50 bp. The reason is that the read length of Illumina sequencing is about 100 bp, including single-end sequencing (100 bp) and paired-end sequencing (2 x 100 bp). Therefore, in SNP calling, most of the reliable indels that can be detected are less than 100 bp, and usually the maximum is about 50 bp. InDel variation is generally less than SNP variation, which also reflects the difference between the sample and the reference genome. InDel in the coding region will cause frameshift mutations, leading to changes in gene function. Commonly used methods for structural variation detection include:

  • High-throughput sequencing-based detection methods: Such as simplified genome sequencing, whole genome resequencing (WGS), CLR detection, CCS HiFi mutation detection, etc.
  • Array-based detection: Including microarray comparative genome hybridization (array CGH).

Variant calling (including SNVs, indels, deletions, and insertions) and phasing with CCS reads.Fig 1. Variant calling (including SNVs, indels, deletions, and insertions) and phasing with CCS reads. (Wenger A M, et al. 2019)

Data Analysis Technical Route

Flow chart showing indel analysis - CD Genomics.Fig 2. Flow chart showing indel analysis.

An Example of Indel Analysis Process

Different types of raw data will have different analysis schemes. The following is the general process of indel analysis using RNA-seq data (use GATK to detect indels):

1. Use STAR software to compare the data to the reference genome (mapping to the reference).

2. Use Picard's "MarkDuplicates" command to perform data cleanup.

3. GATK's SplitNCigarReads package is used to process reads containing N in cigar.

4. Base Quality Recalibration (Base Quality Recalibration) is to use machine learning to adjust the quality score of the original base.

5. Variant calling, filtering and annotation.

Data Ready

Before data analysis, the first thing is to get your data ready. For indel analysis, the raw input data can be microarray data or different types of high-throughput sequencing data, and the data can be obtained from the following channels:

Channels of Indel analysis raw input data. - CD Genomics.

To process data more efficiently, we prefer to receive data files in the raw format, but we can also accept pre-normalized files. More importantly, there are currently many databases related to indel. We are able to provide services for obtaining and mining data from available databases. Alternatively, if you do not currently have the input data, CD Genomics can also provide you with a variety of sequencing services based on its rich sequencing experience. If you have any questions about the data analysis cycle, analysis content and price, please click online inquiry.

