Capture QC

Summary

Target size: 1257257 bases

File	Number reads	% target bases with coverage >= 1.0x	Coverage saturation (slope at the end of the curve)	% reads on target	Duplicated reads on/off target	Coverage distribution (mean coverage)	Coverage per position	Standard deviation of coverage within regions
/home/user/ngsCAT/example2.bam	237535	51.1%	4.6e-05	67.3%	ON-25.7%; OFF: 8.1%	31.8x	8207 consecutive bases with coverage <= 6	nan
Overall status

Note

Please, note that the criteria to decide whether a particular experiment was successful, or not, are dependent on the capture design, sequencing platform and data analysis pipeline. The default warning thresholds are based on general whole exome enrichment experiments. If these default thresholds are not appropriated for your experiment, they can be modified by editing the configuration file config.py. As a general guideline, sensitivity parameters are more relevant than specifity parameters and the latter more than uniformity ones according to their impact in the performance of target enrichment experiments

Sensitivity A set of graphs and data tables that allow to assess how well those bases in the target regions were sequenced.

Percentage of target bases covered

Description
Percentage of target bases covered as a function of a coverage threshold. Y axis represents the % of target bases covered. X axis represents different coverage thresholds. Each bar represents the percentage of target bases covered at the given coverage threshold.

A typical target-enrichment NGS experiment results in ~90% of target-bases covered at coverage >=1x. This value tends to decrease as the coverage threshold increases. How fast this percentage decreases with the coverage increment depends on the specific experimental design/results. A warning is issued if the percentage of bases with coverage >= 1.0x is less than a 90% for any of the samples.

Coverage saturation

Description
Percentage of target-bases covered at coverage >=10x as a function of the number of mapped reads. Sequencing depth simulations were carried out by randomly selecting 0.01x10⁶, 0.02x10⁶, 0.025x10⁶, 0.05x10⁶, 0.075x10⁶, 0.1x10⁶, 0.2x10⁶, 0.3x10⁶, 0.4x10⁶ and 0.5x10⁶ reads from the bam file. For each of these sets of reads, the percentage of target-covered positions was calculated. This graph aims to give an idea on how much one can improve the percentage of target-bases covered by resequencing.

A flat curve on the right part indicates that resequencing will not improve the number of target-bases covered at 10x. A warning is issued if the curve does not tend to saturation on the right side (slope between the two last points > 1e-05). If the maximum depth provided as input is greater than the number of reads in the bam file, the last x-value corresponds to the number of reads in the bam file.

Specificity A set of graphs and data tables that allow to assess how much of the sequencing effort is being wasted in sequencing off-target regions.

Number of reads on target
Stats
Overall percentage of reads on target:
- /home/user/ngsCAT/example2.bam: 67.3%
Overall enrichment:
- /home/user/ngsCAT/example2.bam: 5069.3
Description
Bars represent the percentage of reads on-target per chr. Percentages for each bar were calculated relative to the total number of reads mapped in the corresponding chromosome. Enrichment was calculated as: (on-target reads per Kb)/(off-target reads per Kb).

In a typical experiment one may expect ~80% of reads mapping on-target. A warning is issued if the % of reads on-target is lower than 80% for any of the samples.

Duplicated reads on/off target

Description
Percentages of duplicated on/off-target reads. Reads mapping at exactly the same starting and ending position were considered to be duplicated. X axis indicates de number of times the reads are duplicated (1 indicates unique reads). Green and red bars indicate the percentage of on- and off-target reads with respect to the total number of on-/off-target reads respectively.

One may expect a greater proportion of duplicated reads on-target due to the enrichment process. Duplicated off-target reads should be due to some other experimental artifacts (e.g. PCR). Thus, a warning is issued if the percentage of duplicated on-target reads is lower thant the percentage of duplicated off-target reads for any of the samples

Uniformity Set of graphs and data tables that allow to assess whether coverage is uniformly distributed among target regions.

Coverage distribution

Description
Distribution of coverage per target base (only bases with coverage >= 1x are shown on the left graph). The star symbol in the boxplot graph indicates the mean coverage.

Low-medium coverage experiments may present a mean coverage of ~40x. A warning is issued if mean coverage is below 40x for any of the samples.

Coverage per position

Description
Coverage found at each target base. One graph is provided for each chromosome (contig) in the target bed. X axis represents target positions. Only target bases are represented in the X axis: target regions appear concatenated. Y axis represents coverage. The .txt file lists all those target intervals with 0 coverage.

Wide gaps or peaks may indicate capture biases. A warning is issued if more than 100 consecutive bases lie below <6x for any of the samples.

Standard deviation of the coverage within regions

Description
Distribution of the standard deviation of the coverage within target regions. In other words, for each target region the standard deviation of the coverage per base is calculated. All of these "standard deviations" are sampled to draw the histogram/boxplot above (y axis of the boxplot appears in log-scale).

Given a target region, it is usual to observe the below shown coverage profiles:

Bases near the 5'/3' ends of target regions tend to be worse covered than bases located in the middle of target regions. Graphs in this section are informative of the coverage variations within target regions, and are mainly useful to compare different target-enrichment NGS experiments. The lower the mean of this distribution is, the more uniform the coverage is within target regions.

A warning is issued if normalized std is greater than 0.3 for any of the samples.
GC bias

Description
For each target region, the mean coverage of its bases is calculated as well as the percentaje of Gs and Cs it contains. For each target region, a point (GCcontent,Meancoverage) is painted in the graph. This drawing allows to observe sequencing biases which depend on the GC content of target regions. For example, lower coverage in sequencing regions with high GC or high AT content has long been observed. GC bias in sequencing studies is in large part due to early PCR steps during library generation where high and low GC content cause reduced amplification and therefore lower sequencing coverage.

Capture Quality Control