生物信息學/使用fastp進行數據質量控制

fastp的特性

編輯
  • 對數據自動進行全方位質控,生成人性化的報告
  • 過濾功能(低質量,太短,太多N……);
  • 對每一個序列的頭部或尾部,計算滑動窗內的質量均值,並將均值較低的子序列進行切除(類似Trimmomatic的做法,但是快非常多);
  • 全局剪裁 (在頭/尾部,不影響去重),對於Illumina下機數據往往最後一到兩個cycle需要這樣處理;
  • 去除接頭污染。厲害的是,你不用輸入接頭序列,因為算法會自動識別接頭序列並進行剪裁;
  • 對於雙端測序(PE)的數據,軟件會自動查找每一對read的重疊區域,並對該重疊區域中不匹配的鹼基對進行校正;
  • 去除尾部的polyG。對於Illumina NextSeq/NovaSeq的測序數據,因為是兩色法發光,polyG是常有的事,所以該特性對該兩類測序平台默認打開;
  • 對於PE數據中的overlap區間中不一致的鹼基對,依據質量值進行校正;
  • 可以對帶分子標籤(UMI)的數據進行預處理,不管UMI在插入片段還是在index上,都可以輕鬆處理; -可以將輸出進行分拆,而且支持兩種模式,分別是指定分拆的個數,或者分拆後每個文件的行數;

fastp完美支持gzip的輸入和輸出,同時支持SE和PE數據,而且不但支持像Illumina平台的short read數據,也在一定程度上支持了PacBio/Nanopore的long reads數據。

fastp軟件會生成HTML格式的報告,而且該報告中沒有任何一張靜態圖片,所有的圖表都是使用JavaScript動態繪製,非常具有交互性。想要看一下樣板報告的,可以去以下鏈接:http://opengene.org/fastp/fastp.html

而且軟件的開發者還充分考慮到了各種自動化分析的需求,不但生成了人可讀的HTML報告,還生成了程序可讀性非常強的JSON結果,該JSON報告中的數據包含了HTML報告100%的信息,而且該JSON文件的格式還是特殊定製的,不但程序讀得爽,你用任何一款文本編輯器打開,一眼過去也會看得明明白白。想要看一下JSON結果長什麼樣的,可以去以下鏈接:http://opengene.org/fastp/fastp.json

fastp的安裝

編輯

fastp軟件下載網址

Bioconda源安裝

編輯
# 不一定最新
conda install -c bioconda fastp

安裝二進制命令

編輯
wget http://opengene.org/fastp/fastp
chmod a+x ./fastp

從源碼安裝(Mac和Linux)

編輯
git clone https://github.com/OpenGene/fastp.git

# build
cd fastp
make

#安装
sudo make install

從源碼安裝(Windows)

編輯
git clone -b master --depth=1 https://github.com/OpenGene/fastp.git
cd fastp
make

fastp的參數和選項

編輯
usage: fastp -i <in1> -o <out1> [-I <in1> -O <out2>] [options...]
options:
  # I/O options   即输入输出文件设置
  -i, --in1                          read1 input file name (string)
  -o, --out1                         read1 output file name (string [=])
  -I, --in2                          read2 input file name (string [=])
  -O, --out2                         read2 output file name (string [=])
  -6, --phred64                      indicates the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33)
  -z, --compression                  compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest, default is 2. (int [=2])
    --reads_to_process               specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0])
  
  # adapter trimming options   过滤序列接头参数设置
  -A, --disable_adapter_trimming     adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled
  -a, --adapter_sequence               the adapter for read1. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped. (string [=auto])
      --adapter_sequence_r2            the adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as <adapter_sequence> (string [=])
    
  # global trimming options   剪除序列起始和末端的低质量碱基数量参数
  -f, --trim_front1                  trimming how many bases in front for read1, default is 0 (int [=0])
  -t, --trim_tail1                   trimming how many bases in tail for read1, default is 0 (int [=0])
  -F, --trim_front2                  trimming how many bases in front for read2. If it's not specified, it will follow read1's settings (int [=0])
  -T, --trim_tail2                   trimming how many bases in tail for read2. If it's not specified, it will follow read1's settings (int [=0])

  # polyG tail trimming, useful for NextSeq/NovaSeq data   polyG剪裁
  -g, --trim_poly_g                  force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
      --poly_g_min_len                 the minimum length to detect polyG in the read tail. 10 by default. (int [=10])
  -G, --disable_trim_poly_g          disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data

  # polyX tail trimming
  -x, --trim_poly_x                    enable polyX trimming in 3' ends.
      --poly_x_min_len                 the minimum length to detect polyX in the read tail. 10 by default. (int [=10])
  
  # per read cutting by quality options   划窗裁剪
  -5, --cut_by_quality5              enable per read cutting by quality in front (5'), default is disabled (WARNING: this will interfere deduplication for both PE/SE data)
  -3, --cut_by_quality3              enable per read cutting by quality in tail (3'), default is disabled (WARNING: this will interfere deduplication for SE data)
  -W, --cut_window_size              the size of the sliding window for sliding window trimming, default is 4 (int [=4])
  -M, --cut_mean_quality             the bases in the sliding window with mean quality below cutting_quality will be cut, default is Q20 (int [=20])
  
  # quality filtering options   根据碱基质量来过滤序列
  -Q, --disable_quality_filtering    quality filtering is enabled by default. If this option is specified, quality filtering is disabled
  -q, --qualified_quality_phred      the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])
  -u, --unqualified_percent_limit    how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])
  -n, --n_base_limit                 if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5])
  
  # length filtering options   根据序列长度来过滤序列
  -L, --disable_length_filtering     length filtering is enabled by default. If this option is specified, length filtering is disabled
  -l, --length_required              reads shorter than length_required will be discarded, default is 15. (int [=15])

  # low complexity filtering
  -y, --low_complexity_filter          enable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]).
  -Y, --complexity_threshold           the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required. (int [=30])

  # filter reads with unwanted indexes (to remove possible contamination)
      --filter_by_index1               specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line (string [=])
      --filter_by_index2               specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line (string [=])
      --filter_by_index_threshold      the allowed difference of index barcode for index filtering, default 0 means completely identical. (int [=0])

  # base correction by overlap analysis options   通过overlap来校正碱基
  -c, --correction                   enable base correction in overlapped regions (only for PE data), default is disabled
  
  # UMI processing
  -U, --umi                          enable unique molecular identifer (UMI) preprocessing
      --umi_loc                      specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none (string [=])
      --umi_len                      if the UMI is in read1/read2, its length should be provided (int [=0])
      --umi_prefix                   if specified, an underline will be used to connect prefix and UMI (i.e. prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default (string [=])
      --umi_skip                       if the UMI is in read1/read2, fastp can skip several bases following UMI, default is 0 (int [=0])

  # overrepresented sequence analysis
  -p, --overrepresentation_analysis    enable overrepresented sequence analysis.
  -P, --overrepresentation_sampling    One in (--overrepresentation_sampling) reads will be computed for overrepresentation analysis (1~10000), smaller is slower, default is 20. (int [=20])

  # reporting options
  -j, --json                         the json format report file name (string [=fastp.json])
  -h, --html                         the html format report file name (string [=fastp.html])
  -R, --report_title                 should be quoted with ' or ", default is "fastp report" (string [=fastp report])
  
  # threading options   设置线程数
  -w, --thread                       worker thread number, default is 3 (int [=3])
  
  # output splitting options
  -s, --split                        split output by limiting total split file number with this option (2~999), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (int [=0])
  -S, --split_by_lines               split output by limiting lines of each file with this option(>=1000), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (long [=0])
  -d, --split_prefix_digits          the digits for the sequential number padding (1~10), default is 4, so the filename will be padded as 0001.xxx, 0 to disable padding (int [=4])
  
  # help
  -?, --help                         print this message

雖然參數看起來比較多,但常用的主要包括以下幾個部分:

  • 輸入輸出文件設置
  • 接頭處理
  • 全局裁剪(即直接剪掉起始和末端低質量鹼基)
  • 滑窗質量剪裁 (與trimmomatic相似)
  • 過濾過短序列
  • 校正鹼基(用於雙端測序)
  • 質量過濾
1、接頭處理
編輯

fastp默認啟用了接頭處理,但是可以使用-A命令來關掉。fastp可以自動化地查找接頭序列並進行剪裁,也就是說你可以不輸入任何的接頭序列,fastp全自動搞定了!對於SE數據,你還是可以-a參數來輸入你的接頭,而對於PE數據則完全沒有必要,fastp基於PE數據的overlap分析可以更準確地查找接頭,去得更乾淨,而且對於一些接頭本身就有鹼基不匹配情況處理得更好。fastp對於接頭去除會有一個匯總的報告。

2、全局裁剪
編輯

fastp可以對所有read在頭部和尾部進行統一剪裁,該功能在去除一些測序質量不好的cycle比較有用,比如151*2的PE測序中,最後一個cycle通常質量是非常低的,需要剪裁掉。使用-f和-t分別指定read1的頭部和尾部的剪裁,使用-F和-T分別指定read2的頭部和尾部的剪裁。

3、滑窗質量剪裁
編輯

很多時候,一個read的低質量序列都是集中在read的末端,也有少部分是在read的開頭。fastp支持像Trimmomatic那樣對滑動窗口中的鹼基計算平均質量值,然後將不符合的滑窗直接剪裁掉。使用-5參數開啟在5』端,也就是read的開頭的剪裁,使用-3參數開啟在3』端,也就是read的末尾的剪裁。使用-W參數指定滑動窗大小,默認是4,使用-M參數指定要求的平均質量值,默認是20,也就是Q20。

4、過濾過短序列
編輯

默認開啟多序列過濾,默認值為15,使用-L(--disable_length_filtering)禁止此默認選項。或使用-l(--length_required)自定義最短序列。

5、校正鹼基(用於雙端測序)
編輯

fastp支持對PE數據的每一對read進行分析,查找它們的overlap區間,然後對於overlap區間中不一致的鹼基,如果發現其中一個質量非常高,而另一個非常低,則可以將非常低質量的鹼基改為相應的非常高質量值的鹼基值。此選項默認關閉,可使用-c(--correction)開啟。

6、質量過濾
編輯

fastp可以對低質量序列,較多N的序列,該功能默認是啟用的,但可以使用-Q參數關閉。使用-q參數來指定合格的phred質量值,比如-q 15表示質量值大於等於Q15的即為合格,然後使用-u參數來指定最多可以有多少百分比的質量不合格鹼基。比如-q 15 -u 40表示一個read最多只能有40%的鹼基的質量值低於Q15,否則會被扔掉。使用-n可以限定一個read中最多能有多少個N。

fastp的使用示例

編輯
#!/bin/bash

for i in 74 75 76 82 83 84 85 86 87 88; do
    {
    fastp -i ~/RNAseq/cleandata/SRR17343${i}_1.fastq.gz -o SRR17343${i}_1.fastq.gz \
        -I ~/RNAseq/cleandata/SRR17343${i}_2.fastq.gz -O SRR17343${i}_2.fastq.gz \
        -Q --thread=5 --length_required=50 --n_base_limit=6 --compression=6
    }&
done
wait