使用STAR對轉錄組數據進行比對

編輯

STAR: ultrafast universal RNA-seq aligner Bioinformatics, Volume 29, Issue 1, 1 January 2013, Pages 15–21, https://doi.org/10.1093/bioinformatics/bts635

參考文獻:https://academic.oup.com/bioinformatics/article/29/1/15/272537

STAR軟件官網:https://github.com/alexdobin/STAR

STAR參考文檔:https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf

STAR的下載和安裝

編輯

安裝gcc編譯器

編輯
# Ubuntu.
sudo apt-get update
sudo apt-get install g++
sudo apt-get install make
# Red Hat, CentOS, Fedora.
sudo yum update
sudo yum install make
sudo yum install gcc-c++
sudo yum install glibc-static
# SUSE.
sudo zypper update
sudo zypper in gcc gcc-c++

安裝STAR軟件

編輯

https://github.com/alexdobin/STAR/releases下載最新版本的STAR原始碼:

wget https://github.com/alexdobin/STAR/archive/2.6.1d.tar.gz
tar -xzf 2.6.1d.tar.gz
cd STAR-2.6.1d

# Alternatively, get STAR source using git
git clone https://github.com/alexdobin/STAR.git

# Compile
cd STAR/source
make STAR
# Mac系统编译
make STARforMac

STAR軟件的使用

編輯

基本 STAR 工作流程包括 2 個步驟:

1. 生成基因組索引文件

在此步驟中,用戶提供了參考基因組序列(FASTA 文件)和註釋(GTF 文件),STAR 從中生成基因組索引,用於第二個(映射)步驟。 基因組索引保存到磁盤,並且只需為每個基因組/註釋組合生成一次。 可從http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/ 獲得有限的 STAR 基因組集合,但是,強烈建議用戶生成自己的基因組索引。 - 最新的程序集和註釋。

2. 將read比對到基因組

在此步驟中,用戶提供在第一步中生成的基因組文件,以及 FASTA 或 FASTQ 文件形式的 RNA-seq 讀數(序列)。 STAR 將讀取映射到基因組,並寫入多個輸出文件,例如比對 (SAM/BAM)、映射匯總統計、接合點、未映射的讀取、信號(擺動)軌道等。比對由各種輸入參數(選項)控制。

所有選項的說明。

STAR 命令行格式如下:

STAR --option1-name option1-value(s)--option2-name option2-value(s) ...

如果一個選項可以接受多個值,它們之間用空格分隔,在少數情況下 - 用逗號分隔。

STAR構建基因組index

編輯

運行示例:

# 对genome建索引,新建文件夹/path/to/GenomeDir
# 2种方式,无注释的:
/pathToStarDir/STAR --runMode genomeGenerate --genomeDir /path/to/GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 --runThreadN <n> #有注释引导的(gff3或gtf):
/pathToStarDir/STAR --runMode genomeGenerate --genomeDir /path/to/GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 --runThreadN <n> --sjdbGTFfile <FileName> --sjdbOverhang <N>…
#  gff3的话,再加  --sjdbGTFtagExonParentTranscript Parent
    --sjdbOverhang <N> 是剪切点左边或右边"overhang"的长度,最好设置为RNASEQ时的MateLength - 1

基礎參數:

編輯
參數 描述
--sjdbGTFfile 基因註釋文件GTF
--sjdbOverhang read長度
--runThreadN 選項定義用於基因組生成的線程數,它必須設置為伺服器節點上的可用內核數。 取決於系統硬件配置,整數
--runMode genomeGenerate 選項指示 STAR 運行基因組索引構建。 genomeGenerate:生成基因組索引
--genomeDir 指定存儲基因組索引的目錄(以下稱為「基因組目錄」)的路徑。該目錄必須在 STAR 運行之前創建(使用 mkdir)並且需要具有寫入權限。文件系統需要至少有 100GB 的磁盤 典型哺乳動物基因組的可用空間。建議在運行基因組生成步驟之前從基因組目錄中刪除所有文件。必須在比對步驟中提供此目錄路徑以識別參考基因組。 /path/to/genomeDir
--genomeFastaFiles 指定一個或多個帶有基因組參考序列的 FASTA 文件。 每個 fasta 文件都允許使用多個參考序列(以下稱為「染色體」)。 您可以在 chrName.txt 中重命名染色體的名稱,保持文件中染色體的順序:該文件中的名稱將用於所有輸出對齊文件(例如 .sam)。 染色體名稱中不允許使用制表符,也不建議使用空格。 /path/to/genome/fasta1 /path/to/genome/fasta2 ...
--sjdbGTFfile 以標準 GTF 格式指定帶有帶註釋腳本的文件的路徑。 STAR 將從該文件中提取拼接點並使用它們來大大提高映射的準確性。 雖然這是可選的,並且 STAR 可以在沒有註釋的情況下運行,但強烈建議在註釋可用時使用它們。 從 2.4.1a 開始,也可以在比對步驟中即時包含註釋。 /path/to/annotations.gtf
--sjdbOverhang 指定用於構建剪接點數據庫的帶註釋的連接點周圍的基因組序列的長度。 理想情況下,此長度應等於 ReadLength-1,其中 ReadLength 是讀取的長度。 例如,對於 Illumina 2x100b 雙端讀數,理想值為 100-1=99。 對於不同長度的讀取,理想值是 max(ReadLength)-1。 在大多數情況下,默認值 100 與理想值一樣有效。 ReadLength-1

基因組文件包括二進制基因組序列、後綴數組、文本染色體名稱/長度、剪接點坐標和轉錄本/基因信息。 大多數這些文件使用內部 STAR 格式,不打算由最終用戶使用。 強烈建議不要更改這些文件中的任何一個,但有一個例外:您可以重命名 chrName.txt 中的染色體名稱,同時保持該文件中染色體的順序:該文件中的染色體名稱將用於所有輸出文件 (例如 SAM/BAM)。

進階選項

編輯

基因組/scaffolds/patch的選擇

編輯

強烈建議包括主要染色體(例如,對於人類 chr1-22、chrX、chrY、chrM)以及未放置和未定位的支架。 通常,未放置/未定位的支架僅在基因組長度上增加了幾個 MegaBase,但是,大量讀數可能會映射到這些支架上的核糖體 RNA (rRNA) 重複序列。 如果支架不包含在基因組中,或者更糟糕的是,可能與染色體上的錯誤位點對齊,則這些讀數將被報告為未映射。

通常,基因組中不應包含補丁和替代單倍型。

可接受的基因組序列文件示例:

強烈建議對物種使用最全面的註釋。 非常重要的是,註釋 GTF 文件中的染色體名稱必須與 FASTA 基因組序列文件中的染色體名稱相匹配。 例如,可以將 ENSEMBL FASTA 文件與 ENSEMBL GTF 文件一起使用,將 UCSC FASTA 文件與 UCSC FASTA 文件一起使用。 但是,由於 UCSC 使用 chr1, chr2, ... 命名約定,而 ENSEMBL 使用 1, 2, ... 命名,所以 ENSEMBL 和 UCSC FASTA 和 GTF 文件不能混合在一起,除非染色體被重命名以匹配FASTA 和 GTF 文件。

GFF文件註釋格式

編輯

除了上述選項之外,對於 GFF3 格式的註釋,您需要使用 --sjdbGTFtagExonParentTranscript Parent。 通常,對於 --sjdbGTFfile 文件,STAR 僅處理在第三個字段(列)中具有 --sjdbGTFfeatureExon(默認情況下=exon)的行。 外顯子使用由 --sjdbGTFtagExonParentTranscript(默認情況下 = transcript_id)GTF/GFF 屬性定義的父子關係分配給轉錄本。

使用junction註釋

編輯

STAR 還可以在文本文件中使用格式化為拼接junction坐標列表的註釋:

--sjdbFileChrStartEnd /path/to/sjdbFile.txt。 此文件應包含由制表符分隔的 4 列:

Chr \tab Start \tab End \tab Strand=+/-/。

這裏 Start 和 End 是內含子的第一個和最後一個鹼基(基於 1 的染色體坐標)。 除了 --sjdbGTF 文件之外,還可以使用此文件,在這種情況下,STAR 將從兩個文件中提取連接。

請注意,--sjdbFileChrStartEnd 文件可以包含重複(相同)的連接點,STAR 將摺疊(刪除)重複的連接點。

小基因組

編輯

對於小基因組,參數 --genomeSAindexNbases 必須按比例縮小,典型值為 min(14, log2(GenomeLength)/2 - 1)。 例如,對於 1 megaBase 基因組,這等於 9,對於 100 kiloBase 基因組,這等於 7。

大基因組

編輯

如果您使用的基因組具有大量 (>5,000) 參考(染色體/支架),您可能需要減少 --genomeChrBinNbits 以減少 RAM 消耗。 建議使用以下縮放比例:--genomeChrBinNbits =

min(18,log2[max(GenomeLength/NumberOfReferences,ReadLength)])。 例如,對於具有 100,000 條染色體/支架的 3 gigaBase 基因組,這等於 15。

STAR比對

編輯

比對示例

編輯
/pathToStarDir/STAR --genomeDir /path/to/GenomeDir --readFilesIn /path/to/read1.gz [/path/to/read2.gz] --readFilesCommand zcat --runThreadN <n> --<inputParameterName> <inputparameter value(s)> # 共享内存:

--genomeLoad <value>
# map的时候,这个参数控制基因组读到ram里面是否作为共享的,如果共享,其它在同节点运行的同样以此genome作为ref的star任务,可以共享它,节省计算资源。若要使用,请读manual。

基礎選項

編輯

--genomeDir 指定生成基因組索引的基因組目錄的路徑。

--readFilesIn 包含要映射的序列的文件的名稱(帶路徑)(例如 RNA-seq FASTQ 文件)。 如果使用 Illumina 雙端讀取,則必須提供 read1 和 read2 文件。 STAR 可以處理 FASTA 和 FASTQ 文件。 支持多行(即序列拆分為多行)FASTA(但不支持 FASTQ)文件。

如果讀取的文件被壓縮,請使用 --readFilesCommand UncompressionCommand 選項,其中 UncompressionCommand 是解壓縮命令,它將文件名作為輸入參數,並將未壓縮的輸出發送到 stdout。 例如,對於 gzipped 文件 (*.gz),請使用 --readFilesCommand zcat 或 --readFilesCommand gunzip -c。 對於 bzip2-

壓縮文件,使用--readFilesCommand bunzip2 -c。

一次比對多個文件

編輯

可以在一次運行中使用單個輸出映射多個樣本。這等效於在映射之前連接讀取文件,不同之處在於可以在 --outSAMattrRGline 命令中使用不同的讀取組來跟蹤來自不同文件的讀取。對於單端讀取使用逗號分隔列表(逗號周圍沒有空格),例如:

--readFilesIn sample1.fq,sample2.fq,sample3.fq

對於雙端讀取,read1 使用逗號分隔列表,後跟空格,read2 後跟逗號分隔列表,例如:--readFilesIn s1read1.fq,s2read1.fq,s3read1.fq s1read2.fq,s2read2.fq,s3read2 .fq

對於多個讀取文件,可以在 --outSAMattrRGline 中為相應的讀取組提供空格/逗號/空格分隔列表,例如--outSAMattrRGline ID:sample1, ID:sample2, ID:sample3

請注意,此列表由空格包圍的逗號分隔(與 --readFilesIn 列表不同)。

映射多個讀取文件的另一個選項,對於非常多的文件特別方便,是創建文件清單並在 --readFilesManifest /path/to/manifest.tsv 中提供它。

清單文件應包含 3 個制表符分隔的列。對於雙端讀取:

read1-file-name tab read2-file-name tab read-group-line

對於單端讀取,第二列應包含破折號 -:

read1-file-name tab - tab read-group-line

文件名中允許有空格,但不允許有制表符。如果 read-group-line 不以 ID: 開頭,

它只能包含一個 ID 字段,ID: 將添加到其中。如果 read-group-line 以 ID: 開頭,它可以包含多個由制表符分隔的字段,並且所有字段將被逐字複製到 SAM @RG 標題行中。

高級選項

編輯

比對階段使用註釋文件

編輯

從 2.4.1a 開始,可以在比對步驟中即時包含註釋,而無需在基因組生成步驟中包含它們。 您可以指定 --sjdbGTFfile /path/to/ann.gtf 和/或 --sjdbFileChrStartEnd /path/to/sj.tab,以及 --sjdbOverhang 和任何其他 --sjdb* 選項。 可以使用或不使用另一組註釋/連接來生成基因組索引。

在後一種情況下,新的連接點將添加到舊的連接點上。 STAR 將在映射之前將連接點動態插入到基因組索引中,這需要 1 2 分鐘。 可以使用 --sjdbInsertSave All 將動態基因組索引保存(以供重用)到當前運行目錄中的 STARgenome 目錄中。

ENCODE選項

編輯

下面給出了長 RNA-seq 管道的 ENCODE 標準選項示例:

  • --outFilterType BySJout 減少了「虛假」連接的數量
  • --outFilterMultimapNmax 20 讀取允許的最大多重對齊數:如果超過,則讀取被視為未比對
  • --alignSJoverhangMin 8 未註釋連接的最小懸垂
  • --alignSJDBoverhangMin 1 帶註釋的連接點的最小懸垂
  • --outFilterMismatchNmax 999 每對最大不匹配數,大量關閉此過濾器
  • --outFilterMismatchNoverReadLmax 0.04 每對相對於讀取長度的最大錯配數:對於 2x100b,配對讀取的最大錯配數為 0.04*200=8
  • --alignIntronMin 20 最小內含子長度
  • --alignIntronMax 1000000 最大內含子長度
  • --alignMatesGapMax 1000000 mate之間的最大基因組距離

使用共享內存

編輯

--genomeLoad 選項控制基因組如何加載到內存中。默認情況下,--genomeLoad NoSharedMemory,不使用共享內存。

使用 --genomeLoad LoadAndKeep,STAR 將基因組作為標準 Linux 共享內存塊加載。基因組由其唯一的目錄路徑標識。在加載基因組之前,STAR

檢查基因組是否已加載到共享內存中。如果基因組尚未加載,STAR 將加載它並在 STAR 作業完成後將其保存在內存中。基因組

將與所有其他 STAR 工作共享。您可以使用 --genomeLoad Remove 從運行 STAR 的共享內存中刪除基因組。只有在附加到它的所有 STAR 作業完成後,共享內存塊才會被物理刪除。使用 --genomeLoad LoadAndRemove,STAR 將

將基因組加載到共享內存中,並將其標記為刪除,這樣一旦所有使用它的 STAR 作業退出,基因組就會從共享內存中刪除。 --genomeLoad LoadAndExit,STAR 將在共享內存中加載基因組,並立即退出,將基因組加載到共享內存中以備將來運行。

如果您需要手動檢查或刪除共享內存片段,請使用標準 Linux 命令 ipcs 和 ipcrm。如果駐留在共享內存中的基因組長時間不使用,它可能會從 RAM 中分頁,這將大大減慢 STAR 的運行速度。強烈建議定期重新加載(即刪除並再次加載)共享內存基因組。

許多標準 Linux 發行版不允許足夠大的共享內存塊。如果您有 root 權限,您可以解決這個問題,或者請您的系統管理員來解決這個問題。要啟用共享內存,請修改或將以下行添加到 /etc/sysctl.conf:

kernel.shmmax = Nmax

kernel.shmall = Nall

Nmax, N 所有數字應選擇如下:

Nmax > GenomeIndexSize = Genome + SA + SAindex(人類基因組為 31000000000)

N all > GenomeIndexSize/PageSize

其中 PageSize 通常為 4096(可以使用 getconf PAGE SIZE 檢查)。然後運行:

/sbin/sysctl -p

這會將允許的共享內存塊增加到 31GB,足以容納人類或小鼠基因組。

STAR所有參數

編輯

按功能分組:

必須特別注意以 --out* 開頭的參數,因為它們控制 STAR 輸出。

特別是,--outFilter* 參數控制輸出對齊的過濾,您可能希望對其進行調整以滿足您的需要。

「嵌合」比對的輸出由 --chim* 參數控制。

基因組生成由 --genome* 參數控制。

註釋(剪接點數據庫)由基因組生成步驟中的 --sjdb* 選項控制。

調整 --score*、--align*、--seed*、--win* 參數,需要了解 STAR 對齊算法,僅建議高級用戶使用。

參數文件

編輯

--parametersFiles

default: -

string: name of a user-defined parameters file, 」-」: none. Can only be defined on the command line.

系統

編輯

--sysShell

default: -

string: path to the shell binary, preferably bash, e.g. /bin/bash.

- the default shell is executed, typically /bin/sh. This was reported to fail on some Ubuntu systems - then you need to specify path to bash.

運行參數

編輯
  • --runMode default: alignReads string: type of the run.
  • --runThreadN default: 1 int: number of threads to run STAR
  • --runDirPerm default: User RWX string: permissions for the directories created at the run-time. User RWX user-read/write/execute All RWX all-read/write/execute (same as chmod 777)
  • --runRNGseed default: 777 int: random number generator seed.

基因組參數

編輯
  • --genomeDir default: ./GenomeDir/ string: path to the directory where genome files are stored (for –runMode alignReads) or will be generated (for –runMode generateGenome)
  • --genomeLoad default: NoSharedMemory string: mode of shared memory usage for the genome files. Only used with –runMode alignReads.
    • LoadAndKeep load genome into shared and keep it in memory after run
    • LoadAndRemove load genome into shared but remove it after run
    • LoadAndExit load genome into shared memory and exit, keeping the genome in memory for future runs
    • Remove do not map anything, just remove loaded genome from memory
    • NoSharedMemory do not use shared memory, each job will have its own private copy of the genome
  • --genomeFastaFiles default: - string(s): path(s) to the fasta files with the genome sequences, separated by spaces. These files should be plain text FASTA files, they *cannot* be zipped. Required for the genome generation (–runMode genomeGenerate). Can also be used in the mapping (–runMode alignReads) to add extra (new) sequences to the genome (e.g. spike-ins).
  • --genomeChainFiles default: - string: chain files for genomic liftover. Only used with –runMode liftOver .
  • --genomeFileSizes default: 0 uint(s)>0: genome files exact sizes in bytes. Typically, this should not be defined by the user.
  • --genomeTransformOutput default: None string(s) which output to transform back to original genome
    • SAM SAM/BAM alignments
    • SJ splice junctions (SJ.out.tab)
    • None no transformation of the output

基因組索引參數(只用於–runMode genomeGenerate)

編輯
  • --genomeChrBinNbits default: 18 int: =log2(chrBin), where chrBin is the size of the bins for genome storage: each chromosome will occupy an integer number of bins. For a genome with large number of contigs, it is recommended to scale this parameter as min(18, log2[max(GenomeLength/NumberOfReferences,ReadLength)]).
  • --genomeSAindexNbases default: 14 int: length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, the parameter –genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2 - 1).
  • --genomeSAsparseD default: 1 int>0: suffux array sparsity, i.e. distance between indices: use bigger numbers to decrease needed RAM at the cost of mapping speed reduction
  • --genomeSuffixLengthMax default: -1 int: maximum length of the suffixes, has to be longer than read length. -1 = infinite.
  • --genomeTransformType default: None string: type of genome transformation
    • None no transformation
    • Haploid replace reference alleles with alternative alleles from VCF file (e.g. consensus allele)
    • Diploid create two haplotypes for each chromosome listed in VCF file, for genotypes 1—2, assumes perfect phasing (e.g. personal genome)
  • --genomeTransformVCF default: - string: path to VCF file for genome transformation

剪切Junction數據庫

編輯
  • --sjdbFileChrStartEnd default: - string(s): path to the files with genomic coordinates (chr start end strand) for the splice junction introns. Multiple files can be supplied wand will be concatenated.
  • --sjdbGTFfile default: - string: path to the GTF file with annotations
  • --sjdbGTFchrPrefix default: - string: prefix for chromosome names in a GTF file (e.g. 』chr』 for using ENSMEBL annotations with UCSC genomes)
  • --sjdbGTFfeatureExon default: exon string: feature type in GTF file to be used as exons for building transcripts
  • --sjdbGTFtagExonParentTranscript default: transcript id string: GTF attribute name for parent transcript ID (default 」transcript id」 works for GTF files)
  • --sjdbGTFtagExonParentGene default: gene id string: GTF attribute name for parent gene ID (default 」gene id」 works for GTF files)
  • --sjdbGTFtagExonParentGeneName default: gene name string(s): GTF attrbute name for parent gene name
  • --sjdbGTFtagExonParentGeneType default: gene type gene biotype 27 string(s): GTF attrbute name for parent gene type
  • --sjdbOverhang default: 100 int>0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate length - 1)
  • --sjdbScore default: 2 int: extra alignment score for alignments that cross database junctions
  • --sjdbInsertSave default: Basic string: which files to save when sjdb junctions are inserted on the fly at the mapping step
    • Basic only small junction / transcript files
    • All all files including big Genome, SA and SAindex - this will create a complete genome directory

Variation參數

編輯

--varVCFfile default: - string: path to the VCF file that contains variation data. The 10th column should contain the genotype information, e.g. 0/1

輸入文件

編輯

--inputBAMfile default: - string: path to BAM input file, to be used with –runMode inputAlignmentsFromBAM

Read參數

編輯
  • --readFilesType default: Fastx string: format of input read files
    • Fastx FASTA or FASTQ
    • SAM SE SAM or BAM single-end reads; for BAM use –readFilesCommand samtools view
    • SAM PE SAM or BAM paired-end reads; for BAM use –readFilesCommand samtools view
  • --readFilesSAMattrKeep default: All string(s): for –readFilesType SAM SE/PE, which SAM tags to keep in the output BAM, e.g.: –readFilesSAMtagsKeep RG PL
    • All keep all tags
    • None do not keep any tags
  • --readFilesIn default: Read1 Read2 string(s): paths to files that contain input read1 (and, if needed, read2)
  • --readFilesManifest default: - string: path to the 」manifest」 file with the names of read files. The manifest file should contain 3 tab-separated columns: paired-end reads: read1 file name tab read2 file name tab read group line. single-end reads: read1 file name tab - tab read group line. Spaces, but not tabs are allowed in file names. If read group line does not start with ID:, it can only contain one ID field, and ID: will be added to it. If read group line starts with ID:, it can contain several fields separated by tab, and all fields will be be copied verbatim into SAM @RG header line.
  • --readFilesPrefix default: - string: prefix for the read files names, i.e. it will be added in front of the strings in –readFilesIn
  • --readFilesCommand default: - string(s): command line to execute for each of the input file. This command should generate FASTA or FASTQ text and send it to stdout For example: zcat - to uncompress .gz files, bzcat - to uncompress .bz2 files, etc.
  • --readMapNumber default: -1 int: number of reads to map from the beginning of the file -1: map all reads
  • --readMatesLengthsIn default: NotEqual string: Equal/NotEqual - lengths of names,sequences,qualities for both mates are the same / not the same. NotEqual is safe in all situations.
  • --readNameSeparator default: / string(s): character(s) separating the part of the read names that will be trimmed in output (read name after space is always trimmed)
  • --readQualityScoreBase default: 33 int>=0: number to be subtracted from the ASCII code to get Phred quality score

Read Clipping

編輯
  • --clipAdapterType default: Hamming string: adapter clipping type
    • Hamming adapter clipping based on Hamming distance, with the number of mismatches controlled by –clip5pAdapterMMp
    • CellRanger4 5p and 3p adapter clipping similar to CellRanger4. Utilizes Opal package by Martin Soˇsi´c: https://github.com/Martinsos/opal ˇ
    • None no adapter clipping, all other clip* parameters are disregarded
  • --clip3pNbases default: 0 int(s): number(s) of bases to clip from 3p of each mate. If one value is given, it will be assumed the same for both mates.
  • --clip3pAdapterSeq default: - string(s): adapter sequences to clip from 3p of each mate. If one value is given, it will be assumed the same for both mates.
    • polyA polyA sequence with the length equal to read length
  • --clip3pAdapterMMp default: 0.1 double(s): max proportion of mismatches for 3p adapter clipping for each mate. If one value is given, it will be assumed the same for both mates.
  • --clip3pAfterAdapterNbases default: 0 int(s): number of bases to clip from 3p of each mate after the adapter clipping. If one value is given, it will be assumed the same for both mates.
  • --clip5pNbases default: 0 int(s): number(s) of bases to clip from 5p of each mate. If one value is given, it will be assumed the same for both mates.

Limits

編輯
  • --limitGenomeGenerateRAM default: 31000000000 int>0: maximum available RAM (bytes) for genome generation
  • --limitIObufferSize default: 30000000 50000000 int>0: max available buffers size (bytes) for input/output, per thread
  • --limitOutSAMoneReadBytes default: 100000 int>0: max size of the SAM record (bytes) for one read. Recommended value: >(2*(LengthMate1+LengthMate2+100)*outFilterMultimapNmax
  • --limitOutSJoneRead default: 1000 int>0: max number of junctions for one read (including all multi-mappers)
  • --limitOutSJcollapsed default: 1000000 int>0: max number of collapsed junctions
  • --limitBAMsortRAM default: 0 int>=0: maximum available RAM (bytes) for sorting BAM. If =0, it will be set to the genome index size. 0 value can only be used with –genomeLoad NoSharedMemory option.
  • --limitSjdbInsertNsj default: 1000000 int>=0: maximum number of junction to be inserted to the genome on the fly at the mapping stage, including those from annotations and those detected in the 1st step of the 2-pass run
  • --limitNreadsSoft default: -1 int: soft limit on the number of reads

輸出: 一般

編輯
  • --outFileNamePrefix default: ./ string: output files name prefix (including full or relative path). Can only be defined on the command line.
  • --outTmpDir default: - string: path to a directory that will be used as temporary by STAR. All contents of this directory will be removed! - the temp directory will default to outFileNamePrefix STARtmp
  • --outTmpKeep default: None string: whether to keep the tempporary files after STAR runs is finished
    • None remove all temporary files All .. keep all files
  • --outStd default: Log string: which output will be directed to stdout (standard out)
    • Log log messages
    • SAM alignments in SAM format (which normally are output to Aligned.out.sam file), normal standard output will go into Log.std.out
    • BAM Unsorted alignments in BAM format, unsorted. Requires –outSAMtype BAM Unsorted
    • BAM SortedByCoordinate alignments in BAM format, sorted by coordinate. Requires –outSAMtype BAM SortedByCoordinate
    • BAM Quant alignments to transcriptome in BAM format, unsorted. Requires –quantMode TranscriptomeSAM
  • --outReadsUnmapped default: None string: output of unmapped and partially mapped (i.e. mapped only one mate of a paired end read) reads in separate file(s).
    • None no output
    • Fastx output in separate fasta/fastq files, Unmapped.out.mate1/2
  • --outQSconversionAdd default: 0 int: add this number to the quality score (e.g. to convert from Illumina to Sanger, use -31)
  • --outMultimapperOrder default: Old 2.4 string: order of multimapping alignments in the output files
    • Old 2.4 quasi-random order used before 2.5.0
    • Random random order of alignments for each multi-mapper. Read mates (pairs) are always adjacent, all alignment for each read stay together. This option will become default in the future releases.

輸出:SAM和BAM

編輯
  • --outSAMtype default: SAM strings: type of SAM/BAM output
    • 1st word:
      • BAM output BAM without sorting
      • SAM output SAM without sorting
      • None no SAM/BAM output
    • 2nd, 3rd:
      • Unsorted standard unsorted
      • SortedByCoordinate sorted by coordinate. This option will allocate extra memory for sorting which can be specified by –limitBAMsortRAM
  • --outSAMmode default: Full string: mode of SAM output
    • None no SAM output
    • Full full SAM output
    • NoQS full SAM but without quality scores
  • --outSAMstrandField default: None string: Cufflinks-like strand field flag
    • None not used
    • intronMotif strand derived from the intron motif. This option changes the output alignments: reads with inconsistent and/or non-canonical introns are filtered out.
  • --outSAMattributes default: Standard string: a string of desired SAM attributes, in the order desired for the output SAM. Tags can be listed in any combination/order.
    • ***Presets:
      • None no attributes
      • Standard NH HI AS nM
      • All NH HI AS nM NM MD jM jI MC ch
    • ***Alignment:
      • NH number of loci the reads maps to: =1 for unique mappers, >1 for multimappers. Standard SAM tag.
      • HI multiple alignment index, starts with –outSAMattrIHstart (=1 by default). Standard SAM tag.
      • AS local alignment score, +1/-1 for matches/mismateches, score* penalties for indels and gaps. For PE reads, total score for two mates. Stadnard SAM tag.
      • nM number of mismatches. For PE reads, sum over two mates.
      • NM edit distance to the reference (number of mismatched + inserted + deleted bases) for each mate. Standard SAM tag.
      • MD string encoding mismatched and deleted reference bases (see standard SAM specifications). Standard SAM tag.
      • jM intron motifs for all junctions (i.e. N in CIGAR): 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT. If splice junctions database is used, and a junction is annotated, 20 is added to its motif value.
      • jI start and end of introns for all junctions (1-based).
      • XS alignment strand according to –outSAMstrandField.
      • MC mate’s CIGAR string. Standard SAM tag.
      • ch marks all segment of all chimeric alingments for –chimOutType WithinBAM output.
      • cN number of bases clipped from the read ends: 5』 and 3』
    • ***Variation:
      • vA variant allele
      • vG genomic coordinate of the variant overlapped by the read.
      • vW 1 - alignment passes WASP filtering; 2,3,4,5,6,7 - alignment does not pass WASP filtering. Requires –waspOutputMode SAMtag.
    • ***STARsolo:
      • CR CY UR UY sequences and quality scores of cell barcodes and UMIs for the solo* demultiplexing.
      • GX GN gene ID and gene name.
      • CB UB error-corrected cell barcodes and UMIs for solo* demultiplexing. Requires –outSAMtype BAM SortedByCoordinate.
      • sM assessment of CB and UMI.
      • sS sequence of the entire barcode (CB,UMI,adapter).
      • sQ quality of the entire barcode.
    • ***Unsupported/undocumented:
      • ha haplotype (1/2) when mapping to the diploid genome. Requires genome generated with –genomeTransformType Diploid .
      • rB alignment block read/genomic coordinates.
      • vR read coordinate of the variant.
  • --outSAMattrIHstart default: 1 int>=0: start value for the IH attribute. 0 may be required by some downstream software, such as Cufflinks or StringTie.
  • --outSAMunmapped default: None string(s): output of unmapped reads in the SAM format
    • 1st word:
      • None no output
      • Within output unmapped reads within the main SAM file (i.e. Aligned.out.sam)
    • 2nd word:
      • KeepPairs record unmapped mate for each alignment, and, in case of unsorted output, keep it adjacent to its mapped mate. Only affects multi-mapping reads.
  • --outSAMorder default: Paired string: type of sorting for the SAM output Paired: one mate after the other for all paired alignments PairedKeepInputOrder: one mate after the other for all paired alignments, the order is kept the same as in the input FASTQ files
  • --outSAMprimaryFlag default: OneBestScore string: which alignments are considered primary - all others will be marked with 0x100 bit in the FLAG
    • OneBestScore only one alignment with the best score is primary
    • AllBestScore all alignments with the best score are primary
  • --outSAMreadID default: Standard string: read ID record type
    • Standard first word (until space) from the FASTx read ID line, removing /1,/2 from the end
    • Number read number (index) in the FASTx file
  • --outSAMmapqUnique default: 255 int: 0 to 255: the MAPQ value for unique mappers
  • --outSAMflagOR default: 0 int: 0 to 65535: sam FLAG will be bitwise OR』d with this value, i.e. FLAG=FLAG — outSAMflagOR. This is applied after all flags have been set by STAR, and after outSAMflagAND. Can be used to set specific bits that are not set otherwise.
  • --outSAMflagAND default: 65535 int: 0 to 65535: sam FLAG will be bitwise AND』d with this value, i.e. FLAG=FLAG & outSAMflagOR. This is applied after all flags have been set by STAR, but before outSAMflagOR. Can be used to unset specific bits that are not set otherwise.
  • --outSAMattrRGline default: - string(s): SAM/BAM read group line. The first word contains the read group identifier and must start with 」ID:」, e.g. –outSAMattrRGline ID:xxx CN:yy 」DS:z z z」. xxx will be added as RG tag to each output alignment. Any spaces in the tag values have to be double quoted. Comma separated RG lines correspons to different (comma separated) input files in –readFilesIn. Commas have to be surrounded by spaces, e.g. –outSAMattrRGline ID:xxx , ID:zzz 」DS:z z」 , ID:yyy DS:yyyy
  • --outSAMheaderHD default: - strings: @HD (header) line of the SAM header --outSAMheaderPG default: - strings: extra @PG (software) line of the SAM header (in addition to STAR)
  • --outSAMheaderCommentFile default: - string: path to the file with @CO (comment) lines of the SAM header
  • --outSAMfilter default: None string(s): filter the output into main SAM/BAM files
    • KeepOnlyAddedReferences only keep the reads for which all alignments are to the extra reference sequences added with –genomeFastaFiles at the mapping stage.
    • KeepAllAddedReferences keep all alignments to the extra reference sequences added with –genomeFastaFiles at the mapping stage.
  • --outSAMmultNmax default: -1 int: max number of multiple alignments for a read that will be output to the SAM/BAM files. Note that if this value is not equal to -1, the top scoring alignment will be output first
    • -1 all alignments (up to –outFilterMultimapNmax) will be output
  • --outSAMtlen default: 1 int: calculation method for the TLEN field in the SAM/BAM files
    • 1 leftmost base of the (+)strand mate to rightmost base of the (-)mate. (+)sign for the (+)strand mate
    • 2 leftmost base of any mate to rightmost base of any mate. (+)sign for the mate with the leftmost base. This is different from 1 for overlapping mates with protruding ends
  • --outBAMcompression default: 1 int: -1 to 10 BAM compression level, -1=default compression (6?), 0=no compression, 10=maximum compression
  • --outBAMsortingThreadN default: 0 int: >=0: number of threads for BAM sorting. 0 will default to min(6,–runThreadN).
  • --outBAMsortingBinsN default: 50 int: >0: number of genome bins fo coordinate-sorting

BAM處理

編輯
  • --bamRemoveDuplicatesType default: - string: mark duplicates in the BAM file, for now only works with (i) sorted BAM fed with inputBAMfile, and (ii) for paired-end alignments only
    • - no duplicate removal/marking
    • UniqueIdentical mark all multimappers, and duplicate unique mappers. The coordinates, FLAG, CIGAR must be identical
    • UniqueIdenticalNotMulti mark duplicate unique mappers but not multimappers.
  • --bamRemoveDuplicatesMate2basesN default: 0 int>0: number of bases from the 5』 of mate 2 to use in collapsing (e.g. for RAMPAGE)

輸出Wiggle

編輯
  • --outWigType default: None string(s): type of signal output, e.g. 」bedGraph」 OR 」bedGraph read1 5p」. Requires sorted BAM: –outSAMtype BAM SortedByCoordinate .
    • 1st word:
      • None no signal output
      • bedGraph bedGraph format
      • wiggle wiggle format
    • 2nd word:
      • read1 5p signal from only 5』 of the 1st read, useful for CAGE/RAMPAGE etc
      • read2 signal from only 2nd read
  • --outWigStrand default: Stranded string: strandedness of wiggle/bedGraph output
    • Stranded separate strands, str1 and str2
    • Unstranded collapsed strands
  • --outWigReferencesPrefix default: - string: prefix matching reference names to include in the output wiggle file, e.g. 」chr」, default 」-」 - include all references
  • --outWigNorm default:
    • RPM string: type of normalization for the signal RPM reads per million of mapped reads
    • None no normalization, 」raw」 counts

輸出過濾

編輯
  • --outFilterType default: Normal string: type of filtering
    • Normal standard filtering using only current alignment
    • BySJout keep only those reads that contain junctions that passed filtering into SJ.out.tab
  • --outFilterMultimapScoreRange default: 1 int: the score range below the maximum score for multimapping alignments
  • --outFilterMultimapNmax default: 10 int: maximum number of loci the read is allowed to map to. Alignments (all of them) will be output only if the read maps to no more loci than this value.Otherwise no alignments will be output, and the read will be counted as 」mapped to too many loci」 in the Log.final.out .
  • --outFilterMismatchNmax default: 10 int: alignment will be output only if it has no more mismatches than this value.
  • --outFilterMismatchNoverLmax default: 0.3 real: alignment will be output only if its ratio of mismatches to *mapped* length is less than or equal to this value.
  • --outFilterMismatchNoverReadLmax default: 1.0 real: alignment will be output only if its ratio of mismatches to *read* length is less than or equal to this value. -
  • -outFilterScoreMin default: 0 int: alignment will be output only if its score is higher than or equal to this value.
  • --outFilterScoreMinOverLread default: 0.66 real: same as outFilterScoreMin, but normalized to read length (sum of mates』 lengths for paired-end reads)
  • --outFilterMatchNmin default: 0 int: alignment will be output only if the number of matched bases is higher than or equal to this value.
  • --outFilterMatchNminOverLread default: 0.66 real: sam as outFilterMatchNmin, but normalized to the read length (sum of mates』 lengths for paired-end reads).
  • --outFilterIntronMotifs default: None 43 string: filter alignment using their motifs
    • None no filtering
    • RemoveNoncanonical filter out alignments that contain non-canonical junctions
    • RemoveNoncanonicalUnannotated filter out alignments that contain non-canonical unannotated junctions when using annotated splice junctions database. The annotated non-canonical junctions will be kept.
  • --outFilterIntronStrands default:
    • RemoveInconsistentStrands string: filter alignments RemoveInconsistentStrands remove alignments that have junctions with inconsistent strands
    • None no filtering

輸出splice junction

編輯
  • --outSJtype default: Standard string: type of splice junction output
    • Standard standard SJ.out.tab output
    • None no splice junction output

輸出過濾:splice junction

編輯
  • --outSJfilterReads default: All string: which reads to consider for collapsed splice junctions output
    • All all reads, unique- and multi-mappers
    • Unique uniquely mapping reads only
  • --outSJfilterOverhangMin default: 30 12 12 12 4 integers: minimum overhang length for splice junctions on both sides for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif does not apply to annotated junctions
  • --outSJfilterCountUniqueMin default: 3 1 1 1 4 integers: minimum uniquely mapping read count per junction for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif Junctions are output if one of outSJfilterCountUniqueMin OR outSJfilterCountTotalMin conditions are satisfied does not apply to annotated junctions
  • --outSJfilterCountTotalMin default: 3 1 1 1 4 integers: minimum total (multi-mapping+unique) read count per junction for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif Junctions are output if one of outSJfilterCountUniqueMin OR outSJfilterCountTotalMin conditions are satisfied does not apply to annotated junctions
  • --outSJfilterDistToOtherSJmin default: 10 0 5 10 4 integers>=0: minimum allowed distance to other junctions』 donor/acceptor does not apply to annotated junctions
  • --outSJfilterIntronMaxVsReadN default: 50000 100000 200000 N integers>=0: maximum gap allowed for junctions supported by 1,2,3,,,N reads i.e. by default junctions supported by 1 read can have gaps <=50000b, by 2 reads: <=100000b, by 3 reads: <=200000. by >=4 reads any gap <=alignIntronMax does not apply to annotated junctions

打分

編輯
  • --scoreGap default: 0 int: splice junction penalty (independent on intron motif)
  • --scoreGapNoncan default: -8 int: non-canonical junction penalty (in addition to scoreGap)
  • --scoreGapGCAG default: -4 GC/AG and CT/GC junction penalty (in addition to scoreGap)
  • --scoreGapATAC default: -8 AT/AC and GT/AT junction penalty (in addition to scoreGap)
  • --scoreGenomicLengthLog2scale default: -0.25 extra score logarithmically scaled with genomic length of the alignment: scoreGenomicLengthLog2scale*log2(genomicLength)
  • --scoreDelOpen default: -2 deletion open penalty
  • --scoreDelBase default: -2 deletion extension penalty per base (in addition to scoreDelOpen)
  • --scoreInsOpen default: -2 insertion open penalty
  • --scoreInsBase default: -2 insertion extension penalty per base (in addition to scoreInsOpen)
  • --scoreStitchSJshift default: 1 maximum score reduction while searching for SJ boundaries in the stitching step

比對和隨機種子

編輯
  • --seedSearchStartLmax default: 50 int>0: defines the search start point through the read - the read is split into pieces no longer than this value
  • --seedSearchStartLmaxOverLread default: 1.0 real: seedSearchStartLmax normalized to read length (sum of mates』 lengths for paired-end reads)
  • --seedSearchLmax default: 0 int>=0: defines the maximum length of the seeds, if =0 seed length is not limited
  • --seedMultimapNmax default: 10000 int>0: only pieces that map fewer than this value are utilized in the stitching procedure
  • --seedPerReadNmax default: 1000 int>0: max number of seeds per read
  • --seedPerWindowNmax default: 50 int>0: max number of seeds per window
  • --seedNoneLociPerWindow default: 10 int>0: max number of one seed loci per window
  • --seedSplitMin default: 12 int>0: min length of the seed sequences split by Ns or mate gap
  • --seedMapMin default: 5 int>0: min length of seeds to be mapped
  • --alignIntronMin default: 21 minimum intron size: genomic gap is considered intron if its length>=alignIntronMin, otherwise it is considered Deletion
  • --alignIntronMax default: 0 maximum intron size, if 0, max intron size will be determined by (2ˆwinBinNbits)*winAnchorDistNbins
  • --alignMatesGapMax default: 0 maximum gap between two mates, if 0, max intron gap will be determined by (2ˆwinBinNbits)*winAnchorDistNbins
  • --alignSJoverhangMin default: 5 int>0: minimum overhang (i.e. block size) for spliced alignments
  • --alignSJstitchMismatchNmax default: 0 -1 0 0 4*int>=0: maximum number of mismatches for stitching of the splice junctions (-1: no limit). (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif.
  • --alignSJDBoverhangMin default: 3 int>0: minimum overhang (i.e. block size) for annotated (sjdb) spliced alignments
  • --alignSplicedMateMapLmin default: 0 48 int>0: minimum mapped length for a read mate that is spliced
  • --alignSplicedMateMapLminOverLmate default: 0.66 real>0: alignSplicedMateMapLmin normalized to mate length
  • --alignWindowsPerReadNmax default: 10000 int>0: max number of windows per read
  • --alignTranscriptsPerWindowNmax default: 100 int>0: max number of transcripts per window
  • --alignTranscriptsPerReadNmax default: 10000 int>0: max number of different alignments per read to consider
  • --alignEndsType default: Local string: type of read ends alignment
    • Local standard local alignment with soft-clipping allowed
    • EndToEnd force end-to-end read alignment, do not soft-clip
    • Extend5pOfRead1 fully extend only the 5p of the read1, all other ends: local alignment
    • Extend5pOfReads12 fully extend only the 5p of the both read1 and read2, all other ends: local alignment
  • --alignEndsProtrude default: 0 ConcordantPair int, string: allow protrusion of alignment ends, i.e. start (end) of the +strand mate downstream of the start (end) of the -strand mate
    • 1st word: int: maximum number of protrusion bases allowed
    • 2nd word: string:
      • ConcordantPair report alignments with non-zero protrusion as concordant pairs
      • DiscordantPair report alignments with non-zero protrusion as discordant pairs
  • --alignSoftClipAtReferenceEnds default: Yes string: allow the soft-clipping of the alignments past the end of the chromosomes
    • Yes allow
    • No prohibit, useful for compatibility with Cufflinks
  • --alignInsertionFlush default: None string: how to flush ambiguous insertion positions
    • None insertions are not flushed
    • Right insertions are flushed to the right

雙末端read

編輯

--peOverlapNbasesMin default: 0 int>=0: minimum number of overlap bases to trigger mates merging and realignment

--peOverlapMMp default: 0.01 real, >=0 & <1: maximum proportion of mismatched bases in the overlap area

Windows, Anchors, Binning

編輯

--winAnchorMultimapNmax default: 50 int>0: max number of loci anchors are allowed to map to

--winBinNbits default: 16 int>0: =log2(winBin), where winBin is the size of the bin for the windows/clustering, each window will occupy an integer number of bins.

--winAnchorDistNbins default: 9 int>0: max number of bins between two anchors that allows aggregation of anchors into one window

--winFlankNbins default: 4 int>0: log2(winFlank), where win Flank is the size of the left and right flanking regions for each window

--winReadCoverageRelativeMin default: 0.5 real>=0: minimum relative coverage of the read sequence by the seeds in a window, for STARlong algorithm only.

--winReadCoverageBasesMin default: 0 int>0: minimum number of bases covered by the seeds in a window , for STARlong algorithm only.

Chimeric比對

編輯
  • --chimOutType default: Junctions string(s): type of chimeric output
    • Junctions Chimeric.out.junction
    • SeparateSAMold output old SAM into separate Chimeric.out.sam file
    • WithinBAM output into main aligned BAM files (Aligned.*.bam)
    • WithinBAM HardClip (default) hard-clipping in the CIGAR for supplemental chimeric alignments (default if no 2nd word is present)
    • WithinBAM SoftClip soft-clipping in the CIGAR for supplemental chimeric alignments
  • --chimSegmentMin default: 0 int>=0: minimum length of chimeric segment length, if ==0, no chimeric output
  • --chimScoreMin default: 0 int>=0: minimum total (summed) score of the chimeric segments
  • --chimScoreDropMax default: 20 int>=0: max drop (difference) of chimeric score (the sum of scores of all chimeric segments) from the read length
  • --chimScoreSeparation default: 10 int>=0: minimum difference (separation) between the best chimeric score and the next one
  • --chimScoreJunctionNonGTAG default: -1 int: penalty for a non-GT/AG chimeric junction
  • --chimJunctionOverhangMin default: 20 int>=0: minimum overhang for a chimeric junction
  • --chimSegmentReadGapMax default: 0 int>=0: maximum gap in the read sequence between chimeric segments
  • --chimFilter default: banGenomicN string(s): different filters for chimeric alignments
    • None no filtering
    • banGenomicN Ns are not allowed in the genome sequence around the chimeric junction
  • --chimMainSegmentMultNmax default: 10 int>=1: maximum number of multi-alignments for the main chimeric segment. =1 will prohibit multimapping main segments.
  • --chimMultimapNmax default: 0 int>=0: maximum number of chimeric multi-alignments
    • 0 use the old scheme for chimeric detection which only considered unique alignments
  • --chimMultimapScoreRange default: 1 int>=0: the score range for multi-mapping chimeras below the best chimeric score. Only works with –chimMultimapNmax > 1
  • --chimNonchimScoreDropMin default: 20 int>=0: to trigger chimeric detection, the drop in the best non-chimeric alignment score with respect to the read length has to be greater than this value
  • --chimOutJunctionFormat default: 0 int: formatting type for the Chimeric.out.junction file
    • 0 no comment lines/headers
    • 1 comment lines at the end of the file: command line and Nreads: total, unique/multi-mapping

註釋的定量

編輯
  • --quantMode default: - string(s): types of quantification requested
    • - none
    • TranscriptomeSAM output SAM/BAM alignments to transcriptome into a separate file
    • GeneCounts count reads per gene
  • --quantTranscriptomeBAMcompression default: 1 1 int: -2 to 10 transcriptome BAM compression level
    • -2 no BAM output
    • -1 default compression (6?)
    • 0 no compression
    • 10 maximum compression
  • --quantTranscriptomeBan default: IndelSoftclipSingleend string: prohibit various alignment type
    • IndelSoftclipSingleend prohibit indels, soft clipping and single-end alignments - compatible with RSEM
    • Singleend prohibit single-end alignments

2-pass比對

編輯
  • --twopassMode default: None string: 2-pass mapping mode.
    • None 1-pass mapping
    • Basic basic 2-pass mapping, with all 1st pass junctions inserted into the genome indices on the fly
  • --twopass1readsN default: -1 int: number of reads to process for the 1st step. Use very large number (or default -1) to map all reads in the first step.

WASP參數

編輯

--waspOutputMode default: None string: WASP allele-specific output type. This is re-implementation of the original WASP mappability filtering by Bryce van de Geijn, Graham McVicker, Yoav Gilad & Jonathan K Pritchard. Please cite the original WASP paper: Nature Methods 12, 1061–1063 (2015), https://www.nature.com/articles/nmeth.3582 .

SAMtag add WASP tags to the alignments that pass WASP filtering

STARsolo (single cell RNA-seq) 參數

編輯
  • --soloType default: None string(s): type of single-cell RNA-seq
    • CB UMI Simple (a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium.
    • CB UMI Complex one UMI of fixed length, but multiple Cell Barcodes of varying length, as well as adapters sequences are allowed in read2 only, e.g. inDrop.
    • CB samTagOut output Cell Barcode as CR and/or CB SAm tag. No UMI counting. –readFilesIn cDNA read1 [cDNA read2 if paired-end] CellBarcode read . Requires –outSAMtype BAM Unsorted [and/or SortedByCoordinate]
    • SmartSeq Smart-seq: each cell in a separate FASTQ (paired- or single-end), barcodes are corresponding read-groups, no UMI sequences, alignments deduplicated according to alignment start and end (after extending soft-clipped bases)
  • --soloCBwhitelist default: - string(s): file(s) with whitelist(s) of cell barcodes. Only –soloType CB UMI Complex allows more than one whitelist file.
    • None no whitelist: all cell barcodes are allowed
  • --soloCBstart default: 1 int>0: cell barcode start base
  • --soloCBlen default: 16 int>0: cell barcode length
  • --soloUMIstart default: 17 int>0: UMI start base
  • --soloUMIlen default: 10 int>0: UMI length
  • --soloBarcodeReadLength default: 1 int: length of the barcode read
    • 1 equal to sum of soloCBlen+soloUMIlen
    • 0 not defined, do not check
  • --soloBarcodeMate default: 0 int: identifies which read mate contains the barcode (CB+UMI) sequence
    • 0 barcode sequence is on separate read, which should always be the last file in the –readFilesIn listed
    • 1 barcode sequence is a part of mate 1
    • 2 barcode sequence is a part of mate 2
  • --soloCBposition default: - strings(s) position of Cell Barcode(s) on the barcode read. Presently only works with –soloType CB UMI Complex, and barcodes are assumed to be on Read2. Format for each barcode: startAnchor startPosition endAnchor endPosition start(end)Anchor defines the Anchor Base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end start(end)Position is the 0-based position with of the CB start(end) with respect to the Anchor Base String for different barcodes are separated by space. Example: inDrop (Zilionis et al, Nat. Protocols, 2017): –soloCBposition 0 0 2 -1 3 1 3 8
  • --soloUMIposition default: - string position of the UMI on the barcode read, same as soloCBposition Example: inDrop (Zilionis et al, Nat. Protocols, 2017): –soloCBposition 3 9 3 14
  • --soloAdapterSequence default: - string: adapter sequence to anchor barcodes.
  • --soloAdapterMismatchesNmax default: 1 int>0: maximum number of mismatches allowed in adapter sequence.
  • --soloCBmatchWLtype default: 1MM multi string: matching the Cell Barcodes to the WhiteList
    • Exact only exact matches allowed
    • 1MM only one match in whitelist with 1 mismatched base allowed. Allowed CBs have to have at least one read with exact match.
    • 1MM multi multiple matches in whitelist with 1 mismatched base allowed, posterior probability calculation is used choose one of the matches.
    • Allowed CBs have to have at least one read with exact match. This option matches best with CellRanger 2.2.0
    • 1MM multi pseudocounts same as 1MM Multi, but pseudocounts of 1 are added to all whitelist barcodes.
    • 1MM multi Nbase pseudocounts same as 1MM multi pseudocounts, multimatching to WL is allowed for CBs with N-bases. This option matches best with CellRanger >= 3.0.0
  • --soloInputSAMattrBarcodeSeq default: - string(s): when inputting reads from a SAM file (–readsFileType SAM SE/PE), these SAM attributes mark the barcode sequence (in proper order). 58 For instance, for 10X CellRanger or STARsolo BAMs, use –soloInputSAMattrBarcodeSeq CR UR . This parameter is required when running STARsolo with input from SAM.
  • --soloInputSAMattrBarcodeQual default: - string(s): when inputting reads from a SAM file (–readsFileType SAM SE/PE), these SAM attributes mark the barcode qualities (in proper order). For instance, for 10X CellRanger or STARsolo BAMs, use –soloInputSAMattrBarcodeQual CY UY . If this parameter is 』-』 (default), the quality 』H』 will be assigned to all bases.
  • --soloStrand default: Forward string: strandedness of the solo libraries: Unstranded no strand information Forward read strand same as the original RNA molecule Reverse read strand opposite to the original RNA molecule
  • --soloFeatures default: Gene string(s): genomic features for which the UMI counts per Cell Barcode are collected
    • Gene genes: reads match the gene transcript
    • SJ splice junctions: reported in SJ.out.tab
    • GeneFull full genes: count all reads overlapping genes』 exons and introns
  • --soloMultiMappers default: Unique string(s): counting method for reads mapping to multiple genes
    • Unique count only reads that map to unique genes
    • Uniform uniformly distribute multi-genic UMIs to all genes
    • Rescue distribute UMIs proportionally to unique+uniform counts ( first iteartion of EM)
    • PropUnique distribute UMIs proportionally to unique mappers, if present, and uniformly if not.
  • --soloUMIdedup default: 1MM All string(s): type of UMI deduplication (collapsing) algorithm
    • 1MM All all UMIs with 1 mismatch distance to each other are collapsed (i.e. counted once).
    • 1MM Directional UMItools follows the 」directional」 method from the UMI-tools by Smith, Heger and Sudbery (Genome Research 2017).
    • 1MM Directional same as 1MM Directional UMItools, but with more stringent criteria for duplicate UMIs
    • Exact only exactly matching UMIs are collapsed.
    • NoDedup no deduplication of UMIs, count all reads.
    • 1MM CR CellRanger2-4 algorithm for 1MM UMI collapsing.
  • --soloUMIfiltering default: - string(s) type of UMI filtering (for reads uniquely mapping to genes) - basic filtering: remove UMIs with N and homopolymers (similar to CellRanger 2.2.0).
    • MultiGeneUMI basic + remove lower-count UMIs that map to more than one gene.
    • MultiGeneUMI All basic + remove all UMIs that map to more than one gene.
    • MultiGeneUMI CR basic + remove lower-count UMIs that map to more than one gene, matching CellRanger > 3.0.0 . 60 Only works with –soloUMIdedup 1MM CR
  • --soloOutFileNames default: Solo.out/ features.tsv barcodes.tsv matrix.mtx string(s) file names for STARsolo output: file name prefix gene names barcode sequences cell feature count matrix
  • --soloCellFilter default: CellRanger2.2 3000 0.99 10 string(s): cell filtering type and parameters
    • None do not output filtered cells
    • TopCells only report top cells by UMI count, followed by the exact number of cells
    • CellRanger2.2 simple filtering of CellRanger 2.2. Can be followed by numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count
    • The harcoded values are from CellRanger: nExpectedCells=3000; maxPercentile=0.99; maxMinRatio=10
      • EmptyDrops CR EmptyDrops filtering in CellRanger flavor. Please cite the original EmptyDrops paper: A.T.L Lun et al, Genome Biology, 20, 63 (2019): https://genomebiology.biomedcentral.com/articles/10.1186/s13059- 019-1662-y
      • Can be followed by 10 numeric parameters: nExpectedCells maxPercentile maxMinRatio indMin indMax umiMin umiMinFracMedian candMaxN FDR simN
      • The harcoded values are from CellRanger: 3000 0.99 10 45000 90000 500 0.01 20000 0.01 10000
  • --soloOutFormatFeaturesGeneField3 default: "Gene Expression" string(s): field 3 in the Gene features.tsv file. If 」-」, then no 3rd field is output.

STAR運行示例

編輯
#######STAR MAPPING #######
####index Generate######
STAR --runMode genomeGenerate --genomeDir sSTAR_index/ --genomeFastaFiles STAR_index/genome.fasta --sjdbGTFfile genome.gtf --sjdbOverhang 149

##### mapping ##########
STAR --runThreadN 8 --genomeDir STAR_index/ --outSAMtype BAM Unsorted SortedByCoordinate _1--quantMode GeneCounts --readFilesCommand zcat --readFilesIn sample_1.clean.fq.gz sample_2.clean.fq.gz --outFileNamePrefix sample_star