生物信息学/STAR
使用STAR对转录组数据进行比对
编辑STAR: ultrafast universal RNA-seq aligner Bioinformatics, Volume 29, Issue 1, 1 January 2013, Pages 15–21, https://doi.org/10.1093/bioinformatics/bts635
参考文献:https://academic.oup.com/bioinformatics/article/29/1/15/272537
STAR软件官网:https://github.com/alexdobin/STAR
STAR参考文档:https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf
STAR的下载和安装
编辑安装gcc编译器
编辑# Ubuntu.
sudo apt-get update
sudo apt-get install g++
sudo apt-get install make
# Red Hat, CentOS, Fedora.
sudo yum update
sudo yum install make
sudo yum install gcc-c++
sudo yum install glibc-static
# SUSE.
sudo zypper update
sudo zypper in gcc gcc-c++
安装STAR软件
编辑从https://github.com/alexdobin/STAR/releases下载最新版本的STAR源代码:
wget https://github.com/alexdobin/STAR/archive/2.6.1d.tar.gz
tar -xzf 2.6.1d.tar.gz
cd STAR-2.6.1d
# Alternatively, get STAR source using git
git clone https://github.com/alexdobin/STAR.git
# Compile
cd STAR/source
make STAR
# Mac系统编译
make STARforMac
STAR软件的使用
编辑基本 STAR 工作流程包括 2 个步骤:
1. 生成基因组索引文件
在此步骤中,用户提供了参考基因组序列(FASTA 文件)和注释(GTF 文件),STAR 从中生成基因组索引,用于第二个(映射)步骤。 基因组索引保存到磁盘,并且只需为每个基因组/注释组合生成一次。 可从http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/ 获得有限的 STAR 基因组集合,但是,强烈建议用户生成自己的基因组索引。 - 最新的程序集和注释。
2. 将read比对到基因组
在此步骤中,用户提供在第一步中生成的基因组文件,以及 FASTA 或 FASTQ 文件形式的 RNA-seq 读数(序列)。 STAR 将读取映射到基因组,并写入多个输出文件,例如比对 (SAM/BAM)、映射汇总统计、接合点、未映射的读取、信号(摆动)轨道等。比对由各种输入参数(选项)控制。
所有选项的说明。
STAR 命令行格式如下:
STAR --option1-name option1-value(s)--option2-name option2-value(s) ...
如果一个选项可以接受多个值,它们之间用空格分隔,在少数情况下 - 用逗号分隔。
STAR构建基因组index
编辑运行示例:
# 对genome建索引,新建文件夹/path/to/GenomeDir
# 2种方式,无注释的:
/pathToStarDir/STAR --runMode genomeGenerate --genomeDir /path/to/GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 --runThreadN <n> …
#有注释引导的(gff3或gtf):
/pathToStarDir/STAR --runMode genomeGenerate --genomeDir /path/to/GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 --runThreadN <n> --sjdbGTFfile <FileName> --sjdbOverhang <N>…
# gff3的话,再加 --sjdbGTFtagExonParentTranscript Parent
--sjdbOverhang <N> 是剪切点左边或右边"overhang"的长度,最好设置为RNASEQ时的MateLength - 1。
基础参数:
编辑参数 | 描述 | 值 |
---|---|---|
--sjdbGTFfile | 基因注释文件GTF | |
--sjdbOverhang | read长度 | |
--runThreadN | 选项定义用于基因组生成的线程数,它必须设置为服务器节点上的可用内核数。 | 取决于系统硬件配置,整数 |
--runMode genomeGenerate | 选项指示 STAR 运行基因组索引构建。 | genomeGenerate:生成基因组索引 |
--genomeDir | 指定存储基因组索引的目录(以下称为“基因组目录”)的路径。该目录必须在 STAR 运行之前创建(使用 mkdir)并且需要具有写入权限。文件系统需要至少有 100GB 的磁盘 典型哺乳动物基因组的可用空间。建议在运行基因组生成步骤之前从基因组目录中删除所有文件。必须在比对步骤中提供此目录路径以识别参考基因组。 | /path/to/genomeDir |
--genomeFastaFiles | 指定一个或多个带有基因组参考序列的 FASTA 文件。 每个 fasta 文件都允许使用多个参考序列(以下称为“染色体”)。 您可以在 chrName.txt 中重命名染色体的名称,保持文件中染色体的顺序:该文件中的名称将用于所有输出对齐文件(例如 .sam)。 染色体名称中不允许使用制表符,也不建议使用空格。 | /path/to/genome/fasta1 /path/to/genome/fasta2 ... |
--sjdbGTFfile | 以标准 GTF 格式指定带有带注释脚本的文件的路径。 STAR 将从该文件中提取拼接点并使用它们来大大提高映射的准确性。 虽然这是可选的,并且 STAR 可以在没有注释的情况下运行,但强烈建议在注释可用时使用它们。 从 2.4.1a 开始,也可以在比对步骤中即时包含注释。 | /path/to/annotations.gtf |
--sjdbOverhang | 指定用于构建剪接点数据库的带注释的连接点周围的基因组序列的长度。 理想情况下,此长度应等于 ReadLength-1,其中 ReadLength 是读取的长度。 例如,对于 Illumina 2x100b 双端读数,理想值为 100-1=99。 对于不同长度的读取,理想值是 max(ReadLength)-1。 在大多数情况下,默认值 100 与理想值一样有效。 | ReadLength-1 |
基因组文件包括二进制基因组序列、后缀数组、文本染色体名称/长度、剪接点坐标和转录本/基因信息。 大多数这些文件使用内部 STAR 格式,不打算由最终用户使用。 强烈建议不要更改这些文件中的任何一个,但有一个例外:您可以重命名 chrName.txt 中的染色体名称,同时保持该文件中染色体的顺序:该文件中的染色体名称将用于所有输出文件 (例如 SAM/BAM)。
进阶选项
编辑基因组/scaffolds/patch的选择
编辑强烈建议包括主要染色体(例如,对于人类 chr1-22、chrX、chrY、chrM)以及未放置和未定位的支架。 通常,未放置/未定位的支架仅在基因组长度上增加了几个 MegaBase,但是,大量读数可能会映射到这些支架上的核糖体 RNA (rRNA) 重复序列。 如果支架不包含在基因组中,或者更糟糕的是,可能与染色体上的错误位点对齐,则这些读数将被报告为未映射。
通常,基因组中不应包含补丁和替代单倍型。
可接受的基因组序列文件示例:
- ENSEMBL:后缀名.dna.primary.assembly,例如ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
- GENCODE:标有PRI(primary)的文件。 强烈推荐小鼠和人类。
强烈建议对物种使用最全面的注释。 非常重要的是,注释 GTF 文件中的染色体名称必须与 FASTA 基因组序列文件中的染色体名称相匹配。 例如,可以将 ENSEMBL FASTA 文件与 ENSEMBL GTF 文件一起使用,将 UCSC FASTA 文件与 UCSC FASTA 文件一起使用。 但是,由于 UCSC 使用 chr1, chr2, ... 命名约定,而 ENSEMBL 使用 1, 2, ... 命名,所以 ENSEMBL 和 UCSC FASTA 和 GTF 文件不能混合在一起,除非染色体被重命名以匹配FASTA 和 GTF 文件。
GFF文件注释格式
编辑除了上述选项之外,对于 GFF3 格式的注释,您需要使用 --sjdbGTFtagExonParentTranscript Parent。 通常,对于 --sjdbGTFfile 文件,STAR 仅处理在第三个字段(列)中具有 --sjdbGTFfeatureExon(默认情况下=exon)的行。 外显子使用由 --sjdbGTFtagExonParentTranscript(默认情况下 = transcript_id)GTF/GFF 属性定义的父子关系分配给转录本。
使用junction注释
编辑STAR 还可以在文本文件中使用格式化为拼接junction坐标列表的注释:
--sjdbFileChrStartEnd /path/to/sjdbFile.txt。 此文件应包含由制表符分隔的 4 列:
Chr \tab Start \tab End \tab Strand=+/-/。
这里 Start 和 End 是内含子的第一个和最后一个碱基(基于 1 的染色体坐标)。 除了 --sjdbGTF 文件之外,还可以使用此文件,在这种情况下,STAR 将从两个文件中提取连接。
请注意,--sjdbFileChrStartEnd 文件可以包含重复(相同)的连接点,STAR 将折叠(删除)重复的连接点。
小基因组
编辑对于小基因组,参数 --genomeSAindexNbases 必须按比例缩小,典型值为 min(14, log2(GenomeLength)/2 - 1)。 例如,对于 1 megaBase 基因组,这等于 9,对于 100 kiloBase 基因组,这等于 7。
大基因组
编辑如果您使用的基因组具有大量 (>5,000) 参考(染色体/支架),您可能需要减少 --genomeChrBinNbits 以减少 RAM 消耗。 建议使用以下缩放比例:--genomeChrBinNbits =
min(18,log2[max(GenomeLength/NumberOfReferences,ReadLength)])。 例如,对于具有 100,000 条染色体/支架的 3 gigaBase 基因组,这等于 15。
STAR比对
编辑比对示例
编辑/pathToStarDir/STAR --genomeDir /path/to/GenomeDir --readFilesIn /path/to/read1.gz [/path/to/read2.gz] --readFilesCommand zcat --runThreadN <n> --<inputParameterName> <inputparameter value(s)> …
# 共享内存:
--genomeLoad <value>
# map的时候,这个参数控制基因组读到ram里面是否作为共享的,如果共享,其它在同节点运行的同样以此genome作为ref的star任务,可以共享它,节省计算资源。若要使用,请读manual。
基础选项
编辑--genomeDir 指定生成基因组索引的基因组目录的路径。
--readFilesIn 包含要映射的序列的文件的名称(带路径)(例如 RNA-seq FASTQ 文件)。 如果使用 Illumina 双端读取,则必须提供 read1 和 read2 文件。 STAR 可以处理 FASTA 和 FASTQ 文件。 支持多行(即序列拆分为多行)FASTA(但不支持 FASTQ)文件。
如果读取的文件被压缩,请使用 --readFilesCommand UncompressionCommand 选项,其中 UncompressionCommand 是解压缩命令,它将文件名作为输入参数,并将未压缩的输出发送到 stdout。 例如,对于 gzipped 文件 (*.gz),请使用 --readFilesCommand zcat 或 --readFilesCommand gunzip -c。 对于 bzip2-
压缩文件,使用--readFilesCommand bunzip2 -c。
一次比对多个文件
编辑可以在一次运行中使用单个输出映射多个样本。这等效于在映射之前连接读取文件,不同之处在于可以在 --outSAMattrRGline 命令中使用不同的读取组来跟踪来自不同文件的读取。对于单端读取使用逗号分隔列表(逗号周围没有空格),例如:
--readFilesIn sample1.fq,sample2.fq,sample3.fq
对于双端读取,read1 使用逗号分隔列表,后跟空格,read2 后跟逗号分隔列表,例如:--readFilesIn s1read1.fq,s2read1.fq,s3read1.fq s1read2.fq,s2read2.fq,s3read2 .fq
对于多个读取文件,可以在 --outSAMattrRGline 中为相应的读取组提供空格/逗号/空格分隔列表,例如--outSAMattrRGline ID:sample1, ID:sample2, ID:sample3
请注意,此列表由空格包围的逗号分隔(与 --readFilesIn 列表不同)。
映射多个读取文件的另一个选项,对于非常多的文件特别方便,是创建文件清单并在 --readFilesManifest /path/to/manifest.tsv 中提供它。
清单文件应包含 3 个制表符分隔的列。对于双端读取:
read1-file-name tab read2-file-name tab read-group-line
对于单端读取,第二列应包含破折号 -:
read1-file-name tab - tab read-group-line
文件名中允许有空格,但不允许有制表符。如果 read-group-line 不以 ID: 开头,
它只能包含一个 ID 字段,ID: 将添加到其中。如果 read-group-line 以 ID: 开头,它可以包含多个由制表符分隔的字段,并且所有字段将被逐字复制到 SAM @RG 标题行中。
高级选项
编辑比对阶段使用注释文件
编辑从 2.4.1a 开始,可以在比对步骤中即时包含注释,而无需在基因组生成步骤中包含它们。 您可以指定 --sjdbGTFfile /path/to/ann.gtf 和/或 --sjdbFileChrStartEnd /path/to/sj.tab,以及 --sjdbOverhang 和任何其他 --sjdb* 选项。 可以使用或不使用另一组注释/连接来生成基因组索引。
在后一种情况下,新的连接点将添加到旧的连接点上。 STAR 将在映射之前将连接点动态插入到基因组索引中,这需要 1 2 分钟。 可以使用 --sjdbInsertSave All 将动态基因组索引保存(以供重用)到当前运行目录中的 STARgenome 目录中。
ENCODE选项
编辑下面给出了长 RNA-seq 管道的 ENCODE 标准选项示例:
- --outFilterType BySJout 减少了“虚假”连接的数量
- --outFilterMultimapNmax 20 读取允许的最大多重对齐数:如果超过,则读取被视为未比对
- --alignSJoverhangMin 8 未注释连接的最小悬垂
- --alignSJDBoverhangMin 1 带注释的连接点的最小悬垂
- --outFilterMismatchNmax 999 每对最大不匹配数,大量关闭此过滤器
- --outFilterMismatchNoverReadLmax 0.04 每对相对于读取长度的最大错配数:对于 2x100b,配对读取的最大错配数为 0.04*200=8
- --alignIntronMin 20 最小内含子长度
- --alignIntronMax 1000000 最大内含子长度
- --alignMatesGapMax 1000000 mate之间的最大基因组距离
使用共享内存
编辑--genomeLoad 选项控制基因组如何加载到内存中。默认情况下,--genomeLoad NoSharedMemory,不使用共享内存。
使用 --genomeLoad LoadAndKeep,STAR 将基因组作为标准 Linux 共享内存块加载。基因组由其唯一的目录路径标识。在加载基因组之前,STAR
检查基因组是否已加载到共享内存中。如果基因组尚未加载,STAR 将加载它并在 STAR 作业完成后将其保存在内存中。基因组
将与所有其他 STAR 工作共享。您可以使用 --genomeLoad Remove 从运行 STAR 的共享内存中删除基因组。只有在附加到它的所有 STAR 作业完成后,共享内存块才会被物理删除。使用 --genomeLoad LoadAndRemove,STAR 将
将基因组加载到共享内存中,并将其标记为删除,这样一旦所有使用它的 STAR 作业退出,基因组就会从共享内存中删除。 --genomeLoad LoadAndExit,STAR 将在共享内存中加载基因组,并立即退出,将基因组加载到共享内存中以备将来运行。
如果您需要手动检查或删除共享内存片段,请使用标准 Linux 命令 ipcs 和 ipcrm。如果驻留在共享内存中的基因组长时间不使用,它可能会从 RAM 中分页,这将大大减慢 STAR 的运行速度。强烈建议定期重新加载(即删除并再次加载)共享内存基因组。
许多标准 Linux 发行版不允许足够大的共享内存块。如果您有 root 权限,您可以解决这个问题,或者请您的系统管理员来解决这个问题。要启用共享内存,请修改或将以下行添加到 /etc/sysctl.conf:
kernel.shmmax = Nmax
kernel.shmall = Nall
Nmax, N 所有数字应选择如下:
Nmax > GenomeIndexSize = Genome + SA + SAindex(人类基因组为 31000000000)
N all > GenomeIndexSize/PageSize
其中 PageSize 通常为 4096(可以使用 getconf PAGE SIZE 检查)。然后运行:
/sbin/sysctl -p
这会将允许的共享内存块增加到 31GB,足以容纳人类或小鼠基因组。
STAR所有参数
编辑按功能分组:
必须特别注意以 --out* 开头的参数,因为它们控制 STAR 输出。
特别是,--outFilter* 参数控制输出对齐的过滤,您可能希望对其进行调整以满足您的需要。
“嵌合”比对的输出由 --chim* 参数控制。
基因组生成由 --genome* 参数控制。
注释(剪接点数据库)由基因组生成步骤中的 --sjdb* 选项控制。
调整 --score*、--align*、--seed*、--win* 参数,需要了解 STAR 对齐算法,仅建议高级用户使用。
参数文件
编辑--parametersFiles
default: -
string: name of a user-defined parameters file, ”-”: none. Can only be defined on the command line.
系统
编辑--sysShell
default: -
string: path to the shell binary, preferably bash, e.g. /bin/bash.
- the default shell is executed, typically /bin/sh. This was reported to fail on some Ubuntu systems - then you need to specify path to bash.
运行参数
编辑- --runMode default: alignReads string: type of the run.
- --runThreadN default: 1 int: number of threads to run STAR
- --runDirPerm default: User RWX string: permissions for the directories created at the run-time. User RWX user-read/write/execute All RWX all-read/write/execute (same as chmod 777)
- --runRNGseed default: 777 int: random number generator seed.
基因组参数
编辑- --genomeDir default: ./GenomeDir/ string: path to the directory where genome files are stored (for –runMode alignReads) or will be generated (for –runMode generateGenome)
- --genomeLoad default: NoSharedMemory string: mode of shared memory usage for the genome files. Only used with –runMode alignReads.
- LoadAndKeep load genome into shared and keep it in memory after run
- LoadAndRemove load genome into shared but remove it after run
- LoadAndExit load genome into shared memory and exit, keeping the genome in memory for future runs
- Remove do not map anything, just remove loaded genome from memory
- NoSharedMemory do not use shared memory, each job will have its own private copy of the genome
- --genomeFastaFiles default: - string(s): path(s) to the fasta files with the genome sequences, separated by spaces. These files should be plain text FASTA files, they *cannot* be zipped. Required for the genome generation (–runMode genomeGenerate). Can also be used in the mapping (–runMode alignReads) to add extra (new) sequences to the genome (e.g. spike-ins).
- --genomeChainFiles default: - string: chain files for genomic liftover. Only used with –runMode liftOver .
- --genomeFileSizes default: 0 uint(s)>0: genome files exact sizes in bytes. Typically, this should not be defined by the user.
- --genomeTransformOutput default: None string(s) which output to transform back to original genome
- SAM SAM/BAM alignments
- SJ splice junctions (SJ.out.tab)
- None no transformation of the output
基因组索引参数(只用于–runMode genomeGenerate)
编辑- --genomeChrBinNbits default: 18 int: =log2(chrBin), where chrBin is the size of the bins for genome storage: each chromosome will occupy an integer number of bins. For a genome with large number of contigs, it is recommended to scale this parameter as min(18, log2[max(GenomeLength/NumberOfReferences,ReadLength)]).
- --genomeSAindexNbases default: 14 int: length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, the parameter –genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2 - 1).
- --genomeSAsparseD default: 1 int>0: suffux array sparsity, i.e. distance between indices: use bigger numbers to decrease needed RAM at the cost of mapping speed reduction
- --genomeSuffixLengthMax default: -1 int: maximum length of the suffixes, has to be longer than read length. -1 = infinite.
- --genomeTransformType default: None string: type of genome transformation
- None no transformation
- Haploid replace reference alleles with alternative alleles from VCF file (e.g. consensus allele)
- Diploid create two haplotypes for each chromosome listed in VCF file, for genotypes 1—2, assumes perfect phasing (e.g. personal genome)
- --genomeTransformVCF default: - string: path to VCF file for genome transformation
剪切Junction数据库
编辑- --sjdbFileChrStartEnd default: - string(s): path to the files with genomic coordinates (chr start end strand) for the splice junction introns. Multiple files can be supplied wand will be concatenated.
- --sjdbGTFfile default: - string: path to the GTF file with annotations
- --sjdbGTFchrPrefix default: - string: prefix for chromosome names in a GTF file (e.g. ’chr’ for using ENSMEBL annotations with UCSC genomes)
- --sjdbGTFfeatureExon default: exon string: feature type in GTF file to be used as exons for building transcripts
- --sjdbGTFtagExonParentTranscript default: transcript id string: GTF attribute name for parent transcript ID (default ”transcript id” works for GTF files)
- --sjdbGTFtagExonParentGene default: gene id string: GTF attribute name for parent gene ID (default ”gene id” works for GTF files)
- --sjdbGTFtagExonParentGeneName default: gene name string(s): GTF attrbute name for parent gene name
- --sjdbGTFtagExonParentGeneType default: gene type gene biotype 27 string(s): GTF attrbute name for parent gene type
- --sjdbOverhang default: 100 int>0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate length - 1)
- --sjdbScore default: 2 int: extra alignment score for alignments that cross database junctions
- --sjdbInsertSave default: Basic string: which files to save when sjdb junctions are inserted on the fly at the mapping step
- Basic only small junction / transcript files
- All all files including big Genome, SA and SAindex - this will create a complete genome directory
Variation参数
编辑--varVCFfile default: - string: path to the VCF file that contains variation data. The 10th column should contain the genotype information, e.g. 0/1
输入文件
编辑--inputBAMfile default: - string: path to BAM input file, to be used with –runMode inputAlignmentsFromBAM
Read参数
编辑- --readFilesType default: Fastx string: format of input read files
- Fastx FASTA or FASTQ
- SAM SE SAM or BAM single-end reads; for BAM use –readFilesCommand samtools view
- SAM PE SAM or BAM paired-end reads; for BAM use –readFilesCommand samtools view
- --readFilesSAMattrKeep default: All string(s): for –readFilesType SAM SE/PE, which SAM tags to keep in the output BAM, e.g.: –readFilesSAMtagsKeep RG PL
- All keep all tags
- None do not keep any tags
- --readFilesIn default: Read1 Read2 string(s): paths to files that contain input read1 (and, if needed, read2)
- --readFilesManifest default: - string: path to the ”manifest” file with the names of read files. The manifest file should contain 3 tab-separated columns: paired-end reads: read1 file name tab read2 file name tab read group line. single-end reads: read1 file name tab - tab read group line. Spaces, but not tabs are allowed in file names. If read group line does not start with ID:, it can only contain one ID field, and ID: will be added to it. If read group line starts with ID:, it can contain several fields separated by tab, and all fields will be be copied verbatim into SAM @RG header line.
- --readFilesPrefix default: - string: prefix for the read files names, i.e. it will be added in front of the strings in –readFilesIn
- --readFilesCommand default: - string(s): command line to execute for each of the input file. This command should generate FASTA or FASTQ text and send it to stdout For example: zcat - to uncompress .gz files, bzcat - to uncompress .bz2 files, etc.
- --readMapNumber default: -1 int: number of reads to map from the beginning of the file -1: map all reads
- --readMatesLengthsIn default: NotEqual string: Equal/NotEqual - lengths of names,sequences,qualities for both mates are the same / not the same. NotEqual is safe in all situations.
- --readNameSeparator default: / string(s): character(s) separating the part of the read names that will be trimmed in output (read name after space is always trimmed)
- --readQualityScoreBase default: 33 int>=0: number to be subtracted from the ASCII code to get Phred quality score
Read Clipping
编辑- --clipAdapterType default: Hamming string: adapter clipping type
- Hamming adapter clipping based on Hamming distance, with the number of mismatches controlled by –clip5pAdapterMMp
- CellRanger4 5p and 3p adapter clipping similar to CellRanger4. Utilizes Opal package by Martin Soˇsi´c: https://github.com/Martinsos/opal ˇ
- None no adapter clipping, all other clip* parameters are disregarded
- --clip3pNbases default: 0 int(s): number(s) of bases to clip from 3p of each mate. If one value is given, it will be assumed the same for both mates.
- --clip3pAdapterSeq default: - string(s): adapter sequences to clip from 3p of each mate. If one value is given, it will be assumed the same for both mates.
- polyA polyA sequence with the length equal to read length
- --clip3pAdapterMMp default: 0.1 double(s): max proportion of mismatches for 3p adapter clipping for each mate. If one value is given, it will be assumed the same for both mates.
- --clip3pAfterAdapterNbases default: 0 int(s): number of bases to clip from 3p of each mate after the adapter clipping. If one value is given, it will be assumed the same for both mates.
- --clip5pNbases default: 0 int(s): number(s) of bases to clip from 5p of each mate. If one value is given, it will be assumed the same for both mates.
Limits
编辑- --limitGenomeGenerateRAM default: 31000000000 int>0: maximum available RAM (bytes) for genome generation
- --limitIObufferSize default: 30000000 50000000 int>0: max available buffers size (bytes) for input/output, per thread
- --limitOutSAMoneReadBytes default: 100000 int>0: max size of the SAM record (bytes) for one read. Recommended value: >(2*(LengthMate1+LengthMate2+100)*outFilterMultimapNmax
- --limitOutSJoneRead default: 1000 int>0: max number of junctions for one read (including all multi-mappers)
- --limitOutSJcollapsed default: 1000000 int>0: max number of collapsed junctions
- --limitBAMsortRAM default: 0 int>=0: maximum available RAM (bytes) for sorting BAM. If =0, it will be set to the genome index size. 0 value can only be used with –genomeLoad NoSharedMemory option.
- --limitSjdbInsertNsj default: 1000000 int>=0: maximum number of junction to be inserted to the genome on the fly at the mapping stage, including those from annotations and those detected in the 1st step of the 2-pass run
- --limitNreadsSoft default: -1 int: soft limit on the number of reads
输出: 一般
编辑- --outFileNamePrefix default: ./ string: output files name prefix (including full or relative path). Can only be defined on the command line.
- --outTmpDir default: - string: path to a directory that will be used as temporary by STAR. All contents of this directory will be removed! - the temp directory will default to outFileNamePrefix STARtmp
- --outTmpKeep default: None string: whether to keep the tempporary files after STAR runs is finished
- None remove all temporary files All .. keep all files
- --outStd default: Log string: which output will be directed to stdout (standard out)
- Log log messages
- SAM alignments in SAM format (which normally are output to Aligned.out.sam file), normal standard output will go into Log.std.out
- BAM Unsorted alignments in BAM format, unsorted. Requires –outSAMtype BAM Unsorted
- BAM SortedByCoordinate alignments in BAM format, sorted by coordinate. Requires –outSAMtype BAM SortedByCoordinate
- BAM Quant alignments to transcriptome in BAM format, unsorted. Requires –quantMode TranscriptomeSAM
- --outReadsUnmapped default: None string: output of unmapped and partially mapped (i.e. mapped only one mate of a paired end read) reads in separate file(s).
- None no output
- Fastx output in separate fasta/fastq files, Unmapped.out.mate1/2
- --outQSconversionAdd default: 0 int: add this number to the quality score (e.g. to convert from Illumina to Sanger, use -31)
- --outMultimapperOrder default: Old 2.4 string: order of multimapping alignments in the output files
- Old 2.4 quasi-random order used before 2.5.0
- Random random order of alignments for each multi-mapper. Read mates (pairs) are always adjacent, all alignment for each read stay together. This option will become default in the future releases.
输出:SAM和BAM
编辑- --outSAMtype default: SAM strings: type of SAM/BAM output
- 1st word:
- BAM output BAM without sorting
- SAM output SAM without sorting
- None no SAM/BAM output
- 2nd, 3rd:
- Unsorted standard unsorted
- SortedByCoordinate sorted by coordinate. This option will allocate extra memory for sorting which can be specified by –limitBAMsortRAM
- 1st word:
- --outSAMmode default: Full string: mode of SAM output
- None no SAM output
- Full full SAM output
- NoQS full SAM but without quality scores
- --outSAMstrandField default: None string: Cufflinks-like strand field flag
- None not used
- intronMotif strand derived from the intron motif. This option changes the output alignments: reads with inconsistent and/or non-canonical introns are filtered out.
- --outSAMattributes default: Standard string: a string of desired SAM attributes, in the order desired for the output SAM. Tags can be listed in any combination/order.
- ***Presets:
- None no attributes
- Standard NH HI AS nM
- All NH HI AS nM NM MD jM jI MC ch
- ***Alignment:
- NH number of loci the reads maps to: =1 for unique mappers, >1 for multimappers. Standard SAM tag.
- HI multiple alignment index, starts with –outSAMattrIHstart (=1 by default). Standard SAM tag.
- AS local alignment score, +1/-1 for matches/mismateches, score* penalties for indels and gaps. For PE reads, total score for two mates. Stadnard SAM tag.
- nM number of mismatches. For PE reads, sum over two mates.
- NM edit distance to the reference (number of mismatched + inserted + deleted bases) for each mate. Standard SAM tag.
- MD string encoding mismatched and deleted reference bases (see standard SAM specifications). Standard SAM tag.
- jM intron motifs for all junctions (i.e. N in CIGAR): 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT. If splice junctions database is used, and a junction is annotated, 20 is added to its motif value.
- jI start and end of introns for all junctions (1-based).
- XS alignment strand according to –outSAMstrandField.
- MC mate’s CIGAR string. Standard SAM tag.
- ch marks all segment of all chimeric alingments for –chimOutType WithinBAM output.
- cN number of bases clipped from the read ends: 5’ and 3’
- ***Variation:
- vA variant allele
- vG genomic coordinate of the variant overlapped by the read.
- vW 1 - alignment passes WASP filtering; 2,3,4,5,6,7 - alignment does not pass WASP filtering. Requires –waspOutputMode SAMtag.
- ***STARsolo:
- CR CY UR UY sequences and quality scores of cell barcodes and UMIs for the solo* demultiplexing.
- GX GN gene ID and gene name.
- CB UB error-corrected cell barcodes and UMIs for solo* demultiplexing. Requires –outSAMtype BAM SortedByCoordinate.
- sM assessment of CB and UMI.
- sS sequence of the entire barcode (CB,UMI,adapter).
- sQ quality of the entire barcode.
- ***Unsupported/undocumented:
- ha haplotype (1/2) when mapping to the diploid genome. Requires genome generated with –genomeTransformType Diploid .
- rB alignment block read/genomic coordinates.
- vR read coordinate of the variant.
- ***Presets:
- --outSAMattrIHstart default: 1 int>=0: start value for the IH attribute. 0 may be required by some downstream software, such as Cufflinks or StringTie.
- --outSAMunmapped default: None string(s): output of unmapped reads in the SAM format
- 1st word:
- None no output
- Within output unmapped reads within the main SAM file (i.e. Aligned.out.sam)
- 2nd word:
- KeepPairs record unmapped mate for each alignment, and, in case of unsorted output, keep it adjacent to its mapped mate. Only affects multi-mapping reads.
- 1st word:
- --outSAMorder default: Paired string: type of sorting for the SAM output Paired: one mate after the other for all paired alignments PairedKeepInputOrder: one mate after the other for all paired alignments, the order is kept the same as in the input FASTQ files
- --outSAMprimaryFlag default: OneBestScore string: which alignments are considered primary - all others will be marked with 0x100 bit in the FLAG
- OneBestScore only one alignment with the best score is primary
- AllBestScore all alignments with the best score are primary
- --outSAMreadID default: Standard string: read ID record type
- Standard first word (until space) from the FASTx read ID line, removing /1,/2 from the end
- Number read number (index) in the FASTx file
- --outSAMmapqUnique default: 255 int: 0 to 255: the MAPQ value for unique mappers
- --outSAMflagOR default: 0 int: 0 to 65535: sam FLAG will be bitwise OR’d with this value, i.e. FLAG=FLAG — outSAMflagOR. This is applied after all flags have been set by STAR, and after outSAMflagAND. Can be used to set specific bits that are not set otherwise.
- --outSAMflagAND default: 65535 int: 0 to 65535: sam FLAG will be bitwise AND’d with this value, i.e. FLAG=FLAG & outSAMflagOR. This is applied after all flags have been set by STAR, but before outSAMflagOR. Can be used to unset specific bits that are not set otherwise.
- --outSAMattrRGline default: - string(s): SAM/BAM read group line. The first word contains the read group identifier and must start with ”ID:”, e.g. –outSAMattrRGline ID:xxx CN:yy ”DS:z z z”. xxx will be added as RG tag to each output alignment. Any spaces in the tag values have to be double quoted. Comma separated RG lines correspons to different (comma separated) input files in –readFilesIn. Commas have to be surrounded by spaces, e.g. –outSAMattrRGline ID:xxx , ID:zzz ”DS:z z” , ID:yyy DS:yyyy
- --outSAMheaderHD default: - strings: @HD (header) line of the SAM header --outSAMheaderPG default: - strings: extra @PG (software) line of the SAM header (in addition to STAR)
- --outSAMheaderCommentFile default: - string: path to the file with @CO (comment) lines of the SAM header
- --outSAMfilter default: None string(s): filter the output into main SAM/BAM files
- KeepOnlyAddedReferences only keep the reads for which all alignments are to the extra reference sequences added with –genomeFastaFiles at the mapping stage.
- KeepAllAddedReferences keep all alignments to the extra reference sequences added with –genomeFastaFiles at the mapping stage.
- --outSAMmultNmax default: -1 int: max number of multiple alignments for a read that will be output to the SAM/BAM files. Note that if this value is not equal to -1, the top scoring alignment will be output first
- -1 all alignments (up to –outFilterMultimapNmax) will be output
- --outSAMtlen default: 1 int: calculation method for the TLEN field in the SAM/BAM files
- 1 leftmost base of the (+)strand mate to rightmost base of the (-)mate. (+)sign for the (+)strand mate
- 2 leftmost base of any mate to rightmost base of any mate. (+)sign for the mate with the leftmost base. This is different from 1 for overlapping mates with protruding ends
- --outBAMcompression default: 1 int: -1 to 10 BAM compression level, -1=default compression (6?), 0=no compression, 10=maximum compression
- --outBAMsortingThreadN default: 0 int: >=0: number of threads for BAM sorting. 0 will default to min(6,–runThreadN).
- --outBAMsortingBinsN default: 50 int: >0: number of genome bins fo coordinate-sorting
BAM处理
编辑- --bamRemoveDuplicatesType default: - string: mark duplicates in the BAM file, for now only works with (i) sorted BAM fed with inputBAMfile, and (ii) for paired-end alignments only
- - no duplicate removal/marking
- UniqueIdentical mark all multimappers, and duplicate unique mappers. The coordinates, FLAG, CIGAR must be identical
- UniqueIdenticalNotMulti mark duplicate unique mappers but not multimappers.
- --bamRemoveDuplicatesMate2basesN default: 0 int>0: number of bases from the 5’ of mate 2 to use in collapsing (e.g. for RAMPAGE)
输出Wiggle
编辑- --outWigType default: None string(s): type of signal output, e.g. ”bedGraph” OR ”bedGraph read1 5p”. Requires sorted BAM: –outSAMtype BAM SortedByCoordinate .
- 1st word:
- None no signal output
- bedGraph bedGraph format
- wiggle wiggle format
- 2nd word:
- read1 5p signal from only 5’ of the 1st read, useful for CAGE/RAMPAGE etc
- read2 signal from only 2nd read
- 1st word:
- --outWigStrand default: Stranded string: strandedness of wiggle/bedGraph output
- Stranded separate strands, str1 and str2
- Unstranded collapsed strands
- --outWigReferencesPrefix default: - string: prefix matching reference names to include in the output wiggle file, e.g. ”chr”, default ”-” - include all references
- --outWigNorm default:
- RPM string: type of normalization for the signal RPM reads per million of mapped reads
- None no normalization, ”raw” counts
输出过滤
编辑- --outFilterType default: Normal string: type of filtering
- Normal standard filtering using only current alignment
- BySJout keep only those reads that contain junctions that passed filtering into SJ.out.tab
- --outFilterMultimapScoreRange default: 1 int: the score range below the maximum score for multimapping alignments
- --outFilterMultimapNmax default: 10 int: maximum number of loci the read is allowed to map to. Alignments (all of them) will be output only if the read maps to no more loci than this value.Otherwise no alignments will be output, and the read will be counted as ”mapped to too many loci” in the Log.final.out .
- --outFilterMismatchNmax default: 10 int: alignment will be output only if it has no more mismatches than this value.
- --outFilterMismatchNoverLmax default: 0.3 real: alignment will be output only if its ratio of mismatches to *mapped* length is less than or equal to this value.
- --outFilterMismatchNoverReadLmax default: 1.0 real: alignment will be output only if its ratio of mismatches to *read* length is less than or equal to this value. -
- -outFilterScoreMin default: 0 int: alignment will be output only if its score is higher than or equal to this value.
- --outFilterScoreMinOverLread default: 0.66 real: same as outFilterScoreMin, but normalized to read length (sum of mates’ lengths for paired-end reads)
- --outFilterMatchNmin default: 0 int: alignment will be output only if the number of matched bases is higher than or equal to this value.
- --outFilterMatchNminOverLread default: 0.66 real: sam as outFilterMatchNmin, but normalized to the read length (sum of mates’ lengths for paired-end reads).
- --outFilterIntronMotifs default: None 43 string: filter alignment using their motifs
- None no filtering
- RemoveNoncanonical filter out alignments that contain non-canonical junctions
- RemoveNoncanonicalUnannotated filter out alignments that contain non-canonical unannotated junctions when using annotated splice junctions database. The annotated non-canonical junctions will be kept.
- --outFilterIntronStrands default:
- RemoveInconsistentStrands string: filter alignments RemoveInconsistentStrands remove alignments that have junctions with inconsistent strands
- None no filtering
输出splice junction
编辑- --outSJtype default: Standard string: type of splice junction output
- Standard standard SJ.out.tab output
- None no splice junction output
输出过滤:splice junction
编辑- --outSJfilterReads default: All string: which reads to consider for collapsed splice junctions output
- All all reads, unique- and multi-mappers
- Unique uniquely mapping reads only
- --outSJfilterOverhangMin default: 30 12 12 12 4 integers: minimum overhang length for splice junctions on both sides for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif does not apply to annotated junctions
- --outSJfilterCountUniqueMin default: 3 1 1 1 4 integers: minimum uniquely mapping read count per junction for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif Junctions are output if one of outSJfilterCountUniqueMin OR outSJfilterCountTotalMin conditions are satisfied does not apply to annotated junctions
- --outSJfilterCountTotalMin default: 3 1 1 1 4 integers: minimum total (multi-mapping+unique) read count per junction for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif Junctions are output if one of outSJfilterCountUniqueMin OR outSJfilterCountTotalMin conditions are satisfied does not apply to annotated junctions
- --outSJfilterDistToOtherSJmin default: 10 0 5 10 4 integers>=0: minimum allowed distance to other junctions’ donor/acceptor does not apply to annotated junctions
- --outSJfilterIntronMaxVsReadN default: 50000 100000 200000 N integers>=0: maximum gap allowed for junctions supported by 1,2,3,,,N reads i.e. by default junctions supported by 1 read can have gaps <=50000b, by 2 reads: <=100000b, by 3 reads: <=200000. by >=4 reads any gap <=alignIntronMax does not apply to annotated junctions
打分
编辑- --scoreGap default: 0 int: splice junction penalty (independent on intron motif)
- --scoreGapNoncan default: -8 int: non-canonical junction penalty (in addition to scoreGap)
- --scoreGapGCAG default: -4 GC/AG and CT/GC junction penalty (in addition to scoreGap)
- --scoreGapATAC default: -8 AT/AC and GT/AT junction penalty (in addition to scoreGap)
- --scoreGenomicLengthLog2scale default: -0.25 extra score logarithmically scaled with genomic length of the alignment: scoreGenomicLengthLog2scale*log2(genomicLength)
- --scoreDelOpen default: -2 deletion open penalty
- --scoreDelBase default: -2 deletion extension penalty per base (in addition to scoreDelOpen)
- --scoreInsOpen default: -2 insertion open penalty
- --scoreInsBase default: -2 insertion extension penalty per base (in addition to scoreInsOpen)
- --scoreStitchSJshift default: 1 maximum score reduction while searching for SJ boundaries in the stitching step
比对和随机种子
编辑- --seedSearchStartLmax default: 50 int>0: defines the search start point through the read - the read is split into pieces no longer than this value
- --seedSearchStartLmaxOverLread default: 1.0 real: seedSearchStartLmax normalized to read length (sum of mates’ lengths for paired-end reads)
- --seedSearchLmax default: 0 int>=0: defines the maximum length of the seeds, if =0 seed length is not limited
- --seedMultimapNmax default: 10000 int>0: only pieces that map fewer than this value are utilized in the stitching procedure
- --seedPerReadNmax default: 1000 int>0: max number of seeds per read
- --seedPerWindowNmax default: 50 int>0: max number of seeds per window
- --seedNoneLociPerWindow default: 10 int>0: max number of one seed loci per window
- --seedSplitMin default: 12 int>0: min length of the seed sequences split by Ns or mate gap
- --seedMapMin default: 5 int>0: min length of seeds to be mapped
- --alignIntronMin default: 21 minimum intron size: genomic gap is considered intron if its length>=alignIntronMin, otherwise it is considered Deletion
- --alignIntronMax default: 0 maximum intron size, if 0, max intron size will be determined by (2ˆwinBinNbits)*winAnchorDistNbins
- --alignMatesGapMax default: 0 maximum gap between two mates, if 0, max intron gap will be determined by (2ˆwinBinNbits)*winAnchorDistNbins
- --alignSJoverhangMin default: 5 int>0: minimum overhang (i.e. block size) for spliced alignments
- --alignSJstitchMismatchNmax default: 0 -1 0 0 4*int>=0: maximum number of mismatches for stitching of the splice junctions (-1: no limit). (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif.
- --alignSJDBoverhangMin default: 3 int>0: minimum overhang (i.e. block size) for annotated (sjdb) spliced alignments
- --alignSplicedMateMapLmin default: 0 48 int>0: minimum mapped length for a read mate that is spliced
- --alignSplicedMateMapLminOverLmate default: 0.66 real>0: alignSplicedMateMapLmin normalized to mate length
- --alignWindowsPerReadNmax default: 10000 int>0: max number of windows per read
- --alignTranscriptsPerWindowNmax default: 100 int>0: max number of transcripts per window
- --alignTranscriptsPerReadNmax default: 10000 int>0: max number of different alignments per read to consider
- --alignEndsType default: Local string: type of read ends alignment
- Local standard local alignment with soft-clipping allowed
- EndToEnd force end-to-end read alignment, do not soft-clip
- Extend5pOfRead1 fully extend only the 5p of the read1, all other ends: local alignment
- Extend5pOfReads12 fully extend only the 5p of the both read1 and read2, all other ends: local alignment
- --alignEndsProtrude default: 0 ConcordantPair int, string: allow protrusion of alignment ends, i.e. start (end) of the +strand mate downstream of the start (end) of the -strand mate
- 1st word: int: maximum number of protrusion bases allowed
- 2nd word: string:
- ConcordantPair report alignments with non-zero protrusion as concordant pairs
- DiscordantPair report alignments with non-zero protrusion as discordant pairs
- --alignSoftClipAtReferenceEnds default: Yes string: allow the soft-clipping of the alignments past the end of the chromosomes
- Yes allow
- No prohibit, useful for compatibility with Cufflinks
- --alignInsertionFlush default: None string: how to flush ambiguous insertion positions
- None insertions are not flushed
- Right insertions are flushed to the right
双末端read
编辑--peOverlapNbasesMin default: 0 int>=0: minimum number of overlap bases to trigger mates merging and realignment
--peOverlapMMp default: 0.01 real, >=0 & <1: maximum proportion of mismatched bases in the overlap area
Windows, Anchors, Binning
编辑--winAnchorMultimapNmax default: 50 int>0: max number of loci anchors are allowed to map to
--winBinNbits default: 16 int>0: =log2(winBin), where winBin is the size of the bin for the windows/clustering, each window will occupy an integer number of bins.
--winAnchorDistNbins default: 9 int>0: max number of bins between two anchors that allows aggregation of anchors into one window
--winFlankNbins default: 4 int>0: log2(winFlank), where win Flank is the size of the left and right flanking regions for each window
--winReadCoverageRelativeMin default: 0.5 real>=0: minimum relative coverage of the read sequence by the seeds in a window, for STARlong algorithm only.
--winReadCoverageBasesMin default: 0 int>0: minimum number of bases covered by the seeds in a window , for STARlong algorithm only.
Chimeric比对
编辑- --chimOutType default: Junctions string(s): type of chimeric output
- Junctions Chimeric.out.junction
- SeparateSAMold output old SAM into separate Chimeric.out.sam file
- WithinBAM output into main aligned BAM files (Aligned.*.bam)
- WithinBAM HardClip (default) hard-clipping in the CIGAR for supplemental chimeric alignments (default if no 2nd word is present)
- WithinBAM SoftClip soft-clipping in the CIGAR for supplemental chimeric alignments
- --chimSegmentMin default: 0 int>=0: minimum length of chimeric segment length, if ==0, no chimeric output
- --chimScoreMin default: 0 int>=0: minimum total (summed) score of the chimeric segments
- --chimScoreDropMax default: 20 int>=0: max drop (difference) of chimeric score (the sum of scores of all chimeric segments) from the read length
- --chimScoreSeparation default: 10 int>=0: minimum difference (separation) between the best chimeric score and the next one
- --chimScoreJunctionNonGTAG default: -1 int: penalty for a non-GT/AG chimeric junction
- --chimJunctionOverhangMin default: 20 int>=0: minimum overhang for a chimeric junction
- --chimSegmentReadGapMax default: 0 int>=0: maximum gap in the read sequence between chimeric segments
- --chimFilter default: banGenomicN string(s): different filters for chimeric alignments
- None no filtering
- banGenomicN Ns are not allowed in the genome sequence around the chimeric junction
- --chimMainSegmentMultNmax default: 10 int>=1: maximum number of multi-alignments for the main chimeric segment. =1 will prohibit multimapping main segments.
- --chimMultimapNmax default: 0 int>=0: maximum number of chimeric multi-alignments
- 0 use the old scheme for chimeric detection which only considered unique alignments
- --chimMultimapScoreRange default: 1 int>=0: the score range for multi-mapping chimeras below the best chimeric score. Only works with –chimMultimapNmax > 1
- --chimNonchimScoreDropMin default: 20 int>=0: to trigger chimeric detection, the drop in the best non-chimeric alignment score with respect to the read length has to be greater than this value
- --chimOutJunctionFormat default: 0 int: formatting type for the Chimeric.out.junction file
- 0 no comment lines/headers
- 1 comment lines at the end of the file: command line and Nreads: total, unique/multi-mapping
注释的定量
编辑- --quantMode default: - string(s): types of quantification requested
- - none
- TranscriptomeSAM output SAM/BAM alignments to transcriptome into a separate file
- GeneCounts count reads per gene
- --quantTranscriptomeBAMcompression default: 1 1 int: -2 to 10 transcriptome BAM compression level
- -2 no BAM output
- -1 default compression (6?)
- 0 no compression
- 10 maximum compression
- --quantTranscriptomeBan default: IndelSoftclipSingleend string: prohibit various alignment type
- IndelSoftclipSingleend prohibit indels, soft clipping and single-end alignments - compatible with RSEM
- Singleend prohibit single-end alignments
2-pass比对
编辑- --twopassMode default: None string: 2-pass mapping mode.
- None 1-pass mapping
- Basic basic 2-pass mapping, with all 1st pass junctions inserted into the genome indices on the fly
- --twopass1readsN default: -1 int: number of reads to process for the 1st step. Use very large number (or default -1) to map all reads in the first step.
WASP参数
编辑--waspOutputMode default: None string: WASP allele-specific output type. This is re-implementation of the original WASP mappability filtering by Bryce van de Geijn, Graham McVicker, Yoav Gilad & Jonathan K Pritchard. Please cite the original WASP paper: Nature Methods 12, 1061–1063 (2015), https://www.nature.com/articles/nmeth.3582 .
SAMtag add WASP tags to the alignments that pass WASP filtering
STARsolo (single cell RNA-seq) 参数
编辑- --soloType default: None string(s): type of single-cell RNA-seq
- CB UMI Simple (a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium.
- CB UMI Complex one UMI of fixed length, but multiple Cell Barcodes of varying length, as well as adapters sequences are allowed in read2 only, e.g. inDrop.
- CB samTagOut output Cell Barcode as CR and/or CB SAm tag. No UMI counting. –readFilesIn cDNA read1 [cDNA read2 if paired-end] CellBarcode read . Requires –outSAMtype BAM Unsorted [and/or SortedByCoordinate]
- SmartSeq Smart-seq: each cell in a separate FASTQ (paired- or single-end), barcodes are corresponding read-groups, no UMI sequences, alignments deduplicated according to alignment start and end (after extending soft-clipped bases)
- --soloCBwhitelist default: - string(s): file(s) with whitelist(s) of cell barcodes. Only –soloType CB UMI Complex allows more than one whitelist file.
- None no whitelist: all cell barcodes are allowed
- --soloCBstart default: 1 int>0: cell barcode start base
- --soloCBlen default: 16 int>0: cell barcode length
- --soloUMIstart default: 17 int>0: UMI start base
- --soloUMIlen default: 10 int>0: UMI length
- --soloBarcodeReadLength default: 1 int: length of the barcode read
- 1 equal to sum of soloCBlen+soloUMIlen
- 0 not defined, do not check
- --soloBarcodeMate default: 0 int: identifies which read mate contains the barcode (CB+UMI) sequence
- 0 barcode sequence is on separate read, which should always be the last file in the –readFilesIn listed
- 1 barcode sequence is a part of mate 1
- 2 barcode sequence is a part of mate 2
- --soloCBposition default: - strings(s) position of Cell Barcode(s) on the barcode read. Presently only works with –soloType CB UMI Complex, and barcodes are assumed to be on Read2. Format for each barcode: startAnchor startPosition endAnchor endPosition start(end)Anchor defines the Anchor Base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end start(end)Position is the 0-based position with of the CB start(end) with respect to the Anchor Base String for different barcodes are separated by space. Example: inDrop (Zilionis et al, Nat. Protocols, 2017): –soloCBposition 0 0 2 -1 3 1 3 8
- --soloUMIposition default: - string position of the UMI on the barcode read, same as soloCBposition Example: inDrop (Zilionis et al, Nat. Protocols, 2017): –soloCBposition 3 9 3 14
- --soloAdapterSequence default: - string: adapter sequence to anchor barcodes.
- --soloAdapterMismatchesNmax default: 1 int>0: maximum number of mismatches allowed in adapter sequence.
- --soloCBmatchWLtype default: 1MM multi string: matching the Cell Barcodes to the WhiteList
- Exact only exact matches allowed
- 1MM only one match in whitelist with 1 mismatched base allowed. Allowed CBs have to have at least one read with exact match.
- 1MM multi multiple matches in whitelist with 1 mismatched base allowed, posterior probability calculation is used choose one of the matches.
- Allowed CBs have to have at least one read with exact match. This option matches best with CellRanger 2.2.0
- 1MM multi pseudocounts same as 1MM Multi, but pseudocounts of 1 are added to all whitelist barcodes.
- 1MM multi Nbase pseudocounts same as 1MM multi pseudocounts, multimatching to WL is allowed for CBs with N-bases. This option matches best with CellRanger >= 3.0.0
- --soloInputSAMattrBarcodeSeq default: - string(s): when inputting reads from a SAM file (–readsFileType SAM SE/PE), these SAM attributes mark the barcode sequence (in proper order). 58 For instance, for 10X CellRanger or STARsolo BAMs, use –soloInputSAMattrBarcodeSeq CR UR . This parameter is required when running STARsolo with input from SAM.
- --soloInputSAMattrBarcodeQual default: - string(s): when inputting reads from a SAM file (–readsFileType SAM SE/PE), these SAM attributes mark the barcode qualities (in proper order). For instance, for 10X CellRanger or STARsolo BAMs, use –soloInputSAMattrBarcodeQual CY UY . If this parameter is ’-’ (default), the quality ’H’ will be assigned to all bases.
- --soloStrand default: Forward string: strandedness of the solo libraries: Unstranded no strand information Forward read strand same as the original RNA molecule Reverse read strand opposite to the original RNA molecule
- --soloFeatures default: Gene string(s): genomic features for which the UMI counts per Cell Barcode are collected
- Gene genes: reads match the gene transcript
- SJ splice junctions: reported in SJ.out.tab
- GeneFull full genes: count all reads overlapping genes’ exons and introns
- --soloMultiMappers default: Unique string(s): counting method for reads mapping to multiple genes
- Unique count only reads that map to unique genes
- Uniform uniformly distribute multi-genic UMIs to all genes
- Rescue distribute UMIs proportionally to unique+uniform counts ( first iteartion of EM)
- PropUnique distribute UMIs proportionally to unique mappers, if present, and uniformly if not.
- --soloUMIdedup default: 1MM All string(s): type of UMI deduplication (collapsing) algorithm
- 1MM All all UMIs with 1 mismatch distance to each other are collapsed (i.e. counted once).
- 1MM Directional UMItools follows the ”directional” method from the UMI-tools by Smith, Heger and Sudbery (Genome Research 2017).
- 1MM Directional same as 1MM Directional UMItools, but with more stringent criteria for duplicate UMIs
- Exact only exactly matching UMIs are collapsed.
- NoDedup no deduplication of UMIs, count all reads.
- 1MM CR CellRanger2-4 algorithm for 1MM UMI collapsing.
- --soloUMIfiltering default: - string(s) type of UMI filtering (for reads uniquely mapping to genes) - basic filtering: remove UMIs with N and homopolymers (similar to CellRanger 2.2.0).
- MultiGeneUMI basic + remove lower-count UMIs that map to more than one gene.
- MultiGeneUMI All basic + remove all UMIs that map to more than one gene.
- MultiGeneUMI CR basic + remove lower-count UMIs that map to more than one gene, matching CellRanger > 3.0.0 . 60 Only works with –soloUMIdedup 1MM CR
- --soloOutFileNames default: Solo.out/ features.tsv barcodes.tsv matrix.mtx string(s) file names for STARsolo output: file name prefix gene names barcode sequences cell feature count matrix
- --soloCellFilter default: CellRanger2.2 3000 0.99 10 string(s): cell filtering type and parameters
- None do not output filtered cells
- TopCells only report top cells by UMI count, followed by the exact number of cells
- CellRanger2.2 simple filtering of CellRanger 2.2. Can be followed by numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count
- The harcoded values are from CellRanger: nExpectedCells=3000; maxPercentile=0.99; maxMinRatio=10
- EmptyDrops CR EmptyDrops filtering in CellRanger flavor. Please cite the original EmptyDrops paper: A.T.L Lun et al, Genome Biology, 20, 63 (2019): https://genomebiology.biomedcentral.com/articles/10.1186/s13059- 019-1662-y
- Can be followed by 10 numeric parameters: nExpectedCells maxPercentile maxMinRatio indMin indMax umiMin umiMinFracMedian candMaxN FDR simN
- The harcoded values are from CellRanger: 3000 0.99 10 45000 90000 500 0.01 20000 0.01 10000
- --soloOutFormatFeaturesGeneField3 default: "Gene Expression" string(s): field 3 in the Gene features.tsv file. If ”-”, then no 3rd field is output.
STAR运行示例
编辑#######STAR MAPPING #######
####index Generate######
STAR --runMode genomeGenerate --genomeDir sSTAR_index/ --genomeFastaFiles STAR_index/genome.fasta --sjdbGTFfile genome.gtf --sjdbOverhang 149
##### mapping ##########
STAR --runThreadN 8 --genomeDir STAR_index/ --outSAMtype BAM Unsorted SortedByCoordinate _1--quantMode GeneCounts --readFilesCommand zcat --readFilesIn sample_1.clean.fq.gz sample_2.clean.fq.gz --outFileNamePrefix sample_star