I'm a PhD candidate where the motto seems to be "we're molecular biologists, we can do that!" and often on the cheap. I'll be discussing techniques, grad school and science. And maybe finding some humor in the mess.Ask me
Remember to set the Genotype Likelihoods Model (-glm) to BOTH when running Unified Genotyper.
I knew that filtering/Variant Quality Score Recalibration have different procedures for SNPs/INDELs so I had left the INDEL step to last (bad idea). I kept finding 0 INDELs for my data set (when I had a huge number of SNP calls). Turns out I had forgotten to set the -glm to BOTH way back at the Unified Genotyper Step, whoops. I’ll probably go back and run on INDELs only later, but for now mystery solved.
VariantEval produces some great metrics using the CompOverlap module, I had alot of trouble discerning how to find comparisons between more than 2 VCF files (I have four conditions). This allows several files to be compared to a single comparison track ( I used a VCF that should represent background genotype). The module outputs metrics such as the number of Concordant loci, which I found very helpful to determine how SNPs might accumulate with the transfer process.
I’m taking advantage of the new release of the Neurospora crassa genome and the new version of Cuffdiff to run my RNA-Seq data through its paces on the command line version of the Tuxedo suite. I’m going to post examples of the arguments I used in these programs here occasionally as there aren’t many examples and I may forget.
tophat -G /path/to/GTF/file/ -o /path/to/output/directory/ -p 2 /path/to/index/directory/index_Base_Name/ /path/to/fastq/file/
-p is the argument for multi-threading which vastly speeds up the process of mapping, p can be set to any number less than or equal to the number of cores
cufflinks -o /path/to/output/directory -p 2 -b /path/to/Reference.fa -u -N -G /path/to/GTF/file/ /path/to/accepted_hits.bam/
where -b is the argument for bias correction, -u is the multi-read correct and -N is the upper quartile normalization
In cuffmerge or cuffcompare remember to set the -s option to specify a reference genome to get all outputs
Funnier… when it gets left on what the they chose and I’m trying to figure out who i work with really loves recent Top 40.
So you’ve decided to run snpEff with a custom reference genome and you finally figured how to use vi commands and build the database. If you see something like the image below where 0 protein coding genes are identified
This tells you that you need to follow the FAQ at snpEff for this condition, which means that you have treat every transcript as coding (default). However, the GATK recommends using the -onlyCoding option which only looks at the coding transcripts. If you use this option with a database that does not have protein coding info you do not get any snps annotated to the exons as below
Removing the -onlyCoding true argument results in SNPs being detected in exons as shown below
You’ll notice that my data does not have any SNPs detected in the introns/splice sites, this is expected as this is RNA-Seq data and I specifically limited the SNP calling to intervals in the transcript annotation file.
What’s important to note is this can be a problem with any reference/transcript file combo. I used the previous Neurospora build from Broad, and even though Broad is also responsible for GATK the genome did not play well with a tool that is part of the GATK pipeline.
A final note about snpEff output the genes.txt output reports values for all the genes in the transcript file even if the value is 0 (ala Cuffdiff). So you cannot go scanning this file with your favorite gene in mind without looking carefully at the output.
Yep Tophat, which I am running on command line for the first time can go to hell.
But on the other hand I’ve finally got a handle on GATK.
Students I taught in Intro Bio are now old enough to be sending me linkedin requests. Sigh.