Analysis of data from high-throughput molecular biology experiments (Ioannis)

Τετάρτη 18 Ιανουαρίου 2017

Reflecting upon the final results

In this project, we have devised a RNA-Seq differential expression pipeline to investigate whether mapping to a reference genome could be circumvented while still mainting accuracy. To make the results comparable, we deviated from Sleuth analysis and used DESeq instead for fold change calculation.
We have shown that ,generally, the different methods provided very similar outcome (significant overlap of differential expressed genes, correlated estimated abundances), especially between the mapping-free methods, mainly because of the underlying algorithmic similarity. Although all methods captured the basic biology of the experiment (overexpression of immunorelated genes), discrepancies at lower fold changes were observed but their importance could not be verified due to the absence of a benchmark dataset- although inspection of some of them revealed different isoform transcripts that could have accounted for the observed differences.
Pertaining to the choice of best method for differential expression analysis, one should consider the characteristics of the experiment -for example presence of different isoform abundances should point towards the mapping free methods. Overall, our project showed that mapping-free methods could be a robust and fast alternative to the normal pipeline, although the limited sample size and the absence of a benchmark should be considered as main limitations of our study to generalize the noticed trend.

Κυριακή 15 Ιανουαρίου 2017

Enrichment Analysis

Enrichment analysis of the top 300 differentially expressed genes from the EnrichR tool (http://amp.pharm.mssm.edu/Enrichr/) recapitulate the underlying biology of the experiment.

Sleuth code

Previously, we have derived transcript abundances from Kallisto for the samples (1, 3, 6, 13, 15, 18).
The conditions to compare are:

Medium (1,13) vs 0.2MOI (6,18)

Medium (1,13) vs polyl25 (3,15)

## Sleuth

# Specify the base directory of the results
base_dir <- "/pica/v3/g2016025/G2_proj/data"

# Retrieve the list of samples
sample_id <- dir(file.path(base_dir,"kallisto.results"))

# Construct appropriate paths
kal_dirs <- sapply(sample_id, function(id) file.path(base_dir, "kallisto.results", id))

# Create the experimental design matrix (here for the first comparison)
s2c <- read.table(file.path(base_dir, "kallisto.results/kallisto.table.txt"), header = TRUE, stringsAsFactors=FALSE)

# Add the paths as an extra column
s2c <- dplyr::mutate(s2c, path = kal_dirs)

# Load the kallisto processed data into the object
so <- sleuth_prep(s2c, ~ condition)

# Estimate parameters for the general linear model (full)
so <- sleuth_fit(so)

# Estimate parameters for the sleuth reduced model (intersection only)
so <- sleuth_fit(so, ~1, 'reduced')

# Estimate the differential expression via the Wald test (in order to extract beta values, comparable
# to fold change)
so <- sleuth_wt(so,"MOI")

# Obtain results
results_table <- sleuth_results(so, 'MOI', test_type = 'wt')

## To acquire gene-centric results

# Construct mart object
mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "hsapiens_gene_ensembl",
host = 'ensembl.org')

t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id",
"external_gene_name"), mart = mart)

t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id,
ens_gene = ensembl_gene_id, ext_gene = external_gene_name)

# Generate gene-centric sleuth model
so <- sleuth_prep(s2c, ~condition, target_mapping = t2g,aggregation_column = 'ens_gene')

#...Subsequent steps are the same as above

Sleuth explained

Whilst the derived transcript abundances from Kallisto can be aggregated to gene level (via tximport R package) and follow DESeq2 analysis, the "natural" downstream processing pipeline of Kallisto results involves Sleuth.
The reason is that Sleuth exploits Kallisto bootstraps to estimate technical noise, when this (technical noise) deviates from the Poisson distribution as is the case of ambiguous origin of the transcripts. Thus, it can efficiently perform differential expression at the transcript level, highlighting biologically important details that are obscured at the gene level (i.e splice variants).
Sleuth lies at the algorithmic framework of general linear models with shrinkage - the concept of gathering information across all transcripts to stabilize variance estimates). Briefly, the key point is that the transcript abundance (observed) is decomposed to the true abundance (unobserved) plus a parameter that accounts for the technical noise, for which bootstraps act as as accurate proxy. The separate modelling of technical and biological variance provides more robust estimates of trancript differential expression.
Programmatically, Sleuth is implemented as an R package along with a Shiny web application allowing user-friendly visualization of the fittted models, MA plots, summaries of the data as well as scatterplots-boxplots to perform diagnostics.

Kallisto code

Here is Kallisto code in Linux operating system for the Sample 1 of the project:

#!/bin/bash -l
#SBATCH -A g2016025
#SBATCH -t 2:00:00
#SBATCH -p node -n 8
#SBATCH -o my_job_out

## Step:1 Indexing transcriptome file -coding and non -coding DNA- for pseudoalignment.
#This line generates the Transcriptome-De Bruijn graph using hashing of k-mers (default value:31 #base length).

kallisto index -i transcripts.idx Homo_sapiens.GRCh38.cdna.all.fa.gz Homo_sapiens.GRCh38.ncrna.fa.gz transcripts_all.idx

## Step 2: Run main kallisto command for paired-end RNA seq data on sample 1 with default #parameters (except number of bootstraps = 100 bootstraps -to be used at a later stage for quantifying #technical noise)

# Sample 1
kallisto quant -i sequences/transcript_reference/transcripts_all.idx -o kallisto.outputS1 -b 100 S1/S01_FR_line1.fq.gz S1/S01_RR_line1.fq.gz S1/S01_FR_line2.fq.gz S1/S01_RR_line2.fq.gz S1/S01_FR_line3.fq.gz S1/S01_RR_line3.fq.gz S1/S01_FR_line4.fq.gz S1/S01_RR_line4.fq.gz S1/S01_FR_line5.fq.gz S1/S01_RR_line5.fq.gz

## The output consists of three different files :
#abundances.tsv (text file with abundance estimates)
#abundances.h5 (hdf5 file with abundance estimates along with bootstrap estimates, run #info,transcript length information length)
#run_info.json(json file with details abou the run)

Τρίτη 10 Ιανουαρίου 2017

Kallisto explained

Kallisto belongs to a new generation of RNA-seq quantification programs (along with Salmon) that can circumvent the time-consuming step of genome alignment. The main premise is the reversal of the quantification question; Instead of asking where each read could map to, we seek to find which reads could have originated from the transcript. This is done by introducing the term "pseudoalignment" that stands for mapping a read to a set of transcripts of unspecified coordinates, implemented under k-mer hashing of the transcriptome de Bruijin graph. The latter is a de Brujin graph of k-mers stemming from the transcriptome.The key point is that by spanning a path of the graph according to the read sequence, we highlight a specific transcript this read could have originated from.This concept allows skipping nodes of the graph and eliminating error-prone reads, thus leading to faster and more accurate quantification, respectively.

.

The basic algorithmic steps of Kallisto. (a) A read (black) and three overalpping transcripts (colours). (b) Transcriptome de Bruijin graph; each node represents a k-mer. Relative to the path they belong to, every k-mer acquires a k-compatibility class. (c) "Pseudoaligned" k-mers of a read are hashed to find the k-compatibility class of the read. (d) Nodes belonging to the same k-compatibility class can be skipped. (e) The key point is that the intersection of k-compatibility classes of a read corresponds to the transcript origin.

[Taken and adapted from: Nicolas L Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Near-optimal probabilistic RNA-seq quantification, Nature Biotechnology 34, 525–527 (2016), doi:10.1038/nbt.3519]

Κυριακή 8 Ιανουαρίου 2017

Project Template

Our project relates to processing RNA-seq data in order to highlight differentially expressed genes (DEGs).The aim is to investigate whether new bioinformatics methods that do not require the time-consuming step of mapping to a reference genome could be a trustworthy alternative to the current RNA-seq data analysis pipeline.
The biological model used is the dendritic cell in normal medium or exposed to either H1N1 virus or poly I:C (double stranded RNA). The latter two are considered immunostimulants in that they generate an inflammatory response. The medical angle of the experiment is to consider the gene expression perturbation upon H1N1 exposure as a proxy for over-inflammatory cell response, separate the DEGs' signature from other immunostimulants (such as poly I:C) and decipher drugs to suppress it.
The raw data consist of paired end 2 x 125bp Illumina HiSeq reads, with approximately 30M reads per sample. Initially, we will use the standard tools for finding DEGs -STAR, HTSeq count and DESeq2- and subsequently the mapping-independent algorithms, Kallisto and Sailfish (Salmon).
To compare and contrast our results, we will report the number of overlapping genes across the different methods as well as the presence or not of well-known inflammatory genes that are exprected to change in the experiment.