Academic Positions

  • Present 2014

    University Research Associate

    Simon Fraser University, School of Computing Science

Education & Training

  • Ph.D. 2013

    Ph.D. in Computing Science

    School of Computing Science, Simon Fraser University

  • M.Sc. 2007

    Master of Science

    School of Computing Science, Simon Fraser University

  • B.Sc.2004

    Bachelor of Science

    Computer Engineering Department, Sharif University of Technology

Honors, Awards and Grants

  • Jun 2014
    Governor General's Gold Medal, Simon Fraser University
    image
    The Governor General’s Gold Medals are awarded to the two SFU graduate students who achieve the highest academic standing upon graduation from a master’s or doctoral degree program.
    See coverage here and here.
  • Jan 2014
    Ebco Eppich Research Visitor Award, Simon Fraser University
  • Jul 2012
    Ian Lawson Van Toch Memorial Award, ISMB 2012
    Ian Lawson Van Toch Memorial Award is awarded to the Outstanding Student Paper in Annual International Conference on Intelligent Systems for Molecular Biology.
    See coverage here.
  • Jul 2012
    Travel and Minor Research Award, Simon Fraser University
  • Jul 2011
    Best Paper Award, ISMB-HiTSeq 2011
  • May 2011
    Simon Fraser Graduate Fellowship, Simon Fraser University
  • May 2010
    Simon Fraser Graduate Fellowship, Simon Fraser University
  • May 2006
    Simon Fraser Graduate Fellowship, Simon Fraser University
  • May 2004
    Simon Fraser Entrance Scholarship, Simon Fraser University

Research Projects

  • Sequence Mapping

    Sequence Mapping is the problem of mapping very short reads to a reference genome. Read More

    High Throughput Sequence (HTS) has become an invaluable technology for many applications, e.g. the detection of single-nucleotide polymorphisms, structural variations. In most of these applications, mapping sequenced "reads" to their potential genomic origin is the first fundamental step for subsequent analyses. Many tools have been developed to address this problem. Because of the large amount of HTS data availability, much emphasis has been placed on speed and memory.

    We introduced two novel methods namely mrsFAST and drFAST to map HTS short-reads to the reference genome. These methods are cache oblivious and guarantee perfect sensitivity. Both are specifically designed to address the bottleneck of multi-mapping for the purpose of structural variation detection.

  • Sequence Compression

    Sequence Compression is about compressing biological data in raw and processed form. Read More

    In fact, as HTS data grow in size, data management and storage are becoming major logistical obstacles for adopting HTS-platforms. The requirements for ever increasing monetary investment almost signalled the end of the Sequence Read Archive hosted at the National Center for Biotechnology Information, which holds most of the sequence data generated world wide. One way to solve storage requirements for HTS data is compression. Currently, most HTS data is compressed through general purpose algorithms such as gzip. These algorithms are not specifically designed for compressing data generated by the HTS-platforms. Recently, a number of fast and efficient compression algorithms have been designed specifically for HTS data to address some of the issues in data management, storage and communication.

    We address the storage and communication problems in HTS data by introducing SCALCE, a "boosting" scheme based on Locally Consistent Parsing technique. SCALCE re-orders the data in order to increase the locality of reference and subsequently improve the performance of well-known compression methods in terms of speed and space.

  • Trascriptomics Structural Variation Discovery

    Discovery of structural variation using RNA-Seq. Read More.

    Computational identification of genomic structural variants via high-throughput sequencing is an important problem for which a number of highly sophisticated solutions have been recently developed. With the advent of high-throughput transcriptome sequencing (RNA-Seq), the problem of identifying structural alterations in the transcriptome is now attracting significant attention.

    We introduce two novel algorithmic formulations for identifying transcriptomic structural variants through aligning transcripts to the reference genome under the consideration of such variation. The first formulation is based on a nucleotide-level alignment model; a second, potentially faster formulation is based on chaining fragments shared between each transcript and the reference genome. Based on these formulations, we introduce a novel transcriptome-to-genome alignment tool, Dissect (DIScovery of Structural Alteration Event Containing Transcripts), which can identify and characterize transcriptomic events such as duplications, inversions, rearrangements and fusions. Dissect is suitable for whole transcriptome structural variation discovery problems involving sufficiently long reads or accurately assembled contigs.

Filter by type:

Sort by year:

SiNVICT: ultra-sensitive detection of single nucleotide variants and indels in circulating tumour DNA.

Can Kockan, Faraz Hach, Iman Sarrafi, Robert H. Bell, Brian McConeghy, Kevin Beja, Anne Haegert, Alexander W. Wyatt, Stanislav V. Volik, Kim N. Chi, Colin C. Collins, S. Cenk Sahinalp
Journal Papers Bioinformatics. 2017 Jan 1;33(1):26-34

Abstract

Motivation:Successful development and application of precision oncology approaches require robust elucidation of the genomic landscape of a patient’s cancer and, ideally, the ability to monitor therapy-induced genomic changes in the tumour in an inexpensive and minimally invasive manner. Thanks to recent advances in sequencing technologies, ‘liquid biopsy’, the sampling of patient’s bodily fluids such as blood and urine, is considered as one of the most promising approaches to achieve this goal. In many cancer patients, and especially those with advanced metastatic disease, deep sequencing of circulating cell free DNA (cfDNA) obtained from patient’s blood yields a mixture of reads originating from the normal DNA and from multiple tumour subclones—called circulating tumour DNA or ctDNA. The ctDNA/cfDNA ratio as well as the proportion of ctDNA originating from specific tumour subclones depend on multiple factors, making comprehensive detection of mutations difficult, especially at early stages of cancer. Furthermore, sensitive and accurate detection of single nucleotide variants (SNVs) and indels from cfDNA is constrained by several factors such as the sequencing errors and PCR artifacts, and mapping errors related to repeat regions within the genome. In this article, we introduce SiNVICT, a computational method that increases the sensitivity and specificity of SNV and indel detection at very low variant allele frequencies. SiNVICT has the capability to handle multiple sequencing platforms with different error properties; it minimizes false positives resulting from mapping errors and other technology specific artifacts including strand bias and low base quality at read ends. SiNVICT also has the capability to perform time-series analysis, where samples from a patient sequenced at multiple time points are jointly examined to report locations of interest where there is a possibility that certain clones were wiped out by some treatment while some subclones gained selective advantage.
Results: We tested SiNVICT on simulated data as well as prostate cancer cell lines and cfDNA obtained from castration-resistant prostate cancer patients. On both simulated and biological data, SiNVICT was able to detect SNVs and indels with variant allele percentages as low as 0.5%. The lowest amounts of total DNA used for the biological data where SNVs and indels could be detected with very high sensitivity were 2.5 ng on the Ion Torrent platform and 10 ng on Illumina. With increased sequencing and mapping accuracy, SiNVICT might be utilized in clinical settings, making it possible to track the progress of point mutations and indels that are associated with resistance to cancer therapies and provide patients personalized treatment. We also compared SiNVICT with other popular SNV callers such as MuTect, VarScan2 and Freebayes. Our results show that SiNVICT performs better than these tools in most cases and allows further data exploration such as time-series analysis on cfDNA sequencing data.
Availability and Implementation: SiNVICT is available at: https://sfu-compbio.github.io/sinvict

Comparison of high-throughput sequencing data compression tools

Ibrahim Numanagic, James K Bonfield, Faraz Hach, Jan Voges, Jorn Ostermann, Claudio Alberti, Marco Mattavelli, S. Cenk Sahinalp
Journal Papers Nat Methods. 2016 Dec;13(12):1005-1008

Abstract

High-throughput sequencing (HTS) data are commonly stored as raw sequencing reads in FASTQ format or as reads mapped to a reference, in SAM format, both with large memory footprints. Worldwide growth of HTS data has prompted the development of compression methods that aim to significantly reduce HTS data size. Here we report on a benchmarking study of available compression methods on a comprehensive set of HTS data using an automated framework.

CoLoRMap: Correcting Long Reads by Mapping short reads.

Ehsan Haghshenas, Faraz Hach, S. Cenk Sahinalp, Cedric Chauve
Journal Papers Bioinformatics. 2016 Sep 1;32(17):i545-i551
Presentation ECCB'16 (15th European Conference on Computational Biology), The Hauge, Netherlands, September 3-7 2016

Abstract

Motivation: Second generation sequencing technologies paved the way to an exceptional increase in the number of sequenced genomes, both prokaryotic and eukaryotic. However, short reads are difficult to assemble and often lead to highly fragmented assemblies. The recent developments in long reads sequencing methods offer a promising way to address this issue. However, so far long reads are characterized by a high error rate, and assembling from long reads require a high depth of coverage. This motivates the development of hybrid approaches that leverage the high quality of short reads to correct errors in long reads.
Results: We introduce CoLoRMap, a hybrid method for correcting noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. Our algorithm is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods.
Availability and Implementation: The source code of CoLoRMap is freely available for non-commercial use at https://github.com/sfu-compbio/colormap

Robustness of Massively Parallel Sequencing Platforms

Pınar Kavak, Bayram Yuksel, Soner Aksu, M. Oguzhan Kulekci, Tunga Gungor, Faraz Hach, S. Cenk Sahinalp, Turkish Human Genome Project, Can Alkan, Mahmut Şamil Sagıroglu
Journal PapersPLoS One. 2015 Sep 18;10(9):e0138259

Abstract

The improvements in high throughput sequencing technologies (HTS) made clinical sequencing projects such as ClinSeq and Genomics England feasible. Although there are significant improvements in accuracy and reproducibility of HTS based analyses, the usability of these types of data for diagnostic and prognostic applications necessitates a near perfect data generation. To assess the usability of a widely used HTS platform for accurate and reproducible clinical applications in terms of robustness, we generated whole genome shotgun (WGS) sequence data from the genomes of two human individuals in two different genome sequencing centers. After analyzing the data to characterize SNPs and indels using the same tools (BWA, SAMtools, and GATK), we observed significant number of discrepancies in the call sets. As expected, the most of the disagreements between the call sets were found within genomic regions containing common repeats and segmental duplications, albeit only a small fraction of the discordant variants were within the exons and other functionally relevant regions such as promoters. We conclude that although HTS platforms are sufficiently powerful for providing data for first-pass clinical tests, the variant predictions still need to be confirmed using orthogonal methods before using in clinical applications.

Spatial genomic heterogeneity within localized, multifocal prostate cancer

Paul C Boutros, Michael Fraser, Nicholas J Harding, Richard de Borja, Dominique Trudel, Emilie Lalonde, Alice Meng, Pablo H Hennings-Yeomans, Andrew McPherson, Veronica Y Sabelnykova, Amin Zia, Natalie S Fox, Julie Livingstone, Yu-Jia Shiah, Jianxin Wang, Timothy A Beck, Cherry L Have, Taryne Chong, Michelle Sam, Jeremy Johns, Lee Timms, Nicholas Buchner, Ada Wong, John D Watson, Trent T Simmons, Christine P'ng, Gaetano Zafarana, Francis Nguyen, Xuemei Luo, Kenneth C Chu, Stephenie D Prokopec, Jenna Sykes, Alan Dal Pra, Alejandro Berlin, Andrew Brown, Michelle A Chan-Seng-Yue, Fouad Yousif, Robert E Denroche, Lauren C Chong, Gregory M Chen, Esther Jung, Clement Fung, Maud H W Starmans, Hanbo Chen, Shaylan K Govind, James Hawley, Alister D'Costa, Melania Pintilie, Daryl Waggott, Faraz Hach, Philippe Lambin, Lakshmi B Muthuswamy, Colin Cooper, Rosalind Eeles, David Neal, Bernard Tetu, Cenk Sahinalp, Lincoln D Stein, Neil Fleshner, Sohrab P Shah, Colin C Collins, Thomas J Hudson, John D McPherson, Theodorus van der Kwast and Robert G Bristow
Journal PapersNat Genet. 2015 Jul;47(7):736-45

Abstract

Herein we provide a detailed molecular analysis of the spatial heterogeneity of clinically localized, multifocal prostate cancer to delineate new oncogenes or tumor suppressors. We initially determined the copy number aberration (CNA) profiles of 74 patients with index tumors of Gleason score 7. Of these, 5 patients were subjected to whole-genome sequencing using DNA quantities achievable in diagnostic biopsies, with detailed spatial sampling of 23 distinct tumor regions to assess intraprostatic heterogeneity in focal genomics. Multifocal tumors are highly heterogeneous for single-nucleotide variants (SNVs), CNAs and genomic rearrangements. We identified and validated a new recurrent amplification of MYCL, which is associated with TP53 deletion and unique profiles of DNA damage and transcriptional dysregulation. Moreover, we demonstrate divergent tumor evolution in multifocal cancer and, in some cases, tumors of independent clonal origin. These data represent the first systematic relation of intraprostatic genomic heterogeneity to predicted clinical outcome and inform the development of novel biomarkers that reflect individual prognosis.

Fast and accurate mapping of Complete Genomics reads

Donghyuk Lee, Farhad Hormozdiari, Hongyi Xin, Faraz Hach, Onur Mutlu and Can Alkan
Journal PapersMethods. 2015 Jun;79-80:3-10

Abstract

Many recent advances in genomics and the expectations of personalized medicine are made possible thanks to power of high throughput sequencing (HTS) in sequencing large collections of human genomes. There are tens of different sequencing technologies currently available, and each HTS platform have different strengths and biases. This diversity both makes it possible to use different technologies to correct for shortcomings; but also requires to develop different algorithms for each platform due to the differences in data types and error models. The first problem to tackle in analyzing HTS data for resequencing applications is the read mapping stage, where many tools have been developed for the most popular HTS methods, but publicly available and open source aligners are still lacking for the Complete Genomics (CG) platform. Unfortunately, Burrows-Wheeler based methods are not practical for CG data due to the gapped nature of the reads generated by this method. Here we provide a sensitive read mapper (sirFAST) for the CG technology based on the seed-and-extend paradigm that can quickly map CG reads to a reference genome. We evaluate the performance and accuracy of sirFAST using both simulated and publicly available real data sets, showing high precision and recall rates.

DeeZ: reference-based compression by local assembly

Faraz Hach, Ibrahim Numanagic and S Cenk Sahinalp
Journal PapersNature Methods, 2014 Oct 30;11(11):1082-4
Presentation RECOMB-Seq’15 HighLight Talk (5th Annual RECOMB Satellite Workshop on Massively Parallel Sequencing), Warsaw, Poland, April 10-21 2015

mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications

Faraz Hach, Iman Sarrafi, Farhad Hormozdiari, Can Alkan, Evan E. Eichler and S. Cenk Sahinalp.
Journal Papers Nucl. Acids Res. (1 July 2014) 42 (W1): W494-W500.

Abstract

High throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for processing and downstream analysis. While tools that report the ‘best’ mapping location of each read provide a fast way to process HTS data, they are not suitable for many types of downstream analysis such as structural variation detection, where it is important to report multiple mapping loci for each read. For this purpose we introduce mrsFAST-Ultra, a fast, cache oblivious, SNP-aware aligner that can handle the multi-mapping of HTS reads very efficiently. mrsFAST-Ultra improves mrsFAST, our first cache oblivious read aligner capable of handling multi-mapping reads, through new and compact index structures that reduce not only the overall memory usage but also the number of CPU operations per alignment. In fact the size of the index generated by mrsFAST-Ultra is 10 times smaller than that of mrsFAST. As importantly, mrsFAST-Ultra introduces new features such as being able to (i) obtain the best mapping loci for each read, and (ii) return all reads that have at most n mapping loci (within an error threshold), together with these loci, for any user specified n. Furthermore, mrsFAST-Ultra is SNP-aware, i.e. it can map reads to reference genome while discounting the mismatches that occur at common SNP locations provided by db-SNP; this significantly increases the number of reads that can be mapped to the reference genome. Notice that all of the above features are implemented within the index structure and are not simple post-processing steps and thus are performed highly efficiently. Finally, mrsFAST-Ultra utilizes multiple available cores and processors and can be tuned for various memory settings. Our results show that mrsFAST-Ultra is roughly five times faster than its predecessor mrsFAST. In comparison to newly enhanced popular tools such as Bowtie2, it is more sensitive (it can report 10 times or more mappings per read) and much faster (six times or more) in the multi-mapping mode. Furthermore, mrsFAST-Ultra has an index size of 2GB for the entire human reference genome, which is roughly half of that of Bowtie2. mrsFAST-Ultra is open source and it can be accessed at http://mrsfast.sourceforge.net.

Optimal resolution of ambiguous RNA-Seq multimappings in the presence of novel isoforms

Phuong Dao, Ibrahim Numanagic, Yen-Yi Lin, Faraz Hach, Emre Karakoc, Nilgun Donmez, Colin Collins, Evan E. Eichler and S. Cenk Sahinalp
Journal Papers Bioinformatics (2014) 30 (5): 644-651.

Abstract

Motivation: RNA-Seq technology is promising to uncover many novel alternative splicing events, gene fusions and other variations in RNA transcripts. For an accurate detection and quantification of transcripts, it is important to resolve the mapping ambiguity for those RNA-Seq reads that can be mapped to multiple loci: >17% of the reads from mouse RNA-Seq data and 50% of the reads from some plant RNA-Seq data have multiple mapping loci.

In this study, we show how to resolve the mapping ambiguity in the presence of novel transcriptomic events such as exon skipping and novel indels towards accurate downstream analysis. We introduce ORMAN (Optimal Resolution of Multimapping Ambiguity of RNA-Seq Reads), which aims to compute the minimum number of potential transcript products for each gene and to assign each multimapping read to one of these transcripts based on the estimated distribution of the region covering the read. ORMAN achieves this objective through a combinatorial optimization formulation, which is solved through well-known approximation algorithms, integer linear programs and heuristics.

Results: On a simulated RNA-Seq dataset including a random subset of transcripts from the UCSC database, the performance of several state-of-the-art methods for identifying and quantifying novel transcripts, such as Cufflinks, IsoLasso and CLIIQ, is significantly improved through the use of ORMAN. Furthermore, in an experiment using real RNA-Seq reads, we show that ORMAN is able to resolve multimapping to produce coverage values that are similar to the original distribution, even in genes with highly non-uniform coverage.

Availability: ORMAN is available at http://orman.sourceforge.net

Scalable mapping and compression of high throughput genome sequencing data

Faraz Hach
ThesisPhD Thesis, 2013.

Abstract

The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for processing, downstream analysis and computational infrastructure. HTS has become an invaluable technology for many applications, e.g. the detection of single-nucleotide polymorphisms, structural variations. In most of these applications, mapping sequenced "reads" to their potential genomic origin is the first fundamental step for subsequent analyses. Many tools have been developed to address this problem. Because of the large amount of HTS data availability, much emphasis has been placed on speed and memory. In fact, as HTS data grow in size, data management and storage are becoming major logistical obstacles for adopting HTS-platforms. The requirements for ever increasing monetary investment almost signalled the end of the Sequence Read Archive hosted at the National Center for Biotechnology Information, which holds most of the sequence data generated world wide. One way to solve storage requirements for HTS data is compression. Currently, most HTS data is compressed through general purpose algorithms such as gzip. These algorithms are not specifically designed for compressing data generated by the HTS-platforms. Recently, a number of fast and efficient compression algorithms have been designed specifically for HTS data to address some of the issues in data management, storage and communication. In this thesis, we study both of these computational problems, i.e., Sequence Mapping and Sequence Compression extensively. We introduce two novel methods namely mrsFAST and drFAST to map HTS short-reads to the reference genome. These methods are cache oblivious and guarantee perfect sensitivity. Both are specifically designed to address the bottleneck of multi-mapping for the purpose of structural variation detection. In addition we present Dissect for mapping whole trascriptome to the genome while considering structural alterations in the transcriptome. Dissect is designed specifically to map HTS long-reads as well as assembled contigs. Finally, we address the storage and communication problems in HTS data by introducing SCALCE, a "boosting" scheme based on Locally Consistent Parsing technique. SCALCE re-orders the data in order to increase the locality of reference and subsequently improve the performance of well-known compression methods in terms of speed and space.

SCALCE: boosting sequence compression algorithms using locally consistent encoding

Faraz Hach, Ibrahim Numanagic, Can Alkan and S Cenk Sahinalp
Journal Papers Bioinformatics (2012) 28 (23): 3051-3057.
Presentation RECOMB-Seq’12 (2nd Annual RECOMB Satellite Workshop on Massively Parallel Sequencing), Barcelona, Spain, April 19-20 2012

Abstract

Motivation: The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for the computational infrastructure. Data management, storage and analysis have become major logistical obstacles for those adopting the new platforms. The requirement for large investment for this purpose almost signalled the end of the Sequence Read Archive hosted at the National Center for Biotechnology Information (NCBI), which holds most of the sequence data generated world wide. Currently, most HTS data are compressed through general purpose algorithms such as gzip. These algorithms are not designed for compressing data generated by the HTS platforms; for example, they do not take advantage of the specific nature of genomic sequence data, that is, limited alphabet size and high similarity among reads. Fast and efficient compression algorithms designed specifically for HTS data should be able to address some of the issues in data management, storage and communication. Such algorithms would also help with analysis provided they offer additional capabilities such as random access to any read and indexing for efficient sequence similarity search. Here we present SCALCE, a ‘boosting’ scheme based on Locally Consistent Parsing technique, which reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use and without using a reference genome.

Results: Our tests indicate that SCALCE can improve the compression rate achieved through gzip by a factor of 4.19—when the goal is to compress the reads alone. In fact, on SCALCE reordered reads, gzip running time can improve by a factor of 15.06 on a standard PC with a single core and 6 GB memory. Interestingly even the running time of SCALCE + gzip improves that of gzip alone by a factor of 2.09. When compared with the recently published BEETL, which aims to sort the (inverted) reads in lexicographic order for improving bzip2, SCALCE + gzip provides up to 2.01 times better compression while improving the running time by a factor of 5.17. SCALCE also provides the option to compress the quality scores as well as the read names, in addition to the reads themselves. This is achieved by compressing the quality scores through order-3 Arithmetic Coding (AC) and the read names through gzip through the reordering SCALCE provides on the reads. This way, in comparison with gzip compression of the unordered FASTQ files (including reads, read names and quality scores), SCALCE (together with gzip and arithmetic encoding) can provide up to 3.34 improvement in the compression rate and 1.26 improvement in running time.

Availability: Our algorithm, SCALCE (Sequence Compression Algorithm using Locally Consistent Encoding), is implemented in C++ with both gzip and bzip2 compression options. It also supports multithreading when gzip option is selected, and the pigz binary is available. It is available at http://scalce.sourceforge.net.

CLIIQ: Accurate Comparative Detection and Quantification of Expressed Isoforms in a Population

Yen-Yi Lin, Phuong Dao, Faraz Hach, Marzieh Bakhshi, Fan Mo, Anna Lapuk, Colin Collins and S. Cenk Sahinalp
Conference PapersProc. of WABI 2012: Lecture Notes in Computer Science Volume 7534, 2012, pp 178-189
Presentation WABI’12 (12th Workshop on Algorithms in Bioinformatics), September 10-12, Ljubljana, Slovenia

Abstract

The recently developed RNA-Seq technology provides a high-throughput and reasonably accurate way to analyze the transcriptomic landscape of a tissue. Unfortunately, from a computational perspective, identification and quantification of a gene’s isoforms from RNA-Seq data remains to be a non-trivial problem. We propose CLIIQ, a novel computational method for identification and quantification of expressed isoforms from multiple samples in a population. Motivated by ideas from compressed sensing literature, CLIIQ is based on an integer linear programming formulation for identifying and quantifying ”the most parsimonious” set of isoforms. We show through simulations that, on a single sample, CLIIQ provides better results in isoform identification and quantification to alternative popular tools. More importantly, CLIIQ has an option to jointly analyze multiple samples, which significantly outperforms other tools in both isoform identification and quantification.

Dissect: detection and characterization of novel structural alterations in transcribed sequences

Deniz Yorukoglu, Faraz Hach, Lucas Swanson, Colin C. Collins, Inanc Birol and S. Cenk Sahinalp
Journal PapersBioinformatics (2012) 28 (12): i179-i187
PresentationISMB’12 (20th Annual International Conference on Intelligent Systems for Molecular Biology), Long Beach, USA, July 19-23 2012
Award Won Ian Lawson Van Toch Memorial Award for Outstanding Student Paper

Abstract

Motivation: Computational identification of genomic structural variants via high-throughput sequencing is an important problem for which a number of highly sophisticated solutions have been recently developed. With the advent of high-throughput transcriptome sequencing (RNA-Seq), the problem of identifying structural alterations in the transcriptome is now attracting significant attention.

In this article, we introduce two novel algorithmic formulations for identifying transcriptomic structural variants through aligning transcripts to the reference genome under the consideration of such variation. The first formulation is based on a nucleotide-level alignment model; a second, potentially faster formulation is based on chaining fragments shared between each transcript and the reference genome. Based on these formulations, we introduce a novel transcriptome-to-genome alignment tool, Dissect (DIScovery of Structural Alteration Event Containing Transcripts), which can identify and characterize transcriptomic events such as duplications, inversions, rearrangements and fusions. Dissect is suitable for whole transcriptome structural variation discovery problems involving sufficiently long reads or accurately assembled contigs.

Results: We tested Dissect on simulated transcripts altered via structural events, as well as assembled RNA-Seq contigs from human prostate cancer cell line C4-2. Our results indicate that Dissect has high sensitivity and specificity in identifying structural alteration events in simulated transcripts as well as uncovering novel structural alterations in cancer transcriptomes.

Availability: Dissect is available for public use at: http://dissect-trans.sourceforge.net.

From sequence to molecular pathology, and a mechanism driving the neuroendocrine phenotype in prostate cancer

Anna V Lapuk1, Chunxiao Wu, Alexander W Wyatt, Andrew McPherson, Brian J McConeghy, Sonal Brahmbhatt, Fan Mo, Amina Zoubeidi, Shawn Anderson, Robert H Bell, Anne Haegert, Robert Shukin, Yuzhuo Wang, Ladan Fazli, Antonio Hurtado-Coll, Edward C Jones, Faraz Hach, Fereydoun Hormozdiari, Iman Hajirasouliha, Paul C Boutros, Robert G Bristow, Yongjun Zhao, Marco A Marra, Andrea Fanjul, Christopher A Maher, Arul M Chinnaiyan, Mark A Rubin, Himisha Beltran, S Cenk Sahinalp, Martin E Gleave, Stanislav V Volik and Colin C Collins
Journal PapersJ Pathol. 2012 Jul;227(3):286-97

Abstract

The current paradigm of cancer care relies on predictive nomograms which integrate detailed histopathology with clinical data. However, when predictions fail, the consequences for patients are often catastrophic, especially in prostate cancer where nomograms influence the decision to therapeutically intervene. We hypothesized that the high dimensional data afforded by massively parallel sequencing (MPS) is not only capable of providing biological insights, but may aid molecular pathology of prostate tumours. We assembled a cohort of six patients with high-risk disease, and performed deep RNA and shallow DNA sequencing in primary tumours and matched metastases where available. Our analysis identified copy number abnormalities, accurately profiled gene expression levels, and detected both differential splicing and expressed fusion genes. We revealed occult and potentially dormant metastases, unambiguously supporting the patients' clinical history, and implicated the REST transcriptional complex in the development of neuroendocrine prostate cancer, validating this finding in a large independent cohort. We massively expand on the number of novel fusion genes described in prostate cancer; provide fresh evidence for the growing link between fusion gene aetiology and gene expression profiles; and show the utility of fusion genes for molecular pathology. Finally, we identified chromothripsis in a patient with chronic prostatitis. Our results provide a strong foundation for further development of MPS-based molecular pathology.

Integrated genome and transcriptome sequencing identifies a novel form of hybrid and aggressive prostate cancer

Chunxiao Wu, Alexander W Wyatt, Anna V Lapuk, Andrew McPherson, Brian J McConeghy, Robert H Bell, Shawn Anderson, Anne Haegert, Sonal Brahmbhatt, Robert Shukin, Fan Mo, Estelle Li, Ladan Fazli, Antonio Hurtado-Coll, Edward C Jones, Yaron S Butterfield, Faraz Hach, Fereydoun Hormozdiari, Iman Hajirasouliha, Paul C Boutros, Robert G Bristow, Steven JM Jones, Martin Hirst, Marco A Marra, Christopher A Maher, Arul M Chinnaiyan, S Cenk Sahinalp, Martin E Gleave, Stanislav V Volik and Colin C Collins
Journal PapersJ Pathol. 2012 May;227(1):53-61

Abstract

Next-generation sequencing is making sequence-based molecular pathology and personalized oncology viable. We selected an individual initially diagnosed with conventional but aggressive prostate adenocarcinoma and sequenced the genome and transcriptome from primary and metastatic tissues collected prior to hormone therapy. The histology-pathology and copy number profiles were remarkably homogeneous, yet it was possible to propose the quadrant of the prostate tumour that likely seeded the metastatic diaspora. Despite a homogeneous cell type, our transcriptome analysis revealed signatures of both luminal and neuroendocrine cell types. Remarkably, the repertoire of expressed but apparently private gene fusions, including C15orf21:MYC, recapitulated this biology. We hypothesize that the amplification and over-expression of the stem cell gene MSI2 may have contributed to the stable hybrid cellular identity. This hybrid luminal-neuroendocrine tumour appears to represent a novel and highly aggressive case of prostate cancer with unique biological features and, conceivably, a propensity for rapid progression to castrate-resistance. Overall, this work highlights the importance of integrated analyses of genome, exome and transcriptome sequences for basic tumour biology, sequence-based molecular pathology and personalized oncology.

Sensitive and fast mapping of di-base encoded reads

Farhad Hormozdiari, Faraz Hach S. Cenk Sahinalp, Evan E. Eichler and Can Alkan
Journal PapersBioinformatics (2011) 27 (14): 1915-1921

Abstract

Motivation: Discovering variation among high-throughput sequenced genomes relies on efficient and effective mapping of sequence reads. The speed, sensitivity and accuracy of read mapping are crucial to determining the full spectrum of single nucleotide variants (SNVs) as well as structural variants (SVs) in the donor genomes analyzed.

Results: We present drFAST, a read mapper designed for di-base encoded ‘color-space’ sequences generated with the AB SOLiD platform. drFAST is specially designed for better delineation of structural variants, including segmental duplications, and is able to return all possible map locations and underlying sequence variation of short reads within a user-specified distance threshold. We show that drFAST is more sensitive in comparison to all commonly used aligners such as Bowtie, BFAST and SHRiMP. drFAST is also faster than both BFAST and SHRiMP and achieves a mapping speed comparable to Bowtie.

Availability: The source code for drFAST is available at http://drfast.sourceforge.net

Comrad: detection of expressed rearrangements by integrated analysis of RNA-Seq and low coverage genome sequence data

Andrew McPherson, Chunxiao Wu, Iman Hajirasouliha, Fereydoun Hormozdiari, Faraz Hach, Anna Lapuk, Stanislav Volik, Sohrab Shah, Colin Collins and S. Cenk Sahinalp
Journal PapersBioinformatics (2011) 27 (11): 1481-1488.
Presentation ISMB-HitSeq’11 (High Throughput Sequencing Analysis and Algorithms, Special Interest Group of ISMB’11), Vienna, Austria, July 15-16 2011
Award Won Best Paper Award

Abstract

Motivation: Comrad is a novel algorithmic framework for the integrated analysis of RNA-Seq and whole genome shotgun sequencing (WGSS) data for the purposes of discovering genomic rearrangements and aberrant transcripts. The Comrad framework leverages the advantages of both RNA-Seq and WGSS data, providing accurate classification of rearrangements as expressed or not expressed and accurate classification of the genomic or non-genomic origin of aberrant transcripts. A major benefit of Comrad is its ability to accurately identify aberrant transcripts and associated rearrangements using low coverage genome data. As a result, a Comrad analysis can be performed at a cost comparable to that of two RNA-Seq experiments, significantly lower than an analysis requiring high coverage genome data.

Results: We have applied Comrad to the discovery of gene fusions and read-throughs in prostate cancer cell line C4-2, a derivative of the LNCaP cell line with androgen-independent characteristics. As a proof of concept, we have rediscovered in the C4-2 data 4 of the 6 fusions previously identified in LNCaP. We also identified six novel fusion transcripts and associated genomic breakpoints, and verified their existence in LNCaP, suggesting that Comrad may be more sensitive than previous methods that have been applied to fusion discovery in LNCaP. We show that many of the gene fusions discovered using Comrad would be difficult to identify using currently available techniques.

Availability: A C++ and Perl implementation of the method demonstrated in this article is available at http://compbio.cs.sfu.ca/.

Alu repeat discovery and characterization within human genomes

Fereydoun Hormozdiari, Can Alkan, Mario Ventura, Iman Hajirasouliha, Maika Malig, Faraz Hach, Deniz Yorukoglu, Phuong Dao, Marzieh Bakhshi, S. Cenk Sahinalp and Evan E. Eichler
Journal PapersGenome Res. 2011 Jun;21(6):840-9

Abstract

Human genomes are now being rapidly sequenced, but not all forms of genetic variation are routinely characterized. In this study, we focus on Alu retrotransposition events and seek to characterize differences in the pattern of mobile insertion between individuals based on the analysis of eight human genomes sequenced using next-generation sequencing. Applying a rapid read-pair analysis algorithm, we discover 4342 Alu insertions not found in the human reference genome and show that 98% of a selected subset (63/64) experimentally validate. Of these new insertions, 89% correspond to AluY elements, suggesting that they arose by retrotransposition. Eighty percent of the Alu insertions have not been previously reported and more novel events were detected in Africans when compared with non-African samples (76% vs. 69%). Using these data, we develop an experimental and computational screen to identify ancestry informative Alu retrotransposition events among different human populations.

mrsFAST: a cache-oblivious algorithm for short-read mapping

Faraz Hach, Fereydoun Hormozdiari, Can Alkan, Farhad Hormozdiari, Inanc Birol, Evan E Eichler and S Cenk Sahinalp
Journal PapersNature Methods, 2010 Aug;7(8):576-7

Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery

Fereydoun Hormozdiari, Iman Hajirasouliha, Phuong Dao, Faraz Hach, Deniz Yorukoglu, Can Alkan, Evan E. Eichler and S. Cenk Sahinalp
Journal PapersBioinformatics (2010) 26 (12): i350-i357
Presentation ISMB’10 (18th Annual International Conference Intelligent Systems for Molecular Biology), Boston, USA, July 11-13 2010

Abstract

Recent years have witnessed an increase in research activity for the detection of structural variants (SVs) and their association to human disease. The advent of next-generation sequencing technologies make it possible to extend the scope of structural variation studies to a point previously unimaginable as exemplified by the 1000 Genomes Project. Although various computational methods have been described for the detection of SVs, no such algorithm is yet fully capable of discovering transposon insertions, a very important class of SVs to the study of human evolution and disease. In this article, we provide a complete and novel formulation to discover both loci and classes of transposons inserted into genomes sequenced with high-throughput sequencing technologies. In addition, we also present ‘conflict resolution’ improvements to our earlier combinatorial SV detection algorithm (VariationHunter) by taking the diploid nature of the human genome into consideration. We test our algorithms with simulated data from the Venter genome (HuRef) and are able to discover >85% of transposon insertion events with precision of >90%. We also demonstrate that our conflict resolution algorithm (denoted as VariationHunter-CR) outperforms current state of the art (such as original VariationHunter, BreakDancer and MoDIL) algorithms when tested on the genome of the Yoruba African individual (NA18507).

Availability: The implementation of algorithm is available at http://compbio.cs.sfu.ca/strvar.htm.

Faster Phylogenetic Inference with MXG

David G. Mitchell, Faraz Hach, Raheleh Mohebali
Conference PapersProc of LPAR 2007. Lecture Notes in Computer Science Volume 4790, 2007, pp 423-437

Abstract

We apply the logic-based declarative programming approach of Model Expansion (MX) to a phylogenetic inference task. We axiomatize the task in multi-sorted first-order logic with cardinality constraints. Using the model expansion solver MXG and SAT+cardinality solver MXC, we compare the performance of several MX axiomatizations on a challenging set of test instances. Our methods perform orders of magnitude faster than previously reported declarative solutions. Our best solution involves polynomial-time pre-processing, redundant axioms, and symmetry-breaking axioms. We also discuss our method of test instance generation, and the role of pre-processing in declarative programming.

MXG: A Model Expansion Grounder and Solver

Raheleh Mohebali, Faraz Hach, and David Mitchell,
Technical ReportsAccepted to LPAR 2007, 2007.

Abstract

We describe MXG, a solver for NP search problems expressed as model expansion (MX). Problems are specified in an extension of first-order logic, and solved by grounding. That is, MXG combines a high-level specification with an instance and produces a propositional formula encoding the solutions. It calls a SAT (or extended SAT) solver to find solutions. MXG is distinguished from other grounding software in its use of a grounding algorithm based on a generalization of the relational algebra.

Solving NP search problems with model expansion

Faraz Hach
ThesisM.Sc Thesis, 2007.

Abstract

We explore the application of MXG, a declarative programming solver for NP search problems based on Model Expansion (MX) for first order logic with inductive definitions. We present specifications for several common NP-complete benchmark problems in the language of MXG, and describe some modeling techniques we found useful in obtaining good solver performance. We present an experimental comparison of the performance of MXG with Answer Set Programming (ASP) solvers on these problems, showing that MXG is competitive and often better. As an extended example, we consider an NP-complete phylogenetic inference problem. We present several specifications for this problem, employing a variety of techniques for obtaining good performance. Our best solution, which combines instance pre-processing, redundant axioms, and symmetry breaking axioms, performs orders of magnitude faster than previously reported declarative programming solutions using ASP solvers.

A framework for modelling and solving search problems.

David Mitchell, Eugenia Ternovska, Faraz Hach, and Raheleh Mohebal
Technical ReportsTR 2006-24, School of Computing Science, Simon Fraser University, Burnaby, BC, Canada

Abstract

We propose a framework for modelling and solving search problems using logic, and describe a project whose goal is to produce practically effective, general purpose tools for representing and solving search problems based on this framework. The mathematical foundation lies in the areas of finite model theory and descriptive complexity, which provide us with many classical results, as well as powerful techniques not available to many other approaches with similar goals. We describe the mathematical foundations; explain an extension to classical logic with inductive definitions that we consider central; give a summary of complexity and expressiveness properties; describe an approach to implementing solvers based on grounding; present grounding algorithms based on an extension of the relational algebra; describe an implementation of our framework which includes use of inductive definitions, sorts and order; and give experimental results comparing the performance of our implementation with ASP solvers and another solver based on the same framework.

Designing a Protocol to Coordinate Multiagent Enviroments

Faraz Hach
ThesisB.Sc Thesis, 2004.

Abstract

As agent-oriented software engineering phenomenon evolves, much research has been done in order to take advantage of this technology efficiently. The main objective of RoboCup competitions is to test it, both from software and hardware aspects. Removing physical constraints, Rescue simulation branch of RoboCup is a suitable testbed to tackle with proposed algorithms and protocols while keeping the cost as low as possible. Every year, by adding new laws, a more realistic model of the world is practices. Inter-Agent communication is no exception. The objective of communication protocols is to upgrade local view of the agents to global view, so that they can gain the most from the least. beside from introducing multiagent environments, In this thesis, communication protocol of Arian simulation team has been considered in detail and finally a performance comparison with other teams has been carried out.

Currrent Teaching

  • Fall 2014

    Introduction to Computational Biology

    This is an introductory course on algorithmic aspects of bioinformatics with emphasis on genome and protein sequence analysis and alignment, both via combinatorial techniques and probabilistic models - in particular Hidden Markov Models, basics of phylogeny reconstruction, structural bioinformatics - focusing on RNA secondary structure and RNA-RNA interactions, biomolecular interactions and network biology. The course aims to develop solid mathematical and computational foundations crucial for understanding advanced topics in computational molecular biology, and targets both graduate and advanced undergraduate students in computing science, molecular biology, biochemistry, biophysics, mathematics and biostatistics with minimal or no background in bioinformatics problems and techniques.

At My Office

You can find my office in TASC-1 building. My room number is 9415.

I am usually at my office or in the lab for Computational Biology. To make sure, you will find me there, drop me an email and set a time.

Lab

I often visit the lab for Computational Biology to have meetings and work with students.

You can learn more about the lab here.

Vancouver Prostate Center

I also collaborate with a few colleagues at VPC. Once or twice a week I will go there. If you need to meet me in downtown, that is the closest as I can get to downtown.