Research Interests

The research in the Zhang lab spans the fields of Computational & Molecular Biology/Genetics/Genomics and Artificial Intelligence (particularly, machine learning, data mining and heuristic search)

Biology, Genetics and Genomics

The work in the Zhang lab is mainly computational and falls into the category of Data Sciences (or Big Data). We aim to find and understand biological “stories” through building models and identifying patterns from large quantities of genome-scale biological data. In the process, we integrate the bench work of molecular biology and develop novel methods and efficient software tools along the way. In this broad area, we pursue two overarching themes, network systems biology and gene regulation via noncoding RNA genes. We are interested in basic biological questions (e.g., the biogenesis of small noncoding RNAs, transcriptional and post-transcriptional gene regulation) and their applications to complex diseases (e.g., Alzheimer’s disease (AD) and psoriasis) and plant stress response (e.g., rice blast and rice bacterial blight).

Network-based genome-wide association study (GWAS): GWAS and single-gene approaches have been popular in identifying risk factors of Mendelian diseases, such as familial Alzheimer's disease (AD). Despite a tremendous amount of effort, however, these methods fail to decipher the missing inheritability of complex diseases, such as late-onset AD (or simply AD), diabetes, and psoriasis. A key factor that impairs conventional GWAS on polygenic diseases is the lack of effective ways to identify multiple genomic loci or genes whose interactions or associations underlie the manifestation of disease phenotypes. For example, a brute-force search of possible associations among only three markers within one million single nucleotide polymorphisms (SNPs) requires examining a prohibitive number of 1.7×1017 trios. In the Zhang lab, we took a network perspective toward this problem and focused on developing novel enabling techniques for network GWAS. One network GWAS method recently developed in the lab is called BlocBuster (Climer, Templeton and Zhang, PLOS Comp. Biol., 10(9):e1003766, 2014), which exploits the fact that there exist genetic architectures of markers underlying a complex trait. Such architectures are best represented as a network where correlated markers can be regarded as forming network modules or communities. We also used a new similarity measure to accommodate genetic heterogeneity in the population in order to make the network GWAS analysis allele-specific, robust, and rigorous. One application of the network GWAS method to the HapMap population SNPs data revealed a giant 284-SNPs divergent yin-yang haplotype pattern encompassing gephyrin, a gene with remarkably diverse functions related to translation initiation, regulation of synaptic protein synthesis, and mTOR signaling, to name a few (Climer, Templeton and Zhang, Nature Comm., 6:6534, 2015; see the first figure below), suggest a strong positive selection on the region over gephyrin, rapid evolution of world populations, and possible distinct molecular mechanisms involving gephyrin. The discovery of this large yin-yang pattern, which was missed by all existing methods, demonstrates the power of the network GWAS approach. The second application of the network method to GWAS data of psoriasis, an autoimmune skin disorder, helped reveal a module of 17 tightly correlated SNPs within the Major Histocompatibility Complex (MHC) on human chromosome 6 (Climer, Templeton and Zhang, PLOS Comp. Biol., 10(9):e1003766, 2014; see also the second figure below). While some of the 17 SNPs have been indicated in previous studies, our result showed that their specific alleles as a whole provide a stronger discriminative power to distinguish psoriasis subjects from normal controls. In short, the Zhang lab is the first to systematically pursue the notion of network GWAS to push the envelop of conventional GWAS to a new dimension and to develop an effective enabling technique to approach complex disease.


Network-based transcriptome analysis: Manifestation of complex diseases or traits is often the result of aberrant gene expression variations in disease states. Analysis of genome-wide gene expression variations, or transcriptome analysis, can thus provide insights into disease mechanisms. In the same vein as network GWAS, the Zhang lab has taken a co-expression network approach to transcriptome analysis of complex diseases by exploiting the rationale of “guilty by association”. One realization of this idea is the HQCut method for constructing and analyzing co-expression networks using gene expression data (Ruan, Dean and Zhang, BMC Syst. Biol., 4:136, 2010; Ruan and Zhang, Physics Review E, 77:016104, 2008). The central piece of the method is a heuristic that combines a spectral clustering technique that maps gene expression data to a latent, lower dimensional space, by which the noise of low information content in the data can be effectively removed. As a result, clustering can be more accurately and efficiently performed. This network transcriptome method has been used in our own research and by many other labs. In particular, using this method, we were able to unravel intricate relationships among co-expressed genes in AD, diabetes, and cardiovascular diseases (Ray, Ruan and Zhang, Genome Biol., 9(10):R148, 2008), showing that many genes that are known to be involved in these metabolic diseases were also aberrantly expressed in the AD brain and more interestingly, appeared in one module of co-expressed genes (see the module and genes in yellow in the figure below). This result support the notion that AD is a "diabetes" of the brain. In addition, using this method we identified brain-region specific patterns of gene expression variation in the AD brain (Ray and Zhang, BMC Syst. Biol., 4:136, 2010) and revealed the genetics of gene expression (or eQTL) in AD (Webster, et al., Am. J. Human Genetics, 84(4):445-58, 2009). In short, our work on network biology has clearly demonstrated the power and rigor that computational biology can bring to complex biology problems.

Biogenesis of small noncoding RNAs: Gene expression variations across disease states and the normal condition are often due to aberrant gene regulation where small noncoding RNAs (sncRNAs), particularly microRNAs (miRNAs) and small interfering RNAs (siRNAs), play essential roles. Identification of sncRNAs and analysis of their biogenesis and regulatory functions are key steps toward elucidation of aberrant gene regulation. In my lab, we have developed effective methods for identification of miRNAs and siRNAs as well as methods for analyzing their biogenesis and potential functions in gene regulation and diseases. In particular, we developed an innovative method to show that most miRNAs in animals (e.g., human and mouse) and plants (e.g., rice and Arabidopsis) are transcribed by RNA polymerase II (Zhou, Ruan, Wang and Zhang, PLOS Comp. Biol., 3(3):e37, 2007). The analysis was done by first finding cis-regulatory elements (motifs) in the promoter regions of miRNAs using our WordSpy genome-wide motif finding method (Wang and Zhang, Genome Biol., 7(6):R49, 2006), constructing a model of miRNA transcription using the motifs, and then applying the model to distinguish miRNAs from protein-coding genes. By combining large quantities of sequencing data with bench experiments, we also showed that a miRNA precursor might generate multiple miRNAs (Zhang, et al., Genome Biol., 11(8):R81, 2010; see an example of miR822 in Arabidopsis in the first figure below) or a miRNA plus a siRNA that regulates DNA methylation, suggesting a model of dual biogenesis of miRNA and siRNA (Chellappan, et al., Nucleic Acids Res., 38(20):6883-94, 2010; see the second figure below)


Moreover, microRNAs may also arise from the same gemomic loci as other noncoding RNAs, e.g., tRNAs and snoRNAs. Combining computational and bench molecular biology work, we identified and confirmed that miRNAs can originate from the loci where tRNAs were encoded in mouse and murine herpesvirus (Reese, et al., J. Virology, 84(19):1034-53, 2010; see the first figure below for some examples in mouse). In addition, our extensive computational analysis of large quantities of profiling data revealed that there exist a huge number of isoforms of microRNAs in many eukaryotic organisms, including human (Xia and Zhang, Nucleic Acids Res., 42(3):1427-41, 2014; also see the second figure below for an example), some of which may also be functional in complex diseases such as psoriasis (Xia, Joyce, Bowcock and Zhang, Human Mol. Genetics, 22(4):737-48, 2013). In summary, our work has broadened our view on the biogenesis, diversity and function of small noncoding RNAs.


Regulatory functions of small noncoding RNAs: We expanded our work to examine the potential roles of sncRNAs in complex disease and plant stress response. For example, using the miRNA transcriptome modeling as a tool, we were able to identify cold- and UV-B-light responsive miRNAs in model plant Arabidopsis thaliana, most of which were also experimentally validated (Zhou, Wang and Zhang, Mol. Syst. Biol., 3:13, 2007; Zhou, et al., Biochim. Biophys. Acta, 1779(11):780-8, 2008). This is the first computational miRNA gene function annotation method by modeling the possible transcriptional regulators of microRNAs and their potential targets (see the first figure below for an example of putative Auxin signaling pathways mediated by microRNAs in response to UV-B light stress). Furthermore, combining deep-sequencing profiling, computational analysis, and experimental validation, we characterized the functions of microRNAs in psoriatic and normal human skin (Joyce and et al., Human Mol. Genetics, 20(20):4025-40, 2011; see some microRNA aberrant expression patterns in the second figure below and microR-135b expressed in psoriatic skin in the third figures below), in viral infection (Reese, et al., J. Virology, 84(19):1034-53, 2010), and plant stress responses, for example, in responding to rice bacterial blight infection (Peng, et al., Scientfic Reports, 5:12165, 2015). Our work has broadened our view of the biogenesis, diversity and functions of sncRNAs in diseases and plants. In summary, we have done some pioneering work in developing and using computational methods to study basic biological problems related to small noncoding RNA gene regulation that were traditionally approached through bench work.


Artificial Intelligence - Machine learning/datamining and heuristic search

In recent years, in order to support the research activities in computational biology and genomics, we expanded our research from heuristic search and planning to machine learning and data mining.

Identification of network modules in complex networks: The network GWAS and co-expression network approaches described above rely heavily on accurately and efficiently finding network modules. Nevertheless, finding structures in large networks is a challenging problem in machine learning and datamining. First, network structural properties can be characterized in several different ways, by node connectivity, by properties on links (such as relationships among connected individuals), or by semantics of nodes and links (such as the social roles that individuals have in a society network). In our research, we considered these different types of information for network module finding and developed effective novel methods for finding modules of nodes (Jin, Chen, He and Zhang, Proc. of AAAI-15), modules of links (He, et al., J. Statistical, and modules that are defined by network structural information and node semantics [34]. We adopted a few different machine learning techniques in developing our methods, which include non-negative matrix factorization [35], stochastic modeling [36], and deep learning [37].

Asymptotic optimality of linear-space search algorithms: The best-first search (BFS) algorithm is time optimal, but requires space that is exponential in the search depth, which is impractical for large combinatorial optimization problems, such as the Traveling Salesman Problem (TSP). In contrast, a linear-space search (LSS) algorithm runs linearly in search depth with a penalty of exploring more nodes in the search space. Due to their limited space requirement, LSS algorithms, particularly depth-first branch-and-bound (DFBnB) and iterative deepening A* (IDA*), are the most popular techniques for large and difficult search problems in practice. Understanding the expected complexity of LSS algorithms was an important and challenging problem. My research established the asymptotic optimality of LSS methods, showing that their expected complexity is a constant factor of that of BFS [38,39]. Moreover, LSS algorithms may run faster than BFS on large problems, making them the algorithms for real applications.

Phase transitions and structures of combinatorial search: My study also revealed that the expected complexity of BFS and LLS algorithms exhibits a sharp and abrupt transition from polynomial to exponential time as some critical problem features shift [38,39,40]. This is similar to phase transitions of spin glasses studied in physics. A simple example of a phase transition is water changing from liquid to solid ice when the temperature drops below the freezing point. I identified that the accuracy of heuristic information is the controlling parameter of phase transitions. The computational complexity is also related to the size of the backbone of a problem, which is a set of variables that have the same values among all solutions [40,41]. The study of phase transitions and problem structures such as backbones has changed the way we characterize the complexity of combinatorial problems, beyond the notion of worst-case complexity, and the way we solve difficult problems [38,40,41].

Problem solving by exploiting phase transitions: Phase transitions can not only be used to characterize problems, but also be exploited in problem solving. I have done a pioneering work on developing new efficient algorithms for finding optimal and approximate solutions by exploiting phase transitions [41,42,43]. One of the ideas is to transform a state space that is difficult to search into one that is easy to explore. Another idea is to estimate and utilize backbone structures to guide a search. These techniques are characterized as backbone guided search [41,42]. These methods have been shown to be effective on many combinatorial problems, including the Traveling Salesman Problem (TSP) [44] and maximum Boolean satisfiability (SAT) [41,42].

The Zhang Algorithm for the ATSP: The asymmetric TSP (ATSP) is an important NP-hard problem with many real applications in the areas such as scheduling. The research largely focuses on efficient approximation algorithms. The first approximation method is the Kanallakis-Papadimitiou local search algorithm, named after its inventors, two eminent scientists in computer science and operations research. The second approach is represented by my truncated DFBnB algorithm [45,38], which is called the Zhang algorithm [46].

Long distance mutual exclusion and propositional planning: A dominating technique in planning is GraphPlan with mutual exclusions (mutex) to represent constraints among actions and events at the same planning horizon (depth). In our work, we introduced long distance mutual exclusion (londex) to capture constraints across multiple horizons, developed an efficient algorithm to efficiently compute londex, and designed and implemented our new MaxPlan planner using londex [47,48]. The overall work provided a new paradigm for planning. Our MaxPlan system won the First Place Award of the Fifth International Planning Competition in 2006 [47].

Transition based encoding for planning: The technique of planning as Boolean satisfiability usually uses encodings in logic statements. We introduced a novel SAT encoding scheme (SASE) based on the SAS+ formalism [49,50]. The new scheme exploits the structural information so that the encoding is both compact and efficient for planning. We proved the correctness of the new encoding and analyzed the rationale behind its improvement in planning performance. Our extensive experimental results demonstrated significant improvements of SASE over the state-of-the-art methods in time and space. This work received the Outstanding Paper Award of the 24-th AAAI Conference on Artificial Intelligence Conference in 2010 (AAA-10), chosen from a double blind review process of 894 submissions. Note that AAAI is one of the two leading international AI conferences.