The work in the Zhang lab in this broad field is mainly computational and falls into the category of Data Sciences (or Big Data). We aim to find and understand biological “stories” through building models and identifying patterns from large quantities of genome-scale biological data. In the process, we integrate the bench work of molecular biology and develop novel methods and efficient software tools along the way. In this broad area, we pursue two overarching themes, network systems biology and gene regulation via noncoding RNA genes. We are interested in basic biological questions (e.g., the biogenesis of small noncoding RNAs, transcriptional and post-transcriptional gene regulation) and their applications to complex diseases (e.g., Alzheimer’s disease (AD) and psoriasis) and plant stress response (e.g., rice blast and rice bacterial blight).
Network-based, allele-specific genome-wide association study (GWAS): GWAS and single-gene approaches have been popular in identifying risk factors of Mendelian diseases, such as familial Alzheimer's disease (AD). Despite a tremendous amount of effort, however, these methods fail to decipher the missing inheritability of complex diseases, such as late-onset AD (or simply AD), diabetes, and psoriasis. A key factor that impairs conventional GWAS on polygenic diseases is the lack of effective ways to identify multiple genomic loci or genes whose interactions or associations underlie the manifestation of disease phenotypes. For example, a brute-force search of possible associations among only three markers within one million single nucleotide polymorphisms (SNPs) requires examining a prohibitive number of 1.7×1017 trios. In the Zhang lab, we took a network perspective toward this problem and focused on developing novel enabling techniques for network GWAS. One network GWAS method recently developed in the lab is called BlocBuster (Climer, Templeton and Zhang, PLOS Comp. Biol., 10(9):e1003766, 2014), which exploits the fact that there exist genetic architectures of markers underlying a complex trait. Such architectures are best represented as a network where correlated markers can be regarded as forming network modules or communities. We also used a new similarity measure to accommodate genetic heterogeneity in the population in order to make the network GWAS analysis allele-specific, robust, and rigorous. One application of the network GWAS method to the HapMap population SNPs data revealed a giant 284-SNPs divergent yin-yang haplotype pattern encompassing gephyrin, a gene with remarkably diverse functions related to translation initiation, regulation of synaptic protein synthesis, and mTOR signaling, to name a few (Climer, Templeton and Zhang, Nature Comm., 6:6534, 2015; see the first figure below), suggest a strong positive selection on the region over gephyrin, rapid evolution of world populations, and possible distinct molecular mechanisms involving gephyrin. The discovery of this large yin-yang pattern, which was missed by all existing methods, demonstrates the power of the network GWAS approach. The second application of the network method to GWAS data of psoriasis, an autoimmune skin disorder, helped reveal a module of 17 tightly correlated SNPs within the Major Histocompatibility Complex (MHC) on human chromosome 6 (Climer, Templeton and Zhang, PLOS Comp. Biol., 10(9):e1003766, 2014; see also the second figure below). While some of the 17 SNPs have been indicated in previous studies, our result showed that their specific alleles as a whole provide a stronger discriminative power to distinguish psoriasis subjects from normal controls. In short, the Zhang lab is the first to systematically pursue the notion of network GWAS to push the envelopment of conventional GWAS to a new dimension and to develop an effective enabling technique to approach complex disease.
Network-based transcriptome analysis: Manifestation of complex diseases or traits is often the result of aberrant gene expression variations in disease states. Analysis of genome-wide gene expression variations, or transcriptome analysis, can thus provide insights into disease mechanisms. In the same vein as network GWAS, the Zhang lab has taken a co-expression network approach to transcriptome analysis of complex diseases by exploiting the rationale of “guilty by association”. One realization of this idea is the HQCut method for constructing and analyzing co-expression networks using gene expression data (Ruan, Dean and Zhang, BMC Syst. Biol., 4:136, 2010; Ruan and Zhang, Physics Review E, 77:016104, 2008). The central piece of the method is a heuristic that combines a spectral clustering technique that maps gene expression data to a latent, lower dimensional space, by which the noise of low information content in the data can be effectively removed. As a result, clustering can be more accurately and efficiently performed. This network transcriptome method has been used in our own research and by many other labs. In particular, using this method, we were able to unravel intricate relationships among co-expressed genes in AD, diabetes, and cardiovascular diseases (Ray, Ruan and Zhang, Genome Biol., 9(10):R148, 2008), showing that many genes that are known to be involved in these metabolic diseases were also aberrantly expressed in the AD brain and more interestingly, appeared in one module of co-expressed genes (see the module and genes in yellow in the figure below). This result support the notion that AD is a "diabetes" of the brain. In addition, using this method we identified brain-region specific patterns of gene expression variation in the AD brain (Ray and Zhang, BMC Syst. Biol., 4:136, 2010) and revealed the genetics of gene expression (or eQTL) in AD (Webster, et al., Am. J. Human Genetics, 84(4):445-58, 2009). In short, our work on network biology has clearly demonstrated the power and rigor that computational biology can bring to complex biology problems.
Biogenesis of small noncoding RNAs: Gene expression variations across disease states and the normal condition are often due to aberrant gene regulation where small noncoding RNAs (sncRNAs), particularly microRNAs (miRNAs) and small interfering RNAs (siRNAs), play essential roles. Identification of sncRNAs and analysis of their biogenesis and regulatory functions are key steps toward elucidation of aberrant gene regulation. In my lab, we have developed effective methods for identification of miRNAs and siRNAs as well as methods for analyzing their biogenesis and potential functions in gene regulation and diseases. In particular, we developed an innovative method to show that most miRNAs in animals (e.g., human and mouse) and plants (e.g., rice and Arabidopsis) are transcribed by RNA polymerase II (Zhou, Ruan, Wang and Zhang, PLOS Comp. Biol., 3(3):e37, 2007). The analysis was done by first finding cis-regulatory elements (motifs) in the promoter regions of miRNAs using our WordSpy genome-wide motif finding method (Wang and Zhang, Genome Biol., 7(6):R49, 2006), constructing a model of miRNA transcription using the motifs, and then applying the model to distinguish miRNAs from protein-coding genes. By combining large quantities of sequencing data with bench experiments, we also showed that a miRNA precursor might generate multiple miRNAs (Zhang, et al., Genome Biol., 11(8):R81, 2010; see an example of miR822 in Arabidopsis in the first figure below) or a miRNA plus a siRNA that regulates DNA methylation, suggesting a model of dual biogenesis of miRNA and siRNA (Chellappan, et al., Nucleic Acids Res., 38(20):6883-94, 2010; see the second figure below)
Moreover, microRNAs may also arise from the same gemomic loci as other noncoding RNAs, e.g., tRNAs and snoRNAs. Combining computational and bench molecular biology work, we identified and confirmed that miRNAs can originate from the loci where tRNAs were encoded in mouse and murine herpesvirus (Reese, et al., J. Virology, 84(19):1034-53, 2010; see the first figure below for some examples in mouse). In addition, our extensive computational analysis of large quantities of profiling data revealed that there exist a huge number of isoforms of microRNAs in many eukaryotic organisms, including human (Xia and Zhang, Nucleic Acids Res., 42(3):1427-41, 2014; also see the second figure below for an example), some of which may also be functional in complex diseases such as psoriasis (Xia, Joyce, Bowcock and Zhang, Human Mol. Genetics, 22(4):737-48, 2013). In summary, our work has broadened our view on the biogenesis, diversity and function of small noncoding RNAs.
Regulatory functions of small noncoding RNAs: We expanded our work to examine the potential roles of sncRNAs in complex disease and plant stress response. For example, using the miRNA transcriptome modeling as a tool, we were able to identify cold- and UV-B-light responsive miRNAs in model plant Arabidopsis thaliana, most of which were also experimentally validated (Zhou, Wang and Zhang, Mol. Syst. Biol., 3:13, 2007; Zhou, et al., Biochim. Biophys. Acta, 1779(11):780-8, 2008). This is the first computational miRNA gene function annotation method by modeling the possible transcriptional regulators of microRNAs and their potential targets (see the first figure below for an example of putative Auxin signaling pathways mediated by microRNAs in response to UV-B light stress). Furthermore, combining deep-sequencing profiling, computational analysis, and experimental validation, we characterized the functions of microRNAs in psoriatic and normal human skin (Joyce and et al., Human Mol. Genetics, 20(20):4025-40, 2011; see some microRNA aberrant expression patterns in the second figure below and microR-135b expressed in psoriatic skin in the third figure below), in viral infection (Reese, et al., J. Virology, 84(19):1034-53, 2010), and plant stress responses, for example, in Aribidopsis (Zhang, et al., Human Mol. Genet., 20:4025-40, 2011) and cassava (Zeng, et al., Nucleic Acids Res., 38(3):981-95, 2010. Our work has broadened our view of the biogenesis, diversity and functions of sncRNAs in diseases and plants. In summary, we have done some pioneering work in developing and using computational methods to study basic biological problems related to small noncoding RNA gene regulation that were traditionally approached through bench work.
Research - Machine learning/datamining
In recent years, in order to support the research activities in computational biology and genomics, we have been focusing on developing effective computational methods for finding and analyzing modular structures in large (biological) networks.
Identification of network modules in complex networks: The network GWAS and co-expression network approaches described above rely heavily on accurately and efficiently finding network modules. Nevertheless, finding structures in large networks is a challenging problem in machine learning and datamining. First, network structural properties can be characterized in several different ways, by node connectivity, by properties on links (such as relationships among connected individuals), or by semantics of nodes and links (such as the social roles that individuals play in a society network). In our research, we considered these different types of information for network module finding and developed effective novel methods for finding modules of nodes (Jin, Chen, He and Zhang, Proc. AAAI-15), modules of links (He, Liu, Jin and Zhang, Proc. AAAI-15), and modules that are defined by network structural information and node semantics (He, et al., Proc AAAI-17). We adopted a few different machine learning techniques in developing our methods, including non-negative matrix factorization (Wang, et al., Proc. AAAI-16), stochastic modeling (He, Liu, Jin and Zhang, Proc. AAAI-15), deep learning (Yang, et al., Proc. IJCAI-16), and Markov Random Fields (He, et al., Proc. AAAI-18).
of network module analysis: One of the applications of this line
of research is the network-based GWAS, as discussed in (Climer,
Templeton and Zhang, PLOS Comp. Biol., 10(9):e1003766, 2014) and (Climer,
Templeton and Zhang, Nature Comm., 6:6534, 2015). Another
area of application is network-based transcriptome analysis for elucidation of
gene regulation, which has been studied in the Zhang lab, as discussed in (Ray, Ruan
and Zhang, Genome Biol., 9(10):R148, 2008), (Ruan, Dean and Zhang, BMC Syst. Biol., 4:136, 2010)
and Zhang, Physics Review E, 77:016104, 2008).
The Zhang has a long history of research on heuristic search, planning and optimization. The work has made significant contributions to the understanding of the nature and complexity of many search techniques, such as best-first search and depth-first branch-and-bound, and combinatorial optimization problems, such as the Traveling Salesman Problem and Constraint Satisfaction.
Asymptotic optimality of linear-space search algorithms: Heuristic search is one of the cornerstone of AI. The best-first search (BFS) algorithm, with the well-known A* algorithm as a special case, is time optimal. However, it requires space that is exponential in the search depth of the underlying search tree, which is impractical for large combinatorial optimization problems, such as the Traveling Salesman Problem (TSP). In contrast, a linear-space search (LSS) algorithm, with depth-first search and depth-first branch-and-bound (DFBnB) as special cases, runs linearly in search depth with a penalty of exploring more nodes in the search space. Due to their limited space requirement, LSS algorithms, including DFBnB, iterative deepening A* (IDA*), and recursive best-first search (RBFS), are the most popular techniques for large and difficult search problems in practice. Understanding the expected complexity of LSS algorithms was an important and challenging problem.Based on a random search tree model proposed by Judea Pearl and Richard Karp, our research established the asymptotic optimality of LSS methods, showing that their expected complexity is a constant factor of that of BFS (Zhand and Korf, Artificial Intelligence, 79(2):241-92, 1995; Zhang, State-Space Search, Springer, 1999). Moreover, LSS algorithms may run faster than BFS on large problems, making them the algorithms for real applications.
A directly and early application of our complexity results is the explanation to anomaly of the fixed horizon search, which was observed initially by Richard Korf on sliding-tile puzzles using DFBnB, as illustrated below. The anomaly states that for a fixed search depth, DFBnB can search the trees with larger branching factors faster. In fact, the anomaly persists regardless which search method, e.g., ID (or IDA*) or RBFS, shown below.
The insight gained by the complexity analysis on this search anomaly helped predict the existence of a phase transition in heuristic search problems, discussed below.
Phase transitions and structures of combinatorial search: Our study also revealed that the expected complexity of BFS and LLS algorithms exhibits a sharp and abrupt transition from polynomial to exponential time as some critical problem features (e.g., the precision of heuristic function) shift (Zhand and Korf, Artificial Intelligence, 79(2):241-92, 1995; Zhang and Korf, Artificial Intelligence, 81(1-2):223-39, 1996; Zhang, State-Space Search, Springer, 1999). This is similar to phase transitions of spin glasses studied in physics. A simple example of a phase transition is water changing from liquid to solid ice when the temperature drops below the freezing point. We identified that the accuracy of heuristic information is the controlling parameter of phase transitions. On the Judea&Karp incremental random search tree model, the phase transition can be summarized by the following figure.
The control parameter of this phase transition is the product of the mean branching factor of the search tree and the probability if an edge cost has cost 0.
This analysis was extended to reveal phase transitions in combinatorial optimization problems, such as the asymmetric TSP (ATSP) (Zhang and Korf, Artificial Intelligence, 81(1-2):223-39, 1996) and Boolean Satisfiability (Zhang, Artificial Intelligence, 158:1-26, 2004). For the ATSP, the control parameter is the precision of the distances between cities, which is equivalent to the number of digits used to represent distances. Intuitively, when the number of digits used is small, many distances may have a high chance to have the same value, and consequently many complete tours (visiting each and every city exactly once) may have a high chance to have the same cost as well. This also means that a large number of optimal TSP tours exist; and finding one optimal TSP tour should be relatively easy. In contrast, when no two distances between cites are the same, the number of optimal TSP tour is small, so that finding an optimal solution is difficult. This phase transition is illustrated in the figure below, where the Assignment Problem (AP) is the main computation step (heuristic) for solving the ATSP.
Interestingly, there also exist phase transitions in maximum Boolean satisfiability (max SAT), which is the optimization version of the SAT where the objective is to find a variable assignment that minimize the number of violated clauses (Zhang, Artificial Intelligence, 158:1-26, 2004).
Note that the phase transitions that we studied in optimization problems, such as finding the minimal TSP tour and the variable assignment to minimize the number of unsatisfied constraints, are fundamentally different from the phase transitions for decision problems, such as the decision version of the TSP and the SAT. For optimization problems, the phase transitions appear as an easy-to-difficult transition, whereas for decision problems, the phase transitions exhibit an easy-to-difficult-then-to-easy pattern.
The computational complexity and phase transition are also related to the backbone of a problem, which is a set of variables that have the same values among all solutions (Zhang, Proc. Intern. Conf. on Principles and Practice of Constraint programming, CP-2001; Zhang, JAIR, 20:471-97, 2004; Zhang, Artificial Intelligence, 158:1-26, 2004).
Problem solving by exploiting phase transitions and backbones: The study of phase transitions and problem structures such as backbones has changed the way we characterize the complexity of combinatorial problems, beyond the notion of worst-case complexity, and the way we solve difficult problems. In particular, phase transitions can not only be used to characterize problems, but also be exploited in problem solving. We have done a pioneering work on developing new efficient algorithms for finding optimal and approximate solutions by exploiting phase transitions (Pemberton and Zhang, Artificial Intelligence, 81:297-325, 1996; Zhang, Artificial Intelligence, 158:1-26, 2004). One of the ideas is to transform a state space that is difficult to search into one that is easy to explore. Another idea is to estimate and utilize backbone structures to guide a search. These techniques are characterized as backbone guided search (Zhang, Artificial Intelligence, 158:1-26, 2004; Zhang and Looks, Proc. IJCAI-2005).
The Zhang Algorithm for the ATSP: The asymmetric TSP (ATSP) is an important NP-hard problem with many real applications in the areas such as scheduling. The research largely focuses on efficient approximation algorithms. The first approximation method is the Kanallakis-Papadimitiou local search algorithm, named after its inventors, two eminent scientists in computer science and operations research. The second approach is represented by my truncated DFBnB algorithm (Zhang, Proc. AAAI 1993 Spring Symp. on AI and NP-Hard Problems; Zhang, State-Space Search, Springer, 1999), which has been referred to as the Zhang algorithm (Johnson, et al.in The Traveling Salesman Problem and its Variations, pp.445-88, 2002).
Long distance mutual exclusion and propositional planning: A dominating technique in planning is GraphPlan with mutual exclusions (mutex) to represent constraints among actions and events at the same planning horizon (depth). In our work, we introduced long distance mutual exclusion (londex) to capture constraints across multiple horizons, developed an efficient algorithm to efficiently compute londex, and designed and implemented our new MaxPlan planner using londex (Xing, Chen and Zhang, Proc. ICAPS-06; Chen, Huang, Xing and Zhang, Artificial Intelligence, 173:365-91, 2009). The overall work provided a new paradigm for planning. Our MaxPlan system won the First Place Award of the Fifth International Planning Competition in 2006 (Xing, Chen and Zhang, Proc. ICAPS-06).
Transition based encoding for planning: The technique of planning as Boolean satisfiability usually uses encodings in logic statements. We introduced a novel SAT encoding scheme (SASE) based on the SAS+ formalism (Huang, Chen and Zhang, Proc. AAAI-10; Huang, Chen and Zhang, JAIR, 43:293-328, 2012). The new scheme exploits the structural information so that the encoding is both compact and efficient for planning. We proved the correctness of the new encoding and analyzed the rationale behind its improvement in planning performance. Our extensive experimental results demonstrated significant improvements of SASE over the state-of-the-art methods in time and space. This work received the Outstanding Paper Award of the 24-th AAAI Conference on Artificial Intelligence Conference in 2010 (AAA-10) (Huang, Chen and Zhang, Proc. AAAI-10), chosen from a double blind review process of 894 submissions. Note that AAAI is one of the two leading international AI conferences.