

Research  Biology, Genetics and Genomics
The work in the Zhang lab in
this broad field is mainly computational and falls into the category of Data
Sciences (or Big Data). We aim to find and understand biological “stories”
through building models and identifying patterns from large quantities of
genomescale biological data. In the process, we integrate the bench work of
molecular biology and develop novel methods and efficient software tools along
the way. In this broad area, we pursue two overarching themes, network
systems biology and gene regulation via noncoding RNA genes. We are
interested in basic biological questions (e.g., the biogenesis of small
noncoding RNAs, transcriptional and posttranscriptional gene regulation) and
their applications to complex diseases (e.g., Alzheimer’s disease (AD) and
psoriasis) and plant stress response (e.g., rice blast and rice bacterial
blight).
Networkbased,
allelespecific genomewide association study (GWAS): GWAS and singlegene approaches have
been popular in identifying risk factors of Mendelian diseases, such as
familial Alzheimer's disease (AD). Despite a tremendous amount of effort,
however, these methods fail to decipher the missing inheritability of complex
diseases, such as lateonset AD (or simply AD), diabetes, and psoriasis. A key
factor that impairs conventional GWAS on polygenic diseases is the lack of
effective ways to identify multiple genomic loci or genes whose interactions or
associations underlie the manifestation of disease phenotypes. For example, a
bruteforce search of possible associations among only three markers within one
million single nucleotide polymorphisms (SNPs) requires examining a prohibitive
number of 1.7×10^{17} trios. In the Zhang lab, we took a network
perspective toward this problem and focused on developing novel enabling
techniques for network GWAS. One network GWAS method recently developed in the
lab is called BlocBuster (Climer, Templeton and
Zhang, PLOS Comp. Biol., 10(9):e1003766, 2014), which
exploits the fact that there exist genetic architectures of markers underlying
a complex trait. Such architectures are best represented as a network where
correlated markers can be regarded as forming network modules or communities.
We also used a new similarity measure to accommodate genetic heterogeneity in
the population in order to make the network GWAS analysis allelespecific,
robust, and rigorous. One application of the network GWAS method to the HapMap population SNPs data revealed a giant 284SNPs
divergent yinyang haplotype pattern encompassing gephyrin,
a gene with remarkably diverse functions related to translation initiation,
regulation of synaptic protein synthesis, and mTOR
signaling, to name a few (Climer, Templeton and
Zhang, Nature Comm., 6:6534, 2015; see the first figure
below), suggest a strong positive selection on the region over gephyrin, rapid evolution of world populations, and
possible distinct molecular mechanisms involving gephyrin.
The discovery of this large yinyang pattern, which was missed by all existing
methods, demonstrates the power of the network GWAS approach. The second application
of the network method to GWAS data of psoriasis, an autoimmune skin disorder,
helped reveal a module of 17 tightly correlated SNPs within the Major
Histocompatibility Complex (MHC) on human chromosome 6 (Climer, Templeton and
Zhang, PLOS Comp. Biol., 10(9):e1003766, 2014; see also the
second figure below). While some of the 17 SNPs have been indicated in previous
studies, our result showed that their specific alleles as a whole provide a
stronger discriminative power to distinguish psoriasis subjects from normal
controls. In short, the Zhang lab is the first to systematically pursue the
notion of network GWAS to push the envelopment of conventional GWAS to a new dimension and to
develop an effective enabling technique to approach complex disease.
Networkbased
transcriptome analysis: Manifestation
of complex diseases or traits is often the result of aberrant gene expression
variations in disease states. Analysis of genomewide gene expression
variations, or transcriptome analysis, can thus provide insights into disease
mechanisms. In the same vein as network GWAS, the Zhang lab has taken a
coexpression network approach to transcriptome analysis of complex diseases by
exploiting the rationale of “guilty by association”. One realization of this
idea is the HQCut method for constructing and
analyzing coexpression networks using gene expression data (Ruan,
Dean and Zhang, BMC Syst. Biol., 4:136, 2010; Ruan
and Zhang, Physics Review E, 77:016104, 2008). The central piece of
the method is a heuristic that combines a spectral clustering technique that
maps gene expression data to a latent, lower dimensional space, by which the
noise of low information content in the data can be effectively removed. As a
result, clustering can be more accurately and efficiently performed. This
network transcriptome method has been used in our own research and by many
other labs. In particular, using this method, we were able to unravel intricate
relationships among coexpressed genes in AD, diabetes, and cardiovascular
diseases (Ray, Ruan and Zhang, Genome Biol., 9(10):R148, 2008),
showing that many genes that are known to be involved in these metabolic
diseases were also aberrantly expressed in the AD brain and more interestingly,
appeared in one module of coexpressed genes (see the module and genes in
yellow in the figure below). This result support the
notion that AD is a "diabetes" of the brain. In addition, using this
method we identified brainregion specific patterns of gene expression
variation in the AD brain (Ray and Zhang, BMC Syst.
Biol., 4:136, 2010) and revealed the genetics of gene expression (or eQTL) in AD (Webster, et al., Am. J.
Human Genetics, 84(4):44558, 2009). In short, our work on network
biology has clearly demonstrated the power and rigor that computational biology
can bring to complex biology problems.
Biogenesis
of small noncoding RNAs: Gene
expression variations across disease states and the normal condition are often
due to aberrant gene regulation where small noncoding RNAs (sncRNAs),
particularly microRNAs (miRNAs) and small interfering RNAs (siRNAs), play
essential roles. Identification of sncRNAs and
analysis of their biogenesis and regulatory functions are key steps toward
elucidation of aberrant gene regulation. In my lab, we have developed effective
methods for identification of miRNAs and siRNAs as well as methods for
analyzing their biogenesis and potential functions in gene regulation and
diseases. In particular, we developed an innovative method to show that most
miRNAs in animals (e.g., human and mouse) and plants (e.g., rice and
Arabidopsis) are transcribed by RNA polymerase II (Zhou,
Ruan, Wang and Zhang, PLOS Comp. Biol.,
3(3):e37, 2007). The analysis was done by first finding cisregulatory
elements (motifs) in the promoter regions of miRNAs using our WordSpy genomewide motif finding method (Wang
and Zhang, Genome Biol., 7(6):R49, 2006), constructing a model of
miRNA transcription using the motifs, and then applying the model to
distinguish miRNAs from proteincoding genes. By combining large quantities of
sequencing data with bench experiments, we also showed that a miRNA precursor
might generate multiple miRNAs (Zhang,
et al., Genome Biol., 11(8):R81, 2010; see an example of miR822 in
Arabidopsis in the first figure below) or a miRNA plus a siRNA that regulates
DNA methylation, suggesting a model of dual biogenesis of miRNA and siRNA (Chellappan,
et al., Nucleic Acids Res., 38(20):688394, 2010; see the
second figure below)
Moreover, microRNAs may also
arise from the same gemomic loci as other noncoding
RNAs, e.g., tRNAs and snoRNAs.
Combining computational and bench molecular biology work, we identified and
confirmed that miRNAs can originate from the loci where tRNAs
were encoded in mouse and murine herpesvirus (Reese, et al., J. Virology,
84(19):103453, 2010; see the first figure below for some examples in
mouse). In addition, our extensive computational analysis of large quantities
of profiling data revealed that there exist a huge number of isoforms of
microRNAs in many eukaryotic organisms, including human (Xia and Zhang, Nucleic
Acids Res., 42(3):142741, 2014; also see the second figure below for
an example), some of which may also be functional in complex diseases such as
psoriasis (Xia, Joyce,
Bowcock and Zhang, Human Mol. Genetics, 22(4):73748, 2013). In
summary, our work has broadened our view on the biogenesis, diversity and
function of small noncoding RNAs.
Regulatory
functions of small noncoding RNAs: We
expanded our work to examine the potential roles of sncRNAs
in complex disease and plant stress response. For example, using the miRNA
transcriptome modeling as a tool, we were able to identify cold and UVBlight
responsive miRNAs in model plant Arabidopsis thaliana, most of which were also
experimentally validated (Zhou,
Wang and Zhang, Mol. Syst. Biol., 3:13, 2007; Zhou, et al., Biochim. Biophys. Acta, 1779(11):7808, 2008). This is the first
computational miRNA gene function annotation method by modeling the possible
transcriptional regulators of microRNAs and their potential targets (see the
first figure below for an example of putative Auxin signaling pathways mediated
by microRNAs in response to UVB light stress). Furthermore, combining
deepsequencing profiling, computational analysis, and experimental validation,
we characterized the functions of microRNAs in psoriatic and normal human skin
(Joyce and et al., Human
Mol. Genetics, 20(20):402540, 2011; see some microRNA aberrant
expression patterns in the second figure below and microR135b expressed in
psoriatic skin in the third figure below), in viral infection (Reese, et al., J.
Virology, 84(19):103453, 2010), and plant stress responses, for
example, in Aribidopsis (Zhang, et al., Human
Mol. Genet., 20:402540, 2011) and cassava (Zeng, et al., Nucleic
Acids Res., 38(3):98195, 2010. Our work has broadened our view of the
biogenesis, diversity and functions of sncRNAs in
diseases and plants. In summary, we have done some pioneering work in
developing and using computational methods to study basic biological problems
related to small noncoding RNA gene regulation that were traditionally
approached through bench work.
Research  Machine
learning/datamining
In recent years, in order to
support the research activities in computational biology and genomics, we have
been focusing on developing effective computational methods for finding and
analyzing modular structures in large (biological) networks.
Identification
of network modules in complex networks: The
network GWAS and coexpression network approaches described above rely heavily
on accurately and efficiently finding network modules. Nevertheless, finding
structures in large networks is a challenging problem in machine learning and
datamining. First, network structural properties can be characterized in
several different ways, by node connectivity, by properties on links (such as
relationships among connected individuals), or by semantics of nodes and links
(such as the social roles that individuals play in a society network). In our
research, we considered these different types of information for network module
finding and developed effective novel methods for finding modules of nodes (Jin,
Chen, He and Zhang, Proc. AAAI15), modules of links (He,
Liu, Jin and Zhang, Proc. AAAI15), and modules that are defined
by network structural information and node semantics (He, et al., Proc AAAI17). We adopted a few different
machine learning techniques in developing our methods, including nonnegative
matrix factorization (Wang, et al., Proc.
AAAI16), stochastic modeling (He,
Liu, Jin and Zhang, Proc. AAAI15), deep learning (Yang, et
al., Proc. IJCAI16), and Markov Random Fields (He, et
al., Proc. AAAI18).
Applications
of network module analysis: One of the applications of this line
of research is the networkbased GWAS, as discussed in (Climer,
Templeton and Zhang, PLOS Comp. Biol., 10(9):e1003766, 2014) and (Climer,
Templeton and Zhang, Nature Comm., 6:6534, 2015). Another
area of application is networkbased transcriptome analysis for elucidation of
gene regulation, which has been studied in the Zhang lab, as discussed in (Ray, Ruan
and Zhang, Genome Biol., 9(10):R148, 2008), (Ruan, Dean and Zhang, BMC Syst. Biol., 4:136, 2010)
and (Ruan
and Zhang, Physics Review E, 77:016104, 2008).
Research
interests  Heuristic search, planning and optimization
The Zhang has a long history of
research on heuristic search, planning and optimization. The work has made
significant contributions to the understanding of the nature and complexity of
many search techniques, such as bestfirst search and depthfirst
branchandbound, and combinatorial optimization problems, such as the
Traveling Salesman Problem and Constraint Satisfaction.
Asymptotic
optimality of linearspace search algorithms: Heuristic search is one of the
cornerstone of AI. The bestfirst search (BFS) algorithm, with the wellknown
A* algorithm as a special case, is time optimal. However, it requires space
that is exponential in the search depth of the underlying search tree, which is
impractical for large combinatorial optimization problems, such as the
Traveling Salesman Problem (TSP). In contrast, a linearspace search (LSS)
algorithm, with depthfirst search and depthfirst branchandbound (DFBnB) as special cases, runs linearly in search depth with
a penalty of exploring more nodes in the search space. Due to their limited
space requirement, LSS algorithms, including DFBnB,
iterative deepening A* (IDA*), and recursive bestfirst search (RBFS), are the
most popular techniques for large and difficult search problems in practice.
Understanding the expected complexity of LSS algorithms was an important and
challenging problem.Based on a random search tree
model proposed by Judea Pearl and Richard Karp, our research established the
asymptotic optimality of LSS methods, showing that their
expected complexity is a constant factor of that of BFS (Zhand and Korf,
Artificial Intelligence, 79(2):24192, 1995; Zhang, StateSpace
Search, Springer, 1999). Moreover, LSS algorithms may run faster than
BFS on large problems, making them the algorithms for real applications.
A directly and early application
of our complexity results is the explanation to anomaly of the fixed horizon
search, which was observed initially by Richard Korf
on slidingtile puzzles using DFBnB, as illustrated
below. The anomaly states that for a fixed search depth, DFBnB
can search the trees with larger branching factors faster. In fact, the anomaly
persists regardless which search method, e.g., ID (or IDA*) or RBFS, shown
below.
The insight gained by the
complexity analysis on this search anomaly helped predict the existence of a
phase transition in heuristic search problems, discussed below.
Phase
transitions and structures of combinatorial search: Our study also revealed that the
expected complexity of BFS and LLS algorithms exhibits a sharp and abrupt
transition from polynomial to exponential time as some critical problem features
(e.g., the precision of heuristic function) shift (Zhand and Korf,
Artificial Intelligence, 79(2):24192, 1995; Zhang and Korf, Artificial Intelligence, 81(12):22339, 1996;
Zhang, StateSpace
Search, Springer, 1999). This is similar to phase transitions of spin
glasses studied in physics. A simple example of a phase transition is water
changing from liquid to solid ice when the temperature drops below the freezing
point. We identified that the accuracy of heuristic information is the
controlling parameter of phase transitions. On the Judea&Karp
incremental random search tree model, the phase transition can be summarized by
the following figure.
The control parameter of this phase transition is the product of the mean branching factor of the search tree and the probability if an edge cost has cost 0.
This analysis was extended to reveal phase transitions in combinatorial optimization problems, such as the asymmetric TSP (ATSP) (Zhang and Korf, Artificial Intelligence, 81(12):22339, 1996) and Boolean Satisfiability (Zhang, Artificial Intelligence, 158:126, 2004). For the ATSP, the control parameter is the precision of the distances between cities, which is equivalent to the number of digits used to represent distances. Intuitively, when the number of digits used is small, many distances may have a high chance to have the same value, and consequently many complete tours (visiting each and every city exactly once) may have a high chance to have the same cost as well. This also means that a large number of optimal TSP tours exist; and finding one optimal TSP tour should be relatively easy. In contrast, when no two distances between cites are the same, the number of optimal TSP tour is small, so that finding an optimal solution is difficult. This phase transition is illustrated in the figure below, where the Assignment Problem (AP) is the main computation step (heuristic) for solving the ATSP.
Interestingly, there also exist
phase transitions in maximum Boolean satisfiability (max SAT), which is the
optimization version of the SAT where the objective is to find a variable
assignment that minimize the number of violated clauses (Zhang,
Artificial Intelligence, 158:126, 2004).
Note that the phase
transitions that we studied in optimization problems, such as finding the
minimal TSP tour and the variable assignment to minimize the number of
unsatisfied constraints, are fundamentally different from the phase transitions
for decision problems, such as the decision version of the TSP and the SAT. For
optimization problems, the phase transitions appear as an easytodifficult
transition, whereas for decision problems, the phase transitions exhibit an
easytodifficultthentoeasy pattern.
The computational complexity
and phase transition are also related to the backbone of a problem, which is a
set of variables that have the same values among all solutions (Zhang, Proc. Intern. Conf.
on Principles and Practice of Constraint programming, CP2001; Zhang, JAIR, 20:47197, 2004;
Zhang,
Artificial Intelligence, 158:126, 2004).
Problem
solving by exploiting phase transitions and backbones: The study of phase transitions and problem
structures such as backbones has changed the way we characterize the complexity
of combinatorial problems, beyond the notion of worstcase complexity, and the
way we solve difficult problems. In particular, phase transitions can not
only be used to characterize problems, but also be exploited in problem
solving. We have done a pioneering work on developing new efficient
algorithms for finding optimal and approximate solutions by exploiting phase
transitions (Pemberton
and Zhang, Artificial Intelligence, 81:297325, 1996; Zhang,
Artificial Intelligence, 158:126, 2004). One of the ideas is to
transform a state space that is difficult to search into one that is easy to
explore. Another idea is to estimate and utilize backbone structures to guide a
search. These techniques are characterized as backbone guided search (Zhang,
Artificial Intelligence, 158:126, 2004; Zhang and
Looks, Proc. IJCAI2005).
The
Zhang Algorithm for the ATSP: The
asymmetric TSP (ATSP) is an important NPhard problem with many real
applications in the areas such as scheduling. The research largely focuses on
efficient approximation algorithms. The first approximation method is the KanallakisPapadimitiou local search algorithm, named after
its inventors, two eminent scientists in computer science and operations
research. The second approach is represented by my truncated DFBnB algorithm (Zhang,
Proc. AAAI 1993 Spring Symp. on AI and NPHard
Problems; Zhang,
StateSpace Search, Springer, 1999), which has been referred to as the
Zhang algorithm (Johnson,
et al.in The Traveling Salesman Problem and its Variations, pp.44588,
2002).
Long
distance mutual exclusion and propositional planning: A dominating technique in planning is GraphPlan with mutual exclusions (mutex)
to represent constraints among actions and events at the same planning horizon
(depth). In our work, we introduced long distance mutual exclusion (londex) to capture constraints across multiple horizons,
developed an efficient algorithm to efficiently compute londex,
and designed and implemented our new MaxPlan planner
using londex (Xing,
Chen and Zhang, Proc. ICAPS06; Chen,
Huang, Xing and Zhang, Artificial Intelligence, 173:36591, 2009).
The overall work provided a new paradigm for planning. Our MaxPlan
system won the First Place Award of the Fifth International Planning
Competition in 2006 (Xing,
Chen and Zhang, Proc. ICAPS06).
Transition
based encoding for planning: The
technique of planning as Boolean satisfiability usually uses encodings in logic
statements. We introduced a novel SAT encoding scheme (SASE) based on the SAS+
formalism (Huang,
Chen and Zhang, Proc. AAAI10; Huang, Chen and
Zhang, JAIR, 43:293328, 2012). The new scheme exploits the
structural information so that the encoding is both compact and efficient for
planning. We proved the correctness of the new encoding and analyzed the
rationale behind its improvement in planning performance. Our extensive
experimental results demonstrated significant improvements of SASE over the stateoftheart
methods in time and space. This work received the Outstanding Paper Award
of the 24th AAAI Conference on Artificial Intelligence Conference in 2010
(AAA10) (Huang,
Chen and Zhang, Proc. AAAI10), chosen from a double blind review
process of 894 submissions. Note that AAAI is one of the two leading
international AI conferences.