Non-homogeneous dynamic Bayesian networks with edge-wise sequentially coupled parameters

Abstract Motivation Non-homogeneous dynamic Bayesian networks (NH-DBNs) are a popular tool for learning networks with time-varying interaction parameters. A multiple changepoint process is used to divide the data into disjoint segments and the network interaction parameters are assumed to be segment-specific. The objective is to infer the network structure along with the segmentation and the segment-specific parameters from the data. The conventional (uncoupled) NH-DBNs do not allow for information exchange among segments, and the interaction parameters have to be learned separately for each segment. More advanced coupled NH-DBN models allow the interaction parameters to vary but enforce them to stay similar over time. As the enforced similarity of the network parameters can have counter-productive effects, we propose a new consensus NH-DBN model that combines features of the uncoupled and the coupled NH-DBN. The new model infers for each individual edge whether its interaction parameter stays similar over time (and should be coupled) or if it changes from segment to segment (and should stay uncoupled). Results Our new model yields higher network reconstruction accuracies than state-of-the-art models for synthetic and yeast network data. For gene expression data from A.thaliana our new model infers a plausible network topology and yields hypotheses about the light-dependencies of the gene interactions. Availability and implementation Data are available from earlier publications. Matlab code is available at Bioinformatics online. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
One of the key objectives of computational systems biology is to learn the structure of protein activation pathways and gene regulatory networks. With the work of Friedman et al. (2000), dynamic Bayesian networks (DBNs) have become a popular tool for learning networks from data. However, DBNs are homogeneous linear models that in some applications cannot satisfactorily approximate the complexity of real gene regulatory interaction relationships. Hence, there can be gains from more flexible network reconstruction models. For example, in cellular networks the strengths of the regulatory interactions can depend on unobserved cellular conditions that are not constant in time, so that the application of a homogeneous model (DBN) would be suboptimal. For modelling time-varying regulatory networks many non-homogeneous DBNs (NH-DBNs) have been proposed in the literature. Those NH-DBN models can be divided into two conceptual groups: (i) NH-DBNs that only allow the network parameters to vary in time (see references below) and (ii) NH-DBNs that allow even the network structure to be timedependent (see, e.g. Husmeier et al., 2010;Lèbre et al., 2010;Robinson and Hartemink, 2010). The latter group (ii) offers great model flexibility, but faces a practical and a conceptual problem. The practical problem is potential model over-flexibility. Time series in systems biology are typically rather short and NH-DBNs divide them into even shorter segments. Learning different network structures for short segments that contain a few data points only is a challenging task and likely to lead to inflated inference uncertainty. The conceptual problem is related to the very premise of a flexible network structure. This assumption is surely reasonable for some scenarios, like morphogenesis. See, for example, the application to morphogenesis and muscle growth in D.melanogaster in Robinson and Hartemink (2010), where the gene expression time series cover the embryonic, larval, pupal and adult life phase of the fruit fly. Obviously, a gene regulatory network in an embryonic fruit fly can change during growth to maturity and eventually have another structure with different gene interactions in an adult fruit fly. However, for cellular processes on a short time scale, it is questionable whether the network structure can vary over time. By convention, an edge from gene Z i to gene Z j in a gene regulatory network indicates that gene Z i codes for a transcription factor that can bind to the promoter of gene Z j , so as to initiate its transcription. This biological ability to bind is unlikely to change within a short time period. In a short period of time, only the extent of binding is likely to be influenced by changing external factors (e.g. cellular conditions), so that only the strength of the regulatory effect can vary over time.
We therefore focus on NH-DBNs of group (i), where the network structure is assumed to be time-invariant. In particular, this assumption is more realistic for our two real-world applications to S.cerevisiae (yeast) and to A.thaliana (plant) gene expression data. In the metabolism-related gene regulatory network in yeast (Section 5.2) the strengths of the regulatory interactions depend on the medium, in which yeast is cultured (galactose and glucose). In the circadian clock network in Arabidopsis (Section 5.3) the strengths of the gene regulatory interactions depend on the artificially generated dark: light cycles, to which the plants were earlier exposed.
The NH-DBN models infer the data segmentation, the joint network structure and the segment-specific interaction parameters altogether from the data. As already pointed out above, in typical applications, those NH-DBNs divide a short time series into even shorter segments. Learning the network parameters for each segment separately can then also lead to inflated inference uncertainty. Therefore models that allow for gradual adaptions of the network interaction parameters have been proposed. The TESLA method (Ahmed and Xing, 2009;Kolar et al., 2010) makes use of L1-regularized regression models ('LASSO') for the network parameter inference, and it employs a second L1 regularization term to penalize dissimilarities between the network parameters of neighbouring segments. Inference is based on a penalized maximum likelihood approach, and the regularization parameters can be optimized by the Bayesian information criterion (BIC) or cross-validation. TESLA even allows the network structure to be time-dependent. But as changing network structures yield large L1 penalties, the network structure is encouraged to stay similar.
The NH-DBN model from Grzegorczyk and Husmeier (2012) uses Bayesian hierarchical regression to sequentially couple the parameters. The resulting coupled NH-DBN can be seen as a Bayesian counterpart of TESLA. In the simulation study by Aderhold et al. (2014) the Bayesian NH-DBN yielded better network predictions than TESLA.
It has also been shown that parameter coupling leads to significantly improved network predictions when the segment-specific parameters are similar (Grzegorczyk and Husmeier, 2012). However, our empirical results in Section 5.1 show that coupling can be counter-productive when the segment-specific parameters are dissimilar.
The disadvantage of all proposed coupling schemes is that they have been designed such that they can only couple all interaction parameters simultaneously. If a node Z k is regulated by two nodes, Z i ! Z k Z j , then the parameters for both edges are coupled with the same coupling strength. But the effect of Z i on Z k could stay similar, while the regulatory effect of Z j on Z k could be subject to significant temporal changes.
Given the complexity of the interactions in gene regulatory networks, it might thus be useful to add more flexibility to the models. In this paper we therefore propose a new consensus model with an edge-wise coupling (EWC) scheme. Unlike the coupled NH-DBN, the new EWC NH-DBN does not enforce coupling. Instead it follows the Bayesian paradigm: 'Let the data speak.' and infers for each individual edge (edge-wise) if the corresponding interaction parameter should be coupled or not.
The EWC NH-DBN has the uncoupled and the coupled NH-DBN as limiting cases and it can infer an appropriate trade-off between them. In addition, the EWC NH-DBN can also shed more light onto the robustness of the individual regulatory interactions. Instead of enforcing a priori that either all edges are coupled or that all edges are uncoupled, it infers for each individual edge whether it should be coupled or better stay uncoupled. From a biological perspective, one can conclude that an uncoupled edge is sensitive to external factors, as the interaction parameter (i.e. the strength of the regulatory effect) varies over time. On the other hand, the interaction parameter of a coupled edge stays (rather) stable, so that the strength of the regulatory effect is not (or only minimally) influenced by external factors. For the circadian clock network in Arabidopsis this feature of the EWC NH-DBN can lead to important new insights. One of the objectives of computational plant biology is to derive a faithful description of the circadian clock network in terms of coupled differential equations (DEs); see, e.g. the work by Pokhilko et al. (2013). The diurnal rhythm of the circadian clock network is caused by the actual (or entrained) daily dark:light cycles, as some of the gene interactions are intensified or alleviated by the presence (or expectation) of light. The DE models therefore typically contain an additional light variable that has an effect on some of the regulatory interactions. For an overview of different network hypotheses from the plant biology literature, we refer to Figure 12 in Aderhold et al. (2014). In this overview figure a 'sun symbol' is used to indicate the effects of light within the different circadian clock network hypotheses. Because of the computational costs, the space of all possible network structures cannot be systematically searched with DEs. In typical studies, based on prior knowledge a few novel network structure hypotheses are proposed and then compared with earlier published network hypotheses. As the computational costs allow only a few new hypotheses to be included, the new network structures must be carefully selected and it must also be carefully decided which gene interactions are supposed to be affected by the presence of light (see, e.g. Pokhilko et al., 2013).
Unlike DEs, NH-DBNs can be used to learn the complete network structure from scratch, and thus help generating new hypotheses about it. Unlike all earlier proposed NH-DBNs, the new EWC NH-DBN employs an edge-wise coupling concept and can distinguish between regulatory effects that are stable (coupled) and regulatory effects that are unstable (uncoupled). In the circadian clock, the instability of an edge suggests that the corresponding gene interaction is likely to be light-dependent. This knowledge about the (in-)stability of the regulatory interactions is therefore useful information for subsequent DE modelling approaches. It can be used as prior knowledge when deciding about the light dependency of the edges of a newly proposed DE-based network hypothesis.
In our recent work (Shafiee Kamalabad et al., 2019), we have proposed a partially non-homogeneous DBN for learning networks from a collection of datasets that have been measured under different experimental conditions. The model assumes the data segmentation to be known (one segment per condition), and then treats the segments as interchangeable units. The EWC NH-DBN focuses on network time series with unknown segmentations. Unlike the earlier model, the EWC NH-DBN infers the segmentation from the data, and then uses the temporal order of the segments. Given the order, coupling can be applied sequentially, so that every segment receives information from the preceding one. This allows for gradual/smooth temporal adaptions of the parameters. Another conceptual difference is that the earlier model is partially non-homogeneous, while the EWC NH-DBN is strictly non-homogeneous with an edgewise sequential information-coupling scheme for the interaction parameters.
We now briefly return to the work by Friedman et al. (2000), in which DBNs were proposed for learning gene networks. Since then DBNs have become a popular tool for network learning, although they are based on two simplifying assumptions, namely that the interactions are homogeneous and linear. For gene regulatory interactions, both assumptions can be too restrictive. Above we have discussed model extensions that relax the homogeneity assumption, but none of those methods makes an attempt to relax the linearity assumption. In a complementary line of research, authors have proposed methods that keep the homogeneity assumption but relax the linearity assumption, so that homogeneous non-linear gene interactions can be inferred. For example, Oates and Mukherjee (2012) have added quadratic and interaction terms to the design matrices of linear models. Other non-linear methods make use of Gaussian process regression (Ä ijö and Lä hdesmä ki, 2009), non-parametric additive models (Henderson and Michailidis, 2014) or faithful descriptions of the gene interaction kinetics in form of differential equations (Aderhold et al., 2017;Oates et al., 2014). We briefly describe these methods in Section 2.7 and we also compare the performance of the EWC NH-DBN with them. We illustrate the conceptual difference between non-homogeneous linear and homogeneous non-linear models in Supplementary Material Part I. To the best of our knowledge, no non-homogeneous non-linear method has been proposed yet.

The new edge-wise coupling (EWC) scheme
Consider a Bayesian piece-wise linear regression model with Y being the response and p ¼ fX 1 ; . . . ; X k g being a set of covariates. We assume that the data points have a temporal order and can be divided into disjoint segments h 2 f1; . . . ; Hg, where each segment h has specific regression coefficients, b h ¼ ðb h;0 ; . . . ; b h;k Þ T . Let y h be the vector of the response values and X h be the design matrix for segment h, where each X h includes a first column of 1's for the intercept. For each segment h we use a Gaussian likelihood: where I denotes the identity matrix, and r 2 is the noise variance parameter, which is shared among segments. We impose an inverse Gamma prior on r 2 ; r À2 $ GAMða r ; b r Þ, and a Gaussian prior on b 1 : where 0 :¼ ð0; . . . ; 0Þ T . Onto the 'signal-to-noise ratio parameter for uncoupled regression coefficients', k u , we also impose an inverse Gamma distribution, k À1 u $ GAMða u ; b u Þ. Re-employing r 2 in Equation (2) yields a fully conjugate prior in both b 1 and r 2 , so that the marginal likelihood pðy 1 jk u Þ can be computed (see, e.g. Gelman et al., 2004). The posterior distribution of b 1 is: where C 1 ¼ ð½k u I À1 þ X T 1 X 1 Þ À1 , andb 1 ¼ C 1 X T 1 y 1 is the posterior expectation of b 1 . If we use the same Gaussian prior for all segments b h jðr 2 ; k u Þ $ N ð0; r 2 k u IÞ ðh ¼ 1; . . . ; HÞ we obtain an uncoupled model. The only information exchange among segments is then be w.r.t. the shared parameters r 2 and k u . If we use the posterior expectationb h as prior expectation for b hþ1 : we obtain a sequentially coupled model. The parameter k c is then a 'coupling strength parameter', for which we could assume an inverse Gamma distribution, k À1 c $ GAMða c ; b c Þ. 'Coupling' here means that b hþ1 is coupled to the posterior expectationb h of b h . Low values k c yield peaked priors in Equation (5), so that the vectors b h and b hþ1 are enforced to be similar (¼coupled). Dissimilar regression coefficients can only be obtained for large values of k c , i.e. for vague priors in Equation (5). The shortcoming is that there is no distinction between the individual regression coefficients: they are all coupled with the same coupling strength (via the parameter k c ).
In this paper, we propose a new model that infers a consensus between Equations (4 and 5). The new NH-DBN infers from the data which regression coefficients stay similar over time (and should be coupled) and which regression coefficients change significantly over time (and should stay uncoupled). In each segment the uncoupled regression coefficients will be re-initialized non-informatively with a prior expectation of 0 and the corresponding prior variance will depend on the signal-to-noise ratio parameter k u from Equation (4) rather than on the coupling strength parameter k c from Equation (5). We refer to the new model as the edge-wise coupled (EWC) NH-DBN.
To distinguish between coupled and uncoupled regression coefficients, we introduce a vector of indicator variables d ¼ ðd 0 ; . . . ; d k Þ whose elements are binary variables d i 2 f0; 1g: d 0 corresponds to the intercept, and d i (i ! 1) refers to the ith covariate X i . d i ¼ 1 indicates that the ith regression coefficients b 1;i ; . . . ; b H;i are coupled, while d i ¼ 0 indicates that they are uncoupled. We introduce the new Gaussian prior: (6) where is the Hadamard product ('elementwise multiplication'), diagfxg denotes a diagonal matrix whose diagonal elements are the elements of the vector x, and 1 :¼ ð1; . . . ; 1Þ T . As the covariance matrix in Equation (6) whereb h;i is the ith element of the posterior expectationb h . The new prior yields a consensus between an uncoupled and a coupled model: . . . ; d k Þ from the data, so as to find an appropriate trade-off between the two limiting models.
A priori we assume d 0 ; . . . ; d k to be independently Bernoulli distributed: d i $ BERðpÞ ði ¼ 0; . . . ; kÞ. The parameter p could also be assumed to have a Beta hyperprior, p $ BETAða; bÞ. For our applications the extension, p $ BETAð1; 1Þ, did not lead to improvements over p ¼ 0.5. Figure 1 shows a graphical model of the EWC NH-DBN, indicating the relationships within and between segments. For the posterior distribution we have:  (3). For h > 1 we set: so that the priors take the form: We obtain: The noise variance parameter, r 2 , can be re-sampled via a collapsed (C) Gibbs sampling step, where the regression coefficients, b 1 ; . . . ; b H , have been integrated out: where T is the total number of data points from all response vectors, and where l h and R h were defined in Equation (9). In Equation (9) we For k 2 u and k 2 c we obtain the full conditional distributions: so that k u and k c are the numbers of uncoupled and coupled regres- For the marginal likelihood, with b 1 ; . . . ; b H and r 2 integrated out, we get (Bishop, 2006): where D 2 and R h (h ¼ 1; . . . ; H) were defined above. For the elements of the vector d ¼ ðd 0 ; . . . ; d k Þ we get: where h i ¼

Learning the covariate set and the data segmentation
In typical applications, the covariate set and the data segmentation are unknown and have to be inferred from the data. Let D denote a time series of equidistant data points, indexed t ¼ 1; . . . ; T. Each data point D t contains a response observation y t and the observations x t;1 ; . . . ; x t;n of n potential covariates. We assume all covariate sets p & fX 1 ; . . . ; X n g to be equally likely a priori, subject to the 'fan-in constraint': jpj 3. As prior on the number of segments H we take a Poisson distribution with parameter 1, H $ Poið1Þ. We then identify H segments with H-1 changepoints, s ¼ fs 1 ; . . . ; s HÀ1 g, on the set S :¼ f2; . . . ; T À 1g. Data point D t is assigned to the hth segment if and only if s hÀ1 < t s h , where s 0 :¼ 1 and s H :¼ T. We follow Green (1995) and assume the changepoints to be distributed like the even-numbered order statistics of L :¼ 2ðH À 1Þ þ 1 points, being uniformly distributed on S: Given p and s, the model from Section 2.1 can be applied. The changepoint set s yields a segmentation of the data into H segments with the response vector set y s :¼ fy 1 ; . . . ; y H g. The corresponding design matrices X 1 ; . . . ; X H are built using the values of the covariates in p. The parameters r 2 ; b 1 ; . . . ; b H , k u , k c and the elements of d can then be re-sampled from their FCDs; see Section 2.2. Given instantiations of k u , k c and d, Metropolis-Hastings moves on p and s can be designed. For each combination of covariate set p and changepoint set s we can employ Equation (11) to compute the marginal likelihood. We get: pðp; s; k u ; k c ; djDÞ / pðy s jk u ; k c ; d; p; sÞ Á pðpÞ Á pðsjHÞ Á pðHÞ Á pðdÞ Á pðk u Þ Á pðk c Þ For inference we implement a Reversible Jump Markov Chain Monte Carlo (RJMCMC) sampling scheme (Green, 1995). We use changepoint birth, death and re-allocation moves for sampling the changepoint set, s, and we use covariate addition, deletion and exchange moves for sampling the covariate set, p. We refer to Supplementary Material Part B for the mathematical details and pseudo-code of the RJMCMC algorithm. We then use RJMCMC simulations to generate a sample from the posterior distribution pðp; s; k u ; k c ; djDÞ. In each iteration we first re-sample the parameters r 2 ; b 1 ; . . . ; b H , k u , k c and d from their full conditional distributions (Gibbs sampling), before we perform Metropolis-Hastings moves on the covariate set p and on the changepoint set s. This way, a sample ðp ðwÞ ; s ðwÞ ; k ðwÞ u ; k ðwÞ c ; d ðwÞ Þ w¼1;...;W from the posterior distribution can be generated.

Learning dynamic networks via regression models
Consider a N-by-ðT þ 1Þ data matrix D whose rows correspond to N network variables Z 1 ; . . . ; Z N and whose columns correspond to equidistant time points t ¼ 1; . . . ; T þ 1. Let D i;t denote the value of Z i at t. The variables can then be identified with the nodes of a network, and we can learn how the variables interact with each other. Temporal data are conventionally modelled with dynamic Bayesian networks (DBNs), where all dependencies are subject to a time lag. An edge Z i ! Z j indicates that D j;tþ1 (Z j at t þ 1) depends on D i;t (Z i at t).
Because of this time lag, there is no acyclicity constraint in DBNs, so that (piece-wise) linear regression can be applied N times independently. In the jth linear regression model Y :¼ Z j is the response, and there are n :¼ N À 1 potential covariates: fX 1 ; . . . ; X n g :¼ fZ 1 ; . . . ; Z jÀ1 ; Z jþ1 ; . . . ; Z N g. Each data point D t (t ¼ 1; . . . ; T) contains a response value D j;tþ1 and the shifted values D 1;t ; . . . ; D jÀ1;t ; D jþ1;t ; . . . ; D N;t of the covariates.
The N individual covariate sets p 1 ; . . . ; p N for the responses Z 1 ; . . . ; Z N then describe a network: G :¼ fp 1 ; . . . ; p N g. There is an edge from Z i to Z j if and only if Z i 2 p j .

Network reconstruction
where j , and I i!j ðG ðwÞ Þ ¼ 0, otherwise. If the true network is known, we evaluate the network reconstruction accuracy with precision-recall curves. For each w 2 ½0; 1 we extract the nðwÞ edges whose scoresê i;j exceed w, and we count the number of true positives TðwÞ among them. Plotting the precisions PðwÞ :¼ TðwÞ=nðwÞ against the recalls RðwÞ :¼ TðwÞ=M, where M is the number of edges in the true network, gives the precisionrecall curve. We refer to the area under the curve as AUC value.
The RJMCMC convergence can be monitored in terms of potential scale reduction factors (PSRFs); see, e.g. Brooks and Gelman (1998). On each dataset we perform H ¼ 10 independent RJMCMC simulations we monitor the fraction of edges that fulfilled PSRF < 1.01. For some convergence diagnostics we refer to Supplementary Material Part C.

Related sequentially coupled NH-DBN models
We outline six alternative regression models, with which NH-DBNs can be built. Like the EWC model, the models can be applied to each variable separately to infer a network. The last two models M5-M6 have not been proposed in the literature yet. We propose them here as competitors. For a graphical overview, on how the models are related (see Fig. 2).

• M1: (HOMOGENEOUS) DBN
The conventional homogenous DBN, as discussed in many textbooks, has no changepoints, H ¼ 1. The regression coefficient vector b 1 applies to all data points. The EWC NH-DBN reduces to the coupled model when setting d ¼ 1. The priors of the regression coefficients are: b 1 jðr 2 ; k u Þ $ Nð0; r 2 k u IÞ and b h jðr 2 ; k c ;b hÀ1 Þ $ Nðb hÀ1 ; r 2 k c IÞ for h ! 2.
The underlying informationcoupling scheme is thus not edge-wise but segment-wise.

Alternative network reconstruction models
We also include some alternative network reconstruction methods. Like the NH-DBN models M1-M6, the models A1-A7 can also be applied to each variable separately to infer a network.
• A1: DBN 1 TRAFO Like the DBN (M1), but we include covariate transformations. Given the covariates p ¼ fX 1 ; . . . ; X k g, we add all quadratic X 2 i and all interaction X i X j (i 6 ¼ j) terms to the design matrix. We note that the idea is adopted from Oates and Mukherjee (2012). • A2: NH-DBN 1 TRAFO Like the uncoupled NH-DBN (M2), but we add quadratic and interaction terms to the segment-specific design matrices; see A1. • A3: TESLA TESLA is based on segment-specific L1-regularized linear regression and uses a second L1-regularization term to penalize dissimilarities between the regression coefficients of neighbouring segments. It can be seen as the frequentist counterpart of the coupled NH-DBN (M3). Inference is based on a penalized maximum likelihood approach, and the two regularization parameters have to be optimized. We apply 10-fold cross-validation (CV) with fine grids for the penalty parameters (0; 0:01; . . . ; 1). TESLA is the only method in our comparison that allows the network structure to change over time. For our simulations we use the Matlab software from Kolar et al. (2010). • A4: HMM NH-DBN This model from Grzegorczyk (2016) uses the priors of the uncoupled NH-DBN (M2), b h $ Nð0; r 2 k u IÞ, but unlike the M2 Fig. 2. Overview of the NH-DBNs from Section 2.6. For each model there is a plate that covers the plates of the models that are nested (as special cases) within it model it employs a more flexible hidden Markov model (HMM) to allocate the individual data points to the H components. For our simulations we use the Matlab software from Grzegorczyk (2016).
The following methods A5-A7 use the concept of gradient matching. For each gene, the gradients (temporal derivatives) are estimated (e.g. via finite differences) and then used as response values within nonlinear models.
• A5: CHEMA The CHEMA model from Oates et al. (2014) is a Bayesian model that employs differential equations, representing Michaelis-Menten kinetics, to explain the estimated gradients. For each response, the marginal likelihoods of all possible covariate sets are approximated, and the edge scores are obtained by marginalization over all covariate sets. We apply CHEMA in its improved variant (Aderhold et al., 2017) and use thermodynamic integration with 25 inverse temperatures for approximating the marginal likelihoods. For our simulations we use the Matlab software from Aderhold et al. (2017).

Implementation
For the inverse Gamma distributed parameters r 2 , k u and k c we select the shape and rate parameters: a r ¼ b r ¼ 0:005; a c ¼ a u ¼ 2 and b c ¼ b u ¼ 0:2, as in earlier works (Grzegorczyk and Husmeier, 2012;Lèbre et al., 2010). Pre-simulations with different settings showed robustness w.r.t. those hyperparameters. To ensure a fair comparison we use the same hyperparameters for the competing NH-DBNs. For the NH-DBNs we ruled out (autoregressive) self-loops, such as Z i ! Z i , so as to be consistent with earlier studies (Grzegorczyk and Husmeier, 2012;Grzegorczyk, 2016;Lèbre et al., 2010). Another reason is that self-loops can have negative effects on the network reconstruction accuracy, as empirically shown in Supplementary Material Part D.
For generating posterior samples, we run the RJMCMC algorithm for 100 000 (100k) iterations. We set the burn-in phase to 50k and we sampled every 100th graph during the sampling phase. We used potential scale reduction factors (PSRFs) to monitor convergence. For all datasets all PSRF's were below 1.01 after 100k iterations; see Supplementary Figure S1 in Supplementary Material Part C for two examples. The computational costs for 100k RJMCMC iterations are relatively low when a modern computer cluster can be used. The task to infer a network with N nodes can be subdivided into N independent regression tasks (cf. Section 2.4), so that N simulations can run in parallel. With our Matlab implementation for each regression model 100k iterations took a few minutes.
A detailed analysis of the computational costs is provided in Supplementary Material Part E. The main finding is that also networks with N ¼ 100 nodes can be inferred with satisfactory convergence rate when 6-12 h of computational time are invested. On a computer cluster the network inference task can be separated into N ¼ 100 regression tasks, each taking 3.6-7.2 min of computational time. It is impossible to give a concrete upper bound on the maximal network size that can be inferred with the EWC NH-DBN. Proper Bayesian model inference requires that the RJMCMC sampling algorithm converges, and the convergence rate strongly depends on the posterior landscape. For landscapes with many local optimal regions, convergence can be slowed down, so that even small network inference might become challenging. On the other hand, even for large networks the RJMCMC algorithm might converge rather quickly (e.g. when the posterior landscape is unimodal).

Synthetic network data
The RAF pathway, as reported in Sachs et al. (2005), has N ¼ 11 nodes and M ¼ 20 edges. We generate data consisting of H ¼ 4 segments with m data points each. For every node Z i (i ¼ 1; . . . ; 11) the parent nodes build the covariate set p i of the piece-wise linear regression model: where z k;t denotes the value of node Z k at time point t. We sample the noise values e i;t and the initial values z i;1 from independent Nð0; 0:05 2 Þ distributions. The regression coefficients are subject to temporal changes, and change after m data points, so that FðtÞ ¼ 1 þ bðt À 1Þ=mc. For each node Z i there are jp i j þ 1 regression coefficients with H ¼ 4 segment-specific values. For each segment h we summarize the jp i j þ 1 coefficients for response Z i in a vector b i;h ¼ ðb i;h;0 ; . . . ; b i;h;jpij Þ T . We sample the elements of b i;h (h ¼ 1; . . . ; 4) from standard N(0, 1) Gaussian distributions and then normalize the vectors to Euclidean norm one, i.e. for h ¼ 1; . . . ; 4: b i;h b i;h =jb i;h j. We distinguish four regression coefficient types. The four regression coefficients b i;1;j ; . . . ; b i;4;j for the edge Z j ! Z i can be: 1. 'coupled': We keep the regression coefficient fixed among segments. To this end, we replace: b i;h;j b i;1;j (h ¼ 2, 3, 4). 2. 'similar': We enforce the four segment-specific regression coefficients to have the same sign, i.e. we replace b i;h;j signðb i;1;j Þ Á jb i;h;j j. 3. 'independent': We leave the four independent segment-specific regression coefficients b i;1;j ; . . . ; b i;4;j unchanged. 4. 'dissimilar': We enforce the segment-specific regression coefficients to change the sign, i.e. we set b i;h;j signðÀb i;hÀ1;j Þ Á jb i;h;j j.
The RAF network has P 11 i¼1 ðjp i j þ 1Þ ¼ 31 regression coefficients. We assume that K randomly selected regression coefficients are 'coupled', while all the remaining ones are either 'similar', or 'independent' or 'dissimilar'. This yields three different scenarios (mixtures of T1&T2; T1&T3 and T1&T4). For K 2 f0; 3; . . . ; 27; 31g we obtain different percentages of coupled edges. For each scenario and every K we generate 100 independent datasets with different regression coefficients and m ¼ 5 data points per segment (3300 datasets in total). To each dataset we add observational noise using a signal-to-noise ratio of 3. For each node Z i we compute the standard deviation s i of its values z i;1 ; . . . ; z i;4mþ1 , and we then add to each z i;j the realization of a Nð0; ðs i =3Þ 2 Þ distribution. Reaction (RT-PCR), Cantone et al. then measured in vivo gene expression data: first under galactose-and then under glucosemetabolism. For both carbon sources the network structure is identical (Cantone et al., 2009), but the strengths of the regulatory processes change with the carbon source (Cantone et al., 2009);16 (19) measurements were taken in galactose (glucose). We provide more details in Supplementary Material Part C.

Arabidopsis gene expression data
The circadian clock in A.thaliana synchronizes the plant metabolism with the 24-h photo period. The circadian clock can anticipate the photo period and optimize the regulatory processes. The network structure does not change, but the gene interaction strengths depend on the external (or entrained) photo periods. See, e.g. Figure 12 in Aderhold et al. (2014) for an overview of time-invariant network structure hypotheses from different authors. In each network structure hypothesis the effect of light is indicated by a 'sun' symbol. In four experiments Arabidopsis plants were entrained in different dark: light cycles, before data were collected every 2 or 4 h under constant light. We focus on the core clock genes, and we merge the data into one time series; for details see Supplementary Material Part C.
The gradient-based models (A6-A8) are not suitable here, as the data generation (Section 4.2) does not yield meaningful functional relationships in the individual temporal profiles. To avoid that the NH-DBNs reduce to a DBN (without changepoints) when the percentages of coupled edges approach 100%, we assume the changepoints to be known, so that the changepoints do not have to be inferred from the data. Figure 3 shows the fractions of coupled edges that the EWC NH-DBN inferred. The trends are in agreement with the true data generation processes. The fractions increase with the true percentages, and the fractions are scenario-dependent. When the non-coupled edges are 'similar' (T2) or 'dissimilar' (T4), the inferred fractions are higher or lower, respectively. Overall, the inferred fractions of coupled edges tend to be higher than the true fractions. That is, there is a certain bias towards coupling too many edges. In particular, this applies to scenario T1&T2, where even the non-coupled edges have similar regression coefficients. Here the inferred fractions are consistently too high. For the other two scenarios the bias gets weaker as the true percentage increases. Figure 4 shows the relative AUC differences in favour of the EWC NH-DBN. For the average AUC values we refer to Supplementary Material Part F. From Figure 4 the following trends (i-iv) can be observed: i. EWC NH-DBN versus DBN (M1) and versus DBN1TRAFO (A1) The 1st and 4th row show that the quadratic/interaction terms do not lead to improvements. This is consistent with the results in Oates and Mukherjee (2012). The superiority of the EWC NH-DBN over the homogeneous DBNs diminishes as the percentage of coupled edges increases. Although the superiority is significant, the AUCs are only slightly increased for large percentages (>50%) of coupled edges. Except for scenario 'T1&T2', where even the non-coupled regression coefficients stay similar, the AUC improvement is substantial (> 0.18 and >0.20) when the percentage of coupled edges is low ( 50%).

ii. EWC NH-DBN versus coupled NH-DBN (M3)
The trend in the 2nd row is similar to case (i). For scenario 'T1&T2' both models perform approximately equally well, but for high percentages of coupled edges (!70%) the EWC NH-DBN is slightly inferior. Like for the DBNs, for the other two scenarios the superiority of the EWC NH-DBN diminishes as the percentage of coupled edges increases. The AUC improvements for small percentages ( 50%) are moderate (0:08 À 0:10).

iii. EWC NH-DBN versus uncoupled NH-DBN (M2)
The 2nd row shows an opposite trend: The uncoupled NH-DBN is consistently outperformed for scenario 'T1&T2', though the AUC improvement is only moderate (%0:08). For the other two scenarios the superiority of the EWC NH-DBN model rises as the percentage of coupled edges increases. The AUC improvements for large percentages (!50%) are again moderate (0:04 À 0:09). Here the EWC NH-DBN is slightly inferior when the percentage of coupled edges is very low ( 10%) iv. EWC NH-DBN versus TESLA (A3) The 5th row shows that TESLA is consistently inferior to the EWC NH-DBN. This result is consistent with the result of the cross-method comparison of Aderhold et al. (2014), where TESLA was also found to perform below average. Diagnostics (not shown) revealed that TESLA sometimes inferred different network structure for the segments. We note that TESLA is the only method in the comparison that allows for time-varying network structures; a feature that is not required here.

Results on yeast gene expression data
As the yeast network is known, we can cross-compare the network reconstruction accuracies on real in vivo gene expression data. For each of the NH-DBNs from Section 2.6 we run H ¼ 10 independent RJMCMC simulations. Each simulation yields an edge scoreê i;j for each potential edge. We arrange the simulation-specific scores in vectors v m;h , where m indicates the NH-DBN model and h the simulation. In addition we build the true vector v Ã whose entries are 1 if the corresponding edge is present, or 0 otherwise. We propose to use a principal component analysis (PCA) and a cluster analysis to visualize (dis-)similarities between the NH-DBNs from Section 2.6. To this end, we zscore-standardize all vectors, and project them onto the first two principal components (PCs). Figure 5 shows the resulting PCA plot and a dendrogram of the model-specific average score vectors.  . Relative AUC differences in favour of the EWC NH-DBN (RAF pathway data). We consider three scenarios (mixtures of T1&T2; T1&T3 and T1&T4) with varying percentages of coupled edges (T1). The columns refer to the three scenarios and each rows refers to a competing model. In the panels the AUC differences (averaged across 100 datasets) have been plotted against the percentage of coupled edges. The error bars on the curve correspond to 0.95 confidence intervals of paired two-sample t-tests For the dendrogram we clustered the model-specific average score vectors based on their Euclidean distances. The first two PCs explain 78 and 10% (together %90%) of the variance, so that the 2-dimensional PCA plot conserves most of the information. Taking into account that the 1st PC (k 1 ¼ 1:94) has more weight than the 2nd PC (k 2 ¼ 0:24), the following trends can be seen: (i) The model-specific simulations are always closely grouped together, i.e. independent simulations yield similar edge scores, what is a good indicator for convergence. (ii) Nearest to the true network is the new EWC NH-DBN, while the DBN (M1) has the furthest distance. The partially segmentwise coupled model (M6) is 2nd nearest to the true network. (iii) The coupled model (M3) and its generalization with segment-specific coupling strengths (M4) yield similar edge scores, so that this improvement has a minor effect here. (iv) The points of the switch (M5) and the partially coupled NH-DBN (M6) are near to the uncoupled NH-DBN (M2). We conclude that M5 and M6 infer the majority of genes/segments to be uncoupled. The dendrogram shows that there are two model clusters. In the first cluster, the coupled NH-DBNs, which enforce coupling, group with the DBN. In the second cluster, the more flexible NH-DBNs, which have mechanisms to uncouple, group with the uncoupled NH-DBN. Figure 6 shows the network reconstruction accuracies of the EWC NH-DBN and the related NH-DBNs from Section 2.6 in terms of average AUC values. The EWC NH-DBN, which has the minimal distance to the true network in the PCA plot, yields the highest AUC value. More generally, the AUC values consistently decrease with the distance to the true network in the PCA plot, so that the AUC values and the PCA plot are in agreement. We performed two-sided unpaired t-tests and found that the average AUC of the EWC NH-DBN is significantly higher than the AUC of any other method (all six P-values: < 0.05). The right histogram in Figure 6 compares the AUC of the EWC NH-DBN with the AUCs of the models from Section 2.7. Again the EWC NH-DBN reaches the highest AUC score and two-sided unpaired t-tests indicate that the improvement is significant (all seven P-values: <0.05). In Supplementary Material Part G we provide more results, including the pairwise AUC differences.

Results on Arabidopsis gene expression data
The absence of a gold standard for the circadian clock network renders an objective evaluation of the network reconstruction accuracy impossible. We therefore focus on the EWC NH-DBN and illustrate that this model yields more insight into the robustness of the individual gene interactions. We run H ¼ 10 RJMCMC simulations and average the edge scores. For the posterior probabilities of the changepoint location we refer to Supplementary Material Part H. Onto the scores we impose a threshold w such that the 20 edges with the highest scores are extracted; the corresponding threshold is around w ¼ 2=3. Recalling thatê i;j refers to an edge from X i to Y ¼ Z j , we consider the corresponding sampled d i indicator variables and estimate the posterior probabilities that the edge was mainly 'coupled' (or 'uncoupled'). If the posterior probabilitypðd i ¼ 1jDÞ of the state 'coupled' was double as likely as the probabilitypðd i ¼ 0jDÞ of the state 'uncoupled', we call the edge a 'coupled' edge. Correspondingly, we call the edge 'uncoupled' if pðd i ¼ 0jDÞ > 2 Á pðd i ¼ 1jDÞ, and we call the edge 'mix edge' if none of the conditions is satisfied. Figure 7 shows the predicted network with different symbols for the edge types. In this application to the circadian clock in Arabidopsis, the edge label (coupled versus uncoupled) can be interpreted as an indicator whether the corresponding gene interaction is likely to be lightdependent (¼uncoupled) or not (¼coupled).
In the biological literature, we could find evidence for some features of our network. The important key feature of the circadian clock network is the feedback loop between LHY and TOC1. This feedback is already known since Locke et al. (2006) and plays a central role in circadian regulation [see also more recent works, e.g. Pokhilko et al. (2013)]. The EWC NH-DBN does not only infer this feedback loop but also suggests that the effect of LHY on TOC1 is not light-dependent, while the regulatory effect of TOC1 on LHY appears to depend on light. Focusing on those two genes, we further found the following: The regulatory effect of ELF3 on TOC1, e.g. reported in Miwa et al. (2006), is also light-dependent, while the edge from GI to TOC1, also reported in Miwa et al. (2006), is not. The edges from ELF3 to LHY and from LHY to ELF4 have been reported in Kikis et al. (2005). The model finds both edges and provides evidence that the effect of ELF3 on LHY depends on the presence of light. For the effects of TOC1 on the PRR3 and PRR9 Fig. 6. Network reconstruction accuracy for yeast. The two histograms compare the AUCs of the proposed EWC NH-DBN with the AUCs of the NH-DBN models from Section 2.6 (left histogram) and the AUCs of the competing methods from Section 2.7 (right histogram). For models that are inferred by MCMC techniques, the error bars indicate standard deviations Fig. 7. Predicted Arabidopsis network. Morning (evening) genes are represented as white (grey) nodes. We extracted the 20 edges with the highest scores. Different edge types indicate whether the parameters are coupled, uncoupled, or a mixture thereof. A coupled (uncoupled) edge indicates that the gene interaction is likely to be influenced (not affected) by light