EFTA00611781.pdf

Source: DOJ_DS9 • Size: 802.6 KB • OCR Confidence: 85.0%

PDF Source (No Download)

Extracted Text (OCR)

OPEN ACCESS Freely available online PLOS COMPUTATIONAL BIOLOGY The Time Scale of Evolutionary Innovation Krishnendu Chatterjee'`, Andreas Pavlogiannis', Ben Adlam2, Martin A. Nowak2 11ST Austria, Klosterneuburg, AtIWIld, 2 Program for Evolutionary Dynamics, Department of Organismic and Evolutionary Biology. Department of Mathematics, Harvard University• Cambridge. Massachusetts. United States of America Abstract A fundamental question in biology is the following: what is the time scale that is needed for evolutionary innovations? There are many results that characterize single steps in terms of the fixation time of new mutants arising in populations of certain size and structure. But here we ask a different question, which is concerned with the much longer time scale of evolutionary trajectories: how long does it take for a population exploring a fitness landscape to find target sequences that encode new biological functions? Our key variable is the length, L. of the genetic sequence that undergoes adaptation. In computer science there is a crucial distinction between problems that require algorithms which take polynomial or exponential time. The latter are considered to be intractable. Here we develop a theoretical approach that allows us to estimate the time of evolution as function of L. We show that adaptation on many fitness landscapes takes time that is exponential in L. even if there are broad selection gradients and many targets uniformly distributed in sequence space. These negative results lead us to search for specific mechanisms that allow evolution to work on polynomial time scales. We study a regeneration process and show that it enables evolution to work in polynomial time. Citation: Chatterjee K, Pavlogiannis A. Adlam B. Nowak MA (2014) The Time Scale of Evolutionary Innovation. PLoS Comput BId 10(9): e1003818. doL10.13711 joumauxbiloossis Editor: Niko Beerenwinkel, ETH Zurich• Switzerland Received December 13, 2013; Accepted July 21. 2014: Published September II, 2014 Copyright: 00 2014 Chatterjee et al. ThIs Is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: Austrian Science Fund JAW) Grant No P234994123. FWF NFN Grant No S114074423 (RISE), ERC Stan grant 1279307: Graph Games), and Microsoft Faculty Fellows award. Support from the John Templeton foundation is gratefully acknowledged. The finder had no role In study design data collection and analysis. decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing Interests exist. • Email: Krishnendu.Chattedeealinac.at Introduction Our planet came into existence 4.6 billion years ago. There is clear chemical evidence for life on earth 3.5 billion years ago [1,2]. The evolutionary process generated procaria, eucaria and complex multi-cellular organisms. Throughout the history of life, evolution had to discover sequences of biological polytners that perform specific, complicated functions. The average length of bacterial genes is about 1000 nucleotides, that of human genes about 3000 nucleotides. The longest known bacterial gene contains more than lOs nucleotides, the longest human gene more than 106. A basic question is what is the time scale required by evolution to discover the sequences that perform desired functions. While many results exist for the fixation time of individual mutants [3-15], here we ask how the time scale of evolution depends on the length L of the sequence that needs to be adapted. We consider the crucial distinction of polynomial versus exponential time [16-18]. A time scale that grows exponentially in L is infeasible for long sequences. Evolutionary dynamics operates in sequence space, which can be imagined as a discrete multi-dimensional lattice that arises when all sequences of a given length are arranged such that nearest neighbors differ by one point mutation [19]. For constant selection, each point in sequence space is associated with a non- negative fitness value (reproductive rate). The resulting fitness landscape is a high dimensional mountain range. Populations explore fitness landscapes searching for elevated regions, ridges, and peaks (20 27]. A question that has been extensively studied is how long does it take for existing biological functions to improve under natural selection. This problem leads to the study of adaptive walks on fitness landscapes [15,20,21,28,29]. In this paper we ask a different question: how long does it take for evolution to discover a new function? More specifically, our aim is to estimate the expected discover• time of new biological functions: how long does it take for a population of reproducing organisms to discover a biological function that is not present at the beginning of the search. We will discuss two approximations for rugged fitness landscapes. We also discuss the significance of clustered peaks. We consider an alphabet of size four, as is the case for DNA and RNA, and a nucleotide sequence of length L. We consider a population of size N, which reproduces asexually. The mutation rate, u, is small: individual mutations are introduced and evaluated by natural selection and random drift one at a time. The probability that the evolutionary process moves from a sequence i to a sequence f, which is at Hamming distance one from i, is given by /(3L)]N, where p, is the fixation probability of sequence./ in a population consisting of sequence i. In the special case of a flat fitness landscape, we have Ad =UN, and Pro (u/(3L)I. Thus we have an evolutionary random walk, where each step is a jump to a neighboring sequence of Hamming distance one. Results Consider a high-dimensional sequence space. A particular biological function can be instantiated by some of the sequences. Each sequence f has a fitness value f, which measures the ability of the sequence i to encode the desired function. Biological fitness landscapes are typically expected to have many peaks [29-31]. CrossNlark PLOS Computational Biology I www.pbscompbiol.org September 2014 I Volume 10 I Issue 9 I elCO3818 EFTA00611781 The Time Scale of Evolutionary Innovation Author Summary Evolutionary adaptation can be described as a biased, stochastic walk of a population of sequences in a high dimensional sequence space. The population explores a fitness landscape. The mutation-selection process biases the population towards regions of higher fitness. In this paper we estimate the time sale that is needed for evolutionary innovation. Our key parameter is the length of the genetic sequence that needs to be adapted. We show that a variety of evolutionary processes take exponential time in sequence length. We propose a specific process, which we all 'regeneration processes', and show that it allows evolution to work on polynomial time sales. In this view, evolution an solve a problem efficiently if it has solved a similar problem already. They can be highly rugged due to epistatic effects of mutations [32-34]. They can also contain large regions or networks of neutrality [20,21]. Empirical studies of short RNA sequences have revealed that the underlying fitness landscape has low peak density [35]: around IS peaks in 424 sequences. For the purpose of estimating the expected discovery time we can approximate the fitness landscape with a binary step function over the sequence space. We discuss two different approximations (Figure For the first approximation, we consider the scenario where fitness values below some threshold, fi n, have negligible contribution; those sequences do not instantiate the desired function (either not at all or only below the minimum level that could be detected by natural selection). We approximate the nigged fitness landscape as follows: if f; <Ann, then A -0; if ,finin then I. The set of sequences with t/ tioin constitutes the target set, and the remaining fitness landscape is neutral. The second approximation works as follows. Consider the evolutionary proem exploring a rugged fitness landscape where the goal is to attain a fitness level f*. Local maxima below)" slow down the evolutionary process to attain 7, because the evolutionary walk might get stuck in those local maxima. In order to derive lower bounds for the expected discovery time, the rugged fitness landscape can be approximated as follows. Let I be the fitness value of the highest local maximum below 7. Then for every sequence in a mountain range with a local maximum below 7 we assign the fitness value f. The mountain ranges with local maxima above f• are the target sequences. Note that the target set includes sequences that start at the upslope of mountain ranges with peaks above 7. Thus, again we obtain a fitness landscape with clustered targets and neutral region, where the neutral region consists of all sequences whose fitness values have been assigned to 1. The two approximations are illustrated in Figure I. For p-f„„,, the second approximation generates larger target areas than the first approximation and is therefore more lenient. Our key results for estimating the discovery time call now be formulated for binary fitness landscapes, but they apply to any type of rugged landscape using one of the two approximations. We note that our methods can also be applied for certain non-binary fitness landscapes, and an example of a fitness landscape with a large gradient arising from multiplicative fitness effects is discussed in Sections 6 and 7 of Text SI. We now present our main results in the following order. We first estimate the discovery time of a single search aiming to find a single broad peak. 'Chen we study multiple simultaneous searches for a single broad peak. Finally, we consider multiple broad peaks that are uniformly randomly distributed in sequence space. We first study a broad peak of target sequences described as follows: consider a specific sequence; any sequence within a certain Hamming distance of that sequence belongs to the target set. Specifically, we consider that the evolutionary process has succeeded, if the population discovers a sequence that differs from the specific sequence in no more than a fraction c of positions. We refer to the specific sequence as the target center and c as the width (or radius) of the peak. For example, if L=100 and c-0.1, then the target center is surrounded by a cloud of approximately 1018 sequences. For a single broad peak with width c, the target set contains at least 2a/(3L) sequences, which is an exponential function of L. The fitness landscape outside the broad peak is flat. We refer this binary fitness landscape as a broad peak landscape. The population needs to discover any one of the target sequences in the broad peak, starting from some sequence that is not in the broad peak. We establish the following result. Theorem 1. Consider a single search exploring a broad peak landscape with width c and mutation rate a. The following assertions hold: • if c <3/4, then there exists LOA such that for all sequence spaces of sequence length 1, Lo, the expected discovery time is „ L , 6 at least expR3 — tICy 16 lOg 4c-I-3, • if c2 3/4, then for all sequence spares of sequence length L, the expected discovery time is at most O(L3 Its). Our result can be interpreted as follows (see Theorem S2 and Corollary S2 in Text SI): (i) If c < 3/4, then the expected discovery time is exponential in L; and (ii) if c 3,41., then the expected discovery time is polynomial in L. Thus, we have derived a strong dichotomy result which shows a sharp transition from polynomial to exponential time depending on whether a specific condition on c does or does not hold. For the four letter alphabet most random sequences have Hamming distance 3L/4 from the target center. If the population is further away than this Hamming distance, then random drift will bring it closer. If the population is closer than this Hamming distance, then random drift will push it further away. This argument constitutes the intuitive reason that c 3/4 is the critical threshold. If the peak has a width of less than c= 3/4, then we prove that the expected discovery time by random drift is exponential in the sequence length L (see Figure 2). This result holds for any population size, N, as long as 41- > > N, which is certainly the case for realistic values of L and N. In the Text S I we also present a more general result, where along with a single broad peak, instead of a flat landscape outside the peak we consider a multiplicative fitness landscape and establish a sharp dichotomy result that generalizes Theorem I (see Corollary S2 in Text SI). Remark 1. We highlight two important aspects of our results. I. First, when we establish exponential lower bounds for the expected discovery time, then these lower bounds hold even if the starting sequence is only a few steps away front the target set. 2. Second, we present strong dichotomy results, and derive mathematically the most precise and strongest form of the boundary condition. Let us now give a numerical example to demonstrate that exponential time is intractable. Bacterial life on earth has been around for at least 3.5 billion years, which correspond to 3 x 1013 hours. Assuming fast bacterial cell division of 20-30 minutes on average we have at most ION generations. 'Hie expected discovery PLOS Computational Biology I www.pbscompbiolorg 2 September 2014 I Volume 10 I Issue 9 I e1003818 EFTA00611782 Projection of sequence space B C Projection of sequence space The Time Scale of Evolutionary Innovation Projection of sequence space Projection of sequence space Figure 1. Approximations of a highly rugged fitness landscape by broad peaks and neutral regions. The figures depict examples of highly rugged fitness landscapes where the sequence space has been projected in one dimension. (A) Sequences with fitness below some level b„,,, are functionally very different to the desired function, and selection cannot act upon them. All other sequences are considered as targets. The fitness landscape is approximated by a step function: if A </„„„, then J, =0, otherwise b =I. (B) Local maxima below the desired fitness threshold p are known to slow down the evolutionary random walk towards sequences that attain fitness at least /*. We approximate the fitness landscape by broad peaks and neutral regions by increasing the fitness of every sequence that belongs in a mountain range with fitness below f to the maximal local maxima]. below f'. Note that the target set starts from the upslope of a mountain range whose peak exceeds f doi:10.1371/joumal.pcbi.100313113.9001 time for a sequence of length Lm 1000 with a very large broad peak of em 1/2 is approximately 1065 generations; see Table 1. If individual evolutionary processes cannot find targets in polynomial time, then perhaps the success of evolution is based on the fact that many populations are searching independently and in parallel for a particular adaptation. We prove that multiple, independent parallel searches are not the solution of the problem, if the starting sequence is far away from the target center. Formally we show the following result. A CD 0. CD N 0 '0 ZS 2 rn — ID I- C02 • LL • exp(L) 0 el. 31, 4 Hamming Distance L O CO N 0 O 73 C CU 03 ft c• C • LE 0 ci. Theorem 2. In all cases where the busier bound on the expected dimwits), time k exponential, for dl polynomials pi(), p2(-) and p3e, for any starting sequence with Hamming distance at least 3L/4 from the target center, the probability for any one out of p3(I) independent multiple searches to reach the target set within pi(L) steps is at most I /p2(L). If an evolutionary process takes exponential time, then polynomially many independent searches do not find the target in polynomial time with reasonable probability (for details see poly(L) L 4 Hamming Distance td I0 50 100 ISO 200 210 300 Sequence length, 350 Figure 2. Broad peak with different fitness landscapes. For the broad peak there is a specific sequence, and all sequences that are within Hamming distance cL are part of the target set. The fitness landscape is flat outside the broad peak. (A) If the width of the broad peak is r <3/4, then the expected discovery time is exponential in sequence length, L. (B) If the width of the broad peak is c Z 3/4, then the expected discovery time is polynomial in sequence length, L. (C) Numerical cakulations for broad peak fitness landscapes. We observe exponential expected discovery time for c=1/3 and c= 1/2, whereas polynomial expected discovery time for c=3/4. doi:10.1371/joumal.pcbi.10031318.9002 PLOS Computational Biology I www.ploscompbiol.org 3 September 2014 I Volume 10 I Issue 9 I e1003818 EFTA00611783 The Time Scale of Evolutionary Innovation Table 1. Numerical data for discovery time in flat fitness landscapes. r.I L=102 1.021018 7.36107 183 L=10' 5.89 1.28.100 2666 Numerical data for the discovery time of broad peaks with width c= 1/3.1/2. and 3/4 embedded in flat fitness landscapes. First the discovery time Is computed for small values of Las shown In Figure 2(C). Then the exponential growth Is extrapolated to L= 100 and L= 1000, respectively. We show the discovery times for r= 1/2, and 1/3. For c=3/4 the values are polynomial in L. dol:10.1371/Joutnal.pcb1.1003818.E001 Theorem 55 in the Text SI). We also show all informal and pproximate calculation of the success probability for M independent searches, as follows: if the expected discovery time exponential (say, en, then the probability that all M independent searches fail upto b steps is at least exp(—(M/)1d) (i.e., the success probability within b steps of any of the searches is at most 1— exp(—(M1))/d)), when the starting sequence is far away from the target center. In such a case, one could quickly exhaust the physical resources of an entire planet. The estimated number of bacterial cells [36] on earth is about 103°. To give a specific example let us assume that there are IP independent searches, each with population size N 106. The probability that at least one of those independent searches succeeds within 1014 genera- tions for sequence length L- 1000 and broad peak of cm 1/2 is less than 10-26. In our basic model, individual mutants are evaluated one at a time. The situation of many mutant lineages evolving in parallel is similar to the multiple searches described above. As we show that whenever a single search takes exponential time, multiple independent searches do not lead to polynomial time solutions, our results imply intractability for this case as well. We now explore the case of multiple broad peaks that are uniformly and randomly distributed. Consider that there are in target centers. Around each target center there is a selection gradient extending up to a distance rt. Formally we can consider any fitness function f that assigns zero fitness to a sequence whose Hamming distance exceeds cL from all the target centers, which in particular is subsumed by considering the multiple broad peaks where around each center we consider a broad peak of target set with peak width c. We establish the following result: Theorem 3. Consider a single search under the multiple broad peak fitness landscape of inc <'I'- target centers chosen uniformly at random, with peak width at most c for each center and c < 3/4. Then with high probability, the expected discovery time of the target set is at least (11m)exp[2L(314—c)2]. Whether or not the function (1,4n)exp[21.43/4 —02] is exponential in L depends on how In changes with L. But even if we assume exponentially many broad peak centers, m, with peak width cL where c< 3/4, we need not obtain polynomial time (Figure 3 and Teorem S6 in Text SI). It is known that recombination may accelerate evolution on certain fitness landscapes [28,37 39], and recombination may also slow down evolution on other fitness landscapes (40]. Recombi- nation, however, reduces the discovery time only by at most a linear factor in sequence length [28,37,38,41,42]. A linear or even polynomial factor improvement over an exponential flinction does not convert the exponential function into a polynomial one. Hence, recombination can make a significant difference only if the underlying evolutionary process without recombination already operates in polynomial time. What are then adaptive problems that can be solved by evolution in polynomial time? We propose a "regeneration process". 'lite basic idea is that evolution call solve a new problem efficiently, if it is has solved a similar problem already. Suppose gene duplication or genome rearrangement can give rise to starting sequences that are at most k point mutations away from the target set, where k is a number that is independent of L. It is important that starting sequences can be regenerated again and again. We prove that Lk +I many searches arc sufficient in order to find the target ill polynomial time with high probability (see Figure 4 and Section 10 in Text SI). 'flue upper bound, Lk 1, holds even for neutral drift (without selection). Note that in this case, the expected discovery time for any single search is still exponential. 'flterefore, most of the Lk + 1 searches do not succeed in polynomial time; however, with high probability one of the searches succeeds in polynomial time. There are two key aspects to the "regeneration process": (a) the starting sequence is only a small number of steps away from the target; and (b) the starting sequence can he generated repeatedly. This process enables evolution to overcome the exponential barrier. The upper bound, Lk may possibly be further reduced, if selection and/or recombination arc included. Discussion The regeneration process formalizes the role of several existing ideas. First, it ties in with the proposal that gene duplications and genome rearrangements arc major events leading to the emer- gence of new genes [43]. Second, evolution call be seen as a tinkerer playing around with small modifications of existing sequences rather than creating entirely new ones [44]. 'fbird, the process is related to Gillespie's suggestion [29] that the starting sequence for an evolutionary search must have high fitness. In our theory, proximity in fitness value is replaced by proximity in sequence space. However, our results show that proximity alone is insufficient to break the exponential barrier, and only when combined with the process of regeneration it yields polynomial discovery time with high probability. Our process can also explain the emergence of orphan genes arising front non-coding regions [45]. Section 12 of the Text SI discusses the connection of our approach to existing results. There is one other scenario that must be mentioned. It is possible that certain biological functions are hyper-abundant in sequence space [2 l] and that a process generating a large number of random sequences will fmd the function with high probability. For example, Bartel & Szostak [46] isolated a new ribozytne from a pool of about lois random sequences of length L-220. While such a process is conceivable for small effective sequence length, it cannot represent a general solution for large L. Our theory has clear empirical implications. The regeneration process can be tested in systems of in vitro evolution [47]. A PLOS Computational Biology I www.pbscompbiolorg 4 September 2014 I Volume 10 I Issue 9 I e1003818 EFTA00611784 The Time Scale of Evolutionary Innovation A Start of search D to C 0 a % 08 .0 CO 06 0. 4 04 On 02 00 0.0 OS 1O 1.5 lime limit 20 I 7 02 0.0 1 15 20 25 30 35 40 45 50 Sequence length, I. Bt 10^ 10' to 8) le 10' 102 io' 10' E 10 5 Ie 110' & 10' io' io' to' a le I 15 20 25 30 35 40 45 50 Sequence length, 1. 12 15 18 21 24 Sequence length. I. IT Figure 3. The search for randomly, uniformly distributed targets in sequence space. (A) The target set consists of m random sequences; each one of them is surrounded by a broad peak of width up to it. The figure shows a pictorial illustration where the &dimensional sequence space is projected onto two dimensions. From a randomly chosen starting sequence outside the target set, the expected discovery time is at least (1/nOexp(2L(3/4 —rill, which can be exponential in L. (8) Computer simulations showing the average discovery time of m=100, 150, and 200 targets, with r. =1/3. We observe exponential dependency on L. The discovery time is averaged over 200 runs. (C) Success probability estimated as the fraction of the 200 searches that succeed in finding one of the target sequences within 101 generations. The success probability drops exponentially with L. (D) Success probability as a function of time for L=42. 45. and 48. (E) Discovery time for a large number of randomly generated target sequences. Either in =2t or in = 4t sequences were generated. For b= 0 and h= 3 the target set consists of balls of Hamming distance 0 and 3 (respectively) around each sequence. The figure shows the average discovery time of 100 runs. As expected we observe that the discovery time grows exponentially with sequence length, L. doi:10.1371/joumal.pcbi.10031318.9003 Process re-generating starting sequence at Hamming distance k kern target Target Hamming distance Figure 4. Regeneration process. Gene duplication (or possibly some other process) generates a steady stream of starting sequences that are a constant number k of mutations away from the target Many searches drift away from the target, but some will succeed in polynomially many steps. We prove that Lk+ I searches ensure that with high probability some search succeed in polynomially many steps. doi:10.1371/joumal.pcbi.1003818.9004 starting sequence can be generated by introducing k point mutations in a known protein encoding sequence of length L. If these point mutations destroy the function of the protein, then the expected discovery time of any one attempt to find the original sequence should be exponential in L. But only polynomially many searches in L are required to find the target with high probability in polynomially many steps. 'Fite same setup can be used to explore whether the biological function can be found elsewhere in sequence space: the evolutionary trajectory beginning with the starting sequence could discover new solutions. Our theory also highlights how important it is to explore the distribution of biological functions in sequence space both for RNA [20,21,35,46] and in the protein universe 1481. In summary, we have developed a theory that allows us to estimate time scales of evolutionary trajectories. We have shown that various natural processes of evolution take exponential time as function of the sequence length, L. In some cases we have established strong dichotomy results for precise boundary condi- tions. We have proposed a mechanism that allows evolution in polynomial time scales. Some interesting directions of future work are as follows: (I) Consider various forms of rugged fitness landscapes and study more refined approximations as compared to PLOS Computational Biology I www.pbscompbiolorg 5 September 2014 I Volume 10 I Issue 9 l e1003818 EFTA00611785 The Time Scale of Evolutionary Innovation the ones we consider; and then estimate the expected discovery time for the refined approximations. (2) While in this paper we characterize the difference between exponential and polynomial for the expected discovery time, more refined analysis (such as efficiency for polynomial time, like cubic vs quadratic time) for specific fitness landscapes using mechanisms like recombination is another interesting problem. Materials and Methods Our results are based on a mathematical analysis of the underlying stochastic processes. For Markov chains on the one- dimensional grid, we describe recurrence relations for the expected hitting time and present lower and upper hounds on the expected hitting time using combinatorial analysis (see Text SI for details). We now present the bas- e arguments of the main results. Markov chain on the one-dimensional grid For a single broad peak, due to symmetni we can interpret the evolutionary random walk as a Markov chain on the one- dimensional grid. A sequence of type i is i steps away from the target, where i is the Hamming distance between this sequence and the target. The probability that a type i sequence mutates to a type i— I sequence is given by tri/(3L). The stochastic process of the evolutionary random walk is a Is Iarkov chain on the one-dimensional grid The basic recurrence relation Consider a Markov chain on the one-dimensional grid, and let H( ) denote the expected hitting time from i to j. The general recurrence relation for the expected hitting time is as follows: H(j,f)in I +Pr.e+tH(j,i+1)+Pu -af(j,i- I)+ &NILO: (I) for j<f <L, with boundary condition HUI). 0. The interpreta- tion is as follows. Given the current state i, if i eV, at least one transition will be made to a neighboring state F, with probability Pa, from which the hitting time is H(//). Intuition behind Theorem 1 Theorem I is derived by obi: . g precise bounds for the recurrence relation of the hitting time (Equation I). Consider that PLg: _i >0 for all j<k Si (i.e., progress towards state j is always possible), as otherwise/ is never reached from i. We show (see Lemma 2 in the Text SI) that we can write H(j,i) as a sum, H(/,1).Eitit,b„, where b„ is the sequence (Wilted as: (Obe 1 PL.L-I 1 +PL-a-n+lbn-I PL-net-n- I forn>0. (2) The basic intuition obtained from Equation 2 is as follows: (i) If Pt-net-n+1 • ez, for some constant t> I, then the sequence b,, i grow; at least as fast as a geometric series with factor A_ (ii) On the Pt —net —n+1 other hand, if SI and PL—n et.—N—t x for sonic Pt-net-n-1 constant x> 0, then the sequence b„ grows at most as fast as an arithmetic series with difference lict. From the above case analysis 3 ' the rest& If 4 for Theorem I is obtained as follows: c < - then for all 3+4c PL-n.1.-n+1 . cL<n< —. we have for Seine t> I, and 8 ' rL-n.L-n- 1 hence the sequence b„, grows geometrically for a linear length in L. Then, H(cL,i) ,...)1kL for all states i>cL (i.e., for all sequences outside of the target set). This corresponds to case 1 of Theorem I. PL-n.L-ntl On the other hand, if —3 thenit is I, and case 2 of 4' Theorem I is derived (for details see Corollary 2 in Text SI). Intuition behind Theorem 2 The basic intuition for the result is as follows: consider a single search for which the expected hitting time is exponential. Then for the single search the probability to succeed in polynomially many steps is negligible (as otherwise the expectation would not have been exponential). In case of independent searches, the indepen- dence ensures that the probability that all searches fail is the product of the probabilities that every single search fails. Using the above arguments we establish Theorem 2 (for details see Section 8 in Text SI). Intuition behind Theorem 3 For this result, it is first convenient to view the evolutionary walk taking place in the sequence space of all sequences of length L, under no selection. Each sequence has 31- neighbors, and considering that a point mutation happens, the transition probability to each of them is 3L . The underlying Markov chain due to symmetry has fast mixing time, i.e., the number of steps to converge to the stationary distribution (the mixing time) is O(L logL). Again by symmetry 3 ' the stationary distribution is the tinifomi distribution. If r < - 4 then from Theorem I we obtain that the expected time to reach a single broad peak is exponential. By union bound, if in< <4L, the probability to reach any of the in broad peaks within O(L log L) steps is negligible. Since after the fast O(LlogL) steps the Markov chain converges to the stationary distribution, then each step of the process can be interpreted as selection of sequences uniformly at random among all sequences. Using Hoeffding's inequality, we show that with exp(2-(3/4-0-L) high probability, in expectation such steps are required before a sequence is found that belong: to the target set. Thus we obtain the result of Theorem 3 (for details see Section 9 in Text SI). Remark about techniques An important aspect of our work is that we establish our results using elementary techniques for analysis of Markov chains. 'Ile use of more advanced mathematical machinery, such as martingales [49] or drift analysis [50,51], can possibly be used to derive more refined results. While in this work our goal is to distinguish between exponential and polynomial time, whether the techniques from [49 -51] can lead to a more refined characterization within polynomial time is an interesting direc- tion for future work. Supporting Information Text SI Detailed proofs for The Time Scale of Evolutionary Innovation." (PDF) PLOS Computational Biology I www.pbscompbiol.org 6 September 2014 I Volume 10 I Issue 9 I e1003818 EFTA00611786 The Time Scale of Evolutionary Innovation Acknowledgments We thank Nick Barton and Daniel Weissman for helpful ditemaiosu and pointing us to °levant literature. References I. :Mimics' AC. Grotainger JP, Knoll Ali. Burch 1W. Anderson MS, et al. 1009) Controls on detelopmem and diversity of early archran simmatolies. Pmc Nail Arad Sri USA 106: 95419 9555. 2. SchopfJW (August 2006) The lint hilliein years: When did life emerge? Elements 1: 229-233. 3. Kimura NI (1968, Evolutionary rate at the mokcidar keel. Nature 117:624 626. 4. Ewens 19 (1967) lbr probability al sonied of a new mutant in a twthating nutriment. Heredity 21: 438 443. .5. Barton NH (1993) linkage and the limits to natural selection. Genetics 140: R21 41. 6. Campos PR ((2004) Fixation of beneficial mutations in the presence of rpistatic interactions. Bull Math BIM 66: 473 486. 7. Antal T. Scheming I (2006) Fixation of strategies few an evolutionary game in finite ,titillation. Bull Math Biol 68: 1923-1944. 8. Whitimi MC (2003) Fixation probability and tine in suhditicled populations. Genetics 164: 767-779. 9. Ahmck PM, Trani...en A (2009) Fixation times in evolutionary games under weak selection. New.) Phys 11. clorl0.1088/1367-2630/11/1 /01 301 2 10. Kimura NI. Olua T (1969) Average number al generations until fixation nra mutant gene in a finite population. Genetic 61: 763-771. II. Johnson T, Gerrish P (2002) llte pmhahility nra beneficial allele in a pomilation dividing by binary fusion. Genetica 115 283 287. 12. On [IA .:2000)1he rate of adaptation in asexual,. Genetics 155: 961-968. 13. Wilke CO (1004) Tlw speed al adaptatinn in large asexual populations. Genetics 167: 2045-2053. 14. 1/enti MM. Fisher Ik5, Murray AW 1007; speed of evolution and maintenance of variation in asexual populations. Curt Biol 17: 385 394. 15. Ohia T (1972) POVUlali011 size and rate ol evolution. J NIal Paul 1: 305-314. 16. Papadintitriou C (1991) Computational congilexity. Addison•Wesley. 17. Carmen T, Leiserson C. Rives' R. Stein C:1009) Introduction to Algarithiro. MU Press. 18. Valiant I.G (2009) Evolvability. J ACM 56: 3:1 3:21. 19. Maynard Smithi (1970) Natural selection and the concept rola protein %pace. Nature 215: 563-564. 20. Fontana W, Schuster P (1987) A computer model of rwiltitionaty optimization. Biophys Clem 26: 123-147. 21. Fontana W. Schuster P (1998; Continuity in evolution: On the nature of transitions. Science 280: 1451 1455. 22. Eigeii M. Mccaskilli. Schuster P (19118) Molecular quasi.species..) Phys Clem 92: 6881 6891. 23. Eigen NI. Schumer P (1978) The hypercycle. Natuncissenschalien 65: 7 41. 24. Park SC. Simon I). Krug J (20lth The speed of evolution in large asexual pcmulations..) Stat Phys 138: 381-410. 25. 1/errida B. Prliti I. (1991) Evolution in a at fitness landscape. Bull Math Biel 53: 355381. 26. Stadler Pi (2002) Fitness landscapes. Appl Math & C:amput l7: 187-207. 27. Warden RI' (1995) A speed limit Inc evolution. J 'Diem Biel 176: 137-152. 28. Crow Jr, Kimura M (1965) Evolution in sexual and asexual populations. Am Nat 99: pp. 439-450. Author Contributions comeiwil and designed Lite experithents: KC AP BA MAN. Analyzed the data: KC AP BA MAN. Wage the paper: KC AP BA MAN. 29. JH (1984) Molecular evolution over the mutational landscape. Evolution 38: 1116-1119. 30 E.aulfman S. Levin S (1987) Towards a general therry of adaptive walks on rugged landscapes. Journal allbeoretiral Biology 128: II- 45. 31. OFT [IA 1000) A minimum on the mean number of steps taken in adaptive walks. Journal of Theoretical Slalom. 220: 241 247. 31 Wentreich DM, Watson R.A. IThaci I. (2005) Penpective:sign epistasit and genetic constraint on evolutionary trajectories. Evolution 59: 1165 1174. 33. Poelwijk EJ. Kirin UJ. Wrinreich UM. Tans (2007) Empirical fitness Landscapes reveal accessible evolutionary paths. Nature 445: 383-386. 34 Woods ltj. Barrack JE. Cooper Ti. Shrestha Eatith SIR, et al. (2011) SeceinePorder selection for rwilvability in a large escherichia cnli population. Science 331: 1433 1436. 35 Jimenezjl. Xulti-Bonet R. Campbell GW, Ttuk•Malend K. Chen IA (2013) Comprehensive experimental limes., landscape and MOIllliettaly network Ihr small ma. nor Nati Arad Sri USA. 1103704984 9. 36. Whknun WB, Coleman DC. Wive W.1 1998) Pokaryotes, The unseen maim*. Proc Nail Arad Sti USA 95: 6578 6583. 37. Smith J51 (1974) Recombinatinn and the raw of evolution. Genetics 78: 199 38. Cmw Jr. Kimura M (1970) An introduction in population genetics theory. Burgess Publishing Company. 39. Park SC. Krug.; (2013) Rate of adaptation in sexual. and asexual,: A solvable model of the fishermulkr efiet. (Irwin 193: 941 955. 40. dr Vinci JAGNI. Park S. King J (2009, Expiating the alert of sex on empihcal fitness landscapes. '1'1w American Naturalist 174: S15 530. 41. Neher RA. Shraiman RI. Fisher OS (2010) Rate of adaptation in Large sexual populations. Genetics 184: 467 481. 42. Weissman OB. liallauthek 0 (2014)'11e into of adaptation in large sexual populations with linear chromatomes. Genetics 196: 1167-1183. 43. Ohs? S 0970) Evolution by gene duplication. Springer•Verlag. 44 Jacob F (1977) Evolution and tinkering. Science 196: 1161-1166. 45. 'lama I). Ihnazet•LatiT1011)1be evolutionary origin of orphan genes. Nat Rev Genet 11:692 701. 46. Band I). Smoak J (1993) Isolation of new rihozymes front a large pool or random snowmen. Science 261: 1411 1418. 47. Leconte A.M. Ihrkinson BC. Yang 99. C:hen IA. Alkt B. et at (2013) A population-based experimental model for pmiein evolution: Effects cif mutation rate and selection stringency on evolutionary outcomes, Biewhentimy 51: 1490 1499. 411. Powileaskaya IS. Kondrashov FA (2010) Sequence space and the ongoing expansion of the protein universe. Nature 465: 912 926. 49. William. I) (1991) Probability with Maningales. Cambridge mathematical tombs-sob. Cambridge L:niversity Press. 50. Flajek B (1982) Iiiiiing.time and occupation-time hounds implied by drib analysis with applications. Advances in Applied l'robability 14: pp. 302 525. 51. Litre PK. Wit C (2013) General drift analysis with tail hounds. CORK abr./ 1307.1559. PLOS Computational Biology I envw.ploscompbiolorg 7 September 2014 I Volume 10 I Issue 9 I e1003818 EFTA00611787

Document Preview

PDF source document
This document was extracted from a PDF. No image preview is available. The OCR text is shown on the left.

Extracted Information

Dates

December 13, 2013 December 13, 2013

Phone Numbers

1003818.9004 1103704984

Document Details

Filename	EFTA00611781.pdf
File Size	802.6 KB
OCR Confidence	85.0%
Has Readable Text	Yes
Text Length	40,104 characters
Indexed	2026-02-11T23:04:36.594490