MERS-CoV-like fragment in bat in Saudia Arabia


A paper published on Aug 22nd reports a short fragment (182 nucleotides in length) of coronavirus sequence recovered from a sample from an individual Taphozous perforatus or Egyptian tomb bat that was collect a short distance from the home and work location of the first reported case of MERS-CoV infection (Bisha in Western Saudi Arabia). This sequence is reported to be identical across its 182 nucleotides with the same bit of the MERS-CoV genome sequenced from this patient (referred to as EMC-2012).

Memish ZA, Mishra N, Olival KJ, Fagbo SF, Kapoor V, Epstein JH, et al. (2013) Middle East respiratory syndrome coronavirus in bats, Saudi Arabia. Emerg Infect Dis.

In order to see what the signficance of this is to our understanding of this bat and its relatives to the outbreak of human cases we need to consider some of the details. As we will see the most interesting thing is not that it is identical to EMC-2012 but that it is not identical to the other human MERS-CoV viruses.

Bat virus sequence fragment and its relationship to human MERS-CoV

Although the sequence is identical to EMC-2012, there is 1 nucleotide difference in this region with most of the other sequenced MERS-CoV viruses but not the same nucleotide difference: at position 129 (in the fragment) the bat and the EMC-2012 virus have a 'C' and all the others have a 'T' (in technical terminology, this is a silent transition, the most common type of mutation). The other old virus sequence,  Jordan-N3-2012, also has a 'T' at this site but also a 'T' at position 162 where all the other viruses (including EMC-2012 and the new bat virus) have a 'C' (also a silent transition).

  Position 129 Position 162
T. perforatus batMERS-CoV C C
EMC-2012 C C
Jordan-N3-2012 T T
England1-2012 T C
England2-2013 T C
Munich/Abu_Dhabi-2013 T C
Al-Hasa-1-2013 T C

It is interesting to note that the previous closest match in a South African Neoromicia zuluensis bat has a total of 12 differences with EMC-2012 in this short region. It also has a 'T' in position 129, above (and 'C' in position 162) as do the other more distantly related bat coronaviruses. 

Although there is only a small fragment of the bat virus, there are full genome sequences for the human viruses. If we take the tree of the full genomes, we can consider how the observed changes in the fragment can be mapped onto it. How we interpret this information depends on how the tree is rooted (see this page for some introduction to trees). There are three plausible of roots for the tree:




Each of these trees only requires a single C to T or T to C change to explain the site at position 129 in the fragment. Note the position of the T.perforatus bat virus is assumed based on its share C residue at this site. The total branch length is guessed/approximated as if a complete genome was available given that the bat virus was sampled in October and EMC-2012 was collected in June.

The third of these rooting positions is the only one that is compatible with a single jump of the virus from bats to humans or from bats to another animal reservoir host (which in turn is infecting humans). Furthermore, because Jordan-N3-2012 dates from April 2012, the common ancestor of the EMC-2012 virus and the bat virus must have existed prior to April. 

The first two rooting positions imply that a bat virus is responsible for the EMC-2012 case (either by direct transmission or through an intermediary host). It also suggests the other human cases are the result of other sources (presumably bats and also either direct or indirect). These two rooting positions also suggest a 'T' (at position 129 in the fragment) was the state of the ancestor of all of the viruses at the root of the tree. This would fit with the more divergent bat coronaviruses also having a T at this position and perhaps would mean we would prefer this root position.

How close are the human and bat viruses?

So if we have two 182 nucleotide sequences that are identical can we say how similar the whole virus genomes are likely to be. Obviously they could be identical across the whole genome and this would be indicative of a very close link (the Al-Hasa-1 to 4 sequences are idential or nearly so because they are the result of direct transmission between patients). However, given the rate of evolution of the virus it is plausible that even relatively divergent sequences could be identical in this small region. 

If we guess that the virus is evolving at 1x10-3 substitutions per site per year, how many years of divergence could these viruses have had and still plausibly see no mutations in a 182 nucleotide region. The genome is approximately 30,000 nucleotides long so in a year we expect 30 mutations to accumulate over the whole genome and so it seems quite plausible for none of them to fall in our short region. What about with the 60 mutations expected in 2 years, and so on? There are a number of ways we could model this but I am going to use the simplest. If we assume that each of our 182 nucleotide sites has an independent probability, p, of mutating equal to the rate (per site per year) times the time then the probability of none of the sites mutating is (1 - rt)182, where r is the rate and t is the amount of divergence time (twice the time to the common ancestor). We can then plot this probability against how far back in time they shared a common ancestor (this is assumed to be half the divergence time).

We can see that the probability only drops to 0.5 in a little under 2 years and to 0.1 in a little over 6. In fact they could have last shared a common ancestor 8.2 years ago and still have a 5% chance of being identical in this small region. So although this fragment means a very close relative of the human MERS-CoV is found in a bat geographically close to the first case, the fact it is identical in this short region doesn't mean that these bats are the direct source of the human case. 

On the other hand it is the right virus in the right class of mammals and in the right location. It seems unlikely that it has no significant role in the origin of the human outbreaks. 

A few assumptions and approximations are being made here: I am assuming that the fragment of RdRp is evolving at the average rate for the whole genome (which it isn't - it is more conserved, making it more likely to be identical in this region). The average evolutionary rate is still not known with much certainty so I have used the right order of magnitude. When looking at the individual mutations, above, I am assuming the same mutation is unlikely to happen twice (or the reverse mutation to occur). However, with rates such as these it is very unlikely that random mutations would be likely to hit the same site twice. Finally, saying the divergence time is twice the time to the most recent common ancestor ignores the fact that the two viruses were sampled at different times.