MERS-Coronavirus Molecular Epidemiology and Genetic Analysis
We now have 9 complete genome sequences:
4 sequences from the Al-Hasa hospital outbreak have now been isolated and deposited in GenBank by Cotten,M., Watson,S.J., Palser,A.L., Gall,A., Kellam,P., Zumla,A., Memish,Z.A and the Kingdom of Saudi Arabia, Ministry of Health, Riyadh 11176, Kingdom of Saudi Arabia. The genbank links are: KF186564-KF186567.
As these are a tightly epidemiologically-linked I have taken the most recent of these (Al-Hasa 1, collected on 2013-05-09) to add to the genetic analysis.
When did these strains share a common ancestor?
With sequences sampled from different times, we can attempt to estimate the rate of evolution. To do this we estimated a maximum likelihood tree under the GTR + gamma model of substitution using PhyML. This is the unrooted maximum likelihood topology with estimated branch lengths:
A maximum likelihood tree estimated using PhyML and the GTR + G model. Branch lengths are in substitutions per site. The tree is arbitrarily rooted midway between the most distant sequences. Numbers below-left of the nodes are bootstrap percentages of 1000 replicates.
A rate of evolution for these sequences can be estimated using root-to-tip regression using our software Path-O-Gen. Here only one Al-Hasa sequence is used as these are strongly linked epidemiologically and cannot be considered independent points. This plots genetic distance from the root of the tree against the time of isolation of each virus:
The root-to-tip regression of genetic distances against time of isolation using the maximum likelihood tree above. The position of the root of the tree was found to maximize the correlation of this plot.
The estimate of the rate of evolution is given by the slope of the line and the time of the most recent common ancestor by the x-intercept:
This would result in the common ancestor of all sampled viruses being in the in the middle of 2011. This genomic rate of evolution is roughly half that we might expect for epidemic influenza A and similar to at least one estimate of the rate of SARS-CoV evolution in humans (2 × 10−3 substitutions/site/year for SARS-CoV: Zhao et al. (2004) Moderate mutation rate in the SARS coronavirus genome and its implications. BMC Evol Biol. 4:21). The residuals from the link give some indication of the stochasticity of the molecular evolutionary process.
However, this estimate is from very few sequences with the majority sampled over a relatively short period of time. This is best seen as an upper bound on the rate of evolution and a minimum date of the most recent common ancestors. In particular, a slightly different rooting position of the two earliest sequences (KSA/EMC_2012 and Jordan_N3_2012) can give a slower rate with a similar correlation in the above plot.
A slightly older version of this analysis (from before the most recent KSA sequences were available) appears in Drosten et al, Lancet Infectious Diseases, 17 June 2013.
What is the nearest non-human host relative?
These notes are a speculative jugement of what the above results suggest and some consideration of alternative scenarios - these should not be treated as definitive.
Based on the above results and the restricted geographical range of the known cases, it seems unlikely that this virus has been circulating entirely in humans since these sequences shared a common ancestor (estimated to be mid 2011 or earlier0. Although it is certain that the virus can spread from human to human (familial cases are noted and the large hospital-associated cluster at Al-Hasa, KSA), a single introduction into humans and subsequent epidemic would be unlikely to have remained restricted to the Ariabian Peninsula (the UK case from January was a transitory visitor to Saudi Arabia).
A more likely interpretation of the data would be multiple zoonotic transmissions from an animal reservoir. If the reservoir has a high contact rate with humans (e.g., a domesticated or farmed animal) then multiple small chains of human transmission could be hypothesized allowing for contact with the cases that have been described so far. The key question, therefore, is how many jumps to humans does this represent?
Three alternative scenarios of cross-species transmission from a reservoir host to humans. Each circle represents a jump to humans and lineages to the right of these represent human to human transmission up to the case where the virus was isolated. The first involves a single jump at least 2 years ago with sustained human to human transmission leading up to the cases occuring in the recent months. The second is a scenario where the Jordan cluster from 2012 and the patient from KSA (the first to be isolated) are the result of separate jumps but there has been sustained transmission since mid 2012 giving rise to all the recent cases. The third implies many independent jumps into humans with varying amount of human-to-human transmission. Some may represent direct exposure to a reservoir and a single human case, others a cluster of secondary cases (such as the Al-Hasa outbreak).
From the above tree it would seem that a lineage is developing of the more recently isolated viruses (consisting of Qatar/England1, Munich/AbuDhabi, England2 and the Al-Hasa sequences). It is more plausible that this represents an emerging cluster of human-circulating cases with a common ancestor in the second half of 2012 meaning the trees above might represent 3 independent jumps (KSA/EMC-2012, Jordan-N3-2012 and the new lineage - see the figure above). Without genetic data it is not clear how all the other cases fit into this picture.
A recent paper in the Lancet has suggested the virus has a incubation period of up to 12 days. In the time between 2012.6 (the TMRCA of the 'new' clade) and 2013.35 (the time of the most recent sample) there is time for a minimum of 22 incubations. At any significant growth rate (R0>>1) this would result in a large number of infections. In this scenario, the recorded cases represent a small fraction of the total number of cases with a bias towards severe cases and traced contacts. However, if this were the case then it would be likely to have spread more widely globally with occasional severe cases croping up to indicate this. With an R0 < 1 then it is possible to get chains of transmission without going to high total numbers of cases but these will generally die out. How long (in terms of the number of infections) these chains will be before they stochastically die out will depend on the value of R0. Thus if the recent clade does represent a single jump to human-to-human circulation then it is likely that this will most likely be compatible with an R0 that is not much greater than 1 and a relatively long incubation period.
See this article by the MRC Centre for Outbreak Analysis and Modelling at Imperial College for an excellent outline of these scenarios and the epidemiology behind it (Cauchemez et al, 2013, Eurosurveillance, Volume 18, Issue 24, 13 June 2013).
Once again, sequence data from more individual cases and potentially some consideration of the spatial pattern of these may be able to tell us more about the likely number of zoonotic events and degree of human circulation.
Deeper virus origins
Whilst the (relatively) close phylogenetic relationship of the human virus to bat coronavirus may indicate bats as an ultimate source of this virus, it seems unlikely that bats are the immediate contact for the human cases as human bat contacts are relatively low frequency. Speculatively, it would be plausible that the virus crossed from bats to a domesticated or agricultural animal which then spread widely within the last few years in the Arabian Peninsula. Further surveillance of both bats and other potential reservoirs will undoubtedly be ongoing and the epidemiology of this virus will become more clear.