Novel Human Coronavirus Molecular Epidemiology and Genetic Analysis - Origin and Evolution

Novel Human Coronavirus Molecular Epidemiology and Genetic Analysis

This is an update of an older analysis based on only 2 sequences.

We now have 3 complete genome sequences:

Name Accession Source Date of collection
hCoV_EMC JX869059.2 Patient 3 2012-06-13
England_1 KC667074.1 Patient 4 2012-09-12
England_2 HPA Website Patient 10 between 2013-1-24 and 2013-2-11

When did these two strains share a common ancestor?

With 3 sequences, sampled from different times, we can now estimate the rate of evolution. To do this we estimated a maximum likelihood tree under the GTR + gamma model of substitution using PhyML. There is only one possible tree topology but we can estimate the branch lengths:

A maximum likelihood tree estimated using PhyML and the GTR + G model. Branch lengths are in substitutions per site.

The information about the rate of evolution comes from the rooting of the 2 England strains - The branch to England 2 is 8.74x10-4 substitutions/site where as the branch to England 1 is 7.05x10-4 subst/site. The difference, 1.69x10-4, is approximately how much evolution occured in the 0.383 years between the collection dates of the samples. Dividing 1.69 by 0.383 gives an estimate of the rate of evolution of 4.41x10-4 subst/site/year. This compares very closely the estimates for SARS which in humans has been estimated to be evolving at 4.0 x 10-4 (2.0 × 10−4 to 6 × 10−4) substitutions per site per year for the 1a polyprotein (Salemi et al, 2004, JVirol 78:1602) and Vijgen et al (2005) J Virol 79 1595-1604 who estimate rates of evolution for a group 2 CoV in cattle. This is probably one of our best estimates of the rate of zoonotic CoV. Vijgen et al use a number of methods to estimate rates but all give a similar estimate in the order of 4x10-4 substitutions per site per year.

The dates of collection for the two England strains are based on the range given in the table and some guess-work. If I can get more precise dates then I will update the analysis but the results will not change substantially.

We can use this rate to estimate a timescale for the tree:

The same tree as above but with a timescale added given by the estimated rate of evolution. 

This would result in the common ancestor of all three viruses being in the first half of 2009 and the two UK cases diverging in the early part of 2011. This is not accommodating the uncertainty in the stochastic process giving the observed number of mutations in the time given the rate.

What is the nearest non-human host relative?

The closest non-human sequence to both the human cases is a short fragment from a CoV isolated from a pipistrelle bat in the Netherlands collected in 2008. The fragment consists 332 nucleotides of pp1b located at nucleotide 15033 in the human CoV genomes. There are 41 differences between the human cases and the bat sequence giving a divergence of 0.123 subst/site which, at the same rate as above (4x10-4 subst/site/year), corresponds to an MRCA existing about 150 years ago. So this fragment can tell us little about the possible location and species of the reservoir host for the human cases.


Based on the above results and the restricted geographical range of the known cases, it seems unlikely that this virus has been circulating entirely in humans since 2009. Although it is certain that the virus can spread from human to human (two familial clusters are noted), a common ancestor in humans in 2009 would represent a very great number of infections and thus would be unlikely to have remained restricted to the Ariabian Peninsula (the UK case from January was a transitory visitor to Saudi Arabia).

A more likely interpretation of the data would be multiple zoonotic transmissions from an animal reservoir. If the reservoir has a high contact rate with humans (e.g., a domesticated or farmed animal) then multiple small chains of human transmission could be hypothesized allowing for contact with the cases that have been described so far. 

Whilst the (relatively) close phylogenetic relationship of the human virus to bat coronavirus may indicate bats as an ultimate source of this virus, it seems unlikely that bats are the immediate contact for the human cases as human bat contacts are relatively low frequency. Speculatively, it would be plausible that the virus crossed from bats to a domesticated or agricultural animal which then spread widely within the last few years in the Arabian Peninsula. Further surveillance of both bats and other potential reservoirs will undoubtedly be ongoing and the epidemiology of this virus will become more clear.

Further sequencing of the currently reported human cases, where samples exist, would certainly help resolve the timescale of this virus.

Update 2013-02-25:

This paper reports anecdotally that one patient had indirect contact with ill goats:

Quoting from the paper:

While our patient denied contact to bats, he remembered ill goats among the animals on his farm. Albarrak et al. reported that the first Saudi case was exposed to farm animals, but the first Qatari patient and the second Saudi patient were not [15]. Although our patient reported no direct contact with his animals, one animal caretaker working for him was ill with cough and might have been an intermediate link in the chain of infection.