May 4, 2009
Journal of Heredity
Genome 10K: A Proposition to Obtain Whole Genome Sequence
For 10,000 Vertebrate Species
Genome 10K Community of Scientists
Table of Contents 2
Appendix 1: Policy issues 18
Appendix 2: Detailed table of classes, orders, families, genera and species 19
Appendix 3: Technological issues 23
Table 1: Vertebrate taxa  26
Table 2: List of collections and participating institutions (Fill this in) 27
Figure legends (need these) (Fill in references) 30
Figure 1: A phylogenetic view of the vertebrate subclass 31
Figure 2: The mammalian radiations 32
Figure 3: The birds of the world 33
Figure 4: The reptiles of today 34
Figure 5: The amphibians’ phylogeny 35
The human genome project was recently complemented by whole genome sequencing assessment of 32 mammals and 24 non-mammal vertebrate species suitable for comparative genomic analyses. Here, we anticipate the precipitous drop in sequencing costs and the development of greater annotation technology by proposing to create a collection of tissue and DNA specimens for 10,000 vertebrate species specifically designated for whole genome sequencing in the very near future. For this purpose we , the G10K Community of Scientists, assemble and allocate a collection of some 16,000 representative vertebrate species spanning evolutionary diversity across living mammals, birds, reptiles, amphibians and fishes (circa 60,000 living species).
With this resource, the biological sciencescientific community for genome sequence assessment may now pursue the opportunity to unfold a new era of investigation in the biological sciences; embark upon the most comprehensive study of biological evolution ever attempted; and invigorate integrate genomic inference into every aspect of vertebrate biological enquiry.
The bold insight behind the success of the human genome project was that, while vast, the roughly three billion bits of digital information specifying the total genetic heritage of an individual is finite and might, with dedicated resolve, be brought within the reach of our technology   . The number of living species is similarly vast, estimated to be between 106 and 108 for all phyla, and approximately 5 x 104 for the subphylum vertebrata, the evolutionarily-closest relatives to humankind. With that same unity of purpose, we can now contemplate reading the genetic heritage of all species, beginning with the vertebrates. The feasibility of a “Genome 10K” project to catalog the diversity of vertebrate genomes requires but one more order of magnitude reduction in the cost of DNA sequencing, following the four orders of magnitude reduction we have seen in the last 10 years. It is time to prepare for this undertaking.
Living vertebrate species derive from a common ancestor that lived roughly 500 million years ago, near the time of the Cambrian explosion of animal life. Because a core repertoire of about 10,000 genes in a genome of about a billion bases is seen in multiple, deeply-branching vertebrates and close deuterostome out-groups, we may surmise that the haploid genome of the common vertebrate ancestor was already highly sophisticated; consisting of 108 - 109 bases specifying a body plan that includes segmented muscles derived from somites, a notochord and dorsal hollow neural tube differentiating into primitive forbrain, midbrain, hindbrain and spinal chord structures, basic endocrine functions encoded in distant precursors to the thyroid, pancreas and other vertebrate organs, and a highly sophisticated innate immune system[4-8]  . In the descent of the living vertebrates, the roughly 108 bases in the functional DNA segments that specify these sophisticated features, along with more fundamental biological processes, encountered on average more than 30 substitutions per nucleotide site, the outcome of trillions of natural evolutionary experiments. These and other genetic changes, including rearrangements, duplications and losses, spawned the diversity of vertebrate forms that inhabit the planet today. A Genome 10K project explicitly detailing these genetic changes will provide an essential reference resource for an emerging new synthesis of molecular, organismic, developmental and evolutionary biology to explore the forms of life, just as the human genome project has provided an essential reference resource for a new biomedicine.
Beyond elaborations of ancient biochemical and developmental pathways, vertebrate evolution is characterized by stunning innovations, including adaptive immunity, multi-chambered hearts, cartilage, bones and teeth, and specialized endocrine organs such as the pancreas, thyroid, thymus, pituitary, adrenal and pineal glands. At the cellular level, the neural crest, sometimes referred to as a fourth germ layer, is unique to vertebrates and gives rise to a great variety of structures, including some skeletal elements; tendons and smooth muscle; neurons and glia of the autonomic nervous system and aspects of the sensory systems; melanocytes in the skin; dentin in the teeth; parts of endocrine system organs; and connective tissue in the heart [12, 13]. Sophisticated vertebrate sensory, neuroanatomical and behavioral elaborations coupled with often dramatic anatomical changes allowed exploitation of oceanic, terrestrial, and atmospheric ecological niches. Anticipated details of expansions and losses of specific gene families revealed by the Genome 10K project will provide profound insights into the molecular mechanisms behind these extraordinary innovations.
Adaptive changes in non-coding regulatory DNA also play a fundamental role in vertebrate evolution and its understanding these represents an even greater challenge. Almost no part of the non-coding vertebrate gene regulatory apparatus bears any discernable resemblance at the DNA level to analogous systems in our deuterostome distant cousins, yet the apparatus occupies the majority of the bases we find to be under selection in vertebrate genomes, and is hypothesized to be the major source of evolutionary innovation within vertebrate subclades    . The origins and evolutionary trajectory of the subset of non-coding functional elements under the strongest stabilizing selection can be traced deep into the vertebrate tree , in many cases to its very root, while other non-coding functional elements have uniquely arisen at the base of a particular class, order or family of vertebrate species. Within vertebrate clades that evolved from a common ancestor in the last 108 million years, such as mammals (5000 species), modern birds (10,000 species), Neobatrachia (5,000 frog and toad species), Acanthomorpha (16,000 fish species), snakes (3000 species), geckos (1000 species) and skinks (1300 species), evolutionary coalescence to a common ancestral DNA segment can be reliably determined even for segments of neutrally evolving DNA. This enables detailed studies of base-by-base evolutionary changes throughout the genome, in both coding and non-coding DNA . With an estimated neutral branch length of about 20 substitutions per site within such clades, the Genome 10K project provides enormous depth of genome-wide sequence variation to address critical hypotheses concerning the origin and evolution of functional non-coding DNA segments and their role in molding physiological and developmental definitions of living animal species.