Listening to the Genome: Music or Noise?

One of the great triumphs of twentieth-century biology was the discovery of how genes make proteins. Genes are encoded in DNA. To turn the sequence of a gene into a protein, a number of molecules gather around it. Reading its sequence, they produce a single-stranded version of it made of RNA, called a transcript. The transcript gets shipped to a cluster of other molecules, the ribosome, which picks out building blocks to construct a protein that corresponds to the gene. The protein floats off to do its job, whether that job is to catch light, digest food, or help generate a thought.

We have about 20,000 protein-coding genes. If you tally up the amount of DNA they constitute, you get less than 3 percent of the human genome. Which naturally raises the question of what’s in the other 97 percent.

This question is hardly new, and the answers have earned scientists scads of Nobel prizes over the years.

Some of them found pieces of non-protein coding DNA that are essential to our survival. Over fifty years ago, Francois Jacob and his colleagues realized that the non-protein-coding DNA contains stretches, called regulatory elements, that act like switches for genes. When proteins and RNA molecules grab onto those switches, the genes become active.

Scientists have also known for decades that sometimes when a cell makes an RNA transcript, it can use that transcript for an important job without bothering to translate it into a protein. The ribosome, for example, is an assembly of proteins and RNA molecules. George Palade published this discovery in 1955.

Since then, scientists have plunged ever deeper into the workings of regulatory elements and functional RNAs. Many of our genes are controlled not just by a single switch, but by a veritable combination lock of different regulatory elements. RNAs can carry out many more jobs than just working in the ribosome. They can silence other genes, for example, by locking onto their transcripts.

Understanding the other 97 percent of the genome is a challenge at once profound and medically practical. Earlier this year, for example, scientists identified a gene for a long piece of RNA called PTCSC3 that suppresses cancer in the thyroid.

But for just as long, scientists have known that some of the genome does not carry out such vital functions. Barbara McClintock discovered in the 1940s that parts of the genome can make copies of themselves which can then insert themselves elsewhere in our DNA. It turns out our genomes are a veritable zoo of these so-called “mobile elements,” including ancient viruses. In some cases, evolution harnesses these mobile elements for useful purposes. But a lot of them have mutated to the point that they do nothing at all. About eight percent of our genome is made up the littered remains of dead viruses, for example.

While the basics of the human genome have been clear for decades, the particulars have remained murky. Scientists today are using better tools to explore the genome. They can now gain some clues about any particular chunk of DNA by looking at its sequence. It’s possible to recognize a protein-coding gene, for example–and it’s also possible to see if it has mutations that have rendered it functionally dead (a pseudogene).

But there’s no getting around the hard work of old-fashioned biology–of peering into cells to see what’s going on. And when scientists look in there, things get contentious.

In 2008, I wrote in the New York Times about a then-new project called ENCODE, in which a small army of scientists would create an encyclopedia of information about the entire genome, not just the protein-coding bits. Last year, the ENCODE team unveiled their analysis of this encyclopedia. It traveled through the high-profile-paper-becomes-a-press-release-and-inspires-breathless-articles-with-misleading-headlines sausage machine and ended up giving the impression that until now, scientists thought everything in the genome besides protein-coding genes was “junk,” and that the ENCODE project proved–without a doubt–that about eighty percent of the genome has a function.

What the scientists actually demonstrated was that cells produce RNA transcripts from a huge portion of the genome, not just for the protein-coding parts. They also observed that proteins were able to grab onto those regions–a suggestion that they were switching on genes for RNA. They concluded that this kind of evidence demonstrated that eighty percent of the genome has “biochemical function.” (John Timmer wrote a good analysis of the ENCODE saga at Ars Technica.)

The ENCODE team incurred a remarkably high tide of criticism from other researchers. One long-running complaint is that the mere existence of an RNA transcript does not mean it serves any function at all. Cells can be sloppy, shooting off RNA transcripts from useless DNA. Those accidentally transcribed pieces of RNA promptly get destroyed.

To get a feel for its intensity, check out this piece that Dan Graur and colleagues published this February in the journal Molecular Biology and Evolution:

Here, we detail the many logical and methodological transgressions involved in assigning functionality to almost every nucleotide in the human genome. The ENCODE results were predicted by one of its authors to necessitate the rewriting of textbooks. We agree, many textbooks dealing with marketing, mass-media hype, and public relations may well have to be rewritten.

I’ve been very curious about how scientists would move on from here, and how the debate over the genome would evolve. Now a team of scientists at the University of California at San Francisco has published an interesting paper on the issue in the journal PLOS Genetics.

The UCSF researchers come to a conclusion much like that of ENCODE. They analyzed newly compiled databases of the RNA produced in cells from different tissues. They then pinpointed which segments of DNA encoded that RNA. They found about 85% of the genome produced at least one copy of RNA in one of the databases. The UCSF researchers argued that these results support ENCODE’s work.

The scientists then probed those transcripts to see whether they were just sloppy mistakes or served a function. They focused on one class of transcripts, known as long intergenic noncoding RNAs, or lincRNAs for short. A number of scientists have been cataloging lincRNAs for a few years now, but they’ve only identified a few thousand that appear to have a function. The UCSF searched their new databases for more lincRNAs. They first identified long transcripts, and then they winnowed down their list to get rid of false positives. They filtered out sequences that might be fragments of protein-coding genes that managed to slip into the database, for example. They also combined segments of DNA that overlapped in a way that suggested they both came from a single lincRNA gene.

Counting previously discovered lincRNAs, the researchers ended up with a total 55,000 potential non-protein coding genes. The scientists then looked at each of the candidate genes to look for clues to whether they serve a function. One clue was that the transcripts tend to show up just in one kind of tissue. That’s the rule for many proteins–hemoglobin is very useful in your blood but not very helpful in your eye.

The scientists also found that these stretches bear some hallmarks of being switched on and off. DNA is wound around spools called histones, and the candidate lincRNA genes had proteins latched onto them that can unwind DNA so that it can be transcribed.

Another clue came from comparing the candidate lincRNA genes in humans to other species. If a piece of DNA serves no function, it will be prone to picking up mutations.  Since the  DNA encodes nothing of importance, mutations to it can do no harm.

Mutations that strike functional pieces of DNA, on the other hand, can be devastating. In these cases, natural selection should eradicate them over millions of years. The lincRNA gene candidates that the UCSF scientists found are fairly similar to versions in other mammal. That suggests that evolution is conserving them–and that they serve a function.

If these 55,000 candidates do turn out to be true genes for lincRNAs, then they will outnumber traditional protein-coding genes by a factor of five or more. The scientists don’t claim that they’ve definitively proved that these are genes, however; they look at their catalog as a collection of candidates that deserve to be tested with experiments. “The time is ripe for this dark matter of the human genome to step further into the spotlight,” they write.

I asked a few of ENCODE’s outspoken critics about the new paper, to see whether it changes their view on the genome’s other 97 percent.

Sean Eddy, a biologist at Janelia Farm Research Campus, is very skeptical of all such large-scale catalogs. When he’s looked closely at such catalogs, he’s found plenty of false positives. Rather than just compile a list of possible genes, he thinks scientists should do some careful quality control. They should be like inspectors at a factory, and pick out a random set of candidates to test. Only if careful experiments show that really do behave the way a functional gene behaves can they have confidence in their catalog.

While he was filling up his coffee this morning, Eddy thought up an analogy for this kind of research–one, he wrote to me, “that might be clarifying rather than dumb.”

If you took a big chunk of English text and screened it for novel “dark matter” (the birth of new words!) by eliminating all words that appear in the dictionary, you would indeed find a lot of “novel” words in your “high throughput screen”, and maybe get excited. But the moment you actually looked at a sample of what you’d found, you’d see it was almost all stuff that was obvious in retrospect. You’d say, “Oh yeah, numbers. Oh yeah, abbreviations. Oh yeah, foreign words. Oh yeah, proper names. Oh yeah, misspellings.” And you’d have five new null hypotheses, alternative explanations for your “novel words”; then you’d go back and revise your screen to eliminate those. To my mind, a lot of the lincRNA papers fail to do the part where they look carefully (manually) at what their screen produced, so they fail to develop their intuition for the various failure modes of the high-throughput computational screen.

Larry Moran, a biochemist at the University of Toronto and fierce critic of ENCODE, had a similar response. “Let’s assume that these 55,000 RNAs have a function of some sort,” he wrote to me. “If true, that would require rewriting the textbooks because none of the thousands of labs studying gene expression over the past five decades has seen any hint of such a massive amount of control and regulation by RNAs.”

Moran also pointed out that in many cases, the supposed genes produced just one lincRNA per cell. It strains his imagination to picture a way for a single lincRNA to have any important role in a cell’s existence. Far more likely, it’s just a segment of DNA that the cell accidentally transcribed. “If there have to be more than 10 transcripts per cell then the number of transcripts is reduced to 4,000,” Moran wrote. “If you need more than 30 transcripts per cell then that leaves only 950 putative functional RNAs.”

Moran and Eddy both point out that even if the UCSF researchers are right and all 55,000 DNA segments are real genes for important lincRNAs, that discovery would not, in fact, clear up all that much about the genome as a whole. Here’s how Eddy put it:

Even supposing that all 55,000 were truly functional and important RNAs; 55,000 * 2000nt average lincRNA transcript length = 110MB, less than 4% of the human genome. So I think the questions about these transcripts have to be separated from the concept of junk DNA – if someone did show that an additional 4% of the genome was functional, that would be super cool, but it wouldn’t bear on the questions around junk DNA, which have to do with the majority of the genome.

I contacted two of the UCSF co-authors to respond to these critiques but haven’t yet heard back from them. As soon as I do, I will add their response and post a notice on the blog that I’ve updated this piece. I’d also love to hear from both sides of this debate in the comments below.

Update 6/22: Here’s what Michael McManus, a co-author of the new paper on lincRNAs, said in reply to my queries. The emphases and links are mine…

CZ: Even if all 55,000 transcripts you identified were functional, they would come to a few percent of the human genome. That wouldn’t address the larger question of whether transcriptionally active “junk DNA” isn’t junk.

MM: We agree. There are far more transcripts expressed than those which we have catalogued. Upon observing that nearly the entire genome is transcribed, we chose as a first step to focus only on lincRNAs, but there are many other transcripts we did not focus on in this paper. In fact a study that we reference in our paper looked extremely deeply at very narrow intergenic regions thought to be transcriptionally inactive, but found a dizzying array of complex, regulated transcripts at very low levels in these regions. This clearly shows that what we’ve found is the tip of the iceberg. However, we cannot distinguish between functional and nonfunctional transcripts without performing functional experiments and this is the obvious next frontier for determining how much of this transcriptionally active “junk DNA” is or is not junk.

CZ: Could there be alternative explanations for these sequences?  Could these supposed lincRNA genes actually be pieces of ordinary protein-coding genes, or a false positive from how the experiment is designed.

MM: These are both potential sources of artifactual expression signal and we did work to mitigate both scenarios. We noticed that many currently annotated genes can actually extend outside of their annotated regions when using RNA-seq data to analyze expression levels, so we removed any putative lincRNAs that overlap any of these empirically extended gene structures. After this filtering, we found that the majority of lincRNAs in our catalog are relatively distant (>30 kilobases) to the nearest protein coding gene. We assert that our catalog is by no means perfect but does represent a more refined dataset for investigators to further evaluate. (Emphasis CZ.)

Regarding the second source of artifact, we did take multiple steps to minimize the potential for genomic DNA contamination, and this is described in the Methods section of the paper. Again, it is fair to say that in some rare cases, some of the putative lincRNAs we discovered may be artifacts and additional data is needed (longer reads, deeper sequencing) to achieve even higher confidence in all lincRNAs. For this reason we define the lincRNAs as “putative” in the paper, because they must be manually experimentally validated with great care. This mantra is true even for the large number of existing protein coding genes that have been reported but haven’t been validated.

CZ: Could you test out twenty candidates for lincRNAs to see if they aren’t just noise?

MM: In concept this is true. Manual verification of lincRNAs is an important future direction, and an important pre-requisite for follow-up functional studies. That said, RNA-seq data is becoming a widely accepted approach for studying lowly expressed transcripts as evidenced by large numbers of publications using the technology.

CZ: The commenters thought it unlikely that lincRNAs that are present at just a few copies per cell would be able to have a function. If you use a cutoff of 30 copies per cell, only 950 lincRNAs remain.

MM: We make no broad-sweeping assertions about the functionality of the lincRNAs described in the study. In fact, it is entirely possible that almost none of the lincRNAs we reported are functional. (Emphasis CZ) Moreover, we are hesitant to make strong claims that relate low expression level to functionality, given the reports that low level lincRNA transcripts are functional (examples are HOTTIP, CCND1 ncRNA and others), Therefore an important first step toward identifying which functional lincRNAs is to generate a catalog of all putative lincRNAs for follow-up function based studies.

11 thoughts on “Listening to the Genome: Music or Noise?

  1. A bit more diffidence by scientists on both sides, though not essential for progress to continue, would be reflect more positively on the scientific endeavor. Smug self-assurance in the face of inconclusive data is unbecoming of scientists. Or people in general for that matter. The reality of this situation is that the data accumulated up to now is insufficient to provide any meaningful claim of victory by either side. In reading a number of the ENCODE papers, I couldn’t help but be overwhelmed by the immensity of data thus generated. However, it was also clear that for most of the data, their full import is yet to be realized. Rather, what we have is a field where enough interesting data is accumulated to suggest fruitful future directions where the eventual findings may provide conclusive evidence towards one or the other interpretations. However, another possibility is that further investigation may result in both sides having to admit defeat when data elevates a not-yet-conceived model to the fore. One idea I’ve considered, though it’s pure speculation on my part because I am fairly ignorant of the field as a whole, is the possibility that these cataloged but ill-characterized regions of the genome and transcriptome may be essential for determining the 3-D structure of the genome within the nucleus. It does appear that the structure is important, but the whys and hows of the establishment and maintenance of this structure remain mysteries, especially since methodologies such as Hi-C that allow for a greater understanding this phenomenon are fairly new and in only limited use. However, I’m eager to see just where the research takes us.

  2. The ENCODE work is very provocative. As scientists it is important to remain true skeptics. Keep an open mind to all possibilities. Let the current and future research define what is possible. Heresy has become fact more than once in scientific history. And we are just beginning to understand the world around us. We have A LOT more to learn and discover.

  3. You probably realize this, Carl, but different rates of transcription in different tissues is, to say the least, weak evidence for functional importance, because the transcriptional environments – the populations of transcription factors floating around – differ in different tissues. This difference presumably affects transcriptional noise as well as signal.

    Also, as far as I know, there’s little evidence for purifying selection on most of the human genome. I haven’t been paying attention recently – this kind of thing used to be my business, but I’ve moved on – but that’s among the messages of papers like Keightley et al., 2005 (“Evidence for widespread degradation of gene control regions in hominid genomes,” PLoS Biol. 3:282-288).

    Undoubtedly, there’s as-yet-unidentified functional stuff in our intergenic regions; frankly, it would astonishing if there weren’t. But I know of no good reason to think it’s more than a small fraction of the DNA.

  4. Dr. Zimmer writes: “If a piece of DNA serves no function, it will be prone to picking up mutations. Since the DNA encodes nothing of importance, mutations to it can do no harm.” The intracellular “RNA antibody” hypothesis has been around for over a decade. It predicted pervasive transcription of non-genic DNA in stressed cells. Varying these transcripts by mutation makes it difficult for an intracellular pathogen to “predict” the antibody RNAs it will encounter in its next host. So to paraphrase Dr. Zimmer: “If a piece of DNA serves the antibody RNA function, it will be prone to picking up mutations. Since the DNA encodes something of importance for cell defence, mutations to it can be beneficial.” There is now much scientific literature, including the ENCODE results, that is consistent with the RNA antibody hypothesis.

  5. I stumbled on this sentence: “Those accidentally transcribed pieces of RNA promptly get destroyed.”

    Is that really true? How is this known?

  6. Bjorn: for example, by mutating proteins responsible for that surveillance, and seeing the RNAs that accumulate when the mRNA quality control pathway has been removed. See for example excellent papers from Alain Jacquier’s group, such as Wyers et al, “Cryptic PolII transcripts are degraded by a nuclear quality control pathway involving a new poly(A) polymerase”, Cell 121:725 (2005).

  7. can’t we get those scientists to discuss these things openly in internet with each other ?
    Why does it need a journalist to
    initiate this ?
    I think that would be more productive
    than the current paper publishing business

  8. It is exciting to see the science being worked out and disputed, right here. Kudos to Zimmer for collecting some relevant responses.

    Could it be that having almost-functional pseudo-genes lying around makes it more likely that a mutation in one of them will get transcribed into a stretch of RNA that manages to do something or other that is helpful? Like the Origin of Species, there has to be some Origin of Proteins, truly novel proteins. Of course such de novo emergence would be an extremely rare event.

    A reservoir of that kind would not be conserved in any usual sense, and yet might still retain the potential to do something good, some day.

    I am tempted to call the retention of some-day-maybe-useful fragments “meta-evolution,” in the sense that additions to the noise-genome might confer advantages onto a later evolutionary process, rather than producing some immediately useful change in a phenotype.

  9. We are trying to learn a new language that will eventually help us in communicating with life and its evolution.

  10. @Sanchez: A paper came out this week (by one of the authors indirectly referred to by Carl) showing exactly this (

    In my opinion, one of the biggest disconnects between pervasive transcription of the genome and its functional relevance is the evolutionary conservation of these sequences. At the moment, about 9% of (known) mammalian genome sequences are conserved throughout evolution. If >80% of the genome is transcribed into RNA and is required for normal development and health, how can only ~10% of it display evidence of purifying evolutionary selection?

    The key to this resides in how RNA functions. The higher-order structure of RNA is important for its function (e.g. ribosomes, pre-miRNAs, tRNAs, etc). Long non-coding RNAs will undoubtedly form some localised structural pairings given the laws of thermodynamics. It is thus important to consider the structural nature of RNA (compared to protein-coding DNA) when investigating evolutionary conservation. We have done this and report the findings in a Nucleic Acids Research paper to be available online shortly. We reveal that a substantial fraction of mammalian genomes function through RNA structure. Stay tuned!

Leave a Reply

Your email address will not be published. Required fields are marked *