Saturday, December 8, 2007

GBrowse SNPedia

It is now easy to overlay SNPedia on the Gbrowse databases. This means you can generate images like

(click for full)

which shows a cluster of snps around TCF7L2. The upper track identifies SNPedia snps. The the lower track identifies all snps present on the Illumina 550 microarray. All are hyperlinked to their SNPedia pages. If you click on the image above, you'll see the full image with the LD plot at the bottom. This allows you to determine how much two neighboring snps can say about each other.

To generate your own visit the hapmap of Chr10:114724078..114764077

At the very bottom of the page add these two 'remote annotation urls'
The snpedia file is a modest 300kbytes, but the single illumina chip is 58Mbytes!

Every snpedia rs# page now has a direct link to the hapmap site. Loading will take a while. Scenic spots along the genome include:

*click on hapmap
*set Scroll/Zoom=40kbp (upper right)

*click on hapmap
*set Scroll/Zoom=100kb (upper right)

I'm aware of a problem with rs3135506, but unable to explain it. Other times, snps listed in the snpedia files files aren't rendering onto the images. Other bugs probably remain. Your bug reports are always helpful.

Annotations will be updated approximately weekly, along with the fasta.

(click for full)

Saturday, November 3, 2007

Fasta format available

The fasta formatted version of all the SNPedia snps is now available at
The above link will be updated regularly.

SNPedia was mentioned in the current issues of Nature [1] and Science [2]

Tuesday, September 4, 2007

Venter’s Genome: No Genome Is An Island

With the release of the research paper describing the personal genome of J. Craig Venter, ostensibly the first individual genome to ever be made public, we’ve got an incredible opportunity to muse about what we know so far about human genomes. And the answer is: not much today, but wait until tomorrow.

I don’t know about you, but finding out that Venter has the ear-wax characteristics of most Caucasians and has European ancestors isn’t that surprising to me. His risk for heart disease and Alzheimer’s was already known and could have been found out a lot quicker and cheaper using microarrays. And although we believe that in these early days of personal genomics, you haven’t really released all your sequence unless you’ve released the primary sequence reads with their quality score files, the release by Venter of his assembled sequences (even without the primary data) does at least allow some geno-dumpster diving. We’ll probably post a few nuggets of information about Venter that aren’t mentioned in the PLoS article in future blogs, but suffice it to say for the moment, Venter is mostly pretty normal for a white guy.

So where’s the beef? Think about any nascent network. Who did the very first people with telephones get to call? The first people establishing a network get – initially – very little benefit, aside from the well-deserved credits. The value to scientific research? Priceless. The value to the individual? Hardly.

Venter reports a lot of novel genetic variants, including copy number variations, deletions, insertions, duplications, you name it … and of course, since they are by definition novel, we don’t know what they mean other than at least they aren’t fatal or associated with obvious overt disease(s), given that Venter is alive and reasonably healthy.

For all the hoopla over the Human Genome Project, the reality is that we still know very little about the complex interplay between millions of genetic variations and environmental factors and lifestyle choices. It will take many, many individual genomes, and even more daunting, detailed comprehensive medical records, before we are able to make many of the correlations that will ultimately matter to us on a daily basis. Venter estimates a minimum of 10,000 individual genomes, and maybe, a million or more.

Let me put it another way. What’s it worth to you to get your own genome sequenced, compared to just using microarrays to determine your genotypes? Here’s my first rule of thumb: it’s worth one penny times the number of already released complete personal genomes that also have medical histories. So with Venter being #1 (OK, he actually hasn’t released any medical info really, but let’s cut him some slack), in just another 9,999 genomes I’d say your personal genome sequence will be worth about $100 more to you than whatever you could learn from genotyping yourself with microarrays (and using SNPedia, naturally).

One corollary to this is that it will be essential to get as many genomes sequenced as possible, and whether it’s from government support or celebrity genome sequencing shouldn’t matter. What matters is that the medical histories of the individuals are also made available. As is the case with genome association studies, both case-control (groups of patients with a disease; why not one of the communities?) and ‘genome cohort’ personal genome sequencing studies should be funded by whoever’s got the bucks. Comparative personal genomics, here we come!

I recall a joke that probably plenty of folks have told; I heard it from Francis Collins, the head of NIH’s Genome Project.

A previously-married woman heads to bed for the first time with her new beau, and to his surprise, she admits to being a virgin. When he wonders why, she says, “Well, I was married to a genome biologist, and every night, he just sat in bed and talked about how great our sex life would be someday.”

Someday personal genome sequences will be great for each of us to have, and it’s fantastic to have the first personal genomes coming out this year. As of today, it’s 1 (personal genome) down, at least 9,999 to go …

Sunday, July 29, 2007

SNPedia man page

At present SNPedia contains 1475 snps. The bulk of these are recent arrivals mined from OMIM and GeneRIF. But it all started with about 200 hand curated from pubmed searches, journals and news articles. These older snp pages tend to have accumulated the most information. Newer ones sorely need that sort of attention.

I foresee 2 audiences who can be served by SNPedia. The first audience is researchers who are actively trying to determine the effects of genomic variations. Often they've come here after googling for an rs#. For these people SNPedia may be a useful wiki portal to primary sources, a collective lab notebook, and a chat room. Linking to your own papers is welcome.

The second audience is people who know aspects of their own genome. A very public example is Jim Watson, who recently released his genome. Craig Venter's genome should be public 'any day now', and Esther Dyson seems to want to be next. Given the numerous testing options, it is safe to assume that there are others who already know aspects of their genome. More public and private genomes will surely follow.

But what does it mean to know some or all of your genome? Jim Waston was given a 'fasta' formatted text file.
It looked sort of like this:

>WATSON chromosome 1

Except it has a an extra 6 billion As, Ts, Cs, and Gs. Its too big to fit on a DVD, and it doesn't come with a manual. How can you begin make sense of all that? Lincoln Stein at CSHL put together this viewer. Thats a good start, but it's specific to Watson, and designed for asking particular questions.

SNPedia = SNP + wikipedia

Technically, a SNP is a Single Nucleotide Polymorphism. It means that a position in the dna was changed. I (ab)use the term 'snp' to mean a 'Small' Nucleotide Polymorphism. This encompasses changing a few neighboring letters, or inserting or deleting a few extras. If a snp changed the A at position 7 into a T the previous example would now look like this:

>WATSON chromosome 1 with an A>T snp at position 7

>WATSON chromosome 1 with an AA insertion at position 7

The NCBI has been cataloging all of these snps in dbSNP. It's a great resource, and most importantly it is assigning a unique, stable and consistent name to each snp. These names look like rs7903146. That is the letters 'rs' followed by a some digits. The snp is defined by having a certain pattern of letters 'upstream' of the change, a small variant, and a fixed 'downstream'.

Let's make this a bit more concrete. While the site always feels a bit slow for me,;snp=rs7903146
if you start at ensembl you can see the actual pattern of DNA for this snp.

In the field 'Flanking sequence' you will see a large chunk of dna, with a single red letter Y in the middle. A bit above that we are told

C/T (ambiguity code: Y)
Ancestral allele: T

We get two copies of dna -- one from mom, one from dad. All distant ancestors had this pattern of DNA with the T in the middle of both copies. A child was born with a mutation which changed it from a T to a C. This child lived to pass it on and it has continued to pass on for many generations.

Depending on which you got from each parent, at rs7903146 your dna is one of these 3 genotypes. (C;C) or (C;T) or (T;T).

Over at the NCBI page, you can see some primary data about the frequency of each of these in different populations.
Among Chinese and Japanese, 95% of the people were (C;C). But among a Utah population of mixed european ancestry 33% were (C;T). And 8.3% were (T;T).

James Watson turns out to be a (C;T) as well.
I'm currently in a bidding war with Science vs Nature for my paper proving that James Watson is not asian.

Surely google knows all.
As I write this the top hit is a recent paper reporting "TCF7L2 rs7903146 variant does not associate with smallness for gestational age in the French population". Is that really the most important thing about this snp?

Google's second hit is on SNPedia

All of the links we've visited are already there on in the box on the right hand side. A simple explanation of the effect of this snp is in the main window, with hyperlinked citations to back it all up. There is even a box explaining the consequences for each genotype.

James Watson is at a moderately increased risk of Type-2 Diabetes due to his rs7903146(C;T) genotype.

Interestingly, the SNPedia page and the research papers present the (C;C) genotype as normal risk, while the (T;T) is greatly increased risk. However we've already seen that T is ancestral. So it seems more natural to view (T;T) and a high risk of diabetes as the historic and natural state. Jim has one copy of a more recent mutation which is reducing the risk.

So maybe it is more accurate to say

James Watson is at a moderately decreased risk of Type-2 Diabetes due to his rs7903146(C;T) genotype.

Its all depends on whether you compare him to rs7903146(C;C) or rs7903146(T;T). And there are other snps which further increase and decrease his risk. There is still a lot to learn.

If you're curious, try a few random pages

but before you leave the current page, click on TCF7L2

Its got one of those boxes at the right. Some of it is kind of useful, especially the 'mentioned by' link. This can find related snps automatically. But the TCF7L2 page is still pretty much a mess.

The pages about genes are bad.

The pages about diseases are worse.

This is not diseasapedia (although I do enjoy saying that)

This is SNPedia!

Thursday, July 12, 2007

User:Watson data is real.

The genotypes for ProjectJim are now in the wiki.

Most of the information is still 2 clicks away (genotype->rs#).

So here are Jim's genotypes

Friday, June 1, 2007