shadestark11 t1_ivfhd3i wrote on November 7, 2022 at 4:21 PM

Not an expert but they used multiple volunteers to build a consensus sequence. Which is basically taking the most common/prevalent fragment. It’s also misleading when someone says two human genomes differ by 0.1% only since it’s 0.1% of around 3 billion base pairs so roughly 3 million bp which by itself is a huge number and can help explain a lot of differences.

Would also like to add that post HGP(which ended in 2003 and the produced sequence was filled with gaps) we have sequenced a lot more individual genomes and the variance is now accepted to be around 0.3%-0.4%. If you’re interested, you could look into the recent publication of gapless human genome.

https://www.nih.gov/news-events/news-releases/researchers-generate-first-complete-gapless-sequence-human-genome

davedeoreo t1_ivggkbu wrote on November 7, 2022 at 8:07 PM

This is best answer here so far as it mentions the concensus sequence. I.e. Using multiple people, we have determined which nucleotide is the most common at every position. That most common one is included in the reference sequence. And every human will stray from that sequence in different positions along their genome on average about 3 million times. And also as they said, this has been fine tuned over the years with more people and faster/more accurate sequencing technology

heresacorrection t1_ivgriyw wrote on November 7, 2022 at 9:18 PM

This is factually untrue. The reference genome is constructed in a way that does not necessarily include the most common variant at a given position. The telomere-to-telomere (T2T) assembly is a single female individual (excluding the Y-chromosome).

davedeoreo t1_ivgtzqx wrote on November 7, 2022 at 9:35 PM

Username checks out I guess, lol. Could you please shed some light on this then? It's my understanding that the reference genome is created using contigs via overlapping reads - does this not mean it's a consensus sequence on the most common nucleotide at each position? Or is it more that long stretches which are generally similar enough to overlap aid in determining location along the genome?

Also T2T is more recent right? I was mainly referring to the 2003 method in my first comment.