Genome Assembly
Genome assembly is the process of putting together your reads, short-reads, long-reads or both, into long contiguous sequences (contigs). Many different approaches and tools exist to assemble genomes. In the last years an increasing number of assemblers and assembly pipelines has been released that are specifically designed for long-read assemblies, e.g., Canu, Flye, Shasta, and miniasm.
In this tutorial we will use two different assemblers/assembly pipelines, Flye, and minimap2-miniasm, and compare the results.
Two common questions about genome assembly and the answers to them:
1. What is the best assembler to use?
The answer to this question is something like "It depends.". Different assemblers will perform differently for different genomes. Factors such as genome size, repetitiveness, GC content and others can all influence the performance of the assemblers. Best practice is to run multiple assemblers, compare the results and then decide yourself which one to use.
2. When is my assembly done?
Currently, the answer to this question is Never.. As an example: the Human Genome is the best studied genome in the world with thousands of individuals sequenced and millions (billions?) of dollars spent. Still, until recently up to 20% of the Human Genome remained unassembled and only resent long-read sequencing technologies enabled us to slowly close the gaps on these last 20%. This does not mean you can never finish you project. It means your assembly is done when it can answer the questions you want to ask!
1. What is the best assembler to use?
The answer to this question is something like "It depends.". Different assemblers will perform differently for different genomes. Factors such as genome size, repetitiveness, GC content and others can all influence the performance of the assemblers. Best practice is to run multiple assemblers, compare the results and then decide yourself which one to use.
2. When is my assembly done?
Currently, the answer to this question is Never.. As an example: the Human Genome is the best studied genome in the world with thousands of individuals sequenced and millions (billions?) of dollars spent. Still, until recently up to 20% of the Human Genome remained unassembled and only resent long-read sequencing technologies enabled us to slowly close the gaps on these last 20%. This does not mean you can never finish you project. It means your assembly is done when it can answer the questions you want to ask!