Moment captured by Marius Mariș |
www.the-workaholi.com | www.komiti.media

Lecturer: Prof. Dr. Jean-François Flot

Course start date: 08.03.2023

The course aims to decode the genetic information contained in an organism’s DNA by sequencing and genomic assembly.

Course duration: 08.03.2023 – 10.03.2023

Evaluation date: 10.03.2023

Date of granting of certificates: 28.03.2023

Course fee: 0 lei

1. Principles of genomic sequencing and assembly

Genomic sequencing and assembly are two important steps in the process of decoding the genetic information contained in an organism’s DNA. The principles of genome sequencing and assembly can be summarised as follows: 1.1. Sample preparation: The first step in sequencing a genome is to obtain a sample of the organism’s DNA. This can be done by extracting DNA from a tissue sample, blood sample or any other DNA source.

1.2. Library preparation: Once the DNA has been extracted, it is fragmented into smaller pieces and sequenced. The fragments are then sequenced using a mass sequencing technology such as Illumina or PacBio.
1.3. Sequence alignment: The next step is sequence alignment to a reference genome or to each other. This is done using specialised software that compares sequences and identifies regions of overlap.
1.4. Assembly: After alignment, sequences are assembled into longer contiguous sequences, called contigs. This process involves overlapping sequences to form a consensus sequence that represents the original DNA.
1.5. Quality control: Quality control is a critical step in the genomic sequencing and assembly process. This involves checking the accuracy and completeness of the assembled genome, as well as identifying any errors or gaps in the sequence.
In general, the principles of genomic sequencing and assembly involve a series of technical steps that require specialised equipment, software and expertise. By following these steps, researchers can generate high-quality genome sequences that provide valuable information on the genetic structure of organisms. 

2. Practical genome assembly exercises

3. How to evaluate the quality of an assembled genome 

Evaluating the quality of an assembled genome is an essential step in genome sequencing projects to ensure that assembly is as accurate and complete as possible. Methods commonly used to assess the quality of an assembled genome:
3.1. Assembly statistics: A basic method to assess the quality of an assembled genome is to analyze assembly statistics such as total number of contigs or scaffolds, N50, L50 and genome size. These metrics give a general idea of the completeness and continuity of the assembled genome.
3.2. Genome completeness assessment: Genome completeness can be assessed using a variety of tools such as BUSCO, CheckM and QUAST. These tools compare the assembled genome with a set of conserved genes expected to be present in all genomes, thus providing an estimate of the completeness of the assembly..
3.3. Consistency with sequencing data: the assembled genome can be validated by mapping raw sequencing reads to the assembled genome and checking for consistency in read coverage, depth and distribution. The presence of significant gaps or mismatches between readings and assembly may indicate potential errors in assembly.
3.4. Synteny analysis: compares the assembled genome with the reference genome or with the genome of other related species to identify structural rearrangements or differences in gene order that may indicate errors in the assembly..
3.5. Hi-C mapping: Hi-C mapping can be used to validate the assembled genome by confirming interactions between distant DNA fragments, which can help detect mismatched assemblies, resolve mismatches and identify unoriented contigs.
3.6. Repeat analysis: is crucial for identifying and masking repetitive sequences in the genome, which can be a source of errors and fragmentation of the assembly. Tools such as RepeatMasker and RepeatModeler can be used for this purpose. 

In summary, a combination of these methods can be used to assess the quality of an assembled genome and to ensure that the genome is as accurate and complete as possible.

4. Genome assembly optimization and advanced analysis

Improving genome assembly is a complex and iterative process that can be solved using the following strategies:
4.1. Increase sequencing depth: a higher sequencing depth can help reduce errors and increase the accuracy of the assembly. This can be achieved by sequencing the genome at a greater depth or by combining data from multiple sequencing platforms.
4.2. Using long-read sequencing technologies: long-read sequencing technologies, such as PacBio or Oxford Nanopore, can produce reads that span large genomic regions, reducing the need for assembly algorithms to link smaller fragments. This can improve the continuity and completeness of the assembly.
4.3. Using linked-read sequencing: linked-read sequencing technologies, such as 10x Genomics, can add a barcode tag to each read, allowing identification of reads that come from the same genomic region. This can help stage the assembly and improve the accuracy of haplotype reconstruction. 4.4. Using a hybrid approach: combining data from multiple sequencing technologies, such as Illumina and PacBio, can leverage the strengths of each platform to produce a more accurate whole.
4.5. Using a reference genome: If a closely related species has a well-assembled genome, it can be used as a reference to scaffold the assembly, guiding contig placement and improving the accuracy of the assembly. 4.6. Manual curation: manual curation can help to identify and correct errors such as misassemblies or chimeric contigs. This can be done using tools such as visualization software or by comparing the assembly with other genomic data sources such as transcriptome data.
4.7. Quality control: it is important to perform quality control checks on the assembled genome to identify potential errors or gaps in the sequence. Quality control of the assembled genome includes measures such as assessing the completeness of the assembly, identifying potential contaminants and comparing the assembled genome with other genomic data.