The Differences of Prokaryotic Pan-genome Analysis on Complete Genomes and Simulated Metagenome-Assembled Genomes

Tang Li Author

04/03/2021 Added

70 Plays

Description

In this presentation, we will show how the fragmentation and incompleteness of MAGs influence their core genome size by performing pan-genome analysis on MAGs simulated from complete genomes of prokaryotic species. Additionally, we will discuss potential underestimation in core genome functional analyses and misprediction in phylogenetic trees.

Searchable Transcript

Search:

[00:00:01.300]Hi everyone my name is Tang Li,
[00:00:02.262]a second year master's student
[00:00:04.860]in the Department of Food Science and Technology.
[00:00:07.469]My presentation today will introduce my master project
[00:00:11.030]in the Differences of Prokaryotic Pan-genome Analysis on
[00:00:14.036]Complete Genomes and Simulated Metagenome-Assembled Genomes.
[00:00:19.130]First of all, what is pan-genome?
[00:00:21.690]Pan-genome represents entire gene set
[00:00:24.160]of all strains in a species.
[00:00:26.970]From figure one,
[00:00:28.235]we can see that the genes in pan-genome
[00:00:30.840]can be classified into three categories:
[00:00:33.730]Core genes, dispensable genes, and the unique genes.
[00:00:37.560]The core genes are shared by all strains within the species.
[00:00:41.220]While dispensable genes, present in some of the strains.
[00:00:45.030]The unique genes are only found in specific strains.
[00:00:49.264]Pan-genome analysis can be used in various research areas,
[00:00:53.670]including genomic diversity characterization,
[00:00:57.430]disease outbreak Analysis,
[00:00:58.940]and the study of virulence-associated genes.
[00:01:03.060]With the development of sequencing technology,
[00:01:05.750]a new research area called metagenomics is
[00:01:09.190]emerging to study genetic materials recovered directly
[00:01:12.192]from environmental samples.
[00:01:15.223]Metagenome-assembled genomes are produced
[00:01:17.604]from the raw reads of the metagenomes through
[00:01:21.470]filtering, assembly and binning step,
[00:01:24.086]and finally the taxonomic assignments.
[00:01:28.040]These MAGs can be seen as the close representation
[00:01:31.380]to an actual individual genome.
[00:01:34.380]Since these MAGs are generated by using assembly and binning
[00:01:38.140]computational tools.
[00:01:40.020]They are fragmented, incomplete and contaminated
[00:01:43.540]when comparing with the complete genomes
[00:01:45.450]from isolates. In recent years,
[00:01:48.545]pan-genome analysis of MAGs are being increasingly used
[00:01:52.870]in studying human microbiomes.
[00:01:55.580]It's an open question that how much accuracy loss
[00:01:58.930]the pangenome analysis results will have due to the nature
[00:02:02.390]or MAGs: fragmentation, incompleteness and contamination.
[00:02:07.798]So the purpose of this study is to identify how the nature
[00:02:12.230]or MAGs influence the pan-genome results.
[00:02:15.450]We especially want to evaluate the effects
[00:02:18.450]on the number or core genes and the downstream
[00:02:20.760]functional analysis.
[00:02:23.360]We have two hypotheses here.
[00:02:25.610]The first one is that the nature or MAGs
[00:02:28.010]will lead to the loss of core genes.
[00:02:30.430]The second one is that there may be some bias
[00:02:33.400]or errors in the downstream analysis. To test our hypothesis,
[00:02:38.680]the workflow shown in this figure 3 was designed.
[00:02:44.276]We firstly, download the complete
[00:02:46.484]genomes from NCBI RefSeq database.
[00:02:49.365]Then we use the pipeline shown in figure 4
[00:02:51.952]to do the MAGs simulation.
[00:02:54.697]For each of the complete genomes,
[00:02:57.224]we randomly cut the sequence
[00:02:59.180]to fragments by selecting cut positions.
[00:03:02.108]Then a part of each fragmented sequence will be randomly
[00:03:06.260]removed to create gaps among these fragments.
[00:03:10.086]The length of removed sequence
[00:03:12.440]is determined by the incompleteness we want.
[00:03:16.329]After that, we will add the contamination to the genome
[00:03:20.640]by randomly inserting fragment
[00:03:22.583]sequences from closely related species.
[00:03:26.788]When we finish the MAGs simulation,
[00:03:29.770]genomes will be annotated for gene detection.
[00:03:32.354]In the pan genome analysis,
[00:03:35.170]the homologous genes will be clustered based
[00:03:37.730]on their presence and absence.
[00:03:40.960]Finally, the pan-genome results will be used
[00:03:43.560]to do the functional and phylogenetic analysis.
[00:03:47.861]To help understand the results,
[00:03:50.326]we use Escherichia coli as example species
[00:03:54.673]and a dataset containing 100 complete Escherichia coli
[00:03:58.680]genomes as example dataset.
[00:04:02.008]This figure 5 here shows how the number
[00:04:06.213]of core gene families changed
[00:04:08.100]with the average fragments in the genomes.
[00:04:12.623]The stringent 100% core gene threshold was used here.
[00:04:16.571]It means that core genes
[00:04:21.097]should be found in all the genomes.
[00:04:24.771]There were more than 2,600 core gene families
[00:04:28.970]in the complete genomes. we can see.
[00:04:33.350]However,
[00:04:35.800]the number of core gene families dramatically
[00:04:38.050]decreased to lower than 500 when simulated
[00:04:41.020]genomes have an average of 400 fragments.
[00:04:45.376]We observed similar and even worse result
[00:04:49.800]in incompleteness simulation.
[00:04:52.290]When genomes had an average of 1% loss in their completeness,
[00:04:56.390]about 60% of core genes will be missed.
[00:05:01.225]In the contamination simulation
[00:05:04.070]no significant core gene loss was observed.
[00:05:07.868]After that, we performed the clusters
[00:05:12.250]of orthologous genes, which is also called COG analysis.
[00:05:16.750]We use this to find the enriched COG categories
[00:05:20.160]in core genes of E. coli.
[00:05:22.425]This figure 6 here shows the number
[00:05:25.793]of core gene families assigned
[00:05:28.685]to each COG category significantly decreased
[00:05:31.370]with increasing fragmentation,
[00:05:33.221]from the original to 400 fragmentation.
[00:05:39.060]For example, in the original dataset shown
[00:05:41.907]in this green bar, there were about more
[00:05:46.490]than 300 genes assigned to the category E,
[00:05:51.500]representing the function about amino acid
[00:05:54.460]metabolism and transport.
[00:05:56.069]However, we observed fewer than 100
[00:05:59.907]genes assigned to this group,
[00:06:02.630]If we have an average of 300 fragmentation
[00:06:05.860]in the dataset.
[00:06:07.424]That means the fragmentation in MAGs would lead
[00:06:10.700]to the core gene loss, thus it caused the underestimation
[00:06:16.171]in the functional analysis.
[00:06:19.421]In the phylogenetic comparison, we used the normalized
[00:06:23.681]Robinson-Foulds (nRF) distances
[00:06:27.598]to compare the differences
[00:06:29.124]between two phylogenetic trees.
[00:06:31.515]The phylogenetic trees were constructed based
[00:06:32.660]on the gene presence and absence
[00:06:34.950]The larger the NRF value here is, the more
[00:06:38.156]different the trees.
[00:06:42.020]We observe that the nRF value changed,
[00:06:44.540]they increased significantly with more incompleteness.
[00:06:50.800]It means that we may have more bias
[00:06:53.160]or errors in the phylogenetic tree construction
[00:06:56.400]when we use more incomplete MAGs.
[00:06:58.724]We perform the same analysis in different prokaryotic species,
[00:07:02.930]and also do multiple datasets to see
[00:07:05.480]the variation among random groups.
[00:07:07.980]The results were similar as we shown here.
[00:07:10.869]All in all, the genome fragmentation and incompleteness
[00:07:15.170]are the major reasons for decreasing
[00:07:17.030]core genome size in MAGs.
[00:07:19.700]There are also some potential miss prediction
[00:07:22.133]or underestimation in downstream functional
[00:07:24.964]and phylogenetic analysis.
[00:07:27.499]In the future we hope that better quality control
[00:07:31.440]of MAGs should be developed to improve
[00:07:34.030]the pan-genome analysis of MAGs.
[00:07:37.077]At the end of the presentation,
[00:07:38.443]we will thank the support
[00:07:40.410]and guidance from my advisor, Dr. Yin
[00:07:43.530]and all the suggestions from my supervisory committee,
[00:07:46.673]Dr. Moriyama and Dr. Chaves.
[00:07:49.940]I also appreciate the help from everyone
[00:07:52.469]in the Food Science and Technology Department
[00:07:55.420]and the Nebraska food for Health Center.
[00:07:58.870]Thank you for listening.
[00:07:59.937]Please feel free to make any comments,
[00:08:02.460]if you have any questions.

The screen size you are trying to search captions on is too small!

You can always jump over to MediaHub and check it out there.

Comments

0 Comments

The Differences of Prokaryotic Pan-genome Analysis on Complete Genomes and Simulated Metagenome-Assembled Genomes

Description

Searchable Transcript

Comments icon comment

Related Channels

Comments