The Differences of Prokaryotic Pan-genome Analysis on Complete Genomes and Simulated Metagenome-Assembled Genomes
In this presentation, we will show how the fragmentation and incompleteness of MAGs influence their core genome size by performing pan-genome analysis on MAGs simulated from complete genomes of prokaryotic species. Additionally, we will discuss potential underestimation in core genome functional analyses and misprediction in phylogenetic trees.
icon search Searchable Transcript
Toggle between list and paragraph view.
[00:00:01.300]Hi everyone my name is Tang Li,
[00:00:02.262]a second year master's student
[00:00:04.860]in the Department of Food Science and Technology.
[00:00:07.469]My presentation today will introduce my master project
[00:00:11.030]in the Differences of Prokaryotic Pan-genome Analysis on
[00:00:14.036]Complete Genomes and Simulated Metagenome-Assembled Genomes.
[00:00:19.130]First of all, what is pan-genome?
[00:00:21.690]Pan-genome represents entire gene set
[00:00:24.160]of all strains in a species.
[00:00:26.970]From figure one,
[00:00:28.235]we can see that the genes in pan-genome
[00:00:30.840]can be classified into three categories:
[00:00:33.730]Core genes, dispensable genes, and the unique genes.
[00:00:37.560]The core genes are shared by all strains within the species.
[00:00:41.220]While dispensable genes, present in some of the strains.
[00:00:45.030]The unique genes are only found in specific strains.
[00:00:49.264]Pan-genome analysis can be used in various research areas,
[00:00:53.670]including genomic diversity characterization,
[00:00:57.430]disease outbreak Analysis,
[00:00:58.940]and the study of virulence-associated genes.
[00:01:03.060]With the development of sequencing technology,
[00:01:05.750]a new research area called metagenomics is
[00:01:09.190]emerging to study genetic materials recovered directly
[00:01:12.192]from environmental samples.
[00:01:15.223]Metagenome-assembled genomes are produced
[00:01:17.604]from the raw reads of the metagenomes through
[00:01:21.470]filtering, assembly and binning step,
[00:01:24.086]and finally the taxonomic assignments.
[00:01:28.040]These MAGs can be seen as the close representation
[00:01:31.380]to an actual individual genome.
[00:01:34.380]Since these MAGs are generated by using assembly and binning
[00:01:40.020]They are fragmented, incomplete and contaminated
[00:01:43.540]when comparing with the complete genomes
[00:01:45.450]from isolates. In recent years,
[00:01:48.545]pan-genome analysis of MAGs are being increasingly used
[00:01:52.870]in studying human microbiomes.
[00:01:55.580]It's an open question that how much accuracy loss
[00:01:58.930]the pangenome analysis results will have due to the nature
[00:02:02.390]or MAGs: fragmentation, incompleteness and contamination.
[00:02:07.798]So the purpose of this study is to identify how the nature
[00:02:12.230]or MAGs influence the pan-genome results.
[00:02:15.450]We especially want to evaluate the effects
[00:02:18.450]on the number or core genes and the downstream
[00:02:23.360]We have two hypotheses here.
[00:02:25.610]The first one is that the nature or MAGs
[00:02:28.010]will lead to the loss of core genes.
[00:02:30.430]The second one is that there may be some bias
[00:02:33.400]or errors in the downstream analysis. To test our hypothesis,
[00:02:38.680]the workflow shown in this figure 3 was designed.
[00:02:44.276]We firstly, download the complete
[00:02:46.484]genomes from NCBI RefSeq database.
[00:02:49.365]Then we use the pipeline shown in figure 4
[00:02:51.952]to do the MAGs simulation.
[00:02:54.697]For each of the complete genomes,
[00:02:57.224]we randomly cut the sequence
[00:02:59.180]to fragments by selecting cut positions.
[00:03:02.108]Then a part of each fragmented sequence will be randomly
[00:03:06.260]removed to create gaps among these fragments.
[00:03:10.086]The length of removed sequence
[00:03:12.440]is determined by the incompleteness we want.
[00:03:16.329]After that, we will add the contamination to the genome
[00:03:20.640]by randomly inserting fragment
[00:03:22.583]sequences from closely related species.
[00:03:26.788]When we finish the MAGs simulation,
[00:03:29.770]genomes will be annotated for gene detection.
[00:03:32.354]In the pan genome analysis,
[00:03:35.170]the homologous genes will be clustered based
[00:03:37.730]on their presence and absence.
[00:03:40.960]Finally, the pan-genome results will be used
[00:03:43.560]to do the functional and phylogenetic analysis.
[00:03:47.861]To help understand the results,
[00:03:50.326]we use Escherichia coli as example species
[00:03:54.673]and a dataset containing 100 complete Escherichia coli
[00:03:58.680]genomes as example dataset.
[00:04:02.008]This figure 5 here shows how the number
[00:04:06.213]of core gene families changed
[00:04:08.100]with the average fragments in the genomes.
[00:04:12.623]The stringent 100% core gene threshold was used here.
[00:04:16.571]It means that core genes
[00:04:21.097]should be found in all the genomes.
[00:04:24.771]There were more than 2,600 core gene families
[00:04:28.970]in the complete genomes. we can see.
[00:04:35.800]the number of core gene families dramatically
[00:04:38.050]decreased to lower than 500 when simulated
[00:04:41.020]genomes have an average of 400 fragments.
[00:04:45.376]We observed similar and even worse result
[00:04:49.800]in incompleteness simulation.
[00:04:52.290]When genomes had an average of 1% loss in their completeness,
[00:04:56.390]about 60% of core genes will be missed.
[00:05:01.225]In the contamination simulation
[00:05:04.070]no significant core gene loss was observed.
[00:05:07.868]After that, we performed the clusters
[00:05:12.250]of orthologous genes, which is also called COG analysis.
[00:05:16.750]We use this to find the enriched COG categories
[00:05:20.160]in core genes of E. coli.
[00:05:22.425]This figure 6 here shows the number
[00:05:25.793]of core gene families assigned
[00:05:28.685]to each COG category significantly decreased
[00:05:31.370]with increasing fragmentation,
[00:05:33.221]from the original to 400 fragmentation.
[00:05:39.060]For example, in the original dataset shown
[00:05:41.907]in this green bar, there were about more
[00:05:46.490]than 300 genes assigned to the category E,
[00:05:51.500]representing the function about amino acid
[00:05:54.460]metabolism and transport.
[00:05:56.069]However, we observed fewer than 100
[00:05:59.907]genes assigned to this group,
[00:06:02.630]If we have an average of 300 fragmentation
[00:06:05.860]in the dataset.
[00:06:07.424]That means the fragmentation in MAGs would lead
[00:06:10.700]to the core gene loss, thus it caused the underestimation
[00:06:16.171]in the functional analysis.
[00:06:19.421]In the phylogenetic comparison, we used the normalized
[00:06:23.681]Robinson-Foulds (nRF) distances
[00:06:27.598]to compare the differences
[00:06:29.124]between two phylogenetic trees.
[00:06:31.515]The phylogenetic trees were constructed based
[00:06:32.660]on the gene presence and absence
[00:06:34.950]The larger the NRF value here is, the more
[00:06:38.156]different the trees.
[00:06:42.020]We observe that the nRF value changed,
[00:06:44.540]they increased significantly with more incompleteness.
[00:06:50.800]It means that we may have more bias
[00:06:53.160]or errors in the phylogenetic tree construction
[00:06:56.400]when we use more incomplete MAGs.
[00:06:58.724]We perform the same analysis in different prokaryotic species,
[00:07:02.930]and also do multiple datasets to see
[00:07:05.480]the variation among random groups.
[00:07:07.980]The results were similar as we shown here.
[00:07:10.869]All in all, the genome fragmentation and incompleteness
[00:07:15.170]are the major reasons for decreasing
[00:07:17.030]core genome size in MAGs.
[00:07:19.700]There are also some potential miss prediction
[00:07:22.133]or underestimation in downstream functional
[00:07:24.964]and phylogenetic analysis.
[00:07:27.499]In the future we hope that better quality control
[00:07:31.440]of MAGs should be developed to improve
[00:07:34.030]the pan-genome analysis of MAGs.
[00:07:37.077]At the end of the presentation,
[00:07:38.443]we will thank the support
[00:07:40.410]and guidance from my advisor, Dr. Yin
[00:07:43.530]and all the suggestions from my supervisory committee,
[00:07:46.673]Dr. Moriyama and Dr. Chaves.
[00:07:49.940]I also appreciate the help from everyone
[00:07:52.469]in the Food Science and Technology Department
[00:07:55.420]and the Nebraska food for Health Center.
[00:07:58.870]Thank you for listening.
[00:07:59.937]Please feel free to make any comments,
[00:08:02.460]if you have any questions.
Log in to post comments