Advances in Quantitative Trait Modeling
TIANJING ZHAO, Assistant Professor of Theoretical Quantitative Genetics, Department of Animal Science, University of Nebraska–Lincoln. BRIAN RICE, Assistant Professor of Quantitative Genetics, Department of Agronomy and Horticulture, University of Ne
Author
05/28/2024
Added
20
Plays
Description
Quantitative genetics is crucial for understanding complex traits, enhancing crop and livestock breeding, and tackling challenges such as food security and environmental sustainability. However, the breadth of topics in quantitative trait modeling is vast. In this seminar, three experts will discuss how cutting-edge research, mentoring students, and securing funding for innovative projects, can contribute to addressing global food security. Co-sponsored with the Department of Animal Science and the Department of Statistics.
Searchable Transcript
Toggle between list and paragraph view.
- [00:00:00.750]The following presentation
- [00:00:02.220]is part of the Agronomy and Horticulture Seminar Series
- [00:00:05.790]at the University of Nebraska Lincoln.
- [00:00:08.250]Well, good morning everyone.
- [00:00:09.300]Thank you for joining to our last
- [00:00:11.316]Spring Agronomy and Horticulture Seminar.
- [00:00:15.630]I would like to thank you for being here.
- [00:00:18.330]This is a co-sponsor seminar
- [00:00:21.450]within the Department of Agronomy
- [00:00:23.100]Statistics and Animal Science.
- [00:00:25.050]So thank you for all the coordination efforts.
- [00:00:29.250]The goal is to present this cluster hiring
- [00:00:32.730]in quantitative genetics.
- [00:00:33.840]So there was an idea last year
- [00:00:35.610]that we can bring it all together.
- [00:00:37.590]They recently joined in the last month, all of them to UNL.
- [00:00:41.550]So the purpose of the seminar is to, you know,
- [00:00:44.700]so they can present their programs and interact
- [00:00:46.830]and get to know each other.
- [00:00:48.884]So I hope we have a great seminar today.
- [00:00:51.660]This is our last seminar in front of you.
- [00:00:53.790]You have a little survey.
- [00:00:56.070]So for those who have been attending to the seminar,
- [00:00:58.380]please feel free to fill out that survey.
- [00:01:01.500]We really value your feedback.
- [00:01:03.720]Just a little bit of the evolution of the seminar.
- [00:01:07.290]We started in spring 2023 with 250 attendees,
- [00:01:10.827]and we are breaking today 600 attendees in the spring, 2024.
- [00:01:15.600]So we are happy that we are increasing participation
- [00:01:18.859]and interaction among faculty students
- [00:01:21.360]and staff and the community.
- [00:01:23.280]So thank you also for the online audience.
- [00:01:25.560]It's been growing also.
- [00:01:27.630]With that further introduction,
- [00:01:29.310]I will move to the topic of today
- [00:01:31.740]Advancing Quantitative Trait Modeling
- [00:01:35.220]Dr. Brian Rice from the Department of Agronomy,
- [00:01:38.010]Tianjing Zhao from the Department of Animal Science.
- [00:01:42.570]And Indranil Mukhopadhyay from Statistic Department.
- [00:01:45.690]They will be presenting today their program vision
- [00:01:49.580]and we hope to have a great interaction.
- [00:01:52.230]And remember, we have some time for questions at the end
- [00:01:54.990]for the online audience, you can place them in the chat
- [00:01:57.840]or then please raise your hand
- [00:01:59.520]and we'll bring the microphone for that.
- [00:02:01.950]So with that, Brian, the floor is yours.
- [00:02:07.620]Well, thanks for having us. Thanks for hiring us.
- [00:02:14.262]So I just start with making sure we're in the framework
- [00:02:19.350]of why we're doing this research,
- [00:02:22.290]why a lot of us may be in this room
- [00:02:25.680]specifically as well too.
- [00:02:26.880]Is there are challenges in agriculture.
- [00:02:29.430]We have seen great increase in yields in the fields.
- [00:02:35.494]So this is just showing,
- [00:02:38.070]production of cereal crops in the US
- [00:02:40.980]over the last 50 years or so, 50, 60 years.
- [00:02:45.480]And that's really linear with the population growth,
- [00:02:48.120]it's kept up with the population growth.
- [00:02:50.670]It's been in response to that.
- [00:02:51.960]But at the same time, the land use has remained the same.
- [00:02:56.130]And there's debates about what this increase is from,
- [00:03:00.810]but we know that's probably due to
- [00:03:03.210]a mixture of management and genetics, right?
- [00:03:07.650]But as we move forward, the inputs
- [00:03:12.690]such as land, fertilizer, water,
- [00:03:15.990]these things are becoming more expensive, more scarce,
- [00:03:20.610]they may even be fought over at some point
- [00:03:22.560]in the future.
- [00:03:24.000]As well we have changing climate
- [00:03:26.250]and biotic stresses are changing as well too.
- [00:03:28.950]So there's really an important need for breeding
- [00:03:31.500]to look at and address
- [00:03:35.398]these threats or these challenges
- [00:03:39.840]to this growth that we've seen and to really meet it.
- [00:03:45.750]So what is quantitative genetics
- [00:03:47.790]and what can it do for you?
- [00:03:51.427]So important to put this where you hired
- [00:03:53.790]three quantitative genetics,
- [00:03:56.070]like we hope to answer today why that is
- [00:04:01.140]what the breadth of work that is involved
- [00:04:04.380]in quantitative genetics.
- [00:04:05.397]And so at its simplest show up there, no, that'd be cool.
- [00:04:09.960]That is simplest.
- [00:04:11.160]You know, we have a genome of plants, animals of humans.
- [00:04:18.090]And the goal is to predict
- [00:04:21.520]what the individual would be,
- [00:04:24.750]whether that's animals or plants or humans,
- [00:04:26.760]whether that's disease or some trait in animals or plants.
- [00:04:31.650]And many of these traits that we're interested in
- [00:04:34.650]vary in their genetic architecture.
- [00:04:37.740]So this is just the phenotype or trait distribution
- [00:04:42.000]with theoretical number of underlining genes or QTL's
- [00:04:47.910]loci underlining the trait.
- [00:04:49.613]And so, if you have only one gene,
- [00:04:52.440]you might have some individuals here
- [00:04:54.420]and some individuals here, if you have three,
- [00:04:57.441]sorry, two genes, you might
- [00:05:00.120]start to see distributions with three peaks.
- [00:05:02.790]And then once you keep adding and adding in genes,
- [00:05:05.370]you get more and more complex of a trait and your phenotype.
- [00:05:08.130]Now if you look at a whole population of individuals,
- [00:05:11.097]this looks very quantitative, a normal bell distribution
- [00:05:14.220]or maybe some skewed distribution or something.
- [00:05:17.790]But it really just pointing that,
- [00:05:19.170]that there are a lot of traits that follow this.
- [00:05:22.247]There are a lot of traits that follow this
- [00:05:23.340]and there's a lot of traits that are in between.
- [00:05:28.830]So what quantitative genetics do
- [00:05:30.450]to kind of piece that apart,
- [00:05:32.670]one of the tools in quantitative genetics
- [00:05:34.290]are association studies,
- [00:05:35.400]and this really helps us to trade architectural dissection.
- [00:05:38.610]And so this is just, each point on this plot
- [00:05:41.490]is just a point in the sequence
- [00:05:43.980]and the height of the point,
- [00:05:46.710]the y-axis of the point is just its level
- [00:05:49.800]of association with our trait.
- [00:05:53.088]It's for the statisticians here,
- [00:05:56.790]it's just a linear regression between our trait values
- [00:06:02.268]and the allele states of the genes that we're looking at
- [00:06:04.650]or the sequence that we're looking at.
- [00:06:06.510]And so we can, run, you can look at associations
- [00:06:09.810]across the genome and you can start to say which genes
- [00:06:12.265]are underlining your trait.
- [00:06:14.100]You know, you can answer questions
- [00:06:15.330]like how many genes are underlining your traits?
- [00:06:17.479]How many large genes, how many ones have large effect
- [00:06:21.030]on your traits?
- [00:06:21.863]How many have small effects on your traits?
- [00:06:23.221]And this is done pretty routinely now
- [00:06:26.316]in animal, plants and humans to find genes
- [00:06:30.570]that underline important traits.
- [00:06:33.300]So the other thing quantitative genetics can do for you
- [00:06:38.520]is provide you with a prediction, right?
- [00:06:42.810]So since we are measuring a relationship
- [00:06:45.660]between some variable and some outcome,
- [00:06:49.470]we can use that to create a prediction
- [00:06:51.300]for unknown scenarios.
- [00:06:53.370]And this is just showing that, you know,
- [00:06:55.440]we might have in these individuals here,
- [00:06:58.747]we might have this purple and yellow marker,
- [00:07:02.130]we can sequence it really easily
- [00:07:04.050]and most of the time the purple marker
- [00:07:06.810]is tagged with the actual red variant.
- [00:07:10.710]This is what's causing the change in phenotype.
- [00:07:13.020]And we can use this to predict, the outcome, right?
- [00:07:16.941]If we see this blue, then you probably have this,
- [00:07:20.610]most of the time you, sorry, the purple,
- [00:07:22.230]most of the time you have the red and the yellow,
- [00:07:24.630]sorry, black and the yellow.
- [00:07:25.740]And most of the time you have the red.
- [00:07:27.060]And so we can use that to predict,
- [00:07:28.650]very simple Mendelian prediction, single loci.
- [00:07:35.490]What we know though is lot of it traits
- [00:07:37.860]are very much more complex than a single gene.
- [00:07:40.200]And so we can take these relationships
- [00:07:42.540]between two different loci.
- [00:07:46.920]So zero being the state of the gene in some individuals
- [00:07:50.460]and one being the state of the gene in other individuals.
- [00:07:52.950]And it has an effect.
- [00:07:54.270]And we can take all of those zeros and ones
- [00:07:57.690]across the genome and regress it
- [00:07:59.640]into some more simple information.
- [00:08:02.250]And this is really what genomic prediction is good at,
- [00:08:06.780]is taking a bunch of information,
- [00:08:09.432]which is a challenge in statistics,
- [00:08:11.340]is modeling with a bunch of information
- [00:08:13.140]and really regressing it down to a simple model
- [00:08:18.856]that doesn't violate too many assumptions of modeling.
- [00:08:23.400]And so this is just a classical
- [00:08:25.738]considered the genomic block model
- [00:08:29.421]in which you're involving some effects of your individuals
- [00:08:34.020]and out you get an estimate of your phenotype
- [00:08:37.020]for your trait.
- [00:08:37.853]And so this is what quantitative genetics can do
- [00:08:42.180]with lots of information.
- [00:08:46.170]This was just with sequence information.
- [00:08:48.000]But we know that the central dogma of genetics,
- [00:08:52.318]there are plenty of influential factors
- [00:08:56.040]other than just the sequence that influence your trait.
- [00:08:59.030]So you have the sequence,
- [00:09:00.300]but then you have the transcriptomics.
- [00:09:02.340]And this is, you know, when your genes
- [00:09:05.070]are being expressed at what developmental time,
- [00:09:07.500]under what selection pressures,
- [00:09:09.929]it varies between individuals, what your expression,
- [00:09:12.875]your proteome, so what proteins look like.
- [00:09:17.520]They get chopped up variably after they're translated.
- [00:09:22.710]And then from there, lots of proteins
- [00:09:26.430]then go on to make metabolites form complexes
- [00:09:30.000]and those vary between individuals as well.
- [00:09:33.158]So there's information at each step of this
- [00:09:35.040]that is heritable, that helps us predict our individuals.
- [00:09:44.730]So what I'm showing here is the problem
- [00:09:48.030]that quantitative genetics has not overcome yet
- [00:09:53.190]what it's showing here, this is on the X-axis.
- [00:09:57.960]This is observed values for individuals
- [00:10:02.730]and then this is the predicted value for individuals.
- [00:10:05.880]And the scenario that was set up here
- [00:10:08.940]was in simulation, but to illustrate a point,
- [00:10:11.640]but the scenario that's set up here
- [00:10:12.990]is the mall is trained on one population
- [00:10:15.690]of related individuals
- [00:10:17.370]and then it's used to predict a different population
- [00:10:19.890]of unrelated individuals.
- [00:10:21.630]So you could think of maize that's grown in Mexico
- [00:10:24.780]and that's adapted to Mexico,
- [00:10:26.280]the mall being trained there
- [00:10:27.930]and then used to predict maize that's grown in Nebraska.
- [00:10:31.560]These are unrelated populations.
- [00:10:33.480]They've been bred to be adapted to Mexico
- [00:10:36.000]and be adapted to Nebraska.
- [00:10:38.160]They're different training the model in Mexico
- [00:10:42.150]and predicting in Nebraska is a problem is does not work.
- [00:10:46.740]We gotta prediction accuracy when we do that
- [00:10:49.230]a 0.24 just in the simulation.
- [00:10:52.500]But this is, it's probably often worse
- [00:10:54.780]in real data trying to do cross prediction.
- [00:10:58.380]And it really is because
- [00:11:00.780]we're using purely just statistical relationships.
- [00:11:04.050]We're using just relatedness between individuals.
- [00:11:06.240]Everybody in Mexico is more related to people in Mexico
- [00:11:08.940]than they are to people in Nebraska.
- [00:11:10.410]And so these statistical relationships break down
- [00:11:13.315]when you try to predict in Nebraska.
- [00:11:16.377]Where quantitative genetics and modeling is going,
- [00:11:20.220]is saying that, alright, we've shown even predict
- [00:11:23.730]related individuals with sequence data,
- [00:11:26.596]how can we predict unrelated individuals?
- [00:11:30.720]And really it's about getting down
- [00:11:33.600]to what are the actual causal variants.
- [00:11:35.580]So getting away from just using purely statistics
- [00:11:39.360]and getting more into incorporating
- [00:11:41.070]all this level of omics data that we have now
- [00:11:44.850]and arguing with my colleagues,
- [00:11:46.620]that's kind of where a lot of
- [00:11:49.320]our research programs are going
- [00:11:50.820]is how do we take this breadth of knowledge
- [00:11:53.280]that we have now in the omics space
- [00:11:56.070]and incorporated into our predictions.
- [00:12:00.630]And how do we do it properly
- [00:12:02.100]with this complex amount of data.
- [00:12:05.910]So that's the introduction to quantitative genetics
- [00:12:09.001]and what quantitative genetics can do for you.
- [00:12:10.440]So now the three of us are gonna give
- [00:12:12.570]a brief vision or some previous work that our lab
- [00:12:17.520]that will kind of show you what our labs will be doing.
- [00:12:20.640]So this is me, my wife, we just moved from Colorado
- [00:12:24.014]in January with our two pit bulls.
- [00:12:27.540]I'm originally from Saskatchewan,
- [00:12:30.060]land of a lot of wheat and not a lot of people
- [00:12:32.520]as well as some canola.
- [00:12:35.557]So I'm trained in quantitative genetics
- [00:12:37.440]from the University of Illinois under Alexander Lipka.
- [00:12:41.603]And now I'm here.
- [00:12:44.880]So this is kind of work that I did as a PhD,
- [00:12:49.080]but that is now kind of still at the basis
- [00:12:51.990]of how I think about modeling.
- [00:12:53.910]And so in cereal crops as well as other crops as well.
- [00:13:00.000]But when we think about planting densely,
- [00:13:03.000]we think about how the plants are positioned
- [00:13:05.460]related to each other.
- [00:13:06.510]And so you can see this is showing basically
- [00:13:10.380]light that's being absorbed at the different levels
- [00:13:13.560]in a field based on and the canopy.
- [00:13:15.840]And when we have a more upright canopy,
- [00:13:18.330]you can have deeper light penetration.
- [00:13:20.970]And so this tells us that, you know,
- [00:13:23.340]having this agronomic trait of leaf angle
- [00:13:26.094]is important as well.
- [00:13:27.761]The tassel is also in maize
- [00:13:32.332]is also prohibits light.
- [00:13:34.290]There's studies that show that after pollination,
- [00:13:38.160]if you cut the tassels off,
- [00:13:39.420]you can actually get increased yield.
- [00:13:40.770]So it's blocking light from penetrating.
- [00:13:43.560]So as we think about pushing, planting densities,
- [00:13:46.260]that's a very important trait to think about.
- [00:13:48.270]And this is just kind of showing
- [00:13:51.925]that at a developmental level,
- [00:13:53.670]there's an in fluorescence meristem
- [00:13:55.980]that develops both into the tassel
- [00:13:58.620]and into the leaf actually.
- [00:14:00.840]So the same cells are differentiating
- [00:14:04.470]between different tissues.
- [00:14:06.270]So it's also a very interesting pleiotropic problem
- [00:14:08.790]as well too.
- [00:14:10.920]I worked with folks from the Durham Forest Center
- [00:14:14.100]they took a bunch of these mutants in maize,
- [00:14:17.100]classical mutants and maize that have different tassel
- [00:14:20.130]and leaf angle phenotypes.
- [00:14:22.110]So genes that when you knock 'em out,
- [00:14:24.450]you see drastic differences in the tassel and the leaf
- [00:14:29.310]to different degrees.
- [00:14:30.840]He grew them up, Eduardo grew 'em up,
- [00:14:34.140]and then he took from this developing meristem tissue,
- [00:14:37.580]he took the transcript data.
- [00:14:41.520]And so the SAM is just shoot, apical, meristem
- [00:14:46.350]and tassel primordia is just developing tassel.
- [00:14:49.547]So he used this data then to develop a network.
- [00:14:52.680]So expression is happening at a network level,
- [00:14:57.150]it's genes are interacting with each other.
- [00:14:59.893]They're being regulated by some of the same regulators.
- [00:15:02.910]So this first bit is just co-expression.
- [00:15:05.940]So these are genes that are being co-expressed together,
- [00:15:08.550]in these tissues.
- [00:15:09.383]So if they're being expressed together,
- [00:15:11.250]then we make the assumption
- [00:15:13.320]that they might be having some related function.
- [00:15:17.580]And then as well,
- [00:15:19.770]because a lot of these are transcription factors.
- [00:15:22.050]Transcription factors are targeting places in the genome.
- [00:15:24.840]We can have some directionality
- [00:15:26.490]and we can make a network, we can say that these genes,
- [00:15:28.920]we can develop a network and say
- [00:15:30.240]these genes are interacting with these points
- [00:15:33.030]in this direction and vice versa.
- [00:15:35.460]So he developed this information
- [00:15:37.710]from this set of lines and expression data that he had.
- [00:15:43.410]The issue being that this is great,
- [00:15:45.570]we can understand what's happening
- [00:15:47.100]at a developmental transcriptomic level,
- [00:15:49.500]but it's very limited information for breeding.
- [00:15:52.770]It's only on a subset of individuals.
- [00:15:55.410]There was no clear way of how you then
- [00:15:57.270]take this and predict a large population,
- [00:15:59.700]which is what breeding wants to do.
- [00:16:02.430]So what we did is we measured,
- [00:16:05.968]this is my work, we measured tassel and leaf angle
- [00:16:09.690]in just over a thousand different lines in the field.
- [00:16:13.738]We then said, "Okay," I said,
- [00:16:16.327]"Edward which genes in the network
- [00:16:17.677]"do you think are important?"
- [00:16:19.350]We then prioritized our search space
- [00:16:22.080]to just the genes that he said were important.
- [00:16:25.020]And it was based on some different criteria,
- [00:16:27.150]but mostly the most connected genes
- [00:16:29.400]in the network were the ones
- [00:16:30.330]that we wanted to look at the real,
- [00:16:31.740]the ones in the network that are being core regulators,
- [00:16:36.463]modules in the network.
- [00:16:39.510]And we ran our GWAS and I said,
- [00:16:41.977]"Okay, these are the genes that are showing up
- [00:16:44.947]"as associated."
- [00:16:46.770]He went and then looked for mutants of these genes
- [00:16:49.530]and confirmed that indeed they do have a phenotypic effect.
- [00:16:54.943]So very quickly we were able to rather combine
- [00:16:58.980]show that both the information from his network is true
- [00:17:03.390]in natural populations as well too.
- [00:17:06.360]We then tried to ask,
- [00:17:09.540]does this help with this cross population prediction?
- [00:17:12.210]Can we use these core regulators
- [00:17:15.060]to have more biologically informed models?
- [00:17:18.870]And so again, the prediction problem
- [00:17:20.460]is training in a set of related individuals
- [00:17:23.490]and then predicting in a set of unrelated individuals.
- [00:17:26.940]And so when we use the whole genome,
- [00:17:30.750]this is where we get this red line.
- [00:17:34.320]So going back to that simulation I showed
- [00:17:37.380]right at the beginning,
- [00:17:38.610]this would be just the whole genome
- [00:17:41.100]this is our prediction accuracy with the red line,
- [00:17:43.980]and then if we only use the a hundred top regulators,
- [00:17:48.660]we can beat out the model.
- [00:17:50.940]There was a few other subsets of genes that also beat out
- [00:17:53.910]just using the whole genome.
- [00:17:55.050]So this is really telling us that in the whole genome
- [00:17:58.350]there's redundant information,
- [00:17:59.730]there's incorrect information
- [00:18:01.830]that's being used when we're predicting
- [00:18:03.390]from one population to another.
- [00:18:04.890]And we can have a smaller
- [00:18:07.450]and more correct set of information
- [00:18:09.990]when we subset using this omics data,
- [00:18:12.540]this biologically informed data.
- [00:18:15.990]That's really at the basis
- [00:18:18.570]of how I think about trait modeling
- [00:18:20.709]is figuring out ways to really incorporate
- [00:18:23.499]this omics data into our data.
- [00:18:26.670]And so this is just a lot of people
- [00:18:28.800]that think that we're involved
- [00:18:29.820]in that very specific project,
- [00:18:31.511]but obviously more that have gotten me
- [00:18:33.930]to where I'm at today.
- [00:18:34.763]So with that, I'm gonna pass it on to Tianjing
- [00:18:38.835]and I think we'll probably handle questions
- [00:18:41.220]at the end if there are some.
- [00:19:00.450]Hi everyone, I am Tianjing.
- [00:19:05.250]Let me briefly introduce my background.
- [00:19:08.160]I finished my bachelor's undergraduate study in China,
- [00:19:13.350]then I come to UC Davis for my master's study in statistics.
- [00:19:18.210]Then I joined Professor Hao Cheng's lab at UC Davis
- [00:19:21.870]to study statistical and quantitative genetics.
- [00:19:27.983]After that, I continued my research
- [00:19:31.200]at UNL as a theoretical quantitative genetics
- [00:19:35.160]in animal science department.
- [00:19:38.340]The quantitative genetics actually have a very long history
- [00:19:42.540]in animal science.
- [00:19:45.780]Here is a pedigree of animal breeders
- [00:19:49.260]and you can find my
- [00:19:52.500]names here
- [00:19:55.057]and also many,
- [00:19:59.774]very good quantitative genetics in animal gradings.
- [00:20:06.270]And today I'm going to briefly talk about,
- [00:20:12.540]let's say my published research
- [00:20:14.310]many work with my PhD advisor, Hao Cheng.
- [00:20:18.210]And those work is a kind of foundation
- [00:20:23.250]of my current and the future work at UNL.
- [00:20:29.178]So I split my research into three topics,
- [00:20:32.670]new data, large data
- [00:20:37.592]and sharing data.
- [00:20:40.980]So first, in terms of the new data,
- [00:20:47.670]we propose a new model to extend the mixed models
- [00:20:52.380]to multi-layer neural networks
- [00:20:54.900]for genomic prediction,
- [00:20:56.820]including the intermediate or mixed data.
- [00:21:02.790]So as Brian just mentioned,
- [00:21:05.760]now we start to have many intermediate or mixed data.
- [00:21:10.080]For example, in animal science in animal side
- [00:21:13.080]we have FAANG and Farm GTEx consortiums.
- [00:21:17.430]They have generated many gene expression data
- [00:21:20.820]and those gene expressions,
- [00:21:24.120]they are the intermediate features
- [00:21:26.160]between the upstream genotype and downstream phenotypes.
- [00:21:31.440]And they formed a multi-layer regulation network.
- [00:21:35.610]And inspired by this biological structure,
- [00:21:40.890]we propose a new model to represent this structure
- [00:21:45.270]called NNMM.
- [00:21:46.710]Where NM stands for neural network
- [00:21:48.810]and MM, stands for mixed models.
- [00:21:51.780]So in NNMM we are able to...
- [00:21:56.490]So besides the genotype and phenotype data,
- [00:21:58.427]we are able to add intermediate or mixed
- [00:22:01.680]in the middle layer.
- [00:22:03.180]For example, let's say for cattle,
- [00:22:08.310]the gene expression, for the first gene
- [00:22:10.950]we have its genotype data,
- [00:22:12.750]we have its milk yield
- [00:22:15.180]and we have the gene expression for the first gene 0.9
- [00:22:19.260]for the second gene for third genes.
- [00:22:21.450]And we use mixed models to model unknowns
- [00:22:24.482]between input and middle layer
- [00:22:27.210]and also between middle layer and output layer.
- [00:22:29.910]And we allow activation functions
- [00:22:32.400]from a neural network here to approximate
- [00:22:37.170]the complex non-linear relationships.
- [00:22:43.180]So I will talk about some advantages of our model.
- [00:22:48.330]So we allow different input data across multiple layers.
- [00:22:53.160]For example, here we have genotype in the first layer
- [00:22:56.310]we have all mixed data messaging expression
- [00:22:58.620]in the second layer.
- [00:23:00.390]But for other, let's say general purpose,
- [00:23:04.410]let's say deep learning softwares high touch,
- [00:23:07.440]they only allow users to put all the data,
- [00:23:10.380]let's say genotype and all mixed data in the input layer.
- [00:23:13.950]And everything in the middle
- [00:23:16.353]it's like a black box.
- [00:23:20.892]Our method is based on mixed models.
- [00:23:24.120]And the mixed models has a long, long history
- [00:23:29.640]in animal and plant breeding.
- [00:23:31.140]And we hope the success of mixed model can continue here.
- [00:23:35.580]And also our model is based on the Asian framework.
- [00:23:40.650]So we can provide statistical uncertainty
- [00:23:43.564]or significant measures for association studies.
- [00:23:49.170]But for those general purpose,
- [00:23:52.650]let's say machine learning methods,
- [00:23:54.660]they only provide you the point estimate.
- [00:24:00.780]When it's little tuning.
- [00:24:05.012]And we have published some methodology
- [00:24:08.100]papers about NNMM.
- [00:24:14.277]So here in NNMM is one example
- [00:24:18.390]where the mixed models,
- [00:24:20.760]the conventional mixed model cannot handle
- [00:24:25.838]is adding a intermediate layer for all mixed data
- [00:24:32.208]and from a connected network.
- [00:24:36.360]So here actually is another example
- [00:24:38.850]you will see the conventional mix model
- [00:24:41.340]is not able to handle.
- [00:24:44.490]So this is another project
- [00:24:46.170]learning the functional conservation
- [00:24:47.970]between human and the pig.
- [00:24:49.920]So pig is very important for human study.
- [00:24:55.530]And before so let's say we have a DNA segments from human,
- [00:25:02.210]so this is a DNA segments from pig,
- [00:25:05.460]we can know the sequence conservation
- [00:25:15.200]between those sequence.
- [00:25:16.200]And nowadays we have some new data,
- [00:25:19.535]some functional data of the genome from human,
- [00:25:23.040]we have the ENCODE, project, Roadmap, epigenomics projects.
- [00:25:27.600]They have generated and released many, many functional data.
- [00:25:32.040]And for pig, we have many functional data
- [00:25:35.760]from the farm projects.
- [00:25:37.650]So for the new data, now we have a new challenge
- [00:25:40.830]then how to use those new datas
- [00:25:43.410]to measure the conservation
- [00:25:49.710]for their function.
- [00:25:50.700]So let's say to what extent this sequence
- [00:25:55.710]that's similar in terms of their functional.
- [00:25:59.610]So we propose this neural network model,
- [00:26:03.630]let's say we have one pair of segments,
- [00:26:07.650]one from human, one from pig
- [00:26:11.396]and we can extract the functional features
- [00:26:16.950]from this new datas.
- [00:26:20.580]Then here we have a human specific sub network
- [00:26:25.620]and here's a peak specific sub network.
- [00:26:31.862]And so in this deep learning framework,
- [00:26:38.723]we have let's say about 400 functional features
- [00:26:42.060]for human and pig respectively.
- [00:26:45.660]And we have in Y there's a conservation score
- [00:26:50.820]in Y there's a binary data zero and one.
- [00:26:58.920]One means they're likely to have similar function
- [00:27:02.040]and the zero means they're unlikely
- [00:27:03.840]to have the similar function.
- [00:27:07.269]So after we train this neural network,
- [00:27:12.300]we will have a genome-wide functional conservation scores.
- [00:27:17.460]Let's say we translate this zero one binary data
- [00:27:21.170]into a score between zero and one.
- [00:27:24.900]So we used this many of alignments
- [00:27:28.440]covering 45% of human genome,
- [00:27:31.707]and we got the whole genome-wide
- [00:27:35.130]functional conservation score.
- [00:27:36.930]So most of the scores they are low,
- [00:27:42.054]which means they do not have a similar function,
- [00:27:45.990]but we do find some regions with high scores.
- [00:27:50.400]So, which indicates they are conserved
- [00:27:54.781]for their function.
- [00:27:57.570]And we have publications
- [00:28:02.278]about these projects.
- [00:28:05.400]And there are a lot of analysis
- [00:28:07.980]to validate this functional conservation scores.
- [00:28:13.490]So why I see this is something cannot be done
- [00:28:18.660]by conventional mixed model.
- [00:28:21.420]So because when you see the models,
- [00:28:23.910]actually we do care about the orders of the input.
- [00:28:27.120]So let's say this is the functional features of,
- [00:28:30.810]let's say, of tissue one in human
- [00:28:34.200]tissue two in human, tissue three in human,
- [00:28:36.780]this is the functional features
- [00:28:39.690]for tissue one in pig, tissue two in pig.
- [00:28:44.401]So the order matters,
- [00:28:45.840]but for the conventional mixed models,
- [00:28:48.330]actually let's see, Y = X1+X2,
- [00:28:53.220]actually the order of x1 and x2 doesn't matter.
- [00:28:57.930]So that's why there's a need
- [00:29:00.630]for a new methodology development for this new challenge
- [00:29:06.636]brought by the new data.
- [00:29:10.470]And the second part is large data.
- [00:29:12.900]So the large data we propose a fast parallelized sampling
- [00:29:17.640]for the Bayesian linear mix model.
- [00:29:23.976]So this is a typical way to solve the
- [00:29:26.730]bayesian linear mix model.
- [00:29:28.500]We use MCMC sample,
- [00:29:30.372]we sample let's say market effects RFA one and RFA two
- [00:29:36.270]and from their full conditional posterior distributions.
- [00:29:41.867]So here is actually the equation to sample them.
- [00:29:46.020]This is just the mixed model equation.
- [00:29:48.930]We have the left hand side, we have the right hand side,
- [00:29:52.020]then we can solve the R5J hat.
- [00:29:56.057]So the one problem here is those R5J hat,
- [00:29:59.219]they are dependent.
- [00:30:00.120]You can see from this equation we need to use,
- [00:30:02.700]when we sample R5J, we need all the other rss.
- [00:30:09.317]So this dependence makes...
- [00:30:13.670]Because they are independent
- [00:30:15.330]or because they are dependent,
- [00:30:16.710]we can sample them simultaneously at the same time.
- [00:30:21.120]So we propose new method, we add new datas here,
- [00:30:29.423]such that for the new is it for the new incident
- [00:30:32.972]for the new instance matrix, they are orthogonal.
- [00:30:38.400]Those columns are orthogonal to each other,
- [00:30:41.610]which makes this part become zero.
- [00:30:44.040]So in our new algorithms,
- [00:30:47.610]those micro effects at each iteration,
- [00:30:50.520]they can be sampled at the same time.
- [00:30:52.740]So we can use parallel computing
- [00:30:55.369]to parallel the sampling of market effects
- [00:30:59.610]at each iterations.
- [00:31:01.410]And we have the methodology paper published
- [00:31:06.090]a few years ago.
- [00:31:09.780]So in terms of the third part sharing data,
- [00:31:13.140]we proposed the HEGP homomorphic encryption
- [00:31:17.550]for genotypes and phenotypes
- [00:31:19.620]for shared genome to phenome analysis.
- [00:31:23.190]So the data sharing is very important in academia.
- [00:31:27.360]So we have policies from both journal and founding agencies,
- [00:31:32.160]and it's also very important in industry.
- [00:31:40.658]So for us as a researcher,
- [00:31:45.431]we want to follow the fair principles for data sharing.
- [00:31:48.960]It's findable, accessible, interpretable and reusable.
- [00:31:56.177]So here we propose the HEGP,
- [00:31:58.677]the homomorphic encryption
- [00:32:00.507]for genotype phenotypes.
- [00:32:03.870]So here we want to encrypt the raw genotype
- [00:32:07.230]and phenotypes data.
- [00:32:08.280]So this is the raw data for genotypes.
- [00:32:11.130]We have, this is three values 012,
- [00:32:13.710]this is the phenotypes data.
- [00:32:16.410]So the HEGP is done by multiplying the raw data
- [00:32:21.240]by a random generated orthogonal
- [00:32:24.660]as a high dimension orthogonal matrix.
- [00:32:27.120]So after the encryption,
- [00:32:29.970]this is the encrypted genotypes px,
- [00:32:33.150]this is the encrypted phenotypes
- [00:32:35.220]and only the encrypted data will be shared.
- [00:32:38.130]And the key, the (P) will not be shared.
- [00:32:47.954]So for HEGP before and after encryption,
- [00:32:49.920]the relationship between SNPs is conserved.
- [00:32:53.490]They're identical,
- [00:32:56.160]but the relationship between individuals
- [00:32:59.220]is destroyed by HEGP.
- [00:33:03.150]And also our method works for let's say joint analysis.
- [00:33:07.830]Let's say this in industry, this is company one.
- [00:33:11.160]They can encrypt their own data.
- [00:33:13.560]For company two they can encrypt their data
- [00:33:17.070]and they can share the encrypted data for joint analysis.
- [00:33:21.090]And we have recently published a paper
- [00:33:26.908]for this methodology.
- [00:33:31.050]And let's say, let me summarize my research.
- [00:33:34.260]So in my research, we propose the new network
- [00:33:39.150]with mixed models to incorporate intermediate or mixed
- [00:33:45.232]using conventional mixed models.
- [00:33:48.900]And let's say to solve the challenge
- [00:33:52.874]of functional data, we develop the deep learning framework
- [00:33:58.590]for cross species analysis
- [00:34:02.280]and also to solve the computation way, computation burden.
- [00:34:06.210]We propose some fast paralyzed sampling.
- [00:34:10.704]And also for the data sharing way,
- [00:34:13.230]proposed a homomorphic encryption
- [00:34:15.750]and all the methods are implemented in our software.
- [00:34:20.190]This is open source and free called JWAS
- [00:34:23.400]Julia for Whole-genome Analysis Software.
- [00:34:28.786]So this software is originally built by my advisor,
- [00:34:32.520]Hao Cheng, and now I'm one of the
- [00:34:37.830]co-author to continue build
- [00:34:41.770]and maintenance software.
- [00:34:44.220]So in JWAS we can do single trade analysis,
- [00:34:48.390]multi trade analysis, and also many other types of analysis.
- [00:34:52.680]For example, the repeated measure, single step analysis,
- [00:34:56.100]and we can analyze single categorical traits,
- [00:34:58.860]the sensor traits or the joint analysis
- [00:35:01.710]of any number of continued censored and categorical traits.
- [00:35:05.790]And those RMM is also implemented in JWAS.
- [00:35:09.660]We have many different features.
- [00:35:16.590]For example, the multi-class, patient analysis, JWAS,
- [00:35:20.460]and also let's say incorporated
- [00:35:22.380]the causal networks in JWAS.
- [00:35:27.669]And yeah, many different features.
- [00:35:28.860]And once we develop some new methods,
- [00:35:31.830]we will implement the new methods in our software.
- [00:35:38.602]So that's all my
- [00:35:42.930]previous research.
- [00:35:45.891]And in the future at UNL,
- [00:35:48.330]I'm still going to build masters
- [00:35:52.380]and collaborate with other quantitative genetics
- [00:35:55.770]and other faculties here
- [00:36:00.566]to continue my research in quantitative genetics.
- [00:36:05.430]And I want to thank my collaborators,
- [00:36:08.730]my advisor Hao Cheng and other collaborators
- [00:36:11.790]from other universities.
- [00:36:13.470]Yeah, thank you.
- [00:36:35.730]Yeah, can you hear me? Good.
- [00:36:43.999]Okay, well first, very big thank you to all of you,
- [00:36:47.700]especially Barron and Jennifer,
- [00:36:49.920]because Barron is actually hired me in statistic department.
- [00:36:54.270]Jennifer actually give me encouragement to come here
- [00:36:58.500]in the sense that I am basically...
- [00:37:00.990]So let me go through directly to a slide.
- [00:37:04.230]So today's talk is very informal,
- [00:37:06.600]kind of a smart dimension meet and greet type.
- [00:37:09.240]So I will tell you what I can do and what I cannot do.
- [00:37:12.900]And so my research experience is more or less,
- [00:37:17.220]more than 32 years.
- [00:37:19.320]Interestingly, I did PhD in pure statistics,
- [00:37:21.720]pure mathematical statistics, which is time series analysis.
- [00:37:24.630]Then I got an opportunity
- [00:37:25.920]to do post doctorate in statistical genetics.
- [00:37:28.230]When I arrived in United States University of Pittsburgh,
- [00:37:30.810]I have no idea about genetics, okay?
- [00:37:33.090]If you ask me what are A-T-G-C- I could not immediately
- [00:37:37.170]answer at that time.
- [00:37:38.700]But I did something during my postdoc statistical genetic,
- [00:37:42.870]in statistical genetics.
- [00:37:44.520]However, during my entire life till now,
- [00:37:47.100]I worked in many areas, math medical statistics,
- [00:37:50.850]machine learning or cluster analysis,
- [00:37:52.890]time series of course.
- [00:37:54.420]And there are many ideas in statistical genetics altogether.
- [00:37:57.300]But today I will mainly focus on
- [00:38:00.120]this two, genomic data integration and single cell.
- [00:38:03.167]I will briefly outline what I have done in these two areas.
- [00:38:08.367]And my teaching experience, which is my favorite thing
- [00:38:12.390]more than 29 years.
- [00:38:13.590]And I had experience to teaching
- [00:38:15.240]in undergraduate master's level and PhD programs.
- [00:38:18.090]More or less, I taught
- [00:38:22.157]almost all topics, or not all of course,
- [00:38:24.630]but majority of the topics in statistics,
- [00:38:27.030]mathematical statistics, as well as applied statistics
- [00:38:29.850]and engineer models in Japan.
- [00:38:32.160]Mathematics also little bit major theory, probability,
- [00:38:35.490]linear algebra, all sorts of things.
- [00:38:38.010]I also taught programming,
- [00:38:39.870]which is very interesting at some point
- [00:38:41.250]of time in C language and R language.
- [00:38:43.980]And also I organized
- [00:38:47.264]and taught in many workshops
- [00:38:50.370]that this is the point I actually put it here
- [00:38:53.490]because during this workshop I felt that
- [00:38:56.760]the participants, they might not have,
- [00:38:58.440]not all participants have very strong background
- [00:39:00.720]in mathematics and statistics.
- [00:39:03.090]So when they think of statistics,
- [00:39:04.950]they're sometimes they got scared a little bit at least.
- [00:39:08.700]So based on that experience
- [00:39:10.230]of my teaching only in workshops,
- [00:39:12.030]so we decided to write a book on statistics basically,
- [00:39:16.380]although, oh, where is that picture?
- [00:39:19.980]Actually I have put a picture.
- [00:39:21.570]So statistical methods in human genetics
- [00:39:24.425]where it's kind of very informally written book
- [00:39:27.810]like the how we teach in the class.
- [00:39:29.670]I put that thing in the book
- [00:39:30.763]and it's a very, very basic level statistics.
- [00:39:32.940]It starts from the very basic statistics level anyway.
- [00:39:39.010]And currently I'm working
- [00:39:42.735]in two major radius.
- [00:39:44.340]One is single cell data analytics,
- [00:39:46.890]another is genomic data integration.
- [00:39:50.310]So I will touch upon each of this very briefly.
- [00:39:55.950]So as we all know that...
- [00:39:58.920]Okay, so when we talk about the gene expression data,
- [00:40:01.920]you probably know much better than me
- [00:40:03.450]because I'm not a biologist
- [00:40:07.557]that in the context of human,
- [00:40:09.000]if we take a drop of blood, it contains millions of cells.
- [00:40:12.180]And from that millions of cells
- [00:40:13.430]we collect the entire amount of mRNA
- [00:40:16.290]and we calculate the expression value.
- [00:40:18.930]Now this mRNA, although we say
- [00:40:22.050]that this is the gene expression value,
- [00:40:24.120]but this is a kind of average gene expression value
- [00:40:26.940]averaging over all cells
- [00:40:29.070]which in turns, which means that we are
- [00:40:31.380]actually losing the heterogeneity
- [00:40:33.600]that might be present in the cell.
- [00:40:35.550]So technological advancement gives us the opportunity now
- [00:40:40.320]to collect mRNA for thousand sub genes
- [00:40:43.320]from each single cell.
- [00:40:45.000]If we have 10,000 cells,
- [00:40:46.650]I can collect say 20,000 gene expression value
- [00:40:50.640]from each of these 10,000 cells.
- [00:40:52.830]So it's a kind of 20,000 by 10,000 metrics.
- [00:40:55.590]Now as soon as we get this kind of data,
- [00:40:57.810]this actually poses tremendous analytical challenges,
- [00:41:02.010]especially if we talk about the (indistinct).
- [00:41:05.250]If we just look at the nature of the data,
- [00:41:06.870]this kind of data, were not there in statistics world,
- [00:41:11.280]even 20 or 10 years back
- [00:41:13.230]because it was tremendous dispersed data.
- [00:41:15.600]Most of the sales are zeros.
- [00:41:17.010]But that is very important.
- [00:41:18.510]We have some values.
- [00:41:20.520]Another thing is that,
- [00:41:21.960]so the data structure looks like this.
- [00:41:24.270]So we have say in cells, maybe thousand cells
- [00:41:28.380]and maybe 10,000 genes.
- [00:41:29.940]And these are the expression value are in the (indistinct).
- [00:41:33.570]But there is element issue.
- [00:41:35.460]So when we don't have any idea,
- [00:41:37.590]and probably in India, no one is working
- [00:41:40.410]in the single cell data analytics area actually.
- [00:41:44.196]So when I'm thinking to work in this area,
- [00:41:46.410]then we realize that there is a major issue
- [00:41:48.390]that when they collect this 10,000 cells,
- [00:41:51.877]they are not exactly of same age
- [00:41:54.270]or in the same stage of their life.
- [00:41:56.580]One cell might be just one
- [00:41:58.470]another cell is probably going to die,
- [00:42:00.390]but we need calculate data.
- [00:42:01.620]So they are not at different stages of life.
- [00:42:04.410]That means in a timescale
- [00:42:05.850]they're not exactly the same timescale of their life.
- [00:42:08.850]So that gives us another problem
- [00:42:11.760]that a newborn cell might give us
- [00:42:15.300]a certain amount of expression,
- [00:42:17.400]but an old cell might not give the same amount
- [00:42:20.310]of gene expression even they're fine.
- [00:42:23.400]So that gives us an idea and there is a literature
- [00:42:26.310]that can we place the cells according to the time scale,
- [00:42:30.750]like young to old or old to young, whatever
- [00:42:34.110]somehow arrange with respect to time.
- [00:42:36.840]But we call it pseudotime
- [00:42:37.980]because we don't know the exact time.
- [00:42:40.530]So in that perspective we worked on that
- [00:42:43.830]and that is what is called pseudotime trajectory
- [00:42:47.070]inference or estimation from single cell RNA-seq data
- [00:42:51.213]only RNA-seq data, this is very important.
- [00:42:54.450]And remember when you work with RNA-seq data,
- [00:42:56.940]all the challenges are there, they're very, very sparse
- [00:42:59.220]and they are of course not normally distributed.
- [00:43:01.950]There are lots of other issues are there.
- [00:43:06.739]So that paper something actually there is a
- [00:43:10.320]problem in transferring this PPT,
- [00:43:12.360]so it gets anyway trancated.
- [00:43:16.230]So yeah, this is my student PhD student.
- [00:43:19.320]He is now doing postdoc in Michigan State University.
- [00:43:21.960]So this is published in legal 2021 I believe.
- [00:43:26.490]And since then, and actually there's part
- [00:43:29.220]of his thesis also,
- [00:43:30.510]we also worked on the statistical modeling
- [00:43:33.270]of scRNA-seq data single cell scRNA-seq data.
- [00:43:35.430]That means actually we know if you give a data
- [00:43:38.564]to a statistician, he or she first wants
- [00:43:41.550]to feed a kind of distribution
- [00:43:43.710]once to check whether it is say, in a simplified manner
- [00:43:47.280]that whether it follows the normal distribution or not,
- [00:43:49.819]or whether it follows the portion distribution
- [00:43:51.300]if it is accounted or something like that.
- [00:43:53.100]But unfortunately the scRNA-seq data
- [00:43:54.677]is not that easy to model
- [00:43:56.430]because of the sparsity presented in the data.
- [00:43:58.260]So we developed a model for statistical modeling
- [00:44:01.260]of such kind of data in his thesis also.
- [00:44:04.830]And then of course if we can feed a distribution,
- [00:44:08.940]then there are lots of inferential problems
- [00:44:12.012]will be generated.
- [00:44:13.650]Like we can compare these two models between two groups.
- [00:44:18.420]Two groups means here you can think of patients
- [00:44:21.390]as well as controls or we can think of
- [00:44:24.930]between two different types of cell or cell types.
- [00:44:27.930]So that's a very major issue in the context
- [00:44:31.500]of single scRNA-seq.
- [00:44:32.910]So we have also developed some methods,
- [00:44:35.901]especially with respect to the two sample problem
- [00:44:38.782]that's comparing two different groups.
- [00:44:42.630]There are few other, many other problems
- [00:44:46.290]or research projects with which I'm working at this moment,
- [00:44:48.990]like which of the genes.
- [00:44:50.760]So when we get the stereotype,
- [00:44:51.990]which of the genes are temporarily...
- [00:44:54.720]there's a changes in the temporal expression,
- [00:44:56.850]so temporarily expressed.
- [00:44:58.500]And there is also another area
- [00:45:00.840]we are working on the special expression
- [00:45:03.930]because nowadays if we collect cells from the brain,
- [00:45:07.980]we have the coordinators of the cell
- [00:45:09.870]in the two dimension and three dimension.
- [00:45:11.730]So we can have the idea
- [00:45:13.020]of how they're spatially spread or expressed
- [00:45:16.140]how differentially expressed they are.
- [00:45:17.670]So there are lots of other problems we are working.
- [00:45:21.690]There is more or less one area that where I am working,
- [00:45:24.420]I'm not going to too much technical details at this moment.
- [00:45:27.690]And next comes another, my favorite area,
- [00:45:31.410]which is started long back before single cell analytics
- [00:45:34.230]and is the genomic data integration.
- [00:45:37.620]Now genomic data integration is very simple to understand,
- [00:45:41.760]but very difficult to work with as I understand that.
- [00:45:45.780]So what is the data genomic data integration?
- [00:45:48.180]Again, there is an issue.
- [00:45:49.890]But I need this picture.
- [00:45:54.690]I don't know, as I told you to put the PDFs.
- [00:46:00.111]Yeah, yeah. This picture is very difficult to...
- [00:46:03.330]Oh yeah here it comes, it comes, it's come.
- [00:46:04.929]Okay. Okay. Okay. Thank God.
- [00:46:07.570]Okay, anyway, good.
- [00:46:11.537]So your presence is a magic,
- [00:46:12.780]so you touched it and they say Midas touch.
- [00:46:15.090]So here so many times we see that when we collect,
- [00:46:21.881]so you know, in genetics or genomics or whatever
- [00:46:26.100]we encounter, as Brian mentioned
- [00:46:28.710]and Tianjing mentioned that
- [00:46:30.210]we encounter different types of data,
- [00:46:31.920]like what is genotype data
- [00:46:33.540]another is gene expression data.
- [00:46:34.980]There is a methylation data
- [00:46:36.420]and nowadays in the context of human, I don't know
- [00:46:39.180]what about the animal genetics of
- [00:46:40.953]they are high seed data, chip set data,
- [00:46:43.380]lots and lots of data, there are micro in the data,
- [00:46:45.840]lots of different types of data there.
- [00:46:48.060]But the problem is when you work,
- [00:46:50.130]we usually focus on only one data
- [00:46:51.533]and try to find the significant or associated biomarker.
- [00:46:55.890]So what we are thinking that
- [00:46:59.045]suppose we collect few individuals, maybe cases,
- [00:47:03.000]a group of case individuals,
- [00:47:04.110]maybe a group of control individuals.
- [00:47:06.900]We usually, we know that which fellow is a patient
- [00:47:10.590]and which is not.
- [00:47:11.490]That means a phenotype is known for all of them,
- [00:47:13.950]but and also hopefully the genotypes
- [00:47:16.410]are also known for all of them.
- [00:47:17.790]But due to cost or other limitations or resources
- [00:47:20.451]or whatever, there might be other reasons,
- [00:47:22.831]we may not get the gene expression value
- [00:47:25.440]for all these units.
- [00:47:27.750]That means there is a concept of missing data
- [00:47:30.660]in this scenario.
- [00:47:31.800]So we try to frame this entire scenario
- [00:47:35.520]under a single statistical framework.
- [00:47:38.430]And this is a simple version of the model.
- [00:47:41.425]Don't look at all these things,
- [00:47:43.290]but except that here I = 1 to 200,
- [00:47:46.260]and here I = 1 to 1200.
- [00:47:47.910]That means gene expressions values are available
- [00:47:50.487]for 200 individuals, whereas genotype and phenotype values
- [00:47:53.253]are available for all 1200 individual,
- [00:47:55.710]just as an example.
- [00:47:58.262]So there are lots of other mathematics inside it.
- [00:48:01.530]I don't want to go into the detail,
- [00:48:03.390]but I write this thing only to give you confidence
- [00:48:06.000]that I know a little bit of genetics and statistics.
- [00:48:08.010]That's the only purpose of writing this.
- [00:48:10.380]There is not an issue, just joking.
- [00:48:12.570]And one of the features that it is capturing.
- [00:48:17.280]So here we are considering the phenotype data,
- [00:48:20.190]which is the full data, entire observation,
- [00:48:22.230]all observations for all individuals are available
- [00:48:24.960]phenotype data we consider as a qualitative data
- [00:48:26.970]cases and controls.
- [00:48:28.500]Genotype data, full data, gene expression
- [00:48:30.630]that is partially missing.
- [00:48:31.860]And when I say that missing data,
- [00:48:34.090]there are different types of missing data
- [00:48:36.300]depending on the mechanism of the missingness.
- [00:48:39.727]What is missing completely at random,
- [00:48:41.970]missing at random, and another is missing, not at random.
- [00:48:45.840]So this model is developed for partially missing data,
- [00:48:51.030]but they we are assuming because of some reasons
- [00:48:54.780]missing, not at random.
- [00:48:56.340]This happens in practice.
- [00:48:58.800]And of course, so now it boils down to
- [00:49:02.220]if we want to find out which gene or which steep
- [00:49:05.370]is associated with the disease,
- [00:49:07.625]that means it boils down to an inferential problem.
- [00:49:10.110]It's a testing of hypothesis problem.
- [00:49:12.060]And when I come to the testing of hypothesis problem,
- [00:49:16.387]I prefer to develop the asymptotic distribution
- [00:49:18.930]of the test statistic because that will give us the p value
- [00:49:21.260]or the validation measure very fast.
- [00:49:23.700]That's the advantage, instead of time consuming
- [00:49:25.966](indistinct) or bootstrap techniques.
- [00:49:28.020]So we developed everything and simulation
- [00:49:31.890]and real data analysis we have done.
- [00:49:33.957]And this was also published long back in genomics.
- [00:49:38.422]That she is actually also did PhD with this thing.
- [00:49:42.930]And now is a research scientist
- [00:49:45.510]in University of Cincinnati, I believe.
- [00:49:49.650]Exactly. Now if we look at this area,
- [00:49:52.740]now, there are lots of extension
- [00:49:54.540]we can think of immediately.
- [00:49:56.670]Okay, so just one small extension
- [00:49:59.760]is that here instead of allowing gene expression,
- [00:50:04.260]we also consider the methylation data.
- [00:50:06.570]So now here we are actually
- [00:50:09.270]combining three types of genetic data,
- [00:50:11.400]genotype data, gene expression data and methylation data.
- [00:50:15.000]So one step ahead.
- [00:50:16.860]Again here phenotype and genotype are available
- [00:50:19.980]for all the units.
- [00:50:21.690]Gene expression available for few of them,
- [00:50:24.240]methylation available for few of them.
- [00:50:26.211]And for this fellows, both methylation
- [00:50:28.647]and gene expressions are available.
- [00:50:30.270]But for these fellows only methylation available,
- [00:50:33.150]but not gene expression.
- [00:50:34.440]Genotype, phenotype are available for all of them.
- [00:50:36.570]So with this model, here also we are considering
- [00:50:41.400]qualitative, that means cases and control,
- [00:50:44.070]genotype, phenotype, full data,
- [00:50:45.540]gene expression is partially missing.
- [00:50:47.130]And here we are issuing missing at random.
- [00:50:49.560]So previously we have shown missing, not at random,
- [00:50:53.010]now we are developing the model for missing at random.
- [00:50:57.000]And of course you can understand if we just drop this thing
- [00:50:59.670]methylation, it's same as the previous one,
- [00:51:02.550]but with missing at random.
- [00:51:04.920]Now I'm emphasizing these two words, MAR and MNAR
- [00:51:08.610]because if we change the mechanism
- [00:51:10.350]of missing is the entire analysis,
- [00:51:13.230]entire mathematical division, entire procedure will change.
- [00:51:17.580]It is not that we can just plug in this to this.
- [00:51:20.940]So that's very important to me,
- [00:51:23.040]which I'll explain in a minute.
- [00:51:25.080]And as usual, we develop it's a testing problem,
- [00:51:28.650]asymptotic distribution, and we analyze it
- [00:51:30.630]using simulated data as well as real data.
- [00:51:33.810]So, and we have also worked on,
- [00:51:38.580]you can naturally ask that what happened
- [00:51:40.500]if the phenotype is quantitative,
- [00:51:42.360]quantitative is the kind of continuous values.
- [00:51:45.540]Yes, that work is almost done, but we are yet to submit it.
- [00:51:50.430]So that part is also done.
- [00:51:53.460]The objective of these two slides is basically
- [00:51:59.211]the reason is that, so we have developed
- [00:52:04.260]few statistical method, a single model kind of thing,
- [00:52:09.000]which can integrate different types of data.
- [00:52:12.780]Now remember different types means phenotype data
- [00:52:14.910]is a categorical data.
- [00:52:16.590]Gene expression data, continuous data,
- [00:52:18.780]methylation data is continuous but not normal,
- [00:52:21.600]it's normally distributed.
- [00:52:23.040]So there are lots of issues out there.
- [00:52:25.817]So our objective is to take care of
- [00:52:29.850]each of these factors
- [00:52:31.050]and then come up with an appropriate model.
- [00:52:34.080]Now here, if we go detail into this,
- [00:52:38.550]we'll see that this models, I'm not proposing any models,
- [00:52:41.880]this model evolves automatically from the problem itself,
- [00:52:45.540]from the data that has been generated.
- [00:52:48.090]From there, the model actually has been evolved.
- [00:52:52.995]So that is actually my philosophy of research.
- [00:52:58.696]And so I really feel excited to be here.
- [00:53:02.700]And when I heard that there is a big,
- [00:53:05.250]there are two departments which are animal genetics
- [00:53:07.620]and plan genetics and statistics is in that school,
- [00:53:10.590]I really feel excited because I think,
- [00:53:12.540]and now I know that there are a lot of data
- [00:53:15.060]probably available with you.
- [00:53:16.680]We don't have data, but, so my major objective
- [00:53:20.790]and major dream is that always
- [00:53:23.313]during my entire research career,
- [00:53:25.682]I try to develop new powerful, new methods.
- [00:53:30.420]Not necessarily new but powerful methods and robust method.
- [00:53:33.960]Robust means if there is some changes in somewhere
- [00:53:38.340]it might work, I'm not sure it might work, it might not.
- [00:53:43.050]But to analyze genetic data with respect
- [00:53:46.080]to this is very important, specific scientific questions.
- [00:53:49.590]So I never encourage anyone that, well,
- [00:53:53.080]you have the data, I have this method, apply it.
- [00:53:55.500]No, we try to look at your data,
- [00:53:58.620]try to look at the research question
- [00:54:01.170]because of which you have generated this type of data,
- [00:54:04.590]and then we can try to analyze the data
- [00:54:07.220]in the best possible way.
- [00:54:09.210]And for that, if we have to develop new methods,
- [00:54:12.360]we have to have no other option.
- [00:54:14.190]But I believe that developing the statistical methods
- [00:54:18.330]is essential in order to address any scientific question,
- [00:54:22.860]any research question
- [00:54:24.750]whose conclusion would be based on the generated data,
- [00:54:27.600]that's what I wanted to.
- [00:54:31.380]Yes, that is more or less that I wanted to share with you.
- [00:54:37.015]So in a nutshell,
- [00:54:41.010]right now, I'm focusing on major two areas,
- [00:54:44.340]but again, I have worked on other areas also
- [00:54:48.270]other not genetics area I should say,
- [00:54:52.478]or a little bit of machine learning is there also,
- [00:54:55.260]I did not put it there.
- [00:54:57.495]So I believe I'm done. What is the time?
- [00:55:03.240]Yeah, it's okay.
- [00:55:05.760]Okay. So if you have, so...
- [00:55:07.260]I give it a very informal manner
- [00:55:09.990]because it's not very technical of course,
- [00:55:12.090]but there is some glimpses which you can think
- [00:55:14.280]that there are some technical things,
- [00:55:16.110]especially if to speak to the development
- [00:55:17.903]of powerful statistical methods.
- [00:55:20.310]Okay that is the thing I would like to emphasize.
- [00:55:24.120]If you have any question, anything,
- [00:55:25.860]and I really feel excited, I will feel excited
- [00:55:29.280]if there are some collaborations
- [00:55:31.230]with you or with any of you.
- [00:55:33.450]Yeah, that's it. Yeah.
- [00:55:36.656]So this one.
- [00:55:41.910]So I just wanted to put this up here just to bring us back
- [00:55:46.470]to answer the question of like the breadth
- [00:55:49.620]of work that is going on in quantitative genetics.
- [00:55:51.960]And I think each of us today
- [00:55:55.470]has kind of touched on different aspects
- [00:55:57.572]of this genotype to phenotype type of map.
- [00:56:01.740]Going from sequence to expression to the network of genes
- [00:56:05.670]that are temporal and stress related to proteins,
- [00:56:08.400]to animals and looking across the genomes of these
- [00:56:11.010]and learning from across the genomes
- [00:56:12.870]all down to what statistical genetics,
- [00:56:15.630]what quantitative genetics is good at
- [00:56:17.730]is putting that all into equation
- [00:56:19.590]and giving you out a prediction.
- [00:56:21.120]But this, you can see that there's just such a breath
- [00:56:27.374]that the amount of areas that you can be working on here
- [00:56:30.570]in this is massive.
- [00:56:34.201]And so hopefully you guys got that picture
- [00:56:36.900]and that we'll be, hopefully the three of us
- [00:56:39.900]will be figuring out how to bridge across
- [00:56:42.960]what we're each individually doing
- [00:56:44.880]to kind of combine them together as well.
- [00:56:51.330]Yeah, thank you.
- [00:56:52.163]Three of you, I mean we do have some
- [00:56:54.780]short time for questions, so if you wanna ask a question.
- [00:56:58.860]Yeah.
- [00:57:02.640]Brian, in your presentation early on
- [00:57:04.740]you talked about causal
- [00:57:07.560]or trying to consider causal implications.
- [00:57:12.420]How much of that is weather?
- [00:57:15.240]Maybe you know or could guess.
- [00:57:19.110]And what I think about is in the case of maize,
- [00:57:24.030]we have suckers show up in the spring of the year,
- [00:57:27.030]some years, some years not. Why?
- [00:57:31.560]I mean, I can't speak specifically
- [00:57:33.630]to your suckers question,
- [00:57:34.620]but I mean for sure the environment
- [00:57:37.332]is having a huge role and just from a numerical perspective,
- [00:57:41.721]we can model and think about new environments
- [00:57:44.220]just like a new trait.
- [00:57:46.652]So if we're measuring in those environments
- [00:57:50.070]and we're measuring the response
- [00:57:52.710]of different genes to a biotic stress or abiotic stress,
- [00:57:57.346]that can be modeled
- [00:57:58.800]and hopefully can be modeled at an actual
- [00:58:02.019]biological inference as well.
- [00:58:10.920]Any other question? From the audience? Yeah?
- [00:58:18.030]I have a question for the young lady,
- [00:58:20.250]I don't remember her name.
- [00:58:21.900]You developed the different methods
- [00:58:25.710]using some data from animal and human.
- [00:58:30.540]My question is, are those methods
- [00:58:33.330]also applicable for plants?
- [00:58:35.280]Because I know that usually the people working with animals,
- [00:58:38.880]they always, whenever they develop any method there
- [00:58:42.570]require some data from six chromosomes,
- [00:58:47.013]something that we don't have in the plants.
- [00:58:53.169]So first actually I do not have many
- [00:58:55.860]real data analysis experience even in animal
- [00:59:00.000]for animal datas.
- [00:59:05.473]I believe those methods can also apply in plants
- [00:59:10.530]because let's say those are, whether in animal, plants,
- [00:59:14.850]in human, they are complex traits,
- [00:59:17.790]they are controlled by many genes
- [00:59:19.440]and we have genotype phenotype data
- [00:59:21.210]and maybe some other all mix of functional datas.
- [00:59:25.438]I believe they can be also
- [00:59:29.850]applied in plants data, for example, the maize data.
- [00:59:38.606]And as far as I know, our software is also used
- [00:59:45.780]by some plant breeders to analyze their data.
- [00:59:52.867]So I believe in terms of methodology,
- [01:00:01.267]there are some differences,
- [01:00:04.068]but the difference is not that huge
- [01:00:06.210]for G2P analysis. Yeah.
- [01:00:10.470]Thank you.
- [01:00:15.540]Any other question?
- [01:00:19.080]Well, I'm wondering since, this is a new blog
- [01:00:24.390]and also we have some existing strengths across departments
- [01:00:29.220]and thinking how in the future, how can we work together
- [01:00:33.180]as a group to regress our quantitative genetics strengths.
- [01:00:39.540]So yours is like a workshop or something so that, yeah,
- [01:00:44.730]I'd like to invite you to think a little bit about that.
- [01:00:50.610]Probably you need time to establish,
- [01:00:53.430]but anyway, so this is something that maybe
- [01:00:57.639]some existing or established professor to think about.
- [01:01:04.769]Do you guys have any ideas?
- [01:01:09.640]Yeah, I mean I'm with you Shining.
- [01:01:14.670]I wonder like if is the need
- [01:01:18.750]for like other labs to understand more in depth
- [01:01:23.160]about like what collaborations we can offer
- [01:01:26.730]or is it more of the established,
- [01:01:29.370]some of the established quantitative genetics that are here?
- [01:01:34.166]Like how we can tap into what you're already doing.
- [01:01:38.190]Or is it about training the students?
- [01:01:41.580]Is there not the right class offering?
- [01:01:44.070]And the workshop would play that role.
- [01:01:47.610]So just answered your question with more questions,
- [01:01:51.294]but I think there is a lot of opportunity
- [01:01:54.705]and some of it will probably naturally happen,
- [01:01:57.390]but other it might have to be more deliberate
- [01:02:00.060]and aimed at a goal.
- [01:02:06.960]Have the last one, Martha?
- [01:02:10.073]Just a follow up to this question.
- [01:02:12.600]Each of the academic units really advocated
- [01:02:15.300]for each of the positions.
- [01:02:19.555]So in your maybe conversation,
- [01:02:21.150]what are your thoughts
- [01:02:22.140]in terms of you collaborating together,
- [01:02:25.350]both in the research mission
- [01:02:27.120]but also in the curriculum continuum?
- [01:02:30.780]So we are not redundant and replicating,
- [01:02:34.410]but actually the power of the three of you
- [01:02:37.260]would enhance both research and teaching.
- [01:02:41.910]So you have some thoughts?
- [01:02:51.073]Yes. So, okay.
- [01:02:55.170]So what I believe is that this,
- [01:02:57.360]yeah workshop, should we, okay, fine,
- [01:02:59.970]but in my experience, what I've seen that
- [01:03:05.640]after the...
- [01:03:06.473]Please don't take it otherwise
- [01:03:07.530]because I'm not from this country.
- [01:03:09.180]So based on my experience there,
- [01:03:11.160]so very rarely after workshop that collaboration continues,
- [01:03:15.180]but if it continues, it will survive.
- [01:03:18.510]So maybe out of 20 parties 20, no.
- [01:03:22.050]Maybe out of three workshops,
- [01:03:23.490]if we have say a hundred participants,
- [01:03:25.230]maybe two collaborations come out.
- [01:03:27.750]I think that's a good number.
- [01:03:30.000]So collaborations are possible of course
- [01:03:32.370]from workshops, that's definitely,
- [01:03:37.439]but again, if we organize workshops,
- [01:03:39.570]I want to be very specific about two things.
- [01:03:42.990]One is who is going to participate in the workshops
- [01:03:46.036]and what is the focus of the workshop?
- [01:03:47.880]These two should be very, very specific
- [01:03:49.950]and they're very crucial
- [01:03:52.200]because some of the out of 20 participants
- [01:03:54.960]is two are not with different background,
- [01:03:57.510]then, there is some problems, there are some issues.
- [01:04:01.020]So yeah, workshop is definitely possible,
- [01:04:03.420]but with a focused theme I believe,
- [01:04:07.200]and regarding collaborations,
- [01:04:09.330]I already told that I think all of us actually mentioned
- [01:04:12.617]that we really, we are feeling excited
- [01:04:15.720]to launch any collaboration with any of you.
- [01:04:19.500]Most welcome.
- [01:04:21.420]But personally I prefer collaborations
- [01:04:23.880]on a kind of long, like,
- [01:04:28.050]again, I faced many experience like people called me,
- [01:04:32.100]can you do my data analysis?
- [01:04:34.470]Okay, tell me.
- [01:04:36.360]Yeah, so how to do this multiple regression to this data.
- [01:04:39.420]I don't know, what are the variables?
- [01:04:40.770]How many observations are there, I don't know.
- [01:04:42.940]I was just asking, I said, "Okay, send the data."
- [01:04:45.937]"Okay, I'm sending the data,
- [01:04:46.957]"but you have to give it tomorrow
- [01:04:48.277]"because I'm submitting my manuscript tomorrow."
- [01:04:51.750]And I had to refuse I had other option.
- [01:04:55.080]So I think collaboration should be real collaboration,
- [01:04:58.200]right from the problem specification.
- [01:05:01.620]The research question designing is,
- [01:05:04.110]I think I'm from human genetics,
- [01:05:06.911]but my idea based on my statistics knowledge
- [01:05:10.110]is that designing is extremely important
- [01:05:12.300]in plant and animal genetics, especially.
- [01:05:15.780]Not that much in human, there are also some design issues,
- [01:05:18.690]but plant and animal genetic design is very important.
- [01:05:21.780]So from that, starting from that point,
- [01:05:25.200]I think we should start our collaboration
- [01:05:27.900]and right from and the data generation,
- [01:05:29.940]data cleaning and I believe so that should.
- [01:05:33.107]Batron you want to say something?
- [01:05:41.580]Yeah, as far as strategies for the three of us
- [01:05:45.960]working together,
- [01:05:48.270]there are some, we each have different
- [01:05:51.870]like there's a breadth of modeling that can be done
- [01:05:53.880]and we each seem to have different modeling expertise
- [01:05:58.140]and to different levels as well too
- [01:06:00.510]and different genetic expertise
- [01:06:02.430]to different levels as well too.
- [01:06:04.140]So I do think there's like if I am not knocking
- [01:06:08.190]on their door when I want to know the right model
- [01:06:12.510]and understand and I know the genetics or I know the system
- [01:06:15.510]and they know the modeling,
- [01:06:18.030]like those conversations should be happening at our level
- [01:06:22.140]to both as an example and just to strengthen it.
- [01:06:26.610]And just that point, Tianjing has already wrote me
- [01:06:30.000]in to give a mixed linear model
- [01:06:32.100]at the tri societies already she's the chair
- [01:06:34.530]and now I'm suddenly the co-chair
- [01:06:35.880]and she'll drop off as soon as I jump on.
- [01:06:39.307]But we're already having those conversations, so.
- [01:06:48.361]Okay. Well thank you.
- [01:06:49.890]We'll take one more. Yeah.
- [01:06:55.170]This is more to spark conversation I hope
- [01:06:58.350]than anything else.
- [01:07:01.020]I'm gonna make a statement
- [01:07:02.220]that will dramatically oversimplify
- [01:07:05.070]all what the three of you have been saying
- [01:07:08.130]as a point that you might want to discuss at some time.
- [01:07:12.390]I'm staring at that model in the bottom of the screen
- [01:07:16.620]and here's the oversimplification.
- [01:07:19.350]It seems that in one way or another,
- [01:07:21.630]all three of you are messing with that first term X beta
- [01:07:26.520]tangent is replacing X beta
- [01:07:29.580]with G of X beta where G is a neural net.
- [01:07:37.170]The two of you are replacing X beta
- [01:07:42.469]with x one, x two, x three, beta one, beta two, beta three
- [01:07:47.610]because you want to consider three different types of data.
- [01:07:51.210]And so my natural question for you,
- [01:07:53.700]and allow for missing this.
- [01:07:56.310]So it seems to me each of you are thinking about
- [01:08:00.840]that first term and what it means and how to do inferences
- [01:08:04.980]and use it for prediction or whatever else.
- [01:08:09.480]How fair is that a summary
- [01:08:16.308]of the key focus
- [01:08:18.960]you have in common?
- [01:08:21.090]Yeah, I think-
- [01:08:23.730]As a way to spark discussion.
- [01:08:25.277]So you so agree, disagree, doesn't matter.
- [01:08:31.470]It's eloquently put
- [01:08:32.460]and I do think it probably points to maybe,
- [01:08:37.230]a restriction we're putting on ourselves too
- [01:08:39.373]is only thinking in terms
- [01:08:41.089]of just how to tweak with this single model.
- [01:08:43.650]I mean it's what most quantitative genetics get trained on.
- [01:08:46.530]It's what we know how to tweak with.
- [01:08:48.180]And maybe it's a strength, maybe it's a weakness,
- [01:08:50.130]but for sure there's an eloquent way to put it
- [01:08:52.890]and I think it's got me thinking.
- [01:08:55.904](indistinct)
- [01:09:06.600]Yeah, he's not expecting an answer right now, so.
- [01:09:12.233]Well, sorry that we run a little bit longer today,
- [01:09:15.390]but thank you again for attending the seminar.
- [01:09:17.730]Thank you for our speakers, a next round of applause.
- [01:09:22.928]See you on the Fall of 2024.
- [01:09:24.750]Feel free to fill up the survey so we get some feedback.
- [01:09:27.900]We'll send it online also, so thank you. Have a good day.
- [01:09:31.508](audience applauds)
The screen size you are trying to search captions on is too small!
You can always jump over to MediaHub and check it out there.
Log in to post comments
Embed
Copy the following code into your page
HTML
<div style="padding-top: 56.25%; overflow: hidden; position:relative; -webkit-box-flex: 1; flex-grow: 1;"> <iframe style="bottom: 0; left: 0; position: absolute; right: 0; top: 0; border: 0; height: 100%; width: 100%;" src="https://mediahub.unl.edu/media/22322?format=iframe&autoplay=0" title="Video Player: Advances in Quantitative Trait Modeling" allowfullscreen ></iframe> </div>
Comments
0 Comments