Integrating Testing Optimization Frameworks into Soybean Breeding at UNL

Arthur Bernardeli Soybean Breeding Research PhD. Candidate in Plant Breeding and Genetics Dept of Agronomy and Horticulture - University of Nebraska-Lincoln Author

09/08/2025 Added

1 Plays

Description

This presentation highlights innovations to enhance soybean breeding efficiency at UNL. Topics include optimizing multi-environment trials through implementing genomic sparse testing designs, TPE and hub definitions, advancing drone-based phenotyping for maturity, and software development to streamline analytical pipelines. Together, these approaches strengthen decision-making, improve resource allocation, and accelerate genetic gain.

Searchable Transcript

Search:

[00:00:00.720]The following presentation is
[00:00:02.090]part of the Agronomy and Horticulture
[00:00:04.650]Seminar Series
[00:00:05.760]at the University of Nebraska-Lincoln.
[00:00:08.940]Welcome to the first seminar of
[00:00:11.250]the fall 2025
[00:00:13.040]seminar series for Agronomy and
[00:00:15.040]Horticulture. Thank you guys
[00:00:17.080]for joining us. This semester
[00:00:19.050]we got a
[00:00:19.600]the committee has put together
[00:00:21.330]an excellent lineup of speakers.
[00:00:23.280]We really tried to focus on
[00:00:25.840]speakers that we thought would
[00:00:27.350]fit the theme of science as a
[00:00:28.670]public value, understanding
[00:00:30.190]that, you know, we are at a
[00:00:32.000]public university with a prerogative
[00:00:34.490]to serve the public and trying
[00:00:36.450]to pick speakers that we
[00:00:38.050]thought embodied this and
[00:00:39.760]hopefully will reflect on this
[00:00:41.780]theme and attitude through the
[00:00:43.810]semester.
[00:00:44.960]So the first one is a graduate
[00:00:47.280]student from George Graff's
[00:00:49.600]breeding program, Arthur Bernardeli.
[00:00:52.920]And Arthur was previously from
[00:00:56.080]Brazil.
[00:00:57.220]He's from Brazil, not
[00:00:58.180]previously from Brazil.
[00:00:59.400]He's from Brazil, where he did
[00:01:02.180]his bachelor's and master's at
[00:01:04.960]the University of Federal de
[00:01:07.530]Vascoa.
[00:01:08.620]And he joined here in 2022,
[00:01:11.910]first as a visiting scholar and
[00:01:14.660]then stayed on to do a PhD with
[00:01:17.100]Dr. Graff's program, where he's
[00:01:20.350]been spending time really
[00:01:22.680]optimizing the program for
[00:01:25.120]getting it set up for data
[00:01:27.360]optimization.
[00:01:29.140]And so today he's going to
[00:01:31.030]present on that work. And
[00:01:33.180]please help me in welcoming
[00:01:35.200]Arthur.
[00:01:36.120]Thanks.
[00:01:38.620]Thanks Dr. Rice for introducing
[00:01:41.160]me and thanks for inviting me
[00:01:43.260]to present part of my PhD
[00:01:45.000]research and part of what I do
[00:01:47.080]in our program here in the
[00:01:48.940]first seminar of the semester.
[00:01:51.620]And thanks to the audience here,
[00:01:53.600]I see a pretty diverse group in
[00:01:55.380]terms of scientific background.
[00:01:57.620]So I really hope that you find
[00:01:59.360]this seminar interesting and
[00:02:01.190]that you have a lot of
[00:02:02.420]questions to ask at the end of
[00:02:04.160]the presentation.
[00:02:05.620]So, I'm Arthur Bernardeli, PhD
[00:02:07.650]candidate and research analyst
[00:02:09.660]in the soybean breeding program
[00:02:11.670]led by Dr. George Graf.
[00:02:13.500]And today I'm going to present
[00:02:15.300]a seminar entitled Integrating
[00:02:17.260]Testing Optimization Frameworks
[00:02:19.360]into Soybean Breeding at UNL.
[00:02:21.460]So, I'm going to briefly cover
[00:02:23.580]the outline of our soybean
[00:02:25.400]breeding program and then go
[00:02:27.420]through the specifics of the
[00:02:29.360]enviromics, genomics, and phenomics
[00:02:32.150]in our program.
[00:02:34.020]EnviroMix is the use of high
[00:02:35.660]throughput environmental data
[00:02:37.650]in the context of plant
[00:02:39.040]breeding. Genomics also
[00:02:40.730]deploying genome-wide markers
[00:02:42.690]in the context of plant
[00:02:44.100]breeding as well. And phenomics
[00:02:46.220]is the use of high throughput
[00:02:47.960]phenotyping data from drone
[00:02:49.720]photos also to be used in our
[00:02:51.430]soybean breeding program.
[00:02:54.580]And my objective here is that
[00:02:56.310]at the end of the presentation,
[00:02:58.210]you can picture a frame of how
[00:03:00.080]those modern tools can be
[00:03:01.840]integrated into a, not a
[00:03:03.660]soybean breeding pipeline, but
[00:03:05.990]any crop species breeding
[00:03:07.850]pipeline.
[00:03:08.860]So talking a little bit about
[00:03:10.660]soybean breeding, but breeding
[00:03:12.780]in general, by definition,
[00:03:14.670]breeding is like an endless
[00:03:16.410]cycle of creating variability
[00:03:18.380]and selecting variability.
[00:03:20.460]And eventually, what is
[00:03:22.460]selected is released as a
[00:03:24.500]cultivar or returned to the
[00:03:26.690]crossing block to begin a new
[00:03:29.440]cycle of breeding, of recombination.
[00:03:32.500]And plant breeding is a pretty
[00:03:35.150]diverse area in terms of
[00:03:37.270]discipline.
[00:03:38.720]It integrates statistics, agronomy,
[00:03:41.150]plant pathology, genetics, and
[00:03:43.190]all sorts of other areas of
[00:03:44.760]science to help us to achieve
[00:03:46.460]our breeding objectives.
[00:03:49.320]In talking about breeding
[00:03:50.860]objectives, there are common
[00:03:52.690]breeding objectives across many
[00:03:54.660]different crop species.
[00:03:56.360]For example, let's take, for
[00:03:57.830]example, wheat and soybeans.
[00:03:59.480]Both of those two crops, in the
[00:04:01.180]breeding perspective, the
[00:04:02.700]objective is to improve yield,
[00:04:04.440]seed composition traits, and
[00:04:06.120]resilience to stresses.
[00:04:07.700]Here at the soybean breeding
[00:04:09.620]program at UNL, our main
[00:04:11.300]objective is to improve and to
[00:04:13.300]develop high-yielding varieties
[00:04:15.620]for farmers and soybean
[00:04:17.220]industry,
[00:04:18.140]and also enhance our germplasm
[00:04:20.400]for compositional quality for
[00:04:22.660]high protein, high oil, and
[00:04:24.560]carbohydrates,
[00:04:25.520]and the balance between those
[00:04:26.950]three, and protect our improved
[00:04:29.120]germplasm and enhanced quality
[00:04:31.260]germplasm
[00:04:32.040]against soybean cyst nematode,
[00:04:34.180]phytophthora, frog eye leaf
[00:04:36.100]spot, soybean mosaic virus,
[00:04:38.200]gall midge, iron deficiency,
[00:04:40.020]chlorosis, and drought stress.
[00:04:44.200]So just a quick outline of our
[00:04:46.240]breeding program.
[00:04:47.820]The first half of our breeding
[00:04:49.880]program is what we call the
[00:04:51.690]nursery stages,
[00:04:52.840]where we create variability and
[00:04:54.050]where we sample variability.
[00:04:55.300]Those are real numbers from
[00:04:57.030]2024, where we created 262 new
[00:05:00.040]populations.
[00:05:01.180]We advanced to 230 populations
[00:05:03.610]from F3 through F4.
[00:05:05.300]And we planted more than 26,000
[00:05:08.610]prodding rows across Nebraska
[00:05:10.940]and South America.
[00:05:13.380]And those project rows, they're
[00:05:15.820]selected and they go to those
[00:05:17.910]yield trials where we plant
[00:05:19.940]every year and select every
[00:05:22.280]year more than 30,000 plots. We
[00:05:25.420]plant and we harvest more than
[00:05:26.980]30,000 plots in a multi-location
[00:05:29.410]and multi-year testing
[00:05:30.790]framework.
[00:05:31.540]And we must have that multi-location
[00:05:34.160]testing framework because we
[00:05:36.300]want to replicate the growing
[00:05:38.430]conditions where we want to
[00:05:40.530]target our recommendations and
[00:05:42.680]our selections.
[00:05:44.380]And that second part of our
[00:05:46.080]pipeline, the second half of
[00:05:47.980]our pipeline, is where most or
[00:05:49.920]all of my work here is focused
[00:05:51.730]on optimizing those tests from
[00:05:53.710]preliminary trials that are
[00:05:55.530]selected to the three years of
[00:05:57.490]advanced trials and also to the
[00:05:59.400]regional trials within the last
[00:06:01.700]two years of the advanced
[00:06:03.120]trials.
[00:06:05.620]So back to talking about the
[00:06:07.690]nurseries, the winter nurseries,
[00:06:10.480]we have nurseries here in the
[00:06:12.640]Caribbean and in South America.
[00:06:15.400]And having those in Caribbean
[00:06:17.100]and South America allows us to
[00:06:18.790]do year-round breeding where
[00:06:20.480]during summer we have crosses F1,
[00:06:22.560]F3, and F4 stages here in
[00:06:24.390]Lincoln.
[00:06:25.160]And during winter, we have
[00:06:27.150]crossing blocks, crossing block
[00:06:29.670]F1, generation F2, and F3 in
[00:06:32.320]Puerto Rico, and progeny road
[00:06:34.580]stages in Chile.
[00:06:36.140]So we can advance up to five
[00:06:37.690]different breeding stages in
[00:06:39.470]those winter nurseries.
[00:06:41.180]And that really speed up our
[00:06:43.160]breeding process by a lot,
[00:06:45.140]actually.
[00:06:46.160]So talking about the second
[00:06:48.110]half, the multi-location field
[00:06:50.460]testing.
[00:06:52.520]We run our field tests across
[00:06:54.590]the whole Matrida Group 2 and
[00:06:56.830]Matrida Group 3 area because we
[00:06:59.290]really want to target our
[00:07:01.510]recommendations for the whole
[00:07:03.330]Matrida Group 2 and 3 zones for
[00:07:05.830]soybeans.
[00:07:06.960]And to do so, we have an
[00:07:08.400]expanded trial network inside
[00:07:10.400]Nebraska and also outside
[00:07:12.100]Nebraska.
[00:07:13.020]Inside Nebraska, we carry out
[00:07:14.770]planting, harvesting, and
[00:07:16.450]evaluating all those plots.
[00:07:18.420]outside Nebraska, we hire
[00:07:20.560]private testing services and we
[00:07:23.090]are also part of a coordinated
[00:07:25.410]cross-institutional trial
[00:07:27.650]network between universities
[00:07:30.100]and USDA. And since we have
[00:07:32.370]like 20 environments that we've
[00:07:35.050]been sampling since 2020, it's
[00:07:38.060]more than likely that we have
[00:07:40.110]lots of different climatic and
[00:07:42.340]weather variations across the
[00:07:44.600]whole Matrida Group 2 and Matrida
[00:07:47.270]Group 2.
[00:07:48.420]group three region. So there's
[00:07:49.940]a lot of difference in
[00:07:50.920]temperature that correlates
[00:07:52.300]with the maturity
[00:07:53.220]group zones and also with the
[00:07:55.160]latitude. But more interesting
[00:07:57.510]than that, we have lots of
[00:07:59.310]variations
[00:08:00.220]in rainfall in the whole
[00:08:01.420]maturity group two and three
[00:08:02.860]area. And we have more
[00:08:04.060]differences in
[00:08:04.920]precipitation inside Nebraska
[00:08:06.160]than across the whole maturity
[00:08:07.350]group two and three area.
[00:08:08.460]And another interesting thing
[00:08:10.750]that I like to mention is that
[00:08:12.840]between Lincoln and Scottsbluff,
[00:08:15.560]the difference in elevation is
[00:08:17.260]more remarkable than between
[00:08:18.860]New York and Lincoln.
[00:08:20.200]So that also contributes and
[00:08:22.140]plays a big role in those
[00:08:23.800]differences in weather
[00:08:25.390]conditions
[00:08:26.280]that we're exposed to when
[00:08:27.980]testing our genotypes.
[00:08:29.740]So investigating those
[00:08:31.730]locations and characterizing
[00:08:34.300]those locations in terms of
[00:08:36.590]weather
[00:08:37.340]and see if we can cluster them
[00:08:39.040]is relevant to see how our
[00:08:40.510]testing pipeline can leverage
[00:08:42.360]that type of information.
[00:08:45.560]And to do so, I'm going to
[00:08:47.420]answer three specific questions
[00:08:49.900]here in this, what marks the
[00:08:52.040]first part of my PhD research.
[00:08:54.560]And it is, how can locations be
[00:08:56.290]grouped? Which sites can serve
[00:08:58.190]as testing hubs? And how does
[00:08:59.930]that impact selection?
[00:09:01.560]And I'm going to answer that
[00:09:02.710]mostly with the enviromics, but
[00:09:04.110]not only with the enviromics.
[00:09:05.560]I'm also going to integrate
[00:09:07.670]that with phenotypes, with genotype
[00:09:10.610]by environment interaction
[00:09:12.800]information, to make very
[00:09:14.920]strong outcomes that can
[00:09:16.850]contribute to the longevity of
[00:09:19.230]our program.
[00:09:20.560]So the first question: how can
[00:09:22.700]locations be grouped?
[00:09:24.560]We propose the hybrid
[00:09:25.710]methodology. When I say hybrid,
[00:09:27.610]it's because we merged two
[00:09:29.010]different things into just one
[00:09:30.660]framework.
[00:09:32.560]So location clustering has been
[00:09:35.320]a topic of research for several
[00:09:37.980]years, but researchers, they
[00:09:40.700]either cluster those locations
[00:09:42.620]based on phenotypes only or
[00:09:44.230]based on environmental data
[00:09:45.930]only.
[00:09:46.520]So we merged those two methodologies
[00:09:48.770]because we think that there
[00:09:50.440]might still be strong genotype
[00:09:52.310]by environment interaction,
[00:09:54.210]even in environments that are
[00:09:56.030]correlated.
[00:09:57.140]So we really think that's
[00:09:58.450]relevant to merge those two
[00:09:59.910]methodologies.
[00:10:01.260]And to do so, we're using the
[00:10:02.850]high-yield elite germplasm from
[00:10:04.720]the soybean breeding program of
[00:10:06.530]the University of Nebraska that's
[00:10:08.530]been tested since 2020 until
[00:10:10.340]2024, until last year.
[00:10:12.100]In all those 20 locations, and
[00:10:14.880]that totals more than 22,000
[00:10:18.130]total plots that were tested
[00:10:20.350]for more than 3,000 genotypes
[00:10:23.510]across maturity groups 2 and 3.
[00:10:27.500]In unique 37 environments, when
[00:10:29.460]I say environments, it's the
[00:10:30.980]combinations of locations and
[00:10:32.570]years. And that totals more
[00:10:34.360]than 15,000 unique genotype by
[00:10:37.050]environment interaction values.
[00:10:39.380]And that's the phenotype part.
[00:10:41.410]And the environmental part, we
[00:10:43.420]retrieve the data from NASA,
[00:10:45.180]from weather variables, and
[00:10:46.990]those variables, they are geotagged
[00:10:49.300]and they represent the whole
[00:10:51.060]growing season on a daily basis.
[00:10:55.440]So we're using that platform
[00:10:57.530]called nvrtype to retrieve data
[00:10:59.860]from NASA by providing correct
[00:11:02.040]coordinates, longitude,
[00:11:03.530]latitude,
[00:11:04.200]planting and harvest days and
[00:11:06.220]location names. And that
[00:11:08.080]platform can retrieve a pretty
[00:11:10.190]big data set of 29
[00:11:11.640]environmental variables for
[00:11:13.690]each combination of location
[00:11:15.780]and day and year. So it's a
[00:11:17.560]pretty big
[00:11:19.180]data set. But since we have 24
[00:11:22.200]variables and those 24
[00:11:24.530]variables, they come from
[00:11:26.940]primarily four unique variables
[00:11:30.260]that relate to those weather
[00:11:33.090]conditions: precipitation,
[00:11:36.220]radiation, wind, and
[00:11:37.630]temperature. There's strong
[00:11:39.330]collinearity and correlation
[00:11:41.050]between traits,
[00:11:42.060]between variables that we have
[00:11:44.330]to handle. If we use the data
[00:11:46.500]as it is, like the raw data,
[00:11:48.700]it might lead us to wrong
[00:11:49.810]assumptions. So we have to
[00:11:51.140]investigate collinearity and
[00:11:52.580]try to circumvent
[00:11:53.500]that. For example, here we have
[00:11:56.270]some variables that between
[00:11:58.540]those climate features that are
[00:12:01.180]not strongly correlated. Some
[00:12:02.880]of those are very strongly
[00:12:04.220]correlated. For example, here
[00:12:06.010]we have
[00:12:06.300]almost a perfect correlation
[00:12:07.840]between total precipitation and
[00:12:09.560]precipitation minus evapotranspiration
[00:12:11.820]because total precipitation is
[00:12:13.560]part of the formula that
[00:12:14.860]estimates precipitation and evapotranspiration.
[00:12:17.900]And that's something that we
[00:12:19.870]have to circumvent.
[00:12:21.480]And what I did to handle collinearity,
[00:12:24.440]I plotted all the variables in
[00:12:27.450]a PCA by plot
[00:12:28.680]where I was able to group them
[00:12:30.250]in five different groups.
[00:12:31.880]And those groups are water
[00:12:33.290]availability, climatic extremes,
[00:12:35.420]thermal and radiation stress,
[00:12:36.720]radiation and air dryness,
[00:12:37.900]and physiological radiation.
[00:12:39.340]And after that, I ranked those
[00:12:40.750]variables
[00:12:41.240]in terms of their importance
[00:12:43.340]relative to the variable,
[00:12:45.420]to our response variable, which
[00:12:46.870]is grain yield.
[00:12:47.900]And the ones that are more
[00:12:49.370]important to grain yield is
[00:12:50.980]precipitation and evapotranspiration.
[00:12:53.570]Sorry, are those highlighted in
[00:12:55.530]red. And they are long wave
[00:12:57.230]radiation, photosynthetically
[00:12:59.240]active radiation, normal irradiance,
[00:13:01.820]temperature-weighted photosynthetically
[00:13:04.350]active radiation, minimum air
[00:13:06.640]temperature, maximum air
[00:13:07.780]temperature, number of clear
[00:13:09.670]days, number of frost days, and
[00:13:11.660]precipitation minus potential
[00:13:13.520]evapotranspiration.
[00:13:15.140]And I ranked those based on a
[00:13:16.630]partial least squares
[00:13:17.910]regression, where yield was our
[00:13:19.710]response variable. So we came
[00:13:21.910]from a pretty big raw data set
[00:13:24.380]to a data set that was averaged
[00:13:26.980]by location of 20 rows, each
[00:13:29.930]row representing one location
[00:13:32.650]and 10 most relevant weather
[00:13:35.900]variables.
[00:13:38.840]And the second part, how to
[00:13:40.050]integrate genotype by
[00:13:41.090]environment signals to that
[00:13:42.320]matrix.
[00:13:42.880]Genotype by environment signals,
[00:13:44.640]they are from a totally
[00:13:45.770]different nature if we compare
[00:13:47.300]to those environmental
[00:13:48.500]variables.
[00:13:49.300]The genotype by environment
[00:13:51.130]signals, we can't just average
[00:13:53.190]them by location as we did with
[00:13:55.110]environmental variables.
[00:13:57.100]Because if we do that, we're
[00:13:58.800]actually neglecting the
[00:14:00.250]assumption of nonlinear
[00:14:01.730]relationship between genotypes
[00:14:03.700]and environments.
[00:14:06.380]And that assumption has been
[00:14:08.010]valid for more than 100 years
[00:14:10.050]and accepted by the breeding
[00:14:11.680]community for more than 100
[00:14:13.660]years.
[00:14:14.400]So what I did, I integrated a
[00:14:16.140]factor analytic effect in our
[00:14:18.000]adjustment model where I was
[00:14:19.740]able to extract genotype loadings
[00:14:21.850]and genotype scores and
[00:14:23.390]location loadings.
[00:14:24.860]Those loadings, they represent
[00:14:26.410]how much of the total genotype
[00:14:27.780]by environment interaction that
[00:14:29.330]is accounted by each location.
[00:14:30.980]So those loadings, I also call
[00:14:32.550]them genotype by environment
[00:14:34.070]location signals.
[00:14:35.920]And they are a vector of just,
[00:14:39.340]a vector of order 20 by one.
[00:14:40.920]20 is the number of locations,
[00:14:45.560]and one is the column of genotype
[00:14:48.100]by environment signals.
[00:14:50.340]So now I can merge those two
[00:14:52.920]matrices,
[00:14:54.160]and the data is ready for clustering,
[00:14:56.020]for understanding better how
[00:14:57.020]those environments,
[00:14:57.800]they group and explain that in
[00:15:00.230]the context
[00:15:01.140]of our testing network.
[00:15:03.340]So before doing the formal clustering,
[00:15:05.700]I ran this exploratory analysis
[00:15:08.870]just to see how many clusters I
[00:15:11.790]would expect per maturity group
[00:15:15.080]by minimizing the sum of
[00:15:16.080]squares of the Euclidean
[00:15:17.130]distances between them.
[00:15:18.340]So for maturity group 2, I was
[00:15:20.310]supposed to expect three
[00:15:21.850]clusters, and for maturity
[00:15:23.610]group 3, four clusters.
[00:15:25.580]And then I grouped them, and
[00:15:27.330]after the hierarchical clustering,
[00:15:29.680]I could see that for maturity
[00:15:31.440]group 2 locations,
[00:15:33.460]I had three clusters, and for
[00:15:34.880]maturity group three locations,
[00:15:36.550]I had four clusters.
[00:15:37.700]And from now on, the results,
[00:15:39.430]they started to make a lot of
[00:15:40.960]sense.
[00:15:41.580]So this is the map of the
[00:15:42.990]target population of
[00:15:44.270]environments.
[00:15:45.440]For maturity group two
[00:15:46.850]locations, I have this target
[00:15:48.710]population of environment one,
[00:15:50.820]target population of
[00:15:51.770]environment two, and target
[00:15:52.960]population of environment three.
[00:15:54.580]For maturity group three, I had
[00:15:56.500]four target population of
[00:15:58.070]environments.
[00:15:59.200]So the assumption that we can
[00:16:00.640]make from that is that within
[00:16:02.080]each cluster,
[00:16:02.880]the weather variability is
[00:16:04.290]minimized, and also the genotype
[00:16:06.060]by environment interaction is
[00:16:07.660]minimized, but across clusters
[00:16:09.500]quite the opposite. The genotype
[00:16:11.300]by environment interaction and
[00:16:12.920]also the correlation between
[00:16:14.420]the environmental features is
[00:16:15.980]maximized. And
[00:16:16.820]the next question is:
[00:16:19.660]So now that we understand
[00:16:20.810]better those 20 locations,
[00:16:22.350]should we test in those 20
[00:16:23.740]locations, or do we have more
[00:16:25.120]informative locations within
[00:16:26.560]each cluster that represent
[00:16:28.300]those
[00:16:28.640]20 locations and represent each
[00:16:30.390]of those target populations of
[00:16:31.850]environments. And that's
[00:16:33.200]answering the second question:
[00:16:34.780]which sites can serve as
[00:16:36.250]testing hubs?
[00:16:37.360]So some researchers from CIMMYT
[00:16:39.610]working with a WIP data set,
[00:16:41.560]they clustered environments
[00:16:43.650]based on environmental data and
[00:16:46.270]at the end
[00:16:46.840]they suggested that as a
[00:16:47.980]potential future research,
[00:16:49.360]they should indicate
[00:16:51.860]which sites would serve as
[00:16:53.310]pivot locations. I call those
[00:16:55.100]as testing hubs. And to answer
[00:16:56.900]that question,
[00:16:57.940]question I kind of adapted a
[00:17:00.240]framework from the genomic from
[00:17:02.940]genomics so in
[00:17:04.160]genomics we construct genomic
[00:17:06.170]relationship matrices so right
[00:17:08.290]here I
[00:17:08.740]constructed a environmental or
[00:17:10.600]weather relationship matrix or
[00:17:12.450]weather kingship
[00:17:13.540]matrix based on the
[00:17:14.990]environmental covariates and
[00:17:17.360]genotype by environment
[00:17:19.320]signals and this is an example
[00:17:22.000]for the maturity group 3 where
[00:17:24.900]I had 15
[00:17:26.060]locations. So I had a
[00:17:27.470]relationship matrix of 15 by 15,
[00:17:30.180]15 representing the number of
[00:17:32.340]locations.
[00:17:33.340]And to choose which locations
[00:17:35.100]would better represent those
[00:17:36.940]locations,
[00:17:37.700]I defined the criteria that I
[00:17:40.840]really wanted to have an
[00:17:43.510]average of one hub per target
[00:17:46.690]population
[00:17:48.000]of environments. So that says
[00:17:50.180]that I really wanted to
[00:17:51.720]identify four hubs in total for
[00:17:53.820]maturity group
[00:17:55.000]three. And I ran all possible kinship
[00:17:57.640]matrix of non-hub environments.
[00:18:00.040]And when the prediction
[00:18:01.680]error variance was minimized,
[00:18:03.660]that means that those four
[00:18:05.280]locations that were out of that,
[00:18:07.380]those 11 locations were the
[00:18:09.310]chosen ones for representing
[00:18:11.090]the hub environments. And after
[00:18:13.820]I ran that, I was able to
[00:18:15.230]identify those hub locations in
[00:18:17.120]different target populations of
[00:18:19.160]environments for my treaty
[00:18:20.420]group two. So I was able to
[00:18:21.650]identify one here in this
[00:18:22.810]target population
[00:18:23.740]of environment, another one
[00:18:25.160]here, and another one here, and
[00:18:26.750]for Matrida group 3, I was able
[00:18:28.440]to identify four hub locations.
[00:18:31.940]One in this target population
[00:18:33.730]of environment, two here, none
[00:18:35.760]here, and one here at Phillips
[00:18:37.700]in Nebraska, which is in the
[00:18:39.400]borderline between groups 2 and
[00:18:41.430]3 testing areas.
[00:18:42.780]So those hub locations, they're
[00:18:45.600]actually anchor sites.
[00:18:47.880]They have low redundancy in
[00:18:50.970]between them.
[00:18:52.880]They represent each target
[00:18:53.880]population of environment,
[00:18:55.020]actually the whole testing
[00:18:57.000]network.
[00:18:57.880]The genotype by environment
[00:18:59.580]interaction and weather
[00:19:01.050]information between those hubs
[00:19:02.950]is actually,
[00:19:03.880]with that assumption of maximizing
[00:19:05.820]or minimizing the relationship
[00:19:07.570]between them, it's actually
[00:19:09.190]maximum.
[00:19:09.880]The third question: How does
[00:19:12.560]defining those hubs impact our
[00:19:15.470]selection?
[00:19:16.880]see that from evaluating gains
[00:19:18.770]and stability. And that's what
[00:19:20.810]I did. I evaluated gains and
[00:19:22.720]stability using the same data
[00:19:24.970]set that we use to generate
[00:19:26.840]those previous results for
[00:19:29.000]stratifying those environments.
[00:19:31.020]And we could see that, as
[00:19:32.360]expected, evaluating all the
[00:19:34.140]environments, the genotypes in
[00:19:35.960]all the environments provided
[00:19:37.650]the best response to selection
[00:19:39.340]or
[00:19:39.560]gains, and also the better
[00:19:41.400]estimates for stability. The
[00:19:43.600]recommendations were more
[00:19:45.400]stable, and
[00:19:46.340]It gave us more selection gains.
[00:19:50.580]However, and although testing
[00:19:52.930]hubs,
[00:19:53.480]they had fewer locations than
[00:19:55.340]Nebraska locations,
[00:19:56.800]the testing hubs provided
[00:19:59.070]better response to selection
[00:20:01.680]or gains and stability when we
[00:20:03.680]compared them
[00:20:04.660]to just evaluating our genotypes
[00:20:06.190]in Nebraska locations.
[00:20:07.460]So, and to validate our
[00:20:09.110]methodology,
[00:20:10.240]we use an external data set
[00:20:11.880]from the Northern Uniform Soybean
[00:20:13.690]Trials
[00:20:14.100]that were not used to deploy
[00:20:15.470]those results,
[00:20:16.320]where we could see the same
[00:20:18.590]pattern.
[00:20:19.660]Hub locations were more
[00:20:21.500]effective for gains and
[00:20:23.330]stability
[00:20:24.200]than just evaluating in
[00:20:25.150]Nebraska locations,
[00:20:26.040]although hub locations,
[00:20:27.020]there were fewer than Nebraska
[00:20:28.590]locations.
[00:20:29.360]So the takeaway of this first
[00:20:31.220]part of the presentation
[00:20:32.920]is that fewer hub locations,
[00:20:36.140]they resulted in greater gains
[00:20:37.550]and more stable selections
[00:20:38.660]than Nebraska locations.
[00:20:39.820]And those hub locations are
[00:20:41.440]alternative for optimized
[00:20:43.600]and informative testing network.
[00:20:46.640]So that's the end of the first
[00:20:49.760]part of my research.
[00:20:52.260]And the second part of my
[00:20:53.470]research
[00:20:54.000]is going to talk about genomic
[00:20:56.320]sparse testing.
[00:20:57.880]So we talked a lot about
[00:20:58.690]environments,
[00:20:59.360]about genotype by environment
[00:21:00.580]interaction.
[00:21:01.280]So right here in the genomic sparse
[00:21:02.730]test,
[00:21:03.060]we can actually play with genotypes,
[00:21:05.950]environments,
[00:21:07.280]and genotype by environment
[00:21:08.690]interaction
[00:21:09.280]in the context of genomics.
[00:21:13.600]So before going straight to sparse
[00:21:15.480]testing, to genomic sparse
[00:21:17.030]testing, I would like to define
[00:21:18.770]what's genomic selection.
[00:21:20.440]So genomic selection leverages
[00:21:23.450]genome-wide information at a
[00:21:26.160]nucleotide level and creates
[00:21:29.430]relationship matrices that are
[00:21:31.880]leveraged in prediction models
[00:21:34.850]that are designed to predict
[00:21:37.820]untested genotypes.
[00:21:40.720]So the first concept of genomic
[00:21:42.860]selection, and I consider those
[00:21:45.170]the two seminal papers for genomic
[00:21:47.580]selection, it started back in
[00:21:49.800]1994 when Rex Bernardo, a
[00:21:52.060]quantitative geneticist in the
[00:21:54.220]University of Minnesota, he
[00:21:56.090]said that if in some day there
[00:21:58.130]were covariates that could
[00:21:59.730]describe the whole genome,
[00:22:01.600]conventional prediction models
[00:22:03.790]could be adapted into genomic
[00:22:05.730]prediction models, which we
[00:22:07.730]call today as G-blub.
[00:22:10.720]In 2001, seven years later,
[00:22:14.020]when genome-wide markers were
[00:22:16.680]already becoming a reality,
[00:22:19.160]Mewison simulated some
[00:22:20.410]prediction models, and he
[00:22:21.930]proved that predicting untested
[00:22:23.640]genotypes
[00:22:24.300]could be done with a certain
[00:22:26.770]level of reliability.
[00:22:29.120]That reliability, we call that
[00:22:31.200]the prediction accuracy or
[00:22:32.960]predictive ability.
[00:22:34.540]And today, genomic selection is
[00:22:36.850]a reality.
[00:22:38.060]It's being deployed at a large
[00:22:39.800]scale in the industry, like in
[00:22:41.660]seed companies and in plant and
[00:22:43.430]animal breeding, and not in a
[00:22:45.350]very much large scale in public
[00:22:47.580]breeding programs.
[00:22:49.300]But as I said, it's a reality
[00:22:51.120]and it's being deployed.
[00:22:52.960]And the results are remarkable.
[00:22:57.660]And it's being more recommended
[00:22:59.910]for quantitative traits such as
[00:23:02.120]grain yield, protein and oil
[00:23:04.180]concentration, etc.
[00:23:06.020]Just to summarize what it is,
[00:23:08.270]we use phenotypes in our
[00:23:10.060]prediction models. We integrate
[00:23:12.740]those phenotypes with genomic
[00:23:15.100]information to predict untested
[00:23:17.600]phenotypes, untested genotypes.
[00:23:20.520]So just for the sake of
[00:23:21.720]comparison, here's the
[00:23:23.130]conventional testing. Let's say
[00:23:25.130]that in our breeding pipeline,
[00:23:26.940]we have 1,000 genotypes to be
[00:23:29.500]tested in four environments,
[00:23:30.870]which totals 4,000 plots.
[00:23:33.700]we're going to obtain 4,000
[00:23:35.070]yield values at the end of the
[00:23:36.130]season.
[00:23:36.540]With genomic selection, we have
[00:23:38.240]those same 1,000 genotypes.
[00:23:40.380]All of them are going to have
[00:23:42.710]their marker information,
[00:23:45.200]their DNA marker information.
[00:23:46.560]But just 75% of them are going
[00:23:51.050]to be planted
[00:23:52.880]or are going to have their phenotypic
[00:23:54.290]information available.
[00:23:55.400]Those that have just the DNA
[00:23:57.300]marker information available,
[00:23:59.660]we call that the testing or
[00:24:01.250]cross-validation set,
[00:24:03.040]test and cross validation set
[00:24:04.880]and the ones that have both
[00:24:06.480]data sets available, we call
[00:24:08.300]them training data sets. And so
[00:24:10.630]the training data set, which is
[00:24:13.410]75% of the total of the
[00:24:15.670]material that we have are going
[00:24:18.320]to predict those 25% leveraged
[00:24:21.450]by the genomic information here.
[00:24:24.540]Our lab in 2022 already proved
[00:24:27.960]how effective
[00:24:31.860]Genomic selection can be
[00:24:33.430]relative to conventional
[00:24:35.090]selection. So it is at least 20%
[00:24:37.470]more effective in terms of
[00:24:39.060]gains when comparing to the
[00:24:40.850]conventional phenotypic
[00:24:42.340]selection.
[00:24:43.320]This genomic selection right
[00:24:44.900]here when we compare to the sparse
[00:24:46.650]design, I used to say that this
[00:24:48.270]is the conventional genomic
[00:24:49.790]selection where we're just
[00:24:51.240]predicting 25% of that.
[00:24:54.960]So we have not yet finished the
[00:24:56.460]investigation of the sparse
[00:24:57.970]designs in the context of the
[00:24:59.340]genomic selection in our
[00:25:00.600]program.
[00:25:01.240]But just for defining what are
[00:25:03.670]sparse designs integrated with
[00:25:06.220]genomic selection, sparse
[00:25:08.390]designs are actually partial
[00:25:10.230]replicate designs where a
[00:25:11.610]subset of lines are replicated
[00:25:13.450]across environments.
[00:25:14.940]But not every line is
[00:25:16.030]replicated across every single
[00:25:17.780]environment.
[00:25:18.720]It's very sparse.
[00:25:21.080]And the training set here is
[00:25:23.510]relatively small compared to
[00:25:26.040]the training set of the
[00:25:28.000]conventional genomic selection.
[00:25:31.320]It's actually much smaller than
[00:25:33.130]that other training set.
[00:25:34.740]So how we plan to investigate
[00:25:36.540]that, we're going to do that in
[00:25:38.480]terms of the training set
[00:25:40.040]composition.
[00:25:41.200]If we're able to answer which
[00:25:43.140]genotypes we want to include in
[00:25:45.240]the training set,
[00:25:46.740]how many genotypes do we want
[00:25:47.730]to include in the training set,
[00:25:48.940]and which genotypes do we want
[00:25:50.350]to overlap or replicate across
[00:25:51.740]environments.
[00:25:52.620]We're pretty much answering,
[00:25:54.180]checking this first criteria
[00:25:55.680]here of training set
[00:25:56.700]composition.
[00:25:57.580]We're also integrating in our
[00:26:00.130]training set data from
[00:26:02.080]historical yield trials,
[00:26:04.500]which has mostly information
[00:26:06.370]from advanced pipelines.
[00:26:08.220]And that is going to check
[00:26:10.200]those two other criteria of
[00:26:12.370]late-stage trials in the
[00:26:14.350]training set
[00:26:15.460]and it's increased test
[00:26:16.560]environments. And since we're
[00:26:18.100]working with genomic
[00:26:19.090]relationship matrices,
[00:26:20.420]we're also checking this
[00:26:21.540]criteria of integrating marker
[00:26:23.000]by environment interaction. And
[00:26:24.680]we're going
[00:26:25.140]to go beyond that because we're
[00:26:26.840]also using environmental covariates
[00:26:28.880]for that. So we're
[00:26:30.020]actually doing market by
[00:26:31.240]environment covariates
[00:26:32.470]interaction, which it's a
[00:26:33.840]little bit more complex,
[00:26:35.220]but I'll show you it has some
[00:26:36.750]positive results. So the main
[00:26:38.580]two questions that I'm going to
[00:26:40.330]answer
[00:26:40.900]here today for the context of
[00:26:42.740]genome sparse testing is what
[00:26:44.660]is an appropriate training set
[00:26:46.820]size and overlapping rate in
[00:26:48.740]that sparse design and does the
[00:26:50.860]inclusion of the environmental
[00:26:52.990]matrix
[00:26:53.540]benefit the prediction accuracy
[00:26:56.390]and to do so i'm going to use
[00:26:58.600]the same data set that i used
[00:27:01.040]to
[00:27:01.220]stratify those environments in
[00:27:02.720]the first part of my
[00:27:03.650]presentation but not all of
[00:27:04.860]them here i'm just
[00:27:05.780]going to include the maturity
[00:27:07.060]group too otherwise it's going
[00:27:08.400]to be a lot of information for
[00:27:09.720]for
[00:27:10.660]today. So it's almost 2,000 genotypes
[00:27:14.370]and more than 1,400 of them is
[00:27:17.230]comprised within the
[00:27:18.900]preliminary
[00:27:19.260]set and almost 400 of them is
[00:27:21.370]the advanced sets in our
[00:27:22.860]testing pipeline. They were
[00:27:24.710]tested in 19
[00:27:25.820]environments in all those
[00:27:27.610]locations here for maturity
[00:27:29.560]group 2 and they were genotype
[00:27:31.780]using
[00:27:32.240]the micro inversion probes 1k
[00:27:34.670]SNP set that was developed here
[00:27:36.870]by Dr. David Heitens' lab. Here's
[00:27:40.000]publication it's pretty
[00:27:41.220]interesting if you find it
[00:27:42.480]interesting about learning more
[00:27:44.130]of that genotype
[00:27:45.040]methodology it's available in
[00:27:46.520]that paper and there's another
[00:27:48.050]interesting paper about that
[00:27:49.520]so that's the data set that we
[00:27:51.220]use it to deploy our genomics
[00:27:52.890]parts testing
[00:27:53.840]so before going straight to the
[00:27:56.850]uh to analyzing the data we saw
[00:27:59.680]that for most part of our data
[00:28:02.830]set
[00:28:02.960]there was no clear structure
[00:28:05.060]which allowed us to move
[00:28:06.800]forward and integrate the phenotypes,
[00:28:09.840]the genomic relationship
[00:28:11.050]matrices, and the environmental
[00:28:12.640]data into a very simple
[00:28:13.760]prediction
[00:28:14.240]model which is phenotypes is
[00:28:15.680]equal to genotypes plus
[00:28:16.880]environments plus genotype by
[00:28:18.440]environment
[00:28:19.120]interaction. Before showing the
[00:28:23.170]results of the SPARSE testing
[00:28:26.550]framework, I ran an exploratory
[00:28:30.400]analysis just to guide us when
[00:28:32.900]we actually run these sparse
[00:28:35.320]designs. So this is a four-fold
[00:28:38.240]phenotypic and genomic
[00:28:39.720]selection strategy where I ran
[00:28:41.670]20 independent runs or replications
[00:28:44.300]of the
[00:28:44.800]same analysis considering 75%
[00:28:47.410]of the lines in the training
[00:28:49.400]set and 25% of them in the
[00:28:51.470]testing set.
[00:28:52.720]And when I did that without the
[00:28:54.800]genomic information, we saw
[00:28:56.870]that the correlations
[00:28:58.560]were negative, were minus 30.
[00:29:01.340]So if the breeder selects untested
[00:29:03.840]genotypes just relying on phenotypes,
[00:29:06.880]the selections are going to be
[00:29:08.690]wrong and it's going to mess up
[00:29:10.480]downstream in the next years of
[00:29:12.480]of the selection pipeline. But
[00:29:14.540]that usually doesn't happen
[00:29:16.310]because
[00:29:16.880]programs that rely mostly on
[00:29:18.850]phenotypic selection, they don't
[00:29:21.290]usually
[00:29:24.560]predict and test the genotypes
[00:29:26.810]based just on phenotypes. But
[00:29:29.120]when we included the
[00:29:30.640]genomic relationship matrix, we
[00:29:33.130]saw that the average prediction
[00:29:35.460]accuracy was 0.55.
[00:29:37.760]And why I did that? It's
[00:29:39.570]because this training 75% and
[00:29:42.120]testing 25%, it's like a
[00:29:44.320]benchmark for
[00:29:45.520]sparse testing considering that
[00:29:47.350]sparse testing has a much
[00:29:48.710]smaller training test size. So
[00:29:50.520]if our
[00:29:50.880]prediction accuracy ranges
[00:29:52.580]around 55% for sparse designs,
[00:29:54.730]it shows us that we're going
[00:29:56.380]towards the
[00:29:56.800]right direction into
[00:29:58.460]investigating our sparse
[00:30:00.470]testing. So those are the sparse
[00:30:03.100]testing results
[00:30:04.480]that I have to present here so
[00:30:06.390]far, where I plotted the
[00:30:08.080]predictive ability or the
[00:30:09.780]prediction
[00:30:10.640]accuracy or reliability in
[00:30:12.300]predicting untested genotypes
[00:30:14.240]against training set size
[00:30:15.830]according to
[00:30:17.360]increasing overlapping rates.
[00:30:18.770]When I say overlapping, it's
[00:30:19.840]the replication of lines
[00:30:20.880]across environments. And we
[00:30:22.730]could see that, and I did that
[00:30:24.470]using just the genomic
[00:30:25.810]relationship matrix
[00:30:27.200]and use the genomic
[00:30:28.060]relationship matrix and the
[00:30:29.380]environmental relationship
[00:30:30.800]matrix.
[00:30:31.280]And we could see that
[00:30:32.360]regardless of the scenario, the
[00:30:34.140]prediction accuracy increased
[00:30:35.870]as the training
[00:30:36.800]set size increased and as the
[00:30:38.330]overlapping rates increased.
[00:30:40.130]But right here, and I divided
[00:30:41.710]actually
[00:30:42.320]those two plots into three
[00:30:44.020]sections. The first section
[00:30:45.980]from 0.35 percent of accuracy
[00:30:48.500]until 0.45,
[00:30:49.680]the second one from 0.45 to 0.50,
[00:30:52.320]and the third one from 0.50
[00:30:54.120]above. And you could see that
[00:30:55.680]using
[00:30:56.000]environmental information here
[00:30:58.430]provided us better prediction
[00:31:00.540]accuracies overall. We had more
[00:31:03.040]prediction accuracy points that
[00:31:05.640]fell between 0.45 and 0.50 of
[00:31:09.360]prediction accuracy, which is
[00:31:11.840]is beneficial. But if you ask
[00:31:13.780]me just to pick one point to
[00:31:15.440]recommend for a breeder to run
[00:31:17.290]his tests
[00:31:18.020]based on a number of
[00:31:19.430]individuals in the training set
[00:31:21.770]size and overlapping, I would
[00:31:24.040]say that it
[00:31:24.920]really depends. It depends on
[00:31:26.330]the context. It depends if the
[00:31:27.690]program has any testing
[00:31:28.620]bottlenecks
[00:31:29.260]or any backlogs. So let's say
[00:31:31.240]if I had to test huge
[00:31:32.440]populations and a great number
[00:31:34.420]of populations
[00:31:35.520]and I really needed that type
[00:31:36.760]of information,
[00:31:37.580]I would go with 45% of the
[00:31:40.010]training set size
[00:31:41.480]in a sparse design to test all
[00:31:43.050]those lines
[00:31:43.760]with 50% of overlapping.
[00:31:46.640]So in other words,
[00:31:47.540]if I had to test 100 genotypes
[00:31:49.620]in two locations
[00:31:50.620]in the traditional framework,
[00:31:51.800]I would have to test 200
[00:31:53.580]observations
[00:31:54.480]with the sparse design
[00:31:55.700]providing the same type of
[00:31:56.810]information.
[00:31:57.560]We're actually testing 46 genotypes
[00:32:00.120]in just one location
[00:32:01.160]and 22 genotypes in two
[00:32:02.570]locations,
[00:32:03.440]total 90 observations or 45% of
[00:32:06.800]the total number of
[00:32:08.440]observations.
[00:32:12.080]So it's all about finding like
[00:32:14.110]a sweet spot between
[00:32:15.540]how much to test and where to
[00:32:17.890]test.
[00:32:18.660]It's really context dependent.
[00:32:20.960]And we saw that integrating
[00:32:22.740]weather information here
[00:32:24.520]is beneficial, but it's a trade
[00:32:26.480]off,
[00:32:26.900]much more parameters to
[00:32:27.850]estimate,
[00:32:28.260]and there is a computational,
[00:32:30.520]more computational demand to do
[00:32:31.930]so.
[00:32:32.160]So there's some steps to go
[00:32:35.460]towards this
[00:32:37.160]in terms of data analysis
[00:32:39.160]pipeline optimization as well.
[00:32:42.020]So for the third part of the
[00:32:45.630]presentation,
[00:32:47.960]I'm going to talk about phenomics,
[00:32:51.000]about drone-based phenotyping
[00:32:52.410]for plant maturity.
[00:32:53.400]So those first two sections,
[00:32:55.580]I talked mostly about grain
[00:32:57.540]yield,
[00:32:58.180]genomic selection and stratification
[00:32:59.600]of environments
[00:33:00.220]for grain yield, which is our
[00:33:02.360]main trait.
[00:33:03.460]It's what drives most of our
[00:33:05.560]breeding objectives.
[00:33:07.540]But plant maturity is also a
[00:33:09.240]key trait
[00:33:09.860]because yield is actually
[00:33:12.660]meaningless
[00:33:14.100]if it does not come with
[00:33:16.040]information about maturity.
[00:33:18.880]Because with maturity notes,
[00:33:20.500]we can make proper selections
[00:33:21.770]and recommendations,
[00:33:22.840]and we can create yield tests
[00:33:25.720]for the next years of testing.
[00:33:27.080]We have to compare things that
[00:33:28.370]mature relatively
[00:33:30.180]close together like pretty much
[00:33:31.640]on the same dates.
[00:33:32.600]We cannot compare things that
[00:33:35.370]mature,
[00:33:36.200]let's say one week or 10 days
[00:33:38.340]apart.
[00:33:39.000]And how we do that in our
[00:33:40.450]program so far,
[00:33:41.560]we do that with visual scores.
[00:33:43.540]We go to the field when the
[00:33:45.850]plants are around R6 stage
[00:33:48.540]until full maturity, we go
[00:33:50.790]there around twice a week.
[00:33:53.180]And we record the visual scores.
[00:33:58.680]And we do that in between 15,000
[00:34:01.550]to 20,000 plots per year in our
[00:34:04.340]program.
[00:34:05.280]But in those recent years,
[00:34:07.350]drone phenotyping has become a
[00:34:09.690]pretty hot topic for research
[00:34:11.940]in many breeding programs and
[00:34:14.270]also in other areas of
[00:34:15.810]scientific research,
[00:34:17.760]which uses a specialized camera
[00:34:20.580]attached to a drone and
[00:34:22.560]collects thousands of
[00:34:24.550]information, of photos.
[00:34:27.240]And those photos are translated
[00:34:29.300]into relevant information for
[00:34:31.280]whatever projects, in our case,
[00:34:33.480]for plant breeding projects,
[00:34:35.460]more specifically plant
[00:34:37.060]maturity.
[00:34:38.000]A lot of papers were already
[00:34:40.470]published with open code
[00:34:42.720]pipelines for various crops,
[00:34:45.530]including soybeans.
[00:34:47.780]And we can leverage that type
[00:34:49.220]of information to create our
[00:34:50.660]own drawn phenotyping pipeline.
[00:34:53.840]So, what is the reason for
[00:34:55.400]adopting high throughput phenotyping
[00:34:57.770]in our program?
[00:34:58.920]I would point out three reasons.
[00:35:00.820]The first one is because it's
[00:35:02.880]fast.
[00:35:03.520]The second one is because it's
[00:35:04.680]accurate.
[00:35:05.220]And the third one is because it's
[00:35:06.630]safe.
[00:35:06.960]So, it's pretty fast.
[00:35:08.660]We can fly over 3,500 plots in
[00:35:11.010]10 minutes.
[00:35:12.220]So, if I go to the field to
[00:35:13.590]phenotype those plots for plant
[00:35:15.300]maturity, it would take me at
[00:35:16.920]least two days to go over those
[00:35:18.530]3,500 plots.
[00:35:22.000]And that project started here
[00:35:23.860]at UNL in our breeding program
[00:35:25.720]in 2023. We've been flying our
[00:35:28.560]drones over five of our testing
[00:35:30.990]locations in Nebraska. We have
[00:35:33.130]been flying our drones over 60,000
[00:35:36.520]plots over those two years.
[00:35:38.720]and we have to start flying the
[00:35:40.480]drone from late August through
[00:35:42.310]early October
[00:35:43.240]with the same consistency that
[00:35:45.260]we go to the field to take our
[00:35:47.060]visual notes.
[00:35:48.180]And we still have to take our
[00:35:51.040]visual notes
[00:35:52.440]because we have to validate our
[00:35:54.340]drone pipeline
[00:35:55.340]with the real manual or visual
[00:35:57.590]note-taking.
[00:35:58.820]And here's just a summary of
[00:36:01.070]the pipeline.
[00:36:02.480]We collect thousands of photos
[00:36:04.180]of the drone.
[00:36:05.040]We stitch them in just one big
[00:36:06.960]photo per field, which we call
[00:36:09.080]orthomosaic or raster file.
[00:36:11.200]And then we lay over each orthomosaic
[00:36:13.990]like very small rectangles,
[00:36:16.240]which we call them shape files
[00:36:18.490]that identify individual plots.
[00:36:21.260]And then we create a time
[00:36:23.440]series data set where we
[00:36:25.630]extract the vegetation index.
[00:36:29.000]Here we're using the normalized
[00:36:30.740]green-red difference index,
[00:36:32.410]which has been giving us the
[00:36:33.890]best accuracy in estimating
[00:36:35.440]maturity from the drone.
[00:36:37.060]And using this index, this
[00:36:39.130]index serves as an input for
[00:36:41.200]our logistic regression with
[00:36:43.450]four parameters to estimate our
[00:36:46.050]maturities.
[00:36:47.160]So here's the time series raster
[00:36:50.980]or orthomosaics.
[00:36:53.400]we have the photo when most of
[00:36:56.170]the plants were before R6. Some
[00:36:59.290]of them were already in R6.
[00:37:00.460]And this is the very last photo.
[00:37:02.540]In between those two, we have
[00:37:04.400]probably eight to ten other
[00:37:06.110]photos,
[00:37:06.780]but here I just included two.
[00:37:09.460]Right here is one field with
[00:37:11.840]the shape file laid over the
[00:37:14.210]orthomosaic.
[00:37:15.800]As you can see, we can
[00:37:17.350]individualize every single
[00:37:19.470]unique two-row plot and four-row
[00:37:21.900]plot here.
[00:37:23.400]in those shape files and we
[00:37:24.880]extract information at a plot
[00:37:26.570]level from each of those. So as
[00:37:29.280]I said, we are able to detect
[00:37:32.390]differences at a plot level.
[00:37:34.030]Those two genotypes, they are
[00:37:35.770]neighboring each other in the
[00:37:37.450]field. And let's say for the
[00:37:39.120]flight number six, this genotype
[00:37:41.050]was pretty much mature while
[00:37:42.660]this one was still kind of
[00:37:44.160]yellowish. It was probably one
[00:37:45.940]week behind of getting mature.
[00:37:47.810]So we're able to detect those
[00:37:49.450]differences.
[00:37:52.680]And with the vegetation index
[00:37:54.770]run through a logistic
[00:37:56.360]regression, we can estimate the
[00:37:58.700]parameters that we want.
[00:38:00.740]So with the parameters that we
[00:38:02.670]want, we can identify this
[00:38:04.440]upper plateau.
[00:38:05.700]This means that the plants are
[00:38:07.640]still green.
[00:38:08.720]There's a decay here.
[00:38:10.180]That means that the plants are
[00:38:11.550]getting yellowish.
[00:38:12.540]They're getting mature.
[00:38:13.300]And when it reaches that third
[00:38:15.220]parameter or lower plateau,
[00:38:17.140]that means that the plants
[00:38:18.830]reached full maturity.
[00:38:20.880]And that's when the maturity
[00:38:22.150]notes are recorded from the
[00:38:23.370]drone, from the high-throughput
[00:38:24.920]phenotyping.
[00:38:25.820]And this is the data from 2024
[00:38:28.380]from just three genotypes.
[00:38:30.420]But we have thousands of genotypes.
[00:38:32.800]But I just ran those for three
[00:38:34.300]genotypes so you could
[00:38:35.480]visualize that.
[00:38:36.540]And just for 2024 overall, we
[00:38:38.970]have an accuracy between 65%
[00:38:40.970]and 89% in getting those notes
[00:38:43.070]from the high-throughput phenotyping,
[00:38:45.620]which is not bad.
[00:38:47.240]and we still have to validate
[00:38:49.780]that for 2025.
[00:38:52.160]So here's what we estimated
[00:38:54.120]from the drone.
[00:38:55.420]Here's what we estimated from
[00:38:57.230]manual note taking.
[00:38:58.660]You see that there is a
[00:39:00.090]coincidence here,
[00:39:01.460]but there is some deviation
[00:39:03.020]that's expected,
[00:39:04.180]but overall the accuracies are
[00:39:06.530]good, 75 to 89%.
[00:39:08.760]That's a good accuracy.
[00:39:11.620]So this is the very last slide
[00:39:14.190]of the presentation.
[00:39:16.400]And I would like to talk a
[00:39:18.230]little bit about what is the
[00:39:20.330]impact of everything that I
[00:39:22.440]said on the University of
[00:39:24.360]Nebraska soybean breeding
[00:39:26.460]program.
[00:39:27.500]And I'm going to answer that in
[00:39:30.270]terms of response to selection
[00:39:33.140]or gains.
[00:39:35.180]So gains are a function of
[00:39:36.980]prediction accuracy or accuracy,
[00:39:39.570]selection intensity, genetic
[00:39:41.950]variance, and generation
[00:39:44.170]interval, which is the time
[00:39:45.870]from enough observations being
[00:39:47.840]tested in the field and when
[00:39:49.510]those genotypes being tested in
[00:39:51.500]the field when they are back to
[00:39:53.280]the crossing block to serve as
[00:39:55.060]parents.
[00:39:57.300]I don't want to include
[00:39:58.950]accuracy and selection
[00:40:00.710]intensity here in the
[00:40:02.280]discussion because those are
[00:40:04.550]very population dependent and
[00:40:06.660]that's going to complicate
[00:40:08.580]things. And I'm also not going
[00:40:10.650]to talk about the impacts on
[00:40:12.480]gains from stratification of
[00:40:14.710]environments and from drone
[00:40:16.420]phenotype. And I could talk
[00:40:18.160]about that for the whole day.
[00:40:19.390]But here I'm going to focus on
[00:40:20.520]this parse testing. So just
[00:40:21.710]thinking about this parse
[00:40:22.770]testing.
[00:40:24.520]And those are real numbers from
[00:40:26.650]our program.
[00:40:27.760]Those are the numbers from the
[00:40:29.260]lines that I have genotyped
[00:40:30.640]from my project and that are
[00:40:32.020]part of our code word
[00:40:32.830]development pipeline.
[00:40:33.960]So let's suppose those 3,000
[00:40:36.480]lines are tested in a regular
[00:40:38.310]testing framework in two
[00:40:39.920]locations.
[00:40:40.920]We have 6,000 plots to be
[00:40:43.350]tested.
[00:40:44.060]And selecting 20% of those 3,000
[00:40:46.740]genotypes, we ended up with 600
[00:40:48.950]genotypes being selected the
[00:40:50.630]next cycle.
[00:40:52.440]Considering that that cycle
[00:40:54.320]takes four years to go back, to
[00:40:56.320]complete that cycle, to go back
[00:40:58.410]to the crossing block to begin
[00:41:00.350]another cycle, it gives us a
[00:41:02.210]gain of 0.35, absolute gain of
[00:41:04.940]0.35 per year.
[00:41:06.940]If we deploy sparse testing, we
[00:41:08.640]can save resources, we can
[00:41:10.050]screen those same 3,000 genotypes,
[00:41:12.570]but instead of planting 6,000
[00:41:14.700]plots, we're planting 3,000
[00:41:16.850]plots.
[00:41:17.560]And the absolute gains per year
[00:41:20.490]is still 0.35.
[00:41:23.000]But if we reinvest what we
[00:41:24.700]saved from sparse testing into
[00:41:26.830]screening more lines, still
[00:41:28.810]keeping the total number of
[00:41:30.620]plots in 6,000, we can screen 4,500
[00:41:34.580]lines, selecting the same
[00:41:36.340]number of lines, which means we're
[00:41:38.740]not hampering the effective
[00:41:40.670]population size of our breeding
[00:41:42.830]program.
[00:41:43.960]and also that's important
[00:41:45.540]thinking about the future
[00:41:47.120]diversity of our breeding
[00:41:48.700]program.
[00:41:49.520]And this gives us an absolute
[00:41:52.100]gain of 0.4 per year.
[00:41:54.860]But if we reinvest that into
[00:41:57.160]screening more genotypes,
[00:41:59.440]and instead of testing our genotypes
[00:42:01.330]in two environments, in four
[00:42:02.850]environments,
[00:42:04.780]and still having a limit of 6,000
[00:42:07.550]plots for 3,750 lines,
[00:42:09.780]we have absolute gains of 0.51
[00:42:17.040]per year.
[00:42:19.320]This is an increase of 47% in
[00:42:22.670]gains relative
[00:42:24.320]to the conventional test.
[00:42:26.820]And that's a lot and that's the
[00:42:28.920]impact
[00:42:29.480]of what just the sparse testing
[00:42:31.230]design can promote
[00:42:33.680]in our cultural development
[00:42:34.960]pipeline.
[00:42:35.580]So that's the end of my
[00:42:37.580]presentation.
[00:42:39.320]I would like to thank everyone
[00:42:42.330]that's in our lab,
[00:42:44.320]Dr. George Graff, Luis Posadas,
[00:42:47.000]Dr. David Highton.
[00:42:48.800]He's also involved in a lot of
[00:42:51.480]the research
[00:42:52.780]that we carry out in our
[00:42:54.190]program.
[00:42:54.920]Grad students, technologists,
[00:42:56.330]and technicians,
[00:42:57.160]and also the committee of my
[00:42:59.590]PhD, Dr. David Highton,
[00:43:03.320]Dr. Rick Howard, Dr. Jin Yang
[00:43:05.770]Yang, and Caio Diaz from Brazil.
[00:43:08.320]And I would like also to thank
[00:43:10.900]the support
[00:43:12.040]from the Nebraska Soybean Board.
[00:43:15.080]They are big supporters of most
[00:43:17.510]of what we do here
[00:43:18.820]in our program.
[00:43:19.920]Thanks.
[00:43:20.760]- Thanks, Isaac.
[00:43:28.200]Yeah, great presentation.
[00:43:29.580]So now we have some time for
[00:43:31.120]some questions.
[00:43:32.280]I'll have you start. So, uh,
[00:43:34.140]when you come up with, when you,
[00:43:37.370]with results that show that,
[00:43:40.360]you know, clear benefits from
[00:43:42.000]optimization like this, in your
[00:43:43.750]opinion, what is, um, what is
[00:43:45.680]the
[00:43:45.960]response from, uh, a breeding
[00:43:48.280]program to results like this
[00:43:50.370]are, is, uh, not just picking
[00:43:52.740]on George,
[00:43:53.660]but, uh, is our, like our
[00:43:55.270]breeding public breeding
[00:43:56.810]programs quick to respond to
[00:43:58.490]information
[00:43:59.280]like this are they quick to
[00:44:01.190]change um to because this is
[00:44:03.180]relying on you know changing
[00:44:05.270]where you've
[00:44:06.300]been testing how much you've
[00:44:07.850]been testing yeah yeah for sure
[00:44:09.540]those changes they have to be
[00:44:11.110]carried
[00:44:11.660]out like gradually it's hard to
[00:44:14.120]implement that right away and
[00:44:16.480]uh because it can create like a
[00:44:18.920]big shift in the breeding
[00:44:20.000]program if that gets
[00:44:20.880]implemented right away so i
[00:44:22.070]would say it's a
[00:44:22.820]gradual change and thinking
[00:44:25.380]about public breeding programs
[00:44:28.230]where turnaround time doesn't
[00:44:31.010]always
[00:44:31.720]help us in terms, let's say,
[00:44:33.230]with genotyping. We have to genotype
[00:44:35.130]things fast to make that
[00:44:36.140]work efficiently. That not
[00:44:38.690]always happens, so we have to
[00:44:41.230]really examine how to do that
[00:44:43.760]in a good way to make that
[00:44:46.010]succeed in the long term, I
[00:44:48.580]would say.
[00:44:52.100]Questions?
[00:44:52.600]In the slide, you showed a big
[00:44:58.410]field with lots and lots of
[00:45:03.410]plots in it.
[00:45:06.240]Yes.
[00:45:06.520]How do you deal with spatial
[00:45:07.930]variability in the soil?
[00:45:09.320]Okay.
[00:45:10.040]Yeah, there is spatial
[00:45:11.100]variability in the soil, and we
[00:45:12.690]handle spatial variability with
[00:45:14.700]our experimental designs.
[00:45:16.560]And our experimental designs,
[00:45:18.000]they have been doing a great
[00:45:19.220]job with controlling the
[00:45:20.380]spatial variability in the soil.
[00:45:22.100]So for advanced tests, we're
[00:45:24.000]using complete block designs or
[00:45:26.050]randomized complete block
[00:45:27.810]designs with two reps in that
[00:45:29.620]case.
[00:45:30.280]And in the preliminary, just
[00:45:31.890]one rep.
[00:45:32.500]And that can control the
[00:45:33.710]spatial variation in those.
[00:45:35.360]And I've compared that with
[00:45:37.240]spatial designs, which kind of
[00:45:39.440]eliminates the spatial
[00:45:41.100]variation in our trials.
[00:45:43.040]And it's comparable.
[00:45:44.280]The results are comparable.
[00:45:45.260]So that's taken into account to
[00:45:46.960]evaluate our trials.
[00:45:49.240]We plant those and we evaluate
[00:45:51.170]those in our statistical models
[00:45:53.320]following statistical and
[00:45:54.700]experimental designs.
[00:45:56.060]So the spatial variation, it
[00:45:58.310]exists, but we handle those.
[00:46:00.800]Have you tried incorporating
[00:46:04.740]satellite data as opposed to
[00:46:08.510]drone data?
[00:46:10.440]There are some. I've heard that
[00:46:15.400]there are some startups trying
[00:46:16.730]to do that with those big seed
[00:46:17.940]companies.
[00:46:19.240]satellite data, taking photos
[00:46:21.080]from the fields and providing
[00:46:22.840]those photos from their
[00:46:24.130]breeding programs. But the
[00:46:25.840]thing here is resolution. I'm
[00:46:27.740]not sure how good the
[00:46:28.810]resolution would be. Because
[00:46:30.650]here for the drone, we have a
[00:46:32.260]pretty high resolution of one
[00:46:33.960]centimeter per pixel. And that's
[00:46:35.990]a lot. Because we need high
[00:46:37.610]quality photos to identify
[00:46:39.170]those patterns in maturity. But
[00:46:41.150]yeah, maybe in the future, that's
[00:46:43.160]going to become a reality. If
[00:46:44.920]it's not already a reality that
[00:46:48.480]I do not know yet. Have you
[00:46:50.370]compared any? No, I haven't.
[00:46:53.730]Because I think you could pay
[00:46:56.510]for high resolution satellite
[00:46:59.550]photos on a particular day and
[00:47:02.460]just compare those.
[00:47:04.820]And to get one centimeter per
[00:47:07.380]pixel, we fly our drones in 37
[00:47:10.460]meters, 40 meters above ground.
[00:47:14.180]So
[00:47:14.900]the satellite is I don't know
[00:47:17.220]where. So yeah, my concern is
[00:47:19.830]about resolution.
[00:47:21.700]Plus then Ryan wouldn't have a
[00:47:23.510]job. So.
[00:47:31.300]Thank you for the great
[00:47:32.560]presentation. I'm curious in
[00:47:34.360]terms of the testing hubs that
[00:47:36.030]you have selected,
[00:47:37.060]because all your data is based
[00:47:38.930]on four years, right? 2020 to
[00:47:41.540]2024? Yes. So let's suppose we
[00:47:44.420]expand that range, right? From
[00:47:46.170]1990s all the way to up here,
[00:47:48.100]or maybe into the future. Would
[00:47:49.780]you
[00:47:50.100]expect the testing hub location
[00:47:51.940]would change or will they stay
[00:47:53.700]consistent throughout the years?
[00:47:55.860]Yes, because it's dependent on
[00:47:57.040]the genotypes that we're
[00:47:57.960]evaluating. Since I integrated
[00:47:59.250]genotype by
[00:47:59.780]by environment interaction
[00:48:01.410]signals in those stratifying
[00:48:03.090]models, if we go really far
[00:48:04.630]from
[00:48:04.980]what we have now in our program,
[00:48:06.390]it's another variability,
[00:48:07.540]another germplasm. I mean, it's
[00:48:08.940]the
[00:48:09.080]same germplasm, but with
[00:48:10.280]different allele frequencies.
[00:48:11.880]So I would expect that it would
[00:48:13.320]change. It actually changes if
[00:48:14.580]I do not include the genotype
[00:48:15.650]by environment interaction
[00:48:16.740]signals
[00:48:17.160]in that environmental matrix.
[00:48:19.010]So the results are different.
[00:48:20.800]And it makes more sense after I
[00:48:22.520]included those genotype by
[00:48:23.710]environment interaction. So I
[00:48:25.170]guess from that point of view,
[00:48:26.560]right,
[00:48:27.260]On a practicality note, how
[00:48:28.890]many years do you recommend
[00:48:30.460]that we re-evaluate what is the
[00:48:32.210]optimal location to test our
[00:48:33.850]genotype?
[00:48:34.800]I would say anything from four
[00:48:36.500]to seven.
[00:48:37.180]Here, I evaluated four or five
[00:48:38.610]years from 2020.
[00:48:39.960]We still have data from 2025
[00:48:41.890]that we can use because I can
[00:48:43.300]still put that and include that
[00:48:44.910]in my study.
[00:48:45.780]Thinking about the initial
[00:48:47.270]cross from the code for
[00:48:48.550]development, when that line is
[00:48:50.250]not more in the testing
[00:48:51.480]pipeline, that's seven years.
[00:48:53.920]And four years is the cycle
[00:48:55.460]when that line comes back to
[00:48:57.070]the crossing block.
[00:48:58.420]So four to seven years would be
[00:49:00.120]a reasonable time window to
[00:49:01.660]include those into stratified
[00:49:03.420]environments.
[00:49:04.600]Okay, thank you.
[00:49:05.720]If you let me ask one more.
[00:49:14.500]When you do your drone work,
[00:49:16.770]the values that you get for
[00:49:18.840]each plot, is that a mosaic of
[00:49:21.210]10 or 15 different images that
[00:49:24.220]have been blended together to
[00:49:26.490]avoid the hotspot?
[00:49:28.220]Yes.
[00:49:28.800]Okay.
[00:49:29.140]Yes.
[00:49:29.600]Thank you.
[00:49:30.220]Yeah, it's a time series.
[00:49:31.340]Yeah, a lot of photos.
[00:49:32.740]I have another question, Arthur.
[00:49:35.320]Yes.
[00:49:35.680]Is with the drone data, are
[00:49:37.160]there other phenotypes that you
[00:49:38.900]can get from the same images
[00:49:40.380]other than maturity?
[00:49:41.780]Yeah.
[00:49:44.000]height, lodging, and pivot
[00:49:47.510]tracks. But it's tricky to do
[00:49:51.160]that because, yeah, the
[00:49:54.320]correlation is not
[00:49:56.480]very good for lodging. For
[00:49:58.330]pivot tracks, I tried to do
[00:50:00.090]that and it's not been that
[00:50:01.780]successful. I
[00:50:02.960]just communicated with our
[00:50:05.390]group that we need to take
[00:50:07.590]maturity notes pivot tracks
[00:50:10.110]this year because
[00:50:11.760]I don't think that the drone is
[00:50:13.460]going to capture good
[00:50:14.660]information on that. And for
[00:50:16.420]heights,
[00:50:17.200]it has like an, I would say,
[00:50:19.970]intermediate prediction
[00:50:22.500]accuracy. And the difference is
[00:50:25.920]that you have to fly the drone
[00:50:27.390]before planting your genotypes
[00:50:29.050]where you are going to plant
[00:50:30.460]them.
[00:50:30.800]Because it's just a difference
[00:50:32.850]between the images that you
[00:50:35.480]capture when the plants are
[00:50:38.470]mature or
[00:50:39.680]fully grown from when the
[00:50:42.760]digital soil model is extracted
[00:50:46.720]from those images. Any
[00:50:49.740]questions online?
[00:50:51.600]All right, this one from Jim
[00:50:53.850]Speck online. Relative to the
[00:50:56.350]first part of your seminar,
[00:50:58.720]you mentioned whether
[00:50:59.830]irrigation versus non-irrigation
[00:51:01.560]locations played a role in
[00:51:02.880]grouping locations. So is
[00:51:04.140]irrigation whether the location
[00:51:05.620]was irrigated play a role on
[00:51:06.820]whether...
[00:51:09.120]No, I did not stratify that in
[00:51:10.530]terms of irrigation and non-irrigation.
[00:51:12.760]So I cannot conclude anything
[00:51:15.880]about that because I did not
[00:51:17.640]include that type of
[00:51:18.410]information in my stratification.
[00:51:19.920]This one from Sean Jenkins
[00:51:21.850]online.
[00:51:22.700]How does the yield potential of
[00:51:24.400]a location influence the
[00:51:25.750]location's likelihood to be
[00:51:27.320]selected as a hub?
[00:51:28.520]He goes on to say, are the hubs
[00:51:30.180]typically the highest yielding
[00:51:32.000]locations in a cluster?
[00:51:33.560]Yes.
[00:51:34.440]Yes, we saw that some of the
[00:51:36.280]highest-yielding locations that
[00:51:38.620]we've seen just by evaluating
[00:51:40.630]the data from those last five
[00:51:42.550]years that I'm here and, George,
[00:51:44.880]for the last 30 years, we could
[00:51:47.290]see that there are some high-yielding
[00:51:50.080]locations that are really
[00:51:51.850]considered hubs.
[00:51:53.360]I would say Phillips and Urbana,
[00:51:55.500]but I've not evaluated what's
[00:51:57.440]the actual contribution of
[00:51:59.240]those high-yielding locations.
[00:52:02.380]But high-yielding locations,
[00:52:04.300]they're likely going to provide
[00:52:06.230]better accuracy because of
[00:52:07.870]better heritability.
[00:52:09.480]So, yeah, that might influence
[00:52:11.750]zone de-stratification.
[00:52:13.760]So you think you'd still have
[00:52:15.130]high-yielding environment?
[00:52:16.760]Even if you're selecting in
[00:52:17.940]those environments, they'd
[00:52:19.230]still be stable?
[00:52:20.120]The goal was to identify stable
[00:52:21.540]ones over that environment,
[00:52:22.920]yeah.
[00:52:23.380]And then one last question from
[00:52:25.010]the business issue.
[00:52:26.900]Can you clarify if the increase
[00:52:28.800]in response from 0.35 to 0.51
[00:52:31.720]with the sparse testing design
[00:52:33.610]was from early stage or late
[00:52:35.660]stage trials?
[00:52:36.460]If I was using observed data to
[00:52:39.220]do that, I could do that with
[00:52:41.850]advanced or late.
[00:52:44.540]But just thinking about the breeder's
[00:52:47.880]equation, if you keep the
[00:52:50.180]genetic variance and accuracy
[00:52:53.530]constant, you can check the
[00:52:55.400]selection intensity from a
[00:52:57.800]standardized table.
[00:53:00.040]So that's just playing with
[00:53:01.810]numbers, playing with a
[00:53:03.420]percentage of selected
[00:53:04.970]individuals.
[00:53:06.160]So that's not actually with
[00:53:08.340]observed data.
[00:53:09.780]That's actually with the
[00:53:11.040]standardized table of selection
[00:53:12.790]intensity.
[00:53:14.540]playing with percentage of
[00:53:16.080]selected individuals,
[00:53:17.520]thinking that we have to
[00:53:19.320]preserve the same number
[00:53:21.220]of selected individuals,
[00:53:22.400]regardless of the scenario.
[00:53:23.780]- Excellent.
[00:53:27.780]So, .

The screen size you are trying to search captions on is too small!

You can always jump over to MediaHub and check it out there.

Comments

0 Comments

Integrating Testing Optimization Frameworks into Soybean Breeding at UNL

Description

Searchable Transcript

Comments icon comment

Related Channels

Comments