USDA Project
Sanyam Agrawal & Aniruddh Saxena
Author
04/06/2021
Added
33
Plays
Description
We are working on a USDA project to extract soil data from different soil articles and journals using machine learning techniques.
Searchable Transcript
Toggle between list and paragraph view.
- [00:00:00.816]Hey everyone, we are presenting our USDA project,
- [00:00:05.610]and I am Sanyam Agrawal, this is my partner Aniruddh Saxena.
- [00:00:09.990]Hey!
- [00:00:11.340]And so, the objective
- [00:00:14.650]of our project is to get the soil properties,
- [00:00:21.230]climate factors and management into our data,
- [00:00:25.100]DSP database and which will then be used
- [00:00:28.200]to evaluate the impact of land use
- [00:00:30.660]and management practices of soils.
- [00:00:34.670]So, what we're doing in this project
- [00:00:37.700]is we are taking different soil articles
- [00:00:40.330]and journals and extracting information using my
- [00:00:44.185]technical knowledge base construction, which is implemented
- [00:00:50.869]in a new python based library, which is called Fonduer
- [00:00:54.380]and then we're going to extract that,
- [00:00:58.790]where you're going to store
- [00:00:59.623]the extracted data into our database.
- [00:01:02.920]So for the overview of this presentation, we have
- [00:01:06.380]First we will be discussing what's Fonduer
- [00:01:09.810]and how I am working of the Fonduer,
- [00:01:13.210]then we're gonna go in depth with the dataset
- [00:01:16.200]and the parsing and machine learning part of the project.
- [00:01:20.780]After that, we're gonna discuss results
- [00:01:23.840]and some future work on the project.
- [00:01:27.500]So let's get started with the
- [00:01:28.740]with the Fonduer, let's get started with Fonduer.
- [00:01:32.000]So the Fonduer, it's a python package
- [00:01:36.340]and framework designed by Stanford.
- [00:01:38.090]It's a new, and it's, it, it works on to two techniques
- [00:01:43.320]the first one is knowledge based construction.
- [00:01:45.930]So knowledge base construction is a machine learning,
- [00:01:48.850]a machine learning technique in which we extract information
- [00:01:53.550]and such external information from text, Wilson images
- [00:02:00.720]or video, and then public that data into a database.
- [00:02:07.210]The another technique that Fonduer uses is
- [00:02:10.340]recurrent neural network which is
- [00:02:13.760]usually used to generate results.
- [00:02:18.010]So to get a better understanding of
- [00:02:19.810]Fonduer reviews like figure which,
- [00:02:23.010]so first we were going to take a,
- [00:02:24.630]like a bunch of documents, data
- [00:02:28.020]and then put them and do like a Parsing
- [00:02:32.170]so Parsings gonna create tokens with
- [00:02:35.420]which will have words, sentences and everything
- [00:02:38.180]from the document which will be then
- [00:02:40.960]feeded to our machine learning models
- [00:02:43.445]which is, which was going to spread out like
- [00:02:48.620]like results are into our database
- [00:02:53.000]which will be our extracted data which we want.
- [00:02:57.680]Next is a dataset.
- [00:03:01.090]Okay, So now let's
- [00:03:02.200]get started with the Data Set.
- [00:03:05.308]So for our current data set, we have around
- [00:03:07.110]40,000 different soil science journal
- [00:03:08.990]articles given us by the USDA
- [00:03:12.700]these are, all of these general articles,
- [00:03:14.690]well in the form of PDAs, but Fonduer requires them to be
- [00:03:18.010]either in the form of docs or html files.
- [00:03:21.330]So we did a bunch of different conversions
- [00:03:23.500]and finally decided to go with HTML
- [00:03:26.730]but far most of our experimentation and testing stuff
- [00:03:30.374]we did them on the smart, does it have around 50 articles
- [00:03:34.730]because these are like three level for us
- [00:03:36.530]so we knew what to work with, here is an example
- [00:03:39.540]of a document from this, from that dataset.
- [00:03:42.560]So as you can see, all the things that we
- [00:03:47.690]need to extract has been pre highlighted for us
- [00:03:49.640]so we know what to work with.
- [00:03:53.188]Now, let's check our parsing and machine learning.
- [00:03:56.950]So Parsing, parsing is just like a term
- [00:04:00.240]for like reading or making the computer
- [00:04:01.600]read the document for us, Fonduer uses Spacy
- [00:04:04.850]for this thing and Spacy just gonna go to their document
- [00:04:08.564]and tokenize all of these different sentences
- [00:04:11.540]in that document and extract them
- [00:04:13.260]separately as separate entities.
- [00:04:15.790]Here's an example of that so, after we sent this part
- [00:04:19.990]of the document to the, to spacy
- [00:04:22.230]it's gonna spit out all of these different sentences
- [00:04:25.550]about density and all the names and stuff's
- [00:04:28.790]going to come out of it at a different entity.
- [00:04:31.610]After they're done with Parsing
- [00:04:33.000]and we have like a bunch of different, random sentences.
- [00:04:35.470]We're gonna define some classes
- [00:04:37.670]we're gonna assign all of those sentence
- [00:04:39.190]into separate classes but this class could be like cutting.
- [00:04:42.640]Once this class could be an auto class
- [00:04:44.590]which would have all of the names
- [00:04:45.910]of the author of a specific journal or something like that.
- [00:04:50.552]After with that, we're gonna,
- [00:04:53.670]after we're done with the classes
- [00:04:54.690]we're gonna make some user defined functions
- [00:04:56.390]to assign all of those stuff into a specific class.
- [00:05:00.090]These functions could be as simple as like just taking,
- [00:05:03.560]all of the days that occur after 2011
- [00:05:06.070]or something like that or it could be like
- [00:05:08.060]super-complex, here's an example of that.
- [00:05:10.540]So, after we run our user defined functions
- [00:05:15.200]and create a class author, we're going to just extract all
- [00:05:17.940]of the author names and put them in this class.
- [00:05:23.850]Yes, A couple of different examples of those
- [00:05:26.720]here's a date class and tables is just of the tabular data.
- [00:05:32.060]So next we're going to discuss the results.
- [00:05:36.020]So we have like a, this is like a sample document
- [00:05:39.750]a sample output we will expect
- [00:05:42.620]from Fonduer and we are getting right now.
- [00:05:45.400]So the first column will be like a
- [00:05:48.390]name of the document, which it has
- [00:05:50.470]and, like the second column will be
- [00:05:53.770]suppose we want to have bulk density as a keyword
- [00:05:56.240]and we want to extract all the sentence we have in it.
- [00:05:59.520]So the second column will be like a keyword
- [00:06:02.140]and then third column will have all the sentences
- [00:06:06.420]which are relevant to our extracted data
- [00:06:10.260]and then, like the fourth column
- [00:06:13.750]which we got like as a table data which will have
- [00:06:16.780]like all the table data we have in the, in the document,
- [00:06:23.070]and the last column will have like,
- [00:06:26.410]all the information regarding the document,
- [00:06:30.740]which will be like what, what journal it is from
- [00:06:36.240]and basically everything for the document information.
- [00:06:45.210]So to get more, to give you guys
- [00:06:47.950]more idea on like what our results is.
- [00:06:52.860]We, so here's an example on textual level
- [00:06:58.324]on how things are working, so all
- [00:07:01.660]the highlighted text in here, it's the data
- [00:07:05.280]we are able to extract and we need to extract.
- [00:07:08.640]So as I was talking before, like the bulk density
- [00:07:12.290]so, we have highlighted the bulk density for you
- [00:07:15.780]so that it will be easy for you
- [00:07:17.180]to get the goodness, the sentence we want to extract
- [00:07:20.350]and our current Fonduer model is able to
- [00:07:23.760]extract all the bulk density sentences
- [00:07:25.830]from a document or a bunch of documents.
- [00:07:30.520]For our future work, We are gonna, if we are,
- [00:07:35.410]we are still trying and we want to keep trying
- [00:07:38.870]on extracting more relevant keywords and table later
- [00:07:42.930]and we, USDA has provided us a ER diagram.
- [00:07:48.390]So like our current ER diagram, doesn't exactly
- [00:07:51.340]match to a USDA diagram, because of
- [00:07:54.850]some limitations we have with the data
- [00:07:57.010]and the working of Fonduer liked examples.
- [00:08:01.250]So in future work we would like to
- [00:08:04.340]map our ER diagram with the USDA diagram
- [00:08:08.250]and as our final result, we have like a database
- [00:08:14.570]and be right, right now we have a database
- [00:08:16.670]but we would like to combine everything
- [00:08:19.060]and get a more clear idea of the database,
- [00:08:26.720]trying to get more informative database out of the Fonduer.
- [00:08:32.000]At last we'd like to acknowledge
- [00:08:35.660]Professor Steven Scott who's, who gave us
- [00:08:38.210]this opportunity to work on this project,
- [00:08:40.470]he mentored over the course of the project
- [00:08:42.530]and helped us get all the resources
- [00:08:45.140]that were required for a successful completion.
- [00:08:48.490]We would also like to thank you as DA
- [00:08:50.560]for providing all the necessary details, resources
- [00:08:54.495]and feedbacks that were necessary
- [00:08:56.353]for the project, thank you!
- [00:09:00.440]Thank you.
The screen size you are trying to search captions on is too small!
You can always jump over to MediaHub and check it out there.
Log in to post comments
Embed
Copy the following code into your page
HTML
<div style="padding-top: 56.25%; overflow: hidden; position:relative; -webkit-box-flex: 1; flex-grow: 1;"> <iframe style="bottom: 0; left: 0; position: absolute; right: 0; top: 0; border: 0; height: 100%; width: 100%;" src="https://mediahub.unl.edu/media/16486?format=iframe&autoplay=0" title="Video Player: USDA Project" allowfullscreen ></iframe> </div>
Comments
0 Comments