USDA Project

Sanyam Agrawal & Aniruddh Saxena Author

04/06/2021 Added

33 Plays

Description

We are working on a USDA project to extract soil data from different soil articles and journals using machine learning techniques.

Searchable Transcript

Search:

[00:00:00.816]Hey everyone, we are presenting our USDA project,
[00:00:05.610]and I am Sanyam Agrawal, this is my partner Aniruddh Saxena.
[00:00:09.990]Hey!
[00:00:11.340]And so, the objective
[00:00:14.650]of our project is to get the soil properties,
[00:00:21.230]climate factors and management into our data,
[00:00:25.100]DSP database and which will then be used
[00:00:28.200]to evaluate the impact of land use
[00:00:30.660]and management practices of soils.
[00:00:34.670]So, what we're doing in this project
[00:00:37.700]is we are taking different soil articles
[00:00:40.330]and journals and extracting information using my
[00:00:44.185]technical knowledge base construction, which is implemented
[00:00:50.869]in a new python based library, which is called Fonduer
[00:00:54.380]and then we're going to extract that,
[00:00:58.790]where you're going to store
[00:00:59.623]the extracted data into our database.
[00:01:02.920]So for the overview of this presentation, we have
[00:01:06.380]First we will be discussing what's Fonduer
[00:01:09.810]and how I am working of the Fonduer,
[00:01:13.210]then we're gonna go in depth with the dataset
[00:01:16.200]and the parsing and machine learning part of the project.
[00:01:20.780]After that, we're gonna discuss results
[00:01:23.840]and some future work on the project.
[00:01:27.500]So let's get started with the
[00:01:28.740]with the Fonduer, let's get started with Fonduer.
[00:01:32.000]So the Fonduer, it's a python package
[00:01:36.340]and framework designed by Stanford.
[00:01:38.090]It's a new, and it's, it, it works on to two techniques
[00:01:43.320]the first one is knowledge based construction.
[00:01:45.930]So knowledge base construction is a machine learning,
[00:01:48.850]a machine learning technique in which we extract information
[00:01:53.550]and such external information from text, Wilson images
[00:02:00.720]or video, and then public that data into a database.
[00:02:07.210]The another technique that Fonduer uses is
[00:02:10.340]recurrent neural network which is
[00:02:13.760]usually used to generate results.
[00:02:18.010]So to get a better understanding of
[00:02:19.810]Fonduer reviews like figure which,
[00:02:23.010]so first we were going to take a,
[00:02:24.630]like a bunch of documents, data
[00:02:28.020]and then put them and do like a Parsing
[00:02:32.170]so Parsings gonna create tokens with
[00:02:35.420]which will have words, sentences and everything
[00:02:38.180]from the document which will be then
[00:02:40.960]feeded to our machine learning models
[00:02:43.445]which is, which was going to spread out like
[00:02:48.620]like results are into our database
[00:02:53.000]which will be our extracted data which we want.
[00:02:57.680]Next is a dataset.
[00:03:01.090]Okay, So now let's
[00:03:02.200]get started with the Data Set.
[00:03:05.308]So for our current data set, we have around
[00:03:07.110]40,000 different soil science journal
[00:03:08.990]articles given us by the USDA
[00:03:12.700]these are, all of these general articles,
[00:03:14.690]well in the form of PDAs, but Fonduer requires them to be
[00:03:18.010]either in the form of docs or html files.
[00:03:21.330]So we did a bunch of different conversions
[00:03:23.500]and finally decided to go with HTML
[00:03:26.730]but far most of our experimentation and testing stuff
[00:03:30.374]we did them on the smart, does it have around 50 articles
[00:03:34.730]because these are like three level for us
[00:03:36.530]so we knew what to work with, here is an example
[00:03:39.540]of a document from this, from that dataset.
[00:03:42.560]So as you can see, all the things that we
[00:03:47.690]need to extract has been pre highlighted for us
[00:03:49.640]so we know what to work with.
[00:03:53.188]Now, let's check our parsing and machine learning.
[00:03:56.950]So Parsing, parsing is just like a term
[00:04:00.240]for like reading or making the computer
[00:04:01.600]read the document for us, Fonduer uses Spacy
[00:04:04.850]for this thing and Spacy just gonna go to their document
[00:04:08.564]and tokenize all of these different sentences
[00:04:11.540]in that document and extract them
[00:04:13.260]separately as separate entities.
[00:04:15.790]Here's an example of that so, after we sent this part
[00:04:19.990]of the document to the, to spacy
[00:04:22.230]it's gonna spit out all of these different sentences
[00:04:25.550]about density and all the names and stuff's
[00:04:28.790]going to come out of it at a different entity.
[00:04:31.610]After they're done with Parsing
[00:04:33.000]and we have like a bunch of different, random sentences.
[00:04:35.470]We're gonna define some classes
[00:04:37.670]we're gonna assign all of those sentence
[00:04:39.190]into separate classes but this class could be like cutting.
[00:04:42.640]Once this class could be an auto class
[00:04:44.590]which would have all of the names
[00:04:45.910]of the author of a specific journal or something like that.
[00:04:50.552]After with that, we're gonna,
[00:04:53.670]after we're done with the classes
[00:04:54.690]we're gonna make some user defined functions
[00:04:56.390]to assign all of those stuff into a specific class.
[00:05:00.090]These functions could be as simple as like just taking,
[00:05:03.560]all of the days that occur after 2011
[00:05:06.070]or something like that or it could be like
[00:05:08.060]super-complex, here's an example of that.
[00:05:10.540]So, after we run our user defined functions
[00:05:15.200]and create a class author, we're going to just extract all
[00:05:17.940]of the author names and put them in this class.
[00:05:23.850]Yes, A couple of different examples of those
[00:05:26.720]here's a date class and tables is just of the tabular data.
[00:05:32.060]So next we're going to discuss the results.
[00:05:36.020]So we have like a, this is like a sample document
[00:05:39.750]a sample output we will expect
[00:05:42.620]from Fonduer and we are getting right now.
[00:05:45.400]So the first column will be like a
[00:05:48.390]name of the document, which it has
[00:05:50.470]and, like the second column will be
[00:05:53.770]suppose we want to have bulk density as a keyword
[00:05:56.240]and we want to extract all the sentence we have in it.
[00:05:59.520]So the second column will be like a keyword
[00:06:02.140]and then third column will have all the sentences
[00:06:06.420]which are relevant to our extracted data
[00:06:10.260]and then, like the fourth column
[00:06:13.750]which we got like as a table data which will have
[00:06:16.780]like all the table data we have in the, in the document,
[00:06:23.070]and the last column will have like,
[00:06:26.410]all the information regarding the document,
[00:06:30.740]which will be like what, what journal it is from
[00:06:36.240]and basically everything for the document information.
[00:06:45.210]So to get more, to give you guys
[00:06:47.950]more idea on like what our results is.
[00:06:52.860]We, so here's an example on textual level
[00:06:58.324]on how things are working, so all
[00:07:01.660]the highlighted text in here, it's the data
[00:07:05.280]we are able to extract and we need to extract.
[00:07:08.640]So as I was talking before, like the bulk density
[00:07:12.290]so, we have highlighted the bulk density for you
[00:07:15.780]so that it will be easy for you
[00:07:17.180]to get the goodness, the sentence we want to extract
[00:07:20.350]and our current Fonduer model is able to
[00:07:23.760]extract all the bulk density sentences
[00:07:25.830]from a document or a bunch of documents.
[00:07:30.520]For our future work, We are gonna, if we are,
[00:07:35.410]we are still trying and we want to keep trying
[00:07:38.870]on extracting more relevant keywords and table later
[00:07:42.930]and we, USDA has provided us a ER diagram.
[00:07:48.390]So like our current ER diagram, doesn't exactly
[00:07:51.340]match to a USDA diagram, because of
[00:07:54.850]some limitations we have with the data
[00:07:57.010]and the working of Fonduer liked examples.
[00:08:01.250]So in future work we would like to
[00:08:04.340]map our ER diagram with the USDA diagram
[00:08:08.250]and as our final result, we have like a database
[00:08:14.570]and be right, right now we have a database
[00:08:16.670]but we would like to combine everything
[00:08:19.060]and get a more clear idea of the database,
[00:08:26.720]trying to get more informative database out of the Fonduer.
[00:08:32.000]At last we'd like to acknowledge
[00:08:35.660]Professor Steven Scott who's, who gave us
[00:08:38.210]this opportunity to work on this project,
[00:08:40.470]he mentored over the course of the project
[00:08:42.530]and helped us get all the resources
[00:08:45.140]that were required for a successful completion.
[00:08:48.490]We would also like to thank you as DA
[00:08:50.560]for providing all the necessary details, resources
[00:08:54.495]and feedbacks that were necessary
[00:08:56.353]for the project, thank you!
[00:09:00.440]Thank you.

The screen size you are trying to search captions on is too small!

You can always jump over to MediaHub and check it out there.

Comments

0 Comments

USDA Project

Description

Searchable Transcript

Comments icon comment

Related Channels

Comments