Escaping the Time Pit: Pitfalls and Guidelines for Using Time-based Git Data
Many software engineering research papers rely on time-based data (e.g., commit timestamps, issue report creation/update/close dates, release dates). Like most real-world data however, time-based data is often dirty. To date, there are no studies that quantify how frequently such data is used by the software engineering research community, or investigate sources of and quantify how often such data is dirty. Depending on the research task and method used, including such dirty data could affect the research results. This paper presents the first survey of papers that utilize time-based data, published in the Mining Software Repositories (MSR) conference series. Out of the 690 technical track and data papers published in MSR 2004--2020, we saw at least 35% of papers utilized time-based data. We then used the Boa and Software Heritage infrastructures to help identify and quantify several sources of dirty commit timestamp data. Finally we provide guidelines/best practices for researchers utilizing time-based data from Git repositories.
icon search Searchable Transcript
Toggle between list and paragraph view.
[00:00:00.940]Hello, my name is Samuel Flint,
[00:00:02.470]and today I'm going to be discussing
[00:00:03.950]the use of time-based data
[00:00:05.120]in the Mining Software Repositories field,
[00:00:07.280]and investigating time-based data errors in Git,
[00:00:09.570]a software version control system.
[00:00:11.540]This work was done with my advisor, Dr. Robert Dyer
[00:00:13.810]and a fellow student, Jigyasa Chauhan.
[00:00:15.800]It will be presented at the 2021
[00:00:17.420]Mining Software Repositories Conference.
[00:00:20.520]In Mining Software Repositories,
[00:00:22.030]lots of data contains a time-based component,
[00:00:24.880]this includes things like code changes, issue reports,
[00:00:28.080]log files and more.
[00:00:29.360]However, as with most data,
[00:00:31.010]time-based data is dirty.
[00:00:33.560]Prior studies discussed problems in mining
[00:00:35.350]different kinds of repositories,
[00:00:36.920]as well as presented guidelines to aid MSR researchers.
[00:00:40.270]However, rather few have discussed handling time-based data.
[00:00:44.620]With this in mind, we asked the following questions.
[00:00:47.480]How many MSR papers rely on time-based data?
[00:00:50.290]What kinds of software data include time?
[00:00:52.580]What filtering or cleaning techniques
[00:00:54.780]are used for time-based data?
[00:00:56.650]And then given the results from these questions,
[00:00:59.750]we ask if time-based data is common
[00:01:01.560]in open source Git repositories.
[00:01:04.160]We began by digging into the technical
[00:01:05.720]and data showcase papers
[00:01:06.720]of the Mining Software Repositories Conference,
[00:01:08.660]from the years, 2004 to 2020.
[00:01:11.280]There were a total of 690 papers to analyze.
[00:01:13.920]We then selected papers using a keyword search
[00:01:15.900]on time-related keywords,
[00:01:16.970]such as minute, hour and second.
[00:01:20.340]Our keyword search matched 282 papers initially,
[00:01:23.120]after manual inspection, we removed 43,
[00:01:25.620]because while keywords matched,
[00:01:27.380]they were not utilizing time-based data.
[00:01:29.540]In the end, we retained to 239 papers
[00:01:32.190]or 35% of the original corpus.
[00:01:34.780]We read each paper,
[00:01:35.800]noting the kinds of time-based data used,
[00:01:37.760]sources of the data and any methods used for filtering
[00:01:40.780]and cleaning the data.
[00:01:44.410]Looking at the kinds of time-based data used,
[00:01:46.970]we found that 64% of papers studied
[00:01:51.600]used version control system data
[00:01:53.350]like commit author, commit message,
[00:01:54.810]commit date and so forth.
[00:01:57.710]31% used issues and bug reports,
[00:02:01.090]17% used forge metadata like stars, forks,
[00:02:04.220]repository creation date and like,
[00:02:08.160]and 7% used pull requests,
[00:02:10.360]a communication from one developer to another
[00:02:12.200]to suggest a change in code.
[00:02:15.050]Studies used a number of data sources,
[00:02:16.790]but the 10 most common are shown here,
[00:02:18.210]in particular GitHub was the most common,
[00:02:20.510]with GHTorrent, a curated dataset based on GitHub
[00:02:24.570]also used frequently.
[00:02:26.240]Other common data sources were well-known
[00:02:28.010]open source projects,
[00:02:28.970]as well as a number of proprietary data sources.
[00:02:32.930]Most studies did not describe using any sort of filtering
[00:02:35.810]or cleaning technique.
[00:02:38.870]Other studies selected data
[00:02:40.130]from within a particular time window
[00:02:41.470]and described this explicitly,
[00:02:43.030]while others described only a cutoff date.
[00:02:45.320]Seven studies used a custom condition,
[00:02:47.250]doing things like selecting commits
[00:02:48.610]related to a particular issue
[00:02:50.220]or ensuring that commits are correctly ordered.
[00:02:52.520]A handful of studies coalesced changesets,
[00:02:54.840]that is, they took whole changes made by a given developer
[00:02:57.690]in a given three minute window
[00:02:59.100]and consider them as a single changeset.
[00:03:01.140]This was most commonly used
[00:03:02.550]in older revision control systems like CVS.
[00:03:06.060]A few studies also corrected date formats
[00:03:08.050]ensuring that all dates under study
[00:03:09.610]were in the same time zone and format,
[00:03:11.280]so that analysis may be more easily carried out.
[00:03:16.120]the lack of described method is a cause for concern,
[00:03:20.900]as it reduces the replicability of a study.
[00:03:24.180]Because version control data is the most studied
[00:03:26.030]and GitHub is the most common data source,
[00:03:28.120]we investigated what some time-based problems
[00:03:30.530]would look like and how frequently those occur.
[00:03:33.530]We considered three kinds of time problems in particular,
[00:03:37.150]commits with dates that were very old,
[00:03:38.900]in particular before the 19th of November, 1990,
[00:03:41.970]the initial release of CVS,
[00:03:45.050]commits that were in the future,
[00:03:46.690]after the collection date of our data set we studied,
[00:03:50.560]and commits that have dates older than
[00:03:53.520]at least one of their parents.
[00:03:56.370]To do this, we used Boa's 2019 GitHub large data set
[00:03:59.370]to look for the bad data.
[00:04:01.000]In particular, we found that over 26,000 commits
[00:04:04.710]with younger parents from over 4,000 projects.
[00:04:07.460]We also found over 4,000 suspiciously old commits
[00:04:10.190]in 82 projects and 11 commits in the future from three.
[00:04:16.390]Because of the problems observed,
[00:04:17.760]we proposed several methods to mitigate problems
[00:04:20.320]with bad time data.
[00:04:22.430]The first is to filter by project.
[00:04:24.080]If the research community curates a list of projects,
[00:04:26.160]which have significant problems with time data,
[00:04:28.470]those on the lists could simply be avoided.
[00:04:31.130]The second, filtering by date
[00:04:32.520]would involve leaving out history older than 2014.
[00:04:36.270]This keeps seven years of history to mine,
[00:04:38.080]but removes a very large number of faulty commits.
[00:04:40.310]Finally, we suggest simply removing projects
[00:04:42.360]modified by the Git SVN tool,
[00:04:44.170]as they tend to have bad dates.
[00:04:48.223]we found that at least 35% of MSR papers
[00:04:50.640]utilized time-based data.
[00:04:52.520]This is most often version control data
[00:04:54.400]and originating from GitHub.
[00:04:55.830]However, very few studies explicitly describe
[00:04:57.960]filtering and cleaning of time-based data,
[00:05:01.730]We further quantified some time-based data problems
[00:05:03.927]and the most used data source, GitHub,
[00:05:06.610]finding over 26,000 out of order commits
[00:05:08.780]from over 4,000 projects.
[00:05:10.740]We noted that for too old commits,
[00:05:12.037]Git SVN seems frequently connected,
[00:05:14.120]and causes for all errors
[00:05:15.330]are likely due to tools used misconfiguration or user error,
[00:05:18.690]or replication, packages available from Zenodo.
Log in to post comments