Finding Keystone Citations for Constructing Validity Chains among Research Papers

New discoveries in science are often built upon previous knowledge. Ideally, such dependency information should be made explicit in a scientific knowledge graph. The Keystone Framework was proposed for tracking the validity dependency among papers. A keystone citation indicates that the validity of a given paper depends on a previously published paper it cites. In this paper, we propose and evaluate a strategy that repurposes rhetorical category classifiers for the novel application of extracting keystone citations that relate to research methods. Five binary rhetorical category classifiers were constructed to identify Background, Objective, Methods, Results, and Conclusions sentences in biomedical papers. The resulting classifiers were used to test the strategy against two datasets. The initial strategy assumed that only citations contained in Methods sentences were methods keystone citations, but our analysis revealed that citations contained in sentences classified as either Methods or Results had a high likelihood to be methods keystone citations. Future work will focus on fine tuning the rhetorical category classifiers, experimenting with multiclass classifiers, evaluating the revised strategy with more data, and constructing a larger gold standard citation context sentence dataset for model training.


INTRODUCTION
New discoveries in science are often built upon previous knowledge.For example, Watson and Crick's discovery of the double helix structure of DNA depends, fundamentally, on Erwin Chargaff's discovery of the A-T and C-G pairings and Rosalind Franklin and Maurice Wilkins' X-ray crystallography work [14].Ideally, such dependency information should be made explicit in a scientific knowledge graph.Graphs that incorporate dependency information have the potential to reveal the flow of information among researchers and fields; to generate data that can support better research impact assessment; and to track what else in the knowledge graph is affected when a paper loses its validity.This work is motivated by the last case.
Our previous work proposed a framework for tracking validity dependencies among research papers, named the Keystone Framework [6].A keystone citation indicates that the validity of a given paper depends on a previously published paper it cites.The name is inspired by masonry, where damage to the keystone can threaten the arch it supports.One challenge is that, in general, finding keystone citations requires a global understanding of a scientific paper, which may limit automated approaches.However, a subset of keystone citations is more feasible to automatically detect: Keystone citations that support research methods and materials, as their keystone status can be determined only by using the citation context (i.e., the text surrounding a citation).Thus, for the remainder of the paper, we focus on how to use supervised machine learning to detect this subset of keystone citations.

RELATED WORK 2.1 Representing Scientific Evidence
The Keystone Framework is a part of a broader research effort to formalize the knowledge representation of a scientific publication so that its validity can be examined and re-assessed by human and machine readers.The Keystone Framework guides users through a process to find citations that are a "keystone" to the citing paper's arguments.In the first step, a paper's claims and supporting arguments are modeled into graph-like argument diagrams.In the second step, users try to match citations to components in the diagram using the citation contexts.Through a checklist provided in [6], users can determine whether a citation is a keystone citation, and if it is, what type of keystone citation it is.
A few existing semantic models can be used in the first step of document modeling: the Micropublication Ontology [4], the Scientific Evidence and Provenance Information Ontology (SEPIO) [3], and the Reasoning and Discourse Ontology (RDO) [2].
The Micropublication Ontology was proposed to transform textbound and linear-format scientific publications into web-friendly and machine-tractable digital objects [4].In its minimal form, a micropublication has a statement and its attribution.In a more expanded form, a micropublication can be supported by a supportgraph, which encompasses many elements critical to the creation of scientific arguments, such as data, methods, materials, and references, allowing more detailed examination.
SEPIO was initially designed to aid data integration across various model organism and clinical genetics databases, but it is also a domain-independent conceptual model capable of representing diverse evidence and provenance information [3].It consists of four core informational entities: Assertions, propositions, supporting data items, and evidence lines, and two provenance-related entities: Assertion process and data generation process.In particular, the data generation process entity is further supported by entities such as technique (i.e., methods), resources (i.e., materials), date-time, and agents.
RDO is a part of the Scientific EvidencE (SEE) approach, which aims to represent arguments as they are presented in the source [2].RDO has five core entity classes: Assertion, proposition, text, report, and agent.One key property, "is inferred from, " relates one assertion to another and can be infinitely chained, thus creating an evidence trail for a specific claim.
The contribution of the Keystone Framework is that it focuses on citation relationships and the transmission of validity.Moreover, despite the different constructs of the three semantic models, one commonality is that they all considered research methods and materials as an indispensable part of the model, either being explicit entity classes as in the Micropublication Ontology and SEPIO, or as assertions in RDO.Therefore, under any of these three models, citations that support methods and materials will always be keystone citations, backing our assumption that citations that support research methods and materials (referred to as "methods keystone citations" hereafter) can be extracted as keystone citations without a global understanding of a paper.

Classifying Citation Context Sentences into Methods/non-methods
Citation context sentences can be used to classify citation into "Incidental" and "Important" citations [8,11,17]."Important" citations are cited work being used or extended by the citing papers, which has some overlap with our classification task.The difference is that methods keystone citations provide justifications for the use of methods or materials, which is broader than simply "being used." Citation context sentences can also differentiate method and non-method papers.Here, a "method paper" refers to a paper whose main contribution to science is the development of a method.Method papers are cited with less hedging [16], and they enjoy more citations than non-method papers [15], since the latter are more likely to receive decreasing number of citations due to a phenomenon called "obliteration by incorporation" [9,10], which means when a paper's discovery becomes established knowledge, authors no longer feel the need to cite the source paper.Utility words, such as "use", "used", "using", and "based" in the citation context were found to be strong indicators of method papers [15].However, as we will show later, method papers may not be directly "used" in the papers citing them.And non-method papers, such as reviews, can also be used to support methods [6].

STRATEGY
The proposed strategy to extract keystone citations is depicted in Figure 1.First, we repurposed rhetorical category (RC) classifiers.They are used to assign IMRAD labels (e.g., Introduction, Methods, Results, and Discussion) to sentences in unstructured biomedical abstracts [7,12,19].In particular, a Methods sentence describes "the way of doing research" [7].One advantage of using RC classifiers is that training data are relatively easy to obtain.They can be constructed using biomedical abstracts with IMRAD labels.Moreover, we were able to obtain a "cleaner" training dataset that was manually labeled at the sentence level to one of the following categories: Background, Objective, Methods, Results, and Conclusions.This dataset allows us to "cold start" the project without labeling our own dataset.One limitation of this dataset is that it is from abstracts, whose language styles may differ from that of the full text of a research paper [5].
As depicted in Figure 1, according to this strategy, a citation context sentence (CCS) is passed through the RC classifiers.If the CCS is classified as Methods, the citation is a methods keystone citation.Otherwise, it is not.One underlying assumption is that the reason authors include a citation in a Methods sentence is to provide support to the research method or material used.In the example sentence shown in Figure 1, a method, the use of antibody X to confirm the expression of protein Y, is followed by a citation "[42]".Unless incorrectly cited, the paper [42] should provide some justification for the method, such as a prior usage of the method or evidence that antibody X can recognize protein Y.

Datasets
An unpublished dataset (5,517 sentences) was used to train the models.Each sentence was manually labeled to one of the following ).To construct this dataset, 500 abstracts were randomly selected from PubMed without sub-field specifications to maximize the generalizability of the dataset.All sentences in the 500 abstracts were included, except 34 sentences that were not part of the narrative, such as publication information or funding information.Three experts in biomedical informatics annotated the dataset.They first annotated 10 abstracts to develop guidelines, then, all three annotators annotated 50 more abstracts together.The inter-annotator agreement was found to be high (Fleiss' kappa = 0.92) for the 50 abstracts, so the rest 440 abstracts were split among the three.Two more datasets were used to test our strategy.The first is a gold-standard keystone citation context data set: the JCDL dataset contains nine keystone citation context sentences collected by the authors YF and JS for [6], all supporting methods and materials (Table 2).The second dataset was chosen as a larger testbed: the Willoughby-Hoye dataset is a collection of 99 citation context sentences citing the Willoughby-Hoye protocol [18] downloaded from scite.ai1 on Dec 30, 2020.This paper was chosen since it was found to contain a code glitch [1] and was a subject of our previous study [6].

Building classifiers
Five binary classifiers were built, one for each rhetorical category.The standard "bag-of-words" representation was used that is known to work well for text in general [20,21] and in previous studies of rhetorical category classifiers [7,12,19].Preprocessing included lowing cases and removing of stop words, and features were selected based on information gain [21].
The Support Vector Machines (SVM) algorithm (Scikit-learn version 0.24.0 [13]) was chosen based on a pilot study where this model performed better than the Naïve Bayes and Decision Tree classifiers.The configuration used was C-support vector classification with rbf kernel, using all default settings of sklearn.svm.svcmethod without fine tuning of the parameters.Comparison between the three classification algorithms (i.e., SVM, NB, and decision tree) can be found in Doc1 in a GitHub repository (https://github.com/yuanxiesa/Sci-k2021..The number of features was varied from 100 to 1000, with an increment of 100.The best model for each rhetorical category was identified by the average F1 score obtained through 10-fold crossvalidation.

RESULTS
Performance metrics for the five best classifiers are listed in Table 1.Accuracy scores for all rhetorical classes were above 0.8.The performance suggests that the predictive performance was likely limited by the training set size, because the two classes with the most instances, Methods and Results, achieved better F1 scores than the other three classes.
Results on the JCDL dataset are shown in Table 2. Four sentences were captured by the Methods classifier.On the other hand, sentence 3 was captured by the Results classifier.Close examination shows that it is a hybrid: It describes both a method, the use of a monoclonal antibody to confirm the expression of tau protein, and a result, the confirmation of the strong expression of tau protein.Among the four sentences that were missed, sentence 1, 2, and 8 describe "ways of doing research" but were not captured, a failure of the Methods classifier.Sentence 4 is special because it provides a justification for a method (i.e., the use of synaptic marker to measure neuron damage [6]), and the relation between sentence 4 and methods used in the paper is not explicit in sentence 4.
When applying the rhetorical category classifiers to the Willoughby-Hoye dataset, 43 of the 99 instances received a positive classification.One of the authors, YF, examined those 43 sentences and determined whether the Willoughby-Hoye protocol is a methods keystone citation in those cases, drawing on experience from the previous analysis [6].The citation context sentences, their rhetorical category classifications, and keystone citation annotation can be found in Doc 2 of the GitHub repository (link provided in section 4.2).The results are summarized in Table 3, including the number of instances where Willoughby-Hoye protocol is a methods keystone citation, the total number of instances identified by each classifier, and the ratio between the two.
Table 3 shows that our premise that only citations contained in Methods sentences are methods keystone citations (Figure 1) needs revision.Citations contained in Methods and Results sentences both have a high likelihood of being methods keystone citations (95% and 100%, respectively).While we did not expect the Results classifier to be a keystone citation capture device, two factors altered this view.The first is the existence of Results-Methods hybrids.Second, some Results sentences describe "the way of doing research" and contain phrases that give a sense of closure, such as "were calculated" or "were carried out, " making them classified as Results.
Sentences classified as Background and Conclusion sentences have a low likelihood of containing methods keystone citations.Background sentences situate the Willoughby-Hoye protocol to a research landscape.While we expected no methods keystone Material Methods (6) The evaluation of Boltzmann-averaged 13C and 1H magnetic shielding tensors and isotropic chemical shifts from density functional theory (DFT) followed Hoye's protocol 25 adapted as follows.
Methods Methods (7) Therefore, we turned to a protocol that relies on density functional theory-based computations of 1H and 13C NMR chemical shifts and the use of statistical tools to assign the experimental data to the correct isomer of a compound 28 .
Methods Methods (8) The applied procedure is in principle analogous to the one described by Willoughby 43 , with slight modifications and different software packages used.
Methods Methods  31 99 a One instance was classified as both Methods and Results, and therefore the total number is 99, not 100.citations to be classified as Background sentences, we found one: A sentence that described a method in a non-characteristic way ("The entire process begins with DFT prediction. ..").Likewise, in the two Conclusion sentences, the protocol played an auxiliary role (i.e., reinforcing or contrasting the findings), and neither was a methods keystone citation.And since no Objective sentence were captured, whether citations contained in Objective sentences can be methods keystone citations remains unknown.
This exploratory study resulted in revising our strategy for detecting keystone citations.Our revised strategy, depicted in Figure 2, considers citations contained in Methods or Results sentences to have a high likelihood of being methods keystone citations, Ultimately, a sizable gold-standard keystone citation context dataset is needed, and the rhetorical category classifiers may serve as a useful screening tool for constructing such a dataset.Methods and Results can be quickly scanned to verify that they contain keystone citations; Background and Conclusions sentences can be quickly scanned to ensure that they do not contain keystone citations.Most attention can then be focused on sentences that do not receive a classification.

CONCLUSIONS AND FUTURE WORK
In this paper, we proposed and evaluated a strategy that repurposed rhetorical category classifiers for the novel application of extracting keystone citations that relate to research methods.Five binary rhetorical category classifiers were constructed to identify Background, Objective, Methods, Results, and Conclusions sentences in biomedical papers.The resulting classifiers were evaluated using two datasets.The initial strategy assumed that only citations contained in Methods sentences were methods keystone citations, but our analysis revealed that citations contained in sentences classified as either Methods or Results had a high likelihood to be methods keystone citations.Future work will focus on fine-tuning the rhetorical category classifiers, experimenting with multiclass classifiers, evaluating the revised strategy with more data, and constructing a larger gold-standard citation context sentences dataset for model training.

Figure 1 :
Figure 1: The concept of the strategy

Figure 2 :
Figure 2: A revised strategy based on two tests

Table 1 :
Best classifiers by F1 scores obtained from 10-fold cross-validation on the training dataset

Table 2 :
Classification results of the JCDL dataset We took advantage of a mouse line in which expression of a tet transactivator transgene is under control of the neuropsin gene promoter (Yasuda and Mayford, 2006).In AD, early hallmarks include the loss of synapses, and comparison of AD patients to age-matched control individuals showed that the density of synapses correlated strongly with cognitive impairment, suggesting that loss of connections is associated with the progression of the disease (DeKosky andScheff, 1990; Scheff and  Price, 2006; Terry et al., 1991).

Table 3 :
Classification results of the Willoughby-Hoye dataset and keystone citation annotation