Package org.apache.tika.sax
Class StandardsText
java.lang.Object
org.apache.tika.sax.StandardsText
StandardText relies on regular expressions to extract standard references
from text.
This class helps to find the standard references from text by performing the following steps:
- searches for headers;
- searches for patterns that are supposed to be standard references (basically, every string mostly composed of uppercase letters followed by an alphanumeric characters);
- each potential standard reference starts with score equal to 0.25;
- increases by 0.25 the score of references which include the name of a
known standard organization (
StandardOrganizations
); - increases by 0.25 the score of references which include the word Publication or Standard;
- increases by 0.25 the score of references which have been found within "Applicable Documents" and equivalent sections;
- returns the standard references along with scores.
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic ArrayList<StandardReference>
extractStandardReferences
(String text, double threshold) Extracts the standard references found within the given text.
-
Constructor Details
-
StandardsText
public StandardsText()
-
-
Method Details
-
extractStandardReferences
Extracts the standard references found within the given text.- Parameters:
text
- the text from which the standard references are extracted.threshold
- the lower bound limit to be used in order to select only the standard references with score greater than or equal to the threshold. For instance, using a threshold of 0.75 means that only the patterns with score greater than or equal to 0.75 will be returned.- Returns:
- the list of standard references extracted from the given text.
-