Class StandardsText
java.lang.Object
org.apache.tika.sax.StandardsText
StandardText relies on regular expressions to extract standard references
from text.
This class helps to find the standard references from text by performing the following steps:
- searches for headers;
- searches for patterns that are supposed to be standard references (basically, every string mostly composed of uppercase letters followed by an alphanumeric characters);
- each potential standard reference starts with score equal to 0.25;
- increases by 0.25 the score of references which include the name of a
known standard organization (
StandardOrganizations); - increases by 0.25 the score of references which include the word Publication or Standard;
- increases by 0.25 the score of references which have been found within "Applicable Documents" and equivalent sections;
- returns the standard references along with scores.
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic ArrayList<StandardReference> extractStandardReferences(String text, double threshold) Extracts the standard references found within the given text.
-
Constructor Details
-
StandardsText
public StandardsText()
-
-
Method Details
-
extractStandardReferences
Extracts the standard references found within the given text.- Parameters:
text- the text from which the standard references are extracted.threshold- the lower bound limit to be used in order to select only the standard references with score greater than or equal to the threshold. For instance, using a threshold of 0.75 means that only the patterns with score greater than or equal to 0.75 will be returned.- Returns:
- the list of standard references extracted from the given text.
-