|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectgr.demokritos.iit.tacTools.ACQUAINT2DocumentSet
public class ACQUAINT2DocumentSet
This class implements the IDocumentSet
interface, also implementing
a number of methods to make the retrieval of TAC document sets (also called
DOCSTREAMs) easier.
Field Summary | |
---|---|
static java.lang.String |
DATELINE_TAG
The tag name of the Dateline tag. |
static int |
FROM_TEST_SET
Constant indicating tirage from the test set. |
static int |
FROM_TRAINING_SET
Constant indicating tirage from the training set. |
static int |
FROM_WHOLE_SET
Constant indicating tirage from the whole (training plus test) set. |
static java.lang.String |
TEXT_TAG
The tag name of the Text tag. |
Constructor Summary | |
---|---|
ACQUAINT2DocumentSet(java.lang.String sTACXMLFile)
Creates a new instance of TACDocumentSet, given a corresponding TAC08 formatted file. |
Method Summary | |
---|---|
void |
createSets()
|
java.util.List |
getCategories()
|
java.util.Date |
getDocDate(java.lang.String sDocID)
|
java.util.ArrayList |
getFilesFromCategory(java.lang.String sCategoryName)
|
java.util.ArrayList |
getTestSet()
Returns whole set. |
java.util.ArrayList |
getTrainingSet()
Returns whole set. |
java.lang.String |
loadDocumentDatelineToString(java.lang.String sDocID)
Returns the dateline portion of a given document as a String, if the dateline exists. |
java.lang.String |
loadDocumentElement(java.lang.String sDocID,
java.lang.String sElement)
Returns a given element of a given document as a String, if the element exists. |
java.lang.String |
loadDocumentTextToString(java.lang.String sDocID)
Returns the text portion of a given document as a String. |
java.lang.String |
loadFile(java.lang.String sID)
Loads the file and represents it using type |
java.lang.String |
loadFullDocumentTextToString(java.lang.String sDocID)
Returns the full text of a given document, including dateline and other elements as a String. |
static void |
main(java.lang.String[] sArgs)
Testing main function. |
java.util.Set<java.lang.String> |
toFilenameSet(int iSubset)
Get a string list of all file names in the set or its training / test subsets. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int FROM_TRAINING_SET
public static final int FROM_TEST_SET
public static final int FROM_WHOLE_SET
public static final java.lang.String DATELINE_TAG
public static final java.lang.String TEXT_TAG
Constructor Detail |
---|
public ACQUAINT2DocumentSet(java.lang.String sTACXMLFile)
sTACXMLFile
- The XML file containing the DOCSTREAM.Method Detail |
---|
public java.util.List getCategories()
getCategories
in interface IDocumentSet
public void createSets()
createSets
in interface IDocumentSet
public java.util.ArrayList getFilesFromCategory(java.lang.String sCategoryName)
getFilesFromCategory
in interface IDocumentSet
public java.util.ArrayList getTrainingSet()
getTrainingSet
in interface IDocumentSet
public java.util.ArrayList getTestSet()
getTestSet
in interface IDocumentSet
public java.lang.String loadFile(java.lang.String sID)
IFileLoader
loadFile
in interface IFileLoader<java.lang.String>
public java.lang.String loadDocumentTextToString(java.lang.String sDocID)
sDocID
- The document ID.
public java.lang.String loadFullDocumentTextToString(java.lang.String sDocID)
sDocID
- The document ID.
public java.lang.String loadDocumentDatelineToString(java.lang.String sDocID)
sDocID
- The document ID.
public final java.lang.String loadDocumentElement(java.lang.String sDocID, java.lang.String sElement)
sDocID
- The document ID.sElement
- The element name (e.g. TEXT or DATELINE).
public java.util.Date getDocDate(java.lang.String sDocID)
public static void main(java.lang.String[] sArgs)
public java.util.Set<java.lang.String> toFilenameSet(int iSubset)
iSubset
- A value of either FROM_TRAINING_SET, FROM_TEST_SET,
FROM_WHOLE_SET indicating the subset used to extract filenames.
Set
of strings, that are the filenames of the files in the
set.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |