gr.demokritos.iit.tacTools
Class ACQUAINT2DocumentSet

java.lang.Object
  extended by gr.demokritos.iit.tacTools.ACQUAINT2DocumentSet
All Implemented Interfaces:
IFileLoader<java.lang.String>, IDocumentSet

public class ACQUAINT2DocumentSet
extends java.lang.Object
implements IDocumentSet, IFileLoader<java.lang.String>

This class implements the IDocumentSet interface, also implementing a number of methods to make the retrieval of TAC document sets (also called DOCSTREAMs) easier.


Field Summary
static java.lang.String DATELINE_TAG
          The tag name of the Dateline tag.
static int FROM_TEST_SET
          Constant indicating tirage from the test set.
static int FROM_TRAINING_SET
          Constant indicating tirage from the training set.
static int FROM_WHOLE_SET
          Constant indicating tirage from the whole (training plus test) set.
static java.lang.String TEXT_TAG
          The tag name of the Text tag.
 
Constructor Summary
ACQUAINT2DocumentSet(java.lang.String sTACXMLFile)
          Creates a new instance of TACDocumentSet, given a corresponding TAC08 formatted file.
 
Method Summary
 void createSets()
           
 java.util.List getCategories()
           
 java.util.Date getDocDate(java.lang.String sDocID)
           
 java.util.ArrayList getFilesFromCategory(java.lang.String sCategoryName)
           
 java.util.ArrayList getTestSet()
          Returns whole set.
 java.util.ArrayList getTrainingSet()
          Returns whole set.
 java.lang.String loadDocumentDatelineToString(java.lang.String sDocID)
          Returns the dateline portion of a given document as a String, if the dateline exists.
 java.lang.String loadDocumentElement(java.lang.String sDocID, java.lang.String sElement)
          Returns a given element of a given document as a String, if the element exists.
 java.lang.String loadDocumentTextToString(java.lang.String sDocID)
          Returns the text portion of a given document as a String.
 java.lang.String loadFile(java.lang.String sID)
          Loads the file and represents it using type .
 java.lang.String loadFullDocumentTextToString(java.lang.String sDocID)
          Returns the full text of a given document, including dateline and other elements as a String.
static void main(java.lang.String[] sArgs)
          Testing main function.
 java.util.Set<java.lang.String> toFilenameSet(int iSubset)
          Get a string list of all file names in the set or its training / test subsets.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

FROM_TRAINING_SET

public static final int FROM_TRAINING_SET
Constant indicating tirage from the training set.

See Also:
Constant Field Values

FROM_TEST_SET

public static final int FROM_TEST_SET
Constant indicating tirage from the test set.

See Also:
Constant Field Values

FROM_WHOLE_SET

public static final int FROM_WHOLE_SET
Constant indicating tirage from the whole (training plus test) set.

See Also:
Constant Field Values

DATELINE_TAG

public static final java.lang.String DATELINE_TAG
The tag name of the Dateline tag.

See Also:
Constant Field Values

TEXT_TAG

public static final java.lang.String TEXT_TAG
The tag name of the Text tag.

See Also:
Constant Field Values
Constructor Detail

ACQUAINT2DocumentSet

public ACQUAINT2DocumentSet(java.lang.String sTACXMLFile)
Creates a new instance of TACDocumentSet, given a corresponding TAC08 formatted file.

Parameters:
sTACXMLFile - The XML file containing the DOCSTREAM.
Method Detail

getCategories

public java.util.List getCategories()
Specified by:
getCategories in interface IDocumentSet

createSets

public void createSets()
Specified by:
createSets in interface IDocumentSet

getFilesFromCategory

public java.util.ArrayList getFilesFromCategory(java.lang.String sCategoryName)
Specified by:
getFilesFromCategory in interface IDocumentSet

getTrainingSet

public java.util.ArrayList getTrainingSet()
Returns whole set. TODO: Implement as should be.

Specified by:
getTrainingSet in interface IDocumentSet

getTestSet

public java.util.ArrayList getTestSet()
Returns whole set. TODO: Implement as should be.

Specified by:
getTestSet in interface IDocumentSet

loadFile

public java.lang.String loadFile(java.lang.String sID)
Description copied from interface: IFileLoader
Loads the file and represents it using type .

Specified by:
loadFile in interface IFileLoader<java.lang.String>
Returns:
The representation of the file.

loadDocumentTextToString

public java.lang.String loadDocumentTextToString(java.lang.String sDocID)
Returns the text portion of a given document as a String.

Parameters:
sDocID - The document ID.
Returns:
Null if document is not found, otherwise its text portion (all text found within TEXT tags.

loadFullDocumentTextToString

public java.lang.String loadFullDocumentTextToString(java.lang.String sDocID)
Returns the full text of a given document, including dateline and other elements as a String. Tags are removed.

Parameters:
sDocID - The document ID.
Returns:
Null if document is not found, otherwise its full text.

loadDocumentDatelineToString

public java.lang.String loadDocumentDatelineToString(java.lang.String sDocID)
Returns the dateline portion of a given document as a String, if the dateline exists.

Parameters:
sDocID - The document ID.
Returns:
Null if document is not found, a zero length String if no dateline was found, otherwise the document's dateline field (all text found within DATELINE tags.

loadDocumentElement

public final java.lang.String loadDocumentElement(java.lang.String sDocID,
                                                  java.lang.String sElement)
Returns a given element of a given document as a String, if the element exists.

Parameters:
sDocID - The document ID.
sElement - The element name (e.g. TEXT or DATELINE).
Returns:
Null if document is not found, a zero length String if the specified element was not found, otherwise the document's element text.

getDocDate

public java.util.Date getDocDate(java.lang.String sDocID)

main

public static void main(java.lang.String[] sArgs)
Testing main function.


toFilenameSet

public java.util.Set<java.lang.String> toFilenameSet(int iSubset)
Get a string list of all file names in the set or its training / test subsets.

Parameters:
iSubset - A value of either FROM_TRAINING_SET, FROM_TEST_SET, FROM_WHOLE_SET indicating the subset used to extract filenames.
Returns:
A Set of strings, that are the filenames of the files in the set.