gr.demokritos.iit.tacTools
Class TAC2008TopicFileSet

java.lang.Object
  extended by gr.demokritos.iit.tacTools.TAC2008TopicFileSet
All Implemented Interfaces:
IFileLoader<java.lang.String>, IDocumentSet

public class TAC2008TopicFileSet
extends java.lang.Object
implements IDocumentSet, IFileLoader<java.lang.String>

Uses an TAC 2008 XML topic definition file to create the set.


Field Summary
protected  java.util.HashSet<java.lang.String> Categories
          The set of category (topic ID) names.
protected  java.lang.String CorpusDir
          The top directory of the TAC2008 topic structure
protected static java.lang.String DOCSET_A_TAG
          Docset A tag string in XML file.
protected static java.lang.String DOCSET_B_TAG
          Docset B tag string in XML file.
protected static java.lang.String DOCUMENT_TAG
          Document tag string in XML file.
static int FROM_TEST_SET
          Constant indicating tirage from the test set.
static int FROM_TRAINING_SET
          Constant indicating tirage from the training set.
static int FROM_WHOLE_SET
          Constant indicating tirage from the whole (training plus test) set.
protected static java.lang.String NARRATIVE_TAG
          Docset B tag string in XML file.
protected  java.util.ArrayList<CategorizedFileEntry> TestFiles
          The list of test files (actually group B files).
protected static java.lang.String TITLE_TAG
          Title tag string in XML file.
protected static java.lang.String TOPIC_TAG
          Topic tag string in XML file.
protected  java.util.ArrayList<CategorizedFileEntry> TrainingFiles
          The list of training files (actually group A files).
 
Constructor Summary
TAC2008TopicFileSet(java.lang.String sTopicXMLFile, java.lang.String sCorpusRootDir)
          Initializes the document set, given a TAC2008 topic XML file.
 
Method Summary
 void createSets()
           
 java.util.List getCategories()
          Actually returns the list of topics from the file.
protected  java.lang.String getDocumentText(java.lang.String sDocID, boolean bIncludeTitle)
          Returns a given element of a given document as a String, if the element exists.
 java.util.List<java.lang.String> getFilenamesFromCategory(java.lang.String sCategoryName, int iFromWhatPart)
          Returns all files belonging to a specified category, and belonging to a specified subset of the set.
 java.util.ArrayList getFilesFromCategory(java.lang.String sCategoryName)
           
protected  java.util.List getFilesFromTopic(java.lang.String sTopicID, int iFromWhichSet)
          Returns a list of filenames the meet certain criteria: a given topic ID, and a docset.
 java.util.ArrayList getTestSet()
          Returns group B files, described as training set.
 java.lang.String getTopicDefinition(java.lang.String sTopicID)
          Returns the narrative question of a given topic.
protected  java.lang.String getTopicNarrative(java.lang.String sTopicID)
          Returns the text of the narrative field, given a topic.
protected  org.w3c.dom.Node getTopicNode(java.lang.String sTopicID)
          Return the node of a given topic in the XML document.
protected  java.lang.String getTopicTitle(java.lang.String sTopicID)
          Returns the text of the topic title field, given a topic.
 java.util.ArrayList getTrainingSet()
          Returns group A files, described as training set.
 java.lang.String loadFile(java.lang.String sID)
          Loads the text of a given file, given its filename.
static void main(java.lang.String[] sArgs)
          Used for testing purposes only.
 java.util.Set<java.lang.String> toFilenameSet(int iSubset)
          Get a string list of all file names in the set or its training / test subsets.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TOPIC_TAG

protected static java.lang.String TOPIC_TAG
Topic tag string in XML file.


TITLE_TAG

protected static java.lang.String TITLE_TAG
Title tag string in XML file.


DOCSET_A_TAG

protected static java.lang.String DOCSET_A_TAG
Docset A tag string in XML file.


DOCSET_B_TAG

protected static java.lang.String DOCSET_B_TAG
Docset B tag string in XML file.


NARRATIVE_TAG

protected static java.lang.String NARRATIVE_TAG
Docset B tag string in XML file.


DOCUMENT_TAG

protected static java.lang.String DOCUMENT_TAG
Document tag string in XML file.


FROM_TRAINING_SET

public static final int FROM_TRAINING_SET
Constant indicating tirage from the training set.

See Also:
Constant Field Values

FROM_TEST_SET

public static final int FROM_TEST_SET
Constant indicating tirage from the test set.

See Also:
Constant Field Values

FROM_WHOLE_SET

public static final int FROM_WHOLE_SET
Constant indicating tirage from the whole (training plus test) set.

See Also:
Constant Field Values

CorpusDir

protected java.lang.String CorpusDir
The top directory of the TAC2008 topic structure


Categories

protected java.util.HashSet<java.lang.String> Categories
The set of category (topic ID) names.


TrainingFiles

protected java.util.ArrayList<CategorizedFileEntry> TrainingFiles
The list of training files (actually group A files).


TestFiles

protected java.util.ArrayList<CategorizedFileEntry> TestFiles
The list of test files (actually group B files).

Constructor Detail

TAC2008TopicFileSet

public TAC2008TopicFileSet(java.lang.String sTopicXMLFile,
                           java.lang.String sCorpusRootDir)
                    throws javax.xml.parsers.ParserConfigurationException,
                           org.xml.sax.SAXException,
                           java.io.IOException
Initializes the document set, given a TAC2008 topic XML file.

Parameters:
sXMLFile - The filename of the topic file.
sCorpusRootDir - The base directory of the TAC2008 test corpus directory structure.
Throws:
javax.xml.parsers.ParserConfigurationException
org.xml.sax.SAXException
java.io.IOException
Method Detail

createSets

public void createSets()
Specified by:
createSets in interface IDocumentSet

getFilesFromCategory

public final java.util.ArrayList getFilesFromCategory(java.lang.String sCategoryName)
Specified by:
getFilesFromCategory in interface IDocumentSet

getFilenamesFromCategory

public final java.util.List<java.lang.String> getFilenamesFromCategory(java.lang.String sCategoryName,
                                                                       int iFromWhatPart)
Returns all files belonging to a specified category, and belonging to a specified subset of the set.

Parameters:
sCategoryName - The name of the category the files should belong to.
iFromWhatPart - One of the FROM_TRAINING_SET, FROM_TEST_SET, FROM_WHOLE_SET values to indicate which subset should be used.
Returns:
A list of filenames corresponding to the selected files.

getTrainingSet

public final java.util.ArrayList getTrainingSet()
Returns group A files, described as training set.

Specified by:
getTrainingSet in interface IDocumentSet

getTestSet

public final java.util.ArrayList getTestSet()
Returns group B files, described as training set.

Specified by:
getTestSet in interface IDocumentSet

loadFile

public java.lang.String loadFile(java.lang.String sID)
Loads the text of a given file, given its filename.

Specified by:
loadFile in interface IFileLoader<java.lang.String>
Parameters:
sID - The filename of the file to load.
Returns:
A String representing the title and text of the given file.

getDocumentText

protected final java.lang.String getDocumentText(java.lang.String sDocID,
                                                 boolean bIncludeTitle)
Returns a given element of a given document as a String, if the element exists.

Parameters:
sDocID - The document ID.
bIncludeTitle - If true, title is leading the return text. Otherwise, it is omitted.
Returns:
Null if document is not found, a zero length String if the specified element was not found, otherwise the document's element text.

toFilenameSet

public java.util.Set<java.lang.String> toFilenameSet(int iSubset)
Get a string list of all file names in the set or its training / test subsets.

Parameters:
iSubset - A value of either FROM_TRAINING_SET, FROM_TEST_SET, FROM_WHOLE_SET indicating the subset used to extract filenames.
Returns:
A Set of strings, that are the filenames of the files in the set.

getCategories

public java.util.List getCategories()
Actually returns the list of topics from the file.

Specified by:
getCategories in interface IDocumentSet

getTopicNode

protected final org.w3c.dom.Node getTopicNode(java.lang.String sTopicID)
Return the node of a given topic in the XML document.

Parameters:
sTopicID - The topic ID of interest.
Returns:
Null if the topic was not found, else the corresponding node.

getTopicTitle

protected java.lang.String getTopicTitle(java.lang.String sTopicID)
Returns the text of the topic title field, given a topic.

Parameters:
sTopicID - The topic of interest.
Returns:
The topic title field text or null if topic was not found.

getTopicNarrative

protected java.lang.String getTopicNarrative(java.lang.String sTopicID)
Returns the text of the narrative field, given a topic.

Parameters:
sTopicID - The topic of interest.
Returns:
The narrative field text or null if topic was not found.

getFilesFromTopic

protected java.util.List getFilesFromTopic(java.lang.String sTopicID,
                                           int iFromWhichSet)
Returns a list of filenames the meet certain criteria: a given topic ID, and a docset.

Parameters:
sTopicID - The topic of interest.
iFromWhichSet - An integer value, being one of the following class statics: FROM_TRAINING_SET, FROM_TEST_SET, FROM_WHOLE_SET, indicating docset A, docset B or both correspondingly.
Returns:
A String List of the files meeting the criteria.

main

public static void main(java.lang.String[] sArgs)
Used for testing purposes only.

Parameters:
sArgs - Unused.

getTopicDefinition

public java.lang.String getTopicDefinition(java.lang.String sTopicID)
Returns the narrative question of a given topic.

Parameters:
sTopicID - The ID of the topic of interest.
Returns:
Null if the topic is not found; otherwise a String containing the narrative of the given topic.