gr.demokritos.iit.tacTools
Class TAC2008UpdateSummarizationFileSet

java.lang.Object
  extended by gr.demokritos.iit.tacTools.TAC2008UpdateSummarizationFileSet
All Implemented Interfaces:
IFileLoader<java.lang.String>, IDocumentSet

public class TAC2008UpdateSummarizationFileSet
extends java.lang.Object
implements IDocumentSet, IFileLoader<java.lang.String>

A class that takes a TAC2008 topic structure directory and can return groupA or groupB documents given an xml topic file and a topic ID. The directory structure contains a top directory for every topic ID. Each topic ID directory in turns contains two directories, one for each group of documents.


Field Summary
protected  java.util.HashSet<java.lang.String> Categories
          The set of category (topic ID) names.
protected  java.lang.String CorpusDir
          The top directory of the TAC2008 topic structure
static int FROM_TEST_SET
          Constant indicating tirage from the test set.
static int FROM_TRAINING_SET
          Constant indicating tirage from the training set.
static int FROM_WHOLE_SET
          Constant indicating tirage from the whole (training plus test) set.
protected  java.util.ArrayList<CategorizedFileEntry> TestFiles
          The list of test files (actually group B files).
protected  java.util.ArrayList<CategorizedFileEntry> TrainingFiles
          The list of training files (actually group A files).
 
Constructor Summary
TAC2008UpdateSummarizationFileSet(java.lang.String sCorpusDir)
          Initialize the set using as corpus dir a given dir.
 
Method Summary
 void createSets()
           
 java.util.List getCategories()
          Returns a list of the topic IDs, as category names.
protected  java.lang.String getDocumentText(java.lang.String sDocID, boolean bIncludeTitle)
          Returns a given element of a given document as a String, if the element exists.
 java.util.List<java.lang.String> getFilenamesFromCategory(java.lang.String sCategoryName, int iFromWhatPart)
          Returns all files belonging to a specified category, and belonging to a specified subset of the set.
 java.util.ArrayList getFilesFromCategory(java.lang.String sCategoryName)
           
 java.util.ArrayList getTestSet()
          Returns group B files, described as training set.
 java.util.ArrayList getTrainingSet()
          Returns group A files, described as training set.
 java.lang.String loadFile(java.lang.String sID)
          Loads the text of a given file, given its filename.
static void main(java.lang.String[] sArgs)
          Used for testing purposes only.
 java.util.Set<java.lang.String> toFilenameSet(int iSubset)
          Get a string list of all file names in the set or its training / test subsets.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

FROM_TRAINING_SET

public static final int FROM_TRAINING_SET
Constant indicating tirage from the training set.

See Also:
Constant Field Values

FROM_TEST_SET

public static final int FROM_TEST_SET
Constant indicating tirage from the test set.

See Also:
Constant Field Values

FROM_WHOLE_SET

public static final int FROM_WHOLE_SET
Constant indicating tirage from the whole (training plus test) set.

See Also:
Constant Field Values

CorpusDir

protected java.lang.String CorpusDir
The top directory of the TAC2008 topic structure


Categories

protected java.util.HashSet<java.lang.String> Categories
The set of category (topic ID) names.


TrainingFiles

protected java.util.ArrayList<CategorizedFileEntry> TrainingFiles
The list of training files (actually group A files).


TestFiles

protected java.util.ArrayList<CategorizedFileEntry> TestFiles
The list of test files (actually group B files).

Constructor Detail

TAC2008UpdateSummarizationFileSet

public TAC2008UpdateSummarizationFileSet(java.lang.String sCorpusDir)
Initialize the set using as corpus dir a given dir.

Parameters:
sCorpusDir - The top directory of the corpus structure.
Method Detail

getCategories

public java.util.List getCategories()
Returns a list of the topic IDs, as category names.

Specified by:
getCategories in interface IDocumentSet
Returns:
A String List containing category names, which actually correspond to topic IDs.

createSets

public void createSets()
Specified by:
createSets in interface IDocumentSet

getFilesFromCategory

public final java.util.ArrayList getFilesFromCategory(java.lang.String sCategoryName)
Specified by:
getFilesFromCategory in interface IDocumentSet

getFilenamesFromCategory

public final java.util.List<java.lang.String> getFilenamesFromCategory(java.lang.String sCategoryName,
                                                                       int iFromWhatPart)
Returns all files belonging to a specified category, and belonging to a specified subset of the set.

Parameters:
sCategoryName - The name of the category the files should belong to.
iFromWhatPart - One of the FROM_TRAINING_SET, FROM_TEST_SET, FROM_WHOLE_SET values to indicate which subset should be used.
Returns:
A list of filenames corresponding to the selected files.

getTrainingSet

public final java.util.ArrayList getTrainingSet()
Returns group A files, described as training set.

Specified by:
getTrainingSet in interface IDocumentSet

getTestSet

public final java.util.ArrayList getTestSet()
Returns group B files, described as training set.

Specified by:
getTestSet in interface IDocumentSet

loadFile

public java.lang.String loadFile(java.lang.String sID)
Loads the text of a given file, given its filename.

Specified by:
loadFile in interface IFileLoader<java.lang.String>
Parameters:
sID - The filename of the file to load.
Returns:
A String representing the title and text of the given file.

main

public static void main(java.lang.String[] sArgs)
Used for testing purposes only.

Parameters:
sArgs - Unused.

getDocumentText

protected final java.lang.String getDocumentText(java.lang.String sDocID,
                                                 boolean bIncludeTitle)
Returns a given element of a given document as a String, if the element exists.

Parameters:
sDocID - The document ID.
bIncludeTitle - If true, title is leading the return text. Otherwise, it is omitted.
Returns:
Null if document is not found, a zero length String if the specified element was not found, otherwise the document's element text.

toFilenameSet

public java.util.Set<java.lang.String> toFilenameSet(int iSubset)
Get a string list of all file names in the set or its training / test subsets.

Parameters:
iSubset - A value of either FROM_TRAINING_SET, FROM_TEST_SET, FROM_WHOLE_SET indicating the subset used to extract filenames.
Returns:
A Set of strings, that are the filenames of the files in the set.