|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectgr.demokritos.iit.tacTools.TAC2008UpdateSummarizationFileSet
public class TAC2008UpdateSummarizationFileSet
A class that takes a TAC2008 topic structure directory and can return groupA or groupB documents given an xml topic file and a topic ID. The directory structure contains a top directory for every topic ID. Each topic ID directory in turns contains two directories, one for each group of documents.
Field Summary | |
---|---|
protected java.util.HashSet<java.lang.String> |
Categories
The set of category (topic ID) names. |
protected java.lang.String |
CorpusDir
The top directory of the TAC2008 topic structure |
static int |
FROM_TEST_SET
Constant indicating tirage from the test set. |
static int |
FROM_TRAINING_SET
Constant indicating tirage from the training set. |
static int |
FROM_WHOLE_SET
Constant indicating tirage from the whole (training plus test) set. |
protected java.util.ArrayList<CategorizedFileEntry> |
TestFiles
The list of test files (actually group B files). |
protected java.util.ArrayList<CategorizedFileEntry> |
TrainingFiles
The list of training files (actually group A files). |
Constructor Summary | |
---|---|
TAC2008UpdateSummarizationFileSet(java.lang.String sCorpusDir)
Initialize the set using as corpus dir a given dir. |
Method Summary | |
---|---|
void |
createSets()
|
java.util.List |
getCategories()
Returns a list of the topic IDs, as category names. |
protected java.lang.String |
getDocumentText(java.lang.String sDocID,
boolean bIncludeTitle)
Returns a given element of a given document as a String, if the element exists. |
java.util.List<java.lang.String> |
getFilenamesFromCategory(java.lang.String sCategoryName,
int iFromWhatPart)
Returns all files belonging to a specified category, and belonging to a specified subset of the set. |
java.util.ArrayList |
getFilesFromCategory(java.lang.String sCategoryName)
|
java.util.ArrayList |
getTestSet()
Returns group B files, described as training set. |
java.util.ArrayList |
getTrainingSet()
Returns group A files, described as training set. |
java.lang.String |
loadFile(java.lang.String sID)
Loads the text of a given file, given its filename. |
static void |
main(java.lang.String[] sArgs)
Used for testing purposes only. |
java.util.Set<java.lang.String> |
toFilenameSet(int iSubset)
Get a string list of all file names in the set or its training / test subsets. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int FROM_TRAINING_SET
public static final int FROM_TEST_SET
public static final int FROM_WHOLE_SET
protected java.lang.String CorpusDir
protected java.util.HashSet<java.lang.String> Categories
protected java.util.ArrayList<CategorizedFileEntry> TrainingFiles
protected java.util.ArrayList<CategorizedFileEntry> TestFiles
Constructor Detail |
---|
public TAC2008UpdateSummarizationFileSet(java.lang.String sCorpusDir)
sCorpusDir
- The top directory of the corpus structure.Method Detail |
---|
public java.util.List getCategories()
getCategories
in interface IDocumentSet
String
List
containing category names, which
actually correspond to topic IDs.public void createSets()
createSets
in interface IDocumentSet
public final java.util.ArrayList getFilesFromCategory(java.lang.String sCategoryName)
getFilesFromCategory
in interface IDocumentSet
public final java.util.List<java.lang.String> getFilenamesFromCategory(java.lang.String sCategoryName, int iFromWhatPart)
sCategoryName
- The name of the category the files should belong to.iFromWhatPart
- One of the FROM_TRAINING_SET, FROM_TEST_SET,
FROM_WHOLE_SET values to indicate which subset should be used.
public final java.util.ArrayList getTrainingSet()
getTrainingSet
in interface IDocumentSet
public final java.util.ArrayList getTestSet()
getTestSet
in interface IDocumentSet
public java.lang.String loadFile(java.lang.String sID)
loadFile
in interface IFileLoader<java.lang.String>
sID
- The filename of the file to load.
String
representing the title and text of the given
file.public static void main(java.lang.String[] sArgs)
sArgs
- Unused.protected final java.lang.String getDocumentText(java.lang.String sDocID, boolean bIncludeTitle)
sDocID
- The document ID.bIncludeTitle
- If true, title is leading the return text. Otherwise,
it is omitted.
public java.util.Set<java.lang.String> toFilenameSet(int iSubset)
iSubset
- A value of either FROM_TRAINING_SET, FROM_TEST_SET,
FROM_WHOLE_SET indicating the subset used to extract filenames.
Set
of strings, that are the filenames of the files in the
set.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |