|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectgr.demokritos.iit.tacTools.TAC2008UpdateSummarizationFileSet
public class TAC2008UpdateSummarizationFileSet
A class that takes a TAC2008 topic structure directory and can return groupA or groupB documents given an xml topic file and a topic ID. The directory structure contains a top directory for every topic ID. Each topic ID directory in turns contains two directories, one for each group of documents.
| Field Summary | |
|---|---|
protected java.util.HashSet<java.lang.String> |
Categories
The set of category (topic ID) names. |
protected java.lang.String |
CorpusDir
The top directory of the TAC2008 topic structure |
static int |
FROM_TEST_SET
Constant indicating tirage from the test set. |
static int |
FROM_TRAINING_SET
Constant indicating tirage from the training set. |
static int |
FROM_WHOLE_SET
Constant indicating tirage from the whole (training plus test) set. |
protected java.util.ArrayList<CategorizedFileEntry> |
TestFiles
The list of test files (actually group B files). |
protected java.util.ArrayList<CategorizedFileEntry> |
TrainingFiles
The list of training files (actually group A files). |
| Constructor Summary | |
|---|---|
TAC2008UpdateSummarizationFileSet(java.lang.String sCorpusDir)
Initialize the set using as corpus dir a given dir. |
|
| Method Summary | |
|---|---|
void |
createSets()
|
java.util.List |
getCategories()
Returns a list of the topic IDs, as category names. |
protected java.lang.String |
getDocumentText(java.lang.String sDocID,
boolean bIncludeTitle)
Returns a given element of a given document as a String, if the element exists. |
java.util.List<java.lang.String> |
getFilenamesFromCategory(java.lang.String sCategoryName,
int iFromWhatPart)
Returns all files belonging to a specified category, and belonging to a specified subset of the set. |
java.util.ArrayList |
getFilesFromCategory(java.lang.String sCategoryName)
|
java.util.ArrayList |
getTestSet()
Returns group B files, described as training set. |
java.util.ArrayList |
getTrainingSet()
Returns group A files, described as training set. |
java.lang.String |
loadFile(java.lang.String sID)
Loads the text of a given file, given its filename. |
static void |
main(java.lang.String[] sArgs)
Used for testing purposes only. |
java.util.Set<java.lang.String> |
toFilenameSet(int iSubset)
Get a string list of all file names in the set or its training / test subsets. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public static final int FROM_TRAINING_SET
public static final int FROM_TEST_SET
public static final int FROM_WHOLE_SET
protected java.lang.String CorpusDir
protected java.util.HashSet<java.lang.String> Categories
protected java.util.ArrayList<CategorizedFileEntry> TrainingFiles
protected java.util.ArrayList<CategorizedFileEntry> TestFiles
| Constructor Detail |
|---|
public TAC2008UpdateSummarizationFileSet(java.lang.String sCorpusDir)
sCorpusDir - The top directory of the corpus structure.| Method Detail |
|---|
public java.util.List getCategories()
getCategories in interface IDocumentSetString List containing category names, which
actually correspond to topic IDs.public void createSets()
createSets in interface IDocumentSetpublic final java.util.ArrayList getFilesFromCategory(java.lang.String sCategoryName)
getFilesFromCategory in interface IDocumentSet
public final java.util.List<java.lang.String> getFilenamesFromCategory(java.lang.String sCategoryName,
int iFromWhatPart)
sCategoryName - The name of the category the files should belong to.iFromWhatPart - One of the FROM_TRAINING_SET, FROM_TEST_SET,
FROM_WHOLE_SET values to indicate which subset should be used.
public final java.util.ArrayList getTrainingSet()
getTrainingSet in interface IDocumentSetpublic final java.util.ArrayList getTestSet()
getTestSet in interface IDocumentSetpublic java.lang.String loadFile(java.lang.String sID)
loadFile in interface IFileLoader<java.lang.String>sID - The filename of the file to load.
String representing the title and text of the given
file.public static void main(java.lang.String[] sArgs)
sArgs - Unused.
protected final java.lang.String getDocumentText(java.lang.String sDocID,
boolean bIncludeTitle)
sDocID - The document ID.bIncludeTitle - If true, title is leading the return text. Otherwise,
it is omitted.
public java.util.Set<java.lang.String> toFilenameSet(int iSubset)
iSubset - A value of either FROM_TRAINING_SET, FROM_TEST_SET,
FROM_WHOLE_SET indicating the subset used to extract filenames.
Set of strings, that are the filenames of the files in the
set.
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||