|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectgr.demokritos.iit.tacTools.TAC2008TopicFileSet
public class TAC2008TopicFileSet
Uses an TAC 2008 XML topic definition file to create the set.
Field Summary | |
---|---|
protected java.util.HashSet<java.lang.String> |
Categories
The set of category (topic ID) names. |
protected java.lang.String |
CorpusDir
The top directory of the TAC2008 topic structure |
protected static java.lang.String |
DOCSET_A_TAG
Docset A tag string in XML file. |
protected static java.lang.String |
DOCSET_B_TAG
Docset B tag string in XML file. |
protected static java.lang.String |
DOCUMENT_TAG
Document tag string in XML file. |
static int |
FROM_TEST_SET
Constant indicating tirage from the test set. |
static int |
FROM_TRAINING_SET
Constant indicating tirage from the training set. |
static int |
FROM_WHOLE_SET
Constant indicating tirage from the whole (training plus test) set. |
protected static java.lang.String |
NARRATIVE_TAG
Docset B tag string in XML file. |
protected java.util.ArrayList<CategorizedFileEntry> |
TestFiles
The list of test files (actually group B files). |
protected static java.lang.String |
TITLE_TAG
Title tag string in XML file. |
protected static java.lang.String |
TOPIC_TAG
Topic tag string in XML file. |
protected java.util.ArrayList<CategorizedFileEntry> |
TrainingFiles
The list of training files (actually group A files). |
Constructor Summary | |
---|---|
TAC2008TopicFileSet(java.lang.String sTopicXMLFile,
java.lang.String sCorpusRootDir)
Initializes the document set, given a TAC2008 topic XML file. |
Method Summary | |
---|---|
void |
createSets()
|
java.util.List |
getCategories()
Actually returns the list of topics from the file. |
protected java.lang.String |
getDocumentText(java.lang.String sDocID,
boolean bIncludeTitle)
Returns a given element of a given document as a String, if the element exists. |
java.util.List<java.lang.String> |
getFilenamesFromCategory(java.lang.String sCategoryName,
int iFromWhatPart)
Returns all files belonging to a specified category, and belonging to a specified subset of the set. |
java.util.ArrayList |
getFilesFromCategory(java.lang.String sCategoryName)
|
protected java.util.List |
getFilesFromTopic(java.lang.String sTopicID,
int iFromWhichSet)
Returns a list of filenames the meet certain criteria: a given topic ID, and a docset. |
java.util.ArrayList |
getTestSet()
Returns group B files, described as training set. |
java.lang.String |
getTopicDefinition(java.lang.String sTopicID)
Returns the narrative question of a given topic. |
protected java.lang.String |
getTopicNarrative(java.lang.String sTopicID)
Returns the text of the narrative field, given a topic. |
protected org.w3c.dom.Node |
getTopicNode(java.lang.String sTopicID)
Return the node of a given topic in the XML document. |
protected java.lang.String |
getTopicTitle(java.lang.String sTopicID)
Returns the text of the topic title field, given a topic. |
java.util.ArrayList |
getTrainingSet()
Returns group A files, described as training set. |
java.lang.String |
loadFile(java.lang.String sID)
Loads the text of a given file, given its filename. |
static void |
main(java.lang.String[] sArgs)
Used for testing purposes only. |
java.util.Set<java.lang.String> |
toFilenameSet(int iSubset)
Get a string list of all file names in the set or its training / test subsets. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected static java.lang.String TOPIC_TAG
protected static java.lang.String TITLE_TAG
protected static java.lang.String DOCSET_A_TAG
protected static java.lang.String DOCSET_B_TAG
protected static java.lang.String NARRATIVE_TAG
protected static java.lang.String DOCUMENT_TAG
public static final int FROM_TRAINING_SET
public static final int FROM_TEST_SET
public static final int FROM_WHOLE_SET
protected java.lang.String CorpusDir
protected java.util.HashSet<java.lang.String> Categories
protected java.util.ArrayList<CategorizedFileEntry> TrainingFiles
protected java.util.ArrayList<CategorizedFileEntry> TestFiles
Constructor Detail |
---|
public TAC2008TopicFileSet(java.lang.String sTopicXMLFile, java.lang.String sCorpusRootDir) throws javax.xml.parsers.ParserConfigurationException, org.xml.sax.SAXException, java.io.IOException
sXMLFile
- The filename of the topic file.sCorpusRootDir
- The base directory of the TAC2008 test corpus
directory structure.
javax.xml.parsers.ParserConfigurationException
org.xml.sax.SAXException
java.io.IOException
Method Detail |
---|
public void createSets()
createSets
in interface IDocumentSet
public final java.util.ArrayList getFilesFromCategory(java.lang.String sCategoryName)
getFilesFromCategory
in interface IDocumentSet
public final java.util.List<java.lang.String> getFilenamesFromCategory(java.lang.String sCategoryName, int iFromWhatPart)
sCategoryName
- The name of the category the files should belong to.iFromWhatPart
- One of the FROM_TRAINING_SET, FROM_TEST_SET,
FROM_WHOLE_SET values to indicate which subset should be used.
public final java.util.ArrayList getTrainingSet()
getTrainingSet
in interface IDocumentSet
public final java.util.ArrayList getTestSet()
getTestSet
in interface IDocumentSet
public java.lang.String loadFile(java.lang.String sID)
loadFile
in interface IFileLoader<java.lang.String>
sID
- The filename of the file to load.
String
representing the title and text of the given
file.protected final java.lang.String getDocumentText(java.lang.String sDocID, boolean bIncludeTitle)
sDocID
- The document ID.bIncludeTitle
- If true, title is leading the return text. Otherwise,
it is omitted.
public java.util.Set<java.lang.String> toFilenameSet(int iSubset)
iSubset
- A value of either FROM_TRAINING_SET, FROM_TEST_SET,
FROM_WHOLE_SET indicating the subset used to extract filenames.
Set
of strings, that are the filenames of the files in the
set.public java.util.List getCategories()
getCategories
in interface IDocumentSet
protected final org.w3c.dom.Node getTopicNode(java.lang.String sTopicID)
sTopicID
- The topic ID of interest.
protected java.lang.String getTopicTitle(java.lang.String sTopicID)
sTopicID
- The topic of interest.
protected java.lang.String getTopicNarrative(java.lang.String sTopicID)
sTopicID
- The topic of interest.
protected java.util.List getFilesFromTopic(java.lang.String sTopicID, int iFromWhichSet)
sTopicID
- The topic of interest.iFromWhichSet
- An integer value, being one of the following class
statics: FROM_TRAINING_SET, FROM_TEST_SET, FROM_WHOLE_SET, indicating
docset A, docset B or both correspondingly.
String
List
of the files meeting the criteria.public static void main(java.lang.String[] sArgs)
sArgs
- Unused.public java.lang.String getTopicDefinition(java.lang.String sTopicID)
sTopicID
- The ID of the topic of interest.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |