|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectgr.demokritos.iit.jinsect.structs.DocumentSet
public class DocumentSet
A set of documents, that can be split to training and test sets.
Field Summary | |
---|---|
protected java.lang.String |
BaseDir
|
protected java.util.ArrayList |
Categories
|
java.io.FileFilter |
FileEvaluator
An evaluator of files to add to this document set. |
static int |
FROM_TEST_SET
Constant indicating tirage from the test set. |
static int |
FROM_TRAINING_SET
Constant indicating tirage from the training set. |
static int |
FROM_WHOLE_SET
Constant indicating tirage from the whole (training plus test) set. |
protected java.util.ArrayList |
TestFiles
|
protected java.util.ArrayList |
TrainingFiles
|
protected double |
TrainingPercent
|
Constructor Summary | |
---|---|
DocumentSet(java.lang.String sBaseDir,
double dTrainingPercent)
Creates a new instance of DocumentSet with a training set portion. |
Method Summary | |
---|---|
void |
createSets()
Initializes the document sets with all files of the base directory subtree used. |
void |
createSets(boolean bNoCategories)
Initializes the document sets with all files of the base directory subtree used. |
void |
createSets(boolean bEvenly,
double dPartOfTheCorpus)
Initializes the document sets using a portion of the files of the base directory subtree, either in a stratified or not stratified manner. |
void |
createSets(boolean bEvenly,
double dPartOfTheCorpus,
boolean bNoCategories)
Initializes the document sets using a portion of the files of the base directory subtree, either in a stratified or not stratified manner. |
java.util.List |
getCategories()
Returns a list of the categories appearing in the document set. |
java.util.ArrayList |
getFilesFromCategory(java.lang.String sCategoryName)
Returns the training and test set files that belong to a given category. |
java.util.ArrayList |
getFilesFromCategory(java.lang.String sCategoryName,
int iFromWhichSet)
Returns files either contained to the training and/or test set and belong to a given category. |
java.util.ArrayList |
getTestSet()
Returns the test set of this document set. |
java.util.ArrayList |
getTrainingSet()
Returns the training set of this document set. |
protected void |
shuffleTestAndTrainingSetTogether()
Creates a list containing shuffled test and training instances and then recreates the training and test sets based on this list. |
void |
shuffleTestSet()
Shuffles (randomizes the order of) the files appearing in the test set. |
void |
shuffleTrainingSet()
Shuffles (randomizes the order of) the files appearing in the training set. |
java.util.Set<java.lang.String> |
toFilenameSet(int iSubset)
Get a string list of all file names in the set or its training / test subsets. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected double TrainingPercent
protected java.lang.String BaseDir
protected java.util.ArrayList TrainingFiles
protected java.util.ArrayList TestFiles
protected java.util.ArrayList Categories
public static final int FROM_TRAINING_SET
public static final int FROM_TEST_SET
public static final int FROM_WHOLE_SET
public java.io.FileFilter FileEvaluator
Constructor Detail |
---|
public DocumentSet(java.lang.String sBaseDir, double dTrainingPercent)
sBaseDir
- The root of the corpus directory. Each document is supposed
to be contained in a subdir of this dir, corresponding to the name of its
category.dTrainingPercent
- Percent of trainining set as part of the whole document
set.Method Detail |
---|
public java.util.List getCategories()
getCategories
in interface IDocumentSet
public void createSets()
createSets
in interface IDocumentSet
public void createSets(boolean bNoCategories)
bNoCategories
- Indicates whether there are no subcategories to take
into account. If so, a flat directory full of files is expected.public void createSets(boolean bEvenly, double dPartOfTheCorpus)
bEvenly
- Attempt stratification of instances.dPartOfTheCorpus
- Percentage of the corpus to use. Values should be between 0.0 and 1.0.public void createSets(boolean bEvenly, double dPartOfTheCorpus, boolean bNoCategories)
bEvenly
- Attempt stratification of instances.dPartOfTheCorpus
- Percentage of the corpus to use. Values should be between 0.0 and 1.0.bNoCategories
- Indicates whether there are no subcategories to take
into account. If so, a flat directory full of files is expected.public void shuffleTrainingSet()
public void shuffleTestSet()
protected void shuffleTestAndTrainingSetTogether()
public java.util.ArrayList getTrainingSet()
getTrainingSet
in interface IDocumentSet
ArrayList
.public java.util.ArrayList getTestSet()
getTestSet
in interface IDocumentSet
ArrayList
.public java.util.ArrayList getFilesFromCategory(java.lang.String sCategoryName)
getFilesFromCategory
in interface IDocumentSet
ArrayList
public java.util.ArrayList getFilesFromCategory(java.lang.String sCategoryName, int iFromWhichSet)
sCategoryName
- The name of the category of interest.iFromWhichSet
- A value indicating from which subset the files should be drawn:
ArrayList
public java.util.Set<java.lang.String> toFilenameSet(int iSubset)
iSubset
- A value of either FROM_TRAINING_SET, FROM_TEST_SET,
FROM_WHOLE_SET indicating the subset used to extract filenames.
Set
of strings, that are the filenames of the files in the
set.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |