gr.demokritos.iit.jinsect.structs
Class DocumentSet

java.lang.Object
  extended by gr.demokritos.iit.jinsect.structs.DocumentSet
All Implemented Interfaces:
IDocumentSet

public class DocumentSet
extends java.lang.Object
implements IDocumentSet

A set of documents, that can be split to training and test sets.


Field Summary
protected  java.lang.String BaseDir
           
protected  java.util.ArrayList Categories
           
 java.io.FileFilter FileEvaluator
          An evaluator of files to add to this document set.
static int FROM_TEST_SET
          Constant indicating tirage from the test set.
static int FROM_TRAINING_SET
          Constant indicating tirage from the training set.
static int FROM_WHOLE_SET
          Constant indicating tirage from the whole (training plus test) set.
protected  java.util.ArrayList TestFiles
           
protected  java.util.ArrayList TrainingFiles
           
protected  double TrainingPercent
           
 
Constructor Summary
DocumentSet(java.lang.String sBaseDir, double dTrainingPercent)
          Creates a new instance of DocumentSet with a training set portion.
 
Method Summary
 void createSets()
          Initializes the document sets with all files of the base directory subtree used.
 void createSets(boolean bNoCategories)
          Initializes the document sets with all files of the base directory subtree used.
 void createSets(boolean bEvenly, double dPartOfTheCorpus)
          Initializes the document sets using a portion of the files of the base directory subtree, either in a stratified or not stratified manner.
 void createSets(boolean bEvenly, double dPartOfTheCorpus, boolean bNoCategories)
          Initializes the document sets using a portion of the files of the base directory subtree, either in a stratified or not stratified manner.
 java.util.List getCategories()
          Returns a list of the categories appearing in the document set.
 java.util.ArrayList getFilesFromCategory(java.lang.String sCategoryName)
          Returns the training and test set files that belong to a given category.
 java.util.ArrayList getFilesFromCategory(java.lang.String sCategoryName, int iFromWhichSet)
          Returns files either contained to the training and/or test set and belong to a given category.
 java.util.ArrayList getTestSet()
          Returns the test set of this document set.
 java.util.ArrayList getTrainingSet()
          Returns the training set of this document set.
protected  void shuffleTestAndTrainingSetTogether()
          Creates a list containing shuffled test and training instances and then recreates the training and test sets based on this list.
 void shuffleTestSet()
          Shuffles (randomizes the order of) the files appearing in the test set.
 void shuffleTrainingSet()
          Shuffles (randomizes the order of) the files appearing in the training set.
 java.util.Set<java.lang.String> toFilenameSet(int iSubset)
          Get a string list of all file names in the set or its training / test subsets.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TrainingPercent

protected double TrainingPercent

BaseDir

protected java.lang.String BaseDir

TrainingFiles

protected java.util.ArrayList TrainingFiles

TestFiles

protected java.util.ArrayList TestFiles

Categories

protected java.util.ArrayList Categories

FROM_TRAINING_SET

public static final int FROM_TRAINING_SET
Constant indicating tirage from the training set.

See Also:
Constant Field Values

FROM_TEST_SET

public static final int FROM_TEST_SET
Constant indicating tirage from the test set.

See Also:
Constant Field Values

FROM_WHOLE_SET

public static final int FROM_WHOLE_SET
Constant indicating tirage from the whole (training plus test) set.

See Also:
Constant Field Values

FileEvaluator

public java.io.FileFilter FileEvaluator
An evaluator of files to add to this document set. If null, no criteria are applied.

Constructor Detail

DocumentSet

public DocumentSet(java.lang.String sBaseDir,
                   double dTrainingPercent)
Creates a new instance of DocumentSet with a training set portion.

Parameters:
sBaseDir - The root of the corpus directory. Each document is supposed to be contained in a subdir of this dir, corresponding to the name of its category.
dTrainingPercent - Percent of trainining set as part of the whole document set.
Method Detail

getCategories

public java.util.List getCategories()
Returns a list of the categories appearing in the document set.

Specified by:
getCategories in interface IDocumentSet
Returns:
The list of categories.

createSets

public void createSets()
Initializes the document sets with all files of the base directory subtree used.

Specified by:
createSets in interface IDocumentSet

createSets

public void createSets(boolean bNoCategories)
Initializes the document sets with all files of the base directory subtree used.

Parameters:
bNoCategories - Indicates whether there are no subcategories to take into account. If so, a flat directory full of files is expected.

createSets

public void createSets(boolean bEvenly,
                       double dPartOfTheCorpus)
Initializes the document sets using a portion of the files of the base directory subtree, either in a stratified or not stratified manner. Assumes non-flat structure.

Parameters:
bEvenly - Attempt stratification of instances.
dPartOfTheCorpus - Percentage of the corpus to use. Values should be between 0.0 and 1.0.

createSets

public void createSets(boolean bEvenly,
                       double dPartOfTheCorpus,
                       boolean bNoCategories)
Initializes the document sets using a portion of the files of the base directory subtree, either in a stratified or not stratified manner.

Parameters:
bEvenly - Attempt stratification of instances.
dPartOfTheCorpus - Percentage of the corpus to use. Values should be between 0.0 and 1.0.
bNoCategories - Indicates whether there are no subcategories to take into account. If so, a flat directory full of files is expected.

shuffleTrainingSet

public void shuffleTrainingSet()
Shuffles (randomizes the order of) the files appearing in the training set.


shuffleTestSet

public void shuffleTestSet()
Shuffles (randomizes the order of) the files appearing in the test set.


shuffleTestAndTrainingSetTogether

protected void shuffleTestAndTrainingSetTogether()
Creates a list containing shuffled test and training instances and then recreates the training and test sets based on this list.


getTrainingSet

public java.util.ArrayList getTrainingSet()
Returns the training set of this document set.

Specified by:
getTrainingSet in interface IDocumentSet
Returns:
The training set as an ArrayList.

getTestSet

public java.util.ArrayList getTestSet()
Returns the test set of this document set.

Specified by:
getTestSet in interface IDocumentSet
Returns:
The test set as an ArrayList.

getFilesFromCategory

public java.util.ArrayList getFilesFromCategory(java.lang.String sCategoryName)
Returns the training and test set files that belong to a given category.

Specified by:
getFilesFromCategory in interface IDocumentSet
Returns:
The set of files.
See Also:
ArrayList

getFilesFromCategory

public java.util.ArrayList getFilesFromCategory(java.lang.String sCategoryName,
                                                int iFromWhichSet)
Returns files either contained to the training and/or test set and belong to a given category.

Parameters:
sCategoryName - The name of the category of interest.
iFromWhichSet - A value indicating from which subset the files should be drawn:
  1. FROM_TRAINING_SET indicates the training set as the source.
  2. FROM_TEST_SET indicates the testing set as the source.
  3. FROM_WHOLE_SET indicates the whole document set as the source.
Returns:
The set of files that match the conditions set by the parameters.
See Also:
ArrayList

toFilenameSet

public java.util.Set<java.lang.String> toFilenameSet(int iSubset)
Get a string list of all file names in the set or its training / test subsets.

Parameters:
iSubset - A value of either FROM_TRAINING_SET, FROM_TEST_SET, FROM_WHOLE_SET indicating the subset used to extract filenames.
Returns:
A Set of strings, that are the filenames of the files in the set.