gr.demokritos.iit.jinsect.console
Class grammaticalityEstimator

java.lang.Object
  extended by gr.demokritos.iit.jinsect.console.grammaticalityEstimator
All Implemented Interfaces:
java.io.Serializable

public class grammaticalityEstimator
extends java.lang.Object
implements java.io.Serializable

The grammaticality estimator uses the probability of finding a given token (character) after a given n-gram (string), extracted from a text corpus, in order to determine normality of other (new) strings.

See Also:
DistributionDocument, Serialized Form

Field Summary
protected  java.util.TreeMap<java.lang.Integer,DistributionDocument> DistroDocs
          Map between level and distribution documents.
protected  java.util.TreeMap<java.lang.Integer,DistributionWordDocument> DistroWordDocs
          Map between level and word distribution documents.
protected  java.lang.String FullTextDataString
          The concatenation of all corpus texts.
protected  int iCharDist
          The word and character n-gram neighbourhood sizes.
protected  int iMaxCharNGram
          The minimum and maximum n-gram sizes to take into account.
protected  int iMaxWordNGram
          The minimum and maximum n-gram sizes to take into account.
protected  int iMinCharNGram
          The minimum and maximum n-gram sizes to take into account.
protected  int iMinWordNGram
          The minimum and maximum n-gram sizes to take into account.
protected  int iWordDist
          The word and character n-gram neighbourhood sizes.
 
Constructor Summary
grammaticalityEstimator(java.util.Set FileNames, int iMinChar, int iMaxChar, int iMinWord, int iMaxWord, int iNeighbourhoodWindow)
          Creates a new instance of grammaticalityEstimator, using a given set of documents for training.
grammaticalityEstimator(java.util.Set FileNames, int iMinChar, int iMaxChar, int iCharWindow, int iMinWord, int iMaxWord, int iWordWindow)
          Creates a new instance of grammaticalityEstimator, using a given set of documents for training.
grammaticalityEstimator(java.lang.String sCorpusDir, int iMinChar, int iMaxChar, int iMinWord, int iMaxWord, int iNeighbourhoodWindow, boolean bFlatDir)
          Creates a new instance of grammaticalityEstimator.
 
Method Summary
 double getCharNormality(java.lang.String sStr)
          Calculates a degree of normality, indicating whether a given string appears in a form similar to text in the training corpus.
 java.util.TreeMap<java.lang.Integer,DistributionDocument> getDistroDocs()
           
 double getNormality(java.lang.String sStr)
          Calculates a degree of normality, indicating whether a given string appears in a form similar to text in the training corpus.
 double getWordNormality(java.lang.String sStr)
          Calculates a degree of normality, indicating whether a given string appears in a form similar to text in the training corpus.
static grammaticalityEstimator loadFromStream(java.io.InputStream is)
           
static void main(java.lang.String[] args)
          A utility main method that performs grammaticality estimation, given a corpus, a peer document set and a model document set.
static void printSyntax()
          Provides command-line syntax information for the execution of the class's main function.
 boolean saveToStream(java.io.OutputStream os)
           
 void train()
          Performs the training of the distribution model.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DistroDocs

protected java.util.TreeMap<java.lang.Integer,DistributionDocument> DistroDocs
Map between level and distribution documents.


DistroWordDocs

protected java.util.TreeMap<java.lang.Integer,DistributionWordDocument> DistroWordDocs
Map between level and word distribution documents.


iMinCharNGram

protected int iMinCharNGram
The minimum and maximum n-gram sizes to take into account.


iMaxCharNGram

protected int iMaxCharNGram
The minimum and maximum n-gram sizes to take into account.


iMinWordNGram

protected int iMinWordNGram
The minimum and maximum n-gram sizes to take into account.


iMaxWordNGram

protected int iMaxWordNGram
The minimum and maximum n-gram sizes to take into account.


iWordDist

protected int iWordDist
The word and character n-gram neighbourhood sizes.


iCharDist

protected int iCharDist
The word and character n-gram neighbourhood sizes.


FullTextDataString

protected java.lang.String FullTextDataString
The concatenation of all corpus texts.

Constructor Detail

grammaticalityEstimator

public grammaticalityEstimator(java.util.Set FileNames,
                               int iMinChar,
                               int iMaxChar,
                               int iCharWindow,
                               int iMinWord,
                               int iMaxWord,
                               int iWordWindow)
Creates a new instance of grammaticalityEstimator, using a given set of documents for training.

Parameters:
FileNames - A set of filenames to be used as input training set.
iMinChar - The minimum character n-gram size to take into account.
iMaxChar - The maximum character n-gram size to take into account.
iCharWindow - The neighbourhood window to use for the calculation of n-gram - token neighbourhood of characters.
iMinWord - The minimum word n-gram size to take into account.
iMaxWord - The maximum word n-gram size to take into account.
iWordWindow - The neighbourhood window to use for the calculation of n-gram - token neighbourhood of words.

grammaticalityEstimator

public grammaticalityEstimator(java.util.Set FileNames,
                               int iMinChar,
                               int iMaxChar,
                               int iMinWord,
                               int iMaxWord,
                               int iNeighbourhoodWindow)
Creates a new instance of grammaticalityEstimator, using a given set of documents for training.

Parameters:
FileNames - A set of filenames to be used as input training set.
iMinChar - The minimum character n-gram size to take into account.
iMaxChar - The maximum character n-gram size to take into account.
iMinWord - The minimum word n-gram size to take into account.
iMaxWord - The maximum word n-gram size to take into account.
iNeighbourhoodWindow - The neighbourhood window to use for the calculation of n-gram - token neighbourhood.

grammaticalityEstimator

public grammaticalityEstimator(java.lang.String sCorpusDir,
                               int iMinChar,
                               int iMaxChar,
                               int iMinWord,
                               int iMaxWord,
                               int iNeighbourhoodWindow,
                               boolean bFlatDir)
Creates a new instance of grammaticalityEstimator.

Parameters:
sCorpusDir - The path to the directory containing the training corpus.
iMinChar - The minimum character n-gram size to take into account.
iMaxChar - The maximum character n-gram size to take into account.
iMinWord - The minimum word n-gram size to take into account.
iMaxWord - The maximum word n-gram size to take into account.
iNeighbourhoodWindow - The neighbourhood window to use for the calculation of n-gram - token neighbourhood.
bFlatDir - If true, then the corpus is supposed to be a set of texts in
Method Detail

train

public void train()
Performs the training of the distribution model.


getNormality

public double getNormality(java.lang.String sStr)
Calculates a degree of normality, indicating whether a given string appears in a form similar to text in the training corpus. The normality is the mean value of a distribution of normalities for all n-gram sizes.

Parameters:
sStr - The string to test.
Returns:
A measure of normality as a double.
See Also:
DistributionDocument

getCharNormality

public double getCharNormality(java.lang.String sStr)
Calculates a degree of normality, indicating whether a given string appears in a form similar to text in the training corpus. The normality is the mean value of a distribution of normalities for all character n-gram sizes.

Parameters:
sStr - The string to test.
Returns:
A measure of character normality as a double.
See Also:
DistributionDocument

getWordNormality

public double getWordNormality(java.lang.String sStr)
Calculates a degree of normality, indicating whether a given string appears in a form similar to text in the training corpus. The normality is the mean value of a distribution of normalities for all word n-gram sizes.

Parameters:
sStr - The string to test.
Returns:
A measure of normality as a double.
See Also:
DistributionDocument

saveToStream

public boolean saveToStream(java.io.OutputStream os)

loadFromStream

public static grammaticalityEstimator loadFromStream(java.io.InputStream is)

printSyntax

public static void printSyntax()
Provides command-line syntax information for the execution of the class's main function.


main

public static void main(java.lang.String[] args)
A utility main method that performs grammaticality estimation, given a corpus, a peer document set and a model document set.


getDistroDocs

public java.util.TreeMap<java.lang.Integer,DistributionDocument> getDistroDocs()