gr.demokritos.iit.summarization.analysis
Class EntropyChunker

java.lang.Object
  extended by gr.demokritos.iit.summarization.analysis.EntropyChunker

public class EntropyChunker
extends java.lang.Object

This class can separate a token sequence into chunks, based on the entropy of the following symbol.


Constructor Summary
EntropyChunker()
          Creates a new instance of EntropyChunker.
 
Method Summary
 java.util.List chunkString(java.lang.String sToChunk)
          Returns a list of string chunks, derived from a given string.
protected  int determineImportantDelimiters(java.util.SortedMap smMap)
           
 java.util.SortedMap getDelimiters()
          Returns a sorted map of delimiters, based on their entropy of next character measure.
static void main(java.lang.String[] sArgs)
          Utility method.
protected  java.lang.Integer[] splitPointsByDelimiterList(java.lang.String sStr, java.util.SortedMap lDelimiters)
           
protected static java.lang.String[] splitStringByDelimiterPoints(java.lang.String sStr, java.lang.Integer[] iRes)
          Returns the substrings defined by a string and a set of split points.
 void train(java.util.Set<java.lang.String> sFileNames)
          Train the statistics of the chunker from a given file set.
 void train(java.lang.String sTrainingText)
          Train the statistics of the chunker from a given text.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

EntropyChunker

public EntropyChunker()
Creates a new instance of EntropyChunker.

Method Detail

train

public void train(java.util.Set<java.lang.String> sFileNames)
Train the statistics of the chunker from a given file set.

Parameters:
sFiles - The set of CategorizedFileEntry objects to use for training.

train

public void train(java.lang.String sTrainingText)
Train the statistics of the chunker from a given text.

Parameters:
sTrainingText - The text that defines the statistics used by the chunker.

getDelimiters

public java.util.SortedMap getDelimiters()
Returns a sorted map of delimiters, based on their entropy of next character measure.

Returns:
The SortedMap of Delimiters, where each delimiter is matched to its entropy measure.

chunkString

public java.util.List chunkString(java.lang.String sToChunk)
Returns a list of string chunks, derived from a given string.

Parameters:
sToChunk - The string to chunk.
Returns:
A List of strings that are the chunks of the given string.

splitPointsByDelimiterList

protected java.lang.Integer[] splitPointsByDelimiterList(java.lang.String sStr,
                                                         java.util.SortedMap lDelimiters)

splitStringByDelimiterPoints

protected static java.lang.String[] splitStringByDelimiterPoints(java.lang.String sStr,
                                                                 java.lang.Integer[] iRes)
Returns the substrings defined by a string and a set of split points.

Parameters:
sStr - The string to split.
iRes - An array of integers, indicating the points at which the string is to be split.
Returns:
An array of sub-strings of the given string.

determineImportantDelimiters

protected int determineImportantDelimiters(java.util.SortedMap smMap)

main

public static void main(java.lang.String[] sArgs)
Utility method. Used for testing purposes.