Abstracts of George Paliouras' Publications
Paper abstracts
  1. D. Pierrakos, G. Paliouras, "Personalizing Web Directories with the aid of Web Usage Data," IEEE Transactions on Knowledge and Data Engineering, to appear.

    This paper presents a knowledge discovery framework for the construction of Community Web Directories, a concept that we introduced in our recent work, applying personalization to Web directories. In this context, the Web directory is viewed as a thematic hierarchy and personalization is realized by constructing user community models on the basis of usage data. In contrast to most of the work on Web usage mining, the usage data that are analyzed here correspond to user navigation throughout the Web, rather than a particular Web site, exhibiting as a result a high degree of thematic diversity. For modeling the user communities, we introduce a novel methodology that combines the users’ browsing behavior with thematic information from the Web directories. Following this methodology we enhance the clustering and probabilistic approaches presented in previous work and we also present a new algorithm that combines these two approaches. The resulting community models take the form of Community Web Directories. The proposed personalization methodology is evaluated both on a specialized artificial and a general-purpose Web directory, indicating its potential value to the Web user. The experiments also assess the effectiveness of the different machine learning techniques on the task.

  2. A. Artikis and G. Paliouras, "Behaviour Recognition using the Event Calculus," In Proceedings of the 5th IFIP Conference on Artificial Intelligence Applications & Innovations (AIAI), Thessaloniki, Greece, April, Springer Verlag, 2009.

    We present a system for recognising human behaviour given a symbolic representation of surveillance videos. The input of our system is a set of timestamped short-term behaviours — walking, running, standing still, etc — that is, behaviours taking place in a short period of time, detected on video frames. The output of our system is a set of recognised long-term behaviours — fighting, meeting, leaving an object, collapsing, walking, etc — which are pre-defined temporal combinations of short-term behaviours. The definition of a long-term behaviour, including the temporal constraints on the short-term behaviours that, if satisfied, lead to the recognition of the long-term behaviour, is expressed in the Event Calculus. We present experimental results concerning videos with several humans and objects, temporally overlapping and repetitive behaviours.

  3. G. Petasis, A. Krithara, V. Karkaletsis, G. Paliouras and C.D. Spyropoulos, "Semi-automated ontology learning: the BOEMIE approach," In Proceedings of the Workshop on Inductive Reasoning and Machine Learning on the Semantic Web (IRMLeS) at the European Semantic Web Conference (ESWC), Heraklion, Greece, June 2009.

    In this paper we describe a semi-automated approach for ontology learning. Exploiting an ontology-based multimodal information extraction system, the ontology learning subsystem accumulates documents that are insufficiently analysed and through clustering proposes new concepts, relations and interpretation rules to be added to the ontology.

  4. C.E. Tsourakakis and G. Paliouras, "VeWRA: An Algorithm for Wrapper Verification,". Machine Learning Technical Report CMU-ML-09-100, School of Computer Science, Carnegie Mellon University, 2009.

    Web wrappers play an important role in extracting information from distributed web sources and subsequently in the integration of heterogeneous data. Changes in the layout of web sources typically break the wrapper, leading to erroneous extraction of infomation. Monitoring and repairing broken wrappers is an important hurdle for data integration, since it is an expensive and painful procedure. In this paper we present VEWRA, a new approach to wrapper verification, which improves the successful family of trainable content - based methods. Compared to its predecessors, the new method aims to capture not only the syntactic patterns but the correlations that exist among them due to the underlying semantics of the extracted information. Experiments show that our method achieves excellent performance, being always better or equal than DATAPROG, the state-of-art related work.

  5. G. Korfiatis and G. Paliouras, "Modeling Web Navigation using Grammatical Inference," Applied Artificial Intelligence, v. 22, n. 1 & 2, pp. 116-138, 2008.

    In this article, a method that models user navigation on the web, as opposed to a single website, is presented, aiming to assist the user by recommending pages. User modeling is done through data mining of web usage logs, resulting in aggregate, rather than personal models. The proposed approach extends grammatical inference methods by introducing an extra merging criterion, which examines the semantic similarity of automaton states. The experimental results showed that the method does indeed facilitate the modeling of web navigation, which was not possible with the existing web usage mining methods. However, a content-based recommendation model is shown to still outperform the proposed method, which suggests that the knowledge of the navigation sequence does not contribute to the recommendation process. This is due to the thematic cohesion of navigation sessions, in comparison to the large thematic diversity of web usage data. Among three variants of the proposed method, the one based on Blue Fringe, that examines a larger space of possible merges, performs better.

  6. G. Paliouras, A. Mouzakidis, V. Moustakas and C. Skourlas, "PNS: A personalized news aggregator on the Web,". In Intelligent Interactive Systems in Knowledge-based Environments, M. Virvou and L. Jain (eds), Studies in Computational Intelligence, n. 104, pp. 175-197, Springer-Verlag, 2008.

    This paper presents a system that aggregates news from various electronic news publishers and distributors. The system collects news from HTML and RSS Web documents by using source-specific information extraction programs (wrappers) and parsers, organizes them according to pre-defined news categories and constructs personalized views via a Web-based interface. Adaptive personalization is performed, based on the individual user interaction, user similarities and statistical analysis of aggregate usage data by machine learning algorithms. In addition to the presentation of the basic system, we present here the results of a user study, indicating the merits of the system, as well as ways to improve it further.

  7. E. Zavitsanos, S. Petridis, G. Paliouras and G. Vouros, "Learning Ontologies of Appropriate Size," In Proceedings of the Panhellenic Conference in Artificial Intelligence (SETN), Syros, Greece, October, Lecture Notes in Artificial Intelligence, n. 5138, pp. 327-338, Springer Verlag, 2008.

    Determining the size of an ontology that is automatically learned from text corpora is an open issue. In this paper, we study the similarity between ontology concepts at different levels of a taxonomy, quantifying in a natural manner the quality of the ontology attained. Our approach is integrated in a recently proposed method for language-neutral learning of ontologies of thematic topics from text corpora. Evaluation results over the Genia and the Lonely Planet corpora demonstrate the significance of our approach.

  8. G. Papadakis and G. Paliouras, "MyCites: An Intelligent Information System for Maintaining Citations," In Proceedings of the Panhellenic Conference in Artificial Intelligence (SETN), Syros, Greece, October, Lecture Notes in Artificial Intelligence, Springer Verlag, n. 5138, pp. 371-376, 2008.

    The evaluation of their research work and its effect has always been one of scholars' greatest concerns. The use of citations for that purpose, as proposed by Eugene Garfield, is nowadays widely accepted as the most reliable method. However, gathering a scholar's citations constitutes a particularly laborious task, even in the current Internet era, as one needs to correctly combine information from miscellaneous sources. There exists therefore a need for automating this process. Numerous academic search engines try to cover this need, but none of them addresses successfully all related problems. In this paper we present an approach that facilitates to a great extent citation analysis by taking advantage of new algorithms to deal with these problems.

  9. G. Petasis, V. Karkaletsis, G. Paliouras and C.D. Spyropoulos, "Learning context-free grammars to extract relations from text," In Proceedings of the the European Conference on Artificial Intelligence (ECAI) , IOS Press, pp. 303-307, Patras, Greece, July 2008.

    In this paper we propose a novel relation extraction method, based on grammatical inference. Following a semi-supervised learning approach, the text that connects named entities in an annotated corpus is used to infer a context free grammar. The grammar learning algorithm is able to infer grammars from positive examples only, controlling overgeneralisation through minimum description length. Evaluation results show that the proposed approach performs comparable to the state of the art, while exhibiting a bias towards precision, which is a sign of conservative generalisation.

  10. E. Zavitsanos, G. Paliouras, G. Vouros and S. Petridis, "Determining Automatically the Size of Learned Ontologies," In Proceedings of the the European Conference on Artificial Intelligence (ECAI) , IOS Press, pp. 775-776, Patras, Greece, July 2008.

    Determining the size of an ontology that is automatically learned from texts is an open issue. In this paper, we study the similarity between ontology concepts at different levels of a taxonomy, quantifying in a natural manner the quality of the ontology attained. Our approach is integrated in a method for language-neutral learning of ontologies from texts, which relies on conditional independence tests over thematic topics that are discovered using LDA.

  11. I. Partalas, G. Paliouras and I. Vlahavas, "Reinforcement Learning with Classifier Selection for Focused Crawling," In Proceedings of the the European Conference on Artificial Intelligence (ECAI) , IOS Press, pp. 759-760, Patras, Greece, July 2008.

    Focused crawlers are programs that wander in the Web, using its graph structure, and gather pages that belong to a specific topic. The most critical task in Focused Crawling is the scoring of the URLs as it designates the path that the crawler will follow, and thus its effectiveness. In this paper we propose a novel scheme for assigning scores to the URLs, based on the Reinforcement Learning (RL) framework. The proposed approach learns to select the best classifier for ordering the URLs. This formulation reduces the size of the search space for the RL method and makes the problem tractable. We evaluate the proposed approach on-line on a number of topics, which offers a realistic view of its performance, comparing it also with a RL method and a simple but effective classifier-based crawler. The results demonstrate the strength of the proposed approach.

  12. E. Zavitsanos, G. Paliouras and G. Vouros, "A Distributional Approach to Evaluating Ontology Learning Methods Using a Gold Standard," In Proceedings of the 3 rd Workshop on Ontology Learning and Population (OLP3) at the European Conference on Artificial Intelligence (ECAI), Patras, Greece, July 2008.

    This paper presents a method for the evaluation of learned ontologies against gold standards. The proposed method transforms the ontology concepts to a vector space representation to avoid the common string matching of concepts at the lexical layer. We propose a set of evaluation measures that exploit the concepts' representations and calculate the similarity of the two hierarchies. Experiments show that these measures scale gradually in the closed interval of [0,1] as learned ontologies deviate increasingly from the gold standard. The proposed method is tested using the Genia and the Lonely Planet gold standard ontologies.

  13. S. Konstantopoulos, G. Paliouras, J. Schon, D. Schneider, T. Winkler, J. Pottebaum and R. Koch, "Ontology-based Rescue Operation Management," In Proceedings of the International Symposium on Mobile Information Technology for Emergency Response (MobRes) , Bonn, Germany, May, Lecture Notes in Computer Science, Springer Verlag, 2008. (to appear)

    The focus of this paper is ontology-based knowledge management in the framework of a mobile communication and information system for rescue operation management. We present a novel ontology data service, combining prior domain knowledge about large-scale rescue operations with dynamic information about a developing operation. We also discuss the integration of such a data service into a service-oriented application framework to reach high performance and accessibility, and offer examples of SHARE applications to demonstrate the practical benefits of the approach chosen.

  14. E. Zavitsanos, G. Paliouras, G. Vouros and S. Petridis, "Discovering Subsumption Hierarchies of Ontology Concepts from Text Corpora," In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI) , pp. 402-408, IEEE Press, 2007.

    This paper proposes a method for learning ontologies given a corpus of text documents. The method identifies concepts in documents and organizes them into a subsumption hierarchy, without presupposing the existence of a seed ontology. The method uncovers latent topics in terms of which document text is being generated. These topics form the concepts of the new ontology. This is done in a language neutral way, using probabilistic space reduction techniques over the original term space of the corpus. Given multiple sets of concepts (latent topics) being discovered, the proposed method constructs a subsumption hierarchy by performing conditional independence tests among pairs of latent topics, given a third one. The paper provides experimental results over the GENIA corpus from the domain of biomedicine.

  15. C. Christophi, D. Zeinalipour-Yazti, M. Dikaiakos and G. Paliouras, "Automatically Annotating the ODP Web Taxonomy," In Proceedings of the 11th Panhellenic Conference on Informatics (PCI), Current Trends in Informatics , v.1, pp. 397-408, New Technologies Publications , 2007.

    In this paper we present the ideas and algorithms developed around our KeyGen Web Taxonomy Annotation engine. KeyGen annotates the Open Directory Project, also known as Dmoz, with meaningful and previously unknown keywords by utilizing domain knowledge extracted from the WWW. We present two algorithms: i) The PageParse Algorithm, which efficiently extracts keywords from Web Taxonomies using a combination of local and global scores, and ii) the Support Algorithm, an I/O optimized algorithm for coalescing hierarchies of keywords. We then present the results: i) from constructing a richly annotated ODP Web taxonomy and ii) from evaluating the correctness of this structure by performing an automated classification of Web-pages.

  16. J. Pottebaum, S. Konstantopoulos, R. Koch, and G. Paliouras, "SaR Resource Management based on Description Logics," In Proceedings of the International Workshop on Mobile Information Technology for Emergency Response (MobRes) , Lecture Notes in Computer Science, n. 4458, pp. 61-70, Springer Verlag, 2007.

    The management of resources is a great challenge for commanders in Search and Rescue operations and has a strong impact on all areas of operation control, as command-and-communication structure, geo-referenced information, and operational tasks are inter-connected with complex relations. During an operation these are subject to dynamic changes. For an efficient operation control commanders need access to up-to-date information in their mobile working environment. This paper presents a new approach to manage resources and their relations in an operation. It is based on ontologies to build a model of an operation and Description Logic reasoning to provide enhanced decision support.

  17. D. Kosmopoulos, S. Petridis, I. Pratikakis, V. Gatos, S. Perantonis, V. Karkaletsis and G. Paliouras, "Knowledge Acquisition from Multimedia Content using an Evolution Framework," In Proceedings of the IFIP Conference on Artificial Intelligence Applications and Innovations (AIAI), IFIP International Federation for Information Processing Series, n. 204, pp. 557-565, Springer Boston, 2006.

    We propose an approach to knowledge acquisition, which uses multimedia ontologies for fused extraction of semantics from multiple modalities, and feeds back the extracted information, aiming to evolve knowledge representation. This paper presents the basic components of the proposed approach and discusses the open research issues focusing on the fused information extraction that will enable the development of scalable and precise knowledge acquisition technology.

  18. N. Trogkanis, G. Paliouras, "TPN2: Using positive-only learning to deal with the heterogeneity of labeled and unlabeled data", In Proceedings of the Discovery Challenge at the Joint European Conference on Machine Learning and on Principles and Practices of Knowledge Discovery in Databases (ECML/PKDD), Berlin, Germany, September, 2006.

    This paper introduces TPN2, the runner up method in both tasks of the ECML-PKDD Discovery Challenge 2006 on personalized spam filtering. TPN2 is a classifier training method that bootstraps positive-only learning with fully-supervised learning, in order to make the most of labeled and unlabeled data, under the assumption that the two are drawn from significantly different distributions. Furthermore, the unlabeled data themselves are separated into subsets that are assumed to be drawn from multiple distributions. For that reason, TPN2 trains a different classifier for each subset, making use of all unlabeled data each time.

  19. V. Metsis, I. Androutsopoulos, G. Paliouras, "Spam Filtering with Naive Bayes - Which Naive Bayes?" In Proceedings of the second Conference on Email and Anti-Spam (CEAS), Mountain View, CA, USA, 2006.

    Naive Bayes is very popular in commercial and open-source anti-spam e-mail filters. There are, however, several forms of Naive Bayes, something the anti-spam literature does not always acknowledge. We discuss five different versions of Naive Bayes, and compare them on six new, non-encoded datasets, that contain ham messages of particular Enron users and fresh spam messages. The new datasets, which we make publicly available, are more realistic than previous comparable benchmarks, because they maintain the temporal order of the messages in the two categories, and they emulate the varying proportion of spam and ham messages that users receive over time. We adopt an experimental procedure that emulates the incremental training of personalized spam filters, and we plot ROC curves that allow us to compare the different versions of NB over the entire tradeoff between true positives and true negatives.

  20. E. Dellis, G. Paliouras, "Management of Large Spatial Ontology Bases," In Proceedings of the Workshop on Ontologies-based techniques for DataBases and Information Systems (ODBIS) at the 32nd International Conference on Very Large Data Bases (VLDB), Seoul, Korea, September, 2006.

    In this paper we propose a method for efficient management of large spatial ontologies. Current spatial ontologies are usually represented using an ontology language, such as OWL and stored as OWL files. However, we have observed some shortcomings using this approach especially in the efficiency of spatial query processing. This fact motivated the development of a hybrid approach that uses an R-tree as a spatial index structure. In this way we are able to support efficient query processing over large spatial ontologies, maintaining the benefits of ontological reasoning. We present a case study for emergency teams during Search and Rescue (SaR) operations showing how an Ontology Data Service (SHARE-ODS) can benefit from a spatial index. Performance evaluation shows the superiority of our proposed technique compared to the original approach. To the best of our knowledge, this is the first attempt to address the problem of efficient management of large spatial ontology bases.

  21. G. Paliouras, A. Mouzakidis, C. Ntoutsis, A. Alexopoulos, C. Skourlas, "PNS: Personalized Multi-Source News Delivery," In Proceedings of the 10th International Conference on Knowledge-Based & Intelligent Information & Engineering Systems (KES), Bournemouth, UK, October 2006. 

    This paper presents a system that integrates news from multiple sources on the Web and delivers in a personalized fashion to the reader. The presented service integrates automatic information extraction from various news sources and presentation of information according to the user’s interests. The system consists of source-specific information extraction programs (wrappers) that extract highlights of news items from the various sources, organize them according to pre-defined news categories and present them to the user through a personal Web-based interface. Dynamic personalization is used based on the user’s reading history, as well as the preferences of other similar users. User models are maintained by statistical analysis and machine learning algorithms. Results of an initial user study have confirmed the value of the service and indicated ways in which it should be improved.

  22. S. Konstantopoulos, G. Paliouras, S. Chatzinotas, "SHARE-ODS: An Ontology Data Service for Search and Rescue Operations," In Proceedings of the Panhellenic Conference in Artificial Intelligence (SETN), Lecture Notes in Artificial Intelligence, n. 3955, pp. 525-528, Springer Verlag, 2006.

    This paper describes an ontology data service (ODS) for supporting Search and Rescue (SaR) operations. The ontological model represents various aspects of the command, communication, and organisational structure of the SaR forces and the deployment and progress of a SaR operation. Furthermore, the ontology supports the semantic indexing of multimedia documents in the context of SaR processes and activities. This ODS supports a semantically-enhanced information and communication system for SaR forces. Modelling the spatio-temporal aspects of an operation in alignment with possibly-unreliable information automatically extracted from multimedia objects, introduces a number of challenges for the field of knowledge representation and reasoning.

  23. G. Sigletos, G. Paliouras, C.D. Spyropoulos and M. Hatzopoulos, "Combining Information Extraction Systems Using Voting and Stacked Generalization," Journal of Machine Learning Research, London, UK, November 2005.

    This article investigates the effectiveness of voting and stacked generalization -also known as stacking- in the context of information extraction (IE). A new stacking framework is proposed that accommodates well-known approaches for IE. The key idea is to perform cross-validation on the base-level data set, which consists of text documents annotated with relevant information, in order to create a meta-level data set that consists of feature vectors. A classifier is then trained using the new vectors. Therefore, base-level IE systems are combined with a common classifier at the meta-level. Various voting schemes are presented for comparing against stacking in various IE domains. Well known IE systems are employed at the base-level, together with a variety of classifiers at the meta-level. Results show that both voting and stacking work better when relying on probabilistic estimates by the base-level systems. Voting proved to be effective in most domains in the experiments. Stacking, on the other hand, proved to be consistently effective over all domains, doing comparably or better than voting and always better than the best base-level systems. Particular emphasis is also given to explaining the results obtained by voting and stacking at the meta-level, with respect to the varying degree of similarity in the output of the base-level systems.

  24. C.D. Spyropoulos, G. Paliouras, V. Karkaletsis, D. Kosmopoulos, I. Pratikakis, S. Pertantonis and B. Gatos, "BOEMIE: Bootstrapping Ontology Evolution with Multimedia Information Extraction," In Proceedings of the 2nd European Workshop on Integration of Knowledge Semantic and Digital Media Technologies, v.6, pp. 1751-1782, 2005.

    The BOEMIE project proposes a bootstrapping approach to knowledge acquisition, which uses multimedia ontologies for fused extraction of semantics from multiple modalities, and feeds back the extracted information, aiming to automate the ontology evolution process.

  25. L. Vande Velde, S. Chatzinotas, M. Larson, J. Löffler, G. Paliouras, "Interactive 2D - 3D digital maps for the support of emergency teams during rescue operations", In Proceedings of the 12th World Congress on Intelligent Transport Systems, San Francisco, November, 2005.

    SHARE, a EU-funded 6th Framework Program project, addresses the need of emergency teams for multimodal communication and for decision support with a prototype advanced mobile service based on Push-to-Share technology. The SHARE system provides emergency workers with on-site, on-line details of operational history and current operational status as well as access to pertinent supporting information, in particular information concerning the environment of the incident. The SHARE system will incorporate an enhanced Tele Atlas 2D-3D digital map, including details on buildings and roads above and beyond those represented in basic digital road maps. The SHARE system will log communications and other multimedia data generated during the operation and store it in an ontology-based Knowledge Base, which makes possible the integration of the spatial information of digital maps with multimedia and operational information from external databases. In the final phase of the SHARE project, the system will implement a 2D-3D digital map enhanced with voice, image, text and video information. The map will be fully interactive, permitting emergency workers with mobile end devices such as PDAs and tablet PCs to query the system using a multimodal interface and retrieve information as well as to enter new information as the operation unfolds.

  26. V. Karkaletsis, G. Paliouras, C. D. Spyropoulos, "A Bootstrapping Approach to Knowledge Acquisition from Multimedia Content with Ontology Evolution," In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR), pp. 98-105, Helsinki University of Technology, Finland, June 2005.

    We propose a bootstrapping approach to knowledge acquisition, which uses multimedia ontologies for fused extraction of semantics from multiple modalities, and feeds back the extracted information, aiming to automate the ontology evolution process. This paper presents the basic components of the proposed approach and discusses the open research issues focusing on the synergy of extraction and evolution that will enable the development of scalable and precise knowledge acquisition technology.

  27. G. Paliouras, "On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning," In Proceedings of the International Conference on Conceptual Structures (ICCS), Kassel, Germany, July, Lecture Notes in Artificial Intelligence, n. 3596, pp. 119-135, Springer Verlag, 2005.

    The main claim of this paper is that machine learning can help integrate the construction of ontologies and extraction grammars and lead us closer to the Semantic Web vision. The proposed approach is a bootstrapping process that combines ontology and grammar learning, in order to semi-automate the knowledge acquisition process. After providing a survey of the most relevant work towards this goal, recent research of the Software and Knowledge Engineering Laboratory (SKEL) of NCSR "Demokritos" in the areas of Web information integration, information extraction, grammar induction and ontology enrichment is presented. The paper concludes with a number of interesting issues that need to be addressed in order to realize the advocated bootstrapping process.

  28. D. Pierrakos, G. Paliouras, "Exploiting Probabilistic Latent Information for the Construction of Community Web Directories," In Proceedings of the International User Modelling Conference (UM), Edinburgh, UK, July, Lecture Notes in Artificial Intelligence, n. 3538, pp. 89-98, Springer Verlag, 2005.

    This paper improves a recently-presented approach to Web Personalization, named Community Web Directories, which applies personalization techniques to Web Directories. The Web directory is viewed as a concept hierarchy and personalization is realized by constructing user community models on the basis of usage data collected by the proxy servers of an Internet Service Provider. The user communities are modeled using Probabilistic Latent Semantic Analysis (PLSA), which provides a number of advantages such as overlapping communities, as well as a good rationale for the associations that exist in the data. The data that are analyzed present challenging peculiarities such as their large volume and semantic diversity. Initial results presented in this paper illustrate the effectiveness of the new method.

  29. D. Pierrakos, G. Paliouras, C. Papatheodorou, V. Karkaletsis, M. Dikaiakos, "Web Community Directories: A New Approach to Web Personalization," In Berendt et al. (Eds.), "Web Mining: From Web to Semantic Web", Lecture Notes in Computer Science, n. 3209, pp. 113 - 129, Springer Verlag, 2004.

    This paper introduces a new approach to Web Personalization, named Web Community Directories that aims to tackle the problem of information overload on the WWW. This is realized by applying personalization techniques to the well-known concept ofWeb Directories. TheWeb directory is viewed as a concept hierarchy which is generated by a content-based document clustering method. Personalization is realized by constructing community models on the basis of usage data collected by the proxy servers of an Internet Service Provider. For the construction of the community models, a new data mining algorithm, called Community Directory Miner, is used. This is a simple cluster mining algorithm which has been extended to ascend a concept hierarchy, and specialize it to the needs of user communities. The data that are mined present a number of peculiarities such as their large volume and semantic diversity. Initial results presented in this paper illustrate the use of the methodology and provide an indication of the behavior of the new mining method.

  30. G. Sigletos, G. Paliouras, C. D. Spyropoulos, P. Stamatopoulos, "Stacked generalization for information extraction," In Proceedings of the European Conference in Artificial Intelligence (ECAI), pp. 549 - 553, Valencia, Spain, IOS Press, 2004.

    This paper defines a new stacked generalization framework in the context of information extraction (IE) from online sources. The proposed setting removes the constraint of applying classifiers at the base-level. A set of IE systems are trained instead to identify relevant fragments within text documents, which differs significantly from the task of classifying candidate text fragments as relevant or not. The templates filled by the base-level IE systems are stacked, forming a set of feature vectors for training a meta-level classifier. Thus, base-level IE systems are combined with a common classifier at meta-level. The proposed framework was evaluated on three Web domains, using well known IE approaches at base-level and a variety of classifiers at meta-level. Results demonstrate the added value obtained by combining the base-level IE systems in the new framework.

  31. A. Valarakos, G. Paliouras, V. Karkaletsis, G. Vouros, "Enhancing the Ontological Knowledge through Ontology Population and Enrichment," In Proceedings of the International Conference on Knowledge Engineering and Knowledge Management (EKAW), Lecture Notes in Artificial Intelligence, n. 3257, pp. 144-156, Springer Verlag, 2004.

    Ontologies are widely used for capturing and organizing knowledge of a particular domain of interest. This knowledge is usually evolvable and therefore an ontology maintenance process is required to keep the ontological knowledge up-to-date. We proposed an incremental ontology maintenance methodology which exploits ontology population and enrichment methods to enhance the knowledge captured by the instances of the ontology and their various lexicalizations. Furthermore, we employ ontology learning techniques to alleviate as much as possible the intervention of human into the proposed methodology. We conducted experiments using the CROSSMARC ontology as a case study evaluating the methodology and its partial methods. The methodology performed well enhancing the ontological knowledge to 96.5% from only 50%.

  32. E. Michelakis, I. Androutsopoulos, G. Paliouras, G. Sakkis,, P. Stamatopoulos, "Filtron: A Learning-Based Anti-Spam Filter," In Proceedings of the first Conference on Email and Anti-Spam (CEAS), Mountain View, CA, USA, 2004.

    We present Filtron, a prototype anti-spam filter that integrates the main empirical con- clusions of our comprehensive analysis on using machine learning to construct effective personalized anti-spam filters. Filtron is based on the experimental results over several design parameters on four publicly available benchmark corpora. After describing Filtron's architecture, we assess its behavior in real use over a period of seven months. The results are deemed satisfactory, though they can be improved with more elaborate preprocessing and regular re-training.

  33. N. Karampatziakis, G. Paliouras, D. Pierrakos, P. Stamatopoulos, "Navigation pattern discovery using grammatical inference," In Proceedings of the 7th International Colloquium on Grammatical Inference (ICGI), Lecture Notes in Artificial Intelligence, n. 3264, pp. 187 - 198, Springer Verlag, 2004.

    We present a method for modeling user navigation on a web site using grammatical inference of stochastic regular grammars. With this method we achieve better models than the previously used first order Markov chains, in terms of predictive accuracy and utility of recommendations. In order to obtain comparable results, we apply the same grammatical inference algorithms on Markov chains, modeled as probabilistic automata. The automata induced in this way perform better than the original Markov chains, as models for user navigation, but they are considerably inferior to the automata induced by the traditional grammatical inference methods. The evaluation of our method was based on two web usage data sets from two very dissimilar web sites. It consisted in producing, for each user, a personalized list of recommendations and then measuring its recall and expected utility.

  34. G. Petasis, G. Paliouras, C. D. Spyropoulos, C. Halatsis, "eg-GRIDS: Context-Free Grammatical Inference from Positive Examples using Genetic Search," In Proceedings of the 7th International Colloquium on Grammatical Inference (ICGI), Lecture Notes in Artificial Intelligence, n. 3264, pp. 223 - 234, Springer Verlag, 2004.

    In this paper we present eg-GRIDS, an algorithm for inducing context-free grammars that is able to learn from positive sample sentences. The presented algorithm, similar to its GRIDS predecessors, uses simplicity as a criterion for directing inference, and a set of operators for exploring the search space. In addition to the basic beam search strategy of GRIDS, eg-GRIDS incorporates an evolutionary grammar selection process, aiming to explore a larger part of the search space. Evaluation results are presented on artificially generated data, comparing the performance of beam search and genetic search. These results show that genetic search performs better than beam search while being significantly more efficient computationally.

  35. G. Petasis, G. Paliouras, V. Karkaletsis, C. Halatsis, and C.D. Spyropoulos, "e-GRIDS: Computationally Efficient Grammatical Inference from Positive Examples," GRAMMARS, 2004.

    In this paper we present a new computationally efficient algorithm for inducing context-free grammars that is able to learn from positive sample sentences. This new algorithm uses simplicity as a criterion for directing inference, and the search process of the new algorithm has been optimised by utilising the results of a theoretical analysis regarding the behaviour and complexity of the search operators. Evaluation results are presented on artificially generated data, while the scalability of the algorithm is tested on a large textual corpus. These results show that the new algorithm performs well and can infer grammars from large data sets in a reasonable amount of time.

  36. A. Valarakos, G. Paliouras, V. Karkaletsis, G. Vouros, "A Name-Matching Algorithm for Supporting Ontology Enrichment," In Proceedings of the Panhellenic Conference in Artificial Intelligence (SETN), Lecture Notes in Artificial Intelligence, n. 3025, pp. 381-389, Springer Verlag, 2004.

    Ontologies are widely used for capturing and organizing knowl- edge of a particular domain of interest. This knowledge is usually evolv- able and therefore an ontology maintenance process is required. In the context of ontology maintenance we tackle the problem that arises when an instance/individual is written differently (grammatically, orthograph- ically, lexicographically), while representing the same entity/concept. This type of knowledge is captured into a semantic relationship and con- stitutes valuable information for many intelligent methods and systems. We enrich a domain ontology with instances that participate in this type of relationship, using a novel name matching method based on machine learning. We also show how the proposed method can support the dis- covery of new entities/concepts to be added to the ontology. Finally, we present experimental results for the enrichment of an ontology used in the multi-lingual information integration project CROSSMARC.

  37. A. Grigoriadis, G. Paliouras, "Focused Crawling using Temporal Difference-Learning," In Proceedings of the Panhellenic Conference in Artificial Intelligence (SETN), Lecture Notes in Artificial Intelligence, n. 3025, pp. 142-153, Springer Verlag, 2004.

    This paper deals with the problem of constructing an intelligent Focused Crawler, i.e. a system that is able to retrieve documents of a specific topic from the Web. The crawler must contain a component which assigns visiting priorities to the links, by estimating the probability of leading to a relevant page in the future. Reinforcement Learning was chosen as a method that fits this task nicely, as it provides a method for rewarding intermediate states to the goal. Initial results show that a crawler trained with Reinforcement Learning is able to retrieve relevant documents after a small number of steps.

  38. I. Androutsopoulos, G. Paliouras and E. Michelakis, "Learning to Filter Unsolicited Commercial E-Mail,". Technical Report, No. 2004/2,, NCSR "Demokritos", 2004 (updated 2006).

    We present a thorough investigation on using machine learning to construct effective personalized anti-spam filters. The investigation includes four learning algorithms, Naive Bayes, Flexible Bayes, LogitBoost, and Support Vector Machines, and four datasets, constructed from the mailboxes of different users. We discuss the model and search biases of the learning algorithms, along with worst-case computational complexity figures, and observe how the latter relate to experimental measurements. We study how classification accuracy is affected when using attributes that rep- resent sequences of tokens, as opposed to single tokens, and explore the effect of the size of the attribute and training set, all within a cost-sensitive framework. Furthermore, we describe the architecture of a fully implemented learning-based anti-spam filter, and present an analysis of its behavior in real use over a period of seven months. Information is also provided on other available learning-based anti-spam filters, and alternative filtering approaches.

  39. D. Pierrakos, G. Paliouras, C. Papatheodorou and C.D. Spyropoulos, "Web Usage Mining as a tool for personalization: a survey". User Modeling and User-Adapted Interaction, v. 13, n. 4, pp. 311-372, 2003.

    This paper is a survey of recent work in the field of web usage mining for the benefit of research on the personalization of Web-based information services. The essence of personalization is the adaptability of information systems to the needs of their users. This issue is becoming increasingly important on the Web, as non-expert users are overwhelmed by the quantity of information available online, while commercial Web sites strive to add value to their services in order to create loyal relationships with their visitors-customers. This article views Web personalization through the prism of personalization policies adopted by Web sites and implementing a variety of functions. In this context, the area of Web usage mining is a valuable source of ideas and methods for the implementation of personalization functionality. We therefore present a survey of the most recent work in the field of Web usage mining, focusing on the problems that have been identified and the solutions that have been proposed.

  40. G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos and P. Stamatopoulos, "A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists". Information Retrieval, v. 6, n. 1, pp. 49-73, 2003.

    This paper presents an extensive empirical evaluation of memory-based learning in the context of anti-spam filtering, a novel cost-sensitive application of text categorization that attempts to identify automatically unsolicited commercial messages that flood mailboxes. Focusing on anti-spam filtering for mailing lists, a thorough investigation of the effectiveness of a memory-based anti-spam filter is performed using a publicly available corpus. The investigation includes different attribute and distance-weighting schemes, and studies on the effect of the neighborhood size, the size of the attribute set, and the size of the training corpus. Three different cost scenarios are identified, and suitable cost-sensitive evaluation functions are employed. We conclude that memory-based anti-spam filtering for mailing lists is practically feasible, especially when combined with additional safety nets. Compared to a previously tested Naive Bayes filter, the memory-based filter performs on average better, particularly when the misclassification cost for non-spam messages is high.

  41. G. Sigletos, G. Paliouras, C. D. Spyropoulos, M. Hatzopoulos. "Mining Web sites using wrapper induction, named entities and post-processing", Proceedings of the 1st European Web Mining Forum Workshop, Joint European Conference on Machine Learning andon Principles and Practices of Knowledge Discovery in Databases (ECML/PKDD) , Cavtat-Dubrovnik, Croatia, 2003.

    This paper presents a novel method for extracting information from collections of Web pages across different sites. Our method uses a standard wrapper induction algorithm and exploits named entity information. We introduce the idea of post-processing the extraction results for resolving ambiguous facts and improve the overall extraction performance. Postprocessing involves the exploitation of two additional sources of information: fact transition probabilities, based on a trained bigram model, and confidence probabilities, estimated for each fact by the wrapper induction system. A multiplicative model that is based on the product of those two probabilities is also considered for post-processing. Experiments were conducted on pages describing laptop products, collected from many different sites and in four different languages. The results highlight the effectiveness of our approach.

  42. G. Sigletos, G. Paliouras, C. D. Spyropoulos, P. Stamatopoulos. "Meta-learning beyond classification: A framework for information extraction from the Web", Proceedings of the on Adaptive Text Extraction and Mining Workshop, Joint European Conference on Machine Learning andon Principles and Practices of Knowledge Discovery in Databases (ECML/PKDD) , Cavtat-Dubrovnik, Croatia, 2003.

    This paper proposes a meta-learning framework in the context of information extraction from the Web. The proposed framework relies on learning a meta-level classifier, based on the output of base-level information extraction systems. Such systems are typically trained to recognize relevant information within documents, i.e., streams of lexical units, which differs significantly from the task of classifying feature vectors that is commonly assumed for meta-learning. The proposed framework was evaluated experimentally on the challenging task of training an information extraction system for multiple Web sites. Three well-known methods for training extraction systems were employed at the base level. A variety of classifiers were comparatively evaluated at the meta level. The extraction accuracy that was obtained demonstrated the effectiveness of the proposed framework of collaboration between base-level extraction systems and common classifiers at meta-level.

  43. D. Pierrakos, G. Paliouras, C. Papatheodorou, V. Karkaletsis, M. Dikaiakos. "Construction of Web Community Directories using Document Clustering and Web Usage Mining", Proceedings of the 1st European Web Mining Forum Workshop, Joint European Conference on Machine Learning andon Principles and Practices of Knowledge Discovery in Databases (ECML/PKDD) , Cavtat-Dubrovnik, Croatia, 2003.

    This paper presents the concept of Web Community Directories, as a means of personalizing services on the Web, together with a novel methodology for the construction of these directories by document clustering and usage mining methods. The community models are extracted with the use of the Community Directory Miner, a simple cluster mining algorithm which has been extended to ascend a concept hierarchy, and specialize it to the needs of user communities. The initial concept hierarchy is generated by a content-based document clustering method. Communities are constructed on the basis of usage data collected by the proxy servers of an Internet Service Provider. These data present a number of peculiarities such as their large volume and semantic diversity. Initial results presented in the paper illustrate the use of the methodology and provide an indication of the behavior of the new mining method.

  44. A. Valarakos, G. Sigletos, V. Karkaletsis, G. Paliouras. "A Methodology for Semantically Annotating a Corpus Using a Domain Ontology and Machine Learning", Proceedings of the the Recent Advances in Natural Language Processing International Conference (RANLP) , Borovets, Bulgaria, 2003.

    In this paper we present a methodology for the semantic annotation of domain-specific corpora. This method relies on a domain ontology used initially for identifying and annotating domainspecific instances within the corpus. A machine learning-based information extraction system is then trained on the annotated corpus. The final result of this process is a model which is used to annotate new corpora in the specific domain. We applied the proposed methodology to a Web corpus examining different ontology size using hidden Markov models. The paper presents the proposed methodology together with some first experimental results.

  45. A. Valarakos, G. Sigletos, V. Karkaletsis, G. Paliouras, G. Vouros. "A Methodology for Enriching a Multi-Lingual Domain Ontology using Machine Learning", Proceedings of the Workshop on Text Processing for Modern Greek: from Symbolic to Statistical Approaches, 6th International Conference in Greek Linguistics , Rethymno, Crete, 2003.

    Ontologies accumulate and organize knowledge in a machine-processable and human-readable way providing a common understanding basis. Enriching a multi-lingual ontology is crucial for the success of many knowledge-based systems. We present an iterative ontology-driven methodology that enriches a multi-lingual domain ontology with new instances, exploiting machine learning techniques. The methodology is user-centered and aims to ease the task of ontology maintenance. Our first experiments show the strong dependency between the size of the initial ontology and the performance of the machine learning-based method.

  46. K. Stamatakis, V. Karkaletsis, G. Paliouras, J. Horlock, C. Grover, J. R. Curran, S. Dingare. "Domain-Specific Web Site Identification: The CROSSMARC Focused Web Crawler" Proceedings of the Second International Workshop on Web Document Analysis (WDA) , Edinburgh, UK, 2003.

    This paper presents techniques for identifying domain specific Web sites that have been implemented as part of the EC-funded R&D project, CROSSMARC. The project aims to develop technology for extracting interesting information from domain-specific Web pages. It is therefore important for CROSSMARC to identify Web sites in which interesting domain specific pages reside (focused Web crawling). This is the role of the CROSSMARC Web crawler.

  47. G. Sigletos, D. Farmakiotou, K. Stamatakis, G. Paliouras, V. Karkaletsis. "Annotating Web pages for the needs of Web Information Extraction applications", Poster in the proceedings of the 12th Internatonal World Wide Web Conference (WWW) , Budapest, Hungary, 2003.

    This paper outlines our approach to the creation of annotated corpora for the purposes of Web Information Extraction, and presents the Web Annotation tool. This tool enables the annotation of Web pages from different domains and for different information extraction tasks providing a user-friendly interface to human annotators. Annotated information is stored in a representation format that can easily be exploited.

  48. D. Pierrakos, G. Paliouras, C. Papatheodorou, V. Karkaletsis, M. Dikaiakos. "Construction of Web Community Directories by Mining Usage Data", Proceedings of the Hellenic Data Management Symposium (HDMS) , Athens, Greece, 2003.

    This paper introduces the concept of Web Community Directories, as a means of personalizing services on the Web, and presents a novel methodology for the construction of these directories by usage mining methods. The community models are extracted with the use of the Community Directory Miner, a simple cluster mining algorithm which has been extended to ascend a concept hierarchy, such as a Web directory, and specialize it to the needs of user communities. The construction of the communities is based on usage data collected by the proxy servers of an Internet Service Provider, which is also a task that has not been addressed in the literature. The examined data present a number of peculiarities such as their large volume and their semantic diversity. Initial results presented in the paper illustrate the use of the methodology and provide an indication of the behavior of the new usage mining method.

  49. G. Petasis, V. Karkaletsis, G. Paliouras and C. D. Spyropoulos. "Using the Ellogon Natural Language Engineering Infrastructure", Proceedings of the Workshop on Balkan Language Resources and Tools at the 1st Balkan Conference on Informatics (BCI) , Thessaloniki, Greece, 2003.

    Ellogon is a multi-lingual, cross-operating system, general-purpose natural language engineering infrastructure. Ellogon has been used extensively in various NLP applications. It is currently provided for free for research use to research and academic organisations. In this paper, we outline its architecture and data model, present Ellogon features as used by different types of users and discuss its functionalities against other infrastructures for language engineering.

  50. G. Paliouras, C. Papatheodorou, V. Karkaletsis and C.D. Spyropoulos, "Discovering User Communities on the Internet Using Unsupervised Machine Learning Techniques". Interacting with Computers, v. 14, n. 6, pp. 761-791, 2002.

    Interest in the analysis of user behaviour on the Internet has been increasing rapidly, especially since the advent of electronic commerce. In this context, we argue here for the usefulness of constructing communities of users with common behaviour, making use of machine learning techniques. In particular, we assume that the users of any service on the Internet constitute a large community and we aim to construct smaller communities of users with common characteristics. The paper presents the results of three case studies for three different types of Internet service: a digital library, an information broker and a Web site. Particular attention is paid on the different types of information access involved in the three case studies: query-based information retrieval, profile-based information filtering and Web-site navigation. Each type of access imposes different constraints on the representation of the learning task. Two different unsupervised learning methods are evaluated: conceptual clustering and cluster mining. One of our main concerns is the construction of meaningful communities that can be used for improving information access on the Internet. Analysis of the results in the three case studies brings to surface some of the important properties of the task, suggesting the feasibility of a common methodology for the three different types of information access on the Internet.

  51. G. Petasis, V. Karkaletsis, G. Paliouras, I. Androutsopoulos and C. D. Spyropoulos, "Ellogon: A New Text Engineering Platform". Proceedings of the International Conference on Language Resources and Evaluation (LREC), vol. I, pp. 72-78, Las Palmas, Spain, May, 2002. 

    This paper presents Ellogon, a multi-lingual, cross-platform, general-purpose text engineering environment. Ellogon was designed in order to aid both researchers in natural language processing, as well as companies that produce language engineering systems for the end-user. Ellogon provides a powerful TIPSTER-based infrastructure for managing, storing and exchanging textual data, embedding and managing text processing components as well as visualising textual data and their associated linguistic information. Among its key features are full Unicode support, an extensive multi-lingual graphical user interface, its modular architecture and the reduced hardware requirements.

  52. G. Sigletos, G. Paliouras, V. Karkaletsis, "Role Identification From Free Text Using Hidden Markov Models". Proceedings of the Panhellenic Conference in Artificial Intelligence (SETN), Lecture Notes in Artificial Intelligence, n. 2308, Springer Verlag, pp. 167-178, 2002. 

    In this paper we explore the use of hidden Markov models on the task of role identification from free text. Role identification is an important stage of the information extraction process, assigning roles to particular types of entities with respect to a particular event. Hidden Markov models (HMMs) have been shown to achieve good performance when applied to information extraction tasks in both semistructured and free text. The main contribution of this work is the analysis of whether and how linguistic processing of textual data can improve the extraction performance of HMMs. The emphasis is on the minimal use of computationally expensive linguistic analysis. The overall conclusion is that the performance of HMMs is still worse than an equivalent manually constructed system. However, clear paths for improvement of the method are shown, aiming at a method, which is easily adaptable to new domains.

  53. G. Petasis, S. Petridis, G. Paliouras, V. Karkaletsis, S. Perantonis, and C.D. Spyropoulos, "Symbolic and Neural Learning of Named-Entity Recognition and Classification Systems in Two Languages". In Advances in Computational Intelligence and Learning: Methods and Applications, H-J. Zimmermann, G. Tselentis, M. van Someren and G. Dounias (eds), Kluwer Academic Publishers, 2001. 

    This paper compares two alternative approaches to the problem of acquiring named-entity recognition and classification systems from training corpora, in two different languages. The process of named-entity recognition and classification is an important subtask in most language engineering applications, in particular information extraction, where different types of named entity are associated with specific roles in events. The manual construction of rules for the recognition of named entities is a tedious and time-consuming task. For this reason, effective methods to acquire such systems automatically from data are very desirable. In this paper we compare two popular learning methods on this task: a decision-tree induction method and a multi-layered feed-forward neural network. Particular emphasis is paid on the selection of the appropriate data representation for each method and the extraction of training examples from unstructured textual data. We compare the performance of the two methods on large corpora of English and Greek texts and present the results. In addition to the good performance of both methods, one very interesting result is the fact that a simple representation of the data, which ignores the order of the words within a named entity, leads to improved results over a more complex approach that preserves word order.

  54. H. Jessen and G. Paliouras, "Data Mining in Economics, Marketing and Finance". In Machine Learning and Applications, G. Paliouras, V. Karkaletsis and C.D. Spyropoulos (eds), Lecture Notes in Computer Science, n. 2049, pp. 303-307, Springer-Verlag, 2001. 

    [No abstract available.]

  55. K. Koutroumbas, G. Paliouras, V. Karkaletsis and C.D. Spyropoulos, "Comparison of Computational Learning Methods on a Diagnostic Cytological Application". Proceedings of the European Symposium on Intelligent Technologies, Hybrid Systems and their implementation on Smart Adaptive Systems (EUNITE), pp. 500-508, Tenerife, Spain, 2001. 

    In this paper we perform a comparative evaluation of four different computational learning methods on a problem of diagnostic cytology and more specifically on the classification of gastric cells. The methods considered are: Decision Tree Induction, Boosted Decision Trees, Naive Bayesian Classifier, and Radial Basis Function Neural Networks. The performance of each method was assessed on unseen data. Our aim was not to evaluate the quality of the algorithms as such, but to examine which of them are suitable for the specific medical diagnosis task, in order to provide a reliable diagnostic tool to the doctors involved in the area. We compare the performance of the four methods and discuss the results taking into account the characteristics of the methods and the task examined. The dataset that was used in this paper is publicly available, facilitating reproducibility of the results and providing a basis of comparison for future work.

  56. A. Grigoriadis, G. Paliouras, V. Karkaletsis and C.D. Spyropoulos, "Identifying Word Senses in Greek Text: A comparison of machine learning methods". Proceedings of the European Workshop on Intelligent Forecasting, Diagnosis and Control (IFDICON), pp. 107-113, Santorini, Greece, 2001. 

    In this paper we perform a comparative evaluation of machine learning methods on the task of identifying the correct sense of a word, based on the context in which it appears. This task is known as word sense disambiguation (WSD) and is one of the hardest and most interesting issues in language engineering. Research on the use of machine learning techniques for WSD has so far focused almost exclusively on English words, due to the scarcity of the required linguistic resources for other languages. The work presented here is the first attempt to apply machine learning methods to Greek words. We have constructed a semantically tagged corpus for two Greek words: a noun with clearly distinguishable senses and a verb with overlapping senses. This corpus is used to evaluate four different machine learning methods and three different representations of the context of the ambiguous word. Our results show that the simple naive Bayesian classifier and a method using Support Vector Machines outperform decision tree induction, even with the use of boosting. Furthermore, the use of a distance-based weighting function for the context of the ambiguous word does not seem to have a substantial effect on the performance of the methods.

  57. D. Pierrakos, G. Paliouras, C. Papatheodorou and C.D. Spyropoulos, "KOINOTITES: A Web Usage Mining Tool for Personalization". Proceedings of the Panhellenic Conference on Human Computer Interaction (PC-HCI), pp. 231-236, Patras, 2001. 

    This paper presents the Web Usage Mining system KOINOTITES, which uses data mining techniques for the construction of user communities on the Web. User communities model groups of visitors in a Web site, who have similar interests and navigational behaviour. We present the architecture of the system and the results that we obtained in a real Web site.

  58. G. Petasis, Frantz Vichot, Francis Wolinski, G. Paliouras, V. Karkaletsis, and C.D. Spyropoulos, "Using Machine Learning to Maintain Rule-based Named-Entity Recognition and Classification Systems". Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 426-433, Toulouse, 2001. 

    This paper presents a method that assists in maintaining a rule-based named-entity recognition and classification system. The underlying idea is to use a separate system, constructed with the use of machine learning, to monitor the performance of the rule-based system. The training data for the second system is generated with the use of the rule-based system, thus avoiding the need for manual tagging. The disagreement of the two systems acts as a signal for updating the rule-based system. The generality of the approach is illustrated by applying it to large corpora in two different languages: Greek and French. The results are very encouraging, showing that this alternative use of machine learning can assist significantly in the maintenance of rule-based systems.

  59. G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos and P. Stamatopoulos, "Stacking classifiers for anti-spam filtering of e-mail". Proceedings of the International Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 44-50, Carnegie Mellon University, 2001. 

    We evaluate empirically a scheme for combining classifiers, known as stacked generalization, in the context of anti-spam filtering, a novel cost-sensitive application of text categorization. Unsolicited commercial e-mail, or "spam", floods mailboxes, causing frustration, wasting bandwidth, and exposing minors to unsuitable content. Using a public corpus, we show that stacking can improve the efficiency of automatically induced anti-spam filters, and that such filters can be used in real-life applications.

  60. G. Petasis, A. Cucchiarelli, P. Velardi, G. Paliouras, V. Karkaletsis, and C.D. Spyropoulos, "Automatic adaptation of proper noun dictionaries through co-operation of machine learning and probabilistic methods". Proceedings of the 23rd ACM SIGIR Conference on R&D in IR (SIGIR), pp. 128-135, Athens, Greece, 2000. 

    The recognition of Proper Nouns (PNs) is considered an important task in the area of Information Retrieval and Extraction. However the high performance of most existing PN classifiers heavily depends upon the availability of large dictionaries of domain-specific Proper Nouns, and a certain amount of manual work for rule writing or manual tagging. Though it is not a heavy requirement to rely on some existing PN dictionary (often these resources are available on the web), its coverage of a domain corpus may be rather low, in absence of manual updating. In this paper we propose a technique for the automatic updating of a PN Dictionary through the cooperation of an inductive and a probabilistic classifier. In our experiments we show that, whenever an existing PN Dictionary allows the identification of 50% of the proper nouns within a corpus, our technique allows, without additional manual effort, the successful recognition of about 90% of the remaining 50%.

  61. G. Paliouras, C. Papatheodorou, V. Karkaletsis and C.D. Spyropoulos, "Clustering the Users of Large Web Sites into Communities," Proceedings of the International Conference on Machine Learning (ICML), pp. 719-726, Stanford, California, 2000.

    In this paper we analyze the performance of clustering methods on the task of constructing community models for the users of large Web sites. Community models represent patterns of usage of the Web site, which can be associated with different types of user. Knowledge of this type is clearly valuable for commercial sites, where each user is a potential customer. We argue that it is equally valuable for non-commercial sites, because it can assist greatly in the improvement of the site. We evaluate three clustering methods on usage data from a large site that covers on-line resources in Chemistry. The size of the site and its high hit rate impose a serious constraint on the scalability of the methods. We also examine two ways of encoding usage data, which give complementary information about the behavior of the users. Finally, the emphasis is on the construction of meaningful community models, by identifying the descriptive characteristics of communities, at a post-processing stage.

  62. K.V. Chandrinos, I. Androutsopoulos, G. Paliouras and C.D. Spyropoulos, "Automatic Web Rating: Filtering Obscene Content on the Web". Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), Lisbon, Portugal, Lecture Notes in Computer Science, n. 1923, pp. 403-406, Springer-Verlag, 2000.

    [No abstract available.]

  63. G. Paliouras, V. Karkaletsis, and C.D. Spyropoulos, "Learning Decision Trees for Named-Entity Recognition and Classification", Proceedings of the Workshop "Machine Learning for Information Extraction", European Conference in Artificial Intelligence, Berlin, Germany, 2000.

    We propose the use of decision tree induction as a solution to the problem of customising a named-entity recognition and classification (NERC) system to a specific domain. A NERC system assigns semantic tags to phrases that correspond to named entities, e.g. persons, locations and organisations. Typically, such a system makes use of two language resources: a recognition grammar and a lexicon of known names, classified by the corresponding named-entity types. NERC systems have been shown to achieve good results when the domain of application is very specific. However, the construction of the grammar and the lexicon for a new domain is a hard and time-consuming process. We propose the use of decision trees as NERC "grammars" and the construction of these trees using machine learning. In order to validate our approach, we tested C4.5 on the identification of person and organisation names involved in management succession events, using data from the sixth Message Understanding Conference. The results of the evaluation are very encouraging showing that the induced tree can outperform a grammar that was constructed manually.

  64. G. Petasis, S. Petridis, G. Paliouras, V. Karkaletsis, S. Perantonis, and C.D. Spyropoulos, "Symbolic and Neural Learning for Named-Entity Recognition". Proceedings  of the Symposium on Computational Intelligence and Learning (COIL), pp. 58-66, Chios, Greece, 2000.

    Named-entity recognition involves the identification and classification of named entities in text. This is an important subtask in most language engineering applications, in particular information extraction, where different types of named entity are associated with specific roles in events. The manual construction of rules for the recognition of named entities is a tedious and time-consuming task. For this reason, we present in this paper two approaches to learning named-entity recognition rules from text. The first approach is a decision-tree induction method and the second a multi-layered feed-forward neural network. Particular emphasis is paid on the selection of the appropriate feature set for each method and the extraction of training examples from unstructured textual data. We compare the performance of the two methods on a large corpus of English text and present the results.

  65. I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C.D. Spyropoulos and P. Stamatopoulos. "Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach". Proceedings of the Workshop "Machine Learning and Textual Information Access", European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), pp. 1-13, Lyon, France, 2000.

    We investigate the performance of two machine learning algorithms in the context of anti-spam filtering. The increasing volume of unsolicited bulk e-mail (spam) has generated a need for reliable anti-spam filters. Filters of this type have so far been based mostly on keyword patterns that are constructed by hand and perform poorly. The Naive Bayesian classifier has recently been suggested as an effective method to construct automatically anti-spam filters with superior performance. We investigate thoroughly the performance of the Naive Bayesian filter on a publicly available corpus, contributing towards standard benchmarks. At the same time, we compare the performance of the Naive Bayesian filter to an alternative memory-based learning approach, after introducing suitable cost-sensitive evaluation measures.

  66. G. Paliouras, V. Karkaletsis, I. Androutsopoulos, and C.D. Spyropoulos, "Learning Rules for Large-Vocabulary Word Sense Disambiguation: A Comparison of Various Classifiers". Proceedings of the 2nd International Conference on Natural Language Processing (NLP), Patra, Greece. Lecture Notes in Artificial Intelligence, 1835, pp. 383-394, Springer, 2000.

    In this article we compare the performance of various machine learning algorithms on the task of constructing word-sense disambiguation rules from data. The distinguishing characteristic of our work from most of the related work in the field is that we aim at the disambiguation of all content words in the text, rather than focussing on a small number of words. In an earlier study we have shown that a decision tree induction algorithm performs well on this task. This study compares decision tree induction with other popular learning methods and discusses their advantages and disadvantages. Our results confirm the good performance of decision tree induction, which outperforms the other algorithms, due to its ability to order the features used for disambiguation, according to their contribution in assigning the correct sense.

  67. I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, G. Paliouras, and C.D. Spyropoulos, "An Evaluation of Naive Bayesian Anti-Spam Filtering". Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML), Barcelona, Spain, pp. 9-17, 2000.

    It has recently been argued that a Naive Bayesian classifier can be used to filter unsolicited bulk e-mail ("spam"). We conduct a thorough evaluation of this proposal on a corpus that we make publicly available, contributing towards standard benchmarks. At the same time we investigate the effect of attribute-set size, training-corpus size, lemmatization, and stop-lists on the filter's performance, issues that had not been previously explored. After introducing appropriate cost-sensitive evaluation measures, we reach the conclusion that additional safety nets are needed for the Naive Bayesian anti-spam filter to be viable in practice.

  68. G. Paliouras, C. Papatheodorou, V. Karkaletsis, P.Tzitziras and C.D. Spyropoulos, "Large-Scale Mining of Usage Data on Web Sites," AAAI Spring Symposium on Adaptive User Interfaces, Stanford, California, 2000.

    In this paper we present an approach to the discovery of trends in the usage of large Web-based information systems. This approach is based on the empirical analysis of the users interaction with the system and the construction of user groups with common interests (user communities). The empirical analysis is achieved with the use of cluster mining, a technique that process data collected from the users? interaction with the Web site. Our main concern is the construction of meaningful communities, which can be used for improving the structure of the site as well as for making suggestions to the users at a personal level. Our case study on a site providing information for researchers in Chemistry shows that the proposed method provides effective mining of large usage databases.

  69. S.M. Rudolfer, G. Paliouras and I. Peers, "A Comparison of Logistic Regression to Decision Tree Induction in the Diagnosis of Carpal Tunnel Syndrome," Computers and Biomedical Research, v. 32, pp. 391-414, 1999.

    This paper aims to compare and contrast two types of model (logistic regression and decision tree induction) for the diagnosis of carpal tunnel syndrome using four ordered classification categories. Initially, we present the classification performance results based on more than two covariates (multivariate case). Our results suggest that there is no significant difference between the two methods. Further to this investigation, we present a detailed comparison of the structure of bivariate versions of the models. The first surprising result of this analysis is that the classification accuracy of the bivariate models is slightly higher than that of the multivariate ones. In addition, the bivariate models lend themselves to graphical analysis, where the corresponding decision regions can easily be represented in the two-dimensional covariate space. This analysis reveals important structural differences between the two models.

  70. G. Paliouras and H.C. Jessen, "Statistical and Learning Approaches to Nonlinear Modeling of Labour Force Participation,"   Neural Network World, v. 9, n.4, pp. 341-363, 1999.  

    The decision of whether or not to join the labour market is complex and often involves nonlinearities. However, most econometric decision models are linear and therefore may not be able to capture all aspects of the decision problem. In recent years several interesting Machine Learning methods have emerged for estimating nonlinear models in a relatively straightforward manner. It is shown here that some of these methods achieve significantly better classification performance than the standard linear model. Furthermore, a graphical approach is taken for interpreting the nonlinear models for the examined problem.

  71. V. Karkaletsis, G. Paliouras, G. Petasis, N. Manousopoulou and C.D. Spyropoulos, "Named-Entity Recognition from Greek and English Texts". Journal of Intelligent and Robotic Systems, v. 26, n.2, pp. 123-135, 1999.

    Named-entity recognition (NER) involves the identification and classification of named entities in text. This is an important subtask in most language engineering applications, in particular information extraction, where different types of named entity are associated with specific roles in events. In this paper, we present a prototype NER system for Greek texts that we developed based on a NER system for English. Both systems are evaluated on corpora of the same domain and of similar size. The time-consuming process for the construction and update of domain-specific resources in both systems led us to examine a machine learning method for the automatic construction of such resources for a particular application in a specific language.

  72. G. Paliouras, C. Papatheodorou, V. Karkaletsis, C.D. Spyropoulos and P.Tzitziras, "From Web Usage Statistics to Web Usage Analysis," Proceedings of the IEEE International Conference on Systems Man and Cybernetics, v. II, pp. 159-164, 1999.

    The World Wide Web has become a major source of information that can be turned into valuable knowledge for individuals and organisations. In the work presented here, we are concerned with the extraction of meta-knowledge from the Web. In particular, knowledge about Web usage which is invaluable to the construction of Web sites that meet their purposes and prevent disorientation. Towards this goal, we propose the organisation of the users of a Web site into groups with common navigational behaviour (user communities). We view the task of building user communities as a data mining task, searching for interesting patterns within a database. The database that we use in our experiments consists of access logs collected from the Web site of the Advanced Course on Artificial Intelligence 1999. The unsupervised machine learning algorithm COBWEB is used to organise the users of the site, who follow similar paths, into a small set of communities. Particular attention is paid to the interpretation of the communities that are generated through this process. For this purpose, we use a simple metric to identify the representative navigational behaviour for each community. This information can then be used by the administrators of the site to re-organise it in a way that is tailored to the needs of each community. The proposed Web usage analysis is much more insightful than the common approach of examining simple usage statistics of the Web site.

  73. G. Paliouras, V. Karkaletsis and C.D. Spyropoulos, "Learning Rules for Large Vocabulary Word Sense Disambiguation," Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI '99), v. 2, pp. 674-679, 1999.

    Word Sense Disambiguation (WSD) is the process of distinguishing between different senses of a word. In general, the disambiguation rules differ for different words. For this reason, the automatic construction of disambiguation rules is highly desirable. One way to achieve this aim is by applying machine learning techniques to training data containing the various senses of the ambiguous words. In the work presented here, the decision tree learning algorithm C4.5 is applied on a corpus of financial news articles. Instead of concentrating on a small set of ambiguous words, as done in most of the related previous work, all content words of the examined corpus are disambiguated. Furthermore, the effectiveness of word sense disambiguation for different parts of speech (nouns and verbs) is examined empirically.

  74. G. Paliouras, V. Karkaletsis, C. Papatheodorou and C.D. Spyropoulos, "Exploiting Learning Techniques for the Acquisition of User Stereotypes and Communities," Proceedings of the International Conference on User Modeling (UM), CISM Courses and Lectures, n. 407, pp. 169-178, Springer-Verlag, 1999.

    In this paper we propose a methodology for acquiring user stereotypes and communities automatically from users' data. Stereotypes are built using supervised learning techniques (C4.5 and AQ15) on personal data extracted from a set of questionnaires answered by the users of a news filtering system. Particular emphasis is given to the characteristic features of the task of learning stereotypes and, in this context, the new notion of community stereotype is introduced. On the other hand, the communities are built using unsupervised learning (COBWEB) on data containing users' interests on the news categories covered by the news filtering system. Our main concern is whether meaningful communities can be constructed and for this purpose we specify a metric to decide on the representative news categories for each community. The encouraging results presented in this paper, suggest that established machine learning methods can be particularly useful for the acquisition of stereotypes and communities.

  75. G. Petasis, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos and I. Androutsopoulos,  "Using Machine Learning Techniques for Part-of-Speech Tagging in the Greek Language", Proceedings of the 7th Hellenic Conference on Informatics, Ioannina, Greece, 1999.

    This article investigates the use of Transformation-Based Error-Driven learning for resolving part-of-speech ambiguity in the Greek language. The aim is not only to study the performance, but also to examine its dependence on different thematic domains. Results are presented here for two different test cases: a corpus on "management succession events" and a general-theme corpus. The two experiments show that the performance of this method does not depend on the thematic domain of the corpus, and its accuracy for the Greek language is around 95%.

  76. G. Paliouras, C. Papatheodorou, V. Karkaletsis, C.D. Spyropoulos and V. Malaveta, "Learning User Communities for Improving the Services of Information Providers," Proceedings of the European Conference on Research and Advanced Technology for Digital Libraries (ECDL), Lecture Notes in Computer Science, n. 1513, pp. 367-384, Springer-Verlag, 1998.

    In this paper we propose a methodology for organising the users of an information providing system into groups with common interests (communities). The communities are built using unsupervised learning techniques on data collected from the users (user models). We examine a system that filters news on the Internet, according to the interests of the registered users. Each user model contains the user?s interests on the news categories covered by the information providing system. Two learning algorithms are evaluated: COBWEB and ITERATE. Our main concern is whether meaningful communities can be constructed. We specify a metric to decide which news categories are representative for each community. The construction of meaningful communities can be used for improving the structure of the information providing system as well as for suggesting extensions to individual user models. Encouraging results on a large data-set lead us to consider this work as a first step towards a method that can easily be integrated in a variety of information systems.

  77. G. Paliouras, V. Karkaletsis and C.D. Spyropoulos, "Machine Learning for Domain-Adaptive Word Sense Disambiguation," In Proceedings of the Workshop on Adapting Lexical and Corpus Resources to Sublanguages and Applications, International Conference on Language Resources and Evaluation, Granada, Spain, May 26, 1998.

    This paper investigates the use of machine learning techniques for word sense disambiguation. The aim is to improve on the performance of general-purpose methods, by making the disambiguation method adaptable to new domains. Results are presented here for two different test cases: financial news from the Wall Street Journal, extracted from the SEMCOR corpus, and general-theme news from the same corpus. The two experiments show that the adaptive disambiguation method can achieve high recall and precision; more so in the restricted domain of financial news than in the general-theme case.

  78. G. Paliouras and D.S. Bree, "Adaptive Event Recognition with the use of Limited Training Data," In Recent Advances in Information Science and Technology, N.E. Mastorakis (ed.), pp. 225-232, World Scientific, 1998.

    This paper presents a novel event recognition system, which is capable of adapting itself to improve its performance on a small set of training data. The event recognition system is represented by a network of events, related to each other by temporal constraints. This symbolic representation is particularly suitable to the treatment of overlapping events, which have been overlooked in most of the work on event recognition. Additionally, a method for refining the temporal parameters of the recognition system is presented here. The method uses a small set of preclassified training examples to improve the performance of the system. The principle of minimal model change is used to overcome the sparseness of the training data. Particular emphasis is given to the issue of multiple positive examples, which is prevalent when allowing overlapping events. The new system has been applied to the thematic analysis of humpback whale songs with encouraging results.

  79. S.M. Rudolfer, G. Paliouras and I. Peers, "Diagnostic Strategies for Carpal Tunnel Syndrome," Conference of the European Society for Medical Decision Making, Turin, Italy, 1996.

    Carpal Tunnel Syndrome or CTS (entrapment of the median nerve at the wrist) is the most commonly occurring neurological condition referred to hospital electromyography clinics for investigation. Its diagnosis requires specialised equipment to carry out so-called nerve conduction studies. These are combined with the patient's history and clinical examination to enable the clinician to reach a diagnosis. For the purposes of this study, four diagnostic classes were used: No Abnormality Detected, mild CTS, moderate CTS and severe CTS.
    The aims here were to use a data set, supplied by the late Dr. John L. James, Consultant Physician, St. Luke's Hospital, Huddersfield, to: (1) compare the diagnostic performances of decision tree (DT) induction and logistic regression (LR), (2) investigate the relative importance of patients' history, clinical examination and nerve conduction studies for the diagnostic performances of DT and LR. One important aspect of nerve conduction studies is non-response to electrical stimulus in some of the measurements. Such values were coded as either 99.9 or 0, according to the type of variable (latency or amplitude, respectively). LR was not able to handle non-responses directly, whereas DT was. For this reason, nerve conduction variables were coded into at most fourteen ordered values, using the quartiles of the four diagnostic classes, with non-response as an extra value at the appropriate end of the range.
    The data set, consisting of 1710 hands, was randomly split into a design set (850 hands) and a test set (860 hands). Attention was restricted to two models: M1, involving age, sex and handedness, together with nine nerve conduction variables, and M2, including in addition 24 history variables and 5 clinical sign variables.
    For both M1 and M2, the DT performed better than the corresponding LR models, and used four nerve conduction variables only. Its performances for coded and uncoded nerve conduction values were virtually the same. For M1 and M2, the DT achieved a correct classification rate of 78.5% (hard threshold) and 79.2% (soft threshold). For model M1, the LR with backward elimination used four variables (all NCS), three of which coincided with those used by the DT, and had a correct classification rate of 71.4%; the LR without backward elimination had a correct classification rate of 70.8%; the LR using the four variables selected by the DT had a correct classification rate of 71.9%. For model M2, the LR with backward elimination used 8 variables (4 NCS, 3 history and one clinical sign), and had a correct classification rate of 46.2%; the LR's correct classification rate without backward elimination was 44.9%. Possible reasons for this poor performance, and alternative strategies, will be discussed. The diagnostic performances of all the models were not improved by including the history and clinical signs.

  80. M. Brown and G. Paliouras, Review of: Inside Case Based Explanation , by R. Schank et al., Minds and Machines, v. 7, n. (1 or 2), 1997.

    [No abstract available.]

  81. H.C. Jessen and G. Paliouras, "Predicting Labour Force Participation of Women with the use of Statistical and Learning Classification Techniques ," European Conference in Non-Linear Econometrics (EC2), Aarhus, Denmark, 1995.

    Traditionally, econometric models have been based on regression methods. One limitation of these methods is their restricted ability to extract complex relations between the independent variables of the model. In particular, in classification tasks, the methods that are typically used, can only model linear discrimination between the examined classes. In this paper we use the task of predicting labour force participation of women, to illustrate these problems. This is achieved by comparing the classification performance of logistic regression with two newly developed methods originating from the field of Machine Learning (Neural Networks and Decision Trees). The latter are able to construct non-linear discrimination surfaces and achieve a high out-of-sample classification performance. Encouraged by these results, we attempt to achieve a similar increase in the performance of the logit, by introducing non-linear terms in the model. We then go on to examine the similarities and differences between the three types of non-linear model, in terms of the discrimination and probability surfaces. Finally, we use the latter to express our concerns about the interpretation of the probabilities, especially with respect to the elasticity of labour force participation to wages.

  82. G. Paliouras and D.S. Bree, "The Effect of Numeric Features on the Scalability of Inductive Learning," Proceedings of the European Conference in Machine Learning (ECML), Lecture Notes for Artificial Intelligence, n. 912, pp. 218-231, Springer-Verlag, 1995.

    The behaviour of a learning program as the quantity of data increases affects to a large extent its applicability on real-world problems. This paper presents the results of a theoretical and experimental investigation of the scalability of four well-known empirical concept learning programs. In particular it examines the effect of using numeric features in the training set. The theoretical part of the work involved a detailed worst-case computational complexity analysis of the algorithms. The results of the analysis deviate substantially from previously reported estimates, which have mainly examined discrete and finite feature spaces. In order to test these results, a set of experiments was carried out, involving one artificial and two real data sets. The artificial data set introduces a near-worst-case situation for the examined algorithms, while the real data sets provide an indication of their average-case behaviour.