In this paper we study the identifiability of users across social networks, with a trainable combination of different similarity metrics. This application is becoming particularly interesting as the number and variety of social networks increase and the presence of individuals in multiple networks is becoming commonplace. Motivated by the need to verify information that appears in social networks, as addressed by the research project REVEAL, the presence of individuals in different networks provides an interesting opportunity: we can use information from one network to verify information that appears in another. In order to achieve this, we need to identify users across networks. We approach this problem by a combination of similarity measures that take into account the users’ affiliation, location, professional interests and past experience, as stated in the different networks. We experimented with a variety of combination approaches, ranging from simple averaging to trained hybrid models. Our experiments show that, under certain conditions, identification is possible with sufficiently high accuracy to support the goal of verification.
This paper describes the CLEF QA Track 2015. Following the scenario stated last year for the CLEF QA Track, the starting point for accessing information is always a Natural Language question. However, answering some questions may need to query Linked Data (especially if aggregations or logical inferences are required), some questions may need textual inferences and querying free-text, and finally, answering some queries may require both sources of information. In this edition, the Track was divided into four tasks: (i) QALD: focused on translating natural language questions into SPARQL; (ii) Entrance Exams: focused on answering questions to assess machine reading capabilities; (iii) BioASQ1 focused on large-scale semantic indexing and (iv) BioASQ2 for Question Answering in the biomedical domain.
Learning programs in the Event Calculus with Inductive Logic Programming is a challenging task that requires proper handling of negation and unobserved predicates. Learners that are able to handle such issues, typically utilize abduction to account for unobserved super-vision, and learn by generalizing all examples simultaneously to ensure soundness, at the cost of an often intractable search space. In this work, we propose an alternative approach, where a semi-supervised framework is used to obtain the unobserved supervision, and then a hypothesis is constructed by a divide-and-conquer search.We evaluate of our approach on a real-life, activity recognition application.
We studied the predictability of community evolution in online social networks as a supervised learning task with sequential and non-sequential classifiers. Communities that are formed in on-line social networks as a result of user interaction evolve over time. Structural, content and contextual features as well as the previous states of a community are considered as the features that are involved in the task of community evolution. The evolution phenomena we try to predict are the continuation, shrinking, growth and dissolution. The evolution labels stem from a community tracker that provided the background truth. We have obtained interesting results on a set from Twitter.
In the past years social media services received content contributions from millions of users, making them a fruitful source for data analysis. In this paper we present a novel approach for mining Twitter data in order to extract factual information concerning trending events. Our approach is based on relation extraction between named entities,such as people, organizations and locations. The experiments and the obtained results suggest that relation extraction can help in extracting events in social media, when combined with pre and post-processing steps.
The emergence of social media and the enormous growth of social networks have initiated a great amount of research in social influence analysis. In this regard, many approaches take into account only structural information while a few have also incorporated content. In this study we propose a new method to rank users according to their topic-sensitive influence which utilizes a priori information by employing supervised random walks. We explore the use of supervision in a PageRank-like random walk while also exploiting textual information from the available content. We perform a set of experiments on Twitter datasets and evaluate our findings.
Complex Event Recognition (CER) applications exhibit various types of uncertainty, ranging from incomplete and erroneous data streams to imperfect complex event patterns. We review CER techniques that handle, to some extent, uncertainty. We examine both automata-based techniques, which are the most often, and logic-based ones, which are less frequently used. A number of limitations are identified with respect to the employed languages, their probabilistic models and their performance, as compared to the purely deterministic cases.
Event recognition systems rely on properly engineered knowledge bases of event definitions to infer occurrences of events in time. The manual development of such knowledge is a tedious and error-prone task, thus event-based applications may benefit from automated knowledge construction techniques, such as Inductive Logic Programming (ILP), which combines machine learning with the declarative and formal semantics of First-Order Logic. However, learning temporal logical formalisms, which are typically utilized by logic-based Event Recognition systems is a challenging task, which most ILP systems cannot fully undertake. In addition, event-based data is usually massive and collected at different times and under various circumstances. Ideally, systems that learn from temporal data should be able to operate in an incremental mode, that is, revise prior constructed knowledge in the face of new evidence. Most ILP systems are batch learners, in the sense that in order to account for new evidence they have no alternative but to forget past knowledge and learn from scratch. Given the increased inherent complexity of ILP and the volumes of real-life temporal data, this results to algorithms that scale poorly. In this work we present an incremental method for learning and revising event-based knowledge, in the form of Event Calculus programs. The proposed algorithm relies on abductive-inductive learning and comprises a scalable clause refinement methodology, based on a compressive summarization of clause coverage in a stream of examples. We present an empirical evaluation of our approach on real and synthetic data from activity recognition and city transport applications.
Event recognition systems rely on properly engineered knowledge bases of event definitions to infer occurrences of events in time. The manual development of such knowledge is a tedious and error-prone task, thus event-based applications may benefit from automated knowledge construction techniques, such as Inductive Logic Programming (ILP), which combines machine learning with the declarative and formal semantics of First-Order Logic. However, learning temporal logical formalisms, which are typically utilized by logic-based Event Recognition systems is a challenging task, which most ILP systems cannot fully undertake. In addition, event-based data is usually massive and collected at different times and under various circumstances. Ideally, systems that learn from temporal data should be able to operate in an incremental mode, that is, revise prior constructed knowledge in the face of new evidence. Most ILP systems are batch learners, in the sense that in order to account for new evidence they have no alternative but to forget past knowledge and learn from scratch. Given the increased inherent complexity of ILP and the volumes of real-life temporal data, this results to algorithms that scale poorly. In this work we present an incremental method for learning and revising event-based knowledge, in the form of Event Calculus programs. The proposed algorithm relies on abductive-inductive learning and comprises a scalable clause refinement methodology, based on a compressive summarization of clause coverage in a stream of examples. We present an empirical evaluation of our approach on real and synthetic data from activity recognition and city transport applications.
Systems for symbolic event recognition accept as input a stream of time-stamped events from sensors and other computational devices, and seek to identify high-level composite events, collections of events that satisfy some pattern. RTEC is an Event Calculus dialect with novel implementation and `windowing' techniques that allow for efficient event recognition, scalable to large data streams. RTEC supports the expression of rather complex events, such as `two people are fighting', using simple primitives. It can operate in the absence of filtering modules, as it is only slightly affected by data that are irrelevant to the events we want to recognise. Furthermore, RTEC can deal with applications where event data arrive with a (variable) delay from, and are revised by, the underlying sources. RTEC can update already recognised events and recognise new events when data arrive with a delay or following data revision. We evaluate RTEC both theoretically, presenting a complexity analysis, and experimentally, using two real-world applications. The evaluation shows that RTEC can support real-time event recognition and is capable of meeting the performance requirements identified in a survey of event processing use cases.
Background: This article provides an overview of the first BIOASQ challenge, a competition on large-scale biomedical semantic indexing and question answering (QA), which took place between March and September 2013. BIOASQ assesses the ability of systems to semantically index very large numbers of biomedical scientific articles, and to return concise and user-understandable answers to given natural language questions by combining information from biomedical articles and ontologies. Results: The 2013 BIOASQ competition comprised two tasks, Task 1a and Task 1b. In Task 1a participants were asked to automatically annotate new PUBMED documents with MESH headings. Twelve teams participated in Task 1a, with a total of 46 system runs submitted, and one of the teams performing consistently better than the MTI indexer used by NLM to suggest MESH headings to curators. Task 1b used benchmark datasets containing 29 development and 282 test English questions, along with gold standard (reference) answers, prepared by a team of biomedical experts from around Europe and participants had to automatically produce answers. Three teams participated in Task 1b, with 11 system runs. The BIOASQ infrastructure, including benchmark datasets, evaluation mechanisms, and the results of the participants and baseline methods, is publicly available. Conclusions: A publicly available evaluation infrastructure for biomedical semantic indexing and QA has been developed, which includes benchmark datasets, and can be used to evaluate systems that: assign MESH headings to published articles or to English questions; retrieve relevant RDF triples from ontologies, relevant articles and snippets from PUBMED Central; produce “exact” and paragraph-sized “ideal” answers (summaries). The results of the systems that participated in the 2013 BIOASQ competition are promising. In Task 1a one of the systems performed consistently better from the NLM’s MTI indexer. In Task 1b the systems received high scores in the manual evaluation of the “ideal” answers; hence, they produced high quality summaries as answers. Overall, BIOASQ helped obtain a unified view of how techniques from text classification, semantic indexing, document and passage retrieval, question answering, and text summarization can be combined to allow biomedical experts to obtain concise, user-understandable answers to questions reflecting their real information needs.
Symbolic event recognition systems have been successfully applied to a variety of application domains, extracting useful information in the form of events, allowing experts or other systems to monitor and respond when significant events are recognised. In a typical event recognition application, however, these systems often have to deal with a significant amount of uncertainty. In this article, we address the issue of uncertainty in logic-based event recognition by extending the Event Calculus with probabilistic reasoning. Markov logic networks are a natural candidate for our logic-based formalism. However, the temporal semantics of the Event Calculus introduce a number of challenges for the proposed model. We show how and under what assumptions we can overcome these problems. Additionally, we study how probabilistic modelling changes the behaviour of the formalism, affecting its key property—the inertia of fluents. Furthermore, we demonstrate the advantages of the probabilistic Event Calculus through examples and experiments in the domain of activity recognition, using a publicly available dataset for video surveillance.
We present a system for recognising human activity given a symbolic representation of video content. The input of our system is a set of time-stamped short-term activities (STA) detected on video frames. The output is a set of recognised long-term activities (LTA), which are pre-defined temporal combinations of STA. The constraints on the STA that, if satisfied, lead to the recognition of an LTA, have been expressed using a dialect of the Event Calculus. In order to handle the uncertainty that naturally occurs in human activity recognition, we adapted this dialect to a state-of-the-art probabilistic logic programming framework. We present a detailed evaluation and comparison of the crisp and probabilistic approaches through experimentation on a benchmark dataset of human surveillance videos.
The goal of this task is to push the research frontier towards hybrid information systems. We aim to promote systems and approaches that are able to deal with the whole diversity of the Web, especially for, but not restricted to, the context of bio-medicine. This goal is pursued by the organization of challenges. The second challenge consisted of two tasks: semantic indexing and question answering. 61 systems participated by 18 different participating teams for the semantic indexing task, of which between 25 and 45 participated in each batch. The semantic indexing task was tackled by 22 systems, which were developed by 8 different organizations. Between 15 and 19 of these systems addressed each batch. The question answering task was tackled by 18 different systems, developed by 7 different organizations. Between 9 and 15 of these systems submitted results in each batch. Overall, the best systems were able to outperform the strong baselines provided by the organizers.
Many classification problems are related to a hierarchy of classes, that can be exploited in order to perform hierarchical classification of test objects. The most basic way of hierarchical classification is that of cascade classification, which greedily traverses the hierarchy from root to the predicted leaf. In order to perform cascade classification, a classifier must be trained for each node of the hierarchy. In large scale problems, the number of features can be prohibitively large for the classifiers in the upper levels of the hierarchy. It is therefore desirable to reduce the dimensionality of the feature space at these levels. In this paper we examine the computational feasibility of the most common dimensionality reduction method (Principal Component Analysis) for this problem, as well as the computational benefits that it provides for cascade classification and its effect on classification accuracy. Our experiments on two benchmark datasets with a large hierarchy show that it is possible to perform a certain version of PCA efficiently in such large hierarchies, with a slight decrease in the accuracy of the classifiers. Furthermore, we show that PCA can be used selectively at the top levels of the hierarchy in order to decrease the loss in accuracy. Finally, the reduced feature space, provided by the PCA, facilitates the use of more costly and possibly more accurate classifiers, such as non-linear SVMs.
????
Most common methods for inquiring genomic sequence composition, are based on the bag-of-words approach and thus largely ignore the original sequence structure or the relative positioning of its constituent oligonucleotides. We here present a novel methodology that takes into account both word representation and relative positioning at various lengths scales in the form of n-gram graphs (NGG). We implemented the NGG approach on short vertebrate and invertebrate constrained genomic sequences of various origins and predicted functionalities and were able to efficiently distinguish DNA sequences belonging to the same species (intra-species classification). As an alternative method, we also applied the Genomic Signatures (GS) approach to the same sequences. To our knowledge, this is the first time that GS are applied on short sequences, rather than whole genomes. Together, the presented results suggest that NGG is an efficient method for classifying sequences, originating from a given genome, according to their function.
This paper introduces a method that deals with unwanted mail messages by combining active learning with incremental clustering. The proposed approach is motivated by the fact that the user cannot provide the correct category for all received messages. The email messages are divided into chronological batches (e.g. one per day). The user is asked to give the correct categories (labels) for the messages of the first batch and from then on the proposed algorithm decideswhen to ask for a new label, based on a clustering of the messages that is incrementally updated. We test different variants of the algorithm on a number of different datasets and show that it achieves very good results with only 2% of all email messages labelled by the user.
Developing intelligent systems towards automated clinical monitoring and assistance for the elderly is attracting growing attention. USEFIL is an FP7 project aiming to provide health-care assistance in a smart-home setting. We present the data fusion component of USEFIL which is based on a complex event recognition methodology. In particular, we present our knowledge-driven approach to the detection of Activities of Daily Living (ADL) and functional ability, based on a probabilistic version of the Event Calculus. To investigate the feasibility of our approach, we present an empirical evaluation on synthetic data.
In this work, we consider a transfer learning approach based on K-means for splice site recognition. We use different representations for the sequences, based on n-gram graphs. In addition, a novel representation based on the secondary structure of the sequences is proposed. We evaluate our approach on genomic sequence data from model organisms of varying evolutionary distance. The first obtained results indicate that the proposed representations are promising for the problem of splice site recognition.
This paper provides an overview of the workshop Web-Scale Classification: Web Classification in the Big Data Era which was held in New York City, on February 28th as a workshop of the seventh International Conference on Web Search and Data Mining. The goal of the workshop was to discuss and assess recent research focusing on classification and mining in Web-scale category systems. The workshop brought together members of several communities such web mining, machine learning, text classification and social media mining.
This paper proposes a methodology for proactive event-driven decision making. Proper decisions are made by forecasting events prior to their occurrence. Motivation for proactive decision making stems from social and economic factors, and is based on the fact that prevention is often more effective than the cure. The decisions are made in real time and require swift and immediate processing of Big Data, that is, extremely large amounts of noisy data flooding in from various locations, as well as historical data. The methodology will recognize and forecast opportunities and threats, making the decision to capitalize on the opportunities and mitigate the threats. This will be explained through user-interaction and the decisions of human operators, in order to ultimately facilitate proactive decision making.
Modern assistive environments have the ability to collect data from various distributed sources and need to react swiftly to changes. As information flows, in the form of simple, source events, it becomes more and more difficult to quickly analyze the collected data in an automated way and transform them into operational knowledge. Event recognition (ER) addresses this problem. Several tools exist for defining ER rules, but only a few of them offer graphical design environments. Each such tool supports a single ER language, either query-based or rule-based. Also, many of these systems do not support the addition of user-defined operators, thus limiting the flexibility in rule design. This paper presents the Event Recognition Designer Toolkit (ERDT), a graphical authoring tool, with which a domain expert can design event recognition rules and produce standalone ER. The goal was to develop a user-friendly graphical tool with a basic set of operators, so that a user could easily produce recognizers for different domains and, when needed, easily extend the tool in order to satisfy domain-specific requirements. The ERDT uses an extendable pool of ER language libraries (at the moment SQL and Event Calculus are supported) and transforms the designed rules into Event Recognizers that use the preferred ER language. The same rule can be expressed in different languages without any changes to the design. Furthermore, the authoring tool is cross platform, free, and open source, so that it can be shared with the community, maximizing its potential impact and possible extension.
A new transfer learning method is presented in this paper, addressing a particularly hard transfer learning problem: the case where the target domain shares only a subset of its classes with the source domain and only unlabeled data are provided for the target domain. This is a situation that occurs frequently in real-world applications, such as the multiclass document classification problems that motivated our work. The proposed approach is a transfer learning variant of the Probabilistic Latent Semantic Analysis (PLSA) model that we name TL-PLSA. Unlike most approaches in the literature, TL-PLSA captures both the difference of the domains and the commonalities of the class sets, given no labelled data from the target domain. We perform experiments over three different datasets and show the difficulty of the task, as well as the promising results that we obtained with the new method.
We present the application of a recently proposed probabilistic logical formalism, on the task of sensor data fusion in the USEFIL project. USEFIL seeks to extract valuable knowledge concerning the well-being of elderly people by combining information coming from low-cost, unobtrusive monitoring devices. The approach we adopt to device its data fusion component is based on the Event Calculus and the stochastic logic programming language ProbLog and aims towards constructing a semantic representation of the received data, usable by a Decision Support System that will assist elderly people in their every day activities and will provide to doctors, relatives and carers insights on the user’s behaviour and health.
Artificial Intelligence-based event recognition systems carry high potential for organisations to utilise their structured and unstructured data. The application of these systems as a backbone of decision support systems allows for effective and efficient information management. To sufficiently evaluate such kind of integrated systems recognising events for the benefit of decision makers, a holistic methodology is necessary. We propose a new methodology which complements existing approaches for technology-oriented verification and validation by user-oriented evaluation (user experience analysis). We illustrate the proposed methodology by evaluating EP-IRM, an event processing system for intelligent resource management. This case study shows that our methodology offers invaluable information about the performance and acceptance of an event based decision support system.
????
Community Web Directories constitute a form of personalization performed on Web directories, such as the Open Directory Project (ODP). They correspond to “segments” of the directory hierarchy, representing the interests and preferences of user communities and thus provide a personalized view of theWeb. In this paper, we present OurDMOZ, a system that builds and maintains community Web directories by employing a Web usage mining framework. OurDMOZ, the prototype presented here, exploits Web directories to extend personalization to a larger part of the Web, outside the scope of a single Web site. OurDMOZ offers a variety of personalization functionalities including adaptive interfaces and Web page recommendations. An initial user evaluation of the system indicates the potential value of the enhanced personalized Web experience provided by OurDMOZ.
Today's organizations require techniques for automated transformation of their large data volumes into operational knowledge. This requirement may be addressed by using event recognition systems that detect events/activities of special significance within an organization, given streams of ‘low-level’ information that is very difficult to be utilized by humans. Consider, for example, the recognition of attacks on nodes of a computer network given the Transmission Control Protocol/Internet Protocol messages, the recognition of suspicious trader behaviour given the transactions in a financial market and the recognition of whale songs given a symbolic representation of whale sounds. Various event recognition systems have been proposed in the literature. Recognition systems with a logic-based representation of event structures, in particular, have been attracting considerable attention, because, among others, they exhibit a formal, declarative semantics, they have proven to be efficient and scalable and they are supported by machine learning tools automating the construction and refinement of event structures. In this paper, we review representative approaches of logic-based event recognition and discuss open research issues of this field. We illustrate the reviewed approaches with the use of a real-world case study: event recognition for city transport management.
One of the major innovations in personalization in the last 20 years was the injection of social knowledge into the model of the user. The user is not considered an isolated individual any more, but a member of one or more communities. User communities have been facilitated by the striking advancements of electronic communications and in particular the penetration of the Web into people's everyday routine. Communities arise in a number of different ways. Social networking tools typically allow users to proactively connect to each other. Alternatively, data mining tools discover communities of connected Web sites or communities of Web users. In this paper, we focus on the latter type of community, which is commonly mined from logs of users' activity on the Web. We recall how this process has been used to model the users' interests and personalize Web applications. Collaborative filtering and recommendation are the most widely used forms of community-driven personalization. However, we examine a range of other interesting alternatives that are worth investigating further. This effort leads us naturally to the recent developments on the Web and particularly the advent of the social Web. We explain how this development draws together the different viewpoints on Web communities and introduces new opportunities for community-based personalization. In particular, we propose the concept of active user community and show how this relates to recent efforts on mining social networks and social media.
Personalisation in the fashion industry is a new trend that tries to produce garments respecting the idiosyncrasy of every customer and doing so cost effectively, whilst at the same time adding value to the services provided. Typically, a personalised fashion service recognises its users, collects information about their interests, their needs, as well as their personal physical characteristics (such as body type), and subsequently recommends products based on this information.
The recommender system should be able to create and maintain efficiently user information, and this is typically performed by means of user models. There are two types of information sources that are exploited for the creation of user models in the fashion domain. The first type is in the form of generic style advice rules that are defined by fashion experts. These rules provide some guidance about the appropriate style and fit for garments for different occasions, body types, facial features, etc. The second type of information source is in the form of customer’s data which is collected by fashion oriented web sites or social networking sites; they contain users’ preferences or purchases of garments. This information can be exploited to discover important patterns that denote general user tendencies.
We have been developing a system for recognising human activities given a symbolic representation of video content. The input of our system is a stream of time-stamped short-term activities detected on video frames. The output of our system is a set of recognised long-term activities, which are pre-defined spatio-temporal combinations of short-term activities. The constraints on the short-term activities that, if satisfied, lead to the recognition of a long-term activity, are expressed using a dialect of the Event Calculus. We illustrate the expressiveness of the dialect by showing the representation of several typical complex activities. Furthermore, we present a detailed evaluation of the system through experimentation on a benchmark dataset of surveillance videos.
Artificial Intelligence-based event recognition systems carry high potential for organisations to utilise their structured and unstructured data. The application of these systems as a backbone of decision support systems allows for effective and efficient information management. To sufficiently evaluate such kind of integrated systems recognising events for the benefit of decision makers, a holistic methodology is necessary. We propose a new methodology which complements existing approaches for technology-oriented verification and validation by user-oriented evaluation (user experience analysis). We illustrate the proposed methodology by evaluating EP-IRM, an event processing system for intelligent resource management. This case study shows that our methodology offers invaluable information about the performance and acceptance of an event-based decision support system.
This article provides an overview of BIOASQ, a new competition on biomedical semantic indexing and question answering (QA). BIOASQ aims to push towards systems that will allow biomedical workers to express their information needs in natural language and that will return concise and user-understandable answers by combining information from multiple sources of different kinds, including biomedical articles, databases, and ontologies. BIOASQ encourages participants to adopt semantic indexing as a means to combine multiple information sources and to facilitate the matching of questions to answers. It also adopts a broad semantic indexing and QA architecture that subsumes current relevant approaches, even though no current system instantiates all of its components. Hence, the architecture can also be seen as our view of how relevant work from fields such as information retrieval, hierarchical classification, question answering, ontologies, and linked data can be combined, extended, and applied to biomedical question answering. BIOASQ will develop publicly available benchmarks and it will adopt and possibly refine existing evaluation measures. The evaluation infrastructure of the competition will remain publicly available beyond the end of BIOASQ.
Community Web Directories constitute a form of personalization performed on Web directories, such as the Open Directory Project (ODP). They correspond to “segments” of the directory hierarchy, representing the interests and preferences of user communities and thus provide a personalized view of theWeb. In this paper, we present OurDMOZ, a system that builds and maintains community Web directories by employing a Web usage mining framework. OurDMOZ,the prototype presented here, exploits Web directories to extend personalization to a larger part of the Web, outside the scope of a single Web site. OurDMOZ offers a variety of personalization functionalities including adaptive interfaces and Web page recommendations. An initial user evaluation of the system indicates the potential value of the enhanced personalized Web experience provided by OurDMOZ.
The need for intelligent resource management (IRM) spans across a multitude of applications. To address this requirement, we present EP-IRM, an event processing system recognising composite events given multiple sources of information in order to support IRM. EP-IRM has been deployed in two real-world applications. Moreover, with a small effort it may be used in a wide range of applications requiring IRM. We present an evaluation of the system, and discuss the lessons learnt during its development and deployment.
Events are particularly important pieces of knowledge, as they represent activities of special significance within an organisation: the automated recognition of events is of utmost importance. We present RTEC, an Event Calculus dialect for run-time event recognition and its Prolog implementation. RTEC includes a number of novel techniques allowing for efficient run-time recognition, scalable to large data streams. It can be used in applications where data might arrive with a delay from, or might be revised by, the underlying event sources. We evaluate RTEC using a real-world application.
Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. To address it, patterns of co-occurring words or characters are typically extracted from the textual content of Web documents. However, not all documents are of the same quality; for example, the curated content of news articles usually entails lower levels of noise than the user-generated content of the blog posts and the other Social Media.
In this paper, we provide some insight and a preliminary study on a tripartite categorization of Web documents, based on inherent document characteristics. We claim and support that each category calls for different classification settings with respect to the representation model. We verify this claim experimentally, by showing that topic classification on these different document types offers very different results per type. In addition, we consider a novel approach that improves the performance of topic classification across all types of Web documents: namely the n-gram graphs. This model goes beyond the established bag-of-words one, representing each document as a graph. Individual graphs can be combined into a class graph and graph similarities are then employed to position and classify documents into the vector space. Accuracy is increased due to the contextual information that is encapsulated in the edges of the n-gram graphs; efficiency, on the other hand, is boosted by reducing the feature space to a limited set of dimensions that depend on the number of classes, rather than the size of the vocabulary. Our experimental study over three large-scale, real-world data sets validates the higher performance of n-gram graphs in all three domains of Web documents.
This paper presents hHDP, a hierarchical algorithm for representing a document collection as a hierarchy of latent topics, based on Dirichlet process priors. The hierarchical nature of the algorithm refers to the Bayesian hierarchy that it comprises, as well as to the hierarchy of the latent topics. hHDP relies on nonparametric Bayesian priors and it is able to infer a hierarchy of topics, without making any assumption about the depth of the learned hierarchy and the branching factor at each level. We evaluate the proposed method on real-world datasets in document modeling, as well as in ontology learning, and provide qualitative and quantitative evaluation results, showing that the model is robust, it models accurately the training dataset and is able to generalize on held-out data.
This paper presents a method along with a set of measures for evaluating learned ontologies against gold ontologies. The proposed method transforms the ontology concepts and their properties into a vector space representation to avoid the common string matching of concepts and properties at the lexical layer. The proposed evaluation measures exploit the vector space representation and calculate the similarity of the two ontologies (learned and gold) at the lexical and relational levels. Extensive evaluation experiments are provided, which show that these measures capture accurately the deviations from the gold ontology. The proposed method is tested using the Genia and the Lonely Planet gold ontologies, as well as the ontologies in the benchmark series of the Ontology Alignment Evaluation Initiative.
This chapter summarises the approach and main achievements of the research project BOEMIE (Bootstrapping Ontology Evolution with Multimedia Information Extraction). BOEMIE introduced a new approach towards the automation of knowledge acquisition from multimedia content. In particular, it developed and demonstrated the notion of evolving multimedia ontologies, which is used for the extraction, fusion and interpretation of information from content of various media types (audio, video, images and text). BOEMIE adopted a synergistic approach that combines multimedia extraction and ontology evolution in a bootstrapping process. This process involves, on the one hand, the continuous extraction of semantic information from multimedia content in order to populate and enrich the ontologies and, on the other hand, the deployment of these ontologies to enhance the robustness of the extraction system. Thus, in addition to annotating multimedia content with semantics, the extracted knowledge is used to expand our understanding of the domain and extract even more useful knowledge. The methods and technologies developed in BOEMIE were tested in the domain of athletics, using large sets of annotated content and evaluation by domain experts. The evaluation has proved the value of the technology, which is applicable in a wide spectrum of domains that are based on multimedia content.
Ontology learning is the process of acquiring (constructing or integrating) an ontology (semi-) automatically. Being a knowledge acquisition task, it is a complex activity, which becomes even more complex in the context of the BOEMIE project1, due to the management of multimedia resources and the multi-modal semantic interpretation that they require. The purpose of this chapter is to present a survey of the most relevant methods, techniques and tools used for the task of ontology learning. Adopting a practical perspective, an overview of the main activities involved in ontology learning is presented. This breakdown of the learning process is used as a basis for the comparative analysis of existing tools and approaches. The comparison is done along dimensions that emphasize the particular interests of the BOEMIE project. In this context, ontology learning in BOEMIE is treated and compared to the state of the art, explaining how BOEMIE addresses problems observed in existing systems and contributes to issues that are not frequently considered by existing approaches.
This paper proposes a probabilistic method for classifying folksonomy users to specific domains and for identifying their specific interests to these domains. The proposed method uses a hierarchical probabilistic topic modeling approach that exploits tags to induce hierarchies of latent topics. These hierarchies represent domain conceptualizations of specific domains that are either collective or user-specific. We propose two alternative methods that exploit the induced hierarchies for classifying and identifying users' interests to specific domains and provide preliminary evaluation results.
This paper presents a probabilistic method for classifying folksonomy users to folksonomy sub-domains and identifying their particular interests. In particular, we propose a method for mining topic hierarchies that may reveal either the collective or the user-specific conceptualization of those domains, as these are reected by users' tags. We then propose two alternatives for identifying users' interests in the domains: The first exploits users' tags directly, and the second exploits users' specific conceptualizations of each domain. Both approaches use the collective domain conceptualizations as “reference", to which users' tags and conceptualizations are compared. The proposed statistical method is parametric-less and does not require any prior knowledge or external resources. We apply the proposed method on the Del.icio.us online bookmarking system and we provide experimental results.
Ultraconserved sequences for H.sapiens and C. elegans were obtained and analyzed through the N-gram Graph approach. The N-gram graphs (NGG) represent how symbols (e.g., nucleotides) cooccur within a given neighborhood (e.g., within an oligonucleotide). The neighborhood is defined based on a distance function (e.g., a neighborhood of 5 consecutive characters within a text). Under this framework we trained graphs with the UCS and compared them with genomic and random surrogate sequences with similar DNA composition, in order to define specific “rules” in the use of nucleotides existing within UCS.
We introduce ELS, a new method for entity-level sentiment classification using sequence modeling by Conditional Random Fields (CRF). The CRF is trained to identify the sentiment of each word in a document, which is then used to determine the sentiment for the entity, based on where it appears in the text. Due to its sequential nature, the CRF classifier performs better than the common bag-of-words approaches, especially when we target the local sentiment in small parts of a larger document. Identifying the sentiment about a specific entity, mentioned in a blog post or a larger product review, is a special case of such local sentiment classification. Furthermore, the proposed approach performs well even in short pieces of text, where bag-of-words approaches usually fail, due to the sparseness of the resulting feature vector. We have implemented and tested the proposed method on a publicly available benchmark corpus of short product reviews in English. The results that we present in this paper improve significantly upon published results on the same data, thus confirming our intuition about the approach.
Currently there exist several tools for Complex Event Recognition, varying from design platforms for business process modeling (BPM) to advanced Complex Event Processing (CEP) engines. Several efforts have been reported in literature aiming to support domain experts in the process of defining event recognition (ER) rules. However, few of them offer graphical design environments for the definition of such rules, limiting the broad adoption of ER systems. In this paper, we present a graphical Event Definition Authoring Tool, referred to as the Event Recognition Designer Toolkit (ERDT) with which, a domain expert can easily design event recognition rules on temporal data and produce standalone Event Recognizers.
The application of event processing methods and systems carries high potential for the domain of crisis management and emergency response for different use cases and architectural aspects. This hypothesis is based on the general event based characteristics of the domain as well as former research approaches. Resource management represents a complex task for decision makers; therefore it is taken as a basic use case for this work. It builds up on foundations of resource management (use case and demand side) and event processing (technology and supply side). Methods and results are presented for the identification, definition and validation of events that happen in reality and corresponding event objects which are processed by information systems.
In this paper, we address the issue of uncertainty in event recognition by extending the Event Calculus with probabilistic reasoning. Markov Logic Networks are a natural candidate for our logic-based formalism. However, the temporal semantics of Event Calculus introduce a number of challenges for the proposed model. We show how and under what assumptions we can overcome these problems. Additionally, we demonstrate the advantages of the probabilistic Event Calculus through examples and experiments in the domain of activity recognition, using a publicly available dataset of video surveillance.
We present a system for recognising human behaviour given a symbolic representation of surveillance videos. The input of our system is a set of time-stamped short-term behaviours, that is, behaviours taking place in a short period of time — walking, running, standing still, etc — detected on video frames. The output of our system is a set of recognised long-term behaviours — fighting, meeting, leaving an object, collapsing, walking, etc — which are pre-defined temporal combinations of short-term behaviours. The definition of a long-term behaviour, including the temporal constraints on the short-term behaviours that, if satisfied, lead to the recognition of the long-term behaviour, is expressed in the Event Calculus. We present experimental results concerning videos with several humans and objects, temporally overlapping and repetitive behaviours. Moreover, we present how machine learning techniques may be employed in order to automatically develop long-term behaviour definitions.
This paper presents a knowledge discovery framework for the construction of Community Web Directories, a concept that we introduced in our recent work, applying personalization to Web directories. In this context, the Web directory is viewed as a thematic hierarchy and personalization is realized by constructing user community models on the basis of usage data. In contrast to most of the work on Web usage mining, the usage data that are analyzed here correspond to user navigation throughout the Web, rather than a particular Web site, exhibiting as a result a high degree of thematic diversity. For modeling the user communities, we introduce a novel methodology that combines the users’ browsing behavior with thematic information from the Web directories. Following this methodology we enhance the clustering and probabilistic approaches presented in previous work and we also present a new algorithm that combines these two approaches. The resulting community models take the form of Community Web Directories. The proposed personalization methodology is evaluated both on a specialized artificial and a general-purpose Web directory, indicating its potential value to the Web user. The experiments also assess the effectiveness of the different machine learning techniques on the task.
This paper proposes a method for learning ontologies given a corpus of text documents. The method identifies concepts in documents and organizes them into a subsumption hierarchy, without presupposing the existence of a seed ontology. The method uncovers latent topics for generating document text. The discovered topics form the concepts of the new ontology. Concept discovery is done in a language neutral way, using probabilistic space reduction techniques over the original term space of the corpus. Furthermore, the proposed method constructs a subsumption hierarchy of the concepts by performing conditional independence tests among pairs of latent topics, given a third one. The paper provides experimental results on the Genia and the Lonely Planet corpora from the domains of molecular biology and tourism respectively.
The paper is motivated by the need to handle robustly the uncertainty of temporal intervals, e.g. as it occurs in automated event detection in video streams. The paper introduces a two-dimensional mapping of Allen's relations, based on orthogonal characteristics of interval relations, namely relative position and relative size. The hourglass-shaped mapping also represents limit cases that correspond to durationless intervals. Based on this mapping, we define two sets of primitive interval relations in terms of the relative positioning and relative size of intervals. These primitives are then used to derive a probabilistic set of Allen's relations. A number of example cases are presented to illustrate how the proposed approach can improve the robustness of interval relations.
Today's organisations require techniques for automated transformation of the large data volumes they collect during their operations into operational knowledge. This requirement may be addressed by employing event recognition systems that detect activities/events of special significance within an organisation, given streams of 'low-level' information that is very difficult to be utilised by humans. Numerous event recognition systems have been proposed in the literature. Recognition systems with a logic-based representation of event structures, in particular, have been attracting considerable attention because, among others, they exhibit a formal, declarative semantics, they haven proven to be efficient and scalable, and they are supported by machine learning tools automating the construction and refinement of event structures. In this paper we review representative approaches of logic-based event recognition, and discuss open research issues of this field.
This paper presents the approach used to extract information from multimedia in the context of the Computer-Aided Semantic Annotation of Multimedia (CASAM) system. In particular, we rst describe from a system' s perspective the relevant component of the system, named Knowledge Driven Multimedia Analysis (KDMA) component. We then focus on a particular methodology that allows to improve detection of information found in audio stream of a document, using information found in related text data, provided either as auxiliary sources, speech or user annotations. The methodology is based on separately analysing each medium and then learn a mapping among concepts found in audio and text. This mapping is later used to propose priors for audio classes at the document level and use them to adapt the audio classes posteriors. The evaluation results of the described analysis methods on a multimedia news items corpus demonstrate the usefulness of the approach.
We have been developing a system for recognising human activity given a symbolic representation of video content. The input of our system is a set of time-stamped short-term activities detected on video frames. The output of our system is a set of recognised long-term activities, which are pre-defined temporal combinations of short-term activities. The constraints on the short-term activities that, if satisfied, lead to the recognition of a long-term activity, are expressed using a dialect of the Event Calculus. We illustrate the expressiveness of the dialect by showing the representation of several typical complex activities. Furthermore, we present a detailed evaluation of the system through experimentation on a benchmark dataset of surveillance videos.
Two influential strands in Recommender systems (RS) are the collaborative filtering and content based filtering that by taking into account user communities or interaction history suggest to the active user interesting items. However, the aforementioned approaches do not work well when confronted with new users with few interactions; or with the addition of new items. In such cases, the guidance of an expert could help the active user. In this paper we provide a definition of expert users that can be reduced into two components the expertise and the contribution. The former is related to the content of items evaluated by an expert and the latter refers to the influence of the expert to the users of a RS. In particular, contribution is learnt with the aid of a perceptron. Experts users are defined for values of the features of the items. Furthermore, we have studied the temporal evolution of the experts, as new users, new items, or new item evaluations are added into the system. Moreover, we have compared the proposed expert based method with a stereotype based method, since for both methods a minimal interaction of the active user with the RS suffices. The data originated from the MovieLens set with enhancements from the IMDB.
In this paper we demonstrate a system that automatically annotates text documents with a given domain ontology’s concepts. The annotation process utilizes lexical and Web resources to analyze the semantic similarity of text components with any of the ontology concepts, and outputs a list with the proposed annotations, accompanied with appropriate confidence values. The demonstrated system is available online and free to use, and it constitutes one of the main components of the KDTA (Knowledge-Driven Text Analysis) module of the CASAM European research project.
In this paper we are dealing with the task of adding domain-specific semantic tags to a document, based solely on the domain ontology and generic lexical and Web resources. In this manner, we avoid the need for trained domain-specific lexical resources, which hinder the scalability of semantic annotation. More specifically, the proposed method maps the content of the document to concepts of the ontology, using the WordNet lexicon and Wikipedia. The method comprises a novel combination of measures of semantic relatedness and word sense disambiguation techniques to identify the most related ontology concepts for the document. We test the method on two case studies: (a) a set of summaries, accompanying environmental news videos, (b) a set of medical abstracts. The results in both cases show that the proposed method achieves reasonable performance, thus pointing to a promising path for scalable semantic annotation of documents.
This paper reports on the (Large Scale Hierarchical Classification workshop, held in conjunction with the European Conference on Information Retrieval (ECIR) 2010. The workshop was associated with the PASCAL 2 (Large-Scale Hierarchical Text Classification Challenge , which took place in 2009. We first provide information about the challenge, presenting the data used, the tasks and the evaluation measures and then we provide an overview of the approaches proposed by the participants of the workshop, together with a summary of the results of the challenge.
We present a system for recognising human behaviour given a symbolic representation of surveillance videos. The input of our system is a set of timestamped short-term behaviours — walking, running, standing still, etc — that is, behaviours taking place in a short period of time, detected on video frames. The output of our system is a set of recognised long-term behaviours — fighting, meeting, leaving an object, collapsing, walking, etc — which are pre-defined temporal combinations of short-term behaviours. The definition of a long-term behaviour, including the temporal constraints on the short-term behaviours that, if satisfied, lead to the recognition of the long-term behaviour, is expressed in the Event Calculus. We present experimental results concerning videos with several humans and objects, temporally overlapping and repetitive behaviours.
In this paper we describe a semi-automated approach for ontology learning. Exploiting an ontology-based multimodal information extraction system, the ontology learning subsystem accumulates documents that are insufficiently analysed and through clustering proposes new concepts, relations and interpretation rules to be added to the ontology.
Web wrappers play an important role in extracting information from distributed web sources and subsequently in the integration of heterogeneous data. Changes in the layout of web sources typically break the wrapper, leading to erroneous extraction of infomation. Monitoring and repairing broken wrappers is an important hurdle for data integration, since it is an expensive and painful procedure. In this paper we present VEWRA, a new approach to wrapper verification, which improves the successful family of trainable content - based methods. Compared to its predecessors, the new method aims to capture not only the syntactic patterns but the correlations that exist among them due to the underlying semantics of the extracted information. Experiments show that our method achieves excellent performance, being always better or equal than DATAPROG, the state-of-art related work.
In this article, a method that models user navigation on the web, as opposed to a single website, is presented, aiming to assist the user by recommending pages. User modeling is done through data mining of web usage logs, resulting in aggregate, rather than personal models. The proposed approach extends grammatical inference methods by introducing an extra merging criterion, which examines the semantic similarity of automaton states. The experimental results showed that the method does indeed facilitate the modeling of web navigation, which was not possible with the existing web usage mining methods. However, a content-based recommendation model is shown to still outperform the proposed method, which suggests that the knowledge of the navigation sequence does not contribute to the recommendation process. This is due to the thematic cohesion of navigation sessions, in comparison to the large thematic diversity of web usage data. Among three variants of the proposed method, the one based on Blue Fringe, that examines a larger space of possible merges, performs better.
This paper presents a system that aggregates news from various electronic news publishers and distributors. The system collects news from HTML and RSS Web documents by using source-specific information extraction programs (wrappers) and parsers, organizes them according to pre-defined news categories and constructs personalized views via a Web-based interface. Adaptive personalization is performed, based on the individual user interaction, user similarities and statistical analysis of aggregate usage data by machine learning algorithms. In addition to the presentation of the basic system, we present here the results of a user study, indicating the merits of the system, as well as ways to improve it further.
Determining the size of an ontology that is automatically learned from text corpora is an open issue. In this paper, we study the similarity between ontology concepts at different levels of a taxonomy, quantifying in a natural manner the quality of the ontology attained. Our approach is integrated in a recently proposed method for language-neutral learning of ontologies of thematic topics from text corpora. Evaluation results over the Genia and the Lonely Planet corpora demonstrate the significance of our approach.
The evaluation of their research work and its effect has always been one of scholars' greatest concerns. The use of citations for that purpose, as proposed by Eugene Garfield, is nowadays widely accepted as the most reliable method. However, gathering a scholar's citations constitutes a particularly laborious task, even in the current Internet era, as one needs to correctly combine information from miscellaneous sources. There exists therefore a need for automating this process. Numerous academic search engines try to cover this need, but none of them addresses successfully all related problems. In this paper we present an approach that facilitates to a great extent citation analysis by taking advantage of new algorithms to deal with these problems.
In this paper we propose a novel relation extraction method, based on grammatical inference. Following a semi-supervised learning approach, the text that connects named entities in an annotated corpus is used to infer a context free grammar. The grammar learning algorithm is able to infer grammars from positive examples only, controlling overgeneralisation through minimum description length. Evaluation results show that the proposed approach performs comparable to the state of the art, while exhibiting a bias towards precision, which is a sign of conservative generalisation.
Determining the size of an ontology that is automatically learned from texts is an open issue. In this paper, we study the similarity between ontology concepts at different levels of a taxonomy, quantifying in a natural manner the quality of the ontology attained. Our approach is integrated in a method for language-neutral learning of ontologies from texts, which relies on conditional independence tests over thematic topics that are discovered using LDA.
Focused crawlers are programs that wander in the Web, using its graph structure, and gather pages that belong to a specific topic. The most critical task in Focused Crawling is the scoring of the URLs as it designates the path that the crawler will follow, and thus its effectiveness. In this paper we propose a novel scheme for assigning scores to the URLs, based on the Reinforcement Learning (RL) framework. The proposed approach learns to select the best classifier for ordering the URLs. This formulation reduces the size of the search space for the RL method and makes the problem tractable. We evaluate the proposed approach on-line on a number of topics, which offers a realistic view of its performance, comparing it also with a RL method and a simple but effective classifier-based crawler. The results demonstrate the strength of the proposed approach.
This paper presents a method for the evaluation of learned ontologies against gold standards. The proposed method transforms the ontology concepts to a vector space representation to avoid the common string matching of concepts at the lexical layer. We propose a set of evaluation measures that exploit the concepts' representations and calculate the similarity of the two hierarchies. Experiments show that these measures scale gradually in the closed interval of [0,1] as learned ontologies deviate increasingly from the gold standard. The proposed method is tested using the Genia and the Lonely Planet gold standard ontologies.
The focus of this paper is ontology-based knowledge management in the framework of a mobile communication and information system for rescue operation management. We present a novel ontology data service, combining prior domain knowledge about large-scale rescue operations with dynamic information about a developing operation. We also discuss the integration of such a data service into a service-oriented application framework to reach high performance and accessibility, and offer examples of SHARE applications to demonstrate the practical benefits of the approach chosen.
This paper proposes a method for learning ontologies given a corpus of text documents. The method identifies concepts in documents and organizes them into a subsumption hierarchy, without presupposing the existence of a seed ontology. The method uncovers latent topics in terms of which document text is being generated. These topics form the concepts of the new ontology. This is done in a language neutral way, using probabilistic space reduction techniques over the original term space of the corpus. Given multiple sets of concepts (latent topics) being discovered, the proposed method constructs a subsumption hierarchy by performing conditional independence tests among pairs of latent topics, given a third one. The paper provides experimental results over the GENIA corpus from the domain of biomedicine.
In this paper we present the ideas and algorithms developed around our KeyGen Web Taxonomy Annotation engine. KeyGen annotates the Open Directory Project, also known as Dmoz, with meaningful and previously unknown keywords by utilizing domain knowledge extracted from the WWW. We present two algorithms: i) The PageParse Algorithm, which efficiently extracts keywords from Web Taxonomies using a combination of local and global scores, and ii) the Support Algorithm, an I/O optimized algorithm for coalescing hierarchies of keywords. We then present the results: i) from constructing a richly annotated ODP Web taxonomy and ii) from evaluating the correctness of this structure by performing an automated classification of Web-pages.
The management of resources is a great challenge for commanders in Search and Rescue operations and has a strong impact on all areas of operation control, as command-and-communication structure, geo-referenced information, and operational tasks are inter-connected with complex relations. During an operation these are subject to dynamic changes. For an efficient operation control commanders need access to up-to-date information in their mobile working environment. This paper presents a new approach to manage resources and their relations in an operation. It is based on ontologies to build a model of an operation and Description Logic reasoning to provide enhanced decision support.
We propose an approach to knowledge acquisition, which uses multimedia ontologies for fused extraction of semantics from multiple modalities, and feeds back the extracted information, aiming to evolve knowledge representation. This paper presents the basic components of the proposed approach and discusses the open research issues focusing on the fused information extraction that will enable the development of scalable and precise knowledge acquisition technology.
This paper introduces TPN2, the runner up method in both tasks of the ECML-PKDD Discovery Challenge 2006 on personalized spam filtering. TPN2 is a classifier training method that bootstraps positive-only learning with fully-supervised learning, in order to make the most of labeled and unlabeled data, under the assumption that the two are drawn from significantly different distributions. Furthermore, the unlabeled data themselves are separated into subsets that are assumed to be drawn from multiple distributions. For that reason, TPN2 trains a different classifier for each subset, making use of all unlabeled data each time.
Naive Bayes is very popular in commercial and open-source anti-spam e-mail filters. There are, however, several forms of Naive Bayes, something the anti-spam literature does not always acknowledge. We discuss five different versions of Naive Bayes, and compare them on six new, non-encoded datasets, that contain ham messages of particular Enron users and fresh spam messages. The new datasets, which we make publicly available, are more realistic than previous comparable benchmarks, because they maintain the temporal order of the messages in the two categories, and they emulate the varying proportion of spam and ham messages that users receive over time. We adopt an experimental procedure that emulates the incremental training of personalized spam filters, and we plot ROC curves that allow us to compare the different versions of NB over the entire tradeoff between true positives and true negatives.
In this paper we propose a method for efficient management of large spatial ontologies. Current spatial ontologies are usually represented using an ontology language, such as OWL and stored as OWL files. However, we have observed some shortcomings using this approach especially in the efficiency of spatial query processing. This fact motivated the development of a hybrid approach that uses an R-tree as a spatial index structure. In this way we are able to support efficient query processing over large spatial ontologies, maintaining the benefits of ontological reasoning. We present a case study for emergency teams during Search and Rescue (SaR) operations showing how an Ontology Data Service (SHARE-ODS) can benefit from a spatial index. Performance evaluation shows the superiority of our proposed technique compared to the original approach. To the best of our knowledge, this is the first attempt to address the problem of efficient management of large spatial ontology bases.
This paper presents a system that integrates news from multiple sources on the Web and delivers in a personalized fashion to the reader. The presented service integrates automatic information extraction from various news sources and presentation of information according to the user’s interests. The system consists of source-specific information extraction programs (wrappers) that extract highlights of news items from the various sources, organize them according to pre-defined news categories and present them to the user through a personal Web-based interface. Dynamic personalization is used based on the user’s reading history, as well as the preferences of other similar users. User models are maintained by statistical analysis and machine learning algorithms. Results of an initial user study have confirmed the value of the service and indicated ways in which it should be improved.
This paper describes an ontology data service (ODS) for supporting Search and Rescue (SaR) operations. The ontological model represents various aspects of the command, communication, and organisational structure of the SaR forces and the deployment and progress of a SaR operation. Furthermore, the ontology supports the semantic indexing of multimedia documents in the context of SaR processes and activities. This ODS supports a semantically-enhanced information and communication system for SaR forces. Modelling the spatio-temporal aspects of an operation in alignment with possibly-unreliable information automatically extracted from multimedia objects, introduces a number of challenges for the field of knowledge representation and reasoning.
This report describes an ontology data service (ODS) for supporting Search and Rescue (SaR) operations. The ontological model represents various aspects of the command, communication, and organisational structure of the SaR forces and the deployment and progress of a SaR operation. Furthermore, the ontology supports the semantic indexing of multimedia documents in the context of SaR processes and activities. This ODS supports a semantically-enhanced information and communication system for SaR forces. Modelling the spatio-temporal aspects of an operation in alignment with possibly-unreliable information automatically extracted from multimedia objects, introduces a number of challenges for the field of knowledge representation and reasoning.
The BOEMIE project proposes a bootstrapping approach to knowledge acquisition, which uses multimedia ontologies for fused extraction of semantics from multiple modalities, and feeds back the extracted information, aiming to automate the ontology evolution process.
This article investigates the effectiveness of voting and stacked generalization -also known as stacking- in the context of information extraction (IE). A new stacking framework is proposed that accommodates well-known approaches for IE. The key idea is to perform cross-validation on the base-level data set, which consists of text documents annotated with relevant information, in order to create a meta-level data set that consists of feature vectors. A classifier is then trained using the new vectors. Therefore, base-level IE systems are combined with a common classifier at the meta-level. Various voting schemes are presented for comparing against stacking in various IE domains. Well known IE systems are employed at the base-level, together with a variety of classifiers at the meta-level. Results show that both voting and stacking work better when relying on probabilistic estimates by the base-level systems. Voting proved to be effective in most domains in the experiments. Stacking, on the other hand, proved to be consistently effective over all domains, doing comparably or better than voting and always better than the best base-level systems. Particular emphasis is also given to explaining the results obtained by voting and stacking at the meta-level, with respect to the varying degree of similarity in the output of the base-level systems.
C.D. Spyropoulos, G. Paliouras, V. Karkaletsis, D. Kosmopoulos, I. Pratikakis, S. Pertantonis and B. Gatos, "BOEMIE: Bootstrapping Ontology Evolution with Multimedia Information Extraction," In Proceedings of the 2nd European Workshop on Integration of Knowledge Semantic and Digital Media Technologies, v.6, pp. 1751-1782, 2005.
The BOEMIE project proposes a bootstrapping approach to knowledge acquisition, which uses multimedia ontologies for fused extraction of semantics from multiple modalities, and feeds back the extracted information, aiming to automate the ontology evolution process.
L. Vande Velde, S. Chatzinotas, M. Larson, J. Löffler, G. Paliouras, "Interactive 2D - 3D digital maps for the support of emergency teams during rescue operations", In Proceedings of the 12th World Congress on Intelligent Transport Systems, San Francisco, November, 2005.
SHARE, a EU-funded 6th Framework Program project, addresses the need of emergency teams for multimodal communication and for decision support with a prototype advanced mobile service based on Push-to-Share technology. The SHARE system provides emergency workers with on-site, on-line details of operational history and current operational status as well as access to pertinent supporting information, in particular information concerning the environment of the incident. The SHARE system will incorporate an enhanced Tele Atlas 2D-3D digital map, including details on buildings and roads above and beyond those represented in basic digital road maps. The SHARE system will log communications and other multimedia data generated during the operation and store it in an ontology-based Knowledge Base, which makes possible the integration of the spatial information of digital maps with multimedia and operational information from external databases. In the final phase of the SHARE project, the system will implement a 2D-3D digital map enhanced with voice, image, text and video information. The map will be fully interactive, permitting emergency workers with mobile end devices such as PDAs and tablet PCs to query the system using a multimodal interface and retrieve information as well as to enter new information as the operation unfolds.
V. Karkaletsis, G. Paliouras, C. D. Spyropoulos, "A Bootstrapping Approach to Knowledge Acquisition from Multimedia Content with Ontology Evolution," In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR), pp. 98-105, Helsinki University of Technology, Finland, June 2005.
We propose a bootstrapping approach to knowledge acquisition, which uses multimedia ontologies for fused extraction of semantics from multiple modalities, and feeds back the extracted information, aiming to automate the ontology evolution process. This paper presents the basic components of the proposed approach and discusses the open research issues focusing on the synergy of extraction and evolution that will enable the development of scalable and precise knowledge acquisition technology.
G. Paliouras, "On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning," In Proceedings of the International Conference on Conceptual Structures (ICCS), Kassel, Germany, July, Lecture Notes in Artificial Intelligence, n. 3596, pp. 119-135, Springer Verlag, 2005.
The main claim of this paper is that machine learning can help integrate the construction of ontologies and extraction grammars and lead us closer to the Semantic Web vision. The proposed approach is a bootstrapping process that combines ontology and grammar learning, in order to semi-automate the knowledge acquisition process. After providing a survey of the most relevant work towards this goal, recent research of the Software and Knowledge Engineering Laboratory (SKEL) of NCSR "Demokritos" in the areas of Web information integration, information extraction, grammar induction and ontology enrichment is presented. The paper concludes with a number of interesting issues that need to be addressed in order to realize the advocated bootstrapping process.
D. Pierrakos, G. Paliouras, "Exploiting Probabilistic Latent Information for the Construction of Community Web Directories," In Proceedings of the International User Modelling Conference (UM), Edinburgh, UK, July, Lecture Notes in Artificial Intelligence, n. 3538, pp. 89-98, Springer Verlag, 2005.
This paper improves a recently-presented approach to Web Personalization, named Community Web Directories, which applies personalization techniques to Web Directories. The Web directory is viewed as a concept hierarchy and personalization is realized by constructing user community models on the basis of usage data collected by the proxy servers of an Internet Service Provider. The user communities are modeled using Probabilistic Latent Semantic Analysis (PLSA), which provides a number of advantages such as overlapping communities, as well as a good rationale for the associations that exist in the data. The data that are analyzed present challenging peculiarities such as their large volume and semantic diversity. Initial results presented in this paper illustrate the effectiveness of the new method.
D. Pierrakos, G. Paliouras, C. Papatheodorou, V. Karkaletsis, M. Dikaiakos, "Web Community Directories: A New Approach to Web Personalization," In Berendt et al. (Eds.), "Web Mining: From Web to Semantic Web", Lecture Notes in Computer Science, n. 3209, pp. 113 - 129, Springer Verlag, 2004.
This paper introduces a new approach to Web Personalization, named Web Community Directories that aims to tackle the problem of information overload on the WWW. This is realized by applying personalization techniques to the well-known concept ofWeb Directories. TheWeb directory is viewed as a concept hierarchy which is generated by a content-based document clustering method. Personalization is realized by constructing community models on the basis of usage data collected by the proxy servers of an Internet Service Provider. For the construction of the community models, a new data mining algorithm, called Community Directory Miner, is used. This is a simple cluster mining algorithm which has been extended to ascend a concept hierarchy, and specialize it to the needs of user communities. The data that are mined present a number of peculiarities such as their large volume and semantic diversity. Initial results presented in this paper illustrate the use of the methodology and provide an indication of the behavior of the new mining method.
G. Sigletos, G. Paliouras, C. D. Spyropoulos, P. Stamatopoulos, "Stacked generalization for information extraction," In Proceedings of the European Conference in Artificial Intelligence (ECAI), pp. 549 - 553, Valencia, Spain, IOS Press, 2004.
This paper defines a new stacked generalization framework in the context of information extraction (IE) from online sources. The proposed setting removes the constraint of applying classifiers at the base-level. A set of IE systems are trained instead to identify relevant fragments within text documents, which differs significantly from the task of classifying candidate text fragments as relevant or not. The templates filled by the base-level IE systems are stacked, forming a set of feature vectors for training a meta-level classifier. Thus, base-level IE systems are combined with a common classifier at meta-level. The proposed framework was evaluated on three Web domains, using well known IE approaches at base-level and a variety of classifiers at meta-level. Results demonstrate the added value obtained by combining the base-level IE systems in the new framework.
A. Valarakos, G. Paliouras, V. Karkaletsis, G. Vouros, "Enhancing the Ontological Knowledge through Ontology Population and Enrichment," In Proceedings of the International Conference on Knowledge Engineering and Knowledge Management (EKAW), Lecture Notes in Artificial Intelligence, n. 3257, pp. 144-156, Springer Verlag, 2004.
Ontologies are widely used for capturing and organizing knowledge of a particular domain of interest. This knowledge is usually evolvable and therefore an ontology maintenance process is required to keep the ontological knowledge up-to-date. We proposed an incremental ontology maintenance methodology which exploits ontology population and enrichment methods to enhance the knowledge captured by the instances of the ontology and their various lexicalizations. Furthermore, we employ ontology learning techniques to alleviate as much as possible the intervention of human into the proposed methodology. We conducted experiments using the CROSSMARC ontology as a case study evaluating the methodology and its partial methods. The methodology performed well enhancing the ontological knowledge to 96.5% from only 50%.
E. Michelakis, I. Androutsopoulos, G. Paliouras, G. Sakkis,, P. Stamatopoulos, "Filtron: A Learning-Based Anti-Spam Filter," In Proceedings of the first Conference on Email and Anti-Spam (CEAS), Mountain View, CA, USA, 2004.
We present Filtron, a prototype anti-spam filter that integrates the main empirical con- clusions of our comprehensive analysis on using machine learning to construct effective personalized anti-spam filters. Filtron is based on the experimental results over several design parameters on four publicly available benchmark corpora. After describing Filtron's architecture, we assess its behavior in real use over a period of seven months. The results are deemed satisfactory, though they can be improved with more elaborate preprocessing and regular re-training.
N. Karampatziakis, G. Paliouras, D. Pierrakos, P. Stamatopoulos, "Navigation pattern discovery using grammatical inference," In Proceedings of the 7th International Colloquium on Grammatical Inference (ICGI), Lecture Notes in Artificial Intelligence, n. 3264, pp. 187 - 198, Springer Verlag, 2004.
We present a method for modeling user navigation on a web site using grammatical inference of stochastic regular grammars. With this method we achieve better models than the previously used first order Markov chains, in terms of predictive accuracy and utility of recommendations. In order to obtain comparable results, we apply the same grammatical inference algorithms on Markov chains, modeled as probabilistic automata. The automata induced in this way perform better than the original Markov chains, as models for user navigation, but they are considerably inferior to the automata induced by the traditional grammatical inference methods. The evaluation of our method was based on two web usage data sets from two very dissimilar web sites. It consisted in producing, for each user, a personalized list of recommendations and then measuring its recall and expected utility.
G. Petasis, G. Paliouras, C. D. Spyropoulos, C. Halatsis, "eg-GRIDS: Context-Free Grammatical Inference from Positive Examples using Genetic Search," In Proceedings of the 7th International Colloquium on Grammatical Inference (ICGI), Lecture Notes in Artificial Intelligence, n. 3264, pp. 223 - 234, Springer Verlag, 2004.
In this paper we present eg-GRIDS, an algorithm for inducing context-free grammars that is able to learn from positive sample sentences. The presented algorithm, similar to its GRIDS predecessors, uses simplicity as a criterion for directing inference, and a set of operators for exploring the search space. In addition to the basic beam search strategy of GRIDS, eg-GRIDS incorporates an evolutionary grammar selection process, aiming to explore a larger part of the search space. Evaluation results are presented on artificially generated data, comparing the performance of beam search and genetic search. These results show that genetic search performs better than beam search while being significantly more efficient computationally.
G. Petasis, G. Paliouras, V. Karkaletsis, C. Halatsis, and C.D. Spyropoulos, "e-GRIDS: Computationally Efficient Grammatical Inference from Positive Examples," GRAMMARS, 2004.
In this paper we present a new computationally efficient algorithm for inducing context-free grammars that is able to learn from positive sample sentences. This new algorithm uses simplicity as a criterion for directing inference, and the search process of the new algorithm has been optimised by utilising the results of a theoretical analysis regarding the behaviour and complexity of the search operators. Evaluation results are presented on artificially generated data, while the scalability of the algorithm is tested on a large textual corpus. These results show that the new algorithm performs well and can infer grammars from large data sets in a reasonable amount of time.
A. Valarakos, G. Paliouras, V. Karkaletsis, G. Vouros, "A Name-Matching Algorithm for Supporting Ontology Enrichment," In Proceedings of the Panhellenic Conference in Artificial Intelligence (SETN), Lecture Notes in Artificial Intelligence, n. 3025, pp. 381-389, Springer Verlag, 2004.
Ontologies are widely used for capturing and organizing knowl- edge of a particular domain of interest. This knowledge is usually evolv- able and therefore an ontology maintenance process is required. In the context of ontology maintenance we tackle the problem that arises when an instance/individual is written differently (grammatically, orthograph- ically, lexicographically), while representing the same entity/concept. This type of knowledge is captured into a semantic relationship and con- stitutes valuable information for many intelligent methods and systems. We enrich a domain ontology with instances that participate in this type of relationship, using a novel name matching method based on machine learning. We also show how the proposed method can support the dis- covery of new entities/concepts to be added to the ontology. Finally, we present experimental results for the enrichment of an ontology used in the multi-lingual information integration project CROSSMARC.
A. Grigoriadis, G. Paliouras, "Focused Crawling using Temporal Difference-Learning," In Proceedings of the Panhellenic Conference in Artificial Intelligence (SETN), Lecture Notes in Artificial Intelligence, n. 3025, pp. 142-153, Springer Verlag, 2004.
This paper deals with the problem of constructing an intelligent Focused Crawler, i.e. a system that is able to retrieve documents of a specific topic from the Web. The crawler must contain a component which assigns visiting priorities to the links, by estimating the probability of leading to a relevant page in the future. Reinforcement Learning was chosen as a method that fits this task nicely, as it provides a method for rewarding intermediate states to the goal. Initial results show that a crawler trained with Reinforcement Learning is able to retrieve relevant documents after a small number of steps.
I. Androutsopoulos, G. Paliouras and E. Michelakis, "Learning to Filter Unsolicited Commercial E-Mail,". Technical Report, No. 2004/2,, NCSR "Demokritos", 2004 (updated 2006).
We present a thorough investigation on using machine learning to construct effective personalized anti-spam filters. The investigation includes four learning algorithms, Naive Bayes, Flexible Bayes, LogitBoost, and Support Vector Machines, and four datasets, constructed from the mailboxes of different users. We discuss the model and search biases of the learning algorithms, along with worst-case computational complexity figures, and observe how the latter relate to experimental measurements. We study how classification accuracy is affected when using attributes that rep- resent sequences of tokens, as opposed to single tokens, and explore the effect of the size of the attribute and training set, all within a cost-sensitive framework. Furthermore, we describe the architecture of a fully implemented learning-based anti-spam filter, and present an analysis of its behavior in real use over a period of seven months. Information is also provided on other available learning-based anti-spam filters, and alternative filtering approaches.
D. Pierrakos, G. Paliouras, C. Papatheodorou and C.D. Spyropoulos, "Web Usage Mining as a tool for personalization: a survey". User Modeling and User-Adapted Interaction, v. 13, n. 4, pp. 311-372, 2003.
This paper is a survey of recent work in the field of web usage mining for the benefit of research on the personalization of Web-based information services. The essence of personalization is the adaptability of information systems to the needs of their users. This issue is becoming increasingly important on the Web, as non-expert users are overwhelmed by the quantity of information available online, while commercial Web sites strive to add value to their services in order to create loyal relationships with their visitors-customers. This article views Web personalization through the prism of personalization policies adopted by Web sites and implementing a variety of functions. In this context, the area of Web usage mining is a valuable source of ideas and methods for the implementation of personalization functionality. We therefore present a survey of the most recent work in the field of Web usage mining, focusing on the problems that have been identified and the solutions that have been proposed.
G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos and P. Stamatopoulos, "A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists". Information Retrieval, v. 6, n. 1, pp. 49-73, 2003.
This paper presents an extensive empirical evaluation of memory-based learning in the context of anti-spam filtering, a novel cost-sensitive application of text categorization that attempts to identify automatically unsolicited commercial messages that flood mailboxes. Focusing on anti-spam filtering for mailing lists, a thorough investigation of the effectiveness of a memory-based anti-spam filter is performed using a publicly available corpus. The investigation includes different attribute and distance-weighting schemes, and studies on the effect of the neighborhood size, the size of the attribute set, and the size of the training corpus. Three different cost scenarios are identified, and suitable cost-sensitive evaluation functions are employed. We conclude that memory-based anti-spam filtering for mailing lists is practically feasible, especially when combined with additional safety nets. Compared to a previously tested Naive Bayes filter, the memory-based filter performs on average better, particularly when the misclassification cost for non-spam messages is high.
G. Sigletos, G. Paliouras, C. D. Spyropoulos, M. Hatzopoulos. "Mining Web sites using wrapper induction, named entities and post-processing", Proceedings of the 1st European Web Mining Forum Workshop, Joint European Conference on Machine Learning andon Principles and Practices of Knowledge Discovery in Databases (ECML/PKDD) , Cavtat-Dubrovnik, Croatia, 2003.
This paper presents a novel method for extracting information from collections of Web pages across different sites. Our method uses a standard wrapper induction algorithm and exploits named entity information. We introduce the idea of post-processing the extraction results for resolving ambiguous facts and improve the overall extraction performance. Postprocessing involves the exploitation of two additional sources of information: fact transition probabilities, based on a trained bigram model, and confidence probabilities, estimated for each fact by the wrapper induction system. A multiplicative model that is based on the product of those two probabilities is also considered for post-processing. Experiments were conducted on pages describing laptop products, collected from many different sites and in four different languages. The results highlight the effectiveness of our approach.
G. Sigletos, G. Paliouras, C. D. Spyropoulos, P. Stamatopoulos. "Meta-learning beyond classification: A framework for information extraction from the Web", Proceedings of the on Adaptive Text Extraction and Mining Workshop, Joint European Conference on Machine Learning andon Principles and Practices of Knowledge Discovery in Databases (ECML/PKDD) , Cavtat-Dubrovnik, Croatia, 2003.
This paper proposes a meta-learning framework in the context of information extraction from the Web. The proposed framework relies on learning a meta-level classifier, based on the output of base-level information extraction systems. Such systems are typically trained to recognize relevant information within documents, i.e., streams of lexical units, which differs significantly from the task of classifying feature vectors that is commonly assumed for meta-learning. The proposed framework was evaluated experimentally on the challenging task of training an information extraction system for multiple Web sites. Three well-known methods for training extraction systems were employed at the base level. A variety of classifiers were comparatively evaluated at the meta level. The extraction accuracy that was obtained demonstrated the effectiveness of the proposed framework of collaboration between base-level extraction systems and common classifiers at meta-level.
D. Pierrakos, G. Paliouras, C. Papatheodorou, V. Karkaletsis, M. Dikaiakos. "Construction of Web Community Directories using Document Clustering and Web Usage Mining", Proceedings of the 1st European Web Mining Forum Workshop, Joint European Conference on Machine Learning andon Principles and Practices of Knowledge Discovery in Databases (ECML/PKDD) , Cavtat-Dubrovnik, Croatia, 2003.
This paper presents the concept of Web Community Directories, as a means of personalizing services on the Web, together with a novel methodology for the construction of these directories by document clustering and usage mining methods. The community models are extracted with the use of the Community Directory Miner, a simple cluster mining algorithm which has been extended to ascend a concept hierarchy, and specialize it to the needs of user communities. The initial concept hierarchy is generated by a content-based document clustering method. Communities are constructed on the basis of usage data collected by the proxy servers of an Internet Service Provider. These data present a number of peculiarities such as their large volume and semantic diversity. Initial results presented in the paper illustrate the use of the methodology and provide an indication of the behavior of the new mining method.
A. Valarakos, G. Sigletos, V. Karkaletsis, G. Paliouras. "A Methodology for Semantically Annotating a Corpus Using a Domain Ontology and Machine Learning", Proceedings of the the Recent Advances in Natural Language Processing International Conference (RANLP) , Borovets, Bulgaria, 2003.
In this paper we present a methodology for the semantic annotation of domain-specific corpora. This method relies on a domain ontology used initially for identifying and annotating domainspecific instances within the corpus. A machine learning-based information extraction system is then trained on the annotated corpus. The final result of this process is a model which is used to annotate new corpora in the specific domain. We applied the proposed methodology to a Web corpus examining different ontology size using hidden Markov models. The paper presents the proposed methodology together with some first experimental results.
A. Valarakos, G. Sigletos, V. Karkaletsis, G. Paliouras, G. Vouros. "A Methodology for Enriching a Multi-Lingual Domain Ontology using Machine Learning", Proceedings of the Workshop on Text Processing for Modern Greek: from Symbolic to Statistical Approaches, 6th International Conference in Greek Linguistics , Rethymno, Crete, 2003.
Ontologies accumulate and organize knowledge in a machine-processable and human-readable way providing a common understanding basis. Enriching a multi-lingual ontology is crucial for the success of many knowledge-based systems. We present an iterative ontology-driven methodology that enriches a multi-lingual domain ontology with new instances, exploiting machine learning techniques. The methodology is user-centered and aims to ease the task of ontology maintenance. Our first experiments show the strong dependency between the size of the initial ontology and the performance of the machine learning-based method.
K. Stamatakis, V. Karkaletsis, G. Paliouras, J. Horlock, C. Grover, J. R. Curran, S. Dingare. "Domain-Specific Web Site Identification: The CROSSMARC Focused Web Crawler" Proceedings of the Second International Workshop on Web Document Analysis (WDA) , Edinburgh, UK, 2003.
This paper presents techniques for identifying domain specific Web sites that have been implemented as part of the EC-funded R&D project, CROSSMARC. The project aims to develop technology for extracting interesting information from domain-specific Web pages. It is therefore important for CROSSMARC to identify Web sites in which interesting domain specific pages reside (focused Web crawling). This is the role of the CROSSMARC Web crawler.
G. Sigletos, D. Farmakiotou, K. Stamatakis, G. Paliouras, V. Karkaletsis. "Annotating Web pages for the needs of Web Information Extraction applications", Poster in the proceedings of the 12th Internatonal World Wide Web Conference (WWW) , Budapest, Hungary, 2003.
This paper outlines our approach to the creation of annotated corpora for the purposes of Web Information Extraction, and presents the Web Annotation tool. This tool enables the annotation of Web pages from different domains and for different information extraction tasks providing a user-friendly interface to human annotators. Annotated information is stored in a representation format that can easily be exploited.
D. Pierrakos, G. Paliouras, C. Papatheodorou, V. Karkaletsis, M. Dikaiakos. "Construction of Web Community Directories by Mining Usage Data", Proceedings of the Hellenic Data Management Symposium (HDMS) , Athens, Greece, 2003.
This paper introduces the concept of Web Community Directories, as a means of personalizing services on the Web, and presents a novel methodology for the construction of these directories by usage mining methods. The community models are extracted with the use of the Community Directory Miner, a simple cluster mining algorithm which has been extended to ascend a concept hierarchy, such as a Web directory, and specialize it to the needs of user communities. The construction of the communities is based on usage data collected by the proxy servers of an Internet Service Provider, which is also a task that has not been addressed in the literature. The examined data present a number of peculiarities such as their large volume and their semantic diversity. Initial results presented in the paper illustrate the use of the methodology and provide an indication of the behavior of the new usage mining method.
G. Petasis, V. Karkaletsis, G. Paliouras and C. D. Spyropoulos. "Using the Ellogon Natural Language Engineering Infrastructure", Proceedings of the Workshop on Balkan Language Resources and Tools at the 1st Balkan Conference on Informatics (BCI) , Thessaloniki, Greece, 2003.
Ellogon is a multi-lingual, cross-operating system, general-purpose natural language engineering infrastructure. Ellogon has been used extensively in various NLP applications. It is currently provided for free for research use to research and academic organisations. In this paper, we outline its architecture and data model, present Ellogon features as used by different types of users and discuss its functionalities against other infrastructures for language engineering.
G. Paliouras, C. Papatheodorou, V. Karkaletsis and C.D. Spyropoulos, "Discovering User Communities on the Internet Using Unsupervised Machine Learning Techniques". Interacting with Computers, v. 14, n. 6, pp. 761-791, 2002.
Interest in the analysis of user behaviour on the Internet has been increasing rapidly, especially since the advent of electronic commerce. In this context, we argue here for the usefulness of constructing communities of users with common behaviour, making use of machine learning techniques. In particular, we assume that the users of any service on the Internet constitute a large community and we aim to construct smaller communities of users with common characteristics. The paper presents the results of three case studies for three different types of Internet service: a digital library, an information broker and a Web site. Particular attention is paid on the different types of information access involved in the three case studies: query-based information retrieval, profile-based information filtering and Web-site navigation. Each type of access imposes different constraints on the representation of the learning task. Two different unsupervised learning methods are evaluated: conceptual clustering and cluster mining. One of our main concerns is the construction of meaningful communities that can be used for improving information access on the Internet. Analysis of the results in the three case studies brings to surface some of the important properties of the task, suggesting the feasibility of a common methodology for the three different types of information access on the Internet.
G. Petasis, V. Karkaletsis, G. Paliouras, I. Androutsopoulos and C. D. Spyropoulos, "Ellogon: A New Text Engineering Platform". Proceedings of the International Conference on Language Resources and Evaluation (LREC), vol. I, pp. 72-78, Las Palmas, Spain, May, 2002.
This paper presents Ellogon, a multi-lingual, cross-platform, general-purpose text engineering environment. Ellogon was designed in order to aid both researchers in natural language processing, as well as companies that produce language engineering systems for the end-user. Ellogon provides a powerful TIPSTER-based infrastructure for managing, storing and exchanging textual data, embedding and managing text processing components as well as visualising textual data and their associated linguistic information. Among its key features are full Unicode support, an extensive multi-lingual graphical user interface, its modular architecture and the reduced hardware requirements.
G. Sigletos, G. Paliouras, V. Karkaletsis, "Role Identification From Free Text Using Hidden Markov Models". Proceedings of the Panhellenic Conference in Artificial Intelligence (SETN), Lecture Notes in Artificial Intelligence, n. 2308, Springer Verlag, pp. 167-178, 2002.
In this paper we explore the use of hidden Markov models on the task of role identification from free text. Role identification is an important stage of the information extraction process, assigning roles to particular types of entities with respect to a particular event. Hidden Markov models (HMMs) have been shown to achieve good performance when applied to information extraction tasks in both semistructured and free text. The main contribution of this work is the analysis of whether and how linguistic processing of textual data can improve the extraction performance of HMMs. The emphasis is on the minimal use of computationally expensive linguistic analysis. The overall conclusion is that the performance of HMMs is still worse than an equivalent manually constructed system. However, clear paths for improvement of the method are shown, aiming at a method, which is easily adaptable to new domains.
G. Petasis, S. Petridis, G. Paliouras, V. Karkaletsis, S. Perantonis, and C.D. Spyropoulos, "Symbolic and Neural Learning of Named-Entity Recognition and Classification Systems in Two Languages". In Advances in Computational Intelligence and Learning: Methods and Applications, H-J. Zimmermann, G. Tselentis, M. van Someren and G. Dounias (eds), Kluwer Academic Publishers, 2001.
This paper compares two alternative approaches to the problem of acquiring named-entity recognition and classification systems from training corpora, in two different languages. The process of named-entity recognition and classification is an important subtask in most language engineering applications, in particular information extraction, where different types of named entity are associated with specific roles in events. The manual construction of rules for the recognition of named entities is a tedious and time-consuming task. For this reason, effective methods to acquire such systems automatically from data are very desirable. In this paper we compare two popular learning methods on this task: a decision-tree induction method and a multi-layered feed-forward neural network. Particular emphasis is paid on the selection of the appropriate data representation for each method and the extraction of training examples from unstructured textual data. We compare the performance of the two methods on large corpora of English and Greek texts and present the results. In addition to the good performance of both methods, one very interesting result is the fact that a simple representation of the data, which ignores the order of the words within a named entity, leads to improved results over a more complex approach that preserves word order.
H. Jessen and G. Paliouras, "Data Mining in Economics, Marketing and Finance". In Machine Learning and Applications, G. Paliouras, V. Karkaletsis and C.D. Spyropoulos (eds), Lecture Notes in Computer Science, n. 2049, pp. 303-307, Springer-Verlag, 2001.
[No abstract available.]
K. Koutroumbas, G. Paliouras, V. Karkaletsis and C.D. Spyropoulos, "Comparison of Computational Learning Methods on a Diagnostic Cytological Application". Proceedings of the European Symposium on Intelligent Technologies, Hybrid Systems and their implementation on Smart Adaptive Systems (EUNITE), pp. 500-508, Tenerife, Spain, 2001.
In this paper we perform a comparative evaluation of four different computational learning methods on a problem of diagnostic cytology and more specifically on the classification of gastric cells. The methods considered are: Decision Tree Induction, Boosted Decision Trees, Naive Bayesian Classifier, and Radial Basis Function Neural Networks. The performance of each method was assessed on unseen data. Our aim was not to evaluate the quality of the algorithms as such, but to examine which of them are suitable for the specific medical diagnosis task, in order to provide a reliable diagnostic tool to the doctors involved in the area. We compare the performance of the four methods and discuss the results taking into account the characteristics of the methods and the task examined. The dataset that was used in this paper is publicly available, facilitating reproducibility of the results and providing a basis of comparison for future work.
A. Grigoriadis, G. Paliouras, V. Karkaletsis and C.D. Spyropoulos, "Identifying Word Senses in Greek Text: A comparison of machine learning methods". Proceedings of the European Workshop on Intelligent Forecasting, Diagnosis and Control (IFDICON), pp. 107-113, Santorini, Greece, 2001.
In this paper we perform a comparative evaluation of machine learning methods on the task of identifying the correct sense of a word, based on the context in which it appears. This task is known as word sense disambiguation (WSD) and is one of the hardest and most interesting issues in language engineering. Research on the use of machine learning techniques for WSD has so far focused almost exclusively on English words, due to the scarcity of the required linguistic resources for other languages. The work presented here is the first attempt to apply machine learning methods to Greek words. We have constructed a semantically tagged corpus for two Greek words: a noun with clearly distinguishable senses and a verb with overlapping senses. This corpus is used to evaluate four different machine learning methods and three different representations of the context of the ambiguous word. Our results show that the simple naive Bayesian classifier and a method using Support Vector Machines outperform decision tree induction, even with the use of boosting. Furthermore, the use of a distance-based weighting function for the context of the ambiguous word does not seem to have a substantial effect on the performance of the methods.
D. Pierrakos, G. Paliouras, C. Papatheodorou and C.D. Spyropoulos, "KOINOTITES: A Web Usage Mining Tool for Personalization". Proceedings of the Panhellenic Conference on Human Computer Interaction (PC-HCI), pp. 231-236, Patras, 2001.
This paper presents the Web Usage Mining system KOINOTITES, which uses data mining techniques for the construction of user communities on the Web. User communities model groups of visitors in a Web site, who have similar interests and navigational behaviour. We present the architecture of the system and the results that we obtained in a real Web site.
G. Petasis, Frantz Vichot, Francis Wolinski, G. Paliouras, V. Karkaletsis, and C.D. Spyropoulos, "Using Machine Learning to Maintain Rule-based Named-Entity Recognition and Classification Systems". Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 426-433, Toulouse, 2001.
This paper presents a method that assists in maintaining a rule-based named-entity recognition and classification system. The underlying idea is to use a separate system, constructed with the use of machine learning, to monitor the performance of the rule-based system. The training data for the second system is generated with the use of the rule-based system, thus avoiding the need for manual tagging. The disagreement of the two systems acts as a signal for updating the rule-based system. The generality of the approach is illustrated by applying it to large corpora in two different languages: Greek and French. The results are very encouraging, showing that this alternative use of machine learning can assist significantly in the maintenance of rule-based systems.
G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos and P. Stamatopoulos, "Stacking classifiers for anti-spam filtering of e-mail". Proceedings of the International Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 44-50, Carnegie Mellon University, 2001.
We evaluate empirically a scheme for combining classifiers, known as stacked generalization, in the context of anti-spam filtering, a novel cost-sensitive application of text categorization. Unsolicited commercial e-mail, or "spam", floods mailboxes, causing frustration, wasting bandwidth, and exposing minors to unsuitable content. Using a public corpus, we show that stacking can improve the efficiency of automatically induced anti-spam filters, and that such filters can be used in real-life applications.
G. Petasis, A. Cucchiarelli, P. Velardi, G. Paliouras, V. Karkaletsis, and C.D. Spyropoulos, "Automatic adaptation of proper noun dictionaries through co-operation of machine learning and probabilistic methods". Proceedings of the 23rd ACM SIGIR Conference on R&D in IR (SIGIR), pp. 128-135, Athens, Greece, 2000.
The recognition of Proper Nouns (PNs) is considered an important task in the area of Information Retrieval and Extraction. However the high performance of most existing PN classifiers heavily depends upon the availability of large dictionaries of domain-specific Proper Nouns, and a certain amount of manual work for rule writing or manual tagging. Though it is not a heavy requirement to rely on some existing PN dictionary (often these resources are available on the web), its coverage of a domain corpus may be rather low, in absence of manual updating. In this paper we propose a technique for the automatic updating of a PN Dictionary through the cooperation of an inductive and a probabilistic classifier. In our experiments we show that, whenever an existing PN Dictionary allows the identification of 50% of the proper nouns within a corpus, our technique allows, without additional manual effort, the successful recognition of about 90% of the remaining 50%.
G. Paliouras, C. Papatheodorou, V. Karkaletsis and C.D. Spyropoulos, "Clustering the Users of Large Web Sites into Communities," Proceedings of the International Conference on Machine Learning (ICML), pp. 719-726, Stanford, California, 2000.
In this paper we analyze the performance of clustering methods on the task of constructing community models for the users of large Web sites. Community models represent patterns of usage of the Web site, which can be associated with different types of user. Knowledge of this type is clearly valuable for commercial sites, where each user is a potential customer. We argue that it is equally valuable for non-commercial sites, because it can assist greatly in the improvement of the site. We evaluate three clustering methods on usage data from a large site that covers on-line resources in Chemistry. The size of the site and its high hit rate impose a serious constraint on the scalability of the methods. We also examine two ways of encoding usage data, which give complementary information about the behavior of the users. Finally, the emphasis is on the construction of meaningful community models, by identifying the descriptive characteristics of communities, at a post-processing stage.
K.V. Chandrinos, I. Androutsopoulos, G. Paliouras and C.D. Spyropoulos, "Automatic Web Rating: Filtering Obscene Content on the Web". Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), Lisbon, Portugal, Lecture Notes in Computer Science, n. 1923, pp. 403-406, Springer-Verlag, 2000.
[No abstract available.]
G. Paliouras, V. Karkaletsis, and C.D. Spyropoulos, "Learning Decision Trees for Named-Entity Recognition and Classification", Proceedings of the Workshop "Machine Learning for Information Extraction", European Conference in Artificial Intelligence, Berlin, Germany, 2000.
We propose the use of decision tree induction as a solution to the problem of customising a named-entity recognition and classification (NERC) system to a specific domain. A NERC system assigns semantic tags to phrases that correspond to named entities, e.g. persons, locations and organisations. Typically, such a system makes use of two language resources: a recognition grammar and a lexicon of known names, classified by the corresponding named-entity types. NERC systems have been shown to achieve good results when the domain of application is very specific. However, the construction of the grammar and the lexicon for a new domain is a hard and time-consuming process. We propose the use of decision trees as NERC "grammars" and the construction of these trees using machine learning. In order to validate our approach, we tested C4.5 on the identification of person and organisation names involved in management succession events, using data from the sixth Message Understanding Conference. The results of the evaluation are very encouraging showing that the induced tree can outperform a grammar that was constructed manually.
G. Petasis, S. Petridis, G. Paliouras, V. Karkaletsis, S. Perantonis, and C.D. Spyropoulos, "Symbolic and Neural Learning for Named-Entity Recognition". Proceedings of the Symposium on Computational Intelligence and Learning (COIL), pp. 58-66, Chios, Greece, 2000.
Named-entity recognition involves the identification and classification of named entities in text. This is an important subtask in most language engineering applications, in particular information extraction, where different types of named entity are associated with specific roles in events. The manual construction of rules for the recognition of named entities is a tedious and time-consuming task. For this reason, we present in this paper two approaches to learning named-entity recognition rules from text. The first approach is a decision-tree induction method and the second a multi-layered feed-forward neural network. Particular emphasis is paid on the selection of the appropriate feature set for each method and the extraction of training examples from unstructured textual data. We compare the performance of the two methods on a large corpus of English text and present the results.
I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C.D. Spyropoulos and P. Stamatopoulos. "Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach". Proceedings of the Workshop "Machine Learning and Textual Information Access", European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), pp. 1-13, Lyon, France, 2000.
We investigate the performance of two machine learning algorithms in the context of anti-spam filtering. The increasing volume of unsolicited bulk e-mail (spam) has generated a need for reliable anti-spam filters. Filters of this type have so far been based mostly on keyword patterns that are constructed by hand and perform poorly. The Naive Bayesian classifier has recently been suggested as an effective method to construct automatically anti-spam filters with superior performance. We investigate thoroughly the performance of the Naive Bayesian filter on a publicly available corpus, contributing towards standard benchmarks. At the same time, we compare the performance of the Naive Bayesian filter to an alternative memory-based learning approach, after introducing suitable cost-sensitive evaluation measures.
G. Paliouras, V. Karkaletsis, I. Androutsopoulos, and C.D. Spyropoulos, "Learning Rules for Large-Vocabulary Word Sense Disambiguation: A Comparison of Various Classifiers". Proceedings of the 2nd International Conference on Natural Language Processing (NLP), Patra, Greece. Lecture Notes in Artificial Intelligence, 1835, pp. 383-394, Springer, 2000.
In this article we compare the performance of various machine learning algorithms on the task of constructing word-sense disambiguation rules from data. The distinguishing characteristic of our work from most of the related work in the field is that we aim at the disambiguation of all content words in the text, rather than focussing on a small number of words. In an earlier study we have shown that a decision tree induction algorithm performs well on this task. This study compares decision tree induction with other popular learning methods and discusses their advantages and disadvantages. Our results confirm the good performance of decision tree induction, which outperforms the other algorithms, due to its ability to order the features used for disambiguation, according to their contribution in assigning the correct sense.
I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, G. Paliouras, and C.D. Spyropoulos, "An Evaluation of Naive Bayesian Anti-Spam Filtering". Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML), Barcelona, Spain, pp. 9-17, 2000.
It has recently been argued that a Naive Bayesian classifier can be used to filter unsolicited bulk e-mail ("spam"). We conduct a thorough evaluation of this proposal on a corpus that we make publicly available, contributing towards standard benchmarks. At the same time we investigate the effect of attribute-set size, training-corpus size, lemmatization, and stop-lists on the filter's performance, issues that had not been previously explored. After introducing appropriate cost-sensitive evaluation measures, we reach the conclusion that additional safety nets are needed for the Naive Bayesian anti-spam filter to be viable in practice.
G. Paliouras, C. Papatheodorou, V. Karkaletsis, P.Tzitziras and C.D. Spyropoulos, "Large-Scale Mining of Usage Data on Web Sites," AAAI Spring Symposium on Adaptive User Interfaces, Stanford, California, 2000.
In this paper we present an approach to the discovery of trends in the usage of large Web-based information systems. This approach is based on the empirical analysis of the users interaction with the system and the construction of user groups with common interests (user communities). The empirical analysis is achieved with the use of cluster mining, a technique that process data collected from the users? interaction with the Web site. Our main concern is the construction of meaningful communities, which can be used for improving the structure of the site as well as for making suggestions to the users at a personal level. Our case study on a site providing information for researchers in Chemistry shows that the proposed method provides effective mining of large usage databases.
S.M. Rudolfer, G. Paliouras and I. Peers, "A Comparison of Logistic Regression to Decision Tree Induction in the Diagnosis of Carpal Tunnel Syndrome," Computers and Biomedical Research, v. 32, pp. 391-414, 1999.
This paper aims to compare and contrast two types of model (logistic regression and decision tree induction) for the diagnosis of carpal tunnel syndrome using four ordered classification categories. Initially, we present the classification performance results based on more than two covariates (multivariate case). Our results suggest that there is no significant difference between the two methods. Further to this investigation, we present a detailed comparison of the structure of bivariate versions of the models. The first surprising result of this analysis is that the classification accuracy of the bivariate models is slightly higher than that of the multivariate ones. In addition, the bivariate models lend themselves to graphical analysis, where the corresponding decision regions can easily be represented in the two-dimensional covariate space. This analysis reveals important structural differences between the two models.
G. Paliouras and H.C. Jessen, "Statistical and Learning Approaches to Nonlinear Modeling of Labour Force Participation," Neural Network World, v. 9, n.4, pp. 341-363, 1999.
The decision of whether or not to join the labour market is complex and often involves nonlinearities. However, most econometric decision models are linear and therefore may not be able to capture all aspects of the decision problem. In recent years several interesting Machine Learning methods have emerged for estimating nonlinear models in a relatively straightforward manner. It is shown here that some of these methods achieve significantly better classification performance than the standard linear model. Furthermore, a graphical approach is taken for interpreting the nonlinear models for the examined problem.
V. Karkaletsis, G. Paliouras, G. Petasis, N. Manousopoulou and C.D. Spyropoulos, "Named-Entity Recognition from Greek and English Texts". Journal of Intelligent and Robotic Systems, v. 26, n.2, pp. 123-135, 1999.
Named-entity recognition (NER) involves the identification and classification of named entities in text. This is an important subtask in most language engineering applications, in particular information extraction, where different types of named entity are associated with specific roles in events. In this paper, we present a prototype NER system for Greek texts that we developed based on a NER system for English. Both systems are evaluated on corpora of the same domain and of similar size. The time-consuming process for the construction and update of domain-specific resources in both systems led us to examine a machine learning method for the automatic construction of such resources for a particular application in a specific language.
G. Paliouras, C. Papatheodorou, V. Karkaletsis, C.D. Spyropoulos and P.Tzitziras, "From Web Usage Statistics to Web Usage Analysis," Proceedings of the IEEE International Conference on Systems Man and Cybernetics, v. II, pp. 159-164, 1999.
The World Wide Web has become a major source of information that can be turned into valuable knowledge for individuals and organisations. In the work presented here, we are concerned with the extraction of meta-knowledge from the Web. In particular, knowledge about Web usage which is invaluable to the construction of Web sites that meet their purposes and prevent disorientation. Towards this goal, we propose the organisation of the users of a Web site into groups with common navigational behaviour (user communities). We view the task of building user communities as a data mining task, searching for interesting patterns within a database. The database that we use in our experiments consists of access logs collected from the Web site of the Advanced Course on Artificial Intelligence 1999. The unsupervised machine learning algorithm COBWEB is used to organise the users of the site, who follow similar paths, into a small set of communities. Particular attention is paid to the interpretation of the communities that are generated through this process. For this purpose, we use a simple metric to identify the representative navigational behaviour for each community. This information can then be used by the administrators of the site to re-organise it in a way that is tailored to the needs of each community. The proposed Web usage analysis is much more insightful than the common approach of examining simple usage statistics of the Web site.
G. Paliouras, V. Karkaletsis and C.D. Spyropoulos, "Learning Rules for Large Vocabulary Word Sense Disambiguation," Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI '99), v. 2, pp. 674-679, 1999.
Word Sense Disambiguation (WSD) is the process of distinguishing between different senses of a word. In general, the disambiguation rules differ for different words. For this reason, the automatic construction of disambiguation rules is highly desirable. One way to achieve this aim is by applying machine learning techniques to training data containing the various senses of the ambiguous words. In the work presented here, the decision tree learning algorithm C4.5 is applied on a corpus of financial news articles. Instead of concentrating on a small set of ambiguous words, as done in most of the related previous work, all content words of the examined corpus are disambiguated. Furthermore, the effectiveness of word sense disambiguation for different parts of speech (nouns and verbs) is examined empirically.
G. Paliouras, V. Karkaletsis, C. Papatheodorou and C.D. Spyropoulos, "Exploiting Learning Techniques for the Acquisition of User Stereotypes and Communities," Proceedings of the International Conference on User Modeling (UM), CISM Courses and Lectures, n. 407, pp. 169-178, Springer-Verlag, 1999.
In this paper we propose a methodology for acquiring user stereotypes and communities automatically from users' data. Stereotypes are built using supervised learning techniques (C4.5 and AQ15) on personal data extracted from a set of questionnaires answered by the users of a news filtering system. Particular emphasis is given to the characteristic features of the task of learning stereotypes and, in this context, the new notion of community stereotype is introduced. On the other hand, the communities are built using unsupervised learning (COBWEB) on data containing users' interests on the news categories covered by the news filtering system. Our main concern is whether meaningful communities can be constructed and for this purpose we specify a metric to decide on the representative news categories for each community. The encouraging results presented in this paper, suggest that established machine learning methods can be particularly useful for the acquisition of stereotypes and communities.
G. Petasis, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos and I. Androutsopoulos, "Using Machine Learning Techniques for Part-of-Speech Tagging in the Greek Language", Proceedings of the 7th Hellenic Conference on Informatics, Ioannina, Greece, 1999.
This article investigates the use of Transformation-Based Error-Driven learning for resolving part-of-speech ambiguity in the Greek language. The aim is not only to study the performance, but also to examine its dependence on different thematic domains. Results are presented here for two different test cases: a corpus on "management succession events" and a general-theme corpus. The two experiments show that the performance of this method does not depend on the thematic domain of the corpus, and its accuracy for the Greek language is around 95%.
G. Paliouras, C. Papatheodorou, V. Karkaletsis, C.D. Spyropoulos and V. Malaveta, "Learning User Communities for Improving the Services of Information Providers," Proceedings of the European Conference on Research and Advanced Technology for Digital Libraries (ECDL), Lecture Notes in Computer Science, n. 1513, pp. 367-384, Springer-Verlag, 1998.
In this paper we propose a methodology for organising the users of an information providing system into groups with common interests (communities). The communities are built using unsupervised learning techniques on data collected from the users (user models). We examine a system that filters news on the Internet, according to the interests of the registered users. Each user model contains the user?s interests on the news categories covered by the information providing system. Two learning algorithms are evaluated: COBWEB and ITERATE. Our main concern is whether meaningful communities can be constructed. We specify a metric to decide which news categories are representative for each community. The construction of meaningful communities can be used for improving the structure of the information providing system as well as for suggesting extensions to individual user models. Encouraging results on a large data-set lead us to consider this work as a first step towards a method that can easily be integrated in a variety of information systems.
G. Paliouras, V. Karkaletsis and C.D. Spyropoulos, "Machine Learning for Domain-Adaptive Word Sense Disambiguation," In Proceedings of the Workshop on Adapting Lexical and Corpus Resources to Sublanguages and Applications, International Conference on Language Resources and Evaluation, Granada, Spain, May 26, 1998.
This paper investigates the use of machine learning techniques for word sense disambiguation. The aim is to improve on the performance of general-purpose methods, by making the disambiguation method adaptable to new domains. Results are presented here for two different test cases: financial news from the Wall Street Journal, extracted from the SEMCOR corpus, and general-theme news from the same corpus. The two experiments show that the adaptive disambiguation method can achieve high recall and precision; more so in the restricted domain of financial news than in the general-theme case.
G. Paliouras and D.S. Bree, "Adaptive Event Recognition with the use of Limited Training Data," In Recent Advances in Information Science and Technology, N.E. Mastorakis (ed.), pp. 225-232, World Scientific, 1998.
This paper presents a novel event recognition system, which is capable of adapting itself to improve its performance on a small set of training data. The event recognition system is represented by a network of events, related to each other by temporal constraints. This symbolic representation is particularly suitable to the treatment of overlapping events, which have been overlooked in most of the work on event recognition. Additionally, a method for refining the temporal parameters of the recognition system is presented here. The method uses a small set of preclassified training examples to improve the performance of the system. The principle of minimal model change is used to overcome the sparseness of the training data. Particular emphasis is given to the issue of multiple positive examples, which is prevalent when allowing overlapping events. The new system has been applied to the thematic analysis of humpback whale songs with encouraging results.
S.M. Rudolfer, G. Paliouras and I. Peers, "Diagnostic Strategies for Carpal Tunnel Syndrome," Conference of the European Society for Medical Decision Making, Turin, Italy, 1996.
Carpal Tunnel Syndrome or CTS (entrapment of the median nerve at the
wrist) is the most commonly occurring neurological condition referred to hospital
electromyography clinics for investigation. Its diagnosis requires specialised equipment
to carry out so-called nerve conduction studies. These are combined with the
patient's history and clinical examination to enable the clinician to
reach a diagnosis. For the purposes of this study, four diagnostic classes were used: No
Abnormality Detected, mild CTS, moderate CTS and severe CTS.
The aims here were to use a data set, supplied by the late Dr. John L. James, Consultant
Physician, St. Luke's Hospital, Huddersfield, to: (1) compare the diagnostic performances
of decision tree (DT) induction and logistic regression (LR), (2) investigate the relative
importance of patients' history, clinical examination and nerve conduction studies for the
diagnostic performances of DT and LR. One important aspect of nerve conduction studies is
non-response to electrical stimulus in some of the measurements. Such values were coded as
either 99.9 or 0, according to the type of variable (latency or amplitude, respectively).
LR was not able to handle non-responses directly, whereas DT was. For this reason, nerve
conduction variables were coded into at most fourteen ordered values, using the quartiles
of the four diagnostic classes, with non-response as an extra value at the appropriate end
of the range.
The data set, consisting of 1710 hands, was randomly split into a design set (850 hands)
and a test set (860 hands). Attention was restricted to two models: M1, involving age, sex
and handedness, together with nine nerve conduction variables, and M2, including in
addition 24 history variables and 5 clinical sign variables.
For both M1 and M2, the DT performed better than the corresponding LR models, and used
four nerve conduction variables only. Its performances for coded and uncoded nerve
conduction values were virtually the same. For M1 and M2, the DT achieved a correct
classification rate of 78.5% (hard threshold) and 79.2% (soft threshold). For model M1,
the LR with backward elimination used four variables (all NCS), three of which coincided
with those used by the DT, and had a correct classification rate of 71.4%; the LR without
backward elimination had a correct classification rate of 70.8%; the LR using the four
variables selected by the DT had a correct classification rate of 71.9%. For model M2, the
LR with backward elimination used 8 variables (4 NCS, 3 history and one clinical sign),
and had a correct classification rate of 46.2%; the LR's correct classification rate
without backward elimination was 44.9%. Possible reasons for this poor performance, and
alternative strategies, will be discussed. The diagnostic performances of all the models
were not improved by including the history and clinical signs.
M. Brown and G. Paliouras, Review of: Inside Case Based Explanation , by R. Schank et al., Minds and Machines, v. 7, n. (1 or 2), 1997.
[No abstract available.]
H.C. Jessen and G. Paliouras, "Predicting Labour Force Participation of Women with the use of Statistical and Learning Classification Techniques ," European Conference in Non-Linear Econometrics (EC2), Aarhus, Denmark, 1995.
Traditionally, econometric models have been based on regression methods. One limitation of these methods is their restricted ability to extract complex relations between the independent variables of the model. In particular, in classification tasks, the methods that are typically used, can only model linear discrimination between the examined classes. In this paper we use the task of predicting labour force participation of women, to illustrate these problems. This is achieved by comparing the classification performance of logistic regression with two newly developed methods originating from the field of Machine Learning (Neural Networks and Decision Trees). The latter are able to construct non-linear discrimination surfaces and achieve a high out-of-sample classification performance. Encouraged by these results, we attempt to achieve a similar increase in the performance of the logit, by introducing non-linear terms in the model. We then go on to examine the similarities and differences between the three types of non-linear model, in terms of the discrimination and probability surfaces. Finally, we use the latter to express our concerns about the interpretation of the probabilities, especially with respect to the elasticity of labour force participation to wages.
G. Paliouras and D.S. Bree, "The Effect of Numeric Features on the Scalability of Inductive Learning," Proceedings of the European Conference in Machine Learning (ECML), Lecture Notes for Artificial Intelligence, n. 912, pp. 218-231, Springer-Verlag, 1995.
The behaviour of a learning program as the quantity of data increases affects to a large extent its applicability on real-world problems. This paper presents the results of a theoretical and experimental investigation of the scalability of four well-known empirical concept learning programs. In particular it examines the effect of using numeric features in the training set. The theoretical part of the work involved a detailed worst-case computational complexity analysis of the algorithms. The results of the analysis deviate substantially from previously reported estimates, which have mainly examined discrete and finite feature spaces. In order to test these results, a set of experiments was carried out, involving one artificial and two real data sets. The artificial data set introduces a near-worst-case situation for the examined algorithms, while the real data sets provide an indication of their average-case behaviour.