Massimiliano Ruocco

Events and Group Seminars

Here is collected the list of seminars of the Data and Information Management group that is part of IDI department. The following list start from July 2009.

09 Mar 2011	Dr. George Tsatsaronis A Maximum-Entropy Approach for Accurate Document Annotation in the Biomedical Domain + The increasing number of scientific literature on the Internet and the absence of efficient tools used for classifying and searching the documents are the two most important factors that influence the speed of the search and the quality of the results. Previous studies have shown that the usage of ontologies makes it possible to process document and query information at the semantic level, which greatly improves the search for the relevant information and makes one step further towards the Semantic Web. A fundamental step in these approaches is the annotation of documents with ontology concepts, which can also be seen as a classification task. In this work we address this issue for the biomedical domain and present a new automated and robust method, based on a Maximum Entropy approach, for annotating biomedical literature documents with MeSH concepts, which provides very high F-measure.
01 Mar 2011	Dr. Kim Jin-Dong The Activities of DBCLS + Database Center For Life Science (DBCLS) is a government-funded center in Japan with a mission for the integration of life science databases. In the talk, I am going to introduce activities of DBCLS, which ranges from DB hosting, integration, semantic web, NLP, to license issues, while seeking possible collaboration with NTNU. Kim Jin-Dong has a Ph.D in the area of computer science, especially with NLP, from Korea University in 2000. He is a Project Researcher and Lecturer in University of Tokyo from 2001 to 2010 and Project Associate Professor in DBCLS from 2010. In addition he is Co-Author of GENIA corpus and Co-organizer of BioNLP shared task.
11 Feb 2011	Muhammad Ali Norozi Relevancy in Schema-Agnostic Environment + Relevance is an important component in free text search and often distinguishes the implementations. Relevancy is used to score matching documents and rank them according to the users intent. One of the reasons of the high popularity of Google is its good relevancy originally based on the PageRank algorithm. The emergence of semi-structured data as a standard for data representation opened up new areas which could be related to both the data- base and information retrieval communities. Although the information retrieval and database viewpoints were, until quite recently irreconcilable, semi-structured retrieval helped to bridge the gap. This work is about exploring relevancy in semi-structured retrieval both in isolation and as bridge between database and information retrieval communities.
08 Dec 2010	Tanja Mercun Presentation of data and navigation in bibliographic information systems + Inefficient, difficult to use, and out-dated user interfaces of library catalogues and other bibliographic information systems have continuously been criticized over the years and with a growing selection of information sources and providers on the web, an increasing number of users has started to bypass library systems when searching for information. Exploring new ways to extract more value from library data and improve library catalogues, we have chosen the implementation of FRBR model as our central approach. FRBR model has great potential not only for more effective cataloguing, but especially for end-user oriented organization and display of records and search results, navigation, and presentation of relationships. In the presentation we will discuss our current work on possible uses of FRBR in user interfaces of library catalogues that could improve the findabilty of resources as well as support exploration and discovery.
03 Dec 2010	Naimdjon Takhirov An XML-based representational document format for FRBR systems + Metadata related to cultural items such as movies, books and music is a valuable resource that currently is exploited in many applications and services based on mashup and linked data. Unfortunately, existing metadata formats do not have the semantics needed for versatile integration and reuse of such information across domains and applications. The conceptual model in the Functional Requirements for Bibliographic Records is a major contribution towards a solution, but the existing large body of legacy data makes a transition to this model difficult. In this paper we present a format for exchange of MARC-based information that makes the entities and relationships of the FRBR model explicit. The main purpose of this format is to enable the exchange of FRBR enriched MARC records while still maintaining compatibility with MARC-based systems.
26 Nov 2010	Massimiliano Ruocco Event Clusters Detection on Flickr Images using a Suffix-Tree Structure + Image clustering is a problem that has been treated extensively in both Content-Based (CBIR) and Text- Based (TBIR) Image Retrieval Systems. In this paper, we propose a new image clustering approach that takes both annotation, time and geographical position into account. Our goal is to develop a clustering method that allows an image to be part of an event cluster. We extend a well-known clustering algorithm called Suffix Tree Clustering (STC), which was originally developed to cluster text documents using a document snippet. To be able to use this algorithm, we consider an image with annotation as a document. Then, we extend it to also include time and geographical position. This appears to be particularly useful on the images gathered from online photo-sharing applications such as Flickr. Here image tags are often subjective and incomplete. For this reason, clustering based on textual annotations alone is not enough to capture all context information related to an image. Our approach has been suggested to address this challenge. In addition, we propose a novel algorithm to extract event clusters. The algorithm is evaluated using an annotated dataset from Flickr, and a comparison between different granularity of time and space is provided.
25 Nov 2010	Krisztian Balog Entity Search + We have come to depend on technological resources to create order and find meaning in the ever-growing amount of online data. A large fraction of (web) search queries concern named entities: persons, organizations, locations, etc. These information needs are better answered by returning specific objects instead of just any type of documents that merely mention them. In this talk I will briefly review my work on entity-oriented retrieval. Starting with the task of finding people in organizational environments, I will gradually expand the scope of the search both in terms of type (from people to other types of entities) and scale (from intranet to internet). To address these tasks I propose a probabilistic retrieval framework based on statistical language modeling techniques. On top of the basic layer of these solid text-based models, I will discuss how top-down semantic information can be incorporated. Using standard data sets from international evaluation campaigns, I will demonstrate that the proposed approaches achieve state-of-the-art performance in terms of effectiveness, while maintaining high efficiency.
19 Nov 2010	Marek Ciglan Fast Detection of Size-Constrained Communities in Large Networks + The community detection in networks is a prominent task in the graph data mining, because of the rapid emergence of the graph data; e.g., information networks or social networks. In this paper, we propose a new algorithm for detecting communities in networks. Our approach differs from others in the ability of constraining the size of communities being generated, a property important for a class of applications. In addition, the algorithm is greedy in nature and belongs to a small family of community detection algorithms with the pseudo-linear time complexity, making it applicable also to large networks. The algorithm is able to detect small-sized clusters independently of the network size. It can be viewed as complementary approach to methods optimizing modularity, which tend to increase the size of generated communities with the increase of the network size. Extensive evaluation of the algorithm on synthetic benchmark graphs for community detection showed that the proposed approach is very competitive with state-of-the-art methods, outperforming other approaches in some of the settings.
12 Nov 2010	Simon Jonassen A Combined Semi-Pipelined Query Processing Architecture for Distributed Full-Text Retrieval + Term-partitioning is an efficient way to distribute a large inverted index. Two fundamentally different query processing approaches are pipelined and non-pipelined. While the pipelined approach provides higher query throughput, the non-pipelined approach provides shorter query latency. In this work we propose a third alternative, combining non-pipelined inverted index access, heuristic decision between pipelined and non-pipelined query execution and an improved query routing strategy. From our results, the method combines the advantages of both approaches and provides high throughput and short query latency. Our method increases the throughput by up to 26% compared to the non-pipelined approach and reduces the latency by up to 32% compared to the pipelined.
15 Oct 2010	Joao da Rocha Junior On the Selectivity of Multidimensional Routing Indices + Recently, the problem of efficiently supporting advanced query operators, such as nearest neighbor or range queries, over multidimensional data in widely distributed environments has attracted much attention. In unstructured peer-to-peer (P2P) networks, peers store data in an autonomous manner, thus multidimensional routing indices (MRI) are required, in order to route user queries efficiently to only those peers that may contribute to the query result set. Focusing on a hybrid unstructured P2P network, in this paper, we analyze the parameters for building MRI of high selectivity. In the case where similar data are located at different parts of the network, MRI exhibit extremely poor performance, which renders them ineffective. We present algorithms that boost the query routing performance by detecting similar peers and reassigning these peers to other parts of the hybrid network in a distributed and scalable way. The resulting MRI are able to eagerly discard routing paths during query processing. We demonstrate the advantages of our approach experimentally and show that our framework enhances a state-of-the-art approach for similarity search in terms of reduced network traffic and number of contacted peers.
03 Oct 2010	Nattiya Kanhabua Determining Time of Queries for Re-ranking Search Results + Recent work on analyzing query logs shows that a significant fraction of queries are temporal, i.e., relevancy is dependent on time, and temporal queries play an important role in many domains, e.g., digital libraries and document archives. Temporal queries can be divided into two types: 1) those with temporal criteria explicitly provided by users, and 2) those with no temporal criteria provided. In this paper, we deal with the latter type of queries, i.e., queries that comprise only keywords, and their relevant documents are associated to particular time periods not given by the queries. We propose a number of methods to determine the time of queries using temporal language models. After that, we show how to increase the retrieval effectiveness by using the determined time of queries to re-rank the search results. Through extensive experiments we show that our proposed approaches improve retrieval effectiveness.
27 Aug 2010	Nils Grimsmo Fast Optimal Twig Joins + In XML search systems twig queries specify predicates on node values and on the structural relationships between nodes, and a key operation is to join individual query node matches into full twig matches. Linear time twig join algorithms exist, but many non-optimal algorithms with better average-case performance have been introduced recently. These use somewhat simpler data structures that are faster in practice, but have exponential worst-case time complexity. In this paper we explore and extend the solution space spanned by previous approaches. We introduce new data structures and improved strategies for filtering out useless data nodes, yielding combinations that are both worst-case optimal and faster in practice. An experimental study shows that our best algorithm outperforms previous approaches by an average factor of three on common benchmarks. On queries with at least one unselective leaf node, our algorithm can be an order of magnitude faster, and it is never more than 20% slower on any tested benchmark query.
13 Aug 2010	Georg Russ Spatial Data Mining in Precision Agriculture + The talk will first shortly introduce us to the area of precision agriculture and high-resolution geodata, before covering two important tasks which are nowadays occurring in this area. One of those tasks is yield prediction, for which some of the issues with spatial data must be taken into account. For this purpose, a simple spatial cross-validation technique has been developed. The second task runs into the area of management zone delineation, which is the subdivision of an agriculture site into zones which should be managed differently with respect to fertilizer or pesticides, for example. A spatial clustering-based approach towards this non-trivial task will be presented.
15 Jun 2010	Nattiya Kanhabua Exploiting Time-based Synonyms in Searching Document Archives + Query expansion of named entities can be employed in order to increase the retrieval effectiveness. A peculiarity of named entities compared to other vocabulary terms is that they are very dynamic in appearance, and synonym relationships between terms change with time. In this paper, we present an approach to extracting synonyms of named entities over time from the whole history of Wikipedia. In addition, we will use their temporal patterns as a feature in ranking and classifying them into two types, i.e., time-independent or time-dependent. Time-independent synonyms are invariant to time, while time-dependent synonyms are relevant to a particular time period, i.e., the synonym relationships change over time. Further, we describe how to make use of both types of synonyms to increase the retrieval effectiveness, i.e., query expansion with time-independent synonyms for an ordinary search, and query expansion with time-dependent synonyms for a search wrt. temporal criteria. Finally, through an evaluation based on TREC collections, we demonstrate how retrieval performance of queries consisting of named entities can be improved using our approach.
21 May 2010	Christos Doulkeridis Reverse Top-k Queries: Current State and Research Challenges + Top-k queries are widely applied for retrieving a ranked set of the k most interesting objects based on the individual user preferences. As an example, in online marketplaces, customers (users) typically seek a ranked set of products (objects) that satisfy their needs. Reversing top-k queries leads to a query type that instead returns the set of customers that find a product appealing (it belongs to the top-k result set of their preferences). In this talk, we provide an introduction to reverse top-k queries and a brief overview of query processing algorithms and techniques. In addition, we propose efficient algorithms for processing meaningful variations of reverse top-k queries, such as identifying the most influential products to customers, where influence is defined as the cardinality of the reverse top-k result set. Finally, a roadmap of open problems and research challenges that rely on reverse top-k queries will be presented.
23 Apr 2010	Xiangliang Zhang Hi-AP and StrAP: Algorithms and Applications ---Clustering Large-scale and Streaming Data + The clustering of large-scale streaming data is a key issue for many application domains. In this talk, we present two algorithms, Hi-AP for clustering of large-scale data and StrAP for clustering of streaming data. Our Hi-AP algorithm has the merits of 1) only quasi-linear complexity; 2) better clustering performance; 3) annulling the specification of the number of clusters. Our StrAP algorithm summarizes data streams by an incrementally updated model. It is designed for the data streaming framework and has the merits of (1) seamlessly updating the clustering model; (2) adapting to changes of data distribution; (3) intelligible compressed data model. Based on Hi-AP and StrAP, we developed a multi-scale online grid monitoring system in a fashion of autonomic computing. We will show the performance of the monitoring system running on 5-million job trace from the European EGEE grid and how the system helps to discover device problems (e.g., clogging of LogMonitor).
25 Feb 2010	Akrivi Vlachou Reverse Top-k Queries + Rank-aware query processing has become essential for many applications that return to the user only the top-k objects based on the individual user’s preferences. Top-k queries have been mainly studied from the perspective of the user, focusing primarily on efficient query processing. In this work, for the first time, we study top-k queries from the perspective of the product manufacturer. Given a potential product, which are the user preferences for which this product is in the top-k query result set? We identify a novel query type, namely reverse top-k query, that is essential for manufacturers to assess the potential market and impact of their products based on the competition. We formally define reverse top-k queries and introduce two versions of the query, namely monochromatic and bichromatic and present efficient algorithms. Our experimental evaluation demonstrates the efficiency of our techniques, which reduce the required number of top-k computations by 1 to 3 orders of magnitude.
19 Feb 2010	Muhammad Ali Norozi Ranking the Web using Linear Algebra + The talk is about the Link Analysis Ranking algorithms and their mathematical state of the art interpretations. And hence I will present a notable improvement in the convergence behaviors of the query-dependent algorithms like HITS, SALSA and their descendants (e.g., Exponentiated and Randomized HITS) using the Extrapolation techniques. Through which I was able to accelerate the algorithms in terms of reducing the number of iterations and therefore uncovered a much faster convergence. In the experiments I even got much better results than theoretically predicted results, a speedup of order 3 - 19 times better.
05 Feb 2010	Mihaela A. Bornea Serializability with Snapshot Isolation under the Hood + This presentation proposes a new multi-version concurrency control algorithm, called serializable generalized snapshot isolation (SGSI), targeting middleware replicated database systems. Under this algorithm, each replica runs snapshot isolation locally and the replication middleware guarantees global serializability by performing enhanced certification for update transactions. We proved the correctness of the proposed algorithm and employ novel techniques both to extract transaction readsets and to perform enhanced certification to prevent read-write and write-write conflicts, without changing the underlying database replicas. We build a prototype replicated database system, which uses snapshot isolated database engines while maintaining serializable execution. We assess the algorithm experimentally using the TPC-W benchmark. We show that the algorithm is practical and demonstrate that it is has low overhead for small degrees of replication.
22 Jan 2010	Orestis Gorgas Software structure and code reproduction through sequence diagram analysis + One of the major problems in modern software systems is that, due to their size, their complexity and the constant upgrades they are subject to, it has become very hard to afford the time, the money and the effort to analyze and maintain them. Additionally, the tasks that a system performs are frequently quite different from those the system is intended to complete. The model-driven approach deals with the software systems through an abstractive view using a variety of models, in order to make the design, the implementation and the maintenance of the software systems easier. This presentation is a walk-through of the design and the development of a transformation that can be applied to the sequence diagrams of a software system and can produce an abstract framework of the code. During the face of design and development of a software system, this can guide the insertion of extra code that can be transformed into a runnable system. During the phase of maintenance the generated framework can be compared with actual code to trace inconsistencies between the code and the sequence diagrams. Thus it will be easier, after an update of the sequence diagrams to spot the areas where the code has to be updated, and vice versa. The transformation of the sequence diagrams and the comparison of the code framework with the actual code are demonstrated through an example system to which the techniques proposed in this work are applied.
08 Jan 2010	Nils Grimsmo Towards Unifying Advances in Twig Join Algorithms + Twig joins are key building blocks in current XML indexing systems, and numerous algorithms and useful data structures have been introduced. We give a structured, qualitative analysis of recent advances, which leads to the identification of a number of opportunities for further improvements. Cases where combining competing or orthogonal techniques would be advantageous are highlighted, such as algorithms avoiding redundant computations and schemes for cheaper intermediate result management. We propose some direct improvements over existing solutions, such as reduced memory usage and stronger filters for bottom-up algorithms. In addition we identify cases where previous work has been overlooked or not used to its full potential, such as for virtual streams, or the benefits of previous techniques have been underestimated, such as for skipping joins. Using the identified opportunities as a guide for future work, we are hopefully one step closer to unification of many advances in twig join algorithms.
26 Nov 2009	Katja Hose Maintenance Strategies for Routing Indexes + Processing queries efficiently in large-scale unstructured P2P networks is a crucial part of operating such systems. The straightforward solution of querying all the peers in the network (flooding) leads to complete query answers but does not scale well with the number of peers. Thus, in order to avoid the expensive flooding of the network for query processing, routing indexes are used. Each peer maintains such an index for its neighbors. It provides a compact representation (data summary) of data accessible via each neighboring peer. Based on this information and a given query, a peer can decide whether it is worthwhile to forward the query to a particular neighbor or not. As P2P networks are dynamic systems and peers might change their local data over time, an important problem in this context is to keep these data summaries up-to-date without paying high maintenance costs. This talk discusses the problem of updating routing indexes in P2P-based environments in the absence of global knowledge and central instances. Using the QTree, a combination of R-trees and histograms, as an example base structure for routing indexes, this talk presents a classification of maintenance strategies and discusses several approaches to keep maintenance costs at a reasonable level.
23 Oct 2009	Truls A. Bjorklund A Confluence of Column Stores and Search Engines: Opportunities and Challenges + IR and DB integration has been a long-withstanding research challenge. Most of the work trying to integrate the two fields is motivated by specific application scenarios. In this paper we approach this problem from another perspective. Instead of focusing on IR and DB as whole fields, we restrict the focus to search engines and column stores. We present observations of similarities in the two technologies, and aggregate information on parallel developments in the two fields. We argue that these developments point towards a confluence of column stores and search engines, and one may in fact argue that this confluence has already started. We evaluate the potential in developing an engine capable of handling the workloads traditionally supported by the different systems, namely decision support and search workloads, by identifying potential opportunities and challenges. The opportunities include potential areas for technology transfer and more efficient support for features. The identified challenges outline areas for future work whose successfulness will help decide whether a confluence of column stores and search engines is feasible.
07 Oct 2009	Iraklis Varlamis Monitoring the evolution of interests in the blogosphere + This presentation describes blogTrust, an innovative modular and extensible prototype application for monitoring changes in the interests of blogosphere participants. A new approach for the analysis of weblog contents is introduced, which can yield new insights on the analysis of the blogosphere by monitoring the convergence or dispersion of blogosphere interests. BlogTrust uses established, robust data mining techniques to support every step of the process. The motivation for the work is a hypothesized strong connection between important (global or "local") events and the rapid reduction in the divergence of (global or "local") weblog topic coverage. Experimental results on a real data are provide support for our hypothesis, indicate the most critical points in the proposed process, and point to interesting directions for further research.
14 Aug 2009	Marek Ciglan Mining interesting relations from wikipedia link graph + Recently, Wikipedia has gained lots of popularity amongst researcher as a source of data, mainly in the areas of natural language processing and information retrieval and extraction. In this preliminary work presentation, we describe our ideas for to exploiting wikipedia in a new manner, for mining non-trivial semantic relations between sets of topics. We will present some experiments with the use of a spreading activation algorithm on the link structure of wikipedia to achieve this goal. In this talk, we discuss the challenges of our approach, describe proposed solutions and give a short demonstration of our research prototype.
17 Jul 2009	George Tsatsaronis Text Relatedness based on a Word Thesaurus + Measuring the relatedness between two text segments in an automated manner is a tedious task. Text conveys semantics that are hard for a computer program to capture. Without doubt, a measure of relatedness between text segments must take into account both the lexical and the semantic relatedness between words. Such a measure that captures well both aspects of text relatedness may help in many tasks, such as text retrieval, classification and clustering. We present a new approach for measuring the semantic relatedness between words based on their implicit semantic links. The approach does not require any type of training, since it exploits a word thesaurus in order to devise implicit semantic links between words. Based on this approach, we introduce a new measure of semantic relatedness between texts, which capitalizes on the semantic relatedness between individual words, and extends it to measure the relatedness between sets of words. We gradually validate our method: we first evaluate the performance of the semantic relatedness measure between individual words in three tasks and then proceed with evaluating the performance of our method in measuring text-to-text semantic relatedness in sentence-to-sentence similarity, paraphrase recognition and text classification. Experimental evaluation shows that the proposed method outperforms every lexicon-based method of word semantic relatedness in the selected tasks and the tested data sets, and competes well against corpus-based approaches that require training. Finally, we show that the proposed measure can be successfully applied to more complex linguistic tasks (e.g. paraphrasing) and that it is able to capture the human notion of relatedness better than traditional lexical matching techniques.
26 Jun 2009	Joao da Rocha-Junior AGiDS: A Grid-based Strategy for Distributed Skyline Query Processing + Skyline queries help users make intelligent decisions over complex data, where different and often conflicting criteria are considered. A challenging problem is to support skyline queries in distributed environments, where data is scattered over independent sources. The query response time of skyline processing over distributed data depends on the amount of transferred data and the query processing cost at each server. In this paper, we propose AGiDS, a framework for efficient skyline processing over distributed data. Our approach reduces significantly the amount of transferred data, by using a grid-based data summary that captures the data distribution on each server. AGiDS consists of two phases to compute the result: in the first phase the querying server gathers the grid-based summary, whereas in the second phase a skyline request is sent only to the servers that may contribute to the skyline result set asking only for the points of non-dominated regions. We provide an experimental evaluation showing that our approach performs efficiently and outperforms existing techniques. The same paper will be presented between August 31 - September 4 at the Second International Conference on Data Management in Grid and P2P Systems (Globe 2009).
12 Jun 2009	Akrivi Vlachou Angle-based Space Partitioning for Efficient Parallel Skyline Computation + Recently, skyline queries have attracted much attention in the database research community. Space partitioning techniques, such as recursive division of the data space, have been used for skyline query processing in centralized, parallel and distributed settings. Unfortunately, such grid-based partitioning is not suitable in the case of a parallel skyline query, where all partitions are examined at the same time, since many data partitions do not contribute to the overall skyline set, resulting in a lot of redundant processing. In this talk, we present a novel angle-based space partitioning scheme using the hyperspherical coordinates of the data points. We demonstrate both formally as well as through an exhaustive set of experiments that this new scheme is very suitable for skyline query processing in a parallel share nothing architecture. The intuition of our partitioning technique is that the skyline points are equally spread to all partitions. Our novel partitioning scheme alleviates most of the problems of traditional grid partitioning techniques, thus managing to reduce the response time and share the computational workload more fairly.
20 Feb 2009	Jon Olav Hauglid PROQID and DYFRAM + 1) PROQID: Partial Restarts of Queries in Distributed Databases In a number of application areas, distributed database systems can be used to provide persistent storage of data while providing efficient access for both local and remote data. With an increasing number of sites (computers) involved in a query, the probability of failure at query time increases. Recovery has previously only focused on database updates while query failures have been handled by complete restart of the query. This technique is not always applicable in the context of large queries and queries with deadlines. In this paper we present an approach for partial restart of queries that incurs minimal extra network traffic during query recovery. Based on results from experiments on an implementation of the partial restart technique in a distributed database system, we demonstrate its applicability and significant reduction of query cost in the presence of failures. 2) DYFRAM: Dynamic Fragmentation and Replica Management in Distributed Database Systems In distributed database systems, tables are frequently fragmented and replicated over a number of sites in order to reduce network communication costs. How to fragment, when to replicate and how to allocate the fragments to the sites are challenging problems that has previously been solved either by static fragmentation, replication and allocation, or based on a priori query analysis. Many emerging applications of distributed database systems generate very dynamic workloads with frequent changes in access patterns from different sites. In those contexts, continuous refragmentation and reallocation can significantly improve performance. In this paper we present DYFRAM, a decentralized approach for dynamic table fragmentation and allocation in distributed database systems based on observation of the access patterns of sites to tables. The approach performs fragmentation, replication, and reallocation based on recent access history, aiming at maximizing the number of local accesses compared to accesses from remote sites. Through simulations, we show that the approach significantly reduces communication costs for typical access patterns, thus demonstrating the feasibility of our approach.

Massimiliano Ruocco

Department of Computer and Information Science | Norwegian University of Science and Technology

Events and Group Seminars

Tweet

News/Events