Tutorials

LocationTitlePresenters
November 2nd: 9:00am - 12:30am
Room 202Tutorial 1
Resources and Methods for the Acquisition of Open-Domain Concepts and Conceptual Hierarchies from Text
Marius Pasca
Google Inc.
Room 203Tutorial 2
Introduction to Computational Advertising
Andrei Broder, Vanja Josifovski, Evgeniy Gabrilovich
Yahoo! Research
November 2nd: 2:00am - 5:30am
Room 202Tutorial 3
Parallel Algorithms for Mining Large-scale Datasets
Edward Y. Chang, Kaihua Zhu and Hongjie Bai
Google Research
Room 203Tutorial 4
Statistical Models for Web Search Clicks Log Analysis
Fan Guo
Carnegie Mellon University and
Chao Liu
Microsoft Research 

Tutorial 1
Resources and Methods for the Acquisition of Open-Domain Concepts and Conceptual Hierarchies from Text
Marius Pasca, (Google Inc.)

Abstract: Despite differences in the types of targeted information, as well as underlying algorithms and tools, a common theme shared across recent approaches to information extraction is an aggressive push towards large-scale extraction. Documents spanning various genres are readily available on the Web, providing significant amounts of textual content towards the acquisition of instances, concepts and conceptual hierarchies, as a step towards the far-reaching goal of automatically constructing knowledge bases from unstructured text. This tutorial provides an overview of extraction methods developed in the area of Web-based open-domain information extraction, with the purpose of acquiring sets of instances within unlabeled or labeled open-domain concepts. The concepts are organized either as a flat set of hierarchically. The extraction methods operate over unstructured or semi-structured text available within collections of Web documents, or over relatively more intriguing streams of anonymized search queries. They take advantage of weak supervision provided in the form of seed examples or small amounts of annotated data, or draw upon knowledge already encoded within resources created strictly by experts or collaboratively by users. The more ambitious methods, aiming at acquiring millions of instances from text, need to be designed to scale to Web collections – a restriction with significant consequences on overall complexity and choice of underlying tools – in order to ultimately aid information retrieval in general and Web search in particular, by producing open-domain concepts, along with facts or relations among instances or among concepts.

Tutorial 2
Introduction to Computational Advertising
Andrei Broder, Vanja Josifovski, Evgeniy Gabrilovich (Yahoo! Research)

Abstract: Online advertising affects virtually every Web user, and over the recent years has grown into a $20 billon industry. As with the Web corpus, the structure of the online ads is substantially different than any other previously studied text corpus. The queries used for selecting online ads can also differ substantially from the commonly explored short textual queries, as for example when selecting advertisements for a given web page or specific context of a user. These differences require reexamination of many conclusions of traditional IR, such as document analysis, query expansion, scoring and length normalization, and performance evaluation. In this tutorial we will give an overview of the Ad Retrieval field of Computational Advertising. Computational advertising is a new scientific discipline that studies the process of advertising on the Internet and combines methods from IR, machine learning, statistics, optimization and economics to select the optimal ads for a given user in a given context on the Web. We will demonstrate how to employ a relevance feedback assumption and use Web search results retrieved by the query. This step allows one to use the Web as a repository of relevant query-specific knowledge. We will also describe techniques that go beyond the conventional bag of words indexing, and construct additional features using a large external taxonomy and a lexicon of named entities obtained by analyzing the entire Web as a corpus.

Tutorial 3
Parallel Algorithms for Mining Large-scale Datasets
Edward Y. Chang, Kaihua Zhu and Hongjie Bai (Google Research)

Abstract: The explosive growth of data such as text, photo, video, and biological requires scalable computational solutions. For instance, YouTube attracts more than10-hour videos per minute. Photo sites such as Flickr and PicasaWeb receive millions of uploads per week. And the coming of personal genome data can be exceedingly demanding in storage and computation. To organize, index, analyze, and retrieve these large-scale data, a system must employ scalable algorithms. Therefore, at the forefront, the research community ought to consider solving the real, large-scale problems, rather than dealing with small toy datasets, which success does not translate to real-world, large datasets. In this tutorial, we will present key models and parallel algorithms for dealing with data in the Gegascale. We will also provide participates a huge annotated dataset to conduct research.

Tutorial 4
Statistical Models for Web Search Clicks Log Analysis
Fan Guo (Carnegie Mellon University) and Chao Liu (Microsoft Research)

Abstract: Every day billions of queries and clicks submitted to search engines are automatically logged and aggregated. Such click data have become one of the most important and extensive feedback signals from the World Wide Web audience. They are valuable resources for both information retrieval researchers, to better understand human interaction with retrieval results and calibrate their hypotheses or models, and web search practitioners, to measure, monitor and learn to improve search engine performance. However, the interpretation of user clicks is a non-trivial task because many elements come into play in the decision process. For example, previous eye tracking studies indicated that clicks are generally biased as a form of absolute relevance judgment, and clicking decision on a web document depends on both the position (rank) and the context (other documents) of the presentation.

Click models usually incorporate a statistical depiction of user interaction with web search results in a query session, by specifying probabilities of examination and clicks at different positions and how they depend on each other. They provide principled, scalable solutions to inferring user-perceived relevance of web documents, and modeling outputs could be further leveraged in various search-related applications including search engine quality evaluation and sponsored search auctions. In the past year, quite a few click models have been presented in leading data mining, web searches well as information retrieval conferences such as KDD, WWW, SIGIR and WSDM. They are very well appreciated by audiences with both academic and industrial backgrounds and have stimulated many in-depth discussion and investigation. The growing popularity and impact of this topic are reflected in the fact that both the WWW’09 and the SIGIR’09 conference programs have an individual session devoted to click models. We believe that it is timely and in high demand to have a well-organized tutorial on this emerging growing theme accessible to researchers and developers from the database, information retrieval, and knowledge management communities.

In this tutorial, we will present a comprehensive overview of these most recent developments, examine and compare state-of-the-art models, explore several application scenarios, and lay out challenges as well as future directions of this area.

Platinum Supporters

Gold Supporters

Bronze Supporters

Organizations