Kyu-Young Whang (KAIST, Korea)
Nowadays, as there is an increasing need to integrate the DBMS (for structured data) with Information Retrieval (IR) features (for unstructured data), DB-IR integration is becoming one of major challenges in the database area. Extensible architectures provided by commercial object-relational DBMS(ORDBMS) vendors can be used for DB-IR integration. Here, extensions are implemented using a high-level (typically, SQL-level) interface. We call this architecture loose-coupling. The advantage of loose-coupling is ease of implementation. But, loose-coupling is not preferable for implementing new data types and operations in large databases when high performance is required. In this talk, we present a new DBMS architecture applicable to DB-IR integration, which we call tight-coupling. In tight-coupling, new data types and operations are integrated into the core of the DBMS engine in the extensible type layer. Thus, they are incorporated as the “first-class citizens” within the DBMS architecture and are supported in a consistent manner with high performance. This tight-coupling architecture is being used to incorporate IR features and spatial database features into the Odysseus ORDBMS that has been under development at KAIST/AITrc for over 19 years. In this talk, we introduce Odysseus and explain its tightly-coupled IR features (U.S. patented in 2002). Then, we demonstrate excellence in performance of tight-coupling by showing benchmark results. We have built a web search engine that is capable of managing 100 million web pages per node in a non-parallel configuration using Odysseus. This engine has been successfully tested in many commercial environments. This work won the Best Demonstration Award from the IEEE ICDE conference held in Tokyo, Japan, in April 2005. Last, we present a design of a massively-parallel search engine using Odysseus. Recently, parallel search engines have been implemented based on scalable distributed file systems (e.g., GFS). Nevertheless, building a massively-parallel search engine using a DBMS can be an attractive alternative since it supports a higher-level (i.e., SQL-level) interface than that of a distributed file system while providing scalability. The parallel search engine designed is capable of indexing 30 billion web pages with a performance comparable to or better than those of state-of-the-art search engines.
Biography:
Kyu-Young Whang is a KAIST Distinguished Professor and Professor of
Computer Science at KAIST. Previously, he was with IBM T.J.Watson
Research Center from 1983 to 1990. Since joining KAIST in 1990,
he has been leading the Odysseus DBMS/Search Engine project featuring
tight-coupling of DBMS with information retrieval (IR) and spatial
functions. An earlier version of this technology played a vital
role in starting up NaverCom Co. (currently, NHN Co.) in 1997-2000,
which is the number one portal in Korea. Dr. Whang is one of the
pioneers of probabilistic counting, which nowadays is being widely used
in approximate query answering, sampling, and data streaming. One
of the algorithms he co-developed at IBM Almaden (then San Jose)
Research Lab in 1981 has been made part of DB2. Dr. Whang is the
author of the first main-memory relational query optimization model
developed in 1985 and reported in 1990 in ACM TODS in the context of
Office-by-Example (OBE). This model influenced subsequent
optimization models of commercial main-memory DBMSs. His research
has covered a wide range of database issues including physical database
design, query optimization, DBMS engine technologies, and more
recently, IR, spatial databases, data mining, and XML. Dr. Whang
was the Coordinating Editor-in-Chief of the prestigious VLDB Journal,
having served the journal for 19 years from its inception as a founding
editorial board member. He is a Trustee Emeritus of the VLDB
Endowment and served the international academic community as the
General Chair of VLDB2006, DASFAA2004, and PAKDD2003, as a PC Co-Chair
of VLDB2000, CoopIS1998, and ICDE2006, and as an editorial board member
of journals such as IEEE TKDE, The WWW Journal, and IEEE Data
Engineering Bulletin. He served as the Chair of the Steering Committee
of the DASFAA International Conference and as a co-founder of the
Korea-Japan Database Workshop (KJDB) annually held alternately in Korea
and Japan. He is a member of the ACM SIGMOD Dissertation Award
Committee and served as a member of many 10-year Best or Influential
Paper Award committees of VLDB and IEEE ICDE. He served as an
IEEE Distinguished Visitor from 1989 to 1990 and was invited to ACM
SIGMOD Distinguished Profile in Databases in 2007. He
earned his Ph.D. from Stanford University in 1984. Dr. Whang is
an IEEE Fellow, a member of the ACM and IFIP WG 2.6.
Edward Chang (Google Research China)
Confucius
is a great teacher in ancient China. His theories and principles were
effectively spread throughout China by his disciples. Confucius
is the product code name of Google’s Knowledge Search product, which is
built at Google Beijing lab by my team. In this talk, I present
Knowledge Search’s key disciples, which are data management subroutines
that generate labels for questions, that match existing answers to a
question, that evaluate quality of answers, that rank users based on
their contributions, that distill high-quality answers for search
engines to index, etc. This talk presents scalable algorithms
that we have developed to make these disciples effective in dealing
with huge datasets. Efforts in making these algorithms run even faster
on thousands of machines, and some open research problems will also be
presented.
Clement Yu (University of Illinois at Chicago)
A metasearch engine is a system, which is connected to different search engines. In response to a user query, it invokes suitable search engines for the query, merges the information returned by these search engines and output the merged result. There are two types of metasearch engines: one type for unstructured data (mostly text) and the other for structured data. In comparison to a text search engine, a metasearch engine can have a higher coverage of the Web and can have more timely information. A metasearch engine for structured data facilitates comparison shopping and services and is convenient to use. In this talk, we discuss the problems and their potential solutions. In addition, challenges and unsolved problems are sketched.
Biography:
Clement Yu is a professor in the Department of Computer Science at the
University of Illinois at Chicago. His areas of research are
information retrieval, data base management and applications to health
care. He served as chair of the ACM SIGIR society, program committee
chair of ACM SIGIR conference, general chair of ACM SIGMOD conference
and as an advisory committee member of the National Science Foundation.
He has published more than 200 papers in various journals such as JACM,
TODS, TOIS, TKDE, and TSE and in various conferences such as SIGIR,
CIKM, SIGMOD, VLDB, WWW and ICDE. He has served as associate
editor/member of editorial board of several journals such as TKDE.