IBM's WebFountain

IBM WebFountain

A powerful new architecture for text analysis of unstructured information

Who

Daniel Gruhl
IBM
Walter Bender
Head, Electronic Publishing group
MIT Media Lab

Where

MIT Media Lab, Cambridge, MA; and IBM’s Almaden Research Center, Almaden CA

Why

IBM wanted to create a data management system powerful enough to handle complex, large-scale, unstructured information, and to deliver meaningful text analysis.

How

Built on “Zwrap,” architecture first developed by Gruhl while a student at the Media Lab, and subsequently expanded by Gruhl and other IBM researchers as the foundation for WebFountain.

Details

The challenge of unstructured text analysis consumed Gruhl while he pursued his Ph.D. in electrical engineering at MIT. As a member of the Media Lab’s Electronic Publishing Group, he developed a unique architecture, called ZWrap, to help automate news gathering for newspaper editors by identifying and organizing relevant information from some million pages of data. “IBM said, if you can do a million pages, how about 10 billion?” Gruhl said. Gruhl was one of several students named as IBM Fellows while he was at the Media Lab, was recruited as a student intern by the company, and was subsequently hired by them as a research staff member. While hundreds of researchers have contributed to WebFountain, Gruhl’s architecture defined the fundamental approach.

WebFountain goes beyond today’s search engines to identify trends, patterns, and context rather than just Web links. It collects and analyzes not only structured Web content, but also blogs, bulletin board and, newsgroups, and non-Web sources such as newspapers, patent databases and journals. It assigns searchable XML tags to all this text based on different characteristics that can combine and build to deliver meaningful information. WebFountain is a promising technology for information-hungry enterprises; corporations tracking market trends or brand awareness; banks uncovering money-laundering schemes, or political analysts taking the nation’s pulse. To power it, IBM’s Almaden Research Center has built a half-football-field-sized complex of Intel Linux servers and blades that can run 9,000 mining programs simultaneously and store 4 billion pages of content. WebFountain technology contributes to IBM’s enterprise search offering, WebSphere Information Integrator OmniFind Edition, and is a core tool for IBM consultants in the On Demand Innovative Services group, who translate leading-edge technologies into solutions for clients.

Image: Regents of the University of California