welcome to bangalore

The goal of the Search Sciences group at Yahoo! Labs Bangalore is to make it easy for users to find the information that they are looking for. To this end, we are developing several technologies that will improve the user search experience through better ranking and presentation of search results. The primary areas of focus for the group include information extraction at Web scale, image search relevance, and Web page classification exploiting structure and relationships among pages.


Information extraction from Web pages is critical for improving search results quality and presentation, integrating information from diverse sites to enable applications like comparison shopping, etc. The Vertex system being developed at Yahoo! Labs Bangalore seeks to extract structured records from semi-structured pages belonging to a broad spectrum of Web sites. For head sites like www.amazon.com, Vertex uses wrapper induction to extract data at Web Scale. Human editors annotate the attribute values to be extracted on a few sample pages belonging to each Web site, and the annotations are then used to learn extraction rules for each site.

For torso and tail sites, Vertex relies on unsupervised techniques to label attribute values in Web pages. These range from building attribute models based on machine learning techniques like Conditional Random Fields to leveraging dictionaries and previously extracted attribute values. Vertex also exploits site-level structural similarities among Web pages, attribute uniqueness and proximity constraints within each individual page to resolve ambiguities and improve extraction accuracy.

welcome to bangaloreChallenges in multimedia search are relevance, quality, and presentation of search results. Relevance focuses on building models to find matching images for queries. Quality is to detect and filter out adult, low quality, and near-duplicate images from search results. Presentation aims to satisfy users’ information needs by presenting images based on their topical and visual similarity.

Solutions use a combination of techniques from machine learning, statistical pattern recognition, text/tag analytics, information retrieval, image processing, and computer vision to build the best possible and scalable search engines for image and video.

One specific system developed by the group is for adult image detection. A key aspect of our approach is detection of body-parts in images. Several other features like color, texture, and metadata are also used appropriately. The results for individual images are aggregated and processed at the site-level to increase detection coverage. The system has progressively reduced adult leakage in Yahoo! Image Search and the current leakage is lower than the competition.

Classification of Web pages is important for improving the quality of search results and extraction of information relevant to specific classes. Consequently, identifying the discriminative features is critical. One of the problems we are working on at Yahoo! Labs Bangalore is large taxonomy classification (for example, the Yahoo! directory).

We solve this using Support Vector Machine (SVM) for structured output. The underlying optimization problem solved in dual results in an extremely fast training algorithm.

 

 

 

 

We are also exploring the use of relational information like links, page structure similarity, co-citations etc., along with conventional features to improve classifier accuracy.