This is G o o g l e's cache of http://research.aol.com/pmwiki/pmwiki.php?n=Research.500kUserQueriesSampledOver3Months as retrieved on Aug 5, 2006 09:15:51 GMT.
G o o g l e's cache is the snapshot that we took of the page as we crawled the web.
The page may have changed since that time. Click here for the current page without highlighting.
This cached page may reference images which are no longer available. Click here for the cached text only.
To link to or bookmark this page, use the following url: http://www.google.com/search?q=cache:2Qvd2z9VbuIJ:research.aol.com/pmwiki/pmwiki.php%3Fn%3DResearch.500kUserQueriesSampledOver3Months+&hl=en&gl=us&ct=clnk&cd=1


Google is neither affiliated with the authors of this page nor responsible for its content.

<< Web QA | Test Collection Index | Arrival Times >>

500k User Queries Sampled Over 3 Months

This collection consists of ~20M web queries collected from ~500k users over three months. Where the data is sorted by ananomized user id:

'The data set includes {UserID, Query, QueryTime, ClickedRank, DestinationDomainUrl}.

The goal of this collection is to provide a real query log based on users. It could be used for personalization, query reformulation or other type of search research.

The graph below shows that not all users are equal in terms of usage.

Basic Collection Statistics

Dates:

  01 March, 2006 - 31 May, 2006

Normalized queries:

  19,076,613 queries total
  10,865,119 unique (normalized) queries
     658,086 unique user ID's

Data View

Below we rank domains by the probability of click-through and ratio of unique queries. Pick a domain and see some of the top queries that users searched for to see that domain.

If you have other views or insights from the data add it to our U500k community.

Row Domain
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  

We have slit the data into 10 randomly assigned groups of users. This will facilitate experimentation on smaller sets of data, as well as consistant training/testing splits across experiments. For example, in our own experiemnts we have used 8 groups of users' data for training and 1 group for testing. We repeat our experiemnts 10 times for cross-validation with a "leave one out" approach. I suggest that if people are not interested in cross validation, they should train on 6 groups and test on 3, again leaving one out (e.g. train on groups 1-6, test on groups 8-10). The assignment of groups is truely random so any similar arangment is valid. However, if we all use the same splits we can all compare data easily.

Please reference the following publication when using this collection:

G. Pass, A. Chowdhury, C. Torgeson, "A Picture of Search", The First International Conference on Scalable Information Systems, Hong Kong, June, 2006.


This collection is distributed for non-commercial research use only. Any application of this collection for commercial purposes is STRICTLY PROHIBITED.

CAVEAT EMPTOR -- SEXUALLY EXPLICIT DATA! Please be aware that these queries are not filtered to remove any content. Pornography is prevalent on the Web and unfiltered search engine logs contain queries by users who are looking for pornographic material. There are queries in this collection that use SEXUALLY EXPLICIT LANGUAGE. This collection of data is intended for use by mature adults who are not easily offended by the use of pornographic search terms. If you are offended by sexually explicit language you should not read through this data. Also be aware that in some states it may be illegal to expose a minor to this data. Please understand that the data represents REAL WORLD USERS, un-edited and randomly sampled, and that AOL is not the author of this data.


Please comment on this collection, add references to works using it or suggest improvements that will help other researchers. Tell us about your experiences on this collection at U500k or post shorter comments here.

04 August 2006

05:03 by Julia Luxenburger?: First, I'd like to tell you how happy you made me with making such data available to the research community. I'm a Ph.D. student working on enhancing the PageRank algorithm by means of implicit feedback from query logs. However, it is quite hard to create good data sets for the testing and evaluation of new ideas. For me one

major shortcoming of the U500k collection, however, is that it cuts the URLs of clicked search results at the domain level. Would it be possible to make available clickthrough data at a finer granularity, perhaps restricted to a certain domain, e.g. Wikipedia, or .gov?

Comment
Author