<< Web QA | Test Collection Index | Arrival Times >>
500k User Queries Sampled Over 3 Months
This collection consists of ~20M web queries collected from ~500k users over three months. Where the data is sorted by ananomized user id:
'The data set includes {UserID, Query, QueryTime, ClickedRank, DestinationDomainUrl}.
The goal of this collection is to provide a real query log based on users. It could be used for personalization, query reformulation or other type of search research.
The graph below shows that not all users are equal in terms of usage.

Basic Collection Statistics
Dates:
01 March, 2006 - 31 May, 2006
Normalized queries:
19,076,613 queries total
10,865,119 unique (normalized) queries
658,086 unique user ID's
Data View
Below we rank domains by the probability of click-through and ratio of unique queries. Pick a domain and see some of the top queries that users searched for to see that domain.
If you have other views or insights from the data add it to our U500k community.
| Row | Domain |
We have slit the data into 10 randomly assigned groups of users. This will facilitate experimentation on smaller sets of data, as well as consistant training/testing splits across experiments. For example, in our own experiemnts we have used 8 groups of users' data for training and 1 group for testing. We repeat our experiemnts 10 times for cross-validation with a "leave one out" approach. I suggest that if people are not interested in cross validation, they should train on 6 groups and test on 3, again leaving one out (e.g. train on groups 1-6, test on groups 8-10). The assignment of groups is truely random so any similar arangment is valid. However, if we all use the same splits we can all compare data easily.
Please reference the following publication when using this collection:
This collection is distributed for non-commercial research use only. Any application of this collection for commercial purposes is STRICTLY PROHIBITED.
CAVEAT EMPTOR -- SEXUALLY EXPLICIT DATA! Please be aware that these queries are not filtered to remove any content. Pornography is prevalent on the Web and unfiltered search engine logs contain queries by users who are looking for pornographic material. There are queries in this collection that use SEXUALLY EXPLICIT LANGUAGE. This collection of data is intended for use by mature adults who are not easily offended by the use of pornographic search terms. If you are offended by sexually explicit language you should not read through this data. Also be aware that in some states it may be illegal to expose a minor to this data. Please understand that the data represents REAL WORLD USERS, un-edited and randomly sampled, and that AOL is not the author of this data.
- 500k User Test Collection (tar gzipped) (79 downloads)
Please comment on this collection, add references to works using it or suggest improvements that will help other researchers. Tell us about your experiences on this collection at U500k or post shorter comments here.

