Learning to rank (software, datasets)

摘要:
Datasetsforranking(LETORdatasets)MSLR-WEB10kandMSLR-WEB30kYou’llneedmuchpatiencetodownloadit,sinceMicrosoft’sserverseedswiththespeedof1Mbitorevenslower.Theonlydifferencebetweenthesetwodatasetsisthenum

Datasets for ranking (LETOR datasets)

  • MSLR-WEB10k and MSLR-WEB30k You’ll need much patience to download it, since Microsoft’s server seeds with the speed of 1 Mbit or even slower.
    The only difference between these two datasets is the number of queries (10000 and 30000 respectively). They contain 136 columns, mostly filled with different term frequencies and so on. (but the text of query and document are available)

  • Apart from these datasets, LETOR3.0 and LETOR 4.0 are available, which were published in 2008 and 2009. Those datasets are smaller. From LETOR4.0 MQ-2007 and MQ-2008 are interesting (46 features there). MQ stays for million queries.

  • Yahoo! LETOR dataset,from challenge organized in 2010. There are currently two versions: 1.0(400Mb) and 2.0 (600Mb). Here is more info about two sets within this data
      There is also

Yandex imat’2009

      (Интернет-Математика 2009) dataset, which is rather small. (~100000 query-pairs in test and the same in train, 245 features).

Algorithms

There are plenty of algorithms on wiki and their modifications created specially for LETOR (with papers).

Implementations

There are many algorithms developed, but checking most of them is real problem, because there is no available implementation one can try. But constantly new algorithms appear and their developers claim that new algorithm provides best results on all (or almost all) datasets.

This of course hardly believable, specially provided that most researchers don’t publish code of their algorithms. In theory, one shall publish not only the code of algorithms, but the whole code of experiment.

However, there are some algorithms that are available (apart from regression, of course).

    1. LEMUR.Ranklib project incorporates many algorithms in C++ http://sourceforge.net/projects/lemur/ the best option unless you need implementation of something specific. Currently contains
      MART (=GBRT), RankNet, RankBoost, AdaRank, Coordinate Ascent, LambdaMART and ListNet
    2. LEROT: written in python online learning to rank framework. Also there is less detailed, butlonger list of datasets:https://bitbucket.org/ilps/lerot#rst-header-data
    3. IPython demo on learning to rank
    4. Implementation of LambdaRank(in python specially for kaggle ranking competition)
    5. xapian-letoris part of xapian project, this library was developed at GSoC 2014. Though I haven’t found anythong on ranking in documentation, some implementations can be found in C++ code: https://github.com/xapian/xapian/tree/master/xapian-letor https://github.com/v-hasu/xapian/tree/master/xapian-letor

Comparison fromhttp://www.ke.tu-darmstadt.de/events/PL-12/papers/07-busa-fekete.pdf,
though paper was about comparison of nDCG implementations.

免责声明:文章转载自《Learning to rank (software, datasets)》仅用于学习参考。如对内容有疑问,请及时联系本站处理。

上篇创建.ZIP压缩文件[CL_ABAP_ZIP]十四、Kubernetes之资源限制下篇

宿迁高防,2C2G15M,22元/月;香港BGP,2C5G5M,25元/月 雨云优惠码:MjYwNzM=

随便看看

DEP(数据执行保护)介绍

数据执行保护是一组软件和硬件技术,可以对内存执行额外检查,以帮助防止恶意代码在系统上运行。硬件实现DEP来检测从这些位置运行的代码,并在发现执行时抛出异常。此功能也称为非执行和执行保护。为了与DEP合作,AMD和微软共同设计并开发了AMD的新芯片功能“增强病毒防护”。[1] DEP的安全机制,即“数据执行保护”,是一种Windows安全机制,主要用于防止病毒...

小米路由器3-R3 刷固件

3-3、大功告成,实测:带机12台,内存占用100MB、CPU使用20%不到满载200M带宽。...

【使用 DOM】为DOM元素设置样式

DOCTYPE html˃设置DOM元素的样式p{border:中双绿色;背景颜色:浅灰色;}#block1{color:白色;}table{border:thinsolided;border collapse:collapse;margin:5px;float:left;}td{padding:2px;}#block2{color:yellow;font-...

axios 处理超时问题 记录

前言:记录最近两天处理请求超时的逻辑。...

mac格式化重装系统

4.选择“重新安装MacOS”5.按照以下步骤中的提示进行操作。安装需要半个多小时。在此期间无法断开网络,否则需要重新安装...

easyExcel自动合并单元格

importcom.alibaba.excel.write.handler.CellWriteHandler;importorg.apache.poi.ss.usermodel.Sheet;importorg.apache.poi.ss.util.CellRangeAddress;int[]mergeColumnIndex){this.mergeRowInd...