Datasets (Subscribe)
Links
WordSimilarity-353 Test Collection
http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.html
Contains 353 English word pairs along with human-assigned similarity judgements.
Web->KB dataset
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/
Web pages partitioned into classes, with hyperlink data. The dataset has been used for text categorization and learning to extract symbolic knowledge from the World Wide Web.
UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
A repository of databases, domain theories and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.
TREC Data
http://trec.nist.gov/data.html
Text datasets used in information retrieval and learning in text domains.
Time Series Data Library
http://www-personal.buseco.monash.edu.au/~hyndman/TSDL/
A collection of over 500 time series, maintained by Rob Hyndman. Time series are organized by subject.
The StatLib Datasets Archive
http://lib.stat.cmu.edu/datasets/
A repository of datasets used in statistics and machine learning.
The RCSB Protein Data Bank (PDB)
Archive of experimentally-determined, biological macromolecule 3-D structures from the Brookhaven National Laboratory.
TechTC - Technion Repository of Text Categorization Datasets
http://techtc.cs.technion.ac.il/
Provides a large number of diverse test collections for use in text categorization research.
RISE: Repository of Information Sources used in information Extraction tasks.
http://www.isi.edu/info-agents/RISE/
Repository of online information sources: test domains for information extraction and wrapper generation tools that learn extraction rules (extraction patterns).
Reuters-21578 Text Categorization Corpus
http://www.daviddlewis.com/resources/testcollections/reuters21578/
A classic benchmark for text categorization algorithms.
Penn Treebank Project
http://www.cis.upenn.edu/~treebank/
A corpus of parsed sentences. Used by many researchers for training data-driven parsing algorithms.
NIST Special Database 4.
http://www.nist.gov/srd/nistsd4.htm
This NIST database of fingerprint images contains 2000 8- bit gray scale fingerprint image pairs.
National Space Science Data Center
Provides access to a wide variety of astrophysics, space physics, solar physics, lunar and planetary data from NASA space flight missions, in addition to selected other data and some models and software.
HS3D - Homo Sapiens Splice Sites Dataset
http://www.sci.unisannio.it/docenti/rampone/
HS3D (Homo Sapiens Splice Sites Dataset) is a database of Homo Sapiens Exon, Intron and Splice regions extracted from GenBank primate sequences Rel.123. The aim of this data set is to give standardized material to train and to assess the prediction accuracy of computational approaches for gene identification and characterization.
Face recognition dataset
http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/ML94/face_homework.html
A dataset of face images for face recognition algorithms.
DELVE - Data for Evaluating Learning in Valid Experiments
http://www.cs.utoronto.ca/~delve/
Data for Evaluating Learning Valid Experiments: A standardized environment designed to evaluate the performance of methods that learn relationships based primarily on empirical data. Delve makes it possible for users to compare their learning methods with other methods on many datasets.
Dataset generator
Datgen, formerly SCDS, is a computer program that generates data to systematically test programs that consume data. These synthetic datasets can be used to validate learning algorithms.
Bilkent University Function Approximation Repository
http://funapp.cs.bilkent.edu.tr/DataSets/
Datasets used for the experimental analysis of function approximation techniques and for training and demonstration by machine learning and statistics community.