Main test sets (used on this site)
English Wikipedia as html files
This set consists of all articles in the English Wikipedia marked as ether featured or good articles, converted into standalone html files.
set
Files: | 5 410 |
Size: | 298 MB |
View, Download |
Enron files
With a total of 43 426 files, and a good mix of typical enterprise files like .pdf, .word, .xls, images etc, the Enron file set is a good resource to simulate a file server.
This data set contains files send as email attachments from about 150 users, mostly senior management of Enron. This data was originally made public, and posted to the web, by the US Federal Energy Regulatory Commission during its Enron investigation. The data set was created by extracting all email attachments from the original EDRM set.
set
Files: | 43 401 |
Size: | 7.9 GB |
Top file types: | doc: 21 102, xls: 9 589, pdf: 1 919, ppt: 1 823, jpg: 1 626,gif: 555, htm: 493, exe: 231, dat: 35 |
View, Download |
Additional test sets
Full English Wikipedia as html files
This set consists of all articles in English Wikipedia converted into standalone html files.
Files: | 7 926 727 |
Size: | 83 GB |
View, Download |
Enron email
This data set contains 852 088 email from about 150 users, mostly senior management of Enron. This data was originally made public, and posted to the web, by the US Federal Energy Regulatory Commission during its Enron investigation. This data set was originally released by EDRM .
set
Files: | 852 088 |
Size: | 59 GB |
View, Download |
File samples
I am trying to collect as many different file formats as possible to use for testing text extraction. It is currently a work in progress. If you have files to contribute, please contact me.
Files: | < 20 |
Size: | <100 MB |
View |
Lipsum
Example text in several different languages to test language compatibility. Thanks to lorem-ipsum.info for text.
Files: | < 20 |
Size: | <1 MB |
View |