The proposed datasets consist of a series of documents from two different collections prepared in the Europe project READ: the Alvermann Konzilsprotokolle and the Botany in British India collections. The former, in good preservation state, belongs to the University Archives Greifswald and involves around 18k pages. This collection contains fair copies of the minutes, written during the formal meetings held by the central administration between the years 1794-1797. The documents belong to the University Archives and were digitized and provided by the University Library in Greifswald. Transcripts were provided by the University Archives (Dirk Alvermann). On the other hand, the Botany in British India is from the India Office Records and provided by the British Library. This collection covers the following topics: botanical gardens; botanical collecting; useful plants (economic and medicinal). In this case, we start providing one of its documents entitled Hemp cultivation in India of 10 pages.
For each collection, several training set partitions will be released sequentially in order to assess the performance of each competing method under different amounts of training data (see the Evaluation section for additional details). For each partition, the set of page images and two XML files, containing the word-level and the line-level transcription and segmentation, are given.
It is important to remark that, among all the provided word-segmented training data, only 3 pages per dataset of the initial partition were manually verified. These pages, selected and segmented having in mind the Query-by-Example track, are: b0001, b0002 and b0009 from Botany collection; and k0007, k0008 and k0009 from Konzilsprotokolle collection.
The rest word-level bounding boxes (7 pages) were obtained automatically by means of forced alignment. The line-level segmentation was done manually in all the cases.
In addition to the page images, we also provided the segmented word-level images. These were obtained from the bounding boxes available in the training ground-truth files and are given to make the validation process simpler.
Ground-Truth data is specified by XML files at two different alignning levels: one at line-level, suffixed by _LL.xml, and other at word-level suffixed by _WL.xml. Additionally, a third file containing case-insensitive word-level alignments is suffixed by _WL_CASE_INSENSITIVE.xml. All XML ground-truth files are presented in the next format:
<?xml version="1.0" encoding="utf-8" ?> <wordLocations dataset="collection"> <spot word="keyword1" image="pageImage1" x="123" y="55" w="123" h="50" /> <spot word="keyword2" image="pageImage1" x="553" y="97" w="100" h="59" /> <spot word="keyword3" image="pageImage2" x="94" y="1197" w="244" h="62" /> <!-- The rest of the words in the dataset --> </wordLocations>
The only difference between the word-level and line-level files is that, in the latter the "word" attribute contains the transcripts of the whole line, and the coordinates represent the bounding box of the entire line.
Examples of queries
A small set of queries is provided for the participants to tune their systems before the official test set is released. Queries are presented as individual image files (to be used in the Query-by-Example track) and as a list of keywords (Query-by-String track), for both collections. Please, take into account that queries are case-insensitive, meaning that you have to match any instance of a query keyword, regardless of their casing in the document images.
Finally, you can use the following evaluation ground-truth files to directly assess your performance using the evaluation toolkit. For each dataset, there is a XML file for each of the tracks (QbE vs. QbS) and assignments (Seg.Free vs. Seg.Based). This information could be obtained from the training ground-truth files, we just provide these files for an easier usage.
Follow the next links to obtain the test data used for the evaluation.
We provide, for each dataset, the page and the segmented word images to be used in the Seg.Free and the Seg.Based assignments, respectively. The queries are given as images for the QbE and as keyword strings for the QbS tracks, respectively.