The tsetse archive is a large collection of documents which have been digitised into images. I (Roger Mateer) have split it into a dedicated filesystem repository separate from that hosting the SACEMA website filesystem content, both to make the latter more lightweight and in preparation to support the vision i outline below. It still looks for now as it did before, but is now in a better position to migrate to the proposed setup.
Alex Welte expressed the desire (thinking it to be very blue sky) that this collection of documents could be converted into a queriable database which could (amongst other things) be used as a unique dataset to inform the impact of climate change on the distribution of disease vectors.
There is a potential automated way of attempting to do this. It is admittedly not perfect and subject to several levels of error introduction, but i think it would be worth attempting, because of the potential value that this dataset has.
Since errors may be introduced in this process by the automated attempts to extract the desired information, it seems sensible to maintain a database of the various types of metadata that are produced by the steps in the outlined process, so that the process can be tweaked and retried whenever we think of how to improve some part of it without undue human time expenditure or intervention for manual error correction postprocessing. Also, maintaining a database of metadata ensures that no information (either original source or that produced during some automated data extraction step) is ever lost.
The essential idea is this:
ADDENDUM: We would like several examples of requests for specific, targetted collections of data that would be useful to some end data consumer for some specific purpose, so that we can use these examples to test how well the data extraction method could be made to work, and, if successful, to see how useful the responses the method produces to such requests actually are.
Please send me comments or suggestions about this proposal.