We are searching data for your request:
Upon completion, a link will appear to access the found materials.
Artificial Intelligence (AI) and Machine Learning (ML) at the service of historians, together with algorithms that locate documents of great interest to the history of Spain. Behind it is Carabela Project, a project developed in the last two years by researchers from the Polytechnic University of Valencia (UPV) and the Center for Underwater Archeology of the Andalusian Institute of Historical Heritage.
Within this framework they have developed and applied new AI / ML techniques that allow access to the contents of more than 130,000 images from the General Archive of the Indies and the Provincial Historical Archive of Cádiz. The project has received support from the BBVA Foundation's Aid program for Scientific Research Teams in the area of Digital Humanities.
“With these techniques we can track any graphic document with the same speed as a web search engine, identifying specific words, combinations of words, phrases, etc…. All this thanks to statistical models that we have trained from examples and that are now the great allies for the study of these collections in the history of Spain. And the same methods can also be applied to many other historical documents ”, highlights Enrique Vidal, researcher at the UPV's Pattern Recognition and Human Language Technologies (PRHLT) center.
General Archive of the Indies
The funds of the General Archive of the Indies They are of exceptional interest for the study of the history of Spain in America –from the southern United States to Tierra de Fuego- and the Philippines during the 15th to 19th centuries.
Is about manuscripts related to Spanish naval travel and trade, whose analysis cannot be done with traditional OCR transcription techniques - since they are intended for printed text - nor with specific techniques for handwritten materials, since the results they offer when applied to these historical texts are too imprecise.
"Carabela has allowed us to go further, with automatic learning techniques that allow indexing images of handwritten text in large collections of historical documents whose state of conservation and convoluted writing styles make it almost impossible for humans to read their documents," says Joan Andreu Sánchez, also a researcher at the PRHLT-UPV.
These techniques are able to identify and discern the different types of letters used in each of the periods in which the documents are dated and even analyze images whose quality is very low.
The key is in the capacity of its algorithms to obtain models that are 'learned' automatically from examples.
“Such models require a relatively small amount of learning data to obtain very satisfactory results. These methods allow us to respond satisfactorily to the challenges posed by the documents themselves, such as differences in spellings, smudges, or image quality. ”, Adds Vidal.
In this case, learning was done with about 500 pages from the Archivo de Indias, which were selected and transcribed by Carlos Alonso and his team of specialists from the Center for Underwater Archeology.
Wrecks and Australia
Caravel has brought to light manuscript information about wrecks which constitute an archaeological heritage of the first magnitude, due to the great historical and cultural richness of its content. "Carabela thus also contributes to avoiding the plundering of submerged heritage", explains Joan Andreu Sánchez.
But, undoubtedly, one of the most surprising finds in these funds occurred when, searching for terms related to Australia, such as 'Tierra Austral Incognita', a letter from the early 18th century addressed to King Philip V.
“In this letter, written by the Jesuit Andrés Serrano, we have discovered very precise references to the southern continent dating back to 1705, long before Captain James Cook reached its shores in 1770. Little-known data on the history of Australia and that we are now discovering by applying indexing and probabilistic search techniques developed at our center, ”explains Enrique Vidal.
READ, the Golden Age and Transkribus
In this same line of work, the PRHLT team has participated in the European READ project, which has studied and analyzed documents from the Golden Age of Spanish literature, among them Lope de Vega manuscripts from the collection of the National Library, and correspondence from the Brothers Grimm from the Marburg State Archives.
Also from the National Archive of Finland, of which about 150,000 pages have been indexed, and in future projects it intends to index around 1 million pages.
Also, Within the framework of the project, Transkribus has been developed, a software platform that allows annotating images of old documents of great historiographic value.
Transkribus is used primarily as a training data generation toolsince handwritten text recognition techniques require data to learn automatically. In the near future it will incorporate other features, such as automatic model training for other languages.
READ has also concluded with the creation of a European cooperative of which the UPV is a founding member and which makes the Transkribus software available to all registered users.
Currently, the Transkribus platform It has more than 30,000 users from around the world, which makes it an international reference tool for all historians.