The Marie Skłodowska-Curie Post-Doctoral call encourages researchers to explore novel research paths, investigate new topics, and develop original methodologies. However, entering new fields brings along unexpected challenges, which make risk management and problem-solving skills indispensable assets in any scientific career. Elena Fernandez, post-doc fellow at the MSCA COFUND project Eurotech, talks us through the challenges she has been facing during the implementation of her project PRESSTECH. In particular, she discusses some problems related to optical character recognition
Opening new research lines: Data science across domains
Eurotech is a post-doctoral programme cofunded by the European Commission under the Horizon 2020 programme (Grant Agreement number 754462). Similar to the Marie Skłodowska-Curie Individual Fellowships, Eurotech post-doctoral researchers have the opportunity of designing a bottom-up project, where they often open novel and unexplored research lines while acquiring new and highly valuable skills. Attractive as this opportunity may seem to be, it does come with many challenges along the way.
Holding a background in humanities (PhD in Hispanic languages and literature, UC Berkeley 2019), I am a newcomer to one of the most promising emerging academic and professional fields of the twenty-first century: data science. The increasing availability of big data across domains is surely unlocking new research opportunities in diverse disciplines ranging from humanities and social sciences, environmental studies, finance, or biomedical sciences, just to name a few. However, and to successfully execute big data research projects, several matters must be considered.
Following the ideas of Gaston Sanchez (2020), the data analysis circle consists of several well-differentiated stages: data collection (acquisition), data cleaning, data tidying, exploratory data analysis, confirmatory data analysis, data visualization, model building, simulations, and communication. This article will center its attention on the first and often most difficult part in the data analysis circle: data acquisition. The first thing that most researchers in data science across domains need to address when designing a new research project is to locate the data that they intend to use. However, once this data has been located, there is one unexpected challenge that newcomers to the field may not be familiar with: Optical Character Recognition (OCR).
What is OCR, you may wonder? OCR is a computational procedure that transforms text characters into machine-encoded text (see Chaudhuri et al. 2017). Let’s provide an example: when you scan a page in a book, what you will get is a picture of that page. However, without OCR processing, you will not be able to get the text on that page for text-extraction operations (for example a simple copy-paste). How do you know if a scanned document has gone through an OCR process? Very simple: just left click on your mouse over the scanned document of choice, and if you can select the words individually, then, success! Your document has undergone OCR processing. However, even if your selected corpus of data is OCRproofed, and it is theoretically speaking ready for quantitative research purposes, you may encounter yet another unexpected problem: low-quality OCR.
In the following pages, I will discuss several risk management and contingency plans for researchers who may find themselves in the middle of their post-doctoral project encountering unexpected noisy OCR issues. Even though science-related challenges are quite common in the life of any post-doctoral researcher, risk management procedures in digital humanities projects have not received as much critical attention as they should. By focusing on solutions rather than on problems, this article will provide useful resources for researchers who may find themselves in this situation across different scientific disciplines.
My research project is entitled ‘Time, Technology and the Press. A Study of Accelerations of Time Perceptions during the Industrial and Digital Revolutions’ (PRESSTECH). It aims to analyze information behavior historically by measuring information compactness using different computational methodologies such as quantitative narrative analysis and network analysis.1 Using newspapers as an object of study (The New York Times, Le Figaro, Boletín Oficial del Estado), over a time-scope of twenty years (1988-2018), PRESSTECH has so far successfully produced research results without major technical inconveniences.
Newspapers as an object of study are of high value for researchers in several disciplines, such as digital humanities, computational social science, digital journalism, digital history, media and communication studies, sociology… just to name a few. Their relatively temporal and geographical stability (some of them have been ongoingly published since the seventeenth century), as well as its recent digitization, provide an ideal ground for theoretic and quantitative research.
There are indeed several open-access digitized collections around the world that host a variety of multilingual historic newspapers. To begin with, Chronicling America, hosted by The Library of Congress, contains an archive of United States newspapers ranging from 1777 to 1963 in a variety of languages.2 To continue, under the umbrella of the Europeana Project, a European Union funded project for the digitalization of cultural heritage, several historical newspapers from Germany, Austria, Latvia, Finland, Serbia, Poland, and Luxembourg, are available. Moreover, national libraries around the world have recently made available their newspaper collections following open access policies. Some examples include Portugal,3 Spain,4 France,5 or Australia.6 Although most of the time they are all freely available to the public, access to these collections varies. Sometimes it is possible to do bulk downloads of data, sometimes it is necessary to use their respective APIs, and sometimes it is necessary to manually download newspapers one by one (or even page by page).
Newspapers’ availability in their digitized form is, therefore, a promising vehicle for quantitative analysis across domains. Encouraged by the positive development of PRESSTECH with contemporary newspapers that have no OCR problems, and having located several collections of historical newspapers, I was ready to continue with my project. But suddenly, I found myself at a very unexpected research crossroads.
Using quantitative narrative analysis, PRESSTECH has as a baseline the extraction of subject verb object (SVO) triplets. To perform this kind of methodology, exceptionally clean digitized text is needed. Ground-truth OCR could be defined as digitized documents that have an OCR quality almost perfect. Even though the average digitized historical newspaper may have good enough quality for performing some text data mining operations, it may be insufficient for other research methodologies.
Having finished the more contemporary data analysis of PRESSTECH, I planned to apply quantitative narrative analysis to historic newspapers to inspect fluctuations of information density overtime aiming to analyze the social impact of technology from a historic perspective. However, what I unexpectedly encountered was lowquality OCR in my selected corpus. What can researchers who, like me, suddenly find themselves in that kind of situation, right in the middle of their research project?
Keep calm and carry on. There are several risk management and contingency plans that can be followed if you find yourself in this situation. I will focus my attention on four of them: finding an alternative corpus, curating the one that you have, changing your research methodology, or publishing preliminary results without including any data analysis.
Research infrastructure as a cornerstone of digital humanities, computational social science, and information science as fields of research, is, fortunately, a wellestablished reality. Indeed, it counts with highly competent scientific personnel all around the world who will be ready and happy to help you. Specialized librarians are an indispensable figure for any digital humanities project and can be found in several organizations, ranging from cultural heritage digitization projects, national libraries, research groups, or digital humanities centers. Thanks to their invaluable help, I was able to locate several datasets containing groundtruth OCR. For example, the KB Lab hosts a collection of individual newspaper pages that have undergone a Ground Truth OCR process.7
Moreover, the state of the art in OCR post-correction for historical newspapers is rapidly evolving (for an excellent overview about existing software and approaches, check out Ströbel et al. 2020). If you are located in a research center that hosts specialized researchers with specific knowledge of OCR post-correction tools, it will be possible to curate your dataset. However, bear in mind that this will be a very slow and time-consuming process.
A third possibility could be to change your research methodology. There are several cutting-edge research projects both in digital humanities and computational social science that use noisy OCR digitized corpora with a variety of computational methodologies, as their main goal is to observe general trends in information behavior using big data across domains.
Finally, the possibility of publishing preliminary ideas in journals such as this one (the MCAA Newsletter has an ISSN, and therefore, counts as a scientific publication), could be a very good solution for this kind of scenario. The practice of publishing position papers or preliminary papers is well-established in some fields, such as computer science, but not so common in others. However, with the growing presence of data science across domains, some journals in digital humanities are starting to accept articles without data analysis but with well-elaborated research ideas.
Any of those four contingency plans will have as a side effect some inevitable modifications in your original research question. However, as a postdoctoral researcher, you should consider this challenge as highly valuable scientific training that will strengthen your research skills and help you to gain scientific maturity and research independence.
Being a postdoctoral researcher grants really valuable opportunities for professional development, knowledge discovery, and scientific exploration of new fields. However, and inevitably, it comes along with many challenges. Nevertheless, one way or another, there are solutions. Effective risk management skills and the elaboration of creative contingency plans are crucial assets for developing scientific careers across domains. Consequently, research challenges should be addressed as highly valuable learning opportunities and as training to improve your problem-solving skills.
Acknowledgements: I want to express my gratitude to all the librarians around the world who have helped me from the beginning of PRESSTECH with highly important (and lifesaving!) information. Thanks very much to: Lotte Wilms from the KB Lab, Jean-Philippe Moreux and Emmanuelle Bermès from BNF, Amber Paranik and Nathan Yarasavage from The Library of Congress, Reinhard Niederer from TUM, and Clemens Neudecker from Europeana. I also want to say thanks very much to all the librarians and scientific personnel who have kindly taken the time to reply to my really many emails from: TroveNational Library of Australia, Biblioteca Nacional de España (Hemeroteca Digital), Bavarian State Library, Servicio de atención al ciudadano-BOE, ProQuest, and LexisNexis. The EuroTech Postdoc Programme sponsors this article, and it is co-funded by the European Commission under its framework programme Horizon 2020,Grant Agreement number 754462.
1. Information about the scientific aspects of the project can be found at the following link.
2. Such as Armenian, Russian, Chinese, Italian and Dutch, as well as more than twenty other languages.
3. The Diario de Noticias de Madeira is digitally available from 1876 to 2000: https://abm.madeira.gov.pt/pt/13159-2/
4. The Boletín Oficial del Estado-La Gaceta de Madrid is available from 1661 up to present times: https://www.boe.es/buscar/ayudas/gazeta_ayuda.php.
5. The Bibliothèque National de France (BNF) has a collection of more than twenty historical newspapers under the Gallica section: https://api.bnf.fr/fr/texte-des-documents-de-presse-du-projet-europeana-newspapers-xixexxe-siecles.
6. The ‘Trove’ database is a collaboration between the National Library of Australia and hundreds of Partner organisations around Australia, andoffers a vast collection of digitized newspapers: https://trove.nla.gov.au
7. You can have a look at it in here: https://lab.kb.nl/dataset/historical-newspapers-ocr-ground-truth
Technical University of Munich, Germany
Sanchez, G. (2020). Introduction to Computing With Data. https://www.gastonsanchez.com/intro2cwd/
Chaudhuri, A., Mandaviya, K., Badelia, P., & Ghosh, S. K. (2017). Optical Character Recognition Systems for Different Languages with Soft Computing. Cham: Springer.
Ströbel, P. B., Clematide, S., & Volk, M. (2020). How Much Data Do You Need? About the Creation of a Ground Truth for Black Letter and the Effectiveness of Neural OCR. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, 1 May 2020 - 2 May 2020 (pp. 3551-3559). Marseille: The European Language Resources Association.