This project focuses on data from the Amsterdam Address book of 1907. Names of persons and companies, their professions and addresses were extracted using regular expressions. The addresses were aligned with Adamlink streets and HisGis location points. In the alignment process, streets that were not yet present in the Adamlink street gazetteer (e.g. Galerij, Plantage Lijnbaansgracht, Spaarndammerplein) were added to it. Professions were aligned with the HISCO database, which made it possible to plot ’sculptors’ and ‘painters’ using English labels, and also to plot super categories like ‘creative artists’.
As part of the project, an interface was created, which visualizes the current state of the data on a map of the city. The interface allows users to download the data for 1907, filter them in various ways (by profession, by street name, or by HISCO category), and export the filtered results. By clicking on a dot on the map, a legend appears, indicating through a color coding system which part of the Address book the entry was found in. A click on the entry itself opens the scan of the original document at the City Archives site. The interface is now accessible via https://addressbooks.amsterdamtimemachine.nl/
A big component of the project is the cleaning and alignment of the extracted data, which is still ongoing. The OCR-text has been already cleaned with algorithms comparing multiple entries for the same person or company. Such comparisons are made possible by the fact that each Address book simultaneously lists persons by address, by profession, and by name. Furthermore, entries from the 1905, 1909 and 1910 Address books were extracted as well, thus broadening the possibilities for comparisons among entries. As an example, if 7 entries were found with the same name, profession, and street name, and in 6 of those the house number was 33, the single instance in which the extracted house number was 83 could be safely assumed to be an OCR error. Errors in names, professions, and street names were cleaned in the same manner.
Further options for the cleaning of the data include comparing entries with other datasets, such as the Bevolkingsregisters from the Amsterdam City Archives. Via the interface, it is possible to access a table listing entries that could not yet be pinpointed on the map, mostly because of issues with the OCR-text. Since the table includes URLs with the original scans of the pages from which the data were extracted, the interface can be used to correct the data and add the locations manually.
More information on the data, and the data themselves can be found on the CREATE GitLab.
The project to clean the Amsterdam Address book 1907 data was initiated by CREATE, building on findings from a paper on the automatic extraction of data to map the Amsterdam nightlife, and carried out by Menno den Engelse (Islands of Meaning).