Automated Coding of Historical Danish Cause of Death Data Using String Similarity

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Standard

Automated Coding of Historical Danish Cause of Death Data Using String Similarity. / Perner, Louise Villefrance; Perner, Mads Linnet; Pedersen, Bjørn-Richard; Cañadas, Rafael Nozal; Sildnes, Anders; Shvetsov, Nikita; Andersen, Trygve; Bongo, Lars Ailo; Sommerseth, Hilde Leikny.

Digital Humanities in the Nordic and Baltic Countries Publications. ed. / Annika Rockenberger; Sofie Gilbert; Juliane Tiemann. 2023. p. 203-221 (Digital Humanities in the Nordic and Baltic Countries Publications; No. 1, Vol. 5).

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Harvard

Perner, LV, Perner, ML, Pedersen, B-R, Cañadas, RN, Sildnes, A, Shvetsov, N, Andersen, T, Bongo, LA & Sommerseth, HL 2023, Automated Coding of Historical Danish Cause of Death Data Using String Similarity. in A Rockenberger, S Gilbert & J Tiemann (eds), Digital Humanities in the Nordic and Baltic Countries Publications. Digital Humanities in the Nordic and Baltic Countries Publications, no. 1, vol. 5, pp. 203-221. https://doi.org/10.5617/dhnbpub.10662

APA

Perner, L. V., Perner, M. L., Pedersen, B-R., Cañadas, R. N., Sildnes, A., Shvetsov, N., Andersen, T., Bongo, L. A., & Sommerseth, H. L. (2023). Automated Coding of Historical Danish Cause of Death Data Using String Similarity. In A. Rockenberger, S. Gilbert, & J. Tiemann (Eds.), Digital Humanities in the Nordic and Baltic Countries Publications (pp. 203-221). Digital Humanities in the Nordic and Baltic Countries Publications Vol. 5 No. 1 https://doi.org/10.5617/dhnbpub.10662

Vancouver

Perner LV, Perner ML, Pedersen B-R, Cañadas RN, Sildnes A, Shvetsov N et al. Automated Coding of Historical Danish Cause of Death Data Using String Similarity. In Rockenberger A, Gilbert S, Tiemann J, editors, Digital Humanities in the Nordic and Baltic Countries Publications. 2023. p. 203-221. (Digital Humanities in the Nordic and Baltic Countries Publications; No. 1, Vol. 5). https://doi.org/10.5617/dhnbpub.10662

Author

Perner, Louise Villefrance ; Perner, Mads Linnet ; Pedersen, Bjørn-Richard ; Cañadas, Rafael Nozal ; Sildnes, Anders ; Shvetsov, Nikita ; Andersen, Trygve ; Bongo, Lars Ailo ; Sommerseth, Hilde Leikny. / Automated Coding of Historical Danish Cause of Death Data Using String Similarity. Digital Humanities in the Nordic and Baltic Countries Publications. editor / Annika Rockenberger ; Sofie Gilbert ; Juliane Tiemann. 2023. pp. 203-221 (Digital Humanities in the Nordic and Baltic Countries Publications; No. 1, Vol. 5).

Bibtex

@inproceedings{ad47b04407e843ea9df754fa3e8b8b02,
title = "Automated Coding of Historical Danish Cause of Death Data Using String Similarity",
abstract = "The study of causes of death has been central to some of the most influential studies of the modern mortality decline in the nineteenth and twentieth centuries. The digitization of individual-level cause of-death data has been game-changing, however, the data presents a major challenge: how do we code the thousands of unique strings for analysis in an efficient way? This paper aims to see how far we can get with automated coding based on string similarity. We do this by applying a Jaro Winkler string similarity algorithm in Python (pyjarowinkler) that codes our cause of death data from the Copenhagen Burial Register 1861-1911 to DK1875, a contemporary coding and classification system from nineteenth century Denmark. We then compare the performance of the algorithm to that of a manual (historian) coder in three different ways: at the level of each unique cause-of-death string, at the level of each cause-of-death group and for the overall cause-of-death pattern for all burials in Copenhagen 1861-1911. Our results show that a minimum-effort algorithm coded approximately half of the causes of death correctly compared to the manually coded dataset. This means that the method applied here is not accurate enough to use for actual data analysis of mortality patterns, as it is not possible to examine individual causes within larger causal groups. However, the results are promising for different uses of the method as a help for the manual coder. A way forward could be to use cut-off points of the Jaro-Winkler scores, coding only those causes where the string similarity match is relatively certain or use the automated method to catch most of the initial cases of a certain disease with a very set phrasing, such as cancer. In both cases, the remainder of the unique cause of death strings could then be coded by a manual coder.",
author = "Perner, {Louise Villefrance} and Perner, {Mads Linnet} and Bj{\o}rn-Richard Pedersen and Ca{\~n}adas, {Rafael Nozal} and Anders Sildnes and Nikita Shvetsov and Trygve Andersen and Bongo, {Lars Ailo} and Sommerseth, {Hilde Leikny}",
year = "2023",
doi = "10.5617/dhnbpub.10662",
language = "English",
series = "Digital Humanities in the Nordic and Baltic Countries Publications",
number = "1",
pages = "203--221",
editor = "Rockenberger, {Annika } and Gilbert, {Sofie } and Tiemann, {Juliane }",
booktitle = "Digital Humanities in the Nordic and Baltic Countries Publications",

}

RIS

TY - GEN

T1 - Automated Coding of Historical Danish Cause of Death Data Using String Similarity

AU - Perner, Louise Villefrance

AU - Perner, Mads Linnet

AU - Pedersen, Bjørn-Richard

AU - Cañadas, Rafael Nozal

AU - Sildnes, Anders

AU - Shvetsov, Nikita

AU - Andersen, Trygve

AU - Bongo, Lars Ailo

AU - Sommerseth, Hilde Leikny

PY - 2023

Y1 - 2023

N2 - The study of causes of death has been central to some of the most influential studies of the modern mortality decline in the nineteenth and twentieth centuries. The digitization of individual-level cause of-death data has been game-changing, however, the data presents a major challenge: how do we code the thousands of unique strings for analysis in an efficient way? This paper aims to see how far we can get with automated coding based on string similarity. We do this by applying a Jaro Winkler string similarity algorithm in Python (pyjarowinkler) that codes our cause of death data from the Copenhagen Burial Register 1861-1911 to DK1875, a contemporary coding and classification system from nineteenth century Denmark. We then compare the performance of the algorithm to that of a manual (historian) coder in three different ways: at the level of each unique cause-of-death string, at the level of each cause-of-death group and for the overall cause-of-death pattern for all burials in Copenhagen 1861-1911. Our results show that a minimum-effort algorithm coded approximately half of the causes of death correctly compared to the manually coded dataset. This means that the method applied here is not accurate enough to use for actual data analysis of mortality patterns, as it is not possible to examine individual causes within larger causal groups. However, the results are promising for different uses of the method as a help for the manual coder. A way forward could be to use cut-off points of the Jaro-Winkler scores, coding only those causes where the string similarity match is relatively certain or use the automated method to catch most of the initial cases of a certain disease with a very set phrasing, such as cancer. In both cases, the remainder of the unique cause of death strings could then be coded by a manual coder.

AB - The study of causes of death has been central to some of the most influential studies of the modern mortality decline in the nineteenth and twentieth centuries. The digitization of individual-level cause of-death data has been game-changing, however, the data presents a major challenge: how do we code the thousands of unique strings for analysis in an efficient way? This paper aims to see how far we can get with automated coding based on string similarity. We do this by applying a Jaro Winkler string similarity algorithm in Python (pyjarowinkler) that codes our cause of death data from the Copenhagen Burial Register 1861-1911 to DK1875, a contemporary coding and classification system from nineteenth century Denmark. We then compare the performance of the algorithm to that of a manual (historian) coder in three different ways: at the level of each unique cause-of-death string, at the level of each cause-of-death group and for the overall cause-of-death pattern for all burials in Copenhagen 1861-1911. Our results show that a minimum-effort algorithm coded approximately half of the causes of death correctly compared to the manually coded dataset. This means that the method applied here is not accurate enough to use for actual data analysis of mortality patterns, as it is not possible to examine individual causes within larger causal groups. However, the results are promising for different uses of the method as a help for the manual coder. A way forward could be to use cut-off points of the Jaro-Winkler scores, coding only those causes where the string similarity match is relatively certain or use the automated method to catch most of the initial cases of a certain disease with a very set phrasing, such as cancer. In both cases, the remainder of the unique cause of death strings could then be coded by a manual coder.

U2 - 10.5617/dhnbpub.10662

DO - 10.5617/dhnbpub.10662

M3 - Article in proceedings

T3 - Digital Humanities in the Nordic and Baltic Countries Publications

SP - 203

EP - 221

BT - Digital Humanities in the Nordic and Baltic Countries Publications

A2 - Rockenberger, Annika

A2 - Gilbert, Sofie

A2 - Tiemann, Juliane

ER -

ID: 370481857