A reproducibility protocol and dataset on the biomedical sentence similarity
Alicia Lara Clares, Juan J. Lastra-Díaz, Ana Garcia-Serrano
Abstract
This protocol introduces a set of reproducibility resources with the aim of allowing the exact replication of the experiments introduced by our main paper [1], which introduces the largest and for the first time reproducible experimental survey on biomedical sentence similarity. HESML V2R1 [2] is the sixth release of our Half-Edge Semantic Measures Library (HESML), which is a linearly scalable and efficient Java software library of ontology-based semantic similarity measures and Information Content (IC) models for ontologies like WordNet, SNOMED-CT, MeSH and GO.
This protocol sets a self-contained reproducibility platform which contains the Java source code and binaries of our main benchmark program, as well as a Docker image which allows the exact replication of our experiments in any software platform supported by Docker, such as all Linux-based operating systems, Windows or MacOS. All the necessary resources for executing the experiments are published in the permanent repository [3]
Our benchmark program is distributed with the UMLS SNOMED-CT and MeSH ontologies by courtesy of the US National Library of Medicine (NLM), as well as all needed software components with the aim of making the setup process easier. Our Docker image provides an exact virtual replica of the machine in which we ran our experiments, thus removing the need to carry-out any tedious setup process, such as the setup of the Named Entity Recognizer tools and other software components. (2022-02-20)
[1] Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. A reproducible experimental survey on biomedical sentence similarity: a string-based method sets the state of the art. Submitted to PLoS One. 2022.
[2] Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. HESML V2R1 Java software library of semantic similarity measures for the biomedical domain. e-cienciaDatos; 2022. doi:10.21950/DOI
[3] Lara-Clares, Alicia; Lastra-Díaz, Juan J.; Garcia-Serrano, Ana, 2021, "Reproducible experiments on word and sentence similarity measures for the biomedical domain",https://doi.org/10.21950/EPNXTR, e-cienciaDatos, V2
Before start
Our benchmarks can be reproduced in any Docker-complaint platform, such as Windows, MacOS or any Linux-based system by following a similar setup to that introduced herein.
In order to obtain a decrypt password for downloading the required files, you should sign and obtain a license for the National Library of Medicine (NLM) of the United States to use the UMLS Metathesaurus databases, as well as SNOMED-CT and MeSH ontologies included in this Docker image. For this purpose, you should go top the NLM license page, https://uts.nlm.nih.gov//license.html. After that, you could write to eciencia@consorciomadrono.es to obtain the password to decrypt the file. Likewise, you should obtain and sign a Data User Agreement from the Mayo Clinic to use the MedSTS dataset by sending the authors the Data User Agreement form, https://n2c2.dbmi.hms.harvard.edu/data-use-agreement.
Steps
Installing Docker on Ubuntu
If Docker is not installed in your machine, instructions below install latest version of Docker CE. For further details, we refer the reader to the official Docker setup page https://docs.docker.com/install/linux/docker-ce/ubuntu/
First, we update the system:
sudo apt-get update
```We install the dependencies:
sudo apt-get install ca-certificates curl gnupg lsb-release && curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=
We install Docker engine
sudo apt-get update && sudo apt-get install docker-ce docker-ce-cli containerd.io
<Note title="Note" type="warning" ><span>If the installation detailed below fails, you can install Docker for Ubuntu:</span><span> </span>```sudo apt install docker.io```<span></span></Note>
Downloading resources from the repository
Now, we download and decrypt the external resources such as pre-trained models and dependencies.
First, we create a data directory which will contain all the datasets, pre-trained models and dependencies for executing the experiments
cd /home/[user]/Desktop && mkdir HESML_DATA && cd HESML_DATA
```Now, we download extract the BERT pretrained models compressed file (20,2 GB) to the HESML_DATA
wget https://doi.org/10.21950/BERTExperiments.tar.gz && tar xvf BERTExperiments.tar.gz
wget https://doi.org/10.21950/CharacterAndSentenceEmbeddings.tar.gz && tar xvf CharacterAndSentenceEmbeddings.tar.gz
We download and extract the pre-trained word embedding models (40GB) in the same directory
wget https://doi.org/10.21950/WordEmbeddings.tar.gz && tar xvf WordEmbeddings.tar.gz
sudo apt install -y ccrypt && wget https://doi.org/10.21950/Dependencies.tar.gz.cpt && ccrypt -d Dependencies.tar.gz.cpt
<Note title="Safety information" type="error" ><span>In order to obtain a decrypt password for the Dependencies.tar.gz file, you should sign and obtain a license for the National Library of Medicine (NLM) of the United States to use the UMLS Metathesaurus databases, as well as SNOMED-CT and MeSH ontologies included in this Docker image. For this purpose, you should go top the NLM license page, https://uts.nlm.nih.gov//license.html. After that, you could write to eciencia@consorciomadrono.es to obtain the password to decrypt the file. Likewise, you should obtain and sign a Data User Agreement from the Mayo Clinic to use the MedSTS dataset by sending the authors the Data User Agreement form, https://n2c2.dbmi.hms.harvard.edu/data-use-agreement</span></Note>
tar xvf Dependencies.tar.gz
rm -r *.tar.gz
<Note title="Citation" type="success" ><span>At the end of this section, you should have a directory named HESML_DATA on your local machine with this file structure:</span><span></span><span>.</span><span>./ImportedLibs</span><span>./WordEmbeddings</span><span>./UMLS</span><span>./SentenceEmbeddings</span><span>./ReproducibleResults</span><span>./SentenceSimDatasets</span><span>./FlairEmbeddings</span><span>./public_mm_lite</span><span>./apache-ctakes-4.0.0.1-src</span><span>./BERTExperiments</span><span>./dist</span><span>./public_mm</span></Note>
Create and run a Docker container with HESML and dependencies
In this step, we create and run a Docker container which have pre-installed all the necessary software for executing the experiments.
#We get the docker image from DockerHub
docker pull alicialara/hesml_v2r1:latest
wget https://doi.org/10.21950/hesml_STS_dockerRelease.tar.gz && tar xvf hesml_STS_dockerRelease.tar.gz && docker load --input hesml_STS_dockerRelease.tar.gz
Now, we create, run and attach to the Docker container named "HESMLV2R1" which will share a volume with the HESML_DATA directory.
docker run --name=HESMLV2R1 -it -v [PATH_TO_HESML_DATA_DIRECTORY]/HESML_DATA/:/home/user/HESML_DATA alicialara/hesml_v2r1:latest /bin/bash
```In the following, we will be working on the Docker container, which has been attached in the previous step.
Now, we clone the HESML repository from Github
cd /home/user && git clone --branch HESML-STS_master_dev https://github.com/jjlastra/HESML.git
cd /home/user/HESML_DATA/ && cp -r dist/lib /home/user/HESML/HESML_Library/HESMLSTSclient/dist && cd /home/user/HESML/HESML_Library && cp HESML/dist/HESML-V2R1.0.1.jar HESMLSTSclient/dist/lib
<Note title="Citation" type="success" ><span>At the end of this section, you should have the following directories in the /home/user directory of the Docker container:</span><span></span><span>.</span><span>./HESML</span><span>./HESML_DATA</span><span></span><span>The HESML directory contains the sources from Github with all the necessary dependencies and libraries for executing the experiments.</span><span>The HESML_DATA directory contains the pre-trained models, python virtual environments and the NER tools for executing the experiments</span></Note>
Launch the Metamap and cTAKES services
The experiments evaluated herein use the Metamap [4], MetamapLite [5] and cTAKES [6] external NER tools to annotate CUI codes on the sentences. Thus, we have to launch the NER tools services following the next steps.
First, we open the Metamap directory
cd /home/user/HESML_DATA/public_mm
```We start the Metamap dependency services
./bin/skrmedpostctl start && ./bin/wsdserverctl start
<Note title="Note" type="warning" ><b>Note: Before executing the next step, wait until the following message appears (2-3 minutes): "WSD Server databases and disambiguation methods have been initialized." and press the "Enter" key.</b> </Note>
Now, we start the Metamap service
./bin/mmserver &
export ctakes_umls_apikey=[ENTER YOUR UMLS API KEY]
<Note title="Safety information" type="error" ><span>In order to obtain a UMLS KEY, you should sign and obtain a license for the National Library of Medicine (NLM) of the United States to use the UMLS Metathesaurus databases, as well as SNOMED-CT and MeSH ontologies included in this Docker image. For this purpose, you should go top the NLM license page, https://uts.nlm.nih.gov//license.html.</span></Note>
<Note title="Expected result" type="success" ><span>At the end of this section, you should have initialized the NER tools services, and you can execute all the experiments evaluated in our primary paper:</span><Note title="Citation" type="info" >Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. 2022 A reproducible experimental survey on biomedical sentence similarity: a string-based method sets the state of the art Submitted to PLoS One </Note><span></span></Note>
UBUNTU-based instructions to run our benchmarks on a Docker container
The final step is the execution of the experiments evaluated in out primary paper.
To run the experiments, first step into the HESMLSTSclient directory
cd /home/user/HESML/HESML_Library/HESMLSTSclient/
```Before running the experiments, remove previous results and temporal files:
rm -r ../ReproducibleExperiments/BioSentenceSimilarity_paper/BioSentenceSimFinalRawOutputFiles/* && rm -r ../ReproducibleExperiments/BioSentenceSimilarity_paper/BioSentenceSimFinalProcessedOutputFiles/* && rm Execution_times_* && rm -r tmp* && rm -r /tmp/tmp*
Now, execute the HESMLSTSclient with the default options
java -jar -Xms30g dist/HESMLSTSclient.jar
<Note title="Note" type="warning" ><span>Note that this experiment take more than 24 hours of execution time in a desktop computer with an AMD Ryzen 7 5800x CPU (16 cores) with 64 Gb RAM and 2TB Gb SSD disk</span></Note>
<Note title="Citation" type="success" ><span>At the end of this section, you should find all the raw output files in your HESML_DATA directory</span><span></span><span></span><span>[PATH_TO_HESML_DATA_DIRECTORY]/HESML_DATA/ReproducibleResults/BioSentenceSimilarity_paper/BioSentenceSimFinalRawOutputFiles</span><span>.</span><span>├── raw_similarity_BIOSSES_BESTCOMBS.csv</span><span>├── raw_similarity_BIOSSES_COMBestWorst.csv</span><span>├── raw_similarity_BIOSSES_LiBlockNER.csv</span><span>├── raw_similarity_BIOSSES_NERexperiment.csv</span><span>├── raw_similarity_CTR_BESTCOMBS.csv</span><span>├── raw_similarity_CTR_COMBestWorst.csv</span><span>├── raw_similarity_CTR_LiBlockNER.csv</span><span>├── raw_similarity_CTR_NERexperiment.csv</span><span>├── raw_similarity_MedSTSFull_BESTCOMBS.csv</span><span>├── raw_similarity_MedSTSFull_COMBestWorst.csv</span><span>├── raw_similarity_MedSTSFull_LiBlockNER.csv</span><span>└── raw_similarity_MedSTSFull_NERexperiment.csv</span><span></span><span></span><span>These raw output files will be used in the post-processing stage to create the tables 8, 10-17, figure 5 and appendix A detailed in our primary paper [1].</span></Note>
[OPTIONAL] Running the pre-processing experiments
In our primary paper [1], we also evaluate the pre-processing configurations of each method, which are detailed in tables 7 and 9, as well as the appendix B of the same paper. This pre-processing experiments are evaluated using the HESMLSTSImpactEvaluationclient software included in the HESML V2R1 software release [6].
[6] Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. HESML V2R1 Java software library of semantic similarity measures for the biomedical domain. e-cienciaDatos; 2022. doi:10.21950/DOI
To execute the pre-processing experiments, run the following commands
cd /home/user/HESML_DATA/ && cp -r dist/lib /home/user/HESML/HESML_Library/HESMLSTSImpactEvaluationclient/dist && cd /home/user/HESML/HESML_Library && cp HESML/dist/HESML-V2R1.0.1.jar HESMLSTSImpactEvaluationclient/dist/lib
cd /home/user/HESML/HESML_Library/HESMLSTSImpactEvaluationclient/ && java -jar -Xms30g dist/HESMLSTSImpactEvaluationclient.jar
Post-processing the experiments
The post-processing stage use the RStudio software installed in the local machine to create the final latex tables and CSV files.
In our experiments, we use the last release of RStudio software (Version 1.4) with R version 4.1.2 (2021-11-01). We also install the following packages for executing the post-processing scripts:
- collections
- kableExtra
- knitr
- readr
- stringr
- xtable
- dplyr
- ggpubr
- ggqqplot
- ggpubr
- ggplot2
After executing the experiments, the raw output files, as well as the R post-processing scripts are automatically copied into the HESML_DATA directory, in a new directory named "ReproducibleResults". Before executing the post-processing scripts, it is necessary to modify the file permissions following the next step:
cd [PATH_TO_HESML_DATA_DIRECTORY]/HESML_DATA && sudo chmod -R 777 ReproducibleResults/
The tables 8, 10-17, figure 5 and appendices A and B are created executing the following R scripts marked in bold as follows:
.[PATH_TO_HESML_DATA_DIRECTORY]/HESML_DATA/ReproducibleResults/Post-scripts
├── bio_sentence_sim_tables.R
├── bio_analytics_biosses.R
├── bio_analytics_ctr.R
├── bio_analytics_medsts.R
├── bio_sentence_sim_allExperiments_analyzingtablesPreprocessing.R
├── bio_sentence_sim_pvaluesLiBlock.R
├── bio_sentence_sim_pvaluesNER.R
├── bio_sentence_sim_pvalues.R
├── bio_sentence_sim_scripts
│ ├── readBERT.R
│ ├── readBESTCOMBS.R
│ ├── readFlair.R
│ ├── readLiBlockNERexperiment.R
│ ├── readNERexperiment.R
│ ├── readOurWE.R
│ ├── readSent2Vec.R
│ ├── readString.R
│ ├── readSWEM.R
│ ├── readTest.R
│ ├── readUBSM.R
│ ├── readUSE.R
│ └── readWBSM.R
- bio_sentence_sim_tables.R : Creates the tables 8,10,11 and 12 in our primary paper [1] as well as all the tables from appendix B. It is also used to extract the best and worst pre-processing configuration in the table 9 of the same paper
- bio_sentence_sim_pvalues.R : Creates the tables of the appendix A in our primary paper [1].
- bio_sentence_sim_allExperiments_analyzingtablesPreprocessing.R : Creates the tables with all the p-values of the pre-processing experiments using the HESMLSTSImpactEvaluationclient, which are used in the table 9 of our main paper.
- bio_sentence_sim_pvaluesLiBlock.R : Creates a table with the LiBlock NER experiments which is used to detail the p-values in table 12 of the main paper [1].
- bio_sentence_sim_pvaluesNER.R : Creates a table with the NER experiments which is used to detail the p-values in table 11 of the main paper [1].
- bio_analytics_biosses.R, bio_analytics_medsts.R and bio_analytics_ctr.R : Creates the figure 5 and is used to create the tables 13-17 of our primary paper [1].