A reproducibility protocol and dataset on the biomedical sentence similarity

Alicia Lara Clares, Juan J. Lastra-Díaz, Ana Garcia-Serrano

Published: 2022-02-20 DOI: 10.17504/protocols.io.b5ckq2uw

Abstract

This protocol introduces a set of reproducibility resources with the aim of allowing the exact replication of the experiments introduced by our main paper [1], which introduces the largest and for the first time reproducible experimental survey on biomedical sentence similarity. HESML V2R1 [2] is the sixth release of our Half-Edge Semantic Measures Library (HESML), which is a linearly scalable and efficient Java software library of ontology-based semantic similarity measures and Information Content (IC) models for ontologies like WordNet, SNOMED-CT, MeSH and GO.

This protocol sets a self-contained reproducibility platform which contains the Java source code and binaries of our main benchmark program, as well as a Docker image which allows the exact replication of our experiments in any software platform supported by Docker, such as all Linux-based operating systems, Windows or MacOS. All the necessary resources for executing the experiments are published in the permanent repository [3]

Our benchmark program is distributed with the UMLS SNOMED-CT and MeSH ontologies by courtesy of the US National Library of Medicine (NLM), as well as all needed software components with the aim of making the setup process easier. Our Docker image provides an exact virtual replica of the machine in which we ran our experiments, thus removing the need to carry-out any tedious setup process, such as the setup of the Named Entity Recognizer tools and other software components. (2022-02-20)

[1] Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. A reproducible experimental survey on biomedical sentence similarity: a string-based method sets the state of the art. Submitted to PLoS One. 2022.

[2] Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. HESML V2R1 Java software library of semantic similarity measures for the biomedical domain. e-cienciaDatos; 2022. doi:10.21950/DOI

[3] Lara-Clares, Alicia; Lastra-Díaz, Juan J.; Garcia-Serrano, Ana, 2021, "Reproducible experiments on word and sentence similarity measures for the biomedical domain",https://doi.org/10.21950/EPNXTR, e-cienciaDatos, V2

Before start

Our benchmarks can be reproduced in any Docker-complaint platform, such as Windows, MacOS or any Linux-based system by following a similar setup to that introduced herein.

In order to obtain a decrypt password for downloading the required files, you should sign and obtain a license for the National Library of Medicine (NLM) of the United States to use the UMLS Metathesaurus databases, as well as SNOMED-CT and MeSH ontologies included in this Docker image. For this purpose, you should go top the NLM license page, https://uts.nlm.nih.gov//license.html. After that, you could write to eciencia@consorciomadrono.es to obtain the password to decrypt the file. Likewise, you should obtain and sign a Data User Agreement from the Mayo Clinic to use the MedSTS dataset by sending the authors the Data User Agreement form, https://n2c2.dbmi.hms.harvard.edu/data-use-agreement.

Steps

Installing Docker on Ubuntu

If Docker is not installed in your machine, instructions below install latest version of Docker CE. For further details, we refer the reader to the official Docker setup page https://docs.docker.com/install/linux/docker-ce/ubuntu/

First, we update the system:

sudo apt-get update
```We install the dependencies:

sudo apt-get install ca-certificates curl gnupg lsb-release && curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

echo "deb [arch=(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null


We install Docker engine

sudo apt-get update && sudo apt-get install docker-ce docker-ce-cli containerd.io


<Note title="Note" type="warning" ><span>If the installation detailed below fails, you can install Docker for Ubuntu:</span><span> </span>```sudo apt  install docker.io```<span></span></Note>

Downloading resources from the repository

Now, we download and decrypt the external resources such as pre-trained models and dependencies.

First, we create a data directory which will contain all the datasets, pre-trained models and dependencies for executing the experiments

cd /home/[user]/Desktop && mkdir HESML_DATA && cd HESML_DATA
```Now, we download extract the BERT pretrained models compressed file (20,2 GB) to the HESML_DATA

wget https://doi.org/10.21950/BERTExperiments.tar.gz && tar xvf BERTExperiments.tar.gz

wget https://doi.org/10.21950/CharacterAndSentenceEmbeddings.tar.gz && tar xvf CharacterAndSentenceEmbeddings.tar.gz


We download and extract the pre-trained word embedding models (40GB) in the same directory

wget https://doi.org/10.21950/WordEmbeddings.tar.gz && tar xvf WordEmbeddings.tar.gz

sudo apt install -y ccrypt && wget https://doi.org/10.21950/Dependencies.tar.gz.cpt && ccrypt -d Dependencies.tar.gz.cpt


<Note title="Safety information" type="error" ><span>In order to obtain a decrypt password for the Dependencies.tar.gz file, you should sign and obtain a license for the National Library of Medicine (NLM) of the United States to use the UMLS Metathesaurus databases, as well as SNOMED-CT and MeSH ontologies included in this Docker image. For this purpose, you should go top the NLM license page, https://uts.nlm.nih.gov//license.html. After that, you could write to eciencia@consorciomadrono.es to obtain the password to decrypt the file. Likewise, you should obtain and sign a Data User Agreement from the Mayo Clinic to use the MedSTS dataset by sending the authors the Data User Agreement form, https://n2c2.dbmi.hms.harvard.edu/data-use-agreement</span></Note>

tar xvf Dependencies.tar.gz

rm -r *.tar.gz


<Note title="Citation" type="success" ><span>At the end of this section, you should have a directory named HESML_DATA on your local machine with this file structure:</span><span></span><span>.</span><span>./ImportedLibs</span><span>./WordEmbeddings</span><span>./UMLS</span><span>./SentenceEmbeddings</span><span>./ReproducibleResults</span><span>./SentenceSimDatasets</span><span>./FlairEmbeddings</span><span>./public_mm_lite</span><span>./apache-ctakes-4.0.0.1-src</span><span>./BERTExperiments</span><span>./dist</span><span>./public_mm</span></Note>

Create and run a Docker container with HESML and dependencies

In this step, we create and run a Docker container which have pre-installed all the necessary software for executing the experiments.

#We get the docker image from DockerHub 
docker pull alicialara/hesml_v2r1:latest

Note

NOTE: Alternatively, the docker image can also be downloaded and extracted from our permanent repository:

Citation

Lara-Clares, Alicia; Lastra-Díaz, Juan J.; Garcia-Serrano, Ana 2022 Reproducible experiments on word and sentence similarity measures for the biomedical domain e-cienciaDatos, V2 https://doi.org/10.21950/EPNXTR

In this case, you can import the Docker file by following the next command

wget https://doi.org/10.21950/hesml_STS_dockerRelease.tar.gz && tar xvf hesml_STS_dockerRelease.tar.gz && docker load --input hesml_STS_dockerRelease.tar.gz

Now, we create, run and attach to the Docker container named "HESMLV2R1" which will share a volume with the HESML_DATA directory.

Note

NOTE: you have to modify the variable [PATH_TO_HESML_DATA_DIRECTORY] using the path from your local machine.

docker run --name=HESMLV2R1 -it -v [PATH_TO_HESML_DATA_DIRECTORY]/HESML_DATA/:/home/user/HESML_DATA alicialara/hesml_v2r1:latest /bin/bash
```In the following, we will be working on the Docker container, which has been attached in the previous step.



Now, we clone the HESML repository from Github

cd /home/user && git clone --branch HESML-STS_master_dev https://github.com/jjlastra/HESML.git

cd /home/user/HESML_DATA/ && cp -r dist/lib /home/user/HESML/HESML_Library/HESMLSTSclient/dist && cd /home/user/HESML/HESML_Library && cp HESML/dist/HESML-V2R1.0.1.jar HESMLSTSclient/dist/lib


<Note title="Citation" type="success" ><span>At the end of this section, you should have the following directories in the /home/user directory of the Docker container:</span><span></span><span>.</span><span>./HESML</span><span>./HESML_DATA</span><span></span><span>The HESML directory contains the sources from Github with all the necessary dependencies and libraries for executing the experiments.</span><span>The HESML_DATA directory contains the pre-trained models, python virtual environments and the NER tools for executing the experiments</span></Note>

Launch the Metamap and cTAKES services

The experiments evaluated herein use the Metamap [4], MetamapLite [5] and cTAKES [6] external NER tools to annotate CUI codes on the sentences. Thus, we have to launch the NER tools services following the next steps.

Note

[4] Aronson AR, Lang F-M. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc. 2010;17: 229–236. doi:10.1136/jamia.2009.002733[5] Demner-Fushman D, Rogers WJ, Aronson AR. MetaMap Lite: an evaluation of a new Java implementation of MetaMap. J Am Med Inform Assoc. 2017;24: 841–844. doi:10.1093/jamia/ocw177[6] Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17: 507–513. doi:10.1136/jamia.2009.001560

First, we open the Metamap directory

cd /home/user/HESML_DATA/public_mm
```We start the Metamap dependency services

./bin/skrmedpostctl start && ./bin/wsdserverctl start


<Note title="Note" type="warning" ><b>Note: Before executing the next step, wait until the following message appears (2-3 minutes): "WSD Server databases and disambiguation methods have been initialized." and press the "Enter" key.</b> </Note>

Now, we start the Metamap service

./bin/mmserver &

export ctakes_umls_apikey=[ENTER YOUR UMLS API KEY]


<Note title="Safety information" type="error" ><span>In order to obtain a UMLS KEY, you should sign and obtain a license for the National Library of Medicine (NLM) of the United States to use the UMLS Metathesaurus databases, as well as SNOMED-CT and MeSH ontologies included in this Docker image. For this purpose, you should go top the NLM license page, https://uts.nlm.nih.gov//license.html.</span></Note>



<Note title="Expected result" type="success" ><span>At the end of this section, you should have initialized the NER tools services, and you can execute all the experiments evaluated in our primary paper:</span><Note title="Citation" type="info" >Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. 2022 A reproducible experimental survey on biomedical sentence similarity: a string-based method sets the state of the art  Submitted to PLoS One </Note><span></span></Note>

UBUNTU-based instructions to run our benchmarks on a Docker container

The final step is the execution of the experiments evaluated in out primary paper.

Citation

Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. 2022 A reproducible experimental survey on biomedical sentence similarity: a string-based method sets the state of the art Submitted to PLoS One

To run the experiments, first step into the HESMLSTSclient directory

cd /home/user/HESML/HESML_Library/HESMLSTSclient/
```Before running the experiments, remove previous results and temporal files:

rm -r ../ReproducibleExperiments/BioSentenceSimilarity_paper/BioSentenceSimFinalRawOutputFiles/* && rm -r ../ReproducibleExperiments/BioSentenceSimilarity_paper/BioSentenceSimFinalProcessedOutputFiles/* && rm Execution_times_* && rm -r tmp* && rm -r /tmp/tmp*




Now, execute the HESMLSTSclient with the default options

java -jar -Xms30g dist/HESMLSTSclient.jar


<Note title="Note" type="warning" ><span>Note that this experiment take more than 24 hours of execution time in a desktop computer with an AMD Ryzen 7 5800x CPU (16 cores) with 64 Gb RAM and 2TB Gb SSD disk</span></Note>



<Note title="Citation" type="success" ><span>At the end of this section, you should find all the raw output files in your HESML_DATA directory</span><span></span><span></span><span>[PATH_TO_HESML_DATA_DIRECTORY]/HESML_DATA/ReproducibleResults/BioSentenceSimilarity_paper/BioSentenceSimFinalRawOutputFiles</span><span>.</span><span>├── raw_similarity_BIOSSES_BESTCOMBS.csv</span><span>├── raw_similarity_BIOSSES_COMBestWorst.csv</span><span>├── raw_similarity_BIOSSES_LiBlockNER.csv</span><span>├── raw_similarity_BIOSSES_NERexperiment.csv</span><span>├── raw_similarity_CTR_BESTCOMBS.csv</span><span>├── raw_similarity_CTR_COMBestWorst.csv</span><span>├── raw_similarity_CTR_LiBlockNER.csv</span><span>├── raw_similarity_CTR_NERexperiment.csv</span><span>├── raw_similarity_MedSTSFull_BESTCOMBS.csv</span><span>├── raw_similarity_MedSTSFull_COMBestWorst.csv</span><span>├── raw_similarity_MedSTSFull_LiBlockNER.csv</span><span>└── raw_similarity_MedSTSFull_NERexperiment.csv</span><span></span><span></span><span>These raw output files will be used in the post-processing stage to create the tables 8, 10-17, figure 5 and appendix A detailed in our primary paper [1].</span></Note>

5.1.

[OPTIONAL] Running the pre-processing experiments

In our primary paper [1], we also evaluate the pre-processing configurations of each method, which are detailed in tables 7 and 9, as well as the appendix B of the same paper. This pre-processing experiments are evaluated using the HESMLSTSImpactEvaluationclient software included in the HESML V2R1 software release [6].

[6] Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. HESML V2R1 Java software library of semantic similarity measures for the biomedical domain. e-cienciaDatos; 2022. doi:10.21950/DOI

Safety information

It is important to note that the execution of the pre-processing experiments requires high computational requirements and running times (more than 2 weeks), since they perform more than 1100 pre-processing combinations in total.

To execute the pre-processing experiments, run the following commands

cd /home/user/HESML_DATA/ && cp -r dist/lib /home/user/HESML/HESML_Library/HESMLSTSImpactEvaluationclient/dist && cd /home/user/HESML/HESML_Library && cp HESML/dist/HESML-V2R1.0.1.jar HESMLSTSImpactEvaluationclient/dist/lib

cd /home/user/HESML/HESML_Library/HESMLSTSImpactEvaluationclient/ && java -jar -Xms30g dist/HESMLSTSImpactEvaluationclient.jar

Post-processing the experiments

The post-processing stage use the RStudio software installed in the local machine to create the final latex tables and CSV files.

Note

NOTE: Now, the post-processing experiments are evaluated in the local machine, under the HESML_DATA directory. You can detach the HESMLV2R1 docker container by clicking the key sequence: CTRL+p, CTRL+q

In our experiments, we use the last release of RStudio software (Version 1.4) with R version 4.1.2 (2021-11-01). We also install the following packages for executing the post-processing scripts:

collections
kableExtra
knitr
readr
stringr
xtable
dplyr
ggpubr
ggqqplot
ggpubr
ggplot2

After executing the experiments, the raw output files, as well as the R post-processing scripts are automatically copied into the HESML_DATA directory, in a new directory named "ReproducibleResults". Before executing the post-processing scripts, it is necessary to modify the file permissions following the next step:

cd [PATH_TO_HESML_DATA_DIRECTORY]/HESML_DATA && sudo chmod -R 777 ReproducibleResults/

The tables 8, 10-17, figure 5 and appendices A and B are created executing the following R scripts marked in bold as follows:

.[PATH_TO_HESML_DATA_DIRECTORY]/HESML_DATA/ReproducibleResults/Post-scripts

├── bio_sentence_sim_tables.R

├── bio_analytics_biosses.R

├── bio_analytics_ctr.R

├── bio_analytics_medsts.R

├── bio_sentence_sim_allExperiments_analyzingtablesPreprocessing.R

├── bio_sentence_sim_pvaluesLiBlock.R

├── bio_sentence_sim_pvaluesNER.R

├── bio_sentence_sim_pvalues.R

├── bio_sentence_sim_scripts

│ ├── readBERT.R

│ ├── readBESTCOMBS.R

│ ├── readFlair.R

│ ├── readLiBlockNERexperiment.R

│ ├── readNERexperiment.R

│ ├── readOurWE.R

│ ├── readSent2Vec.R

│ ├── readString.R

│ ├── readSWEM.R

│ ├── readTest.R

│ ├── readUBSM.R

│ ├── readUSE.R

│ └── readWBSM.R

bio_sentence_sim_tables.R : Creates the tables 8,10,11 and 12 in our primary paper [1] as well as all the tables from appendix B. It is also used to extract the best and worst pre-processing configuration in the table 9 of the same paper
bio_sentence_sim_pvalues.R : Creates the tables of the appendix A in our primary paper [1].
bio_sentence_sim_allExperiments_analyzingtablesPreprocessing.R : Creates the tables with all the p-values of the pre-processing experiments using the HESMLSTSImpactEvaluationclient, which are used in the table 9 of our main paper.
bio_sentence_sim_pvaluesLiBlock.R : Creates a table with the LiBlock NER experiments which is used to detail the p-values in table 12 of the main paper [1].
bio_sentence_sim_pvaluesNER.R : Creates a table with the NER experiments which is used to detail the p-values in table 11 of the main paper [1].
bio_analytics_biosses.R, bio_analytics_medsts.R and bio_analytics_ctr.R : Creates the figure 5 and is used to create the tables 13-17 of our primary paper [1].

Note

The "bio_sentence_sim_scripts" directory contains a set of R scripts to parse the output raw files created by the execution of HESMLSTSclient and HESMLSTSImpactEvaluationclient.

Citation

The execution of all the R scripts listed below produces a ser of TXT and CSV files containing all the post-processed results, which are used to create tables 8, 10-17, figure 5 and appendices A and B of our primary paper [1].

Inferring cellular and molecular processes in single-cell data with non-negative matrix factorization using Python, R and GenePattern Notebook implementations of CoGAPS

A reproducibility protocol and dataset on the biomedical sentence similarity

Abstract

Before start

Steps

Installing Docker on Ubuntu

Downloading resources from the repository

Create and run a Docker container with HESML and dependencies

Launch the Metamap and cTAKES services

UBUNTU-based instructions to run our benchmarks on a Docker container

Post-processing the experiments

推荐阅读