IRIS Software Protocol
Leonardo Zilli, Erica Andreose, Salvatore Di Marzo
Abstract
This project aims to investigates the discoverability of University of Bologna's scholarly output within OpenCitations Meta, a platform that stores and delivers bibliographic metadata for alle the publications involved in OpenCitations Index which contains recording citations allowing the user to visualize all of the citation links between a document and another. Specifically, this project aims to analyze the coverage of University of Bologna's publications within the OpenCitations Meta, particularly those deposited in the IRIS institutional repository. Various publication types are observed to discern patterns in representation. Utilizing OpenCitations data, the citation impact of these IRIS publications is quantified, encompassing both the citations they receive and those they provide. Additionally, the study delves into the nature of citations, distinguishing between those involving publications within IRIS and those referencing external sources. The study unveils the citation impact of IRIS publications, clarifying their influence within the scholarly community. By dissecting citations, the research also delineates the interplay between internal and external citations, providing a deeper understanding of the institution's scholarly ecosystem. The methodology used will be all compliant to the core values of Open Science, the data and the software used will all be made available to allow the user to replicate, reproduce and validate the conclusions drawn by the end results of the project, that will hold significance for researchers and academic institutions, facilitating informed decision-making and fostering a deeper understanding of scholarly communication dynamics.
Before start
Before starting, we suggest to make sure you have Python3.x installed on your computer, in addition, in order to correctly execute the provided scripts, you must install the required libraries:
Steps
IRIS Dataset Preparation
-
"ODS_L1_IR_ITEM_CON_PERSON.csv": information about the people involved in the publications (authors, editors, etc.)
-
"ODS_L1_IR_ITEM_DESCRIPTION.csv": the string containing the name of the authors and other related metadata of publications
-
"ODS_L1_IR_ITEM_IDENTIFIER.csv": the identifiers (including DOIs) of publications
-
"ODS_L1_IR_ITEM_LANGUAGE.csv": the language in which the publication has been written (when applicable)
-
"ODS_L1_IR_ITEM_MASTER_ALL.csv": basic metadata information of publications (title and date of publication)
-
"ODS_L1_IR_ITEM_PUBLISHER.csv": the publishers of publications
-
"ODS_L1_IR_ITEM_RELATION.csv": additional metadata related to the context of publications (publication venue, editors, etc.)
These files are connected among themselves through a unique item ID tied to each entry, after we gathered the files we converted them into dataframes with the pandas1 library.
2 of the csv files are converted into Dataframes
df_iris_master = pd.read_csv('./data/iris-data-2024-03-14/ODS_L1_IR_ITEM_MASTER_ALL.csv')
df_iris_identifier = pd.read_csv('./data/iris-data-2024-03-14/ODS_L1_IR_ITEM_IDENTIFIER.csv')
The DataFrames are filtered to remove rows that do not have neither DOI or ISBN ids.
df_iris_identifier_filtered = df_iris_identifier[(df_iris_identifier['IDE_DOI'].notna()) | (df_iris_identifier['IDE_ISBN'].notna())][['ITEM_ID', 'IDE_DOI', 'IDE_ISBN']]
The df_iris_identifier DataFrame is joined with the master dataframe to append the title and the date of publication of the publications to the df_iris_identifier_filtered
df = df_iris_identifier_filtered.merge(df_iris_master, on='ITEM_ID')
OpenCitations Meta Dump preparation
OpenCitations Meta has been queried through the use of the OpenCitations Meta april 2024 dump.
openalex_zip = ZipFile('./data/csv_openalex-2024-04-06.zip')
for file in tqdm(openalex_zip.namelist()):
if file.endswith('.csv'):
with openalex_zip.open(file) as csv_file:
with tempfile.NamedTemporaryFile() as tf:
tf.write(csv_file.read())
tf.seek(0)
Path("./data/openalex_parquet").mkdir(parents=True, exist_ok=True)
lf = (
pl.scan_csv(tf.name)
.select(['id', 'title', 'author', 'type'])
.sink_parquet('./data/openalex_parquet/{}.parquet'.format(file.split('/')[-1].replace(".csv", "")))
)
The parquet files are then read by using polars' data streaming capabilities:
parquet_files = glob.glob('./data/openalex_parquet/*.parquet')
meta_lf = (
pl.scan_parquet(parquet_files)
)
ID sanification
The DOIs and ISBNs from the IRIS dataset are quite dirty and need to be cleaned.
dois = df['IDE_DOI'].dropna().unique().tolist()
#filter and normalize the dois
doi_rule = re.compile(r'10\.\d{4,}\/[^,\s;]*')
not_doi = []
filtered_dois = []
for doi in dois:
match = doi_rule.search(doi)
if match:
filtered_dois.append('doi:' + match.group())
else:
not_doi.append(doi)
isbns = df['IDE_ISBN'].dropna().unique().tolist()
#filter and normalize the isbns
isbn_rule = re.compile(r'(ISBN[-]*(1[03])*[ ]*(: ){0,1})*(([0-9Xx][- ]*){13}|([0-9Xx][- ]*){10})') # ??? results to check
not_isbn = []
filtered_isbns = []
for isbn in isbns:
if isbn_rule.search(isbn) is not None:
filtered_isbns.append('isbn:' + isbn.replace('-', '').replace(' ', ''))
else:
not_isbn.append(isbn)
The two identifiers are then merged into a single list:
dois_isbns = filtered_dois + filtered_isbns
RQ 1. What is the coverage of the publications available in IRIS (strictly concerning research conducted within the University of Bologna) in OpenCitations Meta?
rq1_query = (
pl.scan_parquet(parquet_files, low_memory=True)
.select(['id', 'type'])
.with_columns(
(pl.col('id').str.extract(r"((?:doi|isbn):[^\s]+)"))
)
.select(['omid', 'id', 'type'])
.drop_nulls('id')
.filter(
pl.col("id").is_in(dois_isbns)
)
.select(pl.len()).collect()
)
print(rq1_query.item())
OpenCitations Index Dataset Querying
These are the queries that gave as a result the files:
# 1 Sparql query
# 2 Sparql query
# 3 Sparql query
We then use these csv files to create the pandas DataFrames that will be used to perform the analyses.
index_df = pd.read_csv(f'index.csv')
Research Question answering
What is the coverage of the publications available in IRIS (strictly concerning research conducted within the University of Bologna) in OpenCitations Meta?
filter the IRIS dataframe for publications conducted at UNIBO
check the intersection of the DOIs from the iris_df with the meta_df
Which are the types of publications that are better covered in OpenCitations Meta?
checking the type of the resulting data of the first question
What is the amount of citations (according to OpenCitations Index) coming from the IRIS publications that are involved in OpenCitations Meta (as citing entity and as cited entity)?
checking the number of entities coming from IRIS, stored in OpenCitations Meta, that cite or are cited inside of OpenCitations Index
How many of these citations come from and go to publications that are not included in IRIS?
check the results of the third question to find which citations cite entries not available in IRIS, and which entries in IRIS cite entries not available in IRIS
How many of these citations involve publications in IRIS as both citing and cited entities?
check the results of the third quesion to find how many citations in OpenCitations Index involve the publications in IRIS that are both acting as citing and cited
Data Visualization
The visualizations of the results obtained are computed with the _ _ library.