Bioinformatic workflow for NGS data control

Khalid El Moussaoui

Published: 2022-05-12 DOI: 10.17504/protocols.io.8epv59bnjg1b/v1

Disclaimer

DISCLAIMER – FOR INFORMATIONAL PURPOSES ONLY; USE AT YOUR OWN RISK

The protocol content here is for informational purposes only and does not constitute legal, medical, clinical, or safety advice, or otherwise; content added to protocols.io is not peer reviewed and may not have undergone a formal approval of any kind. Information presented in this protocol should not substitute for independent professional judgment, advice, diagnosis, or treatment. Any action you take or refrain from taking using or relying upon the information presented here is strictly at your own risk. You agree that neither the Company nor any of the authors, contributors, administrators, or anyone else associated with protocols.io, can be held responsible for your use of the information contained in or linked to this protocol or any of our Sites/Apps and Services.

Abstract

Workflow for data integrity and quality control of high throughput sequencing on Illumina NovaSeq6000. The analyses are performed on macOS Monterey 12.3.1 running on an ARM-architected Apple Silicon processor. This workflow considers that the user directory (~/) is structured as seen in the "work environment configuration" protocol. To avoid error messages, please follow this protocol and set up your computer before starting.

Steps

Activation of the environment

Open a terminal window.

Software

Value	Label
Terminal	NAME
macOS Monterey	OS_NAME
12.3.1	OS_VERSION
Apple Inc.	DEVELOPER
2.12.5	VERSION

Activate the previously created QC_env environment by typing the following command in the terminal :

conda activate QC_env

Data integrity check

Considering that the .gz archive downloaded from the GIGA servers has been unzipped under ~/fastq_files, that the original_md5.txt file has been stored under ~/md5 and that the python & R scripts previously created are stored under ~/KE_utilities, type the following command in the terminal to recompute the md5 hash and store it in a new file under ~/md5

md5 ~/fastq_files/* > ~/md5/recomputed_md5.txt

After generating the ~/md5/recomputed_md5.txt file, type the following command in the terminal to launch the python script that allows the data integrity check :

python3 ~/KE_utilities/data_integrity_checker.py
```This script can be downloaded on GitHub : [https://github.com/elmoussaoui-k/drylab_workflow/blob/b41c0a03d7a3fdb023a94b331cab5460d706b6c8/data_integrity_checker.py](https://github.com/elmoussaoui-k/drylab_workflow/blob/b41c0a03d7a3fdb023a94b331cab5460d706b6c8/data_integrity_checker.py)

Specify the path to the original_md5.txt file and then to the recomputed_md5.txt file :

*************** DATA INTEGRITY CHECKER ***************

Please enter the path to original_md5.txt : /users/khalid/md5/original_md5.txt
Please enter the path to recomputed_md5.txt : /users/khalid/md5/recomputed_md5.txt

------------------------------------------------------

Run fastQC

Start the fastQC analysis on all existing files in the ~/fastq_files directory in recursive mode using "*". Moreover, the addition of the --outdir option allows to specify an output directory for the reports generated by fastQC. This generates an individual .html report for each file.

fastqc ~/fastq_files/* --outdir ~/fastqc_reports/

The generated reports can be opened by typing the following command in the terminal :

open ~/fastqc_reports/KE0xx_R1_fastqc.html

Run multiQC

To summarize the reports generated with fastQC into a single report, run multiQC. To do this, type the following command in the terminal :

multiqc ~/fastqc_reports --outdir ~/multiqc_report

The generated report can be opened by typing the following command in the terminal :

open ~/multiqc_report/multiqc_report.html

Filter reads with fastp

10.

The reads can be filtered automatically with fastp. Just launch the program, specify the 2 .fastq.gz files (R1 and R2) as input and specify the name and location of the 2 processed files. Adding the -h option allows to specify a folder for the HTML report. The option -j " " allows to cancel the creation of the JSON report. The -R option allows to give a name to the generated HTML report.

fastp -i ~/fastq_files/KE0xx_R1.fastq.gz 
-I ~/fastq_files/KE0xx_R2.fastq.gz 
-o ~/fastp/cleaned_fastq_files/KE0xx_R1_clean.fastq.gz 
-O ~/fastp/cleaned_fastq_files/KE0xx_R2_clean.fastq.gz 
-h ~/fastp/fastp_reports/KE0xx_fastp_report.html 
-j "" 
-R "Fastp report : KE0xx"