ONT Basecalling, Demultiplexing, and Analysis for Fungal Barcodes

Stephen Douglas Russell

Published: 2022-07-18 DOI: 10.17504/protocols.io.dm6gpbm88lzp/v1

Abstract

This protocol assumes that your MinION run has been completed and the data from the run has been saved. It should take you from raw data to useable FASTA files for each of your fungal barcodes.

Steps

This protocol assumes the experiment name is "FirstRun."

Create a new working folder on the desktop. Ex - FirstRun. Within that create a new folder called "fast5," another called "Programs," and a final one called "NGSpeciesID."

I will start by copying all of the fast5 files from:

/var/lib/minknow/data/./FirstRun/CellName/long_unique_name/fast5 to the newly created fast5 folder on the desktop

Create an index file from your extraction template papers. This will allow you to link all of your reads with the individual specimens. A template for 7 plates (672 specimens) can be found here:

NANOPORE TEMPLATE THIRD RUN.xlsx

This .xlsx is formatted to utilize the Lab Code and iNaturalist # columns as the only inputs. It will combine these and all of the other columns into a single cell - concatenating them all into the final file name. For the Lab Code, I will typically put these into the iNaturalist "Voucher Number(s)" Observational Field, and then export them all at once into a .csv from iNat. This allows me to simply copy and paste many iNat numbers at once, without ever needing to input any of the numbers manually.

After editing, save as a tab-delimited text file in the NGSpeciesID folder. You will need to remove most of the final columns from the template. The final output should be saved like this:

Index.txt

Copy these Python scripts into the Programs folder you just created.

minibar.py

summarize.py

primers.txt

#Run Guppy Basecalling (Pop!_OS 22.04)
guppy_basecaller -x "cuda:all" -i ~/Desktop/FirstRun/fast5 -s ~/Desktop/FirstRun/basecalling --flowcell FLO-FLG001 --kit SQK-LSK110 --records_per_fastq 0 --trim_adapters --trim_strategy dna

For a Flongle cell with 1.15Gb of bases and 1.18M reads, this command takes about 37 minutes to run. Example output:

Citation

Init time: 681 ms0% 10 20 30 40 50 60 70 80 90 100%|----|----|----|----|----|----|----|----|----|----|***************************************************Caller time: 2216974 ms, Samples called: 13852213920, samples/s: 6.24825e+06Finishing up any open output files.Basecalling completed successfully.

Sometimes after the run I need to restart the CPU before this command runs successfully.

#Combine all FASTQ files into a single file (Pop!_OS 22.04)
cat ~/Desktop/FirstRun/basecalling/pass/*runid*.fastq > ~/Desktop/FirstRun/basecalling/pass/basecall.fastq

#Validate the number of reads in your file (Pop!_OS 22.04)
cat ~/Desktop/FirstRun/basecalling/pass/basecall.fastq | wc -l | awk '{print $1/4}'

Citation

850800

#Remove the uncombined FASTQ files (Pop!_OS 22.04)
rm ~/Desktop/FirstRun/basecalling/pass/*runid*.fastq

#Move your fastq file and demultiplexer to a second location (Pop!_OS 22.04)
cp ~/Desktop/FirstRun/basecalling/pass/basecall.fastq ~/Desktop/FirstRun/NGSpeciesID/basecall.fastq
cp ~/Desktop/FirstRun/Programs/minibar.py ~/Desktop/FirstRun/NGSpeciesID/minibar.py
cp ~/Desktop/FirstRun/Programs/summarize.py ~/Desktop/FirstRun/NGSpeciesID/summarize.py
cp ~/Desktop/FirstRun/Programs/primers.txt ~/Desktop/FirstRun/NGSpeciesID/primers.txt

#MinIONQC.R (Pop!_OS 22.04)
cd ~/Desktop/FirstRun/Programs
Rscript MinIONQC.R -i ~/Desktop/SecondRun/basecalling/sequencing_summary.txt -o ~/Desktop/SecondRun/basecalling/pass/summary/

Citation

INFO [2022-07-03 17:16:12] Loading input file: /home/user/Desktop/FirstRun/basecalling/sequencing_summary.txtINFO [2022-07-03 17:16:12] MinION flowcell detectedINFO [2022-07-03 17:16:19] basecalling: creating output directory:/home/userDesktop/FiirstRun/basecalling/pass/summary//basecallingINFO [2022-07-03 17:16:19] basecalling: summarising input file for flowcellINFO [2022-07-03 17:16:19] basecalling: plotting length histogramINFO [2022-07-03 17:16:21] basecalling: plotting mean Q score histogramINFO [2022-07-03 17:16:22] basecalling: plotting flowcell overviewINFO [2022-07-03 17:16:42] basecalling: plotting flowcell yield over timeINFO [2022-07-03 17:16:50] basecalling: plotting flowcell yield by read lengthINFO [2022-07-03 17:16:56] basecalling: plotting sequence length over timeINFO [2022-07-03 17:17:10] basecalling: plotting Q score over timeINFO [2022-07-03 17:17:23] basecalling: plotting reads per hourINFO [2022-07-03 17:17:25] basecalling: plotting read length vs. q score scatterplotINFO [2022-07-03 17:17:35] basecalling: plotting flowcell channels summary histogramsINFO [2022-07-03 17:17:35] basecalling: plotting physical overview of output per channel

Review the images that are generated. Ensure the quality scores of your run are in an appropriate range. For a 9.4.1 Flongle with Q20+ (V12) K12 chemistry, I typically get a peak in the 12-13 range.

Mean Q scores for all of the reads in the run. You want to see the peak well above 10. The lower the Q score, the more errors your results will have.

Example of all outputs from this command: MinIONQC.zip

10.

#Go to your Programs folder and use MiniBar for Demultiplexing (Pop!_OS 22.04)
cd ~/Desktop/FirstRun/NGSpeciesID
./minibar.py -F Index.txt basecall.fastq
```This should take about 3-4 minutes to run.



<Note title="Citation" type="success" ><span>850800 seqs: H 734497 HH 581546 Hh 72882 hh 53453 IDs 707881 Mult_IDs 105306 (193.4800s)</span></Note>

11.

#Run NGSpeciesID for read filtering, clustering, consensus generation and polishing (Pop!_OS 22.04)
conda activate NGSpeciesID
for file in *.fastq; do
bn=`basename $file .fastq`
NGSpeciesID --ont --consensus --sample_size 500 --m 800 --s 400 --medaka --primer_file primers.txt --fastq $file --outfolder ${bn}
done
```This program will take about 3-6 hours to complete.

12.

#Create a summary of your NGSpeciesID consensus data (Pop!_OS 22.04)
python summarize.py ~/Desktop/FirstRun/NGSpeciesID

Fabrication of angstrom-scale two-dimensional channels for mass transport

Efficient and strand-specific profiling of replicating chromatin with enrichment and sequencing of protein-associated nascent DNA in mammalian cells

The eINTACT method for studying nuclear changes in host plant cells targeted by bacterial effectors in native infection contexts

Rapid reaction optimization by robust and economical quantitative benchtop19F NMR spectroscopy

Profiling native pulmonary basement membrane stiffness using atomic force microscopy

Photoimmunotechnology as a powerful biological tool for molecular-based elimination of target cells and microbes, including bacteria, fungi and viruses

Measuring key human carbohydrate digestive enzyme activities using high-performance anion-exchange chromatography with pulsed amperometric detection

Neural cell isolation from adult macaques for high-throughput analyses and neurosphere cultures

Preparation, validation and use of a vasoactive tryptophan-derived hydroperoxide and relevant control compounds

Neuronal subtype-specific growth cone and soma purification from mammalian CNS via fractionation and fluorescent sorting for subcellular analyses and spatial mapping of local transcriptomes and proteomes

ONT Basecalling, Demultiplexing, and Analysis for Fungal Barcodes

Abstract

Steps

推荐阅读