diff --git a/.gitattributes b/.gitattributes index c96b83b44b0c9fbdd6063ce550d5a57f67a0f3e7..939da154525baf51e372ed5659fe82d504895b57 100644 --- a/.gitattributes +++ b/.gitattributes @@ -1,2 +1 @@ -*.config linguist-language=groovy -*.nf linguist-language=groovy \ No newline at end of file +*.nf gitlab-language=groovy diff --git a/README.md b/README.md index 5d1d04e79987abe93720ed873e70ce787db17cfe..d677ad284d6bee334f44d3f746db5a6b5519d045 100644 --- a/README.md +++ b/README.md @@ -1,45 +1,45 @@ -# metagWGS +# metagWGS: Documentation ## Introduction -**metagWGS** is a [Nextflow](https://www.nextflow.io/docs/latest/index.html#) bioinformatics analysis pipeline used for **metag**enomic **W**hole **G**enome **S**hotgun sequencing data (Illumina HiSeq3000 or NovaSeq, paired, 2\*150bp). +**metagWGS** is a [Nextflow](https://www.nextflow.io/docs/latest/index.html#) bioinformatics analysis pipeline used for **metag**enomic **W**hole **G**enome **S**hotgun sequencing data (Illumina HiSeq3000 or NovaSeq, paired, 2\*150bp ; PacBio HiFi reads, single-end). ### Pipeline graphical representation -The workflow processes raw data from `.fastq` or `.fastq.gz` inputs and do the modules represented into this figure: - +The workflow processes raw data from `.fastq/.fastq.gz` input and/or assemblies (contigs) `.fa/.fasta` and uses the modules represented in this figure: + ### metagWGS steps -metagWGS is splitted into different steps that correspond to different parts of the bioinformatics analysis: +metagWGS is split into different steps that correspond to different parts of the bioinformatics analysis: -* `01_clean_qc` (can ke skipped) +* `S01_CLEAN_QC` (can be stopped at with `--stop_at_clean` ; can ke skipped with `--skip_clean`) * trims adapters sequences and deletes low quality reads ([Cutadapt](https://cutadapt.readthedocs.io/en/stable/#), [Sickle](https://github.com/najoshi/sickle)) * suppresses host contaminants ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/)) * controls the quality of raw and cleaned data ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)) - * makes a taxonomic classification of cleaned reads ([Kaiju MEM](https://github.com/bioinformatics-centre/kaiju) + [kronaTools](https://github.com/marbl/Krona/wiki/KronaTools) + [Generate_barplot_kaiju.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Generate_barplot_kaiju.py) + [merge_kaiju_results.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/merge_kaiju_results.py)) -* `02_assembly` - * assembles cleaned reads (combined with `01_clean_qc` step) or raw reads (combined with `--skip_01_clean_qc` parameter) ([metaSPAdes](https://github.com/ablab/spades) or [Megahit](https://github.com/voutcn/megahit)) + * makes a taxonomic classification of cleaned reads ([Kaiju MEM](https://github.com/bioinformatics-centre/kaiju) + [kronaTools](https://github.com/marbl/Krona/wiki/KronaTools) + [Generate_barplot_kaiju.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Generate_barplot_kaiju.py) + [merge_kaiju_results.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/merge_kaiju_results.py)) +* `S02_ASSEMBLY` (can be stopped at with `--stop_at_assembly`) + * assembles cleaned reads (combined with `S01_CLEAN_QC` step) or raw reads (combined with `--skip_clean` parameter) ([metaSPAdes](https://github.com/ablab/spades) or [Megahit](https://github.com/voutcn/megahit)) * assesses the quality of assembly ([metaQUAST](http://quast.sourceforge.net/metaquast)) - * deduplicates cleaned reads (combined with `01_clean_qc` step) or raw reads (combined with `--skip_01_clean_qc` parameter) ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/)) -* `03_filtering` (can be skipped) - * filters contigs with low CPM value ([Filter_contig_per_cpm.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Filter_contig_per_cpm.py) + [metaQUAST](http://quast.sourceforge.net/metaquast)) -* `04_structural_annot` - * makes a structural annotation of genes ([Prokka](https://github.com/tseemann/prokka) + [Rename_contigs_and_genes.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Rename_contigs_and_genes.py)) -* `05_alignment` + * deduplicates cleaned reads (combined with `S01_CLEAN_QC` step) or raw reads (combined with `--skip_clean` parameter) ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/)) +* `S03_FILTERING` (can be stopped at with `--stop_at_filtering` ; can be skipped with `--skip_assembly`) + * filters contigs with low CPM value ([Filter_contig_per_cpm.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Filter_contig_per_cpm.py) + [metaQUAST](http://quast.sourceforge.net/metaquast)) +* `S04_STRUCTURAL_ANNOT` (can be stopped at with `--stop_at_structural_annot`) + * makes a structural annotation of genes ([Prokka](https://github.com/tseemann/prokka) + [Rename_contigs_and_genes.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Rename_contigs_and_genes.py)) +* `S05_ALIGNMENT` * aligns reads to the contigs ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/)) * aligns the protein sequence of genes against a protein database ([DIAMOND](https://github.com/bbuchfink/diamond)) -* `06_func_annot` - * makes a sample and global clustering of genes ([cd-hit-est](http://weizhongli-lab.org/cd-hit/) + [cd_hit_produce_table_clstr.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/cd_hit_produce_table_clstr.py)) - * quantifies reads that align with the genes ([featureCounts](http://subread.sourceforge.net/) + [Quantification_clusters.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Quantification_clusters.py)) - * makes a functional annotation of genes and a quantification of reads by function ([eggNOG-mapper](http://eggnog-mapper.embl.de/) + [best_bitscore_diamond.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/best_bitscore_diamond.py) + [merge_abundance_and_functional_annotations.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/merge_abundance_and_functional_annotations.py) + [quantification_by_functional_annotation.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/quantification_by_functional_annotation.py)) -* `07_taxo_affi` - * taxonomically affiliates the genes ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/aln2taxaffi.py)) - * taxonomically affiliates the contigs ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/aln2taxaffi.py)) - * counts the number of reads and contigs, for each taxonomic affiliation, per taxonomic level ([Samtools](http://www.htslib.org/) + [merge_contig_quantif_perlineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/merge_contig_quantif_perlineage.py) + [quantification_by_contig_lineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/quantification_by_contig_lineage.py)) -* `08_binning` from [nf-core/mag 1.0.0](https://github.com/nf-core/mag/releases/tag/1.0.0) - * makes binning of contigs ([MetaBAT2](https://bitbucket.org/berkeleylab/metabat/src/master/)) - * assesses bins ([BUSCO](https://busco.ezlab.org/) + [metaQUAST](http://quast.sourceforge.net/metaquast) + [summary_busco.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/summary_busco.py) and [combine_tables.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/combine_tables.py) from [nf-core/mag](https://github.com/nf-core/mag)) - * taxonomically affiliates the bins ([BAT](https://github.com/dutilh/CAT)) +* `S06_FUNC_ANNOT` (can ke skipped with `--skip_func_annot`) + * makes a sample and global clustering of genes ([cd-hit-est](http://weizhongli-lab.org/cd-hit/) + [cd_hit_produce_table_clstr.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/cd_hit_produce_table_clstr.py)) + * quantifies reads that align with the genes ([featureCounts](http://subread.sourceforge.net/) + [Quantification_clusters.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Quantification_clusters.py)) + * makes a functional annotation of genes and a quantification of reads by function ([eggNOG-mapper](http://eggnog-mapper.embl.de/) + [best_bitscore_diamond.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/best_bitscore_diamond.py) + [merge_abundance_and_functional_annotations.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/merge_abundance_and_functional_annotations.py) + [quantification_by_functional_annotation.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/quantification_by_functional_annotation.py)) +* `S07_TAXO_AFFI` (can ke skipped with `--skip_taxo_affi`) + * taxonomically affiliates the genes ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/aln2taxaffi.py)) + * taxonomically affiliates the contigs ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/aln2taxaffi.py)) + * counts the number of reads and contigs, for each taxonomic affiliation, per taxonomic level ([Samtools](http://www.htslib.org/) + [merge_contig_quantif_perlineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/merge_contig_quantif_perlineage.py) + [quantification_by_contig_lineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/quantification_by_contig_lineage.py)) +* `S08_BINNING` (not yet implemented) + * binning strategies for assemblies and co-assemblies + +All steps are launched one after another by default. Use `--stop_at_[STEP]` and `--skip_[STEP]` parameters to tweak execution to your will. A report html file is generated at the end of the workflow with [MultiQC](https://multiqc.info/). @@ -49,28 +49,15 @@ Two [Singularity](https://sylabs.io/docs/) containers are available making insta ## Documentation -metagWGS documentation is available [here](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/tree/dev/docs). - -## License -metagWGS is distributed under the GNU General Public License v3. - -## Copyright -2021 INRAE - -## Funded by -Anti-Selfish (Labex ECOFECT – N° 00002455-CT15000562) - -France Génomique National Infrastructure (funded as part of Investissement d’avenir program managed by Agence Nationale de la Recherche, contract ANR-10-INBS-09) - -With participation of SeqOccIn members financed by FEDER-FSE MIDI-PYRENEES ET GARONNE 2014-2020. - -## Citation -metagWGS has been presented at JOBIM 2020: - -Poster "Whole metagenome analysis with metagWGS", J. Fourquet, C. Noirot, C. Klopp, P. Pinton, S. Combes, C. Hoede, G. Pascal. - -https://www.sfbi.fr/sites/sfbi.fr/files/jobim/jobim2020/posters/compressed/jobim2020_poster_9.pdf - -metagWGS has been presented at JOBIM 2019 and at Genotoul Biostat Bioinfo day: - -Poster "Whole metagenome analysis with metagWGS", J. Fourquet, A. Chaubet, H. Chiapello, C. Gaspin, M. Haenni, C. Klopp, A. Lupo, J. Mainguy, C. Noirot, T. Rochegue, M. Zytnicki, T. Ferry, C. Hoede. +The metagWGS documentation can be found in the following pages: + + * [Installation](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/installation.md) + * The pipeline installation procedure. + * [Usage](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md) + * An overview of how the pipeline works, how to run it and a description of all of the different command-line flags. + * [Output](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/output.md) + * An overview of the different output files and directories produced by the pipeline. + * [Use case](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/use_case.md) + * A tutorial to learn how to launch the pipeline on a test dataset on [genologin cluster](http://bioinfo.genotoul.fr/). + * [Functional tests](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/functional_tests/README.md) + * (for developers) A tool to launch a new version of the pipeline on curated input data and compare its results with known output. diff --git a/assets/hifi_multiqc_config.yaml b/assets/hifi_multiqc_config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..aaa4fc27a8fcb12e50b53eccb830378c4dfbb83e --- /dev/null +++ b/assets/hifi_multiqc_config.yaml @@ -0,0 +1,37 @@ +report_comment: > + This report has been generated by the <a href="https://forgemia.inra.fr/genotoul-bioinfo/metagwgs" target="_blank">genotoul-bioinfo/metagwgs</a> + analysis pipeline. For information about how to interpret these results, please see the + <a href="https://forgemia.inra.fr/genotoul-bioinfo/metagwgs" target="_blank">documentation</a>. + +extra_fn_clean_trim: + - "hifi_" + - '.count_reads_on_contigs' + - '_scaffolds' + - '.txt' + - '.contigs' + - '.sort' + +module_order: + - fastqc: + name: 'FastQC' + path_filters: + - '*hifi_*.zip' + - quast: + name: 'Quast primary assembly' + info: 'This section of the report shows quast results after assembly' + path_filters: + - '*quast_hifi/*/report.tsv' + - prokka + - featureCounts + +prokka_fn_snames: True +prokka_table: True + +featurecounts: + fn: '*.summary' + shared: true + +table_columns_visible: + FastQC: + percent_duplicates: False + percent_gc: False diff --git a/assets/multiqc_config.yaml b/assets/sr_multiqc_config.yaml similarity index 89% rename from assets/multiqc_config.yaml rename to assets/sr_multiqc_config.yaml index e8f8b6deca82a1315ee69dd774f2a20085fdb361..802395df6837fba7f65d42a3ca251e44c905d0af 100644 --- a/assets/multiqc_config.yaml +++ b/assets/sr_multiqc_config.yaml @@ -34,11 +34,6 @@ module_order: info: 'This section reports of the reads alignement against host genome with bwa.' path_filters: - '*.no_filter.flagstat' - - samtools: - name : 'Reads aln on host genome' - info: 'This section of the cleaned reads alignement against host genome with bwa.' - path_filters: - - '*host_filter/*' - samtools: name : 'Reads after host reads filter' info: 'This section reports of the cleaned reads alignement against host genome with bwa.' @@ -54,12 +49,12 @@ module_order: name: 'Quast primary assembly' info: 'This section of the report shows quast results after assembly' path_filters: - - '*_all_contigs_QC/*' + - '*quast_primary/*/report.tsv' - quast: name: 'Quast filtered assembly' info: 'This section of the report shows quast results after filtering of assembly' path_filters: - - '*_select_contigs_QC/*' + - '*quast_filtered/*/report.tsv' - samtools: name : 'Reads after deduplication' info: 'This section reports of deduplicated reads alignement against contigs with bwa.' diff --git a/bin/Rename_contigs_and_genes.py b/bin/Rename_contigs_and_genes.py index 1bee2f35b7859c436fbe780839c13b4e704b4476..7d0e94d4a9de9e5af57d1ebcd1b6ea0922f57acd 100755 --- a/bin/Rename_contigs_and_genes.py +++ b/bin/Rename_contigs_and_genes.py @@ -79,7 +79,7 @@ to_write = [] #contig_renames [ald_name]=newname #reecriture du fasta -with open(args.fnaFile, "rU") as fnaFile,\ +with open(args.fnaFile, "r") as fnaFile,\ open(args.outFNAFile, "w") as outFNA_handle: for record in SeqIO.parse(fnaFile, "fasta"): try : @@ -112,7 +112,13 @@ with open(args.file) as gffFile,\ #Generate correspondance old_prot_name = feature.qualifiers['ID'][0].replace("_gene","") prot_number = old_prot_name.split("_")[-1] - new_prot_name = new_ctg_name + "." + prot_prefix + prot_number + + subfeat_types = {subfeat.type for subfeat in feature.sub_features} + assert len(subfeat_types) == 1, f'Subfeature have different types {subfeat_types}' + subfeat_type = subfeat_types.pop() + + + new_prot_name = f"{new_ctg_name}.{subfeat_type}_{prot_number}" prot_names[old_prot_name] = new_prot_name fh_prot_table.write(old_prot_name + "\t" + new_prot_name + "\n") @@ -134,7 +140,7 @@ with open(args.file) as gffFile,\ GFF.write(to_write, out_handle) -with open(args.fastaFile, "rU") as handle,\ +with open(args.fastaFile, "r") as handle,\ open(args.outFAAFile, "w") as outFasta_handle: for record in SeqIO.parse(handle, "fasta"): try : @@ -147,7 +153,7 @@ with open(args.fastaFile, "rU") as handle,\ pass -with open(args.ffnFile, "rU") as handle,\ +with open(args.ffnFile, "r") as handle,\ open(args.outFFNFile, "w") as outFFN_handle: for record in SeqIO.parse(handle, "fasta"): try : diff --git a/bin/aln2taxaffi.py b/bin/aln2taxaffi.py index 0676365a06728bf65cdf117d916a44b4a83fabab..7cbfaa679a868311a24ab94e8302352dbcc7ba56 100755 --- a/bin/aln2taxaffi.py +++ b/bin/aln2taxaffi.py @@ -43,7 +43,7 @@ except ImportError as error: # Variables # These are identities normalized with query coverage: -MIN_IDENTITY_TAXA = (0.40,0.50,0.60,0.70,0.80,0.90,0.95) +MIN_IDENTITY_TAXA = (0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95) # Fraction of weights needed to assign at a specific level, # a measure of concensus at that level. @@ -60,12 +60,14 @@ d_taxonomy = {} # Define taxonomy main levels global main_level main_level = \ -["superkingdom", "phylum", "class", "order", "family", "genus", "species"] + ["superkingdom", "phylum", "class", "order", "family", "genus", "species"] -##### SAME AS Script_renameVcn.py -prot_prefix = "Prot_" +# SAME AS Script_renameVcn.py +prot_prefix = "CDS_" # Definition of the class Node + + class Node: def __init__(self): self.tax_id = 0 # Number of the tax id. @@ -73,11 +75,12 @@ class Node: self.children = [] # List of the children of this node self.tip = 0 # Tip=1 if it's a terminal node, 0 if not. self.name = "" # Name of the node: taxa if it's a terminal node, - # numero if not. + # numero if not. self.level = "None" + def genealogy(self): # Trace genealogy from root to leaf ancestors = [] # Initialize the list of all nodes - # from root to leaf. + # from root to leaf. tax_id = self.tax_id # Define leaf while 1: if tax_id in d_taxonomy: @@ -91,9 +94,10 @@ class Node: ancestors.append(tax_id) break return ancestors # Return the list - def fullnamelineage(self): # Trace genealogy from root to leaf + + def fullnamelineage(self): # Trace genealogy from root to leaf ancestors = [] # Initialise the list of all nodes - # from root to leaf. + # from root to leaf. tax_id = self.tax_id # Define leaf while 1: if tax_id in d_taxonomy: @@ -104,15 +108,16 @@ class Node: if tax_id == "1": break ancestors.reverse() - return "; ".join(ancestors) # Return the list + return "; ".join(ancestors) # Return the list + def genealogy_main_level(self): ancestors = ["None"] * 7 # Initialise the list of all nodes - # from root to leaf. + # from root to leaf. tax_id = self.tax_id while 1: if tax_id in d_taxonomy: cur_level = d_taxonomy[tax_id].level - if cur_level in main_level : + if cur_level in main_level: ancestors[main_level.index(cur_level)] = tax_id tax_id = d_taxonomy[tax_id].parent else: @@ -120,27 +125,30 @@ class Node: if tax_id == "1": # If it is the root, we reached the end. break - return ancestors # Return the list + return ancestors # Return the list + def lineage_main_level(self): ancestors = ["None"] * 7 # Initialise the list of all nodes - # from root to leaf. + # from root to leaf. ancestors_tax_id = ["None"] * 7 # Initialise the list of all nodes tax_id = self.tax_id while 1: if tax_id in d_taxonomy: cur_level = d_taxonomy[tax_id].level - if cur_level in main_level : + if cur_level in main_level: ancestors[main_level.index(cur_level)] = d_taxonomy[tax_id].name - ancestors_tax_id [main_level.index(cur_level)] = str(tax_id) + ancestors_tax_id[main_level.index(cur_level)] = str(tax_id) tax_id = d_taxonomy[tax_id].parent else: break if tax_id == "1": # If it is the root, we reached the end. break - return ("; ".join(ancestors), "; ".join(ancestors_tax_id))# Return the two lists + return ("; ".join(ancestors), "; ".join(ancestors_tax_id)) # Return the two lists # Function to find common ancestor between two nodes or more + + def common_ancestor(node_list): global d_taxonomy # Define the whole genealogy of the first node @@ -150,15 +158,16 @@ def common_ancestor(node_list): list2 = d_taxonomy[node].genealogy() ancestral_list = [] for i in list1: - if i in list2: # Identify common nodes between the two genealogy + if i in list2: # Identify common nodes between the two genealogy ancestral_list.append(i) - list1 = ancestral_list # Reassing ancestral_list to list 1. + list1 = ancestral_list # Reassing ancestral_list to list 1. # Finally, the first node of the ancestra_list is the common ancestor # of all nodes. common_ancestor = ancestral_list[0] # Return a node return common_ancestor + def load_taxonomy(directory): # Load taxonomy global d_taxonomy @@ -177,9 +186,8 @@ def load_taxonomy(directory): d_name_by_tax_id[tax_id] = name # ... and load them d_name_by_tax_id_reverse[name] = tax_id # ... into dictionaries - # Load taxonomy NCBI file ("nodes.dmp") - with open(os.path.join(directory, "nodes.dmp"), "r") as taxonomy_file : + with open(os.path.join(directory, "nodes.dmp"), "r") as taxonomy_file: for line in taxonomy_file: line = line.rstrip().replace("\t", "") tab = line.split("|") @@ -194,10 +202,10 @@ def load_taxonomy(directory): if tax_id not in d_taxonomy: d_taxonomy[tax_id] = Node() d_taxonomy[tax_id].tax_id = tax_id # Assign tax_id - d_taxonomy[tax_id].parent = tax_id_parent # Assign tax_id parent + d_taxonomy[tax_id].parent = tax_id_parent # Assign tax_id parent d_taxonomy[tax_id].name = name # Assign name - d_taxonomy[tax_id].level = str(tab[2].strip()) # Assign level - if tax_id_parent in d_taxonomy: + d_taxonomy[tax_id].level = str(tab[2].strip()) # Assign level + if tax_id_parent in d_taxonomy: children = d_taxonomy[tax_id].children # If parent is already in the object children.append(tax_id) # ... we found its children d_taxonomy[tax_id].children = children # ... so add them to the parent. @@ -205,6 +213,7 @@ def load_taxonomy(directory): # END Functions for taxonomy taxdump.tar.gz ################################################ + def read_query_length_file(query_length_file): lengths = {} for line in open(query_length_file): @@ -212,16 +221,17 @@ def read_query_length_file(query_length_file): lengths[queryid] = float(length) return lengths -def read_blast_input(blastinputfile,lengths,min_identity,max_matches,min_coverage): -#c1.Prot_00001 EFK63346.1 100.0 85 0 0 1 85 62 146 1.6e-36 158.3 85 \ -# 146 EFK63346.1 LOW QUALITY PROTEIN: hypothetical protein HMPREF9008_04720, partial [Parabacteroides sp. 20_3] + +def read_blast_input(blastinputfile, lengths, min_identity, max_matches, min_coverage): + # c1.Prot_00001 EFK63346.1 100.0 85 0 0 1 85 62 146 1.6e-36 158.3 85 \ + # 146 EFK63346.1 LOW QUALITY PROTEIN: hypothetical protein HMPREF9008_04720, partial [Parabacteroides sp. 20_3] #queryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount, #queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore, queryLength, subjectLength, subjectTitle matches = defaultdict(list) accs = Counter() - nmatches = Counter(); + nmatches = Counter() with open(blastinputfile) as blast_handler: reader = csv.DictReader(blast_handler, delimiter='\t') @@ -231,24 +241,24 @@ def read_blast_input(blastinputfile,lengths,min_identity,max_matches,min_coverag # queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore, queryLength, subjectLength, subjectTitle) \ # = line.rstrip().split("\t") - if aln['sseqid'].startswith("gi|") : + if aln['sseqid'].startswith("gi|"): m = re.search(r"gi\|.*?\|.*\|(.*)\|", aln['sseqid']) acc = m.group(1) - else : + else: acc = aln['sseqid'] qLength = lengths[aln['qseqid']] alnLength_in_query = abs(int(aln['qend']) - int(aln['qstart'])) + 1 fHit = float(alnLength_in_query) / qLength coverage = fHit * 100 fHit *= float(aln['pident']) / 100.0 - fHit = min(1.0,fHit) + fHit = min(1.0, fHit) #hits[queryId] = hits[queryId] + 1 if float(aln['pident']) > min_identity and nmatches[aln['qseqid']] < max_matches and float(coverage) > min_coverage: matches[aln['qseqid']].append((acc, fHit)) nmatches[aln['qseqid']] += 1 - accs[acc] +=1 + accs[acc] += 1 - return (OrderedDict(sorted(matches.items(), key = lambda t: t[0])), list(accs.keys())) + return (OrderedDict(sorted(matches.items(), key=lambda t: t[0])), list(accs.keys())) def map_accessions(accs, mapping_file): @@ -263,53 +273,55 @@ def map_accessions(accs, mapping_file): return mappings -def get_consensus (collate_table): - #From collapse_hit retrieve consensus tax_id and lineage - #Compute best lineage consensus +def get_consensus(collate_table): + # From collapse_hit retrieve consensus tax_id and lineage + # Compute best lineage consensus for depth in range(6, -1, -1): collate = collate_table[depth] dWeight = sum(collate.values()) - sortCollate = sorted(list(collate.items()), key = operator.itemgetter(1), reverse = True) + sortCollate = sorted(list(collate.items()), key=operator.itemgetter(1), reverse=True) nL = len(collate) if nL > 0: dP = 0.0 if dWeight > 0.0: dP = float(sortCollate[0][1]) / dWeight if dP > MIN_FRACTION: - (fullnamelineage_text, fullnamelineage_ids) = d_taxonomy[str(sortCollate[0][0])].lineage_main_level() + (fullnamelineage_text, fullnamelineage_ids) = d_taxonomy[str( + sortCollate[0][0])].lineage_main_level() tax_id_keep = str(sortCollate[0][0]) return (tax_id_keep, fullnamelineage_text, fullnamelineage_ids) - return (1,"Unable to found taxonomy consensus",1) + return (1, "Unable to found taxonomy consensus", 1) + def main(argv): parser = argparse.ArgumentParser() - parser.add_argument("aln_input_file", \ - help = "file with blast/diamond matches expected format m8 \ + parser.add_argument("aln_input_file", + help="file with blast/diamond matches expected format m8 \ \nqueryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount,\ queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore") - parser.add_argument("query_length_file", \ - help = "tab delimited file of query lengths") - parser.add_argument('-a','--acc_taxaid_mapping_file', \ - help = "mapping from accession to taxaid gzipped") - parser.add_argument('-t','--taxonomy', \ - help = "path of taxdump.tar.gz extracted directory") - parser.add_argument('-o','--output_file', type = str, \ - default = "taxonomyassignation", help = ("string specifying output file")) - parser.add_argument('-i','--identity', default = 60, \ - help = "percentage of identity") - parser.add_argument('-m','--max_matches', default = 10, \ - help = "max number of matches to analyze") - parser.add_argument('-c','--min_coverage', default = 70, \ - help = "percentage of coverage") + parser.add_argument("query_length_file", + help="tab delimited file of query lengths") + parser.add_argument('-a', '--acc_taxaid_mapping_file', + help="mapping from accession to taxaid gzipped") + parser.add_argument('-t', '--taxonomy', + help="path of taxdump.tar.gz extracted directory") + parser.add_argument('-o', '--output_file', type=str, + default="taxonomyassignation", help=("string specifying output file")) + parser.add_argument('-i', '--identity', default=60, + help="percentage of identity") + parser.add_argument('-m', '--max_matches', default=10, + help="max number of matches to analyze") + parser.add_argument('-c', '--min_coverage', default=70, + help="percentage of coverage") args = parser.parse_args() lengths = read_query_length_file(args.query_length_file) print("Finished reading lengths file") nb_total_prot = len(lengths) - (matches,accs) = read_blast_input(args.aln_input_file, lengths, \ - args.identity,args.max_matches,args.min_coverage) + (matches, accs) = read_blast_input(args.aln_input_file, lengths, + args.identity, args.max_matches, args.min_coverage) print("Finished reading in blast results file") nb_prot_annotated = len(matches) @@ -320,10 +332,10 @@ def main(argv): print("Finished loading taxa directory " + args.taxonomy) re_contig = re.compile('(.*)\.' + prot_prefix) - with open (args.output_file + ".pergene.tsv", "w") as out, \ - open (args.output_file + ".percontig.tsv", "w") as outpercontig, \ - open (args.output_file + ".warn.tsv", "w") as outdisc : - #Write header + with open(args.output_file + ".pergene.tsv", "w") as out, \ + open(args.output_file + ".percontig.tsv", "w") as outpercontig, \ + open(args.output_file + ".warn.tsv", "w") as outdisc: + # Write header out.write("#prot_id\tconsensus_tax_id\tconsensus_lineage\ttax_id_by_level\n") outpercontig.write("#contig\tconsensus_tax_id\tconsensus_lineage\ttax_id_by_level\n") outdisc.write("#prot_id\tlist nr hit not found in taxo\n") @@ -340,45 +352,49 @@ def main(argv): contig_id = None for prot, matchs in list(matches.items()): - hit_sorted = sorted(matchs, key = lambda x: x[1], reverse = True) + hit_sorted = sorted(matchs, key=lambda x: x[1], reverse=True) #### # handle contig consensus match = re_contig.match(prot) - if match : + if match: contig_id = match.group(1) - if prev_contig == None : + if prev_contig == None: prev_contig = contig_id - if prev_contig != contig_id : + if prev_contig != contig_id: ### - (tax_id_keep, fullnamelineage_text, fullnamelineage_ids) = get_consensus(collate_hits_per_contig) + (tax_id_keep, fullnamelineage_text, fullnamelineage_ids) = get_consensus( + collate_hits_per_contig) count_genealogy_contig[d_taxonomy[str(tax_id_keep)].level] += 1 - outpercontig.write(prev_contig + "\t" + str(tax_id_keep) + "\t" + fullnamelineage_text + "\t" + str(fullnamelineage_ids) + "\n") + outpercontig.write(prev_contig + "\t" + str(tax_id_keep) + "\t" + + fullnamelineage_text + "\t" + str(fullnamelineage_ids) + "\n") collate_hits_per_contig = list() for depth in range(7): collate_hits_per_contig.append(Counter()) prev_contig = contig_id - best_hit = hit_sorted [0][0] - fHit = hit_sorted [0][1] + best_hit = hit_sorted[0][0] + fHit = hit_sorted[0][1] if mapping[best_hit] > -1: - #Retrieve hit taxa id + # Retrieve hit taxa id tax_id = mapping[best_hit] - if str(tax_id) in d_taxonomy : # taxid in taxonomy ? - #Retreive lineage on main level only (no no rank) + if str(tax_id) in d_taxonomy: # taxid in taxonomy ? + # Retreive lineage on main level only (no no rank) hits = d_taxonomy[str(tax_id)].genealogy_main_level() for depth in range(7): if hits[depth] != "None": - weight = (fHit - MIN_IDENTITY_TAXA[depth]) / (1.0 - MIN_IDENTITY_TAXA[depth]) - weight = max(weight,0.0) + weight = (fHit - MIN_IDENTITY_TAXA[depth] + ) / (1.0 - MIN_IDENTITY_TAXA[depth]) + weight = max(weight, 0.0) if weight > 0.0: - collate_hits_per_contig[depth][hits[depth]] += weight #could put a transform in here + # could put a transform in here + collate_hits_per_contig[depth][hits[depth]] += weight # end handle contig consensus #### #### - #Handle a protein, init variable + # Handle a protein, init variable added_matches = [] collate_hits = list() protTaxaNotFound = list() @@ -387,50 +403,56 @@ def main(argv): for depth in range(7): collate_hits.append(Counter()) - #For each hit, retrieve taxon id and compute weight in lineage + # For each hit, retrieve taxon id and compute weight in lineage for (match, fHit) in hit_sorted: - #taxid id found in acc_taxaid_mapping_file + # taxid id found in acc_taxaid_mapping_file if mapping[match] > -1: - #Retrieve hit taxa id + # Retrieve hit taxa id tax_id = mapping[match] if tax_id not in added_matches: # Only add the best hit per species added_matches.append(tax_id) - if str(tax_id) in d_taxonomy : # taxid in taxonomy ? - #Retreive lineage on main level only (no no rank) + if str(tax_id) in d_taxonomy: # taxid in taxonomy ? + # Retreive lineage on main level only (no no rank) hits = d_taxonomy[str(tax_id)].genealogy_main_level() for depth in range(7): if hits[depth] != "None": - weight = (fHit - MIN_IDENTITY_TAXA[depth]) / (1.0 - MIN_IDENTITY_TAXA[depth]) - weight = max(weight,0.0) + weight = ( + fHit - MIN_IDENTITY_TAXA[depth]) / (1.0 - MIN_IDENTITY_TAXA[depth]) + weight = max(weight, 0.0) if weight > 0.0: - collate_hits[depth][hits[depth]] += weight #could put a transform in here - else : + # could put a transform in here + collate_hits[depth][hits[depth]] += weight + else: if tax_id not in protTaxaNotFound: - protTaxaNotFound.append(tax_id) - else : + protTaxaNotFound.append(tax_id) + else: if match not in protMatchNotFound: protMatchNotFound.append(match) - if len (added_matches) > 0 : + if len(added_matches) > 0: (tax_id_keep, fullnamelineage_text, fullnamelineage_ids) = get_consensus(collate_hits) count_genealogy[d_taxonomy[str(tax_id_keep)].level] += 1 - #write simple output - out.write(prot + "\t" + str(tax_id_keep) + "\t" + fullnamelineage_text + "\t" + str(fullnamelineage_ids) + "\n") + # write simple output + out.write(prot + "\t" + str(tax_id_keep) + "\t" + + fullnamelineage_text + "\t" + str(fullnamelineage_ids) + "\n") nb_prot_assigned += 1 - #write discarded values. - if len(protTaxaNotFound) > 0 : - outdisc.write(prot + "\t" + "No taxid in taxdump\t" + ",".join(map(str, protTaxaNotFound)) + "\n") - if len(protMatchNotFound) > 0 : - outdisc.write(prot + "\t" + "No protid correspondance file\t" + ",".join(map(str, protMatchNotFound)) + "\n") + # write discarded values. + if len(protTaxaNotFound) > 0: + outdisc.write(prot + "\t" + "No taxid in taxdump\t" + + ",".join(map(str, protTaxaNotFound)) + "\n") + if len(protMatchNotFound) > 0: + outdisc.write(prot + "\t" + "No protid correspondance file\t" + + ",".join(map(str, protMatchNotFound)) + "\n") if(os.path.getsize(args.aln_input_file) != 0): - #handle last record of contigs consensus. + # handle last record of contigs consensus. (tax_id_keep, fullnamelineage_text, fullnamelineage_ids) = get_consensus(collate_hits_per_contig) count_genealogy_contig[d_taxonomy[str(tax_id_keep)].level] += 1 - outpercontig.write(prev_contig + "\t" + str(tax_id_keep) + "\t" + fullnamelineage_text + "\t" + str(fullnamelineage_ids) + "\n") + outpercontig.write(str(prev_contig) + "\t" + str(tax_id_keep) + "\t" + + str(fullnamelineage_text) + "\t" + str(fullnamelineage_ids) + "\n") - #graphs + # graphs try: os.makedirs("graphs") except OSError: @@ -439,8 +461,8 @@ def main(argv): # Sort dictionaries count_genealogy_ord = OrderedDict(sorted(count_genealogy.items(), key=lambda t: t[0])) - count_genealogy_contig_ord = OrderedDict(sorted(count_genealogy_contig.items(), key=lambda t: t[0])) - + count_genealogy_contig_ord = OrderedDict( + sorted(count_genealogy_contig.items(), key=lambda t: t[0])) # Figures pyplot.bar(range(len(count_genealogy_ord.values())), count_genealogy_ord.values()) @@ -448,22 +470,28 @@ def main(argv): pyplot.xlabel("Taxonomy level") pyplot.ylabel("Number of proteins") pyplot.title(args.aln_input_file + " number of proteins at different taxonomy levels") - pyplot.savefig("graphs/" + args.aln_input_file.split(sep="/")[-1] + "_prot_taxonomy_level.pdf") + pyplot.savefig("graphs/" + args.aln_input_file.split(sep="/") + [-1] + "_prot_taxonomy_level.pdf") pyplot.close() - pyplot.bar(range(len(count_genealogy_contig_ord.values())), count_genealogy_contig_ord.values()) - pyplot.xticks(range(len(count_genealogy_contig_ord.values())), count_genealogy_contig_ord.keys()) + pyplot.bar(range(len(count_genealogy_contig_ord.values())), + count_genealogy_contig_ord.values()) + pyplot.xticks(range(len(count_genealogy_contig_ord.values())), + count_genealogy_contig_ord.keys()) pyplot.xlabel("Taxonomy level") pyplot.ylabel("Number of contigs") pyplot.title(args.aln_input_file + " number of contigs at different taxonomy levels") - pyplot.savefig("graphs/" + args.aln_input_file.split(sep="/")[-1] + "_contig_taxonomy_level.pdf") + pyplot.savefig("graphs/" + args.aln_input_file.split(sep="/") + [-1] + "_contig_taxonomy_level.pdf") pyplot.close() list_graphs = [nb_total_prot, nb_prot_annotated, nb_prot_assigned] pyplot.bar(range(len(list_graphs)), list_graphs) - pyplot.xticks(range(len(list_graphs)), ["Total","Annotated","Assigned"]) + pyplot.xticks(range(len(list_graphs)), ["Total", "Annotated", "Assigned"]) pyplot.ylabel("Number of proteins") pyplot.title(args.aln_input_file + " number of annotated and assigned proteins") - pyplot.savefig("graphs/" + args.aln_input_file.split(sep="/")[-1] + "_nb_prot_annotated_and_assigned.pdf") + pyplot.savefig("graphs/" + args.aln_input_file.split(sep="/") + [-1] + "_nb_prot_annotated_and_assigned.pdf") pyplot.close() + if __name__ == "__main__": main(sys.argv[1:]) diff --git a/bin/best_bitscore_diamond.py b/bin/best_bitscore_diamond.py deleted file mode 100755 index d28cce3e8fa4c500ed2873740bff8cf2601e2b14..0000000000000000000000000000000000000000 --- a/bin/best_bitscore_diamond.py +++ /dev/null @@ -1,84 +0,0 @@ -#!/usr/bin/env python - -"""---------------------------------------------------------------------------- - Script Name: best_hit_diamond.py - Description: Have best diamond hits for each gene/protein (best bitscore) - Input files: Diamond output file (.m8) - Created By: Joanna Fourquet - Date: 2021-01-13 -------------------------------------------------------------------------------- -""" - -# Metadata -__author__ = 'Joanna Fourquet \ -- GenPhySE - NED' -__copyright__ = 'Copyright (C) 2021 INRAE' -__license__ = 'GNU General Public License' -__version__ = '0.1' -__email__ = 'support.bioinfo.genotoul@inra.fr' -__status__ = 'dev' - -# Status: dev - -# Modules importation -try: - import argparse - import pandas as p - import re - import sys - import os - import operator - from collections import defaultdict - from collections import OrderedDict - from collections import Counter - from matplotlib import pyplot -except ImportError as error: - print(error) - exit(1) - -def read_blast_input(blastinputfile): - #c1.Prot_00001 EFK63346.1 100.0 85 0 0 1 85 62 146 1.6e-36 158.3 85 \ - # 146 EFK63346.1 LOW QUALITY PROTEIN: hypothetical protein HMPREF9008_04720, partial [Parabacteroides sp. 20_3] - - #queryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount, - #queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore, queryLength, subjectLength, subjectTitle - score = defaultdict(float) - best_lines = defaultdict(list) - nmatches = defaultdict(int); - for line in open(blastinputfile): - (queryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount, \ - queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore, queryLength, subjectLength, subjectTitle) \ - = line.rstrip().split("\t") - if (nmatches[queryId] == 0): - score[queryId] = float(bitScore) - nmatches[queryId] += 1 - best_lines[queryId] = [line] - else : - if (nmatches[queryId] > 0 and float(bitScore) > score[queryId]): - score[queryId] = float(bitScore) - best_lines[queryId] = [line] - else : - if (nmatches[queryId] > 0 and float(bitScore) == score[queryId]): - best_lines[queryId].append(line) - return(best_lines) - -def main(argv): - - parser = argparse.ArgumentParser() - parser.add_argument("aln_input_file", \ - help = "file with blast/diamond matches expected format m8 \ - \nqueryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount,\ - queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore") - parser.add_argument('-o','--output_file', type = str, \ - default = "best_hit.tsv", help = ("string specifying output file")) - args = parser.parse_args() - - out_lines = read_blast_input(args.aln_input_file) - with open (args.output_file, "w") as out : - for id in out_lines.keys(): - for line in out_lines[id]: - out.write(line) - print("Finished") - -if __name__ == "__main__": - main(sys.argv[1:]) diff --git a/bin/combine_tables.py b/bin/combine_tables.py deleted file mode 100755 index 705eeaa7d568c832757412a70c4ca8449b4a7435..0000000000000000000000000000000000000000 --- a/bin/combine_tables.py +++ /dev/null @@ -1,17 +0,0 @@ -#!/usr/bin/env python - -#USAGE: ./combine_tables.py <BUSCO_table> <QUAST_table> - -import pandas as pd -from sys import stdout -from sys import argv - -# Read files -file1 = pd.read_csv(argv[1], sep="\t") -file2 = pd.read_csv(argv[2], sep="\t") - -# Merge files -result = pd.merge(file1, file2, left_on="GenomeBin", right_on="Assembly", how='outer') - -# Print to stdout -result.to_csv(stdout, sep='\t') diff --git a/bin/merge_contig_quantif_perlineage.py b/bin/merge_contig_quantif_perlineage.py index 3ecef5d29e41699eaf51944dd7b7d8a7e4bb7793..d7b921f73bd852677acdcc861c29617be5f8c371 100755 --- a/bin/merge_contig_quantif_perlineage.py +++ b/bin/merge_contig_quantif_perlineage.py @@ -3,119 +3,130 @@ """-------------------------------------------------------------------- Script Name: merge_contig_quantif_perlineage.py Description: merge quantifications and lineage into one matrice for one sample. - Input files: idxstats file, depth from mosdepth (bed.gz) and lineage percontig.tsv file. + Input files: depth from samtools coverage and lineage percontig.tsv file. Created By: Joanna Fourquet Date: 2021-01-19 ------------------------------------------------------------------------ +------------------------------------ ----------------------------------- """ # Metadata. -__author__ = 'Joanna Fourquet \ -- GenPhySE - NED' +__author__ = 'Joanna Fourquet, Jean Mainguy' __copyright__ = 'Copyright (C) 2021 INRAE' __license__ = 'GNU General Public License' __version__ = '0.1' __email__ = 'support.bioinfo.genotoul@inra.fr' __status__ = 'dev' -# Status: dev. - -# Modules importation. -try: - import argparse - import re - import sys - import pandas as pd - import numpy as np - from datetime import datetime -except ImportError as error: - print(error) - exit(1) - -# Print time. -print(str(datetime.now())) - -# Manage parameters. -parser = argparse.ArgumentParser(description = 'Script which \ -merge quantifications and lineage into one matrice for one sample.') - -parser.add_argument('-i', '--idxstats_file', required = True, \ -help = 'idxstats file.') - -parser.add_argument('-m', '--mosdepth_file', required = True, \ -help = 'depth per contigs from mosdepth (regions.bed.gz).') - -parser.add_argument('-c', '--percontig_file', required = True, \ -help = '.percontig.tsv file.') - -parser.add_argument('-o', '--output_name', required = True, \ -help = 'Name of output file containing counts of contigs and reads \ -for each lineage.') - -parser.add_argument('-v', '--version', action = 'version', \ -version = __version__) - -args = parser.parse_args() - -# Recovery of idxstats file. -idxstats = pd.read_csv(args.idxstats_file, delimiter='\t', header=None) -idxstats.columns = ["contig","len","mapped","unmapped"] -# Recovery of mosdepth file; remove start/end columns -mosdepth = pd.read_csv(args.mosdepth_file, delimiter='\t', header=None,compression='gzip') -mosdepth.columns = ["contig","start","end","depth"] -mosdepth.drop(["start","end"], inplace=True,axis=1) - -# Recovery of .percontig.tsv file. -percontig = pd.read_csv(args.percontig_file, delimiter='\t', dtype=str) - -# Merge idxstats and .percontig.tsv files. -merge = pd.merge(idxstats,percontig,left_on='contig',right_on='#contig', how='outer') - -# Add depth -merge = pd.merge(merge,mosdepth,left_on='contig',right_on='contig', how='outer') - -# Fill NaN values to keep unmapped contigs. -merge['consensus_lineage'] = merge['consensus_lineage'].fillna('Unknown') -merge['tax_id_by_level'] = merge['tax_id_by_level'].fillna(1) -merge['consensus_tax_id'] = merge['consensus_tax_id'].fillna(1) - -# Group by lineage and sum number of reads and contigs. -res = merge.groupby(['consensus_lineage','consensus_tax_id', 'tax_id_by_level']).agg({'contig' : [';'.join, 'count'], 'mapped': 'sum', 'depth': 'mean'}).reset_index() -res.columns=['lineage_by_level', 'consensus_tax_id', 'tax_id_by_level', 'name_contigs', 'nb_contigs', 'nb_reads', 'depth'] - -# Fill NaN values with 0. -res.fillna(0, inplace=True) - -# Split by taxonomic level -res_split_tax_id = res.join(res['tax_id_by_level'].str.split(pat=";",expand=True)) -res_split_tax_id.columns=['consensus_lineage', 'consensus_taxid', 'tax_id_by_level', 'name_contigs', 'nb_contigs', 'depth', 'nb_reads', "superkingdom_tax_id", "phylum_tax_id", "order_tax_id", "class_tax_id", "family_tax_id", "genus_tax_id", "species_tax_id"] -res_split_tax_id.fillna(value='no_affi', inplace = True) -print(res_split_tax_id.head()) -res_split = res_split_tax_id.join(res_split_tax_id['consensus_lineage'].str.split(pat=";",expand=True)) -res_split.columns=['consensus_lineage', 'consensus_taxid', 'tax_id_by_level', 'name_contigs', 'nb_contigs', 'nb_reads', 'depth', "superkingdom_tax_id", "phylum_tax_id", "order_tax_id", "class_tax_id", "family_tax_id", "genus_tax_id", "species_tax_id", "superkingdom_lineage", "phylum_lineage", "order_lineage", "class_lineage", "family_lineage", "genus_lineage", "species_lineage"] -res_split.fillna(value='no_affi', inplace = True) -levels_columns=['tax_id_by_level','lineage_by_level','name_contigs','nb_contigs', 'nb_reads', 'depth'] -level_superkingdom = res_split.groupby(['superkingdom_tax_id','superkingdom_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index() -level_superkingdom.columns=levels_columns -level_phylum = res_split.groupby(['phylum_tax_id','phylum_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index() -level_phylum.columns=levels_columns -level_order = res_split.groupby(['order_tax_id','order_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index() -level_order.columns=levels_columns -level_class = res_split.groupby(['class_tax_id','class_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index() -level_class.columns=levels_columns -level_family = res_split.groupby(['family_tax_id','family_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index() -level_family.columns=levels_columns -level_genus = res_split.groupby(['genus_tax_id','genus_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index() -level_genus.columns=levels_columns -level_species = res_split.groupby(['species_tax_id','species_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index() -level_species.columns=levels_columns - -# Write merge data frame in output files. -res.to_csv(args.output_name + ".tsv", sep="\t", index=False) -level_superkingdom.to_csv(args.output_name + "_by_superkingdom.tsv", sep="\t", index=False) -level_phylum.to_csv(args.output_name + "_by_phylum.tsv", sep="\t", index=False) -level_order.to_csv(args.output_name + "_by_order.tsv", sep="\t", index=False) -level_class.to_csv(args.output_name + "_by_class.tsv", sep="\t", index=False) -level_family.to_csv(args.output_name + "_by_family.tsv", sep="\t", index=False) -level_genus.to_csv(args.output_name + "_by_genus.tsv", sep="\t", index=False) -level_species.to_csv(args.output_name + "_by_species.tsv", sep="\t", index=False) + +from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter +import pandas as pd +import logging + +def parse_arguments(): + # Manage parameters. + parser = ArgumentParser(description = 'Script which \ + merge quantifications and lineage into one matrice for one sample.', + formatter_class=ArgumentDefaultsHelpFormatter) + + parser.add_argument('-s', '--sam_coverage', required = True, \ + help = 'depth per contigs from samtools coverage tool.') + + parser.add_argument('-c', '--contig_tax_affi', required = True, \ + help = '.percontig.tsv file.') + + parser.add_argument('-o', '--output_name', required = True, \ + help = 'Name of output file containing counts of contigs and reads \ + for each lineage.') + + parser.add_argument('-v', '--version', action = 'version', \ + version = __version__) + + parser.add_argument("--verbose", help="increase output verbosity", + action="store_true") + + args = parser.parse_args() + return args + + +def main(): + + args = parse_arguments() + + if args.verbose: + logging.basicConfig(format="%(levelname)s: %(message)s", level=logging.DEBUG) + logging.info('Mode verbose ON') + + else: + logging.basicConfig(format="%(levelname)s: %(message)s") + + sam_coverage_file = args.sam_coverage + contig_taxaffi_file = args.contig_tax_affi + output_name = args.output_name + + ranks = ["superkingdom", "phylum", "order", "class", + "family", "genus", "species"] + + logging.info("Read and merge tables") + cov_df = pd.read_csv(sam_coverage_file, delimiter='\t') + + contig_taxaffi_df = pd.read_csv(contig_taxaffi_file, delimiter='\t', dtype=str) + print(cov_df) + print("#####################") + print(contig_taxaffi_df) + + depth_tax_contig_df = pd.merge(cov_df,contig_taxaffi_df,left_on='#rname',right_on='#contig', how='outer') + + # Fill NaN values to keep unmapped contigs. + depth_tax_contig_df['consensus_lineage'] = depth_tax_contig_df['consensus_lineage'].fillna('Unknown') + depth_tax_contig_df['tax_id_by_level'] = depth_tax_contig_df['tax_id_by_level'].fillna(1) + depth_tax_contig_df['consensus_tax_id'] = depth_tax_contig_df['consensus_tax_id'].fillna(1) + + + logging.info("group by lineage") + groupby_cols = ['consensus_lineage','consensus_tax_id', 'tax_id_by_level'] + depth_lineage_df = depth_tax_contig_df.groupby(groupby_cols).agg({ + '#rname' : [';'.join, 'count'], + 'numreads': 'sum', + 'meandepth': 'mean'}).reset_index() + + depth_lineage_df.columns=['lineage_by_level', 'consensus_tax_id', 'tax_id_by_level', + 'name_contigs', 'nb_contigs', 'nb_reads', 'depth'] + + logging.info(f"Write out {output_name}.tsv") + depth_lineage_df.to_csv(f"{output_name}.tsv", sep="\t", index=False) + + + # split lineage + ranks_taxid = [f"{r}_taxid" for r in ranks] + ranks_lineage = [f"{r}_lineage" for r in ranks] + + try: + depth_lineage_df[ranks_taxid] = depth_lineage_df['tax_id_by_level'].str.split(pat=";",expand=True) + depth_lineage_df[ranks_lineage] = depth_lineage_df["lineage_by_level"].str.split(pat=";",expand=True) + + except ValueError: + # Manage case when lineage_by_level is only equal to "Unable to found taxonomy consensus" or "Unknown" + df_noaffi = pd.DataFrame("no_affi", index=range(len(depth_lineage_df)), columns=ranks_taxid+ranks_lineage) + depth_lineage_df = pd.concat([depth_lineage_df, df_noaffi], axis=1) + depth_lineage_df = depth_lineage_df.fillna(value='no_affi') + + # groupby each rank and write the resulting table + levels_columns=['tax_id_by_level','lineage_by_level','name_contigs','nb_contigs', 'nb_reads', 'depth'] + + logging.info("group by rank") + for rank in ranks: + depth_rank_lineage_df = depth_lineage_df.groupby([f'{rank}_taxid',f'{rank}_lineage']).agg({ + 'name_contigs' : [';'.join], + 'nb_contigs' : 'sum', + 'nb_reads' : 'sum', + 'depth': 'mean'}).reset_index() + + depth_rank_lineage_df.columns=levels_columns + depth_rank_lineage_df['rank'] = rank + logging.info(f"Write out {output_name}_by_{rank}.tsv") + depth_rank_lineage_df.to_csv(f"{output_name}_by_{rank}.tsv", sep="\t", index=False) + + + +if __name__ == '__main__': + main() diff --git a/bin/scrape_software_versions.py b/bin/scrape_software_versions.py index c66b465175e76206db508b1044c54bcd8c93dda5..86f979f03a143b8af3e5a106f11c60f64ffe8d97 100755 --- a/bin/scrape_software_versions.py +++ b/bin/scrape_software_versions.py @@ -22,7 +22,8 @@ regexes = { 'Prokka': ['v_prokka.txt', r"prokka (\S+)"], 'Kaiju': ['v_kaiju.txt', r"Kaiju (\S+)"], 'Samtools': ['v_samtools.txt', r"samtools (\S+)"], - 'Bedtools': ['v_bedtools.txt', r"bedtools v(\S+)"] + 'Bedtools': ['v_bedtools.txt', r"bedtools v(\S+)"], + 'Eggnog-Mapper': ['v_eggnogmapper.txt', r"emapper-(\S+)"] } results = OrderedDict() results['metagWGS'] = '<span style="color:#999999;\">N/A</span>' @@ -44,6 +45,7 @@ results['Prokka'] = '<span style="color:#999999;\">N/A</span>' results['Kaiju'] = '<span style="color:#999999;\">N/A</span>' results['Samtools'] = '<span style="color:#999999;\">N/A</span>' results['Bedtools'] = '<span style="color:#999999;\">N/A</span>' +results['Eggnog-Mapper'] = '<span style="color:#999999;\">N/A</span>' # Search each file using its regex for k, v in regexes.items(): diff --git a/bin/summary_busco.py b/bin/summary_busco.py deleted file mode 100755 index f3dd2eba23245a07351d561ee4e5b1cd95caf724..0000000000000000000000000000000000000000 --- a/bin/summary_busco.py +++ /dev/null @@ -1,26 +0,0 @@ -#!/usr/bin/env python - -# USAGE: ./summary.busco.py *.txt - -import re -from sys import argv - -# "# Summarized benchmarking in BUSCO notation for file MEGAHIT-testset1.contigs.fa" -# " C:0.0%[S:0.0%,D:0.0%],F:0.0%,M:100.0%,n:148" - -regexes = [r"# Summarized benchmarking in BUSCO notation for file (\S+)", r" C:(\S+)%\[S:", - r"%\[S:(\S+)%,D:", r"%,D:(\S+)%\],F:", r"%\],F:(\S+)%,M:", r"%,M:(\S+)%,n:", r"%,n:(\S+)"] -columns = ["GenomeBin", "%Complete", "%Complete and single-copy", - "%Complete and duplicated", "%Fragmented", "%Missing", "Total number"] - -# Search each file using its regex -print("\t".join(columns)) -for FILE in argv[1:]: - with open(FILE) as x: - results = [] - TEXT = x.read() - for REGEX in regexes: - match = re.search(REGEX, TEXT) - if match: - results.append(match.group(1)) - print("\t".join(results)) diff --git a/conf/base.config b/conf/base.config index c51fb3bfbdbeb16cb079c96f32b605ba49f1cd22..c9f52e98ccd8cf95cfc93a1b561283fc56fdf908 100644 --- a/conf/base.config +++ b/conf/base.config @@ -17,99 +17,96 @@ process { maxRetries = 1 maxErrors = '-1' container = 'file://metagwgs/env/metagwgs.sif' - withName: cutadapt { - cpus = 8 - memory = { 8.GB * task.attempt } + withName: CUTADAPT { + cpus = 8 + memory = { 8.GB * task.attempt } } - withName: sickle { - memory = { 8.GB * task.attempt } + withName: SICKLE { + memory = { 8.GB * task.attempt } } - withLabel: fastqc { - cpus = 8 - memory = { 8.GB * task.attempt } + withLabel: FASTQC { + cpus = 8 + memory = { 8.GB * task.attempt } } - withName: multiqc { - memory = { 8.GB * task.attempt } + withName: MULTIQC { + memory = { 8.GB * task.attempt } } - withName: host_filter { + withName: HOST_FILTER { memory = { 10.GB * task.attempt } time = '48h' cpus = 8 } - withName: index_db_kaiju { + withName: INDEX_KAIJU { memory = { 200.GB * task.attempt } cpus = 6 } - withName: kaiju { + withName: KAIJU { memory = { 50.GB * task.attempt } cpus = 25 } - withName: assembly { - memory = { 250.GB * task.attempt } + withName: ASSEMBLY { + memory = { 440.GB * task.attempt } cpus = 20 } - withLabel: quast { + withName: QUAST { cpus = 4 memory = { 32.GB * task.attempt } } - withName: reads_deduplication { + withName: READS_DEDUPLICATION { memory = { 32.GB * task.attempt } } - withLabel: assembly_filter { + withName: ASSEMBLY_FILTER { memory = { 8.GB * task.attempt } cpus = 4 } - withName: prokka { + withName: PROKKA { memory = { 45.GB * task.attempt } cpus = 8 } - withName: rename_contigs_genes{ + withName: RENAME_CONTIGS_AND_GENES { memory = { 20.GB * task.attempt } } - withLabel: cd_hit { + withLabel: CD_HIT { memory = { 50.GB * task.attempt } cpus = 16 } - withName: quantification { + withName: QUANTIFICATION { memory = { 50.GB * task.attempt } } - withName: quantification_table { + withName: QUANTIFICATION_TABLE { memory = { 100.GB * task.attempt } } - withName: diamond { + withName: DIAMOND { cpus = 8 memory = { 32.GB * task.attempt } } - withName: get_software_versions { + withName: GET_SOFTWARE_VERSIONS { memory = { 1.GB * task.attempt } } - withLabel: binning { + withLabel: BINNING { memory = { 50.GB * task.attempt } } - withName: cat { + withName: CAT { cpus = 8 memory = { 16.GB * task.attempt } } - withName: eggnog_mapper_db { + withName: EGGNOG_MAPPER_DB { cpus = 2 memory = { 2.GB * task.attempt } } - withName: eggnog_mapper { + withName: EGGNOG_MAPPER { cpus = 4 memory = { 20.GB * task.attempt } } - withName: merge_quantif_and_functional_annot { + withName: MERGE_QUANT_ANNOT_BEST { cpus = 1 memory = { 50.GB * task.attempt } } - withName: make_functional_annotation_tables { + withName: FUNCTIONAL_ANNOT_TABLE { cpus = 1 memory = { 50.GB * task.attempt } } - withLabel: eggnog { + withLabel: EGGNOG { container = 'file://metagwgs/env/eggnog_mapper.sif' - } - withLabel: mosdepth { - container = 'file://metagwgs/env/mosdepth.sif' } } diff --git a/conf/singularity.config b/conf/singularity.config index f1403db182096e688b8b4d55f14676d5f7ec3e62..3ca0ab5ed5c787d332d288c9127d57807890e059 100644 --- a/conf/singularity.config +++ b/conf/singularity.config @@ -1,2 +1,9 @@ singularity.enabled = true singularity.autoMounts = true + +process { + container = '<PATH>/metagwgs.sif' + withLabel: eggnog { + container = '<PATH>/eggnog_mapper.sif' + } +} \ No newline at end of file diff --git a/conf/test_genotoul_workq.config b/conf/test_genotoul_workq.config index 4099dd45a423bb640654794af0f6075d67dc513e..b6bfe94dd04fc9f170a96de546f070d0a89a0df8 100644 --- a/conf/test_genotoul_workq.config +++ b/conf/test_genotoul_workq.config @@ -8,80 +8,73 @@ process { // Process-specific resource requirements cpus = { 1 * task.attempt } - memory = { 2.GB * task.attempt } + memory = { 20.GB * task.attempt } - errorStrategy = { task.exitStatus in [1,143,137,104,134,139] ? 'retry' : 'finish' } + errorStrategy = { task.exitStatus in [1,143,137,104,134,139] ? 'finish' : 'finish' } maxRetries = 3 maxErrors = '-1' - withName: cutadapt { + withName: CUTADAPT { cpus = 3 memory = { 1.GB * task.attempt } } - withName: sickle { + withName: SICKLE { memory = { 1.GB * task.attempt } } - withLabel: fastqc { + withLabel: FASTQC { cpus = 6 - memory = { 1.GB * task.attempt } + memory = { 2.GB * task.attempt } } - withName: multiqc { + withName: MULTIQC { memory = { 2.GB * task.attempt } } - withName: host_filter { + withName: HOST_FILTER { memory = { 20.GB * task.attempt } time = '48h' cpus = 6 } - withName: index_db_kaiju { + withName: INDEX_KAIJU { memory = { 50.GB * task.attempt } cpus = 6 } - withName: kaiju { - memory = { 50.GB * task.attempt } + withName: KAIJU { + memory = { 60.GB * task.attempt } cpus = 4 } - withName: assembly { + withName: ASSEMBLY { memory = { 10.GB * task.attempt } cpus = 8 } - withName: quast { + withLabel: QUAST { cpus = 2 memory = { 2.GB * task.attempt } } - withName: reads_deduplication { + withName: READS_DEDUPLICATION { memory = { 1.GB * task.attempt } } - withLabel: assembly_filter { + withLabel: ASSEMBLY_FILTER { memory = { 1.GB * task.attempt } cpus = 2 } - withName: prokka { + withName: PROKKA { memory = { 1.GB * task.attempt } cpus = 1 } - withName: rename_contigs_genes{ + withName: RENAME_CONTIGS_AND_GENES{ memory = { 1.GB * task.attempt } } - withLabel: cd_hit { + withLabel: CD_HIT { memory = { 16.GB * task.attempt } cpus = 2 } - withName: quantification { + withName: QUANTIFICATION { memory = { 1.GB * task.attempt } } - withName: diamond { + withName: DIAMOND { cpus = 8 memory = { 10.GB * task.attempt } } - withName: get_software_versions { + withName: GET_SOFTWARE_VERSIONS { memory = { 1.GB * task.attempt } } - withLabel: binning { - memory = { 1.GB * task.attempt } - } - withName: cat { - cpus = 1 - memory = { 2.GB * task.attempt } - } } diff --git a/conf/test_local.config b/conf/test_local.config index 4805ffc679bbc207877f15e1fa2bc5a5a6f95c43..64912cdee11b425ce0fd2e03cedbab4dcf1966b0 100644 --- a/conf/test_local.config +++ b/conf/test_local.config @@ -1,4 +1,5 @@ includeConfig 'singularity.config' +singularity.runOptions = "-B /work/bank/ -B /bank -B /work2 -B /work -B /save -B /home -B /work/project -B /usr/local/bioinfo" process { @@ -10,76 +11,73 @@ process { maxRetries = 3 maxErrors = '-1' - withName: cutadapt { + withName: CUTADAPT { cpus = 1 memory = { 1.GB * task.attempt } } - withName: sickle { + withName: SICKLE { memory = { 1.GB * task.attempt } } - withLabel: fastqc { + withLabel: FASTQC { cpus = 2 memory = { 1.GB * task.attempt } } - withName: multiqc { + withName: MULTIQC { memory = { 2.GB * task.attempt } } - withName: host_filter { + withName: HOST_FILTER { memory = { 1.GB * task.attempt } time = '48h' cpus = 2 } - withName: index_db_kaiju { + withName: INDEX_KAIJU { memory = { 10.GB * task.attempt } cpus = 2 } - withName: kaiju { + withName: KAIJU { memory = { 10.GB * task.attempt } cpus = 2 } - withName: assembly { + withName: ASSEMBLY { memory = { 2.GB * task.attempt } cpus = 3 } - withName: quast { + withName: QUAST { cpus = 2 memory = { 2.GB * task.attempt } } - withName: reads_deduplication { + withName: READS_DEDUPLICATION { memory = { 1.GB * task.attempt } } - withName: assembly_filter { + withName: ASSEMBLY_FILTER { memory = { 1.GB * task.attempt } cpus = 2 } - withName: prokka { + withName: PROKKA { memory = { 1.GB * task.attempt } cpus = 1 } - withName: rename_contigs_genes{ + withName: RENAME_CONTIGS_AND_GENES{ memory = { 1.GB * task.attempt } } - withLabel: cd_hit { + withLabel: CD_HIT { memory = { 2.GB * task.attempt } cpus = 2 } - withName: quantification { + withName: QUANTIFICATION { memory = { 1.GB * task.attempt } } - withName: quantification_table { - memory = { 2.GB} - } - withName: diamond { + withName: DIAMOND { cpus = 2 memory = { 2.GB * task.attempt } } - withName: get_software_versions { + withName: GET_SOFTWARE_VERSIONS { memory = { 1.GB * task.attempt } } - withLabel: binning { + withLabel: BINNING { memory = { 1.GB * task.attempt } } - withName: cat { + withName: CAT { cpus = 1 memory = { 4.GB * task.attempt } } diff --git a/docs/README.md b/docs/README.md index e222d51ccde8184b460f132fbd6b8fef5f8554ae..d677ad284d6bee334f44d3f746db5a6b5519d045 100644 --- a/docs/README.md +++ b/docs/README.md @@ -2,54 +2,54 @@ ## Introduction -**metagWGS** is a [Nextflow](https://www.nextflow.io/docs/latest/index.html#) bioinformatics analysis pipeline used for **metag**enomic **W**hole **G**enome **S**hotgun sequencing data (Illumina HiSeq3000 or NovaSeq, paired, 2\*150bp). +**metagWGS** is a [Nextflow](https://www.nextflow.io/docs/latest/index.html#) bioinformatics analysis pipeline used for **metag**enomic **W**hole **G**enome **S**hotgun sequencing data (Illumina HiSeq3000 or NovaSeq, paired, 2\*150bp ; PacBio HiFi reads, single-end). ### Pipeline graphical representation -The workflow processes raw data from `.fastq` or `.fastq.gz` inputs and do the modules represented into this figure: +The workflow processes raw data from `.fastq/.fastq.gz` input and/or assemblies (contigs) `.fa/.fasta` and uses the modules represented in this figure:  ### metagWGS steps metagWGS is split into different steps that correspond to different parts of the bioinformatics analysis: -* `01_clean_qc` (can ke skipped) +* `S01_CLEAN_QC` (can be stopped at with `--stop_at_clean` ; can ke skipped with `--skip_clean`) * trims adapters sequences and deletes low quality reads ([Cutadapt](https://cutadapt.readthedocs.io/en/stable/#), [Sickle](https://github.com/najoshi/sickle)) * suppresses host contaminants ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/)) * controls the quality of raw and cleaned data ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)) * makes a taxonomic classification of cleaned reads ([Kaiju MEM](https://github.com/bioinformatics-centre/kaiju) + [kronaTools](https://github.com/marbl/Krona/wiki/KronaTools) + [Generate_barplot_kaiju.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Generate_barplot_kaiju.py) + [merge_kaiju_results.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/merge_kaiju_results.py)) -* `02_assembly` - * assembles cleaned reads (combined with `01_clean_qc` step) or raw reads (combined with `--skip_01_clean_qc` parameter) ([metaSPAdes](https://github.com/ablab/spades) or [Megahit](https://github.com/voutcn/megahit)) +* `S02_ASSEMBLY` (can be stopped at with `--stop_at_assembly`) + * assembles cleaned reads (combined with `S01_CLEAN_QC` step) or raw reads (combined with `--skip_clean` parameter) ([metaSPAdes](https://github.com/ablab/spades) or [Megahit](https://github.com/voutcn/megahit)) * assesses the quality of assembly ([metaQUAST](http://quast.sourceforge.net/metaquast)) - * deduplicates cleaned reads (combined with `01_clean_qc` step) or raw reads (combined with `--skip_01_clean_qc` parameter) ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/)) -* `03_filtering` (can be skipped) + * deduplicates cleaned reads (combined with `S01_CLEAN_QC` step) or raw reads (combined with `--skip_clean` parameter) ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/)) +* `S03_FILTERING` (can be stopped at with `--stop_at_filtering` ; can be skipped with `--skip_assembly`) * filters contigs with low CPM value ([Filter_contig_per_cpm.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Filter_contig_per_cpm.py) + [metaQUAST](http://quast.sourceforge.net/metaquast)) -* `04_structural_annot` +* `S04_STRUCTURAL_ANNOT` (can be stopped at with `--stop_at_structural_annot`) * makes a structural annotation of genes ([Prokka](https://github.com/tseemann/prokka) + [Rename_contigs_and_genes.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Rename_contigs_and_genes.py)) -* `05_alignment` +* `S05_ALIGNMENT` * aligns reads to the contigs ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/)) * aligns the protein sequence of genes against a protein database ([DIAMOND](https://github.com/bbuchfink/diamond)) -* `06_func_annot` +* `S06_FUNC_ANNOT` (can ke skipped with `--skip_func_annot`) * makes a sample and global clustering of genes ([cd-hit-est](http://weizhongli-lab.org/cd-hit/) + [cd_hit_produce_table_clstr.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/cd_hit_produce_table_clstr.py)) * quantifies reads that align with the genes ([featureCounts](http://subread.sourceforge.net/) + [Quantification_clusters.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Quantification_clusters.py)) * makes a functional annotation of genes and a quantification of reads by function ([eggNOG-mapper](http://eggnog-mapper.embl.de/) + [best_bitscore_diamond.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/best_bitscore_diamond.py) + [merge_abundance_and_functional_annotations.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/merge_abundance_and_functional_annotations.py) + [quantification_by_functional_annotation.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/quantification_by_functional_annotation.py)) -* `07_taxo_affi` +* `S07_TAXO_AFFI` (can ke skipped with `--skip_taxo_affi`) * taxonomically affiliates the genes ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/aln2taxaffi.py)) * taxonomically affiliates the contigs ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/aln2taxaffi.py)) * counts the number of reads and contigs, for each taxonomic affiliation, per taxonomic level ([Samtools](http://www.htslib.org/) + [merge_contig_quantif_perlineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/merge_contig_quantif_perlineage.py) + [quantification_by_contig_lineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/quantification_by_contig_lineage.py)) -* `08_binning` from [nf-core/mag 1.0.0](https://github.com/nf-core/mag/releases/tag/1.0.0) - * makes binning of contigs ([MetaBAT2](https://bitbucket.org/berkeleylab/metabat/src/master/)) - * assesses bins ([BUSCO](https://busco.ezlab.org/) + [metaQUAST](http://quast.sourceforge.net/metaquast) + [summary_busco.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/summary_busco.py) and [combine_tables.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/combine_tables.py) from [nf-core/mag](https://github.com/nf-core/mag)) - * taxonomically affiliates the bins ([BAT](https://github.com/dutilh/CAT)) +* `S08_BINNING` (not yet implemented) + * binning strategies for assemblies and co-assemblies + +All steps are launched one after another by default. Use `--stop_at_[STEP]` and `--skip_[STEP]` parameters to tweak execution to your will. A report html file is generated at the end of the workflow with [MultiQC](https://multiqc.info/). The pipeline is built using [Nextflow,](https://www.nextflow.io/docs/latest/index.html#) a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner. -Three [Singularity](https://sylabs.io/docs/) containers are available making installation trivial and results highly reproducible. +Two [Singularity](https://sylabs.io/docs/) containers are available making installation trivial and results highly reproducible. ## Documentation -The metagWGS documentation is splitted into the following pages: +The metagWGS documentation can be found in the following pages: * [Installation](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/installation.md) * The pipeline installation procedure. @@ -59,3 +59,5 @@ The metagWGS documentation is splitted into the following pages: * An overview of the different output files and directories produced by the pipeline. * [Use case](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/use_case.md) * A tutorial to learn how to launch the pipeline on a test dataset on [genologin cluster](http://bioinfo.genotoul.fr/). + * [Functional tests](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/functional_tests/README.md) + * (for developers) A tool to launch a new version of the pipeline on curated input data and compare its results with known output. diff --git a/docs/installation.md b/docs/installation.md index b85a2af7a7de1d4c1cf976588164847ac77540e3..a41a08779789e6cfa31b745e2333d0d31169d397 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -32,7 +32,7 @@ A directory called `metagwgs` containing all source files of the pipeline have b ## III. Install Singularity -metagWGS needs three [Singularity](https://sylabs.io/docs/) containers to run: Singularity version 3 or above must be installed. +metagWGS needs two [Singularity](https://sylabs.io/docs/) containers to run: Singularity version 3 or above must be installed. See [here](https://sylabs.io/guides/3.7/user-guide/quick_start.html#quick-installation-steps) how to install Singularity >=v3. @@ -43,9 +43,9 @@ See [here](https://sylabs.io/guides/3.7/user-guide/quick_start.html#quick-instal ## IV. Download or build Singularity containers -You can directly download the three Singularity containers (`Solution 1`, recommended) or build them (`Solution 2`). +You can directly download the two Singularity containers (`Solution 1`, recommended) or build them (`Solution 2`). -### Solution 1 (recommended): download the three containers +### Solution 1 (recommended): download the two containers **In the directory you want tu run the workflow**, where you have the directory `metagwgs` with metagWGS source files, run these command lines: @@ -53,14 +53,13 @@ You can directly download the three Singularity containers (`Solution 1`, recomm cd metagwgs/env/ singularity pull eggnog_mapper.sif oras://registry.forgemia.inra.fr/genotoul-bioinfo/metagwgs/eggnog_mapper:latest singularity pull metagwgs.sif oras://registry.forgemia.inra.fr/genotoul-bioinfo/metagwgs/metagwgs:latest -singularity pull mosdepth.sif oras://registry.forgemia.inra.fr/genotoul-bioinfo/metagwgs/mosdepth:latest ``` -Three files (`metagwgs.sif`, `eggnog_mapper.sif` and `mosdepth.sif`) must have been downloaded. +two files (`metagwgs.sif` and `eggnog_mapper.sif`) must have been downloaded. -### Solution 2: build the three containers. +### Solution 2: build the two containers. -**In the directory you want tu run the workflow**, where you have downloaded metagWGS source files, go to `metagwgs/env/` directory, and follow [these explanations](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/wikis/Singularity%20container) to build the three containers. You need three files by container to build them. These files are into the `metagwgs/env/` folder and you can read them here: +**In the directory you want tu run the workflow**, where you have downloaded metagWGS source files, go to `metagwgs/env/` directory, and follow [these explanations](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/wikis/Singularity%20container) to build the two containers. You need two files by container to build them. These files are into the `metagwgs/env/` folder and you can read them here: * metagwgs.sif container * [metagWGS recipe file](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/env/Singularity_recipe_metagWGS) @@ -68,11 +67,19 @@ Three files (`metagwgs.sif`, `eggnog_mapper.sif` and `mosdepth.sif`) must have b * eggnog_mapper.sif container * [eggnog_mapper recipe file](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/env/Singularity_recipe_eggnog_mapper) * [eggnog_mapper.yml](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/env/eggnog_mapper.yml) -* mosdepth.sif container - * [mosdepth recipe file](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/env/Singularity_recipe_mosdepth) - * [mosdepth.yml](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/env/mosdepth.yml) -At the end of the build, three files (`metagwgs.sif`, `eggnog_mapper.sif` and `mosdepth.sif`) must have been generated. +At the end of the build, two files (`metagwgs.sif` and `eggnog_mapper.sif`) must have been generated. + +**WARNING:** to ensure Nextflow can find the _.sif_ files, we encourage you to change the _nextflow.config_ file in metagWGS to contain these lines: +``` +process { + container = '<PATH>/metagwgs.sif' + withLabel: EGGNOG { + container = '<PATH>/eggnog_mapper.sif' + } +} +``` +Where \<PATH\> leads to the directory where the singularity images are built/downloaded. **WARNING:** to ensure Nextflow can find the _.sif_ files, we encourage you to change the _nextflow.config_ file in metagWGS at these lines: ``` diff --git a/docs/output.md b/docs/output.md index 1eea0a584a8b937ab68dd6aa20fd3fd9654d8150..5fe579f601017fbc6dc6f902c26770831e69048d 100644 --- a/docs/output.md +++ b/docs/output.md @@ -21,8 +21,8 @@ The `results/` directory contains a sub-directory for each step launched: | File or directory/ | Description | | ----------------------- | --------------------------------------- | -| `cleaned_SAMPLE_NAME_R{1,2}.fastq.gz` | Cleaned reads after `01_clean_qc` step. There are one R1 and one R2 file for each sample. | -| `logs/` | Contains cutadapt (`SAMPLE_NAME_cutadapt.log`) and sickle (`SAMPLE_NAME_sickle.log`) log files for each sample. Only if you remove host reads, in `SAMPLE_NAME_cleaned_R{1,2}.nb_bases` you have the number of nucleotides into each cleaned R1 and R2 files of each sample. Only if you remove host reads, you also have a samtools flagstat file for each sample before removing host reads (`SAMPLE_NAME.no_filter.flagstat`) and into the directory `host_filter_flagstat/` there are the samtools flagstat files (`SAMPLE_NAME.host_filter.flagstat`) after removing of host reads. | +| `cleaned_SAMPLE_NAME_R{1,2}.fastq.gz` | There are one R1 and one R2 file for each sample. | +| `logs/` | Contains cutadapt (`SAMPLE_NAME_cutadapt.log`) and sickle (`SAMPLE_NAME_sickle.log`) log files for each sample. Only if you remove host reads, in `SAMPLE_NAME_cleaned_R{1,2}.nb_bases` you have the number of nucleotides into each cleaned R1 and R2 files of each sample. Only if you remove host reads, you also have a samtools flagstat file for each sample before removing host reads (`SAMPLE_NAME.no_filter.flagstat`) and into the directory `host_filter_flagstat/` there are the samtools flagstat files (`SAMPLE_NAME.host_filter.flagstat`) after removing host reads. | #### **01_clean_qc/01_2_qc/** @@ -52,7 +52,7 @@ The `results/` directory contains a sub-directory for each step launched: | `megahit/SAMPLE_NAME.contigs.fa` | megahit assembly: nucleotide sequence of contigs. Only if `--assembly "megahit"` is used.| | `SAMPLE_NAME_all_contigs_QC/` | Contains metaQUAST quality control files of contigs. | | `SAMPLE_NAME_R{1,2}_dedup.fastq.gz` | Deduplicated reads (R1 and R2 files) for SAMPLE_NAME sample. | -| `logs/` | Contains `SAMPLE_NAME.count_reads_on_contigs.flagstat`, `SAMPLE_NAME.count_reads_on_contigs.idxstats` and `SAMPLE_NAME_dedup_R{1,2}.nb_bases` files for each sample, generated after deduplication of reads. `SAMPLE_NAME.count_reads_on_contigs.flagstat` and `SAMPLE_NAME.count_reads_on_contigs.idxstats` are respectively the results of samtools flagstat (see informations [here](http://www.htslib.org/doc/samtools-flagstat.html)) and samtools idxstats (see description [here](http://www.htslib.org/doc/samtools-idxstats.html)), `SAMPLE_NAME_R{1,2}.nb_bases` corresponds to the number of nucleotides into deduplicated R1 and R2 files. | +| `logs/` | Contains `SAMPLE_NAME.count_reads_on_contigs.flagstat`, `SAMPLE_NAME.count_reads_on_contigs.idxstats` and `SAMPLE_NAME_dedup_R{1,2}.nb_bases` files for each sample, generated after deduplication of reads. `SAMPLE_NAME.count_reads_on_contigs.flagstat` and `SAMPLE_NAME.count_reads_on_contigs.idxstats` are respectively the results of samtools flagstat (see informations [here](http://www.htslib.org/doc/samtools-flagstat.html)) and samtools idxstats (see description [here](http://www.htslib.org/doc/samtools-idxstats.html)), `SAMPLE_NAME_R{1,2}.nb_bases` corresponds to the number of nucleotides in the deduplicated R1 and R2 files. | #### **03_filtering/** @@ -68,10 +68,11 @@ The `results/` directory contains a sub-directory for each step launched: | ----------------------- | --------------------------------------- | | `SAMPLE_NAME.annotated.faa` | Protein sequence of structural annotated genes. | | `SAMPLE_NAME.annotated.ffn` | Nucleotide sequence of structural annotated genes. | -| `SAMPLE_NAME.annotated.gff` | Coordinates of structural annotated genes into contigs. | | `SAMPLE_NAME.annotated.fna` | Nucleotide sequence of contigs used by Prokka for the annotation of genes. In the used version of Prokka, it removes short contigs (<200bp). **WARNING:** these contigs are used in the following analysis. | +| `SAMPLE_NAME.annotated.gff` | Coordinates of structural annotated genes into contigs. | +| `SAMPLE_NAME_prot.len` | Length (in bp) of each gene annotated with Prokka | -**WARNING: from this step, the gene names follow this nomenclature: SAMPLE_NAME_cCONTIG_ID.Prot_PROT_ID. Contig names follow the same nomenclature: SAMPLE_NAME_cCONTIG_ID.** +**WARNING: starting from this step, the gene names follow this nomenclature: SAMPLE_NAME_CONTIG_ID.Prot_PROT_ID. Contig names follow the same nomenclature: SAMPLE_NAME_CONTIG_ID.** #### **05_alignment/05_1_reads_alignment_on_contigs/** @@ -82,13 +83,13 @@ The `results/` directory contains a sub-directory for each step launched: | `SAMPLE_NAME.sort.bam` | Alignment of reads on contigs (.bam file). | | `SAMPLE_NAME.sort.bam.bai` | Index of .bam file. | | `SAMPLE_NAME.sort.bam.idxstats` | Samtools idxstats file. See description [here](http://www.htslib.org/doc/samtools-idxstats.html). | +| `SAMPLE_NAME_coverage.tsv` | Samtools coverage file. See description [here](http://www.htslib.org/doc/samtools-coverage.html). | #### **05_alignment/05_2_database_alignment/** | File or directory/ | Description | | ----------------------- | --------------------------------------- | -| `head.m8` | Header of all .m8 files in all the next folders. | -| `SAMPLE_NAME/SAMPLE_NAME_aln_diamond.m8` | Diamond results file. See the head.m8 file to see each column title. | +| `SAMPLE_NAME/SAMPLE_NAME_aln_diamond.m8` | Diamond results file. | #### **06_func_annot/06_1_clustering/** @@ -115,10 +116,10 @@ The `results/` directory contains a sub-directory for each step launched: | File | Description | | ----------------------- | --------------------------------------- | -| `SAMPLE_NAME_diamond_one2one.emapper.seed_orthologs` | eggNOG-mapper intermediate file containing seed match into eggNOG database. | -| `SAMPLE_NAME_diamond_one2one.emapper.annotations` | eggNOG-mapper final file containing functional annotations for genes with a match into eggNOG database. | -| `SAMPLE_NAME.best_hit` | Diamond best hits results for each gene. For a gene, best hits are diamond hits with the maximum bitScore for this gene. | -| `Quantifications_and_functional_annotations.tsv` | Table where a row corresponds to an inter-sample cluster. Columns corresponds to quantification of the sum of aligned reads on all gens of each inter-sample cluster (columns `*featureCounts.tsv`), sum of abundance in all samples (column `sum`), eggNOG-mapper results (from `seed_eggNOG_ortholog` to `PFAMs` column) and diamond best hits results (last two columns `diamond_db_id`and `diamond_db_description`). | +| `SAMPLE_NAME_diamond_one2one.emapper.seed_orthologs` | eggNOG-mapper intermediate file containing seed matches onto eggNOG database. | +| `SAMPLE_NAME_diamond_one2one.emapper.annotations` | eggNOG-mapper final file containing functional annotations for genes with a matches into eggNOG database. | +| `SAMPLE_NAME.best_hit` | Diamond best hits results for each gene. Best hits are diamond hits with the maximum bitScore for this gene. | +| `Quantifications_and_functional_annotations.tsv` | Table where a row corresponds to an inter-sample cluster. Columns corresponds to quantification of the sum of aligned reads on all genes of each inter-sample cluster (columns `*featureCounts.tsv`), sum of abundance in all samples (column `sum`), eggNOG-mapper results (from `seed_eggNOG_ortholog` to `PFAMs` column) and diamond best hits results (last two columns `sseqid` and `stitle` correspond to `diamond_db_id`and `diamond_db_description`). | | `GOs_abundance.tsv` | Quantification table storing for each GO term (rows) the sum of aligned reads into all genes having this functional annotation for each sample (columns). | | `KEGG_ko_abundance.tsv` | Quantification table storing for each KEGG_ko (rows) the sum of aligned reads into all genes having this functional annotation for each sample (columns). | | `KEGG_Pathway_abundance.tsv` | Quantification table storing for each KEGG_Pathway (rows) the sum of aligned reads into all genes having this functional annotation for each sample (columns). | @@ -140,35 +141,6 @@ The `results/` directory contains a sub-directory for each step launched: | `quantification_by_contig_lineage_all.tsv` | Quantification table of reads aligned on contigs affiliated to each lineage. One line = one taxonomic affiliation with its lineage (1st column, `lineage_by_level`), the taxon id at each level of this lineage (2nd column, `tax_id_by_level`), and then all next 3-columns blocks correspond to one sample. Each 3-column block corresponds to the name of contigs affiliated to this lineage (1st column, `name_contigs_SAMPLE_NAME_quantif_percontig`), the number of contigs affiliated to this lineage (2nd column, `nb_contigs_SAMPLE_NAME_quantif_percontig`), the sum of the number of reads aligned to these contigs (3rd column, `nb_reads_SAMPLE_NAME_quantif_percontig`) and the mean depth of these contigs (4th column, `depth_SAMPLE_NAME_quantif_percontig`). | | `quantification_by_contig_lineage_[taxonomic_level].tsv` | One file by taxonomic level (superkingdom, phylum, order, class, family, genus, species). Quantification table of reads aligned on contigs affiliated to each lineage of the corresponding [taxonomic level]. One line = one taxonomic affiliation at this [taxonomic level] with its taxon id (1st column, `tax_id_by_level`), its lineage (2nd column, `lineage_by_level`), and then all next 3-columns blocks correspond to one sample. Each 3-column block corresponds to the name of contigs affiliated to this lineage (1st column, `name_contigs_SAMPLE_NAME_quantif_percontig_by_[taxonomic_level]`), the number of contigs affiliated to this lineage (2nd column, `nb_contigs_SAMPLE_NAME_quantif_percontig_by_[taxonomic_level]`) and the sum of the number of reads aligned to these contigs (3rd column, `nb_reads_SAMPLE_NAME_quantif_percontig_by_[taxonomic_level]`) and the mean depth of these contigs (4th column, `depth_SAMPLE_NAME_quantif_percontig_by_[taxonomic_level]`). | -#### **08_binning/08_1_binning/** - -| File | Description | -| ----------------------- | --------------------------------------- | -| `MetaBAT2/SAMPLE_NAME.id_bin.fa` | Nucleotide sequence of contigs group together into the bin `id_bin` in the sample `SAMPLE_NAME`. | - -#### **08_binning/08_2_QC/** - -| File or directory/ | Description | -| ----------------------- | --------------------------------------- | -| `QUAST/` | All metaQUAST QC about bins. | -| `BUSCO/` | All BUSCO QC about bins. | -| `busco_summary.txt` | Summary of BUSCO results. | -| `quast_summary.tsv` | Summary of metaQUAST results. | -| `quast_and_busco_summary.tsv` | Summary of metaQUAST and BUSCO results. | -| `SAMPLE_NAME-busco_figure.png` | BUSCO quality assessment figure. | - -#### **08_binning/08_3_taxonomy/** - -| File | Description | -| ----------------------- | --------------------------------------- | -| `SAMPLE_NAME.bin2classification.names.txt` | BAT taxonomic classification of bins. One line = one bin (1st column, `bin`). See [this file](https://github.com/dutilh/CAT#interpreting-the-output-files) for explanations. | -| `SAMPLE_NAME.ORF2LCA.names.txt` | BAT taxonomic classification of ORF. One line = one ORF (1st column, `ORF`), its corresponding bin (2nd column, `bin`), "the bit-score the top-hit bit-score that is assigned to the ORF for voting" (3rd column, `lineage bit-score`) and the lineage name (4th column, `full lineage names`). See [this file](https://github.com/dutilh/CAT#interpreting-the-output-files) for explanations. | -| `raw/SAMPLE_NAME.log` | BAT log file. | -| `raw/SAMPLE_NAME.bin2classification.txt` | Same as `08_3_taxonomy/SAMPLE_NAME.bin2classification.names.txt` without last column. | -| `raw/SAMPLE_NAME.ORF2LCA.txt` | Same as `08_3_taxonomy/SAMPLE_NAME.ORF2LCA.names.txt` without last column. | -| `raw/SAMPLE_NAME.concatenated.predicted_proteins.gff` | Annotation of proteins into each contig of each bin by BAT. Each line corresponds to `SAMPLE_NAME.id_bin.fa_SAMPLE_NAME_cCONTIG_id`. | -| `raw/SAMPLE_NAME.predicted_proteins.faa` | Protein sequence of proteins predicted into each contig of each bin by BAT. Each line corresponds to `SAMPLE_NAME.id_bin.fa_SAMPLE_NAME_cCONTIG_id`. | - #### **MultiQC/** | File | Description | diff --git a/docs/usage.md b/docs/usage.md index c47b48890d2e470c18ee1d58878aca7e776a7df2..2644fc85d1d6009e48b5a025b89627b45a7a6b6a 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -2,38 +2,51 @@ ## I. Basic usage -1. See [Installation page](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/installation.md) to install metagWGS. +1. See [Installation page](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/installation.md) to install metagWGS. 2. See [Functional tests](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/functional_tests/README.md) to download the test dataset(s). -3. Run a basic script: +3. Setup the samplesheet (example in `metagwgs-test-datasets/small/input/samplesheet.csv`). - > The next script is a script working on **genologin slurm cluster**. Il allows to run the default [step](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/README.md#metagwgs-steps) `01_clean_qc` of the pipeline (without host reads deletion and taxonomic affiliation of reads). + > ``` + > sample,fastq_1,fastq_2 + > a1,$DASTASET/a1_R1.fastq.gz,$DASTASET/a1_R2.fastq.gz + > a2,$DASTASET/a2_R1.fastq.gz,$DASTASET/a2_R2.fastq.gz + > c,$DASTASET/c_R1.fastq.gz,$DASTASET/c_R2.fastq.gz + > ``` + +4. Run a basic script: + + > The next script is a script working on **genologin slurm cluster**. Il allows to run the [step](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/README.md#metagwgs-steps) `S01_CLEAN_QC` of the pipeline (without host reads deletion and taxonomic affiliation of reads). **WARNING:** You must adapt it if you want to run it into your cluster. You must install/load Nextflow and Singularity, and define a specific configuration for your cluster. - * Write in a file `Script.sh`: + * Create file named `Script.sh` with: > ```bash > #!/bin/bash > #SBATCH -p workq > #SBATCH --mem=6G - > module purge + > > module load bioinfo/Nextflow-v20.01.0 > module load system/singularity-3.5.3 - > nextflow run -profile test_genotoul_workq metagwgs/main.nf --reads "<PATH_TO_DATASET_INPUT>/*_{R1,R2}.fastq.gz" --skip_removal_host --skip_kaiju + > nextflow run -profile test_genotoul_workq metagwgs/main.nf \ + > --type 'SR' \ + > --input 'metagwgs-test-datasets/small/input/samplesheet.csv' \ + > --skip_host_filter --skip_kaiju > ``` > **NOTE:** you can change Nextflow and Singularity versions with other versions available on the cluster (see all versions with `search_module ToolName`). Nextflow version must be >= v20 and Singularity version must be >= v3. * Run `Script.sh` with this command line: + > ```bash > sbatch Script.sh > ``` See the description of output files in [this part](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/output.md) of the documentation and [there](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/functional_tests/README.md#iii-output) - `Script.sh` is a basic script that requires only a small test data input and no other files. To analyze real data, in addition to your metagenomic whole genome shotgun `.fastq` files, you need to download different files which are described in the next chapter. + `Script.sh` is a basic script that requires only a small test data input and no other files. To analyze real data, in addition to your metagenomic whole genome shotgun `.fastq/.fastq.gz` and/or assembly `.fa/.fasta` files, you need to download different files which are described in the next chapter. > **WARNING:** if you run metagWGS to **analyze real metagenomics data on genologin cluster**, you have to use the `unlimitq` queue to run your Nextflow script. To do this, instead of writing in the second line of your script `#SBATCH -p workq` you need to write `#SBATCH -p unlimitq`. @@ -42,35 +55,34 @@ ### 1. General mandatory files Launching metagWGS involves the use of mandatory files: -* The **metagenomic whole genome shotgun data** you want to analyze: `.fastq` or `.fastq.gz` R1 and R2 files (Illumina HiSeq3000 or NovaSeq sequencing, 2*150bp). For a cleaner MultiQC html report at the end of the pipeline, raw data with extensions `_R1` and `_R2` are preferred to those with extensions `_1` and `_2`. -* The **metagWGS.sif**, **eggnog_mapper.sif** and **mosdepth.sif** Singularity images (in `metagwgs/env` folder). +* The **metagenomic whole genome shotgun data** you want to analyze: `.fastq/.fastq.gz` R1 and R2 files (Illumina HiSeq3000 or NovaSeq sequencing, 2\*150bp). For a cleaner MultiQC html report at the end of the pipeline, raw data with extensions `_R1` and `_R2` are preferred to those with extensions `_1` and `_2`. +* Or the **assemblies** you want to analyse: `.fa/.fasta` (contigs assembled from PacBio HiFi single-end 'long-reads'). +* The **metagWGS.sif** and **eggnog_mapper.sif** Singularity images (in `metagwgs/env` folder). ### 2. Mandatory files for certain steps In addition to the general mandatory files, if you wish to launch certain steps of the pipeline, you will need previously generated or downloaded files: -* Step `01_clean_qc`, **only if you want to remove host reads**: you need a fasta file of the genome. - -* Step `05_alignment` **(against a protein database)**: download the protein database you want to use. For example you can use NR database. +* Step `S01_CLEAN_QC`, **only if you want to filter host reads**: you need a fasta file of the host genome. -* Step `08_binning`, **taxonomic affiliation of bins**: you need to download CAT/BAT database with `wget tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20210107.tar.gz` +* Step `S05_ALIGNMENT` **(against a protein database)**: download the protein database you want to use. For example you can use NR database (in .dmnd format). **WARNINGS:** -- if you use step `02_assembly` or `03_filtering` or `04_structural_annot` or `05_alignment` or `06_func_annot` or `07_taxo_affi` or `08_binning` without skipping `01_clean_qc` or host reads removal, you need to use the mandatory files of step `01_clean_qc`. +- if you use steps `S02_ASSEMBLY` or `S03_FILTERING` or `S04_STRUCTURAL_ANNOT` or `S05_ALIGNMENT` or `S06_FUNC_ANNOT` or `S07_TAXO_AFFI` without skipping `S01_CLEAN_QC` or host reads filtering, you need to use the mandatory files of step `S01_CLEAN_QC`. **AND** -- if you use step `06_func_annot` or `07_taxo_affi` or `08_binning`, you need to use the mandatory files of step `05_alignment`. +- if you use steps `S06_FUNC_ANNOT` or `S07_TAXO_AFFI`, you need to use the mandatory files of step `S05_ALIGNMENT`. ### 3. Others files for certain steps -In addition to the `general mandatory files` and `mandatory files for certain steps`, if you wish to launch certain steps of the pipeline, you can download files before running metagWGS. It is not mandatory but it avoids unnecessary downloads. +In addition to the `general mandatory files` and `mandatory files for certain steps`, if you wish to launch certain steps of the pipeline, you can download files/databases before running metagWGS. It is not mandatory but it avoids unnecessary downloads. -* Step `01_clean_qc`: +* Step `S01_CLEAN_QC`: - * **Only if you want to remove host reads**: if you also have the BWA index (.amb, .ann, .bwt, .pac and .sa files) of the host genome fasta file, you can specify it as a metagWGS parameter: `--host_bwa_index`. + * **Only if you want to filter host reads**: if you also have the BWA index (.amb, .ann, .bwt, .pac and .sa files) of the host genome fasta file, you can specify it as a metagWGS parameter: `--host_index`. - * **Only if you want to have the taxonomic affiliation of reads**: you can previously download kaiju database index [here](http://kaiju.binf.ku.dk/server) (blue insert on the left side, click right of the desired database -> copy the link address). For example, download `refseq 2020-05-25 (17GB)` with `wget http://kaiju.binf.ku.dk/database/kaiju_db_refseq_2020-05-25.tgz` and unpack it with `tar -zxvf kaiju_db_refseq_2020-05-25.tgz`. This file is not mandatory, a metagWGS parameter allows to download automatically the wanted database among all available in the [kaiju website](http://kaiju.binf.ku.dk/server). **WARNING:** you are not authorized to use kaiju database built with `kaiju-makedb` command line. + * **Only if you want to have the taxonomic affiliation of reads**: you can previously download kaiju database index [here](http://kaiju.binf.ku.dk/server) (blue insert on the left side, click right of the desired database -> copy the link address). For example, download `refseq 2020-05-25 (17GB)` with `wget http://kaiju.binf.ku.dk/database/kaiju_db_refseq_2020-05-25.tgz` and unpack it with `tar -zxvf kaiju_db_refseq_2020-05-25.tgz`. These files are not mandatory, a metagWGS parameter allows to download automatically the wanted database among all available in the [kaiju website](http://kaiju.binf.ku.dk/server). **WARNING:** you are not authorized to use kaiju database built with `kaiju-makedb` command line. Analyzing your metagenomic data with metagWGS allows you to use all **`nextflow run` options** in your `nextflow run` command line and different **metagWGS specific parameters**. Some of these specific parameters are useful to indicate the `<PATH>` to these input files. The next chapters will explain these options and parameters. @@ -91,12 +103,12 @@ It allows you to choose the configuration profile among: process.executor = 'slurm' ``` * **NOTE 3:** on [genologin cluster](http://bioinfo.genotoul.fr/) Miniconda is already installed. You can search Miniconda module with `search_module Miniconda` and load it with `module load chosen_miniconda_module`. - * `genotoul` to analyze **your files** with metagWGS **on genologin cluster** with Singularity images `metagWGS.sif`, `eggnog_mapper.sif` and `mosdepth.sif`. - * `test_genotoul_workq` to analyze **small test data files** (used in [I. Basic Usage](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#i-basic-usage)) with metagWGS **on genologin cluster** on the **`workq`** queue with Singularity images `metagWGS.sif`, `eggnog_mapper.sif` and `mosdepth.sif`. - * `test_genotoul_testq` to analyze **small test data files** (used in [I. Basic Usage](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#i-basic-usage)) with metagWGS **on genologin cluster** on the **`testq`** queue with Singularity images `metagWGS.sif`, `eggnog_mapper.sif` and `mosdepth.sif`. - * `big_test_genotoul` to analyze **big test data files** (used in [Use case documentation page](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/use_case.md)) with metagWGS **on genologin cluster** (on the **`workq`** queue) with Singularity images `metagWGS.sif`, `eggnog_mapper.sif` and `mosdepth.sif`. - * `test_local` to analyze **small test data files** (used in [I. Basic Usage](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#i-basic-usage)) with metagWGS **on your computer** with Singularity images `metagWGS.sif`, `eggnog_mapper.sif` and `mosdepth.sif`. - * `debug` to **debug** metagWGS pipeline: in `pipeline_trace.txt` a new column `script` will show the command used for each task. + * `genotoul` to analyze **your files** with metagWGS **on genologin cluster** with Singularity images `metagWGS.sif` and `eggnog_mapper.sif`. + * `test_genotoul_workq` to analyze **small test data files** (used in [I. Basic Usage](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#i-basic-usage)) with metagWGS **on genologin cluster** on the **`workq`** queue with Singularity images `metagWGS.sif` and `eggnog_mapper.sif`. + * `test_genotoul_testq` to analyze **small test data files** (used in [I. Basic Usage](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#i-basic-usage)) with metagWGS **on genologin cluster** on the **`testq`** queue with Singularity images `metagWGS.sif` and `eggnog_mapper.sif`. + * `big_test_genotoul` to analyze **big test data files** (used in [Use case documentation page](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/use_case.md)) with metagWGS **on genologin cluster** (on the **`workq`** queue) with Singularity images `metagWGS.sif` and `eggnog_mapper.sif`. + * `test_local` to analyze **small test data files** (used in [I. Basic Usage](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#i-basic-usage)) with metagWGS **on your computer** with Singularity images `metagWGS.sif` and `eggnog_mapper.sif`. + * `debug` to **debug** metagWGS pipeline. These profiles are associated to different configuration files developped [in this directory](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/tree/master/conf). The `base.config` file available in this directory is the base configuration load in first which is crushed by indications of the profile you use. See [here](https://genotoul-bioinfo.pages.mia.inra.fr/use-nextflow-nfcore-course/nfcore/profiles.html) for more explanations. @@ -132,50 +144,41 @@ The next parameters can be used when you run metagWGS. **NOTE:** the specific parameters of the pipeline are indicated by `--` in the command line. -### 1. Mandatory parameter: `--reads` - -`--reads "<PATH>/*_{R1,R2}.fastq.gz"`: indicate location of `.fastq` or `.fastq.gz` input files. For example, `--reads "<PATH>/*_{R1,R2}.fastq.gz"` run the pipeline with all the `R1.fastq.gz` and `R2.fastq.gz` files available in the indicated `<PATH>`. For a cleaner MultiQC html report at the end of the pipeline, raw data with extensions `_R1` and `_R2` are preferred to those with extensions `_1` and `_2`. +### 1. Mandatory parameter: `--input` -### 2. `--step` +`--input "<PATH>/samplesheet.csv"`: indicate location of the samplesheet containing paths to the input reads `.fastq/.fastq.gz` and/or assembly `.fa/.fasta` files. For example, `--input "<PATH>/samplesheet.csv"` runs the pipeline with all the `R1.fastq.gz` and `R2.fastq.gz` files available in the indicated `<PATH>`. For a cleaner MultiQC html report at the end of the pipeline, raw data with extensions `_R1` and `_R2` are preferred to those with extensions `_1` and `_2`. -`--step "your_step"`: indicate the step of the pipeline you want to run. The steps available are described in the [`README`](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/tree/master#metagwgs-steps) (`01_clean_qc`, `02_assembly`, `03_filtering`, `04_structural_annot`, `05_alignment`, `06_func_annot`, `07_taxo_affi` and `08_binning`). +### 2. `--stop_at_[STEP]` and `--skip_[STEP]` parameters: -**NOTES:** +By default, all steps of metagWGS will be launched on the input data. From S01_CLEAN_QC to S07_TAXO_AFFI. -**i. You can directly indicate the final step that is important to you. For example, if you are interested in binning (and the taxonomic affiliation of bins), just use `--step "08_binning"`. It runs the mandatory previous steps automatically (01_clean_qc to 05_alignment ; except `03_filtering`, see ii).** +#### Stop at: -**ii. `03_filtering` is automatically skipped for the next steps `04_structural_annot`, `05_alignment`, `06_func_annot`, `07_taxo_affi` and `08_binning`. If you want to filter your assembly before doing one of these steps, you must use `--step "03_filtering,the_step"`, for example `--step "03_filtering,04_structural_annot"`.** +You can use a `stop_at_[STEP]` parameter to launch only the steps leading to and including the one you specified. -**iii. When you run one of the three steps `06_func_annot`, `07_taxo_affi` or `08_binning` during a first analysis and then another of these steps interests you and you run metagWGS again to get the result of this other step, you have to indicate `--step "the_first_step,the_second_step"`. This will allow you to have a final MultiQC html report that will take into account the metrics of both analyses performed. If the third of these steps interests you and you run again metagWGS for this step, you also have to indicate `--step "the_first_step,the_second_step,the_third,step"` for the same reasons.** +`--stop_at_[STEP]`: indicate the step of the pipeline you want to stop at. The steps available are described in the [`README`](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/tree/master#metagwgs-steps) (`S01_CLEAN_QC`, `S02_ASSEMBLY`, `S03_FILTERING`, `S04_STRUCTURAL_ANNOT`). -When you want to run a particular step, you just need to specify its name: +**NOTE: `S05_ALIGNMENT`, `S06_FUNC_ANNOT` and `S07_TAXO_AFFI` being the 3 last steps, there is no `--stop_at_[STEP]`; see 'Skip' subsection for more information.** - * `01_clean_qc` step: `--step "01_clean_qc"`. This step is automatically done in all others steps. - If you want to skip this step into the other steps, add parameter `--skip_01_clean_qc` in your command line. Usefull when you have already checked and cleaned your `.fastq` files: you can put in input data (`--reads` parameter) your cleaned `.fastq` files and run directly the `02_assembly` step or other steps. +For each [STEP](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/README.md#metagwgs-steps), specific parameters are available. You can add them to the command line and run the pipeline. They are described in the section [other parameters step by step](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#other-parameters-step-by-step). - * `02_assembly` step: `--step "02_assembly"`. This step is automatically done in all others steps. Assembly is done on reads cleaned with `01_clean_qc` step. +#### Skip: - * `03_filtering` step: `--step "03_filtering"`. **WARNING:** By default, if you want to run one of the next steps (`04_structural_annot`, `05_alignment`, `06_func_annot`, `07_taxo_affi` and `08_binning`), `03_filtering` is **not done**. If you want to do the `03_filtering` step when you run these steps, you must indicate `--step "03_filtering,the_step"` with `the_step` a step in `04_structural_annot`, `05_alignment`, `06_func_annot`, `07_taxo_affi` and `08_binning`. +In parallel, you can use any combination of `--skip_[STEP]` parameters you want to skip certain steps: - * `04_structural_annot` step: `--step "04_structural_annot"`. This step is automatically done in all others steps +`--skip_clean`: skip the `S01_CLEAN_QC` step entirely, beginning the pipeline at step `S02_ASSEMBLY` - * `05_alignment` step: `--step "05_alignment"`. This step is automatically done in all others steps +`--skip_filtering`: skip the `S03_FILTERING` step entirely, continuing the pipeline at step `S04_STRUCTURAL_ANNOT` - * `06_func_annot` step: `--step "06_func_annot"`. +`--skip_func_annot`: skip the `S06_FUNC_ANNOT` step entirely, ending the pipeline at step `S05_ALIGNMENT` or `S07_TAXO_AFFI` - * `07_taxo_affi` step: `--step "07_taxo_affi"`. +`--skip_taxo_affi`: skip the `S07_TAXO_AFFI` step entirely, ending the pipeline at step `S05_ALIGNMENT` or `S06_FUNC_ANNOT` - * `08_binning` step: `--step "08_binning"`. - -Default: `01_clean_qc`. - -For each [step](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/README.md#metagwgs-steps), specific parameters are available. You can add them to the command line and run the pipeline. They are described in the next section: [other parameters step by step](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#other-parameters-step-by-step). +**NOTE: Using both `--skip_func_annot` and `--skip_taxo_affi` parameters will stop the pipeline at step `S05_ALIGNMENT`** ### 3. Other parameters step by step -#### **`01_clean_qc` step:** - -**NOTE:** this step can be skipped with `--skip_01_clean_qc` parameter. See **WARNING 1**. +#### **`S01_CLEAN_QC` step:** There are 5 substeps in this step each of them with specific parameters: @@ -187,27 +190,28 @@ There are 5 substeps in this step each of them with specific parameters: **2. Remove low quality reads with sickle** -* `--skip_sickle`: allows to skip sickle substep. Use: `--skip_sickle`. +* `--skip_sickle`: allows to skip sickle substep. * `--quality_type "solexa" or "illumina" or "sanger"`: sickle -t quality type parameter. Default: `"sanger"`. **3. Remove host reads with bwa, samtools and bedtoools** -* `--skip_removal_host` allows to skip the deletion of host reads. Use: `--skip_removal_host`. See **WARNING 1**. +* `--skip_host_filter` allows to skip the deletion of host reads. -* `--host_fasta "<PATH>/name_genome.fasta"`: indicate the nucleotide sequence of the host genome. Default: `""`. See **WARNING 1**. Depending on the size of your file, you may need to tweak the memory and cpus settings of the Nextflow process to filter the host reads. If this is the case, create a `nextflow.config` file in our working directory and modify these parameters, such as : +* `--host_fasta "<PATH>/genome_name.fasta"`: indicate the nucleotide sequence of the host genome. Default: `""`. Depending on the size of your file, you may need to tweak the memory and cpus settings of the Nextflow process to filter the host reads. If this is the case, create a `nextflow.config` file in our working directory and modify these parameters, such as : ```bash -withName: host_filter { -memory = { 200.GB * task.attempt } -time = '48h' -cpus = 8 +withName: HOST_FILTER { + memory = { 200.GB * task.attempt } + time = '48h' + cpus = 8 } ``` -* `--host_bwa_index "<PATH>/name_genome.{amb,ann,bwt,pac,sa}"`: indicate the bwa index files if they are already built. Default: `""` corresponding to the building of bwa index files by metagWGS. See **WARNING 1**. -**WARNING 1:** you need to use `--skip_removal_host` or `--host_fasta` or `--skip_01_clean_qc`. If it is not the case, an error message will occur. +* `--host_index "<PATH>/genome_name.{amb,ann,bwt,pac,sa}"`: indicate the bwa index files if they are already built/downloaded. Default: `""` corresponding to the building of bwa index files by metagWGS. + +**WARNING 1:** you need to use either `--skip_host_filter` or `--host_fasta` or `--skip_clean`. If it is not the case, an error message will occur. -**4. Quality control of raw data and cleaned data with fastQC** +**4. Quality control of raw data and cleaned data with FastQC** No parameter available for this substep. @@ -217,84 +221,80 @@ No parameter available for this substep. * `--kaiju_db "http://kaiju.binf.ku.dk/database/CHOOSEN_DATABASE.tgz"`: allows metagWGS to download the kaiju database of your choice. The list of kaiju databases is available in [kaiju website](http://kaiju.binf.ku.dk/server), in the blue insert on the left side. Default: `--kaiju_db https://kaiju.binf.ku.dk/database/kaiju_db_refseq_2021-02-26.tgz`. See **WARNING 2**. -* `--kaiju_verbose`: allows production of kaiju_MEM_verbose.out files (which can be huge) containing information about matches lengths and graphics. - * `--skip_kaiju`: allows to skip taxonomic affiliation of reads with kaiju. Krona files will not be generated. Use: `--skip_kaiju`. See **WARNING 2**. -**WARNING 2:** you need to use `--kaiju_db_dir` or `--kaiju_db` or `--skip_kaiju`. If it is not the case, an error message will occur. +**WARNING 2:** you need to use either `--kaiju_db_dir` or `--kaiju_db` or `--skip_kaiju`. If it is not the case, an error message will occur. -#### **`02_assembly` step:** +#### **`S02_ASSEMBLY` step:** -**WARNING 3:** `02_assembly` step depends on `01_clean_qc` step. You need to use the mandatory files of these two steps to run `02_assembly`. See [II. Input files](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#ii-input-files) and WARNINGS 1 and 2. +**WARNING 3:** `S02_ASSEMBLY` step depends on `S01_CLEAN_QC` step. You need to use the mandatory files of these two steps to run `S02_ASSEMBLY`. See [II. Input files](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#ii-input-files) and **WARNINGS 1 and 2**. * `--assembly ["metaspades" or "megahit"]`: allows to indicate the assembly tool. Default: `metaspades`. -**WARNING 4:** the user has choice between `metaspades` or `megahit` for `--assembly` parameter. The choice can be based on CPUs and memory availability: `metaspades` needs more CPUs and memory than `megahit` but our tests showed that assembly metrics are better for `metaspades` than `megahit`. +**WARNING 4:** the user can choose between `metaspades` or `megahit` for `--assembly` parameter. The choice can be based on CPUs and memory availability: `metaspades` needs more CPUs and memory than `megahit` but our tests showed that assembly metrics are better for `metaspades` than `megahit`. * `--metaspades_mem [memory_value]`: memory (in Gb) used by `metaspades` process. Default: `440`. -#### **`03_filtering` step:** +#### **`S03_FILTERING` step:** -**WARNING 5:** `03_filtering` step depends on `01_clean_qc` and `02_assembly` steps. You need to the use mandatory files of these three steps to run `03_filtering`. See [II. Input files](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#ii-input-files) and WARNINGS 1, 2, 3 and 4. +**WARNING 5:** `S03_FILTERING` step depends on `S01_CLEAN_QC` and `S02_ASSEMBLY` steps. You need to the use mandatory files of these three steps to run `S03_FILTERING`. See [II. Input files](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#ii-input-files) and **WARNINGS 1, 2, 3 and 4**. -**WARNING 6:** this step is not done by default when you launch next steps. +**WARNING 6:** this step is not done by default when you launch the next steps. * `--min_contigs_cpm [cutoff_value]`: CPM (Count Per Million) cutoff to filter contigs with low number of reads. [cutoff_value] can be a decimal number (example: `0.5`). Default: `1`. -#### **`04_structural_annot` step:** +#### **`S04_STRUCTURAL_ANNOT` step:** No parameters. -**WARNING 7:** `04_structural_annot` step depends on `01_clean_qc`, `02_assembly` and `03_filtering` (if you use it) steps. You need to use the mandatory files of these four steps to run `04_structural_annot`. See [II. Input files](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#ii-input-files) and WARNINGS from 1 to 6. +**WARNING 7:** `S04_STRUCTURAL_ANNOT` step depends on `S01_CLEAN_QC`, `S02_ASSEMBLY` and `S03_FILTERING` steps (if you use it). You need to use the mandatory files of these four steps to run `S04_STRUCTURAL_ANNOT`. See [II. Input files](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#ii-input-files) and **WARNINGS 1 to 6**. -**WARNING 8:** if you haven't associated this step with `03_filtering`, calculation time of `04_structural_annot` can be important. Some cluster queues have defined calculation time, you need to adapt the queue you use to your data. -> For example, if you are on [genologin cluster](http://bioinfo.genotoul.fr/) and you haven't done `03_filtering` step, you can create into your working directory a file `nextflow.config` containing: +**WARNING 8:** if you haven't previously done `S03_FILTERING`, calculation time of `S04_STRUCTURAL_ANNOT` can be important. Some cluster queues have defined calculation time, you need to adapt the queue you use to your data. +> For example, if you are on [genologin cluster](http://bioinfo.genotoul.fr/) and you haven't done the `S03_FILTERING` step, you can write a `nextflow.config` file in your working directory containing these lines: > ```bash > withName: prokka { > queue = 'unlimitq' > } > ``` -> This will launch the `Prokka` command line of step `04_structural_annot` on a calculation queue (`unlimitq`) where the job can last more than 4 days (which is not the case for the usual `workq` queue). +> This will launch the `Prokka` command line of step `04_STRUCTURAL_ANNOT` on a calculation queue (`unlimitq`) where the job can last more than 4 days (which is not the case for the usual `workq` queue). -#### **`05_alignment` step:** +#### **`S05_ALIGNMENT` step:** -**WARNING 9:** `05_alignment` step depends on `01_clean_qc`, `02_assembly`, `03_filtering` (if you use it) and `04_structural_annot` steps. You need to use the mandatory files of these five steps to run `05_alignment`. See [II. Input files](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#ii-input-files) and WARNINGS from 1 to 8. +**WARNING 9:** `S05_ALIGNMENT` step depends on `S01_CLEAN_QC`, `S02_ASSEMBLY`, `S03_FILTERING` (if you use it) and `S04_STRUCTURAL_ANNOT` steps. You need to use the mandatory files of these five steps to run `S05_ALIGNMENT`. See [II. Input files](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#ii-input-files) and **WARNINGS 1 to 8**. * `--diamond_bank "<PATH>/bank.dmnd"`: path to diamond bank used to align protein sequence of genes. This bank must be previously built with [diamond makedb](https://github.com/bbuchfink/diamond/wiki). Default `""`. -**WARNING 10:** You need to use a NCBI reference to have functional links in the output file _Quantifications_and_functional_annotations.tsv_ of `06_func_annot` step +**WARNING 10:** You need to use a NCBI reference to have functional links in the output file _Quantifications_and_functional_annotations.tsv_ of `S06_FUNC_ANNOT` step -#### **`06_func_annot` step:** +#### **`S06_FUNC_ANNOT` step:** -**WARNING 11:** `06_func_annot` step depends on `01_clean_qc`, `02_assembly`, `03_filtering` (if you use it), `04_structural_annot` and `05_alignment` steps. You need to use the mandatory files of these six steps to run `06_func_annot`. See [II. Input files](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#ii-input-files) and WARNINGS from 1 to 9. +**WARNING 11:** `S06_FUNC_ANNOT` step depends on `S01_CLEAN_QC`, `S02_ASSEMBLY`, `S03_FILTERING` (if you use it), `S04_STRUCTURAL_ANNOT` and `S05_ALIGNMENT` steps. You need to use the mandatory files of these six steps to run `S06_FUNC_ANNOT`. See [II. Input files](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#ii-input-files) and **WARNINGS 1 to 9**. * `--percentage_identity [number]`: corresponds to cd-hit-est -c option to indicate sequence percentage identity for clustering genes. Default: `0.95` corresponding to 95% of sequence identity. Use: `number` must be between 0 and 1, and use `.` when you want to use a float. -* `--eggnogmapper_db`: downloads eggNOG-mapper database. If you don't use this parameter, metagWGS doesn't download this database and you must use `--eggnog_mapper_db_dir`. Use: `--eggnogmapper_db`. See **WARNING 6**. +* `--eggnogmapper_db_download`: downloads eggNOG-mapper database. If you don't use this parameter, metagWGS doesn't download this database and you must use `--eggnog_mapper_db_dir`. Use: `--eggnogmapper_db_download`. See **WARNING 6**. -* `--eggnog_mapper_db_dir "<PATH>/database_directory/"`: indicates path to eggNOG-mapper database if you have already dowloaded it. If you run the `06_func_annot` step in different metagenomics projects, downloading the eggNOG-mapper database only once before running metagWGS avoids you to multiply the storage of this database and thus keep free disk space. See **WARNING 6**. +* `--eggnog_mapper_db_dir "<PATH>/database_directory/"`: indicates path to eggNOG-mapper database if you have already dowloaded it. If you run the `S06_FUNC_ANNOT` step in different metagenomics projects, downloading the eggNOG-mapper database only once before running metagWGS avoids you to multiply the storage of this database and thus keep free disk space. See **WARNING 6**. -**WARNING 12:** you need to use `--eggnogmapper_db` or `--eggnog_mapper_db_dir`. If it is not the case, an error message will occur. +**WARNING 12:** you need to use either `--eggnogmapper_db_download` or `--eggnog_mapper_db_dir`. If it is not the case, an error message will occur. -#### **`07_taxo_affi` step:** +#### **`S07_TAXO_AFFI` step:** -**WARNING 13:** `07_taxo_affi` step depends on `01_clean_qc`, `02_assembly`, `03_filtering` (if you use it), `04_structural_annot` and `05_alignment` steps. You need to use the mandatory files of these six steps to run `07_taxo_affi`. See [II. Input files](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#ii-input-files) and WARNINGS from 1 to 9. +**WARNING 13:** `S07_TAXO_AFFI` step depends on `S01_CLEAN_QC`, `S02_ASSEMBLY`, `S03_FILTERING` (if you use it), `S04_STRUCTURAL_ANNOT` and `S05_ALIGNMENT` steps. You need to use the mandatory files of these six steps to run `S07_TAXO_AFFI`. See [II. Input files](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#ii-input-files) and **WARNINGS 1 to 9**. * `--accession2taxid "FTP_PATH_TO_prot.accession2taxid.gz"`: indicates the FTP adress of the NCBI file `prot.accession2taxid.gz`. Default: `"ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz"`. * `--taxdump "FTP_PATH_TO_taxdump.tar.gz"`: indicates the FTP adress of the NCBI file `taxdump.tar.gz`. Default `"ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz"`. -* `--taxonomy_dir "<PATH>/directory": if you have already downloaded the accession2taxid and taxdump databases, indicate their parent directory. Default: `--taxonomy_dir false`.` - -#### **`08_binning` step:** +* `--taxonomy_dir "<PATH>/directory": if you have already downloaded the accession2taxid and taxdump databases, indicate their parent directory. Default: "".` -**WARNING 14:** `08_binning` step depends on `01_clean_qc`, `02_assembly`, `03_filtering` (if you use it), `04_structural_annot` and `05_alignment` steps. You need to use the mandatory files of these six steps to run `08_binning`. See [II. Input files](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#ii-input-files) and WARNINGS from 1 to 9. +**WARNING 14:** you need to use either `--accession2taxid` and `--taxdump`, or `--taxonomy_dir`. If it is not the case, an error message will occur. -* `--min_contig_size [cutoff_length]`: contig length cutoff to filter contigs before binning. Must be greater than `1500`. Default: `1500`. +**WARNING 15:** To have contigs and genes taxonomic affiliation your protein database used in the step 05_alignment has to come from ncbi and your taxdump and prot.accession2taxid files must be coherent, i.e. downloaded at the same time as the protein database used in 05_alignment step. -* `--busco_reference "<PATH>/file_db"`: path to BUSCO database. Default: `"https://busco-archive.ezlab.org/v3/datasets/bacteria_odb9.tar.gz"`. **WARNING 15:** We use BUSCO v3 from the `metagWGS.sif` Singularity container. Be careful not to use the BUSCO reference of other BUSCO versions. +#### **`S08_binning` step:** -* `--cat_db "<PATH>/CAT_prepare_20190108.tar.gz"`: path to CAT/BAT database. Default: `false`. **WARNING 16:** you need to download this database before running metagWGS `08_binning` step. Download it with: `wget tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20210107.tar.gz`. +**WARNING 16:** `S08_binning` step is not yet implemented in the DSL2 version of metagWGS. #### Others parameters @@ -302,7 +302,9 @@ No parameters. * `--outdir "dir_name"`: change the name of output directory. Default `"results"`. -* `--help`: print metagWGS help. Default: `false`. Use: `--help`. +* `--databases "dir_name`: change the location where databases will be downloaded if not provided by the user. Default `"databases"` in working directory. + +* `--help`: print metagWGS help. Default: `false`. ## V. Description of output files @@ -310,6 +312,6 @@ See the description of output files in [this part](https://forgemia.inra.fr/geno ## VI. Analyze big test dataset with metagWGS in genologin cluster -> If you have an account in [genologin cluster](http://bioinfo.genotoul.fr/) and you would like to familiarize yourself with metagWGS, see the tutorial available in the [use case documentation page](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/use_case.md). It allows the analysis of big test datasets with metagWGS. +> If you have an account on the [genologin cluster](http://bioinfo.genotoul.fr/) and you would like to familiarize yourself with metagWGS, see the tutorial available in the [use case documentation page](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/use_case.md). It allows the analysis of big test datasets with metagWGS. -**WARNING 17:** the test dataset in `small_metagwgs-test-datasets` used in [I. Basic Usage](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#i-basic-usage) is a small test dataset which does not allow to test all steps (`08_binning` doesn't work with this dataset). +**WARNING 18:** the test dataset in `metagwgs-test-datasets/small` used in [I. Basic Usage](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md#i-basic-usage) is a small test dataset which allows to test all steps but with few CPUs and memory. diff --git a/env/Singularity_recipe_mosdepth b/env/Singularity_recipe_mosdepth deleted file mode 100644 index c1e86355d158e01ff502ade4e81fc1f807efaad6..0000000000000000000000000000000000000000 --- a/env/Singularity_recipe_mosdepth +++ /dev/null @@ -1,16 +0,0 @@ -Bootstrap: docker -From: continuumio/miniconda3 -IncludeCmd: yes - -%files -env/mosdepth.yml / - -%post -apt-get update && apt-get install -y procps && apt-get clean -y -/opt/conda/bin/conda env create -f /mosdepth.yml && /opt/conda/bin/conda clean -a - -%environment -export PATH=/opt/conda/envs/mosdepth/bin:$PATH - -%runscript - "$@" diff --git a/env/mosdepth.yml b/env/mosdepth.yml deleted file mode 100644 index 76fc534dd18e82c45049654dc67c2d6cf215e6c1..0000000000000000000000000000000000000000 --- a/env/mosdepth.yml +++ /dev/null @@ -1,7 +0,0 @@ -name: mosdepth -channels: - - bioconda - - conda-forge - - defaults -dependencies: - - mosdepth=0.3.1 \ No newline at end of file diff --git a/functional_tests/README.md b/functional_tests/README.md index a06b0673d01654796d4eedea1caabe2a9c298bdd..310cf4238639afba1af735f8c921c4ae4408797e 100644 --- a/functional_tests/README.md +++ b/functional_tests/README.md @@ -3,16 +3,15 @@ ## I. Pre-requisites 1. Install metagwgs as described here: [installation doc](../docs/installation.md) -2. Get datasets: two datasets are currently available for these functional tests at `https://forgemia.inra.fr/genotoul-bioinfo/metagwgs-test-datasets.git` +2. Get datasets: three datasets are currently available for these functional tests at `https://forgemia.inra.fr/genotoul-bioinfo/metagwgs-test-datasets.git`: small, mag and hifi ([descriptions here](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs-test-datasets/-/blob/master/README.md)) - Replace "\<dataset\>" with either "small" or "mag": - ``` - git clone --branch <dataset> git@forgemia.inra.fr:genotoul-bioinfo/metagwgs-test-datasets.git +``` + git clone git@forgemia.inra.fr:genotoul-bioinfo/metagwgs-test-datasets.git or - wget https://forgemia.inra.fr/genotoul-bioinfo/metagwgs-test-datasets/-/archive/<dataset>/metagwgs-test-datasets-<dataset>.tar.gz - ``` + wget https://forgemia.inra.fr/genotoul-bioinfo/metagwgs-test-datasets/-/archive/metagwgs-test-datasets.tar.gz +``` ## II. Run functional tests @@ -22,18 +21,16 @@ To launch functional tests, you need to be located at the root of the folder whe - by providing the results folder of a pipeline already exectuted ``` cd test_folder -python <metagwgs-src>/functional_tests/main.py -step 07_taxo_affi -exp_dir metagwgs-test-datasets-small/output -obs_dir ./results +python <metagwgs-src>/functional_tests/main.py -step 07_taxo_affi -exp_dir metagwgs-test-datasets/small/output -obs_dir ./results ``` -- by providing a script which will launch the nextflow pipeline [see example](./launch_example.sh) (this example is designed for the "small" dataset with --min_contigs_cpm = 1000, using slurm) +- by providing a script which will launch the nextflow pipeline [see example](./launch_example.sh) ``` mkdir test_folder cd test_folder cp <metagwgs-src>/functional_tests/launch_example.sh ./ -python <metagwgs-src>/functional_tests/main.py -step 07_taxo_affi -exp_dir metagwgs-test-datasets-small/output -obs_dir ./results --script launch_example.sh +python <metagwgs-src>/functional_tests/main.py -step 07_taxo_affi -exp_dir metagwgs-test-datasets/small/output -obs_dir ./results --script launch_example.sh ``` -For mag dataset, use --min_contigs_cpm = 10 in nextflow.config or in launch_example.sh - ## III. Output A ft_\[step\].log file is created for each step of metagwgs. It contains information about each test performed on given files. @@ -84,8 +81,8 @@ Sometimes, files are not tested because present in exp_dir but not in obs_dir. T 5 simple test methods are used: -diff: simple bash difference between two files -`diff exp_path obs_path` +sort_diff: simple bash difference between two files +`diff <(sort exp_path) <(sort obs_path)` zdiff: simple bash difference between two gzipped files `zdiff exp_path obs_path` @@ -98,3 +95,65 @@ cut_diff: exception for cutadapt.log file not_empty: in python, check if file is empty `test = path.getsize(obs_path) > 0` + + +# Test skips and check processes + +The script `test_parameters_and_processes.py` check if execution with parameters specified in `expected_processes.tsv`, run processes as expected. + +To use it : +1. retrieve databank and datasets (small) as describe in functional test above . +1. fix needed path + - modules + ``` + module load bioinfo/Nextflow-v21.04.1 + module load system/singularity-3.7.3 + ``` + - set enviroment variables : + ``` + export OUTDIR="/path/to/out" + export METAG_PATH="/path/to/sources" + export DATABANK="/path/to/FT_banks_2021-10-19" + export DATASET="/path/to/metagwgs-test-datasets" + export EGGNOG_DB="/bank/eggnog-mapper/eggnog-mapper-2.0.4-rf1/data" + ``` + - create command file: + ``` + cut -f 1 $METAG_PATH/functional_tests/expected_processes_sr.tsv | tail -n +2 > $OUTDIR/cmd_sr.sh + ``` + > the commands use profile `test_genotoul_workq` + - replace path in the samplesheet : + ``` + sed -i -e "s,\$DATASET,$DATASET,g" $DATASET/small/input/samplesheet.csv + ``` +2. launch on the cluster the commands: + ``` + cd $OUTDIR + sarray cmd_sr.sh + ``` +3. launch `test_parameters_and_processes.py` + ``` + $METAG_PATH/functional_tests/test_parameters_and_processes.py --file $METAG_PATH/functional_tests/expected_processes_sr.tsv + ``` + +## Example on HiFi on genotoul : +``` +module load bioinfo/Nextflow-v21.04.1 +module load system/singularity-3.7.3 + +export OUTDIR="$HOME/work/metagenomic/test_processes/" +export METAG_PATH="$HOME/work/metagenomic/metagwgs/" +export DATABANK="/home/pmartin2/work/FT_banks_2021-10-19" +export DATASET="$HOME/work/metagenomic/metagwgs-test-datasets" +export EGGNOG_DB="/bank/eggnog-mapper/eggnog-mapper-2.0.4-rf1/data" +``` + +Fichier $DATASET/hifi/input/samplesHiFi.csv : + +``` +cut -f 1 $METAG_PATH/functional_tests/expected_processes_hifi.tsv | tail -n +2 > $OUTDIR/cmd_hifi.sh +sed -i -e "s,\$DATASET,$DATASET,g" $DATASET/hifi/input/samplesheet.csv +sarray $OUTDIR/cmd_hifi.sh + +``` + diff --git a/functional_tests/expected_processes_HiFi.tsv b/functional_tests/expected_processes_HiFi.tsv new file mode 100644 index 0000000000000000000000000000000000000000..4555d12bd4652bf3e39e0dab21192dfe294e153f --- /dev/null +++ b/functional_tests/expected_processes_HiFi.tsv @@ -0,0 +1,6 @@ +cmd outputdir DATABASES:INDEX_KAIJU DATABASES:DOWNLOAD_TAXONOMY_DB DATABASES:EGGNOG_MAPPER_DB SH:S04_FILTERED_QUAST SH:S04_STRUCTURAL_ANNOT:PROKKA SH:S04_STRUCTURAL_ANNOT:RENAME_CONTIGS_AND_GENES SH:S05_ALIGNMENT:DIAMOND SH:S05_ALIGNMENT:MINIMAP2 SH:S06_FUNC_ANNOT:BEST_HITS SH:S06_FUNC_ANNOT:CD_HIT:GLOBAL_CD_HIT SH:S06_FUNC_ANNOT:CD_HIT:INDIVIDUAL_CD_HIT SH:S06_FUNC_ANNOT:EGGNOG_MAPPER SH:S06_FUNC_ANNOT:FUNCTIONAL_ANNOT_TABLE SH:S06_FUNC_ANNOT:MERGE_QUANT_ANNOT_BEST SH:S06_FUNC_ANNOT:QUANTIFICATION:FEATURE_COUNTS SH:S06_FUNC_ANNOT:QUANTIFICATION:QUANTIFICATION_TABLE SH:S07_TAXO_AFFI:ASSIGN_TAXONOMY SH:S07_TAXO_AFFI:QUANTIF_AND_TAXONOMIC_TABLE_CONTIGS +mkdir $OUTDIR/hifi_all ; cd $OUTDIR/hifi_all ;nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type HIFI --input `echo $DATASET`/hifi/input/samplesheet.csv --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria.dmnd --eggnog_mapper_db_dir `echo $EGGNOG_DB` --taxonomy_dir `echo $DATABANK`/taxonomy_2021-08-23 $OUTDIR/hifi_all 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 +mkdir $OUTDIR/hifi_stop_at_structural_annot;cd $OUTDIR/hifi_stop_at_structural_annot; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type HIFI --input `echo $DATASET`/hifi/input/samplesheet.csv --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria.dmnd --stop_at_structural_annot $OUTDIR/hifi_stop_at_structural_annot 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 +mkdir $OUTDIR/skip_func_annot-skip_taxo_affi; cd $OUTDIR/skip_func_annot-skip_taxo_affi;cp ../nextflow.config .; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type HIFI --input `echo $DATASET`/hifi/input/samplesheet.csv --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria.dmnd --skip_func_annot --skip_taxo_affi $OUTDIR/skip_func_annot-skip_taxo_affi 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 +mkdir $OUTDIR/skip_func_annot ; cd $OUTDIR/skip_func_annot;cp ../nextflow.config .; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type HIFI --input `echo $DATASET`/hifi/input/samplesheet.csv --eggnog_mapper_db_dir `echo $EGGNOG_DB` --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria.dmnd --taxonomy_dir `echo $DATABANK`/taxonomy_2021-08-23 –skip_func_annot $OUTDIR/skip_func_annot 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 +mkdir $OUTDIR/skip_taxo_affi; cd $OUTDIR/skip_taxo_affi;nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type HIFI --input `echo $DATASET`/hifi/input/samplesheet.csv --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria.dmnd --eggnog_mapper_db_dir `echo $EGGNOG_DB` --skip_taxo_affi $OUTDIR/skip_taxo_affi 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 diff --git a/functional_tests/expected_processes_sr.tsv b/functional_tests/expected_processes_sr.tsv new file mode 100644 index 0000000000000000000000000000000000000000..1f0ab4caddec4b51a99cbd0f78fe64aabec1d305 --- /dev/null +++ b/functional_tests/expected_processes_sr.tsv @@ -0,0 +1,22 @@ +cmd outputdir SR:S01_CLEAN_QC:FASTQC_RAW SR:S01_CLEAN_QC:CUTADAPT SR:S01_CLEAN_QC:SICKLE SR:S01_CLEAN_QC:HOST_FILTER SR:S01_CLEAN_QC:FASTQC_CLEANED SR:S01_CLEAN_QC:KAIJU_AND_MERGE:KAIJU SR:S01_CLEAN_QC:KAIJU_AND_MERGE:MERGE_KAIJU SR:S02_ASSEMBLY:ASSEMBLY SR:S02_ASSEMBLY:ASSEMBLY_QUAST SR:S02_ASSEMBLY:READS_DEDUPLICATION SR:S03_FILTERING:CHUNK_ASSEMBLY_FILTER SR:S03_FILTERING:MERGE_ASSEMBLY_FILTER SH:S04_STRUCTURAL_ANNOT:PROKKA SH:S04_FILTERED_QUAST SH:S04_STRUCTURAL_ANNOT:RENAME_CONTIGS_AND_GENES SH:S05_ALIGNMENT:DIAMOND SH:S05_ALIGNMENT:BWA_MEM SH:S06_FUNC_ANNOT:CD_HIT:INDIVIDUAL_CD_HIT SH:S06_FUNC_ANNOT:EGGNOG_MAPPER SH:S06_FUNC_ANNOT:BEST_HITS SH:S06_FUNC_ANNOT:QUANTIFICATION:FEATURE_COUNTS SH:S06_FUNC_ANNOT:CD_HIT:GLOBAL_CD_HIT SH:S06_FUNC_ANNOT:QUANTIFICATION:QUANTIFICATION_TABLE SH:S06_FUNC_ANNOT:MERGE_QUANT_ANNOT_BEST SH:S06_FUNC_ANNOT:FUNCTIONAL_ANNOT_TABLE SH:S07_TAXO_AFFI:ASSIGN_TAXONOMY SH:S07_TAXO_AFFI:QUANTIF_AND_TAXONOMIC_TABLE_CONTIGS +mkdir $OUTDIR/stop_at_clean ; cd $OUTDIR/stop_at_clean ; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type 'SR' --input `echo $DATASET`/small/input/samplesheet.csv --host_fasta `echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa --host_index "`echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa.{amb,ann,bwt,pac,sa}" --kaiju_db_dir `echo $DATABANK`/kaijudb_refseq_2020-05-25 --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria_100000.dmnd --eggnog_mapper_db_dir `echo $EGGNOG_DB` --taxonomy_dir `echo $DATABANK`/taxonomy_2021-08-23 -with-report -with-timeline -with-trace --stop_at_clean $OUTDIR/stop_at_clean 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +mkdir $OUTDIR/skip_sickle-skip_host_filter-skip_kaiju ; cd $OUTDIR/skip_sickle-skip_host_filter-skip_kaiju ; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type 'SR' --input `echo $DATASET`/small/input/samplesheet.csv --host_fasta `echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa --host_index "`echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa.{amb,ann,bwt,pac,sa}" --kaiju_db_dir `echo $DATABANK`/kaijudb_refseq_2020-05-25 --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria_100000.dmnd --eggnog_mapper_db_dir `echo $EGGNOG_DB` --taxonomy_dir `echo $DATABANK`/taxonomy_2021-08-23 -with-report -with-timeline -with-trace --stop_at_clean --skip_sickle --skip_host_filter --skip_kaiju $OUTDIR/skip_sickle-skip_host_filter-skip_kaiju 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +mkdir $OUTDIR/skip_sickle-skip_host_filter ; cd $OUTDIR/skip_sickle-skip_host_filter ; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type 'SR' --input `echo $DATASET`/small/input/samplesheet.csv --host_fasta `echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa --host_index "`echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa.{amb,ann,bwt,pac,sa}" --kaiju_db_dir `echo $DATABANK`/kaijudb_refseq_2020-05-25 --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria_100000.dmnd --eggnog_mapper_db_dir `echo $EGGNOG_DB` --taxonomy_dir `echo $DATABANK`/taxonomy_2021-08-23 -with-report -with-timeline -with-trace --stop_at_clean --skip_sickle --skip_host_filter $OUTDIR/skip_sickle-skip_host_filter 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +mkdir $OUTDIR/skip_sickle-skip_kaiju ; cd $OUTDIR/skip_sickle-skip_kaiju ; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type 'SR' --input `echo $DATASET`/small/input/samplesheet.csv --host_fasta `echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa --host_index "`echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa.{amb,ann,bwt,pac,sa}" --kaiju_db_dir `echo $DATABANK`/kaijudb_refseq_2020-05-25 --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria_100000.dmnd --eggnog_mapper_db_dir `echo $EGGNOG_DB` --taxonomy_dir `echo $DATABANK`/taxonomy_2021-08-23 -with-report -with-timeline -with-trace --stop_at_clean --skip_sickle --skip_kaiju $OUTDIR/skip_sickle-skip_kaiju 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +mkdir $OUTDIR/skip_sickle ; cd $OUTDIR/skip_sickle ; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type 'SR' --input `echo $DATASET`/small/input/samplesheet.csv --host_fasta `echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa --host_index "`echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa.{amb,ann,bwt,pac,sa}" --kaiju_db_dir `echo $DATABANK`/kaijudb_refseq_2020-05-25 --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria_100000.dmnd --eggnog_mapper_db_dir `echo $EGGNOG_DB` --taxonomy_dir `echo $DATABANK`/taxonomy_2021-08-23 -with-report -with-timeline -with-trace --stop_at_clean --skip_sickle $OUTDIR/skip_sickle 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +mkdir $OUTDIR/skip_host_filter-skip_kaiju ; cd $OUTDIR/skip_host_filter-skip_kaiju ; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type 'SR' --input `echo $DATASET`/small/input/samplesheet.csv --host_fasta `echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa --host_index "`echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa.{amb,ann,bwt,pac,sa}" --kaiju_db_dir `echo $DATABANK`/kaijudb_refseq_2020-05-25 --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria_100000.dmnd --eggnog_mapper_db_dir `echo $EGGNOG_DB` --taxonomy_dir `echo $DATABANK`/taxonomy_2021-08-23 -with-report -with-timeline -with-trace --stop_at_clean --skip_host_filter --skip_kaiju $OUTDIR/skip_host_filter-skip_kaiju 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +mkdir $OUTDIR/skip_host_filter ; cd $OUTDIR/skip_host_filter ; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type 'SR' --input `echo $DATASET`/small/input/samplesheet.csv --host_fasta `echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa --host_index "`echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa.{amb,ann,bwt,pac,sa}" --kaiju_db_dir `echo $DATABANK`/kaijudb_refseq_2020-05-25 --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria_100000.dmnd --eggnog_mapper_db_dir `echo $EGGNOG_DB` --taxonomy_dir `echo $DATABANK`/taxonomy_2021-08-23 -with-report -with-timeline -with-trace --stop_at_clean --skip_host_filter $OUTDIR/skip_host_filter 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +mkdir $OUTDIR/skip_kaiju ; cd $OUTDIR/skip_kaiju ; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type 'SR' --input `echo $DATASET`/small/input/samplesheet.csv --host_fasta `echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa --host_index "`echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa.{amb,ann,bwt,pac,sa}" --kaiju_db_dir `echo $DATABANK`/kaijudb_refseq_2020-05-25 --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria_100000.dmnd --eggnog_mapper_db_dir `echo $EGGNOG_DB` --taxonomy_dir `echo $DATABANK`/taxonomy_2021-08-23 -with-report -with-timeline -with-trace --stop_at_clean --skip_kaiju $OUTDIR/skip_kaiju 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + +mkdir $OUTDIR/stop_at_assembly ; cd $OUTDIR/stop_at_assembly ; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type 'SR' --input `echo $DATASET`/small/input/samplesheet.csv --host_fasta `echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa --host_index "`echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa.{amb,ann,bwt,pac,sa}" --kaiju_db_dir `echo $DATABANK`/kaijudb_refseq_2020-05-25 --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria_100000.dmnd --eggnog_mapper_db_dir `echo $EGGNOG_DB` --taxonomy_dir `echo $DATABANK`/taxonomy_2021-08-23 -with-report -with-timeline -with-trace --stop_at_assembly $OUTDIR/stop_at_assembly 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +mkdir $OUTDIR/skip_clean ; cd $OUTDIR/skip_clean ; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type 'SR' --input `echo $DATASET`/small/input/samplesheet.csv --host_fasta `echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa --host_index "`echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa.{amb,ann,bwt,pac,sa}" --kaiju_db_dir `echo $DATABANK`/kaijudb_refseq_2020-05-25 --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria_100000.dmnd --eggnog_mapper_db_dir `echo $EGGNOG_DB` --taxonomy_dir `echo $DATABANK`/taxonomy_2021-08-23 -with-report -with-timeline -with-trace --stop_at_assembly --skip_clean $OUTDIR/skip_clean 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + +mkdir $OUTDIR/stop_at_filtering ; cd $OUTDIR/stop_at_filtering ; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type 'SR' --input `echo $DATASET`/small/input/samplesheet.csv --host_fasta `echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa --host_index "`echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa.{amb,ann,bwt,pac,sa}" --kaiju_db_dir `echo $DATABANK`/kaijudb_refseq_2020-05-25 --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria_100000.dmnd --eggnog_mapper_db_dir `echo $EGGNOG_DB` --taxonomy_dir `echo $DATABANK`/taxonomy_2021-08-23 -with-report -with-timeline -with-trace --stop_at_filtering $OUTDIR/stop_at_filtering 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + +mkdir $OUTDIR/stop_at_structural_annot ; cd $OUTDIR/stop_at_structural_annot ; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type 'SR' --input `echo $DATASET`/small/input/samplesheet.csv --host_fasta `echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa --host_index "`echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa.{amb,ann,bwt,pac,sa}" --kaiju_db_dir `echo $DATABANK`/kaijudb_refseq_2020-05-25 --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria_100000.dmnd --eggnog_mapper_db_dir `echo $EGGNOG_DB` --taxonomy_dir `echo $DATABANK`/taxonomy_2021-08-23 -with-report -with-timeline -with-trace --stop_at_structural_annot $OUTDIR/stop_at_structural_annot 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 +mkdir $OUTDIR/skip_filtering ; cd $OUTDIR/skip_filtering ; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type 'SR' --input `echo $DATASET`/small/input/samplesheet.csv --host_fasta `echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa --host_index "`echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa.{amb,ann,bwt,pac,sa}" --kaiju_db_dir `echo $DATABANK`/kaijudb_refseq_2020-05-25 --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria_100000.dmnd --eggnog_mapper_db_dir `echo $EGGNOG_DB` --taxonomy_dir `echo $DATABANK`/taxonomy_2021-08-23 -with-report -with-timeline -with-trace --stop_at_structural_annot --skip_filtering $OUTDIR/skip_filtering 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 + +mkdir $OUTDIR/all ; cd $OUTDIR/all ; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type 'SR' --input `echo $DATASET`/small/input/samplesheet.csv --host_fasta `echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa --host_index "`echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa.{amb,ann,bwt,pac,sa}" --kaiju_db_dir `echo $DATABANK`/kaijudb_refseq_2020-05-25 --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria_100000.dmnd --eggnog_mapper_db_dir `echo $EGGNOG_DB` --taxonomy_dir `echo $DATABANK`/taxonomy_2021-08-23 -with-report -with-timeline -with-trace $OUTDIR/all 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 +mkdir $OUTDIR/skip_func_annot-skip_taxo_affi ; cd $OUTDIR/skip_func_annot-skip_taxo_affi ; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type 'SR' --input `echo $DATASET`/small/input/samplesheet.csv --host_fasta `echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa --host_index "`echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa.{amb,ann,bwt,pac,sa}" --kaiju_db_dir `echo $DATABANK`/kaijudb_refseq_2020-05-25 --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria_100000.dmnd --eggnog_mapper_db_dir `echo $EGGNOG_DB` --taxonomy_dir `echo $DATABANK`/taxonomy_2021-08-23 -with-report -with-timeline -with-trace --skip_func_annot --skip_taxo_affi $OUTDIR/skip_func_annot-skip_taxo_affi 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 +mkdir $OUTDIR/skip_func_annot ; cd $OUTDIR/skip_func_annot ; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type 'SR' --input `echo $DATASET`/small/input/samplesheet.csv --host_fasta `echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa --host_index "`echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa.{amb,ann,bwt,pac,sa}" --kaiju_db_dir `echo $DATABANK`/kaijudb_refseq_2020-05-25 --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria_100000.dmnd --eggnog_mapper_db_dir `echo $EGGNOG_DB` --taxonomy_dir `echo $DATABANK`/taxonomy_2021-08-23 -with-report -with-timeline -with-trace --skip_func_annot $OUTDIR/skip_func_annot 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 +mkdir $OUTDIR/skip_taxo_affi ; cd $OUTDIR/skip_taxo_affi ; nextflow run -profile test_genotoul_workq $METAG_PATH/main.nf --type 'SR' --input `echo $DATASET`/small/input/samplesheet.csv --host_fasta `echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa --host_index "`echo $DATASET`/input/host/Homo_sapiens.GRCh38_chr21.fa.{amb,ann,bwt,pac,sa}" --kaiju_db_dir `echo $DATABANK`/kaijudb_refseq_2020-05-25 --diamond_bank `echo $DATABANK`/refseq_bacteria_2021-05-20/refseq_bacteria_100000.dmnd --eggnog_mapper_db_dir `echo $EGGNOG_DB` --taxonomy_dir `echo $DATABANK`/taxonomy_2021-08-23 -with-report -with-timeline -with-trace --skip_taxo_affi $OUTDIR/skip_taxo_affi 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 diff --git a/functional_tests/functions.py b/functional_tests/functions.py index eba52ac6786838c9038b029d1973db0ca82b296f..fd8d90c69f1f28b0b1de09712a0914f01a09ceec 100644 --- a/functional_tests/functions.py +++ b/functional_tests/functions.py @@ -111,6 +111,10 @@ def check_files(exp_dir, obs_dir, step, methods, verbose): # Metadata on file to find them and know which test to perform expected_path = path.join(expected_prefix, file_path) observed_path = path.join(observed_prefix, file_path) + + print("exp:\t",expected_path) + print("obs:\t",observed_path) + file_name = path.basename(file_path) file_extension = path.splitext(file_name)[1] @@ -224,10 +228,7 @@ def test_file(exp_path, obs_path, method): if re.search('diff', method): - if method == 'diff': - command = 'diff {} {}'.format(exp_path, obs_path) - - elif method == 'cut_diff': + if method == 'cut_diff': command = 'diff <(tail -n+6 {}) <(tail -n+6 {})'.format(exp_path, obs_path) elif method == 'sort_diff': @@ -245,7 +246,7 @@ def test_file(exp_path, obs_path, method): if not error: if diff_out.decode('ascii') != '': test = False - out = 'Test result: Failed\nDifferences:\n{}\n'.format(diff_out) + out = 'Test result: Failed\nDifferences:\n{}\n'.format(diff_out.decode('ascii')) false_cnt += 1 elif diff_out.decode('ascii') == '': diff --git a/functional_tests/launch_example.sh b/functional_tests/launch_example.sh index c8455f2954f655bc13ec78e57bc44cc014e417ae..3a95935f3184234231e5310413e6600698f04cb2 100644 --- a/functional_tests/launch_example.sh +++ b/functional_tests/launch_example.sh @@ -1,4 +1,4 @@ #!/bin/bash sbatch -W -p workq -J functional_test --mem=6G \ - --wrap="module load bioinfo/Nextflow-v21.04.1 ; module load system/singularity-3.7.3 ; nextflow run -profile test_genotoul_workq <metagwgs-src>/main.nf --step '01_clean_qc,02_assembly,03_filtering,04_structural_annot,05_alignment,06_func_annot,07_taxo_affi' --reads 'metagwgs-test-datasets-small/input/*_{R1,R2}.fastq.gz' --host_fasta 'metagwgs-test-datasets-small/input/host/Homo_sapiens.GRCh38_chr21.fa' --host_bwa_index 'metagwgs-test-datasets-small/input/host/Homo_sapiens.GRCh38_chr21.fa.{amb,ann,bwt,pac,sa}' --min_contigs_cpm 1000 --kaiju_db_dir '<bank>/kaijudb_refseq_2020-05-25' --taxonomy_dir '<bank>/taxonomy' --eggnog_mapper_db_dir '<bank>/eggnog-mapper-2.0.4-rf1/data' --diamond_bank '<bank>/refseq_bacteria_2021-05-20/refseq_bacteria.dmnd' -with-report -with-timeline -with-trace -with-dag" \ No newline at end of file + --wrap="module load bioinfo/Nextflow-v21.04.1 ; module load system/singularity-3.7.3 ; nextflow run -profile test_genotoul_workq main.nf --type 'SR' --input 'metagwgs-test-datasets/small/input/samplesheet.csv' --host_fasta 'metagwgs-test-datasets/small/input/host/Homo_sapiens.GRCh38_chr21.fa' --host_index 'metagwgs-test-datasets/small/input/host/Homo_sapiens.GRCh38_chr21.fa.{amb,ann,bwt,pac,sa}' --kaiju_db_dir 'FT_banks_2021-10-19/kaijudb_refseq_2020-05-25' --min_contigs_cpm 1000 --diamond_bank 'FT_banks_2021-10-19/refseq_bacteria_2021-05-20/refseq_bacteria_100000.dmnd' --eggnog_mapper_db_dir 'FT_banks_2021-10-19/eggnog-mapper-2.0.4-rf1/data' --taxonomy_dir 'FT_banks_2021-10-19/taxonomy_2021-08-23' --stop_at_clean -with-report -with-timeline -with-trace -with-dag" \ No newline at end of file diff --git a/functional_tests/main.py b/functional_tests/main.py index 90ba354b7a144e3f2660d1ccadf90bde3c5ce4c3..2d63453de1d3bad20be07911990d7b52f5149633 100755 --- a/functional_tests/main.py +++ b/functional_tests/main.py @@ -38,11 +38,10 @@ steps_list = OrderedDict([ global methods methods = OrderedDict([ ("cut_diff", r".*_cutadapt\.log"), - ("diff", [".flagstat",".idxstats",".fasta",".fa",".faa",".ffn",".fna",".gff",".len",".bed",".m8",".clstr",".txt",".summary",".best_hit", ".log", ".bam", ".tsv"]), - ("sort_diff", [".out"]), + ("sort_diff", [".flagstat",".idxstats",".bed",".m8",".clstr",".txt",".summary",".best_hit",".log",".tsv",".out",".fasta",".fa",".faa",".ffn",".fna",".gff",".len"]), ("no_header_diff", [".annotations",".seed_orthologs"]), ("zdiff", [".gz"]), - ("not_empty", [".zip",".html",".pdf", ".bai"]) + ("not_empty", [".zip",".html",".pdf",".bam",".bai", ".amb",".ann",".bwt",".pac",".sa",".tex",".stdout"]) ]) # __main__ diff --git a/functional_tests/test_parameters_and_processes.py b/functional_tests/test_parameters_and_processes.py new file mode 100644 index 0000000000000000000000000000000000000000..2c784f282de4b58ca2675a13669a18f180d16e03 --- /dev/null +++ b/functional_tests/test_parameters_and_processes.py @@ -0,0 +1,66 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- + +# Céline Noirot & Pierre MARTIN +# MIAT - INRAe (Toulouse) +# 2021 + + +try: + import csv + import subprocess + import re + import sys + import argparse + +except ImportError as error: + print(error) + exit(1) + + +# __main__ +def main(): + + # Manage parameters + parser = argparse.ArgumentParser(description = \ + 'Script which check proccesses launch thanks to the provided options.') + parser.add_argument('-f', '--file', required = True, default="hifi_steps_command.tsv", help = \ + 'Expected processes.') + args = parser.parse_args() + + read_tsv = csv.reader(open(args.file), delimiter='\t') + header = read_tsv.__next__() + for cmds in read_tsv: + execution_dir=cmds[1] + # parse expected process + processes_expected = set() + for i,val in enumerate(cmds): + if val=="1" : # this process is expected + processes_expected.add(header[i]) + + # get executed process + # print("cd "+execution_dir+"; nextflow log $(nextflow log -q | tail -1) -f name,status") + process = subprocess.Popen("cd "+execution_dir+"; nextflow log $(nextflow log -q | tail -1) -f name,status", stdout = subprocess.PIPE, shell = True, executable = '/bin/bash') + res, error = process.communicate() + processes_executed = set() + for line in res.decode('ascii').split("\n"): + if line != "": + process_sample,status = re.split("\t",line) + cut = re.split(" ",process_sample) + if len(cut) == 2: + process = cut[0] + sample = cut[1] + else: + process = process_sample + if status == "CACHED" or status == "COMPLETED": + processes_executed.add(process) + + diff = processes_expected.symmetric_difference(processes_executed) + if len(diff) != 0 : + print ("#### Project:"+ execution_dir + ". Error in following processes: \n-" + "\n-".join(sorted(diff))) + if len(diff) == 0 : + print ("#### Project:"+ execution_dir + ". All processes exectuted correctly") + + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/main.nf b/main.nf index 5fc1cb91d1cfaf4e6ba9f78c7d2458b239255161..856df572c892b3b392cbf676896a9dae5ddf225b 100644 --- a/main.nf +++ b/main.nf @@ -1,14 +1,13 @@ #!/usr/bin/env nextflow -/* -======================================================================================== - metagWGS -======================================================================================== - metagWGS Analysis Pipeline. - #### Homepage / Documentation - https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/ ----------------------------------------------------------------------------------------- -*/ +nextflow.enable.dsl = 2 +include { SHARED as SH } from './subworkflows/shared' +include { SHORT_READS as SR } from './subworkflows/short_reads' +include { DATABASES } from './subworkflows/00_databases' +include { FASTQC_HIFI as S04_HIFI_FASTQC } from './modules/fastqc' +include { HIFI_QUAST as S04_HIFI_QUAST } from './modules/metaquast' +include { GET_SOFTWARE_VERSIONS } from './modules/get_software_versions' +include { MULTIQC } from './modules/multiqc' /* * Define helpMessage @@ -20,70 +19,60 @@ Usage: The typical command for running the pipeline is as follows: - nextflow run -profile standard main.nf --reads '*_{R1,R2}.fastq.gz' --skip_removal_host --skip_kaiju + nextflow run -profile standard main.nf --input 'samplesheet.csv' --skip_host_filter --skip_kaiju Mandatory arguments: - --reads [path] Path to input data (must be surrounded with quotes). + --input [path] Sample sheet: csv file with samples: sample,fastq_1,fastq_2,fasta[for HIFI] Options: - --step Choose step(s) into: "01_clean_qc", "02_assembly", "03_filtering", - "04_structural_annot", "05_alignment", "06_func_annot", "07_taxo_affi", "08_binning". - i. You can directly indicate the final step that is important to you. For example, - if you are interested in binning (and the taxonomic affiliation of bins), just use --step "08_binning". - It runs the previous steps automatically (except "03_filtering", see ii). - ii. "03_filtering" is automatically skipped for the next steps - "04_structural_annot", "05_alignment", "06_func_annot", "07_taxo_affi" and "08_binning". - If you want to filter your assembly before doing one of these steps, you must use --step "03_filtering,the_step", - for example --step "03_filtering,04_structural_annot". - iii. When you run one of the three steps "06_func_annot", "07_taxo_affi" or "08_binning" during a first analysis - and then another of these steps interests you and you run metagWGS again to get the result of this other step, - you have to indicate --step "the_first_step,the_second_step". This will allow you to have a final MultiQC html report - that will take into account the metrics of both analyses performed. If the third of these steps interests you and you run again - metagWGS for this step, you also have to indicate --step "the_first_step,the_second_step,the_third,step" for the same reasons. - 01_clean_qc options: - --skip_01_clean_qc Skip 01_clean_qc step. + S01_CLEAN_QC options: + --stop_at_clean Stop the pipeline at this step + --skip_clean Skip this step. --adapter1 Sequence of adapter 1. Default: Illumina TruSeq adapter. --adapter2 Sequence of adapter 2. Default: Illumina TruSeq adapter. --skip_sickle Skip sickle process. --quality_type Type of quality values for sickle (solexa (CASAVA < 1.3), illumina (CASAVA 1.3 to 1.7), sanger (which is CASAVA >= 1.8)). Default: 'sanger'. - --skip_removal_host Skip filter host reads process. + --skip_host_filter Skip filter host reads process. --host_fasta Full path to fasta of host genome ("PATH/name_genome.fasta"). --host_bwa_index Full path to directory containing BWA index including base name i.e ("PATH/name_genome.{amb,ann,bwt,pac,sa}"). - You need to use --skip_removal_host or --host_fasta or --skip_01_clean_qc. If it is not the case, an error message will occur. + You need to use --skip_host_filter or --host_fasta or --skip_01_clean_qc. If it is not the case, an error message will occur. --skip_kaiju Skip taxonomic affiliation of reads with kaiju. --kaiju_verbose Allow the generation of kaiju verbose output (file can be large) --kaiju_db_dir Directory with kaiju database already built ("PATH/directory"). - --kaiju_db Indicate kaiju database you want to build. Default: "https://kaiju.binf.ku.dk/database/kaiju_db_refseq_2021-02-26.tgz". - You need to use --kaiju_db_dir or --skip_kaiju. If it is not the case, an error message will occur. + --kaiju_db_url Indicate kaiju database you want to build. Default: "https://kaiju.binf.ku.dk/database/kaiju_db_refseq_2021-02-26.tgz". + You need to use --kaiju_db_url or --kaiju_db_dir or --skip_kaiju. If it is not the case, an error message will occur. - 02_assembly options: - --assembly Indicate the assembly tool ["metaspades" or "megahit"]. Default: "metaspades". + S02_ASSEMBLY options: + --stop_at_assembly Stop the pipeline at this step + --assembly Indicate the assembly tool ["metaspades" or "megahit" ]. Default: "metaspades". --metaspades_mem [mem_value] Memory (in G) used by metaspades process. Default: 440. - - 03_filtering options: + + S03_FILTERING options: + --stop_at_filtering Stop the pipeline at this step + --skip_filtering Skip this step --min_contigs_cpm [cutoff] CPM cutoff (Count Per Million) to filter contigs with low number of reads. Default: 10. - - 05_alignment options: + + S04_STRUCTURAL_ANNOT options: + --stop_at_structural_annot Stop the pipeline at this step + + S05_ALIGNMENT options: --diamond_bank Path to diamond bank used to align protein sequence of genes: "PATH/bank.dmnd". This bank must be previously built with diamond makedb. - 06_func_annot options: + S06_FUNC_ANNOT options: + --skip_func_annot Skip this step --percentage_identity [nb] Sequence identity threshold. Default: 0.95 corresponding to 95%. Use a number between 0 and 1. - --eggnogmapper_db Use: --eggnogmapper_db if you want that metagwgs build the database. Default false: metagWGS didn't build this database. + --eggnog_mapper_db_download Flag --eggnog_mapper_db_download to build the database. Default false: metagWGS didn't build this database. --eggnog_mapper_db_dir Path to eggnog-mapper database "PATH/database_directory/" if it is already built. Default: false. - You need to use --eggnogmapper_db or --eggnog_mapper_db_dir. If it is not the case, an error message will occur. + You need to use --eggnog_mapper_db_download or --eggnog_mapper_db_dir. If it is not the case, an error message will occur. - 07_taxo_affi options: + S07_TAXO_AFFI options: + --skip_taxo_affi Skip this step --accession2taxid FTP adress of file prot.accession2taxid.gz. Default: "ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz". --taxdump FTP adress of file taxdump.tar.gz. Default: "ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz". --taxonomy_dir Directory if taxdump and accession2taxid already downloaded ("PATH/directory"). - 08_binning options: - --min_contig_size [cutoff] Contig length cutoff to filter contigs before binning. Must be greater than 1500. Default: 1500. - --busco_reference BUSCO v3 database. Default: "https://busco-archive.ezlab.org/v3/datasets/bacteria_odb9.tar.gz". - --cat_db CAT/BAT database "PATH/CAT_prepare_20190108.tar.gz". Default: false. Must be previously downloaded. - Other options: --outdir The output directory where the results will be saved: "dir_name". Default "results". --help Show this message and exit. @@ -95,10 +84,6 @@ """.stripIndent() } - /* - * SET UP CONFIGURATION VARIABLES. - */ - // Show help message. if (params.help){ @@ -106,1468 +91,223 @@ if (params.help){ exit 0 } -// Define list of available steps. - -def defineStepList() { - return [ - '01_clean_qc', - '02_assembly', - '03_filtering', - '04_structural_annot', - '05_alignment', - '06_func_annot', - '07_taxo_affi', - '08_binning' - ] -} - -// Check step existence. - -def checkParameterExistence(list_it, list) { - nb_false_step = 0 - for(it in list_it) { - if (!list.contains(it)) { - log.warn "Unknown parameter: ${it}" - nb_false_step = nb_false_step + 1 - } - } - if(nb_false_step > 0) {return false} - else {return true} -} - -// Check number of steps. - -// Set up parameters. - -step = params.step.split(",") -stepList = defineStepList() -if (!checkParameterExistence(step, stepList)) exit 1, "Unknown step(s) upon ${step}, see --help for more information" - -if (!['metaspades','megahit'].contains(params.assembly)){ - exit 1, "Invalid aligner option: ${params.assembly}. Valid options: 'metaspades', 'megahit'" -} - -if (!["solexa","illumina","sanger"].contains(params.quality_type)){ - exit 1, "Invalid quality_type option: ${params.quality_type}. Valid options:'solexa','illumina','sanger'" -} - -/* - * Create channels for adapters, qualityType, alignment genome, mode value. - */ - -adapter1_ch = Channel.value(params.adapter1) -adapter2_ch = Channel.value(params.adapter2) - -if(!params.skip_busco) { - Channel - .fromPath( "${params.busco_reference}", checkIfExists: true ) - .set { file_busco_db } -} -else { - file_busco_db = Channel.from() -} - -if(params.cat_db) { - Channel - .fromPath( "${params.cat_db}", checkIfExists: true ) - .set { file_cat_db } -} -else { - file_cat_db = Channel.from() -} - -if(params.host_fasta) { - bwa_fasta_ch = Channel.value(file(params.host_fasta)) -} -else { - bwa_fasta_ch = Channel.from() -} - -if('05_alignment' in step) { - assert file(params.diamond_bank).exists(): 'Error: give --diamond_bank a valid .dmnd bank' -} - -diamond_bank_ch = Channel.value(params.diamond_bank) -accession2taxid_ch = Channel.value(params.accession2taxid) -taxdump_ch = Channel.value(params.taxdump) -percentage_identity_ch = Channel.value(params.percentage_identity) -min_contigs_cpm_ch = Channel.value(params.min_contigs_cpm) -metaspades_mem_ch = Channel.value(params.metaspades_mem) - -multiqc_config_ch = file(params.multiqc_config, checkIfExists: true) -/* - * Create channels for input read files. - */ -Channel - .fromFilePairs( params.reads, size: params.single_end ? 1 : 2, flat: true ) - .ifEmpty { exit 1, "Cannot find any reads matching: ${params.reads}\nNB: Path needs to be enclosed in quotes!\nNB: Path requires at least one * wildcard!\nIf this is single-end data, please specify --singleEnd on the command line." } - .into { raw_reads_fastqc; raw_reads_cutadapt; raw_reads_assembly_ch; raw_reads_dedup_ch} - -taxon_levels = "phylum class order family genus species" -taxons_affi_taxo_contigs = "all superkingdom phylum class order family genus species" - -// index host if needed -if (params.host_fasta && !(params.host_bwa_index) && !(params.skip_removal_host)) { - lastPath = params.host_fasta.lastIndexOf(File.separator) - bwa_base = params.host_fasta.substring(lastPath+1) - fasta_ch = file(params.host_fasta, checkIfExists: true) - - process BWAIndex { - publishDir "${params.outdir}/01_clean_qc/01_1_cleaned_reads/bwa_index" , mode: 'copy' - - input: - file fasta from fasta_ch - - output: - file("${fasta}.*") into bwa_built - - script: - """ - bwa index -a bwtsw $fasta - """ - } -} - -if (!(params.skip_removal_host) && !(params.host_fasta) && !(params.skip_01_clean_qc)) { - exit 1, "You must specify --host_fasta or skip cleaning host step with option --skip_removal_host or skip all clean and qc modules with --skip_01_clean_qc" -} - -if (!(params.skip_removal_host) && !(params.skip_01_clean_qc)) { - bwa_index_ch = params.host_bwa_index ? Channel.value(file(params.host_bwa_index)) : bwa_built -} - -/* - * CLEANING. - */ - -// Cutadapt. -process cutadapt { - tag "$sampleId" - - publishDir "${params.outdir}/01_clean_qc/01_1_cleaned_reads/", mode: 'copy', pattern: 'cleaned_*.fastq.gz' - publishDir "${params.outdir}/01_clean_qc/01_1_cleaned_reads/logs", mode: 'copy', pattern: '*_cutadapt.log' - input: - set sampleId, file(read1), file(read2) from raw_reads_cutadapt - val adapter1_ch - val adapter2_ch - - output: - set sampleId, file("*${sampleId}*_R1.fastq.gz"), file("*${sampleId}*_R2.fastq.gz") into (cutadapt_reads_ch, cutadapt2_reads_ch, cutadapt3_reads_ch) - file("${sampleId}_cutadapt.log") into cutadapt_log_ch_for_multiqc - - when: ('01_clean_qc' in step || '02_assembly' in step || '03_filtering' in step || '04_structural_annot' in step || '05_alignment' in step || '06_func_annot' in step || '07_taxo_affi' in step || '08_binning' in step) && (!params.skip_01_clean_qc) - - script: - if(params.skip_sickle & params.skip_removal_host) { - // output are final cleaned files - output_files = "-o cleaned_${sampleId}_R1.fastq.gz -p cleaned_${sampleId}_R2.fastq.gz" - } - else { - //tempory files not saved in publish dir - output_files = "-o ${sampleId}_cutadapt_R1.fastq.gz -p ${sampleId}_cutadapt_R2.fastq.gz" - } - """ - cutadapt -a $adapter1_ch -A $adapter2_ch $output_files -m 36 --trim-n -q 20,20 --max-n 0 \ - --cores=${task.cpus} ${read1} ${read2} > ${sampleId}_cutadapt.log - """ -} - -// Sickle. -process sickle { - tag "$sampleId" - publishDir "${params.outdir}/01_clean_qc/01_1_cleaned_reads/", mode: 'copy', pattern: 'cleaned_*.fastq.gz' - publishDir "${params.outdir}/01_clean_qc/01_1_cleaned_reads/logs", mode: 'copy', pattern: '*_sickle.log' - - when: - (!params.skip_sickle) && ('01_clean_qc' in step || '02_assembly' in step || '03_filtering' in step || '04_structural_annot' in step || '05_alignment' in step || '06_func_annot' in step || '07_taxo_affi' in step || '08_binning' in step) && (!params.skip_01_clean_qc) - - input: - set sampleId, file(cutadapt_reads_R1), file(cutadapt_reads_R2) from cutadapt_reads_ch - - output: - set sampleId, file("*${sampleId}*_R1.fastq.gz"), file("*${sampleId}*_R2.fastq.gz") into sickle_reads_ch, sickle2_reads_ch - file("${sampleId}_single_sickle.fastq.gz") into sickle_single_ch - file("${sampleId}_sickle.log") into sickle_log_ch_for_multiqc - - script: - mode = params.single_end ? 'se' : 'pe' - - if(params.skip_removal_host) { - // output are final cleaned files - options = "-o cleaned_${sampleId}_R1.fastq.gz -p cleaned_${sampleId}_R2.fastq.gz" - } - else { - //tempory files not saved in publish dir - options = "-o ${sampleId}_sickle_R1.fastq.gz -p ${sampleId}_sickle_R2.fastq.gz" - } - options += " -t " + params.quality_type - """ - sickle ${mode} -f ${cutadapt_reads_R1} -r ${cutadapt_reads_R2} $options \ - -s ${sampleId}_single_sickle.fastq.gz -g > ${sampleId}_sickle.log - """ -} - -if (!params.skip_sickle) { - sickle_reads_ch.set{intermediate_cleaned_ch} -} -else { - cutadapt2_reads_ch.set{intermediate_cleaned_ch} -} - -// WARNING: use bioinfo_bwa_samtools module. -if (!params.skip_removal_host && ('01_clean_qc' in step || '02_assembly' in step || '03_filtering' in step || '04_structural_annot' in step || '05_alignment' in step || '06_func_annot' in step || '07_taxo_affi' in step || '08_binning' in step) && (!params.skip_01_clean_qc)) { - process host_filter { - tag "${sampleId}" - publishDir "${params.outdir}/01_clean_qc/01_1_cleaned_reads/", mode: 'copy', pattern: 'cleaned_*.fastq.gz' - publishDir "${params.outdir}/01_clean_qc/01_1_cleaned_reads/", mode: 'copy', pattern: '*.bam' - publishDir "${params.outdir}/01_clean_qc/01_1_cleaned_reads/logs", mode: 'copy', - saveAs: {filename -> - if (filename.indexOf(".flagstat") > 0 ) "$filename" - else null} - - input: - set sampleId, file(trimmed_reads_R1), file(trimmed_reads_R2) from intermediate_cleaned_ch - file index from bwa_index_ch - file fasta from bwa_fasta_ch - - output: - set sampleId, file("cleaned_${sampleId}_R1.fastq.gz"), file("cleaned_${sampleId}_R2.fastq.gz") into filter_reads_ch - file("host_filter_flagstat/${sampleId}.host_filter.flagstat") into flagstat_after_host_filter_for_multiqc_ch - file("${sampleId}.no_filter.flagstat") into flagstat_before_filter_for_multiqc_ch - - """ - bwa mem -t ${task.cpus} ${fasta} ${trimmed_reads_R1} ${trimmed_reads_R2} > ${sampleId}.bam - samtools view -bhS -f 12 ${sampleId}.bam > ${sampleId}.without_host.bam - mkdir host_filter_flagstat - samtools flagstat ${sampleId}.bam > ${sampleId}.no_filter.flagstat - samtools flagstat ${sampleId}.without_host.bam >> host_filter_flagstat/${sampleId}.host_filter.flagstat - bamToFastq -i ${sampleId}.without_host.bam -fq cleaned_${sampleId}_R1.fastq -fq2 cleaned_${sampleId}_R2.fastq - gzip cleaned_${sampleId}_R1.fastq - gzip cleaned_${sampleId}_R2.fastq - rm ${sampleId}.bam - rm ${sampleId}.without_host.bam - """ - } - filter_reads_ch.set{preprocessed_reads_ch} -} -else { - intermediate_cleaned_ch.set{preprocessed_reads_ch} - Channel.empty().set{flagstat_after_host_filter_for_multiqc_ch} - Channel.empty().set{flagstat_before_filter_for_multiqc_ch} -} - -preprocessed_reads_ch.into{ - clean_reads_for_fastqc_ch - clean_reads_for_kaiju_ch - clean_reads_for_assembly_ch - clean_reads_for_dedup_ch -} - - -// FastQC on raw data -process fastqc_raw { - tag "${sampleId}" - publishDir "${params.outdir}/01_clean_qc/01_2_qc/fastqc_raw/", mode: 'copy' - - input: - set sampleId, file(read1), file(read2) from raw_reads_fastqc - - output: - file("${sampleId}/*.zip") into fastqc_raw_for_multiqc_ch - file("${sampleId}/*.html") into fastqc_raw_ch - - when: ('01_clean_qc' in step || '02_assembly' in step || '03_filtering' in step || '04_structural_annot' in step || '05_alignment' in step || '06_func_annot' in step || '07_taxo_affi' in step || '08_binning' in step) && (!params.skip_01_clean_qc) - - script: - """ - mkdir ${sampleId} ; fastqc --nogroup --quiet -o ${sampleId} --threads ${task.cpus} ${read1} ${read2} - """ -} - -// FastQC on cleaned data -process fastqc_cleaned { - tag "${sampleId}" - publishDir "${params.outdir}/01_clean_qc/01_2_qc/fastqc_cleaned", mode: 'copy' - - input: - set sampleId, file(read1), file(read2) from clean_reads_for_fastqc_ch - output: - file("cleaned_${sampleId}/*.zip") into fastqc_cleaned_for_multiqc_ch - file("cleaned_${sampleId}/*.html") into fastqc_cleaned_ch - - when: ('01_clean_qc' in step || '02_assembly' in step || '03_filtering' in step || 'structural' in step || '05_alignment' in step || '06_func_annot' in step || '07_taxo_affi' in step || '08_binning' in step) && (!params.skip_01_clean_qc) - - script: - """ - mkdir cleaned_${sampleId}; fastqc --nogroup --quiet -o cleaned_${sampleId} --threads ${task.cpus} ${read1} ${read2} - """ -} - -/* - * TAXONOMIC CLASSIFICATION. - */ - -if (!params.skip_kaiju && ('01_clean_qc' in step || '02_assembly' in step || '03_filtering' in step || '04_structural_annot' in step || '05_alignment' in step || '06_func_annot' in step || '07_taxo_affi' in step || '08_binning' in step) && (!params.skip_01_clean_qc)) { - if(!params.kaiju_db_dir) { - // Built kaiju database index. - process index_db_kaiju { - publishDir "${params.outdir}/01_clean_qc/01_3_taxonomic_affiliation_reads/index", mode: 'copy' - - input: - val database from params.kaiju_db - - output: - file("nodes.dmp") into index_kaiju_nodes_ch - file("*.fmi") into index_kaiju_db_ch - file("names.dmp") into index_kaiju_names_ch - - script: - """ - wget ${database} - file='${database}' - fileNameDatabase=\${file##*/} - echo \$fileNameDatabase - tar -zxvf \$fileNameDatabase - """ +def getAndCheckHeader() { + File file = new File(params.input) + assert file.exists() : "${params.input} file not found" + def line=""; + file.withReader { reader -> + line = reader.readLine() } - } - else if(params.kaiju_db_dir) { - if(file(params.kaiju_db_dir + "/kaiju_db*.fmi").size == 1) { - index_kaiju_nodes_ch = Channel - .value(params.kaiju_db_dir + "/nodes.dmp") - index_kaiju_db_ch = Channel - .value(params.kaiju_db_dir + "/kaiju_db*.fmi") - index_kaiju_names_ch = Channel - .value(params.kaiju_db_dir + "/names.dmp") + def tab = line.split(/,/) + if (! ((tab[0] == "sample") && (tab[1] == "fastq_1") )) { + exit 1, 'Error 1 while check samplesheet format please enter sample,fastq_1[,fastq_2][,assembly] with header line' } - else { - exit 1, "There are more than one file ending with .fmi in ${params.kaiju_db_dir}" + if (tab.size() == 3 ){ + if (!((tab[2] == "fastq_2") || (tab[2] == "assembly"))) { + exit 1, 'Error 2 while check samplesheet format please enter sample,fastq_1[,fastq_2][,assembly] with header line' + } } - } - else { - exit 1, "You must specify --kaiju_db or --kaiju_db_dir or --skip_kaiju" - } - - // Kaiju. - process kaiju { - tag "${sampleId}" - publishDir "${params.outdir}/01_clean_qc/01_3_taxonomic_affiliation_reads", mode: 'copy', pattern: '*.krona.html' - if (params.kaiju_verbose) { - publishDir "${params.outdir}/01_clean_qc/01_3_taxonomic_affiliation_reads", mode: 'copy', pattern: '*_kaiju_MEM_verbose.out' + if (tab.size() == 4) { + if ( ! ((tab[2] == "fastq_2") && (tab[3] == "assembly"))) { + exit 1, 'Error 3 while check samplesheet format please enter sample,fastq_1[,fastq_2][,assembly] with header line' + } } - - input: - set sampleId, file(preprocessed_reads_R1), file(preprocessed_reads_R2) from clean_reads_for_kaiju_ch - val (index_kaiju_nodes) from index_kaiju_nodes_ch - val (index_kaiju_db) from index_kaiju_db_ch - val (index_kaiju_names) from index_kaiju_names_ch - - output: - file("${sampleId}_kaiju_MEM_verbose.out") into kaiju_MEM_verbose_ch - file("${sampleId}.krona.html") into kaiju_krona_ch - file("*.summary_*") into kaiju_summary_for_multiqc_ch - file("*.summary_species") into kaiju_summary_species_ch - file("*.summary_genus") into kaiju_summary_genus_ch - file("*.summary_family") into kaiju_summary_family_ch - file("*.summary_class") into kaiju_summary_class_ch - file("*.summary_order") into kaiju_summary_order_ch - file("*.summary_phylum") into kaiju_summary_phylum_ch - file("*_normalized.pdf") into normalized_pdf_ch - file("*_counts.pdf") into counts_pdf_ch - - script: - """ - kaiju -z ${task.cpus} -t ${index_kaiju_nodes} -f ${index_kaiju_db} -i ${preprocessed_reads_R1} -j ${preprocessed_reads_R2} -o ${sampleId}_kaiju_MEM_verbose.out -a mem -v - kaiju2krona -t ${index_kaiju_nodes} -n ${index_kaiju_names} -i ${sampleId}_kaiju_MEM_verbose.out -o ${sampleId}_kaiju_MEM_without_unassigned.out.krona -u - ktImportText -o ${sampleId}.krona.html ${sampleId}_kaiju_MEM_without_unassigned.out.krona - for i in ${taxon_levels} ; - do - kaiju2table -t ${index_kaiju_nodes} -n ${index_kaiju_names} -r \$i -o ${sampleId}_kaiju_MEM.out.summary_\$i ${sampleId}_kaiju_MEM_verbose.out - done - Generate_barplot_kaiju.py -i ${sampleId}_kaiju_MEM_verbose.out - """ - } - // Merge kaiju results by taxonomic ranks - process kaiju_merge { - tag "${sampleId}" - publishDir "${params.outdir}/01_clean_qc/01_3_taxonomic_affiliation_reads", mode: 'copy' - - input: - file(kaiju_species) from kaiju_summary_species_ch.collect() - file(kaiju_genus) from kaiju_summary_genus_ch.collect() - file(kaiju_family) from kaiju_summary_family_ch.collect() - file(kaiju_class) from kaiju_summary_class_ch.collect() - file(kaiju_order) from kaiju_summary_order_ch.collect() - file(kaiju_phylum) from kaiju_summary_phylum_ch.collect() - file(kaiju_pdf) from normalized_pdf_ch.collect() - - output: - file("taxo_affi_reads_*.tsv") into merge_files_kaiju_ch - file("taxo_barplots.pdf") into merged_pdf_ch - - script: - """ - echo "${kaiju_phylum}" > phylum.txt - echo "${kaiju_order}" > order.txt - echo "${kaiju_class}" > class.txt - echo "${kaiju_family}" > family.txt - echo "${kaiju_genus}" > genus.txt - echo "${kaiju_species}" > species.txt - for i in ${taxon_levels} ; - do - merge_kaiju_results.py -f \$i".txt" -o taxo_affi_reads_\$i".tsv" - done - pdfunite ${kaiju_pdf} taxo_barplots.pdf - """ - } -} - -else { - Channel.empty().set{kaiju_summary_for_multiqc_ch} -} - -if(params.skip_01_clean_qc) { - raw_reads_assembly_ch.set{ - input_reads_for_assembly_ch} - raw_reads_dedup_ch.set{ - input_reads_for_dedup_ch} -} -else { - clean_reads_for_assembly_ch.set{ - input_reads_for_assembly_ch} - clean_reads_for_dedup_ch.set{ - input_reads_for_dedup_ch} -} - -/* - * ASSEMBLY. - */ - -// Assembly (metaspades or megahit). -process assembly { - tag "${sampleId}" - publishDir "${params.outdir}/02_assembly", mode: 'copy' - label 'assembly' - - input: - set sampleId, file(preprocessed_reads_R1), file(preprocessed_reads_R2) from input_reads_for_assembly_ch - val spades_mem from metaspades_mem_ch - - output: - set sampleId, file("${params.assembly}/${sampleId}.contigs.fa") into assembly_for_quast_ch, assembly_for_dedup_ch, assembly_for_filter_ch, assembly_no_filter_ch - set sampleId, file("${params.assembly}/${sampleId}.log"), file("${params.assembly}/${sampleId}.params.txt") into logs_assembly_ch - - when: ('02_assembly' in step || '03_filtering' in step || '04_structural_annot' in step || '05_alignment' in step || '06_func_annot' in step || '07_taxo_affi' in step || '08_binning' in step) - - script: - if(params.assembly=='metaspades') - """ - metaspades.py -t ${task.cpus} -m ${spades_mem} -1 ${preprocessed_reads_R1} -2 ${preprocessed_reads_R2} -o ${params.assembly} - mv ${params.assembly}/scaffolds.fasta ${params.assembly}/${sampleId}.contigs.fa - mv ${params.assembly}/spades.log ${params.assembly}/${sampleId}.log - mv ${params.assembly}/params.txt ${params.assembly}/${sampleId}.params.txt - """ - else if(params.assembly=='megahit') - """ - megahit -t ${task.cpus} -1 ${preprocessed_reads_R1} -2 ${preprocessed_reads_R2} -o ${params.assembly} --out-prefix "${sampleId}" - mv ${params.assembly}/options.json ${params.assembly}/${sampleId}.params.txt - """ - else - error "Invalid parameter: ${params.assembly}" -} - -// Assembly metrics. -process quast { - label 'quast' - tag "${sampleId}" - publishDir "${params.outdir}/02_assembly", mode: 'copy' - - input: - set sampleId, file(assembly_file) from assembly_for_quast_ch - - output: - file("${sampleId}_all_contigs_QC/*") into quast_assembly_ch - file("${sampleId}_all_contigs_QC/report.tsv") into quast_assembly_for_multiqc_ch - - when: ('02_assembly' in step || '03_filtering' in step || '04_structural_annot' in step || '05_alignment' in step || '06_func_annot' in step || '07_taxo_affi' in step || '08_binning' in step) - - script: - """ - metaquast.py --threads "${task.cpus}" --rna-finding --max-ref-number 0 --min-contig 0 "${assembly_file}" -o "${sampleId}_all_contigs_QC" - """ -} - -// Reads deduplication - -assembly_and_reads_ch = assembly_for_dedup_ch.join(input_reads_for_dedup_ch, remainder: true) - -process reads_deduplication { - tag "${sampleId}" - publishDir "${params.outdir}/02_assembly", mode: 'copy', pattern: '*.fastq.gz' - publishDir "${params.outdir}/02_assembly/logs", mode: 'copy', pattern: '*.idxstats' - publishDir "${params.outdir}/02_assembly/logs", mode: 'copy', pattern: '*.flagstat' - - input: - set sampleId, file(assembly_file), file(preprocessed_reads_R1), file(preprocessed_reads_R2) from assembly_and_reads_ch - - output: - set sampleId, file("${sampleId}_R1_dedup.fastq.gz"), file("${sampleId}_R2_dedup.fastq.gz") into deduplicated_reads_ch, deduplicated_reads_copy_ch - set sampleId, file("${sampleId}.count_reads_on_contigs.idxstats") into (idxstats_filter_logs_ch, idxstats_filter_logs_for_multiqc_ch) - set sampleId, file("${sampleId}.count_reads_on_contigs.flagstat") into flagstat_filter_logs_ch - file("${sampleId}.count_reads_on_contigs.flagstat") into flagstat_after_dedup_reads_for_multiqc_ch - - when: ('02_assembly' in step || '03_filtering' in step || '04_structural_annot' in step || '05_alignment' in step || '06_func_annot' in step || '07_taxo_affi' in step || '08_binning' in step) - - script: - """ - mkdir logs - bwa index ${assembly_file} -p ${assembly_file} - bwa mem ${assembly_file} ${preprocessed_reads_R1} ${preprocessed_reads_R2} | samtools view -bS - | samtools sort -n -o ${sampleId}.sort.bam - - samtools fixmate -m ${sampleId}.sort.bam ${sampleId}.fixmate.bam - samtools sort -o ${sampleId}.fixmate.positionsort.bam ${sampleId}.fixmate.bam - samtools markdup -r -S -s -f ${sampleId}.stats ${sampleId}.fixmate.positionsort.bam ${sampleId}.filtered.bam - samtools index ${sampleId}.filtered.bam - samtools idxstats ${sampleId}.filtered.bam > ${sampleId}.count_reads_on_contigs.idxstats - samtools flagstat ${sampleId}.filtered.bam > ${sampleId}.count_reads_on_contigs.flagstat - samtools sort -n -o ${sampleId}.filtered.sort.bam ${sampleId}.filtered.bam - bedtools bamtofastq -i ${sampleId}.filtered.sort.bam -fq ${sampleId}_R1_dedup.fastq -fq2 ${sampleId}_R2_dedup.fastq - gzip ${sampleId}_R1_dedup.fastq ; gzip ${sampleId}_R2_dedup.fastq - rm ${sampleId}.sort.bam - rm ${sampleId}.fixmate.bam - rm ${sampleId}.fixmate.positionsort.bam - rm ${sampleId}.filtered.bam - rm ${sampleId}.filtered.sort.bam - """ -} - -assembly_for_filter_ch - .splitFasta(by: 100000, file: true) - .set{chunk_assembly_for_filter_ch} - -chunk_assembly_for_filter_ch - .combine(idxstats_filter_logs_ch, by:0) - .set{assembly_and_logs_ch} - -process chunk_assembly_filter { - label 'assembly_filter' - - input: - set sampleId, file(assembly_file), file(idxstats) from assembly_and_logs_ch - val min_cpm from min_contigs_cpm_ch - - output: - set sampleId, file("${chunk_name}_select_cpm${min_cpm}.fasta") into chunk_select_assembly_ch - set sampleId, file("${chunk_name}_discard_cpm${min_cpm}.fasta") into chunk_discard_assembly_ch - - when: ('03_filtering' in step) - - script: - chunk_name = assembly_file.baseName - """ - Filter_contig_per_cpm.py -i ${idxstats} -f ${assembly_file} -c ${min_cpm} -s ${chunk_name}_select_cpm${min_cpm}.fasta -d ${chunk_name}_discard_cpm${min_cpm}.fasta - """ -} - -chunk_select_assembly_ch - .groupTuple() - .set{grouped_select_assembly_ch} - -chunk_discard_assembly_ch - .groupTuple() - .set{grouped_discard_assembly_ch} - -process merge_assembly_filter { - tag "${sampleId}" - publishDir "${params.outdir}/03_filtering/", mode: 'copy' - label 'assembly_filter' - - input: - set sampleId, file(select_fasta) from grouped_select_assembly_ch - set sampleId, file(discard_fasta) from grouped_discard_assembly_ch - val min_cpm from min_contigs_cpm_ch - - output: - set sampleId, file("${sampleId}_select_contigs_cpm${min_cpm}.fasta") into select_assembly_ch, select_assembly_for_quast_ch - set sampleId, file("${sampleId}_discard_contigs_cpm${min_cpm}.fasta") into discard_assembly_ch - - when: ('03_filtering' in step) - - script: - """ - cat ${select_fasta} > ${sampleId}_select_contigs_cpm${min_cpm}.fasta - cat ${discard_fasta} > ${sampleId}_discard_contigs_cpm${min_cpm}.fasta - """ -} - -process quast_filtered { - label 'quast' - tag "${sampleId}" - publishDir "${params.outdir}/03_filtering/", mode: 'copy' - - input: - set sampleId, file(fasta) from select_assembly_for_quast_ch - - output: - set sampleId, file("${sampleId}_select_contigs_QC/report.tsv") into quast_select_contigs_for_multiqc_ch - file("${sampleId}_select_contigs_QC/*") into quast_select_contigs_ch - - when: ('03_filtering' in step) - - script: - """ - metaquast.py --threads ${task.cpus} --rna-finding --max-ref-number 0 --min-contig 0 ${fasta} -o "${sampleId}_select_contigs_QC" - """ -} - -if(!('03_filtering' in step)) { - assembly_no_filter_ch.set{select_assembly_ch } -} - -/* - * ANNOTATION with Prokka. - */ - -process prokka { - tag "${sampleId}" - - input: - set sampleId, file(assembly_file) from select_assembly_ch - - output: - set sampleId, file("*") into prokka_ch - set sampleId, file("PROKKA_${sampleId}/${sampleId}.txt") into prokka_for_multiqc_ch - - when: ('04_structural_annot' in step || '05_alignment' in step || '06_func_annot' in step || '07_taxo_affi' in step || '08_binning' in step) - - script: - """ - prokka --metagenome --noanno --rawproduct --outdir PROKKA_${sampleId} --prefix ${sampleId} ${assembly_file} --centre X --compliant --cpus ${task.cpus} - rm *.gbk - """ -} - -/* - * RENAME contigs and genes. - */ - -process rename_contigs_genes { - tag "${sampleId}" - publishDir "${params.outdir}/04_structural_annot", mode: 'copy' - label 'python' - - input: - set sampleId, file(assembly_file) from prokka_ch - - output: - set sampleId, file("${sampleId}.annotated.fna") into prokka_renamed_fna_ch, prokka_renamed_fna_ch2, prokka_renamed_fna_for_metabat2_ch - set sampleId, file("${sampleId}.annotated.ffn") into prokka_renamed_ffn_ch - set sampleId, file("${sampleId}.annotated.faa") into prokka_renamed_faa_ch - set sampleId, file("${sampleId}.annotated.gff") into prokka_renamed_gff_ch, prokka_renamed_gff_ch2 - set sampleId, file("${sampleId}_prot.len") into prot_length_ch - - when: ('04_structural_annot' in step || '05_alignment' in step || '06_func_annot' in step || '07_taxo_affi' in step || '08_binning' in step) - - script: - """ - grep "^gnl" ${assembly_file}/${sampleId}.gff > ${sampleId}_only_gnl.gff - Rename_contigs_and_genes.py -f ${sampleId}_only_gnl.gff -faa ${assembly_file}/${sampleId}.faa -ffn ${assembly_file}/${sampleId}.ffn -fna ${assembly_file}/${sampleId}.fna -p ${sampleId} -oGFF ${sampleId}.annotated.gff -oFAA ${sampleId}.annotated.faa -oFFN ${sampleId}.annotated.ffn -oFNA ${sampleId}.annotated.fna - samtools faidx ${sampleId}.annotated.faa; cut -f 1,2 ${sampleId}.annotated.faa.fai > ${sampleId}_prot.len - """ -} -prokka_renamed_faa_ch.into{prokka_renamed_faa_ch2; prokka_renamed_faa_ch4} - -// ALIGNMENT OF READS AGAINST CONTIGS. -prokka_reads_ch = prokka_renamed_gff_ch.join(prokka_renamed_fna_ch, remainder: true).join(deduplicated_reads_ch, remainder: true) - -process reads_alignment_on_contigs { - tag "${sampleId}" - publishDir "${params.outdir}/05_alignment/05_1_reads_alignment_on_contigs/$sampleId", mode: 'copy' - - input: - set sampleId, file(gff_prokka), file(fna_prokka), file(deduplicated_reads_R1), file(deduplicated_reads_R2) from prokka_reads_ch - - output: - set val(sampleId), file("${sampleId}.sort.bam"), file("${sampleId}.sort.bam.bai") into reads_assembly_ch, reads_assembly_ch_for_metabat2, reads_assembly_ch_for_depth - set val(sampleId), file("${sampleId}.sort.bam.idxstats") into idxstats_ch - set val(sampleId), file("${sampleId}_contig.bed") into contigs_bed_ch - - when: ('05_alignment' in step || '06_func_annot' in step || '07_taxo_affi' in step || '08_binning' in step) - - script: - """ - bwa index ${fna_prokka} -p ${fna_prokka} - bwa mem ${fna_prokka} ${deduplicated_reads_R1} ${deduplicated_reads_R2} | samtools view -bS - | samtools sort - -o ${sampleId}.sort.bam - samtools index ${sampleId}.sort.bam - samtools idxstats ${sampleId}.sort.bam > ${sampleId}.sort.bam.idxstats - awk 'BEGIN {FS="\t"}; {print \$1 FS "0" FS \$2}' ${sampleId}.sort.bam.idxstats > ${sampleId}_contig.bed - """ + return tab } -depth_on_contigs_ch = contigs_bed_ch.join(reads_assembly_ch_for_depth) - -process depth_on_contigs { - tag "${sampleId}" - publishDir "${params.outdir}/05_alignment/05_1_reads_alignment_on_contigs/$sampleId", mode: 'copy' - label 'mosdepth' - - input: - set val(sampleId), file(bed) , file(bam), file(index) from depth_on_contigs_ch - output: - set val(sampleId), file("${sampleId}.regions.bed.gz") into contig_depth_ch - - when: ('05_alignment' in step || '06_func_annot' in step || '07_taxo_affi' in step || '08_binning' in step) - - script: - """ - mosdepth -b ${bed} -n -x ${sampleId} ${bam} - """ -} - -// ALIGNMENT AGAINST PROTEIN DATABASE: DIAMOND. - -process diamond { - publishDir "${params.outdir}/05_alignment/05_2_database_alignment/$sampleId", mode: 'copy' - tag "${sampleId}" - - when: ('05_alignment' in step || '06_func_annot' in step || '07_taxo_affi' in step || '08_binning' in step) - - input: - set sampleId, file(renamed_prokka_faa) from prokka_renamed_faa_ch2 - val diamond_bank from diamond_bank_ch - - output: - set sampleId, file("${sampleId}_aln_diamond.m8") into diamond_result_ch, diamond_result_for_annot_ch - - script: - spc_fmt="qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen stitle" - tab_fmt=spc_fmt.replaceAll(" ","\t") - """ - echo "$tab_fmt" > head.m8 - diamond blastp -p ${task.cpus} -d ${diamond_bank} -q ${renamed_prokka_faa} -o ${sampleId}_aln_diamond.nohead.m8 -f 6 $spc_fmt - cat head.m8 ${sampleId}_aln_diamond.nohead.m8 > ${sampleId}_aln_diamond.m8 - rm ${sampleId}_aln_diamond.nohead.m8 - rm head.m8 - """ -} - -/* - * CLUSTERING. - */ - -// Sample clustering with CD-HIT. -process individual_cd_hit { - tag "${sampleId}" - publishDir "${params.outdir}/06_func_annot/06_1_clustering", mode: 'copy' - label 'cd_hit' - - input: - set sampleId, file(assembly_ffn_file) from prokka_renamed_ffn_ch - val percentage_identity_cdhit from percentage_identity_ch - - output: - file("${sampleId}.cd-hit-est.${percentage_identity_cdhit}.fasta") into individual_cd_hit_ch - file("${sampleId}.cd-hit-est.${percentage_identity_cdhit}.fasta.clstr") into individual_cd_hit_cluster_ch - file("${sampleId}.cd-hit-est.${percentage_identity_cdhit}.table_cluster_contigs.txt") into table_clstr_contigs_ch - - when: ('06_func_annot' in step) - - script: - """ - cd-hit-est -c ${percentage_identity_cdhit} -i ${assembly_ffn_file} -o ${sampleId}.cd-hit-est.${percentage_identity_cdhit}.fasta -T ${task.cpus} -M ${task.mem} -d 150 - cat ${sampleId}.cd-hit-est.${percentage_identity_cdhit}.fasta.clstr | cd_hit_produce_table_clstr.py > ${sampleId}.cd-hit-est.${percentage_identity_cdhit}.table_cluster_contigs.txt - """ +def returnFile(it) { + if (it == null) { + return null + } else { + if (!file(it).exists()) exit 1, "Missing file in CSV file: ${it}, see --help for more information" + } + return file(it) } -// Global clustering with CD-HIT. -process global_cd_hit { - publishDir "${params.outdir}/06_func_annot/06_1_clustering", mode: 'copy' - label 'cd_hit' - - input: - file "*" from individual_cd_hit_ch.collect() - val percentage_identity_cdhit from percentage_identity_ch - - output: - file("All-cd-hit-est.${percentage_identity_cdhit}.fasta") into concatenation_individual_cd_hit_ch - file("All-cd-hit-est.${percentage_identity_cdhit}.fasta.clstr") into global_cd_hit_ch - file("table_clstr.txt") into table_global_clstr_clstr_ch - - when: ('06_func_annot' in step) - - script: - """ - cat * > All-cd-hit-est.${percentage_identity_cdhit} - cd-hit-est -c ${percentage_identity_cdhit} -i All-cd-hit-est.${percentage_identity_cdhit} -o All-cd-hit-est.${percentage_identity_cdhit}.fasta -T ${task.cpus} -M {task.mem} -d 150 - cat All-cd-hit-est.${percentage_identity_cdhit}.fasta.clstr | cd_hit_produce_table_clstr.py > table_clstr.txt - """ +def hasExtension(it, extension) { + it.toString().toLowerCase().endsWith(extension.toLowerCase()) } -/* - * Create channel with sample id, renamed annotation files and bam and index bam files. -*/ -prokka_bam_ch = prokka_renamed_gff_ch2.join(prokka_renamed_fna_ch2, remainder: true).join(reads_assembly_ch, remainder: true) +workflow { -/* - * QUANTIFICATION - */ -// Quantification of reads on each gene in each sample. -process quantification { - tag "${sampleId}" - publishDir "${params.outdir}/06_func_annot/06_2_quantification", mode: 'copy' - - input: - set sampleId, file(gff_prokka), file(fna_prokka), file(bam), file(bam_index) from prokka_bam_ch - - output: - file("${sampleId}.featureCounts.tsv") into counts_ch - file("${sampleId}.featureCounts.tsv.summary") into featureCounts_out_ch_for_multiqc - file("${sampleId}.featureCounts.stdout") into featureCounts_error_ch - - when: ('06_func_annot' in step) - - script: - """ - featureCounts -T ${task.cpus} -p -O -t gene -g ID -a ${gff_prokka} -o ${sampleId}.featureCounts.tsv ${bam} &> ${sampleId}.featureCounts.stdout - """ -} + // Check mandatory parameters -// Create table with sum of reads for each global cluster of genes in each sample. -process quantification_table { - publishDir "${params.outdir}/06_func_annot/06_2_quantification", mode: 'copy' - label 'python' + //////////// + // Start check samplesheet + //////////// + if (params.input) { ch_input = file(params.input) } else { exit 1, 'Input samplesheet not specified!' } - input: - file(clusters_contigs) from table_clstr_contigs_ch.collect() - file(global_clusters_clusters) from table_global_clstr_clstr_ch - file(counts_files) from counts_ch.collect() + skip_clean = params.skip_clean - output: - file("Correspondence_global_clstr_genes.txt") into table_global_clstr_genes_ch - file("Clusters_Count_table_all_samples.txt") into quantification_table_ch - - when: ('06_func_annot' in step) - script: - """ - ls ${clusters_contigs} | cat > List_of_contigs_files.txt - ls ${counts_files} | cat > List_of_count_files.txt - Quantification_clusters.py -t ${global_clusters_clusters} -l List_of_contigs_files.txt -c List_of_count_files.txt -oc Clusters_Count_table_all_samples.txt -oid Correspondence_global_clstr_genes.txt - """ -} - -/* - * FUNCTIONAL ANNOTATION OF GENES. - */ - -if(params.eggnogmapper_db) { - // Built eggNOG-mapper database. - process eggnog_mapper_db { - - label 'eggnog' - - output: - file "db_eggnog_mapper" into functional_annot_db_ch - - when: ('06_func_annot' in step) - - script: - """ - mkdir db_eggnog_mapper - /eggnog-mapper-2.0.4-rf1/download_eggnog_data.py -P -f -y --data_dir db_eggnog_mapper - """ - } -} -else { - if(params.eggnog_mapper_db_dir) { - functional_annot_db_ch = Channel.fromPath(params.eggnog_mapper_db_dir).first() - } - else { - if (!(params.eggnogmapper_db) & !(params.eggnog_mapper_db_dir) & "06_func_annot" in step){ - exit 1, "You must specify --eggnogmapper_db or --eggnog_mapper_db_dir" + if (params.type == 'SR') { + if (!['metaspades','megahit'].contains(params.assembly)){ + exit 1, "Invalid short read aligner option: ${params.assembly}. Valid options: 'metaspades', 'megahit'" } - if (!(params.eggnogmapper_db) & !(params.eggnog_mapper_db_dir) & !("06_func_annot" in step)){ - Channel.empty().set{functional_annot_db_ch} + if (!["solexa","illumina","sanger"].contains(params.quality_type)){ + exit 1, "Invalid quality_type option: ${params.quality_type}. Valid options:'solexa','illumina','sanger'" + } + if (!(params.skip_host_filter) && !(params.host_fasta) && !(params.skip_clean)) { + exit 1, "You must specify --host_fasta or skip cleaning host step with option --skip_host_filter or skip all clean and qc modules with --skip_clean" } - } -} - -// Run eggNOG-mapper. -process eggnog_mapper { - publishDir "${params.outdir}/06_func_annot/06_3_functional_annotation", mode: 'copy' - label 'eggnog' - - input: - set sampleId, file(renamed_prokka_faa) from prokka_renamed_faa_ch4 - file(db) from functional_annot_db_ch - - output: - file "${sampleId}_diamond_one2one.emapper.seed_orthologs" into functional_annot_seed_ch - file "${sampleId}_diamond_one2one.emapper.annotations" into functional_annot_ch - - when: ('06_func_annot' in step) - - script: - """ - /eggnog-mapper-2.0.4-rf1/emapper.py -i ${renamed_prokka_faa} --output ${sampleId}_diamond_one2one -m diamond --cpu ${task.cpus} --data_dir ${db} --target_orthologs one2one - """ -} - -// Best hits (best bitscore) of diamond results. -process best_hits_diamond { - publishDir "${params.outdir}/06_func_annot/06_3_functional_annotation", mode: 'copy' - - input: - set sampleId, file(diamond_file) from diamond_result_for_annot_ch - - output: - file "${sampleId}.best_hit" into diamond_best_hits_ch - - when: ('06_func_annot' in step) - - script: - """ - filter_diamond_hits.py -o ${sampleId}.best_hit ${diamond_file} - """ -} - -// Merge eggNOG-mapper output files and quantification table. -process merge_quantif_and_functional_annot { - publishDir "${params.outdir}/06_func_annot/06_3_functional_annotation", mode: 'copy' - - input: - file(functionnal_annotations_files) from functional_annot_ch.collect() - file(quantification_table) from quantification_table_ch - file(diamond_files) from diamond_best_hits_ch.collect() - - output: - file "Quantifications_and_functional_annotations.tsv" into quantification_and_functional_annotation_ch - - when: ('06_func_annot' in step) - - script: - """ - awk '{ - if(NR == 1) { - print \$0 "\t" "sum"} - else { - for (i=1; i<=NF; i++) - { - if (i == 1) - { - sum = O; - } - else { - sum = sum + \$i; - } - } - print \$0 "\t" sum - } - }' ${quantification_table} > ${quantification_table}.sum - ls ${functionnal_annotations_files} | cat > List_of_functionnal_annotations_files.txt - ls ${diamond_files} | cat > List_of_diamond_files.txt - merge_abundance_and_functional_annotations.py -t ${quantification_table}.sum -f List_of_functionnal_annotations_files.txt -d List_of_diamond_files.txt -o Quantifications_and_functional_annotations.tsv - """ -} - -// Merge eggNOG-mapper output files and quantification table. -process make_functional_annotation_tables { - publishDir "${params.outdir}/06_func_annot/06_3_functional_annotation", mode: 'copy' - - input: - file(quantif_and_functionnal_annotations) from quantification_and_functional_annotation_ch - - output: - file "*" into quantif_by_function_ch - - when: ('06_func_annot' in step) - - script: - """ - quantification_by_functional_annotation.py -i ${quantif_and_functionnal_annotations} - """ -} - - // TAXONOMIC AFFILIATION OF CONTIGS. - - // Download taxonomy if taxonomy_dir is not given. -if(!params.taxonomy_dir) { - process download_taxonomy_db { - - when: - ('07_taxo_affi' in step) - - input: - val accession2taxid_ch - val taxdump_ch - - output: - set file("*taxid*"), file ("*taxdump*") into taxonomy_ch - - script: - """ - wget ${accession2taxid_ch} - file='${accession2taxid_ch}' - fileName=\${file##*/} - echo \$fileName - gunzip \$fileName - wget ${taxdump_ch} - file_taxdump='${taxdump_ch}' - fileName_taxdump=\${file_taxdump##*/} - echo \$fileName_taxdump - mkdir taxdump; mv \$fileName_taxdump taxdump; cd taxdump ; tar xzvf \$fileName_taxdump - """ - } -} -else if(params.taxonomy_dir) { - assert file(params.taxonomy_dir + '/prot.accession2taxid').exists() - assert file(params.taxonomy_dir + '/taxdump').exists() - accession2taxid_ch = Channel - .fromPath(params.taxonomy_dir + '/prot.accession2taxid') - taxdump_ch = Channel - .fromPath(params.taxonomy_dir + '/taxdump') - taxonomy_ch = accession2taxid_ch.combine(taxdump_ch) -} -else { - exit 1, "You must specify [--accession2taxid and --taxdump] or --taxonomy_dir" -} - -/* - * Create channel with sample id, diamond files and desman length files, idxstats and depth files. -*/ - -diamond_parser_input_ch = diamond_result_ch.join(prot_length_ch, remainder: true).join(idxstats_ch, remainder: true).join(contig_depth_ch, remainder: true) - -// Python parser. -process diamond_parser { - tag "$sampleId" - publishDir "${params.outdir}/07_taxo_affi/$sampleId", mode: 'copy' - label 'python' - - when: ('07_taxo_affi' in step) - - input: - set file(accession2taxid), file(taxdump) from taxonomy_ch.collect() - set sampleId, file(diamond_file), file(prot_len), file(idxstats), file(depth) from diamond_parser_input_ch - - output: - set sampleId, file("${sampleId}.percontig.tsv") into taxo_percontig_ch - set sampleId, file("${sampleId}.pergene.tsv") into taxo_pergene_ch - set sampleId, file("${sampleId}.warn.tsv") into taxo_warn_ch - set sampleId, file("graphs") into taxo_graphs_ch - file("${sampleId}_quantif_percontig.tsv") into quantif_percontig_ch - file("${sampleId}_quantif_percontig_by_superkingdom.tsv") into quantif_percontig_superkingdom_ch - file("${sampleId}_quantif_percontig_by_phylum.tsv") into quantif_percontig_phylum_ch - file("${sampleId}_quantif_percontig_by_order.tsv") into quantif_percontig_order_ch - file("${sampleId}_quantif_percontig_by_class.tsv") into quantif_percontig_class_ch - file("${sampleId}_quantif_percontig_by_family.tsv") into quantif_percontig_family_ch - file("${sampleId}_quantif_percontig_by_genus.tsv") into quantif_percontig_genus_ch - file("${sampleId}_quantif_percontig_by_species.tsv") into quantif_percontig_species_ch - - script: - """ - aln2taxaffi.py -a ${accession2taxid} --taxonomy ${taxdump} -o ${sampleId} ${diamond_file} ${prot_len} - merge_contig_quantif_perlineage.py -i ${idxstats} -c ${sampleId}.percontig.tsv -m ${depth} -o ${sampleId}_quantif_percontig - """ -} - -process quantif_and_taxonomic_table_contigs { - publishDir "${params.outdir}/07_taxo_affi", mode: 'copy' - label 'python' - - when: ('07_taxo_affi' in step) - - input: - file(files_all) from quantif_percontig_ch.collect() - file(files_superkingdom) from quantif_percontig_superkingdom_ch.collect() - file(files_phylum) from quantif_percontig_phylum_ch.collect() - file(files_order) from quantif_percontig_order_ch.collect() - file(files_class) from quantif_percontig_class_ch.collect() - file(files_family) from quantif_percontig_family_ch.collect() - file(files_genus) from quantif_percontig_genus_ch.collect() - file(files_species) from quantif_percontig_species_ch.collect() - - output: - file("quantification_by_contig_lineage*.tsv") into quantif_by_lineage_contigs_ch - - script: - """ - echo "${files_all}" > all.txt - echo "${files_superkingdom}" > superkingdom.txt - echo "${files_phylum}" > phylum.txt - echo "${files_order}" > order.txt - echo "${files_class}" > class.txt - echo "${files_family}" > family.txt - echo "${files_genus}" > genus.txt - echo "${files_species}" > species.txt - for i in ${taxons_affi_taxo_contigs} ; - do - quantification_by_contig_lineage.py -i \$i".txt" -o quantification_by_contig_lineage_\$i".tsv" - done - """ -} - -/* - * Binning of contigs - from nf-core/mag - */ - -reads_assembly_ch_for_metabat2 = reads_assembly_ch_for_metabat2.groupTuple(by:[0,1]).join(prokka_renamed_fna_for_metabat2_ch) - -reads_assembly_ch_for_metabat2 = reads_assembly_ch_for_metabat2.dump(tag:'reads_assembly_ch_for_metabat2') - -process metabat { - tag "$sampleId" - publishDir "${params.outdir}/", mode: 'copy', - saveAs: {filename -> (filename.indexOf(".bam") == -1 && filename.indexOf(".fastq.gz") == -1) ? "08_binning/08_1_binning/$filename" : null} - label 'binning' - - input: - set val(sampleId), file(bam), file(index), file(assembly) from reads_assembly_ch_for_metabat2 - val(min_size) from params.min_contig_size - - output: - set val(sampleId), file("MetaBAT2/*") into metabat_bins mode flatten - set val(sampleId), file("MetaBAT2/*") into metabat_bins_for_cat - set val(sampleId), file("MetaBAT2/*") into metabat_bins_quast_bins - - when: ('08_binning' in step) - - script: - def name = "${sampleId}" - """ - jgi_summarize_bam_contig_depths --outputDepth depth.txt ${bam} - metabat2 -t "${task.cpus}" -i "${assembly}" -a depth.txt -o "MetaBAT2/${name}" -m ${min_size} - #if bin folder is empty - if [ -z \"\$(ls -A MetaBAT2)\" ]; then - cp ${assembly} MetaBAT2/${assembly} - fi - """ -} - -process busco_download_db { - tag "${database.baseName}" - label 'binning' - - input: - file(database) from file_busco_db - - output: - set val("${database.toString().replace(".tar.gz", "")}"), file("buscodb/*") into busco_db - - when: ('08_binning' in step) && !(params.skip_busco) - - script: - """ - mkdir buscodb - tar -xf ${database} -C buscodb - """ -} - -metabat_bins - .combine(busco_db) - .set { metabat_db_busco } - -/* - * BUSCO: Quantitative measures for the assessment of genome assembly - */ -process busco { - tag "${assembly}" - publishDir "${params.outdir}/08_binning/08_2_QC/BUSCO/", mode: 'copy' - label 'binning' - - input: - set val(sampleId), file(assembly), val(db_name), file(db) from metabat_db_busco - - output: - file("short_summary_${assembly}.txt") into (busco_summary_to_multiqc, busco_summary_to_plot) - val("$sampleId") into busco_assembler_sample_to_plot - file("${assembly}_busco_log.txt") - file("${assembly}_buscos.faa") - file("${assembly}_buscos.fna") - - when: ('08_binning' in step) && !(params.skip_busco) - - script: - if(workflow.profile.toString().indexOf("conda") == -1) { - """ - cp -r /opt/conda/pkgs/augustus*/config augustus_config/ - export AUGUSTUS_CONFIG_PATH=augustus_config - run_BUSCO.py \ - --in ${assembly} \ - --lineage_path $db_name \ - --cpu "${task.cpus}" \ - --blast_single_core \ - --mode genome \ - --out ${assembly} \ - >${assembly}_busco_log.txt - cp run_${assembly}/short_summary_${assembly}.txt short_summary_${assembly}.txt - for f in run_${assembly}/single_copy_busco_sequences/*faa; do - [ -e "\$f" ] && cat run_${assembly}/single_copy_busco_sequences/*faa >${assembly}_buscos.faa || touch ${assembly}_buscos.faa - break - done - for f in run_${assembly}/single_copy_busco_sequences/*fna; do - [ -e "\$f" ] && cat run_${assembly}/single_copy_busco_sequences/*fna >${assembly}_buscos.fna || touch ${assembly}_buscos.fna - break - done - """ } else { - """ - run_BUSCO.py \ - --in ${assembly} \ - --lineage_path $db_name \ - --cpu "${task.cpus}" \ - --blast_single_core \ - --mode genome \ - --out ${assembly} \ - >${assembly}_busco_log.txt - cp run_${assembly}/short_summary_${assembly}.txt short_summary_${assembly}.txt - for f in run_${assembly}/single_copy_busco_sequences/*faa; do - [ -e "\$f" ] && cat run_${assembly}/single_copy_busco_sequences/*faa >${assembly}_buscos.faa || touch ${assembly}_buscos.faa - break - done - for f in run_${assembly}/single_copy_busco_sequences/*fna; do - [ -e "\$f" ] && cat run_${assembly}/single_copy_busco_sequences/*fna >${assembly}_buscos.fna || touch ${assembly}_buscos.fna - break - done - """ + skip_clean = true } -} - -process busco_plot { - publishDir "${params.outdir}/08_binning/08_2_QC/", mode: 'copy' - label 'binning' - - input: - file(summaries) from busco_summary_to_plot.collect() - val(assemblersample) from busco_assembler_sample_to_plot.collect() - - output: - file("*busco_figure.png") - file("BUSCO/*busco_figure.R") - file("BUSCO/*busco_summary.txt") - file("busco_summary.txt") into busco_summary - - when: ('08_binning' in step) && !(params.skip_busco) - - script: - def assemblersampleunique = assemblersample.unique() - """ - #for each assembler and sample: - assemblersample=\$(echo \"$assemblersampleunique\" | sed 's/[][]//g') - IFS=', ' read -r -a assemblersamples <<< \"\$assemblersample\" - mkdir BUSCO - for name in \"\${assemblersamples[@]}\"; do - mkdir \${name} - cp short_summary_\${name}* \${name}/ - generate_plot.py --working_directory \${name} - - cp \${name}/busco_figure.png \${name}-busco_figure.png - cp \${name}/busco_figure.R \${name}-busco_figure.R - summary_busco.py \${name}/short_summary_*.txt >BUSCO/\${name}-busco_summary.txt - done - cp *-busco_figure.R BUSCO/ - summary_busco.py short_summary_*.txt >busco_summary.txt - """ -} - -process quast_bins { - tag "$sampleId" - publishDir "${params.outdir}/08_binning/08_2_QC/", mode: 'copy' - label 'binning' - - input: - set val(sampleId), file(assembly) from metabat_bins_quast_bins - - output: - file("QUAST/*") - file("QUAST/*-quast_summary.tsv") into quast_bin_summaries - - when: ('08_binning' in step) - - script: - """ - ASSEMBLIES=\$(echo \"$assembly\" | sed 's/[][]//g') - IFS=', ' read -r -a assemblies <<< \"\$ASSEMBLIES\" - - for assembly in \"\${assemblies[@]}\"; do - metaquast.py --threads "${task.cpus}" --max-ref-number 0 --rna-finding --gene-finding -l "\${assembly}" "\${assembly}" -o "QUAST/\${assembly}" - if ! [ -f "QUAST/${sampleId}-quast_summary.tsv" ]; then - cp "QUAST/\${assembly}/transposed_report.tsv" "QUAST/${sampleId}-quast_summary.tsv" - else - tail -n +2 "QUAST/\${assembly}/transposed_report.tsv" >> "QUAST/${sampleId}-quast_summary.tsv" - fi - done - """ -} - -process merge_quast_and_busco { - publishDir "${params.outdir}/08_binning/08_2_QC/", mode: 'copy' - label 'binning' - - input: - file(quast_bin_sum) from quast_bin_summaries.collect() - file(busco_sum) from busco_summary - - output: - file("quast_and_busco_summary.tsv") - file("quast_summary.tsv") - - when: ('08_binning' in step) && !(params.skip_busco) - - script: - """ - QUAST_BIN=\$(echo \"$quast_bin_sum\" | sed 's/[][]//g') - IFS=', ' read -r -a quast_bin <<< \"\$QUAST_BIN\" - - for quast_file in \"\${quast_bin[@]}\"; do - if ! [ -f "quast_summary.tsv" ]; then - cp "\${quast_file}" "quast_summary.tsv" - else - tail -n +2 "\${quast_file}" >> "quast_summary.tsv" - fi - done - combine_tables.py $busco_sum quast_summary.tsv >quast_and_busco_summary.tsv - """ -} - -/* - * CAT: Bin Annotation Tool (BAT) are pipelines for the taxonomic - * classification of long DNA sequences and metagenome assembled genomes - * (MAGs/bins) - from nf-core/mag - */ -process cat_db { - tag "${database.baseName}" - label 'binning' - - input: - file(database) from file_cat_db - - output: - set val("${database.toString().replace(".tar.gz", "")}"), file("database/*"), file("taxonomy/*") into cat_db - - when: ('08_binning' in step) - - script: - """ - mkdir catDB - tar -xf ${database} -C catDB - mv `find catDB/ -type d -name "*taxonomy*"` taxonomy/ - mv `find catDB/ -type d -name "*CAT_database*"` database/ - """ -} -metabat_bins_for_cat - .combine(cat_db) - .set { cat_input } - -process cat { - tag "${sampleId}-${db_name}" - publishDir "${params.outdir}/08_binning/08_3_taxonomy/", mode: 'copy', - saveAs: {filename -> - if (filename.indexOf(".names.txt") > 0) filename - else "raw/$filename" + if ( !(params.stop_at_structural_annot) && !(params.diamond_bank) ) { + exit 1, "You must specify --stop_at_structural_annot or specify a diamond bank with --diamond_bank" } - - input: - set val(sampleId), file("bins/*"), val(db_name), file("database/*"), file("taxonomy/*") from cat_input - - output: - file("*.ORF2LCA.txt") optional true - file("*.names.txt") optional true - file("*.predicted_proteins.faa") optional true - file("*.predicted_proteins.gff") optional true - file("*.log") optional true - file("*.bin2classification.txt") optional true - stdout into stdout_ch - - when: ('08_binning' in step) - - script: - """ - COUNT=`ls -1 bins/*.fa 2>/dev/null | wc -l` - if [ \$COUNT != 0 ] - then - CAT bins -b "bins/" -d database/ -t taxonomy/ -n "${task.cpus}" -s .fa --top 6 -o "${sampleId}" --I_know_what_Im_doing > stdout_CAT.txt - CAT add_names -i "${sampleId}.ORF2LCA.txt" -o "${sampleId}.ORF2LCA.names.txt" -t taxonomy/ >> stdout_CAT.txt - CAT add_names -i "${sampleId}.bin2classification.txt" -o "${sampleId}.bin2classification.names.txt" -t taxonomy/ >> stdout_CAT.txt - else - echo "Sample ${sampleId}: no bins found, it is impossible to make taxonomic affiliation of bins." - fi - """ -} - -// Print error message. -stdout_ch.subscribe { - log.info(it) -} - -/* - * SOFTWARES VERSIONS. - */ - -process get_software_versions { - publishDir "${params.outdir}/pipeline_info", mode: 'copy', - saveAs: {filename -> - if (filename.indexOf(".csv") > 0) filename - else null + header = getAndCheckHeader() + + Channel.from(file(params.input) + .splitCsv ( header:true, sep:',' ) ) + .map { row -> + def sample = row.sample + def paired = false + if (row.fastq_2 != null) { + paired = true + } + if (hasExtension(row.fastq_1, "fastq") || hasExtension(row.fastq_1, "fq") || hasExtension(row.fastq_2, "fastq") || hasExtension(row.fastq_2, "fq")) { + exit 1, "We do recommend to use gziped fastq file to help you reduce your data footprint." + } + ["sample":row.sample, + "fastq_1":returnFile(row.fastq_1), + "fastq_2":returnFile(row.fastq_2), + "paired": paired, + "assembly":returnFile(row.assembly) ] + } + .set { ch_inputs } + + //////////// + // End check samplesheet + //////////// + + // Databases + ch_host_fasta = Channel.empty() + ch_host_index = Channel.empty() + ch_kaiju_db = Channel.empty() + ch_eggnog_db = Channel.empty() + ch_taxonomy = Channel.empty() + + DATABASES (skip_clean) + ch_host_fasta = DATABASES.out.host_fasta + ch_host_index = DATABASES.out.host_index + ch_kaiju_db = DATABASES.out.kaiju_db + ch_eggnog_db = DATABASES.out.eggnog + ch_taxonomy = DATABASES.out.taxonomy + + ch_multiqc_config = Channel.empty() + + // SR only report + ch_cutadapt_report = Channel.empty() + ch_sickle_report = Channel.empty() + ch_before_filter_report = Channel.empty() + ch_after_filter_report = Channel.empty() + ch_fastqc_raw_report = Channel.empty() + ch_fastqc_clean_report = Channel.empty() + ch_kaiju_report = Channel.empty() + ch_dedup_report = Channel.empty() + ch_assembly_report = Channel.empty() + ch_filtered_report = Channel.empty() + + // HIFI only report + ch_hifi_fastqc_report = Channel.empty() + ch_hifi_quast_report = Channel.empty() + + // Shared report + ch_prokka_report = Channel.empty() + ch_quant_report = Channel.empty() + ch_v_eggnogmapper = Channel.empty() + + if ( params.type.toUpperCase() == "SR" ) { + ch_multiqc_config = file(params.sr_multiqc_config, checkIfExists: true) + println("Entering SR") + ch_inputs + .map { item -> [ item.sample, item.fastq_1, item.fastq_2 ] } + .set { ch_reads } + ch_inputs + .map { item -> [ item.sample, item.paired ] } + .set { ch_paired } + //ch_reads.view{ it -> "${it}" } + //ch_paired.view{ it -> "${it}" } + + SR ( + ch_reads, + ch_paired, + ch_host_fasta, + ch_host_index, + ch_kaiju_db + ) + ch_reads = SR.out.dedup + ch_assembly = SR.out.assembly + + ch_cutadapt_report = SR.out.cutadapt_report + ch_sickle_report = SR.out.sickle_report + ch_before_filter_report = SR.out.before_filter_report + ch_after_filter_report = SR.out.after_filter_report + ch_fastqc_raw_report = SR.out.fastqc_raw_report + ch_fastqc_clean_report = SR.out.fastqc_clean_report + ch_kaiju_report = SR.out.kaiju_report + ch_dedup_report = SR.out.dedup_report + ch_assembly_report = SR.out.assembly_report + ch_filtered_report = SR.out.filtered_report } - output: - file 'software_versions_mqc.yaml' into software_versions_yaml - file "software_versions.csv" - - script: - """ - echo $workflow.manifest.version > v_pipeline.txt - echo $workflow.manifest.nextflowVersion > v_nextflow.txt - echo \$(bwa 2>&1) &> v_bwa.txt - cutadapt --version &> v_cutadapt.txt - sickle --version &> v_sickle.txt - ktImportText &> v_kronatools.txt - python --version &> v_python.txt - echo \$(cd-hit -h 2>&1) > v_cdhit.txt - featureCounts -v &> v_featurecounts.txt - diamond help &> v_diamond.txt - multiqc --version &> v_multiqc.txt - fastqc --version &> v_fastqc.txt - megahit --version &> v_megahit.txt - spades.py --version &> v_spades.txt - quast -v &> v_quast.txt - prokka -v &> v_prokka.txt - echo \$(kaiju -h 2>&1) > v_kaiju.txt - samtools --version &> v_samtools.txt - bedtools --version &> v_bedtools.txt - scrape_software_versions.py > software_versions_mqc.yaml - """ -} - -process multiqc { - publishDir "${params.outdir}/MultiQC", mode: 'copy' - - input: - file multiqc_config from multiqc_config_ch - file ('*') from cutadapt_log_ch_for_multiqc.collect().ifEmpty([]) - file ('*') from sickle_log_ch_for_multiqc.collect().ifEmpty([]) - file ('*') from fastqc_raw_for_multiqc_ch.collect().ifEmpty([]) - file ('*') from fastqc_cleaned_for_multiqc_ch.collect().ifEmpty([]) - file("*_select_contigs_QC/*") from quast_select_contigs_for_multiqc_ch.collect().ifEmpty([]) - file("*_all_contigs_QC/*") from quast_assembly_for_multiqc_ch.collect().ifEmpty([]) - file("*") from prokka_for_multiqc_ch.collect().ifEmpty([]) - file("*") from kaiju_summary_for_multiqc_ch.collect().ifEmpty([]) - file("*") from featureCounts_out_ch_for_multiqc.collect().ifEmpty([]) - file ('software_versions/*') from software_versions_yaml.collect().ifEmpty([]) - file ('*') from busco_summary_to_multiqc.collect().ifEmpty([]) - file("host_filter_flagstat/*") from flagstat_after_host_filter_for_multiqc_ch.collect().ifEmpty([]) - file("*") from flagstat_before_filter_for_multiqc_ch.collect().ifEmpty([]) - file("*") from flagstat_after_dedup_reads_for_multiqc_ch.collect().ifEmpty([]) + else if ( params.type.toUpperCase() == "HIFI" ) { + + ch_multiqc_config = file(params.hifi_multiqc_config, checkIfExists: true) + println("Entering HiFi") + ch_inputs.map { item -> [ item.sample, item.assembly ] } // [sample, assembly] + .set { ch_assembly } + + ch_inputs.map { item -> [ item.sample, item.fastq_1 ] } // [sample, reads] + .set { ch_reads } + + S04_HIFI_FASTQC( ch_reads ) + S04_HIFI_QUAST( ch_assembly ) + ch_hifi_fastqc_report = S04_HIFI_FASTQC.out.zip + ch_hifi_quast_report = S04_HIFI_QUAST.out.report + } + + else { + exit 1, "Invalid type option: ${params.type}. Valid options are 'HiFi' for long-read, 'SR' for short-read" + } + + SH ( + ch_reads, + ch_assembly, + ch_eggnog_db, + ch_taxonomy + ) + + ch_prokka_report = SH.out.prokka_report + ch_quant_report = SH.out.quant_report + ch_v_eggnogmapper = SH.out.v_eggnogmapper + + GET_SOFTWARE_VERSIONS( ch_v_eggnogmapper.ifEmpty([]).first() ) + ch_software_versions = GET_SOFTWARE_VERSIONS.out.yaml + + MULTIQC ( + ch_multiqc_config, + ch_software_versions, + ch_cutadapt_report.collect().ifEmpty([]), + ch_sickle_report.collect().ifEmpty([]), + ch_before_filter_report.collect().ifEmpty([]), + ch_after_filter_report.collect().ifEmpty([]), + ch_fastqc_raw_report.collect().ifEmpty([]), + ch_fastqc_clean_report.collect().ifEmpty([]), + ch_kaiju_report.collect().ifEmpty([]), + ch_dedup_report.collect().ifEmpty([]), + ch_assembly_report.collect().ifEmpty([]), + ch_filtered_report.collect().ifEmpty([]), + ch_hifi_fastqc_report.collect().ifEmpty([]), + ch_hifi_quast_report.collect().ifEmpty([]), + ch_prokka_report.collect().ifEmpty([]), + ch_quant_report.collect().ifEmpty([]) + ) + multiqc_report = MULTIQC.out.report - output: - file "multiqc_report.html" into ch_multiqc_report - - script: - """ - multiqc . --config $multiqc_config -m custom_content -m fastqc -m cutadapt -m sickle -m kaiju -m quast -m prokka -m featureCounts -m busco -m samtools - """ } diff --git a/modules/assembly.nf b/modules/assembly.nf new file mode 100644 index 0000000000000000000000000000000000000000..052cad0c62c3a3d91ccbf26d52f05a255233a841 --- /dev/null +++ b/modules/assembly.nf @@ -0,0 +1,29 @@ +process ASSEMBLY { + tag "${sampleId}" + publishDir "${params.outdir}/02_assembly", mode: 'copy' + label 'ASSEMBLY' + + input: + tuple val(sampleId), path(read1), path(read2) + val metaspades_mem + + output: + tuple val(sampleId), path("${params.assembly}/${sampleId}.contigs.fa"), emit: assembly + tuple val(sampleId), path("${params.assembly}/${sampleId}.log"), path("${params.assembly}/${sampleId}.params.txt"), emit: report + + script: + if(params.assembly=='metaspades') + """ + metaspades.py -t ${task.cpus} -m ${metaspades_mem} -1 ${read1} -2 ${read2} -o ${params.assembly} + mv ${params.assembly}/scaffolds.fasta ${params.assembly}/${sampleId}.contigs.fa + mv ${params.assembly}/spades.log ${params.assembly}/${sampleId}.log + mv ${params.assembly}/params.txt ${params.assembly}/${sampleId}.params.txt + """ + else if(params.assembly=='megahit') + """ + megahit -t ${task.cpus} -1 ${read1} -2 ${read2} -o ${params.assembly} --out-prefix "${sampleId}" + mv ${params.assembly}/options.json ${params.assembly}/${sampleId}.params.txt + """ + else + error "Invalid parameter: ${params.assembly}" +} \ No newline at end of file diff --git a/modules/assign_taxonomy.nf b/modules/assign_taxonomy.nf new file mode 100644 index 0000000000000000000000000000000000000000..749aea734cc9be8ca87b036a948401fba918b902 --- /dev/null +++ b/modules/assign_taxonomy.nf @@ -0,0 +1,29 @@ +process ASSIGN_TAXONOMY { + tag "${sampleId}" + publishDir "${params.outdir}/07_taxo_affi/${sampleId}", mode: 'copy' + label 'PYTHON' + + input: + tuple path(accession2taxid), path(taxdump) + tuple val(sampleId), path(m8), path(sam_coverage), path(prot_len) + + output: + tuple val(sampleId), path("${sampleId}.percontig.tsv"), emit: t_percontig + tuple val(sampleId), path("${sampleId}.pergene.tsv"), emit: t_pergene + tuple val(sampleId), path("${sampleId}.warn.tsv"), emit: t_warn + tuple val(sampleId), path("graphs"), emit: t_graphs + path "${sampleId}_quantif_percontig.tsv", emit: q_all + path "${sampleId}_quantif_percontig_by_superkingdom.tsv", emit: q_superkingdom + path "${sampleId}_quantif_percontig_by_phylum.tsv", emit: q_phylum + path "${sampleId}_quantif_percontig_by_order.tsv", emit: q_order + path "${sampleId}_quantif_percontig_by_class.tsv", emit: q_class + path "${sampleId}_quantif_percontig_by_family.tsv", emit: q_family + path "${sampleId}_quantif_percontig_by_genus.tsv", emit: q_genus + path "${sampleId}_quantif_percontig_by_species.tsv", emit: q_species + + script: + """ + aln2taxaffi.py -a ${accession2taxid} --taxonomy ${taxdump} -o ${sampleId} ${m8} ${prot_len} + merge_contig_quantif_perlineage.py -c ${sampleId}.percontig.tsv -s ${sam_coverage} -o ${sampleId}_quantif_percontig + """ +} diff --git a/modules/best_hits.nf b/modules/best_hits.nf new file mode 100644 index 0000000000000000000000000000000000000000..91660cc9af7a74806d592973873a28371fe7dfff --- /dev/null +++ b/modules/best_hits.nf @@ -0,0 +1,14 @@ +process BEST_HITS { + publishDir "${params.outdir}/06_func_annot/06_3_functional_annotation", mode: 'copy' + + input: + tuple val(sampleId), path(m8) + + output: + path "${sampleId}.best_hit", emit: best_hits + + script: + """ + filter_diamond_hits.py -o ${sampleId}.best_hit ${m8} + """ +} \ No newline at end of file diff --git a/modules/cd_hit.nf b/modules/cd_hit.nf new file mode 100644 index 0000000000000000000000000000000000000000..68b221da7a02a9be6956a29e65f19d0ae18b5370 --- /dev/null +++ b/modules/cd_hit.nf @@ -0,0 +1,64 @@ +process INDIVIDUAL_CD_HIT { + tag "${sampleId}" + publishDir "${params.outdir}/06_func_annot/06_1_clustering", mode: 'copy' + label 'CD_HIT' + + input: + tuple val(sampleId), path(ffn) + val pct_id + + output: + path("${sampleId}.cd-hit-est.${pct_id}.fasta"), emit: clstr_fasta + path("${sampleId}.cd-hit-est.${pct_id}.table_cluster_contigs.txt"), emit: clstr_table + + script: + """ + cd-hit-est -c ${pct_id} -i ${ffn} -o ${sampleId}.cd-hit-est.${pct_id}.fasta -T ${task.cpus} -M ${task.mem} -d 150 + cat ${sampleId}.cd-hit-est.${pct_id}.fasta.clstr | cd_hit_produce_table_clstr.py > ${sampleId}.cd-hit-est.${pct_id}.table_cluster_contigs.txt + """ +} + + + +// Global clustering with CD-HIT. +process GLOBAL_CD_HIT { + publishDir "${params.outdir}/06_func_annot/06_1_clustering", mode: 'copy' + label 'CD_HIT' + + input: + path "*.fasta" + val pct_id + + output: + path "All-cd-hit-est.${pct_id}.fasta" + path "table_clstr.txt", emit: clstr_table + + // when: ('06_func_annot' in step) + + script: + """ + cat * > All-cd-hit-est.${pct_id} + cd-hit-est -c ${pct_id} -i All-cd-hit-est.${pct_id} -o All-cd-hit-est.${pct_id}.fasta -T ${task.cpus} -M ${task.mem} -d 150 + cat All-cd-hit-est.${pct_id}.fasta.clstr | cd_hit_produce_table_clstr.py > table_clstr.txt + """ +} + + + +workflow CD_HIT { + +take: +ch_assembly // channel: [ val(sampleid), path(assemblyfasta) ] +ch_percentage_identity // channel: val + +main: + INDIVIDUAL_CD_HIT( ch_assembly, ch_percentage_identity ) + + GLOBAL_CD_HIT( INDIVIDUAL_CD_HIT.out.clstr_fasta.collect(), ch_percentage_identity ) + +emit: +individual_clstr_table = INDIVIDUAL_CD_HIT.out.clstr_table +global_clstr_table = GLOBAL_CD_HIT.out.clstr_table + +} + diff --git a/modules/cutadapt.nf b/modules/cutadapt.nf index 736aaa800b1bc03d16c8bb92e6d02d836709b51f..9467d99fee12e7f989c50b6ff8166adaaeab61db 100644 --- a/modules/cutadapt.nf +++ b/modules/cutadapt.nf @@ -1,5 +1,5 @@ process CUTADAPT { - tag "$sampleId" + tag "${sampleId}" publishDir "${params.outdir}/01_clean_qc/01_1_cleaned_reads/", mode: 'copy', pattern: 'cleaned_*.fastq.gz' publishDir "${params.outdir}/01_clean_qc/01_1_cleaned_reads/logs", mode: 'copy', pattern: '*_cutadapt.log' @@ -11,12 +11,10 @@ process CUTADAPT { output: tuple val(sampleId), path("*${sampleId}*_R1.fastq.gz"), path("*${sampleId}*_R2.fastq.gz"), emit: reads - path "${sampleId}_cutadapt.log", emit: logs - - // when: ('01_clean_qc' in step || '02_assembly' in step || '03_filtering' in step || '04_structural_annot' in step || '05_alignment' in step || '06_func_annot' in step || '07_taxo_affi' in step || '08_binning' in step) && (!params.skip_01_clean_qc) + path "${sampleId}_cutadapt.log", emit: report script: - if(params.skip_sickle & params.skip_removal_host) { + if (params.skip_sickle & params.skip_host_filter) { // output are final cleaned paths output_paths = "-o cleaned_${sampleId}_R1.fastq.gz -p cleaned_${sampleId}_R2.fastq.gz" } diff --git a/modules/diamond.nf b/modules/diamond.nf new file mode 100644 index 0000000000000000000000000000000000000000..13b4523fb5ace23944c12464590178537fa7dab0 --- /dev/null +++ b/modules/diamond.nf @@ -0,0 +1,22 @@ +process DIAMOND { + publishDir "${params.outdir}/05_alignment/05_2_database_alignment/$sampleId", mode: 'copy' + tag "${sampleId}" + + input: + tuple val(sampleId), path(faa) + val diamond_bank + + output: + tuple val(sampleId), path("${sampleId}_aln_diamond.m8"), emit: m8 + + script: + fmt="qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen stitle" + fmt_tab=fmt.replaceAll(" ","\t") + """ + echo "$fmt_tab" > head.m8 + diamond blastp -p ${task.cpus} -d ${diamond_bank} -q ${faa} -o ${sampleId}_aln_diamond.nohead.m8 -f 6 $fmt + cat head.m8 ${sampleId}_aln_diamond.nohead.m8 > ${sampleId}_aln_diamond.m8 + rm ${sampleId}_aln_diamond.nohead.m8 + rm head.m8 + """ +} diff --git a/modules/eggnog_mapper.nf b/modules/eggnog_mapper.nf new file mode 100644 index 0000000000000000000000000000000000000000..81ce5c167efe81d822dee5c46a76a5565da9f55a --- /dev/null +++ b/modules/eggnog_mapper.nf @@ -0,0 +1,19 @@ +process EGGNOG_MAPPER { + publishDir "${params.outdir}/06_func_annot/06_3_functional_annotation", mode: 'copy' + label 'EGGNOG' + + input: + tuple val(sampleId), path(faa) + path db + + output: + path "${sampleId}_diamond_one2one.emapper.seed_orthologs", emit: seed + path "${sampleId}_diamond_one2one.emapper.annotations", emit: annot + path 'v_eggnogmapper.txt', emit: version + + script: + """ + /eggnog-mapper-2.0.4-rf1/emapper.py -i ${faa} --output ${sampleId}_diamond_one2one -m diamond --cpu ${task.cpus} --data_dir ${db} --target_orthologs one2one + /eggnog-mapper-2.0.4-rf1/emapper.py -v &> v_eggnogmapper.txt + """ +} \ No newline at end of file diff --git a/modules/fastqc.nf b/modules/fastqc.nf new file mode 100644 index 0000000000000000000000000000000000000000..42b00ecf190315f80985aad7bbe34db82c7216de --- /dev/null +++ b/modules/fastqc.nf @@ -0,0 +1,54 @@ +process FASTQC_RAW { + tag "${sampleId}" + label 'FASTQC' + + publishDir "${params.outdir}/01_clean_qc/01_2_qc/fastqc_raw", mode: 'copy' + + input: + tuple val(sampleId), path(read1), path(read2) + + output: + path "${sampleId}/*.zip", emit: zip + path "${sampleId}/*.html", emit: html + + script: + """ + mkdir ${sampleId} ; fastqc --nogroup --quiet -o ${sampleId} --threads ${task.cpus} ${read1} ${read2} + """ +} + +process FASTQC_CLEANED { + tag "${sampleId}" + label 'FASTQC' + publishDir "${params.outdir}/01_clean_qc/01_2_qc/fastqc_cleaned", mode: 'copy' + + input: + tuple val(sampleId), path(read1), path(read2) + + output: + path "${sampleId}/*.zip", emit: zip + path "${sampleId}/*.html", emit: html + + script: + """ + mkdir ${sampleId}; fastqc --nogroup --quiet -o ${sampleId} --threads ${task.cpus} ${read1} ${read2} + """ +} + +process FASTQC_HIFI { + tag "${sampleId}" + label 'FASTQC' + publishDir "${params.outdir}/04_structural_annot/fastqc_hifi", mode: 'copy' + + input: + tuple val(sampleId), path(read) + + output: + path "${sampleId}/*.zip", emit: zip + path "${sampleId}/*.html", emit: html + + script: + """ + mkdir ${sampleId}; fastqc --nogroup --quiet -o ${sampleId} --threads ${task.cpus} ${read} + """ +} \ No newline at end of file diff --git a/modules/feature_counts.nf b/modules/feature_counts.nf new file mode 100644 index 0000000000000000000000000000000000000000..78db042431147675fabb5a6f824b6bbde3fea8a2 --- /dev/null +++ b/modules/feature_counts.nf @@ -0,0 +1,64 @@ +// Quantification of reads on each gene in each sample. +process FEATURE_COUNTS { + tag "${sampleId}" + label 'QUANTIFICATION' + publishDir "${params.outdir}/06_func_annot/06_2_quantification", mode: 'copy' + + input: + tuple val(sampleId), file(gff_prokka), file(bam), file(bam_index) + + output: + path "${sampleId}.featureCounts.tsv", emit: count_table + path "${sampleId}.featureCounts.tsv.summary", emit: summary + path "${sampleId}.featureCounts.stdout" + + script: + """ + featureCounts -T ${task.cpus} -p -O -t gene -g ID -a ${gff_prokka} -o ${sampleId}.featureCounts.tsv ${bam} &> ${sampleId}.featureCounts.stdout + """ +} + +// Create table with sum of reads for each global cluster of genes in each sample. +process QUANTIFICATION_TABLE { + publishDir "${params.outdir}/06_func_annot/06_2_quantification", mode: 'copy' + label 'PYTHON' + + input: + path clusters_contigs + path global_clusters_clusters + path counts_files + + output: + path "Clusters_Count_table_all_samples.txt", emit: quantification_table + path "Correspondence_global_clstr_genes.txt" + + script: + """ + ls ${clusters_contigs} | cat > List_of_contigs_files.txt + ls ${counts_files} | cat > List_of_count_files.txt + Quantification_clusters.py -t ${global_clusters_clusters} -l List_of_contigs_files.txt -c List_of_count_files.txt -oc Clusters_Count_table_all_samples.txt -oid Correspondence_global_clstr_genes.txt + """ +} + +workflow QUANTIFICATION { + + take: + ch_gff // channel: [ val(sampleid), path(gff) ] + ch_bam // channel: [ val(sampleid), path(bam), path(bam_index) ] + ch_individual_clstr_table + ch_global_clstr_table + + main: + ch_gff_and_bam = ch_gff.join(ch_bam, remainder: false) + + FEATURE_COUNTS(ch_gff_and_bam) + ch_count_table = FEATURE_COUNTS.out.count_table.collect() + ch_quant_report = FEATURE_COUNTS.out.summary + QUANTIFICATION_TABLE(ch_individual_clstr_table.collect(), ch_global_clstr_table.collect(), ch_count_table) + + emit: + quantification_table = QUANTIFICATION_TABLE.out.quantification_table + quant_report = ch_quant_report +} + + diff --git a/modules/functional_annot_table.nf b/modules/functional_annot_table.nf new file mode 100644 index 0000000000000000000000000000000000000000..d65796fb69ba4404e416dea84d883e766e31317f --- /dev/null +++ b/modules/functional_annot_table.nf @@ -0,0 +1,14 @@ +process FUNCTIONAL_ANNOT_TABLE { + publishDir "${params.outdir}/06_func_annot/06_3_functional_annotation", mode: 'copy' + + input: + path merged_quant_annot_best + + output: + path "*", emit: functional_annot + + script: + """ + quantification_by_functional_annotation.py -i ${merged_quant_annot_best} + """ +} \ No newline at end of file diff --git a/modules/get_software_versions.nf b/modules/get_software_versions.nf new file mode 100644 index 0000000000000000000000000000000000000000..d855e9190b48cabaa2725efdd537e1070b3532ca --- /dev/null +++ b/modules/get_software_versions.nf @@ -0,0 +1,38 @@ +process GET_SOFTWARE_VERSIONS { + publishDir "${params.outdir}/pipeline_info", mode: 'copy', + saveAs: {filename -> + if (filename.indexOf(".csv") > 0) filename + else null + } + + input: + path v_eggnogmapper + + output: + path 'software_versions_mqc.yaml', emit: yaml + path "software_versions.csv" + + script: + """ + echo $workflow.manifest.version > v_pipeline.txt + echo $workflow.nextflow.version > v_nextflow.txt + echo \$(bwa 2>&1) &> v_bwa.txt + cutadapt --version &> v_cutadapt.txt + sickle --version &> v_sickle.txt + ktImportText &> v_kronatools.txt + python --version &> v_python.txt + echo \$(cd-hit -h 2>&1) > v_cdhit.txt + featureCounts -v &> v_featurecounts.txt + diamond help &> v_diamond.txt + multiqc --version &> v_multiqc.txt + fastqc --version &> v_fastqc.txt + megahit --version &> v_megahit.txt + spades.py --version &> v_spades.txt + quast -v &> v_quast.txt + prokka -v &> v_prokka.txt + echo \$(kaiju -h 2>&1) > v_kaiju.txt + samtools --version &> v_samtools.txt + bedtools --version &> v_bedtools.txt + scrape_software_versions.py > software_versions_mqc.yaml + """ +} \ No newline at end of file diff --git a/modules/host_filter.nf b/modules/host_filter.nf new file mode 100644 index 0000000000000000000000000000000000000000..e5f93284ebe201e62892459b3f28eb3b74c3a185 --- /dev/null +++ b/modules/host_filter.nf @@ -0,0 +1,34 @@ +process HOST_FILTER { + tag "${sampleId}" + + publishDir "${params.outdir}/01_clean_qc/01_1_cleaned_reads/", mode: 'copy', pattern: 'cleaned_*.fastq.gz' + publishDir "${params.outdir}/01_clean_qc/01_1_cleaned_reads/", mode: 'copy', pattern: '*.bam' + publishDir "${params.outdir}/01_clean_qc/01_1_cleaned_reads/logs", mode: 'copy', + saveAs: {filename -> + if (filename.indexOf(".flagstat") > 0) "$filename" + else null} + + input: + tuple val(sampleId), path(read1), path(read2) + path fasta + path index + + output: + tuple val(sampleId), path("cleaned_${sampleId}_R1.fastq.gz"), path("cleaned_${sampleId}_R2.fastq.gz"), emit: reads + path "host_filter_flagstat/${sampleId}.host_filter.flagstat", emit: hf_report + path "${sampleId}.no_filter.flagstat", emit: nf_report + + script: + """ + bwa mem -t ${task.cpus} ${fasta} ${read1} ${read2} > ${sampleId}.bam + samtools view -bhS -f 12 ${sampleId}.bam > ${sampleId}.without_host.bam + mkdir host_filter_flagstat + samtools flagstat ${sampleId}.bam > ${sampleId}.no_filter.flagstat + samtools flagstat ${sampleId}.without_host.bam >> host_filter_flagstat/${sampleId}.host_filter.flagstat + bamToFastq -i ${sampleId}.without_host.bam -fq cleaned_${sampleId}_R1.fastq -fq2 cleaned_${sampleId}_R2.fastq + gzip cleaned_${sampleId}_R1.fastq + gzip cleaned_${sampleId}_R2.fastq + rm ${sampleId}.bam + rm ${sampleId}.without_host.bam + """ +} \ No newline at end of file diff --git a/modules/kaiju.nf b/modules/kaiju.nf new file mode 100644 index 0000000000000000000000000000000000000000..d71e3deec0825dcbed12c077f7bb8ab04aba5864 --- /dev/null +++ b/modules/kaiju.nf @@ -0,0 +1,90 @@ +taxon_levels = "phylum class order family genus species" + +process KAIJU { + tag "${sampleId}" + + publishDir "${params.outdir}/01_clean_qc/01_3_taxonomic_affiliation_reads", mode: 'copy', pattern: '*.krona.html' + publishDir "${params.outdir}/01_clean_qc/01_3_taxonomic_affiliation_reads", mode: 'copy', pattern: '*_kaiju_MEM_verbose.out' + publishDir "${params.outdir}/01_clean_qc/01_3_taxonomic_affiliation_reads", mode: 'copy', pattern: '*.pdf' + + input: + tuple val(sampleId), path(read1), path(read2) + tuple path(nodes), path(fmi), path(names) + + output: + path "${sampleId}_kaiju_MEM_verbose.out", emit: out + path "${sampleId}.krona.html", emit: html + path "*.summary_*", emit: k_all + path "*.summary_species", emit: k_species + path "*.summary_genus", emit: k_genus + path "*.summary_family", emit: k_family + path "*.summary_class", emit: k_class + path "*.summary_order", emit: k_order + path "*.summary_phylum", emit: k_phylum + path "*.pdf", emit: pdf + + script: + """ + kaiju -z ${task.cpus} -t ${nodes} -f ${fmi} -i ${read1} -j ${read2} -o ${sampleId}_kaiju_MEM_verbose.out -a mem -v + kaiju2krona -t ${nodes} -n ${names} -i ${sampleId}_kaiju_MEM_verbose.out -o ${sampleId}_kaiju_MEM_without_unassigned.out.krona -u + ktImportText -o ${sampleId}.krona.html ${sampleId}_kaiju_MEM_without_unassigned.out.krona + for i in ${taxon_levels} ; + do + kaiju2table -t ${nodes} -n ${names} -r \$i -o ${sampleId}_kaiju_MEM.out.summary_\$i ${sampleId}_kaiju_MEM_verbose.out + done + Generate_barplot_kaiju.py -i ${sampleId}_kaiju_MEM_verbose.out + """ +} + +process MERGE_KAIJU { + publishDir "${params.outdir}/01_clean_qc/01_3_taxonomic_affiliation_reads", mode: 'copy' + + input: + path k_species + path k_genus + path k_family + path k_class + path k_order + path k_phylum + + output: + path "taxo_affi_reads_*.tsv", emit: tsv + + script: + """ + echo '${k_species}' > species.txt + echo '${k_genus}' > genus.txt + echo '${k_family}' > family.txt + echo '${k_class}' > class.txt + echo '${k_order}' > order.txt + echo '${k_phylum}' > phylum.txt + for i in ${taxon_levels} ; + do + merge_kaiju_results.py -f \$i'.txt' -o taxo_affi_reads_\$i'.tsv' + done + """ +} + +workflow KAIJU_AND_MERGE { + take: + cleaned_reads + database + + main: + KAIJU ( + cleaned_reads, + database + ) + + MERGE_KAIJU ( + KAIJU.out.k_species.collect(), + KAIJU.out.k_genus.collect(), + KAIJU.out.k_family.collect(), + KAIJU.out.k_class.collect(), + KAIJU.out.k_order.collect(), + KAIJU.out.k_phylum.collect() + ) + + emit: + report = KAIJU.out.k_all +} \ No newline at end of file diff --git a/modules/merge_quant_eggnog_best.nf b/modules/merge_quant_eggnog_best.nf new file mode 100644 index 0000000000000000000000000000000000000000..2ea2fa0207246588e6167ec459c85610d8dbdda0 --- /dev/null +++ b/modules/merge_quant_eggnog_best.nf @@ -0,0 +1,35 @@ +process MERGE_QUANT_ANNOT_BEST { + publishDir "${params.outdir}/06_func_annot/06_3_functional_annotation", mode: 'copy' + + input: + path quant_table + path annot + path best_hits + + output: + path "Quantifications_and_functional_annotations.tsv", emit: merged + + script: + """ + awk '{ + if(NR == 1) { + print \$0 "\t" "sum"} + else { + for (i=1; i<=NF; i++) + { + if (i == 1) + { + sum = O; + } + else { + sum = sum + \$i; + } + } + print \$0 "\t" sum + } + }' ${quant_table} > ${quant_table}.sum + ls ${annot} | cat > List_of_functionnal_annotations_files.txt + ls ${best_hits} | cat > List_of_diamond_files.txt + merge_abundance_and_functional_annotations.py -t ${quant_table}.sum -f List_of_functionnal_annotations_files.txt -d List_of_diamond_files.txt -o Quantifications_and_functional_annotations.tsv + """ +} \ No newline at end of file diff --git a/modules/metaquast.nf b/modules/metaquast.nf new file mode 100644 index 0000000000000000000000000000000000000000..59c2b27ff0afcc2d1a89da725d58c5f2f4e638bc --- /dev/null +++ b/modules/metaquast.nf @@ -0,0 +1,59 @@ +process ASSEMBLY_QUAST { + tag "${sampleId}" + label 'QUAST' + publishDir "${params.outdir}/02_assembly/quast_primary", mode: 'copy' + + input: + tuple val(sampleId), path(assembly) + + output: + path "${sampleId}/*", emit: all + path "${sampleId}/report.tsv", emit: report + + script: + """ + mkdir ${sampleId}/ + touch ${sampleId}/report.tsv + metaquast.py --threads ${task.cpus} --rna-finding --max-ref-number 0 --min-contig 0 ${assembly} -o ${sampleId} + """ +} + +process FILTERED_QUAST { + tag "${sampleId}" + label 'QUAST' + publishDir "${params.outdir}/03_filtering/quast_filtered", mode: 'copy' + + input: + tuple val(sampleId), path(assembly) + + output: + path "${sampleId}/*", emit: all + path "${sampleId}/report.tsv", emit: report + + script: + """ + mkdir ${sampleId}/ + touch ${sampleId}/report.tsv + metaquast.py --threads ${task.cpus} --rna-finding --max-ref-number 0 --min-contig 0 ${assembly} -o ${sampleId} + """ +} + +process HIFI_QUAST { + tag "${sampleId}" + label 'QUAST' + publishDir "${params.outdir}/04_structural_annot/quast_hifi", mode: 'copy' + + input: + tuple val(sampleId), path(assembly) + + output: + path "${sampleId}/*", emit: all + path "${sampleId}/report.tsv", emit: report + + script: + """ + mkdir ${sampleId}/ + touch ${sampleId}/report.tsv + metaquast.py --threads ${task.cpus} --rna-finding --max-ref-number 0 --min-contig 0 ${assembly} -o ${sampleId} + """ +} \ No newline at end of file diff --git a/modules/multiqc.nf b/modules/multiqc.nf new file mode 100644 index 0000000000000000000000000000000000000000..d7334dade0469a10afaf09a7c99ee6c7e8da17cf --- /dev/null +++ b/modules/multiqc.nf @@ -0,0 +1,30 @@ +process MULTIQC { + publishDir "${params.outdir}/MultiQC", mode: 'copy' + + input: + path multiqc_config + path 'software_versions/*' + path cutadapt_report + path sickle_report + path before_filter_report + path after_filter_report + path fastqc_raw_report + path fastqc_clean_report + path kaiju_report + path dedup_report + path 'quast_primary/*/report.tsv' + path 'quast_filtered/*/report.tsv' + path hifi_fastqc_report + path 'quast_hifi/*/report.tsv' + path prokka_report + path quant_report + + output: + path "multiqc_report.html", emit: report + path "multiqc_data/*" + + script: + """ + multiqc . --config ${multiqc_config} -m custom_content -m fastqc -m cutadapt -m sickle -m kaiju -m quast -m prokka -m featureCounts -m samtools + """ +} \ No newline at end of file diff --git a/modules/prokka.nf b/modules/prokka.nf new file mode 100644 index 0000000000000000000000000000000000000000..44c3475e680ba5a604d6549f52c65693da3446ed --- /dev/null +++ b/modules/prokka.nf @@ -0,0 +1,44 @@ +process PROKKA { + tag "${sampleId}" + + input: + tuple val(sampleId), file(assembly_file) + + output: + tuple val(sampleId), path("PROKKA_${sampleId}"), emit: prokka_results + path "PROKKA_${sampleId}/${sampleId}.txt", emit: report + + script: + """ + prokka --metagenome --noanno --rawproduct --outdir PROKKA_${sampleId} --prefix ${sampleId} ${assembly_file} --centre X --compliant --cpus ${task.cpus} + rm PROKKA_${sampleId}/*.gbk + """ +} + +process RENAME_CONTIGS_AND_GENES { + tag "${sampleId}" + publishDir "${params.outdir}/04_structural_annot", mode: 'copy' + label 'PYTHON' + + input: + tuple val(sampleId), path(prokka_results) + + output: + tuple val(sampleId), path("${sampleId}.annotated.fna"), emit: fna + tuple val(sampleId), path("${sampleId}.annotated.ffn"), emit: ffn + tuple val(sampleId), path("${sampleId}.annotated.faa"), emit: faa + tuple val(sampleId), path("${sampleId}.annotated.gff"), emit: gff + tuple val(sampleId), path("${sampleId}_prot.len"), emit: prot_length + + script: + """ + grep "^gnl" ${prokka_results}/${sampleId}.gff > ${sampleId}_only_gnl.gff + + Rename_contigs_and_genes.py -f ${sampleId}_only_gnl.gff -faa ${prokka_results}/${sampleId}.faa \ + -ffn ${prokka_results}/${sampleId}.ffn -fna ${prokka_results}/${sampleId}.fna \ + -p ${sampleId} -oGFF ${sampleId}.annotated.gff -oFAA ${sampleId}.annotated.faa \ + -oFFN ${sampleId}.annotated.ffn -oFNA ${sampleId}.annotated.fna + + samtools faidx ${sampleId}.annotated.faa; cut -f 1,2 ${sampleId}.annotated.faa.fai > ${sampleId}_prot.len + """ +} diff --git a/modules/quantif_and_taxonomic_table_contigs.nf b/modules/quantif_and_taxonomic_table_contigs.nf new file mode 100644 index 0000000000000000000000000000000000000000..875df7a03236e2d2915edca546d271ab5aed9278 --- /dev/null +++ b/modules/quantif_and_taxonomic_table_contigs.nf @@ -0,0 +1,35 @@ +taxo_list = "all superkingdom phylum class order family genus species" + +process QUANTIF_AND_TAXONOMIC_TABLE_CONTIGS { + publishDir "${params.outdir}/07_taxo_affi", mode: 'copy' + label 'PYTHON' + + input: + path q_all + path q_superkingdom + path q_phylum + path q_order + path q_class + path q_family + path q_genus + path q_species + + output: + path "quantification_by_contig_lineage*.tsv", emit: quantif_by_contig_lineage + + script: + """ + echo "${q_all}" > all.txt + echo "${q_superkingdom}" > superkingdom.txt + echo "${q_phylum}" > phylum.txt + echo "${q_order}" > order.txt + echo "${q_class}" > class.txt + echo "${q_family}" > family.txt + echo "${q_genus}" > genus.txt + echo "${q_species}" > species.txt + for i in ${taxo_list} ; + do + quantification_by_contig_lineage.py -i \$i".txt" -o quantification_by_contig_lineage_\$i".tsv" + done + """ +} \ No newline at end of file diff --git a/modules/read_alignment.nf b/modules/read_alignment.nf new file mode 100644 index 0000000000000000000000000000000000000000..e8068bbb4addbee9c88555bc23f9c08581e5fa0a --- /dev/null +++ b/modules/read_alignment.nf @@ -0,0 +1,53 @@ +process BWA_MEM { + tag "${sampleId}" + publishDir "${params.outdir}/05_alignment/05_1_reads_alignment_on_contigs/${sampleId}", mode: 'copy' + + input: + tuple val(sampleId), path(fna), path(read1), path(read2), path(gff) + + output: + tuple val(sampleId), path("${sampleId}.sort.bam"), path("${sampleId}.sort.bam.bai"), emit: bam + tuple val(sampleId), path("${sampleId}_coverage.tsv"), emit: sam_coverage + path "${sampleId}*" + + + script: + """ + bwa index ${fna} -p ${fna} + bwa mem ${fna} ${read1} ${read2} | samtools view -bS - | samtools sort - -o ${sampleId}.sort.bam + samtools index ${sampleId}.sort.bam + + samtools flagstat -@ ${task.cpus} ${sampleId}.sort.bam > ${sampleId}.flagstat + samtools coverage ${sampleId}.sort.bam > ${sampleId}_coverage.tsv + + samtools idxstats ${sampleId}.sort.bam > ${sampleId}.sort.bam.idxstats + + # awk 'BEGIN {FS="\t"}; {print \$1 FS "0" FS \$2}' ${sampleId}.sort.bam.idxstats > ${sampleId}_contig.bed + """ +} + +process MINIMAP2 { + tag "${sampleId}" + publishDir "${params.outdir}/05_alignment/05_1_reads_alignment_on_contigs/$sampleId", mode: 'copy' + + input: + tuple val(sampleId), path(fna_prokka), path(reads) + + output: + tuple val(sampleId), path("${sampleId}.sort.bam"), path("${sampleId}.sort.bam.bai"), emit: bam + tuple val(sampleId), path("${sampleId}_coverage.tsv"), emit: sam_coverage + path "${sampleId}*" + + script: + """ + # align reads to contigs, keep only primary aln and sort resulting bam + minimap2 -t ${task.cpus} -ax asm20 $fna_prokka $reads | samtools view -@ ${task.cpus} -b -F 2304 | samtools sort -@ ${task.cpus} -o ${sampleId}.sort.bam + + samtools index ${sampleId}.sort.bam -@ ${task.cpus} + samtools flagstat -@ ${task.cpus} ${sampleId}.sort.bam > ${sampleId}.flagstat + samtools coverage ${sampleId}.sort.bam > ${sampleId}_coverage.tsv + + samtools idxstats ${sampleId}.sort.bam > ${sampleId}.sort.bam.idxstats + + """ +} \ No newline at end of file diff --git a/modules/reads_deduplication.nf b/modules/reads_deduplication.nf new file mode 100644 index 0000000000000000000000000000000000000000..72b8ff3938c1d3f3cd0d31612ba0bc2d6fbcd237 --- /dev/null +++ b/modules/reads_deduplication.nf @@ -0,0 +1,35 @@ +process READS_DEDUPLICATION { + tag "${sampleId}" + publishDir "${params.outdir}/02_assembly", mode: 'copy', pattern: '*.fastq.gz' + publishDir "${params.outdir}/02_assembly/logs", mode: 'copy', pattern: '*.idxstats' + publishDir "${params.outdir}/02_assembly/logs", mode: 'copy', pattern: '*.flagstat' + + input: + tuple val(sampleId), path(assembly), path(read1), path(read2) + + output: + tuple val(sampleId), path("${sampleId}_R1_dedup.fastq.gz"), path("${sampleId}_R2_dedup.fastq.gz"), emit: dedup + tuple val(sampleId), path("${sampleId}.count_reads_on_contigs.idxstats"), emit: idxstats + path "${sampleId}.count_reads_on_contigs.flagstat", emit: flagstat + + script: + """ + mkdir logs + bwa index ${assembly} -p ${assembly} + bwa mem ${assembly} ${read1} ${read2} | samtools view -bS - | samtools sort -n -o ${sampleId}.sort.bam - + samtools fixmate -m ${sampleId}.sort.bam ${sampleId}.fixmate.bam + samtools sort -o ${sampleId}.fixmate.positionsort.bam ${sampleId}.fixmate.bam + samtools markdup -r -S -s -f ${sampleId}.stats ${sampleId}.fixmate.positionsort.bam ${sampleId}.filtered.bam + samtools index ${sampleId}.filtered.bam + samtools idxstats ${sampleId}.filtered.bam > ${sampleId}.count_reads_on_contigs.idxstats + samtools flagstat ${sampleId}.filtered.bam > ${sampleId}.count_reads_on_contigs.flagstat + samtools sort -n -o ${sampleId}.filtered.sort.bam ${sampleId}.filtered.bam + bedtools bamtofastq -i ${sampleId}.filtered.sort.bam -fq ${sampleId}_R1_dedup.fastq -fq2 ${sampleId}_R2_dedup.fastq + gzip ${sampleId}_R1_dedup.fastq ; gzip ${sampleId}_R2_dedup.fastq + rm ${sampleId}.sort.bam + rm ${sampleId}.fixmate.bam + rm ${sampleId}.fixmate.positionsort.bam + rm ${sampleId}.filtered.bam + rm ${sampleId}.filtered.sort.bam + """ +} \ No newline at end of file diff --git a/modules/sickle.nf b/modules/sickle.nf index cf64da98b543f2ddffb18c6cd3a725d439f6e645..d4d010bbc59ee7b08bef37640ae5a0b2a1067f70 100644 --- a/modules/sickle.nf +++ b/modules/sickle.nf @@ -1,33 +1,31 @@ process SICKLE { - tag "$sampleId" + tag "${sampleId}" publishDir "${params.outdir}/01_clean_qc/01_1_cleaned_reads/", mode: 'copy', pattern: 'cleaned_*.fastq.gz' publishDir "${params.outdir}/01_clean_qc/01_1_cleaned_reads/logs", mode: 'copy', pattern: '*_sickle.log' - // when: (!params.skip_sickle) && ('01_clean_qc' in step || '02_assembly' in step || '03_filtering' in step || '04_structural_annot' in step || '05_alignment' in step || '06_func_annot' in step || '07_taxo_affi' in step || '08_binning' in step) && (!params.skip_01_clean_qc) - input: - tuple val(sampleId), path(read1), path(read2) + tuple val(sampleId), path(read1), path(read2), val(paired) output: - tuple val(sampleId), path("*${sampleId}*_R1.fastq.gz"), path("*${sampleId}*_R2.fastq.gz"), emit: reads - path "${sampleId}_single_sickle.fastq.gz", emit: single - path "${sampleId}_sickle.log", emit: logs + tuple val(sampleId), path("*${sampleId}*_R1.fastq.gz"), path("*${sampleId}*_R2.fastq.gz"), emit: reads + path "${sampleId}_single_sickle.fastq.gz", emit: single + path "${sampleId}_sickle.log", emit: report script: - mode = params.single_end ? 'se' : 'pe' + mode = paired ? 'pe' : 'se' - if(params.skip_removal_host) { - // output are final cleaned files - options = "-o cleaned_${sampleId}_R1.fastq.gz -p cleaned_${sampleId}_R2.fastq.gz" - } - else { - //tempory files not saved in publish dir - options = "-o ${sampleId}_sickle_R1.fastq.gz -p ${sampleId}_sickle_R2.fastq.gz" - } - options += " -t " + params.quality_type - """ - sickle ${mode} -f ${cutadapt_reads_R1} -r ${cutadapt_reads_R2} $options \ - -s ${sampleId}_single_sickle.fastq.gz -g > ${sampleId}_sickle.log - """ + if (params.skip_host_filter) { + // output are final cleaned files + options = "-o cleaned_${sampleId}_R1.fastq.gz -p cleaned_${sampleId}_R2.fastq.gz" + } + else { + //tempory files not saved in publish dir + options = "-o ${sampleId}_sickle_R1.fastq.gz -p ${sampleId}_sickle_R2.fastq.gz" + } + options += " -t " + params.quality_type + """ + sickle ${mode} -f ${read1} -r ${read2} $options \ + -s ${sampleId}_single_sickle.fastq.gz -g > ${sampleId}_sickle.log + """ } \ No newline at end of file diff --git a/nextflow.config b/nextflow.config index e973192377f99d3f34b23fc767ba50f34bffe8dc..0d1940cc261880514a93995757ba7144c0be44c3 100644 --- a/nextflow.config +++ b/nextflow.config @@ -7,43 +7,62 @@ params { // metagWGS parameters. - reads = "*_{R1,R2}.fastq.gz" - step = "01_clean_qc" + reads = "" + assemblies = "" single_end = false adapter1 = "AGATCGGAAGAGC" adapter2 = "AGATCGGAAGAGC" quality_type = "sanger" - host_bwa_index = "" + host_index = "" host_fasta = "" - kaiju_verbose = false + assembly = "metaspades" metaspades_mem = 440 - percentage_identity = 0.95 min_contigs_cpm = 1 - assembly = "metaspades" - min_contig_size = 1500 - busco_reference = "https://busco-archive.ezlab.org/v3/datasets/bacteria_odb9.tar.gz" - cat_db = false + diamond_bank = "" + percentage_identity = 0.95 + type = "" - // Skip step or sub-step. - skip_01_clean_qc = false + // Stop after step or skip optional step/sub-step. + + // Optional step + stop_at_clean = false + skip_clean = false + + // Sub-steps of clean skip_sickle = false - skip_removal_host = false + skip_host_filter = false skip_kaiju = false - skip_busco = false + + // Step + stop_at_assembly = false + + // Optional step + stop_at_filtering = false + skip_filtering = false + + // Step + stop_at_structural_annot = false + + // Optional step + skip_func_annot = false + + // Optional step + skip_taxo_affi = false // Ressources. kaiju_db_dir = false - kaiju_db = "https://kaiju.binf.ku.dk/database/kaiju_db_refseq_2021-02-26.tgz" - diamond_bank = "" + kaiju_db_url = "https://kaiju.binf.ku.dk/database/kaiju_db_refseq_2021-02-26.tgz" + eggnog_mapper_db_dir = "" + eggnog_mapper_db_download = true + // Others parameters. + outdir = "results" + databases = "databases" accession2taxid = "ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz" taxdump = "ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz" taxonomy_dir = false - eggnogmapper_db = false - eggnog_mapper_db_dir = false - multiqc_config = "$baseDir/assets/multiqc_config.yaml" + hifi_multiqc_config = "$baseDir/assets/hifi_multiqc_config.yaml" + sr_multiqc_config = "$baseDir/assets/sr_multiqc_config.yaml" - // Others parameters. - outdir = "results" help = false } @@ -74,11 +93,11 @@ profiles { singularity { includeConfig 'conf/singularity.config' } } -// Trace file. -trace { - enabled = true - file = 'pipeline_trace.txt' - fields = 'task_id,name,status,exit,realtime,%cpu,rss' +process { + container = '$SING_IMG_FOLDER/metagwgs.sif' + withLabel: EGGNOG { + container = '$SING_IMG_FOLDER/eggnog_mapper.sif' + } } // Manifest. diff --git a/subworkflows/00_databases.nf b/subworkflows/00_databases.nf new file mode 100644 index 0000000000000000000000000000000000000000..b4dffa48ccdc957abe7706b521703dbbac2b917a --- /dev/null +++ b/subworkflows/00_databases.nf @@ -0,0 +1,160 @@ +workflow DATABASES { + take: + skip_clean + + main: + + ch_host_fasta = Channel.empty() + ch_host_index = Channel.empty() + if ( !skip_clean && !params.skip_host_filter ) { + println("Creating host db") + ch_host_fasta = Channel.value(file(params.host_fasta)) + if ( !params.host_index ) { + INDEX_HOST(ch_host_fasta) + ch_host_index = INDEX_HOST.out.index + } + else { + ch_host_index = Channel.value(file(params.host_index)) + } + } + + ch_kaiju_db = Channel.empty() + if ( !skip_clean && !params.skip_kaiju ) { //kaiju_db + println("Creating kaiju db") + if ( !params.kaiju_db_dir && params.kaiju_db_url ) { + INDEX_KAIJU(params.kaiju_db_url) + ch_kaiju_db = INDEX_KAIJU.out.kaiju_db + } else if (params.kaiju_db_dir) { + if (file(params.kaiju_db_dir + "/kaiju_db*.fmi").size == 1) { + ch_kaiju_db = Channel.value([file(params.kaiju_db_dir + "/nodes.dmp"), file(params.kaiju_db_dir + "/kaiju_db*.fmi"), file(params.kaiju_db_dir + "/names.dmp")]) + } else { + exit 1, "There is more than one file ending with .fmi in ${params.kaiju_db_dir}" + } + } else { + exit 1, "You must specify --kaiju_db_url or --kaiju_db_dir" + } + } + + ch_eggnog = Channel.empty() + if ( !params.stop_at_clean && !params.stop_at_filtering && !params.stop_at_assembly && !params.stop_at_structural_annot && !params.skip_func_annot ) { //eggnog_mapper_db + println("Creating eggnog db") + if( params.eggnog_mapper_db_dir != "" ) { + ch_eggnog = Channel.fromPath(params.eggnog_mapper_db_dir, checkIfExists: true).first() + } + else if ( params.eggnog_mapper_db_download ) { + // Built eggNOG-mapper database. + EGGNOG_MAPPER_DB() + ch_eggnog = EGGNOG_MAPPER_DB.out.functional_annot_db + } else { + exit 1, "You must specify --eggnog_mapper_db_download or --eggnog_mapper_db_dir" + } + } + + ch_taxonomy = Channel.empty() + if ( !params.stop_at_clean && !params.stop_at_filtering && !params.stop_at_assembly && !params.stop_at_structural_annot && !params.skip_taxo_affi ) { + println("Creating taxonomy db") + if( !params.taxonomy_dir ) { + ch_accession2taxid = Channel.value(params.accession2taxid) + ch_taxdump = Channel.value(params.taxdump) + + DOWNLOAD_TAXONOMY_DB(ch_accession2taxid, ch_taxdump) + ch_taxonomy = DOWNLOAD_TAXONOMY_DB.out.taxonomy + } + else if( params.taxonomy_dir ) { + ch_accession2taxid = Channel + .fromPath(params.taxonomy_dir + '/prot.accession2taxid', checkIfExists: true) + ch_taxdump = Channel + .fromPath(params.taxonomy_dir + '/taxdump', checkIfExists: true) + ch_taxonomy = ch_accession2taxid.combine(ch_taxdump) + } + else { + exit 1, "You must specify [--accession2taxid and --taxdump] or --taxonomy_dir" + } + } + + emit: + host_fasta = ch_host_fasta + host_index = ch_host_index + kaiju_db = ch_kaiju_db + eggnog = ch_eggnog + taxonomy = ch_taxonomy.first() +} + +process INDEX_HOST { + publishDir "${params.databases}/index_host" + + input: + path fasta + + output: + path "${fasta}.*", emit: index + + script: + """ + bwa index -a bwtsw $fasta + """ +} + +process INDEX_KAIJU { + publishDir "${params.databases}/kaiju_db" + + input: + val database + + output: + tuple path("nodes.dmp"), path("*.fmi"), path("names.dmp"), emit: kaiju_db + + script: + """ + wget ${database} + file='${database}' + fileNameDatabase=\${file##*/} + echo \$fileNameDatabase + tar -zxvf \$fileNameDatabase + """ +} + +process EGGNOG_MAPPER_DB { + publishDir "${params.databases}/eggnog_db" + label 'EGGNOG' + + output: + path "db_eggnog_mapper" , emit: functional_annot_db + + script: + """ + mkdir db_eggnog_mapper + /eggnog-mapper-2.0.4-rf1/download_eggnog_data.py -f -y --data_dir db_eggnog_mapper + """ +} + +process DOWNLOAD_TAXONOMY_DB { + publishDir "${params.databases}/taxonomy_db" + + input: + val accession2taxid + val taxdump + + output: + tuple file("*taxid*"), file("*taxdump*"), emit: taxonomy + + script: + """ + wget ${accession2taxid} + file='${accession2taxid}' + fileName=\${file##*/} + echo \$fileName + gunzip \$fileName + + wget ${taxdump} + file_taxdump='${taxdump}' + fileName_taxdump=\${file_taxdump##*/} + + echo \$fileName_taxdump + + mkdir taxdump + mv \$fileName_taxdump taxdump + cd taxdump + tar xzvf \$fileName_taxdump + """ +} diff --git a/subworkflows/01_clean_qc.nf b/subworkflows/01_clean_qc.nf new file mode 100644 index 0000000000000000000000000000000000000000..31c92c595a0901652e67e7199e7dbfbf68363e3a --- /dev/null +++ b/subworkflows/01_clean_qc.nf @@ -0,0 +1,87 @@ +include { CUTADAPT } from '../modules/cutadapt' +include { SICKLE } from '../modules/sickle' +include { HOST_FILTER } from '../modules/host_filter' +include { FASTQC_RAW; FASTQC_CLEANED } from '../modules/fastqc' +include { KAIJU_AND_MERGE } from '../modules/kaiju' + +workflow STEP_01_CLEAN_QC { + take: + raw_reads + paired + host_fasta + host_index + kaiju_db + + main: + ch_adapter1 = Channel.value(params.adapter1) + ch_adapter2 = Channel.value(params.adapter2) + + CUTADAPT ( + raw_reads, + ch_adapter1, + ch_adapter2 + ) + ch_intermediate_reads = CUTADAPT.out.reads + ch_cutadapt_report = CUTADAPT.out.report + + if (!params.skip_sickle) { + ch_sickle_reads = ch_intermediate_reads.join(paired) + // ch_sickle_reads.view{ it -> "${it}" } + SICKLE ( + ch_sickle_reads + ) + ch_intermediate_reads = SICKLE.out.reads + ch_sickle_report = SICKLE.out.report + } + + else { + ch_sickle_report = Channel.empty() + } + + if (!params.skip_host_filter) { + HOST_FILTER ( + ch_intermediate_reads, + host_fasta, + host_index + ) + ch_preprocessed_reads = HOST_FILTER.out.reads + ch_before_filter_report = HOST_FILTER.out.nf_report + ch_after_filter_report = HOST_FILTER.out.hf_report + } + + else { + ch_preprocessed_reads = ch_intermediate_reads + ch_before_filter_report = Channel.empty() + ch_after_filter_report = Channel.empty() + } + + FASTQC_RAW(raw_reads) + FASTQC_CLEANED(ch_preprocessed_reads) + ch_fastqc_raw_report = FASTQC_RAW.out.zip + ch_fastqc_clean_report = FASTQC_CLEANED.out.zip + + if (!params.skip_kaiju) { + + KAIJU_AND_MERGE( + ch_preprocessed_reads, + kaiju_db + ) + ch_kaiju_report = KAIJU_AND_MERGE.out.report + + } + + else { + ch_kaiju_report = Channel.empty() + } + + emit: + preprocessed_reads = ch_preprocessed_reads + + cutadapt_report = ch_cutadapt_report + sickle_report = ch_sickle_report + before_filter_report = ch_before_filter_report + after_filter_report = ch_after_filter_report + fastqc_raw_report = ch_fastqc_raw_report + fastqc_clean_report = ch_fastqc_clean_report + kaiju_report = ch_kaiju_report +} \ No newline at end of file diff --git a/subworkflows/02_assembly.nf b/subworkflows/02_assembly.nf new file mode 100644 index 0000000000000000000000000000000000000000..10f7eb91d41693593e27364432f2e1175b7b8168 --- /dev/null +++ b/subworkflows/02_assembly.nf @@ -0,0 +1,35 @@ +include { ASSEMBLY } from '../modules/assembly' +include { ASSEMBLY_QUAST } from '../modules/metaquast' +include { READS_DEDUPLICATION } from '../modules/reads_deduplication' + +workflow STEP_02_ASSEMBLY { + take: preprocessed_reads + + main: + + ch_metaspades_mem = Channel.value(params.metaspades_mem) + ASSEMBLY ( + preprocessed_reads, + ch_metaspades_mem + ) + ch_assembly = ASSEMBLY.out.assembly + + // ch_filtered = Channel.value(false) + ASSEMBLY_QUAST ( ch_assembly ) + + ch_assembly_report = ASSEMBLY_QUAST.out.report + + ch_assembly_and_preprocessed = ch_assembly.join(preprocessed_reads, remainder: true) + + READS_DEDUPLICATION ( ch_assembly_and_preprocessed ) + ch_dedup = READS_DEDUPLICATION.out.dedup + ch_idxstats = READS_DEDUPLICATION.out.idxstats + ch_flagstat = READS_DEDUPLICATION.out.flagstat + + emit: + assembly = ch_assembly + dedup = ch_dedup + idxstats = ch_idxstats + flagstat = ch_flagstat + assembly_report = ch_assembly_report +} \ No newline at end of file diff --git a/subworkflows/03_filtering.nf b/subworkflows/03_filtering.nf new file mode 100644 index 0000000000000000000000000000000000000000..5d04839cf449fb6611ead0a583326a48e4cfcd39 --- /dev/null +++ b/subworkflows/03_filtering.nf @@ -0,0 +1,77 @@ +process CHUNK_ASSEMBLY_FILTER { + label 'ASSEMBLY_FILTER' + + input: + tuple val(sampleId), path(assembly_file), path(idxstats) + val min_cpm + + output: + tuple val(sampleId), path("${chunk_name}_select_cpm${min_cpm}.fasta"), emit: chunk_selected + tuple val(sampleId), path("${chunk_name}_discard_cpm${min_cpm}.fasta"), emit: chunk_discarded + + script: + chunk_name = assembly_file.baseName + """ + Filter_contig_per_cpm.py -i ${idxstats} -f ${assembly_file} -c ${min_cpm} -s ${chunk_name}_select_cpm${min_cpm}.fasta -d ${chunk_name}_discard_cpm${min_cpm}.fasta + """ +} + +process MERGE_ASSEMBLY_FILTER { + label 'ASSEMBLY_FILTER' + + tag "${sampleId}" + publishDir "${params.outdir}/03_filtering/", mode: 'copy' + + input: + tuple val(sampleId), path(select_fasta) + tuple val(sampleId), path(discard_fasta) + val min_cpm + + output: + tuple val(sampleId), path("${sampleId}_select_contigs_cpm${min_cpm}.fasta"), emit: merged_selected + tuple val(sampleId), path("${sampleId}_discard_contigs_cpm${min_cpm}.fasta"), emit: merged_discarded + + shell: + ''' + echo !{select_fasta} | sed "s/ /\\n/g" | sort > select_list + echo !{discard_fasta} | sed "s/ /\\n/g" | sort > discard_list + + for i in `cat select_list` ; do cat $i >> !{sampleId}_select_contigs_cpm!{min_cpm}.fasta ; done + for j in `cat discard_list` ; do cat $j >> !{sampleId}_discard_contigs_cpm!{min_cpm}.fasta ; done + + rm select_list + rm discard_list + ''' +} + +workflow ASSEMBLY_FILTER { + take: + assembly_and_idxstats + min_cpm + + main: + CHUNK_ASSEMBLY_FILTER ( + assembly_and_idxstats, + min_cpm + ) + ch_chunk_selected = CHUNK_ASSEMBLY_FILTER.out.chunk_selected + ch_chunk_discarded = CHUNK_ASSEMBLY_FILTER.out.chunk_discarded + + ch_chunk_selected + .groupTuple(by: 0) + .set{ ch_grouped_selected } + + ch_chunk_discarded + .groupTuple(by: 0) + .set{ ch_grouped_discarded } + + MERGE_ASSEMBLY_FILTER ( + ch_grouped_selected, + ch_grouped_discarded, + min_cpm + ) + ch_merged_selected = MERGE_ASSEMBLY_FILTER.out.merged_selected + + emit: + selected = ch_merged_selected +} \ No newline at end of file diff --git a/subworkflows/04_structural_annot.nf b/subworkflows/04_structural_annot.nf new file mode 100644 index 0000000000000000000000000000000000000000..8f77f0bee114d4272f664b16f48687b10529b7de --- /dev/null +++ b/subworkflows/04_structural_annot.nf @@ -0,0 +1,17 @@ +include { PROKKA; RENAME_CONTIGS_AND_GENES } from '../modules/prokka' + +workflow STEP_04_STRUCTURAL_ANNOT { + take: assembly + + main: + PROKKA( assembly ) + RENAME_CONTIGS_AND_GENES(PROKKA.out.prokka_results) + + emit: + report = PROKKA.out.report + fna = RENAME_CONTIGS_AND_GENES.out.fna + ffn = RENAME_CONTIGS_AND_GENES.out.ffn + gff = RENAME_CONTIGS_AND_GENES.out.gff + faa = RENAME_CONTIGS_AND_GENES.out.faa + prot_length = RENAME_CONTIGS_AND_GENES.out.prot_length +} \ No newline at end of file diff --git a/subworkflows/05_alignment.nf b/subworkflows/05_alignment.nf new file mode 100644 index 0000000000000000000000000000000000000000..4bd23e10797cebc4bf02a94d65b2589d9d262553 --- /dev/null +++ b/subworkflows/05_alignment.nf @@ -0,0 +1,32 @@ +include { MINIMAP2; BWA_MEM } from '../modules/read_alignment' +include { DIAMOND } from '../modules/diamond' + +workflow STEP_05_ALIGNMENT { + take: + contigs_and_reads + prokka_faa + + main: + if (params.type == 'SR') { + BWA_MEM(contigs_and_reads) + //ch_depth_on_contigs = BWA_MEM.out.bed.join(BWA_MEM.out.bam) + ch_bam = BWA_MEM.out.bam + ch_sam_coverage = BWA_MEM.out.sam_coverage + } else { + MINIMAP2(contigs_and_reads) + //ch_depth_on_contigs = MINIMAP2.out.bed.join(MINIMAP2.out.bam) + ch_bam = MINIMAP2.out.bam + ch_sam_coverage = MINIMAP2.out.sam_coverage + } + + DIAMOND ( + prokka_faa, + params.diamond_bank + ) + ch_m8 = DIAMOND.out.m8 + + emit: + bam = ch_bam + m8 = ch_m8 + sam_coverage = ch_sam_coverage + } diff --git a/subworkflows/06_functionnal_annot.nf b/subworkflows/06_functionnal_annot.nf new file mode 100644 index 0000000000000000000000000000000000000000..4d205e2abc0f2d2bb25e10ff86740c847a582460 --- /dev/null +++ b/subworkflows/06_functionnal_annot.nf @@ -0,0 +1,45 @@ +include { CD_HIT } from '../modules/cd_hit' +include { QUANTIFICATION } from '../modules/feature_counts' +include { EGGNOG_MAPPER } from '../modules/eggnog_mapper' +include { BEST_HITS } from '../modules/best_hits' +include { MERGE_QUANT_ANNOT_BEST } from '../modules/merge_quant_eggnog_best' +include { FUNCTIONAL_ANNOT_TABLE } from '../modules/functional_annot_table' + +// cd_hit + quantification + quantification_table + eggnog_mapper_db + eggnog_mapper +// + best_hit_diamond + merge_quantif and functionnal annot + make_functionnal_annotation_tables + +workflow STEP_06_FUNC_ANNOT { + take: + ffn // channel: [ val(sampleid), path(ffn) ] + faa // channel: [ val(sampleid), path(faa) ] + gff // channel: [ val(sampleid), path(gff) ] + bam // channel: [ val(sampleid), path(bam), path(bam_index) ] + m8 // channel: [ val(sampleId), path(diamond_file) ] + eggnog_db + + main: + CD_HIT ( ffn, params.percentage_identity ) + ch_individual_clstr_table = CD_HIT.out.individual_clstr_table + ch_global_clstr_table = CD_HIT.out.global_clstr_table + + QUANTIFICATION ( gff, bam, ch_individual_clstr_table, ch_global_clstr_table) + ch_quant_table = QUANTIFICATION.out.quantification_table + ch_quant_report = QUANTIFICATION.out.quant_report + + EGGNOG_MAPPER ( faa, eggnog_db ) + ch_annot = EGGNOG_MAPPER.out.annot.collect() + ch_v_eggnogmapper = EGGNOG_MAPPER.out.version + + BEST_HITS ( m8 ) + ch_best_hits = BEST_HITS.out.best_hits.collect() + + MERGE_QUANT_ANNOT_BEST ( ch_quant_table, ch_annot, ch_best_hits ) + ch_merged_quant_annot_best = MERGE_QUANT_ANNOT_BEST.out.merged + + FUNCTIONAL_ANNOT_TABLE ( ch_merged_quant_annot_best ) + + emit: + functional_annot = FUNCTIONAL_ANNOT_TABLE.out.functional_annot + quant_report = ch_quant_report + v_eggnogmapper = ch_v_eggnogmapper +} diff --git a/subworkflows/07_taxonomic_affi.nf b/subworkflows/07_taxonomic_affi.nf new file mode 100644 index 0000000000000000000000000000000000000000..cbcc5347ab78bbbd34c7b2f7804c449adc9b2f47 --- /dev/null +++ b/subworkflows/07_taxonomic_affi.nf @@ -0,0 +1,29 @@ +include { ASSIGN_TAXONOMY } from '../modules/assign_taxonomy' +include { QUANTIF_AND_TAXONOMIC_TABLE_CONTIGS } from '../modules/quantif_and_taxonomic_table_contigs' + +workflow STEP_07_TAXO_AFFI { + take: + taxonomy + diamond_result // channel: [ val(sampleId), path(diamond_file) ] + sam_coverage // channel: [ val(sampleId), path(samtools coverage) ] + prot_length // channel: [ val(sampleId), path(prot_length) ] + main: + ch_assign_taxo_input = diamond_result.join(sam_coverage, remainder: true) + .join(prot_length, remainder: true) + + ASSIGN_TAXONOMY ( taxonomy, ch_assign_taxo_input ) + + QUANTIF_AND_TAXONOMIC_TABLE_CONTIGS ( + ASSIGN_TAXONOMY.out.q_all.collect(), + ASSIGN_TAXONOMY.out.q_superkingdom.collect(), + ASSIGN_TAXONOMY.out.q_phylum.collect(), + ASSIGN_TAXONOMY.out.q_order.collect(), + ASSIGN_TAXONOMY.out.q_class.collect(), + ASSIGN_TAXONOMY.out.q_family.collect(), + ASSIGN_TAXONOMY.out.q_genus.collect(), + ASSIGN_TAXONOMY.out.q_species.collect() + ) + + emit: + quantif_by_contig_lineage = QUANTIF_AND_TAXONOMIC_TABLE_CONTIGS.out.quantif_by_contig_lineage +} diff --git a/subworkflows/common.nf b/subworkflows/common.nf deleted file mode 100644 index 5f0e7c417ce52c23c254748067e94169b9772100..0000000000000000000000000000000000000000 --- a/subworkflows/common.nf +++ /dev/null @@ -1,22 +0,0 @@ -include { CUTADAPT } from '../modules/cutadapt' -include { SICKLE } from '../modules/sickle' - -ch_adapter1 = Channel.value(params.adapter1) -ch_adapter2 = Channel.value(params.adapter2) - -workflow COMMON { - take: - ch_reads - - main: - - CUTADAPT ( - ch_reads, - ch_adapter1, - ch_adapter2 - ) - SICKLE ( CUTADAPT.out.reads ) - - emit: - SICKLE.out.reads -} diff --git a/subworkflows/long_reads.nf b/subworkflows/long_reads.nf deleted file mode 100644 index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000 diff --git a/subworkflows/shared.nf b/subworkflows/shared.nf new file mode 100644 index 0000000000000000000000000000000000000000..92ca8f10549af38e2c0192bcc7daec879150a6ae --- /dev/null +++ b/subworkflows/shared.nf @@ -0,0 +1,67 @@ +include { STEP_04_STRUCTURAL_ANNOT as S04_STRUCTURAL_ANNOT } from './04_structural_annot' +include { STEP_05_ALIGNMENT as S05_ALIGNMENT } from './05_alignment' +include { STEP_06_FUNC_ANNOT as S06_FUNC_ANNOT } from './06_functionnal_annot' +include { STEP_07_TAXO_AFFI as S07_TAXO_AFFI } from './07_taxonomic_affi' + +workflow SHARED { + take: + reads + assembly + eggnog_db + taxonomy + + main: + + ch_contigs_and_reads = Channel.empty() + ch_prokka_ffn = Channel.empty() + ch_prokka_faa = Channel.empty() + ch_prokka_gff = Channel.empty() + ch_prokka_fna = Channel.empty() + ch_prokka_report = Channel.empty() + ch_prot_length = Channel.empty() + + if ( !params.stop_at_clean && !params.stop_at_assembly && !params.stop_at_filtering ) { + println("S04_STRUCTURAL_ANNOT") + S04_STRUCTURAL_ANNOT ( assembly ) + ch_prokka_ffn = S04_STRUCTURAL_ANNOT.out.ffn + ch_prokka_faa = S04_STRUCTURAL_ANNOT.out.faa + ch_prokka_gff = S04_STRUCTURAL_ANNOT.out.gff + ch_prokka_fna = S04_STRUCTURAL_ANNOT.out.fna + ch_prokka_report = S04_STRUCTURAL_ANNOT.out.report + + ch_contigs_and_reads = ch_prokka_fna + .join(reads, remainder: true) + .join(ch_prokka_gff, remainder: true) + ch_prot_length = S04_STRUCTURAL_ANNOT.out.prot_length + } + + ch_bam = Channel.empty() + ch_m8 = Channel.empty() + ch_sam_coverage = Channel.empty() + if ( !params.stop_at_clean && !params.stop_at_assembly && !params.stop_at_filtering && !params.stop_at_structural_annot ) { + println("S05_ALIGNMENT") + S05_ALIGNMENT ( ch_contigs_and_reads, ch_prokka_faa ) + ch_bam = S05_ALIGNMENT.out.bam + ch_m8 = S05_ALIGNMENT.out.m8 + ch_sam_coverage = S05_ALIGNMENT.out.sam_coverage + } + + ch_quant_report = Channel.empty() + ch_v_eggnogmapper = Channel.empty() + if ( !params.stop_at_clean && !params.stop_at_assembly && !params.stop_at_filtering && !params.stop_at_structural_annot && !params.skip_func_annot ) { + println("S06_FUNC_ANNOT") + S06_FUNC_ANNOT ( ch_prokka_ffn, ch_prokka_faa, ch_prokka_gff, ch_bam, ch_m8, eggnog_db ) + ch_quant_report = S06_FUNC_ANNOT.out.quant_report + ch_v_eggnogmapper = S06_FUNC_ANNOT.out.v_eggnogmapper + } + + if ( !params.stop_at_clean && !params.stop_at_assembly && !params.stop_at_filtering && !params.stop_at_structural_annot && !params.skip_taxo_affi ) { + println("S07_TAXO_AFFI") + S07_TAXO_AFFI ( taxonomy, ch_m8, ch_sam_coverage, ch_prot_length) + } + + emit: + prokka_report = ch_prokka_report + quant_report = ch_quant_report + v_eggnogmapper = ch_v_eggnogmapper +} diff --git a/subworkflows/short_reads.nf b/subworkflows/short_reads.nf index 1b6d117048f583577a8b8872148caf62a52e6ca8..78486d73cbf8025561e2d5b72b5212cdaae0c8b8 100644 --- a/subworkflows/short_reads.nf +++ b/subworkflows/short_reads.nf @@ -1,11 +1,98 @@ -#!/usr/bin/env nextflow -/* -======================================================================================== - metagWGS -======================================================================================== - metagWGS Analysis Pipeline. - #### Homepage / Documentation - https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/ ----------------------------------------------------------------------------------------- -*/ +include { STEP_01_CLEAN_QC as S01_CLEAN_QC } from './01_clean_qc' +include { STEP_02_ASSEMBLY as S02_ASSEMBLY } from './02_assembly' +include { ASSEMBLY_FILTER as S03_FILTERING } from './03_filtering' +include { FILTERED_QUAST as S04_FILTERED_QUAST } from '../modules/metaquast' + +workflow SHORT_READS { + take: + reads + paired + host_fasta + host_index + kaiju_db + + main: + + ch_preprocessed_reads = reads + ch_cutadapt_report = Channel.empty() + ch_sickle_report = Channel.empty() + ch_before_filter_report = Channel.empty() + ch_after_filter_report = Channel.empty() + ch_fastqc_raw_report = Channel.empty() + ch_fastqc_clean_report = Channel.empty() + ch_kaiju_report = Channel.empty() + ch_idxstats = Channel.empty() + ch_dedup_report = Channel.empty() + ch_assembly_report = Channel.empty() + ch_filtered_report = Channel.empty() + + if ( !params.skip_clean ) { + println("S01_CLEAN_QC") + S01_CLEAN_QC ( + reads, + paired, + host_fasta, + host_index, + kaiju_db + ) + ch_preprocessed_reads = S01_CLEAN_QC.out.preprocessed_reads + ch_cutadapt_report = S01_CLEAN_QC.out.cutadapt_report + ch_sickle_report = S01_CLEAN_QC.out.sickle_report + ch_before_filter_report = S01_CLEAN_QC.out.before_filter_report + ch_after_filter_report = S01_CLEAN_QC.out.after_filter_report + ch_fastqc_raw_report = S01_CLEAN_QC.out.fastqc_raw_report + ch_fastqc_clean_report = S01_CLEAN_QC.out.fastqc_clean_report + ch_kaiju_report = S01_CLEAN_QC.out.kaiju_report + } + + ch_assembly = Channel.empty() + ch_dedup = Channel.empty() + + if ( !params.stop_at_clean ) { + println("S02_ASSEMBLY") + S02_ASSEMBLY ( ch_preprocessed_reads ) + ch_assembly = S02_ASSEMBLY.out.assembly + ch_dedup = S02_ASSEMBLY.out.dedup + ch_idxstats = S02_ASSEMBLY.out.idxstats + ch_dedup_report = S02_ASSEMBLY.out.flagstat + ch_assembly_report = S02_ASSEMBLY.out.assembly_report + } + + if ( !params.stop_at_clean && !params.stop_at_assembly && !params.skip_filtering ) { + println("S03_FILTERING") + ch_min_contigs_cpm = Channel.value(params.min_contigs_cpm) + + ch_assembly + .splitFasta(by: 100000, file: true) + .set{ch_chunk_assembly_for_filter} + + ch_chunk_assembly_for_filter + .combine(ch_idxstats, by:0) + .set{ch_assembly_and_idxstats} + + S03_FILTERING ( + ch_assembly_and_idxstats, + ch_min_contigs_cpm + ) + ch_assembly = S03_FILTERING.out.selected + + S04_FILTERED_QUAST( ch_assembly ) + ch_filtered_report = S04_FILTERED_QUAST.out.report + } + + emit: + assembly = ch_assembly + dedup = ch_dedup + + cutadapt_report = ch_cutadapt_report + sickle_report = ch_sickle_report + before_filter_report = ch_before_filter_report + after_filter_report = ch_after_filter_report + fastqc_raw_report = ch_fastqc_raw_report + fastqc_clean_report = ch_fastqc_clean_report + kaiju_report = ch_kaiju_report + dedup_report = ch_dedup_report + assembly_report = ch_assembly_report + filtered_report = ch_filtered_report +}