diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml index 90573b2369100335ba19b948ef07e3476c4bb2cb..07dcd383757f19e3d37add7b8bfd8247f515520c 100644 --- a/.gitlab-ci.yml +++ b/.gitlab-ci.yml @@ -1,9 +1,12 @@ -#image: python:3.7-alpine -image: registry.forgemia.inra.fr/umr-gdec/magatt +image: python:3.7-alpine +#image: registry.forgemia.inra.fr/umr-gdec/magatt # global before script, set venv in all job before_script: - - source /opt/venv/cicd/bin/activate + - pip install -U sphinx + - pip install -U sphinx-rtd-theme + - pip install -U myst-parser + test: stage: test diff --git a/README.md b/README.md index b1100a291dbfb7378a075c0d84d56cd0eb86f4f3..218dc1899039a705888abafad006fc8ea01bebe9 100644 --- a/README.md +++ b/README.md @@ -3,215 +3,3 @@ Marker Assisted Gene Annotation Transfert for Triticeae. Snakemake pipeline used to transfert GFF annotation on a new assembly with a fine target mapping approach. Documentation : [https://umr-gdec.pages.mia.inra.fr/magatt/](https://umr-gdec.pages.mia.inra.fr/magatt/) - -## Install the pipeline - -```console -$ git clone https://forgemia.inra.fr/umr-gdec/magatt.git -``` - -## Dependancies - -### Build magatt environment with conda - -We recommend to build the environment using conda (developped with miniconda 3, conda 4.9.2 ) with the file [environment.yml](environment.yml): - -```console -$ conda env create -f environment.yml -n magatt -``` - -Once created, you can activate the environment with: - -```console -$ conda activate magatt -``` - -All the dependancies installed in the conda env are listed below. - -* Snakemake : 5.5.2 -* Python: 3.5 -* Biopython: 1.68 -* numpy: 1.15 -* pandas: 0.23 -* pysam: 0.15 -* Bedtools: 2.27 -* Blat: 36 -* Exonerate (fastavalidcds): 2.4.0 -* GenomeTools: 1.5.9 -* gffread: 0.9.11 -* GMAP: 2018-05-11 -* NCBI-blast (BLAST+): 2.6 -* Samtools: 1.9 -* BWA: 0.7 - -## Prepare and run the pipeline - -### Creating the configuration file: inputs and other parameters - -The configuration file [config.yaml](config.yaml) will contain all the input files required for the pipeline and some other parameters. - -* Prepare the query genome data - -| Parameter in `config.yaml` | format |description |Example | -|----------------------------|--------|------------|--------| -|**annotationQuery** | GFF3 | The gff file of the annotation we want to transfert onto the new Target genome| annotationQuery: "/path/to/IWGSC_RefSeqv1_annotation.gff3"| -|**featureType** | \[STRING\] | The feature we want to use to anchore the annotation. Here we use the gene feature of the GFF.| featureType: 'gene| -|**queryFasta** | FASTA | Fasta file of the genome reference sequence. Must match the GFF file used in `annotationQuery` parameter| queryFasta: '/path/to/IWGSC_RefSeqv1_annotation.fasta'| -|**blastdb** | \[blast database\]|blast db of all mRNAs (all isoforms) of the query annotation. This will be used to rescue genes which have failed in the transfert|blastdb: 'data/IWGSCv1.1_all_mrna'| -|**chromosomes**|python list|list of all the chromosomes in the query reference genome. This will be used to split all the data per chromosome and speed up the analysis|chromosomes: ['1A', '2A', '3A', '4A', '5A', '6A', '7A', '1B', '2B', '3B', '4B', '5B', '6B', '7B', '1D', '2D', '3D', '4D', '5D', '6D', '7D', 'U']| - -* Prepare the markers/ISBPs input data - -The pipeline uses markers/ISBPs as anchores to target a restricted part of the target genome on which our genes are suspected to be located. - -| Parameter in `config.yaml` | format |description |Example | -|----------------------------|--------|------------|--------| -|**isbpBed**| BED|Initial coordinates of the markers/ISBPs on the query genome|isbpBed: 'data/ISBP_refseqv1.bed'| -|**mapq**| \[INT\]|Minimum mapping quality of the marker/ISBP to be kept in the anchoring process|mapq: 30| -|**mismatches**| \[INT\]|Max missmatches allowed for a marker to be kept in the analysis|mismatches: 0| - -* Prepare the target genome data - -Here, we only need the fasta of the new genome assembly - -| Parameter in `config.yaml` | format |description |Example | -|----------------------------|--------|------------|--------| -|**targetFasta**|FASTA|fasta file of the target genome assembly on which we transfert the annotation|targetFasta: 'data/CS_pesudo_v2.1.fa'| -|**targetGmapIndex**|PATH|Name of the GMAP index directory. This will be used with the `-d` option of `gmapl`|targetGmapIndex: 'ensembl_Triticum_aestivum_julius_2022-9-16'| -|**targetBwaIdx**|BWA index|Prefix for the BWA index files|targetBwaIdx: '/home/herimbert/gdec/shared/triticum_aestivum/arinalrfor/current/bwa/all'| -|**targetGmapIndexPath**|PATH|Full path to the directory in which the GMAPindex is found. This will be used with the `-D` option of `gmapl`|targetGmapIndexPath: '/home/data/triticum_aestivum/julius/gmapdb/all/'| - -* Output parameters/settings - -| Parameter in `config.yaml` | format |description |Example | -|----------------------------|--------|------------|--------| -|**results**| \[STRING\] | directory in which all the Snakemake rules will be executed|results: 'results'| -|**finalPrefix**| \[STRING\] | Prefix for all the final output files (annotaion, mrna/pep fasta sequences ect)|finalPrefix: 'IWGSC_refseqv2.0_annotv2.0'| -|**chromMapID**| CSV| Mapping file which sets the correspondance between the chromosome names in the GFF and the chromosome ID in the newlygenerated gene IDs|chromMapID: 'data/chromosomeMappingID.csv'| -|**transferType**| \[STRING\] | transfert all isoforms or only the representative transcript (.1) for each gene in the reference genome | transferType: 'all' ; transferType: 'first'| - -Example of `chromosomeMappingID.csv` file : - -```bash -$ cat data/chromosomeMappingID.csv -Chr1A 1A -Chr1B 1B -Chr1D 1D -Chr2A 2A -Chr2B 2B -Chr2D 2D -Chr3A 3A -Chr3B 3B -Chr3D 3D -Chr4A 4A -Chr4B 4B -Chr4D 4D -Chr5A 5A -Chr5B 5B -Chr5D 5D -Chr6A 6A -Chr6B 6B -Chr6D 6D -Chr7A 7A -Chr7B 7B -Chr7D 7D -``` - -Once all those parameters has been set up, the final configuration file may look like this: - -```yaml -##### QUERY related files/parameters (refseqv1.0) -# GFF annotatin to transfert -annotationQuery: 'data/IWGSC_v1.1_20170706.gff3' -# feature type used for anchoring on target genome -featureType: 'gene' -# FASTA of the query (used to check the sequences after the coordinates are calculated on the target genome) -queryFasta: 'data/161010_Chinese_Spring_v1.0_pseudomolecules.fasta' -# blastdb of all mrnas. used to rescue genes which have failed in the transfert using the targeted approache -blastdb: 'data/IWGSCv1.1_all_mrna' -# map of all chromosome ids --> NEED TO BE UPDATED in another version WITH ONE ARRAY FOR THE QUERY AND ONE ARRAY FOR THE TARGET GENOME ASSEMBLY -chromosomes: ['1A', '2A', '3A', '4A', '5A', '6A', '7A', '1B', '2B', '3B', '4B', '5B', '6B', '7B', '1D', '2D', '3D', '4D', '5D', '6D', '7D', 'U'] -refChrom: ['chr1A', 'chr1B', 'chr1D', 'chr2A', 'chr2B', 'chr2D', 'chr3A', 'chr3B', 'chr3D', 'chr4A', 'chr4B', 'chr4D', 'chr5A', 'chr5B', 'chr5D', 'chr6A', 'chr6B', 'chr6D', 'chr7A', 'chr7B', 'chr7D', 'chrUn'] - -##### TARGET related files/parameters (refseqv2.1) -targetFasta: 'data/CS_pesudo_v2.1.fa' -#GMAP index of the genome for -d option -targetGmapIndex: 'ensembl_Triticum_aestivum_julius_2022-9-16' -#GMAP index: path to the gmapindex directory, for -D option -targetGmapIndexPath: '/home/herimbert/gdec/shared/triticum_aestivum/julius/current/gmapdb/all/' - - -##### ISBP/markers related config and parameters -# BAM file of markers/ISBPs mapped on the target genome (REFSEQ v2.1) -isbpBam: 'data/iwgsc_refseqv1_ISBP.bwav2.1.bam' -# BED file of coordinates on the query genome (REFSEQ v1.0) -isbpBed: 'data/ISBP_refseqv1.bed' -# minimum mapping quality of markers on the target genome -mapq: 30 -# max mismatches per ISBP/marker -mismatches: 0 - -##### OUTPUT directory -results: 'resultsDEV' -finalPrefix: 'IWGSC_refseqv2.0_annotv2.0' -# this file contains two columns: the first is the chromosome name as it appears in the genome.fasta of the new reference, -# and the second the chromosome name as it will appear in the new gene Names -chromMapID: 'data/chromosomeMappingID.csv' -``` - -## Running the pipeline - -At first, it is recommended to make a dry-run of the analysis: -```bash -$ snakemake -nrp -``` -This will check all the rules and th parameters in the `config.yaml` file and print all the command lines which would have been executed. -If there is no errors, then you can execute the pipeline with: - -```bash -$ snakemake -``` - -If you have multiple CPUs available on your computer, you can choose to use them. -For example, if you want to use up to 8 CPUs in parallel, you can run: - -```bash -$ snakemake -j 8 -``` - -If you are on a computer cluster with a job scheduler, you can tell the pipeline to use this scheduler instead of runnin all the processes on the local machine: -```bash -$ snakemake -j 32 --cluster sbatch -``` - -This will allow to have at most 32 subproccess run through the SLURM scheduler with `sbatch`. - -You can use a custom [cluster.json](cluster.json) JSON file do setup the parameters of SBATCH for each rules, and use it with with: -```console -$ snakemake -j 32 --cluster-config cluster-hpc2.json --cluster "sbatch -J {cluster.jobName} -c {cluster.c} --mem {cluster.mem} -e {cluster.error} -o {cluster.output} -p gdec" --verbose" -``` - -The pipeline comes with conda environment file which can be used by Snakemake. To enable the use of conda: -```console -$ snakemake --use-conda -j 8 --cluster-config cluster-hpc2.json --cluster "sbatch -p {cluster.partition} -c {cluster.c} -N 1 -t {cluster.time} -J {cluster.jobName} --mem={cluster.mem} -e {cluster.error} -o {cluster.output}" -``` - -It is possible to force Snakemake to wait for a defined amount of time in case of latency on the filesystem of your cluster/server. -```console -# wating 30 seconds after each job to check for output files -$ snakemake --latency-wait 30 [...] -``` - -You can generate the diagram of all the processes and dependancies of you analysis: -```bash -$ snakemake --dag |dot -T png > dag.png -``` -This will generate a PNG file of your diagram. - - -If you simply want the global process of the pipeline, you may run: -```bash -$ snakemake --rulegraph |dot -T png > rulegraph.png -``` -This will generate a PNG file of your diagram. - diff --git a/Singularity b/Singularity index 8a28c7393e9fa9d55905d5af9bb12f330b4b9a03..953cf9c05fb3b43a328c60734d2390b78322a181 100644 --- a/Singularity +++ b/Singularity @@ -3,7 +3,7 @@ From: conda/miniconda3 IncludeCmd: yes %files -env.yaml /environment.yml +envs/magatt.yaml /environment.yml %help Contains all MAGATT dependancies diff --git a/cluster-sibi.json b/cluster-sibi.json deleted file mode 100644 index 573eca1d2212adf9d438d8aaf8a8f4991c30acae..0000000000000000000000000000000000000000 --- a/cluster-sibi.json +++ /dev/null @@ -1,22 +0,0 @@ -{ - "__default__" : - { - "c" : 1, - "mem" : "4G", - "jobName" : "magatt_{rule}", - "error" : "slurm_%x-%J.log", - "output" : "slurm_%x-%J.log" - }, - "mapHomologousRegions" : - { - "mem" : "8G" - }, - "gmapRescue": - { - "mem" : "8G" - }, - "gmapIndexTarget": - { - "mem" : "64G" - } -} diff --git a/docs/conf.py b/docs/conf.py index 86c4b2fef52cc679dcecf1bcc8011064e7b171dc..31a86c28b0088c6d3d166d3a9dd008e786d6ff0e 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -29,6 +29,6 @@ source_suffix = { # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output html_theme = 'sphinx_rtd_theme' -html_static_path = ['source/_static'] -html_templates_path = ['source/_templates'] +#html_theme = 'haiku' +html_static_path = ['_static'] diff --git a/docs/index.rst b/docs/index.rst index d4a0ee79d988176e440e49efe8c6e5b117b044ce..8aa92a717ce3ee8400067e0b64b5c90d88e87d9b 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -10,6 +10,8 @@ Marker Assisted Gene Annotation Transfert for Triticeae. Snakemake pipeline used to transfert GFF annotation on a new assembly with a fine target mapping approach. +GitLab Repository : https://forgemia.inra.fr/umr-gdec/magatt + .. toctree:: :maxdepth: 2 :titlesonly: @@ -24,3 +26,4 @@ Snakemake pipeline used to transfert GFF annotation on a new assembly with a fin :caption: Configuration source/pipeline_conf + source/run_pipeline diff --git a/docs/source/install.md b/docs/source/install.md index 90a8095485b0d3ba266bce8947f084d84d772f4c..bfb67f67de5350b12ba5ea87fb4caae25ddf6a20 100644 --- a/docs/source/install.md +++ b/docs/source/install.md @@ -12,7 +12,7 @@ $ git clone https://forgemia.inra.fr/umr-gdec/magatt.git We recommend to build the environment using conda (developped with miniconda 3, conda 4.9.2 ) with the file [environment.yml](https://forgemia.inra.fr/umr-gdec/magatt/-/blob/master/envs/environment.yml): ```console -$ conda env create -f environment.yml -n magatt +$ conda env create -f envs/environment.yml -n magatt ``` Once created, you can activate the environment with: diff --git a/docs/source/pipeline_conf.md b/docs/source/pipeline_conf.md index c2b98e8624f0d417cf504f0766c9f511812550d1..ed5423a399361d3e1b4d2114a2a2af592c46de54 100644 --- a/docs/source/pipeline_conf.md +++ b/docs/source/pipeline_conf.md @@ -1,4 +1,4 @@ -# Prepare and run the pipeline +# Prepare and configure the pipeline ## Creating the configuration file: inputs and other parameters @@ -31,9 +31,9 @@ Here, we only need the fasta of the new genome assembly | Parameter in `config.yaml` | format |description |Example | |----------------------------|--------|------------|--------| |**targetFasta**|FASTA|fasta file of the target genome assembly on which we transfert the annotation|targetFasta: 'data/CS_pesudo_v2.1.fa'| +|**targetBwaIdx**|BWA index|Prefix for the BWA index files|targetBwaIdx: '/data/db/triticum_aestivum/julius/current/bwa/all'| |**targetGmapIndex**|PATH|Name of the GMAP index directory. This will be used with the `-d` option of `gmapl`|targetGmapIndex: 'ensembl_Triticum_aestivum_julius_2022-9-16'| -|**targetBwaIdx**|BWA index|Prefix for the BWA index files|targetBwaIdx: '/home/herimbert/gdec/shared/triticum_aestivum/arinalrfor/current/bwa/all'| -|**targetGmapIndexPath**|PATH|Full path to the directory in which the GMAPindex is found. This will be used with the `-D` option of `gmapl`|targetGmapIndexPath: '/home/data/triticum_aestivum/julius/gmapdb/all/'| +|**targetGmapIndexPath**|PATH|Full path to the directory in which the GMAPindex is found. This will be used with the `-D` option of `gmapl`|targetGmapIndexPath: '/data/db/triticum_aestivum/julius/gmapdb/all/'| * Output parameters/settings @@ -92,7 +92,7 @@ targetFasta: 'data/CS_pesudo_v2.1.fa' #GMAP index of the genome for -d option targetGmapIndex: 'ensembl_Triticum_aestivum_julius_2022-9-16' #GMAP index: path to the gmapindex directory, for -D option -targetGmapIndexPath: '/home/herimbert/gdec/shared/triticum_aestivum/julius/current/gmapdb/all/' +targetGmapIndexPath: '/data/db/triticum_aestivum/julius/current/gmapdb/all/' ##### ISBP/markers related config and parameters @@ -112,69 +112,3 @@ finalPrefix: 'IWGSC_refseqv2.0_annotv2.0' # and the second the chromosome name as it will appear in the new gene Names chromMapID: 'data/chromosomeMappingID.csv' ``` - -# Running the pipeline - -At first, it is recommended to make a dry-run of the analysis: -```bash -$ snakemake -nrp -``` -This will check all the rules and th parameters in the `config.yaml` file and print all the command lines which would have been executed. -If there is no errors, then you can execute the pipeline with: - -```bash -$ snakemake -``` - -If you have multiple CPUs available on your computer, you can choose to use them. -For example, if you want to use up to 8 CPUs in parallel, you can run: - -```bash -$ snakemake -j 8 -``` - -If you are on a computer cluster with a job scheduler, you can tell the pipeline to use this scheduler instead of runnin all the processes on the local machine: - -```bash -$ snakemake -j 32 --cluster sbatch -``` - -This will allow to have at most 32 subproccess run through the SLURM scheduler with `sbatch`. - -You can use a custom [cluster.json](https://forgemia.inra.fr/umr-gdec/magatt/-/blob/master/cluster.json) JSON file do setup the parameters of SBATCH for each rules, and use it with with: - -```console -$ snakemake -j 32 --cluster-config cluster-hpc2.json --cluster "sbatch -J {cluster.jobName} -c {cluster.c} --mem {cluster.mem} -e {cluster.error} -o {cluster.output} -p gdec" --verbose" -``` - -The pipeline comes with conda environment file which can be used by Snakemake. To enable the use of conda: - -```console -$ snakemake --use-conda -j 8 --cluster-config cluster-hpc2.json --cluster "sbatch -p {cluster.partition} -c {cluster.c} -N 1 -t {cluster.time} -J {cluster.jobName} --mem={cluster.mem} -e {cluster.error} -o {cluster.output}" -``` - -It is possible to force Snakemake to wait for a defined amount of time in case of latency on the filesystem of your cluster/server. - -```console -# wating 30 seconds after each job to check for output files -$ snakemake --latency-wait 30 [...] -``` - -You can generate the diagram of all the processes and dependancies of you analysis: - -```bash -$ snakemake --dag |dot -T png > dag.png -``` -This will generate a PNG file of your diagram. - - - -If you simply want the global process of the pipeline, you may run: - -```bash -$ snakemake --rulegraph |dot -T png > rulegraph.png -``` - -This will generate a PNG file of your diagram. - - diff --git a/docs/source/run_pipeline.md b/docs/source/run_pipeline.md new file mode 100644 index 0000000000000000000000000000000000000000..203ed2ada33c756bbe3d1212df01f00b86489025 --- /dev/null +++ b/docs/source/run_pipeline.md @@ -0,0 +1,76 @@ +# Running the pipeline + +## Check + +At first, it is recommended to make a dry-run of the analysis: +```bash +$ snakemake -nrp +``` + +## Run locally + +This will check all the rules and th parameters in the `config.yaml` file and print all the command lines which would have been executed. +If there is no errors, then you can execute the pipeline with: + +```bash +$ snakemake +``` + +If you have multiple CPUs available on your computer, you can choose to use them. +For example, if you want to use up to 8 CPUs in parallel, you can run: + +```bash +$ snakemake -j 8 +``` + +## Run on cluster + +If you are on a computer cluster with a job scheduler, you can tell the pipeline to use this scheduler instead of runnin all the processes on the local machine: + +```bash +$ snakemake -j 32 --cluster sbatch +``` + +This will allow to have at most 32 subproccess run through the SLURM scheduler with `sbatch`. + +You can use a custom [cluster.json](https://forgemia.inra.fr/umr-gdec/magatt/-/blob/master/cluster-hpc2.json) JSON file do setup the parameters of SBATCH for each rules, and use it with with: + +```console +$ snakemake -j 32 --cluster-config cluster-hpc2.json --cluster "sbatch -J {cluster.jobName} -c {cluster.c} --mem {cluster.mem} -e {cluster.error} -o {cluster.output} -p gdec" --verbose" +``` + +## Use Conda environment + +The pipeline comes with conda environment file which can be used by Snakemake. To enable the use of conda: + +```console +$ snakemake --use-conda -j 8 --cluster-config cluster-hpc2.json --cluster "sbatch -p {cluster.partition} -c {cluster.c} -N 1 -t {cluster.time} -J {cluster.jobName} --mem={cluster.mem} -e {cluster.error} -o {cluster.output}" +``` + +It is possible to force Snakemake to wait for a defined amount of time in case of latency on the filesystem of your cluster/server. + +```console +# wating 30 seconds after each job to check for output files +$ snakemake --latency-wait 30 [...] +``` + +## Diagrams and graphs + +You can generate the diagram of all the processes and dependancies of you analysis: + +```bash +$ snakemake --dag |dot -T png > dag.png +``` +This will generate a PNG file of your diagram. + + + +If you simply want the global process of the pipeline, you may run: + +```bash +$ snakemake --rulegraph |dot -T png > rulegraph.png +``` + +This will generate a PNG file of your diagram. + + diff --git a/env.yaml b/envs/magatt.yaml similarity index 99% rename from env.yaml rename to envs/magatt.yaml index 5771e2547a3e1375175aaf4d48bfb2180c8bd92e..7fa183673d3bc395a28ddb990f86587f1bd59021 100644 --- a/env.yaml +++ b/envs/magatt.yaml @@ -113,3 +113,4 @@ dependencies: - xz=5.2.5=h516909a_1 - zlib=1.2.11=h516909a_1010 - zstd=1.4.5=h6597ccf_2 + - bwa