Daniel Vaulot

2024-01-24

Introduction to Roscoff ABIMS server

Outline

  • Pre-requisites
  • Connecting to server
  • Linux basics
  • Running programs - SLURM

Pre-requisites

Software

Terminal program

File transfer

File editing

Reference

Linux

  • Intro
  • You can practice Linux on your PC by installing WSL2

ABIMS Roscoff cluster

SLURM (job management)

Connecting to server

Connecting to ABIMS terminal

  • Launch MobaXterm

  • Enter information in new session

Connecting to ABIMS terminal

  • Type password (cannot copy and paste)

Connecting to ABIMS terminal

  • Rename session

  • The password is saved in the session (save passwords = always)

Upload/Download files system with WinSCP

  • Launch WinSCP

  • Create new site

Exit system

logout 

Intro to Linux

Very important points

  • Linux is case sensitive
  • No space into file names, always cause problem
  • Directories are separated by “/” (not “" like in Windows)
  • Keep everything organized because Linux programs create lots of output files

Create and manipulate files

# Copy file
cp ~/training/file-01.txt file-01.txt
ls

# Rename file
mv file-01.txt file-02.txt
ls

# Can also move to new directory
mv file-02.txt test

# Delete file (remove)
rm test/file-02.txt

# Create file
touch my-file.txt
ls

Create files using winscp

  • Update preferences to make VS Code default editor
  • Create new file in your directory
  • Double click to edit
  • Save under VS Code and it will be updated in server

Display file directly in terminal

# Copy file
cp ~/training/file-01.txt file-01.txt

# Display whole file
cat file-01.txt

# Display one page at a time
less file-01.txt

# Only the top of the file
head file-01.txt

Tips and tricks

  • Up and Down arrow call back previous lines
  • Use TAB to finish a command
  • Copy - Paste (do not use CTR-C, CTR-V)
    • Copy: just select the part you want to copy
    • Paste: Right click on mouse
Arguments for commands
  • One letter shortcut: -n
  • Full argument: –lines
# First five lines
head -n 5 file-01.txt

# First five lines
head --lines 5 file-01.txt

# To know the arguments
man head

Tips and tricks

Pipes
  • “>” Redirect output towards a file
# Write the output to a file
head -n 10 file-01.txt > file-02.txt
ls
  • “|” Redirect output towards another program
# Redirect output of cat
cat  file-01.txt | head -n 10

Tips and tricks

Variables
  • Always in upper case (easy to spot)
  • Assignment with = sign
    • ! No space around = sign
    • Value between parenthesis
  • Get value with $FILE or ${FILE}
# Create variable
FILE="file-03.txt"

# Check variable
echo $FILE

# Write the output to a file
head -n 10 file-01.txt > $FILE
ls

Job management (SLURM)

Motivation

  • Programs
    • Run for long (cannot wait for output)
    • Need specific amount of memory
    • Can run on multiple processors
    • Many user simultaneously
  • Examples of programs
    • R
    • mafft: alignments
    • raxml: trees
    • vsearch: clustering, sequence manipulation
    • emu: metabarcode assignment
  • Must always be located in project folder
List available programs
  • Must load module before using the program
  • If program not available must make request to ABIMS (go through Daniel)
module avail

Move to project folder - ALWAYS

Our project are located in /shared/projects/geek_simple_laby

cd /shared/projects/geek_simple_laby/daniel
  • Only three folders are backed up.
  • You can copy your scripts to the script folder under your directory (script/sandra).
  • Always do your own backup on your PC (download files/folder)
# Project working directory usage at ABiMS

ABiMS provides a backup on your project directory.

To take advantage of this process, you have to follow some rules:
 - Only the subdirectories ‘archive’, ‘script’ and ‘finalresult’ are backed up.
 - You must place these subdirectories at the root of your project folder.
 - Please be smart in your backups for our finances and the planet.

Interactive mode - srun

  • Use for programs like R
  • Not really recommended
  • Do not forget to quit when done (CPU time)
# Load module and specific version (no need)
module load r/4.2.1

# Run with default values for memory and processors
srun R --vanilla
# srun: job 36256512 queued and waiting for resources


# Run with 4 processors and 32G of mem
# pty is for "pseudo-terminal"
srun -c 4 --mem=32G --pty R

Batch mode - sbatch

  • Copy the necessary files
cp /shared/home/csim/training/Labyrinthulomycetes.pacbio.fasta Labyrinthulomycetes.pacbio.fasta
cp /shared/home/csim/training/sbatch_cluster_01.sh sbatch_cluster_01.sh
cp /shared/home/csim/training/sbatch_cluster_02.sh sbatch_cluster_02.sh

ls

What we are going to do:

  • Singapore time series ASVs for Labyrinthulomycetes
  • 289 sequences, but some are very similar
  • Clusterize using 99% similarity
  • Use vsearch program
  • Final: 78 sequences
Input file
>pacbio;3d4a51f4d1;Aplanochytrium_sp.;size=120;
AGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGTAGGAGCGACCGTGCCGAACTTGATTGTTCGTGTATTGTGTTGTCTTCAGCCATCCTCGT
GGAGAACTTTTCTAACATTAACTTGTTGGGATTGGGACCCGCGTCGTTTACTGTGAAAAAATTAGAGTGTTTAAAGCAGGCATTAGCTTGAATACATTAGCATGGAATAATAAGATAGG
ACTTTGGTACTATTTTGTTGGTTTGCATACCAAATTAATGATCAACAGGAACAGTTTGAGGATATTCGTATGAACATGTCAGAGGTGAAATTCTTGGATTTTGATCAGACGAACTACTG
CGAAAGCATTTATCAAGGATGTTTTCATTAATCAAGAACGAAAGTTAGGGGATCGAA...

Batch mode - sbatch

sbatch_cluster_01.sh

Two parts:

  • Header (very strict formatting)
  • Command to Run
#!/bin/bash
#SBATCH -p fast                      # Partition can be also fast, long, bigmem
#SBATCH --cpus-per-task 4
#SBATCH --mem-per-cpu 4GB                    # mémoire vive pour l'ensemble des cœurs
#SBATCH -t 6-0:00                    # durée maximum du travail (D-HH:MM)
#SBATCH -o slurm.%N.%j.out           # STDOUT
#SBATCH -e slurm.%N.%j.err           # STDERR
#SBATCH --mail-user=vaulot@sb-roscoff.fr # ! Replace with uio email
#SBATCH --mail-type=BEGIN,END,FAIL


# Submitted with 
# cd /shared/home/csim/daniel # ! Change to your directory
# sbatch sbatch_cluster_01.sh

module load vsearch

cd /shared/home/csim/daniel # ! Change to your directory

"${VSEARCH}" --cluster_fast "Labyrinthulomycetes.pacbio.fasta" \
    --threads 4 \
    --id 0.99 \
    --uc clusters_0.99_Labyrinthulomycetes.pacbio.tsv \
    --sizeout \
    --centroids clusters_0.99_Labyrinthulomycetes.pacbio.centroids.fasta \
    --clusterout_sort \
    --clusterout_id

Need to edit: email and directories

Batch mode - sbatch

sbatch sbatch_cluster_01.sh
Get status of your run
sacct --format=JobID,JobName,User%15,Partition,ReqCPUS,ReqMem,State,CPUTime,MaxVMSize%15
JobID           JobName            User  Partition  ReqCPUS     ReqMem      State    CPUTime       MaxVMSize
------------ ---------- --------------- ---------- -------- ---------- ---------- ---------- ---------------
36263527     sbatch_cl+            csim       fast        4        16G  COMPLETED   00:01:32

Three states:

  • PENDING
  • RUNNING
  • COMPLETED
squeue
squeue -u csim

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          36263527      fast sbatch_c     csim  R       0:14      1 cpu-node-050

You should also get two emails:

  • Slurm Job_id=36263527 Name=sbatch_cluster_01.sh Began, Queued time 00:00:01
  • Slurm Job_id=36263527 Name=sbatch_cluster_01.sh Ended, Run time 00:00:23, COMPLETED, ExitCode 0

Batch mode - sbatch

Files produced
  • slurm.cpu-node-050.36263527.err

  • slurm.cpu-node-050.36263527.out

  • clusters_0.99_Labyrinthulomycetes.pacbio.centroids.fasta

  • clusters_0.99_Labyrinthulomycetes.pacbio.tsv

slurm.cpu-node-050.36263527.err

Output of the program vsearch

vsearch v2.22.1_linux_x86_64, 251.3GB RAM, 256 cores
https://github.com/torognes/vsearch

Reading file Labyrinthulomycetes.pacbio.fasta 100%
1327126 nt in 289 seqs, min 4174, max 5443, avg 4592
Masking 100%
Sorting by length 100%
Counting k-mers 100%
Clustering 100%
Sorting clusters 100%
Writing clusters 100%
Clusters: 78 Size min 1, max 29, avg 3.7
Singletons: 41, 14.2% of seqs, 52.6% of clusters

Batch mode - sbatch

clusters_0.99_Labyrinthulomycetes.pacbio.centroids.fasta
>pacbio;1a9244a2e6;Thraustochytriaceae_X_sp.;clusterid=74;size=29
AGCTTCAATAGCATATACTAACGTTGTCGCAGTTAAAAAGTTCGTAGTTGAATTTCTGGTAGGAGTGACCTGGCCTTTTA
CGTTTGTAATTGTATGCTGTGTGTTATCTCTGGCCATCCTGAATCTGCTTTGTTGTAGATTCTCACATACTGTAAAAAAA
TTAGAGTGTTTAAAGCATTTCGTATGAAAAGAATACATCTTATGGGATATCAAAATAGGATTTTGGTGCTATTTTGTTGG
TTTGCACACCAAAATAATGATTAACAGGGACAGTTGGGGGTATTTGTATTTAATTGTCAGAGGTGAAATTCTTGGATTTA
TGAAAGACAAACTACTGCGAAAGCATTTATCAAGGATGTTTTCATTAATCATGAACGAAAGTTAGGGGATCGAAGATGAT
CAGATACCATCGTAGTCTTAACAGTAAACTATACCAACTTGCGATTATTCCATGGTGTTTTTTGCCAGGAGTAGCAGCAC
clusters_0.99_Labyrinthulomycetes.pacbio.tsv
S   0   5443    *   *   *   *   *   pacbio;4b0f43a6e4;Labyrinthulomycetes_LAB8_sp.;size=3;  *
H   0   5436    99.4    +   0   0   1120MI12M7I673MI289M3D174M2D1534M3I1629M    pacbio;e3f28aa0af;Labyrinthulomycetes_LAB8_sp.;size=4;  pacbio;4b0f43a6e4;Labyrinthulomycetes_LAB8_sp.;size=3;
H   0   5436    99.4    +   0   0   1120MI12M7I673MI289M3D174M2D1534M3I1629M    pacbio;70ee1479b8;Labyrinthulomycetes_LAB8_sp.;size=3;  pacbio;4b0f43a6e4;Labyrinthulomycetes_LAB8_sp.;size=3;
S   1   5230    *   *   *   *   *   pacbio;86192ef156;Labyrinthulomycetes_LAB8_sp.;size=15; *
H   1   5229    99.8    +   0   0   1267MI530MD301MI3130M   pacbio;e54d7a85c5;Labyrinthulomycetes_LAB8_sp.;size=4;  pacbio;86192ef156;Labyrinthulomycetes_LAB8_sp.;size=15;
H   1   5227    99.5    +   0   0   1120MI12M7I634M3D24MD305MD3127M pacbio;970a95784e;Labyrinthulomycetes_LAB8_sp.;size=7;  pacbio;86192ef156;Labyrinthulomycetes_LAB8_sp.;size=15;
H   1   5227    99.5    +   0   0   1120MI12M7I634M3D24MD305MD3127M pacbio;d23a7e7d59;Labyrinthulomycetes_LAB8_sp.;size=2;  pacbio;86192ef156;Labyrinthulomycetes_LAB8_sp.;size=15;
H   1   5226    99.7    +   0   0   1267MI555MI275MI39MI3090M   pacbio;abfdcca79c;Labyrinthulomycetes_LAB8_sp.;size=9;  pacbio;86192ef156;Labyrinthulomycetes_LAB8_sp.;size=15;

Batch mode - sbatch

If something goes wrong:

  • Always PENDING: Problem with the SLURM instructions at top of file
  • Script error check CAREFULLY .err file
Error in file names
sacct --format=JobID,JobName,User%15,Partition,ReqCPUS,ReqMem,State,CPUTime,MaxVMSize%15

36263532     sbatch_cl+            csim       fast        4        16G     FAILED   00:00:04
36263532.ba+      batch                                   4                FAILED   00:00:04         146608K

slurm.cpu-node-050.36263532.err

vsearch v2.22.1_linux_x86_64, 251.3GB RAM, 256 cores
https://github.com/torognes/vsearch

Fatal error: Unable to open file for reading (Labyrinthulomycete.pacbio.fasta)
                                 4                FAILED   00:00:04         146608K
Cancel
scancel 36263527

Batch mode - sbatch

sbatch_cluster_02.sh
  • Replace strings by variables
  • Can re-use script for different files and parameters
  • Easier to track errors
module load vsearch

DIR="/shared/home/csim/daniel/" # ! Change to your directory
FILE_HEAD="Labyrinthulomycetes.pacbio"
IDENTITY="0.99"
THREADS=4

cd $DIR

vsearch --cluster_fast "${FILE_HEAD}.fasta" \
    --threads "${THREADS}" \
    --id "${IDENTITY}" \
    --uc clusters_${IDENTITY}_$FILE_HEAD.tsv \
    --msaout clusters_${IDENTITY}_${FILE_HEAD}.align.fasta \
    --sizeout \
    --centroids clusters_${IDENTITY}_${FILE_HEAD}.centroids.fasta \
    --clusterout_sort \
    --clusterout_id

More advanced processing

  • Loop through files

  • Loop through parameters

  • Work directly on files

    • grep: search (regular expression)
    • sed: edit text
  • tar/zip/unzip: compress files/rectories

  • wget: download files from internet

List of useful Linux programs