Bioweek 3.2. 2020 managing data in Allas and Puhti

A. Log in Puhti and use scratch

1. Login to puhti.csc.fi and move to scratch:

Linux/mac

ssh training0 X X@puhti.csc.fi (replace XX with your account number)

Windows/PuTTY

host: puhti.csc.fi

In Puhti check you environment with command:

csc-workspaces

Switch to the scratch directory of your project

cd /scratch/project_2002389

And create your own sub-directory, named after you training account:

mkdir training0XX (replace XX with your account number)

Make the directory permissions such, that other group members can only read the contents but

not modify it

chmod g-wx training0XX

move to the new directory.

cd training0XX

2. Download data with curl

Next download a dataset from internet and uncompress it. The dataset contains some pythiun genomes with related BWA indexes to the genomes directory

curl https://a3s.fi/course_12.11.2019/pythium.tgz > pythium.tgz

ls -ltr

tar zxvf pythium.tgz

ls -ltr

tree pythium

3. Downloading data NCBI edirect

Then move one step down in the directory hierarchy and create directory cellulose_synthase and move to this new directory:

cd ..

mkdir cellulose_synthase

cd cellulose_synthase

Next we text NCBI edirect tool ( https://docs.csc.fi/apps/edirect/) to retrieve some data:

Check how many proteins are found the NCBI protein databanks for Pythium species (count: row in the results)

esearch -db protein -query "Pythium [ORGN]"

The check the nuber of proteins: cellulose synthase 1, cellulose synthase 2 and cellulose synthase 3

that are found for Pythium species-

For cellulose synthase 1 this can be done with:

esearch -db protein -query "Pythium [ORGN] AND cellulose synthase 1 [PROT]"

( do the same for 2 and 3)

Retrive the cellulose synthase 3 sequenses in Fasta format

esearch -db protein -query "Pythium [ORGN] AND cellulose synthase 3 \ [PROT]" | efetch -format fasta > cesy3.fasta

Then run esearch command that tells, how many cellulose synthase 3 sequences there are in total in NCBI protein database?

# Extra exercises for the fast ones: Align the cellulose synthase 3 set with mafft

mafft cesy3.fasta > cesy3_aln.fasta

And study the results:

infoalign cesy3_aln.fasta

showalign cesy3_aln.fasta

4. Downloading with enaDataGet

Check the options of enaDataGet with command:

enaDataGet -h

Download a file (Pythium iwayamai genome assembly)

enaDataGet AKYA02000000 -f fasta

gunzip AKYA02.fasta.gz

ls -ltr

Extra exercise for the fast ones: study the downloaded file:

head -20 AKYA02.fasta

tail AKYA02.fasta

infoseq_summary AKYA02.fasta

Then compare the cellulose synthase 3 sequences against the genome using BLAST

pb tblastn -query cesy3.fasta -dbnuc AKYA02.fasta -out blast_result.txt

Exercise B using Allas

Upload case 1. rclone

Upload the data from Puhti to Allas with rclone. Use the command below (replace XX):

rclone -P copyto pythium allas:training0XX-genomes-rc/

•How long did the data upload took?
•What was the transfer rate?
•How long would it take to transfer 100 GB with the same speed?

Then study what you have uploaded to Allas with commands (replace XX)

rclone lsd allas:

rclone ls allas:training0XX-genomes-rc/

rclone lsl allas:training0XX-genomes-rc/

rclone lsf allas:training0XX-genomes-rc/

Check how this looks like in the Pouta web interface. Open browser and go to: https://pouta.csc.fi/

In Pouta interface, go to “object store” section, list the buckets (that are here called as “Containers”).

Locate your own training0XX-genomes-rc directory and download one of the uploaded fasta files to your local computer.

Upload case 2. a-put

Upload the pyhium directory from to Allas using following commands

(replace XX with your account number)

Case 1: Store everythig in one object

a-put pythium

a-list

a-list 2002389-puhti-SCRATCH

a-info 2002389-puhti-SCRATCH/training0XX/pythium.tar.zst

Case 2: Each subdirectory (species) as one object

a-put pythium/*

a-list 2002389-puhti-SCRATCH/training0XX

a-check pythium/*

a-info 2002389-puhti SCRATCH/training027/pythium/pythium_vexans.tar.zst

Case 3: Use your own bucket name

a-put pythium/* -b training0XX-genomes-ap

a-list training0XX-genomes-ap

Case 4: Upload files without compression.

a-put --nc pythium/pythium_vexans/bwaindex/* -b training0XX-a_vexans_bwa

a-list training0XX-a_vexans_bwa

Can you see the difference between the four a-put commands above?

Study the training0XX-genomes-ap bucket with commands

a-list training0XX-genomes-ap

rclone ls allas:training0XX-genomes-ap

Why the two commands above list different amount of objects?

Try command:

a-info training0XX-genomes-ap/pythium_vexans.tar.zst

which is actually the same as:

rclone cat allas:training0XX-genomes-ap/pythium_vexans.tar.zst_ameta

Finally try command:

a-flip pythium/pythium_vexans/pythium_vexans.fasta

Try opening the public link that a-flip produced, with your browser.

Upload case 3. Allas-backup

Run commands:

allas-backup –help

allas-backup pythium

allas-backup list

What did these commands do for your data?

Exit

The data in pythium directory is now stored in many ways to Allas so we can remove the data from puhti and log out

rm -r pythium

exit

C. Downloading data from Allas to Puhti

1. Login to puhti.csc.fi

Linux/mac

ssh training0 X X@puhti.csc.fi (replace XX with your account number)

Windows/PuTTY

host: puhti.csc.fi

In Puhti check you environment with command:

csc-workspaces

Switch to the personal scratch directory of your project

cd /scratch/project_2002389/training0XX

Set up Allas connection

module load allas

allas-conf

Then run commands

a-list

rclone lsd allas:

a-list training0XX-genomes-ap

rclone ls allas:training0XX-genomes-ap

a-find pythium_vexans.fasta

a-find -a pythium_vexans.fasta

Next download the data in different ways:

1. Download with rclone

mkdir rclone_dir

cd rclone_dir/

example 1: copy everything

mkdir all

rclone ls allas:training027-genomes-rc

rclone copyto -P allas:training0XX-genomes-rc all/

ls -l all

example 2:copy a set of objects

mkdir vexans

rclone copyto allas:training027-genomes-rc/pythium_vexans vexans/

ls -l vexans

example 3: copy just one object

rclone copyto allas:training027-genomes-rc/pythium_vexans/pythium_vexans.fasta \ ./vexans.fasta

ls -l

2. Dowload with a-get

Return to your training0XX directory

cd ..

Check that you are in right place:

pwd

The pwd command should print /scratch/project_2002389/training0XX

Make a new directory

mkdir a_dir

cd a_dir/

create directory all and go there

mkdir all

cd all

list your default scratch bucket.

a-list 2002389-puhti-SCRATCH

a-list 2002389-puhti-SCRATCH/training0XX

Look for file pythium_vexans.fasta in Puhti SCRATCH bucket:

a-find pythium_vexans.fasta -b 2002389-puhti-SCRATCH

download the full dataset with command:

a-get 2002389-puhti-SCRATCH/training0XX/pythium.tar.zst

And check what you got:

ls -l

ls -R

Now get just one genome dataset:

cd ..

a-get 2002389-puhti-SCRATCH/training0XX/pythium/pythium_vexans.tar.zst

ls -l pythium/

ls -l pythium/pythium_vexans/

3. Downloading data from allas-backup

Return to your main scratch directory and make a new directory

cd ..
mkdir a_backup
cd a_backup/

Use the commands below, to find out the ID of the most recent version backup of your pythium directory:

allas-backup list
allas-backup list | grep training027

Then use allas-backup restore to download the data:

allas-backup restore ID-string
ls -l
la -l pythium