Bioweek 3.2. 2020   managing data in Allas and Puhti

A. Log in Puhti and use scratch

 

1. Login to puhti.csc.fi and move to scratch:

 

Linux/mac

 

   ssh training0XX@puhti.csc.fi   (replace XX with your account number)

 

 

Windows/PuTTY

 

   host: puhti.csc.fi

 

   login as: training0XX  (replace XX with your account number)

 

 

In Puhti check you environment with command:

 

  csc-workspaces

 

Switch to the scratch directory of your project

 

   cd /scratch/project_2002389

And create your own sub-directory, named after you training account:

 

  mkdir training0XX (replace XX with your account number)

 

Make the directory permissions such, that other group members can only read the contents but

not modify it

 

 chmod g-wx training0XX

 

move to the new directory.

 

  cd training0XX

 

2. Download data with curl

Next download a dataset from internet and uncompress it. The dataset contains some pythiun genomes with  related BWA indexes to the genomes directory

 

curl https://a3s.fi/course_12.11.2019/pythium.tgz > pythium.tgz

ls -ltr

tar zxvf pythium.tgz  

ls -ltr

tree pythium

 

3. Downloading data NCBI edirect

 

Then move one step down in the directory hierarchy and create directory cellulose_synthase and move to this new directory:

 

cd  ..

mkdir cellulose_synthase

cd cellulose_synthase

 

Next we text NCBI edirect tool ( https://docs.csc.fi/apps/edirect/) to retrieve some data:

 

Check how many proteins are found the NCBI protein databanks for Pythium species (count: row in the results)

 

esearch -db protein -query "Pythium [ORGN]"

 

The check the nuber of proteins: cellulose synthase 1, cellulose synthase 2 and cellulose synthase 3

that are found for Pythium species-

 

For cellulose synthase 1 this can be done with:

 

esearch -db protein -query "Pythium [ORGN] AND cellulose synthase 1 [PROT]"

 

( do the same for 2 and 3)

 

Retrive the cellulose synthase 3 sequenses in Fasta format

 

esearch -db protein -query "Pythium [ORGN] AND cellulose synthase 3 \ [PROT]" | efetch -format fasta > cesy3.fasta

 

Then run esearch command that tells, how many  cellulose synthase 3 sequences there are in total in NCBI protein database?

 

# Extra exercises for the fast ones: Align the cellulose synthase 3 set with mafft

 

 mafft cesy3.fasta > cesy3_aln.fasta

 

And study the results:

 

  infoalign cesy3_aln.fasta

 showalign cesy3_aln.fasta

 

 

4. Downloading with enaDataGet

 

Check the options of enaDataGet with command:

 

enaDataGet -h

 

Download a file (Pythium iwayamai  genome assembly)

 

enaDataGet AKYA02000000 -f fasta

gunzip AKYA02.fasta.gz

ls -ltr

 

 

Extra exercise for the fast ones: study the downloaded file:

head -20 AKYA02.fasta

tail AKYA02.fasta

infoseq_summary  AKYA02.fasta

 

Then compare the cellulose synthase 3 sequences against the genome using BLAST

pb tblastn -query cesy3.fasta -dbnuc AKYA02.fasta -out blast_result.txt

 

 

 

Exercise B using Allas

 

Upload case 1.  rclone

 

Upload the data from Puhti to Allas with rclone. Use the command below (replace XX):

 

rclone -P copyto pythium allas:training0XX-genomes-rc/

 

 

Then study what you have uploaded  to Allas with commands (replace XX)

 

     rclone lsd allas:

  rclone ls allas:training0XX-genomes-rc/

  rclone lsl allas:training0XX-genomes-rc/

  rclone lsf allas:training0XX-genomes-rc/

 

Check how this looks like in the Pouta web interface. Open browser and go to: https://pouta.csc.fi/

 

In Pouta interface, go to “object store” section, list the buckets (that are here called as “Containers”).

Locate your own training0XX-genomes-rc directory and download one of the uploaded fasta files to your  local computer.

 

Upload case 2. a-put

 

Upload the pyhium directory from to Allas using following commands

(replace XX with your account number)

 

Case 1: Store everythig in one object

 a-put pythium

 a-list

 a-list 2002389-puhti-SCRATCH

 a-info 2002389-puhti-SCRATCH/training0XX/pythium.tar.zst

 

Case 2: Each subdirectory (species) as one object

 a-put pythium/*

 a-list 2002389-puhti-SCRATCH/training0XX

 a-check pythium/*

 a-info 2002389-puhti SCRATCH/training027/pythium/pythium_vexans.tar.zst

 

Case 3: Use your own bucket name

 a-put pythium/* -b training0XX-genomes-ap

 a-list training0XX-genomes-ap

 

Case 4: Upload files without compression.

 

a-put --nc  pythium/pythium_vexans/bwaindex/* -b training0XX-a_vexans_bwa

 

 a-list training0XX-a_vexans_bwa

 

Can you see the difference between the four a-put commands above?

 

Study the training0XX-genomes-ap bucket with commands

 

a-list training0XX-genomes-ap

rclone ls allas:training0XX-genomes-ap

 

Why the two commands above list different amount of objects?

 

Try command:

 

a-info training0XX-genomes-ap/pythium_vexans.tar.zst

 

which is actually the same as:

 

rclone cat allas:training0XX-genomes-ap/pythium_vexans.tar.zst_ameta

 

 

Finally try command:

 

 a-flip pythium/pythium_vexans/pythium_vexans.fasta

 

Try opening the public link that a-flip produced, with your browser.

 

 

Upload case 3. Allas-backup

Run commands:

allas-backup –help

allas-backup pythium

allas-backup list

What did these commands do for your data?

Exit

The data in pythium directory is now stored in many ways to Allas so we can remove the data from puhti and log out

 

rm -r pythium

exit

 

C. Downloading data from Allas to Puhti

 

1. Login to puhti.csc.fi

 

Linux/mac

 

   ssh training0XX@puhti.csc.fi   (replace XX with your account number)

 

 

Windows/PuTTY

 

   host: puhti.csc.fi

 

   login as: training0XX  (replace XX with your account number)

 

 

In Puhti check you environment with command:

 

  csc-workspaces

 

Switch to the personal scratch directory of your project

 

   cd /scratch/project_2002389/training0XX

 

 

Set up Allas connection

 

module load allas

allas-conf

 

 

Then run commands

 

a-list

rclone lsd allas:

 

a-list training0XX-genomes-ap

rclone ls allas:training0XX-genomes-ap

 

 

 

a-find pythium_vexans.fasta

a-find -a pythium_vexans.fasta

 

 

Next download the data in different ways:

 

1. Download with rclone

 

mkdir rclone_dir

cd rclone_dir/

 

example 1: copy everything

mkdir all

rclone ls allas:training027-genomes-rc

rclone copyto -P allas:training0XX-genomes-rc all/

ls -l all

 

example 2:copy a set of objects

mkdir vexans

rclone copyto allas:training027-genomes-rc/pythium_vexans vexans/

ls -l vexans

 

example 3: copy just one object

rclone copyto allas:training027-genomes-rc/pythium_vexans/pythium_vexans.fasta \ ./vexans.fasta

ls -l

 

 

2. Dowload with  a-get

 

Return to your  training0XX directory

 

cd ..

 

Check that you are in right place:

 

pwd

 

The pwd command should print  /scratch/project_2002389/training0XX

 

Make a new directory

 

mkdir a_dir

cd a_dir/

 

create directory all and go there

 mkdir all

 cd all

 

list your default scratch bucket.

 

a-list 2002389-puhti-SCRATCH

a-list 2002389-puhti-SCRATCH/training0XX

 

 

Look for file pythium_vexans.fasta in Puhti SCRATCH  bucket:

a-find pythium_vexans.fasta -b  2002389-puhti-SCRATCH

 

download the full dataset with command:

 

a-get 2002389-puhti-SCRATCH/training0XX/pythium.tar.zst

 

And check what you got:

 

ls -l

ls -R

 

Now get just one genome dataset:

 

cd ..

a-get 2002389-puhti-SCRATCH/training0XX/pythium/pythium_vexans.tar.zst

ls -l pythium/

ls -l pythium/pythium_vexans/

 

 

 

 

3. Downloading data from allas-backup

Return to your main scratch directory and make a new directory

cd ..
mkdir a_backup
cd a_backup/

Use the commands below, to find out the ID of the most recent version backup of your pythium directory:

allas-backup list
allas-backup list | grep training027

Then use allas-backup restore to download the data:

allas-backup restore ID-string
ls -l
la -l pythium