wc
- count the number of lines,words,characterstail
- see the last N lines of a filehead
- see the first N lines of a filecat
- print out entire file to screensed
- ‘stream editor’ - edit data stream on the flycurl
- downloading tool for web/ftp data streamswhich
- list path to a program based on the $PATHpwd
- print the current working directoryps
- processes running on the systemman
- view manual pages about a command or programdate
- date and timetime
- prefix a command/program, report how long it took to runfind
- find files/folders by name or other propertydu
- reports disk usage (e.g. how big a file or folder is)awk
- a simple language for processing files/great for column delimited data^
means “Control key”^C
^D
(End of File message)^E
^A
!!
to execute the last commandhistory
to see list of previous commands!NUMBER
to excute cmd from that list!g
to run the last command that started with g
UNIX allows multiple processes (programs) to be run at the same time. While at the command line you can run a specific process, but your command line will be blocked until that program is finished.
Try this which will run a task which just pauses for 30 seconds.
sleep 30
Jobs are run in the foreground by default. While a job is running use ^Z
to suspend job.
sleep 30
^Z
[1]+ Stopped sleep 30
To keep the job running but put it in the background use the command bg
which will puts process in background.
$ bg
[1]+ sleep 30 &
If you want to put the job back in the foreground so you can interact with it or force cancel it use the command fg
.
$ fg
fg
sleep 30
To launch a job directly into the background put an &
at the end.
$ sleep 30 &
It can still be brought to the foreground with fg
.
A typical use if you are using a tool which generates graphical interface that you will interact with is to launch it in background. It will print out the process id of the command that is running.
$ emacs &
[1] 25341
To copy a file use the command cp
.
$ touch one.txt # create an empty file
$ cp one.txt two.txt# copy one file to another
$ mkdir books # make a directory
$ cp one.txt books # copy into a directory
$ ls books # list the contents
one.txt
$ cp books more_books # copy the folder, will fail
cp: books is a directory (not copied).
$ cp -r books more_books # recursive copy succeeds
$ cp one.txt two.txt books # can copy more than one at a time
# will also OVERWRITE the previous
# one.txt that was in the folder
$ ls books
$ ls more_books
The command rsync
can also be used to copy files between folders or between computers.
Here we copy a file that is located on your laptop called LOCALFILE onto the HPCC and will put it in the folder bigdata which is located in your home directory.
[your laptop] $ rsync -a --progress LOCALFILE USER@cluster.hpcc.ucr.edu:bigdata/
Can also specify an explicit path (starts with /
).
[your laptop] $ rsync -a --progress LOCALFILE USER@cluster.hpcc.ucr.edu:/bigdata/gen220/USER/
Can copy FROM HPCC to your local computer
[your laptop] $ rsync -a --progress USER@cluster.hpcc.ucr.edu:/bigdata/gen220/share/simple/yeast_gene_names.txt .
Moving files is just renaming them.
$ mv one.txt three.txt # rename one.txt to three.txt
$ mv three.txt books # relocate three.txt to books folder
$ cd books
$ mv one.txt two.txt three.txt .. # move these files back UP one
directory
$ ls # nothing in the 'books' directory
$ cd .. # go back
$ ls # these files are in the current folder
one.txt two.txt three.txt books more_books
$ ls books # is now empty, we moved everything
# out of there
How does UNIX determine what program to run?
Try typing echo $PATH
to see your search directory.
Also do env
to see all environment variables.
You can use the command which
to tell you where a program is located
which nano # will tell you where the nano program
# is located
To search a file for matches the grep
command is really useful and powerful. Here we will show a short example with more details later.
Remember that we can redirect output with >
which will overwrite a file and >>
will append to end.
If we want instead of storing it we wanted to pass the output from one program into another we use the
|
or pipe.
For example - count how many words and letters are in this message:
echo "This is a long sentence" | wc
Let’s get a few datasets together and try out some simple tools.
If you haven’t already checked out the github data repository for the class examples do this:
git clone https://github.com/biodataprog/GEN220_data.git
You can also browse this dataset https://github.com/biodataprog/GEN220_data
Now lets run some basic UNIX tools on these data.
cd GEN220_data
ls # see what is in the folder, notice there is a 'data' folder
cd data
# take a look at one of the files
more codon_table.txt # Also see it here https://github.com/biodataprog/GEN220_data/blob/main/data/codon_table.txt
# let's see how many codons are in the codon table
wc -l codon_table.txt
# now let's see how many codons there are which code for Valine
grep Valine codon_table.txt # see how many show up
# now lets just report the count of the number in there
grep Valine codon_table.txt | wc -l
# note that grep is also really useful - you can tell it just to report the number of lines so
# the following also works
grep -c Valine codon_table.txt
# can pass to sort program if you want as another way to show this
grep cine codon_table.txt
grep cine codon_table.txt | sort
On the UCR HPCC there are many installed applications through a UNIX module system. To load a module means to make that program part of your path and in some cases will set other environment variables.
For example to get access to the BLAST suite.
$ which blastn
/usr/bin/which: no blastn in ....
$ module load ncbi-blast
$ which blastn
/opt/linux/centos/7.x/x86_64/pkgs/ncbi-blast/2.2.30+/bin/blastn
There are multiple versions installed on the system
module avail ncbi-blast
---- /opt/linux/centos/7.x/x86_64/modules ----------------------------
ncbi-blast/2.2.22+ ncbi-blast/2.2.30+(default) ncbi-blast/2.6.0+
ncbi-blast/2.2.25+ ncbi-blast/2.2.31+ ncbi-blast/2.7.1+
ncbi-blast/2.2.26 ncbi-blast/2.3.0+ ncbi-blast/2.8.0+
ncbi-blast/2.2.26+ ncbi-blast/2.4.0+ ncbi-blast/2.8.1+
ncbi-blast/2.2.29+ ncbi-blast/2.5.0+ ncbi-blast/2.9.0+
You can load a specific version
module load ncbi-blast/2.9.0+
which blastn
/opt/linux/centos/7.x/x86_64/pkgs/ncbi-blast/2.9.0+/bin/blastn
module unload ncbi-blast
/usr/bin/which: no blastn in ....
See what versions of modules you currently have loaded
module list
Currently Loaded Modulefiles:
1) slurm/19.05.0 4) texlive/2017 7) geos/3.7.1 10) R/3.6.0 13) iigb_utilities/1 16) miniconda3/4.3.31
2) openmpi/4.0.1-slurm-19.05.0 5) pandoc/2.0 8) gdal/2.1.3 11) perl/5.20.2 14) pbzip2/1.1.12 17) neovim/0.2.1
3) ggobi/2.1.11 6) netcdf/4.4.1.1 9) hdf5/1.10.1 12) less-highlight/1.0 15) parallel/20151222 18) ncbi-blast/2.9.0+
curl
is useful downloading from remote sites. URLs either FTP, HTTP, or HTTPS.
$ curl https://www.uniprot.org/uniprot/E3Q6S8.fasta
>tr|E3Q6S8|E3Q6S8_COLGM RNAse P Rpr2/Rpp21/SNM1 subunit domain-containing protein
OS=Colletotrichum graminicola (strain M1.001 / M2 / FGSC 10212) GN=GLRG_02386 PE=4 SV=1
MAKPKSESLPNRHAYTRVSYLHQAAAYLATVQSPTSDSTTNSSQPGHAPHAVDHERCLET
NETVARRFVSDIRAVSLKAQIRPSPSLKQMMCKYCDSLLVEGKTCSTTVENASKGGKKPW
ADVMVTKCKTCGNVKRFPVSAPRQKRRPFREQKAVEGQDTTPAVSEMSTGAD
$ curl -OL https://www.uniprot.org/uniprot/E3Q6S8.fasta
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 345 100 345 0 0 724 0 --:--:-- --:--:-- --:--:-- 724
$ curl -o myseqfile.fa https://www.uniprot.org/uniprot/E3Q6S8.fasta
Often use the -L
in curl to allow URL redirects. There are also
resuming options (-C
) and ways to pass in username/password for
authenticated sites. Also
>
- write out the output to file (create it if empty, and overwrite if exists)
$ curl https://www.uniprot.org/uniprot/E3Q6S8.fasta > E3Q6S8.fa
>>
- write out output to a file (create it if empty) but append to the end of the file
$ echo "my name is " >> what_is_my name
$ echo "Joe" >> what_is_my_name
$ cat what_is_my_name
my name is
Joe
<
- This is for redirecting INPUT from a file. We’ll talk about this more but is how we might pull a set of commands into a program expecting input
$ R --no-save < My_R_commands.R
File compression can save disk space, reduce file transfer time when copying between computers
gzip
for GNUzip compression. Single file at a time.pigz
is parallelized and can use multiple processors$ pigz file.txt
$ du -h data/Nc20H.expr.tab # report how big the file is
656K data/Nc20H.expr.tab
$ pigz data/Nc20H.expr.tab # to compress
$ du -h data/Nc20H.expr.tab.gz # report size of compressed file
236K data/Nc20H.expr.tab.gz
$ pigz -d data/Nc20H.expr.tab.gz # to uncompress
bzip2
for Bzip compression. Better compression than gzip but slower.$ module load pbzip2
$ bzip2 data/Nc20H.expr.tab # compress with bzip2
$ du -h data/Nc20H.expr.tab.bz2 # report size of bzipped file
204K data/Nc20H.expr.tab.bz2
$ bunzip2 data/Nc20H.expr.tab.bz2
zcat
, zmore
and bzcat
, bzmore
to read compressed files on the fly
$ zmore file.gz
du
- disk usage
$ du -h /bigdata/gen220/shared/data_files/S_cerevisiae_ORFs.fasta
11M /bigdata/gen220/shared/data_files/S_cerevisiae_ORFs.fasta
$ du /bigdata/gen220/shared/data_files/S_cerevisiae_ORFs.fasta
10752 /bigdata/gen220/shared/data_files/S_cerevisiae_ORFs.fasta
$ du -h --time /bigdata/gen220/shared/data_files/S_cerevisiae_ORFs.fasta
Can also be used on folders to summarize the total size of contents of a folder.
The pipe operator |
allows you to instead of redirecting output to a file, redirect it to another program. Specifically the STDIN of the other program. This is very powerful and allows you to chain together different processes
$ zcat data/Nc20H.expr.tab.gz | wc -l
# output from zcat is printed to STDOUT and that is redirected
# to the command wc with the -l option which in turn
# expects input on STDIN.
# Output from a program can be fed to a pager like less
$ blastn -help | less
$ fasta36 query db | more
$ fasta36 query db | tee report.out | more
tee
is a program which reads from STDIN and writes this BOTH to a file
and to STDOUT. A way to monitor a program but to also detach from
reading the messages and still capture it all to a file.
Multiple pipes can be used, and building together we can start to construct a series of queries. Will go into this more in detail in the next lecture but here we process a file and capture
# shows top ten results from a blast report
$ zcat data/blast.out.gz | head -n 10
# returns the total number of unique items found in column 1
$ zcat data/blast.out.gz | awk '{print $1}' | sort | uniq | wc -l
# take output from blast program, compress it on the fly to a new file
$ blastn -query query.fa -db db.fa -outfmt 6 | gzip -c > blastresult.gz
On Biocluster there are a couple of folder structures to understand
/rhome/USERNAME
your home directory - limited space (20gb)/bigdata/labname/USERNAME
your ‘bigdata’ folder (bigger space (100gb+)/bigdata/labname/shared
shared folder space for your labCurrently everyone is in the the gen220 ‘lab’ during this course so
you have access to /bigdata/gen220/shared
and /bigdata/gen220/USERNAME
How much data am I using currently: https://dashboard.hpcc.ucr.edu
/scratch - local space on a cluster node which is FAST disk access but temporary (30 days)
Graphical tools: Filezilla - https://filezilla-project.org/download.php
Command-line:
# interactive FTP client
$ sftp USERNAME@cluster.hpcc.ucr.edu
# copy a file
$ scp USERNAME@cluster.hpcc.ucr.edu:fileoncluster.txt ./file-on-your-machine.txt
# copy a folder, recursively
$ scp -r USERNAME@cluster.hpcc.ucr.edu:/bigdata/gen220/shared/simple .
# rsync copies, but can check and only copy changed files
$ rsync -a --progress USERNAME@cluster.hpcc.ucr.edu:/bigdata/gen220/shared/simple .
# copy FROM your computer TO the cluster, swap order - here
# copy a folder back to your HOME directory
$ scp -r simple USERNAME@cluster.hpcc.ucr.edu:
Currently only shown login to the main “head” node (cluster.hpcc.ucr.edu)
To use the 6500 CPUs we need to submit job for running. This is called a job management or queueing system.
We use SLURM on the UCR system currently.
We use the SLURM queuing systems on HPCC. Read info here for more resources. http://hpcc.ucr.edu/manuals_linux-cluster_jobs.html
Getting an interactive shell (eg get your own CPU to do work on)
$ srun --pty bash -l
$ srun --nodes 1 --ntasks 2 --mem 8gb --time 8:00:00 --pty bash -l
You can type this in on the cmdline (also click here to see what you will be retrieving )
module load ncbi-blast
module load db-ncbi
curl -O https://www.uniprot.org/uniprot/Q5T6X5.fasta
blastp -num_threads 2 -query Q5T6X5.fasta -db swissprot -out result.blastp
You can also make this a job script (call it job.sh)
#!/bin/bash
module load ncbi-blast
module load db-ncbi
curl -O https://www.uniprot.org/uniprot/Q5T6X5.fasta
blastp -num_threads 2 -query Q5T6X5.fasta -db swissprot -out result.blastp
Submit it with the following options
$ sbatch -N 1 -n 2 -p short job.sh
Requesting job resources
Can also set these INSIDE the script
#!/bin/bash
#SBATCH --nodes 1 --ntasks 2 --mem 2gb --time 2:00:00
module load ncbi-blast
module load db-ncbi
curl -O https://www.uniprot.org/uniprot/Q5T6X5.fasta
blastp -num_threads 2 -query Q5T6X5.fasta -db swissprot -out result.blastp