Introduction to Computing
Overview
This session will introduce you to the following:
- Understanding your computer’s organization
- Using the command line
- Regular expressions
- Installing standard software packages
For a more in-depth look at these concepts (and more!), I highly recommend the really excellent book Practical Computing for Biologists by Haddock and Dunn. This book is a thorough, entry-level overview of introductory computing concepts, showcased with hands-on exercises. The book’s accompanying website (linked above) is also regularly updated with important tips, examples, and errata.
Materials
You can download slides for today’s session here. You may (will!) find it helpful to download a Text Editor that can handle regular expressions.
- Mac Users: Download BBEdit or Sublime Text 3
- Linux/VM/Windows Users: Download Sublime Text 3
Command Line exercises I
-
Launch a terminal session and enter the command
pwd
to determine which directory you are in. If you are not in your home directory, then navigate there by simply typingcd
(without an argument this takes you home!) -
Create a new directory called
lacawac
using the commandmkdir
. Navigate into this directory withcd
and enterpwd
to confirm your successful directory change. -
List the contents of this directory using the command
ls -la
, which includes two flags:-l
to list in long format and-a
to list to show hidden files. What are the contents of this new directory (even though it “should” be empty)? -
Create a file called
myfile.txt
using this command:echo "This is my new file." > myfile.txt
-
Use
less
to examine the contents ofmyfile.txt
. Is is what you expected? -
Now issue this command:
echo "oh man more words" > myfile.txt
and again examine the contents. What changed, and why? -
Finally, issue this command:
echo "the last words" >> myfile.txt
. Yet again rexamine the file contents. Think about how and why the contents are what they are. -
Now create another directory, inside
lacawac/
, calledlodge
. Copy the filemyfile.txt
intolodge
usingcp
. The new file should be calledmynewfile.txt
. -
In theory, the contents of
myfile.txt
andmynewfile.txt
should be identical at this point. Confirm this using the commanddiff
, which determines the difference between two text files:diff myfile.txt lodge/mynewfile.txt
. If there is no difference between files, thendiff
will not yield any output. -
Now create some differences between the files. Issue the command (note, you’re still in the directory
lacawac
!)echo "different now" >> lodge/mynewfile.txt
. Re-run thediff
command from step #10 and see how the output has changed. -
At long last, navigate into the
lodge/
directory withcd
. Rename the filemynewfile.txt
tomyfavefile.txt
using themv
command. Usels
afterwards to confirm that your command worked. -
Now remove the file
myfavefile.txt
usingrm
. -
Navigate one directory up, back into
lacawac/
, with the commandcd ..
. Enterpwd
to confirm that this worked. Now remove the directorylodge/
usingrm
(Hint: which flag do you need to use?). Usels
again to confirm that the directory is gone. -
Navigate to your home directory (which is one directory up from
lacawac/
). Use any of these commands:cd cd ~ cd ..
Confirm using pwd
.
-
Create a new empty file called
.secretfile
using thetouch
command. Enterls
to confirm that the file was created (Hint: doesls
work for this goal? how can we makels
work for this goal?). -
Finally, remove the file
.secret
and the directorylacawac/
, or you may keep them around for posterity :).
Regular expression exercises
Use regular expressions to change each string and/or file contents to the desired target. Feel free to perform these operations using your regex-enabled text editor.
To check that you have succeeded in changing file contents, use the diff
command. Again, this UNIX command shows differences between two text files. If there is no output from the diff
command, then the two files are identical. Use diff
as follows:
diff a.txt b.txt
Protip: Note all whitespace is created equal! Just because two files look the same to you doesn’t mean they are the same.
-
Use regular expressions to change the contents of the FASTA sequence file exampleGFP_raw.fasta to match the contents of the file exampleGFP_clean.fasta.
-
Use regular expressions to change the file contents of the file data1_raw.txt to match the contents of the file data1_clean.txt.
- Once you have succeeded, try to perform all replacements using a single regular expression.
-
Generate a single masterful regular expression that can match a phone number in any of the following formats below and convert it to a phone number into this format:
(XXX)-XXX-XXXX
.555-555-1234 555.555.4321 555 555 4321 (555).555.6789 (555)-555-0123
Command line exercises II
Important: Depending on your system, you may need to use grep
with the flag -E
, as in grep -E
, which stands for extended regular expressions. This flag will allow you to use more flexible regex’s. Alternatively, you can use the command egrep
(stands for grep -E
) if your system has it.
Set I
For these exercises, we will use the file example.fastq (source). This is FASTQ file, which contains raw NGS reads and quality scores. FASTQ files are formatted in this format:
@cluster_2:UMI_ATTCCG # record name; starts with '@'
TTTCCGGGGCACATAATCTTCAGCCGGGCGC # DNA sequence
+ # empty line; starts with '+'
9C;=;=<9@4868>9:67AA<9>65<=>591 # phred-scaled quality scores
Therefore, each raw read is represented by four lines, beginning with a unique ID that always starts with @
.
-
Use
head
to examine the beginning contents of the FASTQ file. To see more than the default 10 lines, specify the flag-n
with a numeric argument, i.e.head -n 50 example.fastq
. - Our first goal is to determine the number of reads in this file. Do this in two ways:
- Use the command
wc
to determine the number of lines (hint: this isn’t the number of reads, but it kind of is!). You might find the flagwc -l
useful. - You’ll notice that all IDs start with the string
@cluster
. Usegrep
to search for occurrences of this pattern at the beginning of a line. Use grep in two ways:- Use grep with the
-c
flag to count all occurrences. - Use grep without any flag, and pipe the output to
wc -l
.
- Use grep with the
- Use the command
-
Now search for all occurrences of
+
in the file (you’ll need to escape this character!) that begin a line. Is this the same as the number of raw reads? Why or why not?- Run this command to find the problematic lines:
grep "^\+\S" example.fastq
. Make sure you understand what the regular expression I used here means!
- Run this command to find the problematic lines:
- You should also have seen that each read is named (in regex) as
"@cluster_\d+:UMI_\w{6}"
(before proceeding, make sure you understand this regex). Here, the final 6 nucleotides are a unique sequence identifier. Usegrep
to find out how many reads have an identifier that ends with “GGG”. Note that there are many way to accomplish this goal! IMPORTANT: You will need to use grep asgrep -E
for this question!
Set II
For these exercises, we will use the tab-delimited file file sampleinfo.tsv, which contains example sample information for an RNAseq experiment setup. Again, examine this file with less
(or more
, etc.) before proceeding. It’s fairly small, so you can also easily open it.
-
Use the
cut
command to print out only the first column from this file. -
Now download the file sampleinfo.csv. This file contains the same information, but it is delimited with commas instead of tabs. In other words, this is a CSV. Again use the
cut
command to print out only the first column. -
Return the original sampleinfo.tsv file. Now, pipe the output from your
cut
command intosort
to sort the first column. - In a single command (use pipes!), print out the sorted, unique values in the 4th column of this file. You can/should build up this command in small chunks:
- First extract column 4
- Sort column 4 (Hint: use as
sort -f
to ignore case!) - Make column 4 unique
- Merge commands with the
|
symbol
- Finally, modify the previous pipey command with yet another pipe to determine how many unique entries in the 4th column exist.