[Solutions] Introduction to Computing
This page contains the solutions for exercises from Day 2 of the workshop.
[Solutions] Command Line exercises I
- Launch a terminal session and enter the command
pwd
to determine which directory you are in. If you are not in your home directory, then navigate there by simply typingcd
(without an argument this takes you home!)
cd
- Create a new directory called
lacawac
using the commandmkdir
. Navigate into this directory withcd
and enterpwd
to confirm your successful directory change.
mkdir lacawac cd lacawac pwd
- List the contents of this directory using the command
ls -la
, which includes two flags:-l
to list in long format and-a
to list to show hidden files. What are the contents of this new directory (even though it “should” be empty)?
ls -la # The directory should contain . and .., which stand for the current directory and the directory one level up, respectively.
- Create a file called
myfile.txt
using this command:echo "This is my new file." > myfile.txt
echo "This is my new file." > myfile.txt
- Use
less
to examine the contents ofmyfile.txt
. Is is what you expected?
less # The file should contain the single sentence "This is my new file."
- Now issue this command:
echo "oh man more words" > myfile.txt
and again examine the contents. What changed, and why?
echo "oh man more words" > myfile.txt # The file should now only contain the single phrase "oh man more words" because we overwrote the previous contents with the ">" in our command.
- Finally, issue this command:
echo "the last words" >> myfile.txt
. Yet again rexamine the file contents. Think about how and why the contents are what they are.
echo "the last words" >> myfile.txt # The file should now only contain two lines, "oh man more words" and "the last words". In this case we *appended* to the original file contents by using ">>" in our command.
- Now create another directory, inside
lacawac/
, calledlodge
. Copy the filemyfile.txt
intolodge
usingcp
. The new file should be calledmynewfile.txt
.
mkdir lodge cp myfile.txt lodge/mynewfile.txt
- In theory, the contents of
myfile.txt
andmynewfile.txt
should be identical at this point. Confirm this using the commanddiff
, which determines the difference between two text files:diff myfile.txt lodge/mynewfile.txt
. If there is no difference between files, thendiff
will not yield any output.
diff myfile.txt lodge/mynewfile.txt # No output shoud result!
- Now create some differences between the files. Issue the command (note, you’re still in the directory
lacawac
!)echo "different now" >> lodge/mynewfile.txt
. Re-run thediff
command from step #10 and see how the output has changed.
echo "different now" >> lodge/mynewfile.txt # The output now reads: "2a3 \\ different now", indicating the difference between these files.
- At long last, navigate into the
lodge/
directory withcd
. Rename the filemynewfile.txt
tomyfavefile.txt
using themv
command. Usels
afterwards to confirm that your command worked.
cd lodge mv mynewfile.txt myfavefile.txt ls
- Now remove the file
myfavefile.txt
usingrm
.
rm myfavefile.txt
- Navigate one directory up, back into
lacawac/
, with the commandcd ..
. Enterpwd
to confirm that this worked. Now remove the directorylodge/
usingrm
(Hint: which flag do you need to use?). Usels
again to confirm that the directory is gone.
cd .. pwd rm -r lodge/ ls
-
Navigate to your home directory (which is one directory up from
lacawac/
). Use any of these commands:cd cd ~ cd ..
Confirm using pwd
.
- Create a new empty file called
.secretfile
using thetouch
command. Enterls
to confirm that the file was created (Hint: doesls
work for this goal? how can we makels
work for this goal?).
touch .secretfile ls # won't appear! ls -a # appears now
- Finally, remove the file
.secret
and the directorylacawac/
, or you may keep them around for posterity :).
cd .. rm -r lacawac
Regular expression exercises
Use regular expressions to change each string and/or file contents to the desired target. Feel free to perform these operations using your regex-enabled text editor.
To check that you have succeeded in changing file contents, use the diff
command. Again, this UNIX command shows differences between two text files. If there is no output from the diff
command, then the two files are identical. Use diff
as follows:
diff a.txt b.txt
Protip: Note all whitespace is created equal! Just because two files look the same to you doesn’t mean they are the same.
- Use regular expressions to change the contents of the FASTA sequence file exampleGFP_raw.fasta to match the contents of the file exampleGFP_clean.fasta.
This single regular expression will change all headers:
[Search] ^(>\w+)\.*\d*=.+\[(\w+) (\w+)\]\.* [Replace] \1_\2_\3
- Use regular expressions to change the file contents of the file data1_raw.txt to match the contents of the file data1_clean.txt.
Note: These steps are only one of many solutions! Also of interest, before determining these commands, I opted to “Show Invisibles” to see the type of whitespace in the file.## Step 1: Change all tabs commas: [Search] \t [Replace] , ## Step 2: The remaining space to a comma [Search] # can't see, but there's a literal space (made w/ spacebar) here [Replace] , ## Step 3: Change all decimals to have a single decimal point [Search] (\d+\.\d{1})\d+ [Replace] \1 ## Step 4: Remove decimal entirely from the date field [Search] (:\d+)\.\d [Replace] \1
-
Generate a single masterful regular expression that can match a phone number in any of the following formats below and convert it to a phone number into this format:
(XXX)-XXX-XXXX
.555-555-1234 555.555.4321 555 555 4321 (555).555.6789 (555)-555-0123
[Search] \(*(\d{3})\)*[\. -]*(\d{3})[\. -]*(\d{4}) [Replace] (\1)-\2-\3
Command line exercises II
Important: Depending on your system, you may need to use grep
with the flag -E
, as in grep -E
, which stands for extended regular expressions. This flag will allow you to use more flexible regex’s. Alternatively, you can use the command egrep
(stands for grep -E
) if your system has it.
Set I
For these exercises, we will use the file example.fastq (source). This is FASTQ file, which contains raw NGS reads and quality scores. FASTQ files are formatted in this format:
@cluster_2:UMI_ATTCCG # record name; starts with '@'
TTTCCGGGGCACATAATCTTCAGCCGGGCGC # DNA sequence
+ # empty line; starts with '+'
9C;=;=<9@4868>9:67AA<9>65<=>591 # phred-scaled quality scores
Therefore, each raw read is represented by four lines, beginning with a unique ID that always starts with @
.
- Use
head
to examine the beginning contents of the FASTQ file. To see more than the default 10 lines, specify the flag-n
with a numeric argument, i.e.head -n 50 example.fastq
.
head example.fastq head -n 20 example.fastq
- Our first goal is to determine the number of reads in this file. Do this in two ways:
- Use the command
wc
to determine the number of lines (hint: this isn’t the number of reads, but it kind of is!). You might find the flagwc -l
useful.
wc -l example.fastq # Returns 1000. Divide this by 4 for the corret answer (250).
- You’ll notice that all IDs start with the string
@cluster
. Usegrep
to search for occurrences of this pattern at the beginning of a line. Use grep in two ways:- Use grep with the
-c
flag to count all occurrences.
grep -c "^@cluster" example.fastq # should return 250
- Use grep without any flag, and pipe the output to
wc -l
.
grep "^@cluster" example.fastq | wc -l # again, should return 250
- Use grep with the
- Use the command
-
Now search for all occurrences of
+
in the file (you’ll need to escape this character!) that begin a line. Is this the same as the number of raw reads? Why or why not?grep "^+" example.fastq | wc -l # The result is 251, not 250! Why? This is because the + symbol can be part of a quality score! Indeed, one of the reads' quality scores begins with +.
- Run this command to find the problematic lines:
grep "^\+\S" example.fastq
. Make sure you understand what the regular expression I used here means!
grep "^\+\S" example.fastq ## Find all lines beginning with a + followed by non-whitespace # Result: +670.98;.+460.1533=09;3481/14*3 # This result is the quality score which threw off our count.
- Run this command to find the problematic lines:
-
You should also have seen that each read is named (in regex) as
"@cluster_\d+:UMI_\w{6}"
(before proceeding, make sure you understand this regex). Here, the final 6 nucleotides are a unique sequence identifier. Usegrep
to find out how many reads have an identifier that ends with “GGG”. Note that there are many way to accomplish this goal! IMPORTANT: You will need to use grep asgrep -E
for this question!Several options for commands are below. The result for all is 8.
grep -E "@cluster_\d+:UMI_\w{3}GGG" example.fastq | wc -l grep -cE "@cluster_\d+:UMI_\w{3}GGG" example.fastq grep -cE "UMI_\w{3}GGG$" example.fastq grep -cE "^@cluster.+\w{3}GGG$" example.fastq
Set II
For these exercises, we will use the tab-delimited file file sampleinfo.tsv, which contains example sample information for an RNAseq experiment setup. Again, examine this file with less
(or more
, etc.) before proceeding. It’s fairly small, so you can also easily open it.
- Use the
cut
command to print out only the first column from this file.
cut -f 1 sampleinfo.tsv
- Now download the file sampleinfo.csv. This file contains the same information, but it is delimited with commas instead of tabs. In other words, this is a CSV. Again use the
cut
command to print out only the first column.
cut -f 1 -d , sampleinfo.csv
- Return the original sampleinfo.tsv file. Now, pipe the output from your
cut
command intosort
to sort the first column.
cut -f 1 sampleinfo.tsv | sort
- In a single command (use pipes!), print out the sorted, unique values in the 4th column of this file. You can/should build up this command in small chunks:
- First extract column 4
- Sort column 4 (Hint: use as
sort -f
to ignore case!) - Make column 4 unique
- Merge commands with the
|
symbol
cut -f 4 sampleinfo.tsv | sort -f | uniq
- Finally, modify the previous pipey command with yet another pipe to determine how many unique entries in the 4th column exist.
cut -f 4 sampleinfo.tsv | sort -f | uniq | wc -l