Functional Ecological Genomics 2017

Materials for Stephanie Spielman's sessions

[Solutions] Introduction to Computing



This page contains the solutions for exercises from Day 2 of the workshop.

[Solutions] Command Line exercises I

  • Launch a terminal session and enter the command pwd to determine which directory you are in. If you are not in your home directory, then navigate there by simply typing cd (without an argument this takes you home!)
cd
  • Create a new directory called lacawac using the command mkdir. Navigate into this directory with cd and enter pwd to confirm your successful directory change.
mkdir lacawac
cd lacawac
pwd
  • List the contents of this directory using the command ls -la, which includes two flags: -l to list in long format and -a to list to show hidden files. What are the contents of this new directory (even though it “should” be empty)?
ls -la
# The directory should contain . and .., which stand for the current directory and the directory one level up, respectively.
  • Create a file called myfile.txt using this command: echo "This is my new file." > myfile.txt
echo "This is my new file." > myfile.txt 
  • Use less to examine the contents of myfile.txt. Is is what you expected?
less
# The file should contain the single sentence "This is my new file."
  • Now issue this command: echo "oh man more words" > myfile.txt and again examine the contents. What changed, and why?
echo "oh man more words" > myfile.txt
# The file should now only contain the single phrase "oh man more words" because we overwrote the previous contents with the ">" in our command.
  • Finally, issue this command: echo "the last words" >> myfile.txt. Yet again rexamine the file contents. Think about how and why the contents are what they are.
echo "the last words" >> myfile.txt
# The file should now only contain two lines, "oh man more words" and "the last words". In this case we *appended* to the original file contents by using ">>" in our command.
  • Now create another directory, inside lacawac/, called lodge. Copy the file myfile.txt into lodge using cp. The new file should be called mynewfile.txt.
mkdir lodge
cp myfile.txt lodge/mynewfile.txt
  • In theory, the contents of myfile.txt and mynewfile.txt should be identical at this point. Confirm this using the command diff, which determines the difference between two text files: diff myfile.txt lodge/mynewfile.txt. If there is no difference between files, then diff will not yield any output.
diff myfile.txt lodge/mynewfile.txt
# No output shoud result!
  • Now create some differences between the files. Issue the command (note, you’re still in the directory lacawac!) echo "different now" >> lodge/mynewfile.txt. Re-run the diff command from step #10 and see how the output has changed.
echo "different now" >> lodge/mynewfile.txt	# The output now reads: "2a3 \\ different now", indicating the difference between these files.
  • At long last, navigate into the lodge/ directory with cd. Rename the file mynewfile.txt to myfavefile.txt using the mv command. Use ls afterwards to confirm that your command worked.
cd lodge
mv mynewfile.txt myfavefile.txt
ls
  • Now remove the file myfavefile.txt using rm.
rm myfavefile.txt
  • Navigate one directory up, back into lacawac/, with the command cd ... Enter pwd to confirm that this worked. Now remove the directory lodge/ using rm (Hint: which flag do you need to use?). Use ls again to confirm that the directory is gone.
cd ..
pwd
rm -r lodge/
ls
  • Navigate to your home directory (which is one directory up from lacawac/). Use any of these commands:

       cd
       cd ~
       cd ..
    

Confirm using pwd.

  • Create a new empty file called .secretfile using the touch command. Enter ls to confirm that the file was created (Hint: does ls work for this goal? how can we make ls work for this goal?).
touch .secretfile
ls # won't appear!
ls -a # appears now
  • Finally, remove the file .secret and the directory lacawac/, or you may keep them around for posterity :).
cd ..
rm -r lacawac

Regular expression exercises

Use regular expressions to change each string and/or file contents to the desired target. Feel free to perform these operations using your regex-enabled text editor.

To check that you have succeeded in changing file contents, use the diff command. Again, this UNIX command shows differences between two text files. If there is no output from the diff command, then the two files are identical. Use diff as follows:

diff a.txt b.txt

Protip: Note all whitespace is created equal! Just because two files look the same to you doesn’t mean they are the same.

  1. Use regular expressions to change the contents of the FASTA sequence file exampleGFP_raw.fasta to match the contents of the file exampleGFP_clean.fasta.

This single regular expression will change all headers:

[Search]     ^(>\w+)\.*\d*=.+\[(\w+) (\w+)\]\.*
[Replace]    \1_\2_\3
  1. Use regular expressions to change the file contents of the file data1_raw.txt to match the contents of the file data1_clean.txt.
Note: These steps are only one of many solutions! Also of interest, before determining these commands, I opted to “Show Invisibles” to see the type of whitespace in the file.
## Step 1: Change all tabs commas:
  [Search]	\t
  [Replace]   , 

## Step 2: The remaining space to a comma
  [Search]	   # can't see, but there's a literal space (made w/ spacebar) here 
  [Replace]  ,

## Step 3: Change all decimals to have a single decimal point
  [Search]	(\d+\.\d{1})\d+
  [Replace]   \1

## Step 4: Remove decimal entirely from the date field
  [Search]	(:\d+)\.\d
  [Replace]   \1      
  1. Generate a single masterful regular expression that can match a phone number in any of the following formats below and convert it to a phone number into this format: (XXX)-XXX-XXXX.

     555-555-1234
     555.555.4321
     555 555 4321
     (555).555.6789
     (555)-555-0123
    
  [Search]	\(*(\d{3})\)*[\. -]*(\d{3})[\. -]*(\d{4})
  [Replace]   (\1)-\2-\3     

Command line exercises II

Important: Depending on your system, you may need to use grep with the flag -E, as in grep -E, which stands for extended regular expressions. This flag will allow you to use more flexible regex’s. Alternatively, you can use the command egrep (stands for grep -E) if your system has it.

Set I

For these exercises, we will use the file example.fastq (source). This is FASTQ file, which contains raw NGS reads and quality scores. FASTQ files are formatted in this format:

@cluster_2:UMI_ATTCCG             # record name; starts with '@'
TTTCCGGGGCACATAATCTTCAGCCGGGCGC   # DNA sequence
+                                 # empty line; starts with '+'
9C;=;=<9@4868>9:67AA<9>65<=>591   # phred-scaled quality scores

Therefore, each raw read is represented by four lines, beginning with a unique ID that always starts with @.

  • Use head to examine the beginning contents of the FASTQ file. To see more than the default 10 lines, specify the flag -n with a numeric argument, i.e. head -n 50 example.fastq.
head example.fastq
head -n 20 example.fastq
  • Our first goal is to determine the number of reads in this file. Do this in two ways:
    • Use the command wc to determine the number of lines (hint: this isn’t the number of reads, but it kind of is!). You might find the flag wc -l useful.
    wc -l example.fastq
    # Returns 1000. Divide this by 4 for the corret answer (250).
    
    • You’ll notice that all IDs start with the string @cluster. Use grep to search for occurrences of this pattern at the beginning of a line. Use grep in two ways:
      • Use grep with the -c flag to count all occurrences.
      grep -c "^@cluster" example.fastq 
      # should return 250
      
      • Use grep without any flag, and pipe the output to wc -l.
      grep "^@cluster" example.fastq | wc -l 
      # again, should return 250
      
  • Now search for all occurrences of + in the file (you’ll need to escape this character!) that begin a line. Is this the same as the number of raw reads? Why or why not?

    grep "^+" example.fastq | wc -l
    # The result is 251, not 250! Why? This is because the + symbol can be part of a quality score! Indeed, one of the reads' quality scores begins with +. 
    
    • Run this command to find the problematic lines: grep "^\+\S" example.fastq. Make sure you understand what the regular expression I used here means!
    grep "^\+\S" example.fastq ## Find all lines beginning with a + followed by non-whitespace
    # Result: +670.98;.+460.1533=09;3481/14*3
    # This result is the quality score which threw off our count.
    
  • You should also have seen that each read is named (in regex) as "@cluster_\d+:UMI_\w{6}" (before proceeding, make sure you understand this regex). Here, the final 6 nucleotides are a unique sequence identifier. Use grep to find out how many reads have an identifier that ends with “GGG”. Note that there are many way to accomplish this goal! IMPORTANT: You will need to use grep as grep -E for this question!

    Several options for commands are below. The result for all is 8.

    grep -E "@cluster_\d+:UMI_\w{3}GGG" example.fastq | wc -l
    grep -cE "@cluster_\d+:UMI_\w{3}GGG" example.fastq
    grep -cE "UMI_\w{3}GGG$" example.fastq 
    grep -cE "^@cluster.+\w{3}GGG$" example.fastq 		
    

Set II

For these exercises, we will use the tab-delimited file file sampleinfo.tsv, which contains example sample information for an RNAseq experiment setup. Again, examine this file with less (or more, etc.) before proceeding. It’s fairly small, so you can also easily open it.

  • Use the cut command to print out only the first column from this file.
cut -f 1 sampleinfo.tsv 
  • Now download the file sampleinfo.csv. This file contains the same information, but it is delimited with commas instead of tabs. In other words, this is a CSV. Again use the cut command to print out only the first column.
cut -f 1 -d , sampleinfo.csv 
  • Return the original sampleinfo.tsv file. Now, pipe the output from your cut command into sort to sort the first column.
cut -f 1 sampleinfo.tsv  | sort
  • In a single command (use pipes!), print out the sorted, unique values in the 4th column of this file. You can/should build up this command in small chunks:
    • First extract column 4
    • Sort column 4 (Hint: use as sort -f to ignore case!)
    • Make column 4 unique
    • Merge commands with the | symbol
cut -f 4 sampleinfo.tsv  | sort -f | uniq
  • Finally, modify the previous pipey command with yet another pipe to determine how many unique entries in the 4th column exist.
cut -f 4 sampleinfo.tsv  | sort -f | uniq | wc -l