Functional Ecological Genomics 2017

Materials for Stephanie Spielman's sessions

Introduction to Computing



Overview

This session will introduce you to the following:

  • Understanding your computer’s organization
  • Using the command line
  • Regular expressions
  • Installing standard software packages

For a more in-depth look at these concepts (and more!), I highly recommend the really excellent book Practical Computing for Biologists by Haddock and Dunn. This book is a thorough, entry-level overview of introductory computing concepts, showcased with hands-on exercises. The book’s accompanying website (linked above) is also regularly updated with important tips, examples, and errata.

Materials

You can download slides for today’s session here. You may (will!) find it helpful to download a Text Editor that can handle regular expressions.

Command Line exercises I

  • Launch a terminal session and enter the command pwd to determine which directory you are in. If you are not in your home directory, then navigate there by simply typing cd (without an argument this takes you home!)

  • Create a new directory called lacawac using the command mkdir. Navigate into this directory with cd and enter pwd to confirm your successful directory change.

  • List the contents of this directory using the command ls -la, which includes two flags: -l to list in long format and -a to list to show hidden files. What are the contents of this new directory (even though it “should” be empty)?

  • Create a file called myfile.txt using this command: echo "This is my new file." > myfile.txt

  • Use less to examine the contents of myfile.txt. Is is what you expected?

  • Now issue this command: echo "oh man more words" > myfile.txt and again examine the contents. What changed, and why?

  • Finally, issue this command: echo "the last words" >> myfile.txt. Yet again rexamine the file contents. Think about how and why the contents are what they are.

  • Now create another directory, inside lacawac/, called lodge. Copy the file myfile.txt into lodge using cp. The new file should be called mynewfile.txt.

  • In theory, the contents of myfile.txt and mynewfile.txt should be identical at this point. Confirm this using the command diff, which determines the difference between two text files: diff myfile.txt lodge/mynewfile.txt. If there is no difference between files, then diff will not yield any output.

  • Now create some differences between the files. Issue the command (note, you’re still in the directory lacawac!) echo "different now" >> lodge/mynewfile.txt. Re-run the diff command from step #10 and see how the output has changed.

  • At long last, navigate into the lodge/ directory with cd. Rename the file mynewfile.txt to myfavefile.txt using the mv command. Use ls afterwards to confirm that your command worked.

  • Now remove the file myfavefile.txt using rm.

  • Navigate one directory up, back into lacawac/, with the command cd ... Enter pwd to confirm that this worked. Now remove the directory lodge/ using rm (Hint: which flag do you need to use?). Use ls again to confirm that the directory is gone.

  • Navigate to your home directory (which is one directory up from lacawac/). Use any of these commands:

       cd
       cd ~
       cd ..
    

Confirm using pwd.

  • Create a new empty file called .secretfile using the touch command. Enter ls to confirm that the file was created (Hint: does ls work for this goal? how can we make ls work for this goal?).

  • Finally, remove the file .secret and the directory lacawac/, or you may keep them around for posterity :).

Regular expression exercises

Use regular expressions to change each string and/or file contents to the desired target. Feel free to perform these operations using your regex-enabled text editor.

To check that you have succeeded in changing file contents, use the diff command. Again, this UNIX command shows differences between two text files. If there is no output from the diff command, then the two files are identical. Use diff as follows:

diff a.txt b.txt

Protip: Note all whitespace is created equal! Just because two files look the same to you doesn’t mean they are the same.

  1. Use regular expressions to change the contents of the FASTA sequence file exampleGFP_raw.fasta to match the contents of the file exampleGFP_clean.fasta.

  2. Use regular expressions to change the file contents of the file data1_raw.txt to match the contents of the file data1_clean.txt.

    • Once you have succeeded, try to perform all replacements using a single regular expression.
  3. Generate a single masterful regular expression that can match a phone number in any of the following formats below and convert it to a phone number into this format: (XXX)-XXX-XXXX.

     555-555-1234
     555.555.4321
     555 555 4321
     (555).555.6789
     (555)-555-0123
    

Command line exercises II

Important: Depending on your system, you may need to use grep with the flag -E, as in grep -E, which stands for extended regular expressions. This flag will allow you to use more flexible regex’s. Alternatively, you can use the command egrep (stands for grep -E) if your system has it.

Set I

For these exercises, we will use the file example.fastq (source). This is FASTQ file, which contains raw NGS reads and quality scores. FASTQ files are formatted in this format:

@cluster_2:UMI_ATTCCG             # record name; starts with '@'
TTTCCGGGGCACATAATCTTCAGCCGGGCGC   # DNA sequence
+                                 # empty line; starts with '+'
9C;=;=<9@4868>9:67AA<9>65<=>591   # phred-scaled quality scores

Therefore, each raw read is represented by four lines, beginning with a unique ID that always starts with @.

  • Use head to examine the beginning contents of the FASTQ file. To see more than the default 10 lines, specify the flag -n with a numeric argument, i.e. head -n 50 example.fastq.

  • Our first goal is to determine the number of reads in this file. Do this in two ways:
    • Use the command wc to determine the number of lines (hint: this isn’t the number of reads, but it kind of is!). You might find the flag wc -l useful.
    • You’ll notice that all IDs start with the string @cluster. Use grep to search for occurrences of this pattern at the beginning of a line. Use grep in two ways:
      • Use grep with the -c flag to count all occurrences.
      • Use grep without any flag, and pipe the output to wc -l.
  • Now search for all occurrences of + in the file (you’ll need to escape this character!) that begin a line. Is this the same as the number of raw reads? Why or why not?

    • Run this command to find the problematic lines: grep "^\+\S" example.fastq. Make sure you understand what the regular expression I used here means!
  • You should also have seen that each read is named (in regex) as "@cluster_\d+:UMI_\w{6}" (before proceeding, make sure you understand this regex). Here, the final 6 nucleotides are a unique sequence identifier. Use grep to find out how many reads have an identifier that ends with “GGG”. Note that there are many way to accomplish this goal! IMPORTANT: You will need to use grep as grep -E for this question!

Set II

For these exercises, we will use the tab-delimited file file sampleinfo.tsv, which contains example sample information for an RNAseq experiment setup. Again, examine this file with less (or more, etc.) before proceeding. It’s fairly small, so you can also easily open it.

  • Use the cut command to print out only the first column from this file.

  • Now download the file sampleinfo.csv. This file contains the same information, but it is delimited with commas instead of tabs. In other words, this is a CSV. Again use the cut command to print out only the first column.

  • Return the original sampleinfo.tsv file. Now, pipe the output from your cut command into sort to sort the first column.

  • In a single command (use pipes!), print out the sorted, unique values in the 4th column of this file. You can/should build up this command in small chunks:
    • First extract column 4
    • Sort column 4 (Hint: use as sort -f to ignore case!)
    • Make column 4 unique
    • Merge commands with the | symbol
  • Finally, modify the previous pipey command with yet another pipe to determine how many unique entries in the 4th column exist.