Day 3 Solutions
Day 3 materials can be found here
Part I: Functions
-
Write a function to compute the
GC
content of a DNA sequence. The function should accept a single argument, the DNA sequence, and return the GC percentage. Test your function with the nucleotide sequence"AGCTATAGCATAGC"
.def calc_gc(sequence): num_gc = sequence.count("G") + sequence.count("C") percentage = num_gc / len(sequence) return percentage ## Use function perc = calc_gc("AGCTATAGCATAGC")
-
Write a function that calculates the percentage of a given nucleotide from a DNA sequence. The function should accept two arguments: the nucleotide of interest and the DNA sequence. It should return the nucleotide percentage. Test your function with the nucleotide sequence
"AGCTATAGCATAGC"
.def calc_nucleotide(sequence, nucleotide): return sequence.count(nucleotide) / len(sequence) ## Use function, to count A's perc = calc_nucleotide("AGCTATAGCATAGC", "A")
-
Write a function that calculates the percentage each nucleotide in a given DNA sequence. of a given nucleotide from a DNA sequence. The function should accept a single argument, the DNA sequence, and return a dictionary containing
key:value
pairs ofnucleotide:percentage
. You can assume that the provided sequence contains only A, C, G, T. Test your function with the nucleotide sequence"AGCTATAGCATAGC"
.def calc_percentage(sequence): total = len(sequence) counts = {} for nuc in sequence: if nuc not in counts: counts[nuc] = 1/total else: counts[nuc] +=1/total return counts ### Use function seq = "AGCTATAGCATAGC" value = calc_percentage(seq)
- Write a function to guess whether a provided sequence is DNA or protein. For this task, assume that any sequence comprised of % A, C, G, T is a DNA sequence. Test your function with the following two sequences:
"AGCTATGCATACGAGCATAGC"
"AGIILLCPKLKKQWTATWCAGCATADSARCVLMKGC"
def is_it_protein(sequence): """ This information enclosed in a 3-quote block is a "docstring". It is the official place to document what the function does. This function returns `True` if the sequence is likely protein, and `False` if it is likely DNA. """ nucleotides = "ACGT" ## Determine percentage that are ACGT nuc_perc = 0. total_length = len(sequence) for nuc in nucleotides: nuc_perc += ( sequence.count(nuc)/ total_length ) ## Compare to 0.5 ## Return true if mostly nucleotides, false otherwise if nuc_perc >= 0.5: return False else: return True ### Use function seq1 = "AGCTATGCATACGAGCATAGC" is_it_protein(seq1) seq2 = "AGIILLCPKLKKQWTATWCAGCATADSARCVLMKGC" is_it_protein(seq2)
-
Modify the previous function to ignore all ambiguities in calculations. Use this list of ambiguous characters for this task:
ambig = ["B", "J", "N", "O", "X", "Z"]
. Test your function with the following sequence:"APAPPPKKLRATNNYPOPPBXXXXXNTYGCTATLMQASDFTDTCATAGC"
def is_it_protein(sequence): ## First, we can remove all ambiguities from the sequence with .replace(). ## Then, we can make our protein/DNA guess based on the "clean" sequence. ambig = ["B", "J", "N", "O", "X", "Z"] for x in ambig: ## loop to replace each amibuity with nothing (blank "") sequence = sequence.replace(x, "") nucleotides = "ACGT" ## Determine percentage that are ACGT nuc_perc = 0. total_length = len(sequence) for nuc in nucleotides: nuc_perc += ( sequence.count(nuc)/ total_length ) ## Compare to 0.5 ## Return true if mostly nucleotides, false otherwise if nuc_perc >= 0.5: return False else: return True
Part II: File Input/Output
Files used in these exercises can be downloaded from the course website. Be sure to write your scripts in the same directory as these files!
-
Open the file
file1.txt
in read-mode, and print its contents to screen. Use the.read()
method, which saves the contents of the file to a single string. Perform this task twice: once usingopen
andclose
, and once usingwith
control-flow.## Open and close f = open("file1.txt", "r") contents = f.read() f.close() print(contents) ## with with open("file1.txt", "r") as f: f.read() print(contents)
-
Open the file
file1.txt
in read-mode, and save all lines in this file to a list using the.readlines()
method. Write a new file calledupper_file1.txt
which contains the same contents offile1.txt
but in upper-case. Try to do this task using a single for-loop.## open file for reading with open("file1.txt", "r") as f: filelines = f.readlines() ## open new file to write (creates file on the fly!) with open("upper_file1.txt", "w") as f: ## loop over filelines and write its uppercase version to the `f` handle for line in filelines: f.write(line.upper())
-
Open the newly created file
upper_file1.txt
in read-mode. Loop over the file lines without using.read()
or.readlines()
, and print out lines as you loop.with open("upper_file1.txt", "r") as f: for line in f: print(line)
-
Modify the previous for-loop to only print out lines in
upper_file1.txt
which contain at least (i.e. ) 5 letterE
’s.letter = "E" threshold = 5 with open("upper_file1.txt", "r") as f: for line in f: if line.count(letter) >= threshold: print(line)
-
You should notice 20 files named
file1.txt, file2.txt, ..., file20.txt
. Write a for-loop to open each of these files (Hint: use therange()
function to loop over file names). For each file, print each line that contains more than 25 characters.characters = 25 for i in range(1, 21): ## i will be 1-20 with this range() specification thisfile = "file" + str(i) + ".txt" with open(thisfile, "r") as f: ## loop over lines, asking each time if it has > 25 characters and printing if true for line in f: if len(line) > characters: print(line)
-
Write another for-loop over the same 20 files. For each file, create a second file named
fileX_odd.txt
(where X=1-20) which contains only the odd-numbered lines from the original file. For this, use a counter in the for-loop that goes over file lines (this will count the line numbers), but be careful: Remember that python indexing starts from 0, but the first line is technically line #1!for i in range(1, 21): ## i will be 1-20 with this range() specification thisfile = "file" + str(i) + ".txt" savelines = [] with open(thisfile, "r") as f: x = 1 ## start our counter variable at 1, so that it matches "human" line numbering! for line in f: if x%2 == 1: ## is i an odd number? if so, print! savelines.append(line) x += 1 # increment the counter for numbering the next line ## create and open output file for writing to outfile = "file" + str(i) + "_odd.txt" with open(outfile, "w") as f: for line in savelines: f.write(line)
-
Convert our zoo-keeper dictionaries into a comma-separated file with the header
animal,vore,food
, and rows should contain corresponding information, i.e.lion,carnivore,meat
. Perform this task with a single for-loop.category = {"lion": "carnivore", "gazelle": "herbivore", "anteater": "insectivore", "alligator": "homovore", "hedgehog": "insectivore", "cow": "herbivore", "tiger": "carnivore", "orangutan": "frugivore"} feed = {"carnivore": "meat", "herbivore": "grass", "frugivore": "mangos", "homovore": "visitors", "insectivore": "termites"} ## Open output file for reading output_file = "animal_foods.csv" with open(output_file, "w") as f: ## First, write header to file, including new line! f.write("animal,vore,food\n") ## now loop over dictionary to get vore and food lines per animal, writing to file as we go for animal in category: vore = category[animal] itsfood = feed[ vore ] ## line to write to file, including a newline symbols linetowrite = animal + "," + vore + "," + itsfood + "\n" ## write the line to file f.write(linetowrite)
-
Create a second zoo-keeper file by converting the CSV into a tab-separated file. Perform this task by reading in the CSV, replacing commas with tabs, where tabs can be created as the string
"\t"
.infile = "animal_foods.csv" outfile = "animal_foods.txt" # note different extension with open(infile, "r") as f: first_contents = f.read() ## replace all , with \t new_contents = first_contents.replace(",", "\t") ## save to outfile with open(outfile, "w") as f: f.write(new_contents)