Day 4 Materials
Downloads
- Download Slides
- Download Exercise files
Day 4 solutions can be found here
Part I: Regular expressions
Perform these exercises in a text editor or https://pythex.org/ directly, NOT in a Python console/script.
-
Use regular expressions to change the contents of the FASTA sequence file
exampleGFP_raw.fasta
to match the contents of the fileexampleGFP_clean.fasta
. -
Generate a single masterful regular expression that can match a phone number in any of the following formats below and convert it to a phone number into this format:
(XXX)-XXX-XXXX
.555-555-1234 555.555.4321 555 555 4321 (555).555.6789 (555)-555-0123
Part II: The re
module
Perform these exercises in Python using the re
module.
-
The file
fixed_width.txt
is a fixed-width file, meaning that each column is delimited by an arbitrary number of spaces to increase human-readability. This format, however, is the bane of the programmer’s existence, as we like to have standardized delimiters. Use regular expressions to convert this file into a CSV file calledfixed_width_asCSV.csv
. - Use the
re
module to change the file contents of the filedata1_raw.txt
to match the contents of the filedata1_clean.txt
. Hints:- Split (with
.split()
) each line in the file into its individual columns, and use regular expressions (when needed) to reform each column - Join (with
.join()
) new columns back together to create the new lines - Remember that the first line of the file is the header and should be treated differently from lines containing content
- For an extra challenge, write functions for the conversion of each column
- Split (with
- Looking for more? Redo the exercises from part 1 in Python scripts, making use of the
re
module.
Part III: File manipulation
-
Download today’s exercise files, and create a new Python script in the directory called
csv_files/
. This directory contains real research data from Stephanie’s github. Files are named as:<Ensembl-gene-id>.fna.<MODEL>.csv
, where each Ensembl gene ID has two associated model CSVs, calledJTT
andWAG
.For this exercise, write a script to process these files in order to create a single new tab-separated file with the following columns:
Ensembl-gene-ID
, obtained from the file nameModel
, obtained from the file nameMLE
, obtained from the file contents. This column should the value in each file’sMLE
column that is associated with the second to last site in each file (it will be different for each gene!)
This task will require you to obtain information both from each file name, as well as from within each file. Use the
os
module to obtain all file names as part of this exercise, and any string methods that are appropriate. -
Also in today’s exercise files is a file called
sacCer3_ncbiRefSeqCurated.gtf
. This file is a Gene Transfer Format (gtf
) file obtained from the UCSC Genome Table Browser, and it is a standard file format in next-generation sequence analysis. It is a tab-delimited and contains gene information from S. cerevisiae, with the following columns (but there is no header, as the format is standard):- The chromosome
- The source of the information (in this case, all sources are the same)
- The type of sequence, also known as feature, in this case either
start_codon
,CDS
(coding sequence),exon
,stop_codon
- The starting index for the genomic feature
- The ending index for for the genomic feature
- The score for the feature (in this case all scores are
0.000000
) - The strand of the feature, either
+
for positive or-
for negative - The phase of the feature, either
.
,0
,1
, or2
. This information indicates the reading frame, specifically the number of bases that need to be skipped before the codon structure begins. A value of.
indicates that there is no specific phase information to deal with - The gene information with a specific idenfitier, in this case
gene_id
.
In this particular
gtf
file, each gene id will have four associated lines, for example:chrI sacCer3_ncbiRefSeqCurated start_codon 130799 130801 0.000000 + . gene_id "NM_001178157.1"; chrI sacCer3_ncbiRefSeqCurated CDS 130799 131980 0.000000 + 0 gene_id "NM_001178157.1"; chrI sacCer3_ncbiRefSeqCurated stop_codon 131981 131983 0.000000 . gene_id "NM_001178157.1"; chrI sacCer3_ncbiRefSeqCurated exon 130799 131983 0.000000 + . gene_id "NM_001178157.1";
Perform the following tasks using this file. For each incremental task, AVOID HARDCODING!!- Write a script to extract only lines whose feature is
CDS
, creating a new file calledsacCer3_ncbiRefSeqCurated_EXON.gtf
. Perform this task by specifically asking (usingif
) if the third column is equal to “CDS”. - Modify the previous script to again create a new file only of CDS’s, but this time do so for only sequences which fall on chromosome
chrV
. - Modify yet again to export exons from chrV, with the additional condition that you should only consider CDS’s on the positive strand
- Modify yet again, with the additional condition that you should only consider CDS’s which are base pairs in length.
- BONUS modification for the highly motivated :Modify yet again, considering only CDS’s whose gene_id is the first version of the gene. More specifically, gene ids are named as
NM_<somenumbers>.<VERSIONUMBER
. In other words, you want the number after the decimal to equal 1.
Once you have built up all this final script, modify it one last time by incorporating the
sys
module to have these five command line arguments, in this order:- chromosome of interest
- strand of interest
- length of interest
- version of interest (if you took that step above!)
- output file name
Assume we always want the
CDS
feature.Remember:
sys
always reads in arguments as strings, so you may need to convert to numbers withint()
orfloat()
.