Sample Data files§

We will use several example data files throughout the class.

BED format§

Data in BED format contains region information (e.g. single nucleotides or megbase regions) in a simple format [1]:

Download a sample BED file: lamina.bed

[1]BED documentation http://genome.ucsc.edu/FAQ/FAQformat.html#format1

FASTA format§

FASTA format just contains DNA sequence data; no quality scores:

>cluster_2:UMI_ATTCCG             # record name; starts with '>'
TTTCCGGGGCACATAATCTTCAGCCGGGCGC   # DNA sequence

Download a sample FASTA file: sample.fa

FASTQ format§

FASTQ format contains DNA sequence data with quality scores:

@cluster_2:UMI_ATTCCG             # record name; starts with '@'
TTTCCGGGGCACATAATCTTCAGCCGGGCGC   # DNA sequence
+                                 # empty line; starts with '+'
9C;=;=<9@4868>9:67AA<9>65<=>591   # phred-scaled quality scores

Download a sample FASTQ file: SP1.fq

ENCODE data§

All encode data are available at https://genome.ucsc.edu/ENCODE/downloads.html

For Problem Set 3, you will need these files on the amc-tesla cluster, available in:

/vol1/opt/data
Experiment Target Cell line Replicate File Type File name
ChIP-seq Histone H3 Lysine 4 trimethyl (H3K4me3) Hela-S3 1 FASTQ wgEncodeBroadHistoneHelas3H3k4me3StdRawDataRep1.fastq.gz
ChIP-seq CTCF Hela-S3 1 narrowPeak wgEncodeUwTfbsHelas3CtcfStdPkRep1.narrowPeak.gz
Merged TFBS ChIP-seq all all n/a BED wgEncodeRegTfbsClusteredV3.bed.gz
Merged DNase I hypersensitive sites all all n/a BED wgEncodeRegDnaseClusteredV2.bed.gz