Description

This track shows Open Reading Frame annotations. For consensuses from Dfam these come directly from the Dfam 3.1 annotations stored here.

You can also view them on Dfam.org under the "features" tab.

For consensuses built by the RepeatBrowser "buildSeqs" script we used tblastn to search the RepeatMasker peptide library and took the highest scoring hit (the highest scoring hit also had to pass a score threshold of 50).

Note that the Repeat Masker Peptide Library does not contain peptides for every element. That is, L1PA2 proteins are assigned to L1HS not because of a poor match, but because L1PA2 is the representative protein for all young L1PA (i.e. there is no L1HS peptide in the library).

Methods


#for exact matches 
awk '($6=="exact")' summary.tsv  | cut -f 1 | while read i; do sed -n -e '/^NM\s*'${i}'$/,/\/\// p' Dfam.embl | grep -A1 "FT.*CDS "| tr -s "\n" "\t" | sed "s/--/\n/g" | tr -s "." " \t" | tr -s "  " "\t" | tr -s "\"" "\t" | awk -v s=$i '{print s"\t"$3"\t"$4"\t"$7}'; done > exact_annotations.tab

#for renamed matches
awk '($6=="int-strip")' summary.tsv  | cut -f 1 | while read i; do sed -n -e '/^NM\s*'${i}'$/,/\/\// p' Dfam.embl | grep -A1 "FT.*CDS "| tr -s "\n" "\t" | sed "s/--/\n/g" | tr -s "." " \t" | tr -s "  " "\t" | tr -s "\"" "\t" | awk -v s=$i '{print s"\t"$3"\t"$4"\t"$7}'; done > intstrip_annotations.tab

#for manual matches
awk '($6!="int-strip" && $6!="exact" && $6!="cons")' summary.tsv  |  cut -f 1 | while read i; do sed -n -e '/^NM\s*'${i}'$/,/\/\// p' Dfam.embl | grep -A1 "FT.*CDS "| tr -s "\n" "\t" | sed "s/--/\n/g" | tr -s "." " \t" | tr -s "  " "\t" | tr -s "\"" "\t" | awk -v s=$i '{print s"\t"$3"\t"$4"\t"$7}'; done > manual_annotations.tab


cat intstrip_annotations.bed manual_annotations.bed exact_annotations.bed | awk '{print $0"\t0\t+"}'| sort | uniq > Dfam_annotations.bed


#built cons
awk '($6=="cons")' ../summary.tsv  |  cut -f 1 | while read i; do sed -n -e '/^>'${i}'$/,/^>/ p' ~/hive/jferna10/RepeatBrowserHub/hg38reps/hg38reps.fa | head -n-1 >> built_cons.fa; done
cd peps/

#RepeatMasker pep library from Hubley
wget https://github.com/rmhubley/RepeatMasker/raw/master/Libraries/RepeatPeps.lib

#
makeblastdb -in built_cons.fa -dbtype nucl -parse_seqids

#phylogenetic groupings for TEs found in hg38 (this list was arrived at by taking the Dfam name, reduce repeat peptide libs to peptides from those groups
cut -f 1 summary.tsv | while read i ; do sed -n -e '/^NM\s*'${i}'$/,/\/\// p' Dfam.embl | grep "^OS" | cut -f2; done >  hg38_TE_phlyo.txt
cat hg38_TE_phlyo.txt | tr -s " " "\t" | cut -f2 |  sort | uniq | tr "\n" "|"

#copy and paste output to grep,reduces peptide library significantly, prevents aligning proteins from random creatures to human
grep -E  'Afrotheria|Amniota|Boreoeutheria|Carnivora|Catarrhini|Cetartiodactyla|Euarchontoglires|Euteleostomi|Eutheria|Glires|Haplorrhini|Hominidae|Homininae|Hominoidea|Homo|Mammalia|Metatheria|Metazoa|Monotremata|Primates|Rodentia|Sauropsida|Scandentia|Simiiformes|Tetrapoda|Theria|Vertebrata|Xenarthra' RepeatPeps.lib -A1 > HumanRepeatPeps.lib 
grep ">" built_cons.fa | cut -f 2 -d">" | while read i; do grep -A1 "$i" HumanRepeatPeps.lib;done > Human_built_cons_peps.lib
sed -zi "s/--\n//g" Human_built_cons_peps.lib

#find peptides in consensuses
tblastn -db built_cons.fa -query Human_built_cons_peps.lib -outfmt 6 > built_consensus_annotations.tab

cat built_consensus_annotations.tab | sort | uniq | sort -k 2,12 -n -r > temp.tab

#take the highest scoring gag and pol, (since all built consesnsus are L1s); Also impose filter that peptide score must also be above 50
grep ">" built_cons.fa | cut -f 2 -d">" | while read i; do grep -P "\t$i\t" temp.tab | grep "gag" | sort -k 12 -n -r | head -n+1 |  awk '($12> 50) {print $2"\t"$9-1"\t"$10"\t"$1"\t"$12"\t+"}';done | sort | uniq > cons_gag.bed
grep ">" built_cons.fa | cut -f 2 -d">" | while read i; do grep -P "\t$i\t" temp.tab | grep "pol" | sort -k 12 -n -r | head -n+1 |  awk '($12> 50) {print $2"\t"$9-1"\t"$10"\t"$1"\t"$12"\t+"}';done | sort | uniq > cons_pol.bed

mv *.bed ../
cd ..

#combine and change score to 0 (avoid issues with decimals etc that make browser unhappy)
cat Dfam_annotations.bed cons_gag.bed cons_pol.bed HERV_full.bed| awk '{print $1"\t"$2"\t"$3"\t"$4"\t"0"\t"$6}' | sed "s/L1PBA1/L1PBa1/g" | bedtools sort > ORFs.bed
bedToBigBed ORFs.bed ../hg38reps/hg38reps.sizes ORFs.bb

mv ORFs.bb ../hg38reps/orfs.bb

References

https://www.dfam.org/home

The Dfam database of repetitive DNA families, Robert Hubley; Robert D. Finn; Jody Clements; Sean R. Eddy; Thomas A. Jones; Weidong Bao; Arian F.A. Smit; Travis J. Wheeler;Nucleic Acids Research (2016) Database Issue 44:D81-89. doi: 10.1093/nar/gkv1272

< a href = "http://www.repeatmasker.org/libraries/"> RepeatPeps library from RepeatMasker

Email:jferna10@ucsc.edu or max@soe.ucsc.edu