Recent Changes - Search:

Augustus

Forum

Contact

Impressum

Data Privacy Protection

About PmWiki

edit SideBar

IncorporateRepeats

Creating hints for AUGUSTUS from RepeatMasker ouput

Instead of running AUGUSTUS on a repeat masked genome (i.e. a genome in which all nucleotides that are parts of predicted repeats have been replaced by the letter "N"), we recommend that you run AUGUSTUS on the unmasked genome but supply repeat information as nonexonpart hints.

Such hints can for example be generated from a RepeatMasker output file. The format of such a file (in the following called repeats.out) looks like this:

   SW  perc perc perc  query      position in query           matching       repeat              position in  repeat 
score  div. del. ins.  sequence    begin     end    (left)    repeat         class/family         begin  end (left)   ID

  778  16.7  0.7  2.0  chr2L           2     154 (23011390) +  HETRP_DM       Satellite             1519 1669  (203)      1
43675   0.5  0.0  0.0  chr2L       47514   52519 (22959025) +  LINEJ1_DM      LINE/Jockey              2 5007   (13)     33
(...)

Note that there are two header lines, followed by one empty newline, and then the actual data rows begin (further lines are denoted with (...)). What we want to use for generating hints is everything except for the first three lines of the file. Further, pay attention to beginning of the two shown data lines: some data lines start with a number, while others start with whitespaces before the first number.

The tabulator-separated hints format that we want to obtain, looks like this:

chr2L  RepeatMasker  nonexonpart      2    154  0  .  .  src=RM
chr2L  RepeatMasker  nonexonpart  47514  52519  0  .  .  src=RM

There are certainly many ways to achieve this kind of formatting. In the following, we use the unix tail command, a bit of Perl, and a bit of Bash:

cat repeats.out | tail -n +3 | perl -ne 'chomp; s/^\s+//; @t = split(/\s+/);
print $t[4]."\t"."repmask\tnonexonpart\t".$t[5]."\t".$t[6]."\t0\t.\t.\tsrc=RM\n";' | sort -n -k 1,1 > repeats.gff

This takes care of deleting the first three lines and of leading whitespaces.

The last column of a hints file for AUGUSTUS must contain a source (src=...). The name of the source, here RM, must be described in the extrinsic.cfg file that you use to run the AUGUSTUS predictions with those hints! Let's assume that your original extrinsic.cfg file looked like this:

[SOURCES]
M E

exonpart    1   .992    M 1 1e+100 E 1 1
intron      1   .34     M 1 1e+100 E 1 1e5
CDSpart     1   1 0.985 M 1 1e+100 E 1 1
UTRpart     1   1 0.985 M 1 1e+100 E 1 1
nonexonpart 1   1       M 1 1e+100 E 1 1

Then you'd need to add the source RM in the following way:

[SOURCES]
M RM E

exonpart    1   .992    M 1 1e+100 RM 1 1    E 1 1
intron      1   .34     M 1 1e+100 RM 1 1    E 1 1e5
CDSpart     1   1 0.985 M 1 1e+100 RM 1 1    E 1 1
UTRpart     1   1 0.985 M 1 1e+100 RM 1 1    E 1 1
nonexonpart 1   1       M 1 1e+100 RM 1 1.01 E 1 1

Of course, when you now call AUGUSTUS, you have to specify a hintsfile that contains the repeats as hints, and you have to specify the correct extrinsic.cfg file!

Edit - History - Print - Recent Changes - Search
Page last modified on January 18, 2013, at 03:53 PM