Augustus /
IncorporateRepeatsCreating hints for AUGUSTUS from RepeatMasker ouput Instead of running AUGUSTUS on a repeat masked genome (i.e. a genome in which all nucleotides that are parts of predicted repeats have been replaced by the letter "N"), we recommend that you run AUGUSTUS on the unmasked genome but supply repeat information as Such hints can for example be generated from a RepeatMasker output file. The format of such a file (in the following called SW perc perc perc query position in query matching repeat position in repeat score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID 778 16.7 0.7 2.0 chr2L 2 154 (23011390) + HETRP_DM Satellite 1519 1669 (203) 1 43675 0.5 0.0 0.0 chr2L 47514 52519 (22959025) + LINEJ1_DM LINE/Jockey 2 5007 (13) 33 (...) Note that there are two header lines, followed by one empty newline, and then the actual data rows begin (further lines are denoted with (...)). What we want to use for generating hints is everything except for the first three lines of the file. Further, pay attention to beginning of the two shown data lines: some data lines start with a number, while others start with whitespaces before the first number. The tabulator-separated hints format that we want to obtain, looks like this: chr2L RepeatMasker nonexonpart 2 154 0 . . src=RM chr2L RepeatMasker nonexonpart 47514 52519 0 . . src=RM There are certainly many ways to achieve this kind of formatting. In the following, we use the unix cat repeats.out | tail -n +3 | perl -ne 'chomp; s/^\s+//; @t = split(/\s+/); print $t[4]."\t"."repmask\tnonexonpart\t".$t[5]."\t".$t[6]."\t0\t.\t.\tsrc=RM\n";' | sort -n -k 1,1 > repeats.gff This takes care of deleting the first three lines and of leading whitespaces. The last column of a hints file for AUGUSTUS must contain a source ( [SOURCES] M E exonpart 1 .992 M 1 1e+100 E 1 1 intron 1 .34 M 1 1e+100 E 1 1e5 CDSpart 1 1 0.985 M 1 1e+100 E 1 1 UTRpart 1 1 0.985 M 1 1e+100 E 1 1 nonexonpart 1 1 M 1 1e+100 E 1 1 Then you'd need to add the source [SOURCES] M RM E exonpart 1 .992 M 1 1e+100 RM 1 1 E 1 1 intron 1 .34 M 1 1e+100 RM 1 1 E 1 1e5 CDSpart 1 1 0.985 M 1 1e+100 RM 1 1 E 1 1 UTRpart 1 1 0.985 M 1 1e+100 RM 1 1 E 1 1 nonexonpart 1 1 M 1 1e+100 RM 1 1.01 E 1 1 Of course, when you now call AUGUSTUS, you have to specify a hintsfile that contains the repeats as hints, and you have to specify the correct extrinsic.cfg file! |