I have recently ran into a severe error that would have spoiled a project... and luckily I got to know a bit why.
I have utilized the tgicl (TIGR sequence assembling software) to make one assembling job. The tgicl called formatdb to build index of all the sequence reads, and there's .nhr, nin, .nsq files been produced.
After assembling, I want to get sequences for all singletons, so I used fastacmd to fetch them from the database formated by tgicl. Everything is usual... but the output is unusual: identifiers of the sequences thus fetched are strange names like following:
>gnl|BL_ORD_ID|10 000067_0726_3676 ...descriptions...
The second word is indeed sequence name, and I have no idea how the first strange name came here. This would cause serious trouble in further processing if the name change were not considered.
Anyway, I found that if I do formatdb using following parameters:
$ formatdb -i xxx.fasta -p F
The produced file xxx.fasta.nhr contains all name correspondance, which uses only one line to hold the very big contents! And this indirectly saved my ass in this nasty issue, especially when I was running to meet the deadline!
No comments:
Post a Comment