Views
MPIBlastHowto
The NCBI blast program is extremely I/O intensive. Even running just a few jobs could bring the cluster to a virtual standstil because of NFS to the home directories. Fortunately, BioBrew? comes with a solution in the form of mpiblast. Ful documentation is available at the mpiblast home page, this is just what I did to run mpiblast on espresso.
Carlos and Daniel have set up mpiformatdb databases for nr and nt the largest NCBI databases. Here's an example of how a hypothetical user named "foo" would run a large blast run.
Prepare an SGE script called blast.qsub to run mpiblast:
#$ -M foo@hpcf.upr.edu #$ -m abe #$ -pe mpi 4 #$ -cwd # SGE default shell is /bin/csh, unless you do this #$ -S /bin/bash time /opt/mpich/gnu/bin/mpirun -np $NSLOTS -machinefile $TMPDIR/machines \ /opt/BioBrew/NCBI/6.1.0/bin/mpiblast -i xlrhodop.fasta -p blastn -d nt -o test.out
Prepare a /home/foo/.ncbirc similar to this one:
[NCBI] Data=/opt/BioBrew/NCBI/6.1.0/data [BLAST] BLASTDB=/home/blast/databases BLASTMAT=/opt/BioBrew/NCBI/6.1.0/data [mpiBLAST] Shared=/home/blast/databases Local=/state/partition1/tmp/foo
Make sure your local directory exists:
cluster-fork mkdir /state/partition1/tmp/foo
You can now submit the SGE script with the qsub command: qsub blast.qsub. Blast output will be in test.out diagnostic errors will be in files like blast.qsub.o111 and blast.qsub.e111 where 111 is substituted for the job number reported by qsub for your job.
Note xlrhodop.fasta vs nt took 24 minutes 18 sec to complete on a single node (4 hyperthread processors). Six processors 23m26.533s. 10 round_robin processors, 6m57s. Local blast on a single compute node 2m. We need to retune NFS, the tuning done on ROCKS 3.1 is not preserved after the upgrade.
Note I followed the tips in the mpiblast paper and got a decent speedup. I distributed (via rsync) the database to every node, then ran tblastn of the seven protein sequences obtained by running seqret tsw:ops* (about 3Kb). Walltimes were 109 minutes on a single processor, and 20 minutes on 10 processors on 10 nodes!
Note mpiblast chokes on formatdb formatted databases (like ecoli.nt), but blast runs just fine with mpiformatdb formatted files. Always use mpiformatdb when formatting databases.
Note, the command I used to format nr is :
mpiformatdb -i /home/blast/databases/ncbi_db/nr -p T -o T -n nr -N 154
database .nal indicies had to be hand edited to delete a bogus /export pathname added to the fragment names (carlos suggests deleting the path component completely). The actual problem is that mpiformatdb uses the .ncbirc file to name parts, and root's .ncbirc had the /export component included in the directory names.