Views
BioDataBases
Genbank
I've set up Genbank release 139.0 on nas1, and formatted the data for use with EMBOSS. EMBOSS is installed on pijuyo and pise, but I did the work on pijuyo. Pijuyo mounts nas1:/home on /mnt/nas1.
Steps are (see the Admin docs for details on dbiflat and the emboss.defaults file):
- Get compressed data from
ftp://ftp.ncbi.nih.gov/genbank/*.gzand release info fromftp://ftp.ncbi.nih.gov/genbank/gbrel.txt(66 GB in release 139.0) - Uncompress the data (128 GB in release 139.0)
- run 'dbiflat':
# dbiflat Index a flat file database EMBL : EMBL SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew GB : Genbank, DDBJ Entry format [SWISS]: GB Database name: genbank Database directory [.]: Wildcard database filename [*.dat]: gb*.seq Release number [0.0]: 138.0 Index date [00/00/00]: 28/01/04 - Edit emboss.default to use the new database:
DB genbank [ method: emblcd format: genbank dir: $genbank_dir file: gb*.seq type: N release: "139.0" comment: "GenBank core sequences" ]
EMBOSS can also use blast formatted databases for sequences. Carlos and I have to work out how best to share resources. I'm using nas1 to store genbank, I think the blast databases are on nas0.
Genpept
Release 139.0 of Genbank has a genpept.fsa fasta format file that is known as the genpept database. The name will change in the next release (to gb_NNN.aa_fsa). I moved this file to a new directory or the index will clobber the genbank ones, then indexed it with 'dbifasta':
# dbifasta
Index a fasta database
simple : >ID
idacc : >ID ACC
gcgid : >db:ID
gcgidacc : >db:ID ACC
dbid : >db ID
ncbi : | formats
ID line format [idacc]: ncbi
Database directory [.]:
Wildcard database filename [*.dat]: genpept*.fsa
Database name: genpept
Release number [0.0]: 139.0
Index date [00/00/00]: 30/01/04
This can be set up in emboss.default like this:
DB genpept [ method: emblcd format: fasta dir: $genbank_dir/genpept file: genpept.fsa type: P release: "139.0" comment: "GenPept protein sequences" ]