Skip to content.

UPR HPCf

Sections
Personal tools
You are here: Home » Members » humberto's Home » Test ZWiki » BioDataBases
Views

BioDataBases

last edited 5 years ago by humberto

Genbank

I've set up Genbank release 139.0 on nas1, and formatted the data for use with EMBOSS. EMBOSS is installed on pijuyo and pise, but I did the work on pijuyo. Pijuyo mounts nas1:/home on /mnt/nas1.

Steps are (see the Admin docs for details on dbiflat and the emboss.defaults file):

  1. Get compressed data from ftp://ftp.ncbi.nih.gov/genbank/*.gz and release info from ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt (66 GB in release 139.0)
  2. Uncompress the data (128 GB in release 139.0)
  3. run 'dbiflat':
      # dbiflat
      Index a flat file database
          EMBL : EMBL
         SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew
            GB : Genbank, DDBJ
      Entry format [SWISS]: GB   
      Database name: genbank
      Database directory [.]: 
      Wildcard database filename [*.dat]: gb*.seq
      Release number [0.0]: 138.0
      Index date [00/00/00]: 28/01/04
    
  4. Edit emboss.default to use the new database:
      DB genbank [
       method: emblcd
       format: genbank
       dir: $genbank_dir
       file: gb*.seq
       type: N
       release: "139.0"
       comment: "GenBank core sequences"
      ]
    

EMBOSS can also use blast formatted databases for sequences. Carlos and I have to work out how best to share resources. I'm using nas1 to store genbank, I think the blast databases are on nas0.

Genpept

Release 139.0 of Genbank has a genpept.fsa fasta format file that is known as the genpept database. The name will change in the next release (to gb_NNN.aa_fsa). I moved this file to a new directory or the index will clobber the genbank ones, then indexed it with 'dbifasta':

  # dbifasta
  Index a fasta database
    simple : >ID
     idacc : >ID ACC
     gcgid : >db:ID
  gcgidacc : >db:ID ACC
      dbid : >db ID
      ncbi : | formats
  ID line format [idacc]: ncbi
  Database directory [.]:
  Wildcard database filename [*.dat]: genpept*.fsa
  Database name: genpept
  Release number [0.0]: 139.0
  Index date [00/00/00]: 30/01/04

This can be set up in emboss.default like this:

  DB genpept [
   method: emblcd
   format: fasta
   dir: $genbank_dir/genpept
   file: genpept.fsa
   type: P
   release: "139.0"
   comment: "GenPept protein sequences"
  ]

 

Powered by Plone

This site conforms to the following standards: