Functions | Variables

bbcflib::mapseq Namespace Reference

Functions

def fastq_dump
def fastqc
def run_fastqc
def get_fastq_files

Variables

string demultiplex_path = "/srv/demultiplexing/public/data/demultiplexing_minilims.files/"
 run_lib_name = None
tuple run = str(run['url'])
list dafl1 = dafl[run['facility']]
tuple daf_data

Detailed Description

======================
Module: bbcflib.mapseq
======================

This module provides functions useful to map raw reads using bowtie.
The most common workflow will first use ``map_reads`` which takes the following arguments:

  * ``'ex'``: an execution environment to run jobs in,

  * ``'fastq_file'``: the raw reads as a fastq file,

  * ``'chromosomes'``: a dictionary with keys 'chromosome_id' as used in the bowtie indexes and values a dictionary with usual chromosome names and their lengths,

  * ``'bowtie_index'``: the file prefix to the bowtie index,

  * ``'maxhits'``: the maximum number of hits per read to keep in the output (default *5*),

  * ``'antibody_enrichment'``: an approximate enrichement ratio of protein bound sequences over controls (default *50*),

  * ``'name'``: sample name to be used in descriptions and file names,

  * ``'remove_pcr_duplicates'``: whether to remove probable PCR amplification artifacts based on a Poisson confidence threshold (default *True*),

  * ``'bwt_args'``: a list of specific arguments for bowtie (in addition to ["-Sam",str(max(20,maxhits)),"--best","--strata"]).

The function ``map_groups`` will take a collection of sample as described in a *job* object from the ``frontend`` module and run fetch fastq files for each of them through using a ``daflims`` object, use an 'Assembly' object from ``genrep`` to get the bowtie indexes and run ``map_reads`` for each sample.

The second step in the workflow consists in generating genome-wide density maps with ``densities_groups`` which needs a 'bein' execution environment, the same *job* and *assembly* as above and a dictionary 'file_dict' as returned by ``map_groups``. The function then runs ``parallel_density_sql`` on each 'bam' file to obtain a normalized read density profile as a 'sqlite' file.

Below is the script used by the frontend::

from bbcflib import daflims, genrep, frontend, gdv, common
from bbcflib.mapseq import *
M = MiniLIMS( limspath )
working_dir = '/path/to/scratch/on/cluster'
hts_key = 'test_key'
gl = { 'hts_mapseq': {'url': 'http://htsstation.epfl.ch/mapseq/'},
       'genrep_url': 'http://bbcftools.vital-it.ch/genrep/',
       'bwt_root': '/db/genrep/',
       'script_path': '/srv/chipseq/lib',
       'lims': {'user': 'alice',
                 'passwd': {'lgtf': 'bob123',
                            'gva': 'joe456'}},
        'gdv': {'url': 'http://svitsrv25.epfl.ch/gdv',
                'email': 'alice.ecila@somewhere.edu',
                'key': 'xxxxxxxxxxxxxxxxxxxxxxxx'} }
assembly_id = 'mm9'
htss = frontend.Frontend( url=gl['hts_mapseq']['url'] )
job = htss.job( hts_key )
g_rep = genrep.GenRep( gl['genrep_url'], gl['bwt_root'] )
assembly = genrep.Assembly( assembly=assembly_id, genrep=g_rep )
daflims1 = dict((loc,daflims.DAFLIMS( username=gl['lims']['user'],
                                      password=gl['lims']['passwd'][loc] ))
                for loc in gl['lims']['passwd'].keys())
with execution( M, description=hts_key, remote_working_directory=working_dir ) as ex:
    job = get_fastq_files( ex, job, daflims1 )
    run_fastqc( ex, job, via=via )
    mapped_files = map_groups( ex, job, assembly )
    pdf = add_pdf_stats( ex, mapped_files,
                         dict((k,v['name']) for k,v in job.groups.iteritems()),
                         gl['script_path'] )
    density_files = densities_groups( ex, job, mapped_files, assembly.chromosomes )
    gdv_project = gdv.new_project( gl['gdv']['email'], gl['gdv']['key'],
                                   job.description, assembly.id, gl['gdv']['url'] )
    add_pickle( ex, gdv_project, description='py:gdv_json' )
print ex.id
allfiles = get_files( ex.id, M )
print allfiles

Function Documentation

def bbcflib::mapseq::fastq_dump (   filename,
  options = None 
)
Binds ``fastq-dump`` to convert *sra* (short reads archive) to *fastq* format.
def bbcflib::mapseq::fastqc (   fastqfile,
  outdir = None,
  options = None 
)
Binds ``fastqc`` (http://www.bioinformatics.bbsrc.ac.uk/) which generates a QC report of short reads present into the fastq file.
def bbcflib::mapseq::get_fastq_files (   ex,
  job,
  dafl = None,
  set_seed_length = True 
)
Will replace file references by actual file paths in the 'job' object.
These references are either 'dafl' run descriptions or urls.
Argument 'dafl' is a dictionary of 'Daflims' objects (keys are the facility names).
If 'set_seed_length' is true, a dictionary job.groups[gid]['seed_lengths']
of seed lengths is constructed with values corresponding to 70% of the
read length.
def bbcflib::mapseq::run_fastqc (   ex,
  job,
  via = 'lsf' 
)
Returns the name of the report file.

Variable Documentation

tuple bbcflib::mapseq::daf_data
Initial value:
00001 dafl1.fetch_fastq( str(run['facility']), str(run['machine']), run['run'], run['lane'],
00002                                               libname=run.get('sequencing_library') )
 All Classes Namespaces Functions