======================
Module: bbcflib.mapseq
======================
This module provides functions useful to map raw reads using bowtie.
The most common workflow will first use ``map_reads`` which takes the following arguments:
* ``'ex'``: an execution environment to run jobs in,
* ``'fastq_file'``: the raw reads as a fastq file,
* ``'chromosomes'``: a dictionary with keys 'chromosome_id' as used in the bowtie indexes and values a dictionary with usual chromosome names and their lengths,
* ``'bowtie_index'``: the file prefix to the bowtie index,
* ``'maxhits'``: the maximum number of hits per read to keep in the output (default *5*),
* ``'antibody_enrichment'``: an approximate enrichement ratio of protein bound sequences over controls (default *50*),
* ``'name'``: sample name to be used in descriptions and file names,
* ``'remove_pcr_duplicates'``: whether to remove probable PCR amplification artifacts based on a Poisson confidence threshold (default *True*),
* ``'bwt_args'``: a list of specific arguments for bowtie (in addition to ["-Sam",str(max(20,maxhits)),"--best","--strata"]).
The function ``map_groups`` will take a collection of sample as described in a *job* object from the ``frontend`` module and run fetch fastq files for each of them through using a ``daflims`` object, use an 'Assembly' object from ``genrep`` to get the bowtie indexes and run ``map_reads`` for each sample.
The second step in the workflow consists in generating genome-wide density maps with ``densities_groups`` which needs a 'bein' execution environment, the same *job* and *assembly* as above and a dictionary 'file_dict' as returned by ``map_groups``. The function then runs ``parallel_density_sql`` on each 'bam' file to obtain a normalized read density profile as a 'sqlite' file.
Below is the script used by the frontend::
from bbcflib import daflims, genrep, frontend, gdv, common
from bbcflib.mapseq import *
M = MiniLIMS( limspath )
working_dir = '/path/to/scratch/on/cluster'
hts_key = 'test_key'
gl = { 'hts_mapseq': {'url': 'http://htsstation.epfl.ch/mapseq/'},
'genrep_url': 'http://bbcftools.vital-it.ch/genrep/',
'bwt_root': '/db/genrep/',
'script_path': '/srv/chipseq/lib',
'lims': {'user': 'alice',
'passwd': {'lgtf': 'bob123',
'gva': 'joe456'}},
'gdv': {'url': 'http://svitsrv25.epfl.ch/gdv',
'email': 'alice.ecila@somewhere.edu',
'key': 'xxxxxxxxxxxxxxxxxxxxxxxx'} }
assembly_id = 'mm9'
htss = frontend.Frontend( url=gl['hts_mapseq']['url'] )
job = htss.job( hts_key )
g_rep = genrep.GenRep( gl['genrep_url'], gl['bwt_root'] )
assembly = genrep.Assembly( assembly=assembly_id, genrep=g_rep )
daflims1 = dict((loc,daflims.DAFLIMS( username=gl['lims']['user'],
password=gl['lims']['passwd'][loc] ))
for loc in gl['lims']['passwd'].keys())
with execution( M, description=hts_key, remote_working_directory=working_dir ) as ex:
job = get_fastq_files( ex, job, daflims1 )
run_fastqc( ex, job, via=via )
mapped_files = map_groups( ex, job, assembly )
pdf = add_pdf_stats( ex, mapped_files,
dict((k,v['name']) for k,v in job.groups.iteritems()),
gl['script_path'] )
density_files = densities_groups( ex, job, mapped_files, assembly.chromosomes )
gdv_project = gdv.new_project( gl['gdv']['email'], gl['gdv']['key'],
job.description, assembly.id, gl['gdv']['url'] )
add_pickle( ex, gdv_project, description='py:gdv_json' )
print ex.id
allfiles = get_files( ex.id, M )
print allfiles