Skip to content

Latest commit

 

History

History
295 lines (147 loc) · 17.8 KB

Genome-Informatics-SaketChoudhary-eQTL-Pipeline.md

File metadata and controls

295 lines (147 loc) · 17.8 KB

Genome Informatics Google Summer Of Code Student Application

eQTL Pipeline and Visualisation

Name: Saket Kumar Choudhary

Street Address: A 19 Hastinapur , Anushakti Nagar

City: Mumbai

State: Maharashtra

Postal/Zip Code: 400094

Email Address: [email protected]

**Phone: ** +91-9869649197

University: Indian Institute of Technology Bombay

Mumbai

India

Background: Fourth Year Undergraduate

					Department of Chemical Engineering

				IIT Bombay

Website: http://home.iitb.ac.in/~saket.kumar

Github: https://github.com/saketkc

**CV: ** http://home.iitb.ac.in/~saket.kumar/saket_cv.pdf

**Idea I would like to contribute to: **eQTL pipeline for Galaxy and Visualisations

Brief Background

I am a Fourth Year Undergraduate pursuing Integrated Bachelors and Masters(a 5 year course) in Chemical Engineering at IIT Bombay. I am inspired a lot by Biology and love to code.

I was a Google Summer of Code 2012 student for Connexions (http://cnx,org) and I am continuing to contribute to the project. I worked on a Python/Pyramid submodule that allowed importing presentations from users’ desktop, converted it into CNXML(the native XML format developed by Connexions) and deposited the slideshow to services like Slideshare and Google Drive, embedding it into a readable module. The user also could add a list of question answers at the end of the module. [https://github.com/oerpub/oerpub.rhaptoslabs.slideimporter]

I generally like to hack around with stuff, be it code or hardware. I recently participated in the** MIT Media Lab Design Innovation Workshop** [http://mitdi2013.pes.edu] . Our team came up with an innovative low cost ‘Spectrometer’ called ‘Spectral Eye’. We were awarded the Gandhian Young Technological Innovation Award 2013 for the same. [http://www.techpedia.in/award/project-detail/Spectral-Eye]

I developed interest in Biology during my fifth semester while undertaking a course on "Introduction to Cell and Molecular Biology". In the subsequent semesters I took up extra course work in subjects that dealt with molecular or computational biology. I have undertaken courses on Molecular Biology, Bioprocess Principles, and Computational Biology.

Besides Biology, I love to code and develop algorithms. I have a special interest in Machine Learning approaches. After taking up a basic course on Artificial Intelligence at my home institution, I was inspired by the subject and applied for an internship at EMBL-EBI(Cambridge,UK) where I was selected to work on developing classifiers for EC number prediction and a webservice for Job submissions for prediction jobs. I worked with Dr. Janet Thornton’s group [http://www.ebi.ac.uk/thornton-srv/software/rbl/acknowledgements.html] . This was essentially a Chemoinformatics project and hence introduced me to elementary bioinformatics.

For the past 2 semesters, I have been working on Next Generation Sequencing Analysis. I explored the NGS domain, setup a pipeline(a basic shell script) using the standard sets of tools (BWA,samtools,GATK,ANNOVAR) and lately have been playing around with Galaxy . In the process I have concentrated on using Python in my pipeline and analysis work . I recently contributed a BWA wrapper for BioPython.

The essential problem we are trying to address is to study the kind of patterns found in SNVs of Cancer Genomic Data. I am in the process of benchmarking a PSSM approach for the BWA aligner using bwa-pssm [https://github.com/pkerpedjiev/bwa-pssm/] I haven’t written any code as such, but my inputs have mainly in the form of bug reports.

Programming Interests

I started programming in PHP in my first year. Later I explored lanuguages like Ruby and Haskell. As part of my second year internship at SlideShare [ http://www.slideshare.net] I got to work on a RAILS project and Behaviour Driven Development Environment and was quiet a good learning experience.

I later took up Python while doing a course on Introductory Python at my home institution and have been hacking in Python since then. I am familiar with Python based MVC (Django/Pyramid) and micro frameworks(bottle.py,flask etc) . I implemented a Electronics Health Record Management System [EHR system] in Django for an Indian Startup, but it didn’t pick up and I open-sourced it here : [http://github.com/saketkc/open-ehr-django]

Open Source Projects

1. BWA Wrapper for BioPython

[biopython/biopython#167]

I use BioPython for my Computational Biology Course Assignments. Beside this as part of my own NGS pipeline , I submitted a wrapper for BioPython

  1. open-ehr-django : An Electronics Health Record Management System

[http://github.com/saketkc/open-ehr-django]

Implemented a cloud based electronic health record management solution in Django( Python based MVC framework) for Hospitals and Clinicians to facilitate electronization of all the health records and tracking a patient’s medical history online. Done while working as a software developer for an Indian startup, I decided to open source it.

3. SlideImporter For Connexions, GSoC2012 Project

[https://github.com/oerpub/oerpub.rhaptoslabs.slideimporter]

Implemented a slideimporter that allowed importing users slideshows to CNXML and deposited them to Google Drive and SlideShare. This is inturn imported into a Pyramid project [https://github.com/oerpub/oerpub.rhaptoslabs.swordpushweb/tree/gsoc2012]

4. Scilab on Cloud

[https://github.com/saketkc/scilab_cloud]

Developed a Django based application that would enable runnign Scilab[http://scilab.org/] a scientific computing software. This app allows user to run his/her Scilab codes online through a browser, thus removing the need to install Scilab Client locally. Scilab is an open source equivalent of MATLAB for numerical and scientific computing. This was widely accepted by the scilab-users community.

This has now been ported to work on AAKASH the low cost tablet:

https://github.com/saketkc/scilab_on_aakash

URL : http://scilab-test.garudaindia.in/cloud/ [ username : guest, password: guest123 ]

5. IIT Bombay Grading System on SMS

[https://github.com/saketkc/iitb-library-sms-interface**] **

In a team of 4 I developed a Flask[http://flask.pocoo.org/] based app to scrape through our institute’s Grading interface and send the Grades to a user on his mobile on request. I wrote the Scraping function making use of Beautiful Soup and urlib2 libraries.

6.Google AppEngine Projects

[https://github.com/saketkc/Google-AppEngine-Projects]

Developed a set of appengine applications that allows programmtically Creating, updating deleting events from a Google Calendar, doing an ISBN search for a book given its name/author and a Gmail bot that in the background inturn queries these interfaces to make them accessible over chat.

7. Pivotal Tracker Email Bot

[https://github.com/saketkc/pivotal-tracker-email-wrapper]

Developed an interface using the Pivotal Tracker(http://pivotaltracker.com) API in Ruby, an online Agile Project Management tool to create "New Issues" by fetching emails from a gmail account. The guys at Pivotal Accepted this as a third party tool and is now listed here [http://www.pivotaltracker.com/help/thirdpartytools]

8. DropBox on SMS

[https://github.com/saketkc/dropbox_on_sms]

This "Sinatra"[http://www.sinatrarb.com/] based app that lets one send a file from your dropbox folder to any email id , just with one SMS. This app stood among the Top 20 at Yahoo! Open Hack India [ September 2011]. I participated as a student hacker.

**9. A blather[Ruby] based bot **

[https://github.com/saketkc/blather-bot]

Implemented a bot using blather library in Ruby to get Info about courses , their grading statistics and the upcoming events in the campus. This was to make all this information to students irrespective of their location , since all this data is otherwise not viewable from outside the IITB network.

**10.DriveStack **

[https://github.com/anilshanbhag/drivestack]

Hacked during the Yahoo! HackU 2012 at IIT Bombay , DriveStack integrates all the cloud storage services like GDrive, box.net and Dropbox to give its user one unified ‘stack’ for file storage using the respective API. Our team was declared Winner.

11. dcpp.js

[https://github.com/saketkc/dcpp.js]

Implemented a Javascript version of the popular Filesharing software DC++. This essentially used a python code for TCP socket programming and enalbed the user to

download files , without the need of a desktop client of DC++.

Plan of Action

The study of expression Quantitative Trait Loci (eQTL) associates genetic variation in populations with variation of gene expression in order to pinpoint polymorphic genetic regions affecting gene expression. Since with High Throughput sequencing and microarrays it is possible to measure both genetic variation and gene expression at the the genomic level, eQTL methods allow for studying the variation of all regions in a genome with the expression of all genes. The computational analysis consists of essentially finding a correlation between the genetic patterns of markers with the expression of marked genes. The markers are essentially potential regulators of target genes.

The following tools perform eQTL analysis:

FastMap is a GUI based software, so getting it to work with Galaxy will be non-trivial and can be left as a challenge to be cracked at the end, incase I have surplus time left.

I would begin with adding these tools to Galaxy and then implementing a Galaxy workflow for doing eQTL analysis. The data would be run through the workflow and the results from all these tools can be compared.

The next task would be to enable workflows for Galaxy’s Visual Analysis Framework. It is possible to enable visualisations through tools, but packing them in a workflow would give user more control over visualisation where in the user can decide which parameters/columns of the datasets needs to be visualised.

The next step should be to with come up with a d3.js based heatmap (d3.js + html5 canvas(svg)) . As seen from the following graphs, heatmaps are a good way to study the gene-gene correlation in eQTL analysis.

image alt text

image alt text

I created a jsfiddle to create heatmap from d3.js : http://jsfiddle.net/saketkc/neADm/

Improving the scatter plot functionality of Galaxy: The scatter plots in Galaxy are generated using a python based backend. To change this into a more interactive graph, the Scatter plot module can be changed to use d3.js for plotting scatter plots. This will not only make the plots more interactive , but would facilitate a real-time plotting experience.

The final thing I would try to implement is nesting workflows, to import a workflow inside a workflow and thus use nested workflows.

Timeline

1. Community Bonding Period :

  • Test out the various workflows given at [https://main.g2.bx.psu.edu/workflow/list_published ]

  • Install and Test out the eQTL analysis tools listed above by giving them proper input files. Some of the simulated data files are here : [http://ml.sheffield.ac.uk/qtl/panama/]

  • Discuss the community what are the requirements and exact use cases of a heatmap and scatter plot. How flexible it should be.

  • Study the codebase, figure out how tools which are not command line can be implemented. For example there are a lot of "R packages" which would need to be run by writing the necessary functonality in a R file and running the R-file.

**2. Week 1 [June 17th - June 24th] **

  • Integrate PANAMA into the Galaxy tool set

  • Implement functional tests for the above module

3. Week 2 [June 24th - July 1st] and Week 3 [July 1st - July 8th]

  • Integrate MERLIN,PLINK into the Galaxy tool set

    • Once PANAMA integration is complete, this would be relatively easier to implement

    • This will involve writing a python wrapper on lines of the rgClustalw.py wrapper as in galaxy-dist/tools/rgenetics/

  • Implement functional tests for both these modules

  • Documentation

4, Week 4 [July 8th - July 15th]

  • Implement a functionality to run R/qtl, a R- based package

    • The python method implemented in the previous week needs to be extended (inherited) or a new wrapper function might be required since the input paramerter names are entirely different incase of this package
  • Implement unit/functional tests for this method/ Extend the previously implemented tests,

  • Documentation

**5, Week 5 [July 15th - July 22nd] **

  • Implement a functionality to run eMap, a R- based package

    • Implement a method that essentially creates a ".r" file and puts the necessary code for eQTL analysis using the library(eMap) and then calls this R code

    • There is a r_wrapper.sh walready implemented in galaxy-tools/tools/plotting/, so the python menthod would essnetially subprocess this wrappaer after creating the necessary file

  • Implement unit/functional tests for this method

  • Documentation

6. Week 6 [July 22nd - July 29th] and Week 7 [July 29th - August 5th]

  • Implement a method to allow creating workflows for visualisation tools

    • The workflow would allow the users to use workflows in place of tools to visualise the data

    • The workflow should allow playing around with various parameters that need to be visualised through Trackster

  • Implement unit/functional tests for this method

  • Documentation

7. Week 8 [August 5th - August 12th] and Week 9 [August 12th - August 19th]

  • Integrate d3.js for plotting heatmaps with Galaxy

    • Heatmaps are a good visualisation tool for eQTL analysis in terms of studying Gene-Gene interaction. But this heatmap can be used for visualising other data too say microarray data.

    • Heatmap with d3.js demo that I could impement : [http://jsfiddle.net/saketkc/neADm/]

    • This can also be plugged into the visualisation workflow developed in the previous week

  • Documentation

8. Week 10 [August 19th - August 26th]

  • Integrate d3.js for scatter plots

    • Galaxy currently uses python matplotlib for plotting scatter plots.

    • Replacing matplotlib with d3.js will give real time support for scatter plots

    • d3.js scatter plots are more interactive

  • Documentation

**9. Week 11 [August 26th - September 2nd] and Week 12 [September 2nd-September 9th] **

  • Implement nested workflow environment functionality

    • This will require a change both at the Canvas level and at the backend python level

    • One workflow can be piped into the other

    • This can possibly get extended into the following week as it is a bit challenging

  • Documentation

10. Week 13 [September 9th-September 16th]

  • Bug Fixing ,

  • More unit tests

  • Documentation

Obligations

I have no obligations during the coding period, My college reopens on July 24th , but I can manage my time for GSoC , since this being my final year at college the credit requirements are minimum.