Skip to content

R source for parsing GEO_family.soft files (using GEOquery lib) and creating a custom csv table. Run R source by command line (without having to get in R environment) by passing through arguments.

Notifications You must be signed in to change notification settings

dimitras/GEOdatasets_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COX expression analysis

All experiments retrieved from GEO were carried out under the same conditions. All microarrays performed on LCLs using an Affymetrix Human HG-Focus Target Array. Samples were treated the same, the platform and protocol used to perform the microarray was the same.


### PART 1 ###

Retrieve GEO datasets to check the variability of expression of a key genes list for the COX pathway.

- Links for studies in GEO:
http://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2106
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE1485
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5859
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE10824

- GSM lists (source.R): Parse GSE_family.soft files and creating a custom csv table with a 'GSM-expression values' set per probe. Used the GEOquery lib written in R (http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/geo). COMMENT: Embodied script to run R source by command line (without having to get in the R environment) by passing through arguments.


### PART 2 ###

Find the COX pathway genes through the datasets (GSE as source files).

- Found genes (find_genes.rb): Parse GSM tables with ruby script to find and count the existence of COX pathway genes. Export these genes with their expression values for all samples.

- Gene expression plots (barplot.R): Barplot the average, variance, and standard deviation in gene expression level for COX pathway genes, per experiment and by gene (pending).


### PART 3 ###

Retrieve the respective CEL files for GEO datasets from GEO to process and RMA normalize the Affy CEL files.

- CEL to PCL files (_ProcessToPCL.R): Run Princeton's script for all experiments and get the PCL files in format: Entrezid_at, expression values for all samples (https://bitbucket.org/FunctionLab/arrayprocess/src/7ac17f671ac91381527ad97de377a7c0c4be6a9a/ProcessToPCL.R?at=default). 
NOTE: Used the hgfocus chip lib from brain: http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/16.0.0/entrezg.asp


### PART 4 ###

Find the COX pathway genes through the datasets, in PCLs (same datasets but using CEL files as source, as mentioned in step 3).
NOTE: The probe-ids-like in the PCLs are the entrez ids with the probe_id suffix "at".

- Found genes in PCLs (find_coxgenes_in_pcl.R): Locate the COX genes among the csv datasets (PCL files converted to csv first).

- Gene expression plots (pcl_plots.R, pcl_plots_by_gene.R): Barplot the average, variance, and standard deviation in gene expression level for COX pathway genes, per experiment and by gene (pending). Note that the x axis in the plots are the genes (entrez ids).


### PART 5 ###

Check if the GSM ids (samples) are the same through the experiments (used the GSM tables created in step 2)

- GSM ids comparison (compareIDs.R): Get the uniques through the experiments, export a table with all the relations (https://upenn.box.com/s/zormz8yo5tgbkftru4aj)


### PART 6 ###

Retrieve genotype data from hapmap by coriell id (linked to GEO ids) and convert them to ped/map (plink format). 

- Samples: ftp://ftp.ncbi.nlm.nih.gov/hapmap/samples_individuals/ (info about the founders)
- Genotypes: http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/2010-08_phaseII+III/forward/

- Genotype files with unique rs-id entries in hapmap format (unique_rs_ids.rb):
Filtered all hapmap genotype files to get only the unique rs ids.
IMPORTANT NOTE: the genotype files contain duplicated rs ids with different locations from different origins/ centers (sanger, perlegen, broad, affymetrix, bcm, illumina). Kept the unique rs ids, by merging alleles per sample. Between the duplicated rs entries, compare the alleles like this: 
known-unknown => known
known==known => known
known!=known => unknown
and ignore the location, origin and other fields.

- GM2NA lists (GM_to_NA.rb): Linking GEO expression data with genotypes from HapMap. Actually link the GM ids (GEO sample ids) with the coriell ids (Hapmap sample ids).

- Samples' list (coriell_founders_list.rb): Filtering the coriell ids list by the samples that are the founders. Get the sample-individuals with the coriell ids that correspond to the GEO ids, and also fulfill the criterion of being a founder (0 for a founder).

- PED & MAP files (run_glu.sh): Convert all hapmap genotype files (with merged duplicates) for each sample set (CEU, JPT, YRI, CHB) to ped & map files (plink format) with GluGenetics library and filter them with samples' list (list with those coriell ids that correspond to the GEO ids). RUN shell script to transform by sample group.
REF:
- PLINK: http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/hapmap3_r3/plink_format/
- GLU: http://code.google.com/p/glu-genetics/
NOTE: May need to cross validate with the inconsistencies.

todo:
- filter the genetic data with 5% variance
- specify in genome 100KB +- from  transcription start site


### PART 7 ###

Association analysis to identify genetic variants that influence COX pathway gene expression.

- Data QC (plink1.sh): Whole genome association analysis with plink toolset. 
Added sex info into files.
Get allele frequencies.
Merge hapmap samples CEU, YRI, JPT, CHB (failed due to strand error).
Cross merge hapmap samples YRI, JPT, CHB with CEU (failed, but we got missing snps).
Flip dna strand for missing snps.
Re-cross merge hapmap samples (finished successfully).
Merge all hapmap samples: CEU with the flipped-stand versions of YRI, JPT, CHB (failed due to strand error).

IMPORTANT NOTE: QC paused as it seems that a great deal of QC work has not been uploaded to the Hapmap resource.

todo:
- Analysis with Genetic Power Calculator.


### PART 8 ###

Analysis with ready eQTL results.

Downloaded the results sets for our genes of interest, from eqtl browser: http://eqtl.uchicago.edu/cgi-bin/gbrowse/eqtl/
The results sets include the Stranger et al. 2007 paper whose data we have retrieved from GEO.

MANUALLY:
- Searched each of the 21 cox genes (only 20 genes found actually), directly on the eqtl browser, like this: *PTGS1*
- Downloaded the gff file by selecting the tracks of interest (specifically the studies we are interested, which were selected according to the tissues of our interest), and excluding entrez genes.

AUTOMATED WAY (selected_eqtl_results_to_gff.rb):
- Downloaded one full gff file that is available at eqtl browser and contains all the data in this browser.
- Filtered the results
1. by our genes of interest
2. by the studies we are interested, which were selected according to the tissues of our interest
3. by score with cutoff set to 7.0

FOR THE MISSED GENES:
Search them in the gff file by loci
- download genes from ncbi with efetch (put all the gene ids seperated with commas in the id argument: get_gene_ids_by_commas.rb)
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=1544,8529&retmode=text
- create a csv with geneid, genename, start and stop genomic locations (gene_locis.rb)
- add genomic location filter (within 10kbp from start and stop gene position) to search the genes in the gff file (selected_eqtl_results_for_missed_genes.rb)

MORE:
also searched results from Dixon et al 2007 http://www.sph.umich.edu/csg/liang/publications/pdf/ng2109.pdf
and did not find any relevant hits (no gene matches)


### PART 9 ###

Results visualization with region graphs.

Get recombination rates, allele frequencies and linkage disequilibrium data from hapmap for CEU group.
- Bulk download multiple links from a webpage: 
#get the index page to grab the links 
wget http://hapmap.ncbi.nlm.nih.gov/downloads/ld_data/2009-04_rel27/
# bulk download all the files with perl one-liner
perl -ne '/href="(.*?)"/ and print "wget http://hapmap.ncbi.nlm.nih.gov/downloads/ld_data/2009-04_rel27/$1\n"' index.html | sh
# bulk download the CEU files with perl one-liner
perl -ne '/href="(.*?YRI.txt.gz)"/ and print "wget http://hapmap.ncbi.nlm.nih.gov/downloads/ld_data/2009-04_rel27/$1\n"' index.html | sh

Links that got tha data from hapmap:
ld data: hapmap.ncbi.nlm.nih.gov/downloads/ld_data/2009-04_rel27/
frequencies: http://hapmap.ncbi.nlm.nih.gov/downloads/frequencies/2010-08_phaseII+III/
recombination: http://hapmap.ncbi.nlm.nih.gov/downloads/recombination/2011-01_phaseII_B37/

REGION GRAPHS:
1) fix files with SNPs per gene in ceratin format (manhattan graphs) & files for plotting only the SNPs region (plain graphs) {TODO: needs to be automated}
2) fix lists for highlighted SNPs in certain format
formats are explained here: http://people.virginia.edu/~sdt5z/0STABLE/qqman.r
3) plot with manhattan_plot.R (used manhattan function available on web)

LD PLOTS:
1) installed & used haploview command line tool
2) run the commands in haploview_commands.txt

ALLELE FREQUENCIES PLOTS:
1) used frequencies' data from hapmap and the files for plotting only the SNPs region (see region graphs)
2) plot with allele_frequencies_plot.R

SD & VAR ABUNDANCE BARPLOTS PER POPULATION:
1) from geo ids get the correil ids (GM_to_NA.rb)
2) get the sample-individuals with the coriell ids that correspond to the GEO ids (coriell_founders_list.rb)
3) create a new mixed genes list, adding a few housekeeping genes
4) find the 'grouped by hapmap poulation' GM ids from each geogroup (find_genes_in_geo_per_population.rb)
5) merge the same population's genes that were found in the different geogroups (use paste unix command) 
6) keep the unique ids from the merged file (merge_genes_per_population.rb)
7) barplot the variance and standard deviation in gene expression level for COX pathway genes & housekeeping, per population (mixed_genes_expression_per_hapgroup.R)








About

R source for parsing GEO_family.soft files (using GEOquery lib) and creating a custom csv table. Run R source by command line (without having to get in R environment) by passing through arguments.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published