Data: vsn-normalization.xls
Setting:
1. k = 6 % optimal k
2. seeds = [1:10] % get average CB+D matrix to generate network
3. sds = 3 % default CBDioZ threshold
top50 + reg : **Control ( .sif / .eda / .pdf ) **Shift ( .sif / .eda / .pdf )
top50: **Control ( .sif / .eda / .pdf ) **Shift ( .sif / .eda / .pdf )
2/23/2007
2/22/2007
Data.xls vs. vsn-normalization.xls
2/21/2007
Hyper-para Derivation for VBSSM
The latest Latex file with comments is available at
http://www.cse.buffalo.edu/~juanli/hyper_update0302.rar
Code is available at
http://www.cse.buffalo.edu/~juanli/vbssm_v3.3.5.rar
http://www.cse.buffalo.edu/~juanli/hyper_update0302.rar
Code is available at
http://www.cse.buffalo.edu/~juanli/vbssm_v3.3.5.rar
2/16/2007
F_vs_K Plot on 'Data.xls'
Upper : top50 & reg (Control + Shift)
Lower: top50 (Control + Shift)
Download: PDF
Hi David,
I just finished the experiment. As you said this morning, there are 50 unique genes and 12 regulators, total is 62. However, Data.xls contains only 56 item entries, so I used these 56 genes/regs for vbssm trainning.
The attached includes results of 'top50'(yesterday) and 'top50 & reg' (today) for comparison.
The tarball is the lastest F_kk scripts with detailed comments, plus vbssm toolbox. I have tested on both local and cluster. Linda should be able to run on her laptop directly. For cluster running, she has to carefully check the working paths first. If she has any question, feel free to contact me.
Lastly, thank you very much for considering me to carry on our project. It has been a pleasure to work with you.
All the best,
-Juan
(Local Data Folder: Fkk0213)
2/15/2007
top50®.sif --- Info
I have manually corrected parsing error in 'top50®.sif', and imported into Cytospace.
# of unique genes (right-hand column) : 55
# of unique regulators (left-hand column) : 29
# of unique genes & regulators (left+right) : 64
# of regulation arcs: 195
They agree with the information provided by Cytoscape:
64 nodes + 195 arcs
# of unique genes (right-hand column) : 55
# of unique regulators (left-hand column) : 29
# of unique genes & regulators (left+right) : 64
# of regulation arcs: 195
They agree with the information provided by Cytoscape:
64 nodes + 195 arcs
top50.xls vs. top50®.sif
Left side = top50.xls, ( by Shabnam)
Right side = 55 unique genes in right-hand column of top50®.sif ( by Juan)
Comparison Info:
34 match
16 left side only
21 right side only
Right side = 55 unique genes in right-hand column of top50®.sif ( by Juan)
Comparison Info:
34 match
16 left side only
21 right side only
Data Set 02/15
Hi Juan
Here are the original files Shabnam used. It seems that vsn_normalization.xls is the original file, so you might want to write your own script to extract the gene time profiles from that one directly. Please rerun the F vs K plots with these.
Please can you also check that the gene names in top50.xls are actually the genes in the right hand column of top50®.sif?
Thanks
David
-----Original Message-----
From: Shabnam Moobedmehdiabadi
Sent: Wed 2/14/2007 15:04
To: David Wild
Cc:
Subject: Data Sets
Dear David,
The vsn-normalization file is the original normalized data with 2 technical replicates each has 3 rep. I used combine.m script to combine the data to have 6 rep for each gene. I saved the result in timecourse.xls file. I used timecourse.xls file for timecourse analysis finding the HotellingT2 value for each gene.
the ready.xls file is timecourse.xls file which has been sorted on HotellingT2 value. I used the top 50 genes in ready.xls file for vbssm training which I saved them in top50.xls. this is the file that I used to produce figure 3 in the NSF report. normalize.xls is the same data but I did standard normalized transform as well.
About the new set of genes 50genes+regulators, I extracted the genes and the relative expressions from timecourse.xls files and I saved it as ready_top50®.xls (in alphabetical order). I extracted the 50genes+regulators name from top50®.sif (which is the same as the figure in networktop50®.tiff ) file as you sent to me. As I examined the genes names I figured out that the Yellow circles genes in networktop50®.tiff are the same genes as in top50.xls file but the with circles are not included in top50.xls file.
Please let me know if my explanations are not clear.
Best regards,
Shabnam
Here are the original files Shabnam used. It seems that vsn_normalization.xls is the original file, so you might want to write your own script to extract the gene time profiles from that one directly. Please rerun the F vs K plots with these.
Please can you also check that the gene names in top50.xls are actually the genes in the right hand column of top50®.sif?
Thanks
David
-----Original Message-----
From: Shabnam Moobedmehdiabadi
Sent: Wed 2/14/2007 15:04
To: David Wild
Cc:
Subject: Data Sets
Dear David,
The vsn-normalization file is the original normalized data with 2 technical replicates each has 3 rep. I used combine.m script to combine the data to have 6 rep for each gene. I saved the result in timecourse.xls file. I used timecourse.xls file for timecourse analysis finding the HotellingT2 value for each gene.
the ready.xls file is timecourse.xls file which has been sorted on HotellingT2 value. I used the top 50 genes in ready.xls file for vbssm training which I saved them in top50.xls. this is the file that I used to produce figure 3 in the NSF report. normalize.xls is the same data but I did standard normalized transform as well.
About the new set of genes 50genes+regulators, I extracted the genes and the relative expressions from timecourse.xls files and I saved it as ready_top50®.xls (in alphabetical order). I extracted the 50genes+regulators name from top50®.sif (which is the same as the figure in networktop50®.tiff ) file as you sent to me. As I examined the genes names I figured out that the Yellow circles genes in networktop50®.tiff are the same genes as in top50.xls file but the with circles are not included in top50.xls file.
Please let me know if my explanations are not clear.
Best regards,
Shabnam
2/14/2007
F_vs_K Plot on 'vsn-normalization.xls'
top50 & reg (normalized):
Download: PDF top50 & reg (not-normalized)
Download: PDF1. Reshape XLS File ...
# of Item Entries: 4295
Extract Control & Shift Data ...Done!
2. Count Unique Gene/Regs in 'top50®.sif' (corrected by Juan)...
# of Unique Genes: 55
# of Unique Regulators: 29
# of Unique Genes & Regulators: 64
**Genes = right-hand column of 'top50& reg.sif'
**Regulators = left-hand column of 'top50& reg.sif'
3. Find Genes/Regulators shown in both .sif and .xls files ...
# of Item Entries in xls file: 4295
a) Common Genes ...
# of Unique Genes in 'top50®.sif': 55
# of Genes hit: 50
b) Common Regulators ...
# of Unique Regulators in 'top50®.sif': 29
# of Regulators hit: 23
c) Common Genes & Regulators ...
# of Unique Genes + Regs in 'top50®.sif': 64
# of Genes + Regs hit: 56
4. Generate inpn/yn data...Done.
5. Train VBSSM (10 seeds on cluster)
6. Plot F_K (see top) : normed vs. non-normed
(Local Data Folder: Fkk0215/non-normed & Fkk0216/normed)
1/25/2007
ARD Results
Model Parameters:
time_window = [600 3480]; % ideal time window 600-3000(min) == 10-50(hr)
sample = 12; % time points
noise = 0.1; % set noise to 10%
vbits = 300; % maximum it#,iterations
kkrange = [1:16]; % hidden state dimension
stdrange = [0.5:0.5:10]; % standard deviation range to produce ROC curve
murange = [.1 .5 1 2 4 8 12 16]; % magnitude of the mean of ARD prior
Target 1: Compare the AUC improvement of 1-arc,2-arc and 3-arc (10 seeds)
1-a) replicates = 1: kk_range = [1:16] / 16-in-1
1-b) replicates = 4: kk_range = [1:16] / 16-in-1
1-c) replicates = 8: kk_range = [1:16] / 16-in-1
1-d) reps = [1 ; 4; 8]: kk_range = [1:16]
Target 2: Capture Informative Interaction (arc) / k
Informative arc/k is defined if it helps priored-auc better than corrected-auc
2-a) Info Table: pdf / xls
2-b) replicates = 1: arc/k statistics in bar-chart or hit the common Informative arc / k
2-c) replicates = 4: arc/k statistics in bar-chart or hit the common Informative arc / k
2-d) replicates = 8: arc/k statistics in bar-chart or hit the common Informative arc / k
time_window = [600 3480]; % ideal time window 600-3000(min) == 10-50(hr)
sample = 12; % time points
noise = 0.1; % set noise to 10%
vbits = 300; % maximum it#,iterations
kkrange = [1:16]; % hidden state dimension
stdrange = [0.5:0.5:10]; % standard deviation range to produce ROC curve
murange = [.1 .5 1 2 4 8 12 16]; % magnitude of the mean of ARD prior
Target 1: Compare the AUC improvement of 1-arc,2-arc and 3-arc (10 seeds)
1-a) replicates = 1: kk_range = [1:16] / 16-in-1
1-b) replicates = 4: kk_range = [1:16] / 16-in-1
1-c) replicates = 8: kk_range = [1:16] / 16-in-1
1-d) reps = [1 ; 4; 8]: kk_range = [1:16]
Target 2: Capture Informative Interaction (arc) / k
Informative arc/k is defined if it helps priored-auc better than corrected-auc
2-a) Info Table: pdf / xls
2-b) replicates = 1: arc/k statistics in bar-chart or hit the common Informative arc / k
2-c) replicates = 4: arc/k statistics in bar-chart or hit the common Informative arc / k
2-d) replicates = 8: arc/k statistics in bar-chart or hit the common Informative arc / k
1/15/2007
ARD Experiments
Hello David.
Here is the experiment I have currently tasked Juan Li with:
[mu is the location of the ARD prior's mean, and is a matrix]
1) mu=0, set random seed, run SSM, obtain ROC curve ROC-A.
2) choose an arc r->s which SSM doesn't get right at all.
3) plot ROC curve B, ROC-B, obtained from ROC-A but where arc r->s is
manually corrected.
4) rerun SSM (from same random seed as above), with mu_{r,s} =
[0 .1 .5 1 2 4 8 12 16]
5) in each case plot the new ROC curve C, {ROC-C}_{mu_{r,s}
=0 .1 .5 1 2 4 8 12 16}.
By construction, ROC-B must be at least as good as ROC-A.
Without local minima issues, we would hope that ROC-C were better
than ROC-A.
But we would hope that ideally ROC-C is better than ROC-B, which
would show that providing information about r->s provides *more than
just this isolated help*.
Will follow up on the discussion we had about the knock-out.
-Matt
Here is the experiment I have currently tasked Juan Li with:
[mu is the location of the ARD prior's mean, and is a matrix]
1) mu=0, set random seed, run SSM, obtain ROC curve ROC-A.
2) choose an arc r->s which SSM doesn't get right at all.
3) plot ROC curve B, ROC-B, obtained from ROC-A but where arc r->s is
manually corrected.
4) rerun SSM (from same random seed as above), with mu_{r,s} =
[0 .1 .5 1 2 4 8 12 16]
5) in each case plot the new ROC curve C, {ROC-C}_{mu_{r,s}
=0 .1 .5 1 2 4 8 12 16}.
By construction, ROC-B must be at least as good as ROC-A.
Without local minima issues, we would hope that ROC-C were better
than ROC-A.
But we would hope that ideally ROC-C is better than ROC-B, which
would show that providing information about r->s provides *more than
just this isolated help*.
Will follow up on the discussion we had about the knock-out.
-Matt
1/12/2007
RUN safely on "underground"
Introduction:
Tube is a 34-processor cluster, owned by Dr. Beal and administered by CSE, available to members of his group for algorithm prototyping. Each node has 2GB ram, 80GB hd, and dual core 3.2GHz Intel Xeon processors. underground.cse.buffalo.edu is the head-node, accessible only via ssh. Matlab is available with the Distributed Computing Toolbox.
Tips to follow up:
1. Make sure you can run the script on local machine first. It's hard for us to terminate a bad heavy job on cluster. I have no such authority.
2. Write a LOG file to know where you may get stuck.
3. Save resulting variable with huge size into a FILE, instead of returning it directly. I lost results before by returing a big variable, it's probably caused by nasty Java vm.
4. After creating a folder on cluster, make sure it is writable, or you will not be able to write into files. I use cmd 'chmod 777 foldername'.
How to run on underground?
1. Invoke matlab
>matlab
2. Specify absolute pathes to tell the workers where to load script and write log.
> pppp = {'/home/csgrad/juanli/work/toolkits/vbssm_v3.3.5',...
'/home/csgrad/juanli/work/code'};
3. Enjoy the underground trip
> tic; s = dfeval(@func,num2cell(1),'PathDependencies',pppp);toc;
Examples are available at http://www.cse.buffalo.edu/~juanli/scripts
MATLAB online DCT help
Tube is a 34-processor cluster, owned by Dr. Beal and administered by CSE, available to members of his group for algorithm prototyping. Each node has 2GB ram, 80GB hd, and dual core 3.2GHz Intel Xeon processors. underground.cse.buffalo.edu is the head-node, accessible only via ssh. Matlab is available with the Distributed Computing Toolbox.
Tips to follow up:
1. Make sure you can run the script on local machine first. It's hard for us to terminate a bad heavy job on cluster. I have no such authority.
2. Write a LOG file to know where you may get stuck.
3. Save resulting variable with huge size into a FILE, instead of returning it directly. I lost results before by returing a big variable, it's probably caused by nasty Java vm.
4. After creating a folder on cluster, make sure it is writable, or you will not be able to write into files. I use cmd 'chmod 777 foldername'.
How to run on underground?
1. Invoke matlab
>matlab
2. Specify absolute pathes to tell the workers where to load script and write log.
> pppp = {'/home/csgrad/juanli/work/toolkits/vbssm_v3.3.5',...
'/home/csgrad/juanli/work/code'};
3. Enjoy the underground trip
> tic; s = dfeval(@func,num2cell(1),'PathDependencies',pppp);toc;
Examples are available at http://www.cse.buffalo.edu/~juanli/scripts
MATLAB online DCT help
Subscribe to:
Posts (Atom)