Tuesday, July 21, 2009

Google Code Jam 2009

Just got to know that Google Code Jam is on this year too. Definitely will participate. Last year I got past Qualification round. This year I will have to practice more and hopefully will get past Round 1.

Monday, July 20, 2009

OSCON 2009

Just wanted to say that I am excited to be at OSCON 2009 conference. :) Also excited to meet other GSOC students at BoF session.

Sunday, June 28, 2009

Profiling code: Breadth first search surprise

At this point in the project I am almost done with writing parser so I did some profiling of the code.

My system is Ubuntu 9.04, ruby 1.8.7 [i486-linux], Intel Core 2 Duo P8600 @2.4GHz

I created test_phyloxml_big.rb test file. It has test_next_tree method which calls next_tree on the phyloxml file until end of file is reached. Here follow results on the ncbi_taxonomy_mollusca.xml file which is 1.5MB large with 5632 external nodes.

It takes around 7.5min to finish test_phyloxml_big.rb test.
(Finished in 443.231507 seconds.
Finished in 457.255576 seconds. )

output of the top:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21222 diana 20 0 29020 25m 1928 R 95 0.9 5:32.92 ruby

So it looks like memory footprint is small ~ 25MBs. CPU usage is 95% (i have two processors, so it is completely using one of them).

I did the same thing for tol_life_xml but it took forever to finish. (more than 3 hours)


For curiosity I created a method next_tree_dummy. All it does is to reader.read from file until it reaches element.

tree of life xml (file size: 45.1MB) - Finished in 3.177743 seconds.

mollusca xml (1.5MB) - Finished in 0.252993 seconds.

metazoa xml (32.3MB) - Finished in 3.393467 seconds.

I think this shows that libxml is really fast.


I also did profiling with ruby-prof on ncbi mollusca taxonomy file.

Here is partial output:


diana@diana-ubuntu:~/bioruby$ ruby-prof -p graph test/unit/bio/db/test_phyloxml_big.rb
Loaded suite /usr/bin/ruby-prof
Started
.
Finished in 1345.4039 seconds.

1 tests, 0 assertions, 0 failures, 0 errors
Thread ID: 3084157360
Total Time: 1257.6

[..]
-------------------------------------------------------------------------------
1257.56 0.60 0.00 1256.96 2/2 Bio::TestPhyloXMLBig#test_next_tree
100.00% 0.05% 1257.56 0.60 0.00 1256.96 2 Bio::PhyloXML#next_tree
0.08 0.03 0.00 0.05 8107/16210 Bio::PhyloXML#parse_attributes
0.00 0.00 0.00 0.00 1/243188 String#==
0.00 0.00 0.00 0.00 8104/24322 LibXML::XML::Reader#[]
6.18 0.97 0.00 5.21 48616/48616 Bio::PhyloXML#parse_clade_elements
0.14 0.14 0.00 0.00 48623/97244 LibXML::XML::Reader#read
0.27 0.17 0.00 0.10 48644/875134 Bio::PhyloXML#is_element?
0.11 0.04 0.00 0.07 16206/32442 Class#new
0.04 0.04 0.00 0.00 16206/116034 Kernel#==
0.00 0.00 0.00 0.00 4/72929 Bio::PhyloXML#parse_simple_elements
0.07 0.01 0.00 0.06 8102/8102 Bio::Tree#add_node
0.44 0.31 0.00 0.13 97243/137758 Bio::PhyloXML#is_end_element?
1249.05 0.02 0.00 1249.03 8102/8102 Bio::Tree#parent
0.58 0.07 0.00 0.51 8102/8102 Bio::Tree#add_edge
-----------------------------------------------------------------------------
1249.05 0.02 0.00 1249.03 8102/8102 Bio::PhyloXML#next_tree
99.32% 0.00% 1249.05 0.02 0.00 1249.03 8102 Bio::Tree#parent
1249.03 0.13 0.00 1248.90 8102/8102 Bio::Tree#path
0.00 0.00 0.00 0.00 8102/72975 Array#[]
--------------------------------------------------------------------------------
1249.03 0.13 0.00 1248.90 8102/8102 Bio::Tree#parent
99.32% 0.01% 1249.03 0.13 0.00 1248.90 8102 Bio::Tree#path
0.04 0.01 0.00 0.03 16204/164638052 Hash#[]
1248.82 0.27 0.00 1248.55 8102/8102 Bio::Pathway#bfs_shortest_path
0.03 0.03 0.00 0.00 24306/116034 Kernel#==
0.01 0.01 0.00 0.00 16204/72975 Array#[]
--------------------------------------------------------------------------------
1248.82 0.27 0.00 1248.55 8102/8102 Bio::Tree#path
99.30% 0.02% 1248.82 0.27 0.00 1248.55 8102 Bio::Pathway#bfs_shortest_path
0.26 0.19 0.00 0.07 142736/164638052 Hash#[]
0.07 0.07 0.00 0.00 75419/116034 Kernel#==
1248.18 115.50 0.00 1132.68 8102/8102 Bio::Pathway#breadth_first_search
0.04 0.04 0.00 0.00 67317/67330 Array#unshift
--------------------------------------------------------------------------------
1248.18 115.50 0.00 1132.68 8102/8102 Bio::Pathway#bfs_shortest_path
99.25% 9.18% 1248.18 115.50 0.00 1132.68 8102 Bio::Pathway#breadth_first_search
136.52 92.65 0.00 43.8765785140/164638052 Hash#[]
22.53 22.53 0.00 0.0032900672/32900681 Array#shift
973.59 324.56 0.00 649.0332892570/32892570 Hash#each_key
0.04 0.03 0.00 0.01 24306/98702064 Hash#[]=
[..]


99.32% of the total time is spent in Bio::Tree#parent method and the methods it calls.
Bio::Tree#parent calls Bio::Tree#path which calls Bio::Pathways#bfs_shortest_path which in turn calls Bio::Pathway#breadth_first_search (99.25% of total time is spent in this method and its sub calls).

This was a huge surprise for me. Why would breadth first search be needed if I just want to know the parent node of the current node.

The reason I am using Bio::Tree#parent is because I have to keep track of the current node I am parsing. When I have reached element i set the current_node to the parent of the node I just parsed.

I see here two options.

1) Keep track of the current node myself (by putting references in an array and pushing and poping accordingly). Thus I won't have to call the Bio::Tree#parent method.

2) Update Bio::Tree/ Bio::Node class so that nodes contain references to their parents. (thus not needing to call breadth first search).

What do you think?

(Code is available at github )

Monday, June 15, 2009

A quote about Microsoft

I somehow ended up reading this site and read this awesome quote:

"The day Microsoft makes something that doesn't suck is probably the day they start making vacuum cleaners." -Ernst Jan Plugge

No offense to anybody, but it is funny!

Sunday, May 31, 2009

Calling method from string in Ruby

A week and a little bit has passed since the beginning of coding for Google Summer of Code 2009. I am getting more and more excited since I am learning more of advanced Ruby programming.

So I am parsing XML into objects using LibXML::XML::Reader. I had this code:

 1     def parse_events()
2 events = Events.new
3 @reader.read #go to next element
4 #read while have reached end of events
5 while not(is_end_element?('events')) do
6 if is_element?('type')
7 @reader.read
8 events.type = @reader.value
9 @reader.read
10 has_reached_end_tag?('type')
11 end
12 if is_element?('duplications')
13 @reader.read
14 events.duplications = @reader.value.to_i
15 @reader.read
16 has_reached_end_tag?('duplications')
17 end
18 if is_element?('speciations')
19 @reader.read
20 events.speciations = @reader.value.to_i
21 @reader.read
22 has_reached_end_tag?('speciations')
23 end
24 if is_element?('losses')
25 @reader.read
26 events.losses = @reader.value.to_i
27 @reader.read
28 has_reached_end_tag?('losses')
29 end
30 if is_element?('confidence')
31 events.confidence = parse_confidence
32 end
33 @reader.read
34 end
35 return events
36 end #parse_events

This was too much of code duplication, but the problem was that each time different method of the events object is called.

After some searching on interwebs, luckily, I found out that ruby has a method send, which invokes the method identified by symbol and passes any arguments.

So I created such method:

def parse_simple_element(object, name)
if is_element?(name)
@reader.read
object.send("#{name}=", @reader.value)
@reader.read
has_reached_end_tag?(name)
end
end

So now the method parse_events() can be rewritten like this:

def parse_events()
events = Events.new
@reader.read #go to next element
while not(is_end_element?('events')) do
['type', 'duplications', 'speciations', 'losses'].each { |elmt|
parse_simple_element(events, elmt)
}
if is_element?('confidence')
events.confidence = parse_confidence
end
@reader.read
end
return events
end #parse_events

Much shorter!! Now I run the unit tests and it works! No worries that I broke something while refactoring code. In cases like this I really appreciate the power of unit testing.

Monday, May 25, 2009

Hello World!

I wanted to have a blog for quite a long time, and now I have a good reason to start it: Google Summer of Code 2009.

My project is to implement phyloXML support in BioRuby and I am working for NESCent organization.

Here is the project abstract:

Phylogenetic trees are used in important applications, including phylogenomics, phylogeography, gene function prediction, cladistics and the study of molecular evolution. In order to foster successful analysis, exchange, storage and reuse of phylogenetic trees and associated data, the phyloXML format was developed. It can store all necessary information about the phylogenetic tree, like clade, sequence, name and distance. The goal of this project is to implement support for phyloXML in BioRuby.

My project page is: https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby

The source code is on GitHub: http://github.com/rozziite/bioruby/tree/master

I have been using Open Source software for quite a long time already and I am now excited to contribute back to community.