My system is Ubuntu 9.04, ruby 1.8.7 [i486-linux], Intel Core 2 Duo P8600 @2.4GHz
I created test_phyloxml_big.rb test file. It has test_next_tree method which calls next_tree on the phyloxml file until end of file is reached. Here follow results on the ncbi_taxonomy_mollusca.xml file which is 1.5MB large with 5632 external nodes.
It takes around 7.5min to finish test_phyloxml_big.rb test.
(Finished in 443.231507 seconds.
Finished in 457.255576 seconds. )
output of the top:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21222 diana 20 0 29020 25m 1928 R 95 0.9 5:32.92 ruby
So it looks like memory footprint is small ~ 25MBs. CPU usage is 95% (i have two processors, so it is completely using one of them).
I did the same thing for tol_life_xml but it took forever to finish. (more than 3 hours)
For curiosity I created a method next_tree_dummy. All it does is to reader.read from file until it reaches element.
tree of life xml (file size: 45.1MB) - Finished in 3.177743 seconds.
mollusca xml (1.5MB) - Finished in 0.252993 seconds.
metazoa xml (32.3MB) - Finished in 3.393467 seconds.
I think this shows that libxml is really fast.
I also did profiling with ruby-prof on ncbi mollusca taxonomy file.
Here is partial output:
diana@diana-ubuntu:~/bioruby$ ruby-prof -p graph test/unit/bio/db/test_phyloxml_big.rb
Loaded suite /usr/bin/ruby-prof
Started
.
Finished in 1345.4039 seconds.
1 tests, 0 assertions, 0 failures, 0 errors
Thread ID: 3084157360
Total Time: 1257.6
[..]
-------------------------------------------------------------------------------
1257.56 0.60 0.00 1256.96 2/2 Bio::TestPhyloXMLBig#test_next_tree
100.00% 0.05% 1257.56 0.60 0.00 1256.96 2 Bio::PhyloXML#next_tree
0.08 0.03 0.00 0.05 8107/16210 Bio::PhyloXML#parse_attributes
0.00 0.00 0.00 0.00 1/243188 String#==
0.00 0.00 0.00 0.00 8104/24322 LibXML::XML::Reader#[]
6.18 0.97 0.00 5.21 48616/48616 Bio::PhyloXML#parse_clade_elements
0.14 0.14 0.00 0.00 48623/97244 LibXML::XML::Reader#read
0.27 0.17 0.00 0.10 48644/875134 Bio::PhyloXML#is_element?
0.11 0.04 0.00 0.07 16206/32442 Class#new
0.04 0.04 0.00 0.00 16206/116034 Kernel#==
0.00 0.00 0.00 0.00 4/72929 Bio::PhyloXML#parse_simple_elements
0.07 0.01 0.00 0.06 8102/8102 Bio::Tree#add_node
0.44 0.31 0.00 0.13 97243/137758 Bio::PhyloXML#is_end_element?
1249.05 0.02 0.00 1249.03 8102/8102 Bio::Tree#parent
0.58 0.07 0.00 0.51 8102/8102 Bio::Tree#add_edge
-----------------------------------------------------------------------------
1249.05 0.02 0.00 1249.03 8102/8102 Bio::PhyloXML#next_tree
99.32% 0.00% 1249.05 0.02 0.00 1249.03 8102 Bio::Tree#parent
1249.03 0.13 0.00 1248.90 8102/8102 Bio::Tree#path
0.00 0.00 0.00 0.00 8102/72975 Array#[]
--------------------------------------------------------------------------------
1249.03 0.13 0.00 1248.90 8102/8102 Bio::Tree#parent
99.32% 0.01% 1249.03 0.13 0.00 1248.90 8102 Bio::Tree#path
0.04 0.01 0.00 0.03 16204/164638052 Hash#[]
1248.82 0.27 0.00 1248.55 8102/8102 Bio::Pathway#bfs_shortest_path
0.03 0.03 0.00 0.00 24306/116034 Kernel#==
0.01 0.01 0.00 0.00 16204/72975 Array#[]
--------------------------------------------------------------------------------
1248.82 0.27 0.00 1248.55 8102/8102 Bio::Tree#path
99.30% 0.02% 1248.82 0.27 0.00 1248.55 8102 Bio::Pathway#bfs_shortest_path
0.26 0.19 0.00 0.07 142736/164638052 Hash#[]
0.07 0.07 0.00 0.00 75419/116034 Kernel#==
1248.18 115.50 0.00 1132.68 8102/8102 Bio::Pathway#breadth_first_search
0.04 0.04 0.00 0.00 67317/67330 Array#unshift
--------------------------------------------------------------------------------
1248.18 115.50 0.00 1132.68 8102/8102 Bio::Pathway#bfs_shortest_path
99.25% 9.18% 1248.18 115.50 0.00 1132.68 8102 Bio::Pathway#breadth_first_search
136.52 92.65 0.00 43.8765785140/164638052 Hash#[]
22.53 22.53 0.00 0.0032900672/32900681 Array#shift
973.59 324.56 0.00 649.0332892570/32892570 Hash#each_key
0.04 0.03 0.00 0.01 24306/98702064 Hash#[]=
[..]
99.32% of the total time is spent in Bio::Tree#parent method and the methods it calls.
Bio::Tree#parent calls Bio::Tree#path which calls Bio::Pathways#bfs_shortest_path which in turn calls Bio::Pathway#breadth_first_search (99.25% of total time is spent in this method and its sub calls).
This was a huge surprise for me. Why would breadth first search be needed if I just want to know the parent node of the current node.
The reason I am using Bio::Tree#parent is because I have to keep track of the current node I am parsing. When I have reached element i set the current_node to the parent of the node I just parsed.
I see here two options.
1) Keep track of the current node myself (by putting references in an array and pushing and poping accordingly). Thus I won't have to call the Bio::Tree#parent method.
2) Update Bio::Tree/ Bio::Node class so that nodes contain references to their parents. (thus not needing to call breadth first search).
What do you think?
(Code is available at github )
No comments:
Post a Comment