1 __author__ = "$Author: jurgenfd $"
2 ___revision__ = "$Revision: 9 $"
3 ___date__ = "$Date: 2007-01-11 20:40:26 +0100 (Thu, 11 Jan 2007) $"
4
5 """
6 Goal of these routines are to provide a Python interface to writing, reading,
7 analyzing, and modifying NMR-STAR and mmCIF files and objects.
8
9 NOTES:
10 * Not supported STAR features (not used in NMR-STAR and mmCIF files):
11 - Nested loops
12 - Global block
13 * Limitations to content:
14 - STAR file should have one and only one data_ tag and that should
15 be the first thing in the file
16 - Comments on input are ignored.
17 * Limitations to the lay out (for fast parsing).
18 - Save frames should start and end with save_ at the beginning of
19 the line
20 - Perhaps some unknown;-(
21
22 SPEED ISSUES:
23 * There was a good Python API written by Jens Linge and Lutz Ehrlig (EMBL).
24 It can handle much more STAR features and variations to content
25 and lay out. The current API was written to handle NMR-STAR files in
26 the order of several Mb for which the EMBL API demanded a lot of
27 resources. Parsing a 1 Mb STAR file with a huge table of mostly numeric
28 values required a peak 50 Mb in memory and about 2 hours with StarFormat.
29 My guess was that this could be much faster if at least the lowest level
30 of the dataNode value (where it is a string or number) would use native
31 Python objects in stead.
32 Another issue is that a large text object when parsed by the
33 EMBL API got copied over and over resulting in loss of speed and a
34 significant increase in memory use.
35 * This API uses native Python objects for a list of tags (looped or free)
36 with user defined objects above that where speed and memory are less of an
37 issue. It parses a 10 Mb STAR file in 25 seconds with a peak memory
38 usage of 45 Mb. The average value in the file is 3 chars long. A Python
39 string object has a reference count (4), type pointer (4), malloc overhead
40 (4), trailing \0 (1) and the content (rounded up to multiples of 4).
41 Ignoring the content rounding we go from 3 bytes to 20 bytes (factor 7)
42 in total for the average string in the example file. Considering some
43 overhead for the objects on top of the string objects the 55 Mb doesn't
44 look that bad.
45 * Compare this with the C STARLIB2 from Steve Mading (BMRB) which takes 12
46 cpu seconds and 18 Mb peak memory usage. For STARLIBJ (Java) Steve
47 got 40 Mb peak memory usage and 57 seconds. Memory usage is slightly
48 better but speed is a factor 2 slower. This was using the best Java
49 engine we had. Another one we tested was a factor 3 slower.
50 * Added yet another STAR parser in Java project: Wattos.Star.STARParser
51 Optimized to be fast and efficient with memory.
52 * Summary:
53
54 Test on Windows using a single Pentium IV CPU 2 GHz
55 Language STAR file size (Mb) Time (s) RAM (Mb) Notes
56 ###############################################################################
57 C 10 7.2 18 Using Steve's STARlib2.
58 Java 10 57 40 Tested by Steve
59 JavaNEW 10 5.2 100 New parser based on SANSj: Wattos.Star.STARParser
60 Python 10 25 45 Written at BMRB
61 Python* 1* 7200* 50* Written at EMBL
62 ###############################################################################
63 Labeled with asterisk because the size of test file had to be truncated and was
64 run on older machine. Their API was developed for small files (< 100 kb).
65
66 * References:
67 S. R. Hall and A. P. F. Cook. STAR dictionary definition language: initial specification.
68 J.Chem.Inf.Comput.Sci. 35:819-825, 1995.
69 S. R. Hall and N. Spadacinni. The STAR file: detailed specifications.
70 J.Chem.Inf.Comput.Sci. 34:505-508, 1994.
71 J. P. Linge, M. Nilges, and Ehrlich L. StarDOM: from STAR format to XML.
72 J Biomol NMR, 1999.
73 N. Spadacinni and C. B. Hall. Star_base: accessing STAR file data.
74 J.Chem.Inf.Comput.Sci. 34:509-516, 1994.
75 J. Westbrook and P. E. Bourne. STAR/mmCIF: An ontologoy for macromolecular structure.
76 Bioinformatics. 16 (2):159-168, 2000.
77 """
78
79
80 verbosity = 2
81