source: trunk/doc/specifications/technical.html @ 3679

Last change on this file since 3679 was 3679, checked in by Jari Häkkinen, 15 years ago

Changing the pesky "a (ä) character to a.

  • Property svn:eol-style set to native
  • Property svn:keywords set to Id Date
File size: 24.9 KB
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "">
3  $Id: technical.html 3679 2007-08-17 07:18:29Z jari $
5  Copyright (C) 2005 Samuel Andersson, Jari Hakkinen, Nicklas Nordborg
6  Copyright (C) 2006 Jari Hakkinen
8  This file is part of BASE - BioArray Software Environment.
9  Available at
11  BASE is free software; you can redistribute it and/or
12  modify it under the terms of the GNU General Public License
13  as published by the Free Software Foundation; either version 2
14  of the License, or (at your option) any later version.
16  BASE is distributed in the hope that it will be useful,
17  but WITHOUT ANY WARRANTY; without even the implied warranty of
19  GNU General Public License for more details.
21  You should have received a copy of the GNU General Public License
22  along with this program; if not, write to the Free Software
23  Foundation, Inc., 59 Temple Place - Suite 330,
24  Boston, MA  02111-1307, USA.
28  <title>BASE - Initial technical specification</title>
29  <link rel=stylesheet type="text/css" href="../styles.css">
33<div class="navigation">
34  <a href="../index.html">BASE</a>
35  <img src="../next.gif">
36  Initial technical specification
39<h1>Initial technical specification</h1>
41<div class="abstract">
43  <ol>
44  <li><a href="#1">Background</a>
45  <li><a href="#2">Requirements for BASE 2.0</a>
46  <li><a href="#3">Generic solution</a>
47  <li><a href="#4">Technical details</a>
48  <li><a href="#5">Work items</a>
49  </ol>
51  <p class="authors">
52  <b>Created by:</b> Nicklas<br>
53  <b>Contributions by:</b> Carl, Jari, Per<br>
54  <b>Last updated:</b> $Date: 2007-08-17 07:18:29 +0000 (Fri, 17 Aug 2007) $
55  </p>
58<a name="1">
59<h2>1. Background</h2>
61  <p>
62  The current BASE 1.2 implementation uses a 3-tier architecure. At the bottom is
63  the data layer running MySQL or Postgres. In the middle is the logic layer with
64  PHP scripts running on an Apache web server. The top layer is the HTML
65  presentation in the browser.
66  </p>
67  <p>
68  This follows a classical and well-known design for web applications. However,
69  the actual implementation of it fails at several points, especially at the
70  logic layer. Here are som exemples:
71  </p>
73  <ul>
74  <li>
75    <p>
76    Several of the PHP scripts have too much responsibility. For example, the
77    plotting function uses the script "". This script is
78    responsible both for generating the HTML where the user selects parameters
79    for the plot and for generating the final graphs in the form of images or
80    postscript/pdf files.
81    </p>
82    <p>
83    Another example is the file "trans_create.phtml" which is used for filtering
84    BioAssay data. It does the following:
85    </p>
86    <ul>
87      <li>generate the HTML where the user creates the filter</li>
88      <li>generate the HTML where the user specifies input parameters for a job</li>
89      <li>store and fetch "Preset:s", i.e. filter definitions that the user may
90        want to reuse</li>
91      <li>do the actual filtering</li>
92      <li>start the selected job</li>
93    </ul>
94  </li>
95  <li>
96    <p>
97    There are too many dependencies between different parts of the PHP scripts
98    and classes. This is actually the same problem as the first point but on a
99    wider scale.
100    </p>
101    <p>
102    I will use the plot function as an example again. When the interface is
103    presented for the user, he/she is supposed to select the values to plot on
104    the X and Y axis respectively. The lists of values to choose from are
105    generated by the BioAssay object. This is ok, since the BioAssay object is
106    the only object that knows about what data is available. When the user has
107    made the selection the information is passed to the BioAssay-object which
108    fetches the data and gives it back to the plot function. This seems like a
109    good idea, but if one looks deeper into the code there is a very tight
110    coupling between the plot function an the BioAssay object. The BioAssay
111    object has methods as "getDataForPlot" and "getPlotType", which are totally
112    wrong. The BioAssay object should not need to know anything about plotting
113    or how the data should be used. It should only have a "getData" method.
114    </p>
115    <p>
116    As it is now, the plot function will only plot data from a BioAssay, but what
117    if we want to plot data from a BioAssaySet? The current design makes it hard
118    to change the plot function to accomplish this.
119    </p>
120  </li>
121  <li>
122    <p>
123    SQL commands are scattered around in several different places. This will
124    become a bigger problem as the code grows and the wish to support other
125    databases increases. How do we verify that all SQL queries also work for
126    example Oracle? And, once we have done that, what about the next version of
127    BASE?
128    </p>
129  </li>
130  </ul>
132  <p><b>To summarize:</b><br>
133  The basic problem is that the division into three layers has been unsuccessful.
134  Code that belongs to the data layer (SQL queries) are scattered among the script
135  in the logic layer. Several PHP scripts performs functions both for presenting
136  the data as well as manipulating it. Ie. there is no clear division between the
137  data layer, the logic layer and the presentation layer.
138  </p>
140<a name="2">
141<h2>2. Requirements for BASE 2.0</h2>
143  <p>
144  The main goal for BASE 2.0 is to make the division between data, logic and
145  presentation clear.
146  </p>
148  <ul>
149    <li>
150    It should be possible to add support for other databases without having
151    to go through every piece of code. The requirements for the capabilities
152    of the database system  must be well documented.
153    </li>
154    <li>
155    Expose an API from the logic layer that is accessible from at least Perl and
156    C++. If possible, the API should also be accessible from Java. Any other
157    languages are considered a bonus.
158    </li>
159    <li>
160    The design must allow calculation intensive parts (i.e. plugins) to be
161    executed on remote servers, using a suitable language for the task.
162    </li>
163    <li>
164    Possible to add support for other import and export file formats, including
165    very cryptic ones (i.e. everyting else than tab-separated text files).
166    </li>
167    <li>
168    It must be possible to run a BASE server without the need to purchase any
169    additional software. Any 3rd-party software required by BASE should be
170    freely available. Optional software, not required for the basic operation
171    of BASE do not have this restriction.
172    </li>
173  </ul>
176<h3>2.1 Possible features of BASE 2.0</h3>
177  <p>
178  Here are some features that are not requirements, but might be nice to have. We
179  should try to include as much as possible, but if we are short of time some
180  features may have to wait until a later version.
181  </p>
182  <ul>
183    <li>
184    Add support for external user authentication, for example via LDAP. A
185    minumum requirement of the authentication system will be the ability to
186    validate a user against a password and check for permission to use BASE.
187    </li>
188  </ul>
190<a name="3">
191<h2>3. Generic solution</h2>
193  <p>
194  The generic solution is an extension to the current one, i.e. the 3-tier
195  solution is replaced by an N-tier solution. This is accomplished by subdividing
196  the layers and precisely specifying their areas of responsibility. At this stage
197  we shouldn't make any assumption about the technology to use, i.e. the
198  programming language, the kind of database, etc.
199  </p>
201  <h3>3.1 The data layer</h3>
202  <p>
203  The data layer is divided into three layers:
204  </p>
205  <ol>
206    <li>The data storage layer
207      <ul>
208      <li>is responsible for holding the data
209      </ul>
211    <li>The database driver layer
212      <ul>
213      <li>is responsible for all queries to the database.
214      <li>knows how to connect to the database
215      <li>handling transactions
216      <li>parse and format user input data, i.e. escape "dangerous" characters
217      <li>should be able to do some simple calculations, such as counting number of
218      items, calculating means, sums, etc. Note! If the technical implementation
219      uses a relational database capable of executing SQL queries this
220      functionality is most likely available in the database, but if we use XML
221      files as the data storage it is not. As noted above, we try not to make
222      any assumptions about the technology to use.
223      </ul>
224    <li>The data abstraction layer
225      <ul>
226      <li>knows which database driver to load
227      <li>defines helper functions usable for a substantial subset of database drivers
228      <li>transport data to and from the logic layer
229      <li>possibly a low-level, efficient method for importing large quantities of
230      data
231      <li>possibly define an API for use with plugins
232      </ul>
233  </ol>
234  <p>
235  The data abstraction layer is the only part of the data layer that is allowed
236  to talk with the outside world, i.e. the logic layer, plugins, etc. Flaws in the
237  actual design might make this impossible to follow at certain times, but much
238  effort should go into not breaking this rule!
239  </p>
241  <h3>3.2 The logic layer</h3>
242  <p>
243  The logic layer is also divided into 3 parts:
244  </p>
246  <ol>
247    <li>The core logic layer
248      <ul>
249      <li>abstracts the data to a class representation with attributes and methods
250      <li>is responsible for data consistency, i.e. initiating, aborting and comitting
251        transactions
252      <li>error checking of user supplied data
253      <li>handling of plugins and external jobs
254      <li>defining an API to make the functions accessible from other languages (Perl,
255        C++ and maybe Java)
256      </ul>
258    <li>Plugins
259      <ul>
260      <li>performs advanced data analysis
261      <li>import and export of data, i.e. parsing input files and generating output
262        files
263      </ul>
264    <li>Helper classes
265      <ul>
266      <li>providing some common services for the presentation layer clients, for
267      example plotting, file handling, etc.
268      </ul>
269  </ol>
271  <p>
272  Both the core and the plugins are allowed to talk to the data abstraction
273  layer. Neither should talk to a specific database driver or use the data
274  storage directly.
275  </p>
276  <p>
277  The helper classes should not talk to the core or the database layer. They
278  should only depend on what they are fed from the presentation layer. It is
279  arguable whether these components are seen as parts of the presentation
280  layer or the logic layer. The reason I choose to put them in the logic layer
281  is that they are providing services to several client applications.
282  </p>
285  <h3>3.3 The presentation layer</h3>
286  <p>
287  The presentation layer is divided into 2 parts:
288  </p>
289  <ol>
290    <li>The web server layer
291      <ul>
292      <li>generating HTML for the browser for presentation and manipulation of data
293      </ul>
294    <li>The browser layer
295      <ul>
296      <li>providing the user interface as specified by the HTML generated from the web
297        server
298      <li>initial error checking of user-supplied data
299      </ul>
300  </ol>
301  <p>
302  In addition to this the presentation layer can be extended with other client
303  applications, i.e. standalone programs written in C++ or Perl or Java.
304  </p>
305  <p>
306  The presentation layer is only allowed to talk with the core layer and the
307  helper classes. Communcation with plugins should go through the core layer.
308  </p>
310  <h3>3.4 Visualising the design</h3>
311  <p>
312  The design could be represented by the following image:
313  </p>
314  <pre class="code">
316                                                <b>Presentation layer</b>
317       ____________
318      |            |
319      |   Browser  |
320      |____________|
321            |
322            |                               __________
323       _____v______                        |          |
324      |            |                       |  Other   |
325      | Web server |                       |  client  |
326      |____________|                       |__________|
327           |    |                             |  |
329           |    |       ___________           |  |     <b>Logic layer</b>
330           |    |      |           |          |  |
331           |    ------&gt;|  Helper   |&lt;----------  |
332           |           |  classes  |             |
333           |           |___________|             |
334           |         ____________________________|
335           |        |
336           |   _____v____       ___________
337           |  |   API    |     |           |
338       ____v__|__________|&lt;---&gt;|  Plugins  |
339      |                  |     |___________|
340      |  Core logic      |          |
341      |  layer           |          |
342      |__________________|          |&lt;--Maybe
343          |                         |
345          |              ___________v____               <b>Data layer</b>
346          |             |      API       |
347       ___v_____________|________________|
348      |                                  |
349      |     Data abstraction layer       |
350      |                                  |
351      |----------------------------------|
352      | MySQL  |                         |
353      | driver |  Other drivers...       |
354      |________|_________________________|
355          |                |
356          |                |
357       ___v___         ____v_____
358      |       |       |          |
359      | MySQL |       | Other DB |
360      |_______|       |__________|
363  </pre>
364  <p class="annotation">A visual representation of the system design</p>
365  <p>
366  Note! In the image above the different layers do not correspond to the
367  ability to break up the execution on different servers! A discussion
368  about that will follow later.
369  </p>
371<a name="4">
372<h2>4. Technical details</h2>
374  <p>
375  Now we have a conceptual image of the design we are trying to accomplish.
376  Until now we haven't paid much attention to the technincal details of the
377  solution, i.e.:
378  </p>
379  <ul>
380  <li>What kind of database do we need?
381  <li>What programming languages should we use?
382  <li>What operating systems should we support?
383  <li>Etc.
384  </ul>
386  <h3>4.1 The data layer</h3>
387  <p>
388  The requirements specify that BASE must be able to use different data storage
389  engines and that it should be possible to add support for other ones without
390  major modification of the rest of the code.
391  </p>
392  <p>
393  The requirements does not specify what type of storage that should be
394  supported, i.e. relational database, flat-file, xml, etc.
395  </p>
396  <p>
397  In order to not complicate the design we choose to limit the support to
398  relational databases using SQL as the query language. The major task for a
399  driver will then be to shield the rest of the application from the various
400  dialects of SQL. The helper functions in the data abstraction layer will then
401  most likely be ones that can be used for dynamic creation of SQL queries.
402  </p>
403  <p>
404  Other issues:
405  </p>
407  <dl>
408  <dt>Transaction support</dt>
409  <dd>
410    <p>
411    This is the ability to treat a series of SQL queries as one operation, i.e.
412    if one query fails the rest would also fail and the database should be
413    returned to the state prior to the beginning of the transaction.
414    </p>
415    <p>
416    In my opionion this is one of the most important features of a relational
417    database. Nevertheless, we will not require that the database supports
418    transactions. However, the code in the logic layer will assume that
419    transactions are supported, if not directly in the database, then the data
420    driver layer must handle upcoming issues with failing queries.
421    </p>
422    <p>
423    We will not require support for nested transactions. Neither at the storage
424    or the driver level.
425    </p>
426  </dd>
428  <dt>Unicode support</dt>
429  <dd>
430    <p>
431    Requests for multi-language support will come sooner or later, and unicode
432    is the way to go. As we will use Java as the programming language (see below)
433    unicode support is already builtin at the code level. Again, we will not
434    require unicode support by the data storage, but all code in the logic layer
435    will behave as if it is supported. So, as for transactions, this is also an
436    issue that the driver must take care of.
437    </p>
439  <dt>Connection pooling</dt>
440  <dd>
441    <p>
442    Opening a connection to a database is a timeconsuming operation. A connection
443    pool maintains a list of already opened connections which can be recycled
444    between different requests, thereby increasing the performance. With JDBC, it
445    is not very complicated to add support for connection pooling for any
446    database.
447    </p>
448  </dl>
450  <h3>4.2 The logic layer</h3>
451  <p>
452  The requirements specify that this layer must expose an API usable for
453  clients programmed in C++ and Perl, with optional support for Java.
454  </p>
455  <p>
456  It must also be able to handle plugins on both local and remote servers.
457  </p>
458  <p>
459  In the implementation of the core logic layer we will look at Java, since
460  this is a well-designed language, which will make it easier to isolate and
461  componentify functionality. In the database layer this will also give us
462  automatic connection pooling through JDBC if the database supports it.
463  </p>
464  <p>
465  We will look at CORBA as the platform for the API. It will give us support for
466  not only C++ and Perl, but also most other programming languages used today.
467  Direct calling into the Java API is also allowed whenever that is more suitable.
468  For instance, the web server should probably do that since going through CORBA
469  every time migh affect performance. See also the discussion about scalability
470  below.
471  </p>
472  <p>
473  More arguments:
474  </p>
475  <ul>
476    <li>
477    Java has a lot of freely available class libraries, for example for XML
478    parsing, image generation, etc. We will not need much special 3rd-party
479    software.
480    </li>
481    <li>
482    The performance is of course worse than for C++, but this is not considered
483    a big issue since most of the computational intensive tasks will be performed
484    by plugins, which may use any suitable language.
485    </li>
486    <li>
487    Java is platform independent, but it is not a main issue. We will concentrate
488    on getting things to work on the Linux platform. Some effort will be made to
489    to get it to work on other Unix versions as well. If it happens to work on
490    other platforms, i.e. Windows, it is nothing that should be taken for granted
491    in future releases.
492    </li>
493  </ul>
495  <h3>4.3 The presentation layer</h3>
496  <p>
497  The requirements says nothing about the presentation layer, but since BASE 1.2
498  is web-based it is implicit that we support a web interface for BASE 2.0.
499  </p>
500  <p>
501  The web server of choice is Apache. It has proven reliable and works on several
502  platforms. The knowledge of how to setup and run an Apache web server is well
503  spread.
504  </p>
505  <p>
506  We will use a scripting module on the web server. Java Server Pages is probably
507  a good choice. It will certaily make it easy to use the core API. Perl is
508  another possibility. There exists perl modules for using Java objects directly.
509  The performance might suffer, but it is definitely worth to have a look at.
510  </p>
511  <p>
512  Other issues:
513  </p>
515  <dl>
516  <dt>Browser versions</dt>
517  <dd>
518    <p>
519    This is always an issue when designing web applications. Luckily the
520    conformance with the different standards are getting better with each
521    browser version. For this reason we should not support browsers that are
522    too old at any price. Things to be considered are:
523    </p>
524    <ul>
525      <li>HTML version
526      <li>Style sheet support
527      <li>JavaScript support
528      <li>Java applet support
529    </ul>
530    <p>
531    In my opinion there is no need to support older versions than IE 6.0 and
532    NS 6.0. If we stay away from Dynamic HTML and similar technologies, any code
533    that works on both of these browsers will probably work on most older ones
534    also (IE 5.x and NS 4.x). Browser related issues can also easily be solved
535    by the open source community.
536    </p>
537    <p>
538    Note! It is mainly an issue of testing, which takes a lot of time, and if
539    one has to do it over and over again with different browser versions and
540    operating systems it is going to take a lot of valuable time from more
541    productive development.
542    </p>
543  </dd>
545  <dt>Unicode support</dt>
546  <dd>
547    <p>
548    The newer browsers support enough unicode to get it to work. Older ones have
549    a few annoying issues (especially Netscape). See also the discussion about
550    unicode for the data layer.
551    </p>
552  </dd>
553  </dl>
555  <h3>4.4 Scalability</h3>
556  <p>
557  The scalability issue is only important in certain parts of the application.
558  For instance, we do not expect the performance of the web server to be a
559  problem. This is not the kind of application that attracts thousands or more
560  simultaneous users.
561  </p>
562  <p>
563  On the other hand, some parts of the application can be very calculation
564  intensive, i.e. the plugins. The requirements specify that it should be possible
565  to run plugins on separate servers. With the use of CORBA this should not pose
566  any problems. Differenent plugins can run on different servers and in theory it
567  should be possible to create a cluster of servers for the plugins.
568  </p>
569  <p>
570  Because of the large quantities of data, the database itself may also be put
571  under strain. It should not pose any problem to run the database on a
572  different server. It is the database driver's responsibility to connect to
573  the database and once connected it should not matter to the rest of the
574  BASE application where it is located. One exception might be a low-level
575  import and/or export function where the database reads/writes data from/to a
576  file on the disk. In this case the network may have to be configured
577  appropriately to allow the database to access the file or, if it is impossible,
578  the driver should do the reading and writing, using SQL to communicate with
579  the database.
580  </p>
581  <p>
582  The minimal configuration involves two computers:
583  </p>
584  <ol>
585    <li>the user's workstation running a browser
586    <li>the BASE server running everything else
587  </ol>
589  <p>
590  The maximum configuration involves at least four computers:
591  </p>
592  <ol>
593    <li>the user's workstation running a browser
594    <li>the main BASE server running the webserver, core logic layer and
595      helper classes, data abstraction layer and database drivers
596    <li>database server
597    <li>one or more plugin servers
598  </ol>
601<a name="5">
602<h2>5. Work items</h2>
604  <p>
605  Here is a list of what needs to be done before BASE 2.0 can be released. The
606  list is ordered by the start time of each item. For a complete time plan see
607  base2.0timplan.sxc.
608  </p>
610  <dl>
611  <dt>1.  Get this specification finished</dt>
612  <dd></dd>
614  <dt>2.  Finding more developers/contributers.</dt>
615  <dd>
616    <p>
617    BASE has a large user base and already
618    a few interested developers. We need to notify them of our plans and find
619    out if someone is interested in contributing to the development.
620    </p>
621  </dd>
623  <dt>3. Make a specification for new functionality in BASE 2.0.</dt>
624  <dd>
625    <p>
626    It is implicit
627    that all functionality in the current version of BASE also should be
628    in BASE 2.0. One important part of this specification is to specify plugins
629    and import/export formats (implemented as plugins).
630    </p>
631    <p>
632    This specification should also include some use cases. A few of them will
633    be used for the prototype development. All will be used during the main
634    implementation and the testing.
635    </p>
636  </dd>
638  <dt>4.  Make a prototype for a subset of BASE 2.0</dt>
639  <dd>
640    <p>
641    The prototype should include test implementations of the most important
642    technical problems we are expecting to encounter during the development.
643    </p>
644    <ul>
645    <li>MySQL connection from Java, including transaction support and connection
646      pooling
647    <li>Test of the database driver concept, i.e. test with another database
648    <li>Clear division into the different layers
649    <li>Test of CORBA interface
650    <li>Test of Java Server Pages
651    <li>Test of Perl calling Java
652    <li>Test of plugin concept, run plugins locally and remote
653    <li>Investigate LDAP and if it can be used for user authentication
654    </ul>
655    <p>
656    At the end of the prototype development all decisions regarding technical
657    solutions must have been made.
658    </p>
659  </dd>
661  <dt>5.  Implement the data layers and the core logic layer</dt>
662  <dd>
663    <ul>
664      <li>database schema
665      <li>driver for MySQL
666      <li>database abstraction layer
667      <li>core logic layer
668    </ul>
669  </dd>
671  <dt>6.  Implement web interface and helper functions</dt>
672  <dd>
673    <ul>
674      <li>basic web functionality, i.e. adding data
675      <li>extended functionality, i.e. analysing data
676      <li>helper classes
677    </ul>
678  </dd>
680  <dt>7.  CORBA API</dt>
681  <dd></dd>
683  <dt>8.  Plugins</dt>
684  <dd>
685    <ul>
686      <li>analysis plugins
687      <li>import/export plugins
688    </ul>
689  </dd>
691  <dt>9.  Testing</dt>
692  <dd></dd>
694  <dt>10. Migration functions</dt>
695  <dd>
696    <p>
697    I don't think it is possible to create a version that is backwards compatible
698    with BASE 1.2. This means that before the installation all data must be
699    exported and then imported into the new version.
700    </p>
701  </dl>
703  <dt>11. Installation script</dt>
704  <dd></dd>
706  <dt>12. Extra functionality</dt>
707  <dd>
708    <ul>
709    <li>support for Postgres and other databases
710    <li>more plugins
711    <li>standalone client software
712    </ul>
713  </dd>
715  <dt>13. Documentation</dt>
716  <dd>
717    <p>
718    All points above includes writing documentation!
719    Since it it a very important issue it is also included as a separate point.
720    Proper documentation MUST be available for:
721    </p>
722    <ul>
723      <li>the database layout (tables, etc.)
724      <li>how to write a database driver
725      <li>API for the data abstraction layer
726      <li>API for the core logic layer
727      <li>helper functions in the logic layer
728      <li>how to create plugins
729      <li>online help/manual for the web interface
730      <li>the findings we made during the prototype development
731    </ul>
732  </dd>
734  </dl>
Note: See TracBrowser for help on using the repository browser.