source: trunk/doc/historical/specifications/technical.html @ 8132

Last change on this file since 8132 was 4889, checked in by Nicklas Nordborg, 14 years ago

References #1290: Change source files to UTF-8

Changed 'Hakkinen' to 'Häkkinen'.

  • Property svn:eol-style set to native
  • Property svn:keywords set to Id Date
File size: 24.9 KB
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "">
3  $Id: technical.html 4889 2009-04-06 12:52:39Z nicklas $
5  Copyright (C) 2005 Samuel Andersson, Jari Häkkinen, Nicklas Nordborg
6  Copyright (C) 2006 Jari Häkkinen
8  This file is part of BASE - BioArray Software Environment.
9  Available at
11  BASE is free software; you can redistribute it and/or
12  modify it under the terms of the GNU General Public License
13  as published by the Free Software Foundation; either version 3
14  of the License, or (at your option) any later version.
16  BASE is distributed in the hope that it will be useful,
17  but WITHOUT ANY WARRANTY; without even the implied warranty of
19  GNU General Public License for more details.
21  You should have received a copy of the GNU General Public License
22  along with BASE. If not, see <>.
26  <title>BASE - Initial technical specification</title>
27  <link rel=stylesheet type="text/css" href="../styles.css">
31<div class="navigation">
32  <a href="../index.html">BASE</a>
33  <img src="../next.gif">
34  Initial technical specification
37<h1>Initial technical specification</h1>
39<div class="abstract">
41  <ol>
42  <li><a href="#1">Background</a>
43  <li><a href="#2">Requirements for BASE 2.0</a>
44  <li><a href="#3">Generic solution</a>
45  <li><a href="#4">Technical details</a>
46  <li><a href="#5">Work items</a>
47  </ol>
49  <p class="authors">
50  <b>Created by:</b> Nicklas<br>
51  <b>Contributions by:</b> Carl, Jari, Per<br>
52  <b>Last updated:</b> $Date: 2009-04-06 12:52:39 +0000 (Mon, 06 Apr 2009) $
53  </p>
56<a name="1">
57<h2>1. Background</h2>
59  <p>
60  The current BASE 1.2 implementation uses a 3-tier architecure. At the bottom is
61  the data layer running MySQL or Postgres. In the middle is the logic layer with
62  PHP scripts running on an Apache web server. The top layer is the HTML
63  presentation in the browser.
64  </p>
65  <p>
66  This follows a classical and well-known design for web applications. However,
67  the actual implementation of it fails at several points, especially at the
68  logic layer. Here are som exemples:
69  </p>
71  <ul>
72  <li>
73    <p>
74    Several of the PHP scripts have too much responsibility. For example, the
75    plotting function uses the script "". This script is
76    responsible both for generating the HTML where the user selects parameters
77    for the plot and for generating the final graphs in the form of images or
78    postscript/pdf files.
79    </p>
80    <p>
81    Another example is the file "trans_create.phtml" which is used for filtering
82    BioAssay data. It does the following:
83    </p>
84    <ul>
85      <li>generate the HTML where the user creates the filter</li>
86      <li>generate the HTML where the user specifies input parameters for a job</li>
87      <li>store and fetch "Preset:s", i.e. filter definitions that the user may
88        want to reuse</li>
89      <li>do the actual filtering</li>
90      <li>start the selected job</li>
91    </ul>
92  </li>
93  <li>
94    <p>
95    There are too many dependencies between different parts of the PHP scripts
96    and classes. This is actually the same problem as the first point but on a
97    wider scale.
98    </p>
99    <p>
100    I will use the plot function as an example again. When the interface is
101    presented for the user, he/she is supposed to select the values to plot on
102    the X and Y axis respectively. The lists of values to choose from are
103    generated by the BioAssay object. This is ok, since the BioAssay object is
104    the only object that knows about what data is available. When the user has
105    made the selection the information is passed to the BioAssay-object which
106    fetches the data and gives it back to the plot function. This seems like a
107    good idea, but if one looks deeper into the code there is a very tight
108    coupling between the plot function an the BioAssay object. The BioAssay
109    object has methods as "getDataForPlot" and "getPlotType", which are totally
110    wrong. The BioAssay object should not need to know anything about plotting
111    or how the data should be used. It should only have a "getData" method.
112    </p>
113    <p>
114    As it is now, the plot function will only plot data from a BioAssay, but what
115    if we want to plot data from a BioAssaySet? The current design makes it hard
116    to change the plot function to accomplish this.
117    </p>
118  </li>
119  <li>
120    <p>
121    SQL commands are scattered around in several different places. This will
122    become a bigger problem as the code grows and the wish to support other
123    databases increases. How do we verify that all SQL queries also work for
124    example Oracle? And, once we have done that, what about the next version of
125    BASE?
126    </p>
127  </li>
128  </ul>
130  <p><b>To summarize:</b><br>
131  The basic problem is that the division into three layers has been unsuccessful.
132  Code that belongs to the data layer (SQL queries) are scattered among the script
133  in the logic layer. Several PHP scripts performs functions both for presenting
134  the data as well as manipulating it. Ie. there is no clear division between the
135  data layer, the logic layer and the presentation layer.
136  </p>
138<a name="2">
139<h2>2. Requirements for BASE 2.0</h2>
141  <p>
142  The main goal for BASE 2.0 is to make the division between data, logic and
143  presentation clear.
144  </p>
146  <ul>
147    <li>
148    It should be possible to add support for other databases without having
149    to go through every piece of code. The requirements for the capabilities
150    of the database system  must be well documented.
151    </li>
152    <li>
153    Expose an API from the logic layer that is accessible from at least Perl and
154    C++. If possible, the API should also be accessible from Java. Any other
155    languages are considered a bonus.
156    </li>
157    <li>
158    The design must allow calculation intensive parts (i.e. plugins) to be
159    executed on remote servers, using a suitable language for the task.
160    </li>
161    <li>
162    Possible to add support for other import and export file formats, including
163    very cryptic ones (i.e. everything else than tab-separated text files).
164    </li>
165    <li>
166    It must be possible to run a BASE server without the need to purchase any
167    additional software. Any 3rd-party software required by BASE should be
168    freely available. Optional software, not required for the basic operation
169    of BASE do not have this restriction.
170    </li>
171  </ul>
174<h3>2.1 Possible features of BASE 2.0</h3>
175  <p>
176  Here are some features that are not requirements, but might be nice to have. We
177  should try to include as much as possible, but if we are short of time some
178  features may have to wait until a later version.
179  </p>
180  <ul>
181    <li>
182    Add support for external user authentication, for example via LDAP. A
183    minumum requirement of the authentication system will be the ability to
184    validate a user against a password and check for permission to use BASE.
185    </li>
186  </ul>
188<a name="3">
189<h2>3. Generic solution</h2>
191  <p>
192  The generic solution is an extension to the current one, i.e. the 3-tier
193  solution is replaced by an N-tier solution. This is accomplished by subdividing
194  the layers and precisely specifying their areas of responsibility. At this stage
195  we shouldn't make any assumption about the technology to use, i.e. the
196  programming language, the kind of database, etc.
197  </p>
199  <h3>3.1 The data layer</h3>
200  <p>
201  The data layer is divided into three layers:
202  </p>
203  <ol>
204    <li>The data storage layer
205      <ul>
206      <li>is responsible for holding the data
207      </ul>
209    <li>The database driver layer
210      <ul>
211      <li>is responsible for all queries to the database.
212      <li>knows how to connect to the database
213      <li>handling transactions
214      <li>parse and format user input data, i.e. escape "dangerous" characters
215      <li>should be able to do some simple calculations, such as counting number of
216      items, calculating means, sums, etc. Note! If the technical implementation
217      uses a relational database capable of executing SQL queries this
218      functionality is most likely available in the database, but if we use XML
219      files as the data storage it is not. As noted above, we try not to make
220      any assumptions about the technology to use.
221      </ul>
222    <li>The data abstraction layer
223      <ul>
224      <li>knows which database driver to load
225      <li>defines helper functions usable for a substantial subset of database drivers
226      <li>transport data to and from the logic layer
227      <li>possibly a low-level, efficient method for importing large quantities of
228      data
229      <li>possibly define an API for use with plugins
230      </ul>
231  </ol>
232  <p>
233  The data abstraction layer is the only part of the data layer that is allowed
234  to talk with the outside world, i.e. the logic layer, plugins, etc. Flaws in the
235  actual design might make this impossible to follow at certain times, but much
236  effort should go into not breaking this rule!
237  </p>
239  <h3>3.2 The logic layer</h3>
240  <p>
241  The logic layer is also divided into 3 parts:
242  </p>
244  <ol>
245    <li>The core logic layer
246      <ul>
247      <li>abstracts the data to a class representation with attributes and methods
248      <li>is responsible for data consistency, i.e. initiating, aborting and comitting
249        transactions
250      <li>error checking of user supplied data
251      <li>handling of plugins and external jobs
252      <li>defining an API to make the functions accessible from other languages (Perl,
253        C++ and maybe Java)
254      </ul>
256    <li>Plugins
257      <ul>
258      <li>performs advanced data analysis
259      <li>import and export of data, i.e. parsing input files and generating output
260        files
261      </ul>
262    <li>Helper classes
263      <ul>
264      <li>providing some common services for the presentation layer clients, for
265      example plotting, file handling, etc.
266      </ul>
267  </ol>
269  <p>
270  Both the core and the plugins are allowed to talk to the data abstraction
271  layer. Neither should talk to a specific database driver or use the data
272  storage directly.
273  </p>
274  <p>
275  The helper classes should not talk to the core or the database layer. They
276  should only depend on what they are fed from the presentation layer. It is
277  arguable whether these components are seen as parts of the presentation
278  layer or the logic layer. The reason I choose to put them in the logic layer
279  is that they are providing services to several client applications.
280  </p>
283  <h3>3.3 The presentation layer</h3>
284  <p>
285  The presentation layer is divided into 2 parts:
286  </p>
287  <ol>
288    <li>The web server layer
289      <ul>
290      <li>generating HTML for the browser for presentation and manipulation of data
291      </ul>
292    <li>The browser layer
293      <ul>
294      <li>providing the user interface as specified by the HTML generated from the web
295        server
296      <li>initial error checking of user-supplied data
297      </ul>
298  </ol>
299  <p>
300  In addition to this the presentation layer can be extended with other client
301  applications, i.e. standalone programs written in C++ or Perl or Java.
302  </p>
303  <p>
304  The presentation layer is only allowed to talk with the core layer and the
305  helper classes. Communcation with plugins should go through the core layer.
306  </p>
308  <h3>3.4 Visualising the design</h3>
309  <p>
310  The design could be represented by the following image:
311  </p>
312  <pre class="code">
314                                                <b>Presentation layer</b>
315       ____________
316      |            |
317      |   Browser  |
318      |____________|
319            |
320            |                               __________
321       _____v______                        |          |
322      |            |                       |  Other   |
323      | Web server |                       |  client  |
324      |____________|                       |__________|
325           |    |                             |  |
327           |    |       ___________           |  |     <b>Logic layer</b>
328           |    |      |           |          |  |
329           |    ------&gt;|  Helper   |&lt;----------  |
330           |           |  classes  |             |
331           |           |___________|             |
332           |         ____________________________|
333           |        |
334           |   _____v____       ___________
335           |  |   API    |     |           |
336       ____v__|__________|&lt;---&gt;|  Plugins  |
337      |                  |     |___________|
338      |  Core logic      |          |
339      |  layer           |          |
340      |__________________|          |&lt;--Maybe
341          |                         |
343          |              ___________v____               <b>Data layer</b>
344          |             |      API       |
345       ___v_____________|________________|
346      |                                  |
347      |     Data abstraction layer       |
348      |                                  |
349      |----------------------------------|
350      | MySQL  |                         |
351      | driver |  Other drivers...       |
352      |________|_________________________|
353          |                |
354          |                |
355       ___v___         ____v_____
356      |       |       |          |
357      | MySQL |       | Other DB |
358      |_______|       |__________|
361  </pre>
362  <p class="annotation">A visual representation of the system design</p>
363  <p>
364  Note! In the image above the different layers do not correspond to the
365  ability to break up the execution on different servers! A discussion
366  about that will follow later.
367  </p>
369<a name="4">
370<h2>4. Technical details</h2>
372  <p>
373  Now we have a conceptual image of the design we are trying to accomplish.
374  Until now we haven't paid much attention to the technincal details of the
375  solution, i.e.:
376  </p>
377  <ul>
378  <li>What kind of database do we need?
379  <li>What programming languages should we use?
380  <li>What operating systems should we support?
381  <li>Etc.
382  </ul>
384  <h3>4.1 The data layer</h3>
385  <p>
386  The requirements specify that BASE must be able to use different data storage
387  engines and that it should be possible to add support for other ones without
388  major modification of the rest of the code.
389  </p>
390  <p>
391  The requirements does not specify what type of storage that should be
392  supported, i.e. relational database, flat-file, xml, etc.
393  </p>
394  <p>
395  In order to not complicate the design we choose to limit the support to
396  relational databases using SQL as the query language. The major task for a
397  driver will then be to shield the rest of the application from the various
398  dialects of SQL. The helper functions in the data abstraction layer will then
399  most likely be ones that can be used for dynamic creation of SQL queries.
400  </p>
401  <p>
402  Other issues:
403  </p>
405  <dl>
406  <dt>Transaction support</dt>
407  <dd>
408    <p>
409    This is the ability to treat a series of SQL queries as one operation, i.e.
410    if one query fails the rest would also fail and the database should be
411    returned to the state prior to the beginning of the transaction.
412    </p>
413    <p>
414    In my opionion this is one of the most important features of a relational
415    database. Nevertheless, we will not require that the database supports
416    transactions. However, the code in the logic layer will assume that
417    transactions are supported, if not directly in the database, then the data
418    driver layer must handle upcoming issues with failing queries.
419    </p>
420    <p>
421    We will not require support for nested transactions. Neither at the storage
422    or the driver level.
423    </p>
424  </dd>
426  <dt>Unicode support</dt>
427  <dd>
428    <p>
429    Requests for multi-language support will come sooner or later, and unicode
430    is the way to go. As we will use Java as the programming language (see below)
431    unicode support is already builtin at the code level. Again, we will not
432    require unicode support by the data storage, but all code in the logic layer
433    will behave as if it is supported. So, as for transactions, this is also an
434    issue that the driver must take care of.
435    </p>
437  <dt>Connection pooling</dt>
438  <dd>
439    <p>
440    Opening a connection to a database is a timeconsuming operation. A connection
441    pool maintains a list of already opened connections which can be recycled
442    between different requests, thereby increasing the performance. With JDBC, it
443    is not very complicated to add support for connection pooling for any
444    database.
445    </p>
446  </dl>
448  <h3>4.2 The logic layer</h3>
449  <p>
450  The requirements specify that this layer must expose an API usable for
451  clients programmed in C++ and Perl, with optional support for Java.
452  </p>
453  <p>
454  It must also be able to handle plugins on both local and remote servers.
455  </p>
456  <p>
457  In the implementation of the core logic layer we will look at Java, since
458  this is a well-designed language, which will make it easier to isolate and
459  componentify functionality. In the database layer this will also give us
460  automatic connection pooling through JDBC if the database supports it.
461  </p>
462  <p>
463  We will look at CORBA as the platform for the API. It will give us support for
464  not only C++ and Perl, but also most other programming languages used today.
465  Direct calling into the Java API is also allowed whenever that is more suitable.
466  For instance, the web server should probably do that since going through CORBA
467  every time migh affect performance. See also the discussion about scalability
468  below.
469  </p>
470  <p>
471  More arguments:
472  </p>
473  <ul>
474    <li>
475    Java has a lot of freely available class libraries, for example for XML
476    parsing, image generation, etc. We will not need much special 3rd-party
477    software.
478    </li>
479    <li>
480    The performance is of course worse than for C++, but this is not considered
481    a big issue since most of the computational intensive tasks will be performed
482    by plugins, which may use any suitable language.
483    </li>
484    <li>
485    Java is platform independent, but it is not a main issue. We will concentrate
486    on getting things to work on the Linux platform. Some effort will be made to
487    to get it to work on other Unix versions as well. If it happens to work on
488    other platforms, i.e. Windows, it is nothing that should be taken for granted
489    in future releases.
490    </li>
491  </ul>
493  <h3>4.3 The presentation layer</h3>
494  <p>
495  The requirements says nothing about the presentation layer, but since BASE 1.2
496  is web-based it is implicit that we support a web interface for BASE 2.0.
497  </p>
498  <p>
499  The web server of choice is Apache. It has proven reliable and works on several
500  platforms. The knowledge of how to setup and run an Apache web server is well
501  spread.
502  </p>
503  <p>
504  We will use a scripting module on the web server. Java Server Pages is probably
505  a good choice. It will certaily make it easy to use the core API. Perl is
506  another possibility. There exists perl modules for using Java objects directly.
507  The performance might suffer, but it is definitely worth to have a look at.
508  </p>
509  <p>
510  Other issues:
511  </p>
513  <dl>
514  <dt>Browser versions</dt>
515  <dd>
516    <p>
517    This is always an issue when designing web applications. Luckily the
518    conformance with the different standards are getting better with each
519    browser version. For this reason we should not support browsers that are
520    too old at any price. Things to be considered are:
521    </p>
522    <ul>
523      <li>HTML version
524      <li>Style sheet support
525      <li>JavaScript support
526      <li>Java applet support
527    </ul>
528    <p>
529    In my opinion there is no need to support older versions than IE 6.0 and
530    NS 6.0. If we stay away from Dynamic HTML and similar technologies, any code
531    that works on both of these browsers will probably work on most older ones
532    also (IE 5.x and NS 4.x). Browser related issues can also easily be solved
533    by the open source community.
534    </p>
535    <p>
536    Note! It is mainly an issue of testing, which takes a lot of time, and if
537    one has to do it over and over again with different browser versions and
538    operating systems it is going to take a lot of valuable time from more
539    productive development.
540    </p>
541  </dd>
543  <dt>Unicode support</dt>
544  <dd>
545    <p>
546    The newer browsers support enough unicode to get it to work. Older ones have
547    a few annoying issues (especially Netscape). See also the discussion about
548    unicode for the data layer.
549    </p>
550  </dd>
551  </dl>
553  <h3>4.4 Scalability</h3>
554  <p>
555  The scalability issue is only important in certain parts of the application.
556  For instance, we do not expect the performance of the web server to be a
557  problem. This is not the kind of application that attracts thousands or more
558  simultaneous users.
559  </p>
560  <p>
561  On the other hand, some parts of the application can be very calculation
562  intensive, i.e. the plugins. The requirements specify that it should be possible
563  to run plugins on separate servers. With the use of CORBA this should not pose
564  any problems. Differenent plugins can run on different servers and in theory it
565  should be possible to create a cluster of servers for the plugins.
566  </p>
567  <p>
568  Because of the large quantities of data, the database itself may also be put
569  under strain. It should not pose any problem to run the database on a
570  different server. It is the database driver's responsibility to connect to
571  the database and once connected it should not matter to the rest of the
572  BASE application where it is located. One exception might be a low-level
573  import and/or export function where the database reads/writes data from/to a
574  file on the disk. In this case the network may have to be configured
575  appropriately to allow the database to access the file or, if it is impossible,
576  the driver should do the reading and writing, using SQL to communicate with
577  the database.
578  </p>
579  <p>
580  The minimal configuration involves two computers:
581  </p>
582  <ol>
583    <li>the user's workstation running a browser
584    <li>the BASE server running everything else
585  </ol>
587  <p>
588  The maximum configuration involves at least four computers:
589  </p>
590  <ol>
591    <li>the user's workstation running a browser
592    <li>the main BASE server running the webserver, core logic layer and
593      helper classes, data abstraction layer and database drivers
594    <li>database server
595    <li>one or more plugin servers
596  </ol>
599<a name="5">
600<h2>5. Work items</h2>
602  <p>
603  Here is a list of what needs to be done before BASE 2.0 can be released. The
604  list is ordered by the start time of each item. For a complete time plan see
605  base2.0timplan.sxc.
606  </p>
608  <dl>
609  <dt>1.  Get this specification finished</dt>
610  <dd></dd>
612  <dt>2.  Finding more developers/contributers.</dt>
613  <dd>
614    <p>
615    BASE has a large user base and already
616    a few interested developers. We need to notify them of our plans and find
617    out if someone is interested in contributing to the development.
618    </p>
619  </dd>
621  <dt>3. Make a specification for new functionality in BASE 2.0.</dt>
622  <dd>
623    <p>
624    It is implicit
625    that all functionality in the current version of BASE also should be
626    in BASE 2.0. One important part of this specification is to specify plugins
627    and import/export formats (implemented as plugins).
628    </p>
629    <p>
630    This specification should also include some use cases. A few of them will
631    be used for the prototype development. All will be used during the main
632    implementation and the testing.
633    </p>
634  </dd>
636  <dt>4.  Make a prototype for a subset of BASE 2.0</dt>
637  <dd>
638    <p>
639    The prototype should include test implementations of the most important
640    technical problems we are expecting to encounter during the development.
641    </p>
642    <ul>
643    <li>MySQL connection from Java, including transaction support and connection
644      pooling
645    <li>Test of the database driver concept, i.e. test with another database
646    <li>Clear division into the different layers
647    <li>Test of CORBA interface
648    <li>Test of Java Server Pages
649    <li>Test of Perl calling Java
650    <li>Test of plugin concept, run plugins locally and remote
651    <li>Investigate LDAP and if it can be used for user authentication
652    </ul>
653    <p>
654    At the end of the prototype development all decisions regarding technical
655    solutions must have been made.
656    </p>
657  </dd>
659  <dt>5.  Implement the data layers and the core logic layer</dt>
660  <dd>
661    <ul>
662      <li>database schema
663      <li>driver for MySQL
664      <li>database abstraction layer
665      <li>core logic layer
666    </ul>
667  </dd>
669  <dt>6.  Implement web interface and helper functions</dt>
670  <dd>
671    <ul>
672      <li>basic web functionality, i.e. adding data
673      <li>extended functionality, i.e. analysing data
674      <li>helper classes
675    </ul>
676  </dd>
678  <dt>7.  CORBA API</dt>
679  <dd></dd>
681  <dt>8.  Plugins</dt>
682  <dd>
683    <ul>
684      <li>analysis plugins
685      <li>import/export plugins
686    </ul>
687  </dd>
689  <dt>9.  Testing</dt>
690  <dd></dd>
692  <dt>10. Migration functions</dt>
693  <dd>
694    <p>
695    I don't think it is possible to create a version that is backwards compatible
696    with BASE 1.2. This means that before the installation all data must be
697    exported and then imported into the new version.
698    </p>
699  </dl>
701  <dt>11. Installation script</dt>
702  <dd></dd>
704  <dt>12. Extra functionality</dt>
705  <dd>
706    <ul>
707    <li>support for Postgres and other databases
708    <li>more plugins
709    <li>standalone client software
710    </ul>
711  </dd>
713  <dt>13. Documentation</dt>
714  <dd>
715    <p>
716    All points above includes writing documentation!
717    Since it it a very important issue it is also included as a separate point.
718    Proper documentation MUST be available for:
719    </p>
720    <ul>
721      <li>the database layout (tables, etc.)
722      <li>how to write a database driver
723      <li>API for the data abstraction layer
724      <li>API for the core logic layer
725      <li>helper functions in the logic layer
726      <li>how to create plugins
727      <li>online help/manual for the web interface
728      <li>the findings we made during the prototype development
729    </ul>
730  </dd>
732  </dl>
Note: See TracBrowser for help on using the repository browser.