source: trunk/doc/development/overview/core/data_in_files.html @ 3681

Last change on this file since 3681 was 3681, checked in by Nicklas Nordborg, 16 years ago

References #721: Store data in files instead of in the database

First draft of a specification/implementation design

  • Property svn:eol-style set to native
  • Property svn:keywords set to Date Id
File size: 9.9 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
2<!--
3  $Id: data_in_files.html 3681 2007-08-17 11:56:30Z nicklas $
4
5  Copyright (C) Authors contributing to this file.
6
7  This file is part of BASE - BioArray Software Environment.
8  Available at http://base.thep.lu.se/
9
10  BASE is free software; you can redistribute it and/or
11  modify it under the terms of the GNU General Public License
12  as published by the Free Software Foundation; either version 2
13  of the License, or (at your option) any later version.
14
15  BASE is distributed in the hope that it will be useful,
16  but WITHOUT ANY WARRANTY; without even the implied warranty of
17  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
18  GNU General Public License for more details.
19
20  You should have received a copy of the GNU General Public License
21  along with this program; if not, write to the Free Software
22  Foundation, Inc., 59 Temple Place - Suite 330,
23  Boston, MA  02111-1307, USA.
24-->
25<html>
26  <head>
27    <title>BASE - Development information - Overview of data in files</title>
28  <link rel=stylesheet type="text/css" href="../../../styles.css">
29  </head>
30<body>
31
32<div class="navigation">
33  <a href="../../../index.html">BASE</a>
34  <img src="../../../next.gif">
35  <a href="../../index.html">Development information</a>
36  <img src="../../../next.gif">
37  <a href="index.html">Overview of the core API</a>
38  <img src="../../../next.gif">
39  Store data in files
40</div>
41
42  <h1>Store data in files</h1>
43
44  <div class="abstract">
45 
46    <p class="warning">
47      NOTE! This document is a draft currently beeing worked on!
48      Major changes are expected before the design is finalized.
49    </p>
50 
51    <p>
52    This document explains how to use BASE for storing data
53    in original files instead of in the database. This API solves
54    the following problems:
55    </p>
56   
57    <ul>
58    <li>
59      Makes it possible for a GUI application to ask a user
60      for one or more files containing the data. This is a
61      dynamic and completely configurable (by the server admin)
62      process.
63    </li>
64    <li>
65      Validate the correctnes of the used files and read
66      metadata into the database. Metadata that is interesting
67      is for example the number of spots, headers, etc.
68    </li>
69   
70    <li>
71      For data types that supports it, import the data into the
72      database.
73    </li>
74    </ul>
75 
76    <p>
77    <b>Contents</b>
78    </p>
79
80    <ol>
81    <li><a href="#diagram">Diagram of classes and methods</a>
82    <li><a href="#ask">Asking the user for files</a>
83    <li><a href="#link">Link to the selected files</a>
84    <li><a href="#metadata">Validate the file and extract metadata</a>
85    <li><a href="#import">Import data into the database</a>
86    <li><a href="#platforms">Pre-installed platforms</a>
87    <li><a href="#problems">Remaining problems</a>
88    </ol>
89
90    <p>
91    <b>See also</b>
92    </p>
93    <ul>
94    <li><a href="../data/data_in_files.html">Database schema</a>
95    </ul>
96
97    <p class="authors">
98    <b>Last updated:</b> $Date: 2007-08-17 11:56:30 +0000 (Fri, 17 Aug 2007) $
99    </p>
100  </div>
101
102  <a name="diagram">
103  <h2>1. Diagram of classes and methods</h2>
104  </a>
105  <img src="data_in_files.png">
106
107  <a name="ask">
108  <h2>2. Asking the user for files</h2>
109  </a>
110 
111  <p>
112    Given that we have a <code>FileXxxAble</code> object (for example
113    a <code>RawBioAssay</code> or <code>ArrayDesign</code> we use the
114    <code>getPlatform()</code> to load the associated platform. This is a required
115    property. Now, after executing the query we get from <code>Platform.getFileXxxTypes()</code>
116    we have a list of <code>FileXxxType</code> object. Each one describes a specific type
117    of file that can be used on the given platform. For example:
118  </p>
119 
120  <ul>
121  <li>
122    The <code>Affymetrix</code> platform defines a <code>CEL</code> and <code>CDF</code>
123    file types for <code>RAW_DATA</code> and <code>FETURE_DATA</code> respectively.
124    If we have a <code>RawBioAssay</code> we filter the query to only return raw data
125    file types. Now, we can ask the user for a CEL file.
126  </li>
127  </ul>
128 
129  <p>
130    In fact, we can get the list of <code>FileXxxType</code> object for any
131    type of item using the simple code below:
132  </p>
133 
134  <pre class="code">
135DbControl dc = ...
136FileXxxAble item = ....
137Platform p = item.getPlatform();
138List&lt;FileXxxType&gt; fileType =
139  p.getFileXxxTypes(item.getItemType()).list(dc);
140// Now, ask the user to select one file for each type
141</pre>
142 
143  <a name="link">
144  <h2>3. Link to the selected files</h2>
145  </a>
146
147  <p>
148    When the user has selected the file(s) we must store the links to them
149    in the database. This is done via a <code>FileSet</code>. A file set
150    contains 0, one or more files. The only limitation is that it can only contain
151    one file of each <code>FileXxxType</code>. Call <code>FileSet.addMember</code>
152    to store a file in the file set. If a file already exists for the given
153    file type, it is replaced, otherwise a new entry is created.
154  </p>
155
156  <a name="metadata">
157  <h2>4. Validate the file and extract metadata</h2>
158  </a>
159 
160  <p>
161    Validation and extraction of metadata is an important part if we
162    want data in files to be equivalent to data in the database. The validation
163    and metadata extraction is normally performed when adding a file
164    to a fileset.
165  </p>
166 
167  <p>
168    Each <code>FileXxxType</code> may store the classname of a <code>FileValidator</code>
169    and a <code>MetadataReader</code>. If so, they are used when a file is
170    added to the file set. An important thing is that if the same class is used
171    for both validation and metadata reading, only one instance is created.
172  </p>
173 
174  <pre class="code">
175FileXxxAble item = ...
176FileXxxType type = ...
177File file = ...
178
179FileValidator validator = type.getValidator();
180MetadataReader reader = type.getMetadataReader();
181
182validator.setFile(file);
183validator.setItem(item);
184// Repeat for 'reader' if not same as 'validator'
185validator.validate();
186reader.writeMetadata();
187</pre>
188
189  <p>
190    All validators and metadata readers should extend the <code>AbstractFileHandler</code>.
191    The reason is that I feel that we may have to add more methods to the <code>FileHandler</code> 
192    interface in the future. The <code>AbstractFileHandler</code> will then provide default
193    implementations.   
194  </p>
195
196  <a name="import">
197  <h2>5. Import data into the database</h2>
198  </a>
199
200  <p>
201    TODO....
202  </p>
203  <p>
204    ...but I think this is done by the already existing plug-ins in
205    more or less the same manner as before. The may benfit from already
206    selected file(s), so it would probably be a good idea to make them
207    aware of the <code>FileSet</code> to offer good default values.
208  </p>
209 
210  <pre class="code">
211// Get file to use a default value
212File defaultFile = null;
213RawBioAssay rba = ...
214FileSet fileSet = rba.getFileSet();
215if (fileSet != null)
216{
217  List<FileSetMember> list = fileSet.getMembers(DataType.RAW_DATA);
218  if (list.size() &gt; 0)
219  {
220    defaultFile = list.get(0).getFile();
221  }
222}
223</pre>
224
225  <p>
226    The auto detect option of the web interface should also be made aware
227    of this.
228  </p>
229
230  <a name="platforms">
231  <h2>6. Pre-installed platforms</h2>
232  </a>
233 
234  <p>
235    BASE ships with a number of platforms already pre-installed. It is important
236    that the external ID of the platform of file types are not changed.
237  </p>
238 
239  <table border="1">
240  <tr>
241    <th colspan="2">Platform</th>
242    <th colspan="3">File types</th>
243  </tr>
244  <tr>
245    <th>Name</th>
246    <th>ID</th>
247    <th>Data type</th>
248    <th>Name</th>
249    <th>ID</th>
250  </tr>
251  <tr>
252    <td rowspan="3">Generic</td>
253    <td rowspan="3">generic</td>
254    <td>RAW_DATA</td>
255    <td>Raw data file</td>
256    <td>generic.raw</td>
257  </tr>
258  <tr>
259    <td>FEATURE_DATA</td>
260    <td>Print map</td>
261    <td>generic.printmap</td>
262  </tr>
263  <tr>
264    <td>FEATURE_DATA</td>
265    <td>Reporter map</td>
266    <td>generic.reportermap</td>
267  </tr>
268  <tr>
269    <td rowspan="2">Affymetrix</td>
270    <td rowspan="2">affymetrix</td>
271    <td>RAW_DATA</td>
272    <td>Affymetrix CEL file</td>
273    <td>affymetrix.cel</td>
274  </tr>
275  <tr>
276    <td>FEATURE_DATA</td>
277    <td>Affymetrix CDF file</td>
278    <td>affymetrix.cdf</td>
279  </tr>
280  </table>
281 
282  <p>
283    Servers that are upgrading from previous releases are assigned
284    the generic platform unless the array design is an affy chip and
285    the raw bioassay is Affymetrix raw data type.
286  </p>
287
288  <a name="problems">
289  <h2>7. Remaining problems</h2>
290  </a>
291
292  <ul>
293  <li>
294    Naming of the new classes: <code>FileXxxAble</code> and <code>FileXxxType</code>
295    need better names. <code>FileType</code> and <code>FileAttachable</code> are already
296    used. <code>Platform</code> makes sense only for <code>RawBioAssay</code> and
297    <code>ArrayDesign</code>. It is not so good when used for <code>BioAssaySet</code>.
298  </li>
299 
300  <li>
301    How does the current <code>RawDataType</code> fit
302    into this? There seems to be an overlap with <code>Platform</code>. Should
303    we have a link from a platform to a raw data type? Can we mix any platform
304    with any raw data type? Can we mix platforms in an experiment? Should we
305    allow mixing of raw data types in an experiment? Can we still maintain
306    backwards compatibility?
307  </li>
308 
309  <li>
310    Is it too restrictive to only allow one file of each <code>FileXxxType</code>
311    in a <code>FileSet</code>? Do we need a
312    'multiplicity' option for the <code>FileXxxType</code>?
313  </li>
314 
315  <li>
316    Validation and metadata extraction is done on a single file basis (<code>FileSetMember</code>).
317    Is this too limiting? Maybe we need validation and metadata extraction based on the entire
318    <code>FileSet</code>. This could happen if data is split into multiple files
319    (for example Imagene has one file for cy3 data and one for cy5 data). Do we need
320    a <code>FileSetValidator</code> / <code>FileSetMetadataReader</code> that can
321    be assigned to <code>Platform</code>?
322  </li>
323 
324  <li>
325    Is one <code>FileXxxType</code> per <code>File</code>enough? Probably in most cases, but maybe
326    for analysed data there is overlap/compatibility between file formats.
327    Can we do this by saying that file type X is compatible with file type Y
328    and if someone asks for Y we give them X? Can we do this by a directional many-to-many
329    relation between two <code>FileXxxType</code>:s? Is this a real problem
330    which increases the user experience, or is it only theoretical?
331  </li>
332 
333  <li>
334    More...??
335  </li>
336 
337  </ul>
338 
339
340</body>
341</html>
Note: See TracBrowser for help on using the repository browser.