1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> |
---|
2 | <!-- |
---|
3 | $Id: data_in_files.html 3681 2007-08-17 11:56:30Z nicklas $ |
---|
4 | |
---|
5 | Copyright (C) Authors contributing to this file. |
---|
6 | |
---|
7 | This file is part of BASE - BioArray Software Environment. |
---|
8 | Available at http://base.thep.lu.se/ |
---|
9 | |
---|
10 | BASE is free software; you can redistribute it and/or |
---|
11 | modify it under the terms of the GNU General Public License |
---|
12 | as published by the Free Software Foundation; either version 2 |
---|
13 | of the License, or (at your option) any later version. |
---|
14 | |
---|
15 | BASE is distributed in the hope that it will be useful, |
---|
16 | but WITHOUT ANY WARRANTY; without even the implied warranty of |
---|
17 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
---|
18 | GNU General Public License for more details. |
---|
19 | |
---|
20 | You should have received a copy of the GNU General Public License |
---|
21 | along with this program; if not, write to the Free Software |
---|
22 | Foundation, Inc., 59 Temple Place - Suite 330, |
---|
23 | Boston, MA 02111-1307, USA. |
---|
24 | --> |
---|
25 | <html> |
---|
26 | <head> |
---|
27 | <title>BASE - Development information - Overview of data in files</title> |
---|
28 | <link rel=stylesheet type="text/css" href="../../../styles.css"> |
---|
29 | </head> |
---|
30 | <body> |
---|
31 | |
---|
32 | <div class="navigation"> |
---|
33 | <a href="../../../index.html">BASE</a> |
---|
34 | <img src="../../../next.gif"> |
---|
35 | <a href="../../index.html">Development information</a> |
---|
36 | <img src="../../../next.gif"> |
---|
37 | <a href="index.html">Overview of the core API</a> |
---|
38 | <img src="../../../next.gif"> |
---|
39 | Store data in files |
---|
40 | </div> |
---|
41 | |
---|
42 | <h1>Store data in files</h1> |
---|
43 | |
---|
44 | <div class="abstract"> |
---|
45 | |
---|
46 | <p class="warning"> |
---|
47 | NOTE! This document is a draft currently beeing worked on! |
---|
48 | Major changes are expected before the design is finalized. |
---|
49 | </p> |
---|
50 | |
---|
51 | <p> |
---|
52 | This document explains how to use BASE for storing data |
---|
53 | in original files instead of in the database. This API solves |
---|
54 | the following problems: |
---|
55 | </p> |
---|
56 | |
---|
57 | <ul> |
---|
58 | <li> |
---|
59 | Makes it possible for a GUI application to ask a user |
---|
60 | for one or more files containing the data. This is a |
---|
61 | dynamic and completely configurable (by the server admin) |
---|
62 | process. |
---|
63 | </li> |
---|
64 | <li> |
---|
65 | Validate the correctnes of the used files and read |
---|
66 | metadata into the database. Metadata that is interesting |
---|
67 | is for example the number of spots, headers, etc. |
---|
68 | </li> |
---|
69 | |
---|
70 | <li> |
---|
71 | For data types that supports it, import the data into the |
---|
72 | database. |
---|
73 | </li> |
---|
74 | </ul> |
---|
75 | |
---|
76 | <p> |
---|
77 | <b>Contents</b> |
---|
78 | </p> |
---|
79 | |
---|
80 | <ol> |
---|
81 | <li><a href="#diagram">Diagram of classes and methods</a> |
---|
82 | <li><a href="#ask">Asking the user for files</a> |
---|
83 | <li><a href="#link">Link to the selected files</a> |
---|
84 | <li><a href="#metadata">Validate the file and extract metadata</a> |
---|
85 | <li><a href="#import">Import data into the database</a> |
---|
86 | <li><a href="#platforms">Pre-installed platforms</a> |
---|
87 | <li><a href="#problems">Remaining problems</a> |
---|
88 | </ol> |
---|
89 | |
---|
90 | <p> |
---|
91 | <b>See also</b> |
---|
92 | </p> |
---|
93 | <ul> |
---|
94 | <li><a href="../data/data_in_files.html">Database schema</a> |
---|
95 | </ul> |
---|
96 | |
---|
97 | <p class="authors"> |
---|
98 | <b>Last updated:</b> $Date: 2007-08-17 11:56:30 +0000 (Fri, 17 Aug 2007) $ |
---|
99 | </p> |
---|
100 | </div> |
---|
101 | |
---|
102 | <a name="diagram"> |
---|
103 | <h2>1. Diagram of classes and methods</h2> |
---|
104 | </a> |
---|
105 | <img src="data_in_files.png"> |
---|
106 | |
---|
107 | <a name="ask"> |
---|
108 | <h2>2. Asking the user for files</h2> |
---|
109 | </a> |
---|
110 | |
---|
111 | <p> |
---|
112 | Given that we have a <code>FileXxxAble</code> object (for example |
---|
113 | a <code>RawBioAssay</code> or <code>ArrayDesign</code> we use the |
---|
114 | <code>getPlatform()</code> to load the associated platform. This is a required |
---|
115 | property. Now, after executing the query we get from <code>Platform.getFileXxxTypes()</code> |
---|
116 | we have a list of <code>FileXxxType</code> object. Each one describes a specific type |
---|
117 | of file that can be used on the given platform. For example: |
---|
118 | </p> |
---|
119 | |
---|
120 | <ul> |
---|
121 | <li> |
---|
122 | The <code>Affymetrix</code> platform defines a <code>CEL</code> and <code>CDF</code> |
---|
123 | file types for <code>RAW_DATA</code> and <code>FETURE_DATA</code> respectively. |
---|
124 | If we have a <code>RawBioAssay</code> we filter the query to only return raw data |
---|
125 | file types. Now, we can ask the user for a CEL file. |
---|
126 | </li> |
---|
127 | </ul> |
---|
128 | |
---|
129 | <p> |
---|
130 | In fact, we can get the list of <code>FileXxxType</code> object for any |
---|
131 | type of item using the simple code below: |
---|
132 | </p> |
---|
133 | |
---|
134 | <pre class="code"> |
---|
135 | DbControl dc = ... |
---|
136 | FileXxxAble item = .... |
---|
137 | Platform p = item.getPlatform(); |
---|
138 | List<FileXxxType> fileType = |
---|
139 | p.getFileXxxTypes(item.getItemType()).list(dc); |
---|
140 | // Now, ask the user to select one file for each type |
---|
141 | </pre> |
---|
142 | |
---|
143 | <a name="link"> |
---|
144 | <h2>3. Link to the selected files</h2> |
---|
145 | </a> |
---|
146 | |
---|
147 | <p> |
---|
148 | When the user has selected the file(s) we must store the links to them |
---|
149 | in the database. This is done via a <code>FileSet</code>. A file set |
---|
150 | contains 0, one or more files. The only limitation is that it can only contain |
---|
151 | one file of each <code>FileXxxType</code>. Call <code>FileSet.addMember</code> |
---|
152 | to store a file in the file set. If a file already exists for the given |
---|
153 | file type, it is replaced, otherwise a new entry is created. |
---|
154 | </p> |
---|
155 | |
---|
156 | <a name="metadata"> |
---|
157 | <h2>4. Validate the file and extract metadata</h2> |
---|
158 | </a> |
---|
159 | |
---|
160 | <p> |
---|
161 | Validation and extraction of metadata is an important part if we |
---|
162 | want data in files to be equivalent to data in the database. The validation |
---|
163 | and metadata extraction is normally performed when adding a file |
---|
164 | to a fileset. |
---|
165 | </p> |
---|
166 | |
---|
167 | <p> |
---|
168 | Each <code>FileXxxType</code> may store the classname of a <code>FileValidator</code> |
---|
169 | and a <code>MetadataReader</code>. If so, they are used when a file is |
---|
170 | added to the file set. An important thing is that if the same class is used |
---|
171 | for both validation and metadata reading, only one instance is created. |
---|
172 | </p> |
---|
173 | |
---|
174 | <pre class="code"> |
---|
175 | FileXxxAble item = ... |
---|
176 | FileXxxType type = ... |
---|
177 | File file = ... |
---|
178 | |
---|
179 | FileValidator validator = type.getValidator(); |
---|
180 | MetadataReader reader = type.getMetadataReader(); |
---|
181 | |
---|
182 | validator.setFile(file); |
---|
183 | validator.setItem(item); |
---|
184 | // Repeat for 'reader' if not same as 'validator' |
---|
185 | validator.validate(); |
---|
186 | reader.writeMetadata(); |
---|
187 | </pre> |
---|
188 | |
---|
189 | <p> |
---|
190 | All validators and metadata readers should extend the <code>AbstractFileHandler</code>. |
---|
191 | The reason is that I feel that we may have to add more methods to the <code>FileHandler</code> |
---|
192 | interface in the future. The <code>AbstractFileHandler</code> will then provide default |
---|
193 | implementations. |
---|
194 | </p> |
---|
195 | |
---|
196 | <a name="import"> |
---|
197 | <h2>5. Import data into the database</h2> |
---|
198 | </a> |
---|
199 | |
---|
200 | <p> |
---|
201 | TODO.... |
---|
202 | </p> |
---|
203 | <p> |
---|
204 | ...but I think this is done by the already existing plug-ins in |
---|
205 | more or less the same manner as before. The may benfit from already |
---|
206 | selected file(s), so it would probably be a good idea to make them |
---|
207 | aware of the <code>FileSet</code> to offer good default values. |
---|
208 | </p> |
---|
209 | |
---|
210 | <pre class="code"> |
---|
211 | // Get file to use a default value |
---|
212 | File defaultFile = null; |
---|
213 | RawBioAssay rba = ... |
---|
214 | FileSet fileSet = rba.getFileSet(); |
---|
215 | if (fileSet != null) |
---|
216 | { |
---|
217 | List<FileSetMember> list = fileSet.getMembers(DataType.RAW_DATA); |
---|
218 | if (list.size() > 0) |
---|
219 | { |
---|
220 | defaultFile = list.get(0).getFile(); |
---|
221 | } |
---|
222 | } |
---|
223 | </pre> |
---|
224 | |
---|
225 | <p> |
---|
226 | The auto detect option of the web interface should also be made aware |
---|
227 | of this. |
---|
228 | </p> |
---|
229 | |
---|
230 | <a name="platforms"> |
---|
231 | <h2>6. Pre-installed platforms</h2> |
---|
232 | </a> |
---|
233 | |
---|
234 | <p> |
---|
235 | BASE ships with a number of platforms already pre-installed. It is important |
---|
236 | that the external ID of the platform of file types are not changed. |
---|
237 | </p> |
---|
238 | |
---|
239 | <table border="1"> |
---|
240 | <tr> |
---|
241 | <th colspan="2">Platform</th> |
---|
242 | <th colspan="3">File types</th> |
---|
243 | </tr> |
---|
244 | <tr> |
---|
245 | <th>Name</th> |
---|
246 | <th>ID</th> |
---|
247 | <th>Data type</th> |
---|
248 | <th>Name</th> |
---|
249 | <th>ID</th> |
---|
250 | </tr> |
---|
251 | <tr> |
---|
252 | <td rowspan="3">Generic</td> |
---|
253 | <td rowspan="3">generic</td> |
---|
254 | <td>RAW_DATA</td> |
---|
255 | <td>Raw data file</td> |
---|
256 | <td>generic.raw</td> |
---|
257 | </tr> |
---|
258 | <tr> |
---|
259 | <td>FEATURE_DATA</td> |
---|
260 | <td>Print map</td> |
---|
261 | <td>generic.printmap</td> |
---|
262 | </tr> |
---|
263 | <tr> |
---|
264 | <td>FEATURE_DATA</td> |
---|
265 | <td>Reporter map</td> |
---|
266 | <td>generic.reportermap</td> |
---|
267 | </tr> |
---|
268 | <tr> |
---|
269 | <td rowspan="2">Affymetrix</td> |
---|
270 | <td rowspan="2">affymetrix</td> |
---|
271 | <td>RAW_DATA</td> |
---|
272 | <td>Affymetrix CEL file</td> |
---|
273 | <td>affymetrix.cel</td> |
---|
274 | </tr> |
---|
275 | <tr> |
---|
276 | <td>FEATURE_DATA</td> |
---|
277 | <td>Affymetrix CDF file</td> |
---|
278 | <td>affymetrix.cdf</td> |
---|
279 | </tr> |
---|
280 | </table> |
---|
281 | |
---|
282 | <p> |
---|
283 | Servers that are upgrading from previous releases are assigned |
---|
284 | the generic platform unless the array design is an affy chip and |
---|
285 | the raw bioassay is Affymetrix raw data type. |
---|
286 | </p> |
---|
287 | |
---|
288 | <a name="problems"> |
---|
289 | <h2>7. Remaining problems</h2> |
---|
290 | </a> |
---|
291 | |
---|
292 | <ul> |
---|
293 | <li> |
---|
294 | Naming of the new classes: <code>FileXxxAble</code> and <code>FileXxxType</code> |
---|
295 | need better names. <code>FileType</code> and <code>FileAttachable</code> are already |
---|
296 | used. <code>Platform</code> makes sense only for <code>RawBioAssay</code> and |
---|
297 | <code>ArrayDesign</code>. It is not so good when used for <code>BioAssaySet</code>. |
---|
298 | </li> |
---|
299 | |
---|
300 | <li> |
---|
301 | How does the current <code>RawDataType</code> fit |
---|
302 | into this? There seems to be an overlap with <code>Platform</code>. Should |
---|
303 | we have a link from a platform to a raw data type? Can we mix any platform |
---|
304 | with any raw data type? Can we mix platforms in an experiment? Should we |
---|
305 | allow mixing of raw data types in an experiment? Can we still maintain |
---|
306 | backwards compatibility? |
---|
307 | </li> |
---|
308 | |
---|
309 | <li> |
---|
310 | Is it too restrictive to only allow one file of each <code>FileXxxType</code> |
---|
311 | in a <code>FileSet</code>? Do we need a |
---|
312 | 'multiplicity' option for the <code>FileXxxType</code>? |
---|
313 | </li> |
---|
314 | |
---|
315 | <li> |
---|
316 | Validation and metadata extraction is done on a single file basis (<code>FileSetMember</code>). |
---|
317 | Is this too limiting? Maybe we need validation and metadata extraction based on the entire |
---|
318 | <code>FileSet</code>. This could happen if data is split into multiple files |
---|
319 | (for example Imagene has one file for cy3 data and one for cy5 data). Do we need |
---|
320 | a <code>FileSetValidator</code> / <code>FileSetMetadataReader</code> that can |
---|
321 | be assigned to <code>Platform</code>? |
---|
322 | </li> |
---|
323 | |
---|
324 | <li> |
---|
325 | Is one <code>FileXxxType</code> per <code>File</code>enough? Probably in most cases, but maybe |
---|
326 | for analysed data there is overlap/compatibility between file formats. |
---|
327 | Can we do this by saying that file type X is compatible with file type Y |
---|
328 | and if someone asks for Y we give them X? Can we do this by a directional many-to-many |
---|
329 | relation between two <code>FileXxxType</code>:s? Is this a real problem |
---|
330 | which increases the user experience, or is it only theoretical? |
---|
331 | </li> |
---|
332 | |
---|
333 | <li> |
---|
334 | More...?? |
---|
335 | </li> |
---|
336 | |
---|
337 | </ul> |
---|
338 | |
---|
339 | |
---|
340 | </body> |
---|
341 | </html> |
---|