Changeset 3757


Ignore:
Timestamp:
Sep 20, 2007, 9:50:22 AM (14 years ago)
Author:
Nicklas Nordborg
Message:

References #721: Store data in files instead of in the database

Updated specification and documentation.

Location:
trunk/doc
Files:
2 added
6 edited

Legend:

Unmodified
Added
Removed
  • trunk/doc/development/overview/core/data_in_files.html

    r3681 r3757  
    4343
    4444  <div class="abstract">
    45  
    46     <p class="warning">
    47       NOTE! This document is a draft currently beeing worked on!
    48       Major changes are expected before the design is finalized.
    49     </p>
     45    <br>
     46    <div class="warning">
     47      NOTE! This document has been replaced with new documents:
     48      <ul>
     49      <li><a href="../../../html/developerdoc/api/core_api.data_in_files.html">Core API : Using files to store data</a>
     50      <li><a href="../../../html/developerdoc/api/data_api.platforms.html">Database/UML schema</a>
     51      <li><a href="../../../html/appendix/appendix.rawdatatypes.html">Default platforms installed with BASE</a>
     52      </li>
     53    </div>
    5054 
    5155    <p>
     
    291295
    292296  <ul>
    293   <li>
    294     Naming of the new classes: <code>FileXxxAble</code> and <code>FileXxxType</code>
    295     need better names. <code>FileType</code> and <code>FileAttachable</code> are already
    296     used. <code>Platform</code> makes sense only for <code>RawBioAssay</code> and
    297     <code>ArrayDesign</code>. It is not so good when used for <code>BioAssaySet</code>.
    298   </li>
    299297 
    300298  <li>
     
    305303    allow mixing of raw data types in an experiment? Can we still maintain
    306304    backwards compatibility?
     305    <p>
     306    ANSWER: There are two types of platforms: platforms that can't store data in
     307    the database (file-only platforms) and platforms which may store data in the
     308    database. File-only platforms will auto-generate a raw data type, there is
     309    no need to define it in <code>raw-data-types.xml</code>. Database-platforms
     310    may be locked to a specific raw data type, but doesn't have to.
    307311  </li>
    308312 
     
    311315    in a <code>FileSet</code>? Do we need a
    312316    'multiplicity' option for the <code>FileXxxType</code>?
     317    <p>
     318    ANSWER: Only one file of each type is allowed.
    313319  </li>
    314320 
     
    320326    a <code>FileSetValidator</code> / <code>FileSetMetadataReader</code> that can
    321327    be assigned to <code>Platform</code>?
     328    <p>
     329    ANSWER: This is solved by having the same validator class for more
     330    than one file type. Only one instance is created and it is given access
     331    to all files of the specified types in the file set before
     332    <code>validate()</code> is called.
    322333  </li>
    323334 
     
    329340    relation between two <code>FileXxxType</code>:s? Is this a real problem
    330341    which increases the user experience, or is it only theoretical?
     342    <p>
     343    ANSWER: We don't see a need for it right now.
    331344  </li>
    332345 
  • trunk/doc/development/overview/data/data_in_files.html

    r3681 r3757  
    4343
    4444  <div class="abstract">
    45     <p class="warning">
    46       NOTE! This document is a draft currently beeing worked on!
    47       Major changes are expected before the design is finalized.
    48     </p>
     45    <br>
     46    <div class="warning">
     47      NOTE! This document has been replaced with new documents:
     48      <ul>
     49      <li><a href="../../../html/developerdoc/api/core_api.data_in_files.html">Core API : Using files to store data</a>
     50      <li><a href="../../../html/developerdoc/api/data_api.platforms.html">Database/UML schema</a>
     51      <li><a href="../../../html/appendix/appendix.rawdatatypes.html">Default platforms installed with BASE</a>
     52      </li>
     53    </div>
    4954
    5055    <p>
  • trunk/doc/src/docbook/appendix/raw_data_types.xml

    r3675 r3757  
    2828
    2929<appendix id="appendix.rawdatatypes">
    30   <title>raw-data-types.xml reference</title>
    31   <para>TODO</para>
     30  <title>Platforms and raw-data-types.xml reference</title>
     31 
     32  <para>
     33    Raw data can be stored either as files attached to items or in
     34    the database. The <classname>Platform</classname> item has information
     35    about this. Configuration information for the database tables
     36    and columns used to store raw data in the database is found in the
     37    <filename>raw-data-types.xml</filename> file. For detailed information
     38    see <xref linkend="core_api.data_in_files" />.
     39  </para>
     40 
     41  <sect1 id="appendix.rawdatatypes.platforms">
     42    <title>Default platforms installed with BASE</title>
     43   
     44      <note>
     45        <title>THIS IS A DRAFT!</title>
     46        <para>
     47          This document is a draft currently beeing worked on!
     48          Changes are expected before the design is finalized.
     49        </para>
     50      </note>
     51   
     52    <informaltable>
     53      <tgroup cols="7">
     54        <colspec colname="platform.name" />
     55        <colspec colname="platform.id" />
     56        <colspec colname="variant.name" />
     57        <colspec colname="variant.id" />
     58        <colspec colname="filetype.item" />
     59        <colspec colname="filetype.name" />
     60        <colspec colname="filetype.id" />
     61        <thead>
     62          <row>
     63            <entry namest="platform.name" nameend="platform.id">Platform</entry>
     64            <entry namest="variant.name" nameend="variant.id">Variants</entry>
     65            <entry namest="filetype.item" nameend="filetype.id">File types</entry>
     66          </row>
     67          <row>
     68            <entry>Name</entry>
     69            <entry>ID</entry>
     70            <entry>Name</entry>
     71            <entry>ID</entry>
     72            <entry>Item</entry>
     73            <entry>Name</entry>
     74            <entry>ID</entry>
     75          </row>
     76        </thead>
     77        <tbody>
     78          <row>
     79            <entry morerows="2">Generic</entry>
     80            <entry morerows="2">generic</entry>
     81            <entry morerows="2"></entry>
     82            <entry morerows="2"></entry>
     83
     84            <entry morerows="1">Array design</entry>
     85            <entry>Reporter map</entry>
     86            <entry>generic.reportermap</entry>
     87          </row>
     88          <row>
     89            <entry>Print map</entry>
     90            <entry>generic.printmap</entry>
     91          </row>
     92          <row>
     93            <entry>Raw bioassay</entry>
     94            <entry>Raw data</entry>
     95            <entry>generic.rawdata</entry>
     96          </row>
     97          <row>
     98            <entry morerows="2">Affymetrix</entry>
     99            <entry morerows="2">affymetrix</entry>
     100            <entry morerows="1">Expression</entry>
     101            <entry morerows="1">affymetrix.expression</entry>
     102            <entry>Array design</entry>
     103            <entry>CDF</entry>
     104            <entry>affymetrix.cdf</entry>
     105          </row>
     106          <row>
     107            <entry>Raw bioassay</entry>
     108            <entry>CEL</entry>
     109            <entry>affymetrix.cel</entry>
     110          </row>
     111          <row>
     112            <entry>Snip</entry>
     113            <entry>affymetrix.snip</entry>
     114            <entry>??</entry>
     115          </row>
     116          <row>
     117            <entry>Illumina</entry>
     118            <entry>illumina</entry>
     119            <entry></entry>
     120            <entry></entry>
     121            <entry>??</entry>
     122          </row>
     123          <row>
     124            <entry namest="platform.name" nameend="filetype.id">More???</entry>
     125          </row>
     126        </tbody>
     127      </tgroup>
     128    </informaltable>
     129   
     130 
     131  </sect1>
     132 
     133  <sect1 id="appendix.rawdatatypes.ref">
     134    <title>raw-data-types.xml reference</title>
     135   
     136    <para>
     137      TODO
     138    </para>
     139   
     140  </sect1>
    32141
    33142</appendix>
  • trunk/doc/src/docbook/developerdoc/api_overview.xml

    r3715 r3757  
    627627          but not quota, expiration date and other administrative information.
    628628        </para>
     629
    629630      </sect3>
    630631    </sect2>
     
    649650      <title>Files and directories</title>
    650651    </sect2>
     652   
     653    <sect2 id="data_api.platforms">
     654      <title>Experimental platforms</title>
     655
     656      <para>
     657         This section gives an overview of experimental platforms
     658         and how they are used to enable data storage in files
     659         instead of in the database.
     660      </para>
     661     
     662      <note>
     663        <title>THIS IS A DRAFT!</title>
     664        <para>
     665          This document is a draft currently beeing worked on!
     666          Changes are expected before the design is finalized.
     667        </para>
     668      </note>
     669     
     670      <sect3 id="data_api.platforms.uml">
     671        <title>UML diagram</title>
     672       
     673        <figure id="data_api.figures.platforms">
     674          <title>Experimental platforms</title>
     675          <screenshot>
     676            <mediaobject>
     677              <imageobject>
     678                <imagedata
     679                  fileref="figures/uml/datalayer.platforms.png" format="PNG" />
     680              </imageobject>
     681            </mediaobject>
     682          </screenshot>
     683        </figure>
     684      </sect3>
     685     
     686      <sect3 id="data_api.platforms.platforms">
     687        <title>Platforms</title>
     688       
     689        <para>
     690          The <classname>PlatformData</classname> holds information about a
     691          platform. A platform can have one or more <classname>PlatformVariant</classname>:s.
     692          Both the platform and variant are identified by a system ID that
     693          is fixed and can't be changed. <emphasis>Affymetrix</emphasis>
     694          and <emphasis>Illumina</emphasis> are examples of platforms.
     695          If the <varname>fileOnly</varname> flag is set data for the platform
     696          can only be stored in files and not imported into the database. If
     697          the flag is not set data can be imported into the database.
     698          The <varname>rawDataType</varname> can be used to lock the platform
     699          to a specific raw data type. If the value is <constant>null</constant>
     700          the platform can use any raw data type.
     701        </para>
     702       
     703        <para>
     704          Each platform and it's variant can be connected to one or more
     705          <classname>FileSetMemberTypeData</classname> items. This item
     706          describes the kind of files that are used to hold data for
     707          the platform and/or variant. The file types are re-usable between
     708          different platforms and variants. Note that a file type may be attached
     709          to either only a platform or to a platform with a variant. File
     710          types attached to platforms are inherited by the variants. The variants
     711          can only define additional file types, not remove or redefine file types
     712          that has been attached to the platform.
     713        </para>
     714        <para>
     715          The file type is also identified
     716          by a fixed, non-changable system ID. The <varname>itemType</varname>
     717          property tells us what type of item the file holds data for (ie.
     718          array design or raw bioassay). It also links to a <classname>FileType</classname>
     719          which is the generic type of data in the file. This allows to query
     720          the database for, as an example, for files with the generic type
     721          <constant>FileType.RAW_DATA</constant>. If we are in an Affymetrix
     722          experiment we will get the CEL file, for another platform we will
     723          get another file.
     724        </para>
     725
     726      </sect3>
     727     
     728      <sect3 id="data_api.platforms.files">
     729        <title>Files</title>
     730       
     731        <para>
     732          An item must implement the <interfacename>FileStoreEnabledData</interfacename>
     733          interface to be able to store data in files instead of in the database.
     734          The interface creates a link to a <classname>FileSetData</classname> object.
     735          In a file set it is only possible to store one file for each
     736          <classname>FileSetMemberTypeData</classname> item.
     737        </para>
     738       
     739      </sect3>
     740    </sect2>
    651741
    652742    <sect2 id="data_api.protocols">
     
    692782  </sect1>
    693783 
    694   <sect1 id="api_overview.core_api">
     784  <sect1 id="api_overview.core_api" chunked="1">
    695785    <title>The Core API</title>
     786   
    696787    <para>
    697       TODO
     788      This section gives an overview of various parts of the core API.
    698789    </para>
     790   
     791    <sect2 id="core_api.data_in_files">
     792      <title>Using files to store data</title>
     793      <note>
     794        <title>THIS IS A DRAFT!</title>
     795        <para>
     796          This document is a draft currently beeing worked on!
     797          Changes are expected before the design is finalized.
     798        </para>
     799      </note>
     800     
     801      <para>
     802        This section is about how BASE can use files to store data instead
     803        of importing it into the database. See <xref linkend="data_api.platforms" />
     804        for an overview of the database schema for this feature. Files can be attached
     805        to any item that implements the <interfacename>FileStoreEnabled</interfacename>
     806        interface. Currently this is <classname>RawBioAssay</classname>, <classname>ArrayDesign</classname>,
     807        <classname>BioAssaySet</classname> and <classname>BioAssay</classname>. The
     808        ability to store data in files is not a replacement for storing data in the
     809        database. It is possible (for some platforms/raw data types) to have data in
     810        files and in the database at the same time. We would have liked to enforce
     811        that (raw) data is always present in files, but this will not be backwards compatible
     812        with older installations, so there are three cases:
     813      </para>
     814     
     815      <itemizedlist>
     816      <listitem>
     817        <para>
     818        Data in files only
     819        </para>
     820      </listitem>
     821      <listitem>
     822        <para>
     823        Data in the database only
     824        </para>
     825      </listitem>
     826      <listitem>
     827        <para>
     828        Data in both files and in the database
     829        </para>
     830      </listitem>
     831      </itemizedlist>
     832     
     833      <para>
     834        Not all three cases are supported for all types of data. This is controlled
     835        by the <classname>Platform</classname> class, which may disallow
     836        that data is stored in the database. To check this call
     837        <methodname>getRawDataType()</methodname> which may return:
     838      </para>
     839     
     840      <itemizedlist>
     841      <listitem>
     842        <para>
     843          <constant>null</constant>: The platform can store data with any
     844          raw data type in the database.
     845        </para>
     846      </listitem>
     847      <listitem>
     848        <para>
     849        A <classname>RawDataType</classname> that has <code>isStoredInDb() == true</code>:
     850        The platform can store data in the database but only data with the specified raw
     851        data type.
     852        </para>
     853      </listitem>
     854      <listitem>
     855        <para>
     856        A <classname>RawDataType</classname> that has <code>isStoredInDb() == false</code>:
     857        The platform can't store data in the database.
     858        </para>
     859      </listitem>
     860      </itemizedlist>
     861
     862      <para>
     863        One major modification is that the registration of raw data types
     864        has changed. The <filename>raw-data-types.xml</filename> file should
     865        only be used for raw data types that are stored in the database. The
     866        <sgmltag>storage</sgmltag> tag has been deprecated and BASE will ignore
     867        any raw data type definitions with <code>storage="file"</code>.
     868        To replace this, each <classname>Platform</classname> that
     869        can only store data in files also defines a "virtual" raw data type.
     870      </para>
     871     
     872      <sect3 id="core_api.data_in_files.diagram">
     873        <title>Diagram of classes and methods</title>
     874        <figure id="core_api.figures.data_in_files">
     875          <title>Store data in files</title>
     876          <screenshot>
     877            <mediaobject>
     878              <imageobject>
     879                <imagedata
     880                  fileref="figures/uml/corelayer.datainfiles.png" format="PNG" />
     881              </imageobject>
     882            </mediaobject>
     883          </screenshot>
     884        </figure>
     885      </sect3>
     886     
     887      <sect3 id="core_api.data_in_files.ask">
     888        <title>Asking the user for files</title>
     889
     890        <para>
     891          A client application must know what types of files it makes sense
     892          to ask the user for. In some cases, data may be split into more than
     893          one file so we need a generic way to select files.
     894        </para>
     895       
     896        <para>
     897          Given that we have a <interfacename>FileStoreEnabled</interfacename>
     898          item we use the <methodname>FileSetMemberType.getQuery()</methodname>
     899          method to find which file types that can be used for that
     900          item. Internally, the <methodname>getQuery()</methodname>
     901          method uses the <methodname>FileStoreEnabled.getPlatform()</methodname>
     902          and <methodname>FileStoreEnabled.getVariant()</methodname>
     903          methods to restrict the query to only return file types for
     904          a given platform and/or variant. If the item doesn't have
     905          a platform or variant the query will only return file types
     906          that are associated with the given item type, but not with any specific
     907          platform. In any case, we get a list of <classname>FileSetMemberType</classname>
     908          items, each one representing a specific file type that
     909          we should ask the user about. Examples:
     910        </para>
     911
     912        <orderedlist>
     913        <listitem>
     914          <para>
     915          The <constant>Affymetrix</constant> platform defines <constant>CEL</constant>
     916          for <constant>FileType.RAW_DATA</constant>
     917          and <constant>CDF</constant> for <constant>FileType.REPORTER_MAP</constant>.
     918          respectively. If we have a
     919          <classname>RawBioAssay</classname> the query will only return
     920          the CEL file type and the client can ask the user for a CEL file.
     921          </para>
     922        </listitem>
     923        <listitem>
     924          <para>
     925          More examples.... ???
     926          </para>
     927        </listitem>
     928        </orderedlist>
     929     
     930        <para>
     931          Here is a simple code template that might be useful.
     932        </para>
     933       
     934        <programlisting>
     935DbControl dc = ...
     936FileStoreEnabled item = ...
     937ItemQuery&lt;FileSetMemberType&gt; query =
     938   FileSetMemberType.getQuery(item);
     939List&lt;FileSetMemberType&gt; types = query.list(dc);
     940// We now have a list of file types...
     941// ... ask the user to select a file for each one of them
     942</programlisting>
     943     
     944      </sect3>
     945     
     946      <sect3 id="core_api.data_in_files.link">
     947        <title>Link to the selected files</title>
     948        <para>
     949          When the user has selected the file(s) we must store the links
     950          to them in the database. This is done with a <classname>FileSet</classname>.
     951          object. A file set can contain any number of files. The only limitation
     952          is that it can only contain one file for each file type.
     953          Call <methodname>FileSet.setMember()</methodname> to store
     954          a file in the set. If a file already exists for the given file type
     955          it is replaced, otherwise a new entry is created.
     956        </para>
     957      </sect3>
     958     
     959      <sect3 id="core_api.data_in_files.validate">
     960        <title>Validate the file and extract metadata</title>
     961       
     962        <para>
     963          Validation and extraction of metadata is important since we want
     964          data in files to be equivalent to data in the database. The validation
     965          and metadata extraction is automatically done by the core when a
     966          file is added to a file set. The process is partly pluggable
     967          since each <classname>FileSetMemberType</classname> can name a class
     968          that should do the validation and/or metadata extraction.
     969          Here is the general outline:
     970        </para>
     971       
     972        <programlisting>
     973FileStoreEnabled item = ...
     974FileSetMemberType type = ...
     975File file = ...
     976FileSetMember member = new FileSetMember(file, type);
     977
     978FileValidator validator = type.getValidator();
     979MetadataReader metadata = type.getMetadataReader();
     980validator.setFile(member);
     981validator.setItem(item);
     982// Repeat for 'metadata' if not same as 'validator'
     983
     984validator.validate();
     985metadata.extractMetadata();
     986</programlisting>
     987       
     988        <note>
     989          <title>Only one instance of each validator class is created</title>
     990          <para>
     991          The validation/metadata extraction is not done until all files have been
     992          added to the fileset. If the same validator/meta data extractor is
     993          used for more than one file, the same instance is reused. Ie.
     994          the <methodname>setFile()</methodname> is called one time
     995          for each file/file type pair. The <methodname>validate()</methodname>
     996          and <methodname>extractMetadata()</methodname> methods are only
     997          called once.
     998          </para>
     999        </note>
     1000       
     1001        <para>
     1002          All validators and meta data extractors should extend
     1003          the <classname>AbstractFileHandler</classname> class. The reason
     1004          is that we may want to add more methods to the <interfacename>FileHandler</interfacename>
     1005          interface in the future. The <classname>AbstractFileHandler</classname> will
     1006          be used to provide default implementations for backwards compatibility.
     1007        </para>
     1008       
     1009      </sect3>
     1010     
     1011      <sect3 id="core_api.data_in_files.import">
     1012        <title>Import data into the database</title>
     1013       
     1014        <para>
     1015          This should be done by existing plug-ins in the same way as before.
     1016          A slight modification is needed since it is good if the importers
     1017          are made aware of already selected files in the <classname>FileSet</classname>
     1018          to provide good default values. Something like this.
     1019        </para>
     1020       
     1021        <programlisting>
     1022File defaultFile = null;
     1023RawBioAssay rba = ...;
     1024if (rba.hasFileSet())
     1025{
     1026   FileSet fileSet = rba.getFileSet();
     1027   List&lt;FileSetMember&gt; members =
     1028      fileSet.getMembers(FileType.RAW_DATA);
     1029   if (members.size() &gt; 0)
     1030   {
     1031      defaultFile = members.get(0).getFile();
     1032   }
     1033}       
     1034</programlisting>
     1035      </sect3>
     1036     
     1037      <sect3 id="core_api.data_in_files.experiments">
     1038        <title>Using raw data from files in an experiment</title>
     1039       
     1040        <para>
     1041          Just as before, an experiment is still locked to a single
     1042          <classname>RawDataType</classname>. This is a design issue that
     1043          would break too many things if changed. If data is stored in files
     1044          the experiment is also locked to a single <classname>Platform</classname>.
     1045          This has been designed to have as little impact on existing
     1046          plug-ins as possible. In most cases, the plug-ins will continue
     1047          to work as before.
     1048        </para>
     1049       
     1050        <para>
     1051          A plug-in (using data from the database that needs to check if it can
     1052          be used within an experiment can still do:
     1053        </para>
     1054       
     1055        <programlisting>
     1056Experiment e = ...
     1057RawDataType rdt = e.getRawDataType();
     1058if (rdt.isStoredInDb())
     1059{
     1060   // Check number of channels, etc...
     1061   // ... run plug-in code ...
     1062}
     1063</programlisting>
     1064       
     1065        <para>
     1066          A newer plug-in which uses data from files should do:
     1067        </para>
     1068       
     1069        <programlisting>
     1070Experiment e = ...
     1071RawDataType rdt = e.getRawDataType();
     1072if (!rdt.isStoredInDb())
     1073{
     1074   Platform p = rdt.getPlatform();
     1075   PlatformVariant v = rdt.getVariant();
     1076   // Check that platform/variant is supported
     1077   // ... run plug-in code ...
     1078}
     1079</programlisting>
     1080       
     1081      </sect3>
     1082     
     1083    </sect2>
    6991084  </sect1>
    7001085
Note: See TracChangeset for help on using the changeset viewer.