$Id: affymetrix.txt 2342 2006-05-30 22:47:50Z jari $ Affymetrix in BASE 2.0 This document summarizes discussions with several parties regarding Affymetrix data storage in BASE. The Affymetrix use cases in BASE 2.0 we want to cover is: i) Allow Affymetrix specific file upload and data management, allow easy creation of Affymetrix slides from Affymetrix Array Design ii) Allow recording of probeset level annotation. Affymetrix data and file formats: In what format do Affymetrix CEL files come in. We need to be able to read them in non-Windows environment. There are different versions of CEL file formats. There are EXP and RPT files, and also CHP files. Affymetrix/BASE issues: - The fact that BASE loads all raw result in a single table is a problem for Affymetrix as Affymetrix chips soon have ~6 millions probes. - Most of the plug-ins/tools freely available for Affymetrix expect the Affymetrix file(s) as input, and since those are in an Affymetrix binary format, it prevents from trying to hack these R packages. - the fact that Affymetric has stuff like probeset, match/mismatch oligos. Affymetrix offer different technologies (with their implied complications in BASE): - Expression Array: in those they have the ProbeSet concept with perfect/mis-match oligos. The good news about those in that, so far, once you talk about probe set, you talk about a reasonable number of objects. Ie the hundreds of thousands probes on the chip end up as ~30000 to 50000 probeset results. This means that probeset results could be uploaded in the raw data table and thus are available in BASE analysis module. However, this hack has several drawbacks: a) Data is not really raw data but rather already transformed data. b) The probeset values depends on the algorithm used. c) Probesets have no position on the array. d) Affymetrix files are binary files so users can load them straight (we are using the Java AffyFusion API beta version to read Affymetrix files) ??? - Tiling Array: in this situation, This is the "normal" situation i.e. one feature = one probe = one position. Here the real problem is the probe number (soon 6 millions) => cannot realistically be loaded into the raw data table and then it becomes impossible to get this data in the analysis module ... and even if we could, the number of probes is still limiting when you think about an experiment having ~50 hybs (=> 300 millions rows in the table at each transformation). - ExonArray: seem to be a new kind of cool stuff to study splicing, here there is the combination of ProbeSet concept (thought only 4 probes per set) and lots of probes on the chip => the 2 previous situations problems apply. Conclusions from above remarks: 1) Loading Affymatrix probe values into the database is useless. 2) Plug-ins should be able to accept files as input. 3) BASE should be aware that raw data might not be stored in database tables but rather in files. 4) The dynamic part of BASE should also deal with data storage in files or database tables (or even a mix). Clarification for item 3 and 4: The CEL files are uploaded to BASE's file storage area (and will be kept as long as they are not deleted by the user). However, when the probe set values are calculated and chips are normalized these results end up in base2dynamic. Tentative list of descisions: A) Raw Affymetrix data will not be loaded into database tables. [Target milestone: now] B) The Feature* classes/tables added in BASE 2 to allow for storage of raw Affymetrix data are deprecated will be removed. [Target milestone: BASE 2.0RC2] C) The CEL-files will be the raw Affymetrix data in BASE 2. This will require some modifications to the core (rawdata related). [Target milestone: BASE 2.0RC2] D) We have discussed to allow for four different types of imports, implemented in the following order: o) Probe information/chip design import (compare with reporter import) using CDF files. There are 52 CDF files matching 52 designs. i) CEL-file import using a free C++ tool. The idea is (if possible) to convert this to Java and ship it with BASE 2.0. [Target milestone BASE 2.0RC2] ii) Create a CEL-file importer using the available R-packages. This will include normalization that provides provides minimum gcrma support from within base without other support based on RMAExpress code [Target milestone BASE 2.0] iii) CHP-file import. For completeness but may be fairly useless. No target release set since import of CHP data will probably not be supported. [No target milestone] The current view on where the imported (non-raw) data should be stored in the dynamic tables to allow other plug-ins to manipulate the probe sets. Maybe we should in some cases store data in files? [Target milestone for last question BASE 2.0+]? E) extended-properties.xml is data format dependent (as well as migration dependent when migrating). The default is 2 channel style. NuGO is to create a template for Affymetrix data. [Target milestone BASE 2.0RC2] This means that the extended-properties.xml file should include Affymetrix related annotation (e.g. Ensembl gene id, Uniprot id, refseq id...) F) Allow to create batch of slides in one go from any given Affymetrix design using an import function [Target milestone BASE 2.0RC2] G) Allow for upload and parsing of EXP and RPT files loading into specific fields of relevant parameters (scaling factor...). Philippe to document EXP and RPT format and important tags. [Target milestone ?] Other remarks and links: We have a Affymetrix data set available from /home/lev/jari/projects/base/base2/testdata/expt_111_cel.zip. The experiment is based on Affymetrix design MG_U74Av2. The archive contains cell files and a descriptive file formated according to the tab2mage specs (http://tab2mage.sourceforge.net/docs/spreadsheet.html) There is a resource for Affymetrix designs and annotations provided by Bioconductor at http://www.bioconductor.org/packages/data/annotation/stable/src/contrib/html/ RMAExpress is available at http://www.stat.berkeley.edu/~bolstad/RMAExpress/RMAExpress.html For item D.ii) One could use the annotation data package and use the Reporter Import plugin. This might requires some changes to the Reporter table to make it suitable fro Affymetrix annotation or just map to some of the fields. For item D.iii) Chip-to-chip normalization must be done, not possible with .CHP files. This will require a plug-in (with documentation). If R (BioConductor) is to be used we should contact Henrik Bengtsson, creator of a BASE 1.2 R dispatcher. Future development: Connectivity to R/bioc and develop specific functions for supporting the work with tiling arrays and SNP arrays. How to use an R server? Do we want to use an R server?