source: plugins/base2/net.sf.basedb.illumina/tags/1.3/README_SNP

Last change on this file was 941, checked in by Martin Svensson, 12 years ago

References #174 license text should referr to the Illumina plug-in package

  • Property svn:mime-type set to text/x-trac-wiki
File size: 9.7 KB

Copyright (C) 2008
This file is part of Illumina plug-in package for BASE.
Available at http://baseplugins.thep.lu.se/
BASE main site: http://base.thep.lu.se/
This is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 3
of the License, or (at your option) any later version.
The software is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with BASE. If not, see <http://www.gnu.org/licenses/>.

Introduction

This file contains only information that is specific to Illumina SNP data. For general information or information about expressions data see the README file.

Illumina SNP raw data files files

The SNP raw data files are created from BeadStudio? and may contain multiple samples. The file should be saved as a tab-separated text file. The first line is the header line which contains the column names. Some of the columns are specific for each sample, some are common columns valid for all samples. Sample specific columns are prefixed with the sample name, followed by a dot that is followed by a generic column name. For example, UC199_B.GType, where UC199_B is the sample name and GType is the generic column name. The following table lists the columns that are required by the plug-ins in this package.

Column Column type Example value
Address Common 830575
GenTrain? Score Common 0,8607027
GType Sample BB
Log R Ratio Sample 0,1801754
B Allele Freq Sample 1

If any of those columns are missing, the plug-ins may not function correctly. Additional columns, both common and sample-specific, may be present in the data file. When the import plug-in parses the input file it will split the file into one file for each sample. The new files will include all common columns, and all sample columns for a specific sample. The column headers in the new files only includes the generic column name, without the sample name prefix.

Illumina SNP manifest files

The SNP manifest files are comma separted text files, that contains information about the probes on a specific SNP array, including gene symbol, probe sequence, and so on. In BASE, the manifest files are used to create array designs that describe the probe content of a specific SNP Array.

The manifest files are comma separated text files composed of 2 sections named Heading and Assay. The first section is the Heading section. It is preceeded by a row containing the text [Heading]. In the Heading section some information is presented including the number of SNPs described in the file. See below for an example of the Heading section.

[Heading]
Descriptor File Name(s),HumanCNV370v1_C.bpm
Assay Format,Infinium
SNP Count,370404

Following the Heading section is the Assay section wich is preceeded by a row containing the text [Assay]. The first row of the Assay section, i.e., the row after [Assay] contain the header for the Assay section. See below for an example of Assay header and how information in the manifest file is mapped to BASE.

Mapping reporter/control annotations from SNP manifest files to BASE

The table below shows how the [Assay] section in the manifest file are mapped to reporter annotations in BASE. Annotations in <brackets> are new annotations defined in the illumina-extended-properties.xml file. Columns marked with - are not mapped to BASE.

Manifest column BASE reporter annotation Example value
IlmnID External ID rs10000010-126_B_F_IFB1153208421:0
Name Name rs10000010
IlmnStrand? <Ilmn strand> Bot
SNP <SNP> [T/C]
AddressA_ID * - 900010475
AlleleA_ProbeSeq Sequence
AddressB_ID -
AlleleB_ProbeSeq -
Chr Chromosome 4
MapInfo? <Start position> 21227772
Ploidy - 2
Species Species Homo sapiens
CustomerStrand? - BOT
IllumicodeSeq? -
TopGenomicSeq? -
  • The AddressA_ID is not a reporter annotation. It is used to identify the

probe on an array design. It's value is found in the Address column in the raw data files and is used to find the reporter.

The column mappings for the [Assay] section can be changed by modifying the existing import configuration or creating a new configuration.

Getting started

  1. Install this package as described by the instructions in the INSTALL file.
  2. Import reporter annotations. You will need one or more SNP manifest files for this.
    • Upload the manifest file(s) to BASE.
    • Go to the View -> Reporters menu.
    • Click on the Import button.
    • Use the auto-detect function or select the Illumina SNP reporter importer plug-in.
    • Select the manifest file.
    • Finish the job registration and wait for the plug-in to complete.
    • Repeat this one time for each manifest file.
  3. Create array designs. You will need one array design for each SNP manifest file.
    • Go to the Array LIMS -> Array designs menu.
    • Click on the New button.
    • Choose the Illumina/SNP platform.
    • We recommend that you give the array design the same name as the manifest file.
    • Switch to the Data files tab and select the manifest file.
    • Click on Save.
    • Repeat this for each manifest file.
  4. Import raw data. You will need a SNP raw data file.
    • Upload the file to BASE.
    • Go to the View -> Experiments page and create a new Experiment.
    • Select the SNP platform for the experiment.
    • Save the experiment and then click on the newly created experiment in the list.
    • Click on the Import button.
    • Use the auto-detect function or select the Illumina SNP raw data importer plug-in.
    • Select the manifest file.
    • Select one of the array designs created in step 3.
    • Finish the job registration and wait for the plug-in to complete.
    • Repeat this if you have more raw data files.

Tip! Steps 1-3 only needs to be done a single time for a BASE installation. If more than one user is going to use the Illumina package we recommend that the array designs created, and the associated manifest files, in step 3 are shared to the appropriate users, for example, the Everyone group.

Analyzing SNP data

The first step is to create a root bioassayset. To do this:

  1. Goto the "Bioassay sets" tab of your experiment.
  2. Click on the "New root bioassayset" button.
  3. This should start the "Illumina SNP root biassayset creator" plug-in.
  4. You must tell it which raw data sets to use.
  5. You may also have to specify character set and/or which decimal separator that is used in your data files.
  6. Finish the job registration and wait for the plug-in to complete.

The above procedure creates a root bioassayset which means that data from the files are imported into the database. BASE can only store data as numeric values in a predetermined number of "channels". The number of channels for SNP data is 3, which means that 3 data columns can be imported. Besides this, the Address column is imported as the 'position' value. This means that plug-ins that are used later in the analysis have the possibility to extract other columns directly from the data files, simply by finding the row which has the same Address value as the position.

Note! This position->Address relation is guaranteed to be correct only for bioassay sets living in the same "data cube" as the root bioassay set. During the analysis, other plug-ins may decide to create a new "data cube", re-arrange the position numbers and break the mapping.

The table below shows how data from the file are imported into the database.

Column Imported to
Address position
GType ch(1): AA=1.0, AB=0.0, BB=-1.0, Other values=null
Log R Ratio ch(2)
B Allele Freq ch(3)

Tip! The installation program has created 3 formulas: GType=ch(1), Log R Ratio=ch(2) and B Allele Freq=ch(3). The formulas can be used when displaying or plotting data instead of the channel numbers. It means no real difference, except that the formula names will be used in column headers, etc. instead of the generic channel numbers.

Note: See TracBrowser for help on using the repository browser.