DNA microarrays can simultaneously measure the expression level of thousands of genes within a particular mRNA sample. There are huge sum of useful information on this topic. I did not provide information on how can you generated microarray data but only concern about the most brimful and useful information on analyzing microarray data.

What is microarray.Edit

Recent years a new type of experiments are changing the way that biologists and other specialists analyze many problems. These are called high throughput experiments and the main difference with those that were performed some years ago is mainly in the quantity of the data obtained from them. Thanks to the technology known generically as microarrays, it is possible to study nowadays in a single experiment the behavior of all the genes of an organism under different conditions. Microarray technology evolved from Southern blotting, where fragmented DNA is attached to a substrate and then probed with a known gene or fragment. The data generated by these experiments may consist from thousands to millions of variables and they pose many challenges to the scientists who have to analyze them. There are many types of microarrays which have been developed to answer different biological questions and some of them will be explained later. For the sake of simplicity we start with the most well known ones: expression microarrays.

Where can you get microarray data.Edit

First thing, you should get the datasets for analyzing. There are mainly two resources to get microarray data. One is generating it in your own lab or by professional corporation (Our lab have generated microarray of pig Ips through the service of CapitalBio Corporation). Secondly, you can get it by microarray databases. Microarray databases can fall into two distinct classes: 1. A peer reviewed, public repository that adheres to academic or industry standards and is designed to be used by many analysis applications and groups. A good example of this is the Gene Expression Omnibus (GEO) from NCBI or ArrayExpress from EBI.

2. A specialized repository associated primarily with the brand of a particular entity (lab, company, university, consortium, group), an application suite, a topic, or an analysis method, whether it is commercial, non-profit, or academic. Two most used databases are:

a.ArrayExpress - Database of gene expression and other microarray data at the European Bioinformatics Institute (EBI), located at the Wellcome Trust Foundation in Cambridge, UK.

b.Gene Expression Omnibus - Database of gene expression data from NCBI (National Center for Biotechnical Information). Includes a definable search by different characteristics.

What can you get through analyzing microarray datasets.Edit

Microarrays are useful in a wide variety of studies with a wide variety of objectives.

1. A typical microarray experiment is one who looks for genes differentially expressed between two or more conditions. That is, genes which behave differently in one condition (for instance healthy [or untreated or wild type cells) than in another (for instance tumor [or treated or mutant cells). These are known as class comparison experiments.

2. When the emphasis is on developing a statistical model that can predict to which class a new individual belongs we have a class prediction problem. Examples of this are predicting the response to a treatment (e.g. classes are responder and non-responder) or the evolution of a disease (e.g. recidivated or cured).

3. Sometimes the objective is the identification of novel sub-types of individuals within a population. For example it has been shown that certain types of leukemia present some subclasses that are very hard to distinguish morphologically but which can be classified using gene expression. This is an example of class discovery.

4. Pathway Analysis studies are those that try to find genes whose co-regulation reflects their participation in common or related biochemical processes.

How to design the microarray experiment.Edit

First and foremost, designing your experimental careful [1]. The main distinction is whether essentially full-length transcripts are printed onto slides (cDNA microarrays) or the desired—typically shorter—oligonucleotides are

Figure 1 Two Affymetrix chips. A match is shown at bottom left for size comparison.

synthesized in situ (oligonucleotide arrays). While the former may be less expensive because they can be manufactured in the lab or at institutional core facilities, the latter may outperform the former in terms of number of spots per array and the spots’ homogeneity[2]. Slightly different oligonucleotide array platforms are manufactured by companies such as Affymetrix (Figure 1), Agilent, and NimbleGen (Table 1).

Table 1 The differetences between three commercialized oligonucleotide array platforms.


Roche Nimblegen



Photolithography with masks

Maskless photolithography

Inkjet technology

Number of Features




Arrays per chip


1x385k, 4x72k, 12x135k

1x1M, 2x400k,4x180, 8x60k

Length of Probes





Labeling of nucleotides




Probes per gene

4 to 11 probes per probe-set

2 to 12*

2 to 50*

Custom Design of Microarrays

prohibitively expensive



*assuming 20k targets'

A major design question is whether to measure the expression levels from each sample on a different microarray (using single-color, or single-channel, arrays), or instead to compare relative expression levels between a pair of samples on each microarray (two-color or two-channel arrays). There are tradeoffs between the two approaches. Single-color arrays allow for more flexibility in analysis, while two-color arrays can control for some technical issues by allowing a direct comparison in a single hybridization [3]. A recent comparison of single- and two-color methods on the same platforms found good overall agreement in the data produced by the two methods.

Figure 2 Comparison of single- and two-color methods.

As attractive as it might seem financially to run just one microarray for each ‘‘class’’ of samples (of the same phenotype, time-point, or tissue type) under consideration, replicates are essential for providing meaningful results[4]. Without replicates, no statistical analysis of the significance and reliability of the observed changes is possible; the typical result is an increased number of both false-positive and false-negative errors in detecting differentially expressed genes [5]. However, we distinguish between technological and biological replicates. Technological replication—the same biological material hybridized independent times—is generally no longer I provided two classical paper for you to understand the application of microarray in the area of stem cell and development. performed, as analyses have shown that the results will be relatively consistent overall [2], although they may include consistent sources of bias[1]. During the experimental design stage, it is important to identify all the variables to be compared and to ensure that the proposed design allows their measurement. Be aware of other variables, such as date of sample collection, the sex of the animals that might confound the distinction between the compared classes. One option is to randomize confounding variables related to experimental conditions under your control.

How to analysis microarray data considering you are not a professional statistician.Edit

The task of analyzing microarray data is often at least as much an art as a science, and it typically consumes considerably more time than the laboratory protocols required to generate the data. Part of the challenge is assessing the quality of the data and ensuring that all samples are comparable for further analysis.


Figure 3 Median standardized result for porcine iPS from different lab.

1. Normalization

Normalization of the raw data, which controls for technical variation between arrays within a study, is essential. The challenge of normalization is to remove as much of the technical variation as possible while leaving the biological variation untouched. This is a big challenge, and here we only touch upon the main issues. First, visualization of the raw data is an essential part of assessing data quality, choosing a normalization method, and estimating the effectiveness of the normalization. There are a number of reasons why data must be normalized, including unequal quantities of starting RNA, differences in labeling or detection efficiencies between the fluorescent dyes used, and systematic biases in the measured expression levels. Conceptually, normalization is similar to adjusting expression levels measured by northern analysis or quantitative reverse transcription PCR (RT–PCR) relative to the expression of one or more reference genes whose levels are assumed to be constant between samples. Many methods for visualization, quality assessment, and data normalization have been developed [6].

There are three normalization methods that we use most frequently: mean-signal (or “scaling”), spike-in normalization, and quantile normalization. Scaling is the simplest normalization method, in which each microarray’s average expression level is set to the same value. Though minimal, it avoids over-normalization, and may be effective if the samples to be compared can be expected to have similar average expression levels (e.g., they come from the same tissues and developmental stages, have similar mRNA quality, etc.). The Affymetrix data analysis software allows for automatic array scaling [7]; samples can also be scaled manually using Excel, R, MatLab, or similar software.

A more exacting method runs an internal control on each array by adding exogenous RNA of known quantities to the RNA sample. These so-called “spike-in” methods can be valuable for comparing arrays where one cannot expect all samples to have the same average expression levels [8]. Agilent provides spike-in RNAs for this purpose [9]; Affymetrix incorporates suitable controls on their commercial arrays [8]. Spike-in controls are particularly helpful when comparing samples with inherently different transcriptomes; for example, a developmental time-course ranging from oocytes to blastocyst, which have vastly different amounts of mRNA. In cases where spike-in controls are not necessary, or for pre-existing data where they were not part of the experiment, quantile normalization [10] is our current favorite among many popular normalization methods. More strict than mean- or median-scaling, quantile normalization aligns the entire expression distributions on each array [10]. Thus, the most highly-expressed value is set to be the same across arrays, as is the next most highly-expressed, and so on. RMA, another popular method, builds on this approach [11]; RMAexpress [12] offers an easy way to apply this method to Affymetrix arrays (Figure 3).

In addition to normalization of features across arrays, it has also been shown to be important to account for intra-array noise derived from such sources as dust on the slide, background surface variation, or scanning device biases. This general error of measurement ppears to vary greatly between datasets [13]. Several statistical models have been recently developed to mitigate these effects at the probe-level using model-based estimates [14-17] and are integrated in most of the common normalization packages described here. The length to hich researchers have gone to account for noise in microarray experiments highlights the importance of collecting many replicates to avoid being misled by noisy data.

Related issues of background adjustment and data ”summarization” (reducing multiple probes representing a single transcript to a single measurement of expression) for Affymetrix arrays are well introduced in chapter 2 of [10].


Figure 4 Clustering result for iPS derived from different lab

2. Cluster Clustering is a way of finding and visualizing patterns in the data. Many papers and indeed books have been written on this topic (Figure 4)([11–13]). Different methods highlight different patterns, so trying more than one method can be worthwhile. Note that while clustering finds predominant patterns in the data, those patterns may not correspond to the phenotypic distinction of interest in the experiment. To identify gene expression patterns related to this distinction, more directed methods are appropriate.There are at least as many clustering methods available as there are data sets, and nearly every expression analysis toolkit includes some clustering or data visualization tools. We suggest trying a couple of different clustering methods to look at broad patterns in each data set and to ensure that such patterns don’t correlate too obviously with any possible confounding variables (for example, we’ve seen samples cluster based on which lab technician performed the hybridization protocol!). However, if the clusters simply don’t make much sense, that’s fine – just proceed with your analysis. Free clustering modules for R and Matlab are available online. For users without the programming skills to work in these environments, the GenePattern software package from the Broad provides a full, freely available data analysis pipeline that includes a variety of clustering methods and data preparation tools [18]. Other clustering methods we like (though not all include software) include GeneSignature method [19], the two-way clustering method of Getz et al. [20], and the quality clustering method of Heyer et al.[21, 22].

3. Selecting differentially expressed genes


Figure 5 Scatter plots depicting gene expression fold changes between paired cell types.

A wide range of methods to adjust for multiple testing are available [23]. Many rely on the assumption that the tests are independent, whereas we know that genes’ expression patterns are heavily dependent on each other (Figure 5). The preferred approach for microarray analysis is to control the “false-discovery rate” (FDR), the probability that any particular significant finding is a false-positive [24]. To better account for the dependencies within the data, we suggest multiple testing adjustment using “permutation-based” methods, which estimate the null distribution by permuting the actual data. If that is not feasible, the Benjamini-Hochberg step-down method [25] offers a reasonable combination of statistical rigor and power for microarray analysis.

The BioConductor [26] software package offers freely-available, customizable, and frequently-updated packages for identifying differential expression. We recommend the multtest [27] and limma [28,29] packages in particular; both include FDR adjustment methods for multiple testing.

Issues in choosing a differential expression method are nicely explained in [30]. Users of the multtest package can choose among several parametric methods (which make assumptions about the normality of the data), including the Welch t-test, paired t-test, or ANOVA [31]. All of these look for differences in the average expression level between groups. Since assumptions about normality are often inappropriate [30], the reported p-values are more appropriately used to as a guide to prioritizing the genes, not as accurate probabilities, even after adjusting for multiple testing. Multtest also supports the non-parametric Wilcoxon method [27]. While more appropriate for non-normally distributed data, the power of this test is often significantly weaker, so important differential expression may be missed by this approach. Sophisticated variance-estimation methods may be helpful when comparing small numbers of samples [32-34]. Non-programmers in particular may appreciate the SAM [34] Excel plug-in for finding differential expression, which adjusts for multiple testing, includes tools for handling time-course data, and is available at no cost to non-commercial users.

The GenePattern analysis pipeline software [18]) is another freely available, stand-alone option that can help non-programmers identify individually differentially expressed genes using parametric tests and FDR adjustment. GenePattern also performs sample classification using a number of standard machine learning methods, making it a good entry-point for biologists who want to classify new samples.

4Functional analysis of gene lists

For the R/Bioconductor user, the GOstats package [35] may be helpful in identifying over-represented GO terms among lists of differentially expressed genes. Onto-Express [36,37] is a stand-alone tool that also finds significant GO term enrichment. However, GO terms are neither the only nor the best source of functional annotation available. In fact, GO terms are assigned on a gene-by-gene or paper-by-paper basis, but not explicitly by pathway or function [38], so they are often ill-suited for defining functionally-related gene sets. We have become increasingly fond of the web-based DAVID tools [39,40] for their ease of use and their large, customizable selection of available pathway databases.

An alternative is gene set analysis. GSEA is the gold standard here. It comes in freely-available open-source Java, web-based, and R implementations, so it can be used by non-programmers (with a slight learning curve for formatting the data and understanding the program options), and can also be incorporated into R analysis pipelines or tweaked by programmers willing to dive into the code internals. Permutation-based adjustment for multiple testing is performed. Ranking the genes still relies on an individual test for differential expression, but several choices are provided in the software, or one can upload a ranked gene list. Extensions to the basic GSEA methodology are becoming plentiful; several are reviewed in [41].

1. Yang YH, Speed T: Design' issues for cDNA microarray experiments. Nat Rev Genet 2002, 3(8):579-588.

2. Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY et al: The MicroArray Quality' Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 2006, 24(9):1151-1161.

3. Altman NS, Hua J: Extending the loop design for two-channel microarray experiments. Genet Res 2006, 88(3):153-163.

4. Kerr MK, Churchill GA: Experimental design for gene expression microarrays. Biostatistics 2001, 2(2):183-201.

5. Lee ML, Kuo FC, Whitmore GA, Sklar J: Importance of replication in microarray' gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc Natl Acad Sci U S A 2000, 97(18):9834-9839.

6. Quackenbush J: Microarray' data normalization and transformation. Nat Genet 2002, 32 Suppl:496-501.

7. Affymetrix: Affymetrix Expression Console Software Version 1.0 User Guide. In. Santa Clara, CA: Affymetrix, Inc.; 2006.

8. Hill AA, Brown EL, Whitley MZ, Tucker-Kellogg G, Hunter CP, Slonim DK: Evaluation of normalization procedures for oligonucleotide array data based on spiked cRNA controls. Genome Biol 2001, 2(12):RESEARCH0055.

9. Zahurak M, Parmigiani G, Yu W, Scharpf RB, Berman D, Schaeffer E, Shabbeer S, Cope L: Pre-processing' Agilent microarray data. BMC Bioinformatics 2007, 8:142.

10. Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19(2):185-193.

11. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4(2):249-264.


13. Albers CJ, Jansen RC, Kok J, Kuipers OP, van Hijum SA: SIMAGE: simulation of DNA-microarray gene expression data. BMC Bioinformatics 2006, 7:205.

14. Li C, Wong WH: Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci U S A 2001, 98(1):31-36.

15. Karakach TK, Wentzell PD: Methods for estimating and mitigating errors in spotted, dual-color DNA microarrays. Omics 2007, 11(2):186-199.

16. Rocke DM, Durbin B: A model for measurement error for gene expression arrays. J Comput Biol 2001, 8(6):557-569.

17. Brody JP, Williams BA, Wold BJ, Quake SR: Significance and statistical errors in the' analysis of DNA microarray data. Proc Natl Acad Sci U S A 2002, 99(20):12975-12978.

18. Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP: GenePattern 2.0. Nat Genet 2006, 38(5):500-501.

19. Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N: Revealing modular organization in the' yeast transcriptional network. Nat Genet 2002, 31(4):370-377.

20. Getz G, Levine E, Domany E: Coupled two-way clustering analysis of gene microarray data. Proc Natl Acad Sci U S A 2000, 97(22):12079-12084.

21. Heyer LJ, Kruglyak S, Yooseph S: Exploring expression data: identification and analysis of coexpressed genes. Genome Res 1999, 9(11):1106-1115.

22. Baugh LR, Hill AA, Slonim DK, Brown EL, Hunter CP: Composition and dynamics of the Caenorhabditis elegans early embryonic transcriptome. Development 2003, 130(5):889-900.

Community content is available under CC-BY-SA unless otherwise noted.