Correlations: P53 Isoform Biosignatures vs Biomedical Parameters

Werner Van Belle¹^* - werner@yellowcouch.org, werner.van.belle@gmail.com

1- Bioinformatics Group Norut IT; Research Park; 9294 Tromsø; Norway

Abstract : This deliverable compares 132 various biological parameters agains P53 ISoforms on 2D Gels of AML and ALL Patients.

Keywords: P53 AML ALL 2DGel Correlation
Reference: Werner Van Belle; Correlations: P53 Isoform Biosignatures vs Biomedical Parameters; 93 Gb; February 2007

Disclaimer: This is a report I wrote on the analysis I performed regarding P53 correlations against biomedical parameters. The report is presented here to show what can be achieved with this technique and how a typical workflow looks like. Regardless of the realisation that money in general comes directly/indirectly from the taxpayer, we honour the current practice in biomedical science and do not present the data online. Except for the FAB classifications which have been previously published. If you have inquiries regarding the technique please contact werner@yellowcouch.org. If you have inquiries regarding P53 itself, AML or ALL then please contact bjorn.gjertsen@med.uib.no.

Introduction

Correlation can be measured between any two numerical data sets. In our case, the first data set is a set of images, and the second data set is a set of biomedical parameters. We used 8 different image-sets.

Gel1: is an image set containing the first gels made by Ingvild
Gel2: is an image set containing the second gels made by Ingvild
Gel1&2: are all the images made by Ingvild
Old: are all the images made by Nina, excluding the X-ray images.
Films: are all the X-ray images made by Nina
AllWithoutFilms: is an image set containing all of Ingvilds and Ninas gels, excluding the X-ray images.
AllWithFilms: is an image set containing all of Ingvilds and Ninas gels, including the X-ray images. I believe that since the acquisition process is different between gels, these should be treated as unreliable.
OldWithFilms : is an image set containing all of Ninas gels, including the X-ray images. I believe that since the acquisition process is different between gels, these should be treated as unreliable. This is the analysis as done before.

Next to this collection of image sets, we also have a collection of parameter-sets. Each parameter-set is correlated to each of the 8 image sets. Of course, the limited overlap between the new and the old parameters limits the usability of the combined sets. And in cases that a parameter is available for both sets, it must be ensured that the new and old images can be pooled together (meaning that Ingvild and Nina must have followed exactly the same procedures). The image sets and parameters can be found in Report/imagesets-and-parameters.xls (excel format), Report/imagesets-and-parameters.ods (OpenOffice format) and Report/imagesets-and-parameters.html. Those files are based on the spreadsheets as provided by Nina (AML-Chemokine-cluster.xls, Cell paper data_Master Sheet.xls, Mastersheet_Nina.xls) and Ingvild (Patient numbers.doc).

Below is a list of parameters which link to html files. Each parameter contains the image count, whether it is based on normal or normalized images and the actual image. PDF files are also available. In the PDF files, each image will list the number of gel-images used in the top left corner. The image count for all correlation-images is informative. Results produced with too few images might be biased by the little information available and must be treated with caution.

The full movies and images are direct available in the directories <parameter>/<parameter>-vs-old[-ns]. For instance age/age-vs-old will contain the correlations between the age and the old image set. The directory age/age-vs-old-ns, will contain the same but will be based on normalized images.

2D images: the color reflects the correlation as presented in the key next to the image. The intensity of an area reflects the joined mask based on the standard deviation and the significance of that area. This is the same mask as used in the Bioinformatics paper. A total overview of all images is available as a PDF in everything.pdf.

3D images have the same color code as reflected in the key contained in the 2D image. The height of a position reflects the combined mask of standard deviation and significance, but in addition to the 2D images the correlation strength itself is integrated as well.

The correlation measure used is the spearman rank order correlation. They were calculated for the original images and the scale-normalized images. This process involves translating the median of the grey values to 0 and scaling the standard deviation of the area to become 1. The normalisation factors are determined based on the area contained within the white rectangle. All measures outside the white rectangle have a correct correlation but the significance might be too low. We filtered the data in such a manner to ensure that no outliers outside the white boundary would remove the information contained within the white rectangle. In the 3D images these areas will be clipped.

Not available. See disclaimer above.
Demonstrator: fabm

Overlayed Gel Images

The overlay images are provided for your information. The overlay was based on input given by nina regarding the alfa and delta positions. We further added the antibody spot and information based on the ladder. All these pseudo-calibration points were used in combination with a pair-wise alignment algorithm to maximize the overlap between any two gels. Every overlay image contains a red and a blue component. The overlay of all gel images is in Overlays/overlay.png. The following table contains in the red channel the image and in the blue channel the overlay. Note: Due to technical reasons the images are flipped over the y-axis
Not available. See disclaimer above

GelImages

The gelimages below are in 16 bit .TIF format as we received them for analysis.
Not available. See disclaimer above

Raw 24 bit correlation data

The data produced by the correlation program is saved to disk using three files. Those are the correlation image, the standard deviation image and the significance image. These can be combined as necessary to have a proper visualisation (as done above). The data format for those images is a lossless PNG compression in which the RGB bytes are used to express one quantity ranging from the minimal value to the maximal value. For instance, if we have a correlation ranging from -0.4 to 0.3 then this range will be linearly mapped to numbers ranging from 0 to 16777215. Those values are listed in a separate file (Report/minmaxcor.* (.html, .ods, .xls, .csv) and must accompany the image before it is useful. The values listed in this file are extrema, meaning that only 1 pixel in the entire image might have this value, therefore we caution to use this file for any interpretation of the results. The directory raw24bit-cors contains the images for all of the calculated correlations.

Potential areas of Improvement

The correlation process was repeated for each image set with the images 'as they are', and once more for each image set after normalizing the image. This normalisation process has been described in detail in the BMC Bioinformatics paper and seems to work relatively well. However, a scientific approach should aim for the best techniques available and normalisation does not necesarily outclass actual measurement. It could for instance have been very useful to improve on the following points:

we were unable to investigate the influence of the capturing process (exposure time, gain etc..) to the correlations. Kodak did not cooperate on this matter and our inquiry regarding their data-format went unanswered.
we were unable to assess the influence of the gel running time since this parameter was not recorded.
gels were not always positioned at the same place, which is bad since we know that the position on the plate has a strong impact on the results.
the camera seems unfocused and filthy. This should probably been have cleaned before acquiring the images.
calibration points were insufficient. The ladder to the left is too far out of the p53 scope to be useful. The actin spot could have been a useful position, if it was somehow related to the 2D gel images.

Part of the reason we redid these experiments was to have a better understanding of these factors. It is a pity to notice that little effort in that direction has realized. Aside from this, there remain 3 other points that might be of interest:

1. Bias Through Gel Removal

Removing gels that are 'bad' is not good. Typically those gels have a very low signal or no signal at all. By removing them, one influence the resulting correlations.

2. Incomplete Measurements (still living is not dead)

It is scientifically unsound to include 'clipped' parameters into the set. For instance, the 'age' parameter includes still living patients. These are marked with >48. It is incorrect to include this value as 48. Instead this value should be removed since we do not yet have full information on the variable. Where we were aware of such conduct we included both the set as we received it and one with only those fully known ages.

3. Unknown is not zero

It is scientifically unsound to replace an 'unknown' with 0. This seem to have occurred in numerous places. Where we were aware of it, we cloned the parameter set and filtered out the zeros. This has happened in the survival parameter for instance.

Resource Usage

The computer resources necessary for this analysis were slightly larger than anticipated and it might be interesting to shed some light on execution times. All times are expressed on a 2.8 GHz Intel Pentium 4 processor with 1Gb of memory.

creating the fine-tuned overlay alignment: 72h
computing all the correlations: 85.55h, which produced 5.8 Gb of raw data.
rendering of the images: at 5 hours per image, with 1416 images: 7080h.

With one machine this would take approximately 300 days to finish. Luckily we had the following hardware available:

A PowerEdge Dell workhorse with 4 Dual-Core AMD Opteron(tm) Processors 8218; Cache: 1024 Kb; Memory: 1608.
1 Intel(R) Pentium(R) 4 CPU 2.80GHz; Cache: 512 KB; Memory: 1004
2 Intel(R) Pentium(R) D CPU 3.00GHz; Cache: 1024 KB; Memory: 1006
2 AMD Athlon(tm) 64 Processors 3200+; Cache: 512 Kb; Memory: 1012
1 Intel(R) Pentium(R) 4 CPU 2.00GHz; Cache: 512 Kb; Memory: 1011

In a joined effort, these machines have been rendering for 30 days.

http://werner.yellowcouch.org/
werner@yellowcouch.org