The data linked to in this page is intended to help answer the question:
How reliable is a given prediction?However, the question that it directly (albeit empirically) answers is:
Given that a protein localizes to site A, what is the probability that it has a predicted utility of some value u or more for site A?
I borrowed the word utility from decision theory, with the intention that eventually WoLF PSORT could predict the utility of believing that a protein localization to a give site -- which combines the probability that the protein combines to each site with the (in general non-uniform) cost of the various possible mistakes. For example when only interested in predicted secreted proteins, one could consider predicting a mitochondrial protein to be a nuclear protein to be a "right" answer -- since both sites are non-secreted.
In practice I have only used this functionality as one way to address proteins with multiple localizations, by lightly penalizing misclassifications between related sites (localization classes), for example to make predicting nuclear for a protein with dual localization to the cytoplasm and nucleus be more acceptable than to predict mitochondria for that protein. For details please see our APBC06 Paper.
Near the bottom of this page there is a table containing graphs like the
following pair for each localization site.
The left hand graph is a histogram plotting the observed frequency of predicted utilities for lysosome for proteins which actually localize there (light bars) vs. proteins which localize to other sites.
The right hand graph is a smoothed curve representing the probability that a protein is a lysosome, given that it has a particular predicted utility for lysosome, under the assumption that the prior probability that a protein localizes to the lysosome is equal to the proportion of lysosome proteins in WoLF PSORT's dataset.
By inspecting these graphs, one may make some conclusions. For example
|The rows in this table represent the localization site, with links to statistics which were computed on proteins labeled with that localization site in the dataset. Note that in some cases it may be useful to look at sites other than the predicted site -- since the predicted site may not be the true site. The numbers are the number of proteins of the localization site in the dataset.|
|Copyright (C) National Institute of Advanced Science and Technology (AIST), Computational Biology Research Center (CBRC). All Rights Reserved.|