Dear Assa,
you need to call prediction with continuous predictions and a _binary_ true
class label.
You are the only one who can tell whether the p-values are actually predictions
and what the class labels are. For the list readers p is just the name of
whatever variable, and you didn't even vaguely say what you try to classify, nor
did you offer any explanation of what the columns are.
The only information we get from your table is that p-value has small and
continuous values. From what I see the p-values could also be fitting errors of
the predictions (e.g. expressed as a probability that the similarity to the
predicted class is random).
Claudia
Assa Yeroslaviz wrote:
Dear Claudia,
thank you for your fast answer.
I add again the table of the data as an example.
Protein ID Pfam Domain p-value Expected Is Expected True Postive
False Negative False Positive True Negative
NP_000011.2 APH 1.15E-05 APH TRUE 1 0 0 0
NP_000011.2 MutS_V 0.0173 APH FALSE 0 0 1 0
NP_000062.1 CBS 9.40E-08 CBS TRUE 1 0 0 0
NP_000066.1 APH 3.83E-06 APH TRUE 1 0 0 0
NP_000066.1 CobU 0.009 APH FALSE 0 0 1 0
NP_000066.1 FeoA 0.3975 APH FALSE 0 0 1 0
NP_000066.1 Phage_integr_N 0.0219 APH FALSE 0 0 1 0
NP_000161.2 Beta_elim_lyase 6.25E-12 Beta_elim_lyase
TRUE 1 0 0 0
NP_000161.2 Glyco_hydro_6 0.002 Beta_elim_lyase FALSE 0
0 1 0
NP_000161.2 SurE 0.0059 Beta_elim_lyase FALSE 0 0
1 0
NP_000161.2 SapB_2 0.0547 Beta_elim_lyase FALSE 0 0
1 0
NP_000161.2 Runt 0.1034 Beta_elim_lyase FALSE 0 0
1 0
NP_000204.3 EGF 0.004666118 EGF TRUE 1 0 0 0
NP_000229.1 PAS 3.13E-06 PAS TRUE 1 0 0 0
NP_000229.1 zf-CCCH 0.2067 PAS FALSE 0 1 1 0
NP_000229.1 E_raikovi_mat 0.0206 PAS FALSE 0 0 0 0
NP_000388.2 NAD_binding_1 8.21E-24 NAD_binding_1 TRUE 1
0 0 0
NP_000388.2 ABM 1.40E-08 NAD_binding_1 FALSE 0 0
1 0
NP_000483.3 MMR_HSR1 1.98E-05 MMR_HSR1 TRUE 1
0 0 0
NP_000483.3 DEAD 2.30E-05 MMR_HSR1 FALSE 0 0
1 0
NP_000483.3 APS_kinase 1.80E-09 MMR_HSR1 FALSE 0
0 1 0
NP_000483.3 CbiA 0.0003 MMR_HSR1 FALSE 0 0 1 0
NP_000483.3 CoaE 1.28E-07 MMR_HSR1 FALSE 0 0
1 0
NP_000483.3 FMN_red 4.61E-08 MMR_HSR1 FALSE 0
0 1 0
NP_000483.3 Fn_bind 0.3855 MMR_HSR1 FALSE 0 0
1 0
NP_000483.3 Invas_SpaK 0.2431 MMR_HSR1 FALSE 0 0
1 0
NP_000483.3 PEP-utilizers 0.127 MMR_HSR1 FALSE 0 0
1 0
NP_000483.3 NIR_SIR_ferr 0.1661 MMR_HSR1 FALSE 0 0
1 0
NP_000483.3 AAA 0.0031 MMR_HSR1 FALSE 0 0 1 0
NP_000483.3 DUF448 0.0021 MMR_HSR1 FALSE 0 0 1 0
NP_000483.3 CBF_beta 0.1201 MMR_HSR1 FALSE 0 0
1 0
NP_000483.3 zf-C3HC4 0.0959 MMR_HSR1 FALSE 0 0
1 0
NP_000560.5 ig 5.69E-39 ig TRUE 1 0 0 0
NP_000704.1 Epimerase 4.40E-21 Epimerase TRUE 1
0 0 0
NP_000704.1 Lipase_GDSL 6.63E-11 Epimerase FALSE 0
0 1 0
...
this is a shorted list from one of the 10 lists I have for different
p-values.
As you can see I have separate p-value experiments and probably need to
calculate for each of them a separate ROC. But I don't know how to
calculate these characteristics for the p-values.
How do I assign the predictions to each of the single p-value experiments?
I would appreciate any help
Thanks
Assa
On Tue, Aug 17, 2010 at 12:55, Claudia Beleites <cbelei...@units.it
<mailto:cbelei...@units.it>> wrote:
Dear Assa,
I am having a problem building a ROC curve with my data using
the ROCR
package.
I have 10 lists of proteins such as attached (proteinlist.xls).
each of the
your file didn't make it to the list.
lists was calculated with a different p-value.
The goal is to find the optimal p-value for the highest number
of true
positives as well as lowaest number of false positives.
As far as I understood the explanations from the vignette of
ROCR, my data
of TP and FP are the labels of the prediction function. But I
don't know how
to assign the right predictions to these labels.
I assume the p-values are different cutoffs that you use for
"hardening" (= making yes/no predictions) from some soft (=
continuous class membership) output of your classifier.
Usually, ROCR calculates the curves as function of the
cutoff/threshold itself from the continuos predictions. If you have
these soft predictions, let ROCR do the calculation for you.
If you don't have them, ROCR can calculate your characteristics
(sens, spec, precision, recall, whatever) for each of the p-values.
While you could combine the results "by hand" into a
ROCR-performance object and let ROCR do the plotting, it is then
probably easier if you plot directly yourself.
Don't be shy to look into the prediction and performance objects, I
find them pretty obvious. Maybe start with the objects produced by
the examples.
Also, note ROCR works with binary validation data only. If your data
has more than one class, you need to make two-class-problems first
(e.g. protein xy ./. not protein xy).
BTW, Is there a way of finding the optimum in the curve? I mean
to find the
exact value in the ROC curve (see sheet 2 in the excel file for
the ROC
curve).
Someone asked for optimum on ROC a couple of months ago, RSiteSearch
on the mailing list with ROC and optimal or optimum should get you
answers.
I would like to thank for any help in advance
You're welcome.
Claudia
--
Claudia Beleites
Dipartimento dei Materiali e delle Risorse Naturali
Università degli Studi di Trieste
Via Alfonso Valerio 6/a
I-34127 Trieste
phone: +39 0 40 5 58-37 68
email: cbelei...@units.it <mailto:cbelei...@units.it>
--
Claudia Beleites
Dipartimento dei Materiali e delle Risorse Naturali
Università degli Studi di Trieste
Via Alfonso Valerio 6/a
I-34127 Trieste
phone: +39 0 40 5 58-37 68
email: cbelei...@units.it
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.