[
https://issues.apache.org/jira/browse/SPARK-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15638546#comment-15638546
]
Felix Cheung commented on SPARK-12757:
--------------------------------------
I'm seeing the same with latest master running a pipeline with GBTClassifier:
{code}
WARN Executor: 1 block locks were not released by TID = 7:
[rdd_28_0]
{code}
to repro, take the code sample from the ml programming guide:
{code}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{GBTClassificationModel,
GBTClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
// Load and parse the data file, converting it to a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as
continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a GBT model.
val gbt = new GBTClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setMaxIter(10)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)
// Chain indexers and GBT in a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureIndexer, gbt, labelConverter))
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
{code}
> Use reference counting to prevent blocks from being evicted during reads
> ------------------------------------------------------------------------
>
> Key: SPARK-12757
> URL: https://issues.apache.org/jira/browse/SPARK-12757
> Project: Spark
> Issue Type: Improvement
> Components: Block Manager
> Reporter: Josh Rosen
> Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> As a pre-requisite to off-heap caching of blocks, we need a mechanism to
> prevent pages / blocks from being evicted while they are being read. With
> on-heap objects, evicting a block while it is being read merely leads to
> memory-accounting problems (because we assume that an evicted block is a
> candidate for garbage-collection, which will not be true during a read), but
> with off-heap memory this will lead to either data corruption or segmentation
> faults.
> To address this, we should add a reference-counting mechanism to track which
> blocks/pages are being read in order to prevent them from being evicted
> prematurely. I propose to do this in two phases: first, add a safe,
> conservative approach in which all BlockManager.get*() calls implicitly
> increment the reference count of blocks and where tasks' references are
> automatically freed upon task completion. This will be correct but may have
> adverse performance impacts because it will prevent legitimate block
> evictions. In phase two, we should incrementally add release() calls in order
> to fix the eviction of unreferenced blocks. The latter change may need to
> touch many different components, which is why I propose to do it separately
> in order to make the changes easier to reason about and review.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]