Boston RegressionΒΆ

The following code illustrates how TransmogrifAI can be used to do linear regression. We use Boston dataset to predict housing prices. The code for this example can be found here, and the data over here.

Define features

val rowId = FeatureBuilder.Integral[BostonHouse].extract(_.rowId.toIntegral).asPredictor
val crim = FeatureBuilder.RealNN[BostonHouse].extract(_.crim.toRealNN).asPredictor
val zn = FeatureBuilder.RealNN[BostonHouse].extract(_.zn.toRealNN).asPredictor
val indus = FeatureBuilder.RealNN[BostonHouse].extract(_.indus.toRealNN).asPredictor
val chas = FeatureBuilder.PickList[BostonHouse].extract(x => Option(x.chas).toPickList).asPredictor
val nox = FeatureBuilder.RealNN[BostonHouse].extract(_.nox.toRealNN).asPredictor
val rm = FeatureBuilder.RealNN[BostonHouse].extract(_.rm.toRealNN).asPredictor
val age = FeatureBuilder.RealNN[BostonHouse].extract(_.age.toRealNN).asPredictor
val dis = FeatureBuilder.RealNN[BostonHouse].extract(_.dis.toRealNN).asPredictor
val rad = FeatureBuilder.Integral[BostonHouse].extract(_.rad.toIntegral).asPredictor
val tax = FeatureBuilder.RealNN[BostonHouse].extract(_.tax.toRealNN).asPredictor
val ptratio = FeatureBuilder.RealNN[BostonHouse].extract(_.ptratio.toRealNN).asPredictor
val b = FeatureBuilder.RealNN[BostonHouse].extract(_.b.toRealNN).asPredictor
val lstat = FeatureBuilder.RealNN[BostonHouse].extract(_.lstat.toRealNN).asPredictor
val medv = FeatureBuilder.RealNN[BostonHouse].extract(_.medv.toRealNN).asResponse

Feature Engineering

val houseFeatures = Seq(crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat).transmogrify()

Modeling & Evaluation

val prediction = RegressionModelSelector
  .withCrossValidation(dataSplitter = Option(DataSplitter(seed = randomSeed)), seed = randomSeed)
  .setRandomForestSeed(randomSeed)
  .setGradientBoostedTreeSeed(randomSeed)
  .setInput(medv, houseFeatures)
  .getOutput()

val workflow = new OpWorkflow().setResultFeatures(prediction)

val evaluator = Evaluators.Regression().setLabelCol(medv).setPredictionCol(prediction)

def runner(opParams: OpParams): OpWorkflowRunner =
  new OpWorkflowRunner(
    workflow = workflow,
    trainingReader = trainingReader,
    scoringReader = scoringReader,
    evaluationReader = Option(trainingReader),
    evaluator = Option(evaluator),
    scoringEvaluator = None,
    featureToComputeUpTo = Option(houseFeatures)
  )

You can run the code using the following commands for train, score and evaluate:

cd helloworld
./gradlew compileTestScala installDist

Train

./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.boston.OpBoston -Dargs="\
--run-type=train \
--model-location=/tmp/boston-model \
--read-location BostonHouse=`pwd`/src/main/resources/BostonDataset/housing.data"

Score

./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.boston.OpBoston -Dargs="\
--run-type=score \
--model-location=/tmp/boston-model \
--read-location BostonHouse=`pwd`/src/main/resources/BostonDataset/housing.data \
--write-location=/tmp/boston-scores"

Evaluate

./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.boston.OpBoston -Dargs="\
--run-type=evaluate \
--read-location BostonHouse=`pwd`/src/main/resources/BostonDataset/housing.data \
--write-location=/tmp/boston-eval \
--model-location=/tmp/boston-model \
--metrics-location=/tmp/boston-metrics"