Spark code and results #28

KiprasKancys · 2016-07-28T10:44:09Z

No description provided.

vkuznet · 2016-07-28T22:31:41Z

Popularity/Kipras/pyspark/pyspark_ml.py

+    "Load, add label col and convert data into label and feature type dataframe (needed for MLlib)"
+
+    lines = ctx.textFile(fname)
+    parts = lines.map(lambda l: l.split(","))


Kipras, there is a split spark sql_function which will be much faster than python one. Try to google it and look-up examples with it.

We can use spark split function, but we need to change a few things. We need to start with data file without headers. However later on when I want to join 'dataset' column with prediction results. I will need to know at least which column was a 'dataset' col.
I can do all of that, but we have to test if that will be faster.

There is a another option. We use spark split function on training data (which grows over the weeks and is much bigger AND we don't need column names here) and old method with validation data (which every week is more or less the same size). Because for output file we need dataset column.

Code to use sql split function:
lines = sc.textFile('dataFileWithoutHeader.csv')
row = Row("val")
df = lines.map(row).toDF()
tr = df.select(split(df.val, ","))

def labelData(data): return data.map(lambda row: LabeledPoint(row[0][-1], row[0][:-1]))
trLabeled = labelData(tr)
training = sqlContext.createDataFrame(trLabeled, ['features','label'])

vkuznet · 2016-07-28T22:39:06Z

Kipras, feel free to contribute to this PR more, I'll keep it open until I back. You may check a few things I pointed out along the code lines.

vkuznet · 2016-08-03T14:16:50Z

Popularity/DCAFPilot/src/scripts/roll2.sh

@@ -28,7 +28,7 @@ valid_end=20170101          # stop date. use future date like 20200101 to run un
 # other options
 classifiers=("AdaBoostClassifier" "BaggingClassifier" "DecisionTreeClassifier" "ExtraTreesClassifier" "GradientBoostingClassifier" "XGBClassifier" "KNeighborsClassifier" "RandomForestClassifier" "RidgeClassifier" "SGDClassifier")
 #classifiers=("KNeighborsClassifier" "RidgeClassifier" "BaggingClassifier")
-drops="nusers,totcpu,rnaccess,rnusers,rtotcpu,nsites,s_0,s_1,s_2,s_3,s_4,wct"
+drops="campain,creation_date,tier_name,dataset_access_type,dataset_id,energy,flown_with,idataset,last_modification_date,last_modified_by,mcmevts,mcmpid,mcmtype,nseq,pdataset,physics_group_name,prep_id,primary_ds_name,primary_ds_type,processed_ds_name,processing_version,pwg,this_dataset,rnaccess,rnusers,rtotcpu,s_0,s_1,s_2,s_3,s_4,totcpu,wct,cpu,xtcrosssection"


Kipras, please revert this back since I already committed the change to the master. You'll get conflict here.

This reverts commit 4ccae63.

KiprasKancys added 2 commits July 28, 2016 12:27

code for model on spark

b7b9839

Update work_flow.md

bca2275

vkuznet reviewed Jul 28, 2016
View reviewed changes

Updated drops

4ccae63

vkuznet reviewed Aug 3, 2016
View reviewed changes

Revert "Updated drops"

d0cc7cc

This reverts commit 4ccae63.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark code and results #28

Spark code and results #28

KiprasKancys commented Jul 28, 2016

vkuznet Jul 28, 2016

KiprasKancys Jul 29, 2016

vkuznet commented Jul 28, 2016

vkuznet Aug 3, 2016

Spark code and results #28

Are you sure you want to change the base?

Spark code and results #28

Conversation

KiprasKancys commented Jul 28, 2016

vkuznet Jul 28, 2016

Choose a reason for hiding this comment

KiprasKancys Jul 29, 2016

Choose a reason for hiding this comment

vkuznet commented Jul 28, 2016

vkuznet Aug 3, 2016

Choose a reason for hiding this comment