Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark code and results #28

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

KiprasKancys
Copy link
Contributor

No description provided.

"Load, add label col and convert data into label and feature type dataframe (needed for MLlib)"

lines = ctx.textFile(fname)
parts = lines.map(lambda l: l.split(","))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kipras, there is a split spark sql_function which will be much faster than python one. Try to google it and look-up examples with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use spark split function, but we need to change a few things. We need to start with data file without headers. However later on when I want to join 'dataset' column with prediction results. I will need to know at least which column was a 'dataset' col.
I can do all of that, but we have to test if that will be faster.

There is a another option. We use spark split function on training data (which grows over the weeks and is much bigger AND we don't need column names here) and old method with validation data (which every week is more or less the same size). Because for output file we need dataset column.

Code to use sql split function:
lines = sc.textFile('dataFileWithoutHeader.csv')
row = Row("val")
df = lines.map(row).toDF()
tr = df.select(split(df.val, ","))

def labelData(data): return data.map(lambda row: LabeledPoint(row[0][-1], row[0][:-1]))
trLabeled = labelData(tr)
training = sqlContext.createDataFrame(trLabeled, ['features','label'])

@vkuznet
Copy link
Contributor

vkuznet commented Jul 28, 2016

Kipras, feel free to contribute to this PR more, I'll keep it open until I back. You may check a few things I pointed out along the code lines.

@@ -28,7 +28,7 @@ valid_end=20170101 # stop date. use future date like 20200101 to run un
# other options
classifiers=("AdaBoostClassifier" "BaggingClassifier" "DecisionTreeClassifier" "ExtraTreesClassifier" "GradientBoostingClassifier" "XGBClassifier" "KNeighborsClassifier" "RandomForestClassifier" "RidgeClassifier" "SGDClassifier")
#classifiers=("KNeighborsClassifier" "RidgeClassifier" "BaggingClassifier")
drops="nusers,totcpu,rnaccess,rnusers,rtotcpu,nsites,s_0,s_1,s_2,s_3,s_4,wct"
drops="campain,creation_date,tier_name,dataset_access_type,dataset_id,energy,flown_with,idataset,last_modification_date,last_modified_by,mcmevts,mcmpid,mcmtype,nseq,pdataset,physics_group_name,prep_id,primary_ds_name,primary_ds_type,processed_ds_name,processing_version,pwg,this_dataset,rnaccess,rnusers,rtotcpu,s_0,s_1,s_2,s_3,s_4,totcpu,wct,cpu,xtcrosssection"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kipras, please revert this back since I already committed the change to the master. You'll get conflict here.

This reverts commit 4ccae63.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants