FEM21037

Computer Science for Business Analytics individual paper

The code presented in this repository is meant for duplicate detection for products in aggregated (different) webshopdata. The code itself contains comments about the steps taken. Data is attached in the repository. The lines of code include:

Line 7 - 19: imports of packages
Line 20 - 394: various functions that are used
Line 396: data initalization and modification
Line 397 - 431: bootstrap settings and bootstrap data initalization
Line 432 - 519: LSH (for flow, please read below)
Line 520 - 575: classification (for flow, please read below)
Line 576 - 600: plots

Line 432 - 519: LSH flow

For every bootstrap, obtain correct train and test sample indices from pre-defined bootstrap sample.
Then within every bootstrap, for threshold values 0.05, 0.15, ..., 0.95, get the optimal value for the bands, rows and number of minhashes using the threshold value.
With these values, calculate the bootstrap results for the train sample by means of the LSH function. 3.1. Create binary vectors from the tokens given as input. 3.2. Create the signature matrix based on these binary vectors. 3.3. For every band, for every product, find out what products hash to the same bucket. If multiple products hash to the same bucket, all unique combinations within these products are considered as possible pairs. 3.4. For every possible pair, evaluate whether the two products have the same brand and are from a different webshop, or are identical by comparing it to all pairs that obey this property. 3.5. Consider these products as the products that need to be compared. Calculate the Jaccard similarity based on the signature matrix. 3.6. The pairs that pass the threshold value based on their Jaccard similarity, are considered candidate pairs.
From these candidate pairs, check whether they are true duplicates by comparing them based on their modelID.
Calculate pair completeness, pair quality and F1 score. Store all in a dictionary.
Perform step 3 for the test sample.
Calculate for both train and test sample the average pair completeness, pair quality and F1 score. Store in a list.

Line 520 - 575: classification flow

For every train and test bootstrap, for every threshold value, calculate the title similarity, feature value similarity and obtain the Jaccard similarity from LSH.
Train a logistic regression algorithm and perform a grid search on parameter c. Calculate F1 score.
Calculate for both train and test sample the average F1 score.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
FEM21037 - Tim van der Schaaf 452457.py		FEM21037 - Tim van der Schaaf 452457.py
README.md		README.md
TVs-all-merged.json		TVs-all-merged.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FEM21037

About

Releases

Packages

Languages

Timvanderschaaf/FEM21037

Folders and files

Latest commit

History

Repository files navigation

FEM21037

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages