Rivet Labs Internship 2017

This code was developed during the summer of 2017 for Rivet Labs. The goal of this project was to find a way to determine if two tables uploaded to the system were the same or similar. Initially, we attempted to determine similarity numerically – according to the content in each of the columns of the two tables. For each table, we iterated through every column and selected only the values that could be parsed as real numbers. If those numbers constituted at least 90% of the column’s total values, the column was considered “numeric” and was used to determine numerical similarity by looking at column similarity methods.

This was done first using the column’s summary statistics (max, min, mean, median, mode, standard deviation), then by determining each column’s best fit distribution, and later using Welsh’s t-test. Each of these tests failed to show similarity due to variance between tables, even when the data was drawn from the exact same distribution (ie as part of a time series). Because of these results, we were forced to reject the notion that statistical methods such as the t-test are accurately able to give the probability that the columns were drawn from the same distribution and are therefore similar.

Summary statistics for for two quarters from a time series:

(Max, Min, Mean, Median, Mode, Standard Deviation)

Alabama Wage Data, Quarter 1, 2016: 
 > lq_qtrly_estabs - (44.530000000000001, 0.0, 2.1738356378001873, 1.0, 2.3658741444158093, 5.5973604672152373)
 > lq_month1_emplvl - (30.719999999999999, 0.0, 1.071302110406487, 1.0, 1.0288827600708506, 1.0585997339710116)
 > lq_month2_emplvl - (30.66, 0.0, 1.0685965277055829, 1.0, 1.0245988594877031, 1.0498028228635019)
 > lq_month3_emplvl - (30.760000000000002, 0.0, 1.0688543507641126, 1.0, 1.0236544458296093, 1.0478684244667245)
 > lq_total_qtrly_wages - (26.850000000000001, 0.0, 1.1525756315625324, 1.0, 1.2291188920220222, 1.5107332507254434)
 > lq_taxable_qtrly_wages - (106.70999999999999, 0.0, 1.215167377066223, 0.88, 3.9302066546454588, 15.44652434821945)
 > lq_qtrly_contributions - (471.13999999999999, 0.0, 1.5803363135461066, 0.95999999999999996, 7.826846299257805, 61.259522992205603)
 > lq_avg_wkly_wage - (3.3700000000000001, 0.0, 1.0010422081297434, 1.0, 0.29981829092794815, 0.089891007574955745)
 > oty_qtrly_estabs_pct_chg - (6150.0, -100.0, 0.85199604948539331, 0.0, 45.470661613757883, 2067.5810675928747)
 > oty_month1_emplvl_pct_chg - (928.60000000000002, -100.0, 0.68383407838652666, 0.40000000000000002, 12.933956755899063, 167.287237363467)
 > oty_month2_emplvl_pct_chg - (813.89999999999998, -100.0, 0.83368333506601522, 0.5, 11.23101401976681, 126.13567591219864)
 > oty_month3_emplvl_pct_chg - (917.10000000000002, -100.0, 1.0873271649859655, 0.69999999999999996, 12.128697136470883, 147.105294228237)
 > oty_total_qtrly_wages_pct_chg - (701.60000000000002, -100.0, -0.1514086703399522, 0.20000000000000001, 11.827112290618157, 139.88058513489105)
 > oty_taxable_qtrly_wages_pct_chg - (27797.400000000001, -100.0, 6.9538049693315305, 0.0, 242.08760639315918, 58606.409169169157)

Alabama Wage Data, Quarter 2, 2016: 
 > lq_qtrly_estabs - (44.960000000000001, 0.0, 2.1814892400457429, 1.0, 2.3802185441195727, 5.6654403177706971)
 > lq_month1_emplvl - (30.800000000000001, 0.0, 1.0715329036282359, 1.0, 1.0242332206755345, 1.049053690335378)
 > lq_month2_emplvl - (30.780000000000001, 0.0, 1.0727263748830438, 1.0, 1.0244269581879926, 1.049450592662303)
 > lq_month3_emplvl - (30.699999999999999, 0.0, 1.077089614305021, 1.0, 1.0363448574971172, 1.0740106636607203)
 > lq_total_qtrly_wages - (27.129999999999999, 0.0, 1.1100478220189207, 1.0, 1.1403267405283157, 1.3003450751639329)
 > lq_taxable_qtrly_wages - (64.109999999999999, 0.0, 1.0015048341823474, 0.70999999999999996, 3.076618292752249, 9.4655801192977638)
 > lq_qtrly_contributions - (158.53, 0.0, 1.3771431541740304, 0.93999999999999995, 5.563226281680314, 30.949486661178575)
 > lq_avg_wkly_wage - (3.5299999999999998, 0.0, 0.96833610562428518, 1.0, 0.27446596245484495, 0.075331564546264343)
 > oty_qtrly_estabs_pct_chg - (1000.0, -100.0, 0.6061960702775756, 0.0, 12.409800775511657, 154.00315528788974)
 > oty_month1_emplvl_pct_chg - (891.39999999999998, -100.0, 0.70900301486641015, 0.5, 12.000187779393805, 144.0045067407124)
 > oty_month2_emplvl_pct_chg - (958.79999999999995, -100.0, 0.73239421977336516, 0.29999999999999999, 12.027979587520266, 144.67229295780419)
 > oty_month3_emplvl_pct_chg - (1084.8, -100.0, 0.93455660671587482, 0.5, 13.181562901193988, 173.75360051813365)
 > oty_total_qtrly_wages_pct_chg - (778.5, -100.0, 2.0831271441937829, 2.1000000000000001, 12.216579794143302, 149.24482186667038)
 > oty_taxable_qtrly_wages_pct_chg - (23796.299999999999, -100.0, 5.0945940326437258, 0.0, 183.83975086814277, 33797.053999260803)
 > oty_qtrly_contributions_pct_chg - (690400.0, -100.0, 78.281645701216334, 0.0, 5665.3227393727757, 32095881.741254251)
 > oty_avg_wkly_wage_pct_chg - (262.5, -100.0, 1.4304501507433205, 1.5, 6.5978802049571428, 43.532023198965312)

Best fit distributions for data taken quarterly for two quarters from a time series:

Alabama Wage Data	Quarter 3, 2016	Quarter 4, 2016
lq_qtrly_estabs	'cauchy'	'dgamma'
lq_month1_emplvl	'dgamma'	'johnsonsu'
lq_month2_emplvl	'dgamma'	'cauchy'
lq_month3_emplvl	'dweibull'	'dgamma'
lq_total_qtrly_wages	'gennorm'	'cauchy'
lq_taxable_qtrly_wages	'alpha'	'alpha'
lq_qtrly_contributions	't'	'exponnorm'
lq_avg_wkly_wage	'dgamma'	'gennorm'
oty_qtrly_estabs_pct_chg	'gennorm'	'dweibull'
oty_month1_emplvl_pct_chg	't'	'gennorm'
oty_month2_emplvl_pct_chg	'nct'	'nct'
oty_month3_emplvl_pct_chg	'laplace'	'laplace'
oty_total_qtrly_wages_pct_chg	'tukeylambda'	'hypsecant'
oty_taxable_qtrly_wages_pct_chg	'gengamma'	'exponweib'
oty_qtrly_contributions_pct_chg	'pearson3'	'foldnorm'
oty_avg_wkly_wage_pct_chg	'johnsonsu'	'burr'

Welch's t-test results for Alabama wage data taken quarterly over four years from a time-series:

(Columns over time were considered to be "similar" if they returned a pvalue greater than 0.05)

Columns that appear similar over time:
 > lq_month2_emplvl
 > lq_month3_emplvl
 > lq_qtrly_estabs

Columns that appear dissimilar over time:

Columns that sometimes appear similar and sometimes appear dissimilar: 
(The first value in the bracket represents the instances when the columns over time "passed" the t-test;
the second value represents the "failed" tests)
 > lq_avg_wkly_wage:[6, 9]
 > lq_total_qtrly_wages:[8, 7]
 > oty_month1_emplvl_pct_chg:[10, 5]
 > oty_month2_emplvl_pct_chg:[11, 4]
 > oty_month3_emplvl_pct_chg:[11, 4]
 > oty_taxable_qtrly_wages_pct_chg:[12, 3]
 > oty_total_qtrly_wages_pct_chg:[5, 10]
 > lq_qtrly_contributions:[6, 9]
 > lq_taxable_qtrly_wages:[3, 12]
 > oty_qtrly_contributions_pct_chg:[11, 4]
 > oty_qtrly_estabs_pct_chg:[12, 3]
 > lq_month1_emplvl:[14, 1]
 > oty_avg_wkly_wage_pct_chg:[1, 14]

Because the data used to test the column similarity methods was drawn from the same time-series, the columns of the tables with the same names were the same semantically as well. Therefore, we are essentially able to rely on column names to determine if two tables are similar, with greater numbers of shared columns connoting the higher similarity. This, however, creates an efficiency challenge because in practical situations there are often many tables to be compared, each containing a vast number of columns. In fact, in order to test such large data in a reliable way, we generated our own CSV files so we could control the numbers of files we were comparing and the number of columns per table. Our upper limit during our tests was testing 100,000 tables with up to 100,000 columns each (the columns were randomly selected from a list) but we generally only tested tables with up to 10,000 columns due to time contraints.

We decided to increase the efficiency of this process by using bitwise comparisons. This was first attempted with the BitVector package for Python using a universal column vocabulary. This package, however, still requires more operations than mathematically necessary because the bitsets are very sparse but contain runs. Therefore, we converted the same tables to MutableSparseIntSets and SemiSparseIntSets to see if they were more suited to these types of comparisons. Running the tables through each of these packages showed that using SemiSparseIntSets was the superior method for efficiently determining similarity between very large sets of data.

Testing runtime (seconds) of BitVector, MutableSparseIntSet, and SemiSparseMutableIntSet packages with varying numbers of tables and columns per table:

	100 tables	1000 tables	10000 tables	25000 tables	50000 tables	75000 tables	100000 tables
100 columns
BitVectors	0.117342861	1.330632427	7.833533564	18.79500713	160.138788	211.4978684	255.0115816
MutableSparse	0.27530164	0.562335432	75.65324984	128.526734	193.9695324	100.6847252	131.9256592
SemiSparse	0.026061443	0.172061698	1.420522832	3.501075161	7.527625295	10.84277683	14.7325597
500 columns
BitVectors	0.503639508	11.06635143	80.94988798	171.869837	292.4827158	309.1632609	460.2269579
MutableSparse	0.428041255	3.91960056	70.23124473	154.1113304	158.9049406	177.84524	198.3105901
SemiSparse	0.030555271	0.284721495	0.284721495	6.176032069	12.4457266	18.40387019	24.7249516
1000 columns
BitVectors	0.755797632	11.13807729	108.7040595	176.1630223	363.6930041	498.522717	486.6650112
MutableSparse	0.53404347	4.066030248	39.91804323	105.8073022	129.4894868	181.4601254	180.1262393
SemiSparse	0.04621434	0.475913274	4.101160731	9.258089584	19.65058712	28.60314256	42.41706525
5000 columns
BitVectors	2.36950059	19.68828472	173.5585634	422.4184266	836.6417741	1246.242342	1664.412147
MutableSparse	1.744588146	16.92050881	113.2346881	201.5243091	355.5350035	467.9677709	840.2451703
SemiSparse	0.592129024	2.963505763	29.94089598	73.13437793	138.9306445	209.4756217	280.8542945
10000 columns
BitVectors	5.075053067	45.88543995	431.6960821	976.0051405	1926.867358	2799.844562	3885.38318
MutableSparse	2.243663899	18.34730918	169.0528138	385.4876877	729.5381877	1268.196914	2779.936492
SemiSparse	2.562875087	19.99145442	180.815371	359.69897	596.4114277	490.1489115	482.0160861

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
Best Distributions.py		Best Distributions.py
Collecting Statistics from Postgres.py		Collecting Statistics from Postgres.py
Compare BitVectors (pickled).py		Compare BitVectors (pickled).py
Comparing Tables in TimeSeries.py		Comparing Tables in TimeSeries.py
Existing CSVs -> Bit Vectors.py		Existing CSVs -> Bit Vectors.py
Existing CSVs -> MutableSparseIntSets.java		Existing CSVs -> MutableSparseIntSets.java
Generate CSV -> Bit Vectors.py		Generate CSV -> Bit Vectors.py
Generate CSVs.py		Generate CSVs.py
README.md		README.md
bv vs msis.jpg		bv vs msis.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rivet Labs Internship 2017

About

Releases

Packages

Languages

bjbroder/Rivet-Labs-Internship-2017

Folders and files

Latest commit

History

Repository files navigation

Rivet Labs Internship 2017

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages