Support for faster scans #15

anuragkh · 2015-12-07T08:28:21Z

Faster scans can be supported by having a snappy compressed representation of the data along with the Succinct data structures; operations on the Succinct RDDs / DataFrame that require full scans (e.g., aggregates), can execute efficiently on the alternate representation, whereas search/random access queries are handled by the Succinct data structures. The two representations should remain under the hood -- exposing a single unified interface to the Succinct RDDs / DataFrame.

koertkuipers · 2016-04-22T17:19:34Z

how fast are scans currently compared to say parquet or avro?

anuragkh · 2016-04-22T22:02:57Z

Succinct is currently not optimized for full scans, but supports queries like random access (i.e., get(key)) and search (i.e., find all occurrences of a certain term in a structured/unstructured dataset) without requiring full scans. We have comparisons against parquet format for search here.

The time taken to extract the data for full scans would depend on the dataset size and the dataset itself. I haven't benchmarked the scan performance against parquet and avro, but I would parquet and avro to be faster for full scans.

What would your intended use-case be?

koertkuipers · 2016-04-22T22:25:11Z

The use case would be similar to elasticsearch or hbase: most access is by
key and/or search, but full scans are also needed for certain types of
analytics. in the case of hbase we accept the fact that the full scan is
about 10x slower (at least order of magnitude) than optimzied avro or
parquet data on hdfs.

On Fri, Apr 22, 2016 at 6:02 PM, Anurag Khandelwal <[email protected]

wrote:

Succinct is currently not optimized for full scans, but supports queries
like random access (i.e., get(key)) and search (i.e., find all
occurrences of a certain term in a structured/unstructured dataset)
without requiring full scans. We have comparisons against parquet
format for search here
http://succinct.cs.berkeley.edu/wp/wordpress/?page_id=8.

The time taken to extract the data for full scans would depend on the
dataset size and the dataset itself. I haven't benchmarked the scan
performance against parquet and avro, but I would parquet and avro to be
faster for full scans.

What would your intended use-case be?

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#15 (comment)

anuragkh · 2016-04-22T23:10:14Z

The full scan performance for Succinct could be slower than HBase depending on the size of the dataset and the size of the cluster being used.

However, with the planned optimization described here, we should be able to support full scans at the rate that snappy codec can decompress data, at the cost of some reduction in overall compression factor (since the succinct representation would be supplemented with scan-efficient compressed representation). Again, I don't have performance comparison against Avro or Parquet, but snappy codec is known to provide very fast decompression rates (decompression rates of the order of ~500Mb/s per thread, as suggested here).

Would that cater to your use case? I'd be happy to assign higher priority to this if it helps!

anuragkh added the enhancement label Dec 7, 2015

anuragkh added this to the v0.1.7 milestone Dec 7, 2015

anuragkh modified the milestones: v0.1.8, v0.1.7, v0.1.9 Oct 17, 2016

anuragkh self-assigned this Oct 17, 2016

anuragkh closed this as completed Nov 27, 2018

anuragkh reopened this Nov 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for faster scans #15

Support for faster scans #15

anuragkh commented Dec 7, 2015

koertkuipers commented Apr 22, 2016

anuragkh commented Apr 22, 2016

koertkuipers commented Apr 22, 2016

anuragkh commented Apr 22, 2016

Support for faster scans #15

Support for faster scans #15

Comments

anuragkh commented Dec 7, 2015

koertkuipers commented Apr 22, 2016

anuragkh commented Apr 22, 2016

koertkuipers commented Apr 22, 2016

anuragkh commented Apr 22, 2016