Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for faster scans #15

Open
anuragkh opened this issue Dec 7, 2015 · 4 comments
Open

Support for faster scans #15

anuragkh opened this issue Dec 7, 2015 · 4 comments
Assignees
Milestone

Comments

@anuragkh
Copy link
Collaborator

anuragkh commented Dec 7, 2015

Faster scans can be supported by having a snappy compressed representation of the data along with the Succinct data structures; operations on the Succinct RDDs / DataFrame that require full scans (e.g., aggregates), can execute efficiently on the alternate representation, whereas search/random access queries are handled by the Succinct data structures. The two representations should remain under the hood -- exposing a single unified interface to the Succinct RDDs / DataFrame.

@anuragkh anuragkh added this to the v0.1.7 milestone Dec 7, 2015
@koertkuipers
Copy link
Contributor

how fast are scans currently compared to say parquet or avro?

@anuragkh
Copy link
Collaborator Author

Succinct is currently not optimized for full scans, but supports queries like random access (i.e., get(key)) and search (i.e., find all occurrences of a certain term in a structured/unstructured dataset) without requiring full scans. We have comparisons against parquet format for search here.

The time taken to extract the data for full scans would depend on the dataset size and the dataset itself. I haven't benchmarked the scan performance against parquet and avro, but I would parquet and avro to be faster for full scans.

What would your intended use-case be?

@koertkuipers
Copy link
Contributor

The use case would be similar to elasticsearch or hbase: most access is by
key and/or search, but full scans are also needed for certain types of
analytics. in the case of hbase we accept the fact that the full scan is
about 10x slower (at least order of magnitude) than optimzied avro or
parquet data on hdfs.

On Fri, Apr 22, 2016 at 6:02 PM, Anurag Khandelwal <[email protected]

wrote:

Succinct is currently not optimized for full scans, but supports queries
like random access (i.e., get(key)) and search (i.e., find all
occurrences of a certain term in a structured/unstructured dataset)
without requiring full scans. We have comparisons against parquet
format for search here
http://succinct.cs.berkeley.edu/wp/wordpress/?page_id=8.

The time taken to extract the data for full scans would depend on the
dataset size and the dataset itself. I haven't benchmarked the scan
performance against parquet and avro, but I would parquet and avro to be
faster for full scans.

What would your intended use-case be?


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#15 (comment)

@anuragkh
Copy link
Collaborator Author

The full scan performance for Succinct could be slower than HBase depending on the size of the dataset and the size of the cluster being used.

However, with the planned optimization described here, we should be able to support full scans at the rate that snappy codec can decompress data, at the cost of some reduction in overall compression factor (since the succinct representation would be supplemented with scan-efficient compressed representation). Again, I don't have performance comparison against Avro or Parquet, but snappy codec is known to provide very fast decompression rates (decompression rates of the order of ~500Mb/s per thread, as suggested here).

Would that cater to your use case? I'd be happy to assign higher priority to this if it helps!

@anuragkh anuragkh modified the milestones: v0.1.8, v0.1.7, v0.1.9 Oct 17, 2016
@anuragkh anuragkh self-assigned this Oct 17, 2016
@anuragkh anuragkh reopened this Nov 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants