Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat]Copy-free save and load for cuckoo hashtable #243

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

Lifann
Copy link
Member

@Lifann Lifann commented May 16, 2022

Description

Brief Description of the PR:
Since dynamic embedding could be super large for memory limit. save and load with traditional TensorFlow checkpoint mechanism will use a lot of memory when saving or loading.
This PR provides a method to save or load files for dynamic embedding tables, without full volume copying.

Type of change

  • Bug fix
  • New Tutorial
  • Updated or additional documentation
  • Additional Testing
  • New Feature

Checklist:

  • I've properly formatted my code according to the guidelines
    • By running yapf
    • By running clang-format
  • This PR addresses an already submitted issue for TensorFlow Recommenders-Addons
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

How Has This Been Tested?

Yes

@Lifann Lifann requested a review from rhdong as a code owner May 16, 2022 06:10
@google-cla
Copy link

google-cla bot commented May 16, 2022

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

For more information, open the CLA check for this pull request.

@Lifann Lifann force-pushed the indep-table-save branch 2 times, most recently from 3fdfd57 to 1d4fe07 Compare May 16, 2022 06:15
@@ -304,6 +304,18 @@ class CuckooHashTableOfTensors final : public LookupInterface {
return table_->export_values(ctx, value_dim);
}

Status Save(OpKernelContext* ctx, const string filepath,
const size_t buffer_size) {
int64 value_dim = value_shape_.dim_size(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int64_t

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tensorflow::int64 is returned from dim_size and also used in table_->save()


Status Load(OpKernelContext* ctx, const string filepath,
const size_t buffer_size) {
int64 value_dim = value_shape_.dim_size(0);
Copy link
Member

@rhdong rhdong May 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int64_t

@@ -304,6 +304,18 @@ class CuckooHashTableOfTensors final : public LookupInterface {
return table_->export_values(ctx, value_dim);
}

Status Save(OpKernelContext* ctx, const string filepath,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SaveToFile might be better for possible extending in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accept.

Status Save(OpKernelContext* ctx, const string filepath,
const size_t buffer_size) {
int64 value_dim = value_shape_.dim_size(0);
return table_->save(ctx, value_dim, filepath, buffer_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SaveToFile might be better for possible extending in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accept

return table_->save(ctx, value_dim, filepath, buffer_size);
}

Status Load(OpKernelContext* ctx, const string filepath,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoadFromFile might be better for possible extending in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accept

Status Load(OpKernelContext* ctx, const string filepath,
const size_t buffer_size) {
int64 value_dim = value_shape_.dim_size(0);
return table_->load(ctx, value_dim, filepath, buffer_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoadFromFile

@@ -49,6 +49,84 @@ struct ValueArray : public ValueArrayBase<V> {
template <class T>
using ValueType = ValueArrayBase<T>;

template <typename T>
class HostFileBuffer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like repeated code with HostFileBuffer in lookup_table_op_cpu.h, if yes, recommending you move them to tensorflow_recommenders_addons/dynamic_embedding/core/utils/host_file_buffer.h

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accept

@@ -338,6 +338,53 @@ def export(self, name=None):
self.resource_handle, self._key_dtype, self._value_dtype)
return keys, values

def save(self, filepath, buffer_size=4194304, name=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save_to_file would be better
Becausesave_to_hdfs is possible in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accept

value_dtype=self._value_dtype,
buffer_size=buffer_size)

def load(self, filepath, buffer_size=4194304, name=None):
Copy link
Member

@rhdong rhdong May 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_from_file or load_from_localfile

@rhdong rhdong force-pushed the indep-table-save branch from 1d4fe07 to 9df45e4 Compare June 7, 2022 03:45
@rhdong rhdong changed the title Copy-free save and load for cuckoo hashtable [Feat]Copy-free save and load for cuckoo hashtable Jun 7, 2022
@acmore
Copy link
Contributor

acmore commented Jun 9, 2022

This feature is very useful. Looking forward to it.

May I ask how to use it? In our case, we will use estimator and save checkpoints per epoch. So should we customize a saver to save the tables manually? But if so, how can we get all the tables to save?

Currently it only support usage like save_op = table.save(path). In eager mode, it's pythonic and simple. In graph mode, it should be managed on graph, like SessionRunHook or run sub-branch of graph.

size_t new_max_size = max_size_;
size_t capacity = table_->get_capacity();

size_t cur_size = table_->get_size(stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from our profiling result, this kernel is expensive. I suggest that we can cache last size, and check if the table need to expand with expecting keys to be added. If not, we could add expect to last size, and just return. In this way, we could reduce the number of calls to get_size.

Copy link
Member Author

@Lifann Lifann Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cached size makes latency between record to real-time size. Save and load are usually low-frequency operations, which may make large gap between recorded and real-time size.


~HostFileBuffer() { Close(); }

void Put(const T value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask if T is tstring, will it work? I remember that sometimes the key is tstring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't used string key type. I think it can be solved by abstracting the host buffer class and template specialization.

size_t key_buffer_size = buffer_size;
string key_tmpfile = filepath + ".keys.tmp";
string key_file = filepath + ".keys";
auto key_buffer = HostFileBuffer<K>(ctx, key_tmpfile, key_buffer_size,
Copy link
Contributor

@acmore acmore Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to abstract the file out of the HostFileBuffer class? We are heavily depending on hdfs, and would like to add hdfs support based on your work. Thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accept

size_t total_keys = 0;
size_t total_values = 0;
while (nkeys > 0) {
nkeys = key_buffer.Fill();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that nkeys is 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nkeys is 0, then it leaves an empty file.

@Lifann Lifann force-pushed the indep-table-save branch 4 times, most recently from dfc5667 to cf6c24f Compare June 10, 2022 14:36
@Lifann Lifann force-pushed the indep-table-save branch from 3578760 to 5feb26c Compare July 1, 2022 03:39
rhdong and others added 4 commits July 1, 2022 15:01
- Also include the horovod compile fail on macOS(They was caused by the same reason)
…t reference with error address when high concurrency and server disconnection.
@Lifann Lifann force-pushed the indep-table-save branch from ce1b182 to b374514 Compare July 1, 2022 07:02
@@ -172,8 +172,6 @@ class RedisWrapper<RedisInstance, K, V,
} catch (const std::exception &err) {
LOG(ERROR) << "RedisHandler error in PipeExecRead for slices "
<< hkey.data() << " -- " << err.what();
error_ptr = std::current_exception();

This comment was marked as resolved.

@PWZER
Copy link
Contributor

PWZER commented Mar 13, 2023

This is a good solution, we want to solve this problem too, When can it be merged?

@MoFHeka
Copy link
Collaborator

MoFHeka commented Mar 14, 2023

Try this: https://github.com/tensorflow/recommenders-addons/blob/master/docs/api_docs/tfra/dynamic_embedding/FileSystemSaver.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants