Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect usage of sys.getsizeof to calculate the byte size of event data #236

Open
asymness opened this issue Oct 3, 2024 · 1 comment
Labels

Comments

@asymness
Copy link

asymness commented Oct 3, 2024

Description
The trigger_batch and trigger functions in the library are using sys.getsizeof(event['data']) to measure the size of the event data. However, sys.getsizeof() returns the size of the object in memory, which includes overhead and doesn't accurately represent the actual byte size of the data when encoded for transmission over HTTP. This can lead to inconsistencies and false positives when checking against the 10KB limit, resulting in ValueError: Too much data exceptions even when the data is within acceptable limits.

@request_method
def trigger_batch(self, batch=[], already_encoded=False):
"""Trigger multiple events with a single HTTP call.
http://pusher.com/docs/rest_api#method-post-batch-events
"""
if not already_encoded:
for event in batch:
validate_channel(event['channel'])
event_name = ensure_text(event['name'], "event_name")
if len(event['name']) > 200:
raise ValueError("event_name too long")
event['data'] = data_to_string(event['data'], self._json_encoder)
if sys.getsizeof(event['data']) > 10240:
raise ValueError("Too much data")
if is_encrypted_channel(event['channel']):
event['data'] = json.dumps(encrypt(event['channel'], event['data'], self._encryption_master_key), ensure_ascii=False)
params = {
'batch': batch}
return Request(
self, POST, "/apps/%s/batch_events" % self.app_id, params)

Steps to Reproduce:

  • Prepare a batch of events with a good number of non-ascii characters where the data size is below 10KB when encoded in UTF-8.
  • Use the trigger_batch function to send the batch.
  • Observe that a ValueError is raised despite the data being within the size limit.

Upon modifying the trigger_batch function to add some logging as follows:

    @request_method
    def trigger_batch(self, batch=[], already_encoded=False):
        """Trigger multiple events with a single HTTP call.

        http://pusher.com/docs/rest_api#method-post-batch-events
        """
        if not already_encoded:
            for event in batch:
                validate_channel(event['channel'])

                event_name = ensure_text(event['name'], "event_name")
                if len(event['name']) > 200:
                    raise ValueError("event_name too long")

                event['data'] = data_to_string(event['data'], self._json_encoder)
                print("---- EVENT SIZE DETAILS ----")
                print("len(event['data'])")
                print(len(event['data']))
                print()
                print("sys.getsizeof(event['data'])")
                print(sys.getsizeof(event['data']))
                print()
                print("len(event['data'].encode('utf-8'))")
                print(len(event['data'].encode('utf-8')))
                print()
                
                if sys.getsizeof(event['data']) > 10240:
                    raise ValueError("Too much data")

                if is_encrypted_channel(event['channel']):
                    event['data'] = json.dumps(encrypt(event['channel'], event['data'], self._encryption_master_key), ensure_ascii=False)

        params = {
            'batch': batch}

        return Request(
            self, POST, "/apps/%s/batch_events" % self.app_id, params)

I get the following output:

---- EVENT SIZE DETAILS ----
len(event['data'])
5778  <-- Length of the string

sys.getsizeof(event['data'])  
5827  <-- In-memory object size

len(event['data'].encode('utf-8')). 
5778  <-- Actual byte-size of the data

---- EVENT SIZE DETAILS ----
len(event['data'])
5671  <-- Length of the string

sys.getsizeof(event['data'])
11416  <-- In-memory object size

len(event['data'].encode('utf-8'))
5673  <-- Actual byte-size of the data

ERROR:root:Too much data

Notice how the result of sys.getsizeof and the UTF-8 encoded byte size is drastically different for the last event just because it contains one non-ascii character ().

Expected Behavior:
The function should allow sending event data that is under the 10KB limit when encoded, without raising an exception.

Actual Behavior:
A ValueError is raised stating "Too much data" even when the actual encoded data size is under 10KB.

Analysis:
Using sys.getsizeof() is not reliable for measuring the size of the data to be sent over the network. It measures the memory footprint of the object in Python, which can include additional overhead and doesn't correspond to the actual size of the encoded data.

Here is some more proof on how sys.getsizeof can be wildly inaccurate for calculating the byte size of data:

>>> a = '2!=1'
>>> b = '2≥1'
>>> 
>>> len(a)
4
>>> len(b)
3
>>> sys.getsizeof(a)
53
>>> sys.getsizeof(b)
80
>>> len(a.encode('utf-8'))
4
>>> len(b.encode('utf-8'))
5

Proposed Solution:
Replace the size check using sys.getsizeof(event['data']) with len(event['data'].encode('utf-8')) to accurately measure the byte size of the data when encoded.

Additional Information:

@evgeniibreikin
Copy link

Hi, lgtm, could you to create PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants