Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up accessors for time series and supplemental attributes #416

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

daniel-thom
Copy link
Contributor

@daniel-thom daniel-thom commented Dec 6, 2024

Latency fixes

The existing code had a few issues that caused slowness when calling has_time_series or get_time_series:

  • SQLite indexes were not present when de-serializing a PSB case. I'm actually still not sure why. They were correctly set after to_json(sys) followed by System(file.json). The new code ensures that they are always set. This could have caused latency of up to 30-40 us on each query.
  • Calling SQLite.DBInterface.execute(db, "a sql query as string") has an overhead of ~3-4 us. The delay seems to come from compiling the SQL statement. I'll note that Python does not suffer from this latency. Interpolating strings causes additional latency. This new code caches the statements. It also avoids string interpolations, but only for the has_time_series code path. This is perhaps over-engineered. If this still isn't fast enough for PSI, we need to re-think the entire idea and allow the application to defined what time series attributes should be cached.
  • Calling get_time_series results in a SQL query to retrieve the time series metadata, which is stored as a JSON string, and then deserialize it into an object. That takes ~30 us. In most cases, this is harmless. Reading the data from the disk takes a lot longer than that. Some users might want to use the in-memory time series storage, and so might notice this delay. To solve that I split the metadata-as-json data into a separate table and added caching of the real metadata objects. De-serialization will only occur on the first read request.
  • The existing code has a bug that causes has_time_series(component, Deterministic) to return false if the component has DeterministicSingleTimeSeries. This PR fixes the bug.
  • The fact that we expect has_time_series(component, Deterministic) to return true when only DeterministicSingleTimeSeries is present causes extra latency. We could remove that latency by changing the requirement (make the user ask for the exact type stored) or add another column to the database table. We agreed to change this behavior in the next release.

Bulk time series addition

This PR introduces a simplification to bulk time series addition. Previously, the user had to call bulk_add_time_series!() with an iterable of TimeSeriesAssociation instances. The code performed bulk database queries to check for duplicates and insert new rows. That was misguided. The code is now easier to write for users. They will do this instead:

begin_time_series_update(sys) do
    for loop over components and time series
        add_time_series!(sys, component, time_series)
    end
end

The new code is a ~10% faster than the old code. The user can still call the existing function bulk_add_time_series!. It redirects to the new code. We can deprecate that function now and remove it in the next major release.

Supplemental attributes

  • Improves query performance when searching for components attached to attributes or attributes attached to components.
  • Adds a procedure for bulk addition of attributes, similar to that for time series.

Fixes #417 and #382

@daniel-thom daniel-thom force-pushed the fix/faster-has-time-series branch from 6c3dbab to e2b7959 Compare December 7, 2024 00:24
Copy link

codecov bot commented Dec 7, 2024

Codecov Report

Attention: Patch coverage is 94.05941% with 18 lines in your changes missing coverage. Please review.

Project coverage is 77.90%. Comparing base (0f39163) to head (95f9851).

Files with missing lines Patch % Lines
src/time_series_metadata_store.jl 93.71% 12 Missing ⚠️
src/supplemental_attribute_associations.jl 94.44% 2 Missing ⚠️
src/time_series_manager.jl 95.74% 2 Missing ⚠️
src/utils/sqlite.jl 75.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #416      +/-   ##
==========================================
- Coverage   78.07%   77.90%   -0.17%     
==========================================
  Files          69       69              
  Lines        5504     5612     +108     
==========================================
+ Hits         4297     4372      +75     
- Misses       1207     1240      +33     
Flag Coverage Δ
unittests 77.90% <94.05%> (-0.17%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/supplemental_attribute_manager.jl 94.11% <100.00%> (+0.43%) ⬆️
src/system_data.jl 91.86% <100.00%> (+0.05%) ⬆️
src/time_series_interface.jl 95.58% <100.00%> (ø)
src/supplemental_attribute_associations.jl 92.61% <94.44%> (-0.68%) ⬇️
src/time_series_manager.jl 94.49% <95.74%> (-1.38%) ⬇️
src/utils/sqlite.jl 73.17% <75.00%> (+0.44%) ⬆️
src/time_series_metadata_store.jl 93.41% <93.71%> (-0.08%) ⬇️

... and 5 files with indirect coverage changes

@daniel-thom daniel-thom force-pushed the fix/faster-has-time-series branch 3 times, most recently from bb3a3b3 to bfe1415 Compare December 10, 2024 00:42
@daniel-thom daniel-thom marked this pull request as draft December 10, 2024 01:24
@@ -567,7 +567,7 @@ function _transform_single_time_series!(
horizon = params.horizon,
time_series_type = DeterministicSingleTimeSeries,
scaling_factor_multiplier = get_scaling_factor_multiplier(metadata),
internal = get_internal(metadata),
internal = InfrastructureSystemsInternal(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous behavior was incorrect, but was harmless. It caused the metadata of the transformed object to have the same UUID, but no code ever looked at that.

@@ -110,7 +110,11 @@ function add_time_series!(mgr::TimeSeriesManager, batch::Vector{TimeSeriesAssoci
for uuid in new_ts_uuids
serialize_time_series!(mgr.data_store, time_series_uuids[uuid])
end
add_metadata!(mgr.metadata_store, owners, all_metadata)
if length(all_metadata) == 1
add_metadata!(mgr.metadata_store, owners[1], all_metadata[1])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is faster than calling the the array version.

@@ -23,14 +23,17 @@ const SUPPLEMENTAL_ATTRIBUTE_TABLE_NAME = "supplemental_attributes"

mutable struct SupplementalAttributeAssociations
db::SQLite.DB
# If we don't cache SQL statements, there is a cost of 3-4 us on every query.
cached_statements::Dict{String, SQLite.Stmt}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some minor changes to the supplemental attribute associations, but did not profile them. If the time series changes work out, I can pursue these in greater detail.

# Caching compiled SQL statements saves 3-4 us per query query.
# DBInterface.jl does something similar with @prepare.
# We need this to be tied to our connection.
cached_statements::Dict{String, SQLite.Stmt}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies to selective queries that we run frequently.

# This caching allows the code to skip some string interpolations.
# It is experimental for PowerSimulations, which calls has_metadata frequently.
# It may not be necessary. Savings are less than 1 us.
has_metadata_statements::Dict{HasMetadataQueryKey, SQLite.Stmt}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cached applies only to the has_time_series path, which is the most important to optimize for PSI. And is possibly over-engineered.

@daniel-thom daniel-thom force-pushed the fix/faster-has-time-series branch from 47db714 to 297a698 Compare December 10, 2024 18:04
@jd-lara
Copy link
Member

jd-lara commented Dec 11, 2024

@daniel-thom now there is a workaround in PowerSimulations to handle this case "The fact that we expect has_time_series(component, Deterministic) to return true when only DeterministicSingleTimeSeries is present causes extra latency." and any impact of that can be addressed.

@daniel-thom
Copy link
Contributor Author

@daniel-thom now there is a workaround in PowerSimulations to handle this case "The fact that we expect has_time_series(component, Deterministic) to return true when only DeterministicSingleTimeSeries is present causes extra latency." and any impact of that can be addressed.

Currently, get_time_series(Deterministic, ...) will return a DeterministicSingleTimeSeries. has_time_series doesn't, and that is a bug. We need to make the behaviors consistent. Changing get_time_series could be a big API breakage for users. Let's test the performance on a large system before we make a decision.

@jd-lara
Copy link
Member

jd-lara commented Jan 7, 2025

@daniel-thom I tested this branch and it should be merged. These are the results on the EI

main:

Build of The EI: 407.636684 seconds (3.26 G allocations: 109.041 GiB, 43.34% gc time, 17.21% compilation time: 2% of which was recompilation)

Loop 10 times over all StaticInjections and call has_time_series on the EI: 2939.589924 seconds (22.27 M allocations: 1.222 GiB, 0.01% gc time, 0.00% compilation time)

This branch:

Build of The EI: 384.807498 seconds (3.25 G allocations: 108.283 GiB, 44.03% gc time, 17.02% compilation time: 2% of which was recompilation)

Loop 10 times over all StaticInjections and call has_time_series on the EI: 1.538067 seconds (14.96 M allocations: 727.982 MiB, 5.98% gc time, 2.98% compilation time)

@daniel-thom daniel-thom force-pushed the fix/faster-has-time-series branch from b71e91a to 42c4f21 Compare January 10, 2025 20:33
@daniel-thom daniel-thom marked this pull request as ready for review January 10, 2025 20:57
@daniel-thom daniel-thom force-pushed the fix/faster-has-time-series branch from 42c4f21 to d5cac54 Compare January 10, 2025 21:03
@daniel-thom daniel-thom force-pushed the fix/faster-has-time-series branch from d5cac54 to 95f9851 Compare January 10, 2025 21:05
@daniel-thom daniel-thom changed the title Fix: Make has_time_series faster Speed up accessors for time series and supplemental attributes Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance for adding multiple supplemental attributes IS
2 participants