Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Live Metrics Filtering Part 6: Error Tracker #43744

Open
wants to merge 30 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
4aa78d3
adding projection functionality, rethink automic double
harsimar Dec 10, 2024
bf15276
atomic double and some other changes
harsimar Dec 11, 2024
4ace19d
merge from master to get p3 changes
harsimar Dec 11, 2024
75718c7
pr comments and some build stuff
harsimar Dec 11, 2024
27d4979
changing guava version
harsimar Dec 11, 2024
619bb47
spotbug fix
harsimar Dec 11, 2024
915d579
starting to add tests
harsimar Dec 12, 2024
9c8fa37
added unit tests for derived metric projection
harsimar Dec 13, 2024
e3377d7
fix conflict from main
harsimar Dec 13, 2024
c599211
reorganize tests
harsimar Dec 13, 2024
a6ab936
fixing inconsistency with main re okio
harsimar Dec 13, 2024
494a1b9
pr comments & small refactorings
harsimar Dec 16, 2024
1351d2b
remove logging, spotless
harsimar Dec 16, 2024
a77adfe
validator
harsimar Dec 17, 2024
68b9c83
minor
harsimar Dec 17, 2024
d80053e
changes to concurrency handling
harsimar Dec 17, 2024
f83c116
moving synchronization to derivedMetricAggregation
harsimar Dec 17, 2024
d8b618f
moving derivedMetricAggregation to its own file
harsimar Dec 17, 2024
77f176a
merge projection into here
harsimar Dec 18, 2024
96e705f
validator round 1 - need to refactor for validator to store errors
harsimar Dec 18, 2024
6e81b55
finish implementation and starting to add tests
harsimar Dec 19, 2024
568f059
merge from main
harsimar Dec 19, 2024
de6bd93
spotless
harsimar Dec 20, 2024
f492237
remove unused classes
harsimar Dec 20, 2024
191783d
pr comments
harsimar Jan 6, 2025
c3c3c49
fix ci errors
harsimar Jan 6, 2025
0ee6032
initial implementation
harsimar Jan 7, 2025
73f4cc7
merge conflicts and starting to add some logging
harsimar Jan 8, 2025
0c5aae9
logging and testing
harsimar Jan 9, 2025
f1ba015
restructure of error tracking in validator
harsimar Jan 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -252,6 +252,7 @@
<suppress files="com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.filtering.DerivedMetricProjections.java" checks="MissingJavadocTypeCheck" />
<suppress files="com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.filtering.DerivedMetricAggregation.java" checks="MissingJavadocTypeCheck" />
<suppress files="com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.filtering.Validator.java" checks="MissingJavadocTypeCheck" />
<suppress files="com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.filtering.ConfigErrorTracker.java" checks="MissingJavadocTypeCheck" />
<suppress files="com.azure.monitor.opentelemetry.autoconfigure.implementation.statsbeat.CustomDimensions.java" checks="MissingJavadocTypeCheck" />
<suppress files="com.azure.monitor.opentelemetry.autoconfigure.implementation.statsbeat.Feature.java" checks="MissingJavadocTypeCheck" />
<suppress files="com.azure.monitor.opentelemetry.autoconfigure.implementation.statsbeat.FeatureStatsbeat.java" checks="MissingJavadocTypeCheck" />
Expand Down Expand Up @@ -368,6 +369,7 @@
<suppress files="com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.filtering.Filter" checks="MissingJavadocMethod" />
<suppress files="com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.filtering.CustomDimensions" checks="MissingJavadocMethod" />
<suppress files="com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.filtering.DerivedMetricProjections" checks="MissingJavadocMethod" />
<suppress files="com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.filtering.ConfigErrorTracker" checks="MissingJavadocMethod" />
<suppress files="com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.swagger.LiveMetricsRestAPIsForClientSDKsBuilder" checks="DenyListedWords" />
<suppress files="com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.swagger.LiveMetricsRestAPIsForClientSDKsBuilder" checks="ServiceClientBuilder" />
<suppress files="com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.filtering.*" checks="JavadocPackage" />
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ private long sendData() {
// in an error state (ping once a minute) if the first ping after the failing post also fails.
long errorDelayInNs = TimeUnit.SECONDS.toNanos(40);
pingSender.resetLastValidRequestTimeNs(dataSender.getLastValidPostRequestTimeNs() - errorDelayInNs);
logger.verbose("Switching to fallback mode.");
return waitOnErrorInMillis;

case QP_IS_OFF:
Expand All @@ -95,6 +96,7 @@ private long sendData() {
// sender to go into backoff state immediately instead of waiting 60s to go into backoff state like
// the spec describes. See: https://github.com/aep-health-and-standards/Telemetry-Collection-Spec/blob/main/ApplicationInsights/livemetrics.md#timings
pingSender.resetLastValidRequestTimeNs(dataSender.getLastValidPostRequestTimeNs());
logger.verbose("Switching to ping mode.");
return qpsServicePollingIntervalHintMillis > 0
? qpsServicePollingIntervalHintMillis
: waitBetweenPingsInMillis;
Expand All @@ -118,10 +120,12 @@ private long ping() {
collector.setQuickPulseStatus(qpStatus);
switch (qpStatus) {
case ERROR:
logger.verbose("In fallback mode");
return waitOnErrorInMillis;

case QP_IS_ON:
pingMode = false;
logger.verbose("Switching to post mode");
// Below two lines are necessary because there are cases where the last valid request is a ping
// before a failing post. This can happen in cases where authentication fails - pings would return
// http 200 but posts http 401.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
import com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.swagger.models.Trace;
import com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.swagger.models.TelemetryType;
import com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.swagger.models.AggregationType;
import com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.swagger.models.CollectionConfigurationError;
import com.azure.monitor.opentelemetry.autoconfigure.implementation.utils.CpuPerformanceCounterCalculator;
import reactor.util.annotation.Nullable;

Expand Down Expand Up @@ -59,7 +60,6 @@ final class QuickPulseDataCollector {

private volatile Supplier<String> instrumentationKeySupplier;

// TODO (harskaur): Track projection (runtime) related errors in future PR
private final AtomicReference<FilteringConfiguration> configuration;

QuickPulseDataCollector(AtomicReference<FilteringConfiguration> configuration) {
Expand All @@ -77,7 +77,8 @@ synchronized void disable() {

synchronized void enable(Supplier<String> instrumentationKeySupplier) {
this.instrumentationKeySupplier = instrumentationKeySupplier;
counters.set(new Counters(configuration.get().getValidProjectionInitInfo()));
FilteringConfiguration config = configuration.get();
counters.set(new Counters(config.getValidProjectionInitInfo(), config.getErrors()));
}

synchronized void setQuickPulseStatus(QuickPulseStatus quickPulseStatus) {
Expand All @@ -91,7 +92,9 @@ synchronized QuickPulseStatus getQuickPulseStatus() {

@Nullable
synchronized FinalCounters getAndRestart() {
Counters currentCounters = counters.getAndSet(new Counters(configuration.get().getValidProjectionInitInfo()));
FilteringConfiguration config = configuration.get();
Counters currentCounters
= counters.getAndSet(new Counters(config.getValidProjectionInitInfo(), config.getErrors()));
if (currentCounters != null) {
return new FinalCounters(currentCounters);
}
Expand Down Expand Up @@ -180,7 +183,6 @@ private void applyMetricFilters(TelemetryColumns columns, TelemetryType telemetr
List<DerivedMetricInfo> metricsConfig = currentConfig.fetchMetricConfigForTelemetryType(telemetryType);
for (DerivedMetricInfo derivedMetricInfo : metricsConfig) {
if (Filter.checkMetricFilters(derivedMetricInfo, columns)) {
// TODO (harskaur): In future PR, track any error that comes from calculateProjection
currentCounters.derivedMetrics.calculateProjection(derivedMetricInfo, columns);
}
}
Expand Down Expand Up @@ -411,6 +413,8 @@ class FinalCounters {

final Map<String, Double> projections;

final List<CollectionConfigurationError> configErrors;

private FinalCounters(Counters currentCounters) {

processPhysicalMemory = getPhysicalMemory(memory);
Expand All @@ -431,6 +435,7 @@ private FinalCounters(Counters currentCounters) {
this.documentList.addAll(currentCounters.documentList);
}
this.projections = currentCounters.derivedMetrics.fetchFinalDerivedMetricValues();
this.configErrors = currentCounters.configErrors;

}

Expand Down Expand Up @@ -486,8 +491,11 @@ static class Counters {

final DerivedMetricProjections derivedMetrics;

Counters(Map<String, AggregationType> projectionInfo) {
final List<CollectionConfigurationError> configErrors;

Counters(Map<String, AggregationType> projectionInfo, List<CollectionConfigurationError> errors) {
derivedMetrics = new DerivedMetricProjections(projectionInfo);
configErrors = errors;
}

static long encodeCountAndDuration(long count, long duration) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ private MonitoringDataPoint buildMonitoringDataPoint(QuickPulseDataCollector.Fin
point.setVersion(sdkVersion);
point.setTimestamp(OffsetDateTime.now());
point.setMetrics(addMetricsToMonitoringDataPoint(counters));
point.setCollectionConfigurationErrors(counters.configErrors);
return point;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -79,13 +79,15 @@ public void run() {
dataPointList.add(point);
Date currentDate = new Date();
long transmissionTimeInTicks = currentDate.getTime() * 10000 + TICKS_AT_EPOCH;
String etag = configuration.get().getETag();

logger.verbose("Attempting to send data points to quickpulse with etag {}: {}", etag,
printListOfMonitoringPoints(dataPointList));

try {
// TODO (harskaur): remove logging when manual testing done
logger.verbose("Monitoring point: {}", point.toJsonString());
logger.verbose("etag: {}", configuration.get().getETag());
Response<CollectionConfigurationInfo> responseMono = liveMetricsRestAPIsForClientSDKs
.publishNoCustomHeadersWithResponseAsync(endpointPrefix, instrumentationKey.get(),
configuration.get().getETag(), transmissionTimeInTicks, dataPointList)
.publishNoCustomHeadersWithResponseAsync(endpointPrefix, instrumentationKey.get(), etag,
transmissionTimeInTicks, dataPointList)
.block();
if (responseMono == null) {
// this shouldn't happen, the mono should complete with a response or a failure
Expand All @@ -105,17 +107,17 @@ public void run() {

lastValidRequestTimeNs = sendTime;
CollectionConfigurationInfo body = responseMono.getValue();
if (body != null && !configuration.get().getETag().equals(body.getETag())) {
if (body != null && !etag.equals(body.getETag())) {
configuration.set(new FilteringConfiguration(body));
// TODO (harskaur): remove logging when manual testing done
try {
logger.verbose("passed in config {}", body.toJsonString());
logger.verbose("Received a new live metrics filtering configuration from post response: {}",
body.toJsonString());
} catch (IOException e) {
logger.error(e.getMessage());
logger.verbose(e.getMessage());
}
}

} catch (RuntimeException | IOException e) { // this includes ServiceErrorException & RuntimeException thrown from quickpulse post api
} catch (RuntimeException e) { // this includes ServiceErrorException & RuntimeException thrown from quickpulse post api
onPostError(sendTime);
logger.error(
"QuickPulseDataSender received a service error while attempting to send data to quickpulse {}",
Expand All @@ -136,6 +138,20 @@ private void onPostError(long sendTime) {
}
}

private String printListOfMonitoringPoints(List<MonitoringDataPoint> points) {
StringBuilder dataPointsPrint = new StringBuilder("[");
for (MonitoringDataPoint p : points) {
try {
dataPointsPrint.append(p.toJsonString());
dataPointsPrint.append("\n");
} catch (IOException e) {
logger.verbose(e.getMessage());
}
}
dataPointsPrint.append("]");
return dataPointsPrint.toString();
}

public void setRedirectEndpointPrefix(String endpointPrefix) {
this.redirectEndpointPrefix = endpointPrefix;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
import com.azure.monitor.opentelemetry.autoconfigure.implementation.utils.Strings;
import reactor.util.annotation.Nullable;

import java.io.IOException;
import java.net.URL;
import java.util.Date;
import java.util.concurrent.atomic.AtomicBoolean;
Expand Down Expand Up @@ -104,6 +105,12 @@ IsSubscribedHeaders ping(String redirectedEndpoint) {

CollectionConfigurationInfo body = responseMono.getValue();
if (body != null && !configuration.get().getETag().equals(body.getETag())) {
try {
logger.verbose("Received a new live metrics filtering configuration from ping response: {}",
body.toJsonString());
} catch (IOException e) {
logger.verbose(e.getMessage());
}
configuration.set(new FilteringConfiguration(body));
}

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.
package com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.filtering;

import com.azure.core.util.logging.ClientLogger;
import com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.swagger.models.CollectionConfigurationError;
import com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.swagger.models.CollectionConfigurationErrorType;
import com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.swagger.models.KeyValuePairString;

import java.util.ArrayList;
import java.util.List;

public class ConfigErrorTracker {
private final List<CollectionConfigurationError> errors = new ArrayList<>();

private static final ClientLogger LOGGER = new ClientLogger(ConfigErrorTracker.class);

public void constructAndTrackCollectionConfigurationError(String message, String eTag, String id,
boolean isDerivedMetricId) {
CollectionConfigurationError error = new CollectionConfigurationError();
error.setMessage(message);
error.setCollectionConfigurationErrorType(setErrorType(message));

KeyValuePairString keyValuePair1 = new KeyValuePairString();
keyValuePair1.setKey("ETag");
keyValuePair1.setValue(eTag);

KeyValuePairString keyValuePair2 = new KeyValuePairString();
keyValuePair2.setKey(isDerivedMetricId ? "DerivedMetricInfoId" : "DocumentStreamInfoId");
keyValuePair2.setValue(id);

List<KeyValuePairString> data = new ArrayList<>();
data.add(keyValuePair1);
data.add(keyValuePair2);

error.setData(data);

errors.add(error);
// This message gets logged once for every error we see on config validation. Config validation
// only happens once per config change.
LOGGER.verbose("{}. Due to this misconfiguration the {} rule with id {} will be ignored by the SDK.", message,
isDerivedMetricId ? "derived metric" : "document filter conjunction", id);
}

private CollectionConfigurationErrorType setErrorType(String message) {
if (message.contains("telemetry type")) {
return CollectionConfigurationErrorType.METRIC_TELEMETRY_TYPE_UNSUPPORTED;
} else if (message.contains("duplicate metric id")) {
return CollectionConfigurationErrorType.METRIC_DUPLICATE_IDS;
}
return CollectionConfigurationErrorType.FILTER_FAILURE_TO_CREATE_UNEXPECTED;
}

public List<CollectionConfigurationError> getErrors() {
return new ArrayList<CollectionConfigurationError>(errors);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,15 @@
// Licensed under the MIT License.
package com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.filtering;

import com.azure.core.util.logging.ClientLogger;
import com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.swagger.models.FilterInfo;

import java.util.HashMap;
import java.util.Map;

public class CustomDimensions {
private final Map<String, String> customDimensions;
private static ClientLogger logger = new ClientLogger(CustomDimensions.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static ClientLogger logger = new ClientLogger(CustomDimensions.class);
private static final ClientLogger logger = new ClientLogger(CustomDimensions.class);


public CustomDimensions(Map<String, String> customDimensions, Map<String, Double> customMeasurements) {
Map<String, String> resultMap = new HashMap<>();
Expand Down Expand Up @@ -47,9 +49,13 @@ public double getCustomDimValueForProjection(String key) {
try {
return Double.parseDouble(value);
} catch (NumberFormatException e) {

logger.verbose(
"The value for the custom dimension could not be converted to a numeric value for a derived metric projection");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The value for the custom dimension could not be converted to a numeric value for a derived metric projection");
value + " for the custom dimension could not be converted to a numeric value for a derived metric projection");

}
return Double.NaN;
}
logger.verbose(
"The custom dimension could not be found in this telemetry item when calculating a derived metric.");
return Double.NaN;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

package com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.filtering;

import com.azure.core.util.logging.ClientLogger;
import com.azure.monitor.opentelemetry.autoconfigure.implementation.models.RemoteDependencyData;
import com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.swagger.models.FilterInfo;
import com.azure.monitor.opentelemetry.autoconfigure.implementation.utils.FormattedDuration;
Expand All @@ -15,17 +16,26 @@ public class DependencyDataColumns implements TelemetryColumns {
private final CustomDimensions customDims;
private final Map<String, Object> mapping = new HashMap<>();

private static ClientLogger logger = new ClientLogger(DependencyDataColumns.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static ClientLogger logger = new ClientLogger(DependencyDataColumns.class);
private static final ClientLogger logger = new ClientLogger(DependencyDataColumns.class);


public DependencyDataColumns(RemoteDependencyData rdData) {
customDims = new CustomDimensions(rdData.getProperties(), rdData.getMeasurements());
mapping.put(KnownDependencyColumns.TARGET, rdData.getTarget());
mapping.put(KnownDependencyColumns.DURATION,
FormattedDuration.getDurationFromTelemetryItemDurationString(rdData.getDuration()));

long durationMicroSec = FormattedDuration.getDurationFromTelemetryItemDurationString(rdData.getDuration());
if (durationMicroSec == -1) {
logger.verbose("The provided timestamp could not be converted to microseconds: {}", rdData.getDuration());
}
mapping.put(KnownDependencyColumns.DURATION, durationMicroSec);

mapping.put(KnownDependencyColumns.SUCCESS, rdData.isSuccess());
mapping.put(KnownDependencyColumns.NAME, rdData.getName());
int resultCode;
try {
resultCode = Integer.parseInt(rdData.getResultCode());
} catch (NumberFormatException e) {
logger.verbose("The provided result code could not be converted to a numeric value: {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.verbose("The provided result code could not be converted to a numeric value: {}",
logger.verbose(rdData.getResultCode() + " result code could not be converted to a numeric value: {}",

rdData.getResultCode());
resultCode = -1;
}
mapping.put(KnownDependencyColumns.RESULT_CODE, resultCode);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

package com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.filtering;

import com.azure.core.util.logging.ClientLogger;
import com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.swagger.models.AggregationType;
import com.azure.monitor.opentelemetry.autoconfigure.implementation.quickpulse.swagger.models.DerivedMetricInfo;

Expand All @@ -14,6 +15,8 @@ public class DerivedMetricProjections {
public static final String COUNT = "Count()";
private final Map<String, DerivedMetricAggregation> derivedMetricValues = new HashMap<>();

private static final ClientLogger LOGGER = new ClientLogger(DerivedMetricProjections.class);

public DerivedMetricProjections(Map<String, AggregationType> projectionInfo) {
for (Map.Entry<String, AggregationType> entry : projectionInfo.entrySet()) {
AggregationType aggregationType = entry.getValue();
Expand Down Expand Up @@ -61,7 +64,10 @@ public void calculateProjection(DerivedMetricInfo derivedMetricInfo, TelemetryCo
// For now, such cases produce Double.Nan and get skipped when calculating projection.
}

if (!Double.isNaN(incrementBy)) {
if (Double.isNaN(incrementBy)) {
LOGGER.verbose(
"This telemetry item will not be counted in derived metric projections because the Duration or a CustomDimension column could not be interpreted as a numeric value.");
} else {
calculateAggregation(derivedMetricInfo.getId(), incrementBy);
}
}
Expand Down
Loading
Loading