Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[connector/failover] Support queuing data in failover connector #33007

Open
sinkingpoint opened this issue May 13, 2024 · 21 comments
Open

[connector/failover] Support queuing data in failover connector #33007

sinkingpoint opened this issue May 13, 2024 · 21 comments
Assignees
Labels
connector/failover enhancement New feature or request

Comments

@sinkingpoint
Copy link
Contributor

sinkingpoint commented May 13, 2024

Component(s)

connector/failover

Is your feature request related to a problem? Please describe.

When using the failover connector, we've found it beneficial to disable the sending queue for the underlying exporters, so as to failover more quickly in the event of a total failure. This has the downside of causing us to drop telemetry in the event that all of our exporters are down. In this case, it would be helpful for the failover connector to maintain a sending queue that operates over all the underlying exporters, flushing to whatever exporter comes up.

Describe the solution you'd like

I'd like to bring in a queueSender (https://github.com/open-telemetry/opentelemetry-collector/blob/1d52fb9c3cca27f5101b25abebd1a3bfb09bf852/exporter/exporterhelper/queue_sender.go#L31-L43) or similar to the failover connector that new data gets enqueued on to before being flushed to the exporters

Describe alternatives you've considered

We could enable queuing on the underlying exporters, but this has a few disadvantages:

  • It slows the failover connectors ability to respond to failures
  • The underlying data can get dropped as it gets enqueued in a failing exporter, and isn't shipped to the failed-over-to exporter

Additional context

No response

@sinkingpoint sinkingpoint added enhancement New feature or request needs triage New item requiring triage labels May 13, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@djaglowski
Copy link
Member

Makes sense to me. Can I assign this to you @sinkingpoint?

@djaglowski djaglowski removed the needs triage New item requiring triage label May 13, 2024
@akats7
Copy link
Contributor

akats7 commented May 14, 2024

Hey @sinkingpoint, thanks for the issue.

I see your point, I can add this if you’d like. I’m also planning an update atm to the export flow (to retry by sampling data points instead of switching the entire pipeline), and can add this in tandem.

@sinkingpoint
Copy link
Contributor Author

I have a patch internally already, so I can PR that tomorrow. If it doesn't work with your expected changes @akats7 then I'm happy to abandon it in favor of your work

@akats7
Copy link
Contributor

akats7 commented Jun 5, 2024

Hey @sinkingpoint, any update on this?

@sinkingpoint
Copy link
Contributor Author

Hey - apologies, been a bit busy. I think it's best to go with your stuff here. My internal patch is proving weirdly buggy for reasons I can't seem to work out

@djaglowski
Copy link
Member

Hey @sinkingpoint, thanks for the issue.

I see your point, I can add this if you’d like. I’m also planning an update atm to the export flow (to retry by sampling data points instead of switching the entire pipeline), and can add this in tandem.

@akats7 since this sounds like quite a different solution than proposed in this issue, do you mind opening a new issue to describe this? Then I will close this one in favor of the new one.

@akats7
Copy link
Contributor

akats7 commented Jun 17, 2024

Hey @djaglowski,

Sure thing, we can probably leave this issue open as well, since it’s a separate feature and can be tracked independently.

@djaglowski
Copy link
Member

Ok, great. I misunderstood it to be one or the other but if both would be useful then we'll leave this open too.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added Stale and removed Stale labels Aug 19, 2024
@akats7
Copy link
Contributor

akats7 commented Aug 19, 2024

Will pick this back up

@robincw
Copy link

robincw commented Aug 29, 2024

Is it worth considering a queue connector processor component? Potentially separating from existing exporters all the configurable features of queueing/stacking/priority strategies, persistence options, backoff/retry, with good telemetry for monitoring.

@akats7
Copy link
Contributor

akats7 commented Aug 30, 2024

I do actually like the idea of decoupling the queueing from the exporter as a component that can be used for use cases similar to this one.

To me it seems like it might make more sense as a processor though, that sits in front of the exporter portion of a connector. I don’t think a queue connector would be ideal because for the one to many consumption model it would have to have some sort of routing logic, which components like the routing connector and round robin connector already do. I think putting a queue processor in front of the component responsible for the routing achieves the same effect.

@djaglowski what are your thoughts on this.

@djaglowski
Copy link
Member

I agree that a queueing processor makes more sense than a connector. That said, I don't whether anyone will sponsor it. I think it seems like a reasonable idea but don't have time to commit to maintaining it. Seems worth pitching as a new component though.

@akats7
Copy link
Contributor

akats7 commented Sep 5, 2024

Makes sense, I’ll create an issue.

@akats7
Copy link
Contributor

akats7 commented Oct 17, 2024

Hey @djaglowski, wanted to get your thoughts.
I created a new component issue but it looks like there was in fact a queueretry processor that was deprecated a while back prior to the both being included in the exporter interface. That made me think would it possibly make sense to pitch it as something that should be inherently supported in the connector interface as well, since to an extent connectors do function as exporters.

@djaglowski
Copy link
Member

@dmitryax, you've spent a lot more time considering the design implications of how we manage queueing of data. Do you have an opinion about whether connectors should support it as well?

I don't think I'm very tuned into the considerations here but my default is to not add complexity to the connectors framework unless there is a specific reason it needs to be there, so I'd prefer we implement queueing in a processor if we need it earlier in the pipeline in some cases.

@dmitryax
Copy link
Member

I'm not up to speed on the original problem. Do I understand correctly that the goal is to have failover behavior between several exporters? I would imagine that having a failover exporter, not a connector, would be easier for end users. Making it an exporter goes with all the queuing and batching (experimental for now) capabilities OOTB. The problem is that this failover exporter would need the component.Host.GetExporters API that we are trying to deprecate... Have you ever considered making it an exporter at any point?

@akats7
Copy link
Contributor

akats7 commented Nov 12, 2024

That’s an interesting thought, to me it seems like the connector pattern is a better fit than a sort of wrapper exporter, that is responsible for determining which true exporter will be used, I don’t recall if there are other exporters that are doing something similar.

Another nice feature of having it be a connector also is the flexibility to have access to the processor pipeline. In the case you use a different exporter in the case of failure and want to do some additional tagging or something along those lines.

I do agree that it would be great to have the exporter capabilities OOTB, but not sure if it is worth making it less flexible.

@djaglowski
Copy link
Member

I'm not up to speed on the original problem. Do I understand correctly that the goal is to have failover behavior between several exporters? I would imagine that having a failover exporter, not a connector, would be easier for end users. Making it an exporter goes with all the queuing and batching (experimental for now) capabilities OOTB. The problem is that this failover exporter would need the component.Host.GetExporters API that we are trying to deprecate... Have you ever considered making it an exporter at any point?

The original reason to implement failover as a connector instead of an exporter is that it does not have to be associated with any particular type of exporter. Even if you implemented failover as a generic feature of all exporters, similar to exporter queue, this doesn't allow for actual use cases like where your backup destination is a different type. e.g. try to export with otlp but if that fails, then dump to s3.

@namco1992
Copy link
Contributor

Hello, I'd like to follow up on this issue as this is bothering us too. To summarize:

  1. We don't want to add complexity to the connectors framework unless there is a good reason
  2. We'd like to keep the failover as a connector to have it available for all exporters

For point 1, #35803 is proposed but doesn't seem to gain much attention. I also noticed that in open-telemetry/opentelemetry-collector#8122, the team is trying to move the batching function to exporter itself, and eventually deprecate the batch processor. In the meantime, it seems we are shifting to fully synchronous consumption in the exporter once open-telemetry/opentelemetry-collector#11951 is merged. Considering the efforts on the exporter side, it feels a bit awkward to propose a new queue processor.

Point 2 reminds me #36094 that adds the exporter settings to the loadbalancing exporter to make it more robust. I agree that it's valuable to provide the failover capability for all exporters, which also means that we can't add the exporter utility functions like what loadbalancing exporter does.

According to the connector's readme:

A connector is both an exporter and receiver. As the name suggests a Connector connects two pipelines: it emits data as an exporter at the end of one pipeline and consumes data as a receiver at the start of another pipeline.

Reading it gives me an impression that the connector should implement both the exporter and receiver interfaces, and have the capabilities of exporter and receiver. The connector indeed implements both interfaces, but the receiver interface is very loose (only Start and Shutdown methods) and also a subset of exporter interface. In the end the connector interface is identical with the exporter interface, and the implementation looks very much like an exporter, with the main logic in Consume[Logs|Metrics|Traces]. If you follow my train of thought here, it feels natural to treat the connector as a fully-fledged exporter, thus it should have the exporter capabilities OOTB. After all, that's what the diagram here tells us too.

I put together a dirty example with the failover connector and it works as expected, with all the exporter capabilities.

type Config struct {
        // failover configs are omitted
	// exporter helper configs
	exporterhelper.TimeoutConfig `mapstructure:",squash"` // squash ensures fields are correctly decoded in embedded struct.
	exporterhelper.QueueConfig   `mapstructure:"sending_queue"`
	RetryConfig                  configretry.BackOffConfig `mapstructure:"retry_on_failure"`

	// Experimental: This configuration is at the early stage of development and may change without backward compatibility
	// until https://github.com/open-telemetry/opentelemetry-collector/issues/8122 is resolved
	BatcherConfig exporterbatcher.Config `mapstructure:"batcher"`
}

func createLogsToLogs(
	ctx context.Context,
	set connector.Settings,
	cfg component.Config,
	logs consumer.Logs,
) (connector.Logs, error) {
	l, err := newLogsToLogs(set, cfg, logs)
	if err != nil {
		return nil, err
	}
	expSet := exporter.Settings{
		ID:                set.ID,
		TelemetrySettings: set.TelemetrySettings,
		BuildInfo:         set.BuildInfo,
	}
	oCfg := cfg.(*Config)
	return exporterhelper.NewLogs(ctx, expSet, cfg,
		l.ConsumeLogs,
		exporterhelper.WithCapabilities(consumer.Capabilities{MutatesData: false}),
		exporterhelper.WithTimeout(oCfg.TimeoutConfig),
		exporterhelper.WithRetry(oCfg.RetryConfig),
		exporterhelper.WithQueue(oCfg.QueueConfig),
		exporterhelper.WithBatcher(oCfg.BatcherConfig),
		exporterhelper.WithStart(l.Start),
		exporterhelper.WithShutdown(l.Shutdown),
	)
}

To me the end result is not too bad. The change is minimal, it solves the original issue, and also brings the connector closer to what it claims to be, "an exporter and receiver". I'm not sure if this is considered as antipattern or "adding complexity to the connectors framework".

Another downside I can think of (and specific to the failover connector) is that the retry_on_failure config might look too similar to the retry settings in the failover connector itself, so the config might look like this:

connectors:
  failover:
    retry_interval: 10s
    retry_gap: 1s
    max_retries: 10000000
    retry_on_failure:
      enabled: true
      max_elapsed_time: 0

We have two set of "retries" here but with different meanings, which could be confusing. However, I believe most of the users should be relatively familiar with the retry_on_failure settings built in all the exporters, and we can document it clearly.

@djaglowski @dmitryax @akats7 would like to see WDYT on this approach, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
connector/failover enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants