Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: ingress stuck in pending/deleting #982

Open
Signum opened this issue Aug 20, 2024 · 27 comments
Open

bug: ingress stuck in pending/deleting #982

Signum opened this issue Aug 20, 2024 · 27 comments
Labels
bug Something isn't working

Comments

@Signum
Copy link

Signum commented Aug 20, 2024

Describe the bug

First time user of SwiftWave here. Trying it out on a Debian 12 VM in the local network. Installation was quick and flawless.

I have now created a new app running Wordpress with MariaDB from the App Store. I figured that my ingress would not work properly so I tried to remove it and create a new one with the proper FQDN. However the old ingress is not removed after several minutes and the new one is still pending:

image

The server's log shows:

2024/08/20 10:53:21 error while invoking function for queue [ingress_rule_apply]
POST /graphql | 10.6.56.60 | 200 
POST /graphql | 10.6.56.60 | 200 
POST /graphql | 10.6.56.60 | 200 
2024/08/20 10:53:25 error while invoking function for queue [ingress_rule_delete]

Are you working on this issue?

No

@Signum Signum added the bug Something isn't working label Aug 20, 2024
@tanmoysrt
Copy link
Member

Hi @Signum , have you enabled the proxy from server list ?
To check whether proxy has been enabled successfully, visit the server ip, it should throw bad gateway error page

@Signum
Copy link
Author

Signum commented Aug 20, 2024

Thanks for the swift response, @tanmoysrt.

It seems I had not:
image

Enabling ingress proxy as "Active"…
image

The ingresses went into "Failed" and after waiting 1-2 minutes, they are now gone.

Thanks. Continuing my journey.

@Signum Signum closed this as completed Aug 20, 2024
@HWiese1980
Copy link

I have a domain in my list in "failed" state and it's not going away. What's going on?

@tanmoysrt
Copy link
Member

I have a domain in my list in "failed" state and it's not going away. What's going on?

Can you share few Infos, whether the proxy is enabled for any server ?
Also, please give some reproducable steps.

@HWiese1980
Copy link

Reproducable steps are difficult because this is an experimental setup and I do not really know what lead to this state.

I try to remember...

I added a server, proxy disabled. I added a domain for an app. Deployed the app from a GitLab repo, building a docker file to that server. Domain stuck in "pending" state, probably because proxy not enabled. Because I didn't know that and also wondered why the app was deployed but did not start (docker ps -a showed no app on the server) I added another server and redeployed there. Still app deploying but not starting. I tried to delete the domain which got stuck in deleting. I enabled proxy on the second server, domain went into failed state. And that's where it's sitting now.

@tanmoysrt tanmoysrt reopened this Jan 7, 2025
@tanmoysrt
Copy link
Member

@HWiese1980
Can you share a screenshot of current state.
(Hide the domain name)

@HWiese1980
Copy link

Sure can.

Bildschirmfoto 2025-01-08 um 07 28 26

@HWiese1980
Copy link

This is from this morning. The domain has been in this state for more than 12 hours or so now.

@tanmoysrt
Copy link
Member

This is from this morning. The domain has been in this state for more than 12 hours or so now.

If proxy is working, just try to recreate the rule
Check docs - https://swiftwave.org/docs/dashboard/ingress-rules#recreate--fix

@HWiese1980
Copy link

Nope, doesn't work. It remains in "failed" state. I see no indication of errors in the system log. I initiated "Recreate & Fix" at around 14:10. Here's the system log. The "No change in haproxy service" messages do not seem to be related to the action.

[CRONJOB] 2025/01/08 14:09:11 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:13 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:14 sync_proxy_state.go:130: No change in haproxy service
[CRONJOB] 2025/01/08 14:09:14 sync_proxy_state.go:178: No change in udpproxy service
[CRONJOB] 2025/01/08 14:09:14 sync_proxy_state.go:235: No change in exposed tcp ports of haproxy service
[CRONJOB] 2025/01/08 14:09:14 sync_proxy_state.go:255: No change in exposed udp ports of udpproxy service
[CRONJOB] 2025/01/08 14:09:15 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:17 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:19 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:21 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:23 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:25 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:27 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:29 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:31 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:33 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:35 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:37 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:39 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:41 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:43 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:45 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:47 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:49 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:51 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:53 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:56 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:09:58 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:00 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:02 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:04 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:06 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:08 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:10 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:12 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:14 sync_proxy_state.go:130: No change in haproxy service
[CRONJOB] 2025/01/08 14:10:14 sync_proxy_state.go:178: No change in udpproxy service
[CRONJOB] 2025/01/08 14:10:14 sync_proxy_state.go:235: No change in exposed tcp ports of haproxy service
[CRONJOB] 2025/01/08 14:10:14 sync_proxy_state.go:255: No change in exposed udp ports of udpproxy service
[CRONJOB] 2025/01/08 14:10:14 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:16 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:18 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:20 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:22 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:24 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:26 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:28 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:30 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:32 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:34 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:36 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:38 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:40 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:42 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:44 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:46 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:48 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:50 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:52 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:54 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:56 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:10:58 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:11:00 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:11:02 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:11:04 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:11:06 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:11:08 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:11:10 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:11:12 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:11:14 sync_proxy_state.go:130: No change in haproxy service
[CRONJOB] 2025/01/08 14:11:14 sync_proxy_state.go:178: No change in udpproxy service
[CRONJOB] 2025/01/08 14:11:14 sync_proxy_state.go:235: No change in exposed tcp ports of haproxy service
[CRONJOB] 2025/01/08 14:11:14 sync_proxy_state.go:255: No change in exposed udp ports of udpproxy service
[CRONJOB] 2025/01/08 14:11:14 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:11:16 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:11:18 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:11:20 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:11:22 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:11:24 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:11:26 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:11:28 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:11:30 server_status_monitor.go:23: Triggering Server Status Monitor Job
[CRONJOB] 2025/01/08 14:11:32 server_status_monitor.go:23: Triggering Server Status Monitor Job

@tanmoysrt
Copy link
Member

tanmoysrt commented Jan 8, 2025

Hi @HWiese1980
Try to disable proxy, then re-enable.
After some time, then try to recreate or delete it.

Also, swiftwave is latest right ?

@HWiese1980
Copy link

SwiftWave is v2.2.20-1.

I've switched proxy off and on and off and on, and also tried to delete/recreate/fix etc. the ingress rule in multiple constellations. It just persists. I can't get rid of it.

A potentially "dangerous" force delete action would be convenient. Or an overwrite option when creating a new rule with the same parameters that simply (after a confirmation) overwrites all the configs of the existing rule and behaves as if it was a new rule.

@HWiese1980
Copy link

I think I've tried every possible permutation of settings now, including waiting. The ingress rule just won't disappear.

@tanmoysrt
Copy link
Member

@HWiese1980
I will check once. If there is any bug, because While recreation, it can handle stuffs like duplicate record or even no record in proxy.

@tanmoysrt
Copy link
Member

@HWiese1980
Can you share the haproxy config ?
[Please anonimize domain names]

Go to your proxy server and you can run -

cat /var/lib/swiftwave/haproxy/haproxy.cfg

@HWiese1980
Copy link

Sure. There are no domain names in the config.

global
  master-worker
  maxconn 100000
  chroot /var/lib/haproxy
  user haproxy
  group haproxy
  stats socket /var/run/haproxy.sock user haproxy group haproxy mode 660 level admin expose-fd listeners

defaults
  mode http
  option forwardfor
  maxconn 4000
  log global
  option tcp-smart-accept
  timeout http-request 10s
  timeout check 10s
  timeout connect 10s
  timeout client 1m
  timeout queue 1m
  timeout server 1m
  timeout http-keep-alive 10s
  retries 3
  errorfile 502 /etc/haproxy/errors/502.http
  errorfile 503 /etc/haproxy/errors/503.http

resolvers docker
  nameserver ns1 127.0.0.11:53
  hold valid    10s
  hold other    30s
  hold refused  30s
  hold nx       30s
  hold timeout  30s
  hold obsolete 30s
  timeout resolve 2s
  timeout retry 2s
  resolve_retries 5
  accepted_payload_size 8192

frontend fe_http
  mode http
  bind :80
  acl letsencrypt-acl path_beg /.well-known
  use_backend letsencrypt_backend if letsencrypt-acl
  default_backend error_backend

frontend fe_https
  mode http
  bind :443 ssl crt /etc/haproxy/ssl/ alpn h2,http/1.1
  http-request set-header X-Forwarded-Proto https
  acl letsencrypt-acl path_beg /.well-known
  use_backend letsencrypt_backend if letsencrypt-acl
  default_backend error_backend

backend error_backend
  mode http
  http-request deny deny_status 502

backend letsencrypt_backend
  option httpchk
  http-check send meth GET uri /healthcheck hdr Host "$SWIFTWAVE_SERVICE_ADDRESS"
  http-check expect status 200
  http-request set-header Host "$SWIFTWAVE_SERVICE_ADDRESS"
  server swiftwave_service_https "$SWIFTWAVE_SERVICE_ENDPOINT" check ssl verify required ca-file /etc/ssl/certs/ca-certificates.crt check-sni "$SWIFTWAVE_SERVICE_ADDRESS" sni str("$SWIFTWAVE_SERVICE_ADDRESS")
  server swiftwave_service_http "$SWIFTWAVE_SERVICE_ENDPOINT" check

program api
  command /dataplaneapi.sh
  no option start-on-reload

@HWiese1980
Copy link

HWiese1980 commented Jan 10, 2025

This may be related to my setup.

My Swiftwave runs behind a Caddy reverse proxy. I had to set up Let's Encrypt during creating the application. I do not need to use Let's Encrypt from Swiftwave because the Caddy reverse proxy in front of it takes care of handling certificate creation and TLS termination.

I was able to recreate the behavior by completely starting over with SwiftWave.

Remember: Swiftwave runs behind a reverse proxy (in my case Caddy). The internet facing domain name is already configured in Caddy, including wildcard for the Swiftwave ingress rules and TLS certificate issued through Let's Encrypt.

Steps:

  1. Install SwiftWave using the docs
  2. Fail because GPG is not automatically installed
  3. Install GPG
  4. Install SwiftWave
  5. Remember that SwiftWave also needs rsync; install rsync
  6. initialize and start SwiftWave according to the docs (use public domain name when asked for a domain during init)
  7. check on local registry credentials (because using "local registry" does not work if I set up SwiftKey behind a reverse proxy that does not forward the corresponding port)
  8. configure remote registry with local registry credentials and 127.0.0.1:3334
  9. configure git credentials and git repo
  10. configure a server using SSH (server runs docker and portainer)
  11. check server's docker (multiple haproxy containers, all in "Created", none running; two udpproxy containers, one running
  12. set up an application domain (TLS cert is indeed issued)
  13. deploy application (docker build finishes successfully, push works, I see Failed to create new haproxy transaction in the logs)
  14. set up an ingress rule: failed, undeletable
  15. destroying the application is stuck

@tanmoysrt
Copy link
Member

@HWiese1980
Caddy is on the same server ?

@HWiese1980
Copy link

My workhorse Caddy is in the same network, on the same host (Proxmox) but not in the same VM. And it is unfortunately not the only Caddy in my setup.

There are actually two Caddys running. One is facing the internet (on a virtual server I hired from some service provider) and only and exclusively acting as a bastion host gateway, forwarding all and solely incoming :80 and :443 traffic to my actual workhorse Caddy on my Proxmox. The workhorse Caddy is which does the routing to the different VMs (including Swiftwave) and TLS termination.

@HWiese1980
Copy link

Ah, the connection between the bastion host gateway Caddy and the workhorse Caddy is done through Wireguard.

@HWiese1980
Copy link

HWiese1980 commented Jan 10, 2025

It might have helped to set the management_node_address in /var/lib/swiftwave/config.yml (which is actually also set during swiftwave init) to 127.0.0.1. This is probably due to the fact that I do not route port 3333 on the public domain name.

@HWiese1980
Copy link

No, it has not helped to set the management_node_address to 127.0.0.1. I was able to destroy the app after doing so (maybe because I had to restart swiftwave), but now I am at the start again. An ingress rule I cannot delete because it's stuck in failed, and a Failed to create new haproxy transaction in the deployment logs.

@HWiese1980
Copy link

Swiftwave is trying to destroy the app again, meanwhile I'm doing docker system prune on the target server over and over again, watching udpproxy and a bunch of haproxy containers getting recreated every time after a while. udpproxy starts, the multiple haproxy containers remain in Created.

@tanmoysrt
Copy link
Member

@HWiese1980
So, after enable HAProxy, just check with docker ps if it's running.
If there is something on port 80,443 on the vm, haproxy will not start.

@HWiese1980
Copy link

Aaah, yeah, that may be the reason. Good catch.

If this is the reason (and it looks so), I would suggest to somehow catch that error. A simple "failed" is a little ambiguous. I would have expected to see something like that in the logs.

@HWiese1980
Copy link

Okay, this has fixed at least the issue with undeletable and stuck ingress rules. So this is kind of solved (aside from maybe some more detailed logging and UX).

Thank you for your support! It's been a nice learning.

@tanmoysrt
Copy link
Member

tanmoysrt commented Jan 10, 2025

@HWiese1980
The issue is with docker swarm. It has no mechanism to report back the status on some change.
Swiftwave poll it on a fixed interval, but what happens that we will get running in most of the case.

Because, every time the container fails to start , swarm service try to start again a new one.
Polling it every second might work, but that will put lot of pressure on the server (cadvisor has same issue with large servers with constant polling google/cadvisor#2459).

Without proper integration with docker deamon event stream, it's tough to tackle.

Most of the time, this issue doesn't appear because, people are doing this on a fresh instance.

In v3.0, doing the integration with 1 level deeper and will not require swarm even.
Hoping all these issues will be solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants