Skip to content
This repository has been archived by the owner on Apr 24, 2023. It is now read-only.

Fluent-bit cannot send logs to elasticsearch in one environment but works fine in another. #42

Open
thearifismail opened this issue Dec 8, 2018 · 4 comments

Comments

@thearifismail
Copy link

Need help with how to locate the problem blocking logs push by fluent-bit containers to elasticsearch.

This setup works fine without any problems in one environment but does not in our staging environment, where it must succeed before moving to the production environment.


Setup

Kubernetes v1.11 (installed using RKE CLI with controlplan, etcd, and workers on separate nodes)
Elasticsearch v6.4.3 native install
Fluent-bit image: fluent/fluent-bit:0.14.6
Kibana v.6.4.2

The elasticsearch host is accessible from every node in the problem cluster. Fluent-bit containers can read logs but what happens after that is a mystery. Here is the docker log from one of the nodes:

docker logs 54b2ed96ca7f
Fluent-Bit v0.14.6
Copyright (C) Treasure Data

[2018/12/07 22:15:28] [ info] [engine] started (pid=1)
[2018/12/07 22:15:28] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2018/12/07 22:15:28] [ info] [filter_kube] local POD info OK
[2018/12/07 22:15:28] [ info] [filter_kube] testing connectivity with API server...
[2018/12/07 22:15:28] [ info] [filter_kube] API server connectivity OK
[2018/12/07 22:15:28] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020

I don't know if it has any bearing, but I don't have permission on the system to check if port 2020 is available or not.

The /var/log/messages in the fluent-bit container on a node is flooded with messages like the following:

kernel: ipmi-sensors:61430 map pfn expected mapping type uncached-minus for [mem 0xbfee0000-0xbfee0fff], got write-back

Dec 7 22:44:37 , dockerd: time="2018-12-07T22:44:37.062465721Z" level=error msg="Error running exec in container: OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused "exec: \"bash\": executable file not found in $PATH": unknown"

dockerd: time="2018-12-07T22:44:37.665307619Z" level=error msg="stream copy error: reading from a closed fifo"

Dec 7 22:24:39 dockerd: time="2018-12-07T22:24:39.310744098Z" level=error msg="Error running exec in container: OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused "exec: \"bash\": executable file not found in $PATH": unknown"
Dec 7 22:24:39 dockerd: time="2018-12-07T22:24:39.424232019Z" level=error msg="stream copy error: reading from a closed fifo"
Dec 7 22:24:39 dockerd: time="2018-12-07T22:24:39.424235038Z" level=error msg="stream copy error: reading from a closed fifo"
Dec 7 22:25:01 systemd: Created slice User Slice of pcp.
Dec 7 22:25:01 systemd: Starting User Slice of pcp.
Dec 7 22:25:01 systemd: Started Session 45542 of user pcp.
Dec 7 22:25:01 systemd: Starting Session 45542 of user pcp.
Dec 7 22:25:01 systemd: Removed slice User Slice of pcp.
Dec 7 22:25:01 systemd: Stopping User Slice of pcp.
Dec 7 22:25:10 telegraf: 2018-12-07T22:25:10Z E! [outputs.influxdb]: when writing to : received error partial write: max-values-per-tag limit exceeded (100055/100000): measurement="net" tag="interface" value="<some_string>" dropped=1; discarding points
Dec 7 22:25:37 dockerd: time="2018-12-07T22:25:37.189532650Z" level=error msg="stream copy error: reading from a closed fifo"
Dec 7 22:25:37 dockerd: time="2018-12-07T22:25:37.189532758Z" level=error msg="stream copy error: reading from a closed fifo"
Dec 7 22:25:37 dockerd: time="2018-12-07T22:25:37.199774849Z" level=error msg="Error running exec in container: OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused "exec: \"bash\": executable file not found in $PATH": unknown"


@edsiper
Copy link
Member

edsiper commented Dec 8, 2018

Looks like the problem is in the host and not in Fluent Bit. Did you check if your Node is under memory pressure?

@thearifismail
Copy link
Author

Sorry Eduardo for my delayed response. I think I have found the problem but looks like I am still not out of the woods yet. The problem was the mounted path for 'varlibdockercontainers' was different from the default value of "/var/lib/docker/containers". Here are my "fluent-bit-ds.yml and "fluent-bit-daemonset.yml" files. There is one question at the end, please respond to that if you can. Thanks.


apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: logging
labels:
k8s-app: fluent-bit-logging
version: v1
kubernetes.io/cluster-service: "true"
spec:
template:
metadata:
labels:
k8s-app: fluent-bit-logging
version: v1
kubernetes.io/cluster-service: "true"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "2020"
prometheus.io/path: /api/v1/metrics/prometheus
spec:
containers:
- name: fluent-bit
image: fluent/fluent-bit:0.14.6
imagePullPolicy: Always
ports:
- containerPort: 2020
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "my-elasticsearch.something.com"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /data/docker/containers

readOnly: true
- name: fluent-bit-config
mountPath: /fluent-bit/etc/
terminationGracePeriodSeconds: 10
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /data/docker/containers

- name: fluent-bit-config
configMap:
name: fluent-bit-config
serviceAccountName: fluent-bit
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule


apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
labels:
k8s-app: fluent-bit
data:

Configuration files: server, input, filters and output

======================================================

fluent-bit.conf: |
[SERVICE]
Flush 1
Log_Level info
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020

@INCLUDE input-kubernetes.conf
@INCLUDE filter-kubernetes.conf
@INCLUDE output-elasticsearch.conf

input-kubernetes.conf: |
[INPUT]
Name tail
Tag kube.*
# Path /data/docker/containers/4d2248b0d7620ab3f86c679bc5ed2c482c20df437980202a5507f8abce0aa717/.log
Path /data/docker/containers/
/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10

filter-kubernetes.conf: |
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc.cluster.local:443
Merge_Log On
K8S-Logging.Parser On

output-elasticsearch.conf: |
[OUTPUT]
Name es
Match *
Host ${FLUENT_ELASTICSEARCH_HOST}
Port ${FLUENT_ELASTICSEARCH_PORT}
Logstash_Format Off
Retry_Limit False
Index fluent-bit

parsers.conf: |
[PARSER]
Name apache
Format regex
Regex ^(?[^ ]) [^ ] (?[^ ]) [(?[^\]])] "(?\S+)(?: +(?[^\"]?)(?: +\S)?)?" (?[^ ]) (?[^ ])(?: "(?[^\"])" "(?[^\"])")?$
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z

[PARSER]
    Name   apache2
    Format regex
    Regex  ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
    Time_Key time
    Time_Format %d/%b/%Y:%H:%M:%S %z

[PARSER]
    Name   apache_error
    Format regex
    Regex  ^\[[^ ]* (?<time>[^\]]*)\] \[(?<level>[^\]]*)\](?: \[pid (?<pid>[^\]]*)\])?( \[client (?<client>[^\]]*)\])? (?<message>.*)$

[PARSER]
    Name   nginx
    Format regex
    Regex ^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
    Time_Key time
    Time_Format %d/%b/%Y:%H:%M:%S %z

[PARSER]
    Name   json
    Format json
    Time_Key time
    Time_Format %d/%b/%Y:%H:%M:%S %z

[PARSER]
    Name        docker
    Format      json
    Time_Key    time
    Time_Format %Y-%m-%dT%H:%M:%S.%L
    Time_Keep   On
    # Command      |  Decoder | Field | Optional Action
    # =============|==================|=================
    Decode_Field_As   escaped    log

[PARSER]
    Name        syslog
    Format      regex
    Regex       ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
    Time_Key    time
    Time_Format %b %d %H:%M:%S

With the above daemonset, I have started getting logs but there are a lot of entries like this:
{
"_index": "fluent-bit",
"_type": "flb_type",
"_id": "8CZFqGcBeFDOdcHK3wlB",
"_version": 1,
"_score": null,
"_source": {
"@timestamp": "2018-12-13T15:53:30.999Z",
"log": "[2018/12/13 15:53:30] [ warn] [filter_kube] invalid pattern for given tag kube.data.docker.containers.e21ddf07416d5cf36cdde9b05b9efffb163e7b43f87cb55c87a0ae470c932757.e21ddf07416d5cf36cdde9b05b9efffb163e7b43f87cb55c87a0ae470c932757-json.log\n",
"stream": "stderr",
"time": "2018-12-13T15:53:30.999064873Z"
},
"fields": {
"@timestamp": [
"2018-12-13T15:53:30.999Z"
],
"time": [
"2018-12-13T15:53:30.999Z"
]
},
"sort": [
1544716410999
]
}

Should I file a separate issue for it or it is related to the current and I should keep both in this thread?

@linbingdouzhe
Copy link

so , any update ?

@samiamoura
Copy link

Do you find any solution ? I have the same probelme

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants