-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] error while writing data in Parquet format using JsonSchema as schema format #439
Comments
Hi @grandimk, thanks for opening this issue. Can you please provide more information about your record schema? |
Hi @alpreu, I looked at the code in the {
"type": "record",
"name": "Quiz",
"fields": [
{
"name": "audit",
"type": {
"type": "record",
"name": "Audit",
"fields": [
{
"name": "actor",
"type": {
"type": "record",
"name": "Actor",
"fields": [
{ "name": "actorId", "type": "string" },
{
"name": "actorType",
"type": {
"type": "enum",
"name": "ActorType",
"symbols": ["person", "service"]
}
},
{ "name": "ip", "type": ["null", "string"], "default": null }
]
}
},
{
"name": "producer",
"type": {
"type": "record",
"name": "Producer",
"fields": [
{ "name": "code", "type": "string" },
{
"name": "instanceId",
"type": ["null", "string"],
"default": null
},
{ "name": "producerType", "type": "string" },
{
"name": "version",
"type": ["null", "string"],
"default": null
}
]
}
}
]
}
},
{
"name": "metadata",
"type": {
"type": "record",
"name": "Metadata",
"fields": [
{ "name": "eventId", "type": "string" },
{ "name": "eventTimestamp", "type": "long" },
{ "name": "eventType", "type": ["null", "string"], "default": null }
]
}
},
{
"name": "payload",
"type": {
"type": "record",
"name": "QuizPayload",
"fields": [
{
"name": "categoryTreeNodesIds",
"type": ["null", { "type": "array", "items": "long" }],
"default": null
},
{ "name": "id", "type": "string" },
{ "name": "isHidden", "type": "boolean" },
{
"name": "knowledgeGraphNodesIds",
"type": ["null", { "type": "array", "items": "long" }],
"default": null
},
{
"name": "properties",
"type": {
"type": "record",
"name": "ContentProperties",
"fields": [
{
"name": "authorId",
"type": ["null", "string"],
"default": null
},
{
"name": "copiedFrom",
"type": ["null", "string"],
"default": null
},
{
"name": "createdAt",
"type": ["null", "string"],
"default": null
},
{
"name": "creatorId",
"type": ["null", "string"],
"default": null
},
{
"name": "ownerId",
"type": ["null", "string"],
"default": null
},
{
"name": "permission",
"type": {
"type": "record",
"name": "ContentPermission",
"fields": [
{
"name": "teams",
"type": [
"null",
{ "type": "array", "items": "string" }
],
"default": null
}
]
}
},
{
"name": "version",
"type": ["null", "string"],
"default": null
}
]
}
},
{
"name": "questions",
"type": { "type": "array", "items": "string" }
},
{
"name": "questionsNumber",
"type": ["null", "long"],
"default": null
},
{ "name": "quizId", "type": ["null", "string"], "default": null },
{
"name": "showDetailedFeedback",
"type": ["null", "boolean"],
"default": null
},
{ "name": "slug", "type": ["null", "string"], "default": null },
{ "name": "title", "type": "string" }
]
}
}
]
} As additional note, I want to point out that we define our events using JSON schema and then generate both the related Python and Java classes. |
@grandimk Thanks for providing the record schema. I had another look but I cannot see an immediate issue either. |
Describe the bug
I was using the Cloud Storage Sink to collect data from Pulsar and write it to AWS S3 in Parquet. Messages were produced using a
JsonSchema
format. The Sink fails as soon as it tries to convert the collected data intoorg.apache.avro.generic.GenericRecord
(within theconvertGenericRecord
function).It tried to produce messages both from Python and from Java and both fail but with different stack traces.
Note: if the
formatType
specified in the configuration isjson
everything works fine.To Reproduce
Use this template configuration for the pulsar-io-cloud-storage
v2.9.3.6
:And produce messages in
JsonSchema
format. Here the code for a minimal Python producer:Expected behavior
A chunk of data containing a list of collected messages, written to the specified AWS S3 prefix in Parquet format.
Screenshots
None
Additional context
The tests were done on my laptop, using an Apache Pulsar Docker container where the schema-registry was properly configured (the schema definition of the messages have been uploaded) and the version
pulsar-io-cloud-storage-2.9.3.6.nar
was loaded.This is the error occurred while writing data produced with the Python producer:
This is the error occurred while writing data produced with the Java producer:
The text was updated successfully, but these errors were encountered: