Loading Parquet Data within S3 Bucket into StarRocks without Data Ingestion using the external table feature #22723

alberttwong · 2023-04-28T05:44:57Z

alberttwong
Apr 28, 2023

This tutorial describes how you can loading Parquet data without data ingestion using the "external table" feature. You can also use this tutorial for accessing parquet data stored in remote or local (#22782) S3-like object store.

Prerequisites

For this tutorial you need to:

A StarRocks or CelerData database cluster.
Download a Parquet data file.
Access to an Object Store and upload the parquet data file.
Create a database, a table and query the data.

A StarRocks or CelerData database cluster.

This is out of scope for the tutorial.

Downloading the Sample Data File

To download the sample Parquet data file, click cities.parquet.

The Parquet data file includes sample continent data. The following is a representative example:

{
  "continent": "Europe",
  "country": {
    "city": [
      "Paris",
      "Nice",
      "Marseilles",
      "Cannes"
    ],
    "name": "France"
  }
}

Access to an Object Store and upload the parquet data file

This is out of scope for the tutorial. Please store the URI to the file and any credential needed to login for future usage.

Create a Database, a Table and Query the Data

The following commands create objects specifically for use with this tutorial. When you have completed the tutorial, you can drop these objects.

Step 0: Login to Database

To login into the database, you'll need the server name, host port, username and password.

mysql -P9030 -h6dihxzedg.cloud-app.celerdata.com -uadminuser -padminpasswd --prompt="StarRocks > "

Step 1: Create Database

Run the create database command.

create database testparquet;

Step 2: Create Table

Run the create table command.

Issue: We don't support the "variant" snowflake type as this time so that's why we use the "varchar" type for "city". Github Issue #22781

AWS and Minio

use testparquet;
create external table cities (
    continent varchar  null,
    country varchar  null,
    city  varchar null
  )
  ENGINE=file
PROPERTIES 
(
    "path"="s3://albertwong/cities.parquet", 
    "format"="parquet",
    "aws.s3.endpoint" = "http://127.0.0.1:9000",
    "aws.s3.enable_ssl" = "false",
    "aws.s3.enable_path_style_access" = "true",
    "aws.s3.access_key" = "xxx",
    "aws.s3.secret_key" = "zzz"
);

Step 3: Run Query

select count(*) from cities;

Results should look like this

StarRocks > select count(*) from cities;
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.69 sec)

Step 4: Clean Up

drop database testparquet;

Tip

The preferred way is to do a "broker load" to get data into SR. This method only connects to an external resource (object store bucket) but doesn't really import the data.

ss892714028 · 2023-05-06T11:03:38Z

ss892714028
May 6, 2023
Collaborator

Just a note, if you want to load your parquet file into StarRocks for maximum performance, you can create a StarRocks OLAP table and perform a INSERT INTO OLAP_TABLE SELECT * FROM FILE_EXTERNAL_TABLE.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading Parquet Data within S3 Bucket into StarRocks without Data Ingestion using the external table feature #22723

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Loading Parquet Data within S3 Bucket into StarRocks without Data Ingestion using the external table feature #22723

alberttwong Apr 28, 2023

Prerequisites

A StarRocks or CelerData database cluster.

Downloading the Sample Data File

Access to an Object Store and upload the parquet data file

Create a Database, a Table and Query the Data

Step 0: Login to Database

Step 1: Create Database

Step 2: Create Table

Step 3: Run Query

Step 4: Clean Up

Tip

Replies: 1 comment

ss892714028 May 6, 2023 Collaborator

alberttwong
Apr 28, 2023

ss892714028
May 6, 2023
Collaborator