diff --git a/TOC-tidb-cloud-lake.md b/TOC-tidb-cloud-lake.md index bfbe8deed0ea1..c8a5e79e9ee01 100644 --- a/TOC-tidb-cloud-lake.md +++ b/TOC-tidb-cloud-lake.md @@ -27,7 +27,8 @@ - [Overview](/tidb-cloud-lake/guides/data-integration-overview.md) - Data Sources - [Overview](/tidb-cloud-lake/guides/data-sources.md) - - [AWS - Credentials](/tidb-cloud-lake/guides/aws-credentials.md) + - [Amazon S3 - Credentials](/tidb-cloud-lake/guides/aws-credentials.md) + - [Amazon SQS (S3) - IAM Role](/tidb-cloud-lake/guides/amazon-sqs-s3-iam-role.md) - [MySQL - Credentials](/tidb-cloud-lake/guides/mysql-credentials.md) - [PostgreSQL - Credentials](/tidb-cloud-lake/guides/postgresql-credentials.md) - [FeiShuBot](/tidb-cloud-lake/guides/feishubot.md) @@ -35,6 +36,7 @@ - [Overview](/tidb-cloud-lake/guides/integration-tasks.md) - [Task Management](/tidb-cloud-lake/guides/task-management.md) - [Amazon S3 Integration Task](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md) + - [Amazon SQS (S3) Integration Task](/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md) - [MySQL Integration Task](/tidb-cloud-lake/guides/integrate-with-mysql.md) - [PostgreSQL Integration Task](/tidb-cloud-lake/guides/integrate-with-postgresql.md) - Connect @@ -135,11 +137,13 @@ - [Roles](/tidb-cloud-lake/guides/roles.md) - [Ownership](/tidb-cloud-lake/guides/ownership.md) - [Audit Trail](/tidb-cloud-lake/guides/audit-trail.md) + - Data Protection + - [Overview](/tidb-cloud-lake/guides/data-protection-policies.md) + - [Masking Policy](/tidb-cloud-lake/guides/masking-policy.md) + - [Row Access Policy](/tidb-cloud-lake/guides/row-access-policy.md) - [Fail-Safe](/tidb-cloud-lake/guides/fail-safe.md) - - [Masking Policy](/tidb-cloud-lake/guides/masking-policy.md) - [Network Policy](/tidb-cloud-lake/guides/network-policy.md) - [Password Policy](/tidb-cloud-lake/guides/password-policy.md) - - [Row Access Policy](/tidb-cloud-lake/guides/row-access-policy.md) - [Recovery from Operational Errors](/tidb-cloud-lake/guides/recovery-from-operational-errors.md) - Data Management - [Overview](/tidb-cloud-lake/guides/data-management.md) diff --git a/tidb-cloud-lake/guides/amazon-sqs-s3-iam-role.md b/tidb-cloud-lake/guides/amazon-sqs-s3-iam-role.md new file mode 100644 index 0000000000000..dc0f46ba9a972 --- /dev/null +++ b/tidb-cloud-lake/guides/amazon-sqs-s3-iam-role.md @@ -0,0 +1,422 @@ +--- +title: Amazon SQS (S3) - IAM Role +summary: Learn how to create an "Amazon SQS (S3) - IAM Role" data source in {{{ .lake }}}. +--- + +# Amazon SQS (S3) - IAM Role + +This page describes how to create an `Amazon SQS (S3) - IAM Role` data source. This data source stores the configuration required to access an Amazon SQS queue and the corresponding S3 bucket, and is used for consuming S3 object creation events delivered from Amazon S3 to SQS. + +`Amazon SQS (S3) - IAM Role` only stores the connection and authorization information required for SQS (S3) ingestion. It does not consume messages by itself. The actual process of reading SQS messages, parsing S3 ObjectCreated events, and writing data into {{{ .lake }}} is performed by an [Amazon SQS (S3) Integration Task](/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md). + +## Use Cases + +- Centrally manage the queue URL, Region, IAM Role, and path scope required for SQS (S3) ingestion +- Consume S3 `ObjectCreated` events and write the corresponding object data into {{{ .lake }}} +- Use S3 event notifications to drive data ingestion instead of relying only on polling an S3 path +- Update the IAM Role, queue URL, or path scope in one place when referenced by multiple tasks + +## Create Amazon SQS (S3) - IAM Role + +1. Navigate to **Data** > **Data Sources**, then click **Create Data Source**. +2. Select **Amazon SQS (S3) - IAM Role** as the service type, then fill in the connection details: + + | Field | Required | Description | + |-------|----------|-------------| + | **Name** | Yes | A descriptive name for the data source | + | **Queue URL** | Yes | SQS standard queue URL, for example `https://sqs.us-east-1.amazonaws.com/123456789012/my-queue` | + | **Queue Region** | Yes | AWS Region where the SQS queue is located, for example `us-east-1`. The S3 bucket must be in the same Region as the SQS queue | + | **Role ARN** | Yes | IAM Role ARN in your AWS account that {{{ .lake }}} is allowed to assume | + | **External ID** | Yes | Organization ID from the {{{ .lake }}} console, used in the IAM Role trust policy | + | **Bucket** | Yes | Name of the S3 bucket that sends ObjectCreated events | + | **Object Key Prefix** | No | Prefix filter for S3 object keys. This should match the S3 notification filter | + | **Object Key Suffix** | No | Suffix filter for S3 object keys. This should match the S3 notification filter | + +3. Click **Test Connectivity** to validate the connection. If the test succeeds, click **OK** to save the data source. + + > **Note:** + > + > SQS (S3) ingestion uses the AssumeRole model. You do not need to provide AWS Access Key or Secret Key to {{{ .lake }}}. Instead, create an IAM Role in your AWS account and allow {{{ .lake }}} platform roles to obtain temporary credentials through `sts:AssumeRole` in the role trust policy. + +## AWS-Side Configuration Overview + +Before creating the data source, complete the following configuration in your AWS account: + +1. Create or prepare an SQS standard queue. +2. Configure the SQS queue policy to allow the specified S3 bucket to send messages to the queue. +3. Configure S3 bucket notification to send `ObjectCreated` events to the SQS queue. +4. Create an IAM Role that allows {{{ .lake }}} platform roles to access it through `sts:AssumeRole`. +5. Attach S3 read permissions and SQS consume permissions to the IAM Role. +6. Upload a test object and confirm that S3 can deliver the event to SQS. + +Prepare the following variables first. `AWS_REGION` must be the Region where both the S3 bucket and SQS queue are located. `EXTERNAL_ID` is the organization ID from the {{{ .lake }}} console. + +```bash +export AWS_REGION="" +export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) + +export BUCKET_NAME="" +export BUCKET_ARN="arn:aws:s3:::$BUCKET_NAME" + +export QUEUE_NAME="" +export ROLE_NAME="databend-s3-sqs-consumer-role" + +export PREFIX="" +export SUFFIX="" + +export DATABEND_SETUP_ROLE_ARN="" +export DATABEND_LOAD_ROLE_ARN="" +export EXTERNAL_ID="" +``` + +> **Tip:** +> +> Use the role ARNs provided by {{{ .lake }}}: `DATABEND_SETUP_ROLE_ARN` is the ARN of **{{{ .lake }}} setup and validation role**, and `DATABEND_LOAD_ROLE_ARN` is the ARN of **{{{ .lake }}} data loading role**. In most cases, the trust policy of your IAM Role should trust both platform roles. + +## Step 1: Create or Get an SQS Standard Queue + +Create an SQS standard queue: + +```bash +aws sqs create-queue \ + --region "$AWS_REGION" \ + --queue-name "$QUEUE_NAME" +``` + +Get the queue URL and queue ARN required by later steps: + +```bash +export QUEUE_URL=$( + aws sqs get-queue-url \ + --region "$AWS_REGION" \ + --queue-name "$QUEUE_NAME" \ + --query 'QueueUrl' \ + --output text +) + +export QUEUE_ARN=$( + aws sqs get-queue-attributes \ + --region "$AWS_REGION" \ + --queue-url "$QUEUE_URL" \ + --attribute-names QueueArn \ + --query 'Attributes.QueueArn' \ + --output text +) +``` + +We recommend using a dedicated SQS standard queue for each SQS (S3) data source. Do not reuse the same queue for other buckets, other prefix / suffix scopes, or other business events. + +## Step 2: Configure the SQS Queue Policy + +Back up the current SQS attributes before making changes: + +```bash +aws sqs get-queue-attributes \ + --region "$AWS_REGION" \ + --queue-url "$QUEUE_URL" \ + --attribute-names Policy QueueArn \ + > "sqs-attributes.backup.$(date +%Y%m%d-%H%M%S).json" +``` + +Generate `queue-policy.json`, which only allows the specified S3 bucket to send messages: + +```bash +jq -n \ + --arg policyId "$QUEUE_NAME-policy" \ + --arg queueArn "$QUEUE_ARN" \ + --arg bucketArn "$BUCKET_ARN" \ + --arg accountId "$AWS_ACCOUNT_ID" \ + '{ + Version: "2012-10-17", + Id: $policyId, + Statement: [ + { + Sid: "AllowS3ToSendMessage", + Effect: "Allow", + Principal: { + Service: "s3.amazonaws.com" + }, + Action: "sqs:SendMessage", + Resource: $queueArn, + Condition: { + ArnLike: { + "aws:SourceArn": $bucketArn + }, + StringEquals: { + "aws:SourceAccount": $accountId + } + } + } + ] + }' \ + > queue-policy.json +``` + +Apply the policy: + +```bash +jq -n \ + --arg policy "$(jq -c . queue-policy.json)" \ + '{Policy: $policy}' \ + > set-queue-attributes.json + +aws sqs set-queue-attributes \ + --region "$AWS_REGION" \ + --queue-url "$QUEUE_URL" \ + --attributes file://set-queue-attributes.json +``` + +## Step 3: Configure S3 Bucket Notification + +Back up the current bucket notification before making changes. `put-bucket-notification-configuration` replaces the entire bucket notification configuration. If the bucket already has other notifications, merge them before applying the new configuration. + +```bash +aws s3api get-bucket-notification-configuration \ + --region "$AWS_REGION" \ + --bucket "$BUCKET_NAME" \ + > "bucket-notification.backup.$(date +%Y%m%d-%H%M%S).json" +``` + +Generate `bucket-notification.json`: + +```bash +jq -n \ + --arg id "$QUEUE_NAME" \ + --arg queueArn "$QUEUE_ARN" \ + --arg prefix "$PREFIX" \ + --arg suffix "$SUFFIX" \ + '{ + QueueConfigurations: [ + ( + { + Id: $id, + QueueArn: $queueArn, + Events: [ + "s3:ObjectCreated:*" + ] + } + + + ( + [ + if $prefix != "" then {Name: "prefix", Value: $prefix} else empty end, + if $suffix != "" then {Name: "suffix", Value: $suffix} else empty end + ] as $rules + | if ($rules | length) > 0 + then {Filter: {Key: {FilterRules: $rules}}} + else {} + end + ) + ) + ] + }' \ + > bucket-notification.json +``` + +Apply the configuration: + +```bash +aws s3api put-bucket-notification-configuration \ + --region "$AWS_REGION" \ + --bucket "$BUCKET_NAME" \ + --notification-configuration file://bucket-notification.json +``` + +Check the configuration: + +```bash +aws s3api get-bucket-notification-configuration \ + --region "$AWS_REGION" \ + --bucket "$BUCKET_NAME" +``` + +Confirm that `QueueArn` points to the target SQS queue, `Events` includes `s3:ObjectCreated:*`, and `FilterRules` matches the `Object Key Prefix` / `Object Key Suffix` configured in the {{{ .lake }}} data source. + +## Step 4: Create an IAM Role for {{{ .lake }}} to Assume + +Generate `trust-policy.json`. `ExternalId` is the organization ID from the {{{ .lake }}} console. + +```bash +jq -n \ + --arg databendSetupRoleArn "$DATABEND_SETUP_ROLE_ARN" \ + --arg databendLoadRoleArn "$DATABEND_LOAD_ROLE_ARN" \ + --arg externalId "$EXTERNAL_ID" \ + '{ + Version: "2012-10-17", + Statement: [ + { + Sid: "AllowDatabendSetupAssumeRole", + Effect: "Allow", + Principal: { + AWS: $databendSetupRoleArn + }, + Action: "sts:AssumeRole", + Condition: { + StringEquals: { + "sts:ExternalId": $externalId + } + } + }, + { + Sid: "AllowDatabendLoadAssumeRole", + Effect: "Allow", + Principal: { + AWS: $databendLoadRoleArn + }, + Action: "sts:AssumeRole", + Condition: { + StringEquals: { + "sts:ExternalId": $externalId + } + } + } + ] + }' \ + > trust-policy.json +``` + +Create the IAM Role: + +```bash +aws iam create-role \ + --role-name "$ROLE_NAME" \ + --assume-role-policy-document file://trust-policy.json +``` + +If the role already exists, back up and update the trust policy: + +```bash +aws iam get-role \ + --role-name "$ROLE_NAME" \ + --query 'Role.AssumeRolePolicyDocument' \ + --output json \ + > "trust-policy.backup.$(date +%Y%m%d-%H%M%S).json" + +aws iam update-assume-role-policy \ + --role-name "$ROLE_NAME" \ + --policy-document file://trust-policy.json +``` + +## Step 5: Attach S3/SQS Permissions + +Generate `permissions-policy.json`: + +```bash +jq -n \ + --arg bucketArn "$BUCKET_ARN" \ + --arg objectArn "$BUCKET_ARN/*" \ + --arg queueArn "$QUEUE_ARN" \ + '{ + Version: "2012-10-17", + Statement: [ + { + Sid: "S3BucketMetadataAccess", + Effect: "Allow", + Action: [ + "s3:GetBucketLocation", + "s3:ListBucket" + ], + Resource: $bucketArn + }, + { + Sid: "S3ObjectReadAccess", + Effect: "Allow", + Action: [ + "s3:GetObject" + ], + Resource: $objectArn + }, + { + Sid: "SQSConsumeAccess", + Effect: "Allow", + Action: [ + "sqs:ReceiveMessage", + "sqs:DeleteMessage", + "sqs:GetQueueAttributes", + "sqs:ChangeMessageVisibility" + ], + Resource: $queueArn + } + ] + }' \ + > permissions-policy.json +``` + +Apply the permissions: + +```bash +aws iam put-role-policy \ + --role-name "$ROLE_NAME" \ + --policy-name databend-s3-sqs-access \ + --policy-document file://permissions-policy.json +``` + +Permission checklist: + +- SQS permissions are scoped to the target queue ARN. +- S3 permissions are scoped to the target bucket and object ARN. +- By default, this policy does not require S3 write or delete permissions. +- If a future SQS (S3) integration task enables **PURGE** or **Clean Up Original Files**, meaning source objects are deleted after successful ingestion, grant `s3:DeleteObject` on the target object path. + +## Step 6: Verify S3 to SQS + +Upload a test object that matches `PREFIX` / `SUFFIX`: + +```bash +echo 'a,b' > /tmp/databend-test.csv + +aws s3 cp /tmp/databend-test.csv \ + "s3://$BUCKET_NAME/${PREFIX}databend-test-$(date +%s)$SUFFIX" \ + --region "$AWS_REGION" +``` + +Receive a message from SQS: + +```bash +aws sqs receive-message \ + --region "$AWS_REGION" \ + --queue-url "$QUEUE_URL" \ + --max-number-of-messages 1 \ + --wait-time-seconds 10 \ + --visibility-timeout 60 +``` + +Confirm that the message contains `Records`, that `eventSource` is `aws:s3`, that `eventName` is `ObjectCreated:*`, and that `Records[].s3.bucket.name` and `Records[].s3.object.key` match the test object. + +> **Note:** +> +> `receive-message` does not delete the message automatically. It only hides the message temporarily during the visibility timeout. If you want {{{ .lake }}} to consume this test message later, do not delete it manually. Wait for the visibility timeout to expire before testing data source connectivity. + +## Information to Provide to {{{ .lake }}} + +After completing the AWS-side configuration, fill in the following information when creating the data source in {{{ .lake }}}: + +| Parameter | Description | +|-----------|-------------| +| `role_arn` | IAM Role ARN in your AWS account that {{{ .lake }}} is allowed to assume | +| `external_id` | Organization ID from the {{{ .lake }}} console | +| `queue_url` | SQS standard queue URL | +| `queue_region` | Region where the SQS queue is located | +| `bucket` | S3 bucket name | +| `prefix` / `suffix` | Optional. This should match the S3 notification filter | + +Command example for getting `role_arn`: + +```bash +aws iam get-role \ + --role-name "$ROLE_NAME" \ + --query 'Role.Arn' \ + --output text +``` + +## Configuration Requirements + +- The S3 bucket and SQS queue should be in the same AWS Region. +- The SQS queue must be a standard queue. FIFO queues are not supported. +- The SQS queue should be dedicated to one S3 notification rule. Do not reuse it for other buckets, other prefix / suffix scopes, or other business events. +- The bucket, prefix, and suffix in the S3 notification should match the {{{ .lake }}} data source configuration. +- `put-bucket-notification-configuration` replaces the entire bucket notification configuration. Back up and merge existing configurations before applying changes. +- S3 event notifications and SQS standard queues both use at-least-once delivery, so messages may be duplicated. + +## Next Steps + +After creating this data source, you can use it to create an [Amazon SQS (S3) Integration Task](/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md). diff --git a/tidb-cloud-lake/guides/aws-credentials.md b/tidb-cloud-lake/guides/aws-credentials.md index 644563777a1b6..ab36e58e3a19d 100644 --- a/tidb-cloud-lake/guides/aws-credentials.md +++ b/tidb-cloud-lake/guides/aws-credentials.md @@ -1,11 +1,11 @@ --- -title: AWS - Credentials -summary: This page describes how to create an "AWS - Credentials" data source. This data source stores the credentials required to access Amazon S3 and can be reused across multiple S3 integration tasks. +title: Amazon S3 - Credentials +summary: This page describes how to create an "Amazon S3 - Credentials" data source. This data source stores the credentials required to access Amazon S3 and can be reused across multiple S3 integration tasks. --- -# AWS - Credentials +# Amazon S3 - Credentials -This page describes how to create an `AWS - Credentials` data source. This data source stores the credentials required to access Amazon S3 and can be reused across multiple S3 integration tasks. +This page describes how to create an `Amazon S3 - Credentials` data source. This data source stores the credentials required to access Amazon S3 and can be reused across multiple S3 integration tasks. ## Use Cases @@ -13,10 +13,10 @@ This page describes how to create an `AWS - Credentials` data source. This data - Avoid re-entering the same S3 access credentials in every task - Update credentials centrally when they are rotated -## Create AWS - Credentials +## Create Amazon S3 - Credentials 1. Navigate to **Data** > **Data Sources** and click **Create Data Source**. -2. Select **AWS - Credentials** as the service type, then fill in the credentials: +2. Select **Amazon S3 - Credentials** as the service type, then fill in the credentials: | Field | Required | Description | |-------|----------|-------------| diff --git a/tidb-cloud-lake/guides/data-integration-overview.md b/tidb-cloud-lake/guides/data-integration-overview.md index 0f8b64d4a06f1..b6f6cddbb91b5 100644 --- a/tidb-cloud-lake/guides/data-integration-overview.md +++ b/tidb-cloud-lake/guides/data-integration-overview.md @@ -11,7 +11,7 @@ The Data Integration feature in {{{ .lake }}} provides a visual, no-code interfa | Concept | Description | |---------|-------------| -| [Data Sources](/tidb-cloud-lake/guides/data-sources.md) | Reusable connection settings or credentials used to access external systems or send notifications, such as AWS Access Key / Secret Key, MySQL hostname / username / password, or a FeiShu bot webhook. | +| [Data Sources](/tidb-cloud-lake/guides/data-sources.md) | Reusable connection settings or credentials used to access external systems or send notifications, such as AWS Access Key / Secret Key, MySQL hostname / username / password, SQS (S3) queue URL, or a FeiShu bot webhook. | | [Integration Tasks](/tidb-cloud-lake/guides/integration-tasks.md) | Executable tasks that define where data comes from, which {{{ .lake }}} table it is written to, which runtime parameters are used, and how the task is started and monitored. | Data sources do not move data by themselves. They only store the information required to access external systems. Integration tasks are the units that actually perform imports, snapshots, and continuous synchronization. @@ -23,7 +23,9 @@ Not every data source corresponds to an ingestion task. For example, `FeiShuBot` | Task Type | Description | |-----------|-------------| | [Amazon S3](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md) | Imports CSV, Parquet, or NDJSON files from Amazon S3 with support for one-time or continuous ingestion. | +| [Amazon SQS (S3)](/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md) | Consumes S3 object creation events from an SQS queue and writes the corresponding object data into {{{ .lake }}}. | | [MySQL](/tidb-cloud-lake/guides/integrate-with-mysql.md) | Synchronizes table data from MySQL using `Snapshot`, `CDC Only`, or `Snapshot + CDC` modes. | +| [PostgreSQL](/tidb-cloud-lake/guides/integrate-with-postgresql.md) | Synchronizes table data from PostgreSQL using `Snapshot`, `CDC Only`, or `Snapshot + CDC` modes. | ## Recommended Flow diff --git a/tidb-cloud-lake/guides/data-protection-policies.md b/tidb-cloud-lake/guides/data-protection-policies.md new file mode 100644 index 0000000000000..35e00e0e3c6c3 --- /dev/null +++ b/tidb-cloud-lake/guides/data-protection-policies.md @@ -0,0 +1,418 @@ +--- +title: Data Protection Policies +summary: Learn the masking policy and row access policy that safeguard sensitive information without altering stored values. +--- + +# Data Protection Policies + +{{{ .lake }}} provides two complementary policy types that protect sensitive data without changing stored values: + +- **Masking Policy** — transforms column values at query time so unauthorized users see redacted data. +- **Row Access Policy** — filters entire rows at query time so unauthorized users never see them. + +Both policies are transparent to applications: no code changes, no extra views, no data duplication. + +## Choosing the Right Policy + +Consider an `orders` table with customer phone numbers, order amounts, and regions. Three roles query it: + +- **Support agents** need phone numbers (to contact customers) but should only see orders in their own region. +- **Analysts** need all regions for reporting but phone numbers must be redacted. +- **Admins** see everything. + +This single requirement splits into two policies: + +| Requirement | Policy Type | +|---|---| +| Support agents only see their region's rows | Row Access Policy | +| Analysts see `138****1234` instead of real phone numbers | Masking Policy | + +## When to Use Each + +| Scenario | Use | +|----------|-----| +| Users should not see certain rows at all | Row Access Policy | +| All users see the row, but a sensitive column is redacted | Masking Policy | +| Different roles see different precision of the same column | Masking Policy | +| Multi-tenant isolation — tenants only see their own data | Row Access Policy | +| Restrict queryable time range by role | Row Access Policy | +| Hide specific keys inside a JSON/VARIANT column | Masking Policy | +| Row-level isolation + column-level redaction together | Both (but the same column cannot have both) | + +## How They Work Together + +``` +Query + → Row Access Policy filters rows (rows that fail the predicate disappear) + → Masking Policy transforms column values (sensitive fields are replaced) + → Result returned to user +``` + +Row filtering happens first. Masking applies only to the surviving rows. + +## Quick Comparison + +| | Masking Policy | Row Access Policy | +|---|---|---| +| Protection granularity | Column (values replaced) | Row (entire row hidden) | +| Return type | Must match column type | Always BOOLEAN | +| Limit per table | One policy per column | One policy per table | +| Affected operations | SELECT | SELECT, UPDATE, DELETE, MERGE | +| Stored data changed? | No | No | +| INSERT affected? | No | No | + +## Combining Both Policies + +You can attach a masking policy to one column and a row access policy to the same table — they compose naturally. The only constraint is that a single column cannot be referenced in both a masking policy binding and a row access policy binding simultaneously. + +**Example**: a `customers` table where: + +- Row access policy on `region` ensures each sales rep only sees their territory +- Masking policy on `ssn` ensures non-HR roles see `***-**-****` + +```sql +-- Row-level: filter by region +CREATE ROW ACCESS POLICY rap_region +AS (r STRING) RETURNS BOOLEAN -> + CASE + WHEN is_role_in_session('admin') THEN true + ELSE is_role_in_session(r) + END; + +ALTER TABLE customers ADD ROW ACCESS POLICY rap_region ON (region); + +-- Column-level: mask SSN +CREATE MASKING POLICY mask_ssn +AS (val STRING) RETURNS STRING -> + CASE + WHEN is_role_in_session('hr') THEN val + ELSE '***-**-****' + END; + +ALTER TABLE customers MODIFY COLUMN ssn SET MASKING POLICY mask_ssn; +``` + +## Advanced Practice: End-to-End Access Control + +This section walks through a production-ready setup combining RBAC, ownership, table privileges, and policy privileges. By the end, you'll see how separation of duties works in practice — who creates policies, who attaches them, and who queries the protected data. + +### Scenario + +An e-commerce company has an `orders` table with sensitive customer data. Four roles need different levels of access: + +| Role | Responsibility | Data Visibility | +|------|---------------|-----------------| +| `security_admin` | Creates and manages all policies | Cannot query data directly | +| `data_engineer` | Creates tables, attaches policies | Sees all data (admin-level) | +| `analyst_apac` | Analyzes APAC region data | Only APAC rows, phone numbers masked | +| `support_global` | Global customer support | All rows, phone numbers visible | + +
+How It All Fits Together (click to expand) + +```text ++--------------------------------------------------------------------------------+ +| ecommerce.orders (raw data) | ++----------+---------------+-------------+--------+--------+---------------------+ +| order_id | customer_name | phone | region | amount | created_at | ++----------+---------------+-------------+--------+--------+---------------------+ +| 1 | Alice | 13812345678 | APAC | 299.00 | 2025-01-15 10:00:00 | +| 2 | Bob | 14987654321 | EMEA | 150.00 | 2025-01-16 11:00:00 | +| 3 | Charlie | 13698765432 | APAC | 520.00 | 2025-01-17 09:30:00 | +| 4 | Diana | 15012349876 | AMER | 89.00 | 2025-01-18 14:00:00 | ++---------------------------------------+----------------------------------------+ + | + v + +------------------------------------------------+ + | 1) Row Access Policy: rap_region | + | ON (region) | + | | + | data_engineer / support_global -> ALL | + | analyst_apac -> region = 'APAC' only | + | others -> NONE | + +------------------------+-----------------------+ + | + v + +------------------------------------------------+ + | 2) Masking Policy: mask_phone | + | ON (phone) | + | | + | data_engineer / support_global -> raw | + | others -> CONCAT(LEFT(3), '****', ...) | + +------------------------+-----------------------+ + | + v + +-------------------------+-----------------------------+-----------------------------+ + | | | | + v v v v ++--------------------+ +---------------------------+ +---------------------------+ +---------------------------+ +| security_admin | | data_engineer | | analyst_apac | | support_global | ++--------------------+ +---------------------------+ +---------------------------+ +---------------------------+ +| permission denied | | id name phone | | id name phone | | id name phone | +| no SELECT | | 1 Alice 13812345678 | | 1 Alice 138****5678 | | 1 Alice 13812345678 | +| | | 2 Bob 14987654321 | | 3 Charlie 136****5432 | | 2 Bob 14987654321 | +| | | 3 Charlie 13698765432 | | | | 3 Charlie 13698765432 | +| | | 4 Diana 15012349876 | | | | 4 Diana 15012349876 | +| | | | | | | | +| 0 rows | | 4 rows, all regions | | 2 rows, APAC only | | 4 rows, all regions | +| | | phone: visible | | phone: masked | | phone: visible | ++--------------------+ +---------------------------+ +---------------------------+ +---------------------------+ +``` + +
+ +The following steps show how to build this setup from scratch. + +### Step 1: Create Roles and Users + +```sql +-- Run as account_admin + +CREATE ROLE security_admin; +CREATE ROLE data_engineer; +CREATE ROLE analyst_apac; +CREATE ROLE support_global; + +CREATE USER 'sec_user' IDENTIFIED BY 'password123'; +CREATE USER 'eng_user' IDENTIFIED BY 'password123'; +CREATE USER 'analyst_user' IDENTIFIED BY 'password123'; +CREATE USER 'support_user' IDENTIFIED BY 'password123'; + +GRANT ROLE security_admin TO USER 'sec_user'; +GRANT ROLE data_engineer TO USER 'eng_user'; +GRANT ROLE analyst_apac TO USER 'analyst_user'; +GRANT ROLE support_global TO USER 'support_user'; +``` + +### Step 2: Set Up Table and Ownership + +Grant `data_engineer` the ability to create databases, then create the table as that role. Ownership is automatically assigned to the creating role. + +```sql +-- Run as account_admin +GRANT CREATE DATABASE ON *.* TO ROLE data_engineer; + +-- Switch to data_engineer +SET ROLE data_engineer; + +CREATE DATABASE ecommerce; +CREATE TABLE ecommerce.orders ( + order_id INT, + customer_name STRING, + phone STRING, + region STRING, + amount DECIMAL(10,2), + created_at TIMESTAMP +); + +INSERT INTO ecommerce.orders VALUES + (1, 'Alice', '13812345678', 'APAC', 299.00, '2025-01-15 10:00:00'), + (2, 'Bob', '14987654321', 'EMEA', 150.00, '2025-01-16 11:00:00'), + (3, 'Charlie', '13698765432', 'APAC', 520.00, '2025-01-17 09:30:00'), + (4, 'Diana', '15012349876', 'AMER', 89.00, '2025-01-18 14:00:00'); +``` + +At this point, `data_engineer` owns `ecommerce.orders` and has full control over it. + +### Step 3: Grant Policy Creation Privileges + +Policy creation privileges are global (on `*.*`) and must be granted to roles, not users. Grant `GRANT` to `security_admin` if it should delegate policy APPLY privileges itself. + +```sql +-- Run as account_admin +GRANT CREATE MASKING POLICY ON *.* TO ROLE security_admin; +GRANT CREATE ROW ACCESS POLICY ON *.* TO ROLE security_admin; +GRANT GRANT ON *.* TO ROLE security_admin; +``` + +Now `security_admin` can create policies and delegate APPLY privileges, but still cannot query any table. + +### Step 4: Create Policies (as security_admin) + +```sql +SET ROLE security_admin; +SET enable_experimental_row_access_policy = 1; + +-- Masking policy: hide phone numbers from roles without 'support_global' or 'data_engineer' +CREATE MASKING POLICY mask_phone +AS (val STRING) +RETURNS STRING -> + CASE + WHEN is_role_in_session('data_engineer') OR is_role_in_session('support_global') THEN val + ELSE CONCAT(SUBSTRING(val, 1, 3), '****', SUBSTRING(val, 8)) + END; + +-- Row access policy: filter by region +CREATE ROW ACCESS POLICY rap_region +AS (r STRING) +RETURNS BOOLEAN -> + CASE + WHEN is_role_in_session('data_engineer') OR is_role_in_session('support_global') THEN true + WHEN is_role_in_session('analyst_apac') AND r = 'APAC' THEN true + ELSE false + END; +``` + +`security_admin` now owns both policies (OWNERSHIP auto-granted). But it cannot attach them to `ecommerce.orders` because it does not have ALTER on the table. + +### Step 5: Grant Policy Apply Privileges + +The policy owner (`security_admin`) delegates APPLY to `data_engineer`, who owns the table and can attach policies. + +```sql +-- Run as security_admin (owner of the policies) +GRANT APPLY ON MASKING POLICY mask_phone TO ROLE data_engineer; +GRANT APPLY ON ROW ACCESS POLICY rap_region TO ROLE data_engineer; +``` + +### Step 6: Attach the Masking Policy First (as data_engineer) + +`data_engineer` has both ALTER on the table (via ownership) and APPLY on the masking policy. Both are required. + +```sql +SET ROLE data_engineer; +ALTER TABLE ecommerce.orders MODIFY COLUMN phone SET MASKING POLICY mask_phone; +``` + +At this point, the table has column masking but no row filtering. Users with SELECT can still see all rows, but unauthorized phone numbers are masked. + +### Step 7: Grant Table Access Through Roles + +Grant table access to roles, not directly to users. The users received these roles in Step 1, so their table access flows through role membership. + +```sql +-- Run as account_admin +GRANT SELECT ON ecommerce.orders TO ROLE analyst_apac; +GRANT SELECT ON ecommerce.orders TO ROLE support_global; +GRANT USAGE ON ecommerce.* TO ROLE analyst_apac; +GRANT USAGE ON ecommerce.* TO ROLE support_global; +``` + +### Step 8: Verify Without Row Access Policy + +**analyst_user** has the `analyst_apac` role, so it can query the table. Because the row access policy is not attached yet, it sees all rows. Because the masking policy is already attached, phone numbers are masked. + +```sql +-- Connect as analyst_user +SET ROLE analyst_apac; +SELECT * FROM ecommerce.orders; +``` + +``` +order_id | customer_name | phone | region | amount | created_at +---------|---------------|-------------|--------|--------|-------------------- + 1 | Alice | 138****5678 | APAC | 299.00 | 2025-01-15 10:00:00 + 2 | Bob | 149****4321 | EMEA | 150.00 | 2025-01-16 11:00:00 + 3 | Charlie | 136****5432 | APAC | 520.00 | 2025-01-17 09:30:00 + 4 | Diana | 150****9876 | AMER | 89.00 | 2025-01-18 14:00:00 +``` + +### Step 9: Attach Row Access Policy + +Now attach the row access policy. This adds row filtering on top of the existing phone masking. + +```sql +-- Run as data_engineer +SET ROLE data_engineer; +SET enable_experimental_row_access_policy = 1; + +ALTER TABLE ecommerce.orders ADD ROW ACCESS POLICY rap_region ON (region); +``` + +### Step 10: Verify With Row Access Policy + +**analyst_user** — only APAC rows, phone masked: + +```sql +-- Connect as analyst_user +SET ROLE analyst_apac; +SELECT * FROM ecommerce.orders; +``` + +``` +order_id | customer_name | phone | region | amount | created_at +---------|---------------|-------------|--------|--------|-------------------- + 1 | Alice | 138****5678 | APAC | 299.00 | 2025-01-15 10:00:00 + 3 | Charlie | 136****5432 | APAC | 520.00 | 2025-01-17 09:30:00 +``` + +**support_user** — all rows, phone visible: + +```sql +-- Connect as support_user +SET ROLE support_global; +SELECT * FROM ecommerce.orders; +``` + +``` +order_id | customer_name | phone | region | amount | created_at +---------|---------------|-------------|--------|--------|-------------------- + 1 | Alice | 13812345678 | APAC | 299.00 | 2025-01-15 10:00:00 + 2 | Bob | 14987654321 | EMEA | 150.00 | 2025-01-16 11:00:00 + 3 | Charlie | 13698765432 | APAC | 520.00 | 2025-01-17 09:30:00 + 4 | Diana | 15012349876 | AMER | 89.00 | 2025-01-18 14:00:00 +``` + +**sec_user** — no SELECT privilege, access denied: + +```sql +-- Connect as sec_user +SET ROLE security_admin; +SELECT * FROM ecommerce.orders; +-- ERROR: Permission denied +``` + +### Step 11: Revoke Role Access + +Because table privileges were granted to roles, removing the role from a user removes the user's table access without changing table grants. + +```sql +-- Run as account_admin +REVOKE ROLE analyst_apac FROM USER 'analyst_user'; + +-- Start a new session as analyst_user +SELECT * FROM ecommerce.orders; +-- ERROR: Permission denied +``` + +### Privilege Flow + +``` +account_admin + │ + ├─ GRANT CREATE MASKING POLICY ON *.* ─────────► security_admin + ├─ GRANT CREATE ROW ACCESS POLICY ON *.* ─────► security_admin + ├─ GRANT GRANT ON *.* ────────────────────────► security_admin + └─ GRANT CREATE DATABASE ON *.* ──────────────► data_engineer + │ +security_admin │ + │ (owns policies via auto-OWNERSHIP) │ + ├─ GRANT APPLY ON MASKING POLICY ─────────────► data_engineer + └─ GRANT APPLY ON ROW ACCESS POLICY ─────────► data_engineer + │ +data_engineer │ + │ (owns table via auto-OWNERSHIP) │ + │ (has APPLY on policies) │ + ├─ ALTER TABLE ... SET MASKING POLICY │ + └─ ALTER TABLE ... ADD ROW ACCESS POLICY │ + │ +account_admin │ + ├─ GRANT SELECT ON ecommerce.orders ─────────► analyst_apac ──► analyst_user + ├─ GRANT SELECT ON ecommerce.orders ─────────► support_global ─► support_user + └─ REVOKE ROLE analyst_apac FROM USER ───────► analyst_user loses access +``` + +### Key Takeaways + +- **Separation of duties**: the role that creates policies (`security_admin`) cannot query data; the role that queries data (`analyst_apac`) cannot modify policies. +- **Least privilege**: attaching a policy requires BOTH `APPLY` on the policy AND `ALTER` on the table — neither alone is sufficient. +- **Masking and row access are independent**: masking alone hides column values but does not remove rows; adding row access policy filters rows before masking is applied. +- **Grant table access through roles**: users query through roles such as `analyst_apac`; revoking the role from a user removes access without changing table grants. +- **Ownership is automatic**: the creator's role receives OWNERSHIP on the new policy/table. No extra GRANT needed. +- **CREATE privileges go to roles, not users**: `CREATE MASKING POLICY` and `CREATE ROW ACCESS POLICY` cannot be granted directly to users. +- **Audit your setup**: use `SHOW GRANTS ON MASKING POLICY mask_phone`, `SHOW GRANTS ON ROW ACCESS POLICY rap_region`, and `POLICY_REFERENCES(POLICY_NAME => 'mask_phone')` to verify who has access and where policies are attached. + +## Next Steps + +- [Masking Policy](/tidb-cloud-lake/guides/masking-policy.md) — full syntax, conditional masking, VARIANT sub-field masking +- [Row Access Policy](/tidb-cloud-lake/guides/row-access-policy.md) — full syntax, DML behavior, multi-argument policies, time-range examples diff --git a/tidb-cloud-lake/guides/data-sources.md b/tidb-cloud-lake/guides/data-sources.md index 1ae813f836fb3..d3bf43b1a319b 100644 --- a/tidb-cloud-lake/guides/data-sources.md +++ b/tidb-cloud-lake/guides/data-sources.md @@ -13,11 +13,13 @@ Data sources do not execute synchronization by themselves. Their role is to cent | Type | Purpose | |------|---------| -| [AWS - Credentials](/tidb-cloud-lake/guides/aws-credentials.md) | Stores the Access Key and Secret Key required to access Amazon S3. These credentials can be reused across multiple S3 import tasks. | +| [Amazon S3 - Credentials](/tidb-cloud-lake/guides/aws-credentials.md) | Stores the Access Key and Secret Key required to access Amazon S3. These credentials can be reused across multiple S3 import tasks. | +| [Amazon SQS (S3) - IAM Role](/tidb-cloud-lake/guides/amazon-sqs-s3-iam-role.md) | Stores the queue URL, Region, IAM Role, and S3 path scope required for SQS (S3) ingestion. It can be used to consume S3 object creation events. | | [MySQL - Credentials](/tidb-cloud-lake/guides/mysql-credentials.md) | Stores the host, port, username, password, and database information required to access MySQL. These settings can be reused across multiple MySQL sync tasks. | +| [PostgreSQL - Credentials](/tidb-cloud-lake/guides/postgresql-credentials.md) | Stores the host, port, username, password, and database information required to access PostgreSQL. These settings can be reused across multiple PostgreSQL sync tasks. | | [FeiShuBot](/tidb-cloud-lake/guides/feishubot.md) | Stores a FeiShu bot webhook and message template for task failure notifications and similar scenarios. | -Not every data source corresponds to an integration task. For example, `FeiShuBot` is used for notification configuration, while `AWS - Credentials` and `MySQL - Credentials` are referenced by actual data import or synchronization tasks. +Not every data source corresponds to an integration task. For example, `FeiShuBot` is used for notification configuration, while `Amazon S3 - Credentials`, `Amazon SQS (S3) - IAM Role`, `MySQL - Credentials`, and `PostgreSQL - Credentials` are referenced by actual import, synchronization, or event-consuming tasks. ## Managing Data Sources diff --git a/tidb-cloud-lake/guides/integrate-with-amazon-s3.md b/tidb-cloud-lake/guides/integrate-with-amazon-s3.md index 4c470e439e357..82b7b28d08c3e 100644 --- a/tidb-cloud-lake/guides/integrate-with-amazon-s3.md +++ b/tidb-cloud-lake/guides/integrate-with-amazon-s3.md @@ -7,7 +7,7 @@ summary: The Amazon S3 data integration enables you to import files from S3 buck This page describes how to create an Amazon S3 integration task that imports files from an S3 bucket into {{{ .lake }}}. CSV, Parquet, and NDJSON file formats are supported, and the task can be configured for one-time import or continuous ingestion. -If you need to create reusable AWS credentials first, see [AWS - Credentials](/tidb-cloud-lake/guides/aws-credentials.md). +If you need to create reusable AWS credentials first, see [Amazon S3 - Credentials](/tidb-cloud-lake/guides/aws-credentials.md). ## Supported File Formats @@ -19,7 +19,7 @@ If you need to create reusable AWS credentials first, see [AWS - Credentials](/t ## Prerequisites -- An **AWS - Credentials** data source has already been created +- An **Amazon S3 - Credentials** data source has already been created - The AWS credentials have read access to the target S3 bucket - If you plan to enable **Clean Up Original Files**, the credentials also need write and delete permissions @@ -33,7 +33,7 @@ If you need to create reusable AWS credentials first, see [AWS - Credentials](/t | Field | Required | Description | |--------------------|----------|--------------------------------------------------------------------------------------------------| - | **Data Source** | Yes | Select an existing **AWS - Credentials** data source from the dropdown | + | **Data Source** | Yes | Select an existing **Amazon S3 - Credentials** data source from the dropdown | | **Name** | Yes | A name for this integration task | | **File Path** | Yes | S3 URI with optional wildcard pattern (e.g., `s3://mybucket/data/2025-*.csv`) | | **File Type** | Auto | Auto-detected from file extension. Supported: CSV, Parquet, NDJSON | diff --git a/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md b/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md new file mode 100644 index 0000000000000..5b06ad18e7a07 --- /dev/null +++ b/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md @@ -0,0 +1,110 @@ +--- +title: Amazon SQS (S3) Integration Task +summary: Learn how to create an Amazon SQS (S3) integration task that consumes S3 object creation events from an SQS queue and writes the corresponding object data into {{{ .lake }}}. +--- + +# Amazon SQS (S3) Integration Task + +This page describes how to create an Amazon SQS (S3) integration task that consumes S3 object creation events from an SQS queue and writes the corresponding object data into {{{ .lake }}}. + +This task is designed for S3 event-driven data ingestion. After an upstream system writes an object to S3, S3 sends an `ObjectCreated` event to SQS. {{{ .lake }}} consumes the SQS message through AssumeRole and writes data into {{{ .lake }}} based on the bucket and object key in the event. + +If you need to create reusable SQS (S3) connection settings first, see [Amazon SQS (S3) - IAM Role](/tidb-cloud-lake/guides/amazon-sqs-s3-iam-role.md). + +## Use Cases + +- Automatically ingest newly written S3 objects based on S3 `ObjectCreated` events +- Use S3 event notifications to drive data ingestion and reduce latency after new files arrive +- Avoid relying only on polling an S3 path to discover new files + +## Workflow + +1. An upstream system writes an object to an S3 bucket. +2. S3 Event Notification sends the `ObjectCreated` event to an SQS standard queue. +3. {{{ .lake }}} reads messages from the SQS queue through the IAM Role configured by the user. +4. The task parses the S3 event records in the message. +5. The task writes data into the {{{ .lake }}} target table based on the bucket, object key, and file format in the S3 event records. +6. After the write succeeds, the task deletes the processed SQS message from the queue. + +> **Note:** +> +> S3 event notifications and SQS standard queues may both produce duplicate messages. {{{ .lake }}} handles failed retries. If your business logic requires strict deduplication, design downstream deduplication based on object information, event time, `sequencer`, or SQS message ID. + +## Prerequisites + +Before creating an SQS (S3) integration task, make sure: + +- An **Amazon SQS (S3) - IAM Role** data source has already been created +- The S3 bucket has been configured with `ObjectCreated` event notification and sends events to the target SQS queue +- The SQS queue policy allows Amazon S3 to call `sqs:SendMessage` +- The user IAM Role allows {{{ .lake }}} platform roles to access it through `sts:AssumeRole` +- The user IAM Role has permissions to read the target S3 objects and consume the target SQS queue +- The SQS queue contains messages in the standard S3 Event Notification format +- The bucket, prefix, and suffix in the S3 notification match the data source configuration + +## Creating an SQS (S3) Integration Task + +### Step 1: Basic Info + +1. Navigate to **Data** > **Data Integration** and click **Create Task**. +2. Select an SQS (S3) data source, then configure the basic parameters: + + | Field | Required | Description | + |-------|----------|-------------| + | **Data Source** | Yes | Select an existing **Amazon SQS (S3) - IAM Role** data source from the dropdown | + | **Name** | Yes | Name of the integration task | + | **File Format** | Yes | File format of the S3 objects, such as CSV, Parquet, or NDJSON | + | **Object Key Prefix** | No | Only process object events with the specified prefix, such as `raw/events/`. This should match the data source and S3 notification filter | + | **Object Key Suffix** | No | Only process object events with the specified suffix, such as `.json` or `.parquet`. This should match the data source and S3 notification filter | + + > **Tip:** + > + > We recommend configuring prefix or suffix filters in S3 Event Notification first, and keeping them consistent with the filters in the data source and task. This reduces unrelated messages entering SQS. + +### Step 2: Preview Data + +After completing the basic settings, click **Next** to preview the source data. + +The preview result is the same as an [Amazon S3 Integration Task](./integrate-with-amazon-s3.md). The system locates the corresponding S3 objects based on the SQS (S3) configuration, reads file content, and displays: + +- Sample data with column names and data types +- The matched S3 object list and object sizes + +> **Note:** +> +> If there are no previewable S3 objects in the current path scope, the preview page may not show sample data. Upload a test object that matches the target prefix / suffix, then retry the preview. + +### Step 3: Set Target Table + +Configure the target location in {{{ .lake }}}: + +| Field | Description | +|-------|-------------| +| **Warehouse** | Select the {{{ .lake }}} warehouse used to run the SQS (S3) integration task | +| **Target Database** | Select the target database in {{{ .lake }}} | +| **Target Table** | Name of the target table to write data into | + +The system infers column names and data types from the previewed S3 object content. Before continuing, you can review and edit the target table schema. If writing to an existing table, select the target table and verify the column mapping. + +Click **Create** to create the integration task. + +## Task Behavior + +An SQS (S3) integration task is a continuously running task. After it starts, it periodically reads messages from the SQS queue and writes data into the target table until it is manually stopped. + +| Scenario | Behavior | +|----------|----------| +| Messages exist in the queue | Reads messages, parses S3 event records, and writes data into the target table based on the object information in the events | +| Write succeeds | Deletes the corresponding SQS message to avoid duplicate processing | +| Write fails | Does not delete the corresponding SQS message, keeping it for later retry | +| Message format is not valid S3 Event Notification | Records the error and skips or stops processing | +| Task is stopped manually | Stops polling and saves the current task state | + +## Difference from Amazon S3 Integration Task + +| Task Type | Processed Object | Data Written to {{{ .lake }}} | Typical Use Case | +|-----------|------------------|--------------------------|------------------| +| Amazon S3 Integration Task | S3 file content | Business data from CSV, Parquet, or NDJSON files | File data import | +| Amazon SQS (S3) Integration Task | S3 ObjectCreated events in SQS | S3 object data corresponding to the events | Automatic ingestion of new objects, event-driven import | + +If your goal is to periodically scan an S3 path and import file content, use an Amazon S3 Integration Task. If your goal is to trigger ingestion based on S3 ObjectCreated events, use an Amazon SQS (S3) Integration Task. diff --git a/tidb-cloud-lake/guides/integration-tasks.md b/tidb-cloud-lake/guides/integration-tasks.md index 2a9ced5f4c704..1094cb6ad3c29 100644 --- a/tidb-cloud-lake/guides/integration-tasks.md +++ b/tidb-cloud-lake/guides/integration-tasks.md @@ -14,6 +14,7 @@ Unlike data sources, integration tasks are the executable units that actually pe | Task Type | Description | |-----------|-------------| | [Amazon S3](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md) | Imports CSV, Parquet, or NDJSON files from Amazon S3 with support for one-time or continuous ingestion. | +| [Amazon SQS (S3)](/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md) | Consumes S3 object creation events from an SQS queue and writes the corresponding object data into {{{ .lake }}}. | | [MySQL](/tidb-cloud-lake/guides/integrate-with-mysql.md) | Synchronizes table data from MySQL using `Snapshot`, `CDC Only`, or `Snapshot + CDC`. | | [PostgreSQL](/tidb-cloud-lake/guides/integrate-with-postgresql.md) | Synchronizes table data from PostgreSQL using `Snapshot`, `CDC Only`, or `Snapshot + CDC`. | @@ -26,5 +27,6 @@ Recommended reading order: ## Task Type Differences -- Amazon S3 tasks are designed for file import scenarios and mainly focus on file path patterns, file formats, and ingestion behavior. +- S3 tasks are designed for file import scenarios and mainly focus on file path patterns, file formats, and ingestion behavior. +- SQS (S3) tasks are designed for S3 event-driven data ingestion and mainly focus on the SQS queue, S3 event filters, IAM Role, and target table. - MySQL and PostgreSQL tasks are designed for table synchronization scenarios and mainly focus on sync modes, primary keys, incremental capture, and archive scheduling. diff --git a/tidb-cloud-lake/guides/masking-policy.md b/tidb-cloud-lake/guides/masking-policy.md index 106dfa7d1c0b0..e0fedea0171ee 100644 --- a/tidb-cloud-lake/guides/masking-policy.md +++ b/tidb-cloud-lake/guides/masking-policy.md @@ -7,6 +7,15 @@ summary: Masking policies protect sensitive data by dynamically transforming col Masking policies protect sensitive data by dynamically transforming column values during query execution. They enable role-based access to confidential information—authorized users see actual data, while others see masked values. +## When to Use + +- **Customer support systems**: agents see order records, but customer ID numbers display as `3201**********1234`. +- **Data analytics**: analysts run reports where email fields show as `***@***.com` without affecting aggregate statistics. +- **VARIANT logs**: everyone can query logs, but JSON fields like `secret_key` and `token` are invisible to non-admins. +- **Partial redaction**: show the last 4 digits of a credit card (`****-****-****-5678`) to support staff for verification. + +If you need to hide entire rows rather than redact column values, use a [Row Access Policy](/tidb-cloud-lake/guides/row-access-policy.md) instead. + ## How Masking Works Policies transform column data at query time, usually based on the caller’s role. @@ -262,9 +271,67 @@ SELECT * FROM events WHERE data['content'] IS NOT NULL; > ELSE delete_by_keypath(val, 'nested:secret') > ``` +## Masking Policy vs Row Access Policy + +| | Masking Policy | Row Access Policy | +|---|---|---| +| Scope | Column-level (transforms values) | Table-level (filters rows) | +| Return type | Must match column type | Always BOOLEAN | +| Per table | One per column | One per table | +| Affects | SELECT only | SELECT, UPDATE, DELETE, MERGE | + +A column cannot be protected by both a masking policy and a row access policy simultaneously. Use masking when you want all users to see the row but with sensitive fields redacted. Use row access when certain rows should be completely invisible to unauthorized users. + +## Limits and Requirements + +- A column can have at most one masking policy at a time. +- A column cannot be bound to both a masking policy and a row access policy simultaneously. +- The policy return type must match the target column's data type. +- A column protected by a masking policy cannot be directly altered or dropped — `UNSET MASKING POLICY` first. +- A policy cannot be dropped while it is still referenced by any table. Use `POLICY_REFERENCES()` to find all bindings. +- `CREATE OR REPLACE MASKING POLICY` is not supported. Drop and recreate instead. +- Masking policies cannot be applied to temporary tables, views, or streams. +- Masking only affects the read path (SELECT). INSERT, UPDATE, and DELETE operate on true values. +- Policy names are globally unique across both masking policies and row access policies. +- Policy argument names are normalized to lowercase at creation time. + +## Best Practices + +### Use is_role_in_session() over current_role() + +`current_role()` only checks the currently active role. Users can bypass masking by switching to an unrestricted role with `SET ROLE`. `is_role_in_session()` checks all granted roles regardless of which is active — it cannot be bypassed. + +```sql +-- Preferred +CASE WHEN is_role_in_session('managers') THEN val ELSE '*********' END + +-- Avoid: can be bypassed with SET ROLE +CASE WHEN current_role() = 'managers' THEN val ELSE '*********' END +``` + +### Minimize conditional columns in USING + +Every column in the `USING` clause is evaluated at runtime. If your masking logic only depends on the caller's role, don't reference extra columns. + +### Keep masked values type-consistent + +Returning `'***'` for an email column works, but if downstream logic uses `LENGTH()` or `LIKE`, consider returning a fixed-format value like `'***@***.com'` to avoid breaking application assumptions. + +### Use object_delete for VARIANT columns + +When hiding specific keys in JSON data, `object_delete(val, 'secret_key', 'token')` is more precise than replacing the entire value — other fields remain queryable. + +### Unbind before dropping + +`DROP MASKING POLICY` fails if any column still references it. Query `POLICY_REFERENCES(POLICY_NAME => '')` to find all bindings, then `UNSET MASKING POLICY` on each before dropping. + +### Test with restricted roles + +After creating a policy, use `SET ROLE` to switch to a restricted role and run SELECT queries to verify the masking effect. Don't assume correctness from admin-role testing alone. + ## Privileges & References -- Grant `CREATE MASKING POLICY` on `*.*` to any role responsible for creating or replacing policies; the creator automatically owns the policy. +- Grant `CREATE MASKING POLICY` on `*.*` to any role responsible for creating policies; the creator automatically owns the policy. - Grant the global `APPLY MASKING POLICY` privilege or `APPLY ON MASKING POLICY ` to roles that attach or detach policies via `ALTER TABLE`. - Audit access with `SHOW GRANTS ON MASKING POLICY `. - Additional references: diff --git a/tidb-cloud-lake/guides/row-access-policy.md b/tidb-cloud-lake/guides/row-access-policy.md index 05900ce65c520..6791a3d2adea3 100644 --- a/tidb-cloud-lake/guides/row-access-policy.md +++ b/tidb-cloud-lake/guides/row-access-policy.md @@ -7,6 +7,15 @@ summary: Row access policies protect data by filtering table rows at query time. Row access policies protect data by filtering table rows at query time. They let you define centralized row-level predicates once, attach them to tables, and ensure users only see rows that satisfy the policy. +## When to Use + +- **Multi-tenant SaaS**: each tenant only sees their own data — no need for separate tables or views per tenant. +- **Regional isolation**: sales reps only see orders in their territory; managers see everything. +- **Time-window control**: a real-time alerting system can only query the last 1 day; an offline analysis system can query 7 days. +- **Compliance auditing**: external auditors only see data from a specific time period. + +If you need all users to see the same rows but with certain column values redacted, use a [Masking Policy](/tidb-cloud-lake/guides/masking-policy.md) instead. + > **Note:** > > Row access policy is currently experimental. Enable it with `SET enable_experimental_row_access_policy = 1` for the current session or `SET GLOBAL enable_experimental_row_access_policy = 1` for the account. @@ -71,7 +80,7 @@ CREATE ROW ACCESS POLICY rap_engineering AS (dept STRING) RETURNS BOOLEAN -> CASE - WHEN current_role() = 'admin' THEN true + WHEN IS_ROLE_IN_SESSION('admin') THEN true WHEN dept = 'Engineering' THEN true ELSE false END; @@ -402,7 +411,51 @@ ALTER TABLE employees DROP ALL ROW ACCESS POLICIES; - `SELECT` is filtered by row access policies and only returns policy-visible rows. - `UPDATE`, `DELETE`, and `MERGE` are filtered by row access policies when matching target rows. Invisible target rows are not updated, deleted, or merged. - Drop or detach the policy before altering or dropping protected columns. -- `CREATE OR REPLACE ROW ACCESS POLICY` and `ALTER ROW ACCESS POLICY` are not supported. +- `CREATE OR REPLACE ROW ACCESS POLICY` and `ALTER ROW ACCESS POLICY` are not supported. Drop and recreate instead. +- Policy names are globally unique across both row access policies and masking policies. +- Policy argument names are normalized to lowercase at creation time. +- Row access policies cannot be applied to tables in ICE-type databases. + +## Best Practices + +### Use IS_ROLE_IN_SESSION() over current_role() + +`IS_ROLE_IN_SESSION()` checks all roles granted to the user, including secondary roles active in the session. Users cannot bypass the policy by switching roles with `SET ROLE`. `current_role()` only checks the single active role and can be circumvented. + +```sql +-- Preferred: accounts for role hierarchy +CASE + WHEN IS_ROLE_IN_SESSION('admin') THEN true + WHEN IS_ROLE_IN_SESSION('sales_apac') THEN region = 'APAC' + ELSE false +END + +-- Avoid: can be bypassed with SET ROLE +CASE WHEN current_role() = 'admin' THEN true ELSE false END +``` + +### Order CASE branches from widest to narrowest + +CASE expressions evaluate top-down. Put the most permissive condition first (e.g., admin sees all) to short-circuit evaluation for privileged roles and reduce unnecessary computation. + +### Keep mapping tables in the same database + +If your policy references a lookup table (e.g., role-to-region mapping), store it in the same database as the protected table. This simplifies privilege management and avoids cross-database access issues. + +### Test with multiple roles + +After attaching a policy, connect as different users/roles and compare query results. Verify that: +- Admin roles see all rows +- Restricted roles see only their permitted subset +- Roles with no matching condition see zero rows + +### Use account_admin for full-data access + +When you need to inspect all rows (debugging, auditing), use a role that satisfies the policy (like `account_admin`) rather than repeatedly detaching and reattaching the policy. + +### Detach before dropping + +`DROP ROW ACCESS POLICY` fails if the policy is still attached to a table. Use `ALTER TABLE ... DROP ROW ACCESS POLICY` or `DROP ALL ROW ACCESS POLICIES` first. ## Privileges & References diff --git a/tidb-cloud-lake/guides/security-reliability.md b/tidb-cloud-lake/guides/security-reliability.md index 2235ed67c54fa..bd62b1199f94d 100644 --- a/tidb-cloud-lake/guides/security-reliability.md +++ b/tidb-cloud-lake/guides/security-reliability.md @@ -13,7 +13,6 @@ summary: "{{{ .lake }}} offers enterprise-grade security and reliability feature | [**Audit Trail**](/tidb-cloud-lake/guides/audit-trail.md) | Track database activities | When you need comprehensive audit trails for security monitoring, compliance, and performance analysis | | [**Network Policy**](/tidb-cloud-lake/guides/network-policy.md) | Restrict network access | When you want to limit connections to specific IP ranges even with valid credentials | | [**Password Policy**](/tidb-cloud-lake/guides/password-policy.md) | Set password requirements | When you need to enforce password complexity, rotation, and account lockout rules | -| [**Masking Policy**](/tidb-cloud-lake/guides/masking-policy.md) | Hide sensitive data | When you need to protect confidential data while still allowing authorized access | -| [**Row Access Policy**](/tidb-cloud-lake/guides/row-access-policy.md) | Filter rows dynamically | When users should only see rows that match role-aware access rules | +| [**Data Protection Policies**](/tidb-cloud-lake/guides/data-protection-policies.md) | Protect sensitive data at row and column level | When you need row-level filtering, column-level masking, or both | | [**Fail-Safe**](/tidb-cloud-lake/guides/fail-safe.md) | Prevent data loss | When you need to recover accidentally deleted data from S3-compatible storage | | [**Recovery from Errors**](/tidb-cloud-lake/guides/recovery-from-operational-errors.md) | Fix operational mistakes | When you need to recover from dropped databases/tables or incorrect data modifications | diff --git a/tidb-cloud-lake/guides/stage-overview.md b/tidb-cloud-lake/guides/stage-overview.md index 13fc4faf14d0f..1fa571e7ffdb9 100644 --- a/tidb-cloud-lake/guides/stage-overview.md +++ b/tidb-cloud-lake/guides/stage-overview.md @@ -75,6 +75,22 @@ The user stage can serve as a convenient repository for your data files that do LIST @~; ``` +## Filtering Staged Files with PATTERN + +Commands and functions that read, list, remove, or inspect staged files can use `PATTERN` to filter files by regular expression. For staged locations, `PATTERN` matches the file path portion after `@[/]`, not the full stage URI. + +For example, with `@sales_stage/raw/`, the staged file `@sales_stage/raw/year=2025/month=01/sales_20250101.parquet` is matched as `year=2025/month=01/sales_20250101.parquet`: + +```sql +LIST @sales_stage/raw/ PATTERN = 'year=2025/month=01/.*[.]parquet'; +``` + +To match all `.log` files under a stage path, use a regular expression such as: + +```sql +LIST @my_stage PATTERN = '.*[.]log'; +``` + ## Managing Stages {{{ .lake }}} provides a variety of commands to assist you in managing stages and the files staged within them: diff --git a/tidb-cloud-lake/guides/task-management.md b/tidb-cloud-lake/guides/task-management.md index fb5bbc1832895..dfddd075e1e12 100644 --- a/tidb-cloud-lake/guides/task-management.md +++ b/tidb-cloud-lake/guides/task-management.md @@ -43,11 +43,15 @@ Click a task to view its execution history. The run history includes: ## Runtime Behavior by Task Type - S3 tasks can run once or continuously poll for new files. +- SQS (S3) tasks continuously poll the SQS queue, consume S3 object creation events, and write data into the target table until manually stopped. - MySQL `Snapshot` tasks usually stop automatically after the full load completes. - MySQL `CDC Only` and `Snapshot + CDC` tasks continue running until manually stopped. +- PostgreSQL `Snapshot` tasks usually stop automatically after the full load completes. +- PostgreSQL `CDC Only` and `Snapshot + CDC` tasks continue running until manually stopped. For field-level configuration and detailed behavior, continue with the relevant task guide: - [Amazon S3 Integration Task](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md) +- [Amazon SQS (S3) Integration Task](/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md) - [MySQL Integration Task](/tidb-cloud-lake/guides/integrate-with-mysql.md) - [PostgreSQL Integration Task](/tidb-cloud-lake/guides/integrate-with-postgresql.md) diff --git a/tidb-cloud-lake/sql/copy-into-table.md b/tidb-cloud-lake/sql/copy-into-table.md index b815d8a240f70..0ca93be959424 100644 --- a/tidb-cloud-lake/sql/copy-into-table.md +++ b/tidb-cloud-lake/sql/copy-into-table.md @@ -187,7 +187,7 @@ copyOptions ::= - **FILES**: Specifies one or more file names (separated by commas) to be loaded. -- **PATTERN**: A [PCRE2](https://www.pcre.org/current/doc/html/)-based regular expression pattern string that specifies file names to match. See [Example 4: Filtering Files with Pattern](#example-4-filtering-files-with-pattern). +- **PATTERN**: A [PCRE2](https://www.pcre.org/current/doc/html/)-based regular expression pattern string that specifies file names to match. When loading from a stage, the pattern matches the part of the file path after `@[/]`. See [Filtering Staged Files with PATTERN](/tidb-cloud-lake/guides/stage-overview.md#filtering-staged-files-with-pattern) and [Example 4: Filtering Files with Pattern](#example-4-filtering-files-with-pattern). ## Format Type Options @@ -538,21 +538,21 @@ COPY INTO mytable ); ``` -When specifying the pattern for a file path including multiple folders, consider your matching criteria: +When specifying the pattern for staged files in paths with multiple folders, remember that the pattern matches only the path portion after `@[/]`. For example, with `FROM @sales_stage/raw/`, the file `@sales_stage/raw/year=2025/month=01/sales_20250101.parquet` is matched as `year=2025/month=01/sales_20250101.parquet`. -- If you want to match a specific subpath following a prefix, include the prefix in the pattern (e.g., 'multi_page/') and then specify the pattern you want to match within that subpath (e.g., '\_page_1'). +- If you want to match a specific subpath following a prefix, include the prefix in the pattern (e.g., 'year=2025/month=01/') and then specify the pattern you want to match within that subpath (e.g., 'sales_'). -```sql --- File path: parquet/multi_page/multi_page_1.parquet -COPY INTO ... FROM @data/parquet/ PATTERN = 'multi_page/.*_page_1.*') ... -``` + ```sql + -- File path: raw/year=2025/month=01/sales_20250101.parquet + COPY INTO ... FROM @sales_stage/raw/ PATTERN = 'year=2025/month=01/.*sales_.*[.]parquet') ... + ``` -- If you want to match any part of the file path that contains the desired pattern, use '.*' before and after the pattern (e.g., '.*multi_page_1.\*') to match any occurrences of 'multi_page_1' within the path. +- If you want to match any part of the file path that contains the desired pattern, use '.*' before and after the pattern (e.g., '.*sales_20250101.*') to match any occurrences of 'sales_20250101' within the path. -```sql --- File path: parquet/multi_page/multi_page_1.parquet -COPY INTO ... FROM @data/parquet/ PATTERN ='.*multi_page_1.*') ... -``` + ```sql + -- File path: raw/year=2025/month=01/sales_20250101.parquet + COPY INTO ... FROM @sales_stage/raw/ PATTERN = '.*sales_20250101.*') ... + ``` ### Example 5: Loading to Table with Extra Columns diff --git a/tidb-cloud-lake/sql/create-masking-policy.md b/tidb-cloud-lake/sql/create-masking-policy.md index 5d7b0ad66b2d8..85462f413e777 100644 --- a/tidb-cloud-lake/sql/create-masking-policy.md +++ b/tidb-cloud-lake/sql/create-masking-policy.md @@ -14,7 +14,7 @@ Creates a new masking policy in {{{ .lake }}}. ## Syntax ```sql -CREATE [ OR REPLACE ] MASKING POLICY [ IF NOT EXISTS ] AS +CREATE MASKING POLICY [ IF NOT EXISTS ] AS ( [ , ... ] ) RETURNS -> [ COMMENT = '' ] @@ -38,7 +38,7 @@ CREATE [ OR REPLACE ] MASKING POLICY [ IF NOT EXISTS ] AS | Privilege | Description | |:----------|:------------| -| CREATE MASKING POLICY | Required to create or replace a masking policy. Typically granted on `*.*`. | +| CREATE MASKING POLICY | Required to create a masking policy. Typically granted on `*.*`. | {{{ .lake }}} automatically grants OWNERSHIP on the new masking policy to the current role so that it can manage the policy with others. diff --git a/tidb-cloud-lake/sql/infer-schema.md b/tidb-cloud-lake/sql/infer-schema.md index 6d120b9f414f0..109a8a3c6b727 100644 --- a/tidb-cloud-lake/sql/infer-schema.md +++ b/tidb-cloud-lake/sql/infer-schema.md @@ -47,7 +47,7 @@ INFER_SCHEMA( | Parameter | Description | Default | Example | |-----------|-------------|---------|---------| | `LOCATION` | Stage location: `@[/]` | Required | `'@my_stage/data/'` | -| `PATTERN` | File name pattern to match | All files | `'*.csv'`, `'*.parquet'` | +| `PATTERN` | Regular expression pattern to match staged files. It matches the file path portion after `@[/]`. See [Filtering Staged Files with PATTERN](/tidb-cloud-lake/guides/stage-overview.md#filtering-staged-files-with-pattern). | All files | `'.*[.]csv'`, `'.*[.]parquet'` | | `FILE_FORMAT` | File format name for parsing | Stage's format | `'csv_format'`, `'NDJSON'` | | `MAX_RECORDS_PRE_FILE` | Max records to sample per file | All records | `100`, `1000` | | `MAX_FILE_COUNT` | Max number of files to process | All files | `5`, `10` | @@ -64,7 +64,7 @@ COPY INTO @test_parquet FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYP -- Infer schema from parquet files using pattern SELECT * FROM INFER_SCHEMA( location => '@test_parquet', - pattern => '*.parquet' + pattern => '.*[.]parquet' ); ``` @@ -91,7 +91,7 @@ CREATE FILE FORMAT csv_format TYPE = 'CSV'; -- Infer schema using pattern and file format SELECT * FROM INFER_SCHEMA( location => '@test_csv', - pattern => '*.csv', + pattern => '.*[.]csv', file_format => 'csv_format' ); ``` @@ -135,7 +135,7 @@ Limit records for faster inference: -- Sample only first 5 records for schema inference SELECT * FROM INFER_SCHEMA( location => '@test_csv', - pattern => '*.csv', + pattern => '.*[.]csv', file_format => 'csv_format', max_records_pre_file => 5 ); @@ -151,7 +151,7 @@ COPY INTO @test_ndjson FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYPE -- Infer schema using pattern and NDJSON format SELECT * FROM INFER_SCHEMA( location => '@test_ndjson', - pattern => '*.ndjson', + pattern => '.*[.]ndjson', file_format => 'NDJSON' ); ``` @@ -172,7 +172,7 @@ Limit records for faster inference: -- Sample only first 5 records for schema inference SELECT * FROM INFER_SCHEMA( location => '@test_ndjson', - pattern => '*.ndjson', + pattern => '.*[.]ndjson', file_format => 'NDJSON', max_records_pre_file => 5 ); @@ -190,7 +190,7 @@ When files have different schemas, `infer_schema` merges them intelligently: SELECT * FROM INFER_SCHEMA( location => '@my_stage/', - pattern => '*.csv', + pattern => '.*[.]csv', file_format => 'csv_format' ); ``` @@ -215,7 +215,7 @@ Use pattern matching to infer schema from multiple files: -- Infer schema from all CSV files in the directory SELECT * FROM INFER_SCHEMA( location => '@my_stage/', - pattern => '*.csv' + pattern => '.*[.]csv' ); ``` @@ -225,7 +225,7 @@ Limit the number of files processed to improve performance: -- Process only the first 5 matching files SELECT * FROM INFER_SCHEMA( location => '@my_stage/', - pattern => '*.csv', + pattern => '.*[.]csv', max_file_count => 5 ); ``` @@ -253,7 +253,7 @@ The `infer_schema` function displays the schema but doesn't create tables. To cr ```sql -- Create table structure from file schema CREATE TABLE my_table AS -SELECT * FROM @my_stage/ (pattern=>'*.parquet') +SELECT * FROM @my_stage/ (pattern=>'.*[.]parquet') LIMIT 0; -- Verify the table structure diff --git a/tidb-cloud-lake/sql/list-stage-files.md b/tidb-cloud-lake/sql/list-stage-files.md index 85dfd0b6a2aae..2558596443de7 100644 --- a/tidb-cloud-lake/sql/list-stage-files.md +++ b/tidb-cloud-lake/sql/list-stage-files.md @@ -19,6 +19,8 @@ See also: LIST { userStage | internalStage | externalStage } [ PATTERN = '' ] ``` +`PATTERN` filters staged files by regular expression. It matches the file path portion after `@[/]`. See [Filtering Staged Files with PATTERN](/tidb-cloud-lake/guides/stage-overview.md#filtering-staged-files-with-pattern). + ## Examples The stage below contains a file named **books.parquet** and a folder named **2023**. @@ -54,10 +56,10 @@ LIST @my_internal_stage/2023/; +-----------------+------+------------------------------------+-------------------------------+---------+ ``` -To list all the files with the extension *.log in the stage, run the following command: +To list all the files with the extension `.log` in the stage, run the following command: ```sql -LIST @my_internal_stage PATTERN = '.log'; +LIST @my_internal_stage PATTERN = '.*[.]log'; +----------------+------+------------------------------------+-------------------------------+---------+ | name | size | md5 | last_modified | creator | +----------------+------+------------------------------------+-------------------------------+---------+ diff --git a/tidb-cloud-lake/sql/list-stage.md b/tidb-cloud-lake/sql/list-stage.md index c09b81786acc3..50bc9af1a6b35 100644 --- a/tidb-cloud-lake/sql/list-stage.md +++ b/tidb-cloud-lake/sql/list-stage.md @@ -42,7 +42,7 @@ userStage ::= @~[/] ### PATTERN -See [COPY INTO table](/tidb-cloud-lake/sql/copy-into-table.md). +Filters staged files by regular expression. It matches the file path portion after `@[/]`. See [Filtering Staged Files with PATTERN](/tidb-cloud-lake/guides/stage-overview.md#filtering-staged-files-with-pattern). ## Examples @@ -56,5 +56,5 @@ SELECT * FROM list_stage(location => '@my_stage/', pattern => '.*[.]log'); +----------------+------+------------------------------------+-------------------------------+---------+ -- Equivalent to the following statement: -LIST @my_stage PATTERN = '.log'; +LIST @my_stage PATTERN = '.*[.]log'; ``` diff --git a/tidb-cloud-lake/sql/remove-stage-files.md b/tidb-cloud-lake/sql/remove-stage-files.md index 1ef6c1932ab95..3c40f307f9497 100644 --- a/tidb-cloud-lake/sql/remove-stage-files.md +++ b/tidb-cloud-lake/sql/remove-stage-files.md @@ -34,7 +34,7 @@ externalStage ::= @[/] ### PATTERN = 'regex_pattern' -A regular expression pattern string, enclosed in single quotes, filters files to remove by their filename. +A regular expression pattern string, enclosed in single quotes, filters staged files to remove. It matches the file path portion after `@[/]`. See [Filtering Staged Files with PATTERN](/tidb-cloud-lake/guides/stage-overview.md#filtering-staged-files-with-pattern). ## Examples diff --git a/tidb-cloud-lake/sql/system-history-tables.md b/tidb-cloud-lake/sql/system-history-tables.md index 8896c28d9cc74..456f02d140904 100644 --- a/tidb-cloud-lake/sql/system-history-tables.md +++ b/tidb-cloud-lake/sql/system-history-tables.md @@ -50,8 +50,8 @@ GRANT ROLE audit_team TO USER compliance_officer; ### Self-Hosted {{{ .lake }}} -
-📝 **Manual configuration required** - Click to expand configuration details +
+Manual configuration required - Click to expand configuration details #### Minimal Configuration @@ -116,6 +116,8 @@ table_name = "login_history" table_name = "access_history" ``` -> ⚠️ **Note:** When changing storage configuration, existing history tables will be dropped and recreated. +> **Note:** +> +> When changing storage configuration, existing history tables will be dropped and recreated.