Cloudflare R2 Data Catalog: Managed Apache Iceberg tables with zero egress fees

blog.cloudflare.com

47 points by kalendos 5 months ago

I vaguely remember reading comments here that said you can get rate limited on R2 without warning if egress is too high. Was that true and is that still true? What is the limit if so?

I tried looking for that thread again and I only found the exact opposite comment from the Cloudflare founder:

>Not abuse. Thanks for being a customer. Bandwidth at scale is effectively free.[0]

I distinctly remember such a thread though.

Edit: I did find these but neither are what I remember:

https://news.ycombinator.com/item?id=42263554

https://news.ycombinator.com/item?id=33337183

[0] https://news.ycombinator.com/item?id=38124676

boomskats 5 months ago

This post also introduces Iceberg pretty nicely. Details on Class A vs Class B operations are here[0].

What kind of latency/throughput are people getting from R2? Does it benefit from parallelism in the same way s3 does?

[0]: https://developers.cloudflare.com/r2/pricing/#class-a-operat...

pier25 5 months ago

> What kind of latency/throughput are people getting from R2?
Not sure about now, but upload speeds were very inconsistent when we tested it a year or so ago.

flakiness 5 months ago

Woo this is cool! I hope they start hosting public datasets like Google does for BigQuery, such as (wink wink) Hacker News archive.

zX41ZdbW 5 months ago

Hacker News archive is hosted in ClickHouse as a publicly accessible data lake. It is available without sign-up and is updated in real-time. Example:

    # Download ClickHouse:
    curl https://clickhouse.com/ | sh
    ./clickhouse local

    # Attach the table:
    CREATE TABLE hackernews_history UUID '66491946-56e3-4790-a112-d2dc3963e68a'
    (
        update_time DateTime DEFAULT now(),
        id UInt32,
        deleted UInt8,
        type Enum8('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
        by LowCardinality(String),
        time DateTime,
        text String,
        dead UInt8,
        parent UInt32,
        poll UInt32,
        kids Array(UInt32),
        url String,
        score Int32,
        title String,
        parts Array(UInt32),
        descendants Int32
    )
    ENGINE = ReplacingMergeTree(update_time)
    ORDER BY id
    SETTINGS refresh_parts_interval = 60, 
        disk = disk(readonly = true, type = 's3_plain_rewritable', endpoint = 'https://clicklake-test-2.s3.eu-central-1.amazonaws.com/', use_environment_credentials = false);

    # Run queries:
    SELECT time, decodeHTMLComponent(extractTextFromHTML(text)) AS t
    FROM hackernews_history ORDER BY time DESC LIMIT 10 \G
    
    # Download everything as Parquet/JSON/CSV...
    SELECT * FROM hackernews_history INTO OUTFILE 'dump.parquet'

Also available on the public Playground: https://play.clickhouse.com/

flakiness 5 months ago

Nice! And the CREATE TABLE in that example is exactly why I'd love to have it with a catalog ;-)

pier25 5 months ago

Honestly don't understand how Cloudflare thinks this is a higher priority than versioning, replication of buckets, or even geo distribution of objects.

x0x0 5 months ago

It's a strange direction. I thought Cloudflare viewed R2 mostly as competition for S3 when used as cdn backing storage (a very natural place to compete.) For which, btw, it is great -- I seamlessly use it for ActiveStorage and not only is it way cheaper, but configuring it is about 100x simpler than the s3/cloudfront/random acls/signed cookies stuff.