“cwensel” — Fediverse search results on home.social

Chris K Wensel @[email protected] · 2023-11-17 · 17:40 UTC

I finally wrote up some documentation on using #clusterless Tessellate for #data stuff

https://docs.clusterless.io/tessellate/1.0-wip/index.html

#clusterless #data

Chris K Wensel @cwensel · 2023-11-13 · 16:38 UTC

I'm hosting my #clusterless docs on github, but #google is refusing to index the site fully.

it shows "currently not indexed" in the search console. which in turn claims they don't want to overload the site.

I get the sense that this is a shared problem for anyone hosting on github..

any suggestions on alternatives or hacks?

#clusterless #google

Chris K Wensel @[email protected] · 2023-11-13 · 16:38 UTC

I'm hosting my #clusterless docs on github, but #google is refusing to index the site fully.

it shows "currently not indexed" in the search console. which in turn claims they don't want to overload the site.

I get the sense that this is a shared problem for anyone hosting on github..

any suggestions on alternatives or hacks?

#clusterless #google

Chris K Wensel @[email protected] · 2023-11-13 · 16:38 UTC

I'm hosting my #clusterless docs on github, but #google is refusing to index the site fully.

it shows "currently not indexed" in the search console. which in turn claims they don't want to overload the site.

I get the sense that this is a shared problem for anyone hosting on github..

any suggestions on alternatives or hacks?

#clusterless #google

Chris K Wensel @[email protected] · 2023-11-13 · 16:38 UTC

I'm hosting my #clusterless docs on github, but #google is refusing to index the site fully.

it shows "currently not indexed" in the search console. which in turn claims they don't want to overload the site.

I get the sense that this is a shared problem for anyone hosting on github..

any suggestions on alternatives or hacks?

#google #clusterless

Chris K Wensel @[email protected] · 2023-11-13 · 16:38 UTC

I'm hosting my #clusterless docs on github, but #google is refusing to index the site fully.

it shows "currently not indexed" in the search console. which in turn claims they don't want to overload the site.

I get the sense that this is a shared problem for anyone hosting on github..

any suggestions on alternatives or hacks?

#clusterless #google

Chris K Wensel @cwensel · 2023-11-08 · 15:34 UTC

Released a new wip of #clusterless, for #cloud #data pipelines, last night that includes reporting on both arcs (workloads) and datasets.

https://github.com/ClusterlessHQ/clusterless

Below is a summary of the three datasets the #AWS s3 log sample app creates.

https://github.com/ClusterlessHQ/clusterless-aws-examples

Note we track the difference between intervals that have no data (empty, which may be intentional) vs a gap (the workload didn't run and create data).

#clusterless #cloud #data #aws

Chris K Wensel @[email protected] · 2023-11-08 · 15:34 UTC

Released a new wip of #clusterless, for #cloud #data pipelines, last night that includes reporting on both arcs (workloads) and datasets.

https://github.com/ClusterlessHQ/clusterless

Below is a summary of the three datasets the #AWS s3 log sample app creates.

https://github.com/ClusterlessHQ/clusterless-aws-examples

Note we track the difference between intervals that have no data (empty, which may be intentional) vs a gap (the workload didn't run and create data).

#clusterless #cloud #data #aws

Chris K Wensel @[email protected] · 2023-11-08 · 15:34 UTC

Released a new wip of #clusterless, for #cloud #data pipelines, last night that includes reporting on both arcs (workloads) and datasets.

https://github.com/ClusterlessHQ/clusterless

Below is a summary of the three datasets the #AWS s3 log sample app creates.

https://github.com/ClusterlessHQ/clusterless-aws-examples

Note we track the difference between intervals that have no data (empty, which may be intentional) vs a gap (the workload didn't run and create data).

#clusterless #cloud #data #aws

Chris K Wensel @[email protected] · 2023-11-08 · 15:34 UTC

Released a new wip of #clusterless, for #cloud #data pipelines, last night that includes reporting on both arcs (workloads) and datasets.

https://github.com/ClusterlessHQ/clusterless

Below is a summary of the three datasets the #AWS s3 log sample app creates.

https://github.com/ClusterlessHQ/clusterless-aws-examples

Note we track the difference between intervals that have no data (empty, which may be intentional) vs a gap (the workload didn't run and create data).

#aws #data #cloud #clusterless

Chris K Wensel @[email protected] · 2023-11-08 · 15:34 UTC

Released a new wip of #clusterless, for #cloud #data pipelines, last night that includes reporting on both arcs (workloads) and datasets.

https://github.com/ClusterlessHQ/clusterless

Below is a summary of the three datasets the #AWS s3 log sample app creates.

https://github.com/ClusterlessHQ/clusterless-aws-examples

Note we track the difference between intervals that have no data (empty, which may be intentional) vs a gap (the workload didn't run and create data).

#clusterless #cloud #data #aws

Chris K Wensel @cwensel · 2023-11-07 · 00:36 UTC

getting closer...

here is a screenshot of the #clusterless cls command printing a summary table of workload (arc) completions since yesterday

I need to release 2.0 of the #java library mini-parsers into maven central before I can push this out and begin work on dataset status (think fsck for workload results)

#clusterless #java

Chris K Wensel @cwensel · 2023-11-02 · 04:28 UTC

hoping to make time to get another #clusterless release out this week.

I have commands to list deployed placements (regions etc), projects, and arcs (workloads). still need to get deployed datasets.

and, status reporting of both arcs and datasets.

that is, completed and failed arcs. and dataset completions, partials, empties, and gaps.

if a gap is found, the arc was skipped or failed, here is where you can re-run workloads deterministically. from the cli.

#clusterless

Chris K Wensel @cwensel · 2023-10-26 · 18:16 UTC

I'm thinking of resurrecting some code I have for Splunk like relative time adjusters

https://docs.splunk.com/Documentation/Splunk/9.1.1/Search/Specifytimemodifiersinyoursearch#Specify_relative_time_ranges

the #java library mini-parsers is due for an update, modern parboiled supports jdk17 now.

https://github.com/Heretical/mini-parsers

and #clusterless status reporting needs time range support on the cli and the splunk syntax is fairly concise.

anyone else interested in parser support?

#java #clusterless

Chris K Wensel @cwensel · 2023-10-24 · 23:15 UTC

Updated the #clusterless install docs to reflect the homebrew install option

https://docs.clusterless.io/guide/1.0-wip/install-quickstart.html

#clusterless

Chris K Wensel @cwensel · 2023-10-24 · 15:32 UTC

ok, finally! the #clusterless and #tessellate wip builds are published to #homebrew

https://github.com/ClusterlessHQ/homebrew-tap

I'll update all the install docs this week.

#clusterless #tessellate #homebrew

Chris K Wensel @cwensel · 2023-10-18 · 22:06 UTC

here is a little pre-announcement of a new #clusterless library clusterloss-commons

https://github.com/ClusterlessHQ/clusterless-commons

currently available in maven central.. but still under documented etc.

this project allows for sharing of some core libraries I find useful developing clusterless and tessellate. as well as some basics to help with #aws cdk development.

i'll make a bigger announcement as it matures.

#clusterless #aws

Chris K Wensel @cwensel · 2023-10-18 · 15:27 UTC

I love how #Hopin adds stuff to your agenda for you, and you can't remove it.

Now i'll ignore the agenda I just crafted.

#hopin

Chris K Wensel @cwensel · 2023-10-11 · 18:55 UTC

probably time i sort out a real logo for #clusterless

https://github.com/ClusterlessHQ

is 99designs still a thing?

#clusterless

Chris K Wensel @cwensel · 2023-10-11 · 02:55 UTC

I've added a new how-to guide on creating a copy #data pipeline in #AWS s3 using only #clusterless intrinsic components. As files get uploaded, they get copied to a new location.

https://docs.clusterless.io/guide/1.0-wip/howtos/s3-copy-files.html

This roughly mirrors the example project, but has a bunch more explainers and examples on using the cls command to build a project file.

https://github.com/ClusterlessHQ

#data #aws #clusterless

Chris K Wensel @cwensel · 2023-10-07 · 19:17 UTC

Long weekend of yard work ahead but look forward to completing a set of improved #clusterless documentation early next week.

Using jackson json views I can print out json for required properties and full json so configuring a simple pipeline doesn’t seem so daunting.

These in turn can be embedded directly in the docs online and help messages.

#clusterless

Chris K Wensel @cwensel · 2023-10-05 · 20:47 UTC

I added include/exclude filters to the S3PutListenerBoundary and S3CopyArc components to #clusterless

Now you can use ant like paths to exclude hidden files etc, in #AWS s3 buckets, like _SUCCESS with an exclude on **/_*

https://docs.clusterless.io/reference/1.0-wip/components/aws-core-s3-put-listener-boundary.html

https://docs.clusterless.io/reference/1.0-wip/components/aws-core-s3-copy-arc.html

#clusterless #aws

Chris K Wensel @cwensel · 2023-09-30 · 18:38 UTC

How's this for a #clusterless tag line?

Think #cloud + #airflow without the airflow, but with a lot more trust and agility.

#clusterless #cloud #airflow

Chris K Wensel @cwensel · 2023-09-29 · 14:35 UTC

I've started publishing how-tos on using #clusterless to manage #cloud #data pipelines.

A little overkill, but the first is how to manage an #aws s3 bucket.

https://docs.clusterless.io/guide/1.0-wip/howtos/s3-bucket-resource.html

#clusterless #cloud #data #aws

Chris K Wensel @cwensel · 2023-09-27 · 00:36 UTC

Having a scenario runner for #data flows in #clusterless is pretty cool for automated testing of dags of workloads.

especially if they are part of your ci/cd

https://github.com/ClusterlessHQ/clusterless/actions/runs/6319702471/job/17161069843

on every commit, a suite of scenarios are deployed, run, and destroyed against #aws

https://github.com/ClusterlessHQ/clusterless/tree/wip-1.0/clusterless-scenario/src/main/cls/scenarios

#data #clusterless #aws

Chris K Wensel @cwensel · 2023-09-21 · 16:28 UTC

would be great to have time (be paid) to create a sample #clusterless app on #AWS to put data behind a #DuckDB WASM frontend

https://duckdb.org/2021/10/29/duckdb-wasm.html

So instead of Athena/Glue integration that can work against a complete corpus, have DuckDb over the last 30 days for investigations etc

https://github.com/ClusterlessHQ/aws-s3-log-pipeline

quick reminder, https://chris.wensel.net

I want to do this with #sqlite https://datasette.io as well (I did it in a previous role and it was awesome)

#clusterless #aws #duckdb #sqlite

Chris K Wensel @cwensel · 2023-09-20 · 21:53 UTC

Just pushed a new #clusterless Tessellate release that updates the transform statement syntax to include intrinsic functions.

The first function is tsid, a unique long value generated by the https://github.com/f4b6a3/tsid-creator library.

More here: https://github.com/ClusterlessHQ/tessellate#intrinsic-functions

https://github.com/ClusterlessHQ/tessellate

#clusterless

Chris K Wensel @cwensel · 2023-09-01 · 02:17 UTC

@Cmastication depends on how you access it? If via a query, only partition on the most common predicates.

Repartitioning data for different access patterns is a key use case behind #clusterless and tessellate. See bio for links.

Otherwise yeah, partition via hash to get equal sized bits. Reminds me to add a hash transform to tessellate.

#clusterless

Chris K Wensel @cwensel · 2023-08-17 · 23:17 UTC

I'll make a bigger announcement later, but if you are following along with #clusterless development, note that we just added #AWS Glue/Athena support.

That means databases and tables can be deployed in tandem with a workload, and any new partitions that arrive will be added to the table.

https://github.com/ClusterlessHQ/clusterless

This example has been updated to show how it works and how simple it is (relatively).

https://github.com/ClusterlessHQ/aws-s3-log-pipeline

So imagine, every result dataset in the dag having a table to query.

#clusterless #aws

Chris K Wensel @cwensel · 2023-08-16 · 17:03 UTC

just sayin', if you find chaining sql statements into a data processing dag a bit of a drag, I suggest you spend some time with #clusterless

https://github.com/ClusterlessHQ

declarative decentralized heterogeneous #data flows (in #aws today)

#clusterless #data #aws