Search
103 results for “cwensel”
-
I finally wrote up some documentation on using #clusterless Tessellate for #data stuff
-
I'm hosting my #clusterless docs on github, but #google is refusing to index the site fully.
it shows "currently not indexed" in the search console. which in turn claims they don't want to overload the site.
I get the sense that this is a shared problem for anyone hosting on github..
any suggestions on alternatives or hacks?
-
I'm hosting my #clusterless docs on github, but #google is refusing to index the site fully.
it shows "currently not indexed" in the search console. which in turn claims they don't want to overload the site.
I get the sense that this is a shared problem for anyone hosting on github..
any suggestions on alternatives or hacks?
-
I'm hosting my #clusterless docs on github, but #google is refusing to index the site fully.
it shows "currently not indexed" in the search console. which in turn claims they don't want to overload the site.
I get the sense that this is a shared problem for anyone hosting on github..
any suggestions on alternatives or hacks?
-
I'm hosting my #clusterless docs on github, but #google is refusing to index the site fully.
it shows "currently not indexed" in the search console. which in turn claims they don't want to overload the site.
I get the sense that this is a shared problem for anyone hosting on github..
any suggestions on alternatives or hacks?
-
I'm hosting my #clusterless docs on github, but #google is refusing to index the site fully.
it shows "currently not indexed" in the search console. which in turn claims they don't want to overload the site.
I get the sense that this is a shared problem for anyone hosting on github..
any suggestions on alternatives or hacks?
-
Released a new wip of #clusterless, for #cloud #data pipelines, last night that includes reporting on both arcs (workloads) and datasets.
https://github.com/ClusterlessHQ/clusterless
Below is a summary of the three datasets the #AWS s3 log sample app creates.
https://github.com/ClusterlessHQ/clusterless-aws-examples
Note we track the difference between intervals that have no data (empty, which may be intentional) vs a gap (the workload didn't run and create data).
-
Released a new wip of #clusterless, for #cloud #data pipelines, last night that includes reporting on both arcs (workloads) and datasets.
https://github.com/ClusterlessHQ/clusterless
Below is a summary of the three datasets the #AWS s3 log sample app creates.
https://github.com/ClusterlessHQ/clusterless-aws-examples
Note we track the difference between intervals that have no data (empty, which may be intentional) vs a gap (the workload didn't run and create data).
-
Released a new wip of #clusterless, for #cloud #data pipelines, last night that includes reporting on both arcs (workloads) and datasets.
https://github.com/ClusterlessHQ/clusterless
Below is a summary of the three datasets the #AWS s3 log sample app creates.
https://github.com/ClusterlessHQ/clusterless-aws-examples
Note we track the difference between intervals that have no data (empty, which may be intentional) vs a gap (the workload didn't run and create data).
-
Released a new wip of #clusterless, for #cloud #data pipelines, last night that includes reporting on both arcs (workloads) and datasets.
https://github.com/ClusterlessHQ/clusterless
Below is a summary of the three datasets the #AWS s3 log sample app creates.
https://github.com/ClusterlessHQ/clusterless-aws-examples
Note we track the difference between intervals that have no data (empty, which may be intentional) vs a gap (the workload didn't run and create data).
-
Released a new wip of #clusterless, for #cloud #data pipelines, last night that includes reporting on both arcs (workloads) and datasets.
https://github.com/ClusterlessHQ/clusterless
Below is a summary of the three datasets the #AWS s3 log sample app creates.
https://github.com/ClusterlessHQ/clusterless-aws-examples
Note we track the difference between intervals that have no data (empty, which may be intentional) vs a gap (the workload didn't run and create data).
-
getting closer...
here is a screenshot of the #clusterless cls command printing a summary table of workload (arc) completions since yesterday
I need to release 2.0 of the #java library mini-parsers into maven central before I can push this out and begin work on dataset status (think fsck for workload results)
-
hoping to make time to get another #clusterless release out this week.
I have commands to list deployed placements (regions etc), projects, and arcs (workloads). still need to get deployed datasets.
and, status reporting of both arcs and datasets.
that is, completed and failed arcs. and dataset completions, partials, empties, and gaps.
if a gap is found, the arc was skipped or failed, here is where you can re-run workloads deterministically. from the cli.
-
I'm thinking of resurrecting some code I have for Splunk like relative time adjusters
the #java library mini-parsers is due for an update, modern parboiled supports jdk17 now.
https://github.com/Heretical/mini-parsers
and #clusterless status reporting needs time range support on the cli and the splunk syntax is fairly concise.
anyone else interested in parser support?
-
Updated the #clusterless install docs to reflect the homebrew install option
https://docs.clusterless.io/guide/1.0-wip/install-quickstart.html
-
ok, finally! the #clusterless and #tessellate wip builds are published to #homebrew
https://github.com/ClusterlessHQ/homebrew-tap
I'll update all the install docs this week.
-
here is a little pre-announcement of a new #clusterless library clusterloss-commons
https://github.com/ClusterlessHQ/clusterless-commons
currently available in maven central.. but still under documented etc.
this project allows for sharing of some core libraries I find useful developing clusterless and tessellate. as well as some basics to help with #aws cdk development.
i'll make a bigger announcement as it matures.
-
I love how #Hopin adds stuff to your agenda for you, and you can't remove it.
Now i'll ignore the agenda I just crafted.
-
probably time i sort out a real logo for #clusterless
https://github.com/ClusterlessHQ
is 99designs still a thing?
-
I've added a new how-to guide on creating a copy #data pipeline in #AWS s3 using only #clusterless intrinsic components. As files get uploaded, they get copied to a new location.
https://docs.clusterless.io/guide/1.0-wip/howtos/s3-copy-files.html
This roughly mirrors the example project, but has a bunch more explainers and examples on using the cls command to build a project file.
-
Long weekend of yard work ahead but look forward to completing a set of improved #clusterless documentation early next week.
Using jackson json views I can print out json for required properties and full json so configuring a simple pipeline doesn’t seem so daunting.
These in turn can be embedded directly in the docs online and help messages.
-
I added include/exclude filters to the S3PutListenerBoundary and S3CopyArc components to #clusterless
Now you can use ant like paths to exclude hidden files etc, in #AWS s3 buckets, like _SUCCESS with an exclude on **/_*
https://docs.clusterless.io/reference/1.0-wip/components/aws-core-s3-put-listener-boundary.html
https://docs.clusterless.io/reference/1.0-wip/components/aws-core-s3-copy-arc.html
-
How's this for a #clusterless tag line?
Think #cloud + #airflow without the airflow, but with a lot more trust and agility.
-
I've started publishing how-tos on using #clusterless to manage #cloud #data pipelines.
A little overkill, but the first is how to manage an #aws s3 bucket.
https://docs.clusterless.io/guide/1.0-wip/howtos/s3-bucket-resource.html
-
Having a scenario runner for #data flows in #clusterless is pretty cool for automated testing of dags of workloads.
especially if they are part of your ci/cd
https://github.com/ClusterlessHQ/clusterless/actions/runs/6319702471/job/17161069843
on every commit, a suite of scenarios are deployed, run, and destroyed against #aws
-
would be great to have time (be paid) to create a sample #clusterless app on #AWS to put data behind a #DuckDB WASM frontend
https://duckdb.org/2021/10/29/duckdb-wasm.html
So instead of Athena/Glue integration that can work against a complete corpus, have DuckDb over the last 30 days for investigations etc
https://github.com/ClusterlessHQ/aws-s3-log-pipeline
quick reminder, https://chris.wensel.net
I want to do this with #sqlite https://datasette.io as well (I did it in a previous role and it was awesome)
-
Just pushed a new #clusterless Tessellate release that updates the transform statement syntax to include intrinsic functions.
The first function is tsid, a unique long value generated by the https://github.com/f4b6a3/tsid-creator library.
More here: https://github.com/ClusterlessHQ/tessellate#intrinsic-functions
-
@Cmastication depends on how you access it? If via a query, only partition on the most common predicates.
Repartitioning data for different access patterns is a key use case behind #clusterless and tessellate. See bio for links.
Otherwise yeah, partition via hash to get equal sized bits. Reminds me to add a hash transform to tessellate.
-
I'll make a bigger announcement later, but if you are following along with #clusterless development, note that we just added #AWS Glue/Athena support.
That means databases and tables can be deployed in tandem with a workload, and any new partitions that arrive will be added to the table.
https://github.com/ClusterlessHQ/clusterless
This example has been updated to show how it works and how simple it is (relatively).
https://github.com/ClusterlessHQ/aws-s3-log-pipeline
So imagine, every result dataset in the dag having a table to query.
-
just sayin', if you find chaining sql statements into a data processing dag a bit of a drag, I suggest you spend some time with #clusterless
https://github.com/ClusterlessHQ
declarative decentralized heterogeneous #data flows (in #aws today)