Search
103 results for “cwensel”
-
@cwensel yep, and going back to your toot on open source sustainability — #b2evolution with a great project and a wonderful community shuttered after 18 years
Community and adoption and even commercial backers don’t always add up to forever sustainability
-
So Tessellate inherits lots of support for various data formats from Cascading
https://github.com/cwensel/cascadingEven though #apacheparquet dropped Cascading support, we were able to port it over.
Now that Parquet is native to Cascading, it should be easier to add #apacheiceberg support.
This would allow #clusterless to convert data as it arrives into Iceberg continuously for use in #aws Athena or other data front-ends.
Anyone interested in a challenge?
-
A little more color on this announcement..
https://fosstodon.org/@cwensel/110549001614086663First, #ApacheParquet removed #Cascading support, so I had to splice the original source into Cascading. But the ParquetScheme didn't honor type information fully. So there is a new TypedParquetScheme that has native support for JSON and Timestamps.
Second, Parquet requires the #ApacheHadoop FileSystem, which means we get the wonderful S3A implementation. But we also get a 331MB jar dependency with the aws bundle.
-
Many thanks to @czds @RuthMalan @kcarruthers @EmilyK @cwensel @douglasvb @digikata @deborahh @Sevoris for the boosts and favoriting of this thread for #IoTday #IoTday2024
https://mastodon.social/@jadp/112242993601272319 -
Hey all, I'm hiring for a #DataEng (and #DevOps) role in the SF Bay Area. Reach out directly to me for more info.
#Java background is a strong want plus AWS and a desire to work on #OpenSource stuff.
-
Hey all, I'm hiring for a #DataEng (and #DevOps) role in the SF Bay Area. Reach out directly to me for more info.
#Java background is a strong want plus AWS and a desire to work on #OpenSource stuff.
-
Hey all, I'm hiring for a #DataEng (and #DevOps) role in the SF Bay Area. Reach out directly to me for more info.
#Java background is a strong want plus AWS and a desire to work on #OpenSource stuff.
-
Hey all, I'm hiring for a #DataEng (and #DevOps) role in the SF Bay Area. Reach out directly to me for more info.
#Java background is a strong want plus AWS and a desire to work on #OpenSource stuff.
-
Hey all, I'm hiring for a #DataEng (and #DevOps) role in the SF Bay Area. Reach out directly to me for more info.
#Java background is a strong want plus AWS and a desire to work on #OpenSource stuff.
-
so my #Grafana #k6 cloud runs seem to initialize the SharedArray object multiple times and is passing different instances to the remote processes. Inits once locally, and historically I don't remember this being an issue in the cloud.
https://grafana.com/docs/k6/latest/javascript-api/k6-data/sharedarray/
I have support and slack questions open, but I find it odd if i'm the only person experiencing this.
-
updated #clusterless subpop cli build to provide a Homebrew tap for easy installation.
https://github.com/ClusterlessHQ/subpop
subpop is an experimental tool for diffing datasets from the cli.
runs on #Linux and #macOS but sadly written in #java so no native binaries just yet.
-
updated #clusterless subpop cli build to provide a Homebrew tap for easy installation.
https://github.com/ClusterlessHQ/subpop
subpop is an experimental tool for diffing datasets from the cli.
runs on #Linux and #macOS but sadly written in #java so no native binaries just yet.
-
updated #clusterless subpop cli build to provide a Homebrew tap for easy installation.
https://github.com/ClusterlessHQ/subpop
subpop is an experimental tool for diffing datasets from the cli.
runs on #Linux and #macOS but sadly written in #java so no native binaries just yet.
-
updated #clusterless subpop cli build to provide a Homebrew tap for easy installation.
https://github.com/ClusterlessHQ/subpop
subpop is an experimental tool for diffing datasets from the cli.
runs on #Linux and #macOS but sadly written in #java so no native binaries just yet.
-
updated #clusterless subpop cli build to provide a Homebrew tap for easy installation.
https://github.com/ClusterlessHQ/subpop
subpop is an experimental tool for diffing datasets from the cli.
runs on #Linux and #macOS but sadly written in #java so no native binaries just yet.
-
Has no one ever read Hyperion/Endymion?
Fun project "Last digital common ancestor"
Self-replicating, self-modifying Assembly program that can evolve into every possible computer program in the universe.
https://github.com/mertyildiran/ldca
#Assembly #OpenSource #github #fasm -
For those a little familiar with Cascading, in #java, it was originally designed to run on #ApacheHadoop, and then #ApacheTez, but it also has a local planner.
This lets developers create non-clustered data applications, without the Hadoop/Tez etc dependencies or runtime.
I've been using the local planner in production for over 5 years now.
But Parquet requires Hadoop libraries, and this is ok, there is a shim between the libraries that allow Parquet and S3AFileSystem to be used locally.
-
while pondering my need for a remote compute environment, vs having random boxes littered about generating heat, I realized I could add a 'device' component concept to #Clusterless.
this concept not only compliments the current model types, it will be handy standalone.
consider an #AWS ec2/ecs instance doing some complex work and dropping files into S3 (over the new mount point feature) where a clusterless DAG takes over processing when the files arrive (via the S3 put boundary).
-
On that note, not using #clusterless for #dataengineering is a trap
-
@seldo I don't know any firsthand,
but I spent the last couple weeks exploring what a #RAG pipeline would look like so I could write a sample application/pipeline using my #OpenSource #clusterless project
https://github.com/ClusterlessHQ
unfortunately the idea I had wasn't ultimately suitable for RAG and could be a simple BERT/BART summarizer pipeline without having a open/elasticsearch backend or other vector db.
still looking for a fun RAG based prototype I could build and share.
-
need to dig into this, but i've been doing replay (redrive) on #aws StepFunctions for years with my #data pipelines
replay is one feature I haven't added back to #Clusterless yet, though all the metadata is there.
-
currently all my #clusterless examples (and scenario tester) use jsonnet, but it's got weak overall support.
CUE looks interesting, but no Java implementation for embedding (if that was a thing I was considering)
-
Tessellate is now on Docker Hub
https://hub.docker.com/r/clusterless/tessellate
Tessellate is a command line tool for reading and writing #data to/from multiple locations and across multiple formats.
-
Automating #AWS CloudWatch log export into S3 is no simple task.
Next #clusterless release will now have a new Component type called Activity that is simply a scheduled task..
The first Activity will be function that exports cloud watch logs created within the previous interval.
As they arrive, any arc can subscribe to the data drop and do things. To simplify that task, I'll update #tessellate
The cw log is a delimited text file with two columns, one is json. unlike all the others in aws!
-
ok, here's a new one for #aws users.
would anyone be interested in an automated way to extract CloudWatch logs (continuously) into an s3 bucket.
and have them converted into #parquet (/etc) for downstream custom processing. or simply partitioned with partition updates to AWS Athena/Glue?
the challenge for users is getting the `detail` json field exposed since it's app specific.
with #clusterless devs could then inject custom processing for custom app logs into the #data pipeline
-
I finally wrote up some documentation on using #clusterless Tessellate for #data stuff
-
I'm hosting my #clusterless docs on github, but #google is refusing to index the site fully.
it shows "currently not indexed" in the search console. which in turn claims they don't want to overload the site.
I get the sense that this is a shared problem for anyone hosting on github..
any suggestions on alternatives or hacks?
-
Released a new wip of #clusterless, for #cloud #data pipelines, last night that includes reporting on both arcs (workloads) and datasets.
https://github.com/ClusterlessHQ/clusterless
Below is a summary of the three datasets the #AWS s3 log sample app creates.
https://github.com/ClusterlessHQ/clusterless-aws-examples
Note we track the difference between intervals that have no data (empty, which may be intentional) vs a gap (the workload didn't run and create data).
-
getting closer...
here is a screenshot of the #clusterless cls command printing a summary table of workload (arc) completions since yesterday
I need to release 2.0 of the #java library mini-parsers into maven central before I can push this out and begin work on dataset status (think fsck for workload results)
-
hoping to make time to get another #clusterless release out this week.
I have commands to list deployed placements (regions etc), projects, and arcs (workloads). still need to get deployed datasets.
and, status reporting of both arcs and datasets.
that is, completed and failed arcs. and dataset completions, partials, empties, and gaps.
if a gap is found, the arc was skipped or failed, here is where you can re-run workloads deterministically. from the cli.