Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data validation #260

Open
spiffytech opened this issue Jan 16, 2021 · 11 comments
Open

Data validation #260

spiffytech opened this issue Jan 16, 2021 · 11 comments

Comments

@spiffytech
Copy link

Is any kind of enforced data validation on the roadmap? If data quality is just an honor code with the client software, it seems like I'd have to defensively validate every DB read to prevent a buggy/malicious client from storing malformed data that crashes every client they share their DB with.

@j-berman
Copy link
Collaborator

j-berman commented Jan 16, 2021

TL;DR At the root of this, because data is end-to-end encrypted, there is no way we can it would be extremely challenging to[1] prevent malicious users with write access to a database from storing whatever data they want on the server. That being said, in addition to our client's validation rules we apply when reading the data before returning it to you in the changeHandler, you can limit that data's harmful effects in your app's code via defensive validation as you mentioned, or use userbase-sql.js.

And in the future, in the worst case, we're planning to allow you to recover a database from an older point in time.


My original response before all the edits: The client does validate every DB write for you before returning data to you via the changeHandler, though you raise a valid point and we can do a better job limiting what malicious users can do.

Will strengthen validation this weekend.

Edit 1: Because all data is end-to-end encrypted, we can't validate the data server-side like traditional databases do. But that doesn't mean data quality relies entirely on an honor code with the client software. Our client does offer some powerful (but not perfect) guarantees about your data. For example, if a malicious user inserts an item that is not encrypted with the correct encryption key, an honest user's client shouldn't crash, it should just ignore that malformed item and move on decrypting properly encrypted items before returning items back to the user via the changeHandler.

Now, this would still leave honest clients vulnerable if a malicious client inserts a string, but in your app code you were expecting a nested object or something like that. To defend against that, like you mentioned, your honest clients could safely validate the data before using it to make sure it matches what you expect. So basically what you would normally expect from server-side validation is moved into the client (but it's not necessarily "unsafe client-side" validation in that it trusts all users to be honest).

Additionally, you may find userbase-sql.js useful if you're looking for custom SQL-like validation beyond what the default client offers (the default client can offer item uniqueness, versioning validation, correctly decrypted items). With userbase-sql.js, you can enforce whatever SQL rules you want, and data returned via the changeHandler must satisfy your rules. The biggest catch with that library today is that you can't share the SQL databases. Would you want to use that library if sharing was possible?

Finally, if data is corrupted by a buggy client or a malicious client manages to sneak through, we can also offer the ability to rollback transactions. This is planned. Something like #206, but allowing you to retrieve the history of an entire database, in the order transactions were pushed to the server. And then recover the database from a particular point in time.

Edit 2: apologies for all the edits.

[1]: maybe validating data server-side can be done with more advanced cryptography, like Zero Knowledge proofs, but that's beyond the scope of Userbase today. If anyone has ideas on this, happy to hear them.

@spiffytech
Copy link
Author

spiffytech commented Jan 17, 2021

While I like the idea of E2E data that the platform is blind to, a lot of the projects I'd like to use Userbase for require a server to interact with user data / perform actions on a user's behalf to enable some features, so the E2E bubble is already popped when the client automatically shares their DBs with the server's service user account. If my app is set for server-side encryption, it would be nice if Userbase could offer support for server-side validation (among other features like triggers, but that's out of scope for this issue). Upload a JS function to run, or check the status code response from calling a webhook, or flagging a user as a service account and letting the service account user open a DB with a change handler that approves/rejects changes, or something.

Besides the convenience of schema enforcement, it would help with a lot of use cases like "freemium users have a quota" and "users can't pack the DB to the brim and use up all my storage space". If I'm going to expose user data to my systems anyway, it would be nice if Userbase could take advantage of that.


userbase-sql is intriguing, and I'll bet I could put it to good use. Sharing userbase-sql DBs would be a hard requirement for a lot of my projects though (both sharing between users, and to my service accounts).

@j-berman
Copy link
Collaborator

j-berman commented Jan 18, 2021

If my app is set for server-side encryption, it would be nice if Userbase could offer support for server-side validation

True. Something we will keep in mind for the future.

"freemium users have a quota" and "users can't pack the DB to the brim and use up all my storage space"

We can offer this even in the end-to-end encryption mode, and it's something we've thought about offering in the past. User X can't insert Y items is possible, or database X can't exceed Y items in total, or User X can't insert more than Y bytes, or only approved users can insert more than Y bytes, etc. It's mainly a matter of getting the right UX down and how to offer it. Will think on this some more, and happy to hear suggestions on how you'd like to set these rules (e.g. in the Admin Panel, or via custom server-side code that you can run).

Upload a JS function to run, or check the status code response from calling a webhook, or flagging a user as a service account and letting the service account user open a DB with a change handler that approves/rejects changes, or something.

Can you clarify these use cases a bit more? Does the coming userbase-js-node not solve some of this for you?

Sharing userbase-sql DBs would be a hard requirement for a lot of my projects though (both sharing between users, and to my service accounts).

Got it, will get this on the roadmap :)

@spiffytech
Copy link
Author

spiffytech commented Jan 18, 2021

Upload a JS function to run, or check the status code response from calling a webhook, or flagging a user as a service account and letting the service account user open a DB with a change handler that approves/rejects changes, or something.

Can you clarify these use cases a bit more? Does the coming userbase-js-node not solve some of this for you?

userbase-js-node solves the problem that I need the ability for automated processes to interact with user data, but some projects also need to grant authority over how users interact with stored data.

E.g., for one project on my plate, I'm doing realtime collaboration and need to prevent race conditions and eventual consistency problems with users editing the same data. So I can turn to CRDTs, where you store a separate record of every change a user makes, and all users can replay them to arrive at the same worldstate. But I can't allow users to delete CRDT records from the database, else users will arrive at incorrect worldstates (and may even encounter problems replaying at all, depending on the CRDT algorithm). I do need a server to be able to delete CRDTs, so I can squash and trim the history that gets sent to users each time they load the app.

Or, I want an app to support RBAC, so anyone can create/edit a record, but only managers/admins can delete it. Or enforce that record changes get versioned (similar problem to the CRDTs above).

Or something like HN comments, where a user can edit a comment for a little while, then the comment is locked in place, and where a moderator can edit/remove comments if they violate guidelines. I can sorta implement that by inserting an item with the service user included in writeAccess["users"], but if a nefarious user uses raw userbase-js to put a comment into the DB, they can block my service from having write access to the comment, so I can't moderate it or revoke their access after the edit window or anything else.

Or one of the projects on my plate calls for different users to have different write access to specific, defined-at-runtime properties on DB records.

Or I could see creating projects that have to implement a pathological combination of GDPR/Right to be Forgotten + audit trail regulations, where custom logic for what can/can't be deleted is a must-have.

And from a practicality perspective, it'd simplify a lot of coding if clients could trust that reads contained good data without me having to think up all of the possible malformed data scenarios I have to defend against at read-time.

I could come up with more, but I think the broader point is that the current design of Userbase leaves me very concerned that even if I can see how to make my app work today, at any time my requirements could change in ways Userbase flatly doesn't support because it doesn't have the notion of superuser data access or custom validation for writes. If whole categories of new requirements can only be implemented by upstreaming a patch to userbase server (like your above mention of implementing specific user restrictions), that's a significant risk for non-toy projects.

Maybe you're fine with that. The more I get up to speed on Userbase, the more I get the impression that operating a server with insight/control over user-created data feels antithetical to the philosophy that drove Userbase to implement E2E in the first place. If that's not the direction Userbase wants to go, I'm fine with that. I'd just appreciate some official messaging to that effect so I can make decisions accordingly.

@j-berman
Copy link
Collaborator

j-berman commented Jan 18, 2021

It’s true, our focus is primarily on E2E apps. That being said, a lot of what you’re asking for is possible to do securely without relying on the server to have access to the data. It’s just a matter of building out the features, or approaching the problems with a different mental framework relative to traditional client-server architectures.

But yes, to be clear, we will likely spend more time and energy enabling more powerful clients to do the things you’re asking, rather than turn more attention toward the server-side. That is, unless there are relatively easy things that we can enable for server-side focused apps (userbase-js-node being an example), which in some of these cases there may be.

So I can turn to CRDTs…

I’m spending the day working on a sample offline-first collaborative app using CRDTs to show how this can be done (described in #255).

But I can't allow users to delete CRDT records from the database… I do need a server to be able to delete CRDTs, so I can squash and trim the history that gets sent to users each time they load the app.

We can relatively easily expose userbase-js functions so you can do both of these things securely:

  1. You can override userbase-js’s internal logic to process transactions and only accept "Insert" commands, which are your CRDT changes. All transactions are stored in an immutable transaction log server-side, and the server is guaranteed to send you transactions in sequence. No user can delete existing changes if your client worked like that.

  2. Right now whenever a user’s database exceeds a fixed size, the server tells the client to "bundle" the database, which compresses and squashes the state, then uploads it to the server in chunks for quick loading. You can override the bundling logic to squash and trim exactly how you want. The server only allows database owners to trigger this process.

For reference, userbase-sql.js relies on just Inserts for SQL statements, and has some smooth logic to bundle databases in a way that clients experience no downtime. But would like to make it even easier to roll your own logic like that by just exposing the above internal userbase-js functions -- I'll probably need to do that to get sharing sql.js databases working.

Or, I want an app to support RBAC, so anyone can create/edit a record, but only managers/admins can delete it.

Definitely possible with a new database permission, like suggested in #208. Also wouldn’t require server-side changes.

Or enforce that record changes get versioned (similar problem to the CRDTs above).

Our client implements versioning. All items are versioned under the hood, we can expose those versions if you’d like — you’d be able to include an item’s version in updateItem/deleteItem.

but if a nefarious user uses raw userbase-js to put a comment into the DB, they can block my service from having write access to the comment

Database owners have root privileges, all honest clients respect database owners’ edits/removals of any items, so a nefarious user can't do that. We can implement an "admin" privilege too so owners could dole out admin privileges.

If whole categories of new requirements can only be implemented by upstreaming a patch to userbase server (like your above mention of implementing specific user restrictions)


When I said "custom server-side code that you can run", I meant code you run on your server, not code that’s patched up to Userbase. This doesn’t seem too challenging to implement on our end. For example, before storing a transaction on our end, we pass it through your proxy server providing enough data along that you can write custom logic as needed telling us if we should store it or return an error to clients.

without me having to think up all of the possible malformed data scenarios I have to defend against at read-time.

Really you just have to take logic you’d place on the server, and place it in the client. It’s just a different way of thinking about the problem.

All the above being said, you will likely find greater flexibility today in tools that rely on traditional client-server architectures. And that’s likely going to be the case into perpetuity, but, we’d like to get Userbase as close to what you’d expect as possible :)

(I know I missed some of your use cases in the above^ happy to discuss any of them in more detail)

@j-berman
Copy link
Collaborator

j-berman commented Jan 19, 2021

With userbase-js-node, you could also do something like this:

  • Only your admin creates databases
  • All users have access to these databases via read-only share tokens
  • Users make changes to these databases via requests to your server

@spiffytech
Copy link
Author

Thanks for the detailed response. Userbase excites me - I can spin up new projects without sinking hours into writing boilerplate user management code, CRUD API endpoints, secure sharing, etc., and E2E is there if it suits my use case, I get real-time updates, and it's OSS so I can feel good building OSS projects with it, and feel good that I'm not held hostage by a proprietary vendor's policies/pricing. I'm eager to use it, I just need to get my head around the possibilities and pitfalls so I can verify it's a good fit for my needs.

I'm trying not to become a pathological customer, so feel free to stop me if it's clear Userbase isn't aligned with what I'd ask of it.

Here are my thoughts:

before storing a transaction on our end, we pass it through your proxy server providing enough data along that you can write custom logic as needed telling us if we should store it or return an error to clients.

This sounds like a stellar 80/20 solution that, on the surface, appears to solve most of my use cases with minimal complexity for either the Userbase server or my app. If the validator could enforce that my service user is made an admin on select new user-created databases, that solves a ton of other problems, too.

You can override userbase-js’s internal logic to process transactions and only accept "Insert" commands, which are your CRDT changes. All transactions are stored in an immutable transaction log server-side ... No user can delete existing changes if your client worked like that.

Reading through the Userbase architecture doc and your comment, it sounds like the idea for CRDTs would be piggybacking on userbase-js' applyTransaction method and customizing the Update/Delete/BatchTransaction cases with custom validation behavior. Then, while the server won't reject deleteItem calls (at least, without the webhook validator you mentioned), honest clients will refuse out-of-spec transactions, so all a nefarious actor could do is pollute the transaction log with wasted-but-harmless bytes. Is that right?

You can override the bundling logic to squash and trim exactly how you want.

Customizing Userbase's squash+trim behavior to capture the resolved state of my CRDT stream at bundle-time sounds great! Is that something I can do today, or does that need updates to userbase-js?

Our client implements versioning. All items are versioned under the hood, we can expose those versions if you’d like

Versioning is on my app's roadmap, so if the platform can give that to me for free, I'm delighted to take it. Could my app still see/request versions that had been bundled?

you just have to take logic you’d place on the server, and place it in the client

Hmmm, I was thinking eventual consistency and race conditions between clients would make this a bugbear vs a server linearizing the authorizations, but right now I'm having trouble constructing a scenario where it makes a difference. I suppose if honest clients can reject out-of-spec transactions, it works out the same.

With userbase-js-node, you could also do something like this

Interesting... It leaves me making the CRUD endpoints I'd hoped Userbase would free me from, but it lets me get going now, sounds like it has similar flexibility to traditional-style server apps, and still gets me the user management and real-time features. And if something like the webhook validator is on the roadmap I could eventually rip out my CRUD endpoints, and then it's just easy street for every new project which could skip the CRUD step and jump straight to the webhook validator.

All the above being said, you will likely find greater flexibility today in tools that rely on traditional client-server architectures. And that’s likely going to be the case into perpetuity, but, we’d like to get Userbase as close to what you’d expect as possible :)

The ideas you gave seem to solve every objection I have at this point once userbase-js/userbase-server expose the required behaviors. Webhook validator, webhook-enforced make-me-a-supplemental-admin, and custom client-side transaction validators seem to nail all of my schema enforcement, moderation, and administration concerns. At this moment, with those in place, I wouldn't see any big reason to hesitate to use Userbase for the projects I'm working on / have planned.

@j-berman
Copy link
Collaborator

I'm trying not to become a pathological customer, so feel free to stop me if it's clear Userbase isn't aligned with what I'd ask of it.

You're fine, you're pushing us to expand Userbase's usefulness :)

This sounds like a stellar 80/20 solution...

Webhook added to the roadmap :)

If the validator could enforce that my service user is made an admin on select new user-created databases, that solves a ton of other problems, too.

When a user creates a database, they are by default the owner of that database, and have admin privileges on the database. Users can then give access to these databases to your admin server. But your admin server wouldn't really have admin privileges over the database in that case. You'd like for both users and your admin server to be granted admin privileges over some databases, right? Or are you specifically looking for custom complex logic to determine whether or not users are admins over some databases at runtime?

honest clients will refuse out-of-spec transactions, so all a nefarious actor could do is pollute the transaction log with wasted-but-harmless bytes. Is that right?

Yes, exactly! :D This is what I've been trying to explain but didn't find the right words to explain it.

Customizing Userbase's squash+trim behavior to capture the resolved state of my CRDT stream at bundle-time sounds great! Is that something I can do today, or does that need updates to userbase-js?

Sweeeet. You can technically just rewrite the client source to do whatever you want today, but, I'd like to rewrite userbase-js to make it as simple as overwriting the applyTransactions , getItems, and buildBundle functions, so you could just supply your own functions to init, something along those lines. I could push this up to high priority if you want :)

Versioning is on my app's roadmap, so if the platform can give that to me for free, I'm delighted to take it. Could my app still see/request versions that had been bundled?

Yep, got it. Added to the roadmap. This one is pretty straightforward.

The ideas you gave seem to solve every objection I have at this point once userbase-js/userbase-server expose the required behaviors.

🎉

@spiffytech
Copy link
Author

You'd like for both users and your admin server to be granted admin privileges over some databases, right?

Yes, so users can write to their own little pool of data, but my server can do admin/moderation if necessary. I'm not sure, but maybe enforcing that a new database is shared rw with my service user accomplishes the same thing as enforcing admin. For example, my project lets users post public data read-only, and I need to be able to remove content that violates TOS or DMCAs. Per your previous comment, I could have an API endpoint that lets the user request my server create a DB where I'm the owner and I and share it to them to populate, but if Userbase is up for turning an API endpoint with request authentication and DB creation into a couple of lines in a validation if statement in a webhook I already have, I'm happy to take that path.

you could just supply your own functions to init, something along those lines. I could push this up to high priority if you want :)

I'd appreciate that. I probably have a couple/few weeks before the CRDT project is far enough along that I'd be ready to implement squash+trim customization.

@j-berman
Copy link
Collaborator

j-berman commented Jan 19, 2021

For example, my project lets users post public data read-only, and I need to be able to remove content that violates TOS or DMCAs.

Unless I'm missing something, it seems just having your clients provide your service user rw access upon item insertion would work (via the writeAccess param). Your honest clients could check to make sure items have that setting before returning to the user. And if you find users are modifying their clients and not doing this, they're violating your TOS and you can delete their accounts.

I get how this wouldn't protect you from malicious users modifying their client who slip away from your moderation (again, to be clear, honest users are still protected). If that's the threat you're concerned about, userbase-js-node actually wouldn't help either because technically any user of your site could use a userbase-js client to just store E2E data, which seems about as easy to do as malicious users modifying their client to avoid providing rw access to your server. (edit: nevermind, your app ID could be kept a server-side secret in this case. And userbase-js-node could work)

Point for the webhook: I could see in this case how you don't want anyone using your app ID to store any data unless it passes through your webhook validations first. That would solve this problem. Any time a user inserts an item, they must grant rw access to your server, otherwise the webhook will prevent it from storing.

The more I think about the webhook idea, the more I like it. Will keep you posted on that. This issue can serve as a catch-all for all the features discussed above.

I'd appreciate that.

Got it, now high priority :)

@fabien
Copy link

fabien commented Feb 24, 2021

Looking forward to webhooks as well! My main concern is that I wouldn’t want any open registration, or at least no db usage at all without prior payment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants