Data Information Knowledge
Mangle started out within a web app to access metadata of some artifacts.
I wanted to keep door open to large-scale data processing, and separate language from application.
The problem space, very broadly:
- analyze data and turn it into knowledge (analytics)
- make knowledge explicit, (ontology, logic/rules)
- ... in order to help communication and collaboration
- ... in a way that does not force us into a schema
- we knew that our ideas of entities, queries and UI would have to evolve.
The data-information-knowledge hierarchy distinguishes
- data: what comes in all shapes and forms, not yet organized.
- information is derived from data by organizing, classified, normalized
- knowledge is what can serve as specific answer that humans need when making decisions, judging a situation or taking actions in the pursuit of some outcomes.
All these involve querying.
Databases, pipelines, query processing
- SQL is likely the most prevalent language for querying data.
- But I remember a brief time where SQL was less relevant
- "NoSQL", sharding, bigtable, MapReduce, JavaFlume (Craig Chambers).
- through engineering, can store and query with high performance on many cheap servers
As a programming language, SQL is hopelessly anachronistic (1970s style).
- Not readable, not testable
- No proper abstraction mechanism (modules), copy-paste-reuse
- No standard extensibility (connecting to external systems), schema evolution issues
- No (easy, standard) way to deal with recursion / transitive closure
But why did SQL come back?
I have asked an AI to generate a picture of a SQL hacker in 1970s:
-
not only due to "familiarity" ...
- evolution ZetaSQL, BigQuery SQL with protos...
- relational algebra, first-order logic is a good foundation - "declarative"
- theory continues to be taught at university (even completely irrelevant stuff like normal forms) - there is material that can be taught
-
business reasons related to people/skills
- not everybody is an engineer,
- even for engineers, one may want to scale up to 1000s of queries or processing pipelines.
- not just familiar, but also familiar and high-level / declarative: can adapt to embarassingly parallel execution
Why is datalog even better as a foundation:
- can address shortcomings above
- if nothing else: implicit joins