Architecturally speaking, a big problem with CAS was its design primarily for servicing the “V” solutions with row-wise access as evolved from earlier LASR-based VA patterns. This required relational column projections to stride entire rows in RAM even when only a few columns in a wide table were required. Imagine the overhead when only a few columns in a very wide table were specified. Also, VA heavily depended on FCMP computed columns which made it very difficult to optimize physical row selection for WHERE processing. Much of this owes to even older traditional SAS row/var processing patterns. Were modern advances in data access for analytical processing even considered?
Conversely, here’s what Spark is optimized for:
AI reply to query: “ is apache spark optimized for columnar access”
Yes, Apache Spark is optimized for columnar access, particularly when using data formats like Apache Parquet, which stores data in a columnar format, leading to significantly improved performance for operations like filtering and aggregations on large datasets.
Key points about Spark and columnar access:
Parquet integration:
Spark leverages Parquet as a preferred storage format due to its columnar structure, allowing for efficient data retrieval by only accessing the relevant columns for a query.
Optimized operators:
Spark's internal operators are designed to work effectively with columnar data, enabling faster processing of large datasets.
Performance benefits:
Columnar access allows Spark to significantly reduce the amount of data that needs to be read from storage, leading to faster query execution times
——————
Wasn’t Parquet eventually integrated into CAS? How is that working out? Does it perform well?
Coming forward to even more modern analytical data processing optimizations:
https://medium.com/@zujkanovic/exploring-duckdb-and-the-columnar-advantage-f7beb8cbf478