Snowflake: The Basics

by Imogen Emmett

What is Snowflake?

Snowflake is a fully managed software as a service (SaaS) that provides a single platform that enables data storage, processing, and analytic solutions. Its primarily a cloud-based data warehouse but it can also be used as a data lake if required. It allows you to use multiple databases from multiple locations.

Data warehouse: a system that aggregates data from different sources into a single, central, consistent data store that is specifically designed for fast query and analysis.

Database: an organised collection of structured information, or data, typically stored electronically in a computer system.

Data lake: a centralised repository designed to store, process, and secure large amounts of structured, semi-structured, and unstructured data. It can store data in its native format and process any variety of it, ignoring size limits.

Snowflake supports 3 cloud platforms – AWS (Amazon Web Services), Azure and GCP (Google Cloud Platform).

What is it used for?

Snowflake is a true SaaS offering and runs completely on cloud infrastructure - the benefits of this are that there is:

  • No hardware (virtual or physical) to select, install, configure, or manage
  • Virtually no software to install, configure, or manage
  • Ongoing maintenance, management, upgrades, and tuning are handled by Snowflake

Further benefits of using Snowflake include:

  • It overcomes some of the problems that are common in other hardware-based data warehouses, for example performance and speed. If you want to load data faster, or run a large number of queries, you can scale up/out your virtual warehouse to accommodate these requests.

Scaling Up: This is the process of increasing the size of a warehouse, allowing you to run larger, more complex queries.

Scaling Out: this is the process of adding warehouses to create a multi-cluster warehouse. Multi-cluster warehouses are best utilised for scaling resources to improve concurrency for users/queries.

Performance and speed

The elastic nature of the cloud means that if you want to load data faster, or run a large volume of queries, you can easily scale up your virtual warehouse. Afterwards, you have the option to scale down the virtual warehouse and pay for only the time you used.

Accessibility

With a traditional data warehouse and a large number of users, you could experience concurrency issues (such as delays or failures) when too many queries compete for resources. Snowflake, on the other hand, addresses these issues with its unique multi-cluster architecture - queries from one virtual warehouse never affect the queries from another, and each virtual warehouse can scale up or down as required.

Data sharing capacity

Snowflake's architecture allows for data sharing between users. It also allows organisations to share data with any data consumer — whether they are a Snowflake customer or not — through reader accounts that can be created directly from the user interface.

High availability

Snowflake is distributed across availability zones of the platform on which it runs — either AWS, Azure or GCP — and is designed to operate continuously and tolerate component and network failures with minimal impact to customers.

Snowflake Interface: The Basics

Worksheets: this is where you run SQL queries, as well as perform other Snowflake data loading, definition and manipulation tasks.

SQL (structured query language): standardised programming language that allows you to transform your data in Snowflake.

Databases: all data in Snowflake is maintained in databases. A database is a logical grouping of schemas. Each database belongs to a single Snowflake account.

Schema: a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.

Snowflake offers 3 types of tables - Temporary, Transient & Permanent:

Temporary tables: only exist within the session in which they were created and persist only for the remainder of the session. They are not visible to other users or sessions. Once the session ends, data stored in the table is removed completely from the system and, therefore, is not recoverable, either by the user who created the table or Snowflake.

Transient tables: persist until explicitly dropped and are available to all users with the appropriate privileges. Transient tables are similar to permanent tables with the key difference that they do not have a Fail-safe period - they are specifically designed for transitory data that needs to be maintained beyond each session, but does not need the same level of data protection and recovery provided by permanent tables.

Permanent tables (DEFAULT): similar to transient tables - the key difference is that they do have a Fail-safe period, which provides an additional level of data protection and recovery.