This library provides an integration with the Snowflake data warehouse.
To use this library, you should first ensure that you have an appropriate Snowflake user configured to access your data warehouse.
Related Guides:
Name of the database to use.
Your Snowflake account name. For more details, see https://bit.ly/2FBL320.
User login name.
User password.
Name of the warehouse to use.
Name of the schema to use.
Name of the role to use.
Raw private key to use. See https://docs.snowflake.com/en/user-guide/key-pair-auth.html for details.
Path to the private key. See https://docs.snowflake.com/en/user-guide/key-pair-auth.html for details.
The password of the private key. See https://docs.snowflake.com/en/user-guide/key-pair-auth.html for details.
Builds an IO manager definition that reads inputs from and writes outputs to Snowflake.
type_handlers (Sequence[DbTypeHandler]) – Each handler defines how to translate between slices of Snowflake tables and an in-memory type - e.g. a Pandas DataFrame.
IOManagerDefinition
Examples
from dagster_snowflake import build_snowflake_io_manager
from dagster_snowflake_pandas import SnowflakePandasTypeHandler
from dagster_snowflake_pyspark import SnowflakePySparkTypeHandler
from dagster import Definitions
@asset(
key_prefix=["my_schema"] # will be used as the schema in snowflake
)
def my_table() -> pd.DataFrame: # the name of the asset will be the table name
...
snowflake_io_manager = build_snowflake_io_manager([SnowflakePandasTypeHandler(), SnowflakePySparkTypeHandler()])
defs = Definitions(
assets=[my_table],
resources={
"io_manager": snowflake_pandas_io_manager.configured({
"database": "my_database",
"account" : {"env": "SNOWFLAKE_ACCOUNT"}
...
})
}
)
If you do not provide a schema, Dagster will determine a schema based on the assets and ops using the IO Manager. For assets, the schema will be determined from the asset key. For ops, the schema can be specified by including a “schema” entry in output metadata. If “schema” is not provided via config or on the asset/op, “public” will be used for the schema.
@op(
out={"my_table": Out(metadata={"schema": "my_schema"})}
)
def make_my_table() -> pd.DataFrame:
# the returned value will be stored at my_schema.my_table
...
To only use specific columns of a table as input to a downstream op or asset, add the metadata “columns” to the In or AssetIn.
@asset(
ins={"my_table": AssetIn("my_table", metadata={"columns": ["a"]})}
)
def my_table_a(my_table: pd.DataFrame) -> pd.DataFrame:
# my_table will just contain the data from column "a"
...
Your Snowflake account name. For more details, see https://bit.ly/2FBL320.
User login name.
User password.
Name of the default database to use. After login, you can use USE DATABASE to change the database.
Name of the default schema to use. After login, you can use USE SCHEMA to change the schema.
Name of the default role to use. After login, you can use USE ROLE to change the role.
Name of the default warehouse to use. After login, you can use USE WAREHOUSE to change the role.
None by default, which honors the Snowflake parameter AUTOCOMMIT. Set to True or False to enable or disable autocommit mode in the session, respectively.
Raw private key to use. See https://docs.snowflake.com/en/user-guide/key-pair-auth.html for details. Alternately, set private_key_path and private_key_password.
Raw private key password to use. See https://docs.snowflake.com/en/user-guide/key-pair-auth.html for details. Required for both private_key and private_key_path.
Raw private key path to use. See https://docs.snowflake.com/en/user-guide/key-pair-auth.html for details. Alternately, set the raw private key as private_key.
Number of threads used to download the results sets (4 by default). Increasing the value improves fetch performance but requires more memory.
False by default. Set this to True to keep the session active indefinitely, even if there is no activity from the user. Make certain to call the close method to terminate the thread properly or the process may hang.
Timeout in seconds for login. By default, 60 seconds. The login request gives up after the timeout length if the HTTP response is “success”.
Timeout in seconds for all other operations. By default, none/infinite. A general request gives up after the timeout length if the HTTP response is not ‘success’.
URI for the OCSP response cache file. By default, the OCSP response cache file is created in the cache directory.
False by default. Raise an exception if either one of specified database, schema or warehouse doesn’t exists if True.
pyformat by default for client side binding. Specify qmark or numeric to change bind variable formats for server side binding.
None by default, which honors the Snowflake parameter TIMEZONE. Set to a valid time zone (e.g. America/Los_Angeles) to set the session time zone.
Indicate alternative database connection engine. Permissible option is ‘sqlalchemy’ otherwise defaults to use the Snowflake Connector for Python.
Optional parameter when connector is set to sqlalchemy. Snowflake SQLAlchemy takes a flag cache_column_metadata=True such that all of column metadata for all tables are “cached”
Optional parameter when connector is set to sqlalchemy. To enable fetching NumPy data types, add numpy=True to the connection parameters.
Optional parameter to specify the authentication mechanism to use.
A resource for connecting to the Snowflake data warehouse. The returned resource object is an
instance of SnowflakeConnection
.
A simple example of loading data into Snowflake and subsequently querying that data is shown below:
Examples
from dagster import job, op
from dagster_snowflake import snowflake_resource
@op(required_resource_keys={'snowflake'})
def get_one(context):
context.resources.snowflake.execute_query('SELECT 1')
@job(resource_defs={'snowflake': snowflake_resource})
def my_snowflake_job():
get_one()
my_snowflake_job.execute_in_process(
run_config={
'resources': {
'snowflake': {
'config': {
'account': {'env': 'SNOWFLAKE_ACCOUNT'},
'user': {'env': 'SNOWFLAKE_USER'},
'password': {'env': 'SNOWFLAKE_PASSWORD'},
'database': {'env': 'SNOWFLAKE_DATABASE'},
'schema': {'env': 'SNOWFLAKE_SCHEMA'},
'warehouse': {'env': 'SNOWFLAKE_WAREHOUSE'},
}
}
}
}
)
A connection to Snowflake that can execute queries. In general this class should not be
directly instantiated, but rather used as a resource in an op or asset via the
snowflake_resource()
.
Execute multiple queries in Snowflake.
sql_queries (str) – List of queries to be executed in series
parameters (Optional[Union[Sequence[Any], Mapping[Any, Any]]]) – Parameters to be passed to every query. See https://docs.snowflake.com/en/user-guide/python-connector-example.html#binding-data
fetch_results (bool) – If True, will return the results of the queries as a list. Defaults to False
use_pandas_result (bool) – If True, will return the results of the queries as a list of a Pandas DataFrames. Defaults to False
The results of the queries as a list if fetch_results or use_pandas_result is True, otherwise returns None
Examples
@op(required_resource_keys={"snowflake"})
def create_fresh_database(context):
queries = ["DROP DATABASE IF EXISTS MY_DATABASE", "CREATE DATABASE MY_DATABASE"]
context.resources.snowflake.execute_queries(
sql=queries
)
Execute a query in Snowflake.
sql (str) – the query to be executed
parameters (Optional[Union[Sequence[Any], Mapping[Any, Any]]]) – Parameters to be passed to the query. See https://docs.snowflake.com/en/user-guide/python-connector-example.html#binding-data
fetch_results (bool) – If True, will return the result of the query. Defaults to False
use_pandas_result (bool) – If True, will return the result of the query as a Pandas DataFrame. Defaults to False
The result of the query if fetch_results or use_pandas_result is True, otherwise returns None
Examples
@op(required_resource_keys={"snowflake"})
def drop_database(context):
context.resources.snowflake.execute_query(
"DROP DATABASE IF EXISTS MY_DATABASE"
)
Gets a connection to Snowflake as a context manager.
If using the execute_query, execute_queries, or load_table_from_local_parquet methods, you do not need to create a connection using this context manager.
raw_conn (bool) – If using the sqlalchemy connector, you can set raw_conn to True to create a raw connection. Defaults to True.
Examples
@op(required_resource_keys={"snowflake"})
def get_query_status(context, query_id):
with context.resources.snowflake.get_connection() as conn:
# conn is a Snowflake Connection object or a SQLAlchemy Connection if
# sqlalchemy is specified as the connector in the Snowflake Resource config
return conn.get_query_status(query_id)
Stores the content of a parquet file to a Snowflake table.
src (str) – the name of the file to store in Snowflake
table (str) – the name of the table to store the data. If the table does not exist, it will be created. Otherwise the contents of the table will be replaced with the data in src
Examples
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
@op(required_resource_keys={"snowflake"})
def write_parquet_file(context):
df = pd.DataFrame({"one": [1, 2, 3], "ten": [11, 12, 13]})
table = pa.Table.from_pandas(df)
pq.write_table(table, "example.parquet')
context.resources.snowflake.load_table_from_local_parquet(
src="example.parquet",
table="MY_TABLE"
)
This function is an op factory that constructs an op to execute a snowflake query.
Note that you can only use snowflake_op_for_query if you know the query you’d like to execute at graph construction time. If you’d like to execute queries dynamically during job execution, you should manually execute those queries in your custom op using the snowflake resource.
sql (str) – The sql query that will execute against the provided snowflake resource.
parameters (dict) – The parameters for the sql query.
Returns the constructed op definition.