dbt无法连接Spark
dbt 简介
dbt ,Data Build Tool是一个开源命令行工具,用于数据仓库中的数据转换,专注于ETL
中的T
。是现代大数据架构中非常热门的组件之一。
使用
安装
官方提供了4种安装方式,分别为
- Homebrew(Mac用户)
- pip
- Docker image
- 源码安装
一般使用pip
安装较方便快捷,最佳实践。dbt为每一类的数据(仓)库提供不了不同的adapter
,选择对应的进行安装, 会自动将依赖dbt-core
引入。这里是想使用dbt
连接Spark
,通过thrift
协议。
pip instal dbt-spark[PyHive]
安装成功后便可以使用dbt-cli
命令行,下面是初始化一个dbt project
的过程。
[root@mutoulbj spark_demo]# dbt init
08:08:09 Running with dbt=1.1.0
Enter a name for your project (letters, digits, underscore): spark_demo
The profile spark_demo already exists in /root/.dbt/profiles.yml. Continue and overwrite it? [y/N]: Y
Which database would you like to use?
[1] spark
(Don't see the one you want? https://docs.getdbt.com/docs/available-adapters)
Enter a number: 1
host (yourorg.sparkhost.com): mcdp1
[1] odbc
[2] http
[3] thrift
Desired authentication method option (enter a number): 3
port [443]: 10000
schema (default schema that dbt will build objects in): dbt_example
threads (1 or more) [1]: 1
08:08:35 Profile spark_demo written to /root/.dbt/profiles.yml using target's profile_template.yml and your supplied values. Run 'dbt debug' to validate the connection.
08:08:35
Your new dbt project "spark_demo" was created!
For more information on how to configure the profiles.yml file,
please consult the dbt documentation here:
https://docs.getdbt.com/docs/configure-your-profile
One more thing:
Need help? Don't hesitate to reach out to us via GitHub issues or on Slack:
https://community.getdbt.com/
Happy modeling!
生成的project目录结构如下:
[root@mutoulbj spark_demo]# tree
.
├── logs
│ └── dbt.log
└── spark_demo
├── analyses
├── dbt_project.yml
├── logs
│ └── dbt.log
├── macros
├── models
│ └── example
│ ├── my_first_dbt_model.sql
│ ├── my_second_dbt_model.sql
│ └── schema.yml
├── README.md
├── seeds
├── snapshots
└── tests
问题与排查解决
使用dbt debug
命令可以验证配置是否正确,数据(仓)库是否可连接。这时候, 问题来了, 一直报如下错误:
[root@mutoulbj spark_demo]# dbt debug
06:33:04 Running with dbt=1.1.0
dbt version: 1.1.0
python version: 3.9.10
python path: /usr/local/bin/python3.9
os info: Linux-3.10.0-1160.42.2.el7.x86_64-x86_64-with-glibc2.17
Using profiles.yml file at /root/.dbt/profiles.yml
Using dbt_project.yml file at /root/spark_demo/dbt_project.yml
Configuration:
profiles.yml file [OK found and valid]
dbt_project.yml file [OK found and valid]
Required dependencies:
- git [OK found]
Connection:
host: 192.168.51.194
port: 10000
cluster: None
endpoint: None
schema: dbt_example
organization: 0
Connection test: [ERROR]
1 check failed:
dbt was unable to connect to the specified database.
The database returned the following error:
>Runtime Error
Database Error
failed to connect
Check your database credentials and try again. For more information, visit:
https://docs.getdbt.com/docs/configure-your-profile
从错误信息来看,在连接Spark的时候出错了,但没有具体信息。只有进行排查。
- 使用beeline进行连接验证Spark Thift服务是否正常 ✔︎
- 换个环境,在本机(Mac)上验证是否OK ✔︎
- 重装dbt再尝试 ✘
通过以上排查,依然没有头绪,看源码,在dbt-spark/dbt/adapters/spark/connections.py
找到相关的代码。
elif creds.method == SparkConnectionMethod.THRIFT:
cls.validate_creds(creds, ["host", "port", "user", "schema"])
if creds.use_ssl:
transport = build_ssl_transport(
host=creds.host,
port=creds.port,
username=creds.user,
auth=creds.auth,
kerberos_service_name=creds.kerberos_service_name,
)
conn = hive.connect(thrift_transport=transport)
else:
conn = hive.connect(
host=creds.host,
port=creds.port,
username=creds.user,
auth=creds.auth,
kerberos_service_name=creds.kerberos_service_name,
) # noqa
handle = PyhiveConnectionWrapper(conn)
可见在调用hive.connect()
方法时,没有捕获异常并透出。要想知道具体原因,可以在服务器中进行调试。
[root@mutoulbj ~]# python3
Python 3.9.10 (main, May 27 2022, 10:27:15)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyhive import hive
>>> conn = hive.connect(host='192.168.51.194',port=10000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.9/site-packages/pyhive/hive.py", line 104, in connect
return Connection(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/pyhive/hive.py", line 243, in __init__
self._transport.open()
File "/usr/local/lib/python3.9/site-packages/thrift_sasl/__init__.py", line 84, in open
raise TTransportException(type=TTransportException.NOT_OPEN,
thrift.transport.TTransport.TTransportException: Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found'
Bingo,到这原因就清楚了,no mechanism available: No worthy mechs found
,sasl相关的依赖包有缺失。
查看当前服务器已有的相关依赖:
[root@mutoulbj ~]# rpm -qa | grep cyrus
cyrus-sasl-lib-2.1.26-24.el7_9.x86_64
cyrus-sasl-devel-2.1.26-24.el7_9.x86_64
cyrus-sasl-2.1.26-24.el7_9.x86_64
确实没有相关的加密算法库,安装并再次检查依赖:
[root@mutoulbj ~]# yum install cyrus-sasl-devel cyrus-sasl-gssapi cyrus-sasl-md5 cyrus-sasl-plain
[root@mutoulbj ~]# rpm -qa | grep cyrus
cyrus-sasl-gssapi-2.1.26-24.el7_9.x86_64
cyrus-sasl-lib-2.1.26-24.el7_9.x86_64
cyrus-sasl-plain-2.1.26-24.el7_9.x86_64
cyrus-sasl-devel-2.1.26-24.el7_9.x86_64
cyrus-sasl-2.1.26-24.el7_9.x86_64
cyrus-sasl-md5-2.1.26-24.el7_9.x86_64
再次运行dbt debug
,成功。
[root@mutoulbj spark_demo]# dbt debug
08:08:51 Running with dbt=1.1.0
dbt version: 1.1.0
python version: 3.9.10
python path: /usr/local/bin/python3.9
os info: Linux-3.10.0-1160.42.2.el7.x86_64-x86_64-with-glibc2.17
Using profiles.yml file at /root/.dbt/profiles.yml
Using dbt_project.yml file at /root/spark_demo/spark_demo/dbt_project.yml
Configuration:
profiles.yml file [OK found and valid]
dbt_project.yml file [OK found and valid]
Required dependencies:
- git [OK found]
Connection:
host: mcdp1
port: 10000
cluster: None
endpoint: None
schema: dbt_example
organization: 0
Connection test: [OK connection ok]
All checks passed!
总结
不透出具体、友好的错误信息,排查会很困难。但是开源的最大好处是代码中的细节无处可藏,只要看代码,再加上一些调试总能找到具体问题所在。
dbt上手简单,配套生态已很完善,其他的待进一步测试。