Ai知识库导入文档向量化失败

针对AI知识库,采用的是通过接口的方式将文档导入知识库,类似于这样:

app.rags.integratedManagementOfTheKnowled.addDocumentByBusinessId(‘aa’, [{ “fileName”: ‘aa’, “size”: size, “url”: ‘https://wd.hbxtxzl.cn/group1/M00/02/3E/CsiNamj4RdmAdH19ABkvYlzOmJY995.pdf’, “type”: ‘’}], {“chunkSeparator”: [“\n”], “chunkSize”: 1024, “chunkOverlap”: 100}, {“chunkCleaning”: True})

最近发现,有些文档通过这种方式导入后,知识库内并没有显示该文档

所以我采取了将文档下载到本地导进去的方式:

会出现向量化失败的情况:

错误信息如下:

文档 processing failed: {‘code’: 60101, ‘message’: ‘Document loading failed [{file_path}https://jit.hbxtxzl.cn/api/xtjit/hbxt/storages/services/StorageSvc/preview?file=4ff40de7673e07a70777b9b868fa51b0.pdf\]: {\‘code\’: 60101, \‘message\’: \‘Document loading failed [https://jit.hbxtxzl.cn/api/xtjit/hbxt/storages/services/StorageSvc/preview?file=4ff40de7673e07a70777b9b868fa51b0.pdf{file_url}\]: {“code”: 20005, “reason”: "Element rags.NormalType.services.RemoteLoader [HTTP Document Download Service] failed. Error: {\\\‘code\\\’: 60101, \\\‘message\\\’: \\\‘Document loading failed [/tmp/tmp_4wba758.pdf{file_url}]: {“code”: 20005, “reason”: “Element rags.NormalType.services.PdfLoader [PDF Document Loader Service] failed. Error: Could not read Boolean object”}\\\’, \\\‘description\\\’: \\\‘Unable to load specified document file\\\’, \\\‘file_path\\\’: \\\’/tmp/tmp_4wba758.pdf\\\’, \\\‘error\\\’: \\\‘{“code”: 20005, “reason”: “Element rags.NormalType.services.PdfLoader [PDF Document Loader Service] failed. Error: Could not read Boolean object”}\\\’}“}\‘, \‘description\’: \‘Unable to load specified document file\’, \‘file_path\’: \‘https://jit.hbxtxzl.cn/api/xtjit/hbxt/storages/services/StorageSvc/preview?file=4ff40de7673e07a70777b9b868fa51b0.pdf\’, \‘error\’: \’{“code”: 20005, “reason”: “Element rags.NormalType.services.RemoteLoader [HTTP Document Download Service] failed. Error: {\\\‘code\\\’: 60101, \\\‘message\\\’: \\\‘Document loading failed [/tmp/tmp_4wba758.pdf{file_url}]: {“code”: 20005, “reason”: “Element rags.NormalType.services.PdfLoader [PDF Document Loader Service] failed. Error: Could not read Boolean object”}\\\’, \\\‘description\\\’: \\\‘Unable to load specified document file\\\’, \\\‘file_path\\\’: \\\‘/tmp/tmp_4wba758.pdf\\\’, \\\‘error\\\’: \\\‘{“code”: 20005, “reason”: “Element rags.NormalType.services.PdfLoader [PDF Document Loader Service] failed. Error: Could not read Boolean object”}\\\’}”}\‘}’, ‘description’: ‘Unable to load specified document file’, ‘file_url’: ‘https://jit.hbxtxzl.cn/api/xtjit/hbxt/storages/services/StorageSvc/preview?file=4ff40de7673e07a70777b9b868fa51b0.pdf’, ‘error’: ‘{\‘code\’: 60101, \‘message\’: \‘Document loading failed [https://jit.hbxtxzl.cn/api/xtjit/hbxt/storages/services/StorageSvc/preview?file=4ff40de7673e07a70777b9b868fa51b0.pdf{file_url}\]: {“code”: 20005, “reason”: "Element rags.NormalType.services.RemoteLoader [HTTP Document Download Service] failed. Error: {\\\‘code\\\’: 60101, \\\‘message\\\’: \\\‘Document loading failed [/tmp/tmp_4wba758.pdf{file_url}]: {“code”: 20005, “reason”: “Element rags.NormalType.services.PdfLoader [PDF Document Loader Service] failed. Error: Could not read Boolean object”}\\\’, \\\‘description\\\’: \\\‘Unable to load specified document file\\\’, \\\‘file_path\\\’: \\\’/tmp/tmp_4wba758.pdf\\\’, \\\‘error\\\’: \\\‘{“code”: 20005, “reason”: “Element rags.NormalType.services.PdfLoader [PDF Document Loader Service] failed. Error: Could not read Boolean object”}\\\’}”}\‘, \‘description\’: \‘Unable to load specified document file\’, \‘file_path\’: \‘https://jit.hbxtxzl.cn/api/xtjit/hbxt/storages/services/StorageSvc/preview?file=4ff40de7673e07a70777b9b868fa51b0.pdf\’, \‘error\’: \’{“code”: 20005, “reason”: “Element rags.NormalType.services.RemoteLoader [HTTP Document Download Service] failed. Error: {\\\‘code\\\’: 60101, \\\‘message\\\’: \\\‘Document loading failed [/tmp/tmp_4wba758.pdf{file_url}]: {“code”: 20005, “reason”: “Element rags.NormalType.services.PdfLoader [PDF Document Loader Service] failed. Error: Could not read Boolean object”}\\\’, \\\‘description\\\’: \\\‘Unable to load specified document file\\\’, \\\‘file_path\\\’: \\\‘/tmp/tmp_4wba758.pdf\\\’, \\\‘error\\\’: \\\‘{“code”: 20005, “reason”: “Element rags.NormalType.services.PdfLoader [PDF Document Loader Service] failed. Error: Could not read Boolean object”}\\\’}”}\‘}’}

因此想请问一下,这是什么原因造成的,如何解决并修复

PDF 文件格式有问题或不兼容,把pdf文件私发我一下,我看看

文件我看了,目前解析pdf用的是pypdf库解析的,不支持你的文件格式,我换pymupdf试了一下,依然解析不出来里面的文本,可能这个pdf是设计型 PDF 宣传册,计软件导出时栅格化了文字,所以各种解析库拿不到里面的文字,也就不能向量化了,建议找到这个内容的docx版本或者文本内容去向量化,也可以使用文字识别pdf之后转文字再做向量化