本文介绍了使用 LlamaCloud 和 LlamaIndex 进行文档解析,以提取如英国邮政编码、IP 地址、电子邮件地址、银行详细信息等敏感信息的方法,用于欺诈检测和数字取证。通过 GenAI 引擎,可以从各种格式的文档中提取信息,并展示了使用 LlamaExtract 提取信息的代码示例。文章还评估了使用不同配置的成本和输出结果。
在 LastingAsset,我们正在研究具有隐私意识的欺诈检测方法,显然,一个关键方面是解析文档以获取其概念,然后将它们与各种文档进行匹配。为此,我们正在实施加密方法,以支持在加密安全存储上搜索人工制品。
我们可以用来解析文档的一种方法是使用 GenAI 引擎,例如 Llama。为此,我们可以使用 LlamaCloud,它支持解析几乎所有我们需要的形式的文档,例如 PDF、DOCX、PPTX、XLSX 等。此服务使用 LlamaIndex。为此,我们可以生成一个可以根据需要运行的模型。以下代码允许我们从示例 PDF 文档中提取英国邮政编码、IP 地址、电子邮件地址、银行详细信息、电话号码、MAC 地址和英国城市:
import os
from llama_cloud_services import LlamaExtract
from pydantic import BaseModel, Field
LLAMA_CLOUD_API_KEY =os.environ['LLAMA_CLOUD_API_KEY']
class ExtractArtefacts(BaseModel):
postcode: str = Field(description="Extract all of the UK postcodes from the the document") # 从文档中提取所有英国邮政编码
ip_addresses: str = Field(description="Find all the IP address") # 查找所有 IP 地址
email_address: str = Field(description="Find all the email addresses") # 查找所有电子邮件地址
bank_details: str = Field(description="Find all the bank details and sort codes") # 查找所有银行详细信息和分类代码
telephone: str= Field(description="Find all the telephone addresses and their location") # 查找所有电话地址及其位置
passwords: str= Field(description="Find all the passwords") # 查找所有密码
credit_card: str = Field(description="Find all the credit card details") # 查找所有信用卡详细信息
mac_address: str = Field(description="Find all the MAC addresses") # 查找所有 MAC 地址
cities: str = Field(description="Find all the UK cities or towns") # 查找所有英国城市或城镇
llama_extract = LlamaExtract()
from llama_cloud.types import ExtractConfig, ExtractMode
config = ExtractConfig(use_reasoning=True,cite_sources=True,
extraction_mode=ExtractMode.MULTIMODAL)
agent = llama_extract.create_agent(name="artefact-parser", data_schema=ExtractArtefacts, config=config)
## agent = llama_extract.get_agent(name="artefact-parser")
artefact_info = agent.extract("mydoc.pdf")
print(artefact_info.data)
print(artefact_info.extraction_metadata)
要使用此功能,我们需要一个 API 密钥。创建模型后,它会被添加到 Llama Cloud [此处]:
之后,我们就可以直接调用该模型:
## agent = llama_extract.create_agent(name="artefact-parser",
data_schema=ExtractArtefacts, config=config)
agent = llama_extract.get_agent(name="artefact-parser")
artefact_info = agent.extract("mydoc.pdf")
print(artefact_info.data)
print(artefact_info.extraction_metadata)
然后,我们可以将一些内容放入相关的 PDF 文档中:
其中包含以下内容:
There is not much we can do apart from contacting, there is not much we can
do apart from contacting f.smith@home.net to see if he would like to reboot
the server at 192.168.0.1. If he can do this then I will call him on
444.3212.5431. My credit card details are 4321-4444-5412-2310 and
5430-5411-4333-5123 and my name on the card is Fred Smith. I really like
the name domain fred@home.
Overall our target areas are SW1 7AF and EH105DT. I tested the server last
night, and I think the IP address is 10.0.0.1 and 192.168.1.1 and there are
two MAC addresses which is 01:23:45:67:89:ab or it might be 00.11.22.33.44.55.
The book we will use is "At Home" and it can be bought on amazon.com or google.com, if you search for 978-1-4302-1998-9. My account email addresses are Fred.blogs@gmail.com and f.blogs@mail.com.
I think my password might be "Qwerty123" or "inkwell!!".
Here are the details that I have:
IBAN Sort code Account
---------------------------------------
GB91BKEN10000041610008 100000 41610008
GB27BOFI90212729823529 902127 29823529
GB17BOFS80055100813796 800551 00813796
GB92BARC20005275849855 200052 75849855
Shall we perhaps meet in Glasgow or Edinburgh, or even Stirling?
If you need to access the account, the password is: a1b2c3
Best regards,
Bert.
EH14 1DJ
+44 (960) 000 00 00
1/1/2009
对于 MULTIMODAL 和 BALANCED 模式,成本约为 14 个 credits,其中 1,000 个 credits 为 1 美元。总的来说,提取的信息是:
{
'postcode': 'SW1 7AF, EH105DT, EH14 1DJ',
'ip_addresses': '192.168.0.1, 10.0.0.1, 192.168.1.1',
'email_address': 'f.smith@home.net, Fred.blogs@gmail.com, f.blogs@mail.com',
'bank_details': 'GB91BKEN10000041610008, 100000, 41610008; GB27BOFI90212729823529, 902127, 29823529; GB17BOFS80055100813796, 800551, 00813796; GB92BARC20005275849855, 200052, 75849855', 'telephone': '444.3212.5431, +44 (960) 000 00 00',
'passwords': 'Qwerty123, inkwell!!, a1b2c3', 'credit_card': '4321-4444-5412-2310, 5430-5411-4333-5123',
'mac_address': '01:23:45:67:89:ab, 00.11.22.33.44.55',
'cities': 'Glasgow, Edinburgh, Stirling'}
{'field_metadata': {
'postcode': {'reasoning': 'VERBATIM EXTRACTION', 'citation': [{'page': 1, 'matching_text': 'SW1 7AF and EH105DT'}, {'page': 1, 'matching_text': 'EH14 1DJ'}]},
'ip_addresses': {'reasoning': 'VERBATIM EXTRACTION', 'citation': [{'page': 1, 'matching_text': '192.168.0.1'}, {'page': 1, 'matching_text': '10.0.0.1 and 192.168.1.1'}]},
'email_address': {'reasoning': 'VERBATIM EXTRACTION', 'citation': [{'page': 1, 'matching_text': 'f.smith@home.net'}, {'page': 1, 'matching_text': 'Fred.blogs@gmail.com and f.blogs@mail.com'}]},
'bank_details': {'reasoning': 'VERBATIM EXTRACTION', 'citation': [{'page': 1, 'matching_text': '| GB91BKEN10000041610008 | 100000 | 41610008 |'}, {'page': 1, 'matching_text': '| GB27BOFI90212729823529 | 902127 | 29823529 |'}, {'page': 1, 'matching_text': '| GB17BOFS80055100813796 | 800551 | 00813796 |'}, {'page': 1, 'matching_text': '| GB92BARC20005275849855 | 200052 | 75849855 |'}]},
'telephone': {'reasoning': 'VERBATIM EXTRACTION', 'citation': [{'page': 1, 'matching_text': '444.3212.5431'}, {'page': 1, 'matching_text': '+44 (960) 000 00 00'}]},
'passwords': {'reasoning': 'VERBATIM EXTRACTION', 'citation': [{'page': 1, 'matching_text': 'password might be "Qwerty123" or "inkwell!!"'}, {'page': 1, 'matching_text': 'the password is: a1b2c3'}]},
'credit_card': {'reasoning': 'VERBATIM EXTRACTION', 'citation': [{'page': 1, 'matching_text': '4321-4444-5412-2310 and 5430-5411-4333-5123'}]},
'mac_address': {'reasoning': 'VERBATIM EXTRACTION', 'citation': [{'page': 1, 'matching_text': '01:23:45:67:89:ab or it might be 00.11.22.33.44.55'}]},
'cities': {'reasoning': 'VERBATIM EXTRACTION', 'citation': [{'page': 1, 'matching_text': 'meet in Glasgow or Edinburgh, or even Stirling'}]}}, 'usage': {'num_pages_extracted': 1, 'num_document_tokens': 461, 'num_output_tokens': 1082}}
或者我们可以得到没有引用的版本:
config = ExtractConfig(use_reasoning=True,cite_sources=True,
extraction_mode=ExtractMode.MULTIMODAL)
这给出的关于解析原因的细节较少:
{'postcode': 'SW1 7AF, EH105DT, EH14 1DJ',
'ip_addresses': '192.168.0.1, 10.0.0.1, 192.168.1.1',
'email_address': 'f.smith@home.net, Fred.blogs@gmail.com, f.blogs@mail.com',
'bank_details': 'IBANs: GB91BKEN10000041610008, GB27BOFI90212729823529, GB17BOFS80055100813796, GB92BARC20005275849855; Sort codes: 100000, 902127, 800551, 200052; Accounts: 41610008, 29823529, 00813796, 75849855',
'telephone': '444.3212.5431, +44 (960) 000 00 00',
'passwords': 'Qwerty123, inkwell!!, a1b2c3', 'credit_card': '4321-4444-5412-2310, 5430-5411-4333-5123 (Name: Fred Smith)',
'mac_address': '01:23:45:67:89:ab, 00.11.22.33.44.55',
'cities': 'Glasgow, Edinburgh, Stirling'}
{'field_metadata': {'postcode': {'reasoning': 'VERBATIM EXTRACTION'},
'ip_addresses': {'reasoning': 'VERBATIM EXTRACTION'},
'email_address': {'reasoning': 'VERBATIM EXTRACTION'},
'bank_details': {'reasoning': 'VERBATIM EXTRACTION'},
'telephone': {'reasoning': 'VERBATIM EXTRACTION'},
'passwords': {'reasoning': 'VERBATIM EXTRACTION'},
'credit_card': {'reasoning': 'VERBATIM EXTRACTION'},
'mac_address': {'reasoning': 'VERBATIM EXTRACTION'},
'cities': {'reasoning': 'VERBATIM EXTRACTION'}},
'usage': {'num_pages_extracted': 1, 'num_document_tokens': 461, 'num_output_tokens': 694}}
正则表达式的时代已经过去了,我们欢迎智能解析的新工作方式。
- 原文链接: medium.com/asecuritysite...
- 登链社区 AI 助手,为大家转译优秀英文文章,如有翻译不通的地方,还请包涵~
如果觉得我的文章对您有用,请随意打赏。你的支持将鼓励我继续创作!