Saturday, June 13, 2026
No Result
View All Result
Crypeto News
  • Home
  • Bitcoin
  • Crypto Updates
    • General
    • Blockchain
    • Ethereum
    • Altcoin
    • Mining
    • Crypto Exchanges
  • NFT
  • DeFi
  • Web3
  • Metaverse
  • Analysis
  • Regulations
  • Scam Alert
  • Videos
CRYPTO MARKETCAP
  • Home
  • Bitcoin
  • Crypto Updates
    • General
    • Blockchain
    • Ethereum
    • Altcoin
    • Mining
    • Crypto Exchanges
  • NFT
  • DeFi
  • Web3
  • Metaverse
  • Analysis
  • Regulations
  • Scam Alert
  • Videos
CRYPTO MARKETCAP
Crypeto News
No Result
View All Result

The importance of data ingestion and integration for enterprise AI

by crypetonews
January 9, 2024
in Blockchain
Reading Time: 4 mins read
0 0
A A
0
Home Blockchain
Share on FacebookShare on Twitter


The emergence of generative AI prompted several prominent companies to restrict its use because of the mishandling of sensitive internal data. According to CNN, some companies imposed internal bans on generative AI tools while they seek to better understand the technology and many have also blocked the use of internal ChatGPT.

Companies still often accept the risk of using internal data when exploring large language models (LLMs) because this contextual data is what enables LLMs to change from general-purpose to domain-specific knowledge. In the generative AI or traditional AI development cycle, data ingestion serves as the entry point. Here, raw data that is tailored to a company’s requirements can be gathered, preprocessed, masked and transformed into a format suitable for LLMs or other models. Currently, no standardized process exists for overcoming data ingestion’s challenges, but the model’s accuracy depends on it.

 4 risks of poorly ingested data

Misinformation generation: When an LLM is trained on contaminated data (data that contains errors or inaccuracies), it can generate incorrect answers, leading to flawed decision-making and potential cascading issues. 

Increased variance: Variance measures consistency. Insufficient data can lead to varying answers over time, or misleading outliers, particularly impacting smaller data sets. High variance in a model may indicate the model works with training data but be inadequate for real-world industry use cases.

Limited data scope and non-representative answers: When data sources are restrictive, homogeneous or contain mistaken duplicates, statistical errors like sampling bias can skew all results. This may cause the model to exclude entire areas, departments, demographics, industries or sources from the conversation.

Challenges in rectifying biased data: If the data is biased from the beginning, “the only way to retroactively remove a portion of that data is by retraining the algorithm from scratch.” It is difficult for LLM models to unlearn answers that are derived from unrepresentative or contaminated data when it’s been vectorized. These models tend to reinforce their understanding based on previously assimilated answers.

Data ingestion must be done properly from the start, as mishandling it can lead to a host of new issues. The groundwork of training data in an AI model is comparable to piloting an airplane. If the takeoff angle is a single degree off, you might land on an entirely new continent than expected.

The entire generative AI pipeline hinges on the data pipelines that empower it, making it imperative to take the correct precautions.

4 key components to ensure reliable data ingestion

Data quality and governance: Data quality means ensuring the security of data sources, maintaining holistic data and providing clear metadata. This may also entail working with new data through methods like web scraping or uploading. Data governance is an ongoing process in the data lifecycle to help ensure compliance with laws and company best practices.

Data integration: These tools enable companies to combine disparate data sources into one secure location. A popular method is extract, load, transform (ELT). In an ELT system, data sets are selected from siloed warehouses, transformed and then loaded into source or target data pools. ELT tools such as IBM® DataStage® facilitate fast and secure transformations through parallel processing engines. In 2023, the average enterprise receives hundreds of disparate data streams, making efficient and accurate data transformations crucial for traditional and new AI model development.

Data cleaning and preprocessing: This includes formatting data to meet specific LLM training requirements, orchestration tools or data types. Text data can be chunked or tokenized while imaging data can be stored as embeddings. Comprehensive transformations can be carried out using data integration tools. Also, there may be a need to directly manipulate raw data by deleting duplicates or changing data types.

Data storage: After data is cleaned and processed, the challenge of data storage arises. Most data is hosted either on cloud or on-premises, requiring companies to make decisions about where to store their data. It’s important to caution using external LLMs for handling sensitive information such as personal data, internal documents or customer data. However, LLMs play a critical role in fine-tuning or implementing a retrieval-augmented generation (RAG) based- approach. To mitigate risks, it’s important to run as many data integration processes as possible on internal servers. One potential solution is to use remote runtime options like .

Start your data ingestion with IBM

IBM DataStage streamlines data integration by combining various tools, allowing you to effortlessly pull, organize, transform and store data that is needed for AI training models in a hybrid cloud environment. Data practitioners of all skill levels can engage with the tool by using no-code GUIs or access APIs with guided custom code.

The new DataStage as a Service Anywhere remote runtime option provides flexibility to run your data transformations. It empowers you to use the parallel engine from anywhere, giving you unprecedented control over its location. DataStage as a Service Anywhere manifests as a lightweight container, allowing you to run all data transformation capabilities in any environment. This allows you to avoid many of the pitfalls of poor data ingestion as you run data integration, cleaning and preprocessing within your virtual private cloud. With DataStage, you maintain complete control over security, data quality and efficacy, addressing all your data needs for generative AI initiatives.

While there are virtually no limits to what can be achieved with generative AI, there are limits on the data a model uses—and that data may as well make all the difference.

Book a meeting to learn more

Try DataStage with the data integration trial

Product Manager, Innovations Lead



Source link

Tags: DataEnterpriseImportanceingestionIntegration
Previous Post

Bitcoin (BTC) Futures on CME Will Face Sell Pressure If Spot Bitcoin ETF Gets Approved: K33

Next Post

SEC’s Twitter Account “Compromised”

Related Posts

LINK Price Prediction: Chainlink Eyes .50 Target as Bulls Test Critical .48 Resistance
Blockchain

LINK Price Prediction: Chainlink Eyes $28.50 Target as Bulls Test Critical $26.48 Resistance

August 23, 2025
AVAX Price Prediction: Targeting  Breakout After 13% Rally Sets Stage for August Surge
Blockchain

AVAX Price Prediction: Targeting $32 Breakout After 13% Rally Sets Stage for August Surge

August 23, 2025
Townstar Introduces Gems to Tackle Spoiled Soil Challenge
Blockchain

Townstar Introduces Gems to Tackle Spoiled Soil Challenge

August 22, 2025
Interpol Busts 1,200 Cybercriminals in Global Crypto Raid
Blockchain

Interpol Busts 1,200 Cybercriminals in Global Crypto Raid

August 22, 2025
BTC Holder Loses M After Falling for Fake Support Trap
Blockchain

BTC Holder Loses $91M After Falling for Fake Support Trap

August 22, 2025
Bitcoin (BTC) 2025 Market Projections Released by Bitwise
Blockchain

Bitcoin (BTC) 2025 Market Projections Released by Bitwise

August 22, 2025
Next Post
SEC’s Twitter Account “Compromised”

SEC's Twitter Account "Compromised"

Lawyers for Rybolovlev and Sotheby’s spar on first day of New York fraud trial

Lawyers for Rybolovlev and Sotheby's spar on first day of New York fraud trial

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RECOMMENDED

No Content Available

  • USD
  • EUR
  • GBP
  • AUD
  • JPY
  • bitcoinBitcoin(BTC)
    $63,879.000.40%
  • ethereumEthereum(ETH)
    $1,676.700.36%
  • tetherTether(USDT)
    $1.000.07%
  • binancecoinBNB(BNB)
    $605.85-0.01%
  • usd-coinUSDC(USDC)
    $1.000.02%
  • rippleXRP(XRP)
    $1.150.54%
  • solanaSolana(SOL)
    $67.831.50%
  • tronTRON(TRX)
    $0.3165621.48%
  • Figure HelocFigure Heloc(FIGR_HELOC)
    $1.030.07%
  • dogecoinDogecoin(DOGE)
    $0.0878161.57%
  • Trending
  • Comments
  • Latest
4 Expert Tips to Turn Blank Pages Into Business Blueprints

4 Expert Tips to Turn Blank Pages Into Business Blueprints

October 21, 2024
Top Crypto Portfolio Rebalancing Tools (Automated & Manual)

Top Crypto Portfolio Rebalancing Tools (Automated & Manual)

April 13, 2025
What are Meta Transactions? Exploring ERC-2771

What are Meta Transactions? Exploring ERC-2771

October 25, 2023
Uniswap v4 Teases Major Updates for 2025

Uniswap v4 Teases Major Updates for 2025

January 2, 2025
How to Set Up NFT Sales Notifications

How to Set Up NFT Sales Notifications

October 19, 2023
A 98% Crash and a Pump & Dump

A 98% Crash and a Pump & Dump

August 8, 2025
AI Expert: Truth Protocols Could Become the SSL of the Information Age

AI Expert: Truth Protocols Could Become the SSL of the Information Age

August 24, 2025
Analyst Says Dogecoin Price Is Entering Expansion Phase, Here’s What It Means

Analyst Says Dogecoin Price Is Entering Expansion Phase, Here’s What It Means

August 24, 2025
Robert Kiyosaki Exposes Brutal Truth Behind Sudden Wealth and Collapse

Robert Kiyosaki Exposes Brutal Truth Behind Sudden Wealth and Collapse

August 24, 2025
Ethereum’s Tech Edge Could Outshine Bitcoin — Here’s How

Ethereum’s Tech Edge Could Outshine Bitcoin — Here’s How

August 23, 2025
IRS Loses Top Crypto Enforcer After Only 90 Days on the Job

IRS Loses Top Crypto Enforcer After Only 90 Days on the Job

August 23, 2025
US Court Grants Stay In Coinbase Biometric Data Lawsuit — Details

US Court Grants Stay In Coinbase Biometric Data Lawsuit — Details

August 23, 2025
Crypeto News

Find the latest Bitcoin, Ethereum, blockchain, crypto, Business, Fintech News, interviews, and price analysis at Crypeto News.

CATEGORIES

  • Altcoin
  • Analysis
  • Bitcoin
  • Blockchain
  • Crypto Exchanges
  • Crypto Updates
  • DeFi
  • Ethereum
  • Metaverse
  • Mining
  • NFT
  • Regulations
  • Scam Alert
  • Uncategorized
  • Videos
  • Web3

LATEST UPDATES

  • AI Expert: Truth Protocols Could Become the SSL of the Information Age
  • Analyst Says Dogecoin Price Is Entering Expansion Phase, Here’s What It Means
  • Robert Kiyosaki Exposes Brutal Truth Behind Sudden Wealth and Collapse
  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us
  • About Us

Copyright © 2022 Crypeto News.
Crypeto News is not responsible for the content of external sites.

No Result
View All Result
  • Home
  • Bitcoin
  • Crypto Updates
    • General
    • Blockchain
    • Ethereum
    • Altcoin
    • Mining
    • Crypto Exchanges
  • NFT
  • DeFi
  • Web3
  • Metaverse
  • Analysis
  • Regulations
  • Scam Alert
  • Videos

Copyright © 2022 Crypeto News.
Crypeto News is not responsible for the content of external sites.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In