Parallelization

Data can now be cleaned & analyzed using comparisons that would be impossible a few years ago.

 

SCROLL

Parallelization / In Memory, Multi-dimensional Processing / Distributed Computing / Big Data Analytics / etc.

This is a new technology and approach so of course it is going to be called a lot of things.  But what exactly IS it?  Some real world examples may shed some light.

 

Traditionally, if one were to combine a series of data sets, eliminate duplicates and generally clean up the resulting file, a single record might be selected and compared to all the other records to pick the best one.

 

While this approach is fast and can easily be applied to huge data sets, data will likely be of poor quality.  For example, if the first record has a correct address but an incomplete name, the second record has a complete name but bad address, and the third record is the only one with a phone number, picking a "best" record inevitably means compromising one data element over others.

 

 

One solution would be to compare every record to every other record at the same time.  This approach allows records to be matched as duplicates even if names or other identifying data elements don't exactly match by "scoring" records with similar names and addresses.

 

Further, instead of scoring and selecting a single record from a set, each individual data element can be scored and selected.  A new record would be built that might contain the best address, name and phone number even though those elements were sourced from different original records.  The result is a single, canonical record containing the best information from all sources.

 

While this sounds great, even within smaller data sets, the computational needs to do these sorts of compares may as well have been a fairy tale up until very recently.

The EDI Projectâ„¢ has successfully implemented projects using new tools to allow these previously impossible types of comparisons.  Complex comparisons are spread across multiple threads and processors.  Large data sets that might contain many billions of records are loaded into memory spanning many machines until the entire set can be processed at once.  Loads are balanced so that complex compares on these big data sets yield previously untouchable results.

 

The EDI Projectâ„¢

Reliable Integration

The EDI Projectâ„¢

5510 El Arbol Dr.

Carlsbad, CA 92008

760-602-4394