Forum

× General discussions, feature requests for CodeTyphon Project and discussions that don't fit in any of the other specific CodeTyphon forum categories.

Important XML Parsing UTF8 Problem

  • Dinko
  • Topic Author
  • Offline
  • Junior Boarder
  • Junior Boarder
More
9 months 2 days ago #11488 by Dinko
XML Parsing UTF8 Problem was created by Dinko
I have problem with CT640 and UTF8 xml parsing. It seems that UTF8 from CT640 is loosing characters or something is changed deeply in fpc, but I do not know what.
This happen from time to time when CT use new version of fpc trunk.

I included test project, so you can compile it to test and confirm.

Result of program is (one character is lost)
Should be, name_event=|Total Cartonașe Galbene
But it is, name_event=|Total Cartona?e Galbene

Regards, Dinko
Attachments:

Please Log in or Create an account to join the conversation.

  • Sternas Stefanos
  • Sternas Stefanos's Avatar
  • Offline
  • Moderator
  • Moderator
  • Ex Pilot, M.Sc, Ph.D
More
9 months 1 day ago - 9 months 1 day ago #11489 by Sternas Stefanos
Replied by Sternas Stefanos on topic XML Parsing UTF8 Problem
Thanks Sir

Now, the string

"|Total Cartonașe Galbene"

has the same character encoding for all letters ?
and what character encoding ?

In XML document, ALL strings in the file must have the same "character encoding",

XML DO NOT support "Element encoding"

You can declare character encoding of a XML file like this
<?xml version='1.0' encoding='US-ASCII' ?>
<?xml version='1.0' encoding='US-ASCII' standalone='yes' ?>
<?xml version='1.0' encoding='UTF-8' ?>
<?xml version='1.0' encoding='UTF-16' ?>
<?xml version='1.0' encoding='ISO-10646-UCS-2' ?>
<?xml version='1.0' encoding='ISO-8859-1' ?>
<?xml version='1.0' encoding='Shift-JIS' ?>

etc

CodeTyphon Architect and Programmer
Attachments:
Last edit: 9 months 1 day ago by Sternas Stefanos.

Please Log in or Create an account to join the conversation.

  • Dinko
  • Topic Author
  • Offline
  • Junior Boarder
  • Junior Boarder
More
9 months 1 day ago #11490 by Dinko
Replied by Dinko on topic XML Parsing UTF8 Problem
Encoding is UTF8

File Attachment:

File Name: export_par...08-2.txt
File Size:0 KB

File Attachment:

File Name: UTF8ChartTestSave.zip
File Size:0 KB


I tried with header
<?xml version='1.0' encoding='UTF-8' ?>
a get the same result.

CT600 - does not have that problem.
Attachments:

Please Log in or Create an account to join the conversation.

  • Sternas Stefanos
  • Sternas Stefanos's Avatar
  • Offline
  • Moderator
  • Moderator
  • Ex Pilot, M.Sc, Ph.D
More
9 months 1 day ago #11491 by Sternas Stefanos
Replied by Sternas Stefanos on topic XML Parsing UTF8 Problem
My suggestion is to send the problem to FPC mail-list
https://www.mail-archive.com/fpc-pascal@lists.freepascal.org/

We can't solve all... problems :blush:

CodeTyphon Architect and Programmer

Please Log in or Create an account to join the conversation.

  • Dinko
  • Topic Author
  • Offline
  • Junior Boarder
  • Junior Boarder
More
9 months 1 day ago #11494 by Dinko
Replied by Dinko on topic XML Parsing UTF8 Problem
I found the line position where starts to appear.
It seems that UTF string is wrongly converted.
Maybe, I need to put some application Encoding setting, but I do not know how.

c:\codetyphon\fpcsrc\rtl\objpas\classes\streams.inc

CT640
constructor TStringStream.Create(const AString: string = '');
begin
Create(AString,TEncoding.Default, False);
end;

CT600
constructor TStringStream.Create(const AString: string = '');
begin
Inherited create;
FDataString:=AString;
UniqueString(FDataString);
end;


laz2xml_read.pas calls this function
procedure TXMLReader.ConvertSource(SrcIn: TXMLInputSource; out SrcOut: TXMLCharSource);
begin
SrcOut := nil;
if Assigned(SrcIn) then
begin
if Assigned(SrcIn.FStream) then
SrcOut := TXMLStreamInputSource.Create(SrcIn.FStream, False)
else if SrcIn.FStringData <> '' then
SrcOut := TXMLStreamInputSource.Create(TStringStream.Create(SrcIn.FStringData), True)
else if (SrcIn.SystemID <> '') then
ResolveEntity(SrcIn.SystemID, SrcIn.PublicID, SrcIn.BaseURI, SrcOut);
end;
if (SrcOut = nil) and (FSource = nil) then
DoErrorPos(esFatal, 'No input source specified', NullLocation);
end;


If I manually convert string into stream and works with that stream everything works as expected, but I need to do it everywhere in source.
procedure FavStringToStream(OutStream: TStream; const InString: TFavString);
begin
try
OutStream.Position:=0;

// This is hardcoding because Lazarus team change UTF8String all the time
if OutStream is TMemoryStream then begin
OutStream.WriteBuffer(Pointer(InString)^, Length(InString));
end
else if OutStream is TFileStream then begin
OutStream.WriteBuffer(Pointer(InString)^, Length(InString));
end
else begin
OutStream.WriteBuffer(Pointer(InString)^, Length(InString));
end;

OutStream.Position:=0;
except
on E: Exception do begin
raise;
end;
end;
end;

Please Log in or Create an account to join the conversation.

  • Dinko
  • Topic Author
  • Offline
  • Junior Boarder
  • Junior Boarder
More
9 months 1 day ago #11495 by Dinko
Replied by Dinko on topic XML Parsing UTF8 Problem
I found documentation in which they claim that new version of freepascal depends on string encoding like Delphi, but I think not all libraries are adjusted to that (especially TStringStream class which is changed a lot)
I think we need to wait for new version - until they fix all issues.

Please Log in or Create an account to join the conversation.

  • Sternas Stefanos
  • Sternas Stefanos's Avatar
  • Offline
  • Moderator
  • Moderator
  • Ex Pilot, M.Sc, Ph.D
More
9 months 1 day ago #11496 by Sternas Stefanos
Replied by Sternas Stefanos on topic XML Parsing UTF8 Problem
Yes "like Delphi" is the "moto" of many people in FPC/Lazarus team :whistle:

You can use and

fpcsrc\packages\fcl-xml\src\xmlread.pp
fpcsrc\packages\fcl-xml\src\xmlutils.pp
fpcsrc\packages\fcl-xml\src\dom.pp

TStringStream has

constructor TStringStream.Create(const AString: UnicodeString; AEncoding: TEncoding; AOwnsEncoding: Boolean);
constructor TStringStream.Create(const AString: UnicodeString; ACodePage: Integer);

CodeTyphon Architect and Programmer

Please Log in or Create an account to join the conversation.

  • Dinko
  • Topic Author
  • Offline
  • Junior Boarder
  • Junior Boarder
More
9 months 1 day ago #11497 by Dinko
Replied by Dinko on topic XML Parsing UTF8 Problem
Not all Delphi solutions are good one. They need to support windows platform which use utf-16, so they did. Everything else is pure alchemistry.

procedure TXMLReader.ConvertSource(SrcIn: TXMLInputSource; out SrcOut: TXMLCharSource);
begin
SrcOut := nil;
if Assigned(SrcIn) then
begin
if Assigned(SrcIn.FStream) then
SrcOut := TXMLStreamInputSource.Create(SrcIn.FStream, False)
else if SrcIn.FStringData <> '' then
// SrcOut := TXMLStreamInputSource.Create(TStringStream.Create(SrcIn.FStringData), True)
SrcOut := TXMLStreamInputSource.Create(TStringStream.Create(SrcIn.FStringData, CP_UTF8), True)
else if (SrcIn.SystemID <> '') then
ResolveEntity(SrcIn.SystemID, SrcIn.PublicID, SrcIn.BaseURI, SrcOut);
end;
if (SrcOut = nil) and (FSource = nil) then
DoErrorPos(esFatal, 'No input source specified', NullLocation);
end;

So my temporary solution for CT640 is to change this procedure in laz2_xmlread.pas.

I'm afraid this is not the end of UTF-8 problems in near future. It should be great that I can detect code page of xml document and pass it to procedure. Since I use only UTF-8, this solution works with my old CT600 code.

Regards, Dinko

Please Log in or Create an account to join the conversation.

  • Dinko
  • Topic Author
  • Offline
  • Junior Boarder
  • Junior Boarder
More
9 months 1 day ago #11498 by Dinko
Replied by Dinko on topic XML Parsing UTF8 Problem
Same story with json parsing I'm afraid.

Please Log in or Create an account to join the conversation.